## Project Overview

This repository host a general data analysis framework employed to investigate reply behaviour and polarization in the
comment section of "Spiegel Online" (SPON). The research, focuses on understanding uncongeniality within a large
online sample and examining polarization in online discussions.

The dataset for analysis can be found on the [Open Science Framework](

## Setup
To set up the analysis, follow the steps below:

### Clone the Repository

```bash
git clone --branch public
```

### Install non-python prerequisites

In order to run the analysis R needs to be installed. The analysis was conducted using R version 4.1.1.
Please install R from the [official website](
In addition, development tools is recommended. This can be done via apt-get on Linux systems: + +```bash +sudo apt-get update +sudo apt-get install build-essential +``` + +Furthermore, to enable the generation of pdf reports, pandoc and texlive needs to be installed. +pandoc installation can be done via apt-get on Linux systems: + +```bash +sudo apt-get install pandoc +sudo apt-get install texlive-latex-recommended +``` + +The code runs without these functionalities if in the config file the pdf flag is set to false. +Please note that the markdown versions of result reports use relative paths to images, +thus they will only display those while being in the `results_reports` folder (in contrast to pdf reports) + +### Install requirements + +The code was tested under python 3.10.2. +It is recommended to run the code in a virtual environment. To create a virtual environment, run the following commands: + +```bash +python3 -m venv venv +source venv/bin/activate +``` + +To install the required python packages, run: + +```bash +pip3 install -r requirements.txt +``` + +## Running analysis + +To run analysis with these frameworks one needs to adapt the configuration file `config.yaml` to the desired settings. +Adapt the `data_path` to the directory in which the dataset that is available on the +[Open Science Framework]( is stored. + +To replicate the analysis provided in "Polarizing reply patterns in comment sections of a large German news outlet" +all other settings can be unchanged. + +The analysis can be run by executing the following command: + +```bash +python3 +``` + +### Configuring analysis + +This framework allows to run a wide range of analyses on a dataset or subsets by defining analysis jobs as yaml files. +Such files consist of four parts: +- `preprocessing`: defines all subsets of the dataset that will be targeted in the analysis +- `descriptive`: defines all descriptive analyses that will be conducted +- `analysis`: defines all other analysis jobs that will be conducted (e.g. regression, correlation, etc.) +- `visualizations`: defines all visualizations that will be created (e.g. histograms, scatterplots, etc.) + +Examples for all supported analysis and their arguments can be found in the `analysis_config_templates` folder. +The general structure of an analysis job consists of a tag that names the analysis followed by a list of arguments. +The `name` argument is mandatory and is used for identification and naming of the output files. +Please note that `dataset` argument refers to the names of the datasets in preprocessing. Some other analysis +(e.g. forest plots) require addition information referring to specific models also defined in the analysis job. + +### Output + +An analysis job creates three types of outputs: +- A markdown report in the `results_reports` folder which for each analysis give the settings of the analysis and the result +- A pdf report in the `results_reports` folder which is the conversion of the pdf (that can be shared) +- For each analysis a file in the `results` folder that contains the results of the analysis and is named the same as the result + +### Contributing: Extending the framework + +If you want to extend the framework with new analysis, you can do so by following these steps: +- Fork the repository +- add your analysis function class to the `analysis_functions` folder +- write a wrapper function for your analysis that takes a list of job arguments and calls the analysis. +- add the wrapper function to the `` file and extend it to create a list of analysis jobs for the +newly created analysis type +- create a parameter dataclass in the `data_classes` folder that is inherited from `GeneralParameters`. +- Add your dataclass to the `` in order for it to be readable from the job yaml file. +- Extend `utils/` to log the settings for your analysis in order for them to be documented in the report. +- Either use the extended analysis framework or create a merge request for it to be included in the main repository. + +## License + +See the LICENSE file for the GNU General Public License v3.0 related details. + +## Contact + +For queries, feedback, or issue reporting, please e-mail Gerrit Anders at \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_bayesian_regression.yaml b/analysis_config_templates/template_analysis_bayesian_regression.yaml new file mode 100644 index 0000000..0e6fc4b --- /dev/null +++ b/analysis_config_templates/template_analysis_bayesian_regression.yaml @@ -0,0 +1,10 @@ +--- +analysis: + - !bayesian_regression + name: "Example_bayes_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_comparison_variance_in_and_between_group.yaml b/analysis_config_templates/template_analysis_comparison_variance_in_and_between_group.yaml new file mode 100644 index 0000000..94302ea --- /dev/null +++ b/analysis_config_templates/template_analysis_comparison_variance_in_and_between_group.yaml @@ -0,0 +1,8 @@ +--- +analysis: + - !comparison_variance_in_and_between_group + name: "Example_comparison_variance_in_and_between_group" + dataset: "data" + variable: 'bayes-corrected (q=0.25) variance' + group: 'user_id' +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_effect_of_up_and_downvotes.yaml b/analysis_config_templates/template_analysis_effect_of_up_and_downvotes.yaml new file mode 100644 index 0000000..709a2f6 --- /dev/null +++ b/analysis_config_templates/template_analysis_effect_of_up_and_downvotes.yaml @@ -0,0 +1,22 @@ +--- +analysis: + - !linear_regression + name: "Example_linear_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: false + + - !increase_per_up_and_downvote_from_totalvotes_and_valence + name: "Example_increase_per_up_and_downvote" + dataset: "data" + weight_as_distribution_quantile: true + weight_m: 0.25 + model_name: "Example_linear_regression" + step: + - 0 + - 1 + startpoint: "average" +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_get_function_inverse_bayes_transformed_regression.yaml b/analysis_config_templates/template_analysis_get_function_inverse_bayes_transformed_regression.yaml new file mode 100644 index 0000000..b326b21 --- /dev/null +++ b/analysis_config_templates/template_analysis_get_function_inverse_bayes_transformed_regression.yaml @@ -0,0 +1,17 @@ +--- +analysis: + - !linear_regression + name: "Example_linear_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: false + report_effect_size: false + + - !function_inverse_bayes_transformed_regression + name: "function_Example" + dataset: "data" + model_name: "Example_linear_regression" +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_grouped_linear_regression.yaml b/analysis_config_templates/template_analysis_grouped_linear_regression.yaml new file mode 100644 index 0000000..fa81368 --- /dev/null +++ b/analysis_config_templates/template_analysis_grouped_linear_regression.yaml @@ -0,0 +1,17 @@ +--- +analysis: + - !linear_regression_grouped + name: "Example_grouped_linear_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + aggregation_functions: + - 'mean' + - 'sum' + - 'sum' + group_by: 'user_id' + standardize: false + print_detailed_coefficients: true +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_linear_regression.yaml b/analysis_config_templates/template_analysis_linear_regression.yaml new file mode 100644 index 0000000..b58928e --- /dev/null +++ b/analysis_config_templates/template_analysis_linear_regression.yaml @@ -0,0 +1,11 @@ +--- +analysis: + - !linear_regression + name: "Example_linear_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: false +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_paired_ttest.yaml b/analysis_config_templates/template_analysis_paired_ttest.yaml new file mode 100644 index 0000000..9996afe --- /dev/null +++ b/analysis_config_templates/template_analysis_paired_ttest.yaml @@ -0,0 +1,8 @@ +--- +analysis: + - !paired_ttest + name: "Example_paired_ttest" + dataset: "data" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_pearson_correlation.yaml b/analysis_config_templates/template_analysis_pearson_correlation.yaml new file mode 100644 index 0000000..cd14db5 --- /dev/null +++ b/analysis_config_templates/template_analysis_pearson_correlation.yaml @@ -0,0 +1,8 @@ +--- +analysis: + - !pearson_correlation + name: "Example_pearson_correlation" + dataset: "data" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' +... \ No newline at end of file diff --git a/analysis_config_templates/template_analysis_ttest.yaml b/analysis_config_templates/template_analysis_ttest.yaml new file mode 100644 index 0000000..3703097 --- /dev/null +++ b/analysis_config_templates/template_analysis_ttest.yaml @@ -0,0 +1,8 @@ +--- +analysis: + - !ttest + name: "Example_ttest" + dataset: "data" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' +... \ No newline at end of file diff --git a/analysis_config_templates/template_descriptive_create_descriptive_aggregated.yaml b/analysis_config_templates/template_descriptive_create_descriptive_aggregated.yaml new file mode 100644 index 0000000..4203673 --- /dev/null +++ b/analysis_config_templates/template_descriptive_create_descriptive_aggregated.yaml @@ -0,0 +1,11 @@ +--- +descriptive: + - !descriptive_aggregated + name: "Example_overview" + dataset: "data" + variables: + - 'Count' + - 'totalvotes' + aggregation_function: "sum" + group_by: "user_id" +... \ No newline at end of file diff --git a/analysis_config_templates/template_descriptive_create_descriptive_overview.yaml b/analysis_config_templates/template_descriptive_create_descriptive_overview.yaml new file mode 100644 index 0000000..f8e4cd7 --- /dev/null +++ b/analysis_config_templates/template_descriptive_create_descriptive_overview.yaml @@ -0,0 +1,40 @@ +--- +descriptive: + - !descriptive_overview + name: "Example_overview" + dataset: "data" + group_by: "order" + metrics: + - operation: "count" + column: null + - operation: "sum" + column: "number O(n+1)-replies" + - operation: "count_nonzero" + column: "number O(n+1)-replies" + - operation: "count_nonzero" + column: "totalvotes" + - operation: "sum" + column: "totalvotes" + - operation: "sum" + column: "upvotes" + - operation: "sum" + column: "downvotes" + - operation: "count_nonzero" + column: "totalvotes" + - operation: "mean" + column: "valence" + - operation: "std_dev" + column: "valence" + - operation: "mean" + column: "bayes-corrected (q=0.25) valence" + - operation: "std_dev" + column: "bayes-corrected (q=0.25) valence" + - operation: "mean" + column: "extremity" + - operation: "std_dev" + column: "extremity" + - operation: "mean" + column: "bayes-corrected (q=0.25) extremity" + - operation: "std_dev" + column: "bayes-corrected (q=0.25) extremity" +... \ No newline at end of file diff --git a/analysis_config_templates/template_descriptive_percentage_of_dataset_under_condtion.yaml b/analysis_config_templates/template_descriptive_percentage_of_dataset_under_condtion.yaml new file mode 100644 index 0000000..ec4da1e --- /dev/null +++ b/analysis_config_templates/template_descriptive_percentage_of_dataset_under_condtion.yaml @@ -0,0 +1,9 @@ +--- +descriptive: + - !percentage_of_dataset_under_condition + name: "Example_percentage_of_dataset_under_condition" + dataset: "data" + variable: "totalvotes" + comparison: "smaller" + condition: 10 +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_barchart.yaml b/analysis_config_templates/template_plot_barchart.yaml new file mode 100644 index 0000000..006290c --- /dev/null +++ b/analysis_config_templates/template_plot_barchart.yaml @@ -0,0 +1,13 @@ +--- +visualization: + - !barchart + name: 'Example_barchart' + dataset: "data" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: None + x_axis_label: 'bayes-corrected (q=0.25) extremity' + y_axis_label: 'Count' + chart_orientation: 'h' + sort_order: 'ascending' + title: 'Barchart' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_boxplot.yaml b/analysis_config_templates/template_plot_boxplot.yaml new file mode 100644 index 0000000..a368d41 --- /dev/null +++ b/analysis_config_templates/template_plot_boxplot.yaml @@ -0,0 +1,11 @@ +--- +visualization: + - !boxplot + name: 'Example_boxplot' + dataset: "data" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + x_axis_label: '' + y_axis_label: 'Extremity value' + title: 'Box plot comparing bayes-corrected extremity with the mean extremity of replies' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_contourplot.yaml b/analysis_config_templates/template_plot_contourplot.yaml new file mode 100644 index 0000000..d9b1eb7 --- /dev/null +++ b/analysis_config_templates/template_plot_contourplot.yaml @@ -0,0 +1,27 @@ +--- +analysis: + - !linear_regression + name: "Example_linear_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: false + report_effect_size: false + + - !function_inverse_bayes_transformed_regression + name: "function_Example" + dataset: "data" + model_name: "Example_linear_regression" + +visualization: + - !contourplot + name: "Example_surfaceplot" + dataset: "data" + function_name: "function_Example" + x_axis_maximum: 20 + y_axis_maximum: 20 + x_axis_label: "downvotes" + y_axis_label: "upvotes" +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_count_distribution.yaml b/analysis_config_templates/template_plot_count_distribution.yaml new file mode 100644 index 0000000..0c68d21 --- /dev/null +++ b/analysis_config_templates/template_plot_count_distribution.yaml @@ -0,0 +1,15 @@ +--- +visualization: + - !count_distribution + name: 'Example_count_distribution' + dataset: "data" + variable: 'user_id' + x_axis_label: 'Number of comments' + y_axis_label: 'Number of users' + x_axis_limits: + - 0 + - 10 + x_axis_logarithmic_scaling: false + y_axis_logarithmic_scaling: false + title: 'Distribution of Comments over Users' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_densityplot.yaml b/analysis_config_templates/template_plot_densityplot.yaml new file mode 100644 index 0000000..4d60e2d --- /dev/null +++ b/analysis_config_templates/template_plot_densityplot.yaml @@ -0,0 +1,10 @@ +--- +visualization: + - !densityplot + name: 'Example_densityplot' + dataset: "data" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: 'mean bayes-corrected (q=0.25) extremity of replies' + data_breakpoints: + - 0 +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_forestplot.yaml b/analysis_config_templates/template_plot_forestplot.yaml new file mode 100644 index 0000000..611c2f7 --- /dev/null +++ b/analysis_config_templates/template_plot_forestplot.yaml @@ -0,0 +1,37 @@ +--- +analysis: + - !linear_regression + name: "Example_linear_regression_subset_1" + dataset: "data_subset_1" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Example_linear_regression_subset_2" + dataset: "data_subset_2" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + +visualization: + - !forestplot: + name: "Example_forestplot" + regression_model_names: + - "Example_linear_regression_subset_1" + - "Example_linear_regression_subset_2" + regression_model_labels: + - "Subset 1" + - "Subset 2" + coefficient_names: + - "bayes-corrected (q=0.25) valence" + - "totalvotes" + x_axis_minimum: 0 + dotsize: 5 + x_axis_label: "Standardized coefficient (95% Confidence Interval)" +... + diff --git a/analysis_config_templates/template_plot_forestplot_paired_ttest.yaml b/analysis_config_templates/template_plot_forestplot_paired_ttest.yaml new file mode 100644 index 0000000..9a75fe8 --- /dev/null +++ b/analysis_config_templates/template_plot_forestplot_paired_ttest.yaml @@ -0,0 +1,27 @@ +--- +analysis: + - !paired_ttest + name: "Example_paired_ttest_subset_1" + dataset: "data_subset_1" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Example_paired_ttest_subset_2" + dataset: "data_subset_2" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + +visualization: + - !forestplot: + name: "Example_forestplot_paired_ttest" + paired_ttest_names: + - "Example_paired_ttest_subset_1" + - "Example_paired_ttest_subset_2" + paired_ttest_labels: + - "Subset 1" + - "Subset 2" + x_axis_minimum: 0 + dotsize: 5 + x_axis_label: "Mean difference bayes-corrected (q=0.25) extremity (95% Confidence Interval)" +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_grouped_histogram.yaml b/analysis_config_templates/template_plot_grouped_histogram.yaml new file mode 100644 index 0000000..fccb2b4 --- /dev/null +++ b/analysis_config_templates/template_plot_grouped_histogram.yaml @@ -0,0 +1,12 @@ +--- +visualization: + - !grouped_histogram + name: "Example_grouped_histogram" + dataset: "data" + group_by: 'user_id' + aggregation_variable: 'bayes-corrected (q=0.25) valence' + aggregation_function: 'mean' + x_axis_label: 'Valence' + y_axis_label: 'Number of users' + title: 'Histogram of Mean Valence per User' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_heatmap.yaml b/analysis_config_templates/template_plot_heatmap.yaml new file mode 100644 index 0000000..d6ff256 --- /dev/null +++ b/analysis_config_templates/template_plot_heatmap.yaml @@ -0,0 +1,17 @@ +--- +visualization: + - !heatmap + name: "Example_heatmap" + dataset: "data" + axis_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + heat_variable: 'number O(n+1)-replies' + axis_maxima: + - 1 + - 40 + axis_minima: + - 0 + - 0 + logarithmic_heat_scaling: 'false' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_hexbinplot.yaml b/analysis_config_templates/template_plot_hexbinplot.yaml new file mode 100644 index 0000000..6c91b47 --- /dev/null +++ b/analysis_config_templates/template_plot_hexbinplot.yaml @@ -0,0 +1,12 @@ +--- +visualization: + - !hexbinplot + name: "Example_hexbinplot" + dataset: "data" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: 'mean bayes-corrected (q=0.25) extremity of replies' + x_axis_maximum: 1 + y_axis_maximum: 1 + trendline: false + logarithmic_hex_scaling: false +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_histogram.yaml b/analysis_config_templates/template_plot_histogram.yaml new file mode 100644 index 0000000..c1a21f7 --- /dev/null +++ b/analysis_config_templates/template_plot_histogram.yaml @@ -0,0 +1,12 @@ +--- +visualization: + - !histogram + name: 'Descriptive_histogram_comments_over_totalvotes' + dataset: "data" + variable: 'totalvotes' + x_axis_label: 'Number of total votes' + y_axis_label: 'Number of comments' + x_axis_logarithmic_scaling: false + y_axis_logarithmic_scaling: true + title: '' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_percentage_stacked_barchart.yaml b/analysis_config_templates/template_plot_percentage_stacked_barchart.yaml new file mode 100644 index 0000000..ab91ffa --- /dev/null +++ b/analysis_config_templates/template_plot_percentage_stacked_barchart.yaml @@ -0,0 +1,14 @@ +--- +visualization: + - !percentage_stacked_barchart + name: 'Example_percentage_stacked_barchart' + dataset: "data" + variable_x_axis: 'section' + variables_to_compare: + - 'upvotes' + - 'downvotes' + x_axis_label: 'Section' + chart_orientation: 'h' + sort_order: 'ascending' + title: 'Stacked Barchart of Upvotes and Downvotes by Section' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_ridgelineplot.yaml b/analysis_config_templates/template_plot_ridgelineplot.yaml new file mode 100644 index 0000000..2865efe --- /dev/null +++ b/analysis_config_templates/template_plot_ridgelineplot.yaml @@ -0,0 +1,10 @@ +--- +visualization: + - !ridgelineplot + name: "Example_ridgelineplot" + dataset: "data" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: 'mean bayes-corrected (q=0.25) extremity of replies' + data_breakpoints: + - 0.5 +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_simple_scatterplot.yaml b/analysis_config_templates/template_plot_simple_scatterplot.yaml new file mode 100644 index 0000000..5d57fef --- /dev/null +++ b/analysis_config_templates/template_plot_simple_scatterplot.yaml @@ -0,0 +1,8 @@ +--- +visualization: + - !simple_scatterplot + name: "Example_simple_scatterplot" + dataset: "data" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: 'mean bayes-corrected (q=0.25) extremity of replies' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_stacked_barchart.yaml b/analysis_config_templates/template_plot_stacked_barchart.yaml new file mode 100644 index 0000000..5ac42eb --- /dev/null +++ b/analysis_config_templates/template_plot_stacked_barchart.yaml @@ -0,0 +1,14 @@ +--- +visualization: + - !stacked_barchart + name: 'Example_stacked_barchart' + dataset: "data" + variable_x_axis: 'section' + variable_y_axis: None + x_axis_label: 'section' + y_axis_label: 'Count' + hue: 'order' + chart_orientation: 'h' + sort_order: 'ascending' + title: 'Stacked Barchart of Comments by Section and Order' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_surfaceplot.yaml b/analysis_config_templates/template_plot_surfaceplot.yaml new file mode 100644 index 0000000..8cc4ef7 --- /dev/null +++ b/analysis_config_templates/template_plot_surfaceplot.yaml @@ -0,0 +1,31 @@ +--- +analysis: + - !linear_regression + name: "Example_linear_regression" + dataset: "data" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: false + report_effect_size: false + + - !function_inverse_bayes_transformed_regression + name: "function_Example" + dataset: "data" + model_name: "Example_linear_regression" + +visualization: + - !surfaceplot + name: "Example_surfaceplot" + dataset: "data" + function_name: "function_Example" + x_axis_maximum: 20 + y_axis_maximum: 20 + x_axis_label: "downvotes" + y_axis_label: "upvotes" + z_axis_label: "replies" + elevation_angle: 45 + azimuth_angle: 205 + title: 'Effect of up- and downvotes according to example linear regression' +... \ No newline at end of file diff --git a/analysis_config_templates/template_plot_violinplot.yaml b/analysis_config_templates/template_plot_violinplot.yaml new file mode 100644 index 0000000..350509f --- /dev/null +++ b/analysis_config_templates/template_plot_violinplot.yaml @@ -0,0 +1,8 @@ +--- +visualization: + - !violinplot + name: "Example_violinplot" + dataset: "data" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: 'mean bayes-corrected (q=0.25) extremity of replies' +... \ No newline at end of file diff --git a/analysis_jobs/analysis_job_manuscript.yaml b/analysis_jobs/analysis_job_manuscript.yaml new file mode 100644 index 0000000..b95086a --- /dev/null +++ b/analysis_jobs/analysis_job_manuscript.yaml @@ -0,0 +1,667 @@ +--- +preprocessing: + data_order0: + - method: data_order + param: 0 + data_order1: + - method: data_order + param: 1 + data_politics: + - method: data_order + param: 0 + - method: data_section + param: 'Politics' + data_foreign_affairs: + - method: data_order + param: 0 + - method: data_section + param: 'Foreign affairs' + data_science: + - method: data_order + param: 0 + - method: data_section + param: 'Science' + data_economy: + - method: data_order + param: 0 + - method: data_section + param: 'Economy' + data_miscellaneous: + - method: data_order + param: 0 + - method: data_section + param: 'Miscellaneous' + data_culture: + - method: data_order + param: 0 + - method: data_section + param: 'Culture' + data_sports: + - method: data_order + param: 0 + - method: data_section + param: 'Sports' + data_mobility: + - method: data_order + param: 0 + - method: data_section + param: 'Mobility' + data_internet: + - method: data_order + param: 0 + - method: data_section + param: 'Internet' + data_health: + - method: data_order + param: 0 + - method: data_section + param: 'Health' + data_order0_with_minimum_one_vote: + - method: data_order + param: 0 + - method: exclude_data_with_value + param: {'column': 'totalvotes', 'value': 0} + +descriptive: + - !descriptive_overview + name: "Extended_Data_Table_1_Descriptive_Data_for_different_comment_levels" + dataset: "data" + group_by: "order" + metrics: + - operation: "count" + column: null + - operation: "count_nonzero" + column: "totalvotes" + - operation: "sum" + column: "totalvotes" + - operation: "mean" + column: "totalvotes" + - operation: "std_dev" + column: "totalvotes" + - operation: "sum" + column: "upvotes" + - operation: "sum" + column: "downvotes" + - operation: "mean" + column: "bayes-corrected (q=0.25) valence" + - operation: "std_dev" + column: "bayes-corrected (q=0.25) valence" + - operation: "mean" + column: "bayes-corrected (q=0.25) extremity" + - operation: "std_dev" + column: "bayes-corrected (q=0.25) extremity" + + - !descriptive_overview + name: "Extended_Data_Table_2_Descriptive_Data_for_different_news_categories" + dataset: "data" + group_by: "section" + metrics: + - operation: "count" + column: null + - operation: "sum" + column: "number O(n+1)-replies" + - operation: "count_nonzero" + column: "number O(n+1)-replies" + - operation: "count_nonzero" + column: "totalvotes" + - operation: "sum" + column: "totalvotes" + - operation: "sum" + column: "upvotes" + - operation: "sum" + column: "downvotes" + - operation: "count_nonzero" + column: "totalvotes" + - operation: "mean" + column: "valence" + - operation: "std_dev" + column: "valence" + - operation: "mean" + column: "bayes-corrected (q=0.25) valence" + - operation: "std_dev" + column: "bayes-corrected (q=0.25) valence" + - operation: "mean" + column: "extremity" + - operation: "std_dev" + column: "extremity" + - operation: "mean" + column: "bayes-corrected (q=0.25) extremity" + - operation: "std_dev" + column: "bayes-corrected (q=0.25) extremity" + +analysis: + - !linear_regression + name: "Evidence_uncongeniality_simplest_model_linear_regression_only_valence_non_standardized" + dataset: "data_order0" + independent_variables: + - 'valence' + dependent_variable: 'number O(n+1)-replies' + standardize: false + report_effect_size: true + + - !linear_regression + name: "Evidence_uncongeniality_preregistered_model" + dataset: "data_order0" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + report_effect_size: true + + - !linear_regression + name: "Evidence_uncongeniality_stability_against_variation_in_weight_q5" + dataset: "data_order0" + independent_variables: + - 'bayes-corrected (q=0.5) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongeniality_stability_against_variation_in_weight_q75" + dataset: "data_order0" + independent_variables: + - 'bayes-corrected (q=0.75) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongeniality_stability_against_variation_in_weight__no_bayes_correction" + dataset: "data_order0" + independent_variables: + - 'valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression_grouped + name: "Evidence_uncongeniality_robustness_analysis_on_person_level" + dataset: "data_order0" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + aggregation_functions: + - 'mean' + - 'sum' + - 'sum' + group_by: 'user_id' + standardize: true + print_detailed_coefficients: true + + - !linear_regression_grouped + name: "Evidence_uncongeniality_robustness_analysis_on_section_level" + dataset: "data_order0" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + aggregation_functions: + - 'mean' + - 'sum' + - 'sum' + group_by: 'section' + standardize: true + print_detailed_coefficients: true + + - !linear_regression + name: "Evidence_uncongenialty_section_politics" + dataset: "data_politics" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_affairs" + dataset: "data_foreign_affairs" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_science" + dataset: "data_science" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_economy" + dataset: "data_economy" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_miscellaneous" + dataset: "data_miscellaneous" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_culture" + dataset: "data_culture" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_sports" + dataset: "data_sports" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_mobility" + dataset: "data_mobility" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_internet" + dataset: "data_internet" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongenialty_section_health" + dataset: "data_health" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncongeniality_robustness_order1" + dataset: "data_order1" + independent_variables: + - 'bayes-corrected (q=0.25) valence' + - 'totalvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_uncogeniality_model_with_seperate_upvotes_downvotes" + dataset: "data_order0" + independent_variables: + - 'upvotes' + - 'downvotes' + dependent_variable: 'number O(n+1)-replies' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_preregistered_model" + dataset: "data_order0" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_stability_against_variation_in_weight_q5" + dataset: "data_order0" + independent_variables: + - 'mean bayes-corrected (q=0.5) valence of replies' + dependent_variable: 'bayes-corrected (q=0.5) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_stability_against_variation_in_weight_q75" + dataset: "data_order0" + independent_variables: + - 'mean bayes-corrected (q=0.75) valence of replies' + dependent_variable: 'bayes-corrected (q=0.75) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_stability_against_variation_in_weight_no_bayes_correction" + dataset: "data_order0" + independent_variables: + - 'mean valence of replies' + dependent_variable: 'valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_politics" + dataset: "data_politics" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_affairs" + dataset: "data_foreign_affairs" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_science" + dataset: "data_science" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_economy" + dataset: "data_economy" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_miscellaneous" + dataset: "data_miscellaneous" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_culture" + dataset: "data_culture" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_sports" + dataset: "data_sports" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_mobility" + dataset: "data_mobility" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_internet" + dataset: "data_internet" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_section_health" + dataset: "data_health" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !linear_regression + name: "Evidence_antagonism_robustness_order1" + dataset: "data_order1" + independent_variables: + - 'mean bayes-corrected (q=0.25) valence of replies' + dependent_variable: 'bayes-corrected (q=0.25) valence' + standardize: true + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity" + dataset: "data_order0" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q5" + dataset: "data_order0" + variable_1: 'bayes-corrected (q=0.5) extremity' + variable_2: 'mean bayes-corrected (q=0.5) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q75" + dataset: "data_order0" + variable_1: 'bayes-corrected (q=0.75) extremity' + variable_2: 'mean bayes-corrected (q=0.75) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_stability_against_variation_in_weight_paired_ttest_bayes" + dataset: "data_order0" + variable_1: 'extremity' + variable_2: 'mean extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_robustness_paired_ttest_order1" + dataset: "data_order1" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_politics" + dataset: "data_politics" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_foreign_affairs" + dataset: "data_foreign_affairs" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_science" + dataset: "data_science" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_economy" + dataset: "data_economy" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_miscellaneous" + dataset: "data_miscellaneous" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_culture" + dataset: "data_culture" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_sports" + dataset: "data_sports" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_mobility" + dataset: "data_mobility" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_internet" + dataset: "data_internet" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + + - !paired_ttest + name: "Evidence_polarization_paired_ttest_extremity_health" + dataset: "data_health" + variable_1: 'bayes-corrected (q=0.25) extremity' + variable_2: 'mean bayes-corrected (q=0.25) extremity of replies' + +visualization: + - !hexbinplot + name: "Fig_2a" + dataset: "data_order0" + variable_x_axis: 'bayes-corrected (q=0.25) valence' + variable_y_axis: 'number O(n+1)-replies' + y_axis_maximum: 40 + trendline: True + logarithmic_hex_scaling: True + + - !forestplot + name: "Fig_2b" + regression_model_names: + - "Evidence_uncongenialty_section_politics" + - "Evidence_uncongenialty_section_foreign_affairs" + - "Evidence_uncongenialty_section_science" + - "Evidence_uncongenialty_section_economy" + - "Evidence_uncongenialty_section_miscellaneous" + - "Evidence_uncongenialty_section_culture" + - "Evidence_uncongenialty_section_sports" + - "Evidence_uncongenialty_section_mobility" + - "Evidence_uncongenialty_section_internet" + - "Evidence_uncongenialty_section_health" + regression_model_labels: + - "Politics" + - "Foreign Affairs" + - "Science" + - "Economy" + - "Miscellaneous" + - "Culture" + - "Sports" + - "Mobility" + - "Internet" + - "Health" + coefficient_names: + - "bayes-corrected (q=0.25) valence" + - "totalvotes" + x_axis_minimum: -0.6 + dotsize: 2 + x_axis_label: "Standardized coefficient (95% Confidence Interval)" + + - !heatmap + name: "Fig_2c" + dataset: "data_order0_with_minimum_one_vote" + axis_variables: + - 'upvotes' + - 'downvotes' + heat_variable: 'number O(n+1)-replies' + axis_maxima: + - 20 + - 20 + axis_minima: + - 0 + - 0 + logarithmic_heat_scaling: 'false' + + - !densityplot + name: 'Fig_3a' + dataset: "data_order0" + variable_x_axis: 'mean bayes-corrected (q=0.25) valence of replies' + variable_y_axis: 'bayes-corrected (q=0.25) valence' + data_breakpoints: + - 0 + + - !forestplot + name: "Fig_3b" + regression_model_names: + - "Evidence_antagonism_section_politics" + - "Evidence_antagonism_section_foreign_affairs" + - "Evidence_antagonism_section_science" + - "Evidence_antagonism_section_economy" + - "Evidence_antagonism_section_miscellaneous" + - "Evidence_antagonism_section_culture" + - "Evidence_antagonism_section_sports" + - "Evidence_antagonism_section_mobility" + - "Evidence_antagonism_section_internet" + - "Evidence_antagonism_section_health" + regression_model_labels: + - "Politics" + - "Foreign Affairs" + - "Science" + - "Economy" + - "Miscellaneous" + - "Culture" + - "Sports" + - "Mobility" + - "Internet" + - "Health" + coefficient_names: + - 'mean bayes-corrected (q=0.25) valence of replies' + x_axis_minimum: -0.1 + dotsize: 2 + x_axis_label: "Standardized coefficient (95% Confidence Interval)" + + - !violinplot + name: "Fig_4a" + dataset: "data_order0" + variable_x_axis: 'bayes-corrected (q=0.25) extremity' + variable_y_axis: 'mean bayes-corrected (q=0.25) extremity of replies' + x_axis_label: '' + y_axis_label: 'Extremity value' + title: '' + + - !forestplot_paired_ttest + name: "Fig_4b" + paired_ttest_names: + - "Evidence_polarization_paired_ttest_extremity_politics" + - "Evidence_polarization_paired_ttest_extremity_affairs" + - "Evidence_polarization_paired_ttest_extremity_science" + - "Evidence_polarization_paired_ttest_extremity_economy" + - "Evidence_polarization_paired_ttest_extremity_miscellaneous" + - "Evidence_polarization_paired_ttest_extremity_culture" + - "Evidence_polarization_paired_ttest_extremity_sports" + - "Evidence_polarization_paired_ttest_extremity_mobility" + - "Evidence_polarization_paired_ttest_extremity_internet" + - "Evidence_polarization_paired_ttest_extremity_health" + paired_ttest_labels: + - "Politics" + - "Foreign Affairs" + - "Science" + - "Economy" + - "Miscellaneous" + - "Culture" + - "Sports" + - "Mobility" + - "Internet" + - "Health" + x_axis_minimum: -0.06 + dotsize: 2 + x_axis_label: "Mean difference bayes-corrected (q=0.25) extremity (95% Confidence Interval)" + + - !histogram + name: 'Extended_Fig_1' + dataset: "data" + variable: 'totalvotes' + x_axis_label: 'Number of total votes' + y_axis_label: 'Number of comments' + x_axis_logarithmic_scaling: false + y_axis_logarithmic_scaling: true + title: '' +... \ No newline at end of file diff --git a/config.yaml b/config.yaml new file mode 100644 index 0000000..41719e7 --- /dev/null +++ b/config.yaml @@ -0,0 +1,6 @@ +--- +data_path: 'put-path-to-the-data-directory-here' +dataset_name: '2024-02-28_preprocessed_data.parquet' +analysis_job_file: 'analysis_jobs/analysis_job_manuscript.yaml' +create_pdf_report: true +... \ No newline at end of file diff --git a/ b/ new file mode 100644 index 0000000..4e5b268 --- /dev/null +++ b/ @@ -0,0 +1,60 @@ +import pandas as pd +import yaml +import logging +import datetime +from pathlib import Path + +from src.analysis import run_analyses +from src.preprocessor import Preprocessing + +from src.data_loading_and_saving.constructor import custom_constructor +from src.data_loading_and_saving.create_results_report import ( + create_markdown_report, + create_pdf_report, +) + + +def main(): + yaml.SafeLoader.add_multi_constructor("!", custom_constructor) + + with open("config.yaml", "r") as file: + config = yaml.safe_load(file) + + path_to_data: str = config["data_path"] + name_data: str = config["dataset_name"] + name_analysis_job_file: str = config["analysis_job_file"] + bool_create_pdf_report: bool = config["create_pdf_report"] + + log_filename: str = "log.log" + todays_date: str ="%B %d, %Y") + output_name: str = f"{todays_date}_analysis_report" + logging.basicConfig( + filename=log_filename, + filemode="w", + format="%(message)s", + level=logging.INFO, + ) + + preprocessor: Preprocessing = Preprocessing(path_to_data, name_data) + + with open(name_analysis_job_file, "r") as file: + analysis_config = yaml.safe_load(file) + + datasets: dict[str, pd.DataFrame] = preprocessor.preprocess_datasets(analysis_config["preprocessing"]) + + run_analyses(analysis_config, datasets) + + create_markdown_report( + log_filename=Path(log_filename), + output_name=Path(output_name), + output_dir=Path("results_reports"), + ) + + if bool_create_pdf_report: + create_pdf_report( + markdown_filename=Path(output_name), output_dir=Path("results_reports") + ) + + +if __name__ == "__main__": + main() diff --git a/requirements.txt b/requirements.txt new file mode 100644 index 0000000..b7c60ae --- /dev/null +++ b/requirements.txt @@ -0,0 +1,13 @@ +statsmodels==0.14.2 +numpy==2.0.0 +scipy==1.13.1 +scikit-learn==1.5.0 +matplotlib==3.9.0 +seaborn==0.13.2 +pandas==2.2.2 +rpy2==3.5.16 +pyarrow==16.1.0 +pingouin==0.5.4 +attrs==23.2.0 +pyyaml==6.0.1 +pypandoc==1.13 \ No newline at end of file diff --git a/results/Evidence_antagonism_preregistered_model.txt b/results/Evidence_antagonism_preregistered_model.txt new file mode 100644 index 0000000..0052699 --- /dev/null +++ b/results/Evidence_antagonism_preregistered_model.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.021 +Model: OLS Adj. R-squared: 0.021 +Method: Least Squares F-statistic: 5.020e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:51 Log-Likelihood: -221.34 +No. Observations: 2392896 AIC: 446.7 +Df Residuals: 2392894 BIC: 472.1 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1246 0.000 796.722 0.000 0.124 0.125 +mean bayes-corrected (q=0.25) valence of replies -0.0351 0.000 -224.063 0.000 -0.035 -0.035 +============================================================================== +Omnibus: 426104.077 Durbin-Watson: 1.729 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 131391.270 +Skew: -0.336 Prob(JB): 0.00 +Kurtosis: 2.070 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_robustness_order1.txt b/results/Evidence_antagonism_robustness_order1.txt new file mode 100644 index 0000000..1ad8bc8 --- /dev/null +++ b/results/Evidence_antagonism_robustness_order1.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.057 +Model: OLS Adj. R-squared: 0.057 +Method: Least Squares F-statistic: 9.915e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: 2.1429e+05 +No. Observations: 1630262 AIC: -4.286e+05 +Df Residuals: 1630260 BIC: -4.286e+05 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1419 0.000 854.072 0.000 0.142 0.142 +mean bayes-corrected (q=0.25) valence of replies -0.0523 0.000 -314.877 0.000 -0.053 -0.052 +============================================================================== +Omnibus: 101738.374 Durbin-Watson: 1.753 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 62821.338 +Skew: -0.351 Prob(JB): 0.00 +Kurtosis: 2.343 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_affairs.txt b/results/Evidence_antagonism_section_affairs.txt new file mode 100644 index 0000000..570faab --- /dev/null +++ b/results/Evidence_antagonism_section_affairs.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.019 +Model: OLS Adj. R-squared: 0.019 +Method: Least Squares F-statistic: 8343. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: -43060. +No. Observations: 440260 AIC: 8.612e+04 +Df Residuals: 440258 BIC: 8.615e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1353 0.000 336.404 0.000 0.134 0.136 +mean bayes-corrected (q=0.25) valence of replies -0.0367 0.000 -91.341 0.000 -0.038 -0.036 +============================================================================== +Omnibus: 129058.321 Durbin-Watson: 1.735 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 32315.635 +Skew: -0.421 Prob(JB): 0.00 +Kurtosis: 1.974 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_culture.txt b/results/Evidence_antagonism_section_culture.txt new file mode 100644 index 0000000..22bb3db --- /dev/null +++ b/results/Evidence_antagonism_section_culture.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.032 +Model: OLS Adj. R-squared: 0.032 +Method: Least Squares F-statistic: 3435. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -1315.0 +No. Observations: 102305 AIC: 2634. +Df Residuals: 102303 BIC: 2653. +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1253 0.001 163.518 0.000 0.124 0.127 +mean bayes-corrected (q=0.25) valence of replies -0.0449 0.001 -58.610 0.000 -0.046 -0.043 +============================================================================== +Omnibus: 19234.419 Durbin-Watson: 1.748 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 5689.759 +Skew: -0.334 Prob(JB): 0.00 +Kurtosis: 2.057 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_economy.txt b/results/Evidence_antagonism_section_economy.txt new file mode 100644 index 0000000..a67a2f1 --- /dev/null +++ b/results/Evidence_antagonism_section_economy.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.017 +Model: OLS Adj. R-squared: 0.017 +Method: Least Squares F-statistic: 5484. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: 24023. +No. Observations: 316428 AIC: -4.804e+04 +Df Residuals: 316426 BIC: -4.802e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1474 0.000 369.619 0.000 0.147 0.148 +mean bayes-corrected (q=0.25) valence of replies -0.0295 0.000 -74.054 0.000 -0.030 -0.029 +============================================================================== +Omnibus: 28321.195 Durbin-Watson: 1.760 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 17536.904 +Skew: -0.450 Prob(JB): 0.00 +Kurtosis: 2.278 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_health.txt b/results/Evidence_antagonism_section_health.txt new file mode 100644 index 0000000..015d4e2 --- /dev/null +++ b/results/Evidence_antagonism_section_health.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.043 +Model: OLS Adj. R-squared: 0.043 +Method: Least Squares F-statistic: 1211. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 1.61e-259 +Time: 09:31:54 Log-Likelihood: -439.22 +No. Observations: 27005 AIC: 882.4 +Df Residuals: 27003 BIC: 898.9 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1074 0.001 71.776 0.000 0.104 0.110 +mean bayes-corrected (q=0.25) valence of replies -0.0521 0.001 -34.794 0.000 -0.055 -0.049 +============================================================================== +Omnibus: 6746.889 Durbin-Watson: 1.761 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1324.435 +Skew: -0.197 Prob(JB): 2.53e-288 +Kurtosis: 1.989 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_internet.txt b/results/Evidence_antagonism_section_internet.txt new file mode 100644 index 0000000..9d7acfd --- /dev/null +++ b/results/Evidence_antagonism_section_internet.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.028 +Model: OLS Adj. R-squared: 0.028 +Method: Least Squares F-statistic: 1805. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -3477.5 +No. Observations: 63079 AIC: 6959. +Df Residuals: 63077 BIC: 6977. +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1191 0.001 117.001 0.000 0.117 0.121 +mean bayes-corrected (q=0.25) valence of replies -0.0433 0.001 -42.490 0.000 -0.045 -0.041 +============================================================================== +Omnibus: 21454.701 Durbin-Watson: 1.721 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 4028.801 +Skew: -0.319 Prob(JB): 0.00 +Kurtosis: 1.939 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_miscellaneous.txt b/results/Evidence_antagonism_section_miscellaneous.txt new file mode 100644 index 0000000..c0311b8 --- /dev/null +++ b/results/Evidence_antagonism_section_miscellaneous.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.028 +Model: OLS Adj. R-squared: 0.028 +Method: Least Squares F-statistic: 6790. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -13916. +No. Observations: 235551 AIC: 2.784e+04 +Df Residuals: 235549 BIC: 2.786e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1362 0.001 257.499 0.000 0.135 0.137 +mean bayes-corrected (q=0.25) valence of replies -0.0436 0.001 -82.403 0.000 -0.045 -0.043 +============================================================================== +Omnibus: 52959.753 Durbin-Watson: 1.732 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 15867.344 +Skew: -0.409 Prob(JB): 0.00 +Kurtosis: 2.027 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_mobility.txt b/results/Evidence_antagonism_section_mobility.txt new file mode 100644 index 0000000..f2cf0f9 --- /dev/null +++ b/results/Evidence_antagonism_section_mobility.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.024 +Model: OLS Adj. R-squared: 0.024 +Method: Least Squares F-statistic: 1726. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: 10825. +No. Observations: 69253 AIC: -2.165e+04 +Df Residuals: 69251 BIC: -2.163e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1109 0.001 141.050 0.000 0.109 0.112 +mean bayes-corrected (q=0.25) valence of replies -0.0327 0.001 -41.551 0.000 -0.034 -0.031 +============================================================================== +Omnibus: 6922.840 Durbin-Watson: 1.814 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 2381.203 +Skew: -0.195 Prob(JB): 0.00 +Kurtosis: 2.179 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_politics.txt b/results/Evidence_antagonism_section_politics.txt new file mode 100644 index 0000000..1ddc4e0 --- /dev/null +++ b/results/Evidence_antagonism_section_politics.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.018 +Model: OLS Adj. R-squared: 0.018 +Method: Least Squares F-statistic: 1.166e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: 34045. +No. Observations: 621929 AIC: -6.809e+04 +Df Residuals: 621927 BIC: -6.806e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1305 0.000 449.326 0.000 0.130 0.131 +mean bayes-corrected (q=0.25) valence of replies -0.0314 0.000 -107.983 0.000 -0.032 -0.031 +============================================================================== +Omnibus: 78154.602 Durbin-Watson: 1.733 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 31765.731 +Skew: -0.357 Prob(JB): 0.00 +Kurtosis: 2.155 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_science.txt b/results/Evidence_antagonism_section_science.txt new file mode 100644 index 0000000..bf4593e --- /dev/null +++ b/results/Evidence_antagonism_section_science.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.028 +Model: OLS Adj. R-squared: 0.028 +Method: Least Squares F-statistic: 1.007e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: 27583. +No. Observations: 345534 AIC: -5.516e+04 +Df Residuals: 345532 BIC: -5.514e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.0723 0.000 190.132 0.000 0.072 0.073 +mean bayes-corrected (q=0.25) valence of replies -0.0381 0.000 -100.372 0.000 -0.039 -0.037 +============================================================================== +Omnibus: 59103.072 Durbin-Watson: 1.791 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 12955.369 +Skew: -0.052 Prob(JB): 0.00 +Kurtosis: 2.057 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_section_sports.txt b/results/Evidence_antagonism_section_sports.txt new file mode 100644 index 0000000..0cfaa30 --- /dev/null +++ b/results/Evidence_antagonism_section_sports.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.032 +Model: OLS Adj. R-squared: 0.032 +Method: Least Squares F-statistic: 3344. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -6723.8 +No. Observations: 100071 AIC: 1.345e+04 +Df Residuals: 100069 BIC: 1.347e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1246 0.001 152.318 0.000 0.123 0.126 +mean bayes-corrected (q=0.25) valence of replies -0.0473 0.001 -57.827 0.000 -0.049 -0.046 +============================================================================== +Omnibus: 28267.899 Durbin-Watson: 1.740 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 6368.870 +Skew: -0.345 Prob(JB): 0.00 +Kurtosis: 1.975 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_stability_against_variation_in_weight_no_bayes_correction.txt b/results/Evidence_antagonism_stability_against_variation_in_weight_no_bayes_correction.txt new file mode 100644 index 0000000..2d16de4 --- /dev/null +++ b/results/Evidence_antagonism_stability_against_variation_in_weight_no_bayes_correction.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================== +Dep. Variable: valence R-squared: 0.010 +Model: OLS Adj. R-squared: 0.010 +Method: Least Squares F-statistic: 2.337e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: -4.8218e+05 +No. Observations: 2392896 AIC: 9.644e+05 +Df Residuals: 2392894 BIC: 9.644e+05 +Df Model: 1 +Covariance Type: nonrobust +=========================================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------------------- +const 0.1158 0.000 604.951 0.000 0.115 0.116 +mean valence of replies -0.0293 0.000 -152.877 0.000 -0.030 -0.029 +============================================================================== +Omnibus: 785394.853 Durbin-Watson: 1.750 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 152455.997 +Skew: -0.323 Prob(JB): 0.00 +Kurtosis: 1.946 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_stability_against_variation_in_weight_q5.txt b/results/Evidence_antagonism_stability_against_variation_in_weight_q5.txt new file mode 100644 index 0000000..ce04362 --- /dev/null +++ b/results/Evidence_antagonism_stability_against_variation_in_weight_q5.txt @@ -0,0 +1,25 @@ + OLS Regression Results +=========================================================================================== +Dep. Variable: bayes-corrected (q=0.5) valence R-squared: 0.027 +Model: OLS Adj. R-squared: 0.027 +Method: Least Squares F-statistic: 6.556e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:52 Log-Likelihood: 3.9215e+05 +No. Observations: 2392896 AIC: -7.843e+05 +Df Residuals: 2392894 BIC: -7.843e+05 +Df Model: 1 +Covariance Type: nonrobust +=================================================================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------------------------------------------- +const 0.1323 0.000 996.732 0.000 0.132 0.133 +mean bayes-corrected (q=0.5) valence of replies -0.0340 0.000 -256.042 0.000 -0.034 -0.034 +============================================================================== +Omnibus: 168653.316 Durbin-Watson: 1.726 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 107460.980 +Skew: -0.396 Prob(JB): 0.00 +Kurtosis: 2.328 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_antagonism_stability_against_variation_in_weight_q75.txt b/results/Evidence_antagonism_stability_against_variation_in_weight_q75.txt new file mode 100644 index 0000000..dbc887c --- /dev/null +++ b/results/Evidence_antagonism_stability_against_variation_in_weight_q75.txt @@ -0,0 +1,25 @@ + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.75) valence R-squared: 0.032 +Model: OLS Adj. R-squared: 0.032 +Method: Least Squares F-statistic: 8.012e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:52 Log-Likelihood: 8.8112e+05 +No. Observations: 2392896 AIC: -1.762e+06 +Df Residuals: 2392894 BIC: -1.762e+06 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1411 0.000 1303.270 0.000 0.141 0.141 +mean bayes-corrected (q=0.75) valence of replies -0.0306 0.000 -283.054 0.000 -0.031 -0.030 +============================================================================== +Omnibus: 95205.666 Durbin-Watson: 1.729 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 102788.717 +Skew: -0.491 Prob(JB): 0.00 +Kurtosis: 2.742 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity.txt b/results/Evidence_polarization_paired_ttest_extremity.txt new file mode 100644 index 0000000..62ed026 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.28634078314814315 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.31853427098636283 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12461005214018245 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09803757310470287 +Degrees of Freedom: 2392895 +Cohen's d: -0.28714996199978216 +T-statistic: -396.76675511778956 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_culture.txt b/results/Evidence_polarization_paired_ttest_extremity_culture.txt new file mode 100644 index 0000000..9318fb0 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_culture.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.2873043034163312 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3180681433274033 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12427097360816901 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.10041204122831116 +Degrees of Freedom: 102304 +Cohen's d: -0.27231114070182555 +T-statistic: -77.26207861609845 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_economy.txt b/results/Evidence_polarization_paired_ttest_extremity_economy.txt new file mode 100644 index 0000000..269c044 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_economy.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.28753090601867964 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3206172046668397 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12269603857552688 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09544219919767762 +Degrees of Freedom: 316427 +Cohen's d: -0.3010114266220678 +T-statistic: -144.89599610520233 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_foreign_affairs.txt b/results/Evidence_polarization_paired_ttest_extremity_foreign_affairs.txt new file mode 100644 index 0000000..5c96bca --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_foreign_affairs.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.30983360408913946 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.330913534598374 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1266220167440838 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.0994658270407316 +Degrees of Freedom: 440259 +Cohen's d: -0.18514479506979328 +T-statistic: -116.67457613500132 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_health.txt b/results/Evidence_polarization_paired_ttest_extremity_health.txt new file mode 100644 index 0000000..39c8898 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_health.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.286001211119296 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.32344058185785135 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12360412242902419 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09582069057098505 +Degrees of Freedom: 27004 +Cohen's d: -0.3385470290175242 +T-statistic: -48.09524752175683 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_internet.txt b/results/Evidence_polarization_paired_ttest_extremity_internet.txt new file mode 100644 index 0000000..fad6588 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_internet.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.30568494651578504 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.33706126033387757 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12135285517757544 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09268724557998224 +Degrees of Freedom: 63078 +Cohen's d: -0.2905871965145026 +T-statistic: -63.21801300923011 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_miscellaneous.txt b/results/Evidence_polarization_paired_ttest_extremity_miscellaneous.txt new file mode 100644 index 0000000..38a640a --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_miscellaneous.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.3045872088628839 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.33005502824126426 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12405998014131653 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09742991339150692 +Degrees of Freedom: 235550 +Cohen's d: -0.22832386975048508 +T-statistic: -97.05206575930157 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_mobility.txt b/results/Evidence_polarization_paired_ttest_extremity_mobility.txt new file mode 100644 index 0000000..b176644 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_mobility.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.25434099233474056 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3002874727751491 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1194543498720806 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09661488578718154 +Degrees of Freedom: 69252 +Cohen's d: -0.4229377864257918 +T-statistic: -93.24696971910268 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_politics.txt b/results/Evidence_polarization_paired_ttest_extremity_politics.txt new file mode 100644 index 0000000..98caff1 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_politics.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.2747813213977206 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.31051648819461664 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1232411698734475 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09815038738028235 +Degrees of Freedom: 621928 +Cohen's d: -0.3207697725588003 +T-statistic: -224.4339595489235 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_science.txt b/results/Evidence_polarization_paired_ttest_extremity_science.txt new file mode 100644 index 0000000..0c4b6cf --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_science.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.25732194943047365 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3019777399435376 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1187657730515952 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09498121080140695 +Degrees of Freedom: 345533 +Cohen's d: -0.4152747999524859 +T-statistic: -212.56678640514008 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_paired_ttest_extremity_sports.txt b/results/Evidence_polarization_paired_ttest_extremity_sports.txt new file mode 100644 index 0000000..c720e07 --- /dev/null +++ b/results/Evidence_polarization_paired_ttest_extremity_sports.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.30601102207250513 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.328439915246921 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12292708240128108 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.098047183761713 +Degrees of Freedom: 100070 +Cohen's d: -0.20172544463043993 +T-statistic: -55.9671976011527 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_robustness_paired_ttest_order1.txt b/results/Evidence_polarization_robustness_paired_ttest_order1.txt new file mode 100644 index 0000000..91adfe0 --- /dev/null +++ b/results/Evidence_polarization_robustness_paired_ttest_order1.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.25) extremity: 0.29265411081901965 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.316766141686027 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.11701339959130957 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09812627267575441 +Degrees of Freedom: 1630261 +Cohen's d: -0.2232935227954181 +T-statistic: -248.9875068375778 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_bayes.txt b/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_bayes.txt new file mode 100644 index 0000000..9fc47e5 --- /dev/null +++ b/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_bayes.txt @@ -0,0 +1,8 @@ +Mean of extremity: 0.2786279465660722 +Mean of mean extremity of replies: 0.33064022086792666 +Standard Deviation of extremity: 0.15566001726472525 +Standard Deviation of mean extremity of replies: 0.15685179947476463 +Degrees of Freedom: 2392895 +Cohen's d: -0.332863548235494 +T-statistic: -441.7826610833192 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q5.txt b/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q5.txt new file mode 100644 index 0000000..6fe52e0 --- /dev/null +++ b/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q5.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.5) extremity: 0.2934997056888845 +Mean of mean bayes-corrected (q=0.5) extremity of replies: 0.31880240669265064 +Standard Deviation of bayes-corrected (q=0.5) extremity: 0.10366027656607042 +Standard Deviation of mean bayes-corrected (q=0.5) extremity of replies: 0.07259709613375841 +Degrees of Freedom: 2392895 +Cohen's d: -0.28275329909468133 +T-statistic: -394.7125869249032 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q75.txt b/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q75.txt new file mode 100644 index 0000000..6b0909d --- /dev/null +++ b/results/Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q75.txt @@ -0,0 +1,8 @@ +Mean of bayes-corrected (q=0.75) extremity: 0.3010823980840001 +Mean of mean bayes-corrected (q=0.75) extremity of replies: 0.32039106933723704 +Standard Deviation of bayes-corrected (q=0.75) extremity: 0.08248076963764756 +Standard Deviation of mean bayes-corrected (q=0.75) extremity of replies: 0.05223289636934443 +Degrees of Freedom: 2392895 +Cohen's d: -0.2796984844303324 +T-statistic: -391.6388789093796 +P-value: 0.0 \ No newline at end of file diff --git a/results/Evidence_uncogeniality_model_with_seperate_upvotes_downvotes.txt b/results/Evidence_uncogeniality_model_with_seperate_upvotes_downvotes.txt new file mode 100644 index 0000000..02bdb7d --- /dev/null +++ b/results/Evidence_uncogeniality_model_with_seperate_upvotes_downvotes.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.194 +Model: OLS Adj. R-squared: 0.194 +Method: Least Squares F-statistic: 7.311e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:51 Log-Likelihood: -1.0415e+07 +No. Observations: 6069971 AIC: 2.083e+07 +Df Residuals: 6069968 BIC: 2.083e+07 +Df Model: 2 +Covariance Type: nonrobust +============================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------ +const 1.1129 0.001 2037.629 0.000 1.112 1.114 +upvotes 0.0893 0.001 162.278 0.000 0.088 0.090 +downvotes 0.6433 0.001 1168.654 0.000 0.642 0.644 +============================================================================== +Omnibus: 3179849.625 Durbin-Watson: 1.812 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 138815450.026 +Skew: 1.836 Prob(JB): 0.00 +Kurtosis: 26.138 Cond. No. 1.13 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongeniality_preregistered_model.txt b/results/Evidence_uncongeniality_preregistered_model.txt new file mode 100644 index 0000000..14784e5 --- /dev/null +++ b/results/Evidence_uncongeniality_preregistered_model.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.220 +Model: OLS Adj. R-squared: 0.220 +Method: Least Squares F-statistic: 6.744e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:43 Log-Likelihood: -8.1863e+06 +No. Observations: 4786218 AIC: 1.637e+07 +Df Residuals: 4786215 BIC: 1.637e+07 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1760 0.001 1922.382 0.000 1.175 1.177 +bayes-corrected (q=0.25) valence -0.4349 0.001 -707.468 0.000 -0.436 -0.434 +totalvotes 0.5207 0.001 847.067 0.000 0.520 0.522 +============================================================================== +Omnibus: 2282674.662 Durbin-Watson: 1.758 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 64040137.713 +Skew: 1.723 Prob(JB): 0.00 +Kurtosis: 20.586 Cond. No. 1.10 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongeniality_robustness_analysis_on_person_level.txt b/results/Evidence_uncongeniality_robustness_analysis_on_person_level.txt new file mode 100644 index 0000000..2a5f03b --- /dev/null +++ b/results/Evidence_uncongeniality_robustness_analysis_on_person_level.txt @@ -0,0 +1 @@ +totalvotes: 208.8619281171 (CI: [ 208.5635392284, 209.1603170058]) \ No newline at end of file diff --git a/results/Evidence_uncongeniality_robustness_analysis_on_section_level.txt b/results/Evidence_uncongeniality_robustness_analysis_on_section_level.txt new file mode 100644 index 0000000..905f677 --- /dev/null +++ b/results/Evidence_uncongeniality_robustness_analysis_on_section_level.txt @@ -0,0 +1 @@ +totalvotes: 444292.7728500224 (CI: [ 403421.6792516428, 485163.8664484020]) \ No newline at end of file diff --git a/results/Evidence_uncongeniality_robustness_order1.txt b/results/Evidence_uncongeniality_robustness_order1.txt new file mode 100644 index 0000000..86fc044 --- /dev/null +++ b/results/Evidence_uncongeniality_robustness_order1.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.136 +Model: OLS Adj. R-squared: 0.136 +Method: Least Squares F-statistic: 3.982e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:50 Log-Likelihood: -6.2998e+06 +No. Observations: 5050120 AIC: 1.260e+07 +Df Residuals: 5050117 BIC: 1.260e+07 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 0.6133 0.000 1636.095 0.000 0.613 0.614 +bayes-corrected (q=0.25) valence -0.2055 0.000 -548.027 0.000 -0.206 -0.205 +totalvotes 0.2575 0.000 686.512 0.000 0.257 0.258 +============================================================================== +Omnibus: 2832727.339 Durbin-Watson: 1.864 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 85433368.019 +Skew: 2.153 Prob(JB): 0.00 +Kurtosis: 22.684 Cond. No. 1.03 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongeniality_simplest_model_linear_regression_only_valence_non_standardized.txt b/results/Evidence_uncongeniality_simplest_model_linear_regression_only_valence_non_standardized.txt new file mode 100644 index 0000000..aec4dc4 --- /dev/null +++ b/results/Evidence_uncongeniality_simplest_model_linear_regression_only_valence_non_standardized.txt @@ -0,0 +1,25 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.077 +Model: OLS Adj. R-squared: 0.077 +Method: Least Squares F-statistic: 4.005e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:42 Log-Likelihood: -8.5881e+06 +No. Observations: 4786218 AIC: 1.718e+07 +Df Residuals: 4786216 BIC: 1.718e+07 +Df Model: 1 +Covariance Type: nonrobust +============================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------ +const 1.4225 0.001 1845.132 0.000 1.421 1.424 +valence -1.3913 0.002 -632.878 0.000 -1.396 -1.387 +============================================================================== +Omnibus: 2883084.941 Durbin-Watson: 1.828 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 98618092.392 +Skew: 2.349 Prob(JB): 0.00 +Kurtosis: 24.736 Cond. No. 3.42 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongeniality_stability_against_variation_in_weight__no_bayes_correction.txt b/results/Evidence_uncongeniality_stability_against_variation_in_weight__no_bayes_correction.txt new file mode 100644 index 0000000..c6f344f --- /dev/null +++ b/results/Evidence_uncongeniality_stability_against_variation_in_weight__no_bayes_correction.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.199 +Model: OLS Adj. R-squared: 0.199 +Method: Least Squares F-statistic: 5.941e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:47 Log-Likelihood: -8.2498e+06 +No. Observations: 4786218 AIC: 1.650e+07 +Df Residuals: 4786215 BIC: 1.650e+07 +Df Model: 2 +Covariance Type: nonrobust +============================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------ +const 1.1760 0.001 1897.046 0.000 1.175 1.177 +valence -0.3745 0.001 -601.728 0.000 -0.376 -0.373 +totalvotes 0.5306 0.001 852.573 0.000 0.529 0.532 +============================================================================== +Omnibus: 2293481.647 Durbin-Watson: 1.752 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 63398255.054 +Skew: 1.739 Prob(JB): 0.00 +Kurtosis: 20.487 Cond. No. 1.09 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongeniality_stability_against_variation_in_weight_q5.txt b/results/Evidence_uncongeniality_stability_against_variation_in_weight_q5.txt new file mode 100644 index 0000000..87c6dbb --- /dev/null +++ b/results/Evidence_uncongeniality_stability_against_variation_in_weight_q5.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.229 +Model: OLS Adj. R-squared: 0.229 +Method: Least Squares F-statistic: 7.096e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:44 Log-Likelihood: -8.1590e+06 +No. Observations: 4786218 AIC: 1.632e+07 +Df Residuals: 4786215 BIC: 1.632e+07 +Df Model: 2 +Covariance Type: nonrobust +=================================================================================================== + coef std err t P>|t| [0.025 0.975] +--------------------------------------------------------------------------------------------------- +const 1.1760 0.001 1933.368 0.000 1.175 1.177 +bayes-corrected (q=0.5) valence -0.4582 0.001 -749.070 0.000 -0.459 -0.457 +totalvotes 0.5147 0.001 841.341 0.000 0.513 0.516 +============================================================================== +Omnibus: 2271398.527 Durbin-Watson: 1.760 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 63503192.358 +Skew: 1.712 Prob(JB): 0.00 +Kurtosis: 20.513 Cond. No. 1.11 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongeniality_stability_against_variation_in_weight_q75.txt b/results/Evidence_uncongeniality_stability_against_variation_in_weight_q75.txt new file mode 100644 index 0000000..760a407 --- /dev/null +++ b/results/Evidence_uncongeniality_stability_against_variation_in_weight_q75.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.236 +Model: OLS Adj. R-squared: 0.236 +Method: Least Squares F-statistic: 7.380e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:45 Log-Likelihood: -8.1372e+06 +No. Observations: 4786218 AIC: 1.627e+07 +Df Residuals: 4786215 BIC: 1.627e+07 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1760 0.001 1942.187 0.000 1.175 1.177 +bayes-corrected (q=0.75) valence -0.4762 0.001 -781.029 0.000 -0.477 -0.475 +totalvotes 0.5081 0.001 833.387 0.000 0.507 0.509 +============================================================================== +Omnibus: 2256599.632 Durbin-Watson: 1.761 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 62251699.550 +Skew: 1.700 Prob(JB): 0.00 +Kurtosis: 20.338 Cond. No. 1.12 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_affairs.txt b/results/Evidence_uncongenialty_section_affairs.txt new file mode 100644 index 0000000..1298acf --- /dev/null +++ b/results/Evidence_uncongenialty_section_affairs.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.237 +Model: OLS Adj. R-squared: 0.237 +Method: Least Squares F-statistic: 1.380e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.5539e+06 +No. Observations: 890221 AIC: 3.108e+06 +Df Residuals: 890218 BIC: 3.108e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1789 0.001 802.397 0.000 1.176 1.182 +bayes-corrected (q=0.25) valence -0.4979 0.001 -337.303 0.000 -0.501 -0.495 +totalvotes 0.5435 0.001 368.179 0.000 0.541 0.546 +============================================================================== +Omnibus: 415616.007 Durbin-Watson: 1.775 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 8567668.092 +Skew: 1.765 Prob(JB): 0.00 +Kurtosis: 17.782 Cond. No. 1.10 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_culture.txt b/results/Evidence_uncongenialty_section_culture.txt new file mode 100644 index 0000000..1d100a8 --- /dev/null +++ b/results/Evidence_uncongenialty_section_culture.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.243 +Model: OLS Adj. R-squared: 0.243 +Method: Least Squares F-statistic: 3.781e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -3.6290e+05 +No. Observations: 235911 AIC: 7.258e+05 +Df Residuals: 235908 BIC: 7.258e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 0.9173 0.002 395.396 0.000 0.913 0.922 +bayes-corrected (q=0.25) valence -0.3334 0.002 -142.771 0.000 -0.338 -0.329 +totalvotes 0.5075 0.002 217.346 0.000 0.503 0.512 +============================================================================== +Omnibus: 99947.806 Durbin-Watson: 1.805 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 886847.368 +Skew: 1.813 Prob(JB): 0.00 +Kurtosis: 11.779 Cond. No. 1.12 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_economy.txt b/results/Evidence_uncongenialty_section_economy.txt new file mode 100644 index 0000000..0c08249 --- /dev/null +++ b/results/Evidence_uncongenialty_section_economy.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.196 +Model: OLS Adj. R-squared: 0.196 +Method: Least Squares F-statistic: 7.576e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.0058e+06 +No. Observations: 620776 AIC: 2.012e+06 +Df Residuals: 620773 BIC: 2.012e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1396 0.002 734.230 0.000 1.137 1.143 +bayes-corrected (q=0.25) valence -0.3478 0.002 -223.518 0.000 -0.351 -0.345 +totalvotes 0.4695 0.002 301.664 0.000 0.466 0.473 +============================================================================== +Omnibus: 202475.900 Durbin-Watson: 1.799 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1088427.374 +Skew: 1.479 Prob(JB): 0.00 +Kurtosis: 8.773 Cond. No. 1.08 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_health.txt b/results/Evidence_uncongenialty_section_health.txt new file mode 100644 index 0000000..8d07d92 --- /dev/null +++ b/results/Evidence_uncongenialty_section_health.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.257 +Model: OLS Adj. R-squared: 0.257 +Method: Least Squares F-statistic: 8576. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -86794. +No. Observations: 49462 AIC: 1.736e+05 +Df Residuals: 49459 BIC: 1.736e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.3371 0.006 212.544 0.000 1.325 1.349 +bayes-corrected (q=0.25) valence -0.4685 0.006 -73.917 0.000 -0.481 -0.456 +totalvotes 0.6228 0.006 98.259 0.000 0.610 0.635 +============================================================================== +Omnibus: 17663.533 Durbin-Watson: 1.771 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 106942.347 +Skew: 1.595 Prob(JB): 0.00 +Kurtosis: 9.459 Cond. No. 1.13 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_internet.txt b/results/Evidence_uncongenialty_section_internet.txt new file mode 100644 index 0000000..a2f0d4d --- /dev/null +++ b/results/Evidence_uncongenialty_section_internet.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.256 +Model: OLS Adj. R-squared: 0.256 +Method: Least Squares F-statistic: 2.267e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -2.1421e+05 +No. Observations: 131977 AIC: 4.284e+05 +Df Residuals: 131974 BIC: 4.284e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.0804 0.003 320.014 0.000 1.074 1.087 +bayes-corrected (q=0.25) valence -0.4040 0.003 -118.355 0.000 -0.411 -0.397 +totalvotes 0.5375 0.003 157.450 0.000 0.531 0.544 +============================================================================== +Omnibus: 54168.298 Durbin-Watson: 1.825 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 590918.640 +Skew: 1.674 Prob(JB): 0.00 +Kurtosis: 12.811 Cond. No. 1.16 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_miscellaneous.txt b/results/Evidence_uncongenialty_section_miscellaneous.txt new file mode 100644 index 0000000..607320e --- /dev/null +++ b/results/Evidence_uncongenialty_section_miscellaneous.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.246 +Model: OLS Adj. R-squared: 0.246 +Method: Least Squares F-statistic: 7.921e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -8.1045e+05 +No. Observations: 485006 AIC: 1.621e+06 +Df Residuals: 485003 BIC: 1.621e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1141 0.002 602.981 0.000 1.110 1.118 +bayes-corrected (q=0.25) valence -0.4406 0.002 -237.533 0.000 -0.444 -0.437 +totalvotes 0.5508 0.002 296.904 0.000 0.547 0.554 +============================================================================== +Omnibus: 308614.044 Durbin-Watson: 1.795 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 33388300.741 +Skew: 2.187 Prob(JB): 0.00 +Kurtosis: 43.411 Cond. No. 1.09 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_mobility.txt b/results/Evidence_uncongenialty_section_mobility.txt new file mode 100644 index 0000000..cff0e6e --- /dev/null +++ b/results/Evidence_uncongenialty_section_mobility.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.198 +Model: OLS Adj. R-squared: 0.198 +Method: Least Squares F-statistic: 1.449e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.9705e+05 +No. Observations: 117051 AIC: 3.941e+05 +Df Residuals: 117048 BIC: 3.941e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.3476 0.004 353.887 0.000 1.340 1.355 +bayes-corrected (q=0.25) valence -0.3144 0.004 -80.973 0.000 -0.322 -0.307 +totalvotes 0.5090 0.004 131.111 0.000 0.501 0.517 +============================================================================== +Omnibus: 32287.766 Durbin-Watson: 1.796 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 111823.546 +Skew: 1.377 Prob(JB): 0.00 +Kurtosis: 6.917 Cond. No. 1.22 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_politics.txt b/results/Evidence_uncongenialty_section_politics.txt new file mode 100644 index 0000000..fa8edf9 --- /dev/null +++ b/results/Evidence_uncongenialty_section_politics.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.209 +Model: OLS Adj. R-squared: 0.209 +Method: Least Squares F-statistic: 1.708e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -2.1743e+06 +No. Observations: 1295105 AIC: 4.349e+06 +Df Residuals: 1295102 BIC: 4.349e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1182 0.001 981.264 0.000 1.116 1.120 +bayes-corrected (q=0.25) valence -0.3909 0.001 -341.822 0.000 -0.393 -0.389 +totalvotes 0.5079 0.001 444.124 0.000 0.506 0.510 +============================================================================== +Omnibus: 680589.819 Durbin-Watson: 1.782 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 49094495.451 +Skew: 1.699 Prob(JB): 0.00 +Kurtosis: 32.971 Cond. No. 1.09 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_science.txt b/results/Evidence_uncongenialty_section_science.txt new file mode 100644 index 0000000..04e4bc5 --- /dev/null +++ b/results/Evidence_uncongenialty_section_science.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.253 +Model: OLS Adj. R-squared: 0.253 +Method: Least Squares F-statistic: 9.746e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.0810e+06 +No. Observations: 575190 AIC: 2.162e+06 +Df Residuals: 575187 BIC: 2.162e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.6458 0.002 787.663 0.000 1.642 1.650 +bayes-corrected (q=0.25) valence -0.3951 0.002 -184.289 0.000 -0.399 -0.391 +totalvotes 0.7495 0.002 349.574 0.000 0.745 0.754 +============================================================================== +Omnibus: 194870.309 Durbin-Watson: 1.765 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1100608.449 +Skew: 1.527 Prob(JB): 0.00 +Kurtosis: 9.050 Cond. No. 1.26 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Evidence_uncongenialty_section_sports.txt b/results/Evidence_uncongenialty_section_sports.txt new file mode 100644 index 0000000..c2b6204 --- /dev/null +++ b/results/Evidence_uncongenialty_section_sports.txt @@ -0,0 +1,26 @@ + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.256 +Model: OLS Adj. R-squared: 0.256 +Method: Least Squares F-statistic: 3.965e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -3.4768e+05 +No. Observations: 230524 AIC: 6.954e+05 +Df Residuals: 230521 BIC: 6.954e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 0.8891 0.002 390.420 0.000 0.885 0.894 +bayes-corrected (q=0.25) valence -0.3918 0.002 -171.548 0.000 -0.396 -0.387 +totalvotes 0.4784 0.002 209.473 0.000 0.474 0.483 +============================================================================== +Omnibus: 109314.794 Durbin-Watson: 1.837 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1540320.347 +Skew: 1.926 Prob(JB): 0.00 +Kurtosis: 15.063 Cond. No. 1.08 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. \ No newline at end of file diff --git a/results/Extended_Data_Table_1_Descriptive_Data_for_different_comment_levels.csv b/results/Extended_Data_Table_1_Descriptive_Data_for_different_comment_levels.csv new file mode 100644 index 0000000..306becc --- /dev/null +++ b/results/Extended_Data_Table_1_Descriptive_Data_for_different_comment_levels.csv @@ -0,0 +1,6 @@ +,total_count,totalvotes_nonzero,totalvotes_sum,totalvotes_mean,totalvotes_std_dev,upvotes_sum,downvotes_sum,bayes-corrected (q=0.25) valence_mean,bayes-corrected (q=0.25) valence_std_dev,bayes-corrected (q=0.25) extremity_mean,bayes-corrected (q=0.25) extremity_std_dev +Total,20161317,14706588,154821490,7.679135742967585,11.556761173578584,102022297,52799193,0.17704715667464238,0.2021943857837279,0.3158469490984941,0.11041613015478122 +0,6069971,4786218,77964965,12.84437190886085,15.885943210670801,50729878,27235087,0.17395713322300024,0.2242844895596835,0.3081309024315796,0.11996477068190904 +1,6755090,5050120,46320518,6.857128180379536,9.434159387282262,31970589,14349929,0.19288237536756858,0.19202605012773946,0.3198901628191536,0.10879194365502998 +2,3786555,2608297,18126241,4.787000584964433,7.119372189759389,11414766,6711475,0.16402235924006153,0.19272989539907023,0.31758620689084427,0.10383240571773994 +3,3549701,2261953,12409766,3.496003184493567,5.5015115601260955,7907064,4502702,0.16325036748997546,0.1823236133759817,0.3211412527772316,0.09881505843280065 diff --git a/results/Extended_Data_Table_2_Descriptive_Data_for_different_news_categories.csv b/results/Extended_Data_Table_2_Descriptive_Data_for_different_news_categories.csv new file mode 100644 index 0000000..51f66fb --- /dev/null +++ b/results/Extended_Data_Table_2_Descriptive_Data_for_different_news_categories.csv @@ -0,0 +1,28 @@ +,total_count,number O(n+1)-replies_sum,number O(n+1)-replies_nonzero,totalvotes_nonzero,totalvotes_sum,upvotes_sum,downvotes_sum,valence_mean,valence_std_dev,bayes-corrected (q=0.25) valence_mean,bayes-corrected (q=0.25) valence_std_dev,extremity_mean,extremity_std_dev,bayes-corrected (q=0.25) extremity_mean,bayes-corrected (q=0.25) extremity_std_dev +Total,20161317,14091458,7495456,14706588,154821490,102022297,52799193,0.1809732449640069,0.3221017025345512,0.17704715667464238,0.2021943857837279,0.3263665242005636,0.17316382137734357,0.3158469490984941,0.11041613015478122 +Backstage,2638,1309,916,2091,19339,14013,5326,0.23396256608765395,0.306585082923225,0.21223527077342733,0.18627809433377085,0.34578357774960666,0.1706915190446471,0.3284342540070973,0.10947346803113846 +Career,125360,84283,47627,94139,991285,688567,302718,0.23144672748128672,0.31736287806178626,0.207642429386839,0.20352379529383485,0.3562556537779823,0.165432505887087,0.33650872389557185,0.10619423486404499 +Community,2546,1519,921,1691,10943,7543,3400,0.22997949625046876,0.31967854887735475,0.200451242569652,0.17131680735740445,0.35250476788410146,0.17545113771119852,0.3275935830615364,0.09841736341890532 +Culture,783764,492965,283683,594634,7485201,4965916,2519285,0.18924208094858397,0.31430902283965584,0.18237701691103533,0.2075772292325427,0.3257271885165046,0.16883248354886274,0.3165073708092221,0.11183033084670414 +Economy,2532709,1753030,981305,1832061,15418671,10477493,4941178,0.19695589659717094,0.3200608942654652,0.1874348176320727,0.18779895991104745,0.33178431101377187,0.17649287038142622,0.3182130588344701,0.10772151446952792 +Family,49628,31670,18194,38744,504399,350538,153861,0.2207067677046886,0.3014587355657185,0.2041527930238584,0.2020069369216948,0.33366309594582666,0.1680957128606617,0.32246786364599994,0.11428810172513852 +Fitness,3010,2211,1182,2183,22484,14215,8269,0.15967373329129744,0.30431877346003455,0.1619996073865862,0.1864234078141263,0.29418798473390506,0.17756989413975396,0.29151785340526726,0.1149792183614986 +Foreign affairs,3677268,2544425,1330773,2734274,33483913,22653979,10829934,0.2002575026917707,0.325382737606063,0.1904666416894296,0.21644036248740706,0.34356756409069306,0.16714752082493223,0.32992358948836636,0.1107027200883534 +Health,232501,170195,87992,169188,1861800,1220143,641657,0.18424450969159725,0.32460476601354893,0.17814240568123157,0.20780662621629975,0.3319043469304209,0.17074351373987692,0.3200033262093227,0.10944699632311113 +History,72480,47028,26802,56445,679183,472071,207112,0.22095860634303593,0.31534867450645176,0.2039420491794974,0.2117808348179994,0.3471692447301984,0.1665525064085087,0.3331839376145845,0.11121773599665685 +International,1778,661,443,1021,5800,3874,1926,0.19351189270788866,0.3781743471452822,0.18492958397224973,0.194320028301713,0.3900060650559993,0.16806412167150314,0.3495507394881325,0.08766877038980984 +Internet,498610,333659,186807,367308,3674903,2466723,1208180,0.2091647658832731,0.3286639549837689,0.19303117978437134,0.2079459094850678,0.3515812393300087,0.16781033736005121,0.3332794356286173,0.10537720414529439 +Miscellaneous,1962726,1352139,729325,1449191,17475106,11899133,5575973,0.20487949983503603,0.32223613216900404,0.1930421103534628,0.21138823123717118,0.3434851196383367,0.16682222414119258,0.3292431100186108,0.10905631859394622 +Mobility,554408,415352,219481,421827,3502371,2196167,1306204,0.16054863670067138,0.31191065777214666,0.1618961069183528,0.18002270627566236,0.30120556091206946,0.1798309742036998,0.2958943291943663,0.10998330012774073 +Politics,5116347,3451139,1901059,3675657,39155173,25667532,13487641,0.17444454738877926,0.3163335278430911,0.17315213087529358,0.19604468474970993,0.31674249710795505,0.17370081307098184,0.3084535160005122,0.11058794557807118 +Psychology,77714,49836,28755,59103,731898,505589,226309,0.20632700233260906,0.31154092970651803,0.1944627043194951,0.20658764113640296,0.3333799252712208,0.1687757327928834,0.3224848844257549,0.1127179346512217 +Relationships,8131,4828,2914,6585,117075,86625,30450,0.24777040590992844,0.29358367752069375,0.22795252407820465,0.21253014623413818,0.34813759810576583,0.1623966700762943,0.33698553182471686,0.11775078466218207 +Science,3525557,2774136,1307843,2480281,21660444,12904848,8755596,0.13124914209737715,0.3254561015718625,0.14358107029299916,0.1929315922045766,0.30217351871367176,0.17843527137661813,0.29726966358820944,0.10920227830101752 +Services,15,6,4,13,70,49,21,0.11337188452573069,0.33244948852144557,0.16113358674678163,0.17243290163606495,0.2928590640129101,0.17757621269031232,0.3027156286778409,0.09718850862453934 +Sports,742645,458996,266832,573957,6603661,4457164,2146497,0.19390481300491805,0.324823993978004,0.1866839749335055,0.21488052454498108,0.3392841168420786,0.1673196240006354,0.3276705379187469,0.1093840089279375 +Start,59059,38288,22794,45209,446297,312161,134136,0.23012121622760803,0.3130036249047111,0.20696856220124274,0.20060384009293042,0.35070753025975687,0.16712187857191027,0.33303728389868575,0.10782854662594989 +Style,30611,17243,10890,24054,237133,168395,68738,0.24331081020636638,0.3088338815884642,0.2153432930645136,0.20011272670414704,0.35698752853489446,0.1647288201407465,0.3384781050249686,0.10535514942392003 +Tests,14585,8163,5413,11604,99542,73441,26101,0.27363177574221587,0.2996883510847475,0.23290915494157782,0.1893771434271009,0.37215267353257064,0.1618158234175073,0.3471946880707655,0.10229356775332507 +Total,2638,2185,922,1915,17354,9336,8018,0.0677696797423411,0.3072866586840947,0.10190539473940045,0.18182325609165528,0.2587321299251362,0.17900538958637705,0.266230689221851,0.11332285476209328 +Travel,84136,55950,32431,63101,614135,404586,209549,0.19389251358743367,0.31251346847520517,0.18297252866496708,0.19464504647573025,0.32412318682777963,0.1737874145135004,0.3130547750304372,0.1114047246370484 +Your SPIEGEL,453,242,148,312,3310,2196,1114,0.18208463027330246,0.2940537652828505,0.17712455278668807,0.1727453899902895,0.29483491204446555,0.1803382964418267,0.29333866069487885,0.10993924168770208 diff --git a/results/Extended_Fig_1.png b/results/Extended_Fig_1.png new file mode 100644 index 0000000..d4dc8bf Binary files /dev/null and b/results/Extended_Fig_1.png differ diff --git a/results/Fig_2a.png b/results/Fig_2a.png new file mode 100644 index 0000000..789dc3b Binary files /dev/null and b/results/Fig_2a.png differ diff --git a/results/Fig_2b.png b/results/Fig_2b.png new file mode 100644 index 0000000..0f89b0b Binary files /dev/null and b/results/Fig_2b.png differ diff --git a/results/Fig_2c.png b/results/Fig_2c.png new file mode 100644 index 0000000..a70b777 Binary files /dev/null and b/results/Fig_2c.png differ diff --git a/results/Fig_3a.png b/results/Fig_3a.png new file mode 100644 index 0000000..6d59afe Binary files /dev/null and b/results/Fig_3a.png differ diff --git a/results/Fig_3b.png b/results/Fig_3b.png new file mode 100644 index 0000000..d51df71 Binary files /dev/null and b/results/Fig_3b.png differ diff --git a/results/Fig_4a.png b/results/Fig_4a.png new file mode 100644 index 0000000..01dd7b2 Binary files /dev/null and b/results/Fig_4a.png differ diff --git a/results/Fig_4b.png b/results/Fig_4b.png new file mode 100644 index 0000000..e08e28e Binary files /dev/null and b/results/Fig_4b.png differ diff --git a/results_reports/ b/results_reports/ new file mode 100644 index 0000000..8c3b3e2 --- /dev/null +++ b/results_reports/ @@ -0,0 +1,1644 @@ +# Analysis Results for July 22, 2024 + +Descriptive Analysis: Extended_Data_Table_1_Descriptive_Data_for_different_comment_levels +``` +Data: data +Metrics: [Metric(operation='count', column=None), Metric(operation='count_nonzero', column='totalvotes'), Metric(operation='sum', column='totalvotes'), Metric(operation='mean', column='totalvotes'), Metric(operation='std_dev', column='totalvotes'), Metric(operation='sum', column='upvotes'), Metric(operation='sum', column='downvotes'), Metric(operation='mean', column='bayes-corrected (q=0.25) valence'), Metric(operation='std_dev', column='bayes-corrected (q=0.25) valence'), Metric(operation='mean', column='bayes-corrected (q=0.25) extremity'), Metric(operation='std_dev', column='bayes-corrected (q=0.25) extremity')] +Group By: order +``` +| | total_count | totalvotes_nonzero | totalvotes_sum | totalvotes_mean | totalvotes_std_dev | upvotes_sum | downvotes_sum | bayes-corrected (q=0.25) valence_mean | bayes-corrected (q=0.25) valence_std_dev | bayes-corrected (q=0.25) extremity_mean | bayes-corrected (q=0.25) extremity_std_dev | +|:------|--------------:|---------------------:|-----------------:|------------------:|---------------------:|--------------:|----------------:|----------------------------------------:|-------------------------------------------:|------------------------------------------:|---------------------------------------------:| +| Total | 2.01613e+07 | 1.47066e+07 | 1.54821e+08 | 7.67914 | 11.5568 | 1.02022e+08 | 5.27992e+07 | 0.177047 | 0.202194 | 0.315847 | 0.110416 | +| 0 | 6.06997e+06 | 4.78622e+06 | 7.7965e+07 | 12.8444 | 15.8859 | 5.07299e+07 | 2.72351e+07 | 0.173957 | 0.224284 | 0.308131 | 0.119965 | +| 1 | 6.75509e+06 | 5.05012e+06 | 4.63205e+07 | 6.85713 | 9.43416 | 3.19706e+07 | 1.43499e+07 | 0.192882 | 0.192026 | 0.31989 | 0.108792 | +| 2 | 3.78656e+06 | 2.6083e+06 | 1.81262e+07 | 4.787 | 7.11937 | 1.14148e+07 | 6.71148e+06 | 0.164022 | 0.19273 | 0.317586 | 0.103832 | +| 3 | 3.5497e+06 | 2.26195e+06 | 1.24098e+07 | 3.496 | 5.50151 | 7.90706e+06 | 4.5027e+06 | 0.16325 | 0.182324 | 0.321141 | 0.0988151 | +Descriptive Analysis: Extended_Data_Table_2_Descriptive_Data_for_different_news_categories +``` +Data: data +Metrics: [Metric(operation='count', column=None), Metric(operation='sum', column='number O(n+1)-replies'), Metric(operation='count_nonzero', column='number O(n+1)-replies'), Metric(operation='count_nonzero', column='totalvotes'), Metric(operation='sum', column='totalvotes'), Metric(operation='sum', column='upvotes'), Metric(operation='sum', column='downvotes'), Metric(operation='count_nonzero', column='totalvotes'), Metric(operation='mean', column='valence'), Metric(operation='std_dev', column='valence'), Metric(operation='mean', column='bayes-corrected (q=0.25) valence'), Metric(operation='std_dev', column='bayes-corrected (q=0.25) valence'), Metric(operation='mean', column='extremity'), Metric(operation='std_dev', column='extremity'), Metric(operation='mean', column='bayes-corrected (q=0.25) extremity'), Metric(operation='std_dev', column='bayes-corrected (q=0.25) extremity')] +Group By: section +``` +| | total_count | number O(n+1)-replies_sum | number O(n+1)-replies_nonzero | totalvotes_nonzero | totalvotes_sum | upvotes_sum | downvotes_sum | valence_mean | valence_std_dev | bayes-corrected (q=0.25) valence_mean | bayes-corrected (q=0.25) valence_std_dev | extremity_mean | extremity_std_dev | bayes-corrected (q=0.25) extremity_mean | bayes-corrected (q=0.25) extremity_std_dev | +|:----------------|-----------------:|----------------------------:|--------------------------------:|---------------------:|-----------------:|-----------------:|-----------------:|---------------:|------------------:|----------------------------------------:|-------------------------------------------:|-----------------:|--------------------:|------------------------------------------:|---------------------------------------------:| +| Total | 2.01613e+07 | 1.40915e+07 | 7.49546e+06 | 1.47066e+07 | 1.54821e+08 | 1.02022e+08 | 5.27992e+07 | 0.180973 | 0.322102 | 0.177047 | 0.202194 | 0.326367 | 0.173164 | 0.315847 | 0.110416 | +| Backstage | 2638 | 1309 | 916 | 2091 | 19339 | 14013 | 5326 | 0.233963 | 0.306585 | 0.212235 | 0.186278 | 0.345784 | 0.170692 | 0.328434 | 0.109473 | +| Career | 125360 | 84283 | 47627 | 94139 | 991285 | 688567 | 302718 | 0.231447 | 0.317363 | 0.207642 | 0.203524 | 0.356256 | 0.165433 | 0.336509 | 0.106194 | +| Community | 2546 | 1519 | 921 | 1691 | 10943 | 7543 | 3400 | 0.229979 | 0.319679 | 0.200451 | 0.171317 | 0.352505 | 0.175451 | 0.327594 | 0.0984174 | +| Culture | 783764 | 492965 | 283683 | 594634 | 7.4852e+06 | 4.96592e+06 | 2.51928e+06 | 0.189242 | 0.314309 | 0.182377 | 0.207577 | 0.325727 | 0.168832 | 0.316507 | 0.11183 | +| Economy | 2.53271e+06 | 1.75303e+06 | 981305 | 1.83206e+06 | 1.54187e+07 | 1.04775e+07 | 4.94118e+06 | 0.196956 | 0.320061 | 0.187435 | 0.187799 | 0.331784 | 0.176493 | 0.318213 | 0.107722 | +| Family | 49628 | 31670 | 18194 | 38744 | 504399 | 350538 | 153861 | 0.220707 | 0.301459 | 0.204153 | 0.202007 | 0.333663 | 0.168096 | 0.322468 | 0.114288 | +| Fitness | 3010 | 2211 | 1182 | 2183 | 22484 | 14215 | 8269 | 0.159674 | 0.304319 | 0.162 | 0.186423 | 0.294188 | 0.17757 | 0.291518 | 0.114979 | +| Foreign affairs | 3.67727e+06 | 2.54442e+06 | 1.33077e+06 | 2.73427e+06 | 3.34839e+07 | 2.2654e+07 | 1.08299e+07 | 0.200258 | 0.325383 | 0.190467 | 0.21644 | 0.343568 | 0.167148 | 0.329924 | 0.110703 | +| Health | 232501 | 170195 | 87992 | 169188 | 1.8618e+06 | 1.22014e+06 | 641657 | 0.184245 | 0.324605 | 0.178142 | 0.207807 | 0.331904 | 0.170744 | 0.320003 | 0.109447 | +| History | 72480 | 47028 | 26802 | 56445 | 679183 | 472071 | 207112 | 0.220959 | 0.315349 | 0.203942 | 0.211781 | 0.347169 | 0.166553 | 0.333184 | 0.111218 | +| International | 1778 | 661 | 443 | 1021 | 5800 | 3874 | 1926 | 0.193512 | 0.378174 | 0.18493 | 0.19432 | 0.390006 | 0.168064 | 0.349551 | 0.0876688 | +| Internet | 498610 | 333659 | 186807 | 367308 | 3.6749e+06 | 2.46672e+06 | 1.20818e+06 | 0.209165 | 0.328664 | 0.193031 | 0.207946 | 0.351581 | 0.16781 | 0.333279 | 0.105377 | +| Miscellaneous | 1.96273e+06 | 1.35214e+06 | 729325 | 1.44919e+06 | 1.74751e+07 | 1.18991e+07 | 5.57597e+06 | 0.204879 | 0.322236 | 0.193042 | 0.211388 | 0.343485 | 0.166822 | 0.329243 | 0.109056 | +| Mobility | 554408 | 415352 | 219481 | 421827 | 3.50237e+06 | 2.19617e+06 | 1.3062e+06 | 0.160549 | 0.311911 | 0.161896 | 0.180023 | 0.301206 | 0.179831 | 0.295894 | 0.109983 | +| Politics | 5.11635e+06 | 3.45114e+06 | 1.90106e+06 | 3.67566e+06 | 3.91552e+07 | 2.56675e+07 | 1.34876e+07 | 0.174445 | 0.316334 | 0.173152 | 0.196045 | 0.316742 | 0.173701 | 0.308454 | 0.110588 | +| Psychology | 77714 | 49836 | 28755 | 59103 | 731898 | 505589 | 226309 | 0.206327 | 0.311541 | 0.194463 | 0.206588 | 0.33338 | 0.168776 | 0.322485 | 0.112718 | +| Relationships | 8131 | 4828 | 2914 | 6585 | 117075 | 86625 | 30450 | 0.24777 | 0.293584 | 0.227953 | 0.21253 | 0.348138 | 0.162397 | 0.336986 | 0.117751 | +| Science | 3.52556e+06 | 2.77414e+06 | 1.30784e+06 | 2.48028e+06 | 2.16604e+07 | 1.29048e+07 | 8.7556e+06 | 0.131249 | 0.325456 | 0.143581 | 0.192932 | 0.302174 | 0.178435 | 0.29727 | 0.109202 | +| Services | 15 | 6 | 4 | 13 | 70 | 49 | 21 | 0.113372 | 0.332449 | 0.161134 | 0.172433 | 0.292859 | 0.177576 | 0.302716 | 0.0971885 | +| Sports | 742645 | 458996 | 266832 | 573957 | 6.60366e+06 | 4.45716e+06 | 2.1465e+06 | 0.193905 | 0.324824 | 0.186684 | 0.214881 | 0.339284 | 0.16732 | 0.327671 | 0.109384 | +| Start | 59059 | 38288 | 22794 | 45209 | 446297 | 312161 | 134136 | 0.230121 | 0.313004 | 0.206969 | 0.200604 | 0.350708 | 0.167122 | 0.333037 | 0.107829 | +| Style | 30611 | 17243 | 10890 | 24054 | 237133 | 168395 | 68738 | 0.243311 | 0.308834 | 0.215343 | 0.200113 | 0.356988 | 0.164729 | 0.338478 | 0.105355 | +| Tests | 14585 | 8163 | 5413 | 11604 | 99542 | 73441 | 26101 | 0.273632 | 0.299688 | 0.232909 | 0.189377 | 0.372153 | 0.161816 | 0.347195 | 0.102294 | +| Total | 2638 | 2185 | 922 | 1915 | 17354 | 9336 | 8018 | 0.0677697 | 0.307287 | 0.101905 | 0.181823 | 0.258732 | 0.179005 | 0.266231 | 0.113323 | +| Travel | 84136 | 55950 | 32431 | 63101 | 614135 | 404586 | 209549 | 0.193893 | 0.312513 | 0.182973 | 0.194645 | 0.324123 | 0.173787 | 0.313055 | 0.111405 | +| Your SPIEGEL | 453 | 242 | 148 | 312 | 3310 | 2196 | 1114 | 0.182085 | 0.294054 | 0.177125 | 0.172745 | 0.294835 | 0.180338 | 0.293339 | 0.109939 | +Linear Regression Analysis: Evidence_uncongeniality_simplest_model_linear_regression_only_valence_non_standardized +``` +Independent Variables: ['valence'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Standardize: False +Report effect size: True +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.077 +Model: OLS Adj. R-squared: 0.077 +Method: Least Squares F-statistic: 4.005e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:42 Log-Likelihood: -8.5881e+06 +No. Observations: 4786218 AIC: 1.718e+07 +Df Residuals: 4786216 BIC: 1.718e+07 +Df Model: 1 +Covariance Type: nonrobust +============================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------ +const 1.4225 0.001 1845.132 0.000 1.421 1.424 +valence -1.3913 0.002 -632.878 0.000 -1.396 -1.387 +============================================================================== +Omnibus: 2883084.941 Durbin-Watson: 1.828 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 98618092.392 +Skew: 2.349 Prob(JB): 0.00 +Kurtosis: 24.736 Cond. No. 3.42 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongeniality_preregistered_model +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Standardize: True +Report effect size: True +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.220 +Model: OLS Adj. R-squared: 0.220 +Method: Least Squares F-statistic: 6.744e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:43 Log-Likelihood: -8.1863e+06 +No. Observations: 4786218 AIC: 1.637e+07 +Df Residuals: 4786215 BIC: 1.637e+07 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1760 0.001 1922.382 0.000 1.175 1.177 +bayes-corrected (q=0.25) valence -0.4349 0.001 -707.468 0.000 -0.436 -0.434 +totalvotes 0.5207 0.001 847.067 0.000 0.520 0.522 +============================================================================== +Omnibus: 2282674.662 Durbin-Watson: 1.758 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 64040137.713 +Skew: 1.723 Prob(JB): 0.00 +Kurtosis: 20.586 Cond. No. 1.10 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongeniality_stability_against_variation_in_weight_q5 +``` +Independent Variables: ['bayes-corrected (q=0.5) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.229 +Model: OLS Adj. R-squared: 0.229 +Method: Least Squares F-statistic: 7.096e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:44 Log-Likelihood: -8.1590e+06 +No. Observations: 4786218 AIC: 1.632e+07 +Df Residuals: 4786215 BIC: 1.632e+07 +Df Model: 2 +Covariance Type: nonrobust +=================================================================================================== + coef std err t P>|t| [0.025 0.975] +--------------------------------------------------------------------------------------------------- +const 1.1760 0.001 1933.368 0.000 1.175 1.177 +bayes-corrected (q=0.5) valence -0.4582 0.001 -749.070 0.000 -0.459 -0.457 +totalvotes 0.5147 0.001 841.341 0.000 0.513 0.516 +============================================================================== +Omnibus: 2271398.527 Durbin-Watson: 1.760 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 63503192.358 +Skew: 1.712 Prob(JB): 0.00 +Kurtosis: 20.513 Cond. No. 1.11 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongeniality_stability_against_variation_in_weight_q75 +``` +Independent Variables: ['bayes-corrected (q=0.75) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.236 +Model: OLS Adj. R-squared: 0.236 +Method: Least Squares F-statistic: 7.380e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:45 Log-Likelihood: -8.1372e+06 +No. Observations: 4786218 AIC: 1.627e+07 +Df Residuals: 4786215 BIC: 1.627e+07 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1760 0.001 1942.187 0.000 1.175 1.177 +bayes-corrected (q=0.75) valence -0.4762 0.001 -781.029 0.000 -0.477 -0.475 +totalvotes 0.5081 0.001 833.387 0.000 0.507 0.509 +============================================================================== +Omnibus: 2256599.632 Durbin-Watson: 1.761 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 62251699.550 +Skew: 1.700 Prob(JB): 0.00 +Kurtosis: 20.338 Cond. No. 1.12 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongeniality_stability_against_variation_in_weight__no_bayes_correction +``` +Independent Variables: ['valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.199 +Model: OLS Adj. R-squared: 0.199 +Method: Least Squares F-statistic: 5.941e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:47 Log-Likelihood: -8.2498e+06 +No. Observations: 4786218 AIC: 1.650e+07 +Df Residuals: 4786215 BIC: 1.650e+07 +Df Model: 2 +Covariance Type: nonrobust +============================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------ +const 1.1760 0.001 1897.046 0.000 1.175 1.177 +valence -0.3745 0.001 -601.728 0.000 -0.376 -0.373 +totalvotes 0.5306 0.001 852.573 0.000 0.529 0.532 +============================================================================== +Omnibus: 2293481.647 Durbin-Watson: 1.752 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 63398255.054 +Skew: 1.739 Prob(JB): 0.00 +Kurtosis: 20.487 Cond. No. 1.09 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Grouped Regression Analysis: Evidence_uncongeniality_robustness_analysis_on_person_level +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Grouped by: user_id +Aggregation methods: {'bayes-corrected (q=0.25) valence': 'mean', 'totalvotes': 'sum', 'number O(n+1)-replies': 'sum'} +Standardize: True +Report effect size: False +Print detailed coefficients: True +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.934 +Model: OLS Adj. R-squared: 0.934 +Method: Least Squares F-statistic: 9.416e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:47 Log-Likelihood: -7.2556e+05 +No. Observations: 133441 AIC: 1.451e+06 +Df Residuals: 133438 BIC: 1.451e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 42.4546 0.152 278.870 0.000 42.156 42.753 +bayes-corrected (q=0.25) valence -3.7792 0.152 -24.824 0.000 -4.078 -3.481 +totalvotes 208.8619 0.152 1371.920 0.000 208.564 209.160 +============================================================================== +Omnibus: 253518.796 Durbin-Watson: 1.996 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 10401556910.024 +Skew: 13.456 Prob(JB): 0.00 +Kurtosis: 1370.496 Cond. No. 1.01 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +``` +const: 42.4545829243 (CI: [ 42.1561998015, 42.7529660471]) +``` +``` +bayes-corrected (q=0.25) valence: -3.7791674082 (CI: [-4.0775562969, -3.4807785195]) +``` +``` +totalvotes: 208.8619281171 (CI: [ 208.5635392284, 209.1603170058]) +``` +Grouped Regression Analysis: Evidence_uncongeniality_robustness_analysis_on_section_level +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Grouped by: section +Aggregation methods: {'bayes-corrected (q=0.25) valence': 'mean', 'totalvotes': 'sum', 'number O(n+1)-replies': 'sum'} +Standardize: True +Report effect size: False +Print detailed coefficients: True +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.959 +Model: OLS Adj. R-squared: 0.955 +Method: Least Squares F-statistic: 268.1 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 1.16e-16 +Time: 09:31:47 Log-Likelihood: -334.54 +No. Observations: 26 AIC: 675.1 +Df Residuals: 23 BIC: 678.9 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 2.598e+05 1.95e+04 13.297 0.000 2.19e+05 3e+05 +bayes-corrected (q=0.25) valence -4.206e+04 1.98e+04 -2.129 0.044 -8.29e+04 -1190.281 +totalvotes 4.443e+05 1.98e+04 22.488 0.000 4.03e+05 4.85e+05 +============================================================================== +Omnibus: 25.218 Durbin-Watson: 1.928 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 48.147 +Skew: 1.909 Prob(JB): 3.51e-11 +Kurtosis: 8.465 Cond. No. 1.16 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +``` +const: 259811.1538461538 (CI: [ 219391.7551783801, 300230.5525139275]) +``` +``` +bayes-corrected (q=0.25) valence: -42061.3741960863 (CI: [-82932.4677944659, -1190.2805977067]) +``` +``` +totalvotes: 444292.7728500224 (CI: [ 403421.6792516428, 485163.8664484020]) +``` +Linear Regression Analysis: Evidence_uncongenialty_section_politics +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_politics +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.209 +Model: OLS Adj. R-squared: 0.209 +Method: Least Squares F-statistic: 1.708e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -2.1743e+06 +No. Observations: 1295105 AIC: 4.349e+06 +Df Residuals: 1295102 BIC: 4.349e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1182 0.001 981.264 0.000 1.116 1.120 +bayes-corrected (q=0.25) valence -0.3909 0.001 -341.822 0.000 -0.393 -0.389 +totalvotes 0.5079 0.001 444.124 0.000 0.506 0.510 +============================================================================== +Omnibus: 680589.819 Durbin-Watson: 1.782 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 49094495.451 +Skew: 1.699 Prob(JB): 0.00 +Kurtosis: 32.971 Cond. No. 1.09 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_affairs +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_foreign_affairs +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.237 +Model: OLS Adj. R-squared: 0.237 +Method: Least Squares F-statistic: 1.380e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.5539e+06 +No. Observations: 890221 AIC: 3.108e+06 +Df Residuals: 890218 BIC: 3.108e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1789 0.001 802.397 0.000 1.176 1.182 +bayes-corrected (q=0.25) valence -0.4979 0.001 -337.303 0.000 -0.501 -0.495 +totalvotes 0.5435 0.001 368.179 0.000 0.541 0.546 +============================================================================== +Omnibus: 415616.007 Durbin-Watson: 1.775 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 8567668.092 +Skew: 1.765 Prob(JB): 0.00 +Kurtosis: 17.782 Cond. No. 1.10 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_science +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_science +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.253 +Model: OLS Adj. R-squared: 0.253 +Method: Least Squares F-statistic: 9.746e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.0810e+06 +No. Observations: 575190 AIC: 2.162e+06 +Df Residuals: 575187 BIC: 2.162e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.6458 0.002 787.663 0.000 1.642 1.650 +bayes-corrected (q=0.25) valence -0.3951 0.002 -184.289 0.000 -0.399 -0.391 +totalvotes 0.7495 0.002 349.574 0.000 0.745 0.754 +============================================================================== +Omnibus: 194870.309 Durbin-Watson: 1.765 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1100608.449 +Skew: 1.527 Prob(JB): 0.00 +Kurtosis: 9.050 Cond. No. 1.26 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_economy +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_economy +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.196 +Model: OLS Adj. R-squared: 0.196 +Method: Least Squares F-statistic: 7.576e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.0058e+06 +No. Observations: 620776 AIC: 2.012e+06 +Df Residuals: 620773 BIC: 2.012e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1396 0.002 734.230 0.000 1.137 1.143 +bayes-corrected (q=0.25) valence -0.3478 0.002 -223.518 0.000 -0.351 -0.345 +totalvotes 0.4695 0.002 301.664 0.000 0.466 0.473 +============================================================================== +Omnibus: 202475.900 Durbin-Watson: 1.799 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1088427.374 +Skew: 1.479 Prob(JB): 0.00 +Kurtosis: 8.773 Cond. No. 1.08 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_miscellaneous +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_miscellaneous +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.246 +Model: OLS Adj. R-squared: 0.246 +Method: Least Squares F-statistic: 7.921e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -8.1045e+05 +No. Observations: 485006 AIC: 1.621e+06 +Df Residuals: 485003 BIC: 1.621e+06 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.1141 0.002 602.981 0.000 1.110 1.118 +bayes-corrected (q=0.25) valence -0.4406 0.002 -237.533 0.000 -0.444 -0.437 +totalvotes 0.5508 0.002 296.904 0.000 0.547 0.554 +============================================================================== +Omnibus: 308614.044 Durbin-Watson: 1.795 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 33388300.741 +Skew: 2.187 Prob(JB): 0.00 +Kurtosis: 43.411 Cond. No. 1.09 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_culture +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_culture +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.243 +Model: OLS Adj. R-squared: 0.243 +Method: Least Squares F-statistic: 3.781e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -3.6290e+05 +No. Observations: 235911 AIC: 7.258e+05 +Df Residuals: 235908 BIC: 7.258e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 0.9173 0.002 395.396 0.000 0.913 0.922 +bayes-corrected (q=0.25) valence -0.3334 0.002 -142.771 0.000 -0.338 -0.329 +totalvotes 0.5075 0.002 217.346 0.000 0.503 0.512 +============================================================================== +Omnibus: 99947.806 Durbin-Watson: 1.805 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 886847.368 +Skew: 1.813 Prob(JB): 0.00 +Kurtosis: 11.779 Cond. No. 1.12 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_sports +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_sports +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.256 +Model: OLS Adj. R-squared: 0.256 +Method: Least Squares F-statistic: 3.965e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -3.4768e+05 +No. Observations: 230524 AIC: 6.954e+05 +Df Residuals: 230521 BIC: 6.954e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 0.8891 0.002 390.420 0.000 0.885 0.894 +bayes-corrected (q=0.25) valence -0.3918 0.002 -171.548 0.000 -0.396 -0.387 +totalvotes 0.4784 0.002 209.473 0.000 0.474 0.483 +============================================================================== +Omnibus: 109314.794 Durbin-Watson: 1.837 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1540320.347 +Skew: 1.926 Prob(JB): 0.00 +Kurtosis: 15.063 Cond. No. 1.08 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_mobility +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_mobility +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.198 +Model: OLS Adj. R-squared: 0.198 +Method: Least Squares F-statistic: 1.449e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -1.9705e+05 +No. Observations: 117051 AIC: 3.941e+05 +Df Residuals: 117048 BIC: 3.941e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.3476 0.004 353.887 0.000 1.340 1.355 +bayes-corrected (q=0.25) valence -0.3144 0.004 -80.973 0.000 -0.322 -0.307 +totalvotes 0.5090 0.004 131.111 0.000 0.501 0.517 +============================================================================== +Omnibus: 32287.766 Durbin-Watson: 1.796 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 111823.546 +Skew: 1.377 Prob(JB): 0.00 +Kurtosis: 6.917 Cond. No. 1.22 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_internet +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_internet +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.256 +Model: OLS Adj. R-squared: 0.256 +Method: Least Squares F-statistic: 2.267e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -2.1421e+05 +No. Observations: 131977 AIC: 4.284e+05 +Df Residuals: 131974 BIC: 4.284e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.0804 0.003 320.014 0.000 1.074 1.087 +bayes-corrected (q=0.25) valence -0.4040 0.003 -118.355 0.000 -0.411 -0.397 +totalvotes 0.5375 0.003 157.450 0.000 0.531 0.544 +============================================================================== +Omnibus: 54168.298 Durbin-Watson: 1.825 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 590918.640 +Skew: 1.674 Prob(JB): 0.00 +Kurtosis: 12.811 Cond. No. 1.16 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongenialty_section_health +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_health +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.257 +Model: OLS Adj. R-squared: 0.257 +Method: Least Squares F-statistic: 8576. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:48 Log-Likelihood: -86794. +No. Observations: 49462 AIC: 1.736e+05 +Df Residuals: 49459 BIC: 1.736e+05 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 1.3371 0.006 212.544 0.000 1.325 1.349 +bayes-corrected (q=0.25) valence -0.4685 0.006 -73.917 0.000 -0.481 -0.456 +totalvotes 0.6228 0.006 98.259 0.000 0.610 0.635 +============================================================================== +Omnibus: 17663.533 Durbin-Watson: 1.771 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 106942.347 +Skew: 1.595 Prob(JB): 0.00 +Kurtosis: 9.459 Cond. No. 1.13 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncongeniality_robustness_order1 +``` +Independent Variables: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order1 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.136 +Model: OLS Adj. R-squared: 0.136 +Method: Least Squares F-statistic: 3.982e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:50 Log-Likelihood: -6.2998e+06 +No. Observations: 5050120 AIC: 1.260e+07 +Df Residuals: 5050117 BIC: 1.260e+07 +Df Model: 2 +Covariance Type: nonrobust +==================================================================================================== + coef std err t P>|t| [0.025 0.975] +---------------------------------------------------------------------------------------------------- +const 0.6133 0.000 1636.095 0.000 0.613 0.614 +bayes-corrected (q=0.25) valence -0.2055 0.000 -548.027 0.000 -0.206 -0.205 +totalvotes 0.2575 0.000 686.512 0.000 0.257 0.258 +============================================================================== +Omnibus: 2832727.339 Durbin-Watson: 1.864 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 85433368.019 +Skew: 2.153 Prob(JB): 0.00 +Kurtosis: 22.684 Cond. No. 1.03 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_uncogeniality_model_with_seperate_upvotes_downvotes +``` +Independent Variables: ['upvotes', 'downvotes'] +Dependent Variable: number O(n+1)-replies +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +================================================================================= +Dep. Variable: number O(n+1)-replies R-squared: 0.194 +Model: OLS Adj. R-squared: 0.194 +Method: Least Squares F-statistic: 7.311e+05 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:51 Log-Likelihood: -1.0415e+07 +No. Observations: 6069971 AIC: 2.083e+07 +Df Residuals: 6069968 BIC: 2.083e+07 +Df Model: 2 +Covariance Type: nonrobust +============================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------ +const 1.1129 0.001 2037.629 0.000 1.112 1.114 +upvotes 0.0893 0.001 162.278 0.000 0.088 0.090 +downvotes 0.6433 0.001 1168.654 0.000 0.642 0.644 +============================================================================== +Omnibus: 3179849.625 Durbin-Watson: 1.812 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 138815450.026 +Skew: 1.836 Prob(JB): 0.00 +Kurtosis: 26.138 Cond. No. 1.13 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_preregistered_model +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.021 +Model: OLS Adj. R-squared: 0.021 +Method: Least Squares F-statistic: 5.020e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:51 Log-Likelihood: -221.34 +No. Observations: 2392896 AIC: 446.7 +Df Residuals: 2392894 BIC: 472.1 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1246 0.000 796.722 0.000 0.124 0.125 +mean bayes-corrected (q=0.25) valence of replies -0.0351 0.000 -224.063 0.000 -0.035 -0.035 +============================================================================== +Omnibus: 426104.077 Durbin-Watson: 1.729 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 131391.270 +Skew: -0.336 Prob(JB): 0.00 +Kurtosis: 2.070 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_stability_against_variation_in_weight_q5 +``` +Independent Variables: ['mean bayes-corrected (q=0.5) valence of replies'] +Dependent Variable: bayes-corrected (q=0.5) valence +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +=========================================================================================== +Dep. Variable: bayes-corrected (q=0.5) valence R-squared: 0.027 +Model: OLS Adj. R-squared: 0.027 +Method: Least Squares F-statistic: 6.556e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:52 Log-Likelihood: 3.9215e+05 +No. Observations: 2392896 AIC: -7.843e+05 +Df Residuals: 2392894 BIC: -7.843e+05 +Df Model: 1 +Covariance Type: nonrobust +=================================================================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------------------------------------------- +const 0.1323 0.000 996.732 0.000 0.132 0.133 +mean bayes-corrected (q=0.5) valence of replies -0.0340 0.000 -256.042 0.000 -0.034 -0.034 +============================================================================== +Omnibus: 168653.316 Durbin-Watson: 1.726 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 107460.980 +Skew: -0.396 Prob(JB): 0.00 +Kurtosis: 2.328 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_stability_against_variation_in_weight_q75 +``` +Independent Variables: ['mean bayes-corrected (q=0.75) valence of replies'] +Dependent Variable: bayes-corrected (q=0.75) valence +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.75) valence R-squared: 0.032 +Model: OLS Adj. R-squared: 0.032 +Method: Least Squares F-statistic: 8.012e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:52 Log-Likelihood: 8.8112e+05 +No. Observations: 2392896 AIC: -1.762e+06 +Df Residuals: 2392894 BIC: -1.762e+06 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1411 0.000 1303.270 0.000 0.141 0.141 +mean bayes-corrected (q=0.75) valence of replies -0.0306 0.000 -283.054 0.000 -0.031 -0.030 +============================================================================== +Omnibus: 95205.666 Durbin-Watson: 1.729 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 102788.717 +Skew: -0.491 Prob(JB): 0.00 +Kurtosis: 2.742 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_stability_against_variation_in_weight_no_bayes_correction +``` +Independent Variables: ['mean valence of replies'] +Dependent Variable: valence +Data: data_order0 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================== +Dep. Variable: valence R-squared: 0.010 +Model: OLS Adj. R-squared: 0.010 +Method: Least Squares F-statistic: 2.337e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: -4.8218e+05 +No. Observations: 2392896 AIC: 9.644e+05 +Df Residuals: 2392894 BIC: 9.644e+05 +Df Model: 1 +Covariance Type: nonrobust +=========================================================================================== + coef std err t P>|t| [0.025 0.975] +------------------------------------------------------------------------------------------- +const 0.1158 0.000 604.951 0.000 0.115 0.116 +mean valence of replies -0.0293 0.000 -152.877 0.000 -0.030 -0.029 +============================================================================== +Omnibus: 785394.853 Durbin-Watson: 1.750 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 152455.997 +Skew: -0.323 Prob(JB): 0.00 +Kurtosis: 1.946 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_politics +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_politics +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.018 +Model: OLS Adj. R-squared: 0.018 +Method: Least Squares F-statistic: 1.166e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: 34045. +No. Observations: 621929 AIC: -6.809e+04 +Df Residuals: 621927 BIC: -6.806e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1305 0.000 449.326 0.000 0.130 0.131 +mean bayes-corrected (q=0.25) valence of replies -0.0314 0.000 -107.983 0.000 -0.032 -0.031 +============================================================================== +Omnibus: 78154.602 Durbin-Watson: 1.733 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 31765.731 +Skew: -0.357 Prob(JB): 0.00 +Kurtosis: 2.155 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_affairs +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_foreign_affairs +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.019 +Model: OLS Adj. R-squared: 0.019 +Method: Least Squares F-statistic: 8343. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: -43060. +No. Observations: 440260 AIC: 8.612e+04 +Df Residuals: 440258 BIC: 8.615e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1353 0.000 336.404 0.000 0.134 0.136 +mean bayes-corrected (q=0.25) valence of replies -0.0367 0.000 -91.341 0.000 -0.038 -0.036 +============================================================================== +Omnibus: 129058.321 Durbin-Watson: 1.735 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 32315.635 +Skew: -0.421 Prob(JB): 0.00 +Kurtosis: 1.974 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_science +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_science +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.028 +Model: OLS Adj. R-squared: 0.028 +Method: Least Squares F-statistic: 1.007e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: 27583. +No. Observations: 345534 AIC: -5.516e+04 +Df Residuals: 345532 BIC: -5.514e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.0723 0.000 190.132 0.000 0.072 0.073 +mean bayes-corrected (q=0.25) valence of replies -0.0381 0.000 -100.372 0.000 -0.039 -0.037 +============================================================================== +Omnibus: 59103.072 Durbin-Watson: 1.791 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 12955.369 +Skew: -0.052 Prob(JB): 0.00 +Kurtosis: 2.057 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_economy +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_economy +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.017 +Model: OLS Adj. R-squared: 0.017 +Method: Least Squares F-statistic: 5484. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:53 Log-Likelihood: 24023. +No. Observations: 316428 AIC: -4.804e+04 +Df Residuals: 316426 BIC: -4.802e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1474 0.000 369.619 0.000 0.147 0.148 +mean bayes-corrected (q=0.25) valence of replies -0.0295 0.000 -74.054 0.000 -0.030 -0.029 +============================================================================== +Omnibus: 28321.195 Durbin-Watson: 1.760 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 17536.904 +Skew: -0.450 Prob(JB): 0.00 +Kurtosis: 2.278 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_miscellaneous +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_miscellaneous +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.028 +Model: OLS Adj. R-squared: 0.028 +Method: Least Squares F-statistic: 6790. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -13916. +No. Observations: 235551 AIC: 2.784e+04 +Df Residuals: 235549 BIC: 2.786e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1362 0.001 257.499 0.000 0.135 0.137 +mean bayes-corrected (q=0.25) valence of replies -0.0436 0.001 -82.403 0.000 -0.045 -0.043 +============================================================================== +Omnibus: 52959.753 Durbin-Watson: 1.732 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 15867.344 +Skew: -0.409 Prob(JB): 0.00 +Kurtosis: 2.027 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_culture +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_culture +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.032 +Model: OLS Adj. R-squared: 0.032 +Method: Least Squares F-statistic: 3435. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -1315.0 +No. Observations: 102305 AIC: 2634. +Df Residuals: 102303 BIC: 2653. +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1253 0.001 163.518 0.000 0.124 0.127 +mean bayes-corrected (q=0.25) valence of replies -0.0449 0.001 -58.610 0.000 -0.046 -0.043 +============================================================================== +Omnibus: 19234.419 Durbin-Watson: 1.748 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 5689.759 +Skew: -0.334 Prob(JB): 0.00 +Kurtosis: 2.057 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_sports +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_sports +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.032 +Model: OLS Adj. R-squared: 0.032 +Method: Least Squares F-statistic: 3344. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -6723.8 +No. Observations: 100071 AIC: 1.345e+04 +Df Residuals: 100069 BIC: 1.347e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1246 0.001 152.318 0.000 0.123 0.126 +mean bayes-corrected (q=0.25) valence of replies -0.0473 0.001 -57.827 0.000 -0.049 -0.046 +============================================================================== +Omnibus: 28267.899 Durbin-Watson: 1.740 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 6368.870 +Skew: -0.345 Prob(JB): 0.00 +Kurtosis: 1.975 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_mobility +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_mobility +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.024 +Model: OLS Adj. R-squared: 0.024 +Method: Least Squares F-statistic: 1726. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: 10825. +No. Observations: 69253 AIC: -2.165e+04 +Df Residuals: 69251 BIC: -2.163e+04 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1109 0.001 141.050 0.000 0.109 0.112 +mean bayes-corrected (q=0.25) valence of replies -0.0327 0.001 -41.551 0.000 -0.034 -0.031 +============================================================================== +Omnibus: 6922.840 Durbin-Watson: 1.814 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 2381.203 +Skew: -0.195 Prob(JB): 0.00 +Kurtosis: 2.179 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_internet +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_internet +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.028 +Model: OLS Adj. R-squared: 0.028 +Method: Least Squares F-statistic: 1805. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: -3477.5 +No. Observations: 63079 AIC: 6959. +Df Residuals: 63077 BIC: 6977. +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1191 0.001 117.001 0.000 0.117 0.121 +mean bayes-corrected (q=0.25) valence of replies -0.0433 0.001 -42.490 0.000 -0.045 -0.041 +============================================================================== +Omnibus: 21454.701 Durbin-Watson: 1.721 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 4028.801 +Skew: -0.319 Prob(JB): 0.00 +Kurtosis: 1.939 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_section_health +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_health +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.043 +Model: OLS Adj. R-squared: 0.043 +Method: Least Squares F-statistic: 1211. +Date: Mon, 22 Jul 2024 Prob (F-statistic): 1.61e-259 +Time: 09:31:54 Log-Likelihood: -439.22 +No. Observations: 27005 AIC: 882.4 +Df Residuals: 27003 BIC: 898.9 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1074 0.001 71.776 0.000 0.104 0.110 +mean bayes-corrected (q=0.25) valence of replies -0.0521 0.001 -34.794 0.000 -0.055 -0.049 +============================================================================== +Omnibus: 6746.889 Durbin-Watson: 1.761 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 1324.435 +Skew: -0.197 Prob(JB): 2.53e-288 +Kurtosis: 1.989 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Linear Regression Analysis: Evidence_antagonism_robustness_order1 +``` +Independent Variables: ['mean bayes-corrected (q=0.25) valence of replies'] +Dependent Variable: bayes-corrected (q=0.25) valence +Data: data_order1 +Standardize: True +Report effect size: False +``` +``` + OLS Regression Results +============================================================================================ +Dep. Variable: bayes-corrected (q=0.25) valence R-squared: 0.057 +Model: OLS Adj. R-squared: 0.057 +Method: Least Squares F-statistic: 9.915e+04 +Date: Mon, 22 Jul 2024 Prob (F-statistic): 0.00 +Time: 09:31:54 Log-Likelihood: 2.1429e+05 +No. Observations: 1630262 AIC: -4.286e+05 +Df Residuals: 1630260 BIC: -4.286e+05 +Df Model: 1 +Covariance Type: nonrobust +==================================================================================================================== + coef std err t P>|t| [0.025 0.975] +-------------------------------------------------------------------------------------------------------------------- +const 0.1419 0.000 854.072 0.000 0.142 0.142 +mean bayes-corrected (q=0.25) valence of replies -0.0523 0.000 -314.877 0.000 -0.053 -0.052 +============================================================================== +Omnibus: 101738.374 Durbin-Watson: 1.753 +Prob(Omnibus): 0.000 Jarque-Bera (JB): 62821.338 +Skew: -0.351 Prob(JB): 0.00 +Kurtosis: 2.343 Cond. No. 1.00 +============================================================================== + +Notes: +[1] Standard Errors assume that the covariance matrix of the errors is correctly specified. +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_order0 +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.28634078314814315 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.31853427098636283 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12461005214018245 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09803757310470287 +Degrees of Freedom: 2392895 +Cohen's d: -0.28714996199978216 +T-statistic: -396.76675511778956 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q5 +``` +Variable 1: bayes-corrected (q=0.5) extremity +Variable 2: mean bayes-corrected (q=0.5) extremity of replies +Data: data_order0 +``` +``` +Mean of bayes-corrected (q=0.5) extremity: 0.2934997056888845 +Mean of mean bayes-corrected (q=0.5) extremity of replies: 0.31880240669265064 +Standard Deviation of bayes-corrected (q=0.5) extremity: 0.10366027656607042 +Standard Deviation of mean bayes-corrected (q=0.5) extremity of replies: 0.07259709613375841 +Degrees of Freedom: 2392895 +Cohen's d: -0.28275329909468133 +T-statistic: -394.7125869249032 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_stability_against_variation_in_weight_paired_ttest_q75 +``` +Variable 1: bayes-corrected (q=0.75) extremity +Variable 2: mean bayes-corrected (q=0.75) extremity of replies +Data: data_order0 +``` +``` +Mean of bayes-corrected (q=0.75) extremity: 0.3010823980840001 +Mean of mean bayes-corrected (q=0.75) extremity of replies: 0.32039106933723704 +Standard Deviation of bayes-corrected (q=0.75) extremity: 0.08248076963764756 +Standard Deviation of mean bayes-corrected (q=0.75) extremity of replies: 0.05223289636934443 +Degrees of Freedom: 2392895 +Cohen's d: -0.2796984844303324 +T-statistic: -391.6388789093796 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_stability_against_variation_in_weight_paired_ttest_bayes +``` +Variable 1: extremity +Variable 2: mean extremity of replies +Data: data_order0 +``` +``` +Mean of extremity: 0.2786279465660722 +Mean of mean extremity of replies: 0.33064022086792666 +Standard Deviation of extremity: 0.15566001726472525 +Standard Deviation of mean extremity of replies: 0.15685179947476463 +Degrees of Freedom: 2392895 +Cohen's d: -0.332863548235494 +T-statistic: -441.7826610833192 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_robustness_paired_ttest_order1 +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_order1 +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.29265411081901965 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.316766141686027 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.11701339959130957 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09812627267575441 +Degrees of Freedom: 1630261 +Cohen's d: -0.2232935227954181 +T-statistic: -248.9875068375778 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_politics +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_politics +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.2747813213977206 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.31051648819461664 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1232411698734475 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09815038738028235 +Degrees of Freedom: 621928 +Cohen's d: -0.3207697725588003 +T-statistic: -224.4339595489235 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_foreign_affairs +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_foreign_affairs +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.30983360408913946 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.330913534598374 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1266220167440838 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.0994658270407316 +Degrees of Freedom: 440259 +Cohen's d: -0.18514479506979328 +T-statistic: -116.67457613500132 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_science +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_science +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.25732194943047365 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3019777399435376 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1187657730515952 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09498121080140695 +Degrees of Freedom: 345533 +Cohen's d: -0.4152747999524859 +T-statistic: -212.56678640514008 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_economy +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_economy +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.28753090601867964 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3206172046668397 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12269603857552688 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09544219919767762 +Degrees of Freedom: 316427 +Cohen's d: -0.3010114266220678 +T-statistic: -144.89599610520233 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_miscellaneous +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_miscellaneous +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.3045872088628839 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.33005502824126426 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12405998014131653 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09742991339150692 +Degrees of Freedom: 235550 +Cohen's d: -0.22832386975048508 +T-statistic: -97.05206575930157 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_culture +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_culture +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.2873043034163312 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3180681433274033 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12427097360816901 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.10041204122831116 +Degrees of Freedom: 102304 +Cohen's d: -0.27231114070182555 +T-statistic: -77.26207861609845 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_sports +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_sports +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.30601102207250513 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.328439915246921 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12292708240128108 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.098047183761713 +Degrees of Freedom: 100070 +Cohen's d: -0.20172544463043993 +T-statistic: -55.9671976011527 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_mobility +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_mobility +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.25434099233474056 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.3002874727751491 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.1194543498720806 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09661488578718154 +Degrees of Freedom: 69252 +Cohen's d: -0.4229377864257918 +T-statistic: -93.24696971910268 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_internet +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_internet +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.30568494651578504 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.33706126033387757 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12135285517757544 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09268724557998224 +Degrees of Freedom: 63078 +Cohen's d: -0.2905871965145026 +T-statistic: -63.21801300923011 +P-value: 0.0 +``` +Paired TTest Analysis: Evidence_polarization_paired_ttest_extremity_health +``` +Variable 1: bayes-corrected (q=0.25) extremity +Variable 2: mean bayes-corrected (q=0.25) extremity of replies +Data: data_health +``` +``` +Mean of bayes-corrected (q=0.25) extremity: 0.286001211119296 +Mean of mean bayes-corrected (q=0.25) extremity of replies: 0.32344058185785135 +Standard Deviation of bayes-corrected (q=0.25) extremity: 0.12360412242902419 +Standard Deviation of mean bayes-corrected (q=0.25) extremity of replies: 0.09582069057098505 +Degrees of Freedom: 27004 +Cohen's d: -0.3385470290175242 +T-statistic: -48.09524752175683 +P-value: 0.0 +``` +Visualization: Fig_2a +``` +Data: data_order0 +Title: None +Creating Hexbin Plot +Variable X: bayes-corrected (q=0.25) valence +Variable Y: number O(n+1)-replies +X Axis Maximum: None +Y Axis Maximum: 40 +Trendline: True +Log Scaling: True +``` +Plot saved at results/Fig_2a.png + +![](../results/Fig_2a.png) + +Visualization: Fig_2b +``` +Data: data +Title: None +Creating Forest Plot +Regression Model Names: ['Evidence_uncongenialty_section_politics', 'Evidence_uncongenialty_section_foreign_affairs', 'Evidence_uncongenialty_section_science', 'Evidence_uncongenialty_section_economy', 'Evidence_uncongenialty_section_miscellaneous', 'Evidence_uncongenialty_section_culture', 'Evidence_uncongenialty_section_sports', 'Evidence_uncongenialty_section_mobility', 'Evidence_uncongenialty_section_internet', 'Evidence_uncongenialty_section_health'] +Coefficient Names: ['bayes-corrected (q=0.25) valence', 'totalvotes'] +X-Axis Minimum: -0.6 +X-Axis Maximum: None +Dotsize: 2 +``` +Plot saved at results/Fig_2b.png + +![](../results/Fig_2b.png) + +Visualization: Fig_2c +``` +Data: data_order0_with_minimum_one_vote +Title: None +Creating Heatmap +Axis Variables: ['upvotes', 'downvotes'] +Heat Variable: number O(n+1)-replies +Max Axis Values: [20, 20] +Min Axis Values: [0, 0] +Log Scaling: false +``` +Plot saved at results/Fig_2c.png + +![](../results/Fig_2c.png) + +Visualization: Fig_3a +``` +Data: data_order0 +Title: None +Creating Density Plot +Variable X: mean bayes-corrected (q=0.25) valence of replies +Variable Y: bayes-corrected (q=0.25) valence +Data Breakpoints: [0] +``` +Plot saved at results/Fig_3a.png + +![](../results/Fig_3a.png) + +Visualization: Fig_3b +``` +Data: data +Title: None +Creating Forest Plot +Regression Model Names: ['Evidence_antagonism_section_politics', 'Evidence_antagonism_section_foreign_affairs', 'Evidence_antagonism_section_science', 'Evidence_antagonism_section_economy', 'Evidence_antagonism_section_miscellaneous', 'Evidence_antagonism_section_culture', 'Evidence_antagonism_section_sports', 'Evidence_antagonism_section_mobility', 'Evidence_antagonism_section_internet', 'Evidence_antagonism_section_health'] +Coefficient Names: ['mean bayes-corrected (q=0.25) valence of replies'] +X-Axis Minimum: -0.1 +X-Axis Maximum: None +Dotsize: 2 +``` +Plot saved at results/Fig_3b.png + +![](../results/Fig_3b.png) + +Visualization: Fig_4a +``` +Data: data_order0 +Title: +Creating Violin Plot +Variable X: bayes-corrected (q=0.25) extremity +Variable Y: mean bayes-corrected (q=0.25) extremity of replies +X-Axis Label: +Y-Axis Label: Extremity value +``` +Plot saved at results/Fig_4a.png + +![](../results/Fig_4a.png) + +Visualization: Fig_4b +``` +Data: data +Title: None +Creating Forest Plot Paired TTest +Paired TTest Names: ['Evidence_polarization_paired_ttest_extremity_politics', 'Evidence_polarization_paired_ttest_extremity_affairs', 'Evidence_polarization_paired_ttest_extremity_science', 'Evidence_polarization_paired_ttest_extremity_economy', 'Evidence_polarization_paired_ttest_extremity_miscellaneous', 'Evidence_polarization_paired_ttest_extremity_culture', 'Evidence_polarization_paired_ttest_extremity_sports', 'Evidence_polarization_paired_ttest_extremity_mobility', 'Evidence_polarization_paired_ttest_extremity_internet', 'Evidence_polarization_paired_ttest_extremity_health'] +X-Axis Minimum: -0.06 +X-Axis Maximum: None +Dotsize: 2 +``` +Plot saved at results/Fig_4b.png + +![](../results/Fig_4b.png) + +Visualization: Extended_Fig_1 +``` +Data: data +Title: +Creating Histogram Plot +Variable: totalvotes +X-Axis Limits: None +X-Axis Logarithmic Scaling: False +Y-Axis Logarithmic Scaling: True +``` +Plot saved at results/Extended_Fig_1.png + +![](../results/Extended_Fig_1.png) + + + + + \ No newline at end of file diff --git a/results_reports/analysis_report_manuscript.pdf b/results_reports/analysis_report_manuscript.pdf new file mode 100644 index 0000000..4903201 Binary files /dev/null and b/results_reports/analysis_report_manuscript.pdf differ diff --git a/src/ b/src/ new file mode 100644 index 0000000..8bfa2b1 --- /dev/null +++ b/src/ @@ -0,0 +1,2 @@ +"""Functions to handle data and perform analysis on Spiegel Online Data""" + diff --git a/src/ b/src/ new file mode 100644 index 0000000..4743c97 --- /dev/null +++ b/src/ @@ -0,0 +1,174 @@ +from statsmodels.regression.linear_model import RegressionResults + +from src.analysis_functions.comparison_variance_in_and_between_group import ( + ComparisonVariance, +) +from src.analysis_functions.descriptive import DescriptiveAnalysis +from src.analysis_functions.regression import Regression +from src.analysis_functions.specific_analysis.increase_per_up_and_downvote import ( + InfluenceOfUpAndDownvotesOnReplies, +) +from src.analysis_functions.ttest import TTest +from src.analysis_functions.pearson_correlation import PearsonCorrelation + +from src.analysis_functions.visualization import DataVisualizer +from src.analysis_wrappers.comparison_variance_in_and_between_group_wrapper import ( + run_comparison_variance_in_and_between_group, +) +from src.analysis_wrappers.descriptive_wrapper import run_descriptive_analysis + +from src.analysis_wrappers.regression_wrapper import run_regression +from src.analysis_wrappers.pearson_correlation_wrapper import run_pearson_correlation +from src.analysis_wrappers.specific_analysis_wrappers.get_function_inverse_bayes_transformed_regression import ( + run_get_function_inverse_bayes_transformed_regression, +) +from src.analysis_wrappers.specific_analysis_wrappers.increase_per_up_and_downvote_wrapper import ( + run_report_influence_of_up_and_downvotes_on_replies, +) +from src.analysis_wrappers.ttest_wrapper import run_ttest +from src.analysis_wrappers.visualization_wrapper import run_visualization +from src.data_classes.parameters_analysis_comparison_variance_in_and_between_group import ( + ComparisonVarianceInAndBetweenGroupParameters, +) +from src.data_classes.parameters_analysis_get_function_inverse_bayes_transformed_regression import ( + GetFunctionInverseBayesTransformedRegressionParameters, +) +from src.data_classes.parameters_analysis_influence_of_up_and_downvotes import ( + InfluenceOfVotesParameters, +) + +from src.data_classes.parameters_analysis_regression import ( + BayesianRegressionParameters, + LinearRegressionParameters, + GroupedLinearRegressionParameters, +) +from src.data_classes.parameters_analysis_pearson_correlation import ( + PearsonCorrelationParameters, +) +from src.data_classes.parameters_analysis_ttest import ( + TTestParameters, + PairedTTestParameters, +) +from src.utils.helper_functions import FunctionData + + +def run_analyses(analyses: dict, preprocessed_datasets: dict) -> None: + """ + Orchestrates the execution of various statistical analyses and visualizations based on a configuration file. + + This function initializes analysis and visualization classes, assigns preprocessed data to analyses, + categorizes analyses by type, and sequentially executes them. It supports descriptive statistics, + regression analyses, t-tests, Pearson correlation analyses, comparison of variance, influence of up and downvotes, + and data visualization. The function ensures that the necessary data and results are passed between analyses + and visualizations as required. + + Parameters + ---------- + analyses : dict + A dictionary containing configurations for different categories of analyses + (descriptive, analysis, visualization) and their parameters. + preprocessed_datasets : dict + A dictionary mapping dataset names to their preprocessed forms. This data is used across various analyses. + + Raises + ------ + ValueError + If an unknown analysis type is encountered in the configuration. + """ + regression: Regression = Regression() + comparison_variance: ComparisonVariance = ComparisonVariance() + pearson_correlation: PearsonCorrelation = PearsonCorrelation() + ttest: TTest = TTest() + influence_up_and_downvotes: InfluenceOfUpAndDownvotesOnReplies = InfluenceOfUpAndDownvotesOnReplies() + + descriptive: DescriptiveAnalysis = DescriptiveAnalysis() + visualizer: DataVisualizer = DataVisualizer() + + for category in ["descriptive", "analysis", "visualization"]: + if category in analyses: + for analysis in analyses[category]: + = preprocessed_datasets[analysis.dataset] + + regression_analyses: list = [] + pearson_correlation_analyses: list = [] + ttest_analyses: list = [] + comparison_variance_in_and_between_group_analyses: list = [] + influence_up_and_downvotes_analyses: list = [] + get_function_inverse_bayes_transformed_regression_analyses: list = [] + + if "descriptive" in analyses: + run_descriptive_analysis(descriptive, analyses["descriptive"]) + + for analysis in analyses.get("analysis", []): + if isinstance( + analysis, + ( + LinearRegressionParameters, + GroupedLinearRegressionParameters, + BayesianRegressionParameters, + ), + ): + regression_analyses.append(analysis) + elif isinstance(analysis, ComparisonVarianceInAndBetweenGroupParameters): + comparison_variance_in_and_between_group_analyses.append(analysis) + elif isinstance(analysis, PearsonCorrelationParameters): + pearson_correlation_analyses.append(analysis) + elif isinstance(analysis, (TTestParameters, PairedTTestParameters)): + ttest_analyses.append(analysis) + elif isinstance(analysis, InfluenceOfVotesParameters): + influence_up_and_downvotes_analyses.append(analysis) + elif isinstance( + analysis, GetFunctionInverseBayesTransformedRegressionParameters + ): + get_function_inverse_bayes_transformed_regression_analyses.append(analysis) + else: + raise ValueError(f"Unknown analysis type for analysis: {analysis}") + + regression_results: dict[str, RegressionResults] = {} + if regression_analyses: + regression_results: dict[str, RegressionResults] = run_regression( + regression, regression_analyses + ) + + if comparison_variance_in_and_between_group_analyses: + run_comparison_variance_in_and_between_group( + comparison_variance, comparison_variance_in_and_between_group_analyses + ) + + if pearson_correlation_analyses: + run_pearson_correlation(pearson_correlation, pearson_correlation_analyses) + + ttest_results: dict[str, tuple[float, float, float]] = {} + if ttest_analyses: + ttest_results: dict[str, tuple[float, float, float]] = run_ttest( + ttest, ttest_analyses + ) + + if influence_up_and_downvotes_analyses: + if regression_results == {}: + raise ValueError( + "Regression results are required for the influence of up and downvotes analysis" + ) + run_report_influence_of_up_and_downvotes_on_replies( + influence_up_and_downvotes, + influence_up_and_downvotes_analyses, + regression_results, + ) + + functions: dict[str, FunctionData] = {} + if get_function_inverse_bayes_transformed_regression_analyses: + functions: dict[ + str, FunctionData + ] = run_get_function_inverse_bayes_transformed_regression( + get_function_inverse_bayes_transformed_regression_analyses, + regression_results, + ) + + if "visualization" in analyses: + run_visualization( + visualizer, + analyses["visualization"], + regression_results, + functions, + ttest_results, + ) diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..e69de29 diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..2c88e24 --- /dev/null +++ b/src/analysis_functions/ @@ -0,0 +1,91 @@ +import os +from pathlib import Path + +import pandas as pd + +from src.data_loading_and_saving.print_and_save_results import print_and_save_result + + +class ComparisonVariance: + """ + A class used to calculate the Pearson's correlation coefficient between two variables. + + Attributes: + ---------- + print_result: bool + A flag to determine if the result should be printed. + save_result: bool + A flag to determine if the result should be saved. + filepath: str + The directory where the result should be saved. + + Methods + ------- + calculate_correlation(data: pd.DataFrame, first_group_name: str, second_group_name: str, name_save_file: Path): + This method calculates Pearson's correlation coefficient for the given columns + """ + + def __init__( + self, + print_result: bool = True, + save_result: bool = True, + filepath: str = "results/", + ): + """ + Constructs all the necessary attributes for the PearsonCorrelation object. + + Parameters + ---------- + print_result: bool + A flag to determine if the result should be printed. + save_result: bool + A flag to determine if the result should be saved. + filepath: str + The directory where the result should be saved. + """ + self.print_result: bool = print_result + self.save_result: bool = save_result + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def compare_ingroup_intergroup_variance( + self, data: pd.DataFrame, variable: str, group: str, name_save_file: Path + ) -> None: + """ + Compares the variance within a group to the variance between groups for a specified variable. + + This method calculates the total variance of the variable, the average variance of the variable + within each group, and the variance of the variable between groups. It then prints and/or saves + these results based on the object's attributes. + + Parameters + ---------- + data : pd.DataFrame + The dataset containing the variable and group columns. + variable : str + The name of the column representing the variable for which variance is calculated. + group : str + The name of the column representing the groups. + name_save_file : Path + The name of the file to which the result will be saved (if save_result is True). + """ + total_var: float = data[variable].var() + + between_group_var: float = data.groupby(group)[variable].var().mean() + + within_group_var: float = total_var - between_group_var + + results: str = f""" + Total variance: {total_var} + Between-group variance: {between_group_var} + Within-group variance: {within_group_var} + """ + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + results, + name_save_file, + ) diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..9b4e47c --- /dev/null +++ b/src/analysis_functions/ @@ -0,0 +1,351 @@ +import os +from typing import Union, Optional +from pathlib import Path + +import numpy as np +import pandas as pd + +from src.data_classes.parameters_descriptive_overview import Metric +from src.data_loading_and_saving.print_and_save_results import print_and_save_result + + +class DescriptiveAnalysis: + """ + A class used to perform descriptive data analysis. + + Attributes + ---------- + print_result: bool + A flag used to indicate if the function should print the result to the standard output. + save_result: bool + A flag used to indicate if the function should save the result to a file. + filepath: str + The directory path where the result files will be saved if save_result is True. + ... + + Methods + ------- + create_descriptives_for_metrics(data, metrics, name_save_file, group_by=None): + Performs multiple descriptive analyses on the provided dataset and either prints or saves the result. + create_descriptive_aggregated_for_metrics(data, variables, aggregation_function, group_by, name_save_file): + Performs descriptive analysis on the provided dataset with aggregation and grouping, and either prints + or saves the result. + give_percentage_of_dataset_under_condition(data, variable, comparison, condition, name_save_file): + Calculates and prints/saves the percentage of the dataset that meets a specified condition. + _compute_metrics(group_name, data, metrics, group_column=None): + Helper method to compute specified metrics on the data. + _count_values(data, column): + Counts the non-null values in a specified column or in the dataframe if no column is specified. + _count_nonzero(data, column): + Counts the non-zero/True values in a specified column. + _count_unique(data, column): + Counts the unique values in a specified column. + _sum_values(data, column): + Sums the values in a specified column. + _mean_values(data, column): + Calculates the mean of the values in a specified column. + _std_dev(data, column): + Calculates the standard deviation of the values in a specified column. + """ + + def __init__( + self, + print_result: bool = True, + save_result: bool = True, + filepath: str = "results/", + ): + """ + Constructs all the necessary attributes for the DescriptiveAnalysis object. + + Parameters + ---------- + print_result: bool + A flag to determine if the result should be printed. + save_result: bool + A flag to determine if the result should be saved. + filepath: str + The directory where the result should be saved. + """ + self.print_result: bool = print_result + self.save_result: bool = save_result + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def create_descriptives_for_metrics( + self, + data: pd.DataFrame, + metrics: list[Metric], + name_save_file: Path, + group_by: Optional[str] = None, + ) -> None: + """ + This method performs multiple descriptive analysis on the provided dataset and either prints or saves the result + + Parameters + ---------- + data: pandas.DataFrame + The input dataframe which contains the data. + metrics: MetricList + The list of metrics to be calculated. + name_save_file: str + The name of the file to which the result will be saved (if self.save_result is True). + group_by: str + The column to group by. This gives the option to perform the analysis on each unique group member. + + Returns + ------- + None + """ + dataframes: list = [] + result_dataframes: pd.DataFrame = self._compute_metrics("Total", data, metrics) + dataframes.append(result_dataframes) + + if group_by: + grouped_dataframe = data.groupby(group_by) + for group_name, group in grouped_dataframe: + result_dataframes: pd.DataFrame = self._compute_metrics( + group_name, group, metrics + ) + dataframes.append(result_dataframes) + + result_dataframes: pd.DataFrame = pd.concat(dataframes) + result_dataframes.set_index(result_dataframes.columns[0], inplace=True) + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + result_dataframes, + name_save_file, + ) + + def create_descriptive_aggregated_for_metrics( + self, + data: pd.DataFrame, + variables: list[str], + aggregation_function: str, + group_by: str, + name_save_file: Path, + ) -> None: + """ + This method performs multiple descriptive analysis on the provided dataset with grouping. + + Parameters + ---------- + data : pd.DataFrame + The dataset to perform descriptive analysis on. + variables : list[str] + The list of columns we want information on. + aggregation_function : str + The method to aggregate the data. Either 'sum' or 'mean'. + group_by : str + The column to group by. + name_save_file : str + The name for saving the result. + + Returns + ------- + None + """ + if aggregation_function == "sum": + if "Count" in variables: + variables.remove("Count") + grouped_data: pd.DataFrame = ( + data.groupby(group_by)[variables] + .sum() + .aggregate(["mean", "std", "max", "min"]) + ) + + count_data: pd.DataFrame = ( + data.groupby(group_by) + .size() + .reset_index(name="Count")["Count"] + .aggregate(["mean", "std", "max", "min"]) + ) + grouped_data: pd.DataFrame = pd.concat( + [count_data, grouped_data], axis=1 + ) + else: + grouped_data: pd.DataFrame = ( + data.groupby(group_by)[variables] + .sum() + .aggregate(["mean", "std", "max", "min"]) + ) + + elif aggregation_function == "mean": + if "Count" in variables: + variables.remove("Count") + grouped_data: pd.DataFrame = ( + data.groupby(group_by)[variables] + .mean() + .aggregate(["mean", "std", "max", "min"]) + ) + + count_data: pd.DataFrame = ( + data.groupby(group_by) + .size() + .reset_index(name="Count")["Count"] + .aggregate(["mean", "std", "max", "min"]) + ) + grouped_data: pd.DataFrame = pd.concat( + [count_data, grouped_data], axis=1 + ) + else: + grouped_data: pd.DataFrame = ( + data.groupby(group_by)[variables] + .mean() + .aggregate(["mean", "std", "max", "min"]) + ) + + else: + raise ValueError(f"Invalid aggregation: {aggregation_function}") + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + grouped_data, + name_save_file, + ) + + def give_percentage_of_dataset_under_condition( + self, + data: pd.DataFrame, + variable: str, + comparison: str, + condition: Union[int, float], + name_save_file: Path, + ) -> None: + """ + Calculates the percentage of the dataset that meets a specified condition and either prints or saves the result. + + This method evaluates a condition on a specified column of the dataset and calculates the percentage of rows + that meet this condition. The result can be printed to the console or saved to a file, depending on the + object's attributes. + + Parameters + ---------- + data : pd.DataFrame + The dataset to evaluate the condition on. + variable : str + The column name on which the condition will be applied. + comparison : str + The type of comparison to perform. Valid options are "smaller", "larger", or "not". + condition : Union[int, float] + The value to compare against the data in the specified column. + name_save_file : Path + The path (including filename) where the result should be saved if saving is enabled. + + Returns + ------- + None + """ + total_data: int = len(data) + conditional_data: int = 0 + if comparison == "smaller": + conditional_data: int = len(data[data[variable] <= condition]) + elif comparison == "larger": + conditional_data: int = len(data[data[variable] >= condition]) + elif comparison == "not": + conditional_data: int = len(data[data[variable] != condition]) + else: + ValueError("Invalid comparison type") + + percentage: float = conditional_data / total_data * 100 + + percentage_result: str = ( + f"The percentage of the dataset under that condition is {percentage} %" + ) + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + percentage_result, + name_save_file, + ) + + def _compute_metrics( + self, group_name, data: pd.DataFrame, metrics: list[Metric], group_column=None + ) -> pd.DataFrame: + """ + Computes specified metrics on the given dataset and returns the results in a DataFrame. + + This method iterates over a list of Metric objects, each defining an operation (e.g., count, sum, mean) and + a column on which the operation is to be performed. It constructs a result dictionary where each key is a + metric name with the operation and column, and the value is the result of the operation. This dictionary is + then converted to a DataFrame and returned. + + Parameters + ---------- + group_name : str + The name of the group for which metrics are being computed. This is used as a prefix in the result + dictionary keys if not None. + data : pd.DataFrame + The dataset on which the metrics are to be computed. + metrics : list[Metric] + A list of Metric objects, each specifying an operation and a column. + group_column : str, optional + The name of the column by which the data was grouped, if any. This is used in the result dictionary keys. + + Returns + ------- + pd.DataFrame + A DataFrame containing the computed metrics, with each row representing the results for a group (if + group_name is not None) or the entire dataset. + """ + result_dict: dict = {group_column: group_name} if group_name is not None else {} + for metric in metrics: + operation: str = metric.operation + column: str = metric.column + if operation == "count": + result_dict[ + f'{column if column else "total"}_count' + ] = self._count_values(data, column) + elif operation == "count_nonzero": + result_dict[f"{column}_nonzero"] = self._count_nonzero(data, column) + elif operation == "count_unique": + result_dict[f"{column}_unique"] = self._count_unique(data, column) + elif operation == "sum": + result_dict[f"{column}_sum"] = self._sum_values(data, column) + elif operation == "mean": + result_dict[f"{column}_mean"] = self._mean_values(data, column) + elif operation == "std_dev": + result_dict[f"{column}_std_dev"] = self._std_dev(data, column) + + return pd.DataFrame(result_dict, index=[0]) + + @staticmethod + def _count_values(data: pd.DataFrame, column: str) -> int: + return len(data[column].dropna()) if column else len(data) + + @staticmethod + def _count_nonzero(data: pd.DataFrame, column: str) -> int: + data_type = data[column].dtype + if data_type == bool: + return len(data[data[column] == True]) + return len(data[data[column] != 0]) + + @staticmethod + def _count_unique(data: pd.DataFrame, column: str) -> int: + first_element = data[column].dropna().iloc[0] + if isinstance(first_element, (list, np.ndarray)): + flattend_entries = pd.Series( + [item for sublist in data[column] for item in sublist] + ) + return flattend_entries.nunique() + else: + return data[column].nunique() + + @staticmethod + def _sum_values(data: pd.DataFrame, column: str) -> Union[int, float]: + return data[column].sum() + + @staticmethod + def _mean_values(data: pd.DataFrame, column: str) -> Union[int, float]: + return data[column].mean() + + @staticmethod + def _std_dev(data: pd.DataFrame, column: str) -> Union[int, float]: + return data[column].std() diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..e167421 --- /dev/null +++ b/src/analysis_functions/ @@ -0,0 +1,94 @@ +import os +from pathlib import Path + +import pandas as pd +from scipy.stats import pearsonr + +from src.data_loading_and_saving.print_and_save_results import print_and_save_result + + +class PearsonCorrelation: + """ + A class used to calculate the Pearson's correlation coefficient between two variables. + + Attributes: + ---------- + print_result: bool + A flag to determine if the result should be printed. + save_result: bool + A flag to determine if the result should be saved. + filepath: str + The directory where the result should be saved. + + Methods + ------- + calculate_correlation(data: pd.DataFrame, first_group_name: str, second_group_name: str, name_save_file: Path): + This method calculates Pearson's correlation coefficient for the given columns + """ + + def __init__( + self, + print_result: bool = True, + save_result: bool = True, + filepath: str = "results/", + ): + """ + Constructs all the necessary attributes for the PearsonCorrelation object. + + Parameters + ---------- + print_result: bool + A flag to determine if the result should be printed. + save_result: bool + A flag to determine if the result should be saved. + filepath: str + The directory where the result should be saved. + """ + self.print_result: bool = print_result + self.save_result: bool = save_result + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def calculate_correlation( + self, + data: pd.DataFrame, + first_group_name: str, + second_group_name: str, + name_save_file: Path, + ) -> None: + """ + This method calculates Pearson's correlation coefficient for the given columns + and either prints or saves the result based on the object properties. + + Parameters + ---------- + data: pd.DataFrame + The data frame containing the data. + first_group_name: str + The name of the first group. + second_group_name: str + The name of the second group. + name_save_file: Path + The name of the file where the result should be saved. + """ + data: pd.DataFrame = data[[first_group_name, second_group_name]].dropna() + group_1: pd.DataFrame = data[first_group_name] + group_2: pd.DataFrame = data[second_group_name] + + pearson_correlation: float + p_value: float + pearson_correlation, p_value = pearsonr(group_1, group_2) + + pearson_correlation_summary: str = ( + f"Pearson correlation between {first_group_name} and {second_group_name}: {pearson_correlation}\n" + f"P-value: {p_value}" + ) + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + pearson_correlation_summary, + name_save_file, + ) diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..eafbb7f --- /dev/null +++ b/src/analysis_functions/ @@ -0,0 +1,502 @@ +import math +import os +from typing import Union +from pathlib import Path + +import numpy as np +import pandas as pd +from rpy2 import robjects as ro +from rpy2.robjects import pandas2ri, Formula +from sklearn.preprocessing import StandardScaler +from statsmodels import api as sm +from statsmodels.iolib.summary import Summary +from statsmodels.regression.linear_model import RegressionResults + +from src.data_loading_and_saving.print_and_save_results import print_and_save_result +from src.utils.handle_r_dependencies import BAS + + +class Regression: + """ + A class used to perform Linear and Bayesian regression analyses. + + ... + + Attributes + ---------- + print_result: bool + A flag used to indicate if the function should print the result to the standard output. + save_result: bool + A flag used to indicate if the function should save the result to a file. + filepath: str + The directory path where the result files will be saved if save_result is True. + + Methods + ------- + linear_regression(data, x_vector, y, standardize, name_save_file) -> RegressionResults: + Performs an OLS (Ordinary Least Squares) linear regression on the provided dataset. + linear_regression_grouped(data, x_vector, y, dictionary_aggregation_methods_for_data_columns, + column_to_group_by, standardize, name_save_file) -> RegressionResults: + Performs Linear Regression on the data grouped by the given column as per the aggregation dictionary. + predict_percentage_increase_between_liner_model_points(model, data_point_1, data_point_2) -> float: + Calculates the percentage increase in fitted values between two different data points using a given model. + report_effect_size_of_model(model, name_save_file) -> None: + Computes and reports the effect size of the model + (r-squared, Cohen's f, and equivalent Cohen's d for the regression model). + bayesian_regression(data, x_vector, y, name_save_file) -> None: + Performs a Bayesian linear regression on the provided dataset. + _create_dataframe_of_bayesian_regression(summary, x_vector) -> pd.DataFrame: + Transforms the provided Bayesian regression summary into a DataFrame. + _clean_names(name) -> str: + Cleans the input string by replacing certain characters with underscores or removing them. + _clean_column_names(dataframe) -> pd.DataFrame: + Cleans the DataFrame column names using the "_clean_names" method. + _get_clean_x_vector(x_vector) -> list[str]: + Cleans the elements of the input list using the "_clean_names" method. + """ + + def __init__( + self, + print_result: bool = True, + save_result: bool = True, + filepath: str = "results/", + ): + """ + Initializes the Regression object with the provided parameters. + + Parameters + ---------- + print_result: bool, optional + A flag to indicate if the function should print the result to the standard output. Default is True. + save_result: bool, optional + A flag to indicate if the function should save the result to a file. Default is True. + filepath: str, optional + The directory path where the result files will be saved if save_result is True. Default is "results/". + """ + self.print_result: bool = print_result + self.save_result: bool = save_result + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def linear_regression( + self, + data: pd.DataFrame, + x_vector: list[str], + y: str, + standardize: bool = False, + report_effect_size: bool = False, + name_save_file: Path = "", + ) -> RegressionResults: + """ + This method performs Ordinary Least Squares (OLS) Linear Regression using the given parameters + and either prints or saves the result based on the Regression object properties. + + Parameters + ---------- + data: pandas.DataFrame + The input dataframe which contains the data. + x_vector: list + The list of column names to be used as independent variables in the regression model. + y: str + The column name to be used as dependent variable in the regression model. + standardize: bool + If the independent variables should be standardized before fitting for better comparison. + report_effect_size: bool + If the effect size of the model should be reported (r-squared, Cohen's f, and equivalent Cohen's d). + name_save_file: str + The name of the file to which the result will be saved (if self.save_result is True). + + Returns + ------- + model: RegressionResults + The fitted regression model. + """ + data: pd.DataFrame = data.dropna(subset=x_vector) + data: pd.DataFrame = data.dropna(subset=y) + + x_vector_data: pd.DataFrame = data[x_vector] + y_data: pd.Series = data[y] + + if standardize: + scaler: StandardScaler = StandardScaler() + x_vector_data_standardized: np.ndarray = scaler.fit_transform(x_vector_data) + x_vector_data: pd.DataFrame = pd.DataFrame( + x_vector_data_standardized, columns=x_vector + ) + x_vector_data.set_index(y_data.index, inplace=True) + + x_vector_data: pd.DataFrame = sm.add_constant(x_vector_data) + + model: RegressionResults = sm.OLS(y_data, x_vector_data).fit() + + linear_regression_summary: Summary = model.summary() + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + linear_regression_summary, + name_save_file, + ) + + if report_effect_size: + name_save_file_effect_size: Path = name_save_file.with_stem( + name_save_file.stem + "_effect_size" + ) + self._report_effect_size_of_model(model, name_save_file_effect_size) + + return model + + def linear_regression_grouped( + self, + data: pd.DataFrame, + x_vector: list[str], + y: str, + dictionary_aggregation_methods_for_data_columns: dict[str, str], + column_to_group_by: str, + standardize: bool = False, + report_effect_size: bool = False, + print_detailed_coefficients: bool = False, + name_save_file: Path = "", + ) -> RegressionResults: + """ + This method performs OLS Linear Regression on the data grouped by the group_by + as per the aggregation dictionary and then the performs regression analysis. + + Parameters + ---------- + data: pandas.DataFrame + The input dataframe which contains the data. + x_vector: list + The list of column names to be used as independent variables in the regression model. + y: str + The column name to be used as dependent variable in the regression model. + dictionary_aggregation_methods_for_data_columns: dict + The dictionary specifying how to aggregate each column before regression. + The keys of the dictionary should include all columns in X and y. + column_to_group_by: str + The column name to group the data by before aggregating. + name_save_file: str + The name of the file to which the result will be saved (if self.save_result is True). + standardize: bool + If the regression should be performed standardized instead to return beta factors. + report_effect_size: bool + If the effect size of the model should be reported (r-squared, Cohen's f, and equivalent Cohen's d). + print_detailed_coefficients: bool + If the coefficients should be printed separately with 10 point float accuracy and 95% CI. + + Returns + ------- + model: RegressionResults + The fitted regression model. + """ + + all_cols = set(x_vector + [y]) + if not all_cols.issubset( + set(dictionary_aggregation_methods_for_data_columns.keys()) + ): + raise ValueError( + "dictionary_aggregation_methods_for_data_columns should contain all columns from X and y." + ) + + if column_to_group_by not in data.columns: + raise ValueError( + f"'{column_to_group_by}' column to group by not found in data." + ) + + grouped_data = ( + data.groupby(column_to_group_by) + .agg(dictionary_aggregation_methods_for_data_columns) + .reset_index() + ) + + model = self.linear_regression( + data=grouped_data, + x_vector=x_vector, + y=y, + standardize=standardize, + report_effect_size=report_effect_size, + name_save_file=name_save_file, + ) + + if print_detailed_coefficients: + coefficients = model.params + confidence_intervals = model.conf_int() + for name, coefficient in coefficients.items(): + confidence_interval = confidence_intervals.loc[name] + detailed_coefficient_information: str = f"{name}: {coefficient: .10f} (CI: [{confidence_interval[0]: .10f}, {confidence_interval[1]: .10f}])" + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + detailed_coefficient_information, + name_save_file, + ) + + return model + + @staticmethod + def predict_percentage_increase_between_liner_model_points( + model: RegressionResults, + data_point_1: Union[list, np.array], + data_point_2: Union[list, np.array], + ) -> float: + """ + Calculate the percentage increase in fitted values between two different data points using a given model. + + Parameters + ---------- + model: statsmodels.regression.linear_model.OLS + The fitted regression model. + data_point_1: list or numpy.array + The first data point (a vector of X values). + data_point_2: list or numpy.array + The second data point (a vector of X values). + + Returns + ------- + percentage_increase: float + The percentage increase in fitted value from data_point1 to data_point2. + """ + if not isinstance(data_point_1, pd.Series): + data_point_1 = pd.Series(data_point_1, index=model.model.exog_names[1:]) + if not isinstance(data_point_2, pd.Series): + data_point_2 = pd.Series(data_point_2, index=model.model.exog_names[1:]) + + if len(data_point_1) != len(data_point_2): + raise ValueError("Data points must have the same dimension.") + if len(data_point_1) != len(model.params) - 1: + raise ValueError("Dimensions of data points and model do not match.") + + x_vector_data: pd.DataFrame = pd.DataFrame( + [data_point_1, data_point_2],[1:] + ) + x_vector_data: pd.DataFrame = sm.add_constant(x_vector_data) + + predictions_for_both_datapoints: np.ndarray = model.predict(x_vector_data) + + percentage_increase: float = ( + (predictions_for_both_datapoints[1] - predictions_for_both_datapoints[0]) + / predictions_for_both_datapoints[0] + ) * 100 + + return percentage_increase + + def _report_effect_size_of_model( + self, model: RegressionResults, name_save_file: Path + ) -> None: + """ + Computes and reports the effect size of the model (r-squared, Cohen's f and equivalent Cohen's d + for the regression model) and either print or save it based on the Regression object properties. + + Parameters + ---------- + model: RegressionResults + The regression model results obtained from statsmodels regression. + name_save_file: Path + The name of the file to which the result will be saved if self.save_result is True. + """ + r_squared: int = model.rsquared + f_squared: float = r_squared / (1 - r_squared) + cohens_f: float = math.sqrt(f_squared) + equivalent_cohens_d: float = 2 * cohens_f + + effect_size_model: str = f""" + R-Square R^2: {r_squared} + Cohen's f: {cohens_f} + Equivalent Cohen's d: {equivalent_cohens_d} + """ + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + effect_size_model, + name_save_file, + ) + + def bayesian_regression( + self, data: pd.DataFrame, x_vector: list[str], y: str, name_save_file: Path + ) -> None: + """ + This method performs Bayesian Regression using the given parameters and either + prints or saves the result based on the Regression object properties. + + the results can be interpreted in the following way: + - `P(B != 0 | Y)`: This column gives the probability that the coefficient, + associated with the factor (the B-value), is not zero given the data Y. + This is a way of measuring the relevance and significance of the predictive factor. + The higher the score, the more likely it is that the factor has an impact on your outcome variable. + - `Model 1, Model 2, Model 3, etc`: These columns represent different models that have been built. + The values in these columns are the specific coefficient values for each factor within each model. + - `BF`: BF is short for Bayes Factor which indicates the strength of evidence for a model as + compared to an alternative model. Higher BF values indicate stronger evidence for a model. + - `PostProbs`: Posterior probabilities for each model. These add up to 1 across all models and + give the probability of each model being the "best" model given the data. + - `R2`: Shows the proportion of the variance for a dependent variable that's explained by + independent variables in the model. Close to 1 indicates the model explains a large amount of the variance. + - `dim`: The dimensionality of the model, which is basically the number of parameters in each model. + - `logmarg`: It is the logarithm of the marginal likelihood for each model. + This is a factor that is used in determining the posterior probabilities of each model. + + Parameters + ---------- + data: pandas.DataFrame + The input dataframe which contains the data. + x_vector: list + The list of column names to be used as independent variables in the regression model. + y: str + The column name to be used as dependent variable in the regression model. + name_save_file: str + The name of the file to which the result will be saved (if self.save_result is True). + + Returns + ------- + None + """ + data: pd.DataFrame = data.dropna(subset=x_vector) + + pandas2ri.activate() + + data: pd.DataFrame = self._clean_column_names(data) + x_vector: list[str] = self._get_clean_x_vector(x_vector) + y: str = self._clean_names(y) + + formula: Formula = Formula(y + " ~ " + " + ".join(x_vector)) + + data_r = pandas2ri.py2rpy(data) + + bayesian_model = BAS.bas_lm( + formula=formula, + data=data_r, + method="MCMC", + prior="ZS-null", + modelprior=BAS.uniform(), + ) + + bayesian_regression_summary: np.ndarray = ro.r["summary"](bayesian_model) + + try: + dataframe_bayesian_regression_summary = ( + self._create_dataframe_of_bayesian_regression( + bayesian_regression_summary, x_vector + ) + ) + except ValueError: + raise ValueError( + "The number of bayesian models does not match the expected values." + ) + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + dataframe_bayesian_regression_summary, + name_save_file, + ) + + @staticmethod + def _create_dataframe_of_bayesian_regression( + summary: np.ndarray, x_vector: list[str] + ) -> pd.DataFrame: + """ + Create a DataFrame from the summary of Bayesian regression results. + + Parameters + ---------- + summary: np.ndarray + The Bayesian regression summary obtained from performing a Bayesian linear regression. + x_vector: list[str] + The list of column names used as independent variables in the Bayesian regression model. + + Returns + ------- + pd.DataFrame + A DataFrame that presents the Bayesian regression's summary, including the probability + of the regression coefficients not being zero, model number, intercept, independent variables, + Bayes Factor (BF), posterior probabilities (PostProbs), R squared (R2), dimension (dim), + and log marginal likelihood (logmarg). + """ + base_columns: list[str] = ["P(B != 0 | Y)"] + model_columns: list[str] = [ + "model {}".format(i + 1) for i in range(summary.shape[1] - 1) + ] + columns: list[str] = base_columns + model_columns + + index: list[str] = [ + "Intercept", + *x_vector, + "BF", + "PostProbs", + "R2", + "dim", + "logmarg", + ] + + dataframe_summary: pd.DataFrame = pd.DataFrame( + summary, columns=columns, index=index + ) + + return dataframe_summary + + @staticmethod + def _clean_names(name: str) -> str: + """ + This method performs cleaning of individual name (could be column name or value in X). + Cleaning includes replacing certain characters with underscores or removing them. + + Parameters + ---------- + name: str + Original name (un-cleaned). + + Returns + ------- + str + Cleaned up name. + """ + return ( + name.replace(" ", "_") + .replace("(", "") + .replace(")", "") + .replace("-", "_") + .replace("+", "_") + .replace("=", "") + .replace(".", "") + ) + + def _clean_column_names(self, dataframe: pd.DataFrame) -> pd.DataFrame: + """ + This method performs cleaning of DataFrame column names using the method clean_names. + It cleans a deep copy to not alter the original dataframe. + + Parameters + ---------- + dataframe: pd.DataFrame + DataFrame with original (un-cleaned) column names. + + Returns + ------- + pd.DataFrame + DataFrame with cleaned column names. + """ + dataframe: pd.DataFrame = dataframe.copy() + dataframe.columns = [self._clean_names(col) for col in dataframe.columns] + return dataframe + + def _get_clean_x_vector(self, x_vector: list[str]) -> list[str]: + """ + This method uses clean_names to clean the individual elements/column names in x_vector. + + Parameters + ---------- + x_vector: list[str] + List with original (un-cleaned) elements. + + Returns + ------- + list[str] + List where all elements have been cleaned. + """ + return [self._clean_names(x) for x in x_vector] diff --git a/src/analysis_functions/specific_analysis/ b/src/analysis_functions/specific_analysis/ new file mode 100644 index 0000000..e69de29 diff --git a/src/analysis_functions/specific_analysis/ b/src/analysis_functions/specific_analysis/ new file mode 100644 index 0000000..b3347b3 --- /dev/null +++ b/src/analysis_functions/specific_analysis/ @@ -0,0 +1,57 @@ +from collections import namedtuple +from typing import NamedTuple + +import pandas as pd +from statsmodels.regression.linear_model import RegressionResults + +from src.utils.helper_functions import FunctionData + + +def get_function_inverse_bayes_transformed_regression(data: pd.DataFrame, model: RegressionResults) -> FunctionData: + """ + Calculate and return a function representing the inverse Bayes transformed regression. + + This function constructs a parameter tuple for the regression function, and then returns + a FunctionData object containing the regression + function and its parameters. + + Parameters + --------- + data: pd.DataFrame + The dataset containing the 'bayes-corrected (q=0.25) valence' column. + model: RegressionResults + The regression model results from which the gradient and intercept parameters are extracted. + + Returns + ------- + FunctionData: An object containing the regression function and its parameters. + """ + average_valence: float = data["bayes-corrected (q=0.25) valence"].mean() + + Param = namedtuple( + "Param", + [ + "average_valence", + "gradient_valence", + "gradient_totalvotes", + "intercept", + ], + ) + + def function(x: float, y: float, parameters: Param) -> float: + return ( + parameters.gradient_valence + * (-1 * (x / (x + y) + parameters.average_valence / (x + y)) - 0.5) + + parameters.gradient_totalvotes * (x + y) + + parameters.intercept + ) + + params: NamedTuple = Param( + average_valence=average_valence, + gradient_valence=model.params.iloc[1], + gradient_totalvotes=model.params.iloc[2], + intercept=model.params.iloc[0], + ) + + function_data: FunctionData = FunctionData(function, params) + return function_data diff --git a/src/analysis_functions/specific_analysis/ b/src/analysis_functions/specific_analysis/ new file mode 100644 index 0000000..a843859 --- /dev/null +++ b/src/analysis_functions/specific_analysis/ @@ -0,0 +1,168 @@ +import os +from pathlib import Path +from typing import Union, Optional + +import pandas as pd +from statsmodels.regression.linear_model import RegressionResults + +from src.analysis_functions.regression import Regression +from src.data_loading_and_saving.print_and_save_results import print_and_save_result +from src.utils.helper_functions import ( + calculate_inverse_bayes_correction, + transform_to_bayes_corrected_valence, +) + + +class InfluenceOfUpAndDownvotesOnReplies: + """ + A class used to perform a dataset specific analysis on the influence of upvotes and downvotes + on the number of replies based on a linear regression model for totalvotes and valence + + Attributes + ---------- + print_result: bool + A flag used to indicate if the function should print the result to the standard output. + save_result: bool + A flag used to indicate if the function should save the result to a file. + filepath: str + The directory path where the result files will be saved if save_result is True. + ... + + Methods + ------- + report_increase_per_up_and_downvote_from_totalvotes_and_valence + (data: pd.DataFrame, weight_as_distribution_quantile: bool, weight_m: float, model: RegressionResults, + step: list, startpoint: Union[str, list], name_save_file: Optional[Path]) -> None: + Gives the % increase in reply likelihood for a given step in upvotes and downvotes + according to a linear regression model + """ + + def __init__( + self, + print_result: bool = True, + save_result: bool = True, + filepath: str = "results/", + ): + """ + Constructs all the necessary attributes for the InfluenceOfUpAndDownvotesOnReplies object. + + Parameters + ---------- + print_result: bool + A flag to determine if the result should be printed. + save_result: bool + A flag to determine if the result should be saved. + filepath: str + The directory where the result should be saved. + """ + self.print_result: bool = print_result + self.save_result: bool = save_result + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def report_increase_per_up_and_downvote_from_totalvotes_and_valence( + self, + data: pd.DataFrame, + weight_as_distribution_quantile: bool, + weight_m: float, + model: RegressionResults, + step: list = None, + startpoint: Union[str, list] = "average", + name_save_file: Optional[Path] = None, + ) -> None: + """ + Reports the percentage increase in reply likelihood for specified upvote and downvote steps + from a given startpoint, using a linear regression model. + + Parameters + ---------- + data : pd.DataFrame + The dataset containing 'totalvotes', 'valence', and other relevant metrics. + weight_as_distribution_quantile : bool + Determines if weight_m should be treated as a quantile value for weighting totalvotes. + weight_m : float + The weight factor or quantile value for calculating weighted measures. + model : RegressionResults + The fitted linear regression model used for prediction. + step : list, optional + A list containing the step increase for upvotes and downvotes respectively. + startpoint : Union[str, list], optional + The starting point for calculation. Can be 'average' or a list of [totalvotes, bayes_corrected_valence]. + name_save_file : Optional[Path], optional + The name of the file to save the result to. + + Returns + ------- + None + """ + startpoint_totalvotes: int = 0 + startpoint_bayes_corrected_valence: int = 0 + average_valence: float = data["valence"].mean() + + if isinstance(startpoint, str) and startpoint.lower() == "average": + startpoint_totalvotes: float = data["totalvotes"].mean() + startpoint_bayes_corrected_valence: float = data[ + "bayes-corrected (q=0.25) valence" + ].mean() + + if isinstance(startpoint, list): + startpoint_totalvotes: Union[int, float] = startpoint[0] + startpoint_bayes_corrected_valence: Union[int, float] = startpoint[1] + + else: + ValueError("startpoint must be a valid point or 'average'") + + if weight_as_distribution_quantile: + weight_m: float = data["totalvotes"].quantile(q=weight_m) + + non_bayes_corrected_valence: float = ( + calculate_inverse_bayes_correction( + bayes_corrected_value=startpoint_bayes_corrected_valence, + volume=startpoint_totalvotes, + weight_factor_m=weight_m, + average_measure=average_valence, + ) + ) + + downvote_equivalent_average_bayes_corrected_valence: float = ( + - (non_bayes_corrected_valence - 0.5) * startpoint_totalvotes + ) + + average_bayes_corrected_negativty_plus: float = transform_to_bayes_corrected_valence( + upvotes=startpoint_totalvotes + - downvote_equivalent_average_bayes_corrected_valence + + step[0], + downvotes=downvote_equivalent_average_bayes_corrected_valence + step[1], + average_valence=average_valence, + weight_factor_m=weight_m, + ) + + increase_per_step: float = ( + Regression().predict_percentage_increase_between_liner_model_points( + model=model, + data_point_1=[ + startpoint_bayes_corrected_valence, + startpoint_totalvotes, + ], + data_point_2=[ + average_bayes_corrected_negativty_plus, + startpoint_totalvotes + step[0] + step[1], + ], + ) + ) + + result = f""" + your startpoint was: bayes-correced-valence {startpoint_bayes_corrected_valence}, totalvotes {startpoint_totalvotes} + for a step with {step[0]} upvotes and {step[1]} downvotes increase + you obtain an endpoint of: bayes-correced-valence {average_bayes_corrected_negativty_plus}, totalvotes {startpoint_totalvotes + step[0] + step[1]} + the increase in replies is {increase_per_step} % + """ + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + result, + name_save_file, + ) diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..eadf1fb --- /dev/null +++ b/src/analysis_functions/ @@ -0,0 +1,195 @@ +import os +import pandas as pd +import numpy as np +from scipy import stats as stats +from scipy.stats import ttest_ind, ttest_rel +from pathlib import Path + +from src.data_loading_and_saving.print_and_save_results import print_and_save_result + + +class TTest: + """ + A class used to perform a T-test analysis. + + ... + + Attributes + ---------- + print_result: bool + A flag used to indicate if the function should print the result to the standard output. + save_result: bool + A flag used to indicate if the function should save the result to a file. + filepath: str + The directory path where the result files will be saved if save_result is True. + + Methods + ------- + perform_ttest(data, first_group_name, second_group_name, name_save_file) -> None: + Performs an independent samples T-test for the two specified groups in the provided dataset. + perform_paired_ttest(data, first_group_name, second_group_name, name_save_file) -> tuple[float, float, float]: + Performs a paired samples T-test on the two specified groups in the provided dataset. + Returns the mean difference with the 95% confidence interval. + """ + + def __init__( + self, + print_result: bool = True, + save_result: bool = True, + filepath: str = "results/", + ): + """ + Initializes the TTest object with the provided parameters. + + Parameters + ---------- + print_result: bool, optional + A flag to indicate if the function should print the result to the standard output. Default is True. + save_result: bool, optional + A flag to indicate if the function should save the result to a file. Default is True. + filepath: str, optional + The directory path where the result files will be saved if save_result is True. Default is "results/". + """ + self.print_result: bool = print_result + self.save_result: bool = save_result + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def perform_ttest( + self, + data: pd.DataFrame, + first_group_name: str, + second_group_name: str, + name_save_file: Path, + ) -> None: + """ + This method performs a T-test for the means of two independent samples of scores using + the given columns and either prints or saves the result based on the TTest object properties. + + Parameters + ---------- + data: pandas.DataFrame + The input dataframe which contains the data. + first_group_name: str + The name of the first column to be used in the t-test. + second_group_name: str + The name of the second column to be used in the t-test. + name_save_file: str + The name of the file to which the result will be saved (if self.save_result is True). + + Returns + ------- + None + """ + group_1: pd.DataFrame = data[first_group_name].dropna() + group_2: pd.DataFrame = data[second_group_name].dropna() + + degrees_of_freedom: int = len(group_1) + len(group_2) - 2 + + t_statistic: float + p_value: float + t_statistic, p_value = ttest_ind(group_1, group_2) + + mean_group_1: float = group_1.mean() + mean_group_2: float = group_2.mean() + + standard_deviation_group_1: float = group_1.std() + standard_deviation_group_2: float = group_2.std() + + ttest_summary: str = ( + f"Mean of {first_group_name}: {mean_group_1}\n" + f"Mean of {second_group_name}: {mean_group_2}\n" + f"Standard Deviation of {first_group_name}: {standard_deviation_group_1}\n" + f"Standard Deviation of {second_group_name}: {standard_deviation_group_2}\n" + f"Degrees of Freedom: {degrees_of_freedom}\n" + f"T-statistic: {t_statistic}\n" + f"P-value: {p_value}" + ) + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + ttest_summary, + name_save_file, + ) + + def perform_paired_ttest( + self, + data: pd.DataFrame, + first_group_name: str, + second_group_name: str, + name_save_file: Path, + ) -> tuple[float, float, float]: + """ + Performs a paired sample t-test and calculates the effect size (Cohen's d) using the given columns + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data of two related groups to be compared + first_group_name : str + The name of the first group (column) + second_group_name : str + the name of the second group (column) + name_save_file: Path + The path of the file to save the result in. + + Returns + ------- + mean_difference : float + The mean difference between the two samples + confidence_interval[0] : float + The lower bound of the 95% confidence interval + confidence_interval[1] : float + The upper bound of the 95% confidence interval + """ + + data: pd.DataFrame = data[[first_group_name, second_group_name]].dropna() + group_1: pd.DataFrame = data[first_group_name] + group_2: pd.DataFrame = data[second_group_name] + degrees_of_freedom: int = len(group_1) - 1 + + t_statistic: float + p_value: float + t_statistic, p_value = ttest_rel(group_1, group_2) + + mean_group_1: float = group_1.mean() + mean_group_2: float = group_2.mean() + standard_deviation_group_1: float = group_1.std() + standard_deviation_group_2: float = group_2.std() + pooled_standard_deviation: float = np.sqrt( + (standard_deviation_group_1**2 + standard_deviation_group_2**2) / 2 + ) + cohens_d: float = (mean_group_1 - mean_group_2) / pooled_standard_deviation + + mean_difference: float = mean_group_1 - mean_group_2 + standard_error_difference: float = np.std(group_1 - group_2, ddof=1) / np.sqrt( + len(group_1) + ) + + confidence_interval: np.ndarray[float] = stats.t.interval( + 0.95, len(group_1) - 1, loc=mean_difference, scale=standard_error_difference + ) + + paired_ttest_summary: str = ( + f"Mean of {first_group_name}: {mean_group_1}\n" + f"Mean of {second_group_name}: {mean_group_2}\n" + f"Standard Deviation of {first_group_name}: {standard_deviation_group_1}\n" + f"Standard Deviation of {second_group_name}: {standard_deviation_group_2}\n" + f"Degrees of Freedom: {degrees_of_freedom}\n" + f"Cohen's d: {cohens_d}\n" + f"T-statistic: {t_statistic}\n" + f"P-value: {p_value}" + ) + + print_and_save_result( + self.print_result, + self.save_result, + self.filepath, + paired_ttest_summary, + name_save_file, + ) + + return mean_difference, confidence_interval[0], confidence_interval[1] diff --git a/src/analysis_functions/ b/src/analysis_functions/ new file mode 100644 index 0000000..078ad60 --- /dev/null +++ b/src/analysis_functions/ @@ -0,0 +1,1352 @@ +import os +from pathlib import Path +from typing import Optional, Union, Tuple, Dict + +import matplotlib.pyplot as plt +import numpy as np +import pandas as pd +import seaborn as sns +from statsmodels.regression.linear_model import RegressionResults + +from src.data_loading_and_saving.save_plot import save_plot +from src.utils.helper_functions import FunctionData + + +class DataVisualizer: + """ + A class used to perform various visualizations. + + This class provides methods to create different types of plots including bar charts, stacked bar charts, + percentage stacked bar charts, count distribution plots, grouped histograms, simple scatter plots, heatmaps, + surface plots, contour plots, density plots, ridgeline plots, hexbin plots with optional trendline, box plots, + violin plots, and forest plots for regression model coefficients or paired t-test results. It supports saving + plots to files. + + Attributes + ---------- + save_plots : bool + A flag used to indicate if the function should save the plots to a file. + filepath : str + The directory path where the plot files will be saved if save_plots is True. + """ + + def __init__(self, save_plots: bool = True, filepath="results/"): + self.save_plots: bool = save_plots + self.filepath: str = filepath + if not os.path.isdir(filepath): + os.makedirs(filepath) + + def create_bar_chart( + self, + data: pd.DataFrame, + variable_x_axis: str, + variable_y_axis: Optional[str] = None, + x_axis_label: Optional[str] = None, + y_axis_label: Optional[str] = None, + title: Optional[str] = None, + chart_orientation: str = "h", + name_save_file: Optional[Path] = None, + sort_order="ascending", + custom_order: Optional[list[str]] = None, + ) -> None: + """ + Creates a bar chart based on the provided DataFrame and plotting parameters. + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data to plot. + variable_x_axis : str + The column name in `data` to be used as the x-axis variable. + variable_y_axis : Optional[str], optional + The column name in `data` to be used as the y-axis variable. If None, a count of occurrences is used. + x_axis_label : Optional[str], optional + The label for the x-axis. If None, `variable_x_axis` is used. + y_axis_label : Optional[str], optional + The label for the y-axis. If None and `variable_y_axis` is None, "Count" is used; otherwise, + `variable_y_axis` is used. + title : Optional[str], optional + The title of the chart. If None, a default title is generated. + chart_orientation : str, default "h" + The orientation of the chart. "h" for horizontal, any other value for vertical. + name_save_file : Optional[Path], optional + The path (including filename) where the chart should be saved. If None, the chart is not saved. + sort_order : str, default "ascending" + The order in which the bars are sorted. Can be "ascending", "descending", or "custom". + custom_order : Optional[list[str]], optional + Specifies a custom order for the x-axis categories. Only effective if `sort_order` is "custom". + + Returns + ------- + None + """ + plt.figure(figsize=(10, 8)) + + x_axis_label: str = x_axis_label if x_axis_label else variable_x_axis + y_axis_label: str = ( + "Count" + if variable_y_axis is None + else y_axis_label + if y_axis_label + else variable_y_axis + ) + + if variable_y_axis is None: + data: pd.DataFrame = data.copy() + data["count"] = 1 + variable_y_axis: str = "count" + + title: str = ( + title if title else f"Bar Chart {variable_y_axis} by {variable_x_axis}" + ) + + aggregated_data: pd.DataFrame = ( + data.groupby(variable_x_axis)[variable_y_axis].sum().reset_index() + ) + + if sort_order == "ascending": + aggregated_data: pd.DataFrame = aggregated_data.sort_values( + by=variable_y_axis + ) + elif sort_order == "descending": + aggregated_data: pd.DataFrame = aggregated_data.sort_values( + by=variable_y_axis, ascending=False + ) + elif sort_order == "custom" and custom_order is not None: + data[variable_x_axis] = pd.Categorical( + data[variable_x_axis], categories=custom_order, ordered=True + ) + aggregated_data: pd.DataFrame = ( + data.groupby([variable_x_axis]).sum().unstack().fillna(0) + ) + + if chart_orientation == "h": + aggregated_data.plot( + x=variable_x_axis, y=variable_y_axis, kind="barh", legend=False + ) + plt.xlabel(y_axis_label) + plt.ylabel(x_axis_label) + else: + aggregated_data.plot( + x=variable_x_axis, y=variable_y_axis, kind="bar", legend=False + ) + plt.xlabel(x_axis_label) + plt.ylabel(y_axis_label) + + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_stacked_bar_chart( + self, + data: pd.DataFrame, + variable_x_axis: str, + variable_y_axis: Optional[str], + hue: str, + x_axis_label: Optional[str] = None, + y_axis_label: Optional[str] = None, + title: Optional[str] = None, + chart_orientation: str = "h", + name_save_file: Optional[Path] = None, + sort_order: str = "descending", + custom_order: Optional[list[str]] = None, + ) -> None: + """ + Creates a stacked bar chart from the given DataFrame. + + This method generates a stacked bar chart visualizing the relationship between two variables, + with the option to categorize data based on a third variable (`hue`). The chart can be sorted + and customized in various ways, including orientation, sorting order, and custom sorting. + + Warning: This can take a very long time for large datasets due to sorting. + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data to be plotted. + variable_x_axis : str + The column name in `data` to be used as the x-axis variable. + variable_y_axis : Optional[str] + The column name in `data` to be used as the y-axis variable. If None, a count of occurrences is used. + hue : str + The column name to be used for categorizing the data into different sections of the stacked bar. + x_axis_label : Optional[str], optional + The label for the x-axis. If None, `variable_x_axis` is used. + y_axis_label : Optional[str], optional + The label for the y-axis. If None and `variable_y_axis` is None, "Count" is used; otherwise, + `variable_y_axis` is used. + title : Optional[str], optional + The title of the chart. If None, a default title is generated based on the provided parameters. + chart_orientation : str, default "h" + The orientation of the chart. "h" for horizontal, any other value for vertical. + name_save_file : Optional[Path], optional + The path (including filename) where the chart should be saved. If None, the chart is not saved. + sort_order : str, default "descending" + The order in which the bars are sorted. Can be "ascending", "descending", or "custom". + custom_order : Optional[list[str]], optional + Specifies a custom order for the x-axis categories. Only effective if `sort_order` is "custom". + + Returns + ------- + None + """ + plt.figure(figsize=(10, 8)) + + x_axis_label: str = x_axis_label if x_axis_label else variable_x_axis + y_axis_label: str = ( + "Count" + if variable_y_axis is None + else y_axis_label + if y_axis_label + else variable_y_axis + ) + + if variable_y_axis is None: + data: pd.DataFrame = data.copy() + data["count"] = 1 + variable_y_axis: str = "count" + + title: str = ( + title + if title + else f"Stacked Bar Chart {variable_y_axis} by {variable_x_axis} stacked by {hue}" + ) + + aggregated_data: pd.DataFrame = ( + data.groupby([variable_x_axis, hue]).sum().unstack() + ) + + sum_of_each_group: pd.Series = aggregated_data.sum(axis=1) + + if sort_order == "ascending": + aggregated_data: pd.DataFrame = aggregated_data.loc[ + sum_of_each_group.sort_values().index + ] + elif sort_order == "descending": + aggregated_data: pd.DataFrame = aggregated_data.loc[ + sum_of_each_group.sort_values(ascending=False).index + ] + elif sort_order == "custom" and custom_order is not None: + data[variable_x_axis] = pd.Categorical( + data[variable_x_axis], categories=custom_order, ordered=True + ) + aggregated_data: pd.DataFrame = ( + data.groupby([variable_x_axis, hue]).sum().unstack().fillna(0) + ) + + if chart_orientation == "h": + aggregated_data.loc[:, variable_y_axis].plot( + kind="barh", stacked=True, legend=True + ) + plt.xlabel(y_axis_label) + plt.ylabel(x_axis_label) + else: + aggregated_data.loc[:, variable_y_axis].plot( + kind="bar", stacked=True, legend=True + ) + plt.xlabel(x_axis_label) + plt.ylabel(y_axis_label) + + plt.title(title) + plt.legend(title=hue, title_fontsize="13", loc="best") + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_percentage_stacked_bar_chart( + self, + data: pd.DataFrame, + variable_x_axis: str, + variables_to_compare: list[str], + x_axis_label: Optional[str] = None, + y_axis_label: str = "Percentage", + title: Optional[str] = None, + chart_orientation: str = "h", + name_save_file: Optional[Path] = None, + sort_order: str = "ascending", + ) -> None: + """ + Creates a percentage stacked bar chart for the specified variables in a DataFrame. + + This method visualizes the distribution of the specified variables as a percentage of their total + across the categories defined by the `variable_x_axis`. It supports both horizontal and vertical + orientations and allows for sorting the data in ascending or descending order. + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data to be plotted. + variable_x_axis : str + The column name in `data` to be used as the x-axis variable. + variables_to_compare : list[str] + A list of column names whose values are to be compared and visualized as stacked percentages. + x_axis_label : Optional[str], optional + The label for the x-axis. If None, `variable_x_axis` is used as the label. Default is None. + y_axis_label : str, default "Percentage" + The label for the y-axis. Default is "Percentage". + title : Optional[str], optional + The title of the chart. If None, a default title is generated based on the provided parameters. + Default is None. + chart_orientation : str, default "h" + The orientation of the chart. "h" for horizontal, any other value for vertical. Default is "h". + name_save_file : Optional[Path], optional + The path (including filename) where the chart should be saved. If None, the chart is not saved. + Default is None. + sort_order : str, default "ascending" + The order in which the bars are sorted. Can be "ascending" or "descending". Default is "ascending". + + Returns + ------- + None + """ + plt.figure(figsize=(10, 8)) + + x_axis_label: str = x_axis_label if x_axis_label else variable_x_axis + title: str = ( + title + if title + else f"Percentage Stacked Bar Chart {variable_x_axis} by {variables_to_compare}" + ) + + sum_data: pd.DataFrame = data.groupby(variable_x_axis)[ + variables_to_compare + ].sum() + + data_percentages: pd.DataFrame = ( + sum_data.div(sum_data.sum(axis=1), axis=0) * 100 + ) + + total_row: pd.Series = ( + data[variables_to_compare].sum() / data[variables_to_compare].sum().sum() + ) * 100 + = "Total" + + if sort_order == "ascending": + data_percentages: pd.DataFrame = data_percentages.loc[ + data_percentages.sum(axis=1).sort_values().index + ] + elif sort_order == "descending": + data_percentages: pd.DataFrame = data_percentages.loc[ + data_percentages.sum(axis=1).sort_values(ascending=False).index + ] + + data_percentages: pd.DataFrame = pd.concat( + [data_percentages, total_row.to_frame().T] + ) + + if chart_orientation == "h": + data_percentages.plot(kind="barh", stacked=True) + plt.xlabel(y_axis_label) + plt.ylabel(x_axis_label) + else: + data_percentages.plot(kind="bar", stacked=True) + plt.xlabel(x_axis_label) + plt.ylabel(y_axis_label) + + plt.title(title) + plt.legend(title="Variable", title_fontsize="13", loc="best") + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + + def create_histogram( + self, + data: pd.DataFrame, + variable: str, + x_axis_limits: Optional[list[Union[int, float]]] = None, + x_axis_logarithmic_scaling: bool = False, + y_axis_logarithmic_scaling: bool = False, + x_axis_label: Optional[str] = None, + y_axis_label: Optional[str] = "Count", + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + ) -> None: + + """ + Creates a histogram from the given DataFrame. + + This method generates a histogram visualizing the distribution of values in a given column of the DataFrame. + It supports optional logarithmic scaling of the x-axis, y-axis, or both to better visualize data with a wide + distribution range. + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data to be plotted. + variable : str + The column name in `data` for which the histogram will be plotted. + x_axis_limits : Optional[list[Union[int, float]]], optional + A list containing two elements [min, max] that define the limits of the x-axis. + If None, the limits are determined automatically. + x_axis_logarithmic_scaling : bool, optional + If True, applies logarithmic scaling to the x-axis. Useful for data with a wide range of values. + y_axis_logarithmic_scaling : bool, optional + If True, applies logarithmic scaling to the y-axis. Useful for data with a wide range of values. + x_axis_label : str, optional + The label for the x-axis. + y_axis_label : str, optional + The label for the y-axis. Defaults to "Count". + title : Optional[str], optional + The title of the plot. If None, defaults to "Histogram of ". + name_save_file : Optional[Path], optional + The path (including filename) where the plot should be saved. If None, the plot is not saved. + + Returns + ------- + None + """ + plt.figure(figsize=(10, 8)) + + x_axis_label: str = x_axis_label if x_axis_label else f"Number of {variable}" + title: str = title if title else f"Histogram of {variable}" + + plt.hist(data[variable], bins=50) + plt.xlabel(x_axis_label) + plt.ylabel(y_axis_label) + plt.title(title) + + if x_axis_limits: + plt.xlim(x_axis_limits) + + if x_axis_logarithmic_scaling: + plt.xscale("log") + + if y_axis_logarithmic_scaling: + plt.yscale("log") + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_count_distribution( + self, + data: pd.DataFrame, + variable: str, + x_axis_limits: Optional[list[Union[int, float]]] = None, + x_axis_logarithmic_scaling: bool = False, + y_axis_logarithmic_scaling: bool = False, + x_axis_label: str = "Number of Occurances", + y_axis_label: Optional[str] = None, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + ) -> None: + """ + Creates a distribution plot showing the frequency of unique values for a specified column in a DataFrame. + + This method visualizes how often unique values occur within a given column of the DataFrame. It can also apply + logarithmic scaling to the x-axis, y-axis, or both to better visualize data with a wide distribution range. + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data to plot. + variable : str + The name of the column in `data` for which the distribution of unique values will be plotted. + x_axis_limits : Optional[list[Union[int, float]]], optional + A list containing two elements [min, max] that define the limits of the x-axis. + If None, the limits are determined automatically. + x_axis_logarithmic_scaling : bool, optional + If True, applies logarithmic scaling to the x-axis. Useful for data with a wide range of values. + y_axis_logarithmic_scaling : bool, optional + If True, applies logarithmic scaling to the y-axis. Useful for data with a wide range of values. + x_axis_label : str, optional + The label for the x-axis. Defaults to "Number of Occurances". + y_axis_label : Optional[str], optional + The label for the y-axis. Defaults to "Number of " where + is the column name specified in `variable`. + title : Optional[str], optional + The title of the plot. If None, defaults to "Distribution of Count over ". + name_save_file : Optional[Path], optional + The file path (including the name) where the plot should be saved. If None, the plot is not saved. + + Returns + ------- + None + """ + y_axis_label: str = y_axis_label if y_axis_label else f"Number of {variable}" + title: str = title if title else f"Distribution of Count over {variable}" + + plt.figure(figsize=(10, 6)) + + uniques_per_column: pd.DataFrame = data[variable].value_counts() + + uniques_count: pd.Series = uniques_per_column.value_counts() + + uniques_count: pd.Series = uniques_count.sort_index() + +, uniques_count.values) + plt.xlabel(x_axis_label) + plt.ylabel(y_axis_label) + plt.title(title) + + if x_axis_limits: + plt.xlim(x_axis_limits) + + if x_axis_logarithmic_scaling: + plt.xscale("log") + + if y_axis_logarithmic_scaling: + plt.yscale("log") + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_grouped_histogram( + self, + data: pd.DataFrame, + group_by: str, + aggregation_column: str, + aggregation_function: str, + x_axis_label: Optional[str] = None, + y_axis_label: Optional[str] = None, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + bins: int = 10, + edgecolor: str = "black", + ): + """ + This method creates a grouped histogram based on and aggregated function. + + Parameters + ---------- + data : DataFrame + The pandas DataFrame to create a histogram from. + group_by : str + The column in DataFrame to group by. + aggregation_column : str + The column to aggregate. + aggregation_function : str + The string passed to pandas agg function. + x_axis_label : str, default data[group_by] + The label for X-axis. + y_axis_label : str, default 'Count' + The label for Y-axis. + title : str, default 'Histogram of {agg_func} {group_by} per {agg_col}' + The title for the plot. + name_save_file : str, default None + The name to save the generated plot. + bins : int, default 10 + The number of bins to divide the data into. + edgecolor : str, default 'black' + The edge color for the bins in the histogram. + """ + + agg_data: pd.DataFrame = data.groupby(group_by)[aggregation_column].agg(aggregation_function) + + plt.figure(figsize=(10, 8)) + + x_axis_label: str = ( + x_axis_label if x_axis_label else f"{aggregation_function} {group_by}" + ) + y_axis_label: str = ( + y_axis_label if y_axis_label else f"Number of {aggregation_column}" + ) + title: str = ( + title + if title + else f"Histogram of {aggregation_function.title()} {aggregation_column.title()} per {group_by.title()}" + ) + + plt.hist(agg_data, bins=bins, edgecolor=edgecolor) + plt.xlabel(x_axis_label) + plt.ylabel(y_axis_label) + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_scatter_plot_simple( + self, + data: pd.DataFrame, + variable_x_axis: str, + variable_y_axis: str, + title=None, + hue=None, + name_save_file: Optional[Path] = None, + ) -> None: + """ + Creates a simple scatter plot from the given DataFrame. + + This method generates a scatter plot visualizing the relationship between two variables. It supports + optional grouping by color through the 'hue' parameter and saving the plot to a file. + + Parameters + ---------- + data : pd.DataFrame + The DataFrame containing the data to be plotted. + variable_x_axis : str + The column name in `data` to be used as the x-axis variable. + variable_y_axis : str + The column name in `data` to be used as the y-axis variable. + title : str, optional + The title of the plot. If None, a default title is generated based on the x and y variables. + hue : str, optional + The column name to be used for color encoding. This allows for the visualization of a third variable. + name_save_file : Optional[Path], optional + The path (including filename) where the plot should be saved. If None, the plot is not saved. + + Returns + ------- + None + """ + plt.figure(figsize=(10, 8)) + title: str = ( + title if title else f"Scatter Plot {variable_y_axis} vs {variable_x_axis}" + ) + sns.scatterplot(data=data, x=variable_x_axis, y=variable_y_axis, hue=hue) + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_heatmap( + self, + data: pd.DataFrame, + axis_variables: list[str], + heat_variable: str, + max_values_axes: list[Union[int, float]], + min_values_axes: list[Union[int, float]], + log_option: str, + name_save_file: Path, + ): + """ + This method creates a heatmap. + + Parameters + ---------- + data: pd.DataFrame + The input dataframe which contains the data. + axis_variables: list[str] + The list of column names to be used as independent variables in the regression model. + heat_variable: str + The column name to be used as dependent variable in the regression model. + max_values_axes: list + The list of maximum values for each column defined in X. + Any row with a value higher than this for the respective column is excluded. + min_values_axes: list + The list of minimum values for each column defined in X. + Any row with a value less than this for the respective column is excluded. + log_option: str + Scale specific axis logarithmically. It could take four options: "false", "x_log", "y_log", "double_log". + If an invalid value is supplied, the axes will not be log scaled. + name_save_file: Path + The name of the file to which the result will be saved (if self.save_result is True). + + Returns + ------- + None + """ + data_filtered: pd.DataFrame = data.copy() + + for col, max_val, min_val in zip( + axis_variables, max_values_axes, min_values_axes + ): + data_filtered = data_filtered[ + (data_filtered[col] <= max_val) & (data_filtered[col] >= min_val) + ] + + if log_option == "x_log": + data_filtered[axis_variables] = np.log(data_filtered[axis_variables]) + elif log_option == "y_log": + data_filtered[heat_variable] = np.log(data_filtered[heat_variable]) + elif log_option == "double_log": + data_filtered[axis_variables] = np.log(data_filtered[axis_variables]) + data_filtered[heat_variable] = np.log(data_filtered[heat_variable]) + + heatmap_data = data_filtered.pivot_table( + index=axis_variables[0], + columns=axis_variables[1], + values=heat_variable, + aggfunc="mean", + ) + + heatmap_data.sort_index(axis=0, ascending=False, inplace=True) + + heatmap_data = heatmap_data.interpolate( + method="linear", limit_direction="both", axis=0 + ) + + plt.figure(figsize=(10, 8)) + sns.heatmap(heatmap_data, cmap="YlGnBu") + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def generate_surface_plot( + self, + function: FunctionData, + x_axis_maximum: Union[int, float], + y_axis_maximum: Union[int, float], + x_axis_label: str, + y_axis_label: str, + z_axis_label: str, + elevation_angle: Optional[Union[int, float]] = None, + azimuth_angle: Optional[Union[int, float]] = None, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + x_steps: int = 1000, + y_steps: int = 1000, + ) -> None: + """ + Generates a 3D surface plot for a function. + + Parameters + ---------- + function: FunctionData + The function to plot as well as the parameters. + x_axis_maximum: Union[int, float] + The range of x values to plot. + y_axis_maximum: Union[int, float] + The range of y values to plot. + x_axis_label: str + The label for the x-axis. + y_axis_label: str + The label for the y-axis. + z_axis_label: str + The label for the z-axis. + elevation_angle: int + set elevation angle in the z plane + azimuth_angle: int + set azimuth angle in the x,y plane + title: str, default None + The title of the plot. + name_save_file: str, default None + The name of the file to which the result will be saved. + x_steps: int + The step size of the grid x direction + y_steps: int + The step size of the grid y direction + """ + + x_range: np.ndarray = np.linspace(1, x_axis_maximum, x_steps) + y_range: np.ndarray = np.linspace(1, y_axis_maximum, y_steps) + + x, y = np.meshgrid(x_range, y_range) + z = function(x, y) + origin_z = function(1, 1) + + fig = plt.figure(figsize=(14, 8)) + ax = fig.add_subplot(111, projection="3d") + + if elevation_angle is not None: + ax.view_init(elev=elevation_angle) + + if azimuth_angle is not None: + ax.view_init(azim=azimuth_angle) + + ax.plot_surface(x, y, z) + ax.set_xlabel(x_axis_label) + ax.set_ylabel(y_axis_label) + ax.set_zlabel(z_axis_label) + + title: str = title if title else f"3D surface plot of {function.func.__name__}" + plt.title(title) + + plt.subplots_adjust(left=0.0, right=1.0, bottom=0.0, top=1.0) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_contour_plot( + self, + function: FunctionData, + x_axis_maximum: Union[int, float], + y_axis_maximum: Union[int, float], + x_axis_label: str, + y_axis_label: str, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + x_steps: int = 1000, + y_steps: int = 1000, + ) -> None: + """ + Generates a contour plot for a function. + + Parameters + ---------- + function: FunctionData + The function to plot as well as the parameters. + x_axis_maximum: int + The range of x values to plot. + y_axis_maximum: int + The range of y values to plot. + x_axis_label: str + The label for the x-axis. + y_axis_label: str + The label for the y-axis. + title: str, default None + The title of the plot. + name_save_file: str, default None + The name of the file to which the result will be saved. + x_steps: int + The step size of the grid x direction + y_steps: int + The step size of the grid y direction + """ + x_range: np.ndarray = np.linspace(1, x_axis_maximum, x_steps) + y_range: np.ndarray = np.linspace(1, y_axis_maximum, y_steps) + + x, y = np.meshgrid(x_range, y_range) + z = function(x, y) + + fig = plt.figure() + ax = fig.add_subplot(111) + ax.contour(x, y, z) + ax.set_xlabel(x_axis_label) + ax.set_ylabel(y_axis_label) + + title: str = title if title else f"Contour plot of {function.func.__name__}" + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_density_plot( + self, + data: pd.DataFrame, + variable_x_axis: str, + variable_y_axis: str, + data_breakpoints: list[Union[int, float]], + name_save_file: Optional[Path] = None, + title: Optional[str] = None, + group_names: Optional[list[str]] = None, + x_axis_limits: Optional[Tuple[float, float]] = None, + y_axis_limits: Optional[Tuple[float, float]] = None, + ) -> None: + """ + This method creates a density plot. + + Parameters + ---------- + data : pd.DataFrame + DataFrame that has the data. + variable_x_axis : str + The name of the column of the provided DataFrame that this method applies breakpoints to. + variable_y_axis : str + The name of the column of the DataFrame that is plotted in the density plot. + data_breakpoints : list[Union[int, float]] + List of breakpoints or bins where the dataset is split. + name_save_file : str, optional + The name of the file to which the plot will be saved (if self.save_plots is True). + title : str, optional + The title of the plot. + group_names : list[str], optional + A list of group names used to rename the boolean categories in the legend. + x_axis_limits : tuple[float, float], optional + The lower and upper limits of the x-axis. Format as (lower_limit, upper_limit). + y_axis_limits : tuple[float, float], optional + The lower and upper limits of the y-axis. Format as (lower_limit, upper_limit). + + Returns + ------- + None + """ + data_temp: pd.DataFrame = data.copy() + data_breakpoints: list[Union[int, float]] = sorted(data_breakpoints) + + data_temp["group"] = pd.cut( + data_temp[variable_x_axis], bins=[-np.inf] + data_breakpoints + [np.inf] + ) + + fig, ax = plt.subplots(figsize=(10, 5)) + + for group, values in data_temp.groupby("group", observed=False): + sns.kdeplot( + values[variable_y_axis], + bw_adjust=0.5, + clip_on=False, + fill=True, + alpha=0.5, + linewidth=1.5, + ax=ax, + label=str(group), + ) + + plt.tight_layout() + + ax.legend() + + if title: + ax.set_title(title) + + if group_names: + legend = ax.legend_ + for t, l in zip(legend.texts, group_names): + t.set_text(l) + + if x_axis_limits: + ax.set_xlim(x_axis_limits) + if y_axis_limits: + ax.set_ylim(y_axis_limits) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_ridgeline_plot( + self, + data: pd.DataFrame, + x_axis_variable: str, + y_axis_variable: str, + data_breakpoints: list[Union[int, float]], + name_save_file: Optional[Path] = None, + title: Optional[str] = None, + group_names: Optional[list[str]] = None, + x_axis_limits: Optional[Tuple[float, float]] = None, + y_axis_limits: Optional[Tuple[float, float]] = None, + y_label: Optional[str] = None, + ) -> None: + """ + This method creates a ridgeline plot. + + Parameters + ---------- + data : pd.DataFrame + DataFrame that has the data. + x_axis_variable : str + The name of the column of the provided DataFrame that this method applies breakpoints to. + y_axis_variable : str + The name of the column of the DataFrame that is plotted in the ridgeline plot. + data_breakpoints : list[Union[int, float]] + List of breakpoints or bins where the dataset is split. + name_save_file : str, optional + The name of the file to which the plot will be saved (if self.save_plots is True). + title : str, optional + The title of the plot. + group_names : list[str], optional + A list of group names used to rename the boolean categories in the legend. + x_axis_limits : tuple[float, float], optional + The lower and upper limits of the x-axis. Format as (lower_limit, upper_limit). + y_axis_limits : tuple[float, float], optional + The lower and upper limits of the y-axis. Format as (lower_limit, upper_limit). + y_label : str, optional + The label for the y-axis + + Returns + ------- + None + """ + data_temp: pd.DataFrame = data.copy() + data_breakpoints: list[Union[int, float]] = sorted(data_breakpoints) + + data_temp["group"] = pd.cut( + data_temp[x_axis_variable], bins=[-np.inf] + data_breakpoints + [np.inf] + ) + + g = sns.FacetGrid( + data_temp, row="group", hue="group", aspect=15, height=2, palette="tab10" + ) + + sns.kdeplot, + y_axis_variable, + bw_adjust=0.5, + clip_on=False, + fill=True, + alpha=1, + linewidth=1.5, + ) + + sns.kdeplot, y_axis_variable, clip_on=False, color="w", lw=2, bw_adjust=0.5 + ) +, y=0, lw=2, clip_on=False) + + def label(x, color, label): + ax = plt.gca() + ax.text( + 0, + 0.2, + label, + fontweight="bold", + fontsize=15, + color=color, + ha="left", + va="center", + transform=ax.transAxes, + ) + +, y_axis_variable) + + g.fig.subplots_adjust(hspace=-0.25) + + g.set_titles("") + g.set_ylabels("") + g.set(yticks=[]) + g.despine(bottom=True, left=True) + + plt.tight_layout() + + if title: + plt.title(title) + + if group_names: + g.set( + yticklabels=group_names[::-1] + ) # Reversed because of the order how seaborn plots + + if y_axis_limits: + g.set(ylim=y_axis_limits) +, y=y_axis_limits[0], lw=2, clip_on=False) + if x_axis_limits: + g.set(xlim=x_axis_limits) + + plt.xlabel(y_axis_variable, fontsize=15) + axes = g.axes.flatten() + middle_plot = len(axes) // 2 + if y_label is None: + axes[middle_plot].set_ylabel(x_axis_variable, fontsize=15) + else: + axes[middle_plot].set_ylabel(y_label, fontsize=15) + plt.tick_params(axis="both", which="major", labelsize=15) + plt.tick_params(axis="both", which="minor", labelsize=12) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_hexbin_plot( + self, + data: pd.DataFrame, + variable_x_axis: str, + variable_y_axis: str, + x_axis_maximum: Union[int, float], + y_axis_maximum: Union[int, float], + trendline: bool = False, + log_scale: bool = False, + x_label: Optional[str] = None, + y_label: Optional[str] = None, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + ): + """ + Generates a hexbin plot with an optional trendline. + + Parameters + ---------- + data: DataFrame + The DataFrame object that contains the data. + + variable_x_axis: str + The column name in data for x values to plot. + + variable_y_axis: str + The column name in data for y values to plot. + x_axis_maximum: int + The range maximum of x values to plot. + y_axis_maximum: int + The range maximum of y values to plot. + trendline: bool, default False + Whether to plot a trend line. + log_scale: bool, default False + Changes the scaling of the hue of hexbins to logarithmic. + x_label: str, default None + The label for the x-axis. + y_label: str, default None + The label for the y-axis. + title: str, default None + The title of the plot. + name_save_file: str, default None + The name of the file to which the result will be saved. + """ + + x_data: pd.DataFrame = data[variable_x_axis] + y_data: pd.DataFrame = data[variable_y_axis] + + bins = None + if log_scale: + bins = "log" + + plt.hexbin(x_data, y_data, gridsize=50, cmap="Blues", bins=bins) + + data: pd.DataFrame = data.replace([np.inf, -np.inf], np.nan).dropna() + + x_data_clean, y_data_clean = data[variable_x_axis], data[variable_y_axis] + + if trendline: + z = np.polyfit(x_data_clean, y_data_clean, 1) + p = np.poly1d(z) + plt.plot(x_data_clean, p(x_data_clean), "r--") + + if x_axis_maximum is not None: + plt.xlim(0, x_axis_maximum) + if y_axis_maximum is not None: + plt.ylim(0, y_axis_maximum) + + if x_label: + plt.xlabel(x_label) + else: + plt.xlabel(variable_x_axis) + if y_label: + plt.ylabel(y_label) + else: + plt.ylabel(variable_y_axis) + if title: + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_box_plot( + self, + data: pd.DataFrame, + variable_1: str, + variable_2: str, + x_axis_label: str, + y_axis_label: str, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + ): + """ + Generates a box plot for two columns from a data frame. + + Parameters + ---------- + data: dataframe + The dataframe that contains the data. + variable_1: str + The name of the first column to plot. + variable_2: str + The name of the second column to plot. + x_axis_label: str + The label for the x-axis. + y_axis_label: str + The label for the y-axis. + title: str, default None + The title of the plot. + name_save_file: str, default None + The name of the file to which the result will be saved. + """ + + column_one_data: list = data[variable_1].dropna().values.tolist() + column_two_data: list = data[variable_2].dropna().values.tolist() + + fig = plt.figure(figsize=(10, 6)) + ax = fig.add_subplot(111) + ax.boxplot([column_one_data, column_two_data], labels=[variable_1, variable_2]) + ax.set_xlabel(x_axis_label) + ax.set_ylabel(y_axis_label) + + title: str = title if title else f"Box plot of {variable_1} and {variable_2}" + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_violin_plot( + self, + data: pd.DataFrame, + variable_x_axis: str, + variable_y_axis: str, + x_axis_label: str, + y_axis_label: str, + title: Optional[str] = None, + name_save_file: Optional[Path] = None, + ): + """ + Generates a violin plot for two columns from a data frame. + Violin plots show the distribution of data as a frequency distribution. + Varying sample size between the columns will influence the violin plot. + + Parameters + ---------- + data: dataframe + The dataframe that contains the data. + variable_x_axis: str + The name of the first column to plot. + variable_y_axis: str + The name of the second column to plot. + x_axis_label: str + The label for the x-axis. + y_axis_label: str + The label for the y-axis. + title: str, default None + The title of the plot. + name_save_file: str, default None + The name of the file to which the result will be saved. + """ + column_one_data: list = data[variable_x_axis].dropna().values.tolist() + column_two_data: list = data[variable_y_axis].dropna().values.tolist() + + fig = plt.figure(figsize=(10, 6)) + ax = fig.add_subplot(111) + ax.violinplot([column_one_data, column_two_data]) + ax.set_xticks([1, 2]) + ax.set_xticklabels([variable_x_axis, variable_y_axis]) + ax.set_xlabel(x_axis_label) + ax.set_ylabel(y_axis_label) + + title: str = ( + title + if title + else f"Violin plot of {variable_x_axis} and {variable_y_axis}" + ) + plt.title(title) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_forest_plot( + self, + regression_models: Dict[str, RegressionResults], + coefficient_names: list[str], + sort_by_size: bool = False, + x_axis_minimum: Optional[Union[int, float]] = None, + x_axis_maximum: Optional[Union[int, float]] = None, + x_axis_label: Optional[str] = None, + dotsize: float = 5, + colors: list[str] = ["orange", "royalblue", "forestgreen", "firebrick"], + name_save_file: Optional[Path] = None, + ) -> None: + """ + The function creates a forest plot of coefficients from multiple regression models. + + Parameters + ---------- + regression_models: dict + A dictionary containing category names as keys and RegressionResults as values. + coefficient_names: list[str] + The coefficients of the regression model to be plotted in the forest plot + sort_by_size: bool + If one want to sort the datapoints by order + # Set x-axis limits if provided + x_axis_minimum: float, default None + The minimum value of the x-axis. Ignored if not given. + x_axis_maximum: float, default None + The maximum value of the x-axis. + x_axis_label: str, default None + The label of the x-axis + dotsize: float, default 5 + The size of the dots indicating values in the forestplot + colors: list[str] + The colors used to plot the coefficients + name_save_file: str, default None + The name of the file to which the result will be saved. + + + Returns + ------- + None + """ + data: dict = {"categories": []} + + for coefficient_name in coefficient_names: + data[coefficient_name] = [] + data[f"conf_interval_{coefficient_name}"] = [] + data[f"color_{coefficient_name}"] = [] + + for category, model in regression_models.items(): + data["categories"].append(category) + for idx, coefficient_name in enumerate(coefficient_names): + color = colors[idx % len(colors)] + coef = model.params.get(coefficient_name, None) + if coef is not None: + data[coefficient_name].append(coef) + conf_int = model.conf_int().loc[coefficient_name] + data[f"conf_interval_{coefficient_name}"].append( + [conf_int[0], conf_int[1]] + ) + data[f"color_{coefficient_name}"].append(color) + else: + data[coefficient_name].append(None) + data[f"conf_interval_{coefficient_name}"].append([None, None]) + data[f"color_{coefficient_name}"].append(None) + + data_dataframe: pd.DataFrame = pd.DataFrame(data) + + if sort_by_size: + data_dataframe.sort_values(by=coefficient_names, key=abs, inplace=True) + data_dataframe.reset_index(drop=True, inplace=True) + + f, ax = plt.subplots(figsize=(7, 3.5 + 0.5 * len(coefficient_names))) + ax.plot([0, 0], [0.5, len(data_dataframe) + 0.5], "--", color="black") + + for index, row in data_dataframe.iterrows(): + for idx, coefficient_name in enumerate(coefficient_names): + color = row[f"color_{coefficient_name}"] + coef = row[coefficient_name] + conf_int = row[f"conf_interval_{coefficient_name}"] + if color and coef is not None: + ax.plot( + conf_int, + [index + 1, index + 1], + "-", + color=color, + solid_capstyle="round", + ) + ax.plot(coef, index + 1, "o", color=color, markersize=dotsize) + + ax.set_yticks(range(1, len(data_dataframe) + 1)) + ax.set_yticklabels(data_dataframe["categories"]) + ax.invert_yaxis() + + if x_axis_minimum is not None or x_axis_maximum is not None: + ax.set_xlim(left=x_axis_minimum, right=x_axis_maximum) + + if x_axis_label is not None: + ax.set_xlabel(x_axis_label) + + if len(coefficient_names) > 1: + handles = [ + plt.Line2D( + [0], [0], color=colors[i % len(colors)], linewidth=2, linestyle="-" + ) + for i in range(len(coefficient_names)) + ] + ax.legend( + handles, + coefficient_names, + title="Coefficients", + loc="best", + fontsize="small", + title_fontsize="small", + ) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) + + def create_forest_plot_paired_ttest( + self, + paired_ttests: dict[str, tuple[float, float, float]], + x_axis_minimum: Optional[Union[int, float]] = None, + x_axis_maximum: Optional[Union[int, float]] = None, + x_axis_label: Optional[str] = None, + dotsize: float = 5, + name_save_file: Optional[Path] = None, + ) -> None: + """ + The function creates a forest plot of mean differences from multiple paired t-tests. + + Parameters + ---------- + paired_ttests: dict + A dictionary containing category names as keys and paired t-test results as values. + Each test result is a tuple of (mean difference, lower CI, upper CI). + x_axis_minimum: float, default None + The minimum value of the x-axis. Ignored if not given. + x_axis_maximum: float, default None + The maximum value of the x-axis. Ignored if not given. + x_axis_label: str, default None + The label of the x-axis + dotsize: float, default 5 + Size of the dots in the plot. + name_save_file: str, default None + The name of the file to which the result will be saved. + + Returns + ------- + None + """ + data: dict = {"categories": [], "mean_diff": [], "conf_interval": []} + + for category, (mean_diff, lower_ci, upper_ci) in paired_ttests.items(): + data["categories"].append(category) + data["mean_diff"].append(mean_diff) + data["conf_interval"].append([lower_ci, upper_ci]) + + data_dataframe: pd.DataFrame = pd.DataFrame(data) + + f, ax = plt.subplots(figsize=(8, len(data_dataframe) * 0.5)) + + ax.plot([0, 0], [0.5, len(data_dataframe) + 0.5], "--", color="black") + + for index, row in data_dataframe.iterrows(): + ax.plot( + row["conf_interval"], + [index + 1, index + 1], + "-", + color="orange", + solid_capstyle="round", + ) + ax.plot(row["mean_diff"], index + 1, "o", color="orange", markersize=dotsize) + + ax.set_yticks(range(1, len(data_dataframe) + 1)) + ax.set_yticklabels(data_dataframe["categories"]) + ax.invert_yaxis() + + if x_axis_minimum is not None or x_axis_maximum is not None: + ax.set_xlim(left=x_axis_minimum, right=x_axis_maximum) + + if x_axis_label is not None: + ax.set_xlabel(x_axis_label) + + save_plot(self.save_plots, self.filepath, plt, name_save_file) diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..e69de29 diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..6a0bf34 --- /dev/null +++ b/src/analysis_wrappers/ @@ -0,0 +1,42 @@ +from src.analysis_functions.comparison_variance_in_and_between_group import ( + ComparisonVariance, +) +from src.data_classes.parameters_analysis_comparison_variance_in_and_between_group import ( + ComparisonVarianceInAndBetweenGroupParameters, +) +from src.utils.helper_logging import log_comparison_variance_details + + +def run_comparison_variance_in_and_between_group( + comparison_variance: ComparisonVariance, + analyses_list: list[ComparisonVarianceInAndBetweenGroupParameters], +): + """ + Run a comparison between the mean variance in a variable grouped by a condition and the variance + between group members based on a list of parameter objects. + + This function iterates over a list of ComparisonVarianceInAndBetweenGroupParameters objects, + and then reports the total variance, variance between group members and mean variance of group members for + each pair of variables specified in the ComparisonVarianceInAndBetweenGroupParameters object. + + Parameters + ---------- + comparison_variance: ComparisonVariance + An instance of the ComparisonVariance class. + analyses_list: list[ComparisonVarianceInAndBetweenGroupParameters] + A list of ComparisonVarianceInAndBetweenGroupParameters objects. + Each object contains the parameters for a single variation comparison analysis. + + Returns + ------- + None + """ + for analysis_config in analyses_list: + log_comparison_variance_details(analysis_config) + + comparison_variance.compare_ingroup_intergroup_variance( +, + variable=analysis_config.variable, +, + name_save_file=analysis_config.name_save_file, + ) diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..5e3b491 --- /dev/null +++ b/src/analysis_wrappers/ @@ -0,0 +1,73 @@ +from src.analysis_functions.descriptive import DescriptiveAnalysis +from src.data_classes.parameters_descriptive_aggregated import ( + DescriptiveAggregatedParameters, +) +from src.data_classes.parameters_descriptive_overview import ( + DescriptiveOverviewParameters, +) +from src.data_classes.parameters_descriptive_percentage_of_dataset_under_condition import ( + DescriptivePercentageOfDatasetUnderConditionParameters, +) + +from src.data_classes.parameters_general import GeneralParameters + +from src.utils.helper_logging import log_descriptive_analysis_details + + +def run_descriptive_analysis( + descriptive: DescriptiveAnalysis, analyses_list: list[GeneralParameters] +): + """ + Run descriptive analyses based on a list of parameter objects. + + This function iterates over a list of GeneralParameters objects, + logs the details of each analysis, and then performs the appropriate + descriptive analysis based on the type of the Parameters object. + + Parameters + ---------- + descriptive: DescriptiveAnalysis + An instance of the DescriptiveAnalysis class. + analyses_list: list[PlotParameters] + A list of PlotParameters objects. + Each object contains the parameters for a single visualization. + + Raises + ------ + ValueError: If the type of the GeneralParameters object is not recognized or not a descriptive analysis. + + Returns + ------- + None + """ + for analysis_config in analyses_list: + log_descriptive_analysis_details(analysis_config) + if isinstance(analysis_config, DescriptiveAggregatedParameters): + descriptive.create_descriptive_aggregated_for_metrics( +, + variables=analysis_config.variables, + aggregation_function=analysis_config.aggregation_function, + group_by=analysis_config.group_by, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, DescriptiveOverviewParameters): + descriptive.create_descriptives_for_metrics( +, + metrics=analysis_config.metrics, + group_by=analysis_config.group_by, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance( + analysis_config, DescriptivePercentageOfDatasetUnderConditionParameters + ): + descriptive.give_percentage_of_dataset_under_condition( +, + variable=analysis_config.variable, + comparison=analysis_config.comparison, + condition=analysis_config.condition, + name_save_file=analysis_config.name_save_file, + ) + else: + raise ValueError( + f"Invalid type of descriptive analysis requested {type(analysis_config)}" + ) diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..727e9c5 --- /dev/null +++ b/src/analysis_wrappers/ @@ -0,0 +1,39 @@ +from src.analysis_functions.pearson_correlation import PearsonCorrelation +from src.data_classes.parameters_analysis_pearson_correlation import ( + PearsonCorrelationParameters, +) +from src.utils.helper_logging import log_pearson_correlation_details + + +def run_pearson_correlation( + pearson_correlation: PearsonCorrelation, + analyses_list: list[PearsonCorrelationParameters], +): + """ + Run Pearson correlation analyses based on a list of parameter objects. + + This function iterates over a list of PearsonCorrelationParameters objects, + and then calculates the Pearson correlation for each pair of variables specified + in the PearsonCorrelationParameters object. + + Parameters + ---------- + pearson_correlation: PearsonCorrelation + An instance of the PearsonCorrelation class. + analyses_list: list[PearsonCorrelationParameters + A list of PearsonCorrelationParameters objects. + Each object contains the parameters for a single Pearson correlation analysis. + + Returns + ------- + None + """ + for analysis_config in analyses_list: + log_pearson_correlation_details(analysis_config) + + pearson_correlation.calculate_correlation( +, + first_group_name=analysis_config.variable_1, + second_group_name=analysis_config.variable_2, + name_save_file=analysis_config.name_save_file, + ) diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..bef17fd --- /dev/null +++ b/src/analysis_wrappers/ @@ -0,0 +1,85 @@ +from statsmodels.regression.linear_model import RegressionResults + +from src.analysis_functions.regression import Regression + +from src.data_classes.parameters_analysis_regression import ( + RegressionParameters, + BayesianRegressionParameters, + LinearRegressionParameters, + GroupedLinearRegressionParameters, +) +from src.utils.helper_logging import log_regression_details + + +def run_regression( + regression: Regression, analyses_list: list[RegressionParameters] +) -> dict[str, RegressionResults]: + """ + Run regression analyses based on a list of parameter objects. + + This function iterates over a list of RegressionParameters objects, + and then performs the appropriate regression analysis based on the + type of the RegressionParameters object. + + Parameters + ---------- + regression: Regression + An instance of the Regression class. + analyses_list: list[RegressionParameters] + A list of RegressionParameters objects. + Each object contains the parameters for a single regression analysis. + + Raises + ------ + ValueError: If the type of the RegressionParameters object is not recognized. + + Returns + ------- + dict[str, RegressionResults] + A dictionary of regression results, where the key is the name of the analysis + and the value is the regression results. + """ + + regression_results: dict = {} + + for analysis_config in analyses_list: + result = None + + if isinstance(analysis_config, LinearRegressionParameters): + log_regression_details(analysis_config, "Linear") + + result = regression.linear_regression( +, + x_vector=analysis_config.independent_variables, + y=analysis_config.dependent_variable, + standardize=analysis_config.standardize, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, BayesianRegressionParameters): + log_regression_details(analysis_config, "Bayesian") + + regression.bayesian_regression( +, + x_vector=analysis_config.independent_variables, + y=analysis_config.dependent_variable, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, GroupedLinearRegressionParameters): + log_regression_details(analysis_config, "Grouped") + + result = regression.linear_regression_grouped( +, + x_vector=analysis_config.independent_variables, + y=analysis_config.dependent_variable, + dictionary_aggregation_methods_for_data_columns=analysis_config.dictionary_aggregation_methods, + column_to_group_by=analysis_config.group_by, + standardize=analysis_config.standardize, + print_detailed_coefficients=analysis_config.print_detailed_coefficients, + name_save_file=analysis_config.name_save_file, + ) + else: + raise ValueError(f"Unknown parameter type: {type(analysis_config)}") + + regression_results[]: RegressionResults = result + + return regression_results diff --git a/src/analysis_wrappers/specific_analysis_wrappers/ b/src/analysis_wrappers/specific_analysis_wrappers/ new file mode 100644 index 0000000..e69de29 diff --git a/src/analysis_wrappers/specific_analysis_wrappers/ b/src/analysis_wrappers/specific_analysis_wrappers/ new file mode 100644 index 0000000..9e1fdbf --- /dev/null +++ b/src/analysis_wrappers/specific_analysis_wrappers/ @@ -0,0 +1,52 @@ +import pandas as pd +from statsmodels.regression.linear_model import RegressionResults + +from src.analysis_functions.specific_analysis.get_function_inverse_bayes_transformed_regression import ( + get_function_inverse_bayes_transformed_regression, +) +from src.data_classes.parameters_analysis_get_function_inverse_bayes_transformed_regression import ( + GetFunctionInverseBayesTransformedRegressionParameters, +) +from src.utils.helper_functions import FunctionData +from src.utils.helper_logging import ( + log_get_function_inverse_bayes_transformed_regression, +) + + +def run_get_function_inverse_bayes_transformed_regression( + analyses_list: list[GetFunctionInverseBayesTransformedRegressionParameters], + regression_models: dict[str, RegressionResults], +) -> dict[str, FunctionData]: + """ + Executes the inverse Bayes transformed regression analysis for a list of analysis configurations. + + This function iterates over a list of analysis configurations, logs the configuration details, + performs the inverse Bayes transformed regression analysis using the specified regression model, + and collects the results in a dictionary. + + Parameters + ---------- + analyses_list : list[GetFunctionInverseBayesTransformedRegressionParameters] + A list of parameters for each analysis to be run. Each item in the list is an instance + of GetFunctionInverseBayesTransformedRegressionParameters, which includes the data and model name. + regression_models : dict[str, RegressionResults] + A dictionary mapping model names to their corresponding fitted regression models. + + Returns + ------- + dict[str, FunctionData] + A dictionary mapping the name of each analysis to its resulting FunctionData object. + """ + functions: dict = {} + + for analysis_config in analyses_list: + log_get_function_inverse_bayes_transformed_regression(analysis_config) + + function: FunctionData = get_function_inverse_bayes_transformed_regression( +, + model=regression_models[analysis_config.model_name], + ) + + functions[] = function + + return functions diff --git a/src/analysis_wrappers/specific_analysis_wrappers/ b/src/analysis_wrappers/specific_analysis_wrappers/ new file mode 100644 index 0000000..9e1840f --- /dev/null +++ b/src/analysis_wrappers/specific_analysis_wrappers/ @@ -0,0 +1,51 @@ +from statsmodels.regression.linear_model import RegressionResults + +from src.analysis_functions.specific_analysis.increase_per_up_and_downvote import ( + InfluenceOfUpAndDownvotesOnReplies, +) +from src.data_classes.parameters_analysis_influence_of_up_and_downvotes import ( + InfluenceOfVotesParameters, +) +from src.utils.helper_logging import log_influence_of_up_and_downvotes_on_replies + + +def run_report_influence_of_up_and_downvotes_on_replies( + influence_of_up_and_downvotes: InfluenceOfUpAndDownvotesOnReplies, + analyses_list: list[InfluenceOfVotesParameters], + regression_models: dict[str, RegressionResults], +) -> None: + """ + Run specific analyses to find the effect of up-and downvotes on replies based on a list of parameter objects. + + This function iterates over a list of InfluenceOfVotesParameters objects, + and then calculates the influence of up-and downvotes on replies based on + the parameters in each object and a given regression model. + + + Parameters + ---------- + influence_of_up_and_downvotes: InfluenceOfUpAndDownvotesOnReplies + An instance of the InfluenceOfUpAndDownvotesOnReplies class. + analyses_list: list[PearsonCorrelationParameters + A list of PearsonCorrelationParameters objects. + Each object contains the parameters for a single Pearson correlation analysis. + regression_models: dict[str, RegressionResults] + A dictionary of regression models, where the key is the model name and the value is the regression model. + Created via the regression_wrappers.run_regression function. + + Returns + ------- + None + """ + for analysis_config in analyses_list: + log_influence_of_up_and_downvotes_on_replies(analysis_config) + + influence_of_up_and_downvotes.report_increase_per_up_and_downvote_from_totalvotes_and_valence( +, + weight_as_distribution_quantile=analysis_config.weight_as_distribution_quantile, + weight_m=analysis_config.weight_m, + model=regression_models[analysis_config.model_name], + step=analysis_config.step, + startpoint=analysis_config.startpoint, + name_save_file=analysis_config.name_save_file, + ) diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..b1c3575 --- /dev/null +++ b/src/analysis_wrappers/ @@ -0,0 +1,57 @@ +from src.analysis_functions.ttest import TTest +from src.data_classes.parameters_analysis_ttest import ( + TTestParameters, + PairedTTestParameters, +) +from src.utils.helper_logging import log_ttest_details + + +def run_ttest( + ttest: TTest, analyses_list: list[TTestParameters] +) -> dict[str, tuple[float, float, float]]: + """ + Run t-test analyses based on a list of parameter objects. + + This function iterates over a list of TTestParameters objects, + logs the details of each analysis, and then performs the appropriate + t-test analysis based on the type of the TTestParameters object. + + Parameters + ---------- + ttest: TTest + An instance of the TTest class. + analyses_list: list[TTestParameters] + A list of TTestParameters objects. + Each object contains the parameters for a single t-test analysis. + + Returns + ------- + dict[str, tuple[float, float, float]] + """ + ttest_results: dict = {} + + for analysis_config in analyses_list: + result = None + + if isinstance(analysis_config, PairedTTestParameters): + log_ttest_details(analysis_config, "Paired") + + result = ttest.perform_paired_ttest( +, + first_group_name=analysis_config.variable_1, + second_group_name=analysis_config.variable_2, + name_save_file=analysis_config.name_save_file, + ) + + else: + log_ttest_details(analysis_config, "") + ttest.perform_ttest( +, + first_group_name=analysis_config.variable_1, + second_group_name=analysis_config.variable_2, + name_save_file=analysis_config.name_save_file, + ) + + ttest_results[] = result + + return ttest_results diff --git a/src/analysis_wrappers/ b/src/analysis_wrappers/ new file mode 100644 index 0000000..684181e --- /dev/null +++ b/src/analysis_wrappers/ @@ -0,0 +1,360 @@ +from statsmodels.regression.linear_model import RegressionResults + +from src.analysis_functions.visualization import DataVisualizer +from src.data_classes.parameters_plot_barchart import BarChartPlotParameters +from src.data_classes.parameters_plot_boxplot import BoxPlotParameters +from src.data_classes.parameters_plot_contourplot import ContourPlotParameters +from src.data_classes.parameters_plot_count_distribution import ( + CountDistributionPlotParameters, +) +from src.data_classes.parameters_plot_densityplot import DensityPlotParameters +from src.data_classes.parameters_plot_forestplot import ForestPlotParameters +from src.data_classes.parameters_plot_forestplot_paired_ttest import ( + ForestPlotPairedTTestParameters, +) +from src.data_classes.parameters_plot_grouped_histogram import ( + GroupedHistogramParameters, +) +from src.data_classes.parameters_plot_heatmap import HeatmapParameters +from src.data_classes.parameters_plot_hexbinplot import HexbinPlotParameters +from src.data_classes.parameters_plot_histogram import HistogramPlotParameters +from src.data_classes.parameters_plot_percentage_stacked_barchart import ( + PercentageStackedBarChartPlotParameters, +) +from src.data_classes.parameters_plot_ridgelineplot import RidgelineParameters +from src.data_classes.parameters_plot_simple_scatterplot import ( + SimpleScatterPlotParameters, +) +from src.data_classes.parameters_plot_stacked_barchart import ( + StackedBarChartPlotParameters, +) +from src.data_classes.parameters_plot_surfaceplot import SurfacePlotParameters +from src.data_classes.parameters_plot_violinplot import ViolinPlotParameters +from src.data_classes.parameters_visualization import PlotParameters +from src.utils.helper_functions import FunctionData +from src.utils.helper_logging import log_visualization_details + + +def run_visualization( + visualizer: DataVisualizer, + analyses_list: list[PlotParameters], + regression_results: dict[str, RegressionResults], + functions: dict[str, FunctionData], + ttest_results: dict[str, tuple[float, float, float]], +) -> None: + """ + Run visualization analyses based on a list of parameter objects. + + This function iterates over a list of PlotParameters objects, + logs the details of each analysis, and then performs the appropriate + visualization based on the type of the PlotParameters object. + + Parameters + ---------- + visualizer: DataVisualizer + An instance of the DataVisualizer class. + analyses_list: list[PlotParameters] + A list of PlotParameters objects. + Each object contains the parameters for a single visualization. + regression_results: dict[str, RegressionResults] + A dictionary mapping regression model names to their results. + functions: dict[str, FunctionData] + A dictionary mapping function names to their data. + ttest_results: dict[str, tuple[float, float, float]] + A dictionary mapping paired t-test names to their results. + + Raises + ------ + ValueError: If the type of the PlotParameters object is not recognized. + + Returns + ------- + None + """ + for analysis_config in analyses_list: + log_visualization_details(analysis_config) + if isinstance(analysis_config, BarChartPlotParameters): + visualizer.create_bar_chart( +, + variable_x_axis=analysis_config.variable_x_axis, + variable_y_axis=analysis_config.variable_y_axis, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + chart_orientation=analysis_config.chart_orientation, + sort_order=analysis_config.sort_order, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + custom_order=analysis_config.custom_order, + ) + elif isinstance(analysis_config, BoxPlotParameters): + visualizer.create_box_plot( +, + variable_1=analysis_config.variable_1, + variable_2=analysis_config.variable_2, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, ContourPlotParameters): + visualizer.create_contour_plot( + function=functions[analysis_config.function_name], + x_axis_maximum=analysis_config.x_axis_maximum, + y_axis_maximum=analysis_config.y_axis_maximum, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, HistogramPlotParameters): + visualizer.create_histogram( +, + variable=analysis_config.variable, + x_axis_limits=analysis_config.x_axis_limits, + x_axis_logarithmic_scaling=analysis_config.x_axis_logarithmic_scaling, + y_axis_logarithmic_scaling=analysis_config.y_axis_logarithmic_scaling, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, CountDistributionPlotParameters): + visualizer.create_count_distribution( +, + variable=analysis_config.variable, + x_axis_limits=analysis_config.x_axis_limits, + x_axis_logarithmic_scaling=analysis_config.x_axis_logarithmic_scaling, + y_axis_logarithmic_scaling=analysis_config.y_axis_logarithmic_scaling, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, DensityPlotParameters): + visualizer.create_density_plot( +, + variable_x_axis=analysis_config.variable_x_axis, + variable_y_axis=analysis_config.variable_y_axis, + data_breakpoints=analysis_config.data_breakpoints, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, ForestPlotParameters): + visualizer.create_forest_plot( + regression_models=_get_labeled_models_for_forest_plot( + analysis_config, regression_results + ), + coefficient_names=analysis_config.coefficient_names, + sort_by_size=analysis_config.sort_by_size, + x_axis_minimum=analysis_config.x_axis_minimum, + x_axis_maximum=analysis_config.x_axis_maximum, + dotsize=analysis_config.dotsize, + colors=analysis_config.colors, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, ForestPlotPairedTTestParameters): + visualizer.create_forest_plot_paired_ttest( + paired_ttests=_get_labeled_ttests_for_forest_plot( + analysis_config, ttest_results + ), + x_axis_minimum=analysis_config.x_axis_minimum, + x_axis_maximum=analysis_config.x_axis_maximum, + dotsize=analysis_config.dotsize, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, GroupedHistogramParameters): + visualizer.create_grouped_histogram( +, + group_by=analysis_config.group_by, + aggregation_column=analysis_config.aggregation_variable, + aggregation_function=analysis_config.aggregation_function, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, HeatmapParameters): + visualizer.create_heatmap( +, + axis_variables=analysis_config.axis_variables, + heat_variable=analysis_config.heat_variable, + max_values_axes=analysis_config.axis_maxima, + min_values_axes=analysis_config.axis_minima, + log_option=analysis_config.logarithmic_heat_scaling, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, HexbinPlotParameters): + visualizer.create_hexbin_plot( +, + variable_x_axis=analysis_config.variable_x_axis, + variable_y_axis=analysis_config.variable_y_axis, + x_axis_maximum=analysis_config.x_axis_maximum, + y_axis_maximum=analysis_config.y_axis_maximum, + trendline=analysis_config.trendline, + log_scale=analysis_config.logarithmic_hex_scaling, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, PercentageStackedBarChartPlotParameters): + visualizer.create_percentage_stacked_bar_chart( +, + variable_x_axis=analysis_config.variable_x_axis, + variables_to_compare=analysis_config.variables_to_compare, + chart_orientation=analysis_config.chart_orientation, + sort_order=analysis_config.sort_order, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, RidgelineParameters): + visualizer.create_ridgeline_plot( +, + x_axis_variable=analysis_config.variable_x_axis, + y_axis_variable=analysis_config.variable_y_axis, + data_breakpoints=analysis_config.data_breakpoints, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, StackedBarChartPlotParameters): + visualizer.create_stacked_bar_chart( +, + variable_x_axis=analysis_config.variable_x_axis, + variable_y_axis=analysis_config.variable_y_axis, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + hue=analysis_config.hue, + chart_orientation=analysis_config.chart_orientation, + sort_order=analysis_config.sort_order, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + custom_order=analysis_config.custom_order, + ) + elif isinstance(analysis_config, SimpleScatterPlotParameters): + visualizer.create_scatter_plot_simple( +, + variable_x_axis=analysis_config.variable_x_axis, + variable_y_axis=analysis_config.variable_y_axis, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, SurfacePlotParameters): + visualizer.generate_surface_plot( + function=functions[analysis_config.function_name], + x_axis_maximum=analysis_config.x_axis_maximum, + y_axis_maximum=analysis_config.y_axis_maximum, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + z_axis_label=analysis_config.z_axis_label, + elevation_angle=analysis_config.elevation_angle, + azimuth_angle=analysis_config.azimuth_angle, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + elif isinstance(analysis_config, ViolinPlotParameters): + visualizer.create_violin_plot( +, + variable_x_axis=analysis_config.variable_x_axis, + variable_y_axis=analysis_config.variable_y_axis, + x_axis_label=analysis_config.x_axis_label, + y_axis_label=analysis_config.y_axis_label, + title=analysis_config.title, + name_save_file=analysis_config.name_save_file, + ) + else: + raise ValueError( + f"Invalid type of visualization requested {type(analysis_config)}" + ) + + +def _get_labeled_models_for_forest_plot( + analysis_config: ForestPlotParameters, + regression_results: dict[str, RegressionResults], +) -> dict: + """ + Maps regression model names to their labels for forest plot visualization. + + This function takes an analysis configuration for a forest plot and a dictionary + of regression results. It maps the specified regression model names to their + corresponding labels as defined in the analysis configuration. This mapping is + used for labeling the models in the forest plot visualization. + + Parameters + ---------- + analysis_config : ForestPlotParameters + The configuration parameters for the forest plot, including the names and + labels of the regression models to be used. + regression_results : dict[str, RegressionResults] + A dictionary mapping model names to their corresponding fitted regression models. + + Returns + ------- + dict + A dictionary mapping the labels (as specified in the analysis configuration) + to the corresponding regression models. + """ + mapping_regression_model_names_to_labels: dict = { + name: label + for name, label in zip( + analysis_config.regression_model_names, + analysis_config.regression_model_labels, + ) + if name in regression_results.keys() + } + + selected_models: dict = { + name: regression_results[name] + for name in mapping_regression_model_names_to_labels.keys() + } + + labeled_models: dict = { + mapping_regression_model_names_to_labels[name]: model + for name, model in selected_models.items() + } + + return labeled_models + + +def _get_labeled_ttests_for_forest_plot( + analysis_config: ForestPlotPairedTTestParameters, + ttest_results: dict[str, tuple[float, float, float]], +) -> dict: + """ + Maps paired t-test names to their labels for forest plot visualization. + + This function takes an analysis configuration for a forest plot that includes + paired t-tests and a dictionary of t-test results. It maps the specified t-test + names to their corresponding labels as defined in the analysis configuration. + This mapping is used for labeling the t-tests in the forest plot visualization. + + Parameters + ---------- + analysis_config : ForestPlotPairedTTestParameters + The configuration parameters for the forest plot, including the names and + labels of the paired t-tests to be used. + ttest_results : dict[str, tuple[float, float, float]] + A dictionary mapping t-test names to their results (t-statistic, p-value, and + degrees of freedom). + + Returns + ------- + dict + A dictionary mapping the labels (as specified in the analysis configuration) + to the corresponding t-test results. + """ + mapping_ttest_names_to_labels: dict = { + name: label + for name, label in zip( + analysis_config.paired_ttest_names, analysis_config.paired_ttest_labels + ) + if name in ttest_results.keys() + } + + selected_ttests: dict = { + name: ttest_results[name] for name in mapping_ttest_names_to_labels.keys() + } + + labeled_ttests: dict = { + mapping_ttest_names_to_labels[name]: ttest + for name, ttest in selected_ttests.items() + } + + return labeled_ttests diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..e69de29 diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..726819d --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,10 @@ +from attrs import field, define +from attrs.validators import instance_of + +from src.data_classes.parameters_general import GeneralParameters + + +@define +class ComparisonVarianceInAndBetweenGroupParameters(GeneralParameters): + variable: str = field(validator=instance_of(str)) + group: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..c09cb8f --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,9 @@ +from attrs import field, define +from attrs.validators import instance_of + +from src.data_classes.parameters_general import GeneralParameters + + +@define +class GetFunctionInverseBayesTransformedRegressionParameters(GeneralParameters): + model_name: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..da51270 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,21 @@ +from typing import Iterable, Union + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable + +from src.data_classes.parameters_general import GeneralParameters + + +@define +class InfluenceOfVotesParameters(GeneralParameters): + weight_as_distribution_quantile: bool = field(validator=instance_of(bool)) + weight_m: Union[int, float] = field(validator=instance_of((int, float))) + model_name: str = field(validator=instance_of(str)) + step: list[int] = field( + validator=deep_iterable( + member_validator=instance_of(int), iterable_validator=instance_of(Iterable) + ), + ) + startpoint: Union[str, list[Union[int, float]]] = field( + validator=instance_of((str, list)) + ) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..4c48dd4 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,10 @@ +from attrs import field, define +from attrs.validators import instance_of + +from src.data_classes.parameters_general import GeneralParameters + + +@define +class PearsonCorrelationParameters(GeneralParameters): + variable_1: str = field(validator=instance_of(str)) + variable_2: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..2d5a380 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,61 @@ +from typing import Iterable +from pathlib import Path + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable, optional +from attr import attrib + +from src.data_classes.parameters_general import GeneralParameters +from src.utils.helper_conversion import create_dictionary_of_aggregation_methods + + +@define +class RegressionParameters(GeneralParameters): + dependent_variable: str = field(validator=instance_of(str)) + independent_variables: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(Iterable) + ), + ) + + +@define +class BayesianRegressionParameters(RegressionParameters): + pass + + +@define +class LinearRegressionParameters(RegressionParameters): + standardize: bool = field(default=False, validator=instance_of(bool)) + report_effect_size: bool = field( + default=False, validator=optional(instance_of(bool)) + ) + + +@define +class GroupedLinearRegressionParameters(RegressionParameters): + aggregation_functions: list[str] = field( + default=[], + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ), + ) + group_by: str = field(default="sum", validator=instance_of(str)) + standardize: bool = field(default=False, validator=instance_of(bool)) + report_effect_size: bool = field( + default=False, validator=optional(instance_of(bool)) + ) + print_detailed_coefficients: bool = field( + default=False, validator=optional(instance_of(bool)) + ) + dictionary_aggregation_methods = attrib(init=False) + + def __attrs_post_init__(self): + self.name_save_file = Path(f"{}") + self.dictionary_aggregation_methods: dict = ( + create_dictionary_of_aggregation_methods( + self.independent_variables, + self.dependent_variable, + self.aggregation_functions, + ) + ) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..34cca10 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,14 @@ +from attrs import field, define +from attrs.validators import instance_of + +from src.data_classes.parameters_general import GeneralParameters + + +@define +class TTestParameters(GeneralParameters): + variable_1: str = field(validator=instance_of(str)) + variable_2: str = field(validator=instance_of(str)) + + +class PairedTTestParameters(TTestParameters): + pass diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..ed31293 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,15 @@ +from attr import attrib +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_general import GeneralParameters + + +@define +class DescriptiveAggregatedParameters(GeneralParameters): + variables: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ) + aggregation_function: str = field(validator=instance_of(str)) + group_by: str = attrib(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..5534a2f --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,26 @@ +from pathlib import Path +from typing import Union, Any + +from attr import attrib +from attrs import field, define +from attrs.validators import instance_of, optional, deep_iterable +from src.data_classes.parameters_general import GeneralParameters + + +@define +class Metric: + operation: str = field(validator=instance_of(str)) + column: str = field(validator=optional(instance_of(str))) + + +@define +class DescriptiveOverviewParameters(GeneralParameters): + metrics: list[Union[Any, Metric]] = field(validator=instance_of(list)) + group_by: str = attrib(validator=instance_of(str)) + + def __attrs_post_init__(self): + self.metrics = [ + Metric(**metric) if isinstance(metric, dict) else metric + for metric in self.metrics + ] + self.name_save_file: Path = Path(f"{}") diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..cb6c593 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,12 @@ +from typing import Union + +from attrs import field, define +from attrs.validators import instance_of +from src.data_classes.parameters_general import GeneralParameters + + +@define +class DescriptivePercentageOfDatasetUnderConditionParameters(GeneralParameters): + variable: str = field(validator=instance_of(str)) + comparison: str = field(validator=instance_of(str)) + condition: Union[int, float] = field(validator=instance_of((int, float))) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..7191a63 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,17 @@ +import pandas as pd +from attr import attrib + +from attrs import field, define +from attrs.validators import instance_of +from pathlib import Path + + +@define +class GeneralParameters: + name: str = field(validator=instance_of(str)) + dataset: str = field(validator=instance_of(str)) + data: pd.DataFrame = attrib(init=False) + name_save_file: Path = attrib(init=False) + + def __attrs_post_init__(self): + self.name_save_file: Path = Path(f"{}") diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..7798270 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,21 @@ +from typing import Optional + +from attrs import field, define +from attrs.validators import instance_of, optional, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class BarChartPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: Optional[str] = field(validator=optional(instance_of(str))) + chart_orientation: str = field(validator=instance_of(str)) + sort_order: str = field(validator=instance_of(str)) + custom_order: Optional[list[str]] = field( + default=None, + validator=optional( + deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ), + ) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..4916ac8 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,9 @@ +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class BoxPlotParameters(PlotParameters): + variable_1: str = field(validator=instance_of(str)) + variable_2: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..13804e8 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,13 @@ +from typing import Union + +from attrs import field, define +from attrs.validators import instance_of +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class ContourPlotParameters(PlotParameters): + function_name: str = field(validator=instance_of(str)) + x_axis_maximum: Union[int, float] = field(validator=instance_of((int, float))) + y_axis_maximum: Union[int, float] = field(validator=instance_of((int, float))) + dataset: str = "data" \ No newline at end of file diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..6a88f47 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,22 @@ +from typing import Union, Optional + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable, optional +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class CountDistributionPlotParameters(PlotParameters): + variable: str = field(validator=instance_of(str)) + x_axis_limits: Optional[list[Union[int, float]]] = field( + default=None, + validator=optional( + deep_iterable( + member_validator=instance_of((int, float)), + iterable_validator=instance_of(list), + ) + ), + kw_only=True, + ) + x_axis_logarithmic_scaling: bool = field(validator=instance_of(bool)) + y_axis_logarithmic_scaling: bool = field(validator=instance_of(bool)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..3f71483 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,17 @@ +from typing import Union + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class DensityPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: str = field(validator=instance_of(str)) + data_breakpoints: list[Union[int, float]] = field( + validator=deep_iterable( + member_validator=instance_of((int, float)), + iterable_validator=instance_of(list), + ) + ) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..6421e72 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,41 @@ +from typing import Union, Optional + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable, optional +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class ForestPlotParameters(PlotParameters): + regression_model_names: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ) + regression_model_labels: list[str] = field( + validator=[ + deep_iterable( + member_validator=instance_of(str), + iterable_validator=instance_of(list), + ), + ], + ) + coefficient_names: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ) + sort_by_size: Optional[bool] = field( + default=False, validator=optional(instance_of(bool)) + ) + x_axis_minimum: Optional[Union[int, float]] = field( + default=None, validator=optional(instance_of((int, float))) + ) + x_axis_maximum: Optional[Union[int, float]] = field( + default=None, validator=optional(instance_of((int, float))) + ) + dotsize: Optional[int] = field(default=5, validator=optional(instance_of(int))) + colors: Optional[list[str]] = field( + default=["orange", "royalblue", "forestgreen", "firebrick"], validator=optional(instance_of(list)) + ) + dataset: str = "data" diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..a146739 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,30 @@ +from typing import Union, Optional + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable, optional +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class ForestPlotPairedTTestParameters(PlotParameters): + paired_ttest_names: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ) + paired_ttest_labels: list[str] = field( + validator=[ + deep_iterable( + member_validator=instance_of(str), + iterable_validator=instance_of(list), + ), + ], + ) + x_axis_minimum: Optional[Union[int, float]] = field( + default=None, validator=optional(instance_of((int, float))) + ) + x_axis_maximum: Optional[Union[int, float]] = field( + default=None, validator=optional(instance_of((int, float))) + ) + dotsize: Optional[int] = field(default=5, validator=optional(instance_of(int))) + dataset: str = "data" diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..8a60399 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,10 @@ +from attrs import field, define +from attrs.validators import instance_of +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class GroupedHistogramParameters(PlotParameters): + group_by: str = field(validator=instance_of(str)) + aggregation_variable: str = field(validator=instance_of(str)) + aggregation_function: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..d5d6293 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,28 @@ +from typing import Iterable, Union + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class HeatmapParameters(PlotParameters): + axis_variables: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(Iterable) + ), + ) + heat_variable: str = field(validator=instance_of(str)) + axis_maxima: list[Union[int, float]] = field( + validator=deep_iterable( + member_validator=instance_of((int, float)), + iterable_validator=instance_of(Iterable), + ), + ) + axis_minima: list[Union[int, float]] = field( + validator=deep_iterable( + member_validator=instance_of((int, float)), + iterable_validator=instance_of(Iterable), + ), + ) + logarithmic_heat_scaling: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..a96b4c6 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,19 @@ +from typing import Union, Optional + +from attrs import field, define +from attrs.validators import instance_of, optional +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class HexbinPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: str = field(validator=instance_of(str)) + x_axis_maximum: Optional[Union[int, float]] = field( + default=None, validator=optional(instance_of((int, float))), kw_only=True + ) + y_axis_maximum: Optional[Union[int, float]] = field( + default=None, validator=optional(instance_of((int, float))), kw_only=True + ) + trendline: bool = field(validator=instance_of(bool)) + logarithmic_hex_scaling: bool = field(validator=instance_of(bool)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..138a447 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,22 @@ +from typing import Union, Optional + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable, optional +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class HistogramPlotParameters(PlotParameters): + variable: str = field(validator=instance_of(str)) + x_axis_limits: Optional[list[Union[int, float]]] = field( + default=None, + validator=optional( + deep_iterable( + member_validator=instance_of((int, float)), + iterable_validator=instance_of(list), + ) + ), + kw_only=True, + ) + x_axis_logarithmic_scaling: bool = field(validator=instance_of(bool)) + y_axis_logarithmic_scaling: bool = field(validator=instance_of(bool)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..da3b388 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,17 @@ +from typing import Optional + +from attrs import field, define +from attrs.validators import instance_of, optional, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class PercentageStackedBarChartPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variables_to_compare: list[str] = field( + validator=deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ) + chart_orientation: str = field(validator=instance_of(str)) + sort_order: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..69d1a96 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,17 @@ +from typing import Iterable, Union + +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class RidgelineParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: str = field(validator=instance_of(str)) + data_breakpoints: list[Union[int, float]] = field( + validator=deep_iterable( + member_validator=instance_of((int, float)), + iterable_validator=instance_of(list), + ) + ) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..f9ca86f --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,9 @@ +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class SimpleScatterPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..972812c --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,22 @@ +from typing import Optional + +from attrs import field, define +from attrs.validators import instance_of, optional, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class StackedBarChartPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: Optional[str] = field(validator=optional(instance_of(str))) + hue: str = field(validator=instance_of(str)) + chart_orientation: str = field(validator=instance_of(str)) + sort_order: str = field(validator=instance_of(str)) + custom_order: Optional[list[str]] = field( + default=None, + validator=optional( + deep_iterable( + member_validator=instance_of(str), iterable_validator=instance_of(list) + ) + ), + ) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..12837a3 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,16 @@ +from typing import Union + +from attrs import field, define +from attrs.validators import instance_of +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class SurfacePlotParameters(PlotParameters): + function_name: str = field(validator=instance_of(str)) + x_axis_maximum: Union[int, float] = field(validator=instance_of((int, float))) + y_axis_maximum: Union[int, float] = field(validator=instance_of((int, float))) + z_axis_label: str = field(validator=instance_of(str)) + elevation_angle: Union[int, float] = field(validator=instance_of((int, float))) + azimuth_angle: Union[int, float] = field(validator=instance_of((int, float))) + dataset: str = "data" diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..1dc1d5b --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,9 @@ +from attrs import field, define +from attrs.validators import instance_of, deep_iterable +from src.data_classes.parameters_visualization import PlotParameters + + +@define +class ViolinPlotParameters(PlotParameters): + variable_x_axis: str = field(validator=instance_of(str)) + variable_y_axis: str = field(validator=instance_of(str)) diff --git a/src/data_classes/ b/src/data_classes/ new file mode 100644 index 0000000..91b53b9 --- /dev/null +++ b/src/data_classes/ @@ -0,0 +1,19 @@ +from typing import Optional + +from attr import define, field +from attr.validators import instance_of, optional + +from src.data_classes.parameters_general import GeneralParameters + + +@define +class PlotParameters(GeneralParameters): + x_axis_label: Optional[str] = field( + default=None, validator=optional(instance_of(str)), kw_only=True + ) + y_axis_label: Optional[str] = field( + default=None, validator=optional(instance_of(str)), kw_only=True + ) + title: Optional[str] = field( + default=None, validator=optional(instance_of(str)), kw_only=True + ) diff --git a/src/data_loading_and_saving/ b/src/data_loading_and_saving/ new file mode 100644 index 0000000..e69de29 diff --git a/src/data_loading_and_saving/ b/src/data_loading_and_saving/ new file mode 100644 index 0000000..e862552 --- /dev/null +++ b/src/data_loading_and_saving/ @@ -0,0 +1,143 @@ +import yaml + +from src.data_classes.parameters_analysis_comparison_variance_in_and_between_group import ( + ComparisonVarianceInAndBetweenGroupParameters, +) +from src.data_classes.parameters_analysis_get_function_inverse_bayes_transformed_regression import ( + GetFunctionInverseBayesTransformedRegressionParameters, +) +from src.data_classes.parameters_analysis_influence_of_up_and_downvotes import ( + InfluenceOfVotesParameters, +) +from src.data_classes.parameters_analysis_regression import ( + BayesianRegressionParameters, + LinearRegressionParameters, + GroupedLinearRegressionParameters, +) +from src.data_classes.parameters_analysis_ttest import ( + TTestParameters, + PairedTTestParameters, +) +from src.data_classes.parameters_analysis_pearson_correlation import ( + PearsonCorrelationParameters, +) +from src.data_classes.parameters_descriptive_aggregated import ( + DescriptiveAggregatedParameters, +) +from src.data_classes.parameters_descriptive_overview import ( + DescriptiveOverviewParameters, +) +from src.data_classes.parameters_descriptive_percentage_of_dataset_under_condition import ( + DescriptivePercentageOfDatasetUnderConditionParameters, +) +from src.data_classes.parameters_plot_barchart import BarChartPlotParameters + +from src.data_classes.parameters_plot_boxplot import BoxPlotParameters +from src.data_classes.parameters_plot_contourplot import ContourPlotParameters +from src.data_classes.parameters_plot_count_distribution import ( + CountDistributionPlotParameters, +) +from src.data_classes.parameters_plot_densityplot import DensityPlotParameters +from src.data_classes.parameters_plot_forestplot import ForestPlotParameters +from src.data_classes.parameters_plot_forestplot_paired_ttest import ForestPlotPairedTTestParameters +from src.data_classes.parameters_plot_grouped_histogram import ( + GroupedHistogramParameters, +) +from src.data_classes.parameters_plot_heatmap import HeatmapParameters +from src.data_classes.parameters_plot_hexbinplot import HexbinPlotParameters +from src.data_classes.parameters_plot_histogram import HistogramPlotParameters +from src.data_classes.parameters_plot_percentage_stacked_barchart import ( + PercentageStackedBarChartPlotParameters, +) +from src.data_classes.parameters_plot_ridgelineplot import RidgelineParameters +from src.data_classes.parameters_plot_simple_scatterplot import ( + SimpleScatterPlotParameters, +) +from src.data_classes.parameters_plot_stacked_barchart import ( + StackedBarChartPlotParameters, +) +from src.data_classes.parameters_plot_surfaceplot import SurfacePlotParameters +from src.data_classes.parameters_plot_violinplot import ViolinPlotParameters + + +def custom_constructor(loader: yaml.Loader, tag_suffix: str, node: yaml.Node): + """ + This function handles custom YAML tags and creates an instance of the appropriate class based on the tag suffix. + + Parameters: + loader (yaml.Loader): The YAML loader instance. + tag_suffix (str): The suffix of the YAML tag. This determines the class to be instantiated. + node (yaml.Node): The YAML node to be transformed into a Python dictionary. + + Returns: + _class: An instance of either LinearRegressionParameters or BayesianRegressionParameters class, + depending on the tag suffix. + + Raises: + ValueError: If the tag suffix is not a supported type of analysis or visualization. + """ + if tag_suffix == "descriptive_aggregated": + _class = DescriptiveAggregatedParameters + elif tag_suffix == "descriptive_overview": + _class = DescriptiveOverviewParameters + elif tag_suffix == "percentage_of_dataset_under_condition": + _class = DescriptivePercentageOfDatasetUnderConditionParameters + elif tag_suffix == "comparison_variance_in_and_between_group": + _class = ComparisonVarianceInAndBetweenGroupParameters + elif tag_suffix == "linear_regression": + _class = LinearRegressionParameters + elif tag_suffix == "bayesian_regression": + _class = BayesianRegressionParameters + elif tag_suffix == "linear_regression_grouped": + _class = GroupedLinearRegressionParameters + elif tag_suffix == "increase_per_up_and_downvote_from_totalvotes_and_valence": + _class = InfluenceOfVotesParameters + elif tag_suffix == "function_inverse_bayes_transformed_regression": + _class = GetFunctionInverseBayesTransformedRegressionParameters + elif tag_suffix == "ttest": + _class = TTestParameters + elif tag_suffix == "paired_ttest": + _class = PairedTTestParameters + elif tag_suffix == "pearson_correlation": + _class = PearsonCorrelationParameters + elif tag_suffix == "barchart": + _class = BarChartPlotParameters + elif tag_suffix == "boxplot": + _class = BoxPlotParameters + elif tag_suffix == "contourplot": + _class = ContourPlotParameters + elif tag_suffix == "histogram": + _class = HistogramPlotParameters + elif tag_suffix == "count_distribution": + _class = CountDistributionPlotParameters + elif tag_suffix == "densityplot": + _class = DensityPlotParameters + elif tag_suffix == "forestplot": + _class = ForestPlotParameters + elif tag_suffix == "forestplot_paired_ttest": + _class = ForestPlotPairedTTestParameters + elif tag_suffix == "grouped_histogram": + _class = GroupedHistogramParameters + elif tag_suffix == "heatmap": + _class = HeatmapParameters + elif tag_suffix == "hexbinplot": + _class = HexbinPlotParameters + elif tag_suffix == "percentage_stacked_barchart": + _class = PercentageStackedBarChartPlotParameters + elif tag_suffix == "ridgelineplot": + _class = RidgelineParameters + elif tag_suffix == "simple_scatterplot": + _class = SimpleScatterPlotParameters + elif tag_suffix == "stacked_barchart": + _class = StackedBarChartPlotParameters + elif tag_suffix == "surfaceplot": + _class = SurfacePlotParameters + elif tag_suffix == "violinplot": + _class = ViolinPlotParameters + else: + raise ValueError( + f"Unexpected tag suffix: {tag_suffix}. Please check for typos or if the analysis is not supported" + ) + + instance_data = loader.construct_mapping(node, deep=True) + return _class(**instance_data) diff --git a/src/data_loading_and_saving/ b/src/data_loading_and_saving/ new file mode 100644 index 0000000..946f497 --- /dev/null +++ b/src/data_loading_and_saving/ @@ -0,0 +1,202 @@ +import datetime +import os +import re +from pathlib import Path + +import pypandoc + + +def create_markdown_report(log_filename: Path, output_name: Path, output_dir: Path): + """ + Generates a markdown report from a specified log file. + + This function reads a log file, processes the content by adding markdown image syntax for any detected plot paths, + and then writes the processed content into a markdown file named after the `output_name` parameter in the specified + `output_dir`. The report includes a header with the current date and the processed log content. + + Parameters + ---------- + log_filename : Path + The path to the log file to be processed. + output_name : Path + The base name for the output markdown file (without extension). + output_dir : Path + The directory where the markdown file will be saved. + """ + with open(log_filename, "r") as log_file: + log_content: str = + log_lines: list[str] = log_content.split("\n") + + processed_log_content: str = "" + for line in log_lines: + if "Plot saved at" in line: + image_path: str ="Plot saved at (.*)", line).group(1) + line += f"\n\n![](../{image_path})\n" + processed_log_content += line + "\n" + + todays_date: str ="%B %d, %Y") + markdown_content: str = f"""# Analysis Results for {todays_date} + +{processed_log_content} + + """ + + with open(f"{output_dir}/{output_name}.md", "w") as md_file: + md_file.write(markdown_content) + + +def create_pdf_report(markdown_filename: Path, output_dir: Path, font_size: str = "8pt"): + """ + Generates a PDF report from a markdown file. + This function converts a markdown file into a PDF file directly using pypandoc, + preserving all formatting including code blocks and embedded images. + The PDF file is saved in the specified `output_dir` with the same base name as the `markdown_filename`. + + Parameters + ---------- + markdown_filename : Path + The path to the markdown file to be converted into PDF. + output_dir : Path + The directory where the PDF file will be saved. + font_size : str, optional + The font size for the PDF (default is '8pt'). + """ + os.chdir(output_dir) + + markdown_path: Path = Path(f"{markdown_filename}.md") + new_md_file = split_tables_in_markdown(markdown_path) + new_md_file = split_regression_long_vars(new_md_file) + output_pdf_path: Path = Path(f"{markdown_filename}.pdf") + + fontsize_number = font_size[:-2] + + header_includes = f""" + \\usepackage[utf8]{{inputenc}} + \\usepackage{{lmodern}} + \\usepackage{{adjustbox}} + \\usepackage{{anyfontsize}} + \\usepackage{{geometry}} + \\usepackage{{listings}} + \\lstset{{basicstyle=\\fontsize{{{fontsize_number}}}{{{int(fontsize_number) * 0.9}}}\selectfont}} + \\fontsize{{{fontsize_number}}}{{{int(fontsize_number) * 1.2}}}\\selectfont + """ + + table_settings = f""" + \\usepackage{{etoolbox}} + \\BeforeBeginEnvironment{{tabular}}{{\\begin{{adjustbox}}{{max width=\\textwidth}}}} + \\AfterEndEnvironment{{tabular}}{{\\end{{adjustbox}}}} + """ + + extra_args = [ + '--pdf-engine=xelatex', + '--variable', f'geometry:top=1in, bottom=1in, left=1in, right=1in', + '--variable', f'header-includes:{header_includes}', + '--variable', f'header-includes:{table_settings}', + '-s' + ] + + pypandoc.convert_file(new_md_file, 'pdf', outputfile=output_pdf_path, extra_args=extra_args) + + os.remove(Path(f"{markdown_filename}")) + os.remove(Path(f"{markdown_filename}")) + + +def split_tables_in_markdown(file_path: str, max_width=100) -> Path: + """ + Process a markdown file and split tables that are wider than `max_width` into smaller tables. + """ + with open(file_path, 'r') as f: + lines = f.readlines() + + new_lines: list = [] + table_buffer: list = [] + for line in lines: + if line.startswith('|'): + table_buffer.append(line.rstrip('\n')) + else: + if table_buffer: + new_lines.extend(split_table_into_smaller_ones(table_buffer, max_width)) + table_buffer: list = [] + + new_lines.append(line.rstrip('\n')) + + if table_buffer: + new_lines.extend(split_table_into_smaller_ones(table_buffer, max_width)) + + new_file_path = str(file_path).replace('.md', '') + with open(new_file_path, 'w') as f: + f.write('\n'.join(new_lines)) + + return Path(new_file_path) + + +def split_table_into_smaller_ones(table_lines, max_width: int) -> list[str]: + """ + Given the lines of a table and a maximum width, split the table into multiple smaller tables + """ + new_lines: list = [] + headers: str = table_lines[0].split('|')[1:-1] + rows: list = [row.split('|')[1:-1] for row in table_lines[1:]] + + current_width: int = 0 + start_col: int = 0 + for end_col, header in enumerate(headers, start=1): + current_width += len(header) + 3 + if current_width > max_width or end_col == len(headers): + new_headers: str = headers[start_col:end_col] + new_lines.append('|' + '|'.join(new_headers) + '|') + for row in rows: + new_cells: str = row[start_col:end_col] + new_lines.append('|' + '|'.join(new_cells) + '|') + new_lines.append('\n') + current_width: int = 0 + start_col: str = end_col + + return new_lines + + +def split_regression_long_vars(file_path: Path, max_width: int=105) -> Path: + """ Process a regression result file and split variables that have too long lines + (length > `max_width`) into smaller strings. """ + with open(file_path, 'r') as f: + lines = f.readlines() + + new_lines: list = [] + is_ols_block, is_const_met = False, False + + for line in lines: + if 'OLS Regression Results' in line: + is_ols_block = True + if line.strip() == '```': + is_ols_block = False + if is_ols_block: + if 'const' in line: + is_const_met = True + if line.strip().startswith('==='): + is_const_met = False + if is_const_met: + if set(line.strip()) not in [{'-', '+', '='}, {' '}, {''}]: + split_line = line.split(' ') + temp_line = '' + new_line = [] + for word in split_line: + if (len(temp_line + word) <= max_width or '[' in word or ']' in word) and not word.startswith('----'): + temp_line += ' ' + word + else: + new_line.append(temp_line.strip()) + temp_line = word + if temp_line: + new_line.append(temp_line.strip()) + new_lines.extend(new_line) + else: + new_lines.append(line.strip()[:max_width]) + else: + new_lines.append(line.rstrip('\n')) + else: + new_lines.append(line.rstrip('\n')) + + new_file_path = Path(str(file_path).replace('.md', '')) + with open(new_file_path, 'w') as f: + f.write('\n'.join(new_lines)) + + return new_file_path diff --git a/src/data_loading_and_saving/ b/src/data_loading_and_saving/ new file mode 100644 index 0000000..da8039a --- /dev/null +++ b/src/data_loading_and_saving/ @@ -0,0 +1,63 @@ +from typing import Union +from pathlib import Path +import logging + +import pandas as pd +from statsmodels.iolib.summary import Summary +from statsmodels.regression.linear_model import RegressionResults + + +def print_and_save_result( + print_result: bool, + save_result: bool, + filepath: str, + result: Union[str, Summary, RegressionResults, pd.DataFrame], + name_save_file: Path, +) -> None: + """ + Prints and/or saves the analysis result based on the specified conditions. + + This function is designed to handle the output of statistical analysis results, allowing for both printing to the + console and saving to a file. The type of the result (e.g., string, DataFrame) determines the format of the saved file. + Logging is used to record the result in a consistent format, facilitating debugging and record-keeping. + + Parameters + ---------- + print_result : bool + A flag indicating whether to print the result to the console. + save_result : bool + A flag indicating whether to save the result to a file. + filepath : str + The base path where the result file will be saved. It is used in conjunction with `name_save_file` + to construct the full file path. + result : Union[str, Summary, RegressionResults, pd.DataFrame] + The result of the analysis. Can be a string, a Summary object, a RegressionResults object, or a pandas DataFrame + name_save_file : Path + The name of the file (without extension) to which the result will be saved. + The extension is determined by the type of `result`. + + Returns + ------- + None + """ + pd.set_option('display.max_columns', None) + pd.set_option('display.width', 1000) + + if isinstance(result, pd.DataFrame): + result_str = result.to_markdown() + + else: + result_str = str(result) +"```") + +"```") + + if print_result: + print(result_str) + if save_result: + if type(result) == pd.DataFrame: + result.to_csv(f"{filepath}{name_save_file}.csv", index=True) + + else: + with open(f"{filepath}{name_save_file}.txt", "w") as f: + f.write(result_str) diff --git a/src/data_loading_and_saving/ b/src/data_loading_and_saving/ new file mode 100644 index 0000000..30b2cd8 --- /dev/null +++ b/src/data_loading_and_saving/ @@ -0,0 +1,36 @@ +import logging +from typing import Optional + +from matplotlib import pyplot as plt + + +def save_plot(save_plots: bool, filepath: str, plot: plt, name: Optional[str], file_format: str = "png"): + """ + This method saves the generated plot to a file. + + Parameters + ---------- + save_plots: bool + A boolean flag to determine whether to save the plot or not. + filepath: str + The path to the directory where the plot will be saved. + plot: plt + The matplotlib.pyplot instance. + name: str + The name of the file to which the plot will be saved (if self.save_plots is True). + file_format: str + The file format in which plot will be saved. By default, it is 'png'. + + Returns + ------- + None + """ + + if save_plots: + output_path = f"{filepath}{name}.{file_format}" + plot.savefig(output_path, bbox_inches="tight", dpi=300) + + plot.close() +"Plot saved at {output_path}") + else: + diff --git a/src/ b/src/ new file mode 100644 index 0000000..7a322d4 --- /dev/null +++ b/src/ @@ -0,0 +1,121 @@ +from typing import Any + +import pandas as pd + + +class Preprocessing: + """ + A class for preprocessing datasets loaded from a parquet file. + + This class provides methods to preprocess datasets according to a given configuration. It supports operations such as + filtering data by order, section, popular sections, users with more comments, and excluding data based on specific values. + + Attributes + ---------- + data : pd.DataFrame + The dataset loaded from the specified parquet file. + + Methods + ------- + preprocess_datasets(preprocessing_config): + Applies a series of preprocessing steps to the dataset based on the provided configuration and returns the modified datasets. + _apply_chain(data, chain): + Applies a chain of preprocessing steps to the given data. + full_data(): + Returns the full dataset without any preprocessing. + data_order(data, order): + Filters the dataset to include only the data with the specified order. + data_section(data, section): + Filters the dataset to include only the data from the specified section. + popular_sections(data, threshold): + Filters the dataset to include only the data from sections with more than a specified number of entries. + users_with_more_comments(data, num_comments): + Filters the dataset to include only the data from users with more than a specified number of comments. + exclude_data_with_value(data, param): + Excludes data from the dataset based on a specified column value. + """ + + def __init__(self, path_to_file: str, name_data: str): + """ + Initializes the Preprocessing class with data loaded from a specified parquet file. + + Parameters + ---------- + path_to_file : str + The path to the directory containing the parquet file. + name_data : str + The name of the parquet file from which to load the data. + """ + = pd.read_parquet(path_to_file + name_data) + + def preprocess_datasets(self, preprocessing_config) -> dict[str, pd.DataFrame]: + """ + Applies a series of preprocessing steps to the dataset based on the provided configuration. + + Parameters + ---------- + preprocessing_config : dict + A dictionary where keys are dataset names and values are lists of preprocessing steps (as dictionaries) to be applied. + + Returns + ------- + dict[str, pd.DataFrame] + A dictionary of preprocessed datasets. + """ + datasets = {} + for dataset_name, config in preprocessing_config.items(): + datasets[dataset_name]: pd.DataFrame = self._apply_chain(, config) + datasets["data"] = self.full_data() + return datasets + + def _apply_chain(self, data: pd.DataFrame, chain: list) -> pd.DataFrame: + """ + Applies a chain of preprocessing steps to the given data. + + Parameters + ---------- + data : pd.DataFrame + The dataset to preprocess. + chain : list + A list of dictionaries, each representing a preprocessing step with a method name and optional parameters. + + Returns + ------- + pd.DataFrame + The preprocessed dataset. + """ + for step in chain: + method_name = step["method"] + param = step.get("param") + method = getattr(self, method_name) + data = method(data, param) + return data + + def full_data(self): + return + + @staticmethod + def data_order(data: pd.DataFrame, order: int) -> pd.DataFrame: + return data.loc[data["order"] == order] + + @staticmethod + def data_section(data: pd.DataFrame, section: str) -> pd.DataFrame: + return data.loc[data["section"] == section] + + @staticmethod + def popular_sections(data: pd.DataFrame, threshold: int = 1000) -> pd.DataFrame: + counts = data["section"].value_counts() + popular_sections = counts[counts > threshold].index.tolist() + return data[data["section"].isin(popular_sections)] + + @staticmethod + def users_with_more_comments(data: pd.DataFrame, num_comments: int = 10) -> pd.DataFrame: + counts_user = data["user_id"].value_counts() + users_with_10_comments = counts_user[counts_user > num_comments].index.tolist() + return data[data["user_id"].isin(users_with_10_comments)] + + @staticmethod + def exclude_data_with_value(data: pd.DataFrame, param: dict) -> pd.DataFrame: + column: str = param['column'] + value: Any = param['value'] + return data[data[column] != value] diff --git a/src/utils/ b/src/utils/ new file mode 100644 index 0000000..e69de29 diff --git a/src/utils/ b/src/utils/ new file mode 100644 index 0000000..4b4aaac --- /dev/null +++ b/src/utils/ @@ -0,0 +1,40 @@ +from rpy2.robjects import packages as rpackages + + +def is_r_package_installed(package_name: str) -> bool: + """ + Checks if a given R package is installed. + + Parameters + ---------- + package_name : str + The name of the R package to check. + + Returns + ------- + bool + True if the package is installed, False otherwise. + """ + return rpackages.isinstalled(package_name) + + +def install_r_package(package_name: str): + """ + Installs a given R package using CRAN mirror. + + This function selects the first CRAN mirror and installs the specified R package. + + Parameters + ---------- + package_name : str + The name of the R package to install. + """ + utils = rpackages.importr("utils") + utils.chooseCRANmirror(ind=1) + utils.install_packages(package_name) + + +if not is_r_package_installed("BAS"): + install_r_package("BAS") + +BAS = rpackages.importr("BAS") diff --git a/src/utils/ b/src/utils/ new file mode 100644 index 0000000..38693a1 --- /dev/null +++ b/src/utils/ @@ -0,0 +1,87 @@ +import inspect +from typing import Any + + +def create_dictionary_of_aggregation_methods( + dependent_variables: list[str], + independent_variable: str, + aggregation_functions: list[str], +) -> dict[str, str]: + """ + Creates a dictionary mapping each variable (dependent and independent) to its specified aggregation function. + + This function is designed to facilitate the aggregation of data by dynamically creating a mapping of variables + to their corresponding aggregation functions. This is particularly useful in data analysis and preprocessing + where different variables may require different methods of aggregation. + + Parameters + ---------- + dependent_variables : list[str] + A list of strings representing the names of the dependent variables. + independent_variable : str + A string representing the name of the independent variable. + aggregation_functions : list[str] + A list of strings representing the aggregation functions to be applied to each variable. The order of functions + in this list should correspond to the order of variables in `dependent_variables` + followed by the `independent_variable`. + + Returns + ------- + dict[str, str] + A dictionary where keys are variable names (both dependent and independent) and values are the corresponding + aggregation functions as strings. + + Raises + ------ + ValueError + If the total number of variables (dependent + independent) + does not match the number of provided aggregation functions. + + Example + ------- + >>> create_dictionary_of_aggregation_methods(['var1', 'var2'], 'var3', ['sum', 'mean', 'count']) + {'var1': 'sum', 'var2': 'mean', 'var3': 'count'} + """ + variables: list[str] = dependent_variables + [independent_variable] + + if len(variables) != len(aggregation_functions): + raise ValueError( + "The number of variables must match the number of aggregation functions." + ) + + return { + variable: aggregation_function + for variable, aggregation_function in zip(variables, aggregation_functions) + } + + +def get_data_name(data: Any) -> str or None: + """ + Retrieves the variable name of the input data as it appears in the caller's scope. + + This function uses introspection to look back into the caller's local variables and find the name of the variable + that references the data passed to this function. This can be useful for debugging or when dynamically generating + output based on variable names. + + Parameters + ---------- + data : any + The data object whose variable name is to be found. + + Returns + ------- + str or None + The name of the variable as a string if found; otherwise, None. + + Example + ------- + >>> my_data = [1, 2, 3] + >>> get_data_name(my_data) + 'my_data' + """ + callers_local_vars = inspect.currentframe().f_back.f_locals.items() + data_name = [ + variable_name for variable_name, variable_value in callers_local_vars if variable_value is data + ] + if data_name: + return data_name[0] diff --git a/src/utils/ b/src/utils/ new file mode 100644 index 0000000..288ac37 --- /dev/null +++ b/src/utils/ @@ -0,0 +1,102 @@ +"""Minor helper functions""" + + +class FunctionData: + def __init__(self, func, params): + self.func = func + self.params = params + + def __call__(self, x, y): + return self.func(x, y, self.params) + + +def transform_to_bayes_corrected_valence( + upvotes: int, downvotes: int, average_valence: float, weight_factor_m: float +) -> float: + """ + Transforms the upvotes and downvotes to its bayesian corrected value. + + Parameters + ---------- + upvotes: int + The number of upvotes. + downvotes: int + The number of downvotes. + weight_factor_m: float + The weight factor m. + average_valence: float + The average valence. + + Returns + ------- + bayes_corrected_value: float + The bayes_corrected_value between 0 and 1. + """ + valence = -(downvotes / (downvotes + upvotes)) + 0.5 + totalvotes = upvotes + downvotes + + bayes_corrected_value = caluculate_bayes_correction( + valence, totalvotes, weight_factor_m, average_valence + ) + return bayes_corrected_value + + +def caluculate_bayes_correction( + measure: float, volume: float, weight: float, average_measure: float +) -> float: + """ + Calculates the bayes_corrected_value between 0 and 1 weighing in the volume. + + Parameters + ---------- + measure: float + The measure. + volume: float + The volume. + weight: float + The weight factor m. Weighting how strongly the volume is considered + average_measure: float + The average_measure. + + Returns + ------- + bayes_corrected_measure: float + The bayes_corrected_value between 0 and 1 + """ + bayes_corrected_measure = ( + volume / (volume + weight) * measure + + weight / (volume + weight) * average_measure + ) + + return bayes_corrected_measure + + +def calculate_inverse_bayes_correction( + bayes_corrected_value: float, + volume: float, + weight_factor_m: float, + average_measure: float, +) -> float: + """ + Calculates the inverse bayes_corrected_value between 0 and 1. + + Parameters + ---------- + bayes_corrected_value: float + The bayes_corrected_value between 0 and 1. + volume: float + The average_totalvotes. + weight_factor_m: float + The weight_factor. + average_measure: float + The average_measure. + + Returns + ------- + downvotes: float + The downvotes corresponding to the bayes_corrected_value for a given number of total votes. + """ + inverse_bayes_corrected_value = ( + ((volume + weight_factor_m)/volume) * (bayes_corrected_value - weight_factor_m/(volume + weight_factor_m) * average_measure) + ) + return inverse_bayes_corrected_value diff --git a/src/utils/ b/src/utils/ new file mode 100644 index 0000000..1ed20b5 --- /dev/null +++ b/src/utils/ @@ -0,0 +1,260 @@ +import logging + +from src.data_classes.parameters_analysis_regression import ( + LinearRegressionParameters, + BayesianRegressionParameters, + GroupedLinearRegressionParameters, +) +from src.data_classes.parameters_descriptive_aggregated import ( + DescriptiveAggregatedParameters, +) +from src.data_classes.parameters_descriptive_overview import ( + DescriptiveOverviewParameters, +) +from src.data_classes.parameters_descriptive_percentage_of_dataset_under_condition import ( + DescriptivePercentageOfDatasetUnderConditionParameters, +) +from src.data_classes.parameters_plot_barchart import BarChartPlotParameters +from src.data_classes.parameters_plot_boxplot import BoxPlotParameters +from src.data_classes.parameters_plot_contourplot import ContourPlotParameters +from src.data_classes.parameters_plot_count_distribution import ( + CountDistributionPlotParameters, +) +from src.data_classes.parameters_plot_densityplot import DensityPlotParameters +from src.data_classes.parameters_plot_forestplot import ForestPlotParameters +from src.data_classes.parameters_plot_forestplot_paired_ttest import ( + ForestPlotPairedTTestParameters, +) +from src.data_classes.parameters_plot_grouped_histogram import ( + GroupedHistogramParameters, +) +from src.data_classes.parameters_plot_heatmap import HeatmapParameters +from src.data_classes.parameters_plot_hexbinplot import HexbinPlotParameters +from src.data_classes.parameters_plot_histogram import HistogramPlotParameters +from src.data_classes.parameters_plot_percentage_stacked_barchart import ( + PercentageStackedBarChartPlotParameters, +) +from src.data_classes.parameters_plot_ridgelineplot import RidgelineParameters +from src.data_classes.parameters_plot_simple_scatterplot import ( + SimpleScatterPlotParameters, +) +from src.data_classes.parameters_plot_stacked_barchart import ( + StackedBarChartPlotParameters, +) +from src.data_classes.parameters_plot_surfaceplot import SurfacePlotParameters +from src.data_classes.parameters_plot_violinplot import ViolinPlotParameters + + +def log_descriptive_analysis_details(config): +"Descriptive Analysis: {config.name_save_file}") +"```") +"Data: {config.dataset}") + if isinstance(config, DescriptiveAggregatedParameters): +"Variables: {config.variables}") +"Aggregation Function: {config.aggregation_function}") +"Group By: {config.group_by}") + elif isinstance(config, DescriptiveOverviewParameters): +"Metrics: {config.metrics}") +"Group By: {config.group_by}") + elif isinstance(config, DescriptivePercentageOfDatasetUnderConditionParameters): +"Variable: {config.variable}") +"Comparison: {config.comparison}") +"Condition: {config.condition}") + else: + raise ValueError( + f"Invalid type of descriptive analysis requested {type(config)}" + ) +"```") + + +def log_regression_details(config, regression_type: str): +"{regression_type} Regression Analysis: {config.name_save_file}") +"```") +"Independent Variables: {config.independent_variables}") +"Dependent Variable: {config.dependent_variable}") +"Data: {config.dataset}") + if isinstance(config, LinearRegressionParameters): +"Standardize: {config.standardize}") +"Report effect size: {config.report_effect_size}") + elif isinstance(config, BayesianRegressionParameters): + pass + elif isinstance(config, GroupedLinearRegressionParameters): +"Independent Variables: {config.independent_variables}") +"Grouped by: {config.group_by}") +"Aggregation methods: {config.dictionary_aggregation_methods}") +"Standardize: {config.standardize}") +"Report effect size: {config.report_effect_size}") + + f"Print detailed coefficients: {config.print_detailed_coefficients}" + ) + else: + raise ValueError(f"Unknown parameter type: {type(config)}") + +"```") + + +def log_get_function_inverse_bayes_transformed_regression(config): + + f"Get Function Inverse Bayes Transformed Regression: {config.name_save_file}" + ) +"```") +"Data: {}") +"Model Name: {config.model_name}") +"```") + + +def log_influence_of_up_and_downvotes_on_replies(config): +"Influence of Up and Downvotes on Replies: {config.name_save_file}") +"```") + + f"Weight as Distribution Quantile: {config.weight_as_distribution_quantile}" + ) +"Weight m: {config.weight_m}") +"Model Name: {config.model_name}") +"Step: {config.step}") +"Startpoint: {config.startpoint}") +"```") + + +def log_comparison_variance_details(config): +"Comparison Variance Analysis: {config.name_save_file}") +"```") +"Variable: {config.variable}") +"Group: {}") +"Data: {}") +"```") + + +def log_pearson_correlation_details(config): +"Pearson Correlation Analysis: {config.name_save_file}") +"```") +"Variable 1: {config.variable_1}") +"Variable 2: {config.variable_2}") +"Data: {config.dataset}") +"```") + + +def log_ttest_details(config, ttest_type: str): +"{ttest_type} TTest Analysis: {config.name_save_file}") +"```") +"Variable 1: {config.variable_1}") +"Variable 2: {config.variable_2}") +"Data: {config.dataset}") +"```") + + +def log_visualization_details(config): +"Visualization: {config.name_save_file}") +"```") +"Data: {config.dataset}") +"Title: {config.title}") + if isinstance(config, BarChartPlotParameters): +"Creating Bar Chart") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") +"Chart Orientation: {config.chart_orientation}") +"Sort Order: {config.sort_order}") + elif isinstance(config, BoxPlotParameters): +"Creating Box Plot") +"Variable 1: {config.variable_1}") +"Variable 2: {config.variable_2}") +"X-Axis Label: {config.x_axis_label}") +"Y-Axis Label: {config.y_axis_label}") + elif isinstance(config, ContourPlotParameters): +"Creating Contour Plot") +"Function Name: {config.function_name}") +"X-Axis Maximum: {config.x_axis_maximum}") +"Y-Axis Maximum: {config.y_axis_maximum}") + elif isinstance(config, HistogramPlotParameters): +"Creating Histogram Plot") +"Variable: {config.variable}") +"X-Axis Limits: {config.x_axis_limits}") +"X-Axis Logarithmic Scaling: {config.x_axis_logarithmic_scaling}") +"Y-Axis Logarithmic Scaling: {config.y_axis_logarithmic_scaling}") + elif isinstance(config, CountDistributionPlotParameters): +"Creating Count Distribution Plot") +"Variable: {config.variable}") +"X-Axis Limits: {config.x_axis_limits}") +"X-Axis Logarithmic Scaling: {config.x_axis_logarithmic_scaling}") +"Y-Axis Logarithmic Scaling: {config.y_axis_logarithmic_scaling}") + elif isinstance(config, DensityPlotParameters): +"Creating Density Plot") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") +"Data Breakpoints: {config.data_breakpoints}") + elif isinstance(config, ForestPlotParameters): +"Creating Forest Plot") +"Regression Model Names: {config.regression_model_names}") +"Coefficient Names: {config.coefficient_names}") +"X-Axis Minimum: {config.x_axis_minimum}") +"X-Axis Maximum: {config.x_axis_maximum}") +"Dotsize: {config.dotsize}") + elif isinstance(config, ForestPlotPairedTTestParameters): +"Creating Forest Plot Paired TTest") +"Paired TTest Names: {config.paired_ttest_names}") +"X-Axis Minimum: {config.x_axis_minimum}") +"X-Axis Maximum: {config.x_axis_maximum}") +"Dotsize: {config.dotsize}") + elif isinstance(config, GroupedHistogramParameters): +"Creating Grouped Histogram") +"Group By: {config.group_by}") +"Aggregation Variable: {config.aggregation_variable}") +"Aggregation Function: {config.aggregation_function}") +"X-Axis Label: {config.x_axis_label}") +"Y-Axis Label: {config.y_axis_label}") + elif isinstance(config, HeatmapParameters): +"Creating Heatmap") +"Axis Variables: {config.axis_variables}") +"Heat Variable: {config.heat_variable}") +"Max Axis Values: {config.axis_maxima}") +"Min Axis Values: {config.axis_minima}") +"Log Scaling: {config.logarithmic_heat_scaling}") + elif isinstance(config, HexbinPlotParameters): +"Creating Hexbin Plot") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") +"X Axis Maximum: {config.x_axis_maximum}") +"Y Axis Maximum: {config.y_axis_maximum}") +"Trendline: {config.trendline}") +"Log Scaling: {config.logarithmic_hex_scaling}") + elif isinstance(config, PercentageStackedBarChartPlotParameters): +"Creating Percentage Stacked Bar Chart") +"Variable X: {config.variable_x_axis}") +"Variables to compare: {config.variables_to_compare}") +"Chart Orientation: {config.chart_orientation}") +"Sort Order: {config.sort_order}") + elif isinstance(config, RidgelineParameters): +"Creating Ridgeline Plot") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") +"Data Breakpoints: {config.data_breakpoints}") + elif isinstance(config, SimpleScatterPlotParameters): +"Creating Simple Scatter Plot") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") + elif isinstance(config, StackedBarChartPlotParameters): +"Creating Bar Chart") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") +"Hue: {config.hue}") +"Chart Orientation: {config.chart_orientation}") +"Sort Order: {config.sort_order}") + elif isinstance(config, SurfacePlotParameters): +"Creating Surface Plot") +"Function Name: {config.function_name}") +"X-Axis Maximum: {config.x_axis_maximum}") +"Y-Axis Maximum: {config.y_axis_maximum}") +"X-Axis Label: {config.x_axis_label}") +"Y-Axis Label: {config.y_axis_label}") +"Z-Axis Label: {config.z_axis_label}") +"Elevation Angle: {config.elevation_angle}") +"Azimuth Angle: {config.azimuth_angle}") + elif isinstance(config, ViolinPlotParameters): +"Creating Violin Plot") +"Variable X: {config.variable_x_axis}") +"Variable Y: {config.variable_y_axis}") +"X-Axis Label: {config.x_axis_label}") +"Y-Axis Label: {config.y_axis_label}") + else: + raise ValueError(f"Invalid type of visualization requested {type(config)}") +"```")