uncongeniality_preprocessor is a Python application designed to preprocess comment and article data scraped from DER SPIEGEL. Building on the spon_scraper application, the project performs data cleaning, merging and saves a final analysis-ready dataset. The information in the dataset can be specified depending on which information are necessary for the planned analysis. Especially including the text body of comments increases the size of the dataset significantly and should only be done if necessary.

Please note that due to the restructuring of the DER SPIEGEL website in December 2023, and the subsequent discontinuation of the spon_scraper, this code is intended to process already collected datasets.

Features

Data Preprocessing: Efficiently cleans and preprocesses the extracted article and comment data.
Data Merging: Merges the article and comment data into a single consolidated dataset based on the specified conditions.
Bayesian Corrections: Perform Bayesian corrections on measures within the comment data.
Data Saving: Easily save your preprocessed data into a parquet format.

Setup

To set up uncongeniality_preprocessor, follow these steps:

Prerequisites

Ensure you have Python 3.10 or later installed. All code was tested under Python 3.10.13.

Installation

Clone the repository:

git clone https://gitea.iwm-tuebingen.de/ganders/uncongeniality_preprocessing.git
cd uncongeniality_preprocessor

Install the required Python packages:

pip install -r requirements.txt

Usage

Configuration

Please amend the settings in the settings.json file as per your requirements and specify the data paths to the collected article and comment data.

Running the Preprocessor

To run the preprocessor, use the following command:

python main.py

This command will start the preprocessing based on the parameters specified in your settings.json file.

Output Data Structure

The preprocessed data will be saved as a structured parquet file, based on the dataset name provided in the settings file.

License

This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

Contact

For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.