# uncongeniality_preprocessor `uncongeniality_preprocessor` is a Python application designed to preprocess comment and article data scraped from DER SPIEGEL. Building on the `spon_scraper` application, the project performs data cleaning, merging and saves a final analysis-ready dataset. The information in the dataset can be specified depending on which information are necessary for the planned analysis. Especially including the text body of comments increases the size of the dataset significantly and should only be done if necessary. Please note that due to the restructuring of the DER SPIEGEL website in December 2023, and the subsequent discontinuation of the `spon_scraper`, this code is intended to process already collected datasets. ## Features - **Data Preprocessing**: Efficiently cleans and preprocesses the extracted article and comment data. - **Data Merging**: Merges the article and comment data into a single consolidated dataset based on the specified conditions. - **Bayesian Corrections**: Perform Bayesian corrections on measures within the comment data. - **Data Saving**: Easily save your preprocessed data into a parquet format. ## Setup To set up `uncongeniality_preprocessor`, follow these steps: ### Prerequisites Ensure you have Python 3.10 or later installed. All code was tested under Python 3.10.13. ### Installation 1. **Clone the repository**: ```bash git clone https://gitea.iwm-tuebingen.de/ganders/uncongeniality_preprocessing.git cd uncongeniality_preprocessor ``` 2. **Install the required Python packages**: ```bash pip install -r requirements.txt ``` ## Usage ### Configuration Please amend the settings in the `settings.json` file as per your requirements and specify the data paths to the collected article and comment data. ### Running the Preprocessor To run the preprocessor, use the following command: ```bash python main.py ``` This command will start the preprocessing based on the parameters specified in your `settings.json` file. ## Output Data Structure The preprocessed data will be saved as a structured parquet file, based on the dataset name provided in the settings file. ## License This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details. ## Contact For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.