# uncongeniality_preprocessor

`uncongeniality_preprocessor` is a Python application designed to preprocess comment and article data scraped from DER SPIEGEL. Building on the `spon_scraper` application, the project performs data cleaning, merging and saves a final analysis-ready dataset.
The information in the dataset can be specified depending on which information are necessary for the planned analysis. Especially including the text body of comments increases the size of the dataset significantly and should only be done if necessary.

Please note that due to the restructuring of the DER SPIEGEL website in December 2023, and the subsequent discontinuation of the `spon_scraper`, this code is intended to process already collected datasets.

## Features
- **Data Preprocessing**: Efficiently cleans and preprocesses the extracted article and comment data.
- **Data Merging**: Merges the article and comment data into a single consolidated dataset based on the specified conditions.
- **Bayesian Corrections**: Perform Bayesian corrections on measures within the comment data.
- **Data Saving**: Easily save your preprocessed data into a parquet format.

## Setup
To set up `uncongeniality_preprocessor`, follow these steps:
### Prerequisites
Ensure you have Python 3.10 or later installed.
All code was tested under Python 3.10.13.
### Installation
1. **Clone the repository**:
```bash
git clone https://gitea.iwm-tuebingen.de/ganders/uncongeniality_preprocessing.git
cd uncongeniality_preprocessor
```
2. **Install the required Python packages**:
```bash
pip install -r requirements.txt
```
## Usage
### Configuration
Please amend the settings in the `settings.json` file as per your requirements and specify the data paths to the collected article and comment data.

### Running the Preprocessor
To run the preprocessor, use the following command:
```bash
python main.py
```
This command will start the preprocessing based on the parameters specified in your `settings.json` file.

## Output Data Structure
The preprocessed data will be saved as a structured parquet file, based on the dataset name provided in the settings file.

## License
This project is licensed under the GNU General Public License v3.0. See the LICENSE file for more details.

## Contact
For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.