preprocessing/README.md

# Preproccessing HMC longitudinal study

[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)

This is a Python application designed to preprocess data from the HMC longitudinal study.
It focuses on transforming survey data into a structured format suitable for analysis by employing yaml configuration files that document the different questionnaires employed over the waves.

## Features
- Flexible data preprocessing using YAML configuration files
- Automatic generation of database documentation (Markdown and PDF)
- Support for multiple output formats (CSV, SQLite)
- Processing and validation of scales and composite scores across multiple survey waves
- Modular architecture for easy extensibility

## Installation

Clone the repository and install using pip:

```bash
git clone https://gitea.iwm-tuebingen.de/HMC/preprocessing.git
cd preprocessing
pip install .
```

This uses the `pyproject.toml` file for all dependencies and build instructions.
Note that the project requires Python 3.10 or higher and use of [virtual environments](https://docs.python.org/3/library/venv.html) is recommended.

## Usage

### 1. Global Settings

In order to use the project to process the data you need to first create the global settings file `settings.yaml` in the root directory of the project.
You can follow the example provided in `settings_example.yaml` to create your own settings file.
The main settings are used to define the location of the configuration and data files.

### 2. Configuration Files

The project uses YAML configuration files to define the structure of the questionnaires and the processing steps.
Check the `config` directory to make sure that all required questionnaires are present.

### 3. Running the Preprocessing

To run the preprocessing, you can use the command line interface:

```bash
python HMC_preprocessing.py
```

### Output

The preprocessing will generate several output files:
- sqlite database file `hmc_data.db` containing the processed data in a denormalized format.
- markdown documentation `database_api_reference.md` containing the documentation of the database schema.
- pdf documentation `database_api_reference.pdf` containing the documentation of the database schema.

Furthermore each wave will be exported as a separate csv or Excel file in the `results` directory.


## Code Structure

Our approach centers on the flexible processing of a wide range of psychological scales and composite measures across multiple survey waves.
The project leverages YAML configuration files to describe the structure, scoring, and validation rules for each questionnaire,
allowing new scales or response formats to be integrated with minimal code changes.

For each wave, the system reads the relevant configuration, imports the raw data, and applies the specified processing logic,
such as item inversion, custom scoring, and subgroup filtering—entirely based on the configuration.
This enables researchers to adapt the pipeline to evolving study designs or new measurement instruments without modifying the core codebase.

Processed data from all waves is consolidated into a unified database, and the schema is automatically documented.
The modular design ensures that each step—from data import to scale computation and documentation is
transparent, reproducible, and easily extensible for future requirements.

In order to achieve this the following modules are used:
- `settings_loader.py`: Loads and validates global settings
- `data_loader.py`: Imports raw survey data
- `scale_processor.py`: Processes individual scales
- `composite_processor.py`: Computes composite scores (e.g. for the user and non-user groups)
- `process_all_waves.py`: Orchestrates processing across all waves
- `database_populator.py`: Exports processed data to the database
- `database_documentation_generator.py`: Generates database documentation
- `logging_config.py`: Configures logging for the entire process

## Additional Information

- if a combined scales is used (combining user and no_user items), the group-specific scales are not included in the dataset
- boolean columns are saved as 0/1 in csv and excel files for better interoperability
- if single items are retained, they are named as {scale_name}-item_{item_number}
- in impact_of_delegation_on_skills 6 is coded as NA as it does not fit the answer
- hope and concern is coded to have higher values for higher hope
- in wave 1 delegation_comfort item 3 was always NA. > no cronbachs alpha

## Contributing

In order to contribute please follow these steps:
Before making a pull request:
- Please review the CONTRIBUTING.md guidelines.
- Add corresponding unit or integration tests for new code.
- Ensure your code passes linting and typing checks (black, ruff, mypy).

For development, you can install the package and all dev dependencies in editable mode:

```bash
pip install -e .[dev]
```

Note that the project uses [pre-commit](https://pre-commit.com/) to ensure code quality and consistency,
and that the main branch is protected to ensure that all tests pass before merging.

## License

This code in this project is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE) file for details.

## Contact
For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.