114 lines
5.3 KiB
Markdown
114 lines
5.3 KiB
Markdown
# Preproccessing HMC longitudinal study
|
|
|
|
[](https://www.gnu.org/licenses/gpl-3.0)
|
|
|
|
This is a Python application designed to preprocess data from the HMC longitudinal study.
|
|
It focuses on transforming survey data into a structured format suitable for analysis by employing yaml configuration files that document the different questionnaires employed over the waves.
|
|
|
|
## Features
|
|
- Flexible data preprocessing using YAML configuration files
|
|
- Automatic generation of database documentation (Markdown and PDF)
|
|
- Support for multiple output formats (CSV, SQLite)
|
|
- Processing and validation of scales and composite scores across multiple survey waves
|
|
- Modular architecture for easy extensibility
|
|
|
|
## Installation
|
|
|
|
Clone the repository and install using pip:
|
|
|
|
```bash
|
|
git clone https://gitea.iwm-tuebingen.de/HMC/preprocessing.git
|
|
cd preprocessing
|
|
pip install .
|
|
```
|
|
|
|
This uses the `pyproject.toml` file for all dependencies and build instructions.
|
|
Note that the project requires Python 3.10 or higher and use of [virtual environments](https://docs.python.org/3/library/venv.html) is recommended.
|
|
|
|
## Usage
|
|
|
|
### 1. Global Settings
|
|
|
|
In order to use the project to process the data you need to first create the global settings file `settings.yaml` in the root directory of the project.
|
|
You can follow the example provided in `settings_example.yaml` to create your own settings file.
|
|
The main settings are used to define the location of the configuration and data files.
|
|
|
|
### 2. Configuration Files
|
|
|
|
The project uses YAML configuration files to define the structure of the questionnaires and the processing steps.
|
|
Check the `config` directory to make sure that all required questionnaires are present.
|
|
|
|
### 3. Running the Preprocessing
|
|
|
|
To run the preprocessing, you can use the command line interface:
|
|
|
|
```bash
|
|
python HMC_preprocessing.py
|
|
```
|
|
|
|
### Output
|
|
|
|
The preprocessing will generate several output files:
|
|
- sqlite database file `hmc_data.db` containing the processed data in a denormalized format.
|
|
- markdown documentation `database_api_reference.md` containing the documentation of the database schema.
|
|
- pdf documentation `database_api_reference.pdf` containing the documentation of the database schema.
|
|
|
|
Furthermore each wave will be exported as a separate csv or Excel file in the `results` directory.
|
|
|
|
|
|
## Code Structure
|
|
|
|
Our approach centers on the flexible processing of a wide range of psychological scales and composite measures across multiple survey waves.
|
|
The project leverages YAML configuration files to describe the structure, scoring, and validation rules for each questionnaire,
|
|
allowing new scales or response formats to be integrated with minimal code changes.
|
|
|
|
For each wave, the system reads the relevant configuration, imports the raw data, and applies the specified processing logic,
|
|
such as item inversion, custom scoring, and subgroup filtering—entirely based on the configuration.
|
|
This enables researchers to adapt the pipeline to evolving study designs or new measurement instruments without modifying the core codebase.
|
|
|
|
Processed data from all waves is consolidated into a unified database, and the schema is automatically documented.
|
|
The modular design ensures that each step—from data import to scale computation and documentation is
|
|
transparent, reproducible, and easily extensible for future requirements.
|
|
|
|
In order to achieve this the following modules are used:
|
|
- `settings_loader.py`: Loads and validates global settings
|
|
- `data_loader.py`: Imports raw survey data
|
|
- `scale_processor.py`: Processes individual scales
|
|
- `composite_processor.py`: Computes composite scores (e.g. for the user and non-user groups)
|
|
- `process_all_waves.py`: Orchestrates processing across all waves
|
|
- `database_populator.py`: Exports processed data to the database
|
|
- `database_documentation_generator.py`: Generates database documentation
|
|
- `logging_config.py`: Configures logging for the entire process
|
|
|
|
## Additional Information
|
|
|
|
- if a combined scales is used (combining user and no_user items), the group-specific scales are not included in the dataset
|
|
- boolean columns are saved as 0/1 in csv and excel files for better interoperability
|
|
- if single items are retained, they are named as {scale_name}-item_{item_number}
|
|
- in impact_of_delegation_on_skills 6 is coded as NA as it does not fit the answer
|
|
- hope and concern is coded to have higher values for higher hope
|
|
- in wave 1 delegation_comfort item 3 was always NA. > no cronbachs alpha
|
|
|
|
## Contributing
|
|
|
|
In order to contribute please follow these steps:
|
|
Before making a pull request:
|
|
- Please review the CONTRIBUTING.md guidelines.
|
|
- Add corresponding unit or integration tests for new code.
|
|
- Ensure your code passes linting and typing checks (black, ruff, mypy).
|
|
|
|
For development, you can install the package and all dev dependencies in editable mode:
|
|
|
|
```bash
|
|
pip install -e .[dev]
|
|
```
|
|
|
|
Note that the project uses [pre-commit](https://pre-commit.com/) to ensure code quality and consistency,
|
|
and that the main branch is protected to ensure that all tests pass before merging.
|
|
|
|
## License
|
|
|
|
This code in this project is licensed under the GNU General Public License v3.0. See the [LICENSE](LICENSE) file for details.
|
|
|
|
## Contact
|
|
For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de. |