2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00
2025-12-15 13:47:28 +01:00

Preproccessing HMC longitudinal study

License: GPL v3

This is a Python application designed to preprocess data from the HMC longitudinal study. It focuses on transforming survey data into a structured format suitable for analysis by employing yaml configuration files that document the different questionnaires employed over the waves.

Features

  • Flexible data preprocessing using YAML configuration files
  • Automatic generation of database documentation (Markdown and PDF)
  • Support for multiple output formats (CSV, SQLite)
  • Processing and validation of scales and composite scores across multiple survey waves
  • Modular architecture for easy extensibility

Installation

Clone the repository and install using pip:

git clone https://gitea.iwm-tuebingen.de/HMC/preprocessing.git
cd preprocessing
pip install .

This uses the pyproject.toml file for all dependencies and build instructions. Note that the project requires Python 3.10 or higher and use of virtual environments is recommended.

Usage

1. Global Settings

In order to use the project to process the data you need to first create the global settings file settings.yaml in the root directory of the project. You can follow the example provided in settings_example.yaml to create your own settings file. The main settings are used to define the location of the configuration and data files.

2. Configuration Files

The project uses YAML configuration files to define the structure of the questionnaires and the processing steps. Check the config directory to make sure that all required questionnaires are present.

3. Running the Preprocessing

To run the preprocessing, you can use the command line interface:

python HMC_preprocessing.py

Output

The preprocessing will generate several output files:

  • sqlite database file hmc_data.db containing the processed data in a denormalized format.
  • markdown documentation database_api_reference.md containing the documentation of the database schema.
  • pdf documentation database_api_reference.pdf containing the documentation of the database schema.

Furthermore each wave will be exported as a separate csv or Excel file in the results directory.

Code Structure

Our approach centers on the flexible processing of a wide range of psychological scales and composite measures across multiple survey waves. The project leverages YAML configuration files to describe the structure, scoring, and validation rules for each questionnaire, allowing new scales or response formats to be integrated with minimal code changes.

For each wave, the system reads the relevant configuration, imports the raw data, and applies the specified processing logic, such as item inversion, custom scoring, and subgroup filtering—entirely based on the configuration. This enables researchers to adapt the pipeline to evolving study designs or new measurement instruments without modifying the core codebase.

Processed data from all waves is consolidated into a unified database, and the schema is automatically documented. The modular design ensures that each step—from data import to scale computation and documentation is transparent, reproducible, and easily extensible for future requirements.

In order to achieve this the following modules are used:

  • settings_loader.py: Loads and validates global settings
  • data_loader.py: Imports raw survey data
  • scale_processor.py: Processes individual scales
  • composite_processor.py: Computes composite scores (e.g. for the user and non-user groups)
  • process_all_waves.py: Orchestrates processing across all waves
  • database_populator.py: Exports processed data to the database
  • database_documentation_generator.py: Generates database documentation
  • logging_config.py: Configures logging for the entire process

Additional Information

  • if a combined scales is used (combining user and no_user items), the group-specific scales are not included in the dataset
  • boolean columns are saved as 0/1 in csv and excel files for better interoperability
  • if single items are retained, they are named as {scale_name}-item_{item_number}
  • in impact_of_delegation_on_skills 6 is coded as NA as it does not fit the answer
  • hope and concern is coded to have higher values for higher hope
  • in wave 1 delegation_comfort item 3 was always NA. > no cronbachs alpha

Contributing

In order to contribute please follow these steps: Before making a pull request:

  • Please review the CONTRIBUTING.md guidelines.
  • Add corresponding unit or integration tests for new code.
  • Ensure your code passes linting and typing checks (black, ruff, mypy).

For development, you can install the package and all dev dependencies in editable mode:

pip install -e .[dev]

Note that the project uses pre-commit to ensure code quality and consistency, and that the main branch is protected to ensure that all tests pass before merging.

License

This code in this project is licensed under the GNU General Public License v3.0. See the LICENSE file for details.

Contact

For questions, feedback, or to report issues, please contact Gerrit Anders at g.anders@iwm-tuebingen.de.

Description
Public preprocessing code for HMC project (longitudinal study on AI usage). Responsible: Gerrit Anders.
Readme 112 KiB
Languages
Python 100%