data/03_data/README.md
2025-10-17 11:23:52 +02:00

124 lines
3.5 KiB
Markdown

Data for the HMC (Human Machine Communication) project
================
# Variables
An overview of all variables on item level can be found in the [item
reference](item_refrence.md) and in the [EXCEL codebook](HMC_codebook.xlsx).
These files show which variables have been collected in each wave.
# Folder and file organisation
## Folders
* `01_raw_data` contains the downloaded files from Qualtrics
* `02_anonymized_data` contains the anonymized data files (otherwise this can
still be considered raw data)
* `03_cleaned` contains data files with harmonized data names; additionally some
incorrect variable names were fixed and double entries from subjects who did a
wave two or more times were removed; see `cleaning.R` and below for more
details
## Files
* `HMC_codebook.xlsx` contains all variable names for all waves, with the
original descriptions as presented in Qualtrics and the original variable
names as well as the harmonized variable names
* `HMC_variables.xlsx` contains an overview of the variables, their origin, who
wanted them in the data, etc. This file is for internal use and is not
commited with the public version
# Data collection and data files
The data collection was done in Qualtrics. The following projects are on
https://kmrc.qualtrics.com/:
* `AI_Trends_Wave1`
* `AI_Trends_Wave2`
* `AI_Trends_Wave3`
* `AI_Trends_Wave4`
* `AI_Trends_Wave4_sample2`
* `AI_Trends_Wave5`
* `AI_Trends_Wave5_sample2`
* `AI_Trends_Wave6`
* `AI_Trends_Wave6_sample2`
## Sample
### Sample 2 data files
Subjects from the first wave that did not participate in the following waves
were again invited after...
<!-- TODO: Add more details -->
## Download settings in Qualtrics
The data were downloaded from Qualtrics as CSV files with the following
settings.
### Overall
- Download all fields
- Export values
### CSV
- Recode seen but unanswered questions as -99
- Recode seen but unanswered multi-value fields as 0
- Split multi-value fields into columns
# Data anonymization
After download from Qualtrics, files were put in the respective folders for each
wave in `03_data/01_raw_data/wave*`. The script
`03_data/01_raw_data/anonymization.R` mostly removes the `PROLIFIC_IDs` from the
data and adds an anonymized ID `subj_id` with entries `subj0001 - sub1009` to
all data sets.
Irrelevant columns -- mostly automatically created by Qualtrics -- are also
removed. See `anonymization.R` for details.
The anonymized data files are saved to `03_data/02_anonymized_data/ as
CSV files with file names `HMC_<wave>_anonymized.csv`.
# Data preprocessing
After data anonymization, some more rudimentary preprocessing was done on the
data with the script `03_data/02_anonymized_data/cleaning.R`. Especially,
the original variable names in Qualtrics were harmonized so they all follow the
same structure.
The cleaned data files are saved to `03_data/03_cleaned_data/`as
CSV files with file names `HMC_<wave>_cleaned.csv`.
The following section gives an overview of the problems in the data, that needed
some cleaning.
## Problems
### with variable names over waves
* `trust_fav` and `Q161` and `Q162`
* `obj_know` and `Q158`
* intention labels sind vertauscht
--> `int_use_bhvr_fav = int_use_bhvr_noUser` and vice versa
* ...
<!-- TODO: Add more details -->
### with subjects
* Two entries in wave 1: `subj0762`
* Three entries in wave 3: `subj1009`
* We kept the first entry for each subject
# TODOs
* Add more preprocessing steps like variable renaming?
* Get age (and other descriptives?) for subj1008 and subj1009 from Profilic
data?