data/03_data/README.md

Data for the HMC (Human Machine Communication) project
================

# Variables

An overview of all variables on item level can be found in the [item
reference](item_refrence.md) and in the [EXCEL codebook](HMC_codebook.xlsx).
These files show which variables have been collected in each wave.

# Folder and file organisation

## Folders

* `01_raw_data` contains the downloaded files from Qualtrics
* `02_anonymized_data` contains the anonymized data files (otherwise this can
  still be considered raw data)
* `03_cleaned` contains data files with harmonized data names; additionally some
  incorrect variable names were fixed and double entries from subjects who did a
  wave two or more times were removed; see `cleaning.R` and below for more
  details

## Files

* `HMC_codebook.xlsx` contains all variable names for all waves, with the
  original descriptions as presented in Qualtrics and the original variable
  names as well as the harmonized variable names
* `HMC_variables.xlsx` contains an overview of the variables, their origin, who
  wanted them in the data, etc. This file is for internal use and is not
  commited with the public version

# Data collection and data files

The data collection was done in Qualtrics. The following projects are on
https://kmrc.qualtrics.com/:

* `AI_Trends_Wave1`
* `AI_Trends_Wave2`
* `AI_Trends_Wave3`
* `AI_Trends_Wave4`
* `AI_Trends_Wave4_sample2`
* `AI_Trends_Wave5`
* `AI_Trends_Wave5_sample2`
* `AI_Trends_Wave6`
* `AI_Trends_Wave6_sample2`

## Sample

### Sample 2 data files

Subjects from the first wave that did not participate in the following waves
were again invited after...

<!-- TODO: Add more details -->

## Download settings in Qualtrics

The data were downloaded from Qualtrics as CSV files with the following
settings.

### Overall

- Download all fields
- Export values

### CSV

- Recode seen but unanswered questions as -99
- Recode seen but unanswered multi-value fields as 0
- Split multi-value fields into columns

# Data anonymization

After download from Qualtrics, files were put in the respective folders for each
wave in `03_data/01_raw_data/wave*`. The script
`03_data/01_raw_data/anonymization.R` mostly removes the `PROLIFIC_IDs` from the
data and adds an anonymized ID `subj_id` with entries `subj0001 - sub1009` to
all data sets.

Irrelevant columns -- mostly automatically created by Qualtrics -- are also
removed. See `anonymization.R` for details.

The anonymized data files are saved to `03_data/02_anonymized_data/ as
CSV files with file names `HMC_<wave>_anonymized.csv`.

# Data preprocessing

After data anonymization, some more rudimentary preprocessing was done on the
data with the script `03_data/02_anonymized_data/cleaning.R`. Especially,
the original variable names in Qualtrics were harmonized so they all follow the
same structure.

The cleaned data files are saved to `03_data/03_cleaned_data/`as
CSV files with file names `HMC_<wave>_cleaned.csv`.

The following section gives an overview of the problems in the data, that needed
some cleaning.

## Problems

### with variable names over waves

* `trust_fav` and `Q161` and `Q162`
* `obj_know` and `Q158`
* intention labels sind vertauscht
  --> `int_use_bhvr_fav = int_use_bhvr_noUser` and vice versa
* ...

<!-- TODO: Add more details -->

### with subjects

* Two entries in wave 1: `subj0762`
* Three entries in wave 3: `subj1009`
* We kept the first entry for each subject

# TODOs

* Add more preprocessing steps like variable renaming?

* Get age (and other descriptives?) for subj1008 and subj1009 from Profilic
  data?