data/03_data/README.md

140 lines
4.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Data for the HMC (Human Machine Communication) project
================
# Variables
An overview of all variables on item level can be found in the [item
reference](item_refrence.md) and in the [EXCEL codebook](HMC_codebook.xlsx).
These files show which variables have been collected in each wave.
# Folder and file organisation
## Folders
* `01_raw_data` contains the downloaded files from Qualtrics
* `02_anonymized_data` contains the anonymized data files (otherwise this can
still be considered raw data)
* `03_cleaned` contains data files with harmonized data names; additionally some
incorrect variable names were fixed and double entries from subjects who did a
wave two or more times were removed; see `cleaning.R` and below for more
details
## Files
* `HMC_codebook.xlsx` contains all variable names for all waves, with the
original descriptions as presented in Qualtrics and the original variable
names as well as the harmonized variable names
* `HMC_variables.xlsx` contains an overview of the variables, their origin, who
wanted them in the data, etc. This file is for internal use and is not
commited with the public version
# Data collection and data files
The data collection was done in Qualtrics. The following projects are on
https://kmrc.qualtrics.com/:
* `AI_Trends_Wave1`
* `AI_Trends_Wave2`
* `AI_Trends_Wave3`
* `AI_Trends_Wave4`
* `AI_Trends_Wave4_sample2`
* `AI_Trends_Wave5`
* `AI_Trends_Wave5_sample2`
* `AI_Trends_Wave6`
* `AI_Trends_Wave6_sample2`
## Sample
### Sample 2 data files
After wave 3, we re-invited wave-1 participants for waves 46 to increase
statistical power for questions that did not require participation in all six
waves. This departed from our original plan to invite only participants from the
immediately preceding wave because ongoing monitoring showed that many non-users
remained non-users and that relatively few participants perceived AI as a social
actor. To capture more contemporary usage and obtain sufficient variation for
research questions filtering for individuals that perceived AI as a social
actor, we broadened recruitment in wave 4 to all wave-1 participants.
## Download settings in Qualtrics
The data were downloaded from Qualtrics as CSV files with the following
settings.
### Overall
- Download all fields
- Export values
### CSV
- Recode seen but unanswered questions as -99
- Recode seen but unanswered multi-value fields as 0
- Split multi-value fields into columns
# Data anonymization
After download from Qualtrics, files were put in the respective folders for each
wave in `03_data/01_raw_data/wave*`. The script
`03_data/01_raw_data/anonymization.R` mostly removes the `PROLIFIC_IDs` from the
data and adds an anonymized ID `subj_id` with entries `subj0001 - sub1009` to
all data sets.
Irrelevant columns - mostly automatically created by Qualtrics - are also
removed. See `anonymization.R` for details.
The anonymized data files are saved to `03_data/02_anonymized_data/` as
CSV files with file names `HMC_<wave>_anonymized.csv`.
# Data cleaning
After data anonymization, some more rudimentary data cleaning was done with the
script `03_data/02_anonymized_data/cleaning.R`. Especially, the original
variable names in Qualtrics were harmonized so they all follow the same
structure.
The cleaned data files are saved to `03_data/03_cleaned_data/` as CSV files with
file names `HMC_<wave>_cleaned.csv`.
The following section gives an overview of the problems in the data, that needed
some cleaning.
## Problems
### with variable names
* For the variables looking at what tasks subjects would delegate to AI, there
were some inconsistencies in the naming. This was _only_ in the variable
naming, the items were presented correctly to the subjects. The folloing
variables were renamed:
- `delg_tsk_typs_4 --> delg_tsk_typs_3`
- `delg_tsk_typs_5 --> delg_tsk_typs_4`
- `delg_tsk_typs_6 --> delg_tsk_typs_5`
- `delg_tsk_typs_7 --> delg_tsk_typs_6`
- `delg_tsk_typs_8 --> delg_tsk_typs_7`
- `delg_tsk_typs_8` was deleted
* The labels of the intention variables were swapped by accident and this was
corrected:
- `int_use_bhvr_fav = int_use_bhvr_noUser` and vice versa
### with subjects
* Two entries in wave 1: `subj0762`
* Three entries in wave 3: `subj1009`
* We kept the first entry for each subject
* `subj1009` has been removed from the dataset since it only appeared in wave 3
and it is unclear how this happened; only subjects who participated in wave 1
have been invited to participate in further waves
# Data preprocessing
The final data preprocessing creates scales from the collected items. It was
done in Python and the code for the preprocessing can be found in a separate
code repository: https://gitea.iwm-tuebingen.de/HMC/preprocessing. The files
with the final variables for each scale are then saved in the folder
`03_data/04_preprocessed_data` as CSV files with file names
`HMC_<wave>_preprocessed.csv`.