Body: - Generated preprocessed data from preprocessing repo @ e0f6561 - Data repo base before update: bf4efcc - Run timestamp (UTC): 2025-12-15T12:33:55Z - Target dir: 03_data/04_preprocessed_data
Data for the HMC (Human Machine Communication) project
Variables
An overview of all variables on item level can be found in the item reference and in the EXCEL codebook. These files show which variables have been collected in each wave.
Folder and file organisation
Folders
01_raw_datacontains the downloaded files from Qualtrics02_anonymized_datacontains the anonymized data files (otherwise this can still be considered raw data)03_cleanedcontains data files with harmonized data names; additionally some incorrect variable names were fixed and double entries from subjects who did a wave two or more times were removed; seecleaning.Rand below for more details04_preprocessed_datacontains the preprocessed data files with scales created from the items; seedatabase_api_reference.mdfor a description of scales.
Files
HMC_codebook.xlsxcontains all variable names for all waves, with the original descriptions as presented in Qualtrics and the original variable names as well as the harmonized variable namesHMC_variables.xlsxcontains an overview of the variables, their origin, who wanted them in the data, etc. This file is for internal use and is not commited with the public version
Data collection and data files
The data collection was done in Qualtrics. The following projects are on https://kmrc.qualtrics.com/:
AI_Trends_Wave1AI_Trends_Wave2AI_Trends_Wave3AI_Trends_Wave4AI_Trends_Wave4_sample2AI_Trends_Wave5AI_Trends_Wave5_sample2AI_Trends_Wave6AI_Trends_Wave6_sample2
Sample
Sample 2 data files
After wave 3, we re-invited wave-1 participants for waves 4–6 to increase statistical power for questions that did not require participation in all six waves. This departed from our original plan to invite only participants from the immediately preceding wave because ongoing monitoring showed that many non-users remained non-users and that relatively few participants perceived AI as a social actor. To capture more contemporary usage and obtain sufficient variation for research questions filtering for individuals that perceived AI as a social actor, we broadened recruitment in wave 4 to all wave-1 participants.
Download settings in Qualtrics
The data were downloaded from Qualtrics as CSV files with the following settings.
Overall
- Download all fields
- Export values
CSV
- Recode seen but unanswered questions as -99
- Recode seen but unanswered multi-value fields as 0
- Split multi-value fields into columns
Data anonymization
After download from Qualtrics, files were put in the respective folders for each
wave in 03_data/01_raw_data/wave*. The script
03_data/01_raw_data/anonymization.R mostly removes the PROLIFIC_IDs from the
data and adds an anonymized ID subj_id with entries subj0001 - sub1009 to
all data sets.
Irrelevant columns - mostly automatically created by Qualtrics - are also
removed. See anonymization.R for details.
The anonymized data files are saved to 03_data/02_anonymized_data/ as
CSV files with file names HMC_<wave>_anonymized.csv.
Data cleaning
After data anonymization, some more rudimentary data cleaning was done with the
script 03_data/02_anonymized_data/cleaning.R. Especially, the original
variable names in Qualtrics were harmonized so they all follow the same
structure.
The cleaned data files are saved to 03_data/03_cleaned_data/ as CSV files with
file names HMC_<wave>_cleaned.csv.
The following section gives an overview of the problems in the data, that needed some cleaning.
Problems
with variable names
-
For the variables looking at what tasks subjects would delegate to AI, there were some inconsistencies in the naming. This was only in the variable naming, the items were presented correctly to the subjects. The folloing variables were renamed:
delg_tsk_typs_4 --> delg_tsk_typs_3delg_tsk_typs_5 --> delg_tsk_typs_4delg_tsk_typs_6 --> delg_tsk_typs_5delg_tsk_typs_7 --> delg_tsk_typs_6delg_tsk_typs_8 --> delg_tsk_typs_7delg_tsk_typs_8was deleted
-
The labels of the intention variables were swapped by accident and this was corrected:
int_use_bhvr_fav = int_use_bhvr_noUserand vice versa
with subjects
- Two entries in wave 1:
subj0762 - Three entries in wave 3:
subj1009 - We kept the first entry for each subject
subj1009has been removed from the dataset since it only appeared in wave 3 and it is unclear how this happened; only subjects who participated in wave 1 have been invited to participate in further waves
Data preprocessing
The final data preprocessing creates scales from the collected items. It was
done in Python, and the code for the preprocessing can be found in a separate
code repository: https://gitea.iwm-tuebingen.de/HMC/preprocessing. The files
with the final variables for each scale are then saved in the folder
03_data/04_preprocessed_data. Three versions are provided: csv, and excel
versions per wave as well as an overall sqlite database containing all waves in
one file. database_api_reference.md contains the documentation of the database.