Add content to data records and code availability

2025-10-17 14:25:06 +02:00 · 2025-10-17 14:25:06 +02:00 · 1a58709bfd
commit 1a58709bfd
parent ab0e760fcf
3 changed files with 307 additions and 144 deletions
--- a/manuscript.Rnw
+++ b/manuscript.Rnw
@ -3,8 +3,10 @@
 \usepackage[margin = 2.4cm]{geometry}
 \usepackage[utf8]{inputenc}
 \usepackage{Sweave}
+\usepackage{url}
 \usepackage{authblk}
-
+\usepackage[style = apa, backend = biber, natbib = true]{biblatex}
+\addbibresource{lit.bib}

 \title{Working title: Data Descriptor for HMC Data Set}
 \author{Angelica Henestrosa}
@ -43,23 +45,14 @@ This widespread proliferation of AI technologies, coupled with their increasingl
 As AI systems become more adaptive and embedded in everyday life, understanding the determinants of usage intensity, behavioral patterns, and types of use becomes essential. 
 Moreover, the field of AI is evolving at a fast pace, and user characteristics such as attitudes and trust are subject to change over time. Therefore, longitudinal research that captures temporal fluctuations in user traits and behaviors is crucial.

-Therfore, this longitudinally designed data set aims to capture the evolving perceptions of opportunities and risks associated with AI, perceived capabilities of AI systems, attitudes toward AI, trust in AI, willingness to delegate tasks to AI, areas of application, (to be continued) and the interrelationships among these constructs over time and get some hints on causality. Longitudinal studies are more likely to find changes if there is a potential change trigger (Zhao et al., 2024)
+Therfore, this longitudinally designed data set aims to capture the evolving perceptions of opportunities and risks associated with AI, perceived capabilities of AI systems, attitudes toward AI, trust in AI, willingness to delegate tasks to AI, areas of application, (to be continued) and the interrelationships among these constructs over time and get some hints on causality. Longitudinal studies are more likely to find changes if there is a potential change trigger \citep{Zhao2024}.

-Central questions are whether predictors of technology acceptance as well as
-technology use change over time, whether the perception of AI-Tools as tools vs.
-agents (if so: what type of role/relationship) changes over time, whether this
-perception is related to concepts like credibility, trustworthiness, or task
-delegation, and whether factors such as social presence of perceive
-anthropomorphism mediate such processes. We also explore the long-term
-effects of delegating tasks to AI Tools on outcomes like perceived
-self-efficacy (writing skills), loneliness, or cognitive self-esteem and explore
-the moderating role of personality.
+Central questions are whether predictors of technology acceptance as well as technology use change over time, whether the perception of AI-Tools as tools vs.\ agents (if so: what type of role/relationship) changes over time, whether this perception is related to concepts like credibility, trustworthiness, or task delegation, and whether factors such as social presence of perceived anthropomorphism mediate such processes. Long-term effects of delegating tasks to AI Tools on outcomes like perceived self-efficacy (writing skills), loneliness, or cognitive self-esteem and explore the moderating role of personality can also be explored.

 % Note: lets all reflect on which term and why we want to use, and how we define it: usage vs. use

 This project is a joint project from the human-computer interaction group at
-the Leibniz-Institut für Wissensmedien in Tübingen (IWM). There are several (how many should we mention?) preregistrations from group members focusing on their individual subquestions. For an overview of the work packages and their research questions, please visit our repository [LINK]. 
-% --> create workpackages.md
+the Leibniz-Institut für Wissensmedien in Tübingen (IWM). There are several (how many should we mention?) preregistrations from group members focusing on their individual subquestions. For an overview of the work packages and their research questions, please visit our repository \url{https://gitea.iwm-tuebingen.de/HMC/data}. 
 Thus, this data descriptor may be used to examine research questions across the individual work packages, the possibility to extract and analyze specific subgroups or individual trajectories ignored in the work packages. 
 Because the data set was collected shortly before the public release of Apple Intelligence on consumer devices, it offers a timely snapshot of user attitudes and behaviors at a pivotal moment in AI adoption. This context enhances the relevance of the data for understanding emerging patterns in human-AI interaction. Moreover, the findings may provide early indicators of how psychological variables such as trust, perceived usefulness, and willingness to delegate tasks relate to AI usage, potentially offering prognosis of similar developments in other countries.

@ -143,12 +136,6 @@ We collected sociodemographic information, including, age, gender, educational l

 \section{Data Records}

-% * @Nora das könntest du vllt. noch ausführen?
-
-Data records for each of the six waves are available in csv format at (tbd) together with the R/python scripts for data anonymization, data cleaning, and data preprocessing.
-That is, firstly the data was anonymized by removing participants' Prolific IDs and unused variables, empty variables resulting from faulty questionnaire programming, and xy were removed. Thus (filename) represents the cleaned and anonymized raw data, including the single items of each measurement. Second, variable names were harmonized and scales were calculated, resulting an the preprocessed data set xy, ready for analyses across scales.
-Moreover, a codebook explaining variable abbreviations and containing information about the waves in which the variable was collected (what else?) is available at (tbd).
-
 % * Explain what the dataset contains.
 % * Specify the repository where the dataset is stored.
 % * Provide an overview of the data files and their formats.
@ -158,25 +145,123 @@ Moreover, a codebook explaining variable abbreviations and containing informatio
 % * Include 1-2 tables or figures if necessary, but avoid summarizing data that
 %   can be generated from the dataset.

-% * how should we report on the variables and scales:
-%  **item and scale level OR just scale level ?
-% **link to Gerrits scale list: https://gitea.iwm-tuebingen.de/AG4/project_HMC_preprocessing/src/branch/main/results/database_api_reference.md ?
-%  **extra codebook or merge that information to Gerrits list?
+Data records for each of the six waves are available in CSV format at
+\url{https://gitea.iwm-tuebingen.de/HMC/data} together with the R scripts for
+data anonymization and data cleaning.

-%  --> an overview about all variables, their calculation, their measurement format and ideally their M, SD, cronbachs alpha would be ideal!
+In a first step, the data was anonymized by removing participants' Prolific IDs
+and unused variables as well as variables only containing \texttt{NA} resulting
+from faulty questionnaire programming were removed. The results are six files
+(one for each wave) with the primary data containing the single items of each
+scale measured. Furthermore, variable names were harmonized and subjects
+excluded that filled in the survey several times. The final data sets are ready
+for analyses after taking some additional data preparation steps for building
+the scales (if desired).

-Table~\ref{tab:demographics} provides an overwiev of the demographic variables
-over all six waves.
+Figure~\ref{fig:folderstruc} shows the folder structure and files contained in
+the repository of the data records. This repostirory is generated from the
+local project folder that all project collaborators can access. All files are
+text files or PDFs with the exceptions of the codebook which is an EXCEL file.
+However, an export of the information contained in the EXCEL codebook to a
+MARKDOWN file is also included, for faster readability online and to ensure
+that all files are in non-proprietary formats.
+
+\begin{figure}
+\begin{verbatim}
+https://gitea.iwm-tuebingen.de/HMC/data
+|-- 01_project_management
+|   |-- workpackages
+|   |   |-- workpackages.md
+|-- 02_material
+|   |-- AI_Trends_Wave1_Survey.pdf
+|   |-- AI_Trends_Wave2_Survey.pdf
+|   |-- AI_Trends_Wave3_Survey.pdf
+|   |-- AI_Trends_Wave4_Survey.pdf
+|   |-- AI_Trends_Wave5_Survey.pdf
+|   |-- AI_Trends_Wave6_Survey.pdf
+|-- 03_data
+|   |-- 01_raw_data
+|   |   |-- anonymization.R
+|   |-- 02_anonymized_data
+|   |   |-- cleaning.R
+|   |-- 03_cleaned_data
+|   |   |-- HMC_wave1_cleaned.csv
+|   |   |-- HMC_wave2_cleaned.csv
+|   |   |-- HMC_wave3_cleaned.csv
+|   |   |-- HMC_wave4_cleaned.csv
+|   |   |-- HMC_wave5_cleaned.csv
+|   |   |-- HMC_wave6_cleaned.csv
+|   |-- HMC_codebook.xlsx
+|   |-- item_reference.md
+|   |-- README.md
+|-- README.md
+\end{verbatim}
+  \caption{Folder structure of the repository containing the data records.}
+  \label{fig:folderstruc}
+\end{figure}
+
+Furthermore, a codebook explaining variable abbreviations and coding and
+containing references and information about the waves in which the variable was
+collected is available at
+\url{https://gitea.iwm-tuebingen.de/HMC/data/src/branch/main/03_data/item_reference.md}.
+
+% TODO: Where should the demographics table go? Here or above in the Methods
+% section?
+Table~\ref{tab:demographics} provides an overview of the demographic variables
+over all six waves. Education and income were collected on six-point scales.
+Answering options for education are
+%
+\begin{enumerate}
+  \item Some high school or less
+  \item High school diploma or GED
+  \item Some college, but no degree
+  \item Associates or technical degree
+  \item Bachelor's degree
+  \item Graduate or professional degree (MA, MS, MBA, PhD, JD, MD, DDS etc.)"
+\end{enumerate}
+%
+and for income
+%
+\begin{enumerate}
+  \item Less than \$25,000
+  \item \$25,000-\$49,999
+  \item \$50,000-\$74,999
+  \item \$75,000-\$99,999
+  \item \$100,000-\$149,999
+  \item \$150,000 or more.
+\end{enumerate}
+%
+The rate of users of AI systems increases over the six waves from about 76\% to
+almost 90\% in the sixth wave.

 <<echo = false, results = tex>>=
 # Read data

-dat1 <- read.csv("../data/03_cleaned_data/HMC_wave1_cleaned.csv")
-dat2 <- read.csv("../data/03_cleaned_data/HMC_wave2_cleaned.csv")
-dat3 <- read.csv("../data/03_cleaned_data/HMC_wave3_cleaned.csv")
-dat4 <- read.csv("../data/03_cleaned_data/HMC_wave4_cleaned.csv")
-dat5 <- read.csv("../data/03_cleaned_data/HMC_wave5_cleaned.csv")
-dat6 <- read.csv("../data/03_cleaned_data/HMC_wave6_cleaned.csv")
+dat1 <- read.csv("../data/03_data/03_cleaned_data/HMC_wave1_cleaned.csv")
+dat2 <- read.csv("../data/03_data/03_cleaned_data/HMC_wave2_cleaned.csv")
+dat3 <- read.csv("../data/03_data/03_cleaned_data/HMC_wave3_cleaned.csv")
+dat4 <- read.csv("../data/03_data/03_cleaned_data/HMC_wave4_cleaned.csv")
+dat5 <- read.csv("../data/03_data/03_cleaned_data/HMC_wave5_cleaned.csv")
+dat6 <- read.csv("../data/03_data/03_cleaned_data/HMC_wave6_cleaned.csv")
+
+dat1$use <- factor(dat1$use,
+  levels = 1:2,
+  labels = c("user", "noUser"))
+dat2$use <- factor(dat2$use,
+  levels = 1:2,
+  labels = c("user", "noUser"))
+dat3$use <- factor(dat3$use,
+  levels = 1:2,
+  labels = c("user", "noUser"))
+dat4$use <- factor(dat4$use,
+  levels = 1:2,
+  labels = c("user", "noUser"))
+dat5$use <- factor(dat5$use,
+  levels = 1:2,
+  labels = c("user", "noUser"))
+dat6$use <- factor(dat6$use,
+  levels = 1:2,
+  labels = c("user", "noUser"))

 subj_id_w2 <- unique(dat2$subj_id)
 subj_id_w3 <- unique(dat3$subj_id)
@ -214,9 +299,13 @@ dat$gender <- factor(dat$gender,
 # TODO: What to do about these? Reported means in table, since it is to detailed
 # otherwise - but is this what we want?

-dat$use <- factor(dat$use,
-  levels = 1:2,
-  labels = c("user", "noUser"))
+# Remove categories that are not informative for means and SDs
+dat$education <- ifelse(dat$education == 7, NA, dat$education)
+dat$income <- ifelse(dat$income == 7, NA, dat$income)
+
+# dat$use <- factor(dat$use,
+#   levels = 1:2,
+#   labels = c("user", "noUser"))

 # TODO: Put in separate table? Left out for now!
 dat$apple_use <- factor(dat$apple_use,
@ -235,19 +324,19 @@ dat$apple_AI_intent_use <- factor(dat$apple_AI_intent_use,
 tab_demo <- matrix(NA, nrow = 6, ncol = 8)

 rownames(tab_demo) <- paste("wave", 1:6)
-colnames(tab_demo) <- c("Total N", "User", "Male", "Female", "Other", "Age M(SD)", "Education M(SD)",
-                        "Income M(SD)")
+colnames(tab_demo) <- c("Total N", "User", "Male", "Female", "Other",
+                        "Age M(SD)", "Education M(SD)", "Income M(SD)")

 tab_demo[, 1] <- c(nrow(dat1), nrow(dat2), nrow(dat3), nrow(dat4), nrow(dat5),
                   nrow(dat6))

 tab_demo[, 2] <- c(
-  paste0(sprintf(fmt = "%.2f", dat |> subset(use == "user") |> nrow() / dat |> nrow()), "%"),
-  paste0(sprintf(fmt = "%.2f", dat |> subset(use == "user" & subj_id %in% subj_id_w2) |> nrow() / dat |> subset(subj_id %in% subj_id_w2) |> nrow()), "%"),
-  paste0(sprintf(fmt = "%.2f", dat |> subset(use == "user" & subj_id %in% subj_id_w3) |> nrow() / dat |> subset(subj_id %in% subj_id_w3) |> nrow()), "%"),
-  paste0(sprintf(fmt = "%.2f", dat |> subset(use == "user" & subj_id %in% subj_id_w4) |> nrow() / dat |> subset(subj_id %in% subj_id_w4) |> nrow()), "%"),
-  paste0(sprintf(fmt = "%.2f", dat |> subset(use == "user" & subj_id %in% subj_id_w5) |> nrow() / dat |> subset(subj_id %in% subj_id_w5) |> nrow()), "%"),
-  paste0(sprintf(fmt = "%.2f", dat |> subset(use == "user" & subj_id %in% subj_id_w6) |> nrow() / dat |> subset(subj_id %in% subj_id_w6) |> nrow()), "%")
+  paste0(sprintf(fmt = "%.2f", dat1 |> subset(use == "user") |> nrow() / dat1 |> nrow() * 100), "%"),
+  paste0(sprintf(fmt = "%.2f", dat2 |> subset(use == "user") |> nrow() / dat2 |> nrow() * 100), "%"),
+  paste0(sprintf(fmt = "%.2f", dat3 |> subset(use == "user") |> nrow() / dat3 |> nrow() * 100), "%"),
+  paste0(sprintf(fmt = "%.2f", dat4 |> subset(use == "user") |> nrow() / dat4 |> nrow() * 100), "%"),
+  paste0(sprintf(fmt = "%.2f", dat5 |> subset(use == "user") |> nrow() / dat5 |> nrow() * 100), "%"),
+  paste0(sprintf(fmt = "%.2f", dat6 |> subset(use == "user") |> nrow() / dat6 |> nrow() * 100), "%")
 )

 tab_demo[, 3] <- c(dat |> subset(gender == "Male") |> nrow(),
@ -276,101 +365,101 @@ tab_demo[, 5] <- c(dat |> subset(gender %in% c("Non-binary / third gender",
 )

 tab_demo[, 6] <- c(
-  paste0(dat$age |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(dat$age |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         dat$age |> sd() |> sprintf(fmt = "%.2f"),
+         dat$age |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w2)$age |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w2)$age |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w2)$age |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w2)$age |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w3)$age |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w3)$age |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w3)$age |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w3)$age |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w4)$age |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w4)$age |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w4)$age |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w4)$age |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w5)$age |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w5)$age |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w5)$age |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w5)$age |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w6)$age |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w6)$age |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w6)$age |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w6)$age |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  )
 )

 tab_demo[, 7] <- c(
-  paste0(dat$education |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(dat$education |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         dat$education |> sd() |> sprintf(fmt = "%.2f"),
+         dat$education |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w2)$education |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w2)$education |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w2)$education |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w2)$education |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w3)$education |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w3)$education |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w3)$education |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w3)$education |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w4)$education |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w4)$education |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w4)$education |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w4)$education |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w5)$education |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w5)$education |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w5)$education |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w5)$education |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w6)$education |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w6)$education |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w6)$education |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w6)$education |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  )
 )


 tab_demo[, 8] <- c(
-  paste0(dat$income |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(dat$income |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         dat$income |> sd() |> sprintf(fmt = "%.2f"),
+         dat$income |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w2)$income |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w2)$income |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w2)$income |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w2)$income |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w3)$income |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w3)$income |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w3)$income |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w3)$income |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w4)$income |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w4)$income |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w4)$income |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w4)$income |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w5)$income |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w5)$income |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w5)$income |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w5)$income |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  ),
-  paste0(subset(dat, subj_id %in% subj_id_w6)$income |> mean() |> sprintf(fmt = "%.2f"),
+  paste0(subset(dat, subj_id %in% subj_id_w6)$income |> mean(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         " (",
-         subset(dat, subj_id %in% subj_id_w6)$income |> sd() |> sprintf(fmt = "%.2f"),
+         subset(dat, subj_id %in% subj_id_w6)$income |> sd(na.rm = TRUE) |> sprintf(fmt = "%.2f"),
         ")"
  )
 )
@ -411,9 +500,6 @@ Maybe here elaborate on limitations:

 \section{Code Availability}

-All python (version x) an R (version x) code for data anonymization, data cleaning, and preprocessing as well as the cleaned and the preprocessed data sets for each wave are stored in the public repository [link].
-% überlegen, ob man hier getrennt zu gitea und zu OSF (+ Material) weiterleitet
-
 % * Include a subheading titled "Code Availability" in the publication.
 % * Indicate whether custom code can be accessed.
 % * Provide details on how to access the custom code, including any restrictions
@ -424,7 +510,25 @@ All python (version x) an R (version x) code for data anonymization, data cleani
 %   immediately before the references.
 % * If no custom code has been used, include a statement confirming this.

-\section*{References}
+The primary cleaned data and accompanying R code for data anonymization and
+cleaning for all six waves is available at
+\url{https://gitea.iwm-tuebingen.de/HMC/data}. The repository and all material
+can be downloaded directly or cloned as a Git repository. All additional R
+pacckages used for data cleaning (like, e.\,g., \texttt{dplyr},
+\texttt{qualtRics}, or \texttt{openxlsx}) are available on CRAN
+(\url{https://cran.r-project.org/}) and can be freely downloaded there.
+However, the scripts are mainly provided to make transparent which steps haven
+been take for data annymization and data cleaning. The data files and codebook
+can be downloaded and sued without having to rerun any of the scripts. We
+provide the data on item level here, so that they can be used for any kind of
+analysis. The codebook provides information needed to aggregate items into
+scales, e.\,g. which items belong to one scale and which items should be
+inversed before being included into the scale.
+
+% TODO: Should we maybe add information on how to build the scale? Like "take
+% the mean", "take the sum" - does this differ? --> Check YAMLs
+
+\printbibliography

 \section*{Author Contributions}

@ -432,12 +536,4 @@ All python (version x) an R (version x) code for data anonymization, data cleani

 \section*{Acknowledgements}

-
-Hier ist ein R-Chunk:
-
-<<>>=
-x <- rnorm(100)
-summary(x)
-@
-
 \end{document}
--- a/manuscript.pdf
+++ b/manuscript.pdf
--- a/manuscript.tex
+++ b/manuscript.tex
@ -3,8 +3,10 @@
 \usepackage[margin = 2.4cm]{geometry}
 \usepackage[utf8]{inputenc}
 \usepackage{Sweave}
+\usepackage{url}
 \usepackage{authblk}
-
+\usepackage[style = apa, backend = biber, natbib = true]{biblatex}
+\addbibresource{lit.bib}

 \title{Working title: Data Descriptor for HMC Data Set}
 \author{Angelica Henestrosa}
@ -43,23 +45,14 @@ This widespread proliferation of AI technologies, coupled with their increasingl
 As AI systems become more adaptive and embedded in everyday life, understanding the determinants of usage intensity, behavioral patterns, and types of use becomes essential. 
 Moreover, the field of AI is evolving at a fast pace, and user characteristics such as attitudes and trust are subject to change over time. Therefore, longitudinal research that captures temporal fluctuations in user traits and behaviors is crucial.

-Therfore, this longitudinally designed data set aims to capture the evolving perceptions of opportunities and risks associated with AI, perceived capabilities of AI systems, attitudes toward AI, trust in AI, willingness to delegate tasks to AI, areas of application, (to be continued) and the interrelationships among these constructs over time and get some hints on causality. Longitudinal studies are more likely to find changes if there is a potential change trigger (Zhao et al., 2024)
+Therfore, this longitudinally designed data set aims to capture the evolving perceptions of opportunities and risks associated with AI, perceived capabilities of AI systems, attitudes toward AI, trust in AI, willingness to delegate tasks to AI, areas of application, (to be continued) and the interrelationships among these constructs over time and get some hints on causality. Longitudinal studies are more likely to find changes if there is a potential change trigger \citep{Zhao2024}.

-Central questions are whether predictors of technology acceptance as well as
-technology use change over time, whether the perception of AI-Tools as tools vs.
-agents (if so: what type of role/relationship) changes over time, whether this
-perception is related to concepts like credibility, trustworthiness, or task
-delegation, and whether factors such as social presence of perceive
-anthropomorphism mediate such processes. We also explore the long-term
-effects of delegating tasks to AI Tools on outcomes like perceived
-self-efficacy (writing skills), loneliness, or cognitive self-esteem and explore
-the moderating role of personality.
+Central questions are whether predictors of technology acceptance as well as technology use change over time, whether the perception of AI-Tools as tools vs.\ agents (if so: what type of role/relationship) changes over time, whether this perception is related to concepts like credibility, trustworthiness, or task delegation, and whether factors such as social presence of perceived anthropomorphism mediate such processes. Long-term effects of delegating tasks to AI Tools on outcomes like perceived self-efficacy (writing skills), loneliness, or cognitive self-esteem and explore the moderating role of personality can also be explored.

 % Note: lets all reflect on which term and why we want to use, and how we define it: usage vs. use

 This project is a joint project from the human-computer interaction group at
-the Leibniz-Institut für Wissensmedien in Tübingen (IWM). There are several (how many should we mention?) preregistrations from group members focusing on their individual subquestions. For an overview of the work packages and their research questions, please visit our repository [LINK]. 
-% --> create workpackages.md
+the Leibniz-Institut für Wissensmedien in Tübingen (IWM). There are several (how many should we mention?) preregistrations from group members focusing on their individual subquestions. For an overview of the work packages and their research questions, please visit our repository \url{https://gitea.iwm-tuebingen.de/HMC/data}. 
 Thus, this data descriptor may be used to examine research questions across the individual work packages, the possibility to extract and analyze specific subgroups or individual trajectories ignored in the work packages. 
 Because the data set was collected shortly before the public release of Apple Intelligence on consumer devices, it offers a timely snapshot of user attitudes and behaviors at a pivotal moment in AI adoption. This context enhances the relevance of the data for understanding emerging patterns in human-AI interaction. Moreover, the findings may provide early indicators of how psychological variables such as trust, perceived usefulness, and willingness to delegate tasks relate to AI usage, potentially offering prognosis of similar developments in other countries.

@ -143,12 +136,6 @@ We collected sociodemographic information, including, age, gender, educational l

 \section{Data Records}

-% * @Nora das könntest du vllt. noch ausführen?
-
-Data records for each of the six waves are available in csv format at (tbd) together with the R/python scripts for data anonymization, data cleaning, and data preprocessing.
-That is, firstly the data was anonymized by removing participants' Prolific IDs and unused variables, empty variables resulting from faulty questionnaire programming, and xy were removed. Thus (filename) represents the cleaned and anonymized raw data, including the single items of each measurement. Second, variable names were harmonized and scales were calculated, resulting an the preprocessed data set xy, ready for analyses across scales.
-Moreover, a codebook explaining variable abbreviations and containing information about the waves in which the variable was collected (what else?) is available at (tbd).
-
 % * Explain what the dataset contains.
 % * Specify the repository where the dataset is stored.
 % * Provide an overview of the data files and their formats.
@ -158,30 +145,109 @@ Moreover, a codebook explaining variable abbreviations and containing informatio
 % * Include 1-2 tables or figures if necessary, but avoid summarizing data that
 %   can be generated from the dataset.

-% * how should we report on the variables and scales:
-%  **item and scale level OR just scale level ?
-% **link to Gerrits scale list: https://gitea.iwm-tuebingen.de/AG4/project_HMC_preprocessing/src/branch/main/results/database_api_reference.md ?
-%  **extra codebook or merge that information to Gerrits list?
+Data records for each of the six waves are available in CSV format at
+\url{https://gitea.iwm-tuebingen.de/HMC/data} together with the R scripts for
+data anonymization and data cleaning.

-%  --> an overview about all variables, their calculation, their measurement format and ideally their M, SD, cronbachs alpha would be ideal!
+In a first step, the data was anonymized by removing participants' Prolific IDs
+and unused variables as well as variables only containing \texttt{NA} resulting
+from faulty questionnaire programming were removed. The results are six files
+(one for each wave) with the primary data containing the single items of each
+scale measured. Furthermore, variable names were harmonized and subjects
+excluded that filled in the survey several times. The final data sets are ready
+for analyses after taking some additional data preparation steps for building
+the scales (if desired).

-Table~\ref{tab:demographics} provides an overwiev of the demographic variables
-over all six waves.
+Figure~\ref{fig:folderstruc} shows the folder structure and files contained in
+the repository of the data records. This repostirory is generated from the
+local project folder that all project collaborators can access. All files are
+text files or PDFs with the exceptions of the codebook which is an EXCEL file.
+However, an export of the information contained in the EXCEL codebook to a
+MARKDOWN file is also included, for faster readability online and to ensure
+that all files are in non-proprietary formats.
+
+\begin{figure}
+\begin{verbatim}
+https://gitea.iwm-tuebingen.de/HMC/data
+|-- 01_project_management
+|   |-- workpackages
+|   |   |-- workpackages.md
+|-- 02_material
+|   |-- AI_Trends_Wave1_Survey.pdf
+|   |-- AI_Trends_Wave2_Survey.pdf
+|   |-- AI_Trends_Wave3_Survey.pdf
+|   |-- AI_Trends_Wave4_Survey.pdf
+|   |-- AI_Trends_Wave5_Survey.pdf
+|   |-- AI_Trends_Wave6_Survey.pdf
+|-- 03_data
+|   |-- 01_raw_data
+|   |   |-- anonymization.R
+|   |-- 02_anonymized_data
+|   |   |-- cleaning.R
+|   |-- 03_cleaned_data
+|   |   |-- HMC_wave1_cleaned.csv
+|   |   |-- HMC_wave2_cleaned.csv
+|   |   |-- HMC_wave3_cleaned.csv
+|   |   |-- HMC_wave4_cleaned.csv
+|   |   |-- HMC_wave5_cleaned.csv
+|   |   |-- HMC_wave6_cleaned.csv
+|   |-- HMC_codebook.xlsx
+|   |-- item_reference.md
+|   |-- README.md
+|-- README.md
+\end{verbatim}
+  \caption{Folder structure of the repository containing the data records.}
+  \label{fig:folderstruc}
+\end{figure}
+
+Furthermore, a codebook explaining variable abbreviations and coding and
+containing references and information about the waves in which the variable was
+collected is available at
+\url{https://gitea.iwm-tuebingen.de/HMC/data/src/branch/main/03_data/item_reference.md}.
+
+% TODO: Where should the demographics table go? Here or above in the Methods
+% section?
+Table~\ref{tab:demographics} provides an overview of the demographic variables
+over all six waves. Education and income were collected on six-point scales.
+Answering options for education are
+%
+\begin{enumerate}
+  \item Some high school or less
+  \item High school diploma or GED
+  \item Some college, but no degree
+  \item Associates or technical degree
+  \item Bachelor's degree
+  \item Graduate or professional degree (MA, MS, MBA, PhD, JD, MD, DDS etc.)"
+\end{enumerate}
+%
+and for income
+%
+\begin{enumerate}
+  \item Less than \$25,000
+  \item \$25,000-\$49,999
+  \item \$50,000-\$74,999
+  \item \$75,000-\$99,999
+  \item \$100,000-\$149,999
+  \item \$150,000 or more.
+\end{enumerate}
+%
+The rate of users of AI systems increases over the six waves from about 76\% to
+almost 90\% in the sixth wave.

 % latex table generated in R 4.5.1 by xtable 1.8-4 package
-% Tue Oct 14 16:48:36 2025
+% Fri Oct 17 14:22:35 2025
 \begin{table}[ht]
 \centering
 \begin{tabular}{lrrrrrccc}
  \hline
 & Total N & User & Male & Female & Other & Age M(SD) & Education M(SD) & Income M(SD) \\ 
  \hline
-wave 1 & 1007 & 0.76\% & 500 & 494 & 13 & 38.68 (11.11) & 4.37 (1.34) & 3.55 (1.62) \\ 
-  wave 2 & 768 & 0.76\% & 375 & 384 & 8 & 39.37 (11.08) & 4.33 (1.32) & 3.55 (1.61) \\ 
-  wave 3 & 658 & 0.77\% & 318 & 332 & 6 & 39.86 (11.00) & 4.30 (1.33) & 3.57 (1.61) \\ 
-  wave 4 & 611 & 0.76\% & 282 & 323 & 5 & 40.13 (11.04) & 4.22 (1.35) & 3.50 (1.62) \\ 
-  wave 5 & 564 & 0.76\% & 259 & 300 & 4 & 40.43 (11.06) & 4.19 (1.33) & 3.48 (1.61) \\ 
-  wave 6 & 514 & 0.76\% & 238 & 270 & 5 & 40.36 (11.12) & 4.15 (1.33) & 3.43 (1.59) \\ 
+wave 1 & 1007 & 76.07\% & 500 & 494 & 13 & 38.68 (11.11) & 4.37 (1.34) & 3.51 (1.58) \\ 
+  wave 2 & 768 & 80.34\% & 375 & 384 & 8 & 39.37 (11.08) & 4.33 (1.32) & 3.50 (1.57) \\ 
+  wave 3 & 658 & 82.83\% & 318 & 332 & 6 & 39.86 (11.00) & 4.30 (1.33) & 3.51 (1.56) \\ 
+  wave 4 & 611 & 82.49\% & 282 & 323 & 5 & 40.13 (11.04) & 4.22 (1.35) & 3.43 (1.56) \\ 
+  wave 5 & 564 & 85.99\% & 259 & 300 & 4 & 40.43 (11.06) & 4.19 (1.33) & 3.42 (1.56) \\ 
+  wave 6 & 514 & 89.30\% & 238 & 270 & 5 & 40.36 (11.12) & 4.15 (1.33) & 3.36 (1.53) \\ 
   \hline
 \end{tabular}
 \caption{Demographic variables per wave} 
@ -217,9 +283,6 @@ Maybe here elaborate on limitations:

 \section{Code Availability}

-All python (version x) an R (version x) code for data anonymization, data cleaning, and preprocessing as well as the cleaned and the preprocessed data sets for each wave are stored in the public repository [link].
-% überlegen, ob man hier getrennt zu gitea und zu OSF (+ Material) weiterleitet
-
 % * Include a subheading titled "Code Availability" in the publication.
 % * Indicate whether custom code can be accessed.
 % * Provide details on how to access the custom code, including any restrictions
@ -230,7 +293,25 @@ All python (version x) an R (version x) code for data anonymization, data cleani
 %   immediately before the references.
 % * If no custom code has been used, include a statement confirming this.

-\section*{References}
+The primary cleaned data and accompanying R code for data anonymization and
+cleaning for all six waves is available at
+\url{https://gitea.iwm-tuebingen.de/HMC/data}. The repository and all material
+can be downloaded directly or cloned as a Git repository. All additional R
+pacckages used for data cleaning (like, e.\,g., \texttt{dplyr},
+\texttt{qualtRics}, or \texttt{openxlsx}) are available on CRAN
+(\url{https://cran.r-project.org/}) and can be freely downloaded there.
+However, the scripts are mainly provided to make transparent which steps haven
+been take for data annymization and data cleaning. The data files and codebook
+can be downloaded and sued without having to rerun any of the scripts. We
+provide the data on item level here, so that they can be used for any kind of
+analysis. The codebook provides information needed to aggregate items into
+scales, e.\,g. which items belong to one scale and which items should be
+inversed before being included into the scale.
+
+% TODO: Should we maybe add information on how to build the scale? Like "take
+% the mean", "take the sum" - does this differ? --> Check YAMLs
+
+\printbibliography

 \section*{Author Contributions}

@ -238,18 +319,4 @@ All python (version x) an R (version x) code for data anonymization, data cleani

 \section*{Acknowledgements}

-
-Hier ist ein R-Chunk:
-
-\begin{Schunk}
-\begin{Sinput}
-> x <- rnorm(100)
-> summary(x)
-\end{Sinput}
-\begin{Soutput}
-   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-2.5079 -0.8108 -0.2314 -0.1425  0.6314  3.1375 
-\end{Soutput}
-\end{Schunk}
-
 \end{document}