diff --git a/.Rbuildignore b/.Rbuildignore new file mode 100644 index 0000000..6e6b0d3 --- /dev/null +++ b/.Rbuildignore @@ -0,0 +1 @@ +^README\.Rmd$ diff --git a/README.Rmd b/README.Rmd new file mode 100644 index 0000000..2cfef99 --- /dev/null +++ b/README.Rmd @@ -0,0 +1,541 @@ +--- +output: github_document +--- + + + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>", + fig.path = "man/figures/README-", + out.width = "100%" +) +``` + +# R package mtt + +![mtt package](man/figures/logo.png) + +This package was created to process log files obtained from multi-touch +tables at the Leibniz-Institut für Wissensmedien (IWM). + +## Installation + +It can be installed via + +`devtools::install_git("https://gitea.iwm-tuebingen.de/R/mtt.git")` + +If you get an error message, you probably need to install `git2r`first with + +`install.packages("git2r")`. + +The package depends on the following R packages + +* `dplyr` +* `pbapply` +* `XML` +* `lubridate` + +so make sure they are installed as well. + +# Multi-Touch Table + +The multi-touch table at the Herzog-Anton-Ulrich-Museum (HAUM) in +Braunschweig gives visitors of the Museum the opportunity to interact with +about 70 artworks and 3 virtual cards containing information about the +museum and its layout. The table was installed at the museum in October +2016 and since November 2016 log files from interactions of visitors of the +museum have been collected. These log files are in an unstructured format +and cannot be easily analyzed. The purpose of the following document is to +describe how the data haven been transformed and which decisions have been +made along the way. + + + +# Data structure + +The log files contain lines that indicate the beginning and end of possible +activities that can be performed when interacting with the artworks on the +table. The layout of the table looks like pictures have been tossed on a +large table. Every artwork is visible at the start configuration. People +can move the pictures on the table, they can be scaled and rotated. +Additionally, the virtual picture cards can be flipped in order to find +more information of the artwork on the "back" of the card. One has to press +a little `i` for more information in one of the bottom corners of the card. +On the back of the card two to six information cards can be found with a +teaser text about a certain topic. These topic cards can be opened and a +hypertext with detailed information opens. Within these hypertexts certain +technical terms can be clicked for lay people to get more information. This +also opens up a pop-up. The events encoded in the raw log files therefore +have the following structure. + +``` +"Start Application" --> Start Application +"Show Application" +"Transform start" --> Move +"Transform stop" +"Show Info" --> Flip Card +"Show Front" +"Artwork/OpenCard" --> Open Topic +"Artwork/CloseCard" +"ShowPopup" --> Open Popup +"HidePopup" +``` + +The right side shows what events can be extracted from these raw lines. The +"Start Application" is not an event in the original sense since it only +indicates if the table was started or maybe reset itself. This is not an +interaction with the table and therefore not interesting in itself. All +"Start Application" and "Show Application" are therefore excluded from the +data when further processed and are only in the raw log files. + +# Parsing the raw log files + +The first step is to parse the raw log files that are stored by the +application as text files in a rather unstructured format to a format that +can be read by common statistics software packages. The data are therefore +transferred to a spread sheet format. The following section describes what +problems were encountered while doing this. + +## Corrupt lines + +When reading the files containing the raw logs into R, a warning appears +that says + +``` +Warning messages: + incomplete final line found on '2016/2016_11_18-11_31_0.log' + incomplete final line found on '2016/2016_11_18-11_38_30.log' + incomplete final line found on '2016/2016_11_18-11_40_36.log' + ... +``` + +When you open these files, it looks like the last line contains some binary +content. It is unclear why and how this happens. So when reading the data, +these lines were removed. A warning will be given that indicates how many +files have been affected. + +## Extracted variables from raw log files + +The following variables (columns in the data frame) are extracted from the +raw log file: + +* `fileId`: Containing the zero-left-padded file name of the raw log file + the data line has been extracted from + +* `folder`: The folder names in which the raw log files haven been + organized in. For the HAUM data set, the data are sorted by year (folders + 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023). + +* `date`: Extracted timestamp from the raw log file in the format + `yyyy-mm-dd hh:mm:ss`. + +* `timeMs`: Containing a timestamp in Milliseconds that restarts with + every new raw log files. + +* `event`: Start and stop event tags. See above for possible values. + +* `item`: Identifier of the different items. This is a three-digit + (left-padded) number. The numbers of the items correspond to the + folder names in `/ContentEyevisit/eyevisit_cards_light/` and were + orginally taken from the museums catalogue. + +* `popup`: Name of the pop-up opened. This is only interesting for + "openPopup" events. + +* `topic`: The number of the topic card that has been opened at the back of + the item card. See below for a more detailed description what these + numbers mean. + +* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$). + +* `y`: Value of y-coordinate in pixel. + +* `scale`: Number in 128 bit that indicates how much the card has been + scaled. + +* `rotation`: Degree of rotation from start configuration. + + + +## Variables after "closing of events" + +The raw log data consist of start and stop events for each event type. +After preprocessing four event types are extracted: `move`, `flipCard`, +`openTopic`, and `openPopup`. Except for the `move` events, which can occur +at any time when interacting with an item card on the table, the events +have a hierarchical order: An item card first needs to be flipped +(`flipCard`), then the topic cards on the back of the card can be opened +(`openTopic`), and finally pop-ups on these topic cards can be opened +(`openPopup`). This implies that the event `openPopup` can only be present +for a certain item, if the card has already been flipped (i.e., an event +`flipCard` for the same item has already occured). + +After preprocessing, the data frame is now in a wide format with columns +for the start and the stop of each event and contains the following +variables: + +* `fileId.start` / `fileId.stop`: See above. + +* `date.start` / `date.stop`: See above. + +* `folder`: Containing the folder name (see above). + +* `case`: A numerical variable indicating cases in the data. A "case" + indicates an interaction interval and could be defined in different ways. + Right now a new case begins, when no event occurred when no new path + started for 20 seconds or longer. + +* `path`: A path is defined as one interaction with one item. A path + can either start with a `flipCard` event or when an item has been + touched for the first time within this case. A path ends with the + item card being flipped close again or with the last movement of the + card within this case. One case can contain several paths with the same + item when the item is flipped open and flipped close again several + times within a short time. + +* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up + has been opened from the glossar folder. These pop-ups can be assigned to + the wrong item since it is not possible to do this algorithmically. + It is possible that two items are flipped open that could both link to + the same pop-up from a glossar. The indicator variable is left as a + variable, so that these pop-ups can be easily deleted from the data. + Right now, glossar entries can be ignored completely by setting an + argument and this is done by default. Using the pop-ups from the glossar + will need a lot more love, before it behaves satisfactorily. + +* `event`: Indicating the event. Can take tha values `move`, `flipCard`, + `openTopic`, and `openPopup`. + +* `item`: Identifier of the different artworks and information cards. This + is a three-digit (left-padded) number. See above. + +* `timeMs.start` / `timeMs.stop`: See above. + +* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds. + Needs to be adjusted for events spanning more than one log file by a + factor of $60,000 \times \text{number of logfiles}$. See below for details. + +* `topic`: See above. + +* `popup`: See above. + +* `x.start` / `x.stop`: See above. + +* `y.start` / `y.stop`: See above. + +* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and + $(x.stop, y.stop)$. + +* `scale.start` / `scale.stop`: See above. + +* `scaleSize`: Relative scaling of item card, calculated by + $\frac{scale.stop}{scale.start}$. + +* `rotation.start` / `rotation.stop`: See above. + +* `rotationDegree`: Difference of rotation from $rotation.stop$ to + $rotation.start$. + +## How unclosed events are handled + +Events do not necessarily need to be completed. A person can, e.g., leave +the table and not flip the item card close again. For `flipCard`, +`openTopic`, and `openPopup` the data frame contains `NA` when the event +does not complete. For `move` events it happens quite often that a start +event follows a start event and a stop event follows a stop event. +Technically a move event cannot *not* be finished and the number of events +without a start or stop indicate that the time resolution was not +sufficient to catch all these events accurately. Double start and stop +`move` events have therefore been deleted from the data set. + +## Additional meta data + +For the HAUM data, I added meta data on state holidays and school +vacations. + +This led to the following additional variables: + +* `holiday` + +* `vacations` + +# Problems and how I handled them + +This lists some problems with the log data that required decisions. These +decisions influence the outcome and maybe even the data quality. Hence, I +tried to document how I handled these problems and explain the decisions I +made. + +## Weird behavior of `timeMs` and neg. `duration` values + +`timeMs` resets itself every time a new log file starts. This means that +the durations of events spanning more than one log file must be adjusted. +Instead of just calculating $timeMs.stop - timeMs.start$, `timeMs.start` +must be subtracted from the maximum duration of the log file where the +event started ($600,000 ms$) and the `timeMs.stop` must be added. If the +event spans more than two log files, a multiple of $600,000$ must be taken, +e.g. for three log files it must be: $2 \times 600,000 - timeMs.start + +timeMs.stop$ and so on. + +```{r timems, echo = FALSE, results = FALSE, fig.show = TRUE} +# Read data +datraw <- read.table("../../MDS/2023ss/60100_master_thesis/analysis/code/results/raw_logfiles_2024-02-21_16-07-33.csv", sep = ";", + header = TRUE) + +plot(timeMs ~ as.factor(fileId), datraw[1:5000,], xlab = "fileId") +``` + +The boxplot shows that we have a continuous range of values within one log +file but that `timeMs` does not increase over log files. I kept +`timeMs.start` and `timeMs.stop` and also `fileId.start` and `fileId.stop` +in the data frame, so it is clear when events span more than one log file. + + + +## Left padding of file IDs + +The file names of the raw log files are automatically generated and contain +a timestamp. This timestamp is not well formed. First, it contains an +incorrect month. The months go from 0 to 11 which means, that the file name +`2016_11_15-12_12_57.log` was collected on December 15, 2016 at 12:12 pm. +Another problem is that the file names are not zero left padded, e.g., +`2016_11_15-12_2_57.log`. This file was collected on December 15, 2016 at +12:02 pm and therefore before the file above. But most sorting algorithms, +will sort these files in the order shown below. In order to preprocess the +data and close events that belong together, the data need to be sorted by +events and artworks repeatedly. In order to get them back in the correct +time order, it is necessary to order them based on three variables: +`fileId.start`, `date.start` and `timeMs.start`. The file IDs therefore +need to sort in the correct order (again see below for example). I zero +left padded the log file names within the data frame using it as an +identifier. These "file names" do not correspond exactly to the original +raw log file names. This needs to be kept in mind when doing any kind of +matching etc. + +``` +## what it looked like before left padding +# 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 +# 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 +# 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 +# 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 +# 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 +# 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 + +## what it looks like now +# 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 +# 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 +# 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 +# 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 +# 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 +# 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 +``` + +## Timestamps repeat + +The timestamps in the `date` variable record year, month, day, hour, +minute and seconds. Since one second is not a very short time interval for +a move on a touch display, this is not fine grained enough to bring events +into the correct order, meaning there are events from the same log file +having the same timestamp and even events from different log files having +the same timestamp. The log files get written about every 10 minutes +(which can easily be seen when looking at the file names of the raw log +files). So in order to get events in the correct order, it is necessary to +first order by file ID, within file ID then sort by timestamp `date` and +then within these more coarse grained timestamps sort be `timeMs`. But as +explained above, `timeMs` can only be sorted within one file ID, since they +do not increase consistently over log files, but have a new setoff for each +raw log file. + +## x,y-coordinates outside of display range + +The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160 +pixels. When you plot the start and stop coordinates, the display is +clearly distinguishable. However, a lot of points are outside of the +display range. This can happen, when the art objects are scaled and then +moved to the very edge of the table. Then it will record pixels outside of +the table. These are actually valid data points and I will leave them as +is. + +```{r xycoord} +datlogs <- read.table("../../MDS/2023ss/60100_master_thesis/analysis/code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";", + header = TRUE) + +par(mfrow = c(1, 2)) +plot(y.start ~ x.start, datlogs) +abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) +plot(y.stop ~ x.stop, datlogs) +abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) + +aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean) +``` + +## Pop-ups from glossar cannot be assigned to a specific item + +All the information, pictures and texts for the topics and pop-ups are +stored in `/data/haum/ContentEyevisit/eyevisit_cards_light/`. +Among other things, each folder contains XML-files with the information +about any technical terms that can be opened from the hypertexts on the +topic cards. Often these information are item dependent and then the +corresponding XML-file is in the folder for this item. Sometimes, however, +more general terms can be opened. In order to avoid multiple files +containing the same information, these were stored in a folder called +`glossar` and get accessed from there. The raw log files only contain the +path to this glossar entry and did not record from which item it was +accessed. I tried to assign these glossar entries to the correct items. The +(very heuristic) approach was this: + +1. Create a lookup table with all XML-file names (possible pop-ups) from + the glossar folder and what items possibly call them. This was stored + as an `RData` object for easier handling but should maybe be stored in a + more interoperable format. + +2. I went through all possible pop-ups in this lookup table and stored the + items that are associated with it. + +3. I created a sub data frame without move events (since they can never be + associated with a pop-up) and went through every line and looked up if + an item and a topic card had been opened. If this was the case and a + glossar entry came up before the item was closed again, I assigned + this item to the glossar entry. + +This is heuristic since it is possible that several topic cards from +different items are opened simultaneously and the glossar pop-up could +be opened from either one (it could even be more than two, of course). In +these cases the item that was opened closest to the glossar pop-up has +been assigned, but this can never be completely error free. + +And this heuristic only assigns a little more than half of the glossar +entries. Since my heuristic only looks for the last item that has been +opened and if this item is a possible candidate it misses all glossar +pop-ups where another item has been opened in between. This is still an +open TODO to write a more elaborate algorithm. + +All glossar pop-ups that do not get matched with an item are removed +from the data set with a warning if the argument `glossar = TRUE` is set. +Otherwise the glossar entries will be ignored completely. + +## Assign a `case` variable based on "time heuristic" + +One thing needed in order to work with the data set and use it for machine +learning algorithms like process mining, is a variable that tries to +identify a case. A case variable will structure the data frame in a way +that navigation behavior can actually be investigated. However, we do not +know if several people are standing around the table interacting with it or +just one very active person. The simplest way to define a case variable is +to just use a time limit between events. This means that when the table has +not been interacted with for, e.g., 20 seconds than it is assumed that a +person moved on and a new person started interacting with the table. This +is the easiest heuristic and implemented at the moment. Process mining +shows that this simple approach works in a way that the correct process +gets extracted by the algorithm. + +In order to investigate user behavior on a more fine grained level, it will +be necessary to come up with a more elaborate approach. A better, still +simple approach, could be to use this kind of time limit and additionally +look at the distance between items interacted with within one time window. +When items are far apart it seems plausible that more than one person +interacted with them. Very short time lapses between events on different +items could also be an indicator that more than one person is interacting +with the table. + +## Assign a `path` variable + +The `path` variable is supposed to show one interaction trace with one +artwork. Meaning it starts when an artwork is touched or flipped and stops +when it is closed again. It is easy to assign a path from flipping a card +over opening (maybe several) topics and pop-ups for this artwork card until +closing this card again. But one would like to assign the same path to +move events surrounding this interaction. Again, this is not possible in an +algorithmic way but only heuristically. + +Again, I used a time cutoff for this. First, if a `move` event occurs, it +is checked, if the same item has been flipped less than 20 seconds +beforehand. If yes, the same path indicator is assigned to this `move`. If +not, temporarily a new "move indicator" is assigned. Then, a "backward +pass" is applied, where it is checked if the same item is opened less than +20 seconds _after_ the event occurs. If yes, that path indicator is +assigned. For all the remaining moves, a new path number is assigned. This +corresponds to items being moved without being flipped. + +## A `move` event does not record any change + +Most of the events in the log files are move events. Additionally, many of +these move events are recorded but they do not indicate any change, meaning +the only difference is the timestamp. All other variables indicating moves +like `x.start` and `x.stop`, `rotation.start` and `rotation.stop` etc. do +not show _any_ change. They represent about 2/3 of all move events. These +events are probably short touches of the table without an actual +interaction. They were therefore removed from the data set. + +## Card indices go from 0 to 7 (instead of 0 to 5 as expected) + +In the beginning I thought that the number for topics was the index of +where the card was presented on the back of the item. But this is not +correct. It is the number of the topic. There are eight topics in total: + +``` +Indices for topics: +0 artist +1 thema +2 komposition +3 leben des kunstwerks +4 details +5 licht und farbe +6 extra info +7 technik +``` +On the back of items, there can be between 2 to 6 topic cards. Several of +these topic cards can be about the same topic, e.g., there can be two topic +cards assigned to the topic `thema`. It is impossible to find out if the +same topic card was opened several times or if different topic cards with +the same topic were opened from the same item. See example below for item +"001". + +```{r topics, echo = FALSE} +devtools::load_all() +items <- sprintf("%03d", unique(datlogs$item)) +topics <- extract_topics(items, xmlfiles = paste0(items, ".xml"), + xmlpath = "../../MDS/2023ss/60100_master_thesis/analysis/data/haum/ContentEyevisit/eyevisit_cards_light/") +head(topics) +``` + +## New artworks "504" and "505" starting October 2022 + +When I read in the complete data frame for the first time, all of the +sudden there were 72 instead of 70 items. It seems like these two +artworks appear on October 21, 2022. + +```{r newitems} +summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"])) +``` + +The artworks seem to be have updated in general after October 21, 2022. The +following table shows which items were presented in which years. + +```{r years} +xtabs(~ item + lubridate::year(date.start), datlogs) +``` + +It shows that the artworks haven been updated after the Corona pandemic. I +think, the table was also moved to a different location at that point. + diff --git a/README.md b/README.md index efc1976..5aab782 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,12 @@ + + + # R package mtt -This package was created to process log files obtained from -Multi-Touch-Tables at the IWM. +![mtt package](man/figures/logo.png) + +This package was created to process log files obtained from multi-touch +tables at the Leibniz-Institut für Wissensmedien (IWM). ## Installation @@ -9,16 +14,597 @@ It can be installed via `devtools::install_git("https://gitea.iwm-tuebingen.de/R/mtt.git")` -If you get an error message, you probably need to install `git2r`first with +If you get an error message, you probably need to install `git2r`first +with `install.packages("git2r")`. The package depends on the following R packages -* `dplyr` -* `pbapply` -* `XML` -* `lubridate` +- `dplyr` +- `pbapply` +- `XML` +- `lubridate` so make sure they are installed as well. +# Multi-Touch Table + +The multi-touch table at the Herzog-Anton-Ulrich-Museum (HAUM) in +Braunschweig gives visitors of the Museum the opportunity to interact +with about 70 artworks and 3 virtual cards containing information about +the museum and its layout. The table was installed at the museum in +October 2016 and since November 2016 log files from interactions of +visitors of the museum have been collected. These log files are in an +unstructured format and cannot be easily analyzed. The purpose of the +following document is to describe how the data haven been transformed +and which decisions have been made along the way. + + + +# Data structure + +The log files contain lines that indicate the beginning and end of +possible activities that can be performed when interacting with the +artworks on the table. The layout of the table looks like pictures have +been tossed on a large table. Every artwork is visible at the start +configuration. People can move the pictures on the table, they can be +scaled and rotated. Additionally, the virtual picture cards can be +flipped in order to find more information of the artwork on the “back” +of the card. One has to press a little `i` for more information in one +of the bottom corners of the card. On the back of the card two to six +information cards can be found with a teaser text about a certain topic. +These topic cards can be opened and a hypertext with detailed +information opens. Within these hypertexts certain technical terms can +be clicked for lay people to get more information. This also opens up a +pop-up. The events encoded in the raw log files therefore have the +following structure. + + "Start Application" --> Start Application + "Show Application" + "Transform start" --> Move + "Transform stop" + "Show Info" --> Flip Card + "Show Front" + "Artwork/OpenCard" --> Open Topic + "Artwork/CloseCard" + "ShowPopup" --> Open Popup + "HidePopup" + +The right side shows what events can be extracted from these raw lines. +The “Start Application” is not an event in the original sense since it +only indicates if the table was started or maybe reset itself. This is +not an interaction with the table and therefore not interesting in +itself. All “Start Application” and “Show Application” are therefore +excluded from the data when further processed and are only in the raw +log files. + +# Parsing the raw log files + +The first step is to parse the raw log files that are stored by the +application as text files in a rather unstructured format to a format +that can be read by common statistics software packages. The data are +therefore transferred to a spread sheet format. The following section +describes what problems were encountered while doing this. + +## Corrupt lines + +When reading the files containing the raw logs into R, a warning appears +that says + + Warning messages: + incomplete final line found on '2016/2016_11_18-11_31_0.log' + incomplete final line found on '2016/2016_11_18-11_38_30.log' + incomplete final line found on '2016/2016_11_18-11_40_36.log' + ... + +When you open these files, it looks like the last line contains some +binary content. It is unclear why and how this happens. So when reading +the data, these lines were removed. A warning will be given that +indicates how many files have been affected. + +## Extracted variables from raw log files + +The following variables (columns in the data frame) are extracted from +the raw log file: + +- `fileId`: Containing the zero-left-padded file name of the raw log + file the data line has been extracted from + +- `folder`: The folder names in which the raw log files haven been + organized in. For the HAUM data set, the data are sorted by year + (folders 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023). + +- `date`: Extracted timestamp from the raw log file in the format + `yyyy-mm-dd hh:mm:ss`. + +- `timeMs`: Containing a timestamp in Milliseconds that restarts with + every new raw log files. + +- `event`: Start and stop event tags. See above for possible values. + +- `item`: Identifier of the different items. This is a three-digit + (left-padded) number. The numbers of the items correspond to the + folder names in `/ContentEyevisit/eyevisit_cards_light/` and were + orginally taken from the museums catalogue. + +- `popup`: Name of the pop-up opened. This is only interesting for + “openPopup” events. + +- `topic`: The number of the topic card that has been opened at the back + of the item card. See below for a more detailed description what these + numbers mean. + +- `x`: Value of x-coordinate in pixel on the 4K-Display + ($3840 \times 2160$). + +- `y`: Value of y-coordinate in pixel. + +- `scale`: Number in 128 bit that indicates how much the card has been + scaled. + +- `rotation`: Degree of rotation from start configuration. + + + +## Variables after “closing of events” + +The raw log data consist of start and stop events for each event type. +After preprocessing four event types are extracted: `move`, `flipCard`, +`openTopic`, and `openPopup`. Except for the `move` events, which can +occur at any time when interacting with an item card on the table, the +events have a hierarchical order: An item card first needs to be flipped +(`flipCard`), then the topic cards on the back of the card can be opened +(`openTopic`), and finally pop-ups on these topic cards can be opened +(`openPopup`). This implies that the event `openPopup` can only be +present for a certain item, if the card has already been flipped (i.e., +an event `flipCard` for the same item has already occured). + +After preprocessing, the data frame is now in a wide format with columns +for the start and the stop of each event and contains the following +variables: + +- `fileId.start` / `fileId.stop`: See above. + +- `date.start` / `date.stop`: See above. + +- `folder`: Containing the folder name (see above). + +- `case`: A numerical variable indicating cases in the data. A “case” + indicates an interaction interval and could be defined in different + ways. Right now a new case begins, when no event occurred when no new + path started for 20 seconds or longer. + +- `path`: A path is defined as one interaction with one item. A path can + either start with a `flipCard` event or when an item has been touched + for the first time within this case. A path ends with the item card + being flipped close again or with the last movement of the card within + this case. One case can contain several paths with the same item when + the item is flipped open and flipped close again several times within + a short time. + +- `glossar`: An indicator variable with values 0/1 that tracks if a + pop-up has been opened from the glossar folder. These pop-ups can be + assigned to the wrong item since it is not possible to do this + algorithmically. It is possible that two items are flipped open that + could both link to the same pop-up from a glossar. The indicator + variable is left as a variable, so that these pop-ups can be easily + deleted from the data. Right now, glossar entries can be ignored + completely by setting an argument and this is done by default. Using + the pop-ups from the glossar will need a lot more love, before it + behaves satisfactorily. + +- `event`: Indicating the event. Can take tha values `move`, `flipCard`, + `openTopic`, and `openPopup`. + +- `item`: Identifier of the different artworks and information cards. + This is a three-digit (left-padded) number. See above. + +- `timeMs.start` / `timeMs.stop`: See above. + +- `duration`: Calculated by $timeMs.stop - timeMs.start$ in + Milliseconds. Needs to be adjusted for events spanning more than one + log file by a factor of $60,000 \times \text{number of logfiles}$. See + below for details. + +- `topic`: See above. + +- `popup`: See above. + +- `x.start` / `x.stop`: See above. + +- `y.start` / `y.stop`: See above. + +- `distance`: Euclidean distande calculated from $(x.start, y.start)$ + and $(x.stop, y.stop)$. + +- `scale.start` / `scale.stop`: See above. + +- `scaleSize`: Relative scaling of item card, calculated by + $\frac{scale.stop}{scale.start}$. + +- `rotation.start` / `rotation.stop`: See above. + +- `rotationDegree`: Difference of rotation from $rotation.stop$ to + $rotation.start$. + +## How unclosed events are handled + +Events do not necessarily need to be completed. A person can, e.g., +leave the table and not flip the item card close again. For `flipCard`, +`openTopic`, and `openPopup` the data frame contains `NA` when the event +does not complete. For `move` events it happens quite often that a start +event follows a start event and a stop event follows a stop event. +Technically a move event cannot *not* be finished and the number of +events without a start or stop indicate that the time resolution was not +sufficient to catch all these events accurately. Double start and stop +`move` events have therefore been deleted from the data set. + +## Additional meta data + +For the HAUM data, I added meta data on state holidays and school +vacations. + +This led to the following additional variables: + +- `holiday` + +- `vacations` + +# Problems and how I handled them + +This lists some problems with the log data that required decisions. +These decisions influence the outcome and maybe even the data quality. +Hence, I tried to document how I handled these problems and explain the +decisions I made. + +## Weird behavior of `timeMs` and neg. `duration` values + +`timeMs` resets itself every time a new log file starts. This means that +the durations of events spanning more than one log file must be +adjusted. Instead of just calculating $timeMs.stop - timeMs.start$, +`timeMs.start` must be subtracted from the maximum duration of the log +file where the event started ($600,000 ms$) and the `timeMs.stop` must +be added. If the event spans more than two log files, a multiple of +$600,000$ must be taken, e.g. for three log files it must be: +$2 \times 600,000 - timeMs.start + timeMs.stop$ and so on. + + + +The boxplot shows that we have a continuous range of values within one +log file but that `timeMs` does not increase over log files. I kept +`timeMs.start` and `timeMs.stop` and also `fileId.start` and +`fileId.stop` in the data frame, so it is clear when events span more +than one log file. + + + +## Left padding of file IDs + +The file names of the raw log files are automatically generated and +contain a timestamp. This timestamp is not well formed. First, it +contains an incorrect month. The months go from 0 to 11 which means, +that the file name `2016_11_15-12_12_57.log` was collected on December +15, 2016 at 12:12 pm. Another problem is that the file names are not +zero left padded, e.g., `2016_11_15-12_2_57.log`. This file was +collected on December 15, 2016 at 12:02 pm and therefore before the file +above. But most sorting algorithms, will sort these files in the order +shown below. In order to preprocess the data and close events that +belong together, the data need to be sorted by events and artworks +repeatedly. In order to get them back in the correct time order, it is +necessary to order them based on three variables: `fileId.start`, +`date.start` and `timeMs.start`. The file IDs therefore need to sort in +the correct order (again see below for example). I zero left padded the +log file names within the data frame using it as an identifier. These +“file names” do not correspond exactly to the original raw log file +names. This needs to be kept in mind when doing any kind of matching +etc. + + ## what it looked like before left padding + # 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 + # 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 + # 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 + # 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 + # 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 + # 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 + + ## what it looks like now + # 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 + # 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 + # 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 + # 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 + # 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 + # 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 + +## Timestamps repeat + +The timestamps in the `date` variable record year, month, day, hour, +minute and seconds. Since one second is not a very short time interval +for a move on a touch display, this is not fine grained enough to bring +events into the correct order, meaning there are events from the same +log file having the same timestamp and even events from different log +files having the same timestamp. The log files get written about every +10 minutes (which can easily be seen when looking at the file names of +the raw log files). So in order to get events in the correct order, it +is necessary to first order by file ID, within file ID then sort by +timestamp `date` and then within these more coarse grained timestamps +sort be `timeMs`. But as explained above, `timeMs` can only be sorted +within one file ID, since they do not increase consistently over log +files, but have a new setoff for each raw log file. + +## x,y-coordinates outside of display range + +The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160 +pixels. When you plot the start and stop coordinates, the display is +clearly distinguishable. However, a lot of points are outside of the +display range. This can happen, when the art objects are scaled and then +moved to the very edge of the table. Then it will record pixels outside +of the table. These are actually valid data points and I will leave them +as is. + +``` r +datlogs <- read.table("../../MDS/2023ss/60100_master_thesis/analysis/code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";", + header = TRUE) + +par(mfrow = c(1, 2)) +plot(y.start ~ x.start, datlogs) +abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) +plot(y.stop ~ x.stop, datlogs) +abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) +``` + + + +``` r + +aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean) +#> x.start x.stop y.start y.stop +#> 1 1978.202 1975.876 1137.481 1133.494 +``` + +## Pop-ups from glossar cannot be assigned to a specific item + +All the information, pictures and texts for the topics and pop-ups are +stored in +`/data/haum/ContentEyevisit/eyevisit_cards_light/`. Among +other things, each folder contains XML-files with the information about +any technical terms that can be opened from the hypertexts on the topic +cards. Often these information are item dependent and then the +corresponding XML-file is in the folder for this item. Sometimes, +however, more general terms can be opened. In order to avoid multiple +files containing the same information, these were stored in a folder +called `glossar` and get accessed from there. The raw log files only +contain the path to this glossar entry and did not record from which +item it was accessed. I tried to assign these glossar entries to the +correct items. The (very heuristic) approach was this: + +1. Create a lookup table with all XML-file names (possible pop-ups) + from the glossar folder and what items possibly call them. This was + stored as an `RData` object for easier handling but should maybe be + stored in a more interoperable format. + +2. I went through all possible pop-ups in this lookup table and stored + the items that are associated with it. + +3. I created a sub data frame without move events (since they can never + be associated with a pop-up) and went through every line and looked + up if an item and a topic card had been opened. If this was the case + and a glossar entry came up before the item was closed again, I + assigned this item to the glossar entry. + +This is heuristic since it is possible that several topic cards from +different items are opened simultaneously and the glossar pop-up could +be opened from either one (it could even be more than two, of course). +In these cases the item that was opened closest to the glossar pop-up +has been assigned, but this can never be completely error free. + +And this heuristic only assigns a little more than half of the glossar +entries. Since my heuristic only looks for the last item that has been +opened and if this item is a possible candidate it misses all glossar +pop-ups where another item has been opened in between. This is still an +open TODO to write a more elaborate algorithm. + +All glossar pop-ups that do not get matched with an item are removed +from the data set with a warning if the argument `glossar = TRUE` is +set. Otherwise the glossar entries will be ignored completely. + +## Assign a `case` variable based on “time heuristic” + +One thing needed in order to work with the data set and use it for +machine learning algorithms like process mining, is a variable that +tries to identify a case. A case variable will structure the data frame +in a way that navigation behavior can actually be investigated. However, +we do not know if several people are standing around the table +interacting with it or just one very active person. The simplest way to +define a case variable is to just use a time limit between events. This +means that when the table has not been interacted with for, e.g., 20 +seconds than it is assumed that a person moved on and a new person +started interacting with the table. This is the easiest heuristic and +implemented at the moment. Process mining shows that this simple +approach works in a way that the correct process gets extracted by the +algorithm. + +In order to investigate user behavior on a more fine grained level, it +will be necessary to come up with a more elaborate approach. A better, +still simple approach, could be to use this kind of time limit and +additionally look at the distance between items interacted with within +one time window. When items are far apart it seems plausible that more +than one person interacted with them. Very short time lapses between +events on different items could also be an indicator that more than one +person is interacting with the table. + +## Assign a `path` variable + +The `path` variable is supposed to show one interaction trace with one +artwork. Meaning it starts when an artwork is touched or flipped and +stops when it is closed again. It is easy to assign a path from flipping +a card over opening (maybe several) topics and pop-ups for this artwork +card until closing this card again. But one would like to assign the +same path to move events surrounding this interaction. Again, this is +not possible in an algorithmic way but only heuristically. + +Again, I used a time cutoff for this. First, if a `move` event occurs, +it is checked, if the same item has been flipped less than 20 seconds +beforehand. If yes, the same path indicator is assigned to this `move`. +If not, temporarily a new “move indicator” is assigned. Then, a +“backward pass” is applied, where it is checked if the same item is +opened less than 20 seconds *after* the event occurs. If yes, that path +indicator is assigned. For all the remaining moves, a new path number is +assigned. This corresponds to items being moved without being flipped. + +## A `move` event does not record any change + +Most of the events in the log files are move events. Additionally, many +of these move events are recorded but they do not indicate any change, +meaning the only difference is the timestamp. All other variables +indicating moves like `x.start` and `x.stop`, `rotation.start` and +`rotation.stop` etc. do not show *any* change. They represent about 2/3 +of all move events. These events are probably short touches of the table +without an actual interaction. They were therefore removed from the data +set. + +## Card indices go from 0 to 7 (instead of 0 to 5 as expected) + +In the beginning I thought that the number for topics was the index of +where the card was presented on the back of the item. But this is not +correct. It is the number of the topic. There are eight topics in total: + + Indices for topics: + 0 artist + 1 thema + 2 komposition + 3 leben des kunstwerks + 4 details + 5 licht und farbe + 6 extra info + 7 technik + +On the back of items, there can be between 2 to 6 topic cards. Several +of these topic cards can be about the same topic, e.g., there can be two +topic cards assigned to the topic `thema`. It is impossible to find out +if the same topic card was opened several times or if different topic +cards with the same topic were opened from the same item. See example +below for item “001”. + + #> ℹ Loading mtt + #> item file_name topic + #> 1 001 001_dargestellte.xml thema + #> 2 001 001_thema1.xml thema + #> 3 001 001_leben.xml leben des kunstwerks + #> 4 001 001_leben3.xml leben des kunstwerks + #> 5 001 001_thema2.xml thema + #> 6 001 001_thema.xml thema + +## New artworks “504” and “505” starting October 2022 + +When I read in the complete data frame for the first time, all of the +sudden there were 72 instead of 70 items. It seems like these two +artworks appear on October 21, 2022. + +``` r +summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"])) +#> Min. 1st Qu. Median Mean 3rd Qu. Max. +#> "2022-10-21" "2023-01-11" "2023-03-08" "2023-03-09" "2023-05-21" "2023-07-05" +``` + +The artworks seem to be have updated in general after October 21, 2022. +The following table shows which items were presented in which years. + +``` r +xtabs(~ item + lubridate::year(date.start), datlogs) +#> lubridate::year(date.start) +#> item 2016 2017 2018 2019 2020 2022 2023 +#> 1 277 4082 1912 1434 424 394 1315 +#> 3 485 6730 3126 2356 528 457 1124 +#> 19 714 8656 4028 2743 660 698 1595 +#> 20 595 8461 3996 2983 938 657 1355 +#> 24 497 6638 2912 2251 649 439 1028 +#> 27 567 5959 3112 2318 651 711 1324 +#> 28 601 9329 4394 3056 778 762 1570 +#> 29 425 6865 3830 2365 516 615 1174 +#> 31 289 4118 2051 1218 291 296 675 +#> 32 562 7016 3477 2253 726 766 1647 +#> 33 509 4936 2242 1449 555 358 666 +#> 36 434 4505 2276 1668 373 387 976 +#> 37 242 4478 2182 1554 339 423 1168 +#> 38 480 4617 2144 1397 371 381 784 +#> 39 395 3227 1313 1003 237 161 622 +#> 41 282 3329 1303 1022 225 209 701 +#> 42 203 3113 1307 903 242 191 421 +#> 43 115 2420 1089 806 176 219 486 +#> 45 1491 13561 5924 4474 966 585 1828 +#> 46 903 9181 5340 3812 961 944 1648 +#> 47 306 4949 2395 1510 750 297 675 +#> 48 723 10455 5384 4162 1328 948 2031 +#> 49 433 4326 2124 1414 434 431 809 +#> 51 564 7837 4577 2991 884 659 1370 +#> 52 447 5021 2104 1729 471 349 840 +#> 54 424 5068 2816 2008 529 370 918 +#> 55 358 4859 2069 1428 341 403 1303 +#> 57 860 14264 6625 5092 1410 1221 2714 +#> 60 555 6865 3539 2336 639 586 1415 +#> 62 547 6736 3803 2210 795 633 1322 +#> 63 251 3677 1827 1241 300 282 527 +#> 66 552 6004 2774 1977 505 373 932 +#> 69 394 3730 1827 1438 272 206 680 +#> 70 226 3766 1843 973 293 268 703 +#> 71 557 6160 2490 1846 570 323 839 +#> 72 426 6194 2857 2129 508 635 1553 +#> 73 432 6125 2880 1821 583 395 939 +#> 75 258 5885 2418 1562 369 257 645 +#> 76 861 12435 6253 4214 1753 1153 2268 +#> 77 816 8595 4197 2897 699 674 1452 +#> 78 410 5632 2498 1924 394 408 850 +#> 80 1650 25687 12429 7782 1975 1712 4433 +#> 83 644 8618 4720 3026 987 1027 2294 +#> 84 184 2121 1231 759 231 254 465 +#> 87 149 1618 722 632 99 0 0 +#> 88 513 6996 3493 2272 539 533 1420 +#> 89 214 2204 950 723 156 0 0 +#> 90 281 3756 1372 1143 403 320 932 +#> 93 613 8528 4224 3015 696 1174 2058 +#> 98 462 6662 3265 2565 704 670 1453 +#> 99 180 4162 1653 1454 363 411 868 +#> 101 414 4209 1859 1282 392 411 981 +#> 103 677 8758 4366 3165 1045 909 1871 +#> 104 423 5256 2381 1865 463 467 933 +#> 107 181 2101 1106 788 205 146 339 +#> 109 321 4001 1619 1106 292 188 453 +#> 110 489 5846 2785 2008 494 387 923 +#> 125 640 8435 4519 3334 926 0 0 +#> 129 598 11322 5046 3369 910 1131 1682 +#> 145 419 7821 3945 2694 706 740 1396 +#> 176 507 8465 3968 2787 687 552 1544 +#> 180 516 7563 3720 2765 585 550 1272 +#> 183 377 4014 1819 1741 346 251 675 +#> 187 340 4222 2165 1753 319 312 734 +#> 197 426 7710 3603 2510 671 602 1217 +#> 229 303 4872 2360 1891 482 389 1005 +#> 231 271 3606 1851 1239 318 236 467 +#> 501 1915 15968 7849 5060 1157 890 2989 +#> 502 1212 14550 7111 4749 1105 883 2752 +#> 503 1308 15218 8632 6399 1626 870 2558 +#> 504 0 0 0 0 0 363 662 +#> 505 0 0 0 0 0 426 1533 +``` + +It shows that the artworks haven been updated after the Corona pandemic. +I think, the table was also moved to a different location at that point. diff --git a/man/figures/README-timems-1.png b/man/figures/README-timems-1.png new file mode 100644 index 0000000..f08b70a Binary files /dev/null and b/man/figures/README-timems-1.png differ diff --git a/man/figures/README-xycoord-1.png b/man/figures/README-xycoord-1.png new file mode 100644 index 0000000..d72a279 Binary files /dev/null and b/man/figures/README-xycoord-1.png differ diff --git a/man/figures/logo.png b/man/figures/logo.png new file mode 100644 index 0000000..e933311 Binary files /dev/null and b/man/figures/logo.png differ