Updated README.Rmd and exported as github_document

2024-03-22 15:58:30 +01:00
parent 37e67bfa69
commit 9762c61a8d
4 changed files with 745 additions and 371 deletions
@@ -1,46 +1,38 @@
 ---
-title: "Background information about MTT data"
-author: "Nora Wickelmaier"
-date: "`r Sys.Date()`"
-output: 
-  html_document:
-    number_sections: true
-    toc: true
+title: "Log data from the Multi-Touch Table at the HAUM"
+output: github_document
 ---

 ```{r, include = FALSE}
-# setwd("C:/Users/nwickelmaier/Nextcloud/Documents/MDS/2023ss/60100_master_thesis")
-devtools::load_all("../../../software/mtt")
+devtools::load_all("../../../../software/mtt")
 ```

-# Log data from the Multi-Touch Table at the HAUM
-
 The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in
 Braunschweig gives visitors of the Museum the opportunity to interact with
-67 artworks and 3 tiles containing information about the museum and its
-layout. The table was installed at the institute in October 2016 and since
-November 2016 log files from interactions of visitors of the museum have
-been collected. These log files are in an unstructured format and cannot be
-easily analyzed. The purpose of the following document is to describe how
-the data haven been transformed and which decisions have been made along
-the way.
+about 70 artworks and 3 virtual cards containing information about the
+museum and its layout. The table was installed at the institute in October
+2016 and since November 2016 log files from interactions of visitors of the
+museum have been collected. These log files are in an unstructured format
+and cannot be easily analyzed. The purpose of the following document is to
+describe how the data haven been transformed and which decisions have been
+made along the way.

 # Data structure

 The log files contain lines that indicate the beginning and end of possible
-actions that can be performed when interacting with the artworks on the
-table. The layout of the table looks like 70 pictures have been tossed on a
+activities that can be performed when interacting with the artworks on the
+table. The layout of the table looks like pictures have been tossed on a
 large table. Every artwork is visible at the start configuration. People
 can move the pictures on the table, they can be scaled and rotated.
 Additionally, the virtual picture cards can be flipped in order to find
 more information of the artwork on the "back" of the card. One has to press
 a little `i` for more information in one of the bottom corners of the card.
-On the back of the card two (?) to six information cards can be found with
-a teaser text about a certain topic. These topic cards can be opened and a
-hypertext with detailed information pops up. Within these hypertexts
-certain technical terms can be clicked for lay people to get more
-information. This also opens up a pop-up. The events encoded in the raw log
-files therefore have the following structure.
+On the back of the card two to six information cards can be found with a
+teaser text about a certain topic. These topic cards can be opened and a
+hypertext with detailed information opens. Within these hypertexts certain
+technical terms can be clicked for lay people to get more information. This
+also opens up a pop-up. The events encoded in the raw log files therefore
+have the following structure.

 ```
 "Start Application"     --> Start Application
@@ -100,32 +92,32 @@ raw log file:
  organized in. For the HAUM data set, the data are sorted by year (folders
  2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).

-* `data`: Extracted time stamp from the raw log file in the format
+* `date`: Extracted timestamp from the raw log file in the format
  `yyyy-mm-dd hh:mm:ss`.

-* `timeMs`: Containing a time stamp in Milliseconds that restarts with
+* `timeMs`: Containing a timestamp in Milliseconds that restarts with
  every new raw log files.

 * `event`: Start and stop event tags. See above for possible values.

-* `artwork`: Identifier of the different artworks. This is a 3 digit
-  (left-padded) number. The numbers of the artworks correspond to the
+* `item`: Identifier of the different items. This is a three-digit
+  (left-padded) number. The numbers of the items correspond to the
  folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
  orginally taken from the museums catalogue.

-* `popup`: Name of the pop-up opened. This is only interestin for
+* `popup`: Name of the pop-up opened. This is only interesting for
  "openPopup" events.

-* `topicNumber`: The number of the topic card that has been opened at the back of
-  the artwork card. See below for a more detailed descripttion what these
-  numbers possibly mean.
+* `topic`: The number of the topic card that has been opened at the back of
+  the item card. See below for a more detailed descripttion what these
+  numbers mean.

 * `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$)

 * `y`: Value of y-coordinate in pixel

-* `scale`: Number in 128 bit that indicates how much the artwork card has
-  been scaled (????)
+* `scale`: Number in 128 bit that indicates how much the card has been
+  scaled

 * `rotation`: Degree of rotation in start configuration.

@@ -134,43 +126,45 @@ raw log file:

 ## Variables after "closing of events"

-The raw log data consists of start and stop events for each event type.
-After preprocessing for event types are extracted: `move`, `flipCard`,
+The raw log data consist of start and stop events for each event type.
+After preprocessing four event types are extracted: `move`, `flipCard`,
 `openTopic`, and `openPopup`. Except for the `move` events, which can occur
-at any time when interacting with an artwork card on the table, the events
-have a hierachical order: An artwork card first needs to be flipped
+at any time when interacting with an item card on the table, the events
+have a hierarchical order: An item card first needs to be flipped
 (`flipCard`), then the topic cards on the back of the card can be opened
 (`openTopic`), and finally pop-ups on these topic cards can be opened
 (`openPopup`). This implies that the event `openPopup` can only be present
-for a certain artwork, if the card has already been flipped (i.e., an event
-`flipCard` for the same artwork has already occured).
+for a certain item, if the card has already been flipped (i.e., an event
+`flipCard` for the same item has already occured).

 After preprocessing, the data frame is now in a wide format with columns
 for the start and the stop of each event and contains the following
 variables:

-* `folder`: Containing the folder name (see above)
+* `fileId.start` / `fileId.stop`: See above.

-* `eventId`: A numerical variable that indicates the number of the event.
-  Starts at 1 and ends with the total number of events, counting up by 1.
+* `date.start` / `date.stop`: See above.
+
+* `folder`: Containing the folder name (see above)

 * `case`: A numerical variable indicating cases in the data. A "case"
  indicates an interaction interval and could be defined in different ways.
-  Right now a new case begins, when no event occured for 20 seconds.
+  Right now a new case begins, when no event occurred for 20 seconds or
+  longer.

-* `trace`: A trace is defined as one interaction with one artwork. A trace
-  can either start with a `flipCard` event or when an artwork has been
-  touched for the first time within this case. A trace ends with the
-  artwork card being flipped close again or with the last movement of the
-  card within this case. One case can contain several traces with the same
-  artwork when the artwork is flipped open and slipped close again several
+* `path`: A path is defined as one interaction with one item A path
+  can either start with a `flipCard` event or when an item has been
+  touched for the first time within this case. A path ends with the
+  item card being flipped close again or with the last movement of the
+  card within this case. One case can contain several paths with the same
+  item when the item is flipped open and flipped close again several
  times within a short time.

 * `glossar`: An indicator variable with values 0/1 that tracks if a pop-up
  has been opened from the glossar folder. These pop-ups can be assigned to
-  the wronge artwork since it is not possible to do this algorithmically.
-  It is possible that two artworks are flipped open that could both link to
-  the same popup from a glossar. The indicator variable is left as a
+  the wrong item since it is not possible to do this algorithmically.
+  It is possible that two items are flipped open that could both link to
+  the same pop-up from a glossar. The indicator variable is left as a
  variable, so that these pop-ups can be easily deleted from the data.
  Right now, glossar entries can be ignored completely by setting an
  argument and this is done by default. Using the pop-ups from the glossar
@@ -179,20 +173,16 @@ variables:
 * `event`: Indicating the event. Can take tha values `move`, `flipCard`,
  `openTopic`, and `openPopup`.

-* `artwork`: Identifier of the different artworks. This is a 3 digit
-  (left-padded) number. See above.
-
-* `fileId.start` / `fileId.stop`: See above.
-
-* `date.start` / `date.stop`: See above.
+* `item`: Identifier of the different artworks and information cards. This
+  is a three-digit (left-padded) number. See above.

 * `timeMs.start` / `timeMs.stop`: See above.

 * `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds.
  Needs to be adjusted for events spanning more than one log file by a
-  factor of $60,000 \times #logfiles$. See below for details.
+  factor of $60,000 \times \text{number of logfiles}$. See below for details.

-* `topicNumber`: See above.
+* `topic`: See above.

 * `popup`: See above.

@@ -200,11 +190,12 @@ variables:

 * `y.start` / `y.stop`: See above.

-* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and $(x.stop, y.stop)$.
+* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and
+  $(x.stop, y.stop)$.

 * `scale.start` / `scale.stop`: See above.

-* `scaleSize`: Relative scaling of artwork card, calculated by
+* `scaleSize`: Relative scaling of item card, calculated by
  $\frac{scale.stop}{scale.start}$.

 * `rotation.start` / `rotation.stop`: See above.
@@ -215,60 +206,26 @@ variables:
 ## How unclosed events are handled

 Events do not necessarily need to be completed. A person can, e.g., leave
-the table and not flip the artwork card close again. For `flipCard`,
+the table and not flip the item card close again. For `flipCard`,
 `openTopic`, and `openPopup` the data frame contains `NA` when the event
-does not complete. For `move` events is happens quite often that a start
+does not complete. For `move` events it happens quite often that a start
 event follows a start event and a stop event follows a stop event.
 Technically a move event cannot *not* be finished and the number of events
-without a start or stop indicated that the time resolution was not
+without a start or stop indicate that the time resolution was not
 sufficient to catch all these events accurately. Double start and stop
-`move`events have therefore been deleted from the data set.
-
-<!--
-## How a case is defined
-
-* Herausfinden, ob mehr als eine Person am Tisch steht?
-  - Sliding window, in der Anzahl von Artworks gezählt wird? Oder wie weit
-    angefasste Artworks voneinander entfernt sind?
-  - Man kann sowas schon "sehen" in den Logs - aber wie kann ich es
-    automatisiert rausziehen? Was ist meine Definition von
-    "Interaktionsboost"?
-  - Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
-->
+`move` events have therefore been deleted from the data set.

 ## Additional meta data

 For the HAUM data, I added meta data on state holidays and school
-vacations. Additionally, the topic categories of the topic cards were
-extracted from the XML files and added to the data frame.
+vacations. 

 This led to the following additional variables:

-* `topicIndex`
-
-* `topicFile`
-
-* `topic`
-
-* `state` (Niedersachsen for complete HAUM data set)
-
-* `stateCode` (NI)
-
 * `holiday`

 * `vacations`

-* `stateCodeVacations`
-
-<!--
-  - Metadata on artworks like, name, artist, type of artwork, epoch, etc.
-  - School vacations and holidays
-  - Special exhibits at the museum
-  - Number of visitors per day (bei Sven noch mal nachhaken?)
-  - Age structure of visitors per day?
-  - ... ????
-->
-
 # Problems and how I handled them

 This lists some problems with the log data that required decisions. These
@@ -287,33 +244,12 @@ event spans more than two log files, a multiple of $600,000$ must be taken,
 e.g. for three log files it must be: $2 \times 600,000 - timeMs.start +
 timeMs.stop$ and so on.

-```{r, results = FALSE, fig.show = TRUE}
+```{r timems, echo = FALSE, results = FALSE, fig.show = TRUE}
 # Read data
-dat0 <- read.table("data/haum/raw_logfiles_small_2023-09-26_13-50-20.csv", sep = ";",
-                   header = TRUE)
-dat0$date <- as.POSIXct(dat0$date)
-dat0$glossar <- ifelse(dat0$artwork == "glossar", 1, 0)
+datraw <- read.table("code/results/raw_logfiles_2024-02-21_16-07-33.csv", sep = ";",
+                     header = TRUE)

-# Remove irrelevant events
-dat <- subset(dat0, !(dat0$event %in% c("Start Application",
-                                        "Show Application")))
-
-# Add trace variable
-artworks <- unique(stats::na.omit(dat$artwork))
-artworks <- artworks[artworks != "glossar"]
-glossar_files <- unique(subset(dat, dat$artwork == "glossar")$popup)
-glossar_dict <- create_glossardict(artworks, glossar_files,
-                    xmlpath = "data/haum/ContentEyevisit/eyevisit_cards_light/")
-dat1 <- add_trace(dat, glossar_dict)
-
-# Close events
-dat2 <- rbind(close_events(dat1, "move", rm_nochange_moves = TRUE),
-              close_events(dat1, "flipCard", rm_nochange_moves = TRUE),
-              close_events(dat1, "openTopic", rm_nochange_moves = TRUE),
-              close_events(dat1, "openPopup", rm_nochange_moves = TRUE))
-dat2 <- dat2[order(dat2$fileId.start, dat2$date.start, dat2$timeMs.start), ]
-
-plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
+plot(timeMs ~ as.factor(fileId), datraw[1:5000,], xlab = "fileId")
 ```

 The boxplot shows that we have a continuous range of values within one log
@@ -322,7 +258,7 @@ file but that `timeMs` does not increase over log files. I kept
 in the data frame, so it is clear when events span more than one log file.

 <!--
-Infos from Philipp:
+Infos from the programmer:

 "Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
 so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
@@ -340,7 +276,7 @@ es passt."
 ## Left padding of file IDs

 The file names of the raw log files are automatically generated and contain
-a time stamp. This time stamp is not well formed. First, it contains an
+a timestamp. This timestamp is not well formed. First, it contains an
 incorrect month. The months go from 0 to 11 which means, that the file name
 `2016_11_15-12_12_57.log` was collected on December 15, 2016 at 12:12 pm.
 Another problem is that the file names are not zero left padded, e.g.,
@@ -350,11 +286,12 @@ will sort these files in the order shown below. In order to preprocess the
 data and close events that belong together, the data need to be sorted by
 events and artworks repeatedly. In order to get them back in the correct
 time order, it is necessary to order them based on three variables:
-`fileId`, `date.start` and `timeMs`. The file IDs therefore need to
-sort in the correct order (again see below for example). I zero left padded
-the log file names within the data frame using it as an identifier. These
-"file names" do not correspond exactly to the original raw log file names.
-This needs to be kept in mind when doing any kind of matching etc.
+`fileId.start`, `date.start` and `timeMs.start`. The file IDs therefore
+need to sort in the correct order (again see below for example). I zero
+left padded the log file names within the data frame using it as an
+identifier. These "file names" do not correspond exactly to the original
+raw log file names. This needs to be kept in mind when doing any kind of
+matching etc.

 ```
 ## what it looked like before left padding
@@ -376,16 +313,16 @@ This needs to be kept in mind when doing any kind of matching etc.

 ## Timestamps repeat

-The time stamps in the `date` variable record year, month, day, hour,
+The timestamps in the `date` variable record year, month, day, hour,
 minute and seconds. Since one second is not a very short time interval for
 a move on a touch display, this is not fine grained enough to bring events
 into the correct order, meaning there are events from the same log file
-having the same time stamp and even events from different log files having
-the same time stamp. The log files get written about every 10 minutes
+having the same timestamp and even events from different log files having
+the same timestamp. The log files get written about every 10 minutes
 (which can easily be seen when looking at the file names of the raw log
 files). So in order to get events in the correct order, it is necessary to
-first order by file ID, within file ID then sort by time stamp `date` and
-then within these more coarse grained time stamps sort be `timeMs`. But as
+first order by file ID, within file ID then sort by timestamp `date` and
+then within these more coarse grained timestamps sort be `timeMs`. But as
 explained above, `timeMs` can only be sorted within one file ID, since they
 do not increase consistently over log files, but have a new setoff for each
 raw log file.
@@ -394,64 +331,67 @@ raw log file.

 The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160
 pixels. When you plot the start and stop coordinates, the display is
-clearly to distinguish. However, a lot of points are outside of the display
-range. This can happen, when the art objects are scaled and then moved to
-the very edge of the table. Then it will record pixels outside of the
-table. These are actually valid data points and I will leave them as is.
+clearly distinguishable. However, a lot of points are outside of the
+display range. This can happen, when the art objects are scaled and then
+moved to the very edge of the table. Then it will record pixels outside of
+the table. These are actually valid data points and I will leave them as
+is.
+
+```{r xycoord}
+datlogs <- read.table("code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";",
+                      header = TRUE)

-```{r}
 par(mfrow = c(1, 2))
-plot(y.start ~ x.start, dat2)
+plot(y.start ~ x.start, datlogs)
 abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
-plot(y.stop ~ x.stop, dat2)
+plot(y.stop ~ x.stop, datlogs)
 abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)

-aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, dat2, mean)
+aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean)
 ```

-## Pop-ups from glossar cannot be assigned to a specific artwork
+## Pop-ups from glossar cannot be assigned to a specific item

 All the information, pictures and texts for the topics and pop-ups are
-stored in
-`/Logfiles/ContentEyevisit/eyevisit_cards_light/<artwork_number>`. Among
-other things, each folder contains XML-files with the information about any
-technical terms that can be opened from the hypertexts on the topic cards.
-Often these information are artwork dependent and then the corresponding
-XML-file is in the folder for this artwork. Sometimes, however, more
-general terms can be opened. In order to avoid multiple files containing
-the same information, these were stored in a folder called `glossar` and
-get accessed from there. The raw log files only contain the path to this
-glossar entry and did not record from which artwork it was accessed. I
-tried to assign these glossar entries to the correct artworks. The (very
-heuristic) approach was this:
+stored in `/data/haum/ContentEyevisit/eyevisit_cards_light/<item_number>`.
+Among other things, each folder contains XML-files with the information
+about any technical terms that can be opened from the hypertexts on the
+topic cards. Often these information are item dependent and then the
+corresponding XML-file is in the folder for this item. Sometimes, however,
+more general terms can be opened. In order to avoid multiple files
+containing the same information, these were stored in a folder called
+`glossar` and get accessed from there. The raw log files only contain the
+path to this glossar entry and did not record from which item it was
+accessed. I tried to assign these glossar entries to the correct items. The
+(very heuristic) approach was this:

 1. Create a lookup table with all XML-file names (possible pop-ups) from
-   the glossar folder and what artworks possibly call them. This was stored
+   the glossar folder and what items possibly call them. This was stored
   as an `RData` object for easier handling but should maybe be stored in a
   more interoperable format.

 2. I went through all possible pop-ups in this lookup table and stored the
-   artworks that are associated with it.
+   items that are associated with it.

 3. I created a sub data frame without move events (since they can never be
   associated with a pop-up) and went through every line and looked up if
-   an artwork and a topic card had been opened. If this was the case and a
-   glossar entry came up before the artwork was closed again, I assigned
-   this artwork to this glossar entry.
+   an item and a topic card had been opened. If this was the case and a
+   glossar entry came up before the item was closed again, I assigned
+   this item to the glossar entry.

 This is heuristic since it is possible that several topic cards from
-different artworks are opened simultaneously and the glossar pop-up could
+different items are opened simultaneously and the glossar pop-up could
 be opened from either one (it could even be more than two, of course). In
-these cases the artwork that was opened closest to the glossar pop-up has
+these cases the item that was opened closest to the glossar pop-up has
 been assigned, but this can never be completely error free.

 And this heuristic only assigns a little more than half of the glossar
-entries. Since my heuristic only looks for the last artwork that has been
-opened and if this artwork is a possible candidate it misses all glossar
-pop-ups where another artwork has been opened in between. This is still an
+entries. Since my heuristic only looks for the last item that has been
+opened and if this item is a possible candidate it misses all glossar
+pop-ups where another item has been opened in between. This is still an
 open TODO to write a more elaborate algorithm.

-All glossar pop-ups that do not get matched with an artwork are removed
+All glossar pop-ups that do not get matched with an item are removed
 from the data set with a warning if the argument `glossar = TRUE` is set.
 Otherwise the glossar entries will be ignored completely.

@@ -473,232 +413,89 @@ gets extracted by the algorithm.
 In order to investigate user behavior on a more fine grained level, it will
 be necessary to come up with a more elaborate approach. A better, still
 simple approach, could be to use this kind of time limit and additionally
-look at the distance between artworks interacted with within one time
-window. When artworks are far apart it seems plausible that more than one
-person interacted with them. Very short time lapses between events on
-different artworks could also be an indicator that more than one person is
-interacting with the table.
+look at the distance between items interacted with within one time window.
+When items are far apart it seems plausible that more than one person
+interacted with them. Very short time lapses between events on different
+items could also be an indicator that more than one person is interacting
+with the table.

-## Assign a `trace` variable
+## Assign a `path` variable

-The `trace` variable is supposed to show one interaction trace with one
+The `path` variable is supposed to show one interaction trace with one
 artwork. Meaning it starts when an artwork is touched or flipped and stops
-when it is closed again. It is easy to assign a trace from flipping a card
+when it is closed again. It is easy to assign a path from flipping a card
 over opening (maybe several) topics and pop-ups for this artwork card until
-closing this card again. But one would like to assign the same trace to
+closing this card again. But one would like to assign the same path to
 move events surrounding this interaction. Again, this is not possible in an
-algorithmic way but only heuristically. I used the `case` variable in order
-to get meaningful units around the artworks.
+algorithmic way but only heuristically.

-If within one case only a single trace for a single artwork was opened, I
-assigned this trace to the moves associated with this artwork. It (quite
-often) happens that within one case one artwork is opened and closed
-several times, each time starting a new trace. I then assigned all the
-following move events to the trace beforehand. This is, of course,
-arbitrary and could also be handled the other way around.
-
-Another possibility is, that an artwork gets moved within one trace without
-being flipped. I then assigned a new trace to this move.
-
-This overall worked very well even though it was based on the very
-heuristic approach assigning a case when the table has not been touched for
-20 seconds. It should be kept in mind that the trace assignments for the
-moves will change when case is defined in a different way.
+Again, I used a time cutoff for this. First, if a `move` event occurs, it
+is checked, if the same item has been flipped less than 20 seconds
+beforehand. If yes, the same path indicator is assigned to this `move`. If
+not, temporarily a new "move indicator" is assigned. Then, a "backward
+pass" is applied, where it is checked if the same item is opened less than
+20 seconds _after_ the event occurs. If yes, that path indicator is
+assigned. For all the remaining moves, a new path number is assigned. This
+corresponds to items being moved without being flipped.

 ## A `move` event does not record any change

 Most of the events in the log files are move events. Additionally, many of
-these move events are recorded but they do not indicate any change meaning
-the only difference is the time stamp. All other variables indicating moves
+these move events are recorded but they do not indicate any change, meaning
+the only difference is the timestamp. All other variables indicating moves
 like `x.start` and `x.stop`, `rotation.start` and `rotation.stop` etc. do
-not show any change. They represent about 2/3 of all move events. These
+not show _any_ change. They represent about 2/3 of all move events. These
 events are probably short touches of the table without an actual
 interaction. They were therefore removed from the data set.

-## Events that only close (`date.start` is NA)
-
-It looks like there is some kind of log error for the events that do not
-have a start stop. I was able to get rid of most by sorting for `popup` for
-the openPopup events, but there are still some left (50 for the small data
-set, which corresponds to 0.2 per mill). The following example shows that
-artwork "501" gets closed (line 31030) while the pop-up `sommerbau.xml`
-is still opened (line 31027). Then artwork "501" gets opened again
-(line 31035) and after that the pop-up `sommerbau.xml` is closed (line
-31040). This should not be possible and therefore (correctly) two events
-are assigned: One where the pop-up was opened and then not closed (which is
-common) and another one where the pop-up has no start.
-
-```{r}
-dat[31000:31019,]
-# Card gets flipped closed before pop-up closes --> log error!
-```
-
-I did not check all of these cases (for the complete data set this is
-simply not possible by hand) but just excluded all events that do not have
-a `date.start` since they are hard to interpret. Often they are log errors
-but in some cases they might be resolvable.
-
-```{r}
-# remove all events that do not have a `date.start`
-dim(dat2[is.na(dat2$date.start), ])
-dat2 <- dat2[!is.na(dat2$date.start), ]
-```
-
-In order to deal with these logging errors, I check the data for what I
-call "fragmented traces". These are traces that cannot happen, when
-everything is logged correctly, e.g., traces containing `flipCard ->
-openPopup` or traces that only consist of `move`, `openTopic`, and
-`openPopup` events. These fragmented traces are removed from the data. It
-was not possible to check them all manually, but the 20 or more that I do
-check in the raw log files were all some kind of logging error like above.
-Most often a card was already closed again, before a topic card or pop-up
-was recorded as being closed.
-
 ## Card indices go from 0 to 7 (instead of 0 to 5 as expected)

-See `questions_number-of-cards.R` for more details.
+In the beginning I thought that the number for topics was the index of
+where the card was presented on the back of the item. But this is not
+correct. It is the number of the topic. There are eight topics in total:

-I wrote a function that for each artwork extracts the file names of the
-possible topic cards and then looks up which topics have actually been
-displayed on the back of the card. I added an index giving the ordering in
-the index files.
-
-The possible values in the variable `topicNumber` range from 0 to 7,
-however, no artwork has more than six different numbers. So I just renamed
-those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
-to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
-assign topics and file names to the according pop-ups. This needs to be
-cross checked with the programming, but seems the most plausible approach
-with my current knowledge.
-
-<!-- TODO: Ask Philipp -->
-
-## Extracting topics from `index.xml` vs. `<artwork_number>.xml`
-
-When I extract the topics from `index.html` I get different topics, than
-when I get them from `<artwork>.html`. At first glance, it looks like using
-`index.html` actually gives the wrong results.
-
-```{r}
-artworks <- unique(dat2$artwork)
-path <- "data/haum/ContentEyevisit/eyevisit_cards_light/"
-topics <- extract_topics(artworks, rep("index.xml", length(artworks)), path)
-topics2 <- extract_topics(artworks, paste0(artworks, ".xml"), path)
-
-topics[!topics$file_name %in% topics2$file_name, ]
-topics2[!topics2$file_name %in% topics$file_name, ]
 ```
+Indices for topics:
+0   artist
+1   thema
+2   komposition
+3   leben des kunstwerks
+4   details
+5   licht und farbe
+6   extra info
+7   technik
+```
+On the back of items, there can be between 2 to 6 topic cards. Several of
+these topic cards can be about the same topic, e.g., there can be two topic
+cards assigned to the topic `thema`. It is impossible to find out if the
+same topic card was opened several times or if different topic cards with
+the same topic were opened from the same item. See example below for item
+"001".

-For artwork "031", `index.html` only defines 5 cards (the 6th is commented
-out), but `topicNumber` for this artwork has 6 different entries. I will
-therefore extract the topics from `<artwork>.html`. (This seems also better
-compatible with other data sets like 8o8m.)
+```{r topics, echo = FALSE}
+items <- sprintf("%03d", unique(datlogs$item))
+topics <- extract_topics(items, xmlfiles = paste0(items, ".xml"),
+                         xmlpath = "data/haum/ContentEyevisit/eyevisit_cards_light/")
+head(topics)
+```

 ## New artworks "504" and "505" starting October 2022

 When I read in the complete data frame for the first time, all of the
-sudden there were 72 instead of 70 artworks. It seems like these two
+sudden there were 72 instead of 70 items. It seems like these two
 artworks appear on October 21, 2022.

-```{r}
-dat0 <- read.table("data/haum/raw_logfiles_2023-09-23_01-31-30.csv",
-                   sep = ";", header = TRUE)
-dat0$date <- as.POSIXct(dat0$date)
-dat0$glossar <- ifelse(dat0$artwork == "glossar", 1, 0)
-
-# Remove irrelevant events
-dat <- subset(dat0, !(dat0$event %in% c("Start Application",
-                                        "Show Application")))
-
-summary(dat[dat$artwork %in% c("504", "505"), ])
+```{r newitems}
+summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"]))
 ```

-The artworks seem to be have updated in general after October 21, 2022.
+The artworks seem to be have updated in general after October 21, 2022. The
+following table shows which items were presented in which years.

-```{r}
-art_after_oct2022 <- sort(unique(dat[dat$date >= "2022-10-21", "artwork"]))
-art_before_oct2022 <- sort(unique(dat[dat$date <= "2022-10-21", "artwork"]))
-# Removed artworks
-art_before_oct2022[!art_before_oct2022 %in% art_after_oct2022]
-# Additional artworks
-art_after_oct2022[!art_after_oct2022 %in% art_before_oct2022]
+```{r years}
+xtabs(~ item + lubridate::year(date.start), datlogs)
 ```

-The following table shows which artworks were presented in which years.
-
-```{r}
-xtabs(~ artwork + lubridate::year(date), dat)
-```
-
-It strongly suggests that the artworks haven been updated after the Corona
-pandemic. I think, the table was also moved to a different location at that
-point. (Check with PG to make sure.)
-
-# Optimizing resources used by the code
-
-After I started trying out the functions on the complete data set, it
-became obvious (not surprisingly `:)`) that this will not work --
-especially for the move events. The reshape function cannot take a long
-data frame with over 6 Million entries and convert it into a wide data
-frame (at least not on my laptop). The code is supposed to work "out of the
-box" for researchers, hence it *should* run on a regular (8 core) laptop.
-So, I changed the reshaping so that it is done in batches on subsets of the
-data for every `fileId` separately. This means that events that span over
-two (or more) raw log files cannot be closed and will then be removed from
-the data set. The function warns about this, but it is a random process
-getting rid of these data and seems therefore not like a systematic
-problem. Another reason why this is not bad, is that durations cannot be
-calculated for events across log files anyways, because the time stamps do
-not increase systematically over log files (see above).
-
-UPDATE: By now, I close the events spanning more than one log file after
-this has been done.
-
-I meant to put the lists back together with `do.call(rbind, some_list)` but
-this can also not handle big data sets. I therefore switched to
-`dplyr::bind_rows(some_ist)` which is really fast and was developed
-especially for this purpose. It means, that I have to depend on the dplyr
-package (which I am not a big fan of, since I meant to keep the package
-self-contained).
-
-# Reading list
-
-* @Arizmendi2022 [--]
-* @Bannert2014 [x]
-* @Bousbia2010 [--]
-* @Cerezo2020
-* @GerjetsSchwan2021 [x]
-* @Goldhammer2020
-* @Guenther2007
-* @HuberBannert2023 [x]
-* @Kroehne2018
-* @SchwanGerjets2021 [x]
-* @vanderAalst2016 [Chap. 2, x]
-* @vanderAalst2016 [Chap. 3]
-* @vanderAalst2016 [Chap. 5, x]
-* @Wang2019
-
-# Open stuff
-
-* Angle from which people approach table in Braunschweig? Consider in
-  rotation variable?
-* Time limit for `case` variable different for different events? (openTopic
-  should be opened the longest)
-
-  $\to$ I think this is not relevant since I am looking at time *between*
-  events!
-
-# Stuff AK found interesting
-
-* Pre/post corona
-* Identify school classes
-* How many persons are present at the table?
-
-# Other potential questions
-
-* "Bursts"
-* 1st vs. 2nd half of the day
-* Can we identify "types of art"? With clustering or something?
-* Possible to estimate how many persons per day? Maybe average of certain
-  weekdays? ... ?
+It shows that the artworks haven been updated after the Corona pandemic. I
+think, the table was also moved to a different location at that point.

@@ -0,0 +1,577 @@
+Log data from the Multi-Touch Table at the HAUM
+================
+
+The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in
+Braunschweig gives visitors of the Museum the opportunity to interact
+with about 70 artworks and 3 virtual cards containing information about
+the museum and its layout. The table was installed at the institute in
+October 2016 and since November 2016 log files from interactions of
+visitors of the museum have been collected. These log files are in an
+unstructured format and cannot be easily analyzed. The purpose of the
+following document is to describe how the data haven been transformed
+and which decisions have been made along the way.
+
+# Data structure
+
+The log files contain lines that indicate the beginning and end of
+possible activities that can be performed when interacting with the
+artworks on the table. The layout of the table looks like pictures have
+been tossed on a large table. Every artwork is visible at the start
+configuration. People can move the pictures on the table, they can be
+scaled and rotated. Additionally, the virtual picture cards can be
+flipped in order to find more information of the artwork on the “back”
+of the card. One has to press a little `i` for more information in one
+of the bottom corners of the card. On the back of the card two to six
+information cards can be found with a teaser text about a certain topic.
+These topic cards can be opened and a hypertext with detailed
+information opens. Within these hypertexts certain technical terms can
+be clicked for lay people to get more information. This also opens up a
+pop-up. The events encoded in the raw log files therefore have the
+following structure.
+
+    "Start Application"     --> Start Application
+    "Show Application"
+    "Transform start"       --> Move
+    "Transform stop"
+    "Show Info"             --> Flip Card
+    "Show Front"
+    "Artwork/OpenCard"      --> Open Topic
+    "Artwork/CloseCard"
+    "ShowPopup"             --> Open Popup
+    "HidePopup"
+
+The right side shows what events can be extracted from these raw lines.
+The “Start Application” is not an event in the original sense since it
+only indicates if the table was started or maybe reset itself. This is
+not an interaction with the table and therefore not interesting in
+itself. All “Start Application” and “Show Application” are therefore
+excluded from the data when further processed and are only in the raw
+log files.
+
+# Parsing the raw log files
+
+The first step is to parse the raw log files that are stored by the
+application as text files in a rather unstructured format to a format
+that can be read by common statistics software packages. The data are
+therefore transferred to a spread sheet format. The following section
+describes what problems were encountered while doing this.
+
+## Corrupt lines
+
+When reading the files containing the raw logs into R, a warning appears
+that says
+
+    Warning messages:
+      incomplete final line found on '2016/2016_11_18-11_31_0.log'
+      incomplete final line found on '2016/2016_11_18-11_38_30.log'
+      incomplete final line found on '2016/2016_11_18-11_40_36.log'
+      ...
+
+When you open these files, it looks like the last line contains some
+binary content. It is unclear why and how this happens. So when reading
+the data, these lines were removed. A warning will be given that
+indicates how many files have been affected.
+
+## Extracted variables from raw log files
+
+The following variables (columns in the data frame) are extracted from
+the raw log file:
+
+- `fileId`: Containing the zero-left-padded file name of the raw log
+  file the data line has been extracted from
+
+- `folder`: The folder names in which the raw log files haven been
+  organized in. For the HAUM data set, the data are sorted by year
+  (folders 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).
+
+- `date`: Extracted timestamp from the raw log file in the format
+  `yyyy-mm-dd hh:mm:ss`.
+
+- `timeMs`: Containing a timestamp in Milliseconds that restarts with
+  every new raw log files.
+
+- `event`: Start and stop event tags. See above for possible values.
+
+- `item`: Identifier of the different items. This is a three-digit
+  (left-padded) number. The numbers of the items correspond to the
+  folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
+  orginally taken from the museums catalogue.
+
+- `popup`: Name of the pop-up opened. This is only interesting for
+  “openPopup” events.
+
+- `topic`: The number of the topic card that has been opened at the back
+  of the item card. See below for a more detailed descripttion what
+  these numbers mean.
+
+- `x`: Value of x-coordinate in pixel on the 4K-Display
+  ($3840 \times 2160$)
+
+- `y`: Value of y-coordinate in pixel
+
+- `scale`: Number in 128 bit that indicates how much the card has been
+  scaled
+
+- `rotation`: Degree of rotation in start configuration.
+
+<!-- TODO: Nach welchem Zeitintervall resettet sich der Tisch wieder in die
+  Ausgangskonfiguration? -> PM needs to look it up -->
+
+## Variables after “closing of events”
+
+The raw log data consist of start and stop events for each event type.
+After preprocessing four event types are extracted: `move`, `flipCard`,
+`openTopic`, and `openPopup`. Except for the `move` events, which can
+occur at any time when interacting with an item card on the table, the
+events have a hierarchical order: An item card first needs to be flipped
+(`flipCard`), then the topic cards on the back of the card can be opened
+(`openTopic`), and finally pop-ups on these topic cards can be opened
+(`openPopup`). This implies that the event `openPopup` can only be
+present for a certain item, if the card has already been flipped (i.e.,
+an event `flipCard` for the same item has already occured).
+
+After preprocessing, the data frame is now in a wide format with columns
+for the start and the stop of each event and contains the following
+variables:
+
+- `fileId.start` / `fileId.stop`: See above.
+
+- `date.start` / `date.stop`: See above.
+
+- `folder`: Containing the folder name (see above)
+
+- `case`: A numerical variable indicating cases in the data. A “case”
+  indicates an interaction interval and could be defined in different
+  ways. Right now a new case begins, when no event occurred for 20
+  seconds or longer.
+
+- `path`: A path is defined as one interaction with one item A path can
+  either start with a `flipCard` event or when an item has been touched
+  for the first time within this case. A path ends with the item card
+  being flipped close again or with the last movement of the card within
+  this case. One case can contain several paths with the same item when
+  the item is flipped open and flipped close again several times within
+  a short time.
+
+- `glossar`: An indicator variable with values 0/1 that tracks if a
+  pop-up has been opened from the glossar folder. These pop-ups can be
+  assigned to the wrong item since it is not possible to do this
+  algorithmically. It is possible that two items are flipped open that
+  could both link to the same pop-up from a glossar. The indicator
+  variable is left as a variable, so that these pop-ups can be easily
+  deleted from the data. Right now, glossar entries can be ignored
+  completely by setting an argument and this is done by default. Using
+  the pop-ups from the glossar will need a lot more love, before it
+  behaves satisfactorily.
+
+- `event`: Indicating the event. Can take tha values `move`, `flipCard`,
+  `openTopic`, and `openPopup`.
+
+- `item`: Identifier of the different artworks and information cards.
+  This is a three-digit (left-padded) number. See above.
+
+- `timeMs.start` / `timeMs.stop`: See above.
+
+- `duration`: Calculated by $timeMs.stop - timeMs.start$ in
+  Milliseconds. Needs to be adjusted for events spanning more than one
+  log file by a factor of $60,000 \times \text{number of logfiles}$. See
+  below for details.
+
+- `topic`: See above.
+
+- `popup`: See above.
+
+- `x.start` / `x.stop`: See above.
+
+- `y.start` / `y.stop`: See above.
+
+- `distance`: Euclidean distande calculated from $(x.start, y.start)$
+  and $(x.stop, y.stop)$.
+
+- `scale.start` / `scale.stop`: See above.
+
+- `scaleSize`: Relative scaling of item card, calculated by
+  $\frac{scale.stop}{scale.start}$.
+
+- `rotation.start` / `rotation.stop`: See above.
+
+- `rotationDegree`: Difference of rotation from $rotation.stop$ to
+  $rotation.start$.
+
+## How unclosed events are handled
+
+Events do not necessarily need to be completed. A person can, e.g.,
+leave the table and not flip the item card close again. For `flipCard`,
+`openTopic`, and `openPopup` the data frame contains `NA` when the event
+does not complete. For `move` events it happens quite often that a start
+event follows a start event and a stop event follows a stop event.
+Technically a move event cannot *not* be finished and the number of
+events without a start or stop indicate that the time resolution was not
+sufficient to catch all these events accurately. Double start and stop
+`move` events have therefore been deleted from the data set.
+
+## Additional meta data
+
+For the HAUM data, I added meta data on state holidays and school
+vacations.
+
+This led to the following additional variables:
+
+- `holiday`
+
+- `vacations`
+
+# Problems and how I handled them
+
+This lists some problems with the log data that required decisions.
+These decisions influence the outcome and maybe even the data quality.
+Hence, I tried to document how I handled these problems and explain the
+decisions I made.
+
+## Weird behavior of `timeMs` and neg. `duration` values
+
+`timeMs` resets itself every time a new log file starts. This means that
+the durations of events spanning more than one log file must be
+adjusted. Instead of just calculating $timeMs.stop - timeMs.start$,
+`timeMs.start` must be subtracted from the maximum duration of the log
+file where the event started ($600,000 ms$) and the `timeMs.stop` must
+be added. If the event spans more than two log files, a multiple of
+$600,000$ must be taken, e.g. for three log files it must be:
+$2 \times 600,000 - timeMs.start + timeMs.stop$ and so on.
+
+![](README_files/figure-gfm/timems-1.png)<!-- -->
+
+The boxplot shows that we have a continuous range of values within one
+log file but that `timeMs` does not increase over log files. I kept
+`timeMs.start` and `timeMs.stop` and also `fileId.start` and
+`fileId.stop` in the data frame, so it is clear when events span more
+than one log file.
+
+<!--
+Infos from the programmer:
+
+"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
+so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
+erstellt. Die Startzeit, von der aus die Duration berechnet wird, wird
+jeweils neu gesetzt. Duration ist also nicht "Dauer seit Start der
+Anwendung" sondern "Dauer seit Restart des Loggers". Deine Vermutung ist
+also richtig - es sollte keine Durations >10 Minuten geben. Der erste
+Eintrag eines Logfiles kann alles zwischen 0 und 10 Minuten sein (je
+nachdem, ob der Tisch zum Zeitpunkt des neuen Logging-Intervalls in
+Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
+die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
+es passt."
+-->
+
+## Left padding of file IDs
+
+The file names of the raw log files are automatically generated and
+contain a timestamp. This timestamp is not well formed. First, it
+contains an incorrect month. The months go from 0 to 11 which means,
+that the file name `2016_11_15-12_12_57.log` was collected on December
+15, 2016 at 12:12 pm. Another problem is that the file names are not
+zero left padded, e.g., `2016_11_15-12_2_57.log`. This file was
+collected on December 15, 2016 at 12:02 pm and therefore before the file
+above. But most sorting algorithms, will sort these files in the order
+shown below. In order to preprocess the data and close events that
+belong together, the data need to be sorted by events and artworks
+repeatedly. In order to get them back in the correct time order, it is
+necessary to order them based on three variables: `fileId.start`,
+`date.start` and `timeMs.start`. The file IDs therefore need to sort in
+the correct order (again see below for example). I zero left padded the
+log file names within the data frame using it as an identifier. These
+“file names” do not correspond exactly to the original raw log file
+names. This needs to be kept in mind when doing any kind of matching
+etc.
+
+    ## what it looked like before left padding
+    # 1422  ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56  599671 Transform start     076 076.xml   NA 2092.25 2008.00 0.3000000   13.26874254
+    # 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57     621 Transform start     076 076.xml   NA 2092.25 2008.00 0.3000000   13.26523465
+    # 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57     677  Transform stop     076 076.xml   NA 2092.25 2008.00 0.2997736   13.26239605
+    # 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57     774 Transform start     076 076.xml   NA 2092.25 2008.00 0.2999345   13.26239605
+    # 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57     850  Transform stop     076 076.xml   NA 2092.25 2008.00 0.2997107   13.26223362
+    # 1427  ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57  599916  Transform stop     076 076.xml   NA 2092.25 2008.00 0.2997771   13.26523465
+
+    ## what it looks like now
+    # 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56  599671 Transform start     076 076.xml   NA 2092.25 2008.00 0.3000000   13.26874254
+    # 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57  599916  Transform stop     076 076.xml   NA 2092.25 2008.00 0.2997771   13.26523465
+    # 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57     621 Transform start     076 076.xml   NA 2092.25 2008.00 0.3000000   13.26523465
+    # 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57     677  Transform stop     076 076.xml   NA 2092.25 2008.00 0.2997736   13.26239605
+    # 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57     774 Transform start     076 076.xml   NA 2092.25 2008.00 0.2999345   13.26239605
+    # 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57     850  Transform stop     076 076.xml   NA 2092.25 2008.00 0.2997107   13.26223362
+
+## Timestamps repeat
+
+The timestamps in the `date` variable record year, month, day, hour,
+minute and seconds. Since one second is not a very short time interval
+for a move on a touch display, this is not fine grained enough to bring
+events into the correct order, meaning there are events from the same
+log file having the same timestamp and even events from different log
+files having the same timestamp. The log files get written about every
+10 minutes (which can easily be seen when looking at the file names of
+the raw log files). So in order to get events in the correct order, it
+is necessary to first order by file ID, within file ID then sort by
+timestamp `date` and then within these more coarse grained timestamps
+sort be `timeMs`. But as explained above, `timeMs` can only be sorted
+within one file ID, since they do not increase consistently over log
+files, but have a new setoff for each raw log file.
+
+## x,y-coordinates outside of display range
+
+The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160
+pixels. When you plot the start and stop coordinates, the display is
+clearly distinguishable. However, a lot of points are outside of the
+display range. This can happen, when the art objects are scaled and then
+moved to the very edge of the table. Then it will record pixels outside
+of the table. These are actually valid data points and I will leave them
+as is.
+
+``` r
+datlogs <- read.table("code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";",
+                      header = TRUE)
+
+par(mfrow = c(1, 2))
+plot(y.start ~ x.start, datlogs)
+abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
+plot(y.stop ~ x.stop, datlogs)
+abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
+```
+
+![](README_files/figure-gfm/xycoord-1.png)<!-- -->
+
+``` r
+aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean)
+```
+
+    ##    x.start   x.stop  y.start   y.stop
+    ## 1 1978.202 1975.876 1137.481 1133.494
+
+## Pop-ups from glossar cannot be assigned to a specific item
+
+All the information, pictures and texts for the topics and pop-ups are
+stored in
+`/data/haum/ContentEyevisit/eyevisit_cards_light/<item_number>`. Among
+other things, each folder contains XML-files with the information about
+any technical terms that can be opened from the hypertexts on the topic
+cards. Often these information are item dependent and then the
+corresponding XML-file is in the folder for this item. Sometimes,
+however, more general terms can be opened. In order to avoid multiple
+files containing the same information, these were stored in a folder
+called `glossar` and get accessed from there. The raw log files only
+contain the path to this glossar entry and did not record from which
+item it was accessed. I tried to assign these glossar entries to the
+correct items. The (very heuristic) approach was this:
+
+1.  Create a lookup table with all XML-file names (possible pop-ups)
+    from the glossar folder and what items possibly call them. This was
+    stored as an `RData` object for easier handling but should maybe be
+    stored in a more interoperable format.
+
+2.  I went through all possible pop-ups in this lookup table and stored
+    the items that are associated with it.
+
+3.  I created a sub data frame without move events (since they can never
+    be associated with a pop-up) and went through every line and looked
+    up if an item and a topic card had been opened. If this was the case
+    and a glossar entry came up before the item was closed again, I
+    assigned this item to the glossar entry.
+
+This is heuristic since it is possible that several topic cards from
+different items are opened simultaneously and the glossar pop-up could
+be opened from either one (it could even be more than two, of course).
+In these cases the item that was opened closest to the glossar pop-up
+has been assigned, but this can never be completely error free.
+
+And this heuristic only assigns a little more than half of the glossar
+entries. Since my heuristic only looks for the last item that has been
+opened and if this item is a possible candidate it misses all glossar
+pop-ups where another item has been opened in between. This is still an
+open TODO to write a more elaborate algorithm.
+
+All glossar pop-ups that do not get matched with an item are removed
+from the data set with a warning if the argument `glossar = TRUE` is
+set. Otherwise the glossar entries will be ignored completely.
+
+## Assign a `case` variable based on “time heuristic”
+
+One thing needed in order to work with the data set and use it for
+machine learning algorithms like process mining, is a variable that
+tries to identify a case. A case variable will structure the data frame
+in a way that navigation behavior can actually be investigated. However,
+we do not know if several people are standing around the table
+interacting with it or just one very active person. The simplest way to
+define a case variable is to just use a time limit between events. This
+means that when the table has not been interacted with for, e.g., 20
+seconds than it is assumed that a person moved on and a new person
+started interacting with the table. This is the easiest heuristic and
+implemented at the moment. Process mining shows that this simple
+approach works in a way that the correct process gets extracted by the
+algorithm.
+
+In order to investigate user behavior on a more fine grained level, it
+will be necessary to come up with a more elaborate approach. A better,
+still simple approach, could be to use this kind of time limit and
+additionally look at the distance between items interacted with within
+one time window. When items are far apart it seems plausible that more
+than one person interacted with them. Very short time lapses between
+events on different items could also be an indicator that more than one
+person is interacting with the table.
+
+## Assign a `path` variable
+
+The `path` variable is supposed to show one interaction trace with one
+artwork. Meaning it starts when an artwork is touched or flipped and
+stops when it is closed again. It is easy to assign a path from flipping
+a card over opening (maybe several) topics and pop-ups for this artwork
+card until closing this card again. But one would like to assign the
+same path to move events surrounding this interaction. Again, this is
+not possible in an algorithmic way but only heuristically.
+
+Again, I used a time cutoff for this. First, if a `move` event occurs,
+it is checked, if the same item has been flipped less than 20 seconds
+beforehand. If yes, the same path indicator is assigned to this `move`.
+If not, temporarily a new “move indicator” is assigned. Then, a
+“backward pass” is applied, where it is checked if the same item is
+opened less than 20 seconds *after* the event occurs. If yes, that path
+indicator is assigned. For all the remaining moves, a new path number is
+assigned. This corresponds to items being moved without being flipped.
+
+## A `move` event does not record any change
+
+Most of the events in the log files are move events. Additionally, many
+of these move events are recorded but they do not indicate any change,
+meaning the only difference is the timestamp. All other variables
+indicating moves like `x.start` and `x.stop`, `rotation.start` and
+`rotation.stop` etc. do not show *any* change. They represent about 2/3
+of all move events. These events are probably short touches of the table
+without an actual interaction. They were therefore removed from the data
+set.
+
+## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
+
+In the beginning I thought that the number for topics was the index of
+where the card was presented on the back of the item. But this is not
+correct. It is the number of the topic. There are eight topics in total:
+
+    Indices for topics:
+    0   artist
+    1   thema
+    2   komposition
+    3   leben des kunstwerks
+    4   details
+    5   licht und farbe
+    6   extra info
+    7   technik
+
+On the back of items, there can be between 2 to 6 topic cards. Several
+of these topic cards can be about the same topic, e.g., there can be two
+topic cards assigned to the topic `thema`. It is impossible to find out
+if the same topic card was opened several times or if different topic
+cards with the same topic were opened from the same item. See example
+below for item “001”.
+
+    ##   item            file_name                topic
+    ## 1  001 001_dargestellte.xml                thema
+    ## 2  001       001_thema1.xml                thema
+    ## 3  001        001_leben.xml leben des kunstwerks
+    ## 4  001       001_leben3.xml leben des kunstwerks
+    ## 5  001       001_thema2.xml                thema
+    ## 6  001        001_thema.xml                thema
+
+## New artworks “504” and “505” starting October 2022
+
+When I read in the complete data frame for the first time, all of the
+sudden there were 72 instead of 70 items. It seems like these two
+artworks appear on October 21, 2022.
+
+``` r
+summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"]))
+```
+
+    ##         Min.      1st Qu.       Median         Mean      3rd Qu.         Max. 
+    ## "2022-10-21" "2023-01-11" "2023-03-08" "2023-03-09" "2023-05-21" "2023-07-05"
+
+The artworks seem to be have updated in general after October 21, 2022.
+The following table shows which items were presented in which years.
+
+``` r
+xtabs(~ item + lubridate::year(date.start), datlogs)
+```
+
+    ##      lubridate::year(date.start)
+    ## item   2016  2017  2018  2019  2020  2022  2023
+    ##   1     277  4082  1912  1434   424   394  1315
+    ##   3     485  6730  3126  2356   528   457  1124
+    ##   19    714  8656  4028  2743   660   698  1595
+    ##   20    595  8461  3996  2983   938   657  1355
+    ##   24    497  6638  2912  2251   649   439  1028
+    ##   27    567  5959  3112  2318   651   711  1324
+    ##   28    601  9329  4394  3056   778   762  1570
+    ##   29    425  6865  3830  2365   516   615  1174
+    ##   31    289  4118  2051  1218   291   296   675
+    ##   32    562  7016  3477  2253   726   766  1647
+    ##   33    509  4936  2242  1449   555   358   666
+    ##   36    434  4505  2276  1668   373   387   976
+    ##   37    242  4478  2182  1554   339   423  1168
+    ##   38    480  4617  2144  1397   371   381   784
+    ##   39    395  3227  1313  1003   237   161   622
+    ##   41    282  3329  1303  1022   225   209   701
+    ##   42    203  3113  1307   903   242   191   421
+    ##   43    115  2420  1089   806   176   219   486
+    ##   45   1491 13561  5924  4474   966   585  1828
+    ##   46    903  9181  5340  3812   961   944  1648
+    ##   47    306  4949  2395  1510   750   297   675
+    ##   48    723 10455  5384  4162  1328   948  2031
+    ##   49    433  4326  2124  1414   434   431   809
+    ##   51    564  7837  4577  2991   884   659  1370
+    ##   52    447  5021  2104  1729   471   349   840
+    ##   54    424  5068  2816  2008   529   370   918
+    ##   55    358  4859  2069  1428   341   403  1303
+    ##   57    860 14264  6625  5092  1410  1221  2714
+    ##   60    555  6865  3539  2336   639   586  1415
+    ##   62    547  6736  3803  2210   795   633  1322
+    ##   63    251  3677  1827  1241   300   282   527
+    ##   66    552  6004  2774  1977   505   373   932
+    ##   69    394  3730  1827  1438   272   206   680
+    ##   70    226  3766  1843   973   293   268   703
+    ##   71    557  6160  2490  1846   570   323   839
+    ##   72    426  6194  2857  2129   508   635  1553
+    ##   73    432  6125  2880  1821   583   395   939
+    ##   75    258  5885  2418  1562   369   257   645
+    ##   76    861 12435  6253  4214  1753  1153  2268
+    ##   77    816  8595  4197  2897   699   674  1452
+    ##   78    410  5632  2498  1924   394   408   850
+    ##   80   1650 25687 12429  7782  1975  1712  4433
+    ##   83    644  8618  4720  3026   987  1027  2294
+    ##   84    184  2121  1231   759   231   254   465
+    ##   87    149  1618   722   632    99     0     0
+    ##   88    513  6996  3493  2272   539   533  1420
+    ##   89    214  2204   950   723   156     0     0
+    ##   90    281  3756  1372  1143   403   320   932
+    ##   93    613  8528  4224  3015   696  1174  2058
+    ##   98    462  6662  3265  2565   704   670  1453
+    ##   99    180  4162  1653  1454   363   411   868
+    ##   101   414  4209  1859  1282   392   411   981
+    ##   103   677  8758  4366  3165  1045   909  1871
+    ##   104   423  5256  2381  1865   463   467   933
+    ##   107   181  2101  1106   788   205   146   339
+    ##   109   321  4001  1619  1106   292   188   453
+    ##   110   489  5846  2785  2008   494   387   923
+    ##   125   640  8435  4519  3334   926     0     0
+    ##   129   598 11322  5046  3369   910  1131  1682
+    ##   145   419  7821  3945  2694   706   740  1396
+    ##   176   507  8465  3968  2787   687   552  1544
+    ##   180   516  7563  3720  2765   585   550  1272
+    ##   183   377  4014  1819  1741   346   251   675
+    ##   187   340  4222  2165  1753   319   312   734
+    ##   197   426  7710  3603  2510   671   602  1217
+    ##   229   303  4872  2360  1891   482   389  1005
+    ##   231   271  3606  1851  1239   318   236   467
+    ##   501  1915 15968  7849  5060  1157   890  2989
+    ##   502  1212 14550  7111  4749  1105   883  2752
+    ##   503  1308 15218  8632  6399  1626   870  2558
+    ##   504     0     0     0     0     0   363   662
+    ##   505     0     0     0     0     0   426  1533
+
+It shows that the artworks haven been updated after the Corona pandemic.
+I think, the table was also moved to a different location at that point.