mtt/README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  fig.path = "man/figures/README-"
)
devtools::load_all()
```

# R package mtt

![mtt package](man/figures/logo.png)

This package was created to process log files obtained from multi-touch
tables at the Leibniz-Institut für Wissensmedien (IWM).

## Installation

It can be installed via

`devtools::install_git("https://gitea.iwm-tuebingen.de/R/mtt.git")`

If you get an error message, you probably need to install `git2r`first with

`install.packages("git2r")`.

The package depends on the following R packages

* `dplyr`
* `pbapply`
* `XML`
* `lubridate`

so make sure they are installed as well.

# Multi-Touch Table

The multi-touch table at the Herzog-Anton-Ulrich-Museum (HAUM) in
Braunschweig gives visitors of the Museum the opportunity to interact with
about 70 artworks and 3 virtual cards containing information about the
museum and its layout. The table was installed at the museum in October
2016 and since November 2016 log files from interactions of visitors of the
museum have been collected. These log files are in an unstructured format
and cannot be easily analyzed. The purpose of the following document is to
describe how the data haven been transformed and which decisions have been
made along the way.

<!--
The implementation of the steps described here can be found at:
https://gitea.iwm-tuebingen.de/R/mtt.
-->

# Data structure

The log files contain lines that indicate the beginning and end of possible
activities that can be performed when interacting with the artworks on the
table. The layout of the table looks like pictures have been tossed on a
large table. Every artwork is visible at the start configuration. People
can move the pictures on the table, they can be scaled and rotated.
Additionally, the virtual picture cards can be flipped in order to find
more information of the artwork on the "back" of the card. One has to press
a little `i` for more information in one of the bottom corners of the card.
On the back of the card two to six information cards can be found with a
teaser text about a certain topic. These topic cards can be opened and a
hypertext with detailed information opens. Within these hypertexts certain
technical terms can be clicked for lay people to get more information. This
also opens up a pop-up. The events encoded in the raw log files therefore
have the following structure.

```
"Start Application"     --> Start Application
"Show Application"
"Transform start"       --> Move
"Transform stop"
"Show Info"             --> Flip Card
"Show Front"
"Artwork/OpenCard"      --> Open Topic
"Artwork/CloseCard"
"ShowPopup"             --> Open Popup
"HidePopup"
```

The right side shows what events can be extracted from these raw lines. The
"Start Application" is not an event in the original sense since it only
indicates if the table was started or maybe reset itself. This is not an
interaction with the table and therefore not interesting in itself. All
"Start Application" and "Show Application" are therefore excluded from the
data when further processed and are only in the raw log files.

# Parsing the raw log files

The first step is to parse the raw log files that are stored by the
application as text files in a rather unstructured format to a format that
can be read by common statistics software packages. The data are therefore
transferred to a spread sheet format. The following section describes what
problems were encountered while doing this.

## Corrupt lines

When reading the files containing the raw logs into R, a warning appears
that says

```
Warning messages:
  incomplete final line found on '2016/2016_11_18-11_31_0.log'
  incomplete final line found on '2016/2016_11_18-11_38_30.log'
  incomplete final line found on '2016/2016_11_18-11_40_36.log'
  ...
```

When you open these files, it looks like the last line contains some binary
content. It is unclear why and how this happens. So when reading the data,
these lines were removed. A warning will be given that indicates how many
files have been affected.

## Extracted variables from raw log files

The following variables (columns in the data frame) are extracted from the
raw log file:

* `fileId`: Containing the zero-left-padded file name of the raw log file
  the data line has been extracted from

* `folder`: The folder names in which the raw log files haven been
  organized in. For the HAUM data set, the data are sorted by year (folders
  2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).

* `date`: Extracted timestamp from the raw log file in the format
  `yyyy-mm-dd hh:mm:ss`.

* `timeMs`: Containing a timestamp in Milliseconds that restarts with
  every new raw log files.

* `event`: Start and stop event tags. See above for possible values.

* `item`: Identifier of the different items. This is a three-digit
  (left-padded) number. The numbers of the items correspond to the
  folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
  orginally taken from the museums catalogue.

* `popup`: Name of the pop-up opened. This is only interesting for
  "openPopup" events.

* `topic`: The number of the topic card that has been opened at the back of
  the item card. See below for a more detailed description what these
  numbers mean.

* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$).

* `y`: Value of y-coordinate in pixel.

* `scale`: Number in 128 bit that indicates how much the card has been
  scaled.

* `rotation`: Degree of rotation from start configuration.

<!-- TODO: Nach welchem Zeitintervall resettet sich der Tisch wieder in die
  Ausgangskonfiguration? -->

## Variables after "closing of events"

The raw log data consist of start and stop events for each event type.
After preprocessing four event types are extracted: `move`, `flipCard`,
`openTopic`, and `openPopup`. Except for the `move` events, which can occur
at any time when interacting with an item card on the table, the events
have a hierarchical order: An item card first needs to be flipped
(`flipCard`), then the topic cards on the back of the card can be opened
(`openTopic`), and finally pop-ups on these topic cards can be opened
(`openPopup`). This implies that the event `openPopup` can only be present
for a certain item, if the card has already been flipped (i.e., an event
`flipCard` for the same item has already occured).

After preprocessing, the data frame is now in a wide format with columns
for the start and the stop of each event and contains the following
variables:

* `fileId.start` / `fileId.stop`: See above.

* `date.start` / `date.stop`: See above.

* `folder`: Containing the folder name (see above).

* `case`: A numerical variable indicating cases in the data. A "case"
  indicates an interaction interval and could be defined in different ways.
  Right now a new case begins, when no event occurred when no new path
  started for 20 seconds or longer.

* `path`: A path is defined as one interaction with one item. A path
  can either start with a `flipCard` event or when an item has been
  touched for the first time within this case. A path ends with the
  item card being flipped close again or with the last movement of the
  card within this case. One case can contain several paths with the same
  item when the item is flipped open and flipped close again several
  times within a short time.

* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up
  has been opened from the glossar folder. These pop-ups can be assigned to
  the wrong item since it is not possible to do this algorithmically.
  It is possible that two items are flipped open that could both link to
  the same pop-up from a glossar. The indicator variable is left as a
  variable, so that these pop-ups can be easily deleted from the data.
  Right now, glossar entries can be ignored completely by setting an
  argument and this is done by default. Using the pop-ups from the glossar
  will need a lot more love, before it behaves satisfactorily.

* `event`: Indicating the event. Can take tha values `move`, `flipCard`,
  `openTopic`, and `openPopup`.

* `item`: Identifier of the different artworks and information cards. This
  is a three-digit (left-padded) number. See above.

* `timeMs.start` / `timeMs.stop`: See above.

* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds.
  Needs to be adjusted for events spanning more than one log file by a
  factor of $60,000 \times \text{number of logfiles}$. See below for details.

* `topic`: See above.

* `popup`: See above.

* `x.start` / `x.stop`: See above.

* `y.start` / `y.stop`: See above.

* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and
  $(x.stop, y.stop)$.

* `scale.start` / `scale.stop`: See above.

* `scaleSize`: Relative scaling of item card, calculated by
  $\frac{scale.stop}{scale.start}$.

* `rotation.start` / `rotation.stop`: See above.

* `rotationDegree`: Difference of rotation from $rotation.stop$ to
  $rotation.start$.

## How unclosed events are handled

Events do not necessarily need to be completed. A person can, e.g., leave
the table and not flip the item card close again. For `flipCard`,
`openTopic`, and `openPopup` the data frame contains `NA` when the event
does not complete. For `move` events it happens quite often that a start
event follows a start event and a stop event follows a stop event.
Technically a move event cannot *not* be finished and the number of events
without a start or stop indicate that the time resolution was not
sufficient to catch all these events accurately. Double start and stop
`move` events have therefore been deleted from the data set.

## Additional meta data

For the HAUM data, I added meta data on state holidays and school
vacations.

This led to the following additional variables:

* `holiday`

* `vacations`

# Problems and how I handled them

This lists some problems with the log data that required decisions. These
decisions influence the outcome and maybe even the data quality. Hence, I
tried to document how I handled these problems and explain the decisions I
made.

## Weird behavior of `timeMs` and neg. `duration` values

`timeMs` resets itself every time a new log file starts. This means that
the durations of events spanning more than one log file must be adjusted.
Instead of just calculating $timeMs.stop - timeMs.start$, `timeMs.start`
must be subtracted from the maximum duration of the log file where the
event started ($600,000 ms$) and the `timeMs.stop` must be added. If the
event spans more than two log files, a multiple of $600,000$ must be taken,
e.g. for three log files it must be: $2 \times 600,000 - timeMs.start +
timeMs.stop$ and so on.

```{r timems, echo = FALSE, results = FALSE, fig.show = TRUE}
# Read data
datraw <- read.table("../../MDS/2023ss/60100_master_thesis/analysis/code/results/raw_logfiles_2024-02-21_16-07-33.csv", sep = ";",
                     header = TRUE)

plot(timeMs ~ as.factor(fileId), datraw[1:5000,], xlab = "fileId")
```

The boxplot shows that we have a continuous range of values within one log
file but that `timeMs` does not increase over log files. I kept
`timeMs.start` and `timeMs.stop` and also `fileId.start` and `fileId.stop`
in the data frame, so it is clear when events span more than one log file.

<!--
Infos from the programmer:

"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
erstellt. Die Startzeit, von der aus die Duration berechnet wird, wird
jeweils neu gesetzt. Duration ist also nicht "Dauer seit Start der
Anwendung" sondern "Dauer seit Restart des Loggers". Deine Vermutung ist
also richtig - es sollte keine Durations >10 Minuten geben. Der erste
Eintrag eines Logfiles kann alles zwischen 0 und 10 Minuten sein (je
nachdem, ob der Tisch zum Zeitpunkt des neuen Logging-Intervalls in
Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
es passt."
-->

## Left padding of file IDs

The file names of the raw log files are automatically generated and contain
a timestamp. This timestamp is not well formed. First, it contains an
incorrect month. The months go from 0 to 11 which means, that the file name
`2016_11_15-12_12_57.log` was collected on December 15, 2016 at 12:12 pm.
Another problem is that the file names are not zero left padded, e.g.,
`2016_11_15-12_2_57.log`. This file was collected on December 15, 2016 at
12:02 pm and therefore before the file above. But most sorting algorithms,
will sort these files in the order shown below. In order to preprocess the
data and close events that belong together, the data need to be sorted by
events and artworks repeatedly. In order to get them back in the correct
time order, it is necessary to order them based on three variables:
`fileId.start`, `date.start` and `timeMs.start`. The file IDs therefore
need to sort in the correct order (again see below for example). I zero
left padded the log file names within the data frame using it as an
identifier. These "file names" do not correspond exactly to the original
raw log file names. This needs to be kept in mind when doing any kind of
matching etc.

```
## what it looked like before left padding
# 1422  ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
# 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57    621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
# 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57    677  Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
# 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57    774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
# 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57    850  Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
# 1427  ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916  Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465

## what it looks like now
# 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
# 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916  Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
# 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57    621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
# 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57    677  Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
# 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57    774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
# 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57    850  Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
```

## Timestamps repeat

The timestamps in the `date` variable record year, month, day, hour,
minute and seconds. Since one second is not a very short time interval for
a move on a touch display, this is not fine grained enough to bring events
into the correct order, meaning there are events from the same log file
having the same timestamp and even events from different log files having
the same timestamp. The log files get written about every 10 minutes
(which can easily be seen when looking at the file names of the raw log
files). So in order to get events in the correct order, it is necessary to
first order by file ID, within file ID then sort by timestamp `date` and
then within these more coarse grained timestamps sort be `timeMs`. But as
explained above, `timeMs` can only be sorted within one file ID, since they
do not increase consistently over log files, but have a new setoff for each
raw log file.

## x,y-coordinates outside of display range

The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160
pixels. When you plot the start and stop coordinates, the display is
clearly distinguishable. However, a lot of points are outside of the
display range. This can happen, when the art objects are scaled and then
moved to the very edge of the table. Then it will record pixels outside of
the table. These are actually valid data points and I will leave them as
is.

```{r xycoord}
datlogs <- read.table("../../MDS/2023ss/60100_master_thesis/analysis/code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";",
                      header = TRUE)

par(mfrow = c(1, 2))
plot(y.start ~ x.start, datlogs)
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
plot(y.stop ~ x.stop, datlogs)
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)

aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean)
```

## Pop-ups from glossar cannot be assigned to a specific item

All the information, pictures and texts for the topics and pop-ups are
stored in `/data/haum/ContentEyevisit/eyevisit_cards_light/<item_number>`.
Among other things, each folder contains XML-files with the information
about any technical terms that can be opened from the hypertexts on the
topic cards. Often these information are item dependent and then the
corresponding XML-file is in the folder for this item. Sometimes, however,
more general terms can be opened. In order to avoid multiple files
containing the same information, these were stored in a folder called
`glossar` and get accessed from there. The raw log files only contain the
path to this glossar entry and did not record from which item it was
accessed. I tried to assign these glossar entries to the correct items. The
(very heuristic) approach was this:

1. Create a lookup table with all XML-file names (possible pop-ups) from
   the glossar folder and what items possibly call them. This was stored
   as an `RData` object for easier handling but should maybe be stored in a
   more interoperable format.

2. I went through all possible pop-ups in this lookup table and stored the
   items that are associated with it.

3. I created a sub data frame without move events (since they can never be
   associated with a pop-up) and went through every line and looked up if
   an item and a topic card had been opened. If this was the case and a
   glossar entry came up before the item was closed again, I assigned
   this item to the glossar entry.

This is heuristic since it is possible that several topic cards from
different items are opened simultaneously and the glossar pop-up could
be opened from either one (it could even be more than two, of course). In
these cases the item that was opened closest to the glossar pop-up has
been assigned, but this can never be completely error free.

And this heuristic only assigns a little more than half of the glossar
entries. Since my heuristic only looks for the last item that has been
opened and if this item is a possible candidate it misses all glossar
pop-ups where another item has been opened in between. This is still an
open TODO to write a more elaborate algorithm.

All glossar pop-ups that do not get matched with an item are removed
from the data set with a warning if the argument `glossar = TRUE` is set.
Otherwise the glossar entries will be ignored completely.

## Assign a `case` variable based on "time heuristic"

One thing needed in order to work with the data set and use it for machine
learning algorithms like process mining, is a variable that tries to
identify a case. A case variable will structure the data frame in a way
that navigation behavior can actually be investigated. However, we do not
know if several people are standing around the table interacting with it or
just one very active person. The simplest way to define a case variable is
to just use a time limit between events. This means that when the table has
not been interacted with for, e.g., 20 seconds than it is assumed that a
person moved on and a new person started interacting with the table. This
is the easiest heuristic and implemented at the moment. Process mining
shows that this simple approach works in a way that the correct process
gets extracted by the algorithm.

In order to investigate user behavior on a more fine grained level, it will
be necessary to come up with a more elaborate approach. A better, still
simple approach, could be to use this kind of time limit and additionally
look at the distance between items interacted with within one time window.
When items are far apart it seems plausible that more than one person
interacted with them. Very short time lapses between events on different
items could also be an indicator that more than one person is interacting
with the table.

## Assign a `path` variable

The `path` variable is supposed to show one interaction trace with one
artwork. Meaning it starts when an artwork is touched or flipped and stops
when it is closed again. It is easy to assign a path from flipping a card
over opening (maybe several) topics and pop-ups for this artwork card until
closing this card again. But one would like to assign the same path to
move events surrounding this interaction. Again, this is not possible in an
algorithmic way but only heuristically.

Again, I used a time cutoff for this. First, if a `move` event occurs, it
is checked, if the same item has been flipped less than 20 seconds
beforehand. If yes, the same path indicator is assigned to this `move`. If
not, temporarily a new "move indicator" is assigned. Then, a "backward
pass" is applied, where it is checked if the same item is opened less than
20 seconds _after_ the event occurs. If yes, that path indicator is
assigned. For all the remaining moves, a new path number is assigned. This
corresponds to items being moved without being flipped.

## A `move` event does not record any change

Most of the events in the log files are move events. Additionally, many of
these move events are recorded but they do not indicate any change, meaning
the only difference is the timestamp. All other variables indicating moves
like `x.start` and `x.stop`, `rotation.start` and `rotation.stop` etc. do
not show _any_ change. They represent about 2/3 of all move events. These
events are probably short touches of the table without an actual
interaction. They were therefore removed from the data set.

## Card indices go from 0 to 7 (instead of 0 to 5 as expected)

In the beginning I thought that the number for topics was the index of
where the card was presented on the back of the item. But this is not
correct. It is the number of the topic. There are eight topics in total:

```
Indices for topics:
0   artist
1   thema
2   komposition
3   leben des kunstwerks
4   details
5   licht und farbe
6   extra info
7   technik
```
On the back of items, there can be between 2 to 6 topic cards. Several of
these topic cards can be about the same topic, e.g., there can be two topic
cards assigned to the topic `thema`. It is impossible to find out if the
same topic card was opened several times or if different topic cards with
the same topic were opened from the same item. See example below for item
"001".

```{r topics, echo = FALSE}
items <- sprintf("%03d", unique(datlogs$item))
topics <- extract_topics(items, xmlfiles = paste0(items, ".xml"),
                         xmlpath = "../../MDS/2023ss/60100_master_thesis/analysis/data/haum/ContentEyevisit/eyevisit_cards_light/")
head(topics)
```

## New artworks "504" and "505" starting October 2022

When I read in the complete data frame for the first time, all of the
sudden there were 72 instead of 70 items. It seems like these two
artworks appear on October 21, 2022.

```{r newitems}
summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"]))
```

The artworks seem to be have updated in general after October 21, 2022. The
following table shows which items were presented in which years.

```{r years}
xtabs(~ item + lubridate::year(date.start), datlogs)
```

It shows that the artworks haven been updated after the Corona pandemic. I
think, the table was also moved to a different location at that point.