Updated README.Rmd and exported as github_document

This commit is contained in:
Nora Wickelmaier 2024-03-22 15:58:30 +01:00
parent 37e67bfa69
commit 9762c61a8d
4 changed files with 745 additions and 371 deletions

View File

@ -1,46 +1,38 @@
---
title: "Background information about MTT data"
author: "Nora Wickelmaier"
date: "`r Sys.Date()`"
output:
html_document:
number_sections: true
toc: true
title: "Log data from the Multi-Touch Table at the HAUM"
output: github_document
---
```{r, include = FALSE}
# setwd("C:/Users/nwickelmaier/Nextcloud/Documents/MDS/2023ss/60100_master_thesis")
devtools::load_all("../../../software/mtt")
devtools::load_all("../../../../software/mtt")
```
# Log data from the Multi-Touch Table at the HAUM
The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in
Braunschweig gives visitors of the Museum the opportunity to interact with
67 artworks and 3 tiles containing information about the museum and its
layout. The table was installed at the institute in October 2016 and since
November 2016 log files from interactions of visitors of the museum have
been collected. These log files are in an unstructured format and cannot be
easily analyzed. The purpose of the following document is to describe how
the data haven been transformed and which decisions have been made along
the way.
about 70 artworks and 3 virtual cards containing information about the
museum and its layout. The table was installed at the institute in October
2016 and since November 2016 log files from interactions of visitors of the
museum have been collected. These log files are in an unstructured format
and cannot be easily analyzed. The purpose of the following document is to
describe how the data haven been transformed and which decisions have been
made along the way.
# Data structure
The log files contain lines that indicate the beginning and end of possible
actions that can be performed when interacting with the artworks on the
table. The layout of the table looks like 70 pictures have been tossed on a
activities that can be performed when interacting with the artworks on the
table. The layout of the table looks like pictures have been tossed on a
large table. Every artwork is visible at the start configuration. People
can move the pictures on the table, they can be scaled and rotated.
Additionally, the virtual picture cards can be flipped in order to find
more information of the artwork on the "back" of the card. One has to press
a little `i` for more information in one of the bottom corners of the card.
On the back of the card two (?) to six information cards can be found with
a teaser text about a certain topic. These topic cards can be opened and a
hypertext with detailed information pops up. Within these hypertexts
certain technical terms can be clicked for lay people to get more
information. This also opens up a pop-up. The events encoded in the raw log
files therefore have the following structure.
On the back of the card two to six information cards can be found with a
teaser text about a certain topic. These topic cards can be opened and a
hypertext with detailed information opens. Within these hypertexts certain
technical terms can be clicked for lay people to get more information. This
also opens up a pop-up. The events encoded in the raw log files therefore
have the following structure.
```
"Start Application" --> Start Application
@ -100,32 +92,32 @@ raw log file:
organized in. For the HAUM data set, the data are sorted by year (folders
2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).
* `data`: Extracted time stamp from the raw log file in the format
* `date`: Extracted timestamp from the raw log file in the format
`yyyy-mm-dd hh:mm:ss`.
* `timeMs`: Containing a time stamp in Milliseconds that restarts with
* `timeMs`: Containing a timestamp in Milliseconds that restarts with
every new raw log files.
* `event`: Start and stop event tags. See above for possible values.
* `artwork`: Identifier of the different artworks. This is a 3 digit
(left-padded) number. The numbers of the artworks correspond to the
* `item`: Identifier of the different items. This is a three-digit
(left-padded) number. The numbers of the items correspond to the
folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
orginally taken from the museums catalogue.
* `popup`: Name of the pop-up opened. This is only interestin for
* `popup`: Name of the pop-up opened. This is only interesting for
"openPopup" events.
* `topicNumber`: The number of the topic card that has been opened at the back of
the artwork card. See below for a more detailed descripttion what these
numbers possibly mean.
* `topic`: The number of the topic card that has been opened at the back of
the item card. See below for a more detailed descripttion what these
numbers mean.
* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$)
* `y`: Value of y-coordinate in pixel
* `scale`: Number in 128 bit that indicates how much the artwork card has
been scaled (????)
* `scale`: Number in 128 bit that indicates how much the card has been
scaled
* `rotation`: Degree of rotation in start configuration.
@ -134,43 +126,45 @@ raw log file:
## Variables after "closing of events"
The raw log data consists of start and stop events for each event type.
After preprocessing for event types are extracted: `move`, `flipCard`,
The raw log data consist of start and stop events for each event type.
After preprocessing four event types are extracted: `move`, `flipCard`,
`openTopic`, and `openPopup`. Except for the `move` events, which can occur
at any time when interacting with an artwork card on the table, the events
have a hierachical order: An artwork card first needs to be flipped
at any time when interacting with an item card on the table, the events
have a hierarchical order: An item card first needs to be flipped
(`flipCard`), then the topic cards on the back of the card can be opened
(`openTopic`), and finally pop-ups on these topic cards can be opened
(`openPopup`). This implies that the event `openPopup` can only be present
for a certain artwork, if the card has already been flipped (i.e., an event
`flipCard` for the same artwork has already occured).
for a certain item, if the card has already been flipped (i.e., an event
`flipCard` for the same item has already occured).
After preprocessing, the data frame is now in a wide format with columns
for the start and the stop of each event and contains the following
variables:
* `folder`: Containing the folder name (see above)
* `fileId.start` / `fileId.stop`: See above.
* `eventId`: A numerical variable that indicates the number of the event.
Starts at 1 and ends with the total number of events, counting up by 1.
* `date.start` / `date.stop`: See above.
* `folder`: Containing the folder name (see above)
* `case`: A numerical variable indicating cases in the data. A "case"
indicates an interaction interval and could be defined in different ways.
Right now a new case begins, when no event occured for 20 seconds.
Right now a new case begins, when no event occurred for 20 seconds or
longer.
* `trace`: A trace is defined as one interaction with one artwork. A trace
can either start with a `flipCard` event or when an artwork has been
touched for the first time within this case. A trace ends with the
artwork card being flipped close again or with the last movement of the
card within this case. One case can contain several traces with the same
artwork when the artwork is flipped open and slipped close again several
* `path`: A path is defined as one interaction with one item A path
can either start with a `flipCard` event or when an item has been
touched for the first time within this case. A path ends with the
item card being flipped close again or with the last movement of the
card within this case. One case can contain several paths with the same
item when the item is flipped open and flipped close again several
times within a short time.
* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up
has been opened from the glossar folder. These pop-ups can be assigned to
the wronge artwork since it is not possible to do this algorithmically.
It is possible that two artworks are flipped open that could both link to
the same popup from a glossar. The indicator variable is left as a
the wrong item since it is not possible to do this algorithmically.
It is possible that two items are flipped open that could both link to
the same pop-up from a glossar. The indicator variable is left as a
variable, so that these pop-ups can be easily deleted from the data.
Right now, glossar entries can be ignored completely by setting an
argument and this is done by default. Using the pop-ups from the glossar
@ -179,20 +173,16 @@ variables:
* `event`: Indicating the event. Can take tha values `move`, `flipCard`,
`openTopic`, and `openPopup`.
* `artwork`: Identifier of the different artworks. This is a 3 digit
(left-padded) number. See above.
* `fileId.start` / `fileId.stop`: See above.
* `date.start` / `date.stop`: See above.
* `item`: Identifier of the different artworks and information cards. This
is a three-digit (left-padded) number. See above.
* `timeMs.start` / `timeMs.stop`: See above.
* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds.
Needs to be adjusted for events spanning more than one log file by a
factor of $60,000 \times #logfiles$. See below for details.
factor of $60,000 \times \text{number of logfiles}$. See below for details.
* `topicNumber`: See above.
* `topic`: See above.
* `popup`: See above.
@ -200,11 +190,12 @@ variables:
* `y.start` / `y.stop`: See above.
* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and $(x.stop, y.stop)$.
* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and
$(x.stop, y.stop)$.
* `scale.start` / `scale.stop`: See above.
* `scaleSize`: Relative scaling of artwork card, calculated by
* `scaleSize`: Relative scaling of item card, calculated by
$\frac{scale.stop}{scale.start}$.
* `rotation.start` / `rotation.stop`: See above.
@ -215,60 +206,26 @@ variables:
## How unclosed events are handled
Events do not necessarily need to be completed. A person can, e.g., leave
the table and not flip the artwork card close again. For `flipCard`,
the table and not flip the item card close again. For `flipCard`,
`openTopic`, and `openPopup` the data frame contains `NA` when the event
does not complete. For `move` events is happens quite often that a start
does not complete. For `move` events it happens quite often that a start
event follows a start event and a stop event follows a stop event.
Technically a move event cannot *not* be finished and the number of events
without a start or stop indicated that the time resolution was not
without a start or stop indicate that the time resolution was not
sufficient to catch all these events accurately. Double start and stop
`move`events have therefore been deleted from the data set.
<!--
## How a case is defined
* Herausfinden, ob mehr als eine Person am Tisch steht?
- Sliding window, in der Anzahl von Artworks gezählt wird? Oder wie weit
angefasste Artworks voneinander entfernt sind?
- Man kann sowas schon "sehen" in den Logs - aber wie kann ich es
automatisiert rausziehen? Was ist meine Definition von
"Interaktionsboost"?
- Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
-->
`move` events have therefore been deleted from the data set.
## Additional meta data
For the HAUM data, I added meta data on state holidays and school
vacations. Additionally, the topic categories of the topic cards were
extracted from the XML files and added to the data frame.
vacations.
This led to the following additional variables:
* `topicIndex`
* `topicFile`
* `topic`
* `state` (Niedersachsen for complete HAUM data set)
* `stateCode` (NI)
* `holiday`
* `vacations`
* `stateCodeVacations`
<!--
- Metadata on artworks like, name, artist, type of artwork, epoch, etc.
- School vacations and holidays
- Special exhibits at the museum
- Number of visitors per day (bei Sven noch mal nachhaken?)
- Age structure of visitors per day?
- ... ????
-->
# Problems and how I handled them
This lists some problems with the log data that required decisions. These
@ -287,33 +244,12 @@ event spans more than two log files, a multiple of $600,000$ must be taken,
e.g. for three log files it must be: $2 \times 600,000 - timeMs.start +
timeMs.stop$ and so on.
```{r, results = FALSE, fig.show = TRUE}
```{r timems, echo = FALSE, results = FALSE, fig.show = TRUE}
# Read data
dat0 <- read.table("data/haum/raw_logfiles_small_2023-09-26_13-50-20.csv", sep = ";",
header = TRUE)
dat0$date <- as.POSIXct(dat0$date)
dat0$glossar <- ifelse(dat0$artwork == "glossar", 1, 0)
datraw <- read.table("code/results/raw_logfiles_2024-02-21_16-07-33.csv", sep = ";",
header = TRUE)
# Remove irrelevant events
dat <- subset(dat0, !(dat0$event %in% c("Start Application",
"Show Application")))
# Add trace variable
artworks <- unique(stats::na.omit(dat$artwork))
artworks <- artworks[artworks != "glossar"]
glossar_files <- unique(subset(dat, dat$artwork == "glossar")$popup)
glossar_dict <- create_glossardict(artworks, glossar_files,
xmlpath = "data/haum/ContentEyevisit/eyevisit_cards_light/")
dat1 <- add_trace(dat, glossar_dict)
# Close events
dat2 <- rbind(close_events(dat1, "move", rm_nochange_moves = TRUE),
close_events(dat1, "flipCard", rm_nochange_moves = TRUE),
close_events(dat1, "openTopic", rm_nochange_moves = TRUE),
close_events(dat1, "openPopup", rm_nochange_moves = TRUE))
dat2 <- dat2[order(dat2$fileId.start, dat2$date.start, dat2$timeMs.start), ]
plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
plot(timeMs ~ as.factor(fileId), datraw[1:5000,], xlab = "fileId")
```
The boxplot shows that we have a continuous range of values within one log
@ -322,7 +258,7 @@ file but that `timeMs` does not increase over log files. I kept
in the data frame, so it is clear when events span more than one log file.
<!--
Infos from Philipp:
Infos from the programmer:
"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
@ -340,7 +276,7 @@ es passt."
## Left padding of file IDs
The file names of the raw log files are automatically generated and contain
a time stamp. This time stamp is not well formed. First, it contains an
a timestamp. This timestamp is not well formed. First, it contains an
incorrect month. The months go from 0 to 11 which means, that the file name
`2016_11_15-12_12_57.log` was collected on December 15, 2016 at 12:12 pm.
Another problem is that the file names are not zero left padded, e.g.,
@ -350,11 +286,12 @@ will sort these files in the order shown below. In order to preprocess the
data and close events that belong together, the data need to be sorted by
events and artworks repeatedly. In order to get them back in the correct
time order, it is necessary to order them based on three variables:
`fileId`, `date.start` and `timeMs`. The file IDs therefore need to
sort in the correct order (again see below for example). I zero left padded
the log file names within the data frame using it as an identifier. These
"file names" do not correspond exactly to the original raw log file names.
This needs to be kept in mind when doing any kind of matching etc.
`fileId.start`, `date.start` and `timeMs.start`. The file IDs therefore
need to sort in the correct order (again see below for example). I zero
left padded the log file names within the data frame using it as an
identifier. These "file names" do not correspond exactly to the original
raw log file names. This needs to be kept in mind when doing any kind of
matching etc.
```
## what it looked like before left padding
@ -376,16 +313,16 @@ This needs to be kept in mind when doing any kind of matching etc.
## Timestamps repeat
The time stamps in the `date` variable record year, month, day, hour,
The timestamps in the `date` variable record year, month, day, hour,
minute and seconds. Since one second is not a very short time interval for
a move on a touch display, this is not fine grained enough to bring events
into the correct order, meaning there are events from the same log file
having the same time stamp and even events from different log files having
the same time stamp. The log files get written about every 10 minutes
having the same timestamp and even events from different log files having
the same timestamp. The log files get written about every 10 minutes
(which can easily be seen when looking at the file names of the raw log
files). So in order to get events in the correct order, it is necessary to
first order by file ID, within file ID then sort by time stamp `date` and
then within these more coarse grained time stamps sort be `timeMs`. But as
first order by file ID, within file ID then sort by timestamp `date` and
then within these more coarse grained timestamps sort be `timeMs`. But as
explained above, `timeMs` can only be sorted within one file ID, since they
do not increase consistently over log files, but have a new setoff for each
raw log file.
@ -394,64 +331,67 @@ raw log file.
The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160
pixels. When you plot the start and stop coordinates, the display is
clearly to distinguish. However, a lot of points are outside of the display
range. This can happen, when the art objects are scaled and then moved to
the very edge of the table. Then it will record pixels outside of the
table. These are actually valid data points and I will leave them as is.
clearly distinguishable. However, a lot of points are outside of the
display range. This can happen, when the art objects are scaled and then
moved to the very edge of the table. Then it will record pixels outside of
the table. These are actually valid data points and I will leave them as
is.
```{r xycoord}
datlogs <- read.table("code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";",
header = TRUE)
```{r}
par(mfrow = c(1, 2))
plot(y.start ~ x.start, dat2)
plot(y.start ~ x.start, datlogs)
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
plot(y.stop ~ x.stop, dat2)
plot(y.stop ~ x.stop, datlogs)
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, dat2, mean)
aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean)
```
## Pop-ups from glossar cannot be assigned to a specific artwork
## Pop-ups from glossar cannot be assigned to a specific item
All the information, pictures and texts for the topics and pop-ups are
stored in
`/Logfiles/ContentEyevisit/eyevisit_cards_light/<artwork_number>`. Among
other things, each folder contains XML-files with the information about any
technical terms that can be opened from the hypertexts on the topic cards.
Often these information are artwork dependent and then the corresponding
XML-file is in the folder for this artwork. Sometimes, however, more
general terms can be opened. In order to avoid multiple files containing
the same information, these were stored in a folder called `glossar` and
get accessed from there. The raw log files only contain the path to this
glossar entry and did not record from which artwork it was accessed. I
tried to assign these glossar entries to the correct artworks. The (very
heuristic) approach was this:
stored in `/data/haum/ContentEyevisit/eyevisit_cards_light/<item_number>`.
Among other things, each folder contains XML-files with the information
about any technical terms that can be opened from the hypertexts on the
topic cards. Often these information are item dependent and then the
corresponding XML-file is in the folder for this item. Sometimes, however,
more general terms can be opened. In order to avoid multiple files
containing the same information, these were stored in a folder called
`glossar` and get accessed from there. The raw log files only contain the
path to this glossar entry and did not record from which item it was
accessed. I tried to assign these glossar entries to the correct items. The
(very heuristic) approach was this:
1. Create a lookup table with all XML-file names (possible pop-ups) from
the glossar folder and what artworks possibly call them. This was stored
the glossar folder and what items possibly call them. This was stored
as an `RData` object for easier handling but should maybe be stored in a
more interoperable format.
2. I went through all possible pop-ups in this lookup table and stored the
artworks that are associated with it.
items that are associated with it.
3. I created a sub data frame without move events (since they can never be
associated with a pop-up) and went through every line and looked up if
an artwork and a topic card had been opened. If this was the case and a
glossar entry came up before the artwork was closed again, I assigned
this artwork to this glossar entry.
an item and a topic card had been opened. If this was the case and a
glossar entry came up before the item was closed again, I assigned
this item to the glossar entry.
This is heuristic since it is possible that several topic cards from
different artworks are opened simultaneously and the glossar pop-up could
different items are opened simultaneously and the glossar pop-up could
be opened from either one (it could even be more than two, of course). In
these cases the artwork that was opened closest to the glossar pop-up has
these cases the item that was opened closest to the glossar pop-up has
been assigned, but this can never be completely error free.
And this heuristic only assigns a little more than half of the glossar
entries. Since my heuristic only looks for the last artwork that has been
opened and if this artwork is a possible candidate it misses all glossar
pop-ups where another artwork has been opened in between. This is still an
entries. Since my heuristic only looks for the last item that has been
opened and if this item is a possible candidate it misses all glossar
pop-ups where another item has been opened in between. This is still an
open TODO to write a more elaborate algorithm.
All glossar pop-ups that do not get matched with an artwork are removed
All glossar pop-ups that do not get matched with an item are removed
from the data set with a warning if the argument `glossar = TRUE` is set.
Otherwise the glossar entries will be ignored completely.
@ -473,232 +413,89 @@ gets extracted by the algorithm.
In order to investigate user behavior on a more fine grained level, it will
be necessary to come up with a more elaborate approach. A better, still
simple approach, could be to use this kind of time limit and additionally
look at the distance between artworks interacted with within one time
window. When artworks are far apart it seems plausible that more than one
person interacted with them. Very short time lapses between events on
different artworks could also be an indicator that more than one person is
interacting with the table.
look at the distance between items interacted with within one time window.
When items are far apart it seems plausible that more than one person
interacted with them. Very short time lapses between events on different
items could also be an indicator that more than one person is interacting
with the table.
## Assign a `trace` variable
## Assign a `path` variable
The `trace` variable is supposed to show one interaction trace with one
The `path` variable is supposed to show one interaction trace with one
artwork. Meaning it starts when an artwork is touched or flipped and stops
when it is closed again. It is easy to assign a trace from flipping a card
when it is closed again. It is easy to assign a path from flipping a card
over opening (maybe several) topics and pop-ups for this artwork card until
closing this card again. But one would like to assign the same trace to
closing this card again. But one would like to assign the same path to
move events surrounding this interaction. Again, this is not possible in an
algorithmic way but only heuristically. I used the `case` variable in order
to get meaningful units around the artworks.
algorithmic way but only heuristically.
If within one case only a single trace for a single artwork was opened, I
assigned this trace to the moves associated with this artwork. It (quite
often) happens that within one case one artwork is opened and closed
several times, each time starting a new trace. I then assigned all the
following move events to the trace beforehand. This is, of course,
arbitrary and could also be handled the other way around.
Another possibility is, that an artwork gets moved within one trace without
being flipped. I then assigned a new trace to this move.
This overall worked very well even though it was based on the very
heuristic approach assigning a case when the table has not been touched for
20 seconds. It should be kept in mind that the trace assignments for the
moves will change when case is defined in a different way.
Again, I used a time cutoff for this. First, if a `move` event occurs, it
is checked, if the same item has been flipped less than 20 seconds
beforehand. If yes, the same path indicator is assigned to this `move`. If
not, temporarily a new "move indicator" is assigned. Then, a "backward
pass" is applied, where it is checked if the same item is opened less than
20 seconds _after_ the event occurs. If yes, that path indicator is
assigned. For all the remaining moves, a new path number is assigned. This
corresponds to items being moved without being flipped.
## A `move` event does not record any change
Most of the events in the log files are move events. Additionally, many of
these move events are recorded but they do not indicate any change meaning
the only difference is the time stamp. All other variables indicating moves
these move events are recorded but they do not indicate any change, meaning
the only difference is the timestamp. All other variables indicating moves
like `x.start` and `x.stop`, `rotation.start` and `rotation.stop` etc. do
not show any change. They represent about 2/3 of all move events. These
not show _any_ change. They represent about 2/3 of all move events. These
events are probably short touches of the table without an actual
interaction. They were therefore removed from the data set.
## Events that only close (`date.start` is NA)
It looks like there is some kind of log error for the events that do not
have a start stop. I was able to get rid of most by sorting for `popup` for
the openPopup events, but there are still some left (50 for the small data
set, which corresponds to 0.2 per mill). The following example shows that
artwork "501" gets closed (line 31030) while the pop-up `sommerbau.xml`
is still opened (line 31027). Then artwork "501" gets opened again
(line 31035) and after that the pop-up `sommerbau.xml` is closed (line
31040). This should not be possible and therefore (correctly) two events
are assigned: One where the pop-up was opened and then not closed (which is
common) and another one where the pop-up has no start.
```{r}
dat[31000:31019,]
# Card gets flipped closed before pop-up closes --> log error!
```
I did not check all of these cases (for the complete data set this is
simply not possible by hand) but just excluded all events that do not have
a `date.start` since they are hard to interpret. Often they are log errors
but in some cases they might be resolvable.
```{r}
# remove all events that do not have a `date.start`
dim(dat2[is.na(dat2$date.start), ])
dat2 <- dat2[!is.na(dat2$date.start), ]
```
In order to deal with these logging errors, I check the data for what I
call "fragmented traces". These are traces that cannot happen, when
everything is logged correctly, e.g., traces containing `flipCard ->
openPopup` or traces that only consist of `move`, `openTopic`, and
`openPopup` events. These fragmented traces are removed from the data. It
was not possible to check them all manually, but the 20 or more that I do
check in the raw log files were all some kind of logging error like above.
Most often a card was already closed again, before a topic card or pop-up
was recorded as being closed.
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
See `questions_number-of-cards.R` for more details.
In the beginning I thought that the number for topics was the index of
where the card was presented on the back of the item. But this is not
correct. It is the number of the topic. There are eight topics in total:
I wrote a function that for each artwork extracts the file names of the
possible topic cards and then looks up which topics have actually been
displayed on the back of the card. I added an index giving the ordering in
the index files.
The possible values in the variable `topicNumber` range from 0 to 7,
however, no artwork has more than six different numbers. So I just renamed
those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
assign topics and file names to the according pop-ups. This needs to be
cross checked with the programming, but seems the most plausible approach
with my current knowledge.
<!-- TODO: Ask Philipp -->
## Extracting topics from `index.xml` vs. `<artwork_number>.xml`
When I extract the topics from `index.html` I get different topics, than
when I get them from `<artwork>.html`. At first glance, it looks like using
`index.html` actually gives the wrong results.
```{r}
artworks <- unique(dat2$artwork)
path <- "data/haum/ContentEyevisit/eyevisit_cards_light/"
topics <- extract_topics(artworks, rep("index.xml", length(artworks)), path)
topics2 <- extract_topics(artworks, paste0(artworks, ".xml"), path)
topics[!topics$file_name %in% topics2$file_name, ]
topics2[!topics2$file_name %in% topics$file_name, ]
```
Indices for topics:
0 artist
1 thema
2 komposition
3 leben des kunstwerks
4 details
5 licht und farbe
6 extra info
7 technik
```
On the back of items, there can be between 2 to 6 topic cards. Several of
these topic cards can be about the same topic, e.g., there can be two topic
cards assigned to the topic `thema`. It is impossible to find out if the
same topic card was opened several times or if different topic cards with
the same topic were opened from the same item. See example below for item
"001".
For artwork "031", `index.html` only defines 5 cards (the 6th is commented
out), but `topicNumber` for this artwork has 6 different entries. I will
therefore extract the topics from `<artwork>.html`. (This seems also better
compatible with other data sets like 8o8m.)
```{r topics, echo = FALSE}
items <- sprintf("%03d", unique(datlogs$item))
topics <- extract_topics(items, xmlfiles = paste0(items, ".xml"),
xmlpath = "data/haum/ContentEyevisit/eyevisit_cards_light/")
head(topics)
```
## New artworks "504" and "505" starting October 2022
When I read in the complete data frame for the first time, all of the
sudden there were 72 instead of 70 artworks. It seems like these two
sudden there were 72 instead of 70 items. It seems like these two
artworks appear on October 21, 2022.
```{r}
dat0 <- read.table("data/haum/raw_logfiles_2023-09-23_01-31-30.csv",
sep = ";", header = TRUE)
dat0$date <- as.POSIXct(dat0$date)
dat0$glossar <- ifelse(dat0$artwork == "glossar", 1, 0)
# Remove irrelevant events
dat <- subset(dat0, !(dat0$event %in% c("Start Application",
"Show Application")))
summary(dat[dat$artwork %in% c("504", "505"), ])
```{r newitems}
summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"]))
```
The artworks seem to be have updated in general after October 21, 2022.
The artworks seem to be have updated in general after October 21, 2022. The
following table shows which items were presented in which years.
```{r}
art_after_oct2022 <- sort(unique(dat[dat$date >= "2022-10-21", "artwork"]))
art_before_oct2022 <- sort(unique(dat[dat$date <= "2022-10-21", "artwork"]))
# Removed artworks
art_before_oct2022[!art_before_oct2022 %in% art_after_oct2022]
# Additional artworks
art_after_oct2022[!art_after_oct2022 %in% art_before_oct2022]
```{r years}
xtabs(~ item + lubridate::year(date.start), datlogs)
```
The following table shows which artworks were presented in which years.
```{r}
xtabs(~ artwork + lubridate::year(date), dat)
```
It strongly suggests that the artworks haven been updated after the Corona
pandemic. I think, the table was also moved to a different location at that
point. (Check with PG to make sure.)
# Optimizing resources used by the code
After I started trying out the functions on the complete data set, it
became obvious (not surprisingly `:)`) that this will not work --
especially for the move events. The reshape function cannot take a long
data frame with over 6 Million entries and convert it into a wide data
frame (at least not on my laptop). The code is supposed to work "out of the
box" for researchers, hence it *should* run on a regular (8 core) laptop.
So, I changed the reshaping so that it is done in batches on subsets of the
data for every `fileId` separately. This means that events that span over
two (or more) raw log files cannot be closed and will then be removed from
the data set. The function warns about this, but it is a random process
getting rid of these data and seems therefore not like a systematic
problem. Another reason why this is not bad, is that durations cannot be
calculated for events across log files anyways, because the time stamps do
not increase systematically over log files (see above).
UPDATE: By now, I close the events spanning more than one log file after
this has been done.
I meant to put the lists back together with `do.call(rbind, some_list)` but
this can also not handle big data sets. I therefore switched to
`dplyr::bind_rows(some_ist)` which is really fast and was developed
especially for this purpose. It means, that I have to depend on the dplyr
package (which I am not a big fan of, since I meant to keep the package
self-contained).
# Reading list
* @Arizmendi2022 [--]
* @Bannert2014 [x]
* @Bousbia2010 [--]
* @Cerezo2020
* @GerjetsSchwan2021 [x]
* @Goldhammer2020
* @Guenther2007
* @HuberBannert2023 [x]
* @Kroehne2018
* @SchwanGerjets2021 [x]
* @vanderAalst2016 [Chap. 2, x]
* @vanderAalst2016 [Chap. 3]
* @vanderAalst2016 [Chap. 5, x]
* @Wang2019
# Open stuff
* Angle from which people approach table in Braunschweig? Consider in
rotation variable?
* Time limit for `case` variable different for different events? (openTopic
should be opened the longest)
$\to$ I think this is not relevant since I am looking at time *between*
events!
# Stuff AK found interesting
* Pre/post corona
* Identify school classes
* How many persons are present at the table?
# Other potential questions
* "Bursts"
* 1st vs. 2nd half of the day
* Can we identify "types of art"? With clustering or something?
* Possible to estimate how many persons per day? Maybe average of certain
weekdays? ... ?
It shows that the artworks haven been updated after the Corona pandemic. I
think, the table was also moved to a different location at that point.

577
README.md Normal file
View File

@ -0,0 +1,577 @@
Log data from the Multi-Touch Table at the HAUM
================
The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in
Braunschweig gives visitors of the Museum the opportunity to interact
with about 70 artworks and 3 virtual cards containing information about
the museum and its layout. The table was installed at the institute in
October 2016 and since November 2016 log files from interactions of
visitors of the museum have been collected. These log files are in an
unstructured format and cannot be easily analyzed. The purpose of the
following document is to describe how the data haven been transformed
and which decisions have been made along the way.
# Data structure
The log files contain lines that indicate the beginning and end of
possible activities that can be performed when interacting with the
artworks on the table. The layout of the table looks like pictures have
been tossed on a large table. Every artwork is visible at the start
configuration. People can move the pictures on the table, they can be
scaled and rotated. Additionally, the virtual picture cards can be
flipped in order to find more information of the artwork on the “back”
of the card. One has to press a little `i` for more information in one
of the bottom corners of the card. On the back of the card two to six
information cards can be found with a teaser text about a certain topic.
These topic cards can be opened and a hypertext with detailed
information opens. Within these hypertexts certain technical terms can
be clicked for lay people to get more information. This also opens up a
pop-up. The events encoded in the raw log files therefore have the
following structure.
"Start Application" --> Start Application
"Show Application"
"Transform start" --> Move
"Transform stop"
"Show Info" --> Flip Card
"Show Front"
"Artwork/OpenCard" --> Open Topic
"Artwork/CloseCard"
"ShowPopup" --> Open Popup
"HidePopup"
The right side shows what events can be extracted from these raw lines.
The “Start Application” is not an event in the original sense since it
only indicates if the table was started or maybe reset itself. This is
not an interaction with the table and therefore not interesting in
itself. All “Start Application” and “Show Application” are therefore
excluded from the data when further processed and are only in the raw
log files.
# Parsing the raw log files
The first step is to parse the raw log files that are stored by the
application as text files in a rather unstructured format to a format
that can be read by common statistics software packages. The data are
therefore transferred to a spread sheet format. The following section
describes what problems were encountered while doing this.
## Corrupt lines
When reading the files containing the raw logs into R, a warning appears
that says
Warning messages:
incomplete final line found on '2016/2016_11_18-11_31_0.log'
incomplete final line found on '2016/2016_11_18-11_38_30.log'
incomplete final line found on '2016/2016_11_18-11_40_36.log'
...
When you open these files, it looks like the last line contains some
binary content. It is unclear why and how this happens. So when reading
the data, these lines were removed. A warning will be given that
indicates how many files have been affected.
## Extracted variables from raw log files
The following variables (columns in the data frame) are extracted from
the raw log file:
- `fileId`: Containing the zero-left-padded file name of the raw log
file the data line has been extracted from
- `folder`: The folder names in which the raw log files haven been
organized in. For the HAUM data set, the data are sorted by year
(folders 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).
- `date`: Extracted timestamp from the raw log file in the format
`yyyy-mm-dd hh:mm:ss`.
- `timeMs`: Containing a timestamp in Milliseconds that restarts with
every new raw log files.
- `event`: Start and stop event tags. See above for possible values.
- `item`: Identifier of the different items. This is a three-digit
(left-padded) number. The numbers of the items correspond to the
folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
orginally taken from the museums catalogue.
- `popup`: Name of the pop-up opened. This is only interesting for
“openPopup” events.
- `topic`: The number of the topic card that has been opened at the back
of the item card. See below for a more detailed descripttion what
these numbers mean.
- `x`: Value of x-coordinate in pixel on the 4K-Display
($3840 \times 2160$)
- `y`: Value of y-coordinate in pixel
- `scale`: Number in 128 bit that indicates how much the card has been
scaled
- `rotation`: Degree of rotation in start configuration.
<!-- TODO: Nach welchem Zeitintervall resettet sich der Tisch wieder in die
Ausgangskonfiguration? -> PM needs to look it up -->
## Variables after “closing of events”
The raw log data consist of start and stop events for each event type.
After preprocessing four event types are extracted: `move`, `flipCard`,
`openTopic`, and `openPopup`. Except for the `move` events, which can
occur at any time when interacting with an item card on the table, the
events have a hierarchical order: An item card first needs to be flipped
(`flipCard`), then the topic cards on the back of the card can be opened
(`openTopic`), and finally pop-ups on these topic cards can be opened
(`openPopup`). This implies that the event `openPopup` can only be
present for a certain item, if the card has already been flipped (i.e.,
an event `flipCard` for the same item has already occured).
After preprocessing, the data frame is now in a wide format with columns
for the start and the stop of each event and contains the following
variables:
- `fileId.start` / `fileId.stop`: See above.
- `date.start` / `date.stop`: See above.
- `folder`: Containing the folder name (see above)
- `case`: A numerical variable indicating cases in the data. A “case”
indicates an interaction interval and could be defined in different
ways. Right now a new case begins, when no event occurred for 20
seconds or longer.
- `path`: A path is defined as one interaction with one item A path can
either start with a `flipCard` event or when an item has been touched
for the first time within this case. A path ends with the item card
being flipped close again or with the last movement of the card within
this case. One case can contain several paths with the same item when
the item is flipped open and flipped close again several times within
a short time.
- `glossar`: An indicator variable with values 0/1 that tracks if a
pop-up has been opened from the glossar folder. These pop-ups can be
assigned to the wrong item since it is not possible to do this
algorithmically. It is possible that two items are flipped open that
could both link to the same pop-up from a glossar. The indicator
variable is left as a variable, so that these pop-ups can be easily
deleted from the data. Right now, glossar entries can be ignored
completely by setting an argument and this is done by default. Using
the pop-ups from the glossar will need a lot more love, before it
behaves satisfactorily.
- `event`: Indicating the event. Can take tha values `move`, `flipCard`,
`openTopic`, and `openPopup`.
- `item`: Identifier of the different artworks and information cards.
This is a three-digit (left-padded) number. See above.
- `timeMs.start` / `timeMs.stop`: See above.
- `duration`: Calculated by $timeMs.stop - timeMs.start$ in
Milliseconds. Needs to be adjusted for events spanning more than one
log file by a factor of $60,000 \times \text{number of logfiles}$. See
below for details.
- `topic`: See above.
- `popup`: See above.
- `x.start` / `x.stop`: See above.
- `y.start` / `y.stop`: See above.
- `distance`: Euclidean distande calculated from $(x.start, y.start)$
and $(x.stop, y.stop)$.
- `scale.start` / `scale.stop`: See above.
- `scaleSize`: Relative scaling of item card, calculated by
$\frac{scale.stop}{scale.start}$.
- `rotation.start` / `rotation.stop`: See above.
- `rotationDegree`: Difference of rotation from $rotation.stop$ to
$rotation.start$.
## How unclosed events are handled
Events do not necessarily need to be completed. A person can, e.g.,
leave the table and not flip the item card close again. For `flipCard`,
`openTopic`, and `openPopup` the data frame contains `NA` when the event
does not complete. For `move` events it happens quite often that a start
event follows a start event and a stop event follows a stop event.
Technically a move event cannot *not* be finished and the number of
events without a start or stop indicate that the time resolution was not
sufficient to catch all these events accurately. Double start and stop
`move` events have therefore been deleted from the data set.
## Additional meta data
For the HAUM data, I added meta data on state holidays and school
vacations.
This led to the following additional variables:
- `holiday`
- `vacations`
# Problems and how I handled them
This lists some problems with the log data that required decisions.
These decisions influence the outcome and maybe even the data quality.
Hence, I tried to document how I handled these problems and explain the
decisions I made.
## Weird behavior of `timeMs` and neg. `duration` values
`timeMs` resets itself every time a new log file starts. This means that
the durations of events spanning more than one log file must be
adjusted. Instead of just calculating $timeMs.stop - timeMs.start$,
`timeMs.start` must be subtracted from the maximum duration of the log
file where the event started ($600,000 ms$) and the `timeMs.stop` must
be added. If the event spans more than two log files, a multiple of
$600,000$ must be taken, e.g. for three log files it must be:
$2 \times 600,000 - timeMs.start + timeMs.stop$ and so on.
![](README_files/figure-gfm/timems-1.png)<!-- -->
The boxplot shows that we have a continuous range of values within one
log file but that `timeMs` does not increase over log files. I kept
`timeMs.start` and `timeMs.stop` and also `fileId.start` and
`fileId.stop` in the data frame, so it is clear when events span more
than one log file.
<!--
Infos from the programmer:
"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
erstellt. Die Startzeit, von der aus die Duration berechnet wird, wird
jeweils neu gesetzt. Duration ist also nicht "Dauer seit Start der
Anwendung" sondern "Dauer seit Restart des Loggers". Deine Vermutung ist
also richtig - es sollte keine Durations >10 Minuten geben. Der erste
Eintrag eines Logfiles kann alles zwischen 0 und 10 Minuten sein (je
nachdem, ob der Tisch zum Zeitpunkt des neuen Logging-Intervalls in
Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
es passt."
-->
## Left padding of file IDs
The file names of the raw log files are automatically generated and
contain a timestamp. This timestamp is not well formed. First, it
contains an incorrect month. The months go from 0 to 11 which means,
that the file name `2016_11_15-12_12_57.log` was collected on December
15, 2016 at 12:12 pm. Another problem is that the file names are not
zero left padded, e.g., `2016_11_15-12_2_57.log`. This file was
collected on December 15, 2016 at 12:02 pm and therefore before the file
above. But most sorting algorithms, will sort these files in the order
shown below. In order to preprocess the data and close events that
belong together, the data need to be sorted by events and artworks
repeatedly. In order to get them back in the correct time order, it is
necessary to order them based on three variables: `fileId.start`,
`date.start` and `timeMs.start`. The file IDs therefore need to sort in
the correct order (again see below for example). I zero left padded the
log file names within the data frame using it as an identifier. These
“file names” do not correspond exactly to the original raw log file
names. This needs to be kept in mind when doing any kind of matching
etc.
## what it looked like before left padding
# 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
# 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
# 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
# 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
# 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
# 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
## what it looks like now
# 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
# 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
# 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
# 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
# 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
# 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
## Timestamps repeat
The timestamps in the `date` variable record year, month, day, hour,
minute and seconds. Since one second is not a very short time interval
for a move on a touch display, this is not fine grained enough to bring
events into the correct order, meaning there are events from the same
log file having the same timestamp and even events from different log
files having the same timestamp. The log files get written about every
10 minutes (which can easily be seen when looking at the file names of
the raw log files). So in order to get events in the correct order, it
is necessary to first order by file ID, within file ID then sort by
timestamp `date` and then within these more coarse grained timestamps
sort be `timeMs`. But as explained above, `timeMs` can only be sorted
within one file ID, since they do not increase consistently over log
files, but have a new setoff for each raw log file.
## x,y-coordinates outside of display range
The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160
pixels. When you plot the start and stop coordinates, the display is
clearly distinguishable. However, a lot of points are outside of the
display range. This can happen, when the art objects are scaled and then
moved to the very edge of the table. Then it will record pixels outside
of the table. These are actually valid data points and I will leave them
as is.
``` r
datlogs <- read.table("code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";",
header = TRUE)
par(mfrow = c(1, 2))
plot(y.start ~ x.start, datlogs)
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
plot(y.stop ~ x.stop, datlogs)
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
```
![](README_files/figure-gfm/xycoord-1.png)<!-- -->
``` r
aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean)
```
## x.start x.stop y.start y.stop
## 1 1978.202 1975.876 1137.481 1133.494
## Pop-ups from glossar cannot be assigned to a specific item
All the information, pictures and texts for the topics and pop-ups are
stored in
`/data/haum/ContentEyevisit/eyevisit_cards_light/<item_number>`. Among
other things, each folder contains XML-files with the information about
any technical terms that can be opened from the hypertexts on the topic
cards. Often these information are item dependent and then the
corresponding XML-file is in the folder for this item. Sometimes,
however, more general terms can be opened. In order to avoid multiple
files containing the same information, these were stored in a folder
called `glossar` and get accessed from there. The raw log files only
contain the path to this glossar entry and did not record from which
item it was accessed. I tried to assign these glossar entries to the
correct items. The (very heuristic) approach was this:
1. Create a lookup table with all XML-file names (possible pop-ups)
from the glossar folder and what items possibly call them. This was
stored as an `RData` object for easier handling but should maybe be
stored in a more interoperable format.
2. I went through all possible pop-ups in this lookup table and stored
the items that are associated with it.
3. I created a sub data frame without move events (since they can never
be associated with a pop-up) and went through every line and looked
up if an item and a topic card had been opened. If this was the case
and a glossar entry came up before the item was closed again, I
assigned this item to the glossar entry.
This is heuristic since it is possible that several topic cards from
different items are opened simultaneously and the glossar pop-up could
be opened from either one (it could even be more than two, of course).
In these cases the item that was opened closest to the glossar pop-up
has been assigned, but this can never be completely error free.
And this heuristic only assigns a little more than half of the glossar
entries. Since my heuristic only looks for the last item that has been
opened and if this item is a possible candidate it misses all glossar
pop-ups where another item has been opened in between. This is still an
open TODO to write a more elaborate algorithm.
All glossar pop-ups that do not get matched with an item are removed
from the data set with a warning if the argument `glossar = TRUE` is
set. Otherwise the glossar entries will be ignored completely.
## Assign a `case` variable based on “time heuristic”
One thing needed in order to work with the data set and use it for
machine learning algorithms like process mining, is a variable that
tries to identify a case. A case variable will structure the data frame
in a way that navigation behavior can actually be investigated. However,
we do not know if several people are standing around the table
interacting with it or just one very active person. The simplest way to
define a case variable is to just use a time limit between events. This
means that when the table has not been interacted with for, e.g., 20
seconds than it is assumed that a person moved on and a new person
started interacting with the table. This is the easiest heuristic and
implemented at the moment. Process mining shows that this simple
approach works in a way that the correct process gets extracted by the
algorithm.
In order to investigate user behavior on a more fine grained level, it
will be necessary to come up with a more elaborate approach. A better,
still simple approach, could be to use this kind of time limit and
additionally look at the distance between items interacted with within
one time window. When items are far apart it seems plausible that more
than one person interacted with them. Very short time lapses between
events on different items could also be an indicator that more than one
person is interacting with the table.
## Assign a `path` variable
The `path` variable is supposed to show one interaction trace with one
artwork. Meaning it starts when an artwork is touched or flipped and
stops when it is closed again. It is easy to assign a path from flipping
a card over opening (maybe several) topics and pop-ups for this artwork
card until closing this card again. But one would like to assign the
same path to move events surrounding this interaction. Again, this is
not possible in an algorithmic way but only heuristically.
Again, I used a time cutoff for this. First, if a `move` event occurs,
it is checked, if the same item has been flipped less than 20 seconds
beforehand. If yes, the same path indicator is assigned to this `move`.
If not, temporarily a new “move indicator” is assigned. Then, a
“backward pass” is applied, where it is checked if the same item is
opened less than 20 seconds *after* the event occurs. If yes, that path
indicator is assigned. For all the remaining moves, a new path number is
assigned. This corresponds to items being moved without being flipped.
## A `move` event does not record any change
Most of the events in the log files are move events. Additionally, many
of these move events are recorded but they do not indicate any change,
meaning the only difference is the timestamp. All other variables
indicating moves like `x.start` and `x.stop`, `rotation.start` and
`rotation.stop` etc. do not show *any* change. They represent about 2/3
of all move events. These events are probably short touches of the table
without an actual interaction. They were therefore removed from the data
set.
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
In the beginning I thought that the number for topics was the index of
where the card was presented on the back of the item. But this is not
correct. It is the number of the topic. There are eight topics in total:
Indices for topics:
0 artist
1 thema
2 komposition
3 leben des kunstwerks
4 details
5 licht und farbe
6 extra info
7 technik
On the back of items, there can be between 2 to 6 topic cards. Several
of these topic cards can be about the same topic, e.g., there can be two
topic cards assigned to the topic `thema`. It is impossible to find out
if the same topic card was opened several times or if different topic
cards with the same topic were opened from the same item. See example
below for item “001”.
## item file_name topic
## 1 001 001_dargestellte.xml thema
## 2 001 001_thema1.xml thema
## 3 001 001_leben.xml leben des kunstwerks
## 4 001 001_leben3.xml leben des kunstwerks
## 5 001 001_thema2.xml thema
## 6 001 001_thema.xml thema
## New artworks “504” and “505” starting October 2022
When I read in the complete data frame for the first time, all of the
sudden there were 72 instead of 70 items. It seems like these two
artworks appear on October 21, 2022.
``` r
summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"]))
```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## "2022-10-21" "2023-01-11" "2023-03-08" "2023-03-09" "2023-05-21" "2023-07-05"
The artworks seem to be have updated in general after October 21, 2022.
The following table shows which items were presented in which years.
``` r
xtabs(~ item + lubridate::year(date.start), datlogs)
```
## lubridate::year(date.start)
## item 2016 2017 2018 2019 2020 2022 2023
## 1 277 4082 1912 1434 424 394 1315
## 3 485 6730 3126 2356 528 457 1124
## 19 714 8656 4028 2743 660 698 1595
## 20 595 8461 3996 2983 938 657 1355
## 24 497 6638 2912 2251 649 439 1028
## 27 567 5959 3112 2318 651 711 1324
## 28 601 9329 4394 3056 778 762 1570
## 29 425 6865 3830 2365 516 615 1174
## 31 289 4118 2051 1218 291 296 675
## 32 562 7016 3477 2253 726 766 1647
## 33 509 4936 2242 1449 555 358 666
## 36 434 4505 2276 1668 373 387 976
## 37 242 4478 2182 1554 339 423 1168
## 38 480 4617 2144 1397 371 381 784
## 39 395 3227 1313 1003 237 161 622
## 41 282 3329 1303 1022 225 209 701
## 42 203 3113 1307 903 242 191 421
## 43 115 2420 1089 806 176 219 486
## 45 1491 13561 5924 4474 966 585 1828
## 46 903 9181 5340 3812 961 944 1648
## 47 306 4949 2395 1510 750 297 675
## 48 723 10455 5384 4162 1328 948 2031
## 49 433 4326 2124 1414 434 431 809
## 51 564 7837 4577 2991 884 659 1370
## 52 447 5021 2104 1729 471 349 840
## 54 424 5068 2816 2008 529 370 918
## 55 358 4859 2069 1428 341 403 1303
## 57 860 14264 6625 5092 1410 1221 2714
## 60 555 6865 3539 2336 639 586 1415
## 62 547 6736 3803 2210 795 633 1322
## 63 251 3677 1827 1241 300 282 527
## 66 552 6004 2774 1977 505 373 932
## 69 394 3730 1827 1438 272 206 680
## 70 226 3766 1843 973 293 268 703
## 71 557 6160 2490 1846 570 323 839
## 72 426 6194 2857 2129 508 635 1553
## 73 432 6125 2880 1821 583 395 939
## 75 258 5885 2418 1562 369 257 645
## 76 861 12435 6253 4214 1753 1153 2268
## 77 816 8595 4197 2897 699 674 1452
## 78 410 5632 2498 1924 394 408 850
## 80 1650 25687 12429 7782 1975 1712 4433
## 83 644 8618 4720 3026 987 1027 2294
## 84 184 2121 1231 759 231 254 465
## 87 149 1618 722 632 99 0 0
## 88 513 6996 3493 2272 539 533 1420
## 89 214 2204 950 723 156 0 0
## 90 281 3756 1372 1143 403 320 932
## 93 613 8528 4224 3015 696 1174 2058
## 98 462 6662 3265 2565 704 670 1453
## 99 180 4162 1653 1454 363 411 868
## 101 414 4209 1859 1282 392 411 981
## 103 677 8758 4366 3165 1045 909 1871
## 104 423 5256 2381 1865 463 467 933
## 107 181 2101 1106 788 205 146 339
## 109 321 4001 1619 1106 292 188 453
## 110 489 5846 2785 2008 494 387 923
## 125 640 8435 4519 3334 926 0 0
## 129 598 11322 5046 3369 910 1131 1682
## 145 419 7821 3945 2694 706 740 1396
## 176 507 8465 3968 2787 687 552 1544
## 180 516 7563 3720 2765 585 550 1272
## 183 377 4014 1819 1741 346 251 675
## 187 340 4222 2165 1753 319 312 734
## 197 426 7710 3603 2510 671 602 1217
## 229 303 4872 2360 1891 482 389 1005
## 231 271 3606 1851 1239 318 236 467
## 501 1915 15968 7849 5060 1157 890 2989
## 502 1212 14550 7111 4749 1105 883 2752
## 503 1308 15218 8632 6399 1626 870 2558
## 504 0 0 0 0 0 363 662
## 505 0 0 0 0 0 426 1533
It shows that the artworks haven been updated after the Corona pandemic.
I think, the table was also moved to a different location at that point.

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.2 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 12 KiB