Updated README and tried to document all decisions I made so far
This commit is contained in:
parent
9f15ea1b62
commit
498b487338
470
README.Rmd
Normal file
470
README.Rmd
Normal file
@ -0,0 +1,470 @@
|
|||||||
|
---
|
||||||
|
title: "Background information about MTT data"
|
||||||
|
author: "Nora Wickelmaier"
|
||||||
|
date: "`r Sys.Date()`"
|
||||||
|
output:
|
||||||
|
html_document:
|
||||||
|
number_sections: true
|
||||||
|
toc: true
|
||||||
|
---
|
||||||
|
|
||||||
|
# Log data from the Multi-Touch Table at the HAUM
|
||||||
|
|
||||||
|
The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in
|
||||||
|
Braunschweig gives visitors of the Museum the opportunity to interact with
|
||||||
|
67 artworks and 3 tiles containing information about the museum and its
|
||||||
|
layout. The table was installed at the institute in October 2016 and since
|
||||||
|
November 2016 log files from interactions of visitors of the museum have
|
||||||
|
been collected. These log files are in an unstructured format and cannot be
|
||||||
|
easily analyzed. The purpose of the following document is to describe how
|
||||||
|
the data haven been transformed and which decisions have been made a long
|
||||||
|
the way.
|
||||||
|
|
||||||
|
# Data structure
|
||||||
|
|
||||||
|
The log files contain lines that indicate the beginning and end of possible
|
||||||
|
actions that can be performed when interacting with the artworks on the
|
||||||
|
table. The layout of the table looks like 70 pictures have been tossed on a
|
||||||
|
large table. Every artwork is visible at the start configuration. People
|
||||||
|
can move the pictures on the table, they can be scaled and rotated.
|
||||||
|
Additionally, the virtual picture cards can be flipped in order to find
|
||||||
|
more information of the artwork on the "back" of the card. One has to press
|
||||||
|
a little `i` for more information in one of the bottom corners of the card.
|
||||||
|
On the back of the card two (?) to six information cards can be found with
|
||||||
|
a teaser text about a certain topic. These topic cards can be opened and a
|
||||||
|
hypertext with detailed information pops up. Within these hypertexts
|
||||||
|
certain technical terms can be clicked for lay people to get more
|
||||||
|
information. This also opens up a pop-up. The events encoded in the raw log
|
||||||
|
files therefore have the following structure.
|
||||||
|
|
||||||
|
```
|
||||||
|
"Start Application" --> Start Application
|
||||||
|
"Show Application"
|
||||||
|
"Transform start" --> Move
|
||||||
|
"Transform stop"
|
||||||
|
"Show Info" --> Flip Card
|
||||||
|
"Show Front"
|
||||||
|
"Artwork/OpenCard" --> Open Topic
|
||||||
|
"Artwork/CloseCard"
|
||||||
|
"ShowPopup" --> Open Popup
|
||||||
|
"HidePopup"
|
||||||
|
```
|
||||||
|
|
||||||
|
The right side shows what events can be extracted from these raw lines. The
|
||||||
|
"Start Application" is not an event in the original sense since it only
|
||||||
|
indicates if the table was started or maybe reset itself. This is not an
|
||||||
|
interaction with the table and therefore not interesting in itself. All
|
||||||
|
"Start Application" and "Show Application" are therefore excluded from the
|
||||||
|
data when further processed and are only in the raw log files.
|
||||||
|
|
||||||
|
# Parsing the raw log files
|
||||||
|
|
||||||
|
The first step is to parse the raw log files that are stored by the
|
||||||
|
application as text files in a rather unstructured format to a format that
|
||||||
|
is better handled. The data are therefore transferred to a spread sheet
|
||||||
|
format. The following section describes what problems were encountered
|
||||||
|
while doing this.
|
||||||
|
|
||||||
|
## Corrupt lines
|
||||||
|
|
||||||
|
When reading the files containing the raw logs into R, a warning appears
|
||||||
|
that says
|
||||||
|
|
||||||
|
```
|
||||||
|
Warning messages:
|
||||||
|
incomplete final line found on '_2016/2016_11_18-11_31_0.log'
|
||||||
|
incomplete final line found on '_2016/2016_11_18-11_38_30.log'
|
||||||
|
incomplete final line found on '_2016/2016_11_18-11_40_36.log'
|
||||||
|
...
|
||||||
|
```
|
||||||
|
|
||||||
|
When you open these files, it looks like the last line contains some binary
|
||||||
|
content. It is unclear why and how this happens. So when reading the data,
|
||||||
|
these lines were removed. A warning will be given that indicates how many
|
||||||
|
files have been affected.
|
||||||
|
|
||||||
|
## Units of the variables
|
||||||
|
|
||||||
|
* Welche Einheit haben x und y? Pixel? --> yes
|
||||||
|
* Welche Einheit hat scale? --> some kind if bit, does not matter, when
|
||||||
|
calculating a ratio
|
||||||
|
* rotation wirklich degree? --> yes
|
||||||
|
* Nach welchem Zeitintervall resettet sich der Tisch wieder in die
|
||||||
|
Ausgangskonfiguration? --> PM needs to look it up
|
||||||
|
|
||||||
|
## How unclosed events are handled
|
||||||
|
|
||||||
|
## How a case is defined
|
||||||
|
|
||||||
|
* Herausfinden, ob mehr als eine Person am Tisch steht?
|
||||||
|
- Sliding window, in der Anzahl von Artworks gezählt wird? Oder wie weit
|
||||||
|
angefasste Artworks voneinander entfernt sind?
|
||||||
|
- Man kann sowas schon "sehen" in den Logs - aber wie kann ich es
|
||||||
|
automatisiert rausziehen? Was ist meine Definition von
|
||||||
|
"Interaktionsboost"?
|
||||||
|
- Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
|
||||||
|
|
||||||
|
## Additional meta data
|
||||||
|
|
||||||
|
* Anreicherung der Log-Daten mit weiteren Metadaten? Was wäre interessant?
|
||||||
|
|
||||||
|
- Metadata on artworks like, name, artist, type of artwork, epoch, etc.
|
||||||
|
- School vacations and holidays
|
||||||
|
- Special exhibits at the museum
|
||||||
|
- Number of visitors per day (bei Sven noch mal nachhaken?)
|
||||||
|
- Age structure of visitors per day?
|
||||||
|
- ... ????
|
||||||
|
|
||||||
|
# Problems and how I handled them
|
||||||
|
|
||||||
|
```{r, include = FALSE}
|
||||||
|
# setwd("C:/Users/nwickelmaier/Nextcloud/Documents/MDS/2023ss/60100_master_thesis")
|
||||||
|
source("code/functions.R")
|
||||||
|
```
|
||||||
|
|
||||||
|
This lists some problems with the log data that required decisions. These
|
||||||
|
decisions influence the outcome and maybe even the data quality. Hence, I
|
||||||
|
tried to document how I handled these problems and explain the decisions I
|
||||||
|
made.
|
||||||
|
|
||||||
|
## Weird behavior of `timeMs` and neg. `duration` values
|
||||||
|
|
||||||
|
I think the negative duration values happen, when an event starts in one
|
||||||
|
log file and completes in another one. The variable `timeMs` seems to be
|
||||||
|
continuous within one log file but not over several log files.
|
||||||
|
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
# Read data
|
||||||
|
dat0 <- read.table("data/rawdata_logfiles_small.csv", sep = ";",
|
||||||
|
header = TRUE)
|
||||||
|
dat0$date <- as.POSIXct(dat0$date)
|
||||||
|
dat0$glossar <- ifelse(dat0$artwork == "glossar", 1, 0)
|
||||||
|
|
||||||
|
# Remove irrelevant events
|
||||||
|
dat <- subset(dat0, !(dat0$event %in% c("Start Application",
|
||||||
|
"Show Application")))
|
||||||
|
|
||||||
|
# Add trace variable
|
||||||
|
dat1 <- add_trace(dat, glossar_dict = "data/glossar_dict.RData")
|
||||||
|
|
||||||
|
# Close events
|
||||||
|
dat2 <- rbind(close_events(dat1, "move"),
|
||||||
|
close_events(dat1, "flipCard"),
|
||||||
|
close_events(dat1, "openTopic"),
|
||||||
|
close_events(dat1, "openPopup"))
|
||||||
|
dat2 <- dat2[order(dat2$date.start, dat2$fileId.start), ]
|
||||||
|
|
||||||
|
head(dat2[which(dat2$duration < 0),
|
||||||
|
c("fileId.start", "fileId.stop", "event", "artwork", "duration")], 20)
|
||||||
|
|
||||||
|
head(dat2[which(dat2$fileId.start != dat2$fileId.stop),
|
||||||
|
c("fileId.start", "fileId.stop", "event", "artwork", "duration")], 20)
|
||||||
|
|
||||||
|
plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
|
||||||
|
|
||||||
|
# Remove durations when event spans more than one log file, since they are
|
||||||
|
# not interpretable
|
||||||
|
#dat2[which(dat2$fileId.start != dat2$fileId.stop), "duration"] <- NA
|
||||||
|
```
|
||||||
|
|
||||||
|
The boxplot shows that we have a continuous range of values within one log
|
||||||
|
file but that `timeMs` does not increase over log files. Since it seems not
|
||||||
|
possible to fix this in a consistent way, I set all durations to `NA` where
|
||||||
|
`fileId.start` and `fileId.stop` are not identical. I kept `timeMs.start`
|
||||||
|
and `timeMs.stop` and also `fileId.start` and `fileId.stop` in the data
|
||||||
|
frame, so it is clear why there are no durations. The other
|
||||||
|
|
||||||
|
NOTE: Part of this problem was that time stamps that are part of the log
|
||||||
|
file names are not zero-left-padded and therefore the files were not in the
|
||||||
|
correct order when read into R. When zero left padding these file IDs and
|
||||||
|
sorting by them and then by `date.start` within, some of the durations are
|
||||||
|
exactly fixed. Unfortunately, only three `move` events were fixed, since it
|
||||||
|
only fixed irregularities *within* one log file. See below for more
|
||||||
|
details.
|
||||||
|
|
||||||
|
## Left padding of file IDs
|
||||||
|
|
||||||
|
The file names of the raw log files are automatically generated and contain
|
||||||
|
a time stamp. This time stamp is not well formed. First, it contains an
|
||||||
|
incorrect month. The months go from 0 to 11 which means, that the file name
|
||||||
|
`2016_11_15-12_12_57.log` was collected on December 15, 2016 at 12:12 pm.
|
||||||
|
Another problem is that the file names are not zero left padded, e.g.,
|
||||||
|
`2016_11_15-12_2_57.log`. This file was collected on December 15, 2016 at
|
||||||
|
12:02 pm and therefore before the file above. But most sorting algorithms,
|
||||||
|
will sort these files in the order shown below. In order to preprocess the
|
||||||
|
data and close events that belong together, the data need to be sorted by
|
||||||
|
events and artworks repeatedly. In order to get them back in the correct
|
||||||
|
time order, it is necessary to order them based on three variables:
|
||||||
|
`fileId.start`, `date.start` and `timeMs`. The file IDs therefore need to
|
||||||
|
sort in the correct order (again see below for example). I zero left padded
|
||||||
|
the log file names within the data frame using it as an identifier. These
|
||||||
|
"file names" do not correspond exactly to the original raw log file names.
|
||||||
|
This needs to be kept in mind when doing any kind of matching etc.
|
||||||
|
|
||||||
|
```
|
||||||
|
## what it looked like before left padding
|
||||||
|
# 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
|
||||||
|
# 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
|
||||||
|
# 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
|
||||||
|
# 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
|
||||||
|
# 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
|
||||||
|
# 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
|
||||||
|
|
||||||
|
## what it looks like now
|
||||||
|
# 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
|
||||||
|
# 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
|
||||||
|
# 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
|
||||||
|
# 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
|
||||||
|
# 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
|
||||||
|
# 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
|
||||||
|
```
|
||||||
|
|
||||||
|
## Timestamps repeat
|
||||||
|
|
||||||
|
The time stamps in the `date` variable record year, month, day, hour,
|
||||||
|
minute and seconds. Since one second is not a very short time interval for
|
||||||
|
a move on a touch display, this is not fine grained enough to bring events
|
||||||
|
into the correct order, meaning there are events from the same log file
|
||||||
|
having the same time stamp and even events from different log files having
|
||||||
|
the same time stamp. The log files get written about every 10 minutes
|
||||||
|
(which can easily be seen when looking at the file names of the raw log
|
||||||
|
files). So in order to get events in the correct order, it is necessary to
|
||||||
|
first order by file ID, within file ID then sort by time stamp `date` and
|
||||||
|
then within these more coarse grained time stamps sort be `timeMs`. But as
|
||||||
|
explained above, `timeMs` can only be sorted within one file ID, since they
|
||||||
|
do not increase consistently over log files, but have a new setoff for each
|
||||||
|
raw log file.
|
||||||
|
|
||||||
|
## x,y-coordinates outside of display range
|
||||||
|
|
||||||
|
The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160
|
||||||
|
pixels. When you plot the start and stop coordinates, the display is
|
||||||
|
clearly to distinguish. However, a lot of points are outside of the display
|
||||||
|
range. This can happen, when the art objects are scaled and then moved to
|
||||||
|
the very edge of the table. Then it will record pixels outside of the
|
||||||
|
table. These are actually valid data points and I will leave them as is.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
par(mfrow = c(1, 2))
|
||||||
|
plot(y.start ~ x.start, dat2)
|
||||||
|
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
|
||||||
|
plot(y.stop ~ x.stop, dat2)
|
||||||
|
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
|
||||||
|
|
||||||
|
aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, dat2, mean)
|
||||||
|
```
|
||||||
|
|
||||||
|
## Pop-ups from glossar cannot be assigned to a specific artwork
|
||||||
|
|
||||||
|
All the information, pictures and texts for the topics and pop-ups are
|
||||||
|
stored in
|
||||||
|
`/Logfiles/ContentEyevisit/eyevisit_cards_light/<artwork_number>`. Among
|
||||||
|
other things, each folder contains XML-files with the information about any
|
||||||
|
technical terms that can be opened from the hypertexts on the topic cards.
|
||||||
|
Often these information are artwork dependent and then the corresponding
|
||||||
|
XML-file is in the folder for this artwork. Sometimes, however, more
|
||||||
|
general terms can be opened. In order to avoid multiple files containing
|
||||||
|
the same informatione, these were stored in a folder called `glossar` and
|
||||||
|
get accessed from there. The raw log files only contain the path to this
|
||||||
|
glossar entry and did not record from which artwork it was accessed. I
|
||||||
|
tried to assign these glossar entries to the correct artworks. The (very
|
||||||
|
heuristic) approach was this:
|
||||||
|
|
||||||
|
1. Create a lookup table with all XML-file names (possible pop-ups) from
|
||||||
|
the glossar folder and what artworks possibly call them. This was stored
|
||||||
|
as an `RData` object for easier handling but should maybe be stored in a
|
||||||
|
more interoperable format.
|
||||||
|
|
||||||
|
2. I went through all possible pop-ups in this lookup table and stored the
|
||||||
|
artworks that are associated with it.
|
||||||
|
|
||||||
|
3. I created a sub data frame without move events (since they can never be
|
||||||
|
associated with a pop-up) and went through every line and looked up if
|
||||||
|
an artwork and a topic card had been opened. If this was the case and a
|
||||||
|
glossar entry came up before the artwork was closed again, I assigned
|
||||||
|
this artwork to this glossar entry.
|
||||||
|
|
||||||
|
This is heuristic since it is possible that several topic cards from
|
||||||
|
different artworks are opened simultaneously and the glossar pop-up could
|
||||||
|
be opened from either one (it could even be more than two, of course). In
|
||||||
|
these cases the artwork that was opened closest to the glossar pop-up has
|
||||||
|
been assigned, but this can never be completely error free.
|
||||||
|
|
||||||
|
And this heuristic only assigns a little more than half of the glossar
|
||||||
|
entries. Since my heuristic only looks for the last artwork that has been
|
||||||
|
opened and if this artwork is a possible candidate it misses all glossar
|
||||||
|
pop-ups where another artwork has been opened in between. This is still an
|
||||||
|
open TODO to write a more elaborate algorithm.
|
||||||
|
|
||||||
|
All glossar pop-ups that do not get matched with an artwork are removed
|
||||||
|
from the data set with a warning.
|
||||||
|
|
||||||
|
## Assign a `case` variable based on "time heuristic"
|
||||||
|
|
||||||
|
One thing needed in order to work with the data set and use it for machine
|
||||||
|
learning algorithms like process mining is a variable that tries to
|
||||||
|
identify a case. A case variable will structure the data frame in a way
|
||||||
|
that navigation behavior can actually be investigated. However, we do not
|
||||||
|
know if several people are standing around the table interacting with it or
|
||||||
|
just one very active person. The simplest way to define a case variable is
|
||||||
|
to just use a time limit between events. This means that when the table has
|
||||||
|
not been interacted with for, e.g., 20 seconds than it is assumed that a
|
||||||
|
person moved on and a new person started interacting with the table. This
|
||||||
|
is the easiest heuristic and implemented at the moment. Process mining
|
||||||
|
shows that this simple approach works in a way that the correct process
|
||||||
|
gets extracted by the algorithm.
|
||||||
|
|
||||||
|
In order to investigate user behavior on a more fine grained level, it will
|
||||||
|
be necessary to come up with a more elaborate approach. A better, still
|
||||||
|
simple approach could be to use this kind of time limit and additionally
|
||||||
|
look at the distance between artworks interacted with within one time
|
||||||
|
window. When artworks are far apart is seems plausible that more than one
|
||||||
|
person interacted with them. Very short time lapses between events on
|
||||||
|
different artworks could also be an indicator that more than one person is
|
||||||
|
interacting with the table.
|
||||||
|
|
||||||
|
## Assign a `trace` variable
|
||||||
|
|
||||||
|
The `trace` variable is supposed to show one interaction trace with one
|
||||||
|
artwork. Meaning it starts when an artwork is touched or flipped and stops
|
||||||
|
when it is closed again. It is easy to assign a trace from flipping a card
|
||||||
|
over opening (maybe several) topics and pop-ups for this artwork card until
|
||||||
|
closing this card again. But one would like to assign the same trace to
|
||||||
|
move events surrounding this interaction. Again, this is not possible in an
|
||||||
|
algorithmic way but only heuristically. I used the `case` variable in order
|
||||||
|
to get meaningful units around the artworks.
|
||||||
|
|
||||||
|
If within one case only a single trace for a single artwork was opened, I
|
||||||
|
assigned this trace to the moves associated with this artwork. I (quite
|
||||||
|
often) happens that within one case one artwork is opened and closed
|
||||||
|
several times, each time starting a new trace. I then assigned all the
|
||||||
|
following move events to the trace beforehand. This is, of course,
|
||||||
|
arbitrary and could also be handled the other way around.
|
||||||
|
|
||||||
|
Another possibility is, that an artwork gets moved within one trace without
|
||||||
|
being flipped. I then assigned a new trace to this move.
|
||||||
|
|
||||||
|
This overall worked very well even though it was based on the very
|
||||||
|
heuristic approach assigning a case when the table has not been touched for
|
||||||
|
20 seconds. It should be kept in mind that the trace assignments for the
|
||||||
|
moves will change when case is defined in a different way.
|
||||||
|
|
||||||
|
## A `move` event does not record any change
|
||||||
|
|
||||||
|
Most of the events in the log files are move events. Additionally, many of
|
||||||
|
these move events are recorded but they do not indicate any change meaning
|
||||||
|
the only difference is the time stamp. All other variables indicating moves
|
||||||
|
like `x.start` and `x.stop`, `rotation.start` and `rotation.stop` etc. do
|
||||||
|
not show any change. They represent about 2/3 of all move events. These
|
||||||
|
events are probably short touches of the table without an actual
|
||||||
|
interaction. They were therefore removed from the data set.
|
||||||
|
|
||||||
|
## Events that only close (`date.start` is NA)
|
||||||
|
|
||||||
|
It looks like there is some kind of log error for the events that do not
|
||||||
|
have a start stop. I was able to get rid of most by sorting for `popup` for
|
||||||
|
the openPopup events, but there are still some left (50 for the small data
|
||||||
|
set, which corresponds to 0.2 per mill). The following example shows that
|
||||||
|
artwork "501" gets closed (line 31030) while the pop-up `sommerbau.xml`
|
||||||
|
is still opened (line 31027). Then artwork "501" gets opened again
|
||||||
|
(line 31035) and after that the pop-up `sommerbau.xml` is closed (line
|
||||||
|
31040). This should not be possible and therefore (correctly) two events
|
||||||
|
are assigned: One where the pop-up was opened and then not closed (which is
|
||||||
|
common) and another one where the pop-up has no start.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
dat[31000:31019,]
|
||||||
|
# Card gets flipped closed before pop-up closes --> log error!
|
||||||
|
```
|
||||||
|
|
||||||
|
I did not check all of these cases (for the complete data set this is
|
||||||
|
simply not possible by hand) but just excluded all events that do not have
|
||||||
|
a `date.start` since they are hard to interpret. Often they are log errors
|
||||||
|
but in some cases they might be resolvable.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
# remove all events that do not have a `date.start`
|
||||||
|
dim(dat2[is.na(dat2$date.start), ])
|
||||||
|
dat2 <- dat2[!is.na(dat2$date.start), ]
|
||||||
|
```
|
||||||
|
|
||||||
|
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
|
||||||
|
|
||||||
|
See `questions_number-of-cards.R` for more details.
|
||||||
|
|
||||||
|
I wrote a function that for each artwork extracts the file names of the
|
||||||
|
possible topic cards and then looks up which topics have actually been
|
||||||
|
displayed on the back of the card. I added an index giving the ordering in
|
||||||
|
the index files.
|
||||||
|
|
||||||
|
The possible values in the variable `topicNumber` range from 0 to 7,
|
||||||
|
however, not artwork has more than six different numbers. So I just renamed
|
||||||
|
those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
|
||||||
|
to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
|
||||||
|
assign topics and file names to the according pop-ups. This needs to be
|
||||||
|
cross checked with the programming, but seems the most plausible approach
|
||||||
|
with my current knowledge.
|
||||||
|
|
||||||
|
## Extracting topics from `index.xml` vs. `<artwork_number>.xml
|
||||||
|
|
||||||
|
When I extract the topics from `index.html` I get different topics, than
|
||||||
|
when I get them from `<artwork>.html`. At first glance, it looks like using
|
||||||
|
`index.html` actually gives the wrong results.
|
||||||
|
|
||||||
|
```{r}
|
||||||
|
artworks <- unique(dat2$artwork)
|
||||||
|
path <- "data/ContentEyevisit/eyevisit_cards_light/"
|
||||||
|
topics <- extract_topics(artworks, "index.xml", path)
|
||||||
|
topics2 <- extract_topics(artworks, paste0(artworks, ".xml"), path)
|
||||||
|
|
||||||
|
topics[!topics$file_name %in% topics2$file_name, ]
|
||||||
|
topics2[!topics2$file_name %in% topics$file_name, ]
|
||||||
|
```
|
||||||
|
|
||||||
|
For artwork "031", `index.html` only defines 5 cards (the 6th is commented
|
||||||
|
out), but `topicNumber` for this artwork has 6 different entries. I will
|
||||||
|
therefore extract the topics from `<artwork>.html`. (This seems also better
|
||||||
|
compatible with other data sets like 8o8m.)
|
||||||
|
|
||||||
|
# Reading list
|
||||||
|
|
||||||
|
* @Arizmendi2022 [--]
|
||||||
|
* @Bannert2014 [x]
|
||||||
|
* @Bousbia2010 [--]
|
||||||
|
* @Cerezo2020
|
||||||
|
* @GerjetsSchwan2021 [x]
|
||||||
|
* @Goldhammer2020
|
||||||
|
* @Guenther2007
|
||||||
|
* @HuberBannert2023 [x]
|
||||||
|
* @Kroehne2018
|
||||||
|
* @SchwanGerjets2021 [x]
|
||||||
|
* @vanderAalst2016 [Chap. 2, x]
|
||||||
|
* @vanderAalst2016 [Chap. 3]
|
||||||
|
* @vanderAalst2016 [Chap. 5, x]
|
||||||
|
* @Wang2019
|
||||||
|
|
||||||
|
# Open stuff
|
||||||
|
|
||||||
|
* Angle from which people approach table in Braunschweig? Consider in
|
||||||
|
rotation variable?
|
||||||
|
* Time limit for `case` variable different for different events? (openTopic
|
||||||
|
should be opened the longest)
|
||||||
|
|
||||||
|
$\to$ I think this is not relevant since I am looking at time *between*
|
||||||
|
events!
|
||||||
|
|
||||||
|
# Stuff AK found interesting
|
||||||
|
|
||||||
|
* Pre/post corona
|
||||||
|
* Identify school classes
|
||||||
|
* How many persons are present at the table?
|
||||||
|
|
||||||
|
# Other potential questions
|
||||||
|
|
||||||
|
* "Bursts"
|
||||||
|
* 1st vs. 2nd half of the day
|
||||||
|
* Can we identify "types of art"? With clustering or something?
|
||||||
|
* Possible to estimate how many persons per day? Maybe average of certain
|
||||||
|
weekdays? ... ?
|
||||||
|
|
321
README.md
321
README.md
@ -1,321 +0,0 @@
|
|||||||
# Offene Fragen
|
|
||||||
|
|
||||||
## Datenverständnis
|
|
||||||
|
|
||||||
* Welche Einheit haben x und y? Pixel? --> yes
|
|
||||||
* Welche Einheit hat scale? --> some kind if bit, does not matter, when
|
|
||||||
calculating a ratio
|
|
||||||
* rotation wirklich degree? --> yes
|
|
||||||
* Nach welchem Zeitintervall resettet sich der Tisch wieder in die
|
|
||||||
Ausgangskonfiguration? --> PM needs to look it up
|
|
||||||
|
|
||||||
## Tisch-Software
|
|
||||||
|
|
||||||
* Gibt es Doku für die Bilder, die über die xml files hinausgeht? Sowas wie
|
|
||||||
ein Manual oder ähnliches?
|
|
||||||
* Gibt es evtl. irgendwo noch ein Tablet mit der Anwendung drauf?
|
|
||||||
* Was bedeuten die Farben der Topic Cards? --> sieht man in den xml files
|
|
||||||
|
|
||||||
## Event Logs
|
|
||||||
|
|
||||||
* Wie gehen wir mit "nicht geschlossenen" Events um? Einfach rauslöschen?
|
|
||||||
- für Transform tendiere ich zu ja, weil sonst total uninteressant
|
|
||||||
- bei flipCard bin ich nicht so sicher... Aber man kann dann keine
|
|
||||||
duration berechnen, wäre NA
|
|
||||||
* Moves/scales/rotations ohne Veränderung würde ich auf jeden Fall
|
|
||||||
rauslöschen
|
|
||||||
* Es ist nicht möglich (bzw. ich weiß nicht wie) zusammengehörige Events
|
|
||||||
eineindeutig zu identifizieren
|
|
||||||
- nach Heuristik vorgehen? Doppelte Transformation start und stop einfach
|
|
||||||
raus?
|
|
||||||
- Daten sind nicht "fehlerfrei"; es gibt z.B. Transformation-Events wo
|
|
||||||
das Ende nicht geloggt wurde
|
|
||||||
* Wie identifiziere ich eine "Interaktionseinheit"?
|
|
||||||
- Was ist ein "case"?
|
|
||||||
- Eher grob über Zeitintervalle?
|
|
||||||
- Noch irgendeine andere Idee?
|
|
||||||
* Herausfinden, ob mehr als eine Person am Tisch steht?
|
|
||||||
- Sliding window, in der Anzahl von Artworks gezählt wird? Oder wie weit
|
|
||||||
angefasste Artworks voneinander entfernt sind?
|
|
||||||
- Man kann sowas schon "sehen" in den Logs - aber wie kann ich es
|
|
||||||
automatisiert rausziehen? Was ist meine Definition von
|
|
||||||
"Interaktionsboost"?
|
|
||||||
- Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
|
|
||||||
* Anreicherung der Log-Daten mit weiteren Metadaten? Was wäre interessant?
|
|
||||||
- Metadata on artworks like, name, artist, type of artwork, epoch, etc.
|
|
||||||
- School vacations and holidays
|
|
||||||
- Special exhibits at the museum
|
|
||||||
- Number of visitors per day
|
|
||||||
- Age structure of visitors per day?
|
|
||||||
- ... ????
|
|
||||||
|
|
||||||
## HAUM
|
|
||||||
|
|
||||||
* Bei Sven noch mal nachhaken wegen Besucherzahlen?
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
# Problems and how I handled them
|
|
||||||
|
|
||||||
This lists some problems with the log data that required decisions. These
|
|
||||||
decisions influence the outcome and maybe even the data quality. Hence, I
|
|
||||||
tried to document how I handled these problems and explain the decisions I
|
|
||||||
made.
|
|
||||||
|
|
||||||
## Weird behavior of `time_ms` and neg. `duration`values
|
|
||||||
|
|
||||||
I think the negative duration values happen, when an event starts in one
|
|
||||||
log file and completes in another one. The variable `time_ms` seems to be
|
|
||||||
continuous within one log file but not over several log files.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
dat_all[which(dat_all$duration < 0), ][1:5, 1:10]
|
|
||||||
|
|
||||||
# flipCard
|
|
||||||
## trace 56
|
|
||||||
dat3[dat3$trace == 56,]
|
|
||||||
|
|
||||||
dat[dat$fileid == "2016_11_15-11_12_57.log" & dat$date == "2016-12-15 11:17:26", ]
|
|
||||||
dat[dat$fileid == "2016_11_15-11_42_57.log" & dat$date == "2016-12-15 11:46:19", ]
|
|
||||||
|
|
||||||
#dat[309:1405, ]
|
|
||||||
|
|
||||||
tmp <- dat[300:1405, ]
|
|
||||||
tmp[tmp$artwork == "051", ]
|
|
||||||
## -> was closed correctly, but does it belong together?
|
|
||||||
|
|
||||||
|
|
||||||
## trace 61
|
|
||||||
dat3[dat3$trace == 61,]
|
|
||||||
|
|
||||||
dat[dat$fileid == "2016_11_15-11_12_57.log" & dat$date == "2016-12-15 11:17:52", ]
|
|
||||||
dat[dat$fileid == "2016_11_15-11_42_57.log" & dat$date == "2016-12-15 11:46:19", ]
|
|
||||||
|
|
||||||
tmp <- dat[350:1408, ]
|
|
||||||
tmp[tmp$artwork == "057", ]
|
|
||||||
## -> was closed correctly, but does it belong together?
|
|
||||||
|
|
||||||
|
|
||||||
# openTopic
|
|
||||||
dat_all[which(dat_all$duration < 0), ][100:105, 1:10]
|
|
||||||
|
|
||||||
# trace 2052
|
|
||||||
dat4[dat4$trace == 2052,]
|
|
||||||
|
|
||||||
dat[dat$fileid == "2016_11_17-14_12_10.log" & dat$date == "2016-12-17 14:21:51", ]
|
|
||||||
dat[dat$fileid == "2016_11_17-14_22_10.log" & dat$date == "2016-12-17 14:22:25", ]
|
|
||||||
|
|
||||||
tmp <- dat[23801:23950, ]
|
|
||||||
tmp[tmp$artwork == "502", ]
|
|
||||||
|
|
||||||
plot(time_ms ~ as.factor(fileid), dat[1:5000,])
|
|
||||||
```
|
|
||||||
|
|
||||||
The boxplot shows that we have a continuous range of values within one log
|
|
||||||
file but that `time_ms` does not increase over log files.
|
|
||||||
<!--
|
|
||||||
TODO: I will probably update how events are closed and the names of these
|
|
||||||
data frame, especially `dat3` and `dat4` will have to be adjusted.
|
|
||||||
-->
|
|
||||||
Since it seems not possible to fix this in a consistent way, I will set
|
|
||||||
negative durations to `NA`. I will keep `time_ms.start` and `time_ms.stop`
|
|
||||||
in the data frame, so it is clear why there are no durations. Maybe it
|
|
||||||
would also be useful to keep `logfileid.start` and `logfileid.stop` in the
|
|
||||||
data? Maybe just for proof checking this theory...
|
|
||||||
|
|
||||||
Part of it was that timestamps that are part of the log file names are not
|
|
||||||
zero-left-padded. But this fixed only three `move` events, since it only
|
|
||||||
fixed irregularities *within* one log file.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
table(dat_all[dat_all$duration < 0, "event"])
|
|
||||||
|
|
||||||
# flipCard move openPopup openTopic
|
|
||||||
# 562 100 34 284
|
|
||||||
|
|
||||||
|
|
||||||
dat[dat$event %in% c("Transform start", "Transform stop"), ][1100:1300,]
|
|
||||||
# --> got fixed by left padding... but only three all together!!
|
|
||||||
|
|
||||||
dat_all[735, ]
|
|
||||||
|
|
||||||
## what it looked like before left padding
|
|
||||||
# 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
|
|
||||||
# 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
|
|
||||||
# 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
|
|
||||||
# 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
|
|
||||||
# 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
|
|
||||||
# 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
|
|
||||||
|
|
||||||
## what it looks like now
|
|
||||||
# 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254
|
|
||||||
# 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465
|
|
||||||
# 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465
|
|
||||||
# 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605
|
|
||||||
# 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605
|
|
||||||
# 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362
|
|
||||||
```
|
|
||||||
|
|
||||||
`time_ms` does not increase from log file to log file
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
tmp1 <- dat[!duplicated(dat$fileid), c("fileid", "time_ms", "event")]
|
|
||||||
tmp2 <- dat[!duplicated(dat$fileid, fromLast=T), c("fileid", "time_ms", "event")]
|
|
||||||
tmp <- rbind(tmp1, tmp2)
|
|
||||||
tmp <- tmp[order(tmp$fileid), ]
|
|
||||||
head(tmp, 50)
|
|
||||||
|
|
||||||
plot(time_ms ~ as.factor(fileid), dat[1:2000, ], xlab = "fileid")
|
|
||||||
```
|
|
||||||
|
|
||||||
## x,y-coordinates outside of display range
|
|
||||||
|
|
||||||
The display is a 4K-display with 3840 x 2160 pixels. When you plot the
|
|
||||||
start and stop coordinates, the display is clearly to distinguish. However,
|
|
||||||
a lot of points are outside of the display range. This can happen, when the
|
|
||||||
art objects are scaled and then moved to the very edge of the table. Then
|
|
||||||
it will record pixels outside of the table. These are actually valid data
|
|
||||||
points and I will leave them as is.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
par(mfrow = c(1, 2))
|
|
||||||
plot(y.start ~ x.start, dat)
|
|
||||||
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
|
|
||||||
plot(y.stop ~ x.stop, dat)
|
|
||||||
abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2)
|
|
||||||
|
|
||||||
|
|
||||||
aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, dat, mean)
|
|
||||||
```
|
|
||||||
|
|
||||||
|
|
||||||
## Timestamps repeat
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## Popups from glossar cannot be assigned to a specific artwork
|
|
||||||
|
|
||||||
|
|
||||||
## Assign a case variable based on "time heuristic"
|
|
||||||
|
|
||||||
## A `move`event does not record any change
|
|
||||||
|
|
||||||
## Add moves to `trace` variable
|
|
||||||
|
|
||||||
## openPopup does not close correctly
|
|
||||||
|
|
||||||
The sorting had to include `popup` otherwise nested events could not be
|
|
||||||
closed correctly.
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
# TODO: Some correct entries are not closed:
|
|
||||||
df[df$trace == 1843, ]
|
|
||||||
# WHY???
|
|
||||||
# --> Wrong eventid!
|
|
||||||
dat5[dat5$trace == 1843, ]
|
|
||||||
openPopup_wide[openPopup_wide$trace == 1843, ]
|
|
||||||
```
|
|
||||||
## Events that only close (`date.start` is NA)
|
|
||||||
|
|
||||||
It looks like there is some kind of log error for the events that do not
|
|
||||||
have a start stop. I was able to get rid of most by sorting for `popup` for
|
|
||||||
the openPopup events, but there are still some left (50 for the small data
|
|
||||||
set, which corresponds to 0.2 per mill).
|
|
||||||
|
|
||||||
```{r}
|
|
||||||
# remove all events that do not have a `date.start`
|
|
||||||
dim(dat_all[is.na(dat_all$date.start), ])
|
|
||||||
dat_all <- dat_all[!is.na(dat_all$date.start), ]
|
|
||||||
# TODO: Find out how it can be that there is only a `date.stop`
|
|
||||||
## --> happens, when event is not properly closed, see here:
|
|
||||||
df[df$trace == 1843, ]
|
|
||||||
dat_openPopup[dat_openPopup$trace == 1843, ]
|
|
||||||
## --> still 50 (small data set) left, and some really do not seem to be
|
|
||||||
## opened! Must be a log error
|
|
||||||
# --> others should be closed!
|
|
||||||
dat[31000:31019,] # this one e.g.
|
|
||||||
# --> Actually NOT! card gets flipped before! Again - log error!
|
|
||||||
```
|
|
||||||
Will probably just get rid of them!
|
|
||||||
|
|
||||||
Think about if you want give warning messages about these deletions in the
|
|
||||||
functions.
|
|
||||||
|
|
||||||
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
|
|
||||||
|
|
||||||
See `questions_number-of-cards.R` for details.
|
|
||||||
|
|
||||||
## Extracting topics
|
|
||||||
|
|
||||||
When I extract the topics from `index.html` I get different topics, than
|
|
||||||
when I get them from `<artwork>.html`. At first glance, it looks like using
|
|
||||||
`index.html` actually gives the wrong results.
|
|
||||||
|
|
||||||
```
|
|
||||||
topics <- extract_topics(artworks, "index.xml", path)
|
|
||||||
topics2 <- extract_topics(artworks, paste0(artworks, ".xml"), path)
|
|
||||||
|
|
||||||
topics[!topics$file_name %in% topics2$file_name, ]
|
|
||||||
# artwork file_name topic index
|
|
||||||
# 072 072_artist.xml artist 1
|
|
||||||
# 073 073_artist.xml artist 1
|
|
||||||
# 110 110_technik.xml technik 2
|
|
||||||
topics2[!topics2$file_name %in% topics$file_name, ]
|
|
||||||
# artwork file_name topic index
|
|
||||||
# 031 031_vergleich.xml extra info 6
|
|
||||||
# 033 033_technik.xml technik 2
|
|
||||||
# 055 055_vergleich4.xml extra info 5
|
|
||||||
# 063 063_thema3.xml thema 3
|
|
||||||
# 063 063_extrainfo1.xml thema 4
|
|
||||||
# 072 072_artist2.xml artist 1
|
|
||||||
# 073 073_artist2.xml artist 1
|
|
||||||
# 099 099_technik.xml technik 2
|
|
||||||
# 110 110_technikneu.xml technik 2
|
|
||||||
```
|
|
||||||
|
|
||||||
For artwork 031, `index.html` only defines 5 cards (the 6th is commented
|
|
||||||
out), but `topicNumber` for this artwork has 6 different entries. I will
|
|
||||||
therefore extract the topics from `<artwork>.html`. (This seems also better
|
|
||||||
compatible with other data sets like 8o8m.
|
|
||||||
|
|
||||||
# Reading list
|
|
||||||
|
|
||||||
* @Arizmendi2022 [$-$]
|
|
||||||
* @Bannert2014 [x]
|
|
||||||
* @Bousbia2010 [$-$]
|
|
||||||
* @Cerezo2020
|
|
||||||
* @GerjetsSchwan2021 [x]
|
|
||||||
* @Goldhammer2020
|
|
||||||
* @Guenther2007
|
|
||||||
* @HuberBannert2023 [x]
|
|
||||||
* @Kroehne2018
|
|
||||||
* @SchwanGerjets2021 [x]
|
|
||||||
* @vanderAalst2016 [Chap. 2, x]
|
|
||||||
* @vanderAalst2016 [Chap. 3]
|
|
||||||
* @vanderAalst2016 [Chap. 5, x]
|
|
||||||
* @Wang2019
|
|
||||||
|
|
||||||
# Open stuff
|
|
||||||
|
|
||||||
* Angle from which people approach table in Braunschweig? Consider in
|
|
||||||
rotation variable?
|
|
||||||
* Time limit for `case` variable different for different events? (openTopic
|
|
||||||
should be opened the longest)
|
|
||||||
--> I think this is not relevant since I am looking at time *between*
|
|
||||||
events!
|
|
||||||
|
|
||||||
# Stuff AK found interesting
|
|
||||||
|
|
||||||
* Pre/post corona
|
|
||||||
* Identify school classes
|
|
||||||
* How many persons are present at the table?
|
|
||||||
|
|
||||||
# Other potential questions
|
|
||||||
|
|
||||||
* "Bursts"
|
|
||||||
* 1st vs. 2nd half of the day
|
|
||||||
* Can we identify "types of art"? With clustering or something?
|
|
||||||
* Possible to estimate how many persons per day? Maybe average of certain
|
|
||||||
weekdays? ... ?
|
|
||||||
|
|
@ -1,46 +1,6 @@
|
|||||||
#' ---
|
|
||||||
#' title: "Preprocessing raw log files"
|
|
||||||
#' author: "Nora Wickelmaier"
|
|
||||||
#' date: "`r Sys.Date()`"
|
|
||||||
#' output:
|
|
||||||
#' html_document:
|
|
||||||
#' default
|
|
||||||
#' pdf_document:
|
|
||||||
#' toc: true
|
|
||||||
#' number_sections: true
|
|
||||||
#' geometry: margin = 2.5cm
|
|
||||||
#' ---
|
|
||||||
|
|
||||||
# setwd("C:/Users/nwickelmaier/Nextcloud/Documents/MDS/2023ss/60100_master_thesis/code")
|
# setwd("C:/Users/nwickelmaier/Nextcloud/Documents/MDS/2023ss/60100_master_thesis/code")
|
||||||
|
|
||||||
#+ setup, include = FALSE
|
###### HELPER ######
|
||||||
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
|
|
||||||
|
|
||||||
#' The following events can be extracted from the log files:
|
|
||||||
#'
|
|
||||||
#' ```
|
|
||||||
#' LogEntry classes:
|
|
||||||
#' TRANSFORM_START: "Transform start" --> "Transformation Start" in Tool
|
|
||||||
#' TRANSFORM_STOP: "Transform stop"
|
|
||||||
#' START_APPLICATION: "Start Application"
|
|
||||||
#' SHOW_APPLICATION: "Show Application"
|
|
||||||
#' SHOW_INFO: "Show Info" --> "Flip Card" in Tool
|
|
||||||
#' SHOW_FRONT: "Show Front"
|
|
||||||
#' SHOW_POPUP: "ShowPopup" --> "Show Popup" in Tool
|
|
||||||
#' HIDE_POPUP: "HidePopup"
|
|
||||||
#' ARTWORK: "Artwork" --> "Show Topic" in Tool
|
|
||||||
#' ```
|
|
||||||
|
|
||||||
#' Choose which folders with raw log files should be included:
|
|
||||||
|
|
||||||
folders <- "all"
|
|
||||||
#folders <- "_2016b"
|
|
||||||
|
|
||||||
dirpaths <- paste0("../data/haum_logs_2016-2023/", folders)
|
|
||||||
|
|
||||||
fnames <- dir(dirpaths, pattern = "*.log", full.names = TRUE)
|
|
||||||
length(fnames)
|
|
||||||
head(fnames)
|
|
||||||
|
|
||||||
# Need to left pad file names. If I do not do this, the sorting of the
|
# Need to left pad file names. If I do not do this, the sorting of the
|
||||||
# timestamps will be off and I get negative durations later on since the
|
# timestamps will be off and I get negative durations later on since the
|
||||||
@ -70,6 +30,18 @@ leftpad_fnames <- function(x) {
|
|||||||
res
|
res
|
||||||
}
|
}
|
||||||
|
|
||||||
|
##### CONTENT ######
|
||||||
|
|
||||||
|
# Choose which folders with raw log files should be included
|
||||||
|
|
||||||
|
folders <- "all"
|
||||||
|
#folders <- "_2016b"
|
||||||
|
|
||||||
|
dirpaths <- paste0("../data/haum_logs_2016-2023/", folders)
|
||||||
|
|
||||||
|
fnames <- dir(dirpaths, pattern = "*.log", full.names = TRUE)
|
||||||
|
length(fnames)
|
||||||
|
head(fnames)
|
||||||
|
|
||||||
logs <- lapply(fnames, readLines)
|
logs <- lapply(fnames, readLines)
|
||||||
nlog <- sapply(logs, length)
|
nlog <- sapply(logs, length)
|
||||||
@ -77,31 +49,19 @@ dat <- data.frame(fileId = rep(leftpad_fnames(fnames), nlog),
|
|||||||
logs = unlist(logs))
|
logs = unlist(logs))
|
||||||
head(dat$logs)
|
head(dat$logs)
|
||||||
|
|
||||||
#' Remove corrupted lines
|
# Remove corrupted lines
|
||||||
|
|
||||||
# Warning messages:
|
|
||||||
# incomplete final line found on '_2016/2016_11_18-11_31_0.log'
|
|
||||||
# incomplete final line found on '_2016/2016_11_18-11_38_30.log'
|
|
||||||
# incomplete final line found on '_2016/2016_11_18-11_40_36.log'
|
|
||||||
# ...
|
|
||||||
|
|
||||||
## --> files have a last line that looks like a binary entry??
|
|
||||||
|
|
||||||
# From LogEntry.as:
|
|
||||||
# //pm: inserted this check to account for some broken logfiles
|
|
||||||
# if (metaData[1] == null){
|
|
||||||
# trace("corrupt line... still do not know how these came to happen.");
|
|
||||||
|
|
||||||
# corrupt lines are "" and need to be removed
|
# corrupt lines are "" and need to be removed
|
||||||
d1 <- dim(dat)[1]
|
d1 <- dim(dat)[1]
|
||||||
dat <- subset(dat, dat$logs != "")
|
dat <- subset(dat, dat$logs != "")
|
||||||
d2 <- dim(dat)[1]
|
d2 <- dim(dat)[1]
|
||||||
|
|
||||||
#' The files contain `r d1-d2` corrupt lines that were remooved from the
|
# TODO: Catch this in a function and give back a meaningful warning
|
||||||
#' data.
|
# The files contain `r d1-d2` corrupt lines that were removed from the
|
||||||
#'
|
# data.
|
||||||
|
|
||||||
#' ### Extract relevant infos
|
# Extract relevant infos
|
||||||
|
|
||||||
date <- sapply(dat$logs, gsub,
|
date <- sapply(dat$logs, gsub,
|
||||||
pattern = "^\\[(.*)\\], \\[.*$",
|
pattern = "^\\[(.*)\\], \\[.*$",
|
||||||
@ -139,8 +99,6 @@ ts_elements <- strsplit(timestamp, ":")
|
|||||||
time_ms <- as.numeric(sapply(ts_elements, function(x) x[4])) +
|
time_ms <- as.numeric(sapply(ts_elements, function(x) x[4])) +
|
||||||
as.numeric(sapply(ts_elements, function(x) x[3])) * 1000 +
|
as.numeric(sapply(ts_elements, function(x) x[3])) * 1000 +
|
||||||
as.numeric(sapply(ts_elements, function(x) x[2])) * 1000 * 60
|
as.numeric(sapply(ts_elements, function(x) x[2])) * 1000 * 60
|
||||||
# TODO: Maybe change to simple gsub()...
|
|
||||||
# --> This is theoretically sound but a lot of lines for just removing ":"
|
|
||||||
|
|
||||||
dat$date <- lubridate::parse_date_time(date, "bdyHMSOp")
|
dat$date <- lubridate::parse_date_time(date, "bdyHMSOp")
|
||||||
dat$timeMs <- time_ms
|
dat$timeMs <- time_ms
|
||||||
@ -156,17 +114,11 @@ dat$rotation <- moves[,4]
|
|||||||
dat$logs <- NULL
|
dat$logs <- NULL
|
||||||
# remove original log files from data so file becomes smaller
|
# remove original log files from data so file becomes smaller
|
||||||
|
|
||||||
str(dat)
|
# sort by fileId, since reading in by file names does not make sense
|
||||||
|
# because of missing left zero padding
|
||||||
head(dat, 20)
|
|
||||||
|
|
||||||
# sort by fileId, since reading in by file names does not make sense because of
|
|
||||||
# missing left zero padding
|
|
||||||
dat <- dat[order(dat$fileId, dat$date, dat$timeMs), ]
|
dat <- dat[order(dat$fileId, dat$date, dat$timeMs), ]
|
||||||
|
|
||||||
## TODO: Replace artwork and popup numbers with informative strings
|
# Export data
|
||||||
|
|
||||||
#' ### Save data frame
|
|
||||||
|
|
||||||
write.table(dat, "../data/rawdata_logfiles.csv",
|
write.table(dat, "../data/rawdata_logfiles.csv",
|
||||||
sep = ";", quote = FALSE, row.names = FALSE)
|
sep = ";", quote = FALSE, row.names = FALSE)
|
||||||
|
@ -2,8 +2,7 @@
|
|||||||
|
|
||||||
source("functions.R")
|
source("functions.R")
|
||||||
|
|
||||||
# Read data
|
# Read data ##############################################################
|
||||||
|
|
||||||
dat0 <- read.table("../data/rawdata_logfiles_small.csv", sep = ";",
|
dat0 <- read.table("../data/rawdata_logfiles_small.csv", sep = ";",
|
||||||
header = TRUE)
|
header = TRUE)
|
||||||
dat0$date <- as.POSIXct(dat0$date)
|
dat0$date <- as.POSIXct(dat0$date)
|
||||||
@ -13,7 +12,7 @@ dat0$glossar <- ifelse(dat0$artwork == "glossar", 1, 0)
|
|||||||
dat <- subset(dat0, !(dat0$event %in% c("Start Application",
|
dat <- subset(dat0, !(dat0$event %in% c("Start Application",
|
||||||
"Show Application")))
|
"Show Application")))
|
||||||
|
|
||||||
# Add trace variable
|
# Add trace variable #####################################################
|
||||||
dat1 <- add_trace(dat)
|
dat1 <- add_trace(dat)
|
||||||
|
|
||||||
# Close events
|
# Close events
|
||||||
@ -21,7 +20,9 @@ dat2 <- rbind(close_events(dat1, "move"),
|
|||||||
close_events(dat1, "flipCard"),
|
close_events(dat1, "flipCard"),
|
||||||
close_events(dat1, "openTopic"),
|
close_events(dat1, "openTopic"),
|
||||||
close_events(dat1, "openPopup"))
|
close_events(dat1, "openPopup"))
|
||||||
|
|
||||||
dat2 <- dat2[order(dat2$date.start, dat2$fileId.start), ]
|
dat2 <- dat2[order(dat2$date.start, dat2$fileId.start), ]
|
||||||
|
|
||||||
# Remove durations when event spans more than one log file, since they are
|
# Remove durations when event spans more than one log file, since they are
|
||||||
# not interpretable
|
# not interpretable
|
||||||
dat2[which(dat2$fileId.start != dat2$fileId.stop), "duration"] <- NA
|
dat2[which(dat2$fileId.start != dat2$fileId.stop), "duration"] <- NA
|
||||||
@ -29,13 +30,12 @@ dat2[which(dat2$fileId.start != dat2$fileId.stop), "duration"] <- NA
|
|||||||
# Remove all events that do not have a `date.start`
|
# Remove all events that do not have a `date.start`
|
||||||
dat2 <- dat2[!is.na(dat2$date.start), ]
|
dat2 <- dat2[!is.na(dat2$date.start), ]
|
||||||
rownames(dat2) <- NULL
|
rownames(dat2) <- NULL
|
||||||
|
# TODO: Throw warning about this
|
||||||
|
|
||||||
#summary(dat2)
|
# Add case variable ######################################################
|
||||||
|
|
||||||
# Add case variable
|
|
||||||
dat3 <- add_case(dat2)
|
dat3 <- add_case(dat2)
|
||||||
|
|
||||||
# Add event ID
|
# Add event ID ###########################################################
|
||||||
dat3$eventId <- seq_len(nrow(dat3))
|
dat3$eventId <- seq_len(nrow(dat3))
|
||||||
dat3 <- dat3[, c("fileId.start", "fileId.stop", "eventId", "case",
|
dat3 <- dat3[, c("fileId.start", "fileId.stop", "eventId", "case",
|
||||||
"trace", "glossar", "event", "artwork",
|
"trace", "glossar", "event", "artwork",
|
||||||
@ -46,17 +46,19 @@ dat3 <- dat3[, c("fileId.start", "fileId.stop", "eventId", "case",
|
|||||||
"scaleSize", "rotation.start", "rotation.stop",
|
"scaleSize", "rotation.start", "rotation.stop",
|
||||||
"rotationDegree")]
|
"rotationDegree")]
|
||||||
|
|
||||||
# Add trace for move events
|
# Add trace for move events ##############################################
|
||||||
dat4 <- add_trace_moves(dat3)
|
dat4 <- add_trace_moves(dat3)
|
||||||
|
|
||||||
# Add topics: file names and topics
|
# Add topics: file names and topics ######################################
|
||||||
artworks <- unique(dat4$artwork)
|
artworks <- unique(dat4$artwork)
|
||||||
topics <- extract_topics(artworks, pattern = paste0(artworks, ".xml"),
|
topics <- extract_topics(artworks, pattern = paste0(artworks, ".xml"),
|
||||||
path = "../data/ContentEyevisit/eyevisit_cards_light/")
|
path = "../data/ContentEyevisit/eyevisit_cards_light/")
|
||||||
|
|
||||||
dat5 <- add_topic(dat4, topics = topics)
|
dat5 <- add_topic(dat4, topics = topics)
|
||||||
|
|
||||||
# Export data
|
# TODO: Replace artwork with informative strings
|
||||||
|
|
||||||
|
# Export data ############################################################
|
||||||
write.table(dat5, "../data/event_logfiles.csv", sep = ";",
|
write.table(dat5, "../data/event_logfiles.csv", sep = ";",
|
||||||
row.names = FALSE)
|
row.names = FALSE)
|
||||||
|
|
||||||
|
@ -40,14 +40,11 @@ add_trace <- function(data, glossar_dict = "../data/glossar_dict.RData") {
|
|||||||
load(glossar_dict)
|
load(glossar_dict)
|
||||||
lut <- glossar_dict[glossar_dict$glossar_file %in% glossar_files, ]
|
lut <- glossar_dict[glossar_dict$glossar_file %in% glossar_files, ]
|
||||||
|
|
||||||
head(subdata2[, c("artwork", "event", "popup", "trace")], 20)
|
|
||||||
|
|
||||||
inside <- glossar_files[glossar_files %in%
|
inside <- glossar_files[glossar_files %in%
|
||||||
lut[sapply(lut$artwork, length) == 1,
|
lut[sapply(lut$artwork, length) == 1,
|
||||||
"glossar_file"]]
|
"glossar_file"]]
|
||||||
single_art <- unlist(lut[lut$glossar_file %in% inside, "artwork"])
|
single_art <- unlist(lut[lut$glossar_file %in% inside, "artwork"])
|
||||||
|
|
||||||
|
|
||||||
for (file in lut$glossar_file) {
|
for (file in lut$glossar_file) {
|
||||||
|
|
||||||
artwork_list <- unlist(lut[lut$glossar_file == file, "artwork"])
|
artwork_list <- unlist(lut[lut$glossar_file == file, "artwork"])
|
||||||
|
Loading…
Reference in New Issue
Block a user