From b81a3e984c56b29528495d29636ff8b145930aa0 Mon Sep 17 00:00:00 2001
From: nwickel <n.wickelmaier@iwm-tuebingen.de>
Date: Wed, 25 Oct 2023 17:13:07 +0200
Subject: [PATCH] Updated README after many new insights

---
 README.Rmd | 269 +++++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 212 insertions(+), 57 deletions(-)

diff --git a/README.Rmd b/README.Rmd
index 2a836ea..8bbca70 100644
--- a/README.Rmd
+++ b/README.Rmd
@@ -22,7 +22,7 @@ layout. The table was installed at the institute in October 2016 and since
 November 2016 log files from interactions of visitors of the museum have
 been collected. These log files are in an unstructured format and cannot be
 easily analyzed. The purpose of the following document is to describe how
-the data haven been transformed and which decisions have been made a long
+the data haven been transformed and which decisions have been made along
 the way.
 
 # Data structure
@@ -66,9 +66,9 @@ data when further processed and are only in the raw log files.
 
 The first step is to parse the raw log files that are stored by the
 application as text files in a rather unstructured format to a format that
-is better handled. The data are therefore transferred to a spread sheet
-format. The following section describes what problems were encountered
-while doing this.
+can be read by common statistics software packages. The data are therefore
+transferred to a spread sheet format. The following section describes what
+problems were encountered while doing this.
 
 ## Corrupt lines
 
@@ -77,9 +77,9 @@ that says
 
 ```
 Warning messages:
-  incomplete final line found on '_2016/2016_11_18-11_31_0.log'
-  incomplete final line found on '_2016/2016_11_18-11_38_30.log'
-  incomplete final line found on '_2016/2016_11_18-11_40_36.log'
+  incomplete final line found on '2016/2016_11_18-11_31_0.log'
+  incomplete final line found on '2016/2016_11_18-11_38_30.log'
+  incomplete final line found on '2016/2016_11_18-11_40_36.log'
   ...
 ```
 
@@ -88,17 +88,143 @@ content. It is unclear why and how this happens. So when reading the data,
 these lines were removed. A warning will be given that indicates how many
 files have been affected.
 
-## Units of the variables
+## Extracted variables from raw log files
 
-* Welche Einheit haben x und y? Pixel? --> yes
-* Welche Einheit hat scale? --> some kind if bit, does not matter, when
-  calculating a ratio
-* rotation wirklich degree? --> yes
-* Nach welchem Zeitintervall resettet sich der Tisch wieder in die
-  Ausgangskonfiguration? --> PM needs to look it up
+The following variables (columns in the data frame) are extracted from the
+raw log file:
+
+* `fileId`: Containing the zero-left-padded file name of the raw log file
+  the data line has been extracted from
+
+* `folder`: The folder names in which the raw log files haven been
+  organized in. For the HAUM data set, the data are sorted by year (folders
+  2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).
+
+* `data`: Extracted time stamp from the raw log file in the format
+  `yyyy-mm-dd hh:mm:ss`.
+
+* `timeMs`: Containing a time stamp in Milliseconds that restarts with
+  every new raw log files.
+
+* `event`: Start and stop event tags. See above for possible values.
+
+* `artwork`: Identifier of the different artworks. This is a 3 digit
+  (left-padded) number. The numbers of the artworks correspond to the
+  folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
+  orginally taken from the museums catalogue.
+
+* `popup`: Name of the pop-up opened. This is only interestin for
+  "openPopup" events.
+
+* `topicNumber`: The number of the topic card that has been opened at the back of
+  the artwork card. See below for a more detailed descripttion what these
+  numbers possibly mean.
+
+* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$)
+
+* `y`: Value of y-coordinate in pixel
+
+* `scale`: Number in 128 bit that indicates how much the artwork card has
+  been scaled (????)
+
+* `rotation`: Degree of rotation in start configuration.
+
+<!-- TODO: Nach welchem Zeitintervall resettet sich der Tisch wieder in die
+  Ausgangskonfiguration? -> PM needs to look it up -->
+
+## Variables after "closing of events"
+
+The raw log data consists of start and stop events for each event type.
+After preprocessing for event types are extracted: `move`, `flipCard`,
+`openTopic`, and `openPopup`. Except for the `move` events, which can occur
+at any time when interacting with an artwork card on the table, the events
+have a hierachical order: An artwork card first needs to be flipped
+(`flipCard`), then the topic cards on the back of the card can be opened
+(`openTopic`), and finally pop-ups on these topic cards can be opened
+(`openPopup`). This implies that the event `openPopup` can only be present
+for a certain artwork, if the card has already been flipped (i.e., an event
+`flipCard` for the same artwork has already occured).
+
+After preprocessing, the data frame is now in a wide format with columns
+for the start and the stop of each event and contains the following
+variables:
+
+* `folder`: Containing the folder name (see above)
+
+* `eventId`: A numerical variable that indicates the number of the event.
+  Starts at 1 and ends with the total number of events, counting up by 1.
+
+* `case`: A numerical variable indicating cases in the data. A "case"
+  indicates an interaction interval and could be defined in different ways.
+  Right now a new case begins, when no event occured for 20 seconds.
+
+* `trace`: A trace is defined as one interaction with one artwork. A trace
+  can either start with a `flipCard` event or when an artwork has been
+  touched for the first time within this case. A trace ends with the
+  artwork card being flipped close again or with the last movement of the
+  card within this case. One case can contain several traces with the same
+  artwork when the artwork is flipped open and slipped close again several
+  times within a short time.
+
+* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up
+  has been opened from the glossar folder. These pop-ups can be assigned to
+  the wronge artwork since it is not possible to do this algorithmically.
+  It is possible that two artworks are flipped open that could both link to
+  the same popup from a glossar. The indicator variable is left as a
+  variable, so that these pop-ups can be easily deleted from the data.
+  Right now, glossar entries can be ignored completely by setting an
+  argument and this is done by default. Using the pop-ups from the glossar
+  will need a lot more love, before it behaves satisfactorily.
+
+* `event`: Indicating the event. Can take tha values `move`, `flipCard`,
+  `openTopic`, and `openPopup`.
+
+* `artwork`: Identifier of the different artworks. This is a 3 digit
+  (left-padded) number. See above.
+
+* `fileId.start` / `fileId.stop`: See above.
+
+* `date.start` / `date.stop`: See above.
+
+* `timeMs.start` / `timeMs.stop`: See above.
+
+* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds.
+  Needs to be adjusted for events spanning more than one log file by a
+  factor of $60,000 \times #logfiles$. See below for details.
+
+* `topicNumber`: See above.
+
+* `popup`: See above.
+
+* `x.start` / `x.stop`: See above.
+
+* `y.start` / `y.stop`: See above.
+
+* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and $(x.stop, y.stop)$.
+
+* `scale.start` / `scale.stop`: See above.
+
+* `scaleSize`: Relative scaling of artwork card, calculated by
+  $\frac{scale.stop}{scale.start}$.
+
+* `rotation.start` / `rotation.stop`: See above.
+
+* `rotationDegree`: Difference of rotation from $rotation.stop$ to
+  $rotation.start$.
 
 ## How unclosed events are handled
 
+Events do not necessarily need to be completed. A person can, e.g., leave
+the table and not flip the artwork card close again. For `flipCard`,
+`openTopic`, and `openPopup` the data frame contains `NA` when the event
+does not complete. For `move` events is happens quite often that a start
+event follows a start event and a stop event follows a stop event.
+Technically a move event cannot *not* be finished and the number of events
+without a start or stop indicated that the time resolution was not
+sufficient to catch all these events accurately. Double start and stop
+`move`events have therefore been deleted from the data set.
+
+<!--
 ## How a case is defined
 
 * Herausfinden, ob mehr als eine Person am Tisch steht?
@@ -108,17 +234,40 @@ files have been affected.
     automatisiert rausziehen? Was ist meine Definition von
     "Interaktionsboost"?
   - Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
+-->
 
 ## Additional meta data
 
-* Anreicherung der Log-Daten mit weiteren Metadaten? Was wäre interessant?
+For the HAUM data, I added meta data on state holidays and school
+vacations. Additionally, the topic categories of the topic cards were
+extracted from the XML files and added to the data frame.
 
+This led to the following additional variables:
+
+* `topicIndex`
+
+* `topicFile`
+
+* `topic`
+
+* `state` (Niedersachsen for complete HAUM data set)
+
+* `stateCode` (NI)
+
+* `holiday`
+
+* `vacations`
+
+* `stateCodeVacations`
+
+<!--
   - Metadata on artworks like, name, artist, type of artwork, epoch, etc.
   - School vacations and holidays
   - Special exhibits at the museum
   - Number of visitors per day (bei Sven noch mal nachhaken?)
   - Age structure of visitors per day?
   - ... ????
+-->
 
 # Problems and how I handled them
 
@@ -129,10 +278,14 @@ made.
 
 ## Weird behavior of `timeMs` and neg. `duration` values
 
-I think the negative duration values happen, when an event starts in one
-log file and completes in another one. The variable `timeMs` seems to be
-continuous within one log file but not over several log files.
-
+`timeMs` resets itself every time a new log file starts. This means that
+the durations of events spanning more than one log file must be adjusted.
+Instead of just calculating $timeMs.stop - timeMs.start$, `timeMs.start`
+must be subtracted from the maximum duration of the log file where the
+event started ($600,000 ms$) and the `timeMs.stop` must be added. If the
+event spans more than two log files, a multiple of $600,000$ must be taken,
+e.g. for three log files it must be: $2 \times 600,000 - timeMs.start +
+timeMs.stop$ and so on.
 
 ```{r, results = FALSE, fig.show = TRUE}
 # Read data
@@ -154,34 +307,22 @@ glossar_dict <- create_glossardict(artworks, glossar_files,
 dat1 <- add_trace(dat, glossar_dict)
 
 # Close events
-dat2 <- rbind(close_events(dat1, "move"),
-              close_events(dat1, "flipCard"),
-              close_events(dat1, "openTopic"),
-              close_events(dat1, "openPopup"))
-dat2 <- dat2[order(dat2$date.start, dat2$fileId), ]
+dat2 <- rbind(close_events(dat1, "move", rm_nochange_moves = TRUE),
+              close_events(dat1, "flipCard", rm_nochange_moves = TRUE),
+              close_events(dat1, "openTopic", rm_nochange_moves = TRUE),
+              close_events(dat1, "openPopup", rm_nochange_moves = TRUE))
+dat2 <- dat2[order(dat2$fileId.start, dat2$date.start, dat2$timeMs.start), ]
 
 plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
 ```
 
 The boxplot shows that we have a continuous range of values within one log
-file but that `timeMs` does not increase over log files. Since it seems not
-possible to fix this in a consistent way, I set all durations to `NA` where
-`fileId.start` and `fileId.stop` are not identical. I kept `timeMs.start`
-and `timeMs.stop` and also `fileId.start` and `fileId.stop` in the data
-frame, so it is clear why there are no durations. The other
+file but that `timeMs` does not increase over log files. I kept
+`timeMs.start` and `timeMs.stop` and also `fileId.start` and `fileId.stop`
+in the data frame, so it is clear when events span more than one log file.
 
-NOTE: Part of this problem was that time stamps that are part of the log
-file names are not zero-left-padded and therefore the files were not in the
-correct order when read into R. When zero left padding these file IDs and
-sorting by them and then by `date.start` within, some of the durations are
-exactly fixed. Unfortunately, only three `move` events were fixed, since it
-only fixed irregularities *within* one log file. See below for more
-details.
-
-UPDATE: By now I remove all events that span more than one log file. This
-lets me improve speed considerably.
-
-UPDATE: Infos from Philipp:
+<!--
+Infos from Philipp:
 
 "Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
 so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
@@ -194,6 +335,7 @@ nachdem, ob der Tisch zum Zeitpunkt des neuen Logging-Intervalls in
 Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
 die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
 es passt."
+-->
 
 ## Left padding of file IDs
 
@@ -277,7 +419,7 @@ technical terms that can be opened from the hypertexts on the topic cards.
 Often these information are artwork dependent and then the corresponding
 XML-file is in the folder for this artwork. Sometimes, however, more
 general terms can be opened. In order to avoid multiple files containing
-the same informatione, these were stored in a folder called `glossar` and
+the same information, these were stored in a folder called `glossar` and
 get accessed from there. The raw log files only contain the path to this
 glossar entry and did not record from which artwork it was accessed. I
 tried to assign these glossar entries to the correct artworks. The (very
@@ -310,12 +452,13 @@ pop-ups where another artwork has been opened in between. This is still an
 open TODO to write a more elaborate algorithm.
 
 All glossar pop-ups that do not get matched with an artwork are removed
-from the data set with a warning.
+from the data set with a warning if the argument `glossar = TRUE` is set.
+Otherwise the glossar entries will be ignored completely.
 
 ## Assign a `case` variable based on "time heuristic"
 
 One thing needed in order to work with the data set and use it for machine
-learning algorithms like process mining is a variable that tries to
+learning algorithms like process mining, is a variable that tries to
 identify a case. A case variable will structure the data frame in a way
 that navigation behavior can actually be investigated. However, we do not
 know if several people are standing around the table interacting with it or
@@ -329,9 +472,9 @@ gets extracted by the algorithm.
 
 In order to investigate user behavior on a more fine grained level, it will
 be necessary to come up with a more elaborate approach. A better, still
-simple approach could be to use this kind of time limit and additionally
+simple approach, could be to use this kind of time limit and additionally
 look at the distance between artworks interacted with within one time
-window. When artworks are far apart is seems plausible that more than one
+window. When artworks are far apart it seems plausible that more than one
 person interacted with them. Very short time lapses between events on
 different artworks could also be an indicator that more than one person is
 interacting with the table.
@@ -348,7 +491,7 @@ algorithmic way but only heuristically. I used the `case` variable in order
 to get meaningful units around the artworks.
 
 If within one case only a single trace for a single artwork was opened, I
-assigned this trace to the moves associated with this artwork. I (quite
+assigned this trace to the moves associated with this artwork. It (quite
 often) happens that within one case one artwork is opened and closed
 several times, each time starting a new trace. I then assigned all the
 following move events to the trace beforehand. This is, of course,
@@ -401,6 +544,16 @@ dim(dat2[is.na(dat2$date.start), ])
 dat2 <- dat2[!is.na(dat2$date.start), ]
 ```
 
+In order to deal with these logging errors, I check the data for what I
+call "fragmented traces". These are traces that cannot happen, when
+everything is logged correctly, e.g., traces containing `flipCard ->
+openPopup` or traces that only consist of `move`, `openTopic`, and
+`openPopup` events. These fragmented traces are removed from the data. It
+was not possible to check them all manually, but the 20 or more that I do
+check in the raw log files were all some kind of logging error like above.
+Most often a card was already closed again, before a topic card or pop-up
+was recorded as being closed.
+
 ## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
 
 See `questions_number-of-cards.R` for more details.
@@ -411,13 +564,15 @@ displayed on the back of the card. I added an index giving the ordering in
 the index files.
 
 The possible values in the variable `topicNumber` range from 0 to 7,
-however, not artwork has more than six different numbers. So I just renamed
+however, no artwork has more than six different numbers. So I just renamed
 those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
 to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
 assign topics and file names to the according pop-ups. This needs to be
 cross checked with the programming, but seems the most plausible approach
 with my current knowledge.
 
+<!-- TODO: Ask Philipp -->
+
 ## Extracting topics from `index.xml` vs. `<artwork_number>.xml`
 
 When I extract the topics from `index.html` I get different topics, than
@@ -479,9 +634,6 @@ It strongly suggests that the artworks haven been updated after the Corona
 pandemic. I think, the table was also moved to a different location at that
 point. (Check with PG to make sure.)
 
-I need to get the XML files for "504" and "505" from PM in order to extract
-information on them for the metadata.
-
 # Optimizing resources used by the code
 
 After I started trying out the functions on the complete data set, it
@@ -492,12 +644,15 @@ frame (at least not on my laptop). The code is supposed to work "out of the
 box" for researchers, hence it *should* run on a regular (8 core) laptop.
 So, I changed the reshaping so that it is done in batches on subsets of the
 data for every `fileId` separately. This means that events that span over
-two raw log files cannot be closed and will then be removed from the data
-set. The functions warns about this, but it is a random process getting rid
-of these data and seems therefore not like a systematic problem. Another
-reason why this is not bad, is that durations cannot be calculated for
-events across log files anyways, because the time stamps do not increase
-systematically over log files (see above).
+two (or more) raw log files cannot be closed and will then be removed from
+the data set. The function warns about this, but it is a random process
+getting rid of these data and seems therefore not like a systematic
+problem. Another reason why this is not bad, is that durations cannot be
+calculated for events across log files anyways, because the time stamps do
+not increase systematically over log files (see above).
+
+UPDATE: By now, I close the events spanning more than one log file after
+this has been done.
 
 I meant to put the lists back together with `do.call(rbind, some_list)` but
 this can also not handle big data sets. I therefore switched to