diff --git a/README.Rmd b/README.Rmd deleted file mode 100644 index 1629914..0000000 --- a/README.Rmd +++ /dev/null @@ -1,504 +0,0 @@ ---- -title: "Log data from the Multi-Touch Table at the HAUM" -output: github_document ---- - -```{r, include = FALSE} -devtools::load_all("../../../../software/mtt") -``` - -The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in -Braunschweig gives visitors of the Museum the opportunity to interact with -about 70 artworks and 3 virtual cards containing information about the -museum and its layout. The table was installed at the museum in October -2016 and since November 2016 log files from interactions of visitors of the -museum have been collected. These log files are in an unstructured format -and cannot be easily analyzed. The purpose of the following document is to -describe how the data haven been transformed and which decisions have been -made along the way. - -The implementation of the steps described here can be found at: -https://gitea.iwm-tuebingen.de/R/mtt. - -# Data structure - -The log files contain lines that indicate the beginning and end of possible -activities that can be performed when interacting with the artworks on the -table. The layout of the table looks like pictures have been tossed on a -large table. Every artwork is visible at the start configuration. People -can move the pictures on the table, they can be scaled and rotated. -Additionally, the virtual picture cards can be flipped in order to find -more information of the artwork on the "back" of the card. One has to press -a little `i` for more information in one of the bottom corners of the card. -On the back of the card two to six information cards can be found with a -teaser text about a certain topic. These topic cards can be opened and a -hypertext with detailed information opens. Within these hypertexts certain -technical terms can be clicked for lay people to get more information. This -also opens up a pop-up. The events encoded in the raw log files therefore -have the following structure. - -``` -"Start Application" --> Start Application -"Show Application" -"Transform start" --> Move -"Transform stop" -"Show Info" --> Flip Card -"Show Front" -"Artwork/OpenCard" --> Open Topic -"Artwork/CloseCard" -"ShowPopup" --> Open Popup -"HidePopup" -``` - -The right side shows what events can be extracted from these raw lines. The -"Start Application" is not an event in the original sense since it only -indicates if the table was started or maybe reset itself. This is not an -interaction with the table and therefore not interesting in itself. All -"Start Application" and "Show Application" are therefore excluded from the -data when further processed and are only in the raw log files. - -# Parsing the raw log files - -The first step is to parse the raw log files that are stored by the -application as text files in a rather unstructured format to a format that -can be read by common statistics software packages. The data are therefore -transferred to a spread sheet format. The following section describes what -problems were encountered while doing this. - -## Corrupt lines - -When reading the files containing the raw logs into R, a warning appears -that says - -``` -Warning messages: - incomplete final line found on '2016/2016_11_18-11_31_0.log' - incomplete final line found on '2016/2016_11_18-11_38_30.log' - incomplete final line found on '2016/2016_11_18-11_40_36.log' - ... -``` - -When you open these files, it looks like the last line contains some binary -content. It is unclear why and how this happens. So when reading the data, -these lines were removed. A warning will be given that indicates how many -files have been affected. - -## Extracted variables from raw log files - -The following variables (columns in the data frame) are extracted from the -raw log file: - -* `fileId`: Containing the zero-left-padded file name of the raw log file - the data line has been extracted from - -* `folder`: The folder names in which the raw log files haven been - organized in. For the HAUM data set, the data are sorted by year (folders - 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023). - -* `date`: Extracted timestamp from the raw log file in the format - `yyyy-mm-dd hh:mm:ss`. - -* `timeMs`: Containing a timestamp in Milliseconds that restarts with - every new raw log files. - -* `event`: Start and stop event tags. See above for possible values. - -* `item`: Identifier of the different items. This is a three-digit - (left-padded) number. The numbers of the items correspond to the - folder names in `/ContentEyevisit/eyevisit_cards_light/` and were - orginally taken from the museums catalogue. - -* `popup`: Name of the pop-up opened. This is only interesting for - "openPopup" events. - -* `topic`: The number of the topic card that has been opened at the back of - the item card. See below for a more detailed description what these - numbers mean. - -* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$). - -* `y`: Value of y-coordinate in pixel. - -* `scale`: Number in 128 bit that indicates how much the card has been - scaled. - -* `rotation`: Degree of rotation from start configuration. - - - -## Variables after "closing of events" - -The raw log data consist of start and stop events for each event type. -After preprocessing four event types are extracted: `move`, `flipCard`, -`openTopic`, and `openPopup`. Except for the `move` events, which can occur -at any time when interacting with an item card on the table, the events -have a hierarchical order: An item card first needs to be flipped -(`flipCard`), then the topic cards on the back of the card can be opened -(`openTopic`), and finally pop-ups on these topic cards can be opened -(`openPopup`). This implies that the event `openPopup` can only be present -for a certain item, if the card has already been flipped (i.e., an event -`flipCard` for the same item has already occured). - -After preprocessing, the data frame is now in a wide format with columns -for the start and the stop of each event and contains the following -variables: - -* `fileId.start` / `fileId.stop`: See above. - -* `date.start` / `date.stop`: See above. - -* `folder`: Containing the folder name (see above). - -* `case`: A numerical variable indicating cases in the data. A "case" - indicates an interaction interval and could be defined in different ways. - Right now a new case begins, when no event occurred when no new path - started for 20 seconds or longer. - -* `path`: A path is defined as one interaction with one item. A path - can either start with a `flipCard` event or when an item has been - touched for the first time within this case. A path ends with the - item card being flipped close again or with the last movement of the - card within this case. One case can contain several paths with the same - item when the item is flipped open and flipped close again several - times within a short time. - -* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up - has been opened from the glossar folder. These pop-ups can be assigned to - the wrong item since it is not possible to do this algorithmically. - It is possible that two items are flipped open that could both link to - the same pop-up from a glossar. The indicator variable is left as a - variable, so that these pop-ups can be easily deleted from the data. - Right now, glossar entries can be ignored completely by setting an - argument and this is done by default. Using the pop-ups from the glossar - will need a lot more love, before it behaves satisfactorily. - -* `event`: Indicating the event. Can take tha values `move`, `flipCard`, - `openTopic`, and `openPopup`. - -* `item`: Identifier of the different artworks and information cards. This - is a three-digit (left-padded) number. See above. - -* `timeMs.start` / `timeMs.stop`: See above. - -* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds. - Needs to be adjusted for events spanning more than one log file by a - factor of $60,000 \times \text{number of logfiles}$. See below for details. - -* `topic`: See above. - -* `popup`: See above. - -* `x.start` / `x.stop`: See above. - -* `y.start` / `y.stop`: See above. - -* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and - $(x.stop, y.stop)$. - -* `scale.start` / `scale.stop`: See above. - -* `scaleSize`: Relative scaling of item card, calculated by - $\frac{scale.stop}{scale.start}$. - -* `rotation.start` / `rotation.stop`: See above. - -* `rotationDegree`: Difference of rotation from $rotation.stop$ to - $rotation.start$. - -## How unclosed events are handled - -Events do not necessarily need to be completed. A person can, e.g., leave -the table and not flip the item card close again. For `flipCard`, -`openTopic`, and `openPopup` the data frame contains `NA` when the event -does not complete. For `move` events it happens quite often that a start -event follows a start event and a stop event follows a stop event. -Technically a move event cannot *not* be finished and the number of events -without a start or stop indicate that the time resolution was not -sufficient to catch all these events accurately. Double start and stop -`move` events have therefore been deleted from the data set. - -## Additional meta data - -For the HAUM data, I added meta data on state holidays and school -vacations. - -This led to the following additional variables: - -* `holiday` - -* `vacations` - -# Problems and how I handled them - -This lists some problems with the log data that required decisions. These -decisions influence the outcome and maybe even the data quality. Hence, I -tried to document how I handled these problems and explain the decisions I -made. - -## Weird behavior of `timeMs` and neg. `duration` values - -`timeMs` resets itself every time a new log file starts. This means that -the durations of events spanning more than one log file must be adjusted. -Instead of just calculating $timeMs.stop - timeMs.start$, `timeMs.start` -must be subtracted from the maximum duration of the log file where the -event started ($600,000 ms$) and the `timeMs.stop` must be added. If the -event spans more than two log files, a multiple of $600,000$ must be taken, -e.g. for three log files it must be: $2 \times 600,000 - timeMs.start + -timeMs.stop$ and so on. - -```{r timems, echo = FALSE, results = FALSE, fig.show = TRUE} -# Read data -datraw <- read.table("code/results/raw_logfiles_2024-02-21_16-07-33.csv", sep = ";", - header = TRUE) - -plot(timeMs ~ as.factor(fileId), datraw[1:5000,], xlab = "fileId") -``` - -The boxplot shows that we have a continuous range of values within one log -file but that `timeMs` does not increase over log files. I kept -`timeMs.start` and `timeMs.stop` and also `fileId.start` and `fileId.stop` -in the data frame, so it is clear when events span more than one log file. - - - -## Left padding of file IDs - -The file names of the raw log files are automatically generated and contain -a timestamp. This timestamp is not well formed. First, it contains an -incorrect month. The months go from 0 to 11 which means, that the file name -`2016_11_15-12_12_57.log` was collected on December 15, 2016 at 12:12 pm. -Another problem is that the file names are not zero left padded, e.g., -`2016_11_15-12_2_57.log`. This file was collected on December 15, 2016 at -12:02 pm and therefore before the file above. But most sorting algorithms, -will sort these files in the order shown below. In order to preprocess the -data and close events that belong together, the data need to be sorted by -events and artworks repeatedly. In order to get them back in the correct -time order, it is necessary to order them based on three variables: -`fileId.start`, `date.start` and `timeMs.start`. The file IDs therefore -need to sort in the correct order (again see below for example). I zero -left padded the log file names within the data frame using it as an -identifier. These "file names" do not correspond exactly to the original -raw log file names. This needs to be kept in mind when doing any kind of -matching etc. - -``` -## what it looked like before left padding -# 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 -# 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 -# 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 -# 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 -# 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 -# 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 - -## what it looks like now -# 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 -# 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 -# 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 -# 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 -# 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 -# 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 -``` - -## Timestamps repeat - -The timestamps in the `date` variable record year, month, day, hour, -minute and seconds. Since one second is not a very short time interval for -a move on a touch display, this is not fine grained enough to bring events -into the correct order, meaning there are events from the same log file -having the same timestamp and even events from different log files having -the same timestamp. The log files get written about every 10 minutes -(which can easily be seen when looking at the file names of the raw log -files). So in order to get events in the correct order, it is necessary to -first order by file ID, within file ID then sort by timestamp `date` and -then within these more coarse grained timestamps sort be `timeMs`. But as -explained above, `timeMs` can only be sorted within one file ID, since they -do not increase consistently over log files, but have a new setoff for each -raw log file. - -## x,y-coordinates outside of display range - -The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160 -pixels. When you plot the start and stop coordinates, the display is -clearly distinguishable. However, a lot of points are outside of the -display range. This can happen, when the art objects are scaled and then -moved to the very edge of the table. Then it will record pixels outside of -the table. These are actually valid data points and I will leave them as -is. - -```{r xycoord} -datlogs <- read.table("code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";", - header = TRUE) - -par(mfrow = c(1, 2)) -plot(y.start ~ x.start, datlogs) -abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) -plot(y.stop ~ x.stop, datlogs) -abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) - -aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean) -``` - -## Pop-ups from glossar cannot be assigned to a specific item - -All the information, pictures and texts for the topics and pop-ups are -stored in `/data/haum/ContentEyevisit/eyevisit_cards_light/`. -Among other things, each folder contains XML-files with the information -about any technical terms that can be opened from the hypertexts on the -topic cards. Often these information are item dependent and then the -corresponding XML-file is in the folder for this item. Sometimes, however, -more general terms can be opened. In order to avoid multiple files -containing the same information, these were stored in a folder called -`glossar` and get accessed from there. The raw log files only contain the -path to this glossar entry and did not record from which item it was -accessed. I tried to assign these glossar entries to the correct items. The -(very heuristic) approach was this: - -1. Create a lookup table with all XML-file names (possible pop-ups) from - the glossar folder and what items possibly call them. This was stored - as an `RData` object for easier handling but should maybe be stored in a - more interoperable format. - -2. I went through all possible pop-ups in this lookup table and stored the - items that are associated with it. - -3. I created a sub data frame without move events (since they can never be - associated with a pop-up) and went through every line and looked up if - an item and a topic card had been opened. If this was the case and a - glossar entry came up before the item was closed again, I assigned - this item to the glossar entry. - -This is heuristic since it is possible that several topic cards from -different items are opened simultaneously and the glossar pop-up could -be opened from either one (it could even be more than two, of course). In -these cases the item that was opened closest to the glossar pop-up has -been assigned, but this can never be completely error free. - -And this heuristic only assigns a little more than half of the glossar -entries. Since my heuristic only looks for the last item that has been -opened and if this item is a possible candidate it misses all glossar -pop-ups where another item has been opened in between. This is still an -open TODO to write a more elaborate algorithm. - -All glossar pop-ups that do not get matched with an item are removed -from the data set with a warning if the argument `glossar = TRUE` is set. -Otherwise the glossar entries will be ignored completely. - -## Assign a `case` variable based on "time heuristic" - -One thing needed in order to work with the data set and use it for machine -learning algorithms like process mining, is a variable that tries to -identify a case. A case variable will structure the data frame in a way -that navigation behavior can actually be investigated. However, we do not -know if several people are standing around the table interacting with it or -just one very active person. The simplest way to define a case variable is -to just use a time limit between events. This means that when the table has -not been interacted with for, e.g., 20 seconds than it is assumed that a -person moved on and a new person started interacting with the table. This -is the easiest heuristic and implemented at the moment. Process mining -shows that this simple approach works in a way that the correct process -gets extracted by the algorithm. - -In order to investigate user behavior on a more fine grained level, it will -be necessary to come up with a more elaborate approach. A better, still -simple approach, could be to use this kind of time limit and additionally -look at the distance between items interacted with within one time window. -When items are far apart it seems plausible that more than one person -interacted with them. Very short time lapses between events on different -items could also be an indicator that more than one person is interacting -with the table. - -## Assign a `path` variable - -The `path` variable is supposed to show one interaction trace with one -artwork. Meaning it starts when an artwork is touched or flipped and stops -when it is closed again. It is easy to assign a path from flipping a card -over opening (maybe several) topics and pop-ups for this artwork card until -closing this card again. But one would like to assign the same path to -move events surrounding this interaction. Again, this is not possible in an -algorithmic way but only heuristically. - -Again, I used a time cutoff for this. First, if a `move` event occurs, it -is checked, if the same item has been flipped less than 20 seconds -beforehand. If yes, the same path indicator is assigned to this `move`. If -not, temporarily a new "move indicator" is assigned. Then, a "backward -pass" is applied, where it is checked if the same item is opened less than -20 seconds _after_ the event occurs. If yes, that path indicator is -assigned. For all the remaining moves, a new path number is assigned. This -corresponds to items being moved without being flipped. - -## A `move` event does not record any change - -Most of the events in the log files are move events. Additionally, many of -these move events are recorded but they do not indicate any change, meaning -the only difference is the timestamp. All other variables indicating moves -like `x.start` and `x.stop`, `rotation.start` and `rotation.stop` etc. do -not show _any_ change. They represent about 2/3 of all move events. These -events are probably short touches of the table without an actual -interaction. They were therefore removed from the data set. - -## Card indices go from 0 to 7 (instead of 0 to 5 as expected) - -In the beginning I thought that the number for topics was the index of -where the card was presented on the back of the item. But this is not -correct. It is the number of the topic. There are eight topics in total: - -``` -Indices for topics: -0 artist -1 thema -2 komposition -3 leben des kunstwerks -4 details -5 licht und farbe -6 extra info -7 technik -``` -On the back of items, there can be between 2 to 6 topic cards. Several of -these topic cards can be about the same topic, e.g., there can be two topic -cards assigned to the topic `thema`. It is impossible to find out if the -same topic card was opened several times or if different topic cards with -the same topic were opened from the same item. See example below for item -"001". - -```{r topics, echo = FALSE} -items <- sprintf("%03d", unique(datlogs$item)) -topics <- extract_topics(items, xmlfiles = paste0(items, ".xml"), - xmlpath = "data/haum/ContentEyevisit/eyevisit_cards_light/") -head(topics) -``` - -## New artworks "504" and "505" starting October 2022 - -When I read in the complete data frame for the first time, all of the -sudden there were 72 instead of 70 items. It seems like these two -artworks appear on October 21, 2022. - -```{r newitems} -summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"])) -``` - -The artworks seem to be have updated in general after October 21, 2022. The -following table shows which items were presented in which years. - -```{r years} -xtabs(~ item + lubridate::year(date.start), datlogs) -``` - -It shows that the artworks haven been updated after the Corona pandemic. I -think, the table was also moved to a different location at that point. - diff --git a/README.md b/README.md index adbbd88..6d6c5cc 100644 --- a/README.md +++ b/README.md @@ -1,580 +1,66 @@ -Log data from the Multi-Touch Table at the HAUM -================ +# Accompanying Analysis Code for the Master Thesis "XXX" -The Multi Touch Table at the Herzog-Anton-Ulrich-Museum (HAUM) in -Braunschweig gives visitors of the Museum the opportunity to interact -with about 70 artworks and 3 virtual cards containing information about -the museum and its layout. The table was installed at the museum in -October 2016 and since November 2016 log files from interactions of -visitors of the museum have been collected. These log files are in an -unstructured format and cannot be easily analyzed. The purpose of the -following document is to describe how the data haven been transformed -and which decisions have been made along the way. +The multi-touch table at the Herzog-Anton-Ulrich-Museum (HAUM) in +Braunschweig gives visitors of the Museum the opportunity to interact with +about 70 artworks and 3 virtual cards containing information about the +museum and its layout. The table was installed at the museum in October +2016 and since November 2016 log files from interactions of visitors of the +museum have been collected. The master thesis for which this repository was +created analyzed data collected between December 14, 2016 and July 5, 2023. +In total, the data set consists of 39,767 log files containing 6,700,176 +events. -The implementation of the steps described here can be found at: +The following gives a short overview over the analyses conducted. All +analysis scripts can be found in the `/code/` folder. + +## Preprocessing and Descriptives + +The first script `01_preprocessing.R` preprocesses the raw log files by +first parsing them so they are readable by standard statistics software +like R or Python and then converting it to event logs. A short R package +doing the preprocessing and more information can be found at . -# Data structure +The second script `02_descriptives.R` calculates some descriptive +statistics and creates plots to get an overall feeling for the data set. + +## Conformance Checking + +A normative Petri net to test the data quality after the preprocessing is +created in `03_create-petrinet.py` and the actual data quality check is +done in `04_conformance-checking.py`. Both scripts are written in Python +using the pm4py library. For more information and the full documentation go +to . + +The next script `05_check-traces.R` (written in R again) checks the corrupt +trace found during conformance checking and exports the cleaned data sets +used for the following analyses. + +## Clustering of Items + +To answer the first research question in the thesis "Do interaction +patterns look different for different artworks? (Control-flow perspective)" +process mining was applied to all paths separately for each item on the +multi-touch table. Fitness, precision, generalizability, simplicity, +soundness, number of connecting arcs, number of transitions, number of +places, number of different variants, and the most frequent variant were +obtained and saved to a CSV file (Python script `06_infos-items.py`). These +information were then read into R in the next script +(`07_item-clustering.R`) and used (together with other features) for +hierarchical clustering. + +## Clustering of Cases + +For the second research question "What kind of patterns exist and are there +typical user behaviors? (Case perspective)" six indicator variables for +five proposed user navigation types were calculated in +`08_case-characteristics.R` and then used for hierarchical clustering und +recursive partitioning to extract the different navigation types in script +`09_user-navigation.R`. A validation of the results for data from 2018 was +done in `10_validation.R`. Different variants for the cases for the +complete data set and the data used for investigating the navigation types +(all log files from 2019) was done in `11_investigate-variants.R` and the +found clusters of the navigation types were further investigated with +process mining techniques in R (`12_dfgs-case-clusters.R`) and Python +(`13_pm-case-clusters.py`). -The log files contain lines that indicate the beginning and end of -possible activities that can be performed when interacting with the -artworks on the table. The layout of the table looks like pictures have -been tossed on a large table. Every artwork is visible at the start -configuration. People can move the pictures on the table, they can be -scaled and rotated. Additionally, the virtual picture cards can be -flipped in order to find more information of the artwork on the “back” -of the card. One has to press a little `i` for more information in one -of the bottom corners of the card. On the back of the card two to six -information cards can be found with a teaser text about a certain topic. -These topic cards can be opened and a hypertext with detailed -information opens. Within these hypertexts certain technical terms can -be clicked for lay people to get more information. This also opens up a -pop-up. The events encoded in the raw log files therefore have the -following structure. - - "Start Application" --> Start Application - "Show Application" - "Transform start" --> Move - "Transform stop" - "Show Info" --> Flip Card - "Show Front" - "Artwork/OpenCard" --> Open Topic - "Artwork/CloseCard" - "ShowPopup" --> Open Popup - "HidePopup" - -The right side shows what events can be extracted from these raw lines. -The “Start Application” is not an event in the original sense since it -only indicates if the table was started or maybe reset itself. This is -not an interaction with the table and therefore not interesting in -itself. All “Start Application” and “Show Application” are therefore -excluded from the data when further processed and are only in the raw -log files. - -# Parsing the raw log files - -The first step is to parse the raw log files that are stored by the -application as text files in a rather unstructured format to a format -that can be read by common statistics software packages. The data are -therefore transferred to a spread sheet format. The following section -describes what problems were encountered while doing this. - -## Corrupt lines - -When reading the files containing the raw logs into R, a warning appears -that says - - Warning messages: - incomplete final line found on '2016/2016_11_18-11_31_0.log' - incomplete final line found on '2016/2016_11_18-11_38_30.log' - incomplete final line found on '2016/2016_11_18-11_40_36.log' - ... - -When you open these files, it looks like the last line contains some -binary content. It is unclear why and how this happens. So when reading -the data, these lines were removed. A warning will be given that -indicates how many files have been affected. - -## Extracted variables from raw log files - -The following variables (columns in the data frame) are extracted from -the raw log file: - -- `fileId`: Containing the zero-left-padded file name of the raw log - file the data line has been extracted from - -- `folder`: The folder names in which the raw log files haven been - organized in. For the HAUM data set, the data are sorted by year - (folders 2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023). - -- `date`: Extracted timestamp from the raw log file in the format - `yyyy-mm-dd hh:mm:ss`. - -- `timeMs`: Containing a timestamp in Milliseconds that restarts with - every new raw log files. - -- `event`: Start and stop event tags. See above for possible values. - -- `item`: Identifier of the different items. This is a three-digit - (left-padded) number. The numbers of the items correspond to the - folder names in `/ContentEyevisit/eyevisit_cards_light/` and were - orginally taken from the museums catalogue. - -- `popup`: Name of the pop-up opened. This is only interesting for - “openPopup” events. - -- `topic`: The number of the topic card that has been opened at the back - of the item card. See below for a more detailed description what these - numbers mean. - -- `x`: Value of x-coordinate in pixel on the 4K-Display - ($3840 \times 2160$). - -- `y`: Value of y-coordinate in pixel. - -- `scale`: Number in 128 bit that indicates how much the card has been - scaled. - -- `rotation`: Degree of rotation from start configuration. - - - -## Variables after “closing of events” - -The raw log data consist of start and stop events for each event type. -After preprocessing four event types are extracted: `move`, `flipCard`, -`openTopic`, and `openPopup`. Except for the `move` events, which can -occur at any time when interacting with an item card on the table, the -events have a hierarchical order: An item card first needs to be flipped -(`flipCard`), then the topic cards on the back of the card can be opened -(`openTopic`), and finally pop-ups on these topic cards can be opened -(`openPopup`). This implies that the event `openPopup` can only be -present for a certain item, if the card has already been flipped (i.e., -an event `flipCard` for the same item has already occured). - -After preprocessing, the data frame is now in a wide format with columns -for the start and the stop of each event and contains the following -variables: - -- `fileId.start` / `fileId.stop`: See above. - -- `date.start` / `date.stop`: See above. - -- `folder`: Containing the folder name (see above). - -- `case`: A numerical variable indicating cases in the data. A “case” - indicates an interaction interval and could be defined in different - ways. Right now a new case begins, when no event occurred when no new - path started for 20 seconds or longer. - -- `path`: A path is defined as one interaction with one item A path can - either start with a `flipCard` event or when an item has been touched - for the first time within this case. A path ends with the item card - being flipped close again or with the last movement of the card within - this case. One case can contain several paths with the same item when - the item is flipped open and flipped close again several times within - a short time. - -- `glossar`: An indicator variable with values 0/1 that tracks if a - pop-up has been opened from the glossar folder. These pop-ups can be - assigned to the wrong item since it is not possible to do this - algorithmically. It is possible that two items are flipped open that - could both link to the same pop-up from a glossar. The indicator - variable is left as a variable, so that these pop-ups can be easily - deleted from the data. Right now, glossar entries can be ignored - completely by setting an argument and this is done by default. Using - the pop-ups from the glossar will need a lot more love, before it - behaves satisfactorily. - -- `event`: Indicating the event. Can take tha values `move`, `flipCard`, - `openTopic`, and `openPopup`. - -- `item`: Identifier of the different artworks and information cards. - This is a three-digit (left-padded) number. See above. - -- `timeMs.start` / `timeMs.stop`: See above. - -- `duration`: Calculated by $timeMs.stop - timeMs.start$ in - Milliseconds. Needs to be adjusted for events spanning more than one - log file by a factor of $60,000 \times \text{number of logfiles}$. See - below for details. - -- `topic`: See above. - -- `popup`: See above. - -- `x.start` / `x.stop`: See above. - -- `y.start` / `y.stop`: See above. - -- `distance`: Euclidean distande calculated from $(x.start, y.start)$ - and $(x.stop, y.stop)$. - -- `scale.start` / `scale.stop`: See above. - -- `scaleSize`: Relative scaling of item card, calculated by - $\frac{scale.stop}{scale.start}$. - -- `rotation.start` / `rotation.stop`: See above. - -- `rotationDegree`: Difference of rotation from $rotation.stop$ to - $rotation.start$. - -## How unclosed events are handled - -Events do not necessarily need to be completed. A person can, e.g., -leave the table and not flip the item card close again. For `flipCard`, -`openTopic`, and `openPopup` the data frame contains `NA` when the event -does not complete. For `move` events it happens quite often that a start -event follows a start event and a stop event follows a stop event. -Technically a move event cannot *not* be finished and the number of -events without a start or stop indicate that the time resolution was not -sufficient to catch all these events accurately. Double start and stop -`move` events have therefore been deleted from the data set. - -## Additional meta data - -For the HAUM data, I added meta data on state holidays and school -vacations. - -This led to the following additional variables: - -- `holiday` - -- `vacations` - -# Problems and how I handled them - -This lists some problems with the log data that required decisions. -These decisions influence the outcome and maybe even the data quality. -Hence, I tried to document how I handled these problems and explain the -decisions I made. - -## Weird behavior of `timeMs` and neg. `duration` values - -`timeMs` resets itself every time a new log file starts. This means that -the durations of events spanning more than one log file must be -adjusted. Instead of just calculating $timeMs.stop - timeMs.start$, -`timeMs.start` must be subtracted from the maximum duration of the log -file where the event started ($600,000 ms$) and the `timeMs.stop` must -be added. If the event spans more than two log files, a multiple of -$600,000$ must be taken, e.g. for three log files it must be: -$2 \times 600,000 - timeMs.start + timeMs.stop$ and so on. - -![](README_files/figure-gfm/timems-1.png) - -The boxplot shows that we have a continuous range of values within one -log file but that `timeMs` does not increase over log files. I kept -`timeMs.start` and `timeMs.stop` and also `fileId.start` and -`fileId.stop` in the data frame, so it is clear when events span more -than one log file. - - - -## Left padding of file IDs - -The file names of the raw log files are automatically generated and -contain a timestamp. This timestamp is not well formed. First, it -contains an incorrect month. The months go from 0 to 11 which means, -that the file name `2016_11_15-12_12_57.log` was collected on December -15, 2016 at 12:12 pm. Another problem is that the file names are not -zero left padded, e.g., `2016_11_15-12_2_57.log`. This file was -collected on December 15, 2016 at 12:02 pm and therefore before the file -above. But most sorting algorithms, will sort these files in the order -shown below. In order to preprocess the data and close events that -belong together, the data need to be sorted by events and artworks -repeatedly. In order to get them back in the correct time order, it is -necessary to order them based on three variables: `fileId.start`, -`date.start` and `timeMs.start`. The file IDs therefore need to sort in -the correct order (again see below for example). I zero left padded the -log file names within the data frame using it as an identifier. These -“file names” do not correspond exactly to the original raw log file -names. This needs to be kept in mind when doing any kind of matching -etc. - - ## what it looked like before left padding - # 1422 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 - # 1423 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 - # 1424 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 - # 1425 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 - # 1426 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 - # 1427 ../data/haum_logs_2016-2023/_2016b/2016_11_15-12_2_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 - - ## what it looks like now - # 1422 2016_11_15-12_02_57.log 2016-12-15 12:12:56 599671 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26874254 - # 1423 2016_11_15-12_02_57.log 2016-12-15 12:12:57 599916 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997771 13.26523465 - # 1424 2016_11_15-12_12_57.log 2016-12-15 12:12:57 621 Transform start 076 076.xml NA 2092.25 2008.00 0.3000000 13.26523465 - # 1425 2016_11_15-12_12_57.log 2016-12-15 12:12:57 677 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997736 13.26239605 - # 1426 2016_11_15-12_12_57.log 2016-12-15 12:12:57 774 Transform start 076 076.xml NA 2092.25 2008.00 0.2999345 13.26239605 - # 1427 2016_11_15-12_12_57.log 2016-12-15 12:12:57 850 Transform stop 076 076.xml NA 2092.25 2008.00 0.2997107 13.26223362 - -## Timestamps repeat - -The timestamps in the `date` variable record year, month, day, hour, -minute and seconds. Since one second is not a very short time interval -for a move on a touch display, this is not fine grained enough to bring -events into the correct order, meaning there are events from the same -log file having the same timestamp and even events from different log -files having the same timestamp. The log files get written about every -10 minutes (which can easily be seen when looking at the file names of -the raw log files). So in order to get events in the correct order, it -is necessary to first order by file ID, within file ID then sort by -timestamp `date` and then within these more coarse grained timestamps -sort be `timeMs`. But as explained above, `timeMs` can only be sorted -within one file ID, since they do not increase consistently over log -files, but have a new setoff for each raw log file. - -## x,y-coordinates outside of display range - -The display of the Multi-Touch-Table is a 4K-display with 3840 x 2160 -pixels. When you plot the start and stop coordinates, the display is -clearly distinguishable. However, a lot of points are outside of the -display range. This can happen, when the art objects are scaled and then -moved to the very edge of the table. Then it will record pixels outside -of the table. These are actually valid data points and I will leave them -as is. - -``` r -datlogs <- read.table("code/results/event_logfiles_2024-02-21_16-07-33.csv", sep = ";", - header = TRUE) - -par(mfrow = c(1, 2)) -plot(y.start ~ x.start, datlogs) -abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) -plot(y.stop ~ x.stop, datlogs) -abline(v = c(0, 3840), h = c(0, 2160), col = "blue", lwd = 2) -``` - -![](README_files/figure-gfm/xycoord-1.png) - -``` r -aggregate(cbind(x.start, x.stop, y.start, y.stop) ~ 1, datlogs, mean) -``` - - ## x.start x.stop y.start y.stop - ## 1 1978.202 1975.876 1137.481 1133.494 - -## Pop-ups from glossar cannot be assigned to a specific item - -All the information, pictures and texts for the topics and pop-ups are -stored in -`/data/haum/ContentEyevisit/eyevisit_cards_light/`. Among -other things, each folder contains XML-files with the information about -any technical terms that can be opened from the hypertexts on the topic -cards. Often these information are item dependent and then the -corresponding XML-file is in the folder for this item. Sometimes, -however, more general terms can be opened. In order to avoid multiple -files containing the same information, these were stored in a folder -called `glossar` and get accessed from there. The raw log files only -contain the path to this glossar entry and did not record from which -item it was accessed. I tried to assign these glossar entries to the -correct items. The (very heuristic) approach was this: - -1. Create a lookup table with all XML-file names (possible pop-ups) - from the glossar folder and what items possibly call them. This was - stored as an `RData` object for easier handling but should maybe be - stored in a more interoperable format. - -2. I went through all possible pop-ups in this lookup table and stored - the items that are associated with it. - -3. I created a sub data frame without move events (since they can never - be associated with a pop-up) and went through every line and looked - up if an item and a topic card had been opened. If this was the case - and a glossar entry came up before the item was closed again, I - assigned this item to the glossar entry. - -This is heuristic since it is possible that several topic cards from -different items are opened simultaneously and the glossar pop-up could -be opened from either one (it could even be more than two, of course). -In these cases the item that was opened closest to the glossar pop-up -has been assigned, but this can never be completely error free. - -And this heuristic only assigns a little more than half of the glossar -entries. Since my heuristic only looks for the last item that has been -opened and if this item is a possible candidate it misses all glossar -pop-ups where another item has been opened in between. This is still an -open TODO to write a more elaborate algorithm. - -All glossar pop-ups that do not get matched with an item are removed -from the data set with a warning if the argument `glossar = TRUE` is -set. Otherwise the glossar entries will be ignored completely. - -## Assign a `case` variable based on “time heuristic” - -One thing needed in order to work with the data set and use it for -machine learning algorithms like process mining, is a variable that -tries to identify a case. A case variable will structure the data frame -in a way that navigation behavior can actually be investigated. However, -we do not know if several people are standing around the table -interacting with it or just one very active person. The simplest way to -define a case variable is to just use a time limit between events. This -means that when the table has not been interacted with for, e.g., 20 -seconds than it is assumed that a person moved on and a new person -started interacting with the table. This is the easiest heuristic and -implemented at the moment. Process mining shows that this simple -approach works in a way that the correct process gets extracted by the -algorithm. - -In order to investigate user behavior on a more fine grained level, it -will be necessary to come up with a more elaborate approach. A better, -still simple approach, could be to use this kind of time limit and -additionally look at the distance between items interacted with within -one time window. When items are far apart it seems plausible that more -than one person interacted with them. Very short time lapses between -events on different items could also be an indicator that more than one -person is interacting with the table. - -## Assign a `path` variable - -The `path` variable is supposed to show one interaction trace with one -artwork. Meaning it starts when an artwork is touched or flipped and -stops when it is closed again. It is easy to assign a path from flipping -a card over opening (maybe several) topics and pop-ups for this artwork -card until closing this card again. But one would like to assign the -same path to move events surrounding this interaction. Again, this is -not possible in an algorithmic way but only heuristically. - -Again, I used a time cutoff for this. First, if a `move` event occurs, -it is checked, if the same item has been flipped less than 20 seconds -beforehand. If yes, the same path indicator is assigned to this `move`. -If not, temporarily a new “move indicator” is assigned. Then, a -“backward pass” is applied, where it is checked if the same item is -opened less than 20 seconds *after* the event occurs. If yes, that path -indicator is assigned. For all the remaining moves, a new path number is -assigned. This corresponds to items being moved without being flipped. - -## A `move` event does not record any change - -Most of the events in the log files are move events. Additionally, many -of these move events are recorded but they do not indicate any change, -meaning the only difference is the timestamp. All other variables -indicating moves like `x.start` and `x.stop`, `rotation.start` and -`rotation.stop` etc. do not show *any* change. They represent about 2/3 -of all move events. These events are probably short touches of the table -without an actual interaction. They were therefore removed from the data -set. - -## Card indices go from 0 to 7 (instead of 0 to 5 as expected) - -In the beginning I thought that the number for topics was the index of -where the card was presented on the back of the item. But this is not -correct. It is the number of the topic. There are eight topics in total: - - Indices for topics: - 0 artist - 1 thema - 2 komposition - 3 leben des kunstwerks - 4 details - 5 licht und farbe - 6 extra info - 7 technik - -On the back of items, there can be between 2 to 6 topic cards. Several -of these topic cards can be about the same topic, e.g., there can be two -topic cards assigned to the topic `thema`. It is impossible to find out -if the same topic card was opened several times or if different topic -cards with the same topic were opened from the same item. See example -below for item “001”. - - ## item file_name topic - ## 1 001 001_dargestellte.xml thema - ## 2 001 001_thema1.xml thema - ## 3 001 001_leben.xml leben des kunstwerks - ## 4 001 001_leben3.xml leben des kunstwerks - ## 5 001 001_thema2.xml thema - ## 6 001 001_thema.xml thema - -## New artworks “504” and “505” starting October 2022 - -When I read in the complete data frame for the first time, all of the -sudden there were 72 instead of 70 items. It seems like these two -artworks appear on October 21, 2022. - -``` r -summary(as.Date(datraw[datraw$item %in% c("504", "505"), "date"])) -``` - - ## Min. 1st Qu. Median Mean 3rd Qu. Max. - ## "2022-10-21" "2023-01-11" "2023-03-08" "2023-03-09" "2023-05-21" "2023-07-05" - -The artworks seem to be have updated in general after October 21, 2022. -The following table shows which items were presented in which years. - -``` r -xtabs(~ item + lubridate::year(date.start), datlogs) -``` - - ## lubridate::year(date.start) - ## item 2016 2017 2018 2019 2020 2022 2023 - ## 1 277 4082 1912 1434 424 394 1315 - ## 3 485 6730 3126 2356 528 457 1124 - ## 19 714 8656 4028 2743 660 698 1595 - ## 20 595 8461 3996 2983 938 657 1355 - ## 24 497 6638 2912 2251 649 439 1028 - ## 27 567 5959 3112 2318 651 711 1324 - ## 28 601 9329 4394 3056 778 762 1570 - ## 29 425 6865 3830 2365 516 615 1174 - ## 31 289 4118 2051 1218 291 296 675 - ## 32 562 7016 3477 2253 726 766 1647 - ## 33 509 4936 2242 1449 555 358 666 - ## 36 434 4505 2276 1668 373 387 976 - ## 37 242 4478 2182 1554 339 423 1168 - ## 38 480 4617 2144 1397 371 381 784 - ## 39 395 3227 1313 1003 237 161 622 - ## 41 282 3329 1303 1022 225 209 701 - ## 42 203 3113 1307 903 242 191 421 - ## 43 115 2420 1089 806 176 219 486 - ## 45 1491 13561 5924 4474 966 585 1828 - ## 46 903 9181 5340 3812 961 944 1648 - ## 47 306 4949 2395 1510 750 297 675 - ## 48 723 10455 5384 4162 1328 948 2031 - ## 49 433 4326 2124 1414 434 431 809 - ## 51 564 7837 4577 2991 884 659 1370 - ## 52 447 5021 2104 1729 471 349 840 - ## 54 424 5068 2816 2008 529 370 918 - ## 55 358 4859 2069 1428 341 403 1303 - ## 57 860 14264 6625 5092 1410 1221 2714 - ## 60 555 6865 3539 2336 639 586 1415 - ## 62 547 6736 3803 2210 795 633 1322 - ## 63 251 3677 1827 1241 300 282 527 - ## 66 552 6004 2774 1977 505 373 932 - ## 69 394 3730 1827 1438 272 206 680 - ## 70 226 3766 1843 973 293 268 703 - ## 71 557 6160 2490 1846 570 323 839 - ## 72 426 6194 2857 2129 508 635 1553 - ## 73 432 6125 2880 1821 583 395 939 - ## 75 258 5885 2418 1562 369 257 645 - ## 76 861 12435 6253 4214 1753 1153 2268 - ## 77 816 8595 4197 2897 699 674 1452 - ## 78 410 5632 2498 1924 394 408 850 - ## 80 1650 25687 12429 7782 1975 1712 4433 - ## 83 644 8618 4720 3026 987 1027 2294 - ## 84 184 2121 1231 759 231 254 465 - ## 87 149 1618 722 632 99 0 0 - ## 88 513 6996 3493 2272 539 533 1420 - ## 89 214 2204 950 723 156 0 0 - ## 90 281 3756 1372 1143 403 320 932 - ## 93 613 8528 4224 3015 696 1174 2058 - ## 98 462 6662 3265 2565 704 670 1453 - ## 99 180 4162 1653 1454 363 411 868 - ## 101 414 4209 1859 1282 392 411 981 - ## 103 677 8758 4366 3165 1045 909 1871 - ## 104 423 5256 2381 1865 463 467 933 - ## 107 181 2101 1106 788 205 146 339 - ## 109 321 4001 1619 1106 292 188 453 - ## 110 489 5846 2785 2008 494 387 923 - ## 125 640 8435 4519 3334 926 0 0 - ## 129 598 11322 5046 3369 910 1131 1682 - ## 145 419 7821 3945 2694 706 740 1396 - ## 176 507 8465 3968 2787 687 552 1544 - ## 180 516 7563 3720 2765 585 550 1272 - ## 183 377 4014 1819 1741 346 251 675 - ## 187 340 4222 2165 1753 319 312 734 - ## 197 426 7710 3603 2510 671 602 1217 - ## 229 303 4872 2360 1891 482 389 1005 - ## 231 271 3606 1851 1239 318 236 467 - ## 501 1915 15968 7849 5060 1157 890 2989 - ## 502 1212 14550 7111 4749 1105 883 2752 - ## 503 1308 15218 8632 6399 1626 870 2558 - ## 504 0 0 0 0 0 363 662 - ## 505 0 0 0 0 0 426 1533 - -It shows that the artworks haven been updated after the Corona pandemic. -I think, the table was also moved to a different location at that point. diff --git a/README_files/figure-gfm/timems-1.png b/README_files/figure-gfm/timems-1.png deleted file mode 100644 index f08b70a..0000000 Binary files a/README_files/figure-gfm/timems-1.png and /dev/null differ diff --git a/README_files/figure-gfm/xycoord-1.png b/README_files/figure-gfm/xycoord-1.png deleted file mode 100644 index d72a279..0000000 Binary files a/README_files/figure-gfm/xycoord-1.png and /dev/null differ