Updated README after many new insights
This commit is contained in:
parent
4973ceb31b
commit
b81a3e984c
267
README.Rmd
267
README.Rmd
@ -66,9 +66,9 @@ data when further processed and are only in the raw log files.
|
||||
|
||||
The first step is to parse the raw log files that are stored by the
|
||||
application as text files in a rather unstructured format to a format that
|
||||
is better handled. The data are therefore transferred to a spread sheet
|
||||
format. The following section describes what problems were encountered
|
||||
while doing this.
|
||||
can be read by common statistics software packages. The data are therefore
|
||||
transferred to a spread sheet format. The following section describes what
|
||||
problems were encountered while doing this.
|
||||
|
||||
## Corrupt lines
|
||||
|
||||
@ -77,9 +77,9 @@ that says
|
||||
|
||||
```
|
||||
Warning messages:
|
||||
incomplete final line found on '_2016/2016_11_18-11_31_0.log'
|
||||
incomplete final line found on '_2016/2016_11_18-11_38_30.log'
|
||||
incomplete final line found on '_2016/2016_11_18-11_40_36.log'
|
||||
incomplete final line found on '2016/2016_11_18-11_31_0.log'
|
||||
incomplete final line found on '2016/2016_11_18-11_38_30.log'
|
||||
incomplete final line found on '2016/2016_11_18-11_40_36.log'
|
||||
...
|
||||
```
|
||||
|
||||
@ -88,17 +88,143 @@ content. It is unclear why and how this happens. So when reading the data,
|
||||
these lines were removed. A warning will be given that indicates how many
|
||||
files have been affected.
|
||||
|
||||
## Units of the variables
|
||||
## Extracted variables from raw log files
|
||||
|
||||
* Welche Einheit haben x und y? Pixel? --> yes
|
||||
* Welche Einheit hat scale? --> some kind if bit, does not matter, when
|
||||
calculating a ratio
|
||||
* rotation wirklich degree? --> yes
|
||||
* Nach welchem Zeitintervall resettet sich der Tisch wieder in die
|
||||
Ausgangskonfiguration? --> PM needs to look it up
|
||||
The following variables (columns in the data frame) are extracted from the
|
||||
raw log file:
|
||||
|
||||
* `fileId`: Containing the zero-left-padded file name of the raw log file
|
||||
the data line has been extracted from
|
||||
|
||||
* `folder`: The folder names in which the raw log files haven been
|
||||
organized in. For the HAUM data set, the data are sorted by year (folders
|
||||
2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).
|
||||
|
||||
* `data`: Extracted time stamp from the raw log file in the format
|
||||
`yyyy-mm-dd hh:mm:ss`.
|
||||
|
||||
* `timeMs`: Containing a time stamp in Milliseconds that restarts with
|
||||
every new raw log files.
|
||||
|
||||
* `event`: Start and stop event tags. See above for possible values.
|
||||
|
||||
* `artwork`: Identifier of the different artworks. This is a 3 digit
|
||||
(left-padded) number. The numbers of the artworks correspond to the
|
||||
folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
|
||||
orginally taken from the museums catalogue.
|
||||
|
||||
* `popup`: Name of the pop-up opened. This is only interestin for
|
||||
"openPopup" events.
|
||||
|
||||
* `topicNumber`: The number of the topic card that has been opened at the back of
|
||||
the artwork card. See below for a more detailed descripttion what these
|
||||
numbers possibly mean.
|
||||
|
||||
* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$)
|
||||
|
||||
* `y`: Value of y-coordinate in pixel
|
||||
|
||||
* `scale`: Number in 128 bit that indicates how much the artwork card has
|
||||
been scaled (????)
|
||||
|
||||
* `rotation`: Degree of rotation in start configuration.
|
||||
|
||||
<!-- TODO: Nach welchem Zeitintervall resettet sich der Tisch wieder in die
|
||||
Ausgangskonfiguration? -> PM needs to look it up -->
|
||||
|
||||
## Variables after "closing of events"
|
||||
|
||||
The raw log data consists of start and stop events for each event type.
|
||||
After preprocessing for event types are extracted: `move`, `flipCard`,
|
||||
`openTopic`, and `openPopup`. Except for the `move` events, which can occur
|
||||
at any time when interacting with an artwork card on the table, the events
|
||||
have a hierachical order: An artwork card first needs to be flipped
|
||||
(`flipCard`), then the topic cards on the back of the card can be opened
|
||||
(`openTopic`), and finally pop-ups on these topic cards can be opened
|
||||
(`openPopup`). This implies that the event `openPopup` can only be present
|
||||
for a certain artwork, if the card has already been flipped (i.e., an event
|
||||
`flipCard` for the same artwork has already occured).
|
||||
|
||||
After preprocessing, the data frame is now in a wide format with columns
|
||||
for the start and the stop of each event and contains the following
|
||||
variables:
|
||||
|
||||
* `folder`: Containing the folder name (see above)
|
||||
|
||||
* `eventId`: A numerical variable that indicates the number of the event.
|
||||
Starts at 1 and ends with the total number of events, counting up by 1.
|
||||
|
||||
* `case`: A numerical variable indicating cases in the data. A "case"
|
||||
indicates an interaction interval and could be defined in different ways.
|
||||
Right now a new case begins, when no event occured for 20 seconds.
|
||||
|
||||
* `trace`: A trace is defined as one interaction with one artwork. A trace
|
||||
can either start with a `flipCard` event or when an artwork has been
|
||||
touched for the first time within this case. A trace ends with the
|
||||
artwork card being flipped close again or with the last movement of the
|
||||
card within this case. One case can contain several traces with the same
|
||||
artwork when the artwork is flipped open and slipped close again several
|
||||
times within a short time.
|
||||
|
||||
* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up
|
||||
has been opened from the glossar folder. These pop-ups can be assigned to
|
||||
the wronge artwork since it is not possible to do this algorithmically.
|
||||
It is possible that two artworks are flipped open that could both link to
|
||||
the same popup from a glossar. The indicator variable is left as a
|
||||
variable, so that these pop-ups can be easily deleted from the data.
|
||||
Right now, glossar entries can be ignored completely by setting an
|
||||
argument and this is done by default. Using the pop-ups from the glossar
|
||||
will need a lot more love, before it behaves satisfactorily.
|
||||
|
||||
* `event`: Indicating the event. Can take tha values `move`, `flipCard`,
|
||||
`openTopic`, and `openPopup`.
|
||||
|
||||
* `artwork`: Identifier of the different artworks. This is a 3 digit
|
||||
(left-padded) number. See above.
|
||||
|
||||
* `fileId.start` / `fileId.stop`: See above.
|
||||
|
||||
* `date.start` / `date.stop`: See above.
|
||||
|
||||
* `timeMs.start` / `timeMs.stop`: See above.
|
||||
|
||||
* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds.
|
||||
Needs to be adjusted for events spanning more than one log file by a
|
||||
factor of $60,000 \times #logfiles$. See below for details.
|
||||
|
||||
* `topicNumber`: See above.
|
||||
|
||||
* `popup`: See above.
|
||||
|
||||
* `x.start` / `x.stop`: See above.
|
||||
|
||||
* `y.start` / `y.stop`: See above.
|
||||
|
||||
* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and $(x.stop, y.stop)$.
|
||||
|
||||
* `scale.start` / `scale.stop`: See above.
|
||||
|
||||
* `scaleSize`: Relative scaling of artwork card, calculated by
|
||||
$\frac{scale.stop}{scale.start}$.
|
||||
|
||||
* `rotation.start` / `rotation.stop`: See above.
|
||||
|
||||
* `rotationDegree`: Difference of rotation from $rotation.stop$ to
|
||||
$rotation.start$.
|
||||
|
||||
## How unclosed events are handled
|
||||
|
||||
Events do not necessarily need to be completed. A person can, e.g., leave
|
||||
the table and not flip the artwork card close again. For `flipCard`,
|
||||
`openTopic`, and `openPopup` the data frame contains `NA` when the event
|
||||
does not complete. For `move` events is happens quite often that a start
|
||||
event follows a start event and a stop event follows a stop event.
|
||||
Technically a move event cannot *not* be finished and the number of events
|
||||
without a start or stop indicated that the time resolution was not
|
||||
sufficient to catch all these events accurately. Double start and stop
|
||||
`move`events have therefore been deleted from the data set.
|
||||
|
||||
<!--
|
||||
## How a case is defined
|
||||
|
||||
* Herausfinden, ob mehr als eine Person am Tisch steht?
|
||||
@ -108,17 +234,40 @@ files have been affected.
|
||||
automatisiert rausziehen? Was ist meine Definition von
|
||||
"Interaktionsboost"?
|
||||
- Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
|
||||
-->
|
||||
|
||||
## Additional meta data
|
||||
|
||||
* Anreicherung der Log-Daten mit weiteren Metadaten? Was wäre interessant?
|
||||
For the HAUM data, I added meta data on state holidays and school
|
||||
vacations. Additionally, the topic categories of the topic cards were
|
||||
extracted from the XML files and added to the data frame.
|
||||
|
||||
This led to the following additional variables:
|
||||
|
||||
* `topicIndex`
|
||||
|
||||
* `topicFile`
|
||||
|
||||
* `topic`
|
||||
|
||||
* `state` (Niedersachsen for complete HAUM data set)
|
||||
|
||||
* `stateCode` (NI)
|
||||
|
||||
* `holiday`
|
||||
|
||||
* `vacations`
|
||||
|
||||
* `stateCodeVacations`
|
||||
|
||||
<!--
|
||||
- Metadata on artworks like, name, artist, type of artwork, epoch, etc.
|
||||
- School vacations and holidays
|
||||
- Special exhibits at the museum
|
||||
- Number of visitors per day (bei Sven noch mal nachhaken?)
|
||||
- Age structure of visitors per day?
|
||||
- ... ????
|
||||
-->
|
||||
|
||||
# Problems and how I handled them
|
||||
|
||||
@ -129,10 +278,14 @@ made.
|
||||
|
||||
## Weird behavior of `timeMs` and neg. `duration` values
|
||||
|
||||
I think the negative duration values happen, when an event starts in one
|
||||
log file and completes in another one. The variable `timeMs` seems to be
|
||||
continuous within one log file but not over several log files.
|
||||
|
||||
`timeMs` resets itself every time a new log file starts. This means that
|
||||
the durations of events spanning more than one log file must be adjusted.
|
||||
Instead of just calculating $timeMs.stop - timeMs.start$, `timeMs.start`
|
||||
must be subtracted from the maximum duration of the log file where the
|
||||
event started ($600,000 ms$) and the `timeMs.stop` must be added. If the
|
||||
event spans more than two log files, a multiple of $600,000$ must be taken,
|
||||
e.g. for three log files it must be: $2 \times 600,000 - timeMs.start +
|
||||
timeMs.stop$ and so on.
|
||||
|
||||
```{r, results = FALSE, fig.show = TRUE}
|
||||
# Read data
|
||||
@ -154,34 +307,22 @@ glossar_dict <- create_glossardict(artworks, glossar_files,
|
||||
dat1 <- add_trace(dat, glossar_dict)
|
||||
|
||||
# Close events
|
||||
dat2 <- rbind(close_events(dat1, "move"),
|
||||
close_events(dat1, "flipCard"),
|
||||
close_events(dat1, "openTopic"),
|
||||
close_events(dat1, "openPopup"))
|
||||
dat2 <- dat2[order(dat2$date.start, dat2$fileId), ]
|
||||
dat2 <- rbind(close_events(dat1, "move", rm_nochange_moves = TRUE),
|
||||
close_events(dat1, "flipCard", rm_nochange_moves = TRUE),
|
||||
close_events(dat1, "openTopic", rm_nochange_moves = TRUE),
|
||||
close_events(dat1, "openPopup", rm_nochange_moves = TRUE))
|
||||
dat2 <- dat2[order(dat2$fileId.start, dat2$date.start, dat2$timeMs.start), ]
|
||||
|
||||
plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
|
||||
```
|
||||
|
||||
The boxplot shows that we have a continuous range of values within one log
|
||||
file but that `timeMs` does not increase over log files. Since it seems not
|
||||
possible to fix this in a consistent way, I set all durations to `NA` where
|
||||
`fileId.start` and `fileId.stop` are not identical. I kept `timeMs.start`
|
||||
and `timeMs.stop` and also `fileId.start` and `fileId.stop` in the data
|
||||
frame, so it is clear why there are no durations. The other
|
||||
file but that `timeMs` does not increase over log files. I kept
|
||||
`timeMs.start` and `timeMs.stop` and also `fileId.start` and `fileId.stop`
|
||||
in the data frame, so it is clear when events span more than one log file.
|
||||
|
||||
NOTE: Part of this problem was that time stamps that are part of the log
|
||||
file names are not zero-left-padded and therefore the files were not in the
|
||||
correct order when read into R. When zero left padding these file IDs and
|
||||
sorting by them and then by `date.start` within, some of the durations are
|
||||
exactly fixed. Unfortunately, only three `move` events were fixed, since it
|
||||
only fixed irregularities *within* one log file. See below for more
|
||||
details.
|
||||
|
||||
UPDATE: By now I remove all events that span more than one log file. This
|
||||
lets me improve speed considerably.
|
||||
|
||||
UPDATE: Infos from Philipp:
|
||||
<!--
|
||||
Infos from Philipp:
|
||||
|
||||
"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
|
||||
so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
|
||||
@ -194,6 +335,7 @@ nachdem, ob der Tisch zum Zeitpunkt des neuen Logging-Intervalls in
|
||||
Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
|
||||
die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
|
||||
es passt."
|
||||
-->
|
||||
|
||||
## Left padding of file IDs
|
||||
|
||||
@ -277,7 +419,7 @@ technical terms that can be opened from the hypertexts on the topic cards.
|
||||
Often these information are artwork dependent and then the corresponding
|
||||
XML-file is in the folder for this artwork. Sometimes, however, more
|
||||
general terms can be opened. In order to avoid multiple files containing
|
||||
the same informatione, these were stored in a folder called `glossar` and
|
||||
the same information, these were stored in a folder called `glossar` and
|
||||
get accessed from there. The raw log files only contain the path to this
|
||||
glossar entry and did not record from which artwork it was accessed. I
|
||||
tried to assign these glossar entries to the correct artworks. The (very
|
||||
@ -310,12 +452,13 @@ pop-ups where another artwork has been opened in between. This is still an
|
||||
open TODO to write a more elaborate algorithm.
|
||||
|
||||
All glossar pop-ups that do not get matched with an artwork are removed
|
||||
from the data set with a warning.
|
||||
from the data set with a warning if the argument `glossar = TRUE` is set.
|
||||
Otherwise the glossar entries will be ignored completely.
|
||||
|
||||
## Assign a `case` variable based on "time heuristic"
|
||||
|
||||
One thing needed in order to work with the data set and use it for machine
|
||||
learning algorithms like process mining is a variable that tries to
|
||||
learning algorithms like process mining, is a variable that tries to
|
||||
identify a case. A case variable will structure the data frame in a way
|
||||
that navigation behavior can actually be investigated. However, we do not
|
||||
know if several people are standing around the table interacting with it or
|
||||
@ -329,9 +472,9 @@ gets extracted by the algorithm.
|
||||
|
||||
In order to investigate user behavior on a more fine grained level, it will
|
||||
be necessary to come up with a more elaborate approach. A better, still
|
||||
simple approach could be to use this kind of time limit and additionally
|
||||
simple approach, could be to use this kind of time limit and additionally
|
||||
look at the distance between artworks interacted with within one time
|
||||
window. When artworks are far apart is seems plausible that more than one
|
||||
window. When artworks are far apart it seems plausible that more than one
|
||||
person interacted with them. Very short time lapses between events on
|
||||
different artworks could also be an indicator that more than one person is
|
||||
interacting with the table.
|
||||
@ -348,7 +491,7 @@ algorithmic way but only heuristically. I used the `case` variable in order
|
||||
to get meaningful units around the artworks.
|
||||
|
||||
If within one case only a single trace for a single artwork was opened, I
|
||||
assigned this trace to the moves associated with this artwork. I (quite
|
||||
assigned this trace to the moves associated with this artwork. It (quite
|
||||
often) happens that within one case one artwork is opened and closed
|
||||
several times, each time starting a new trace. I then assigned all the
|
||||
following move events to the trace beforehand. This is, of course,
|
||||
@ -401,6 +544,16 @@ dim(dat2[is.na(dat2$date.start), ])
|
||||
dat2 <- dat2[!is.na(dat2$date.start), ]
|
||||
```
|
||||
|
||||
In order to deal with these logging errors, I check the data for what I
|
||||
call "fragmented traces". These are traces that cannot happen, when
|
||||
everything is logged correctly, e.g., traces containing `flipCard ->
|
||||
openPopup` or traces that only consist of `move`, `openTopic`, and
|
||||
`openPopup` events. These fragmented traces are removed from the data. It
|
||||
was not possible to check them all manually, but the 20 or more that I do
|
||||
check in the raw log files were all some kind of logging error like above.
|
||||
Most often a card was already closed again, before a topic card or pop-up
|
||||
was recorded as being closed.
|
||||
|
||||
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
|
||||
|
||||
See `questions_number-of-cards.R` for more details.
|
||||
@ -411,13 +564,15 @@ displayed on the back of the card. I added an index giving the ordering in
|
||||
the index files.
|
||||
|
||||
The possible values in the variable `topicNumber` range from 0 to 7,
|
||||
however, not artwork has more than six different numbers. So I just renamed
|
||||
however, no artwork has more than six different numbers. So I just renamed
|
||||
those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
|
||||
to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
|
||||
assign topics and file names to the according pop-ups. This needs to be
|
||||
cross checked with the programming, but seems the most plausible approach
|
||||
with my current knowledge.
|
||||
|
||||
<!-- TODO: Ask Philipp -->
|
||||
|
||||
## Extracting topics from `index.xml` vs. `<artwork_number>.xml`
|
||||
|
||||
When I extract the topics from `index.html` I get different topics, than
|
||||
@ -479,9 +634,6 @@ It strongly suggests that the artworks haven been updated after the Corona
|
||||
pandemic. I think, the table was also moved to a different location at that
|
||||
point. (Check with PG to make sure.)
|
||||
|
||||
I need to get the XML files for "504" and "505" from PM in order to extract
|
||||
information on them for the metadata.
|
||||
|
||||
# Optimizing resources used by the code
|
||||
|
||||
After I started trying out the functions on the complete data set, it
|
||||
@ -492,12 +644,15 @@ frame (at least not on my laptop). The code is supposed to work "out of the
|
||||
box" for researchers, hence it *should* run on a regular (8 core) laptop.
|
||||
So, I changed the reshaping so that it is done in batches on subsets of the
|
||||
data for every `fileId` separately. This means that events that span over
|
||||
two raw log files cannot be closed and will then be removed from the data
|
||||
set. The functions warns about this, but it is a random process getting rid
|
||||
of these data and seems therefore not like a systematic problem. Another
|
||||
reason why this is not bad, is that durations cannot be calculated for
|
||||
events across log files anyways, because the time stamps do not increase
|
||||
systematically over log files (see above).
|
||||
two (or more) raw log files cannot be closed and will then be removed from
|
||||
the data set. The function warns about this, but it is a random process
|
||||
getting rid of these data and seems therefore not like a systematic
|
||||
problem. Another reason why this is not bad, is that durations cannot be
|
||||
calculated for events across log files anyways, because the time stamps do
|
||||
not increase systematically over log files (see above).
|
||||
|
||||
UPDATE: By now, I close the events spanning more than one log file after
|
||||
this has been done.
|
||||
|
||||
I meant to put the lists back together with `do.call(rbind, some_list)` but
|
||||
this can also not handle big data sets. I therefore switched to
|
||||
|
Loading…
Reference in New Issue
Block a user