Updated README after many new insights
This commit is contained in:
parent
4973ceb31b
commit
b81a3e984c
269
README.Rmd
269
README.Rmd
@ -22,7 +22,7 @@ layout. The table was installed at the institute in October 2016 and since
|
|||||||
November 2016 log files from interactions of visitors of the museum have
|
November 2016 log files from interactions of visitors of the museum have
|
||||||
been collected. These log files are in an unstructured format and cannot be
|
been collected. These log files are in an unstructured format and cannot be
|
||||||
easily analyzed. The purpose of the following document is to describe how
|
easily analyzed. The purpose of the following document is to describe how
|
||||||
the data haven been transformed and which decisions have been made a long
|
the data haven been transformed and which decisions have been made along
|
||||||
the way.
|
the way.
|
||||||
|
|
||||||
# Data structure
|
# Data structure
|
||||||
@ -66,9 +66,9 @@ data when further processed and are only in the raw log files.
|
|||||||
|
|
||||||
The first step is to parse the raw log files that are stored by the
|
The first step is to parse the raw log files that are stored by the
|
||||||
application as text files in a rather unstructured format to a format that
|
application as text files in a rather unstructured format to a format that
|
||||||
is better handled. The data are therefore transferred to a spread sheet
|
can be read by common statistics software packages. The data are therefore
|
||||||
format. The following section describes what problems were encountered
|
transferred to a spread sheet format. The following section describes what
|
||||||
while doing this.
|
problems were encountered while doing this.
|
||||||
|
|
||||||
## Corrupt lines
|
## Corrupt lines
|
||||||
|
|
||||||
@ -77,9 +77,9 @@ that says
|
|||||||
|
|
||||||
```
|
```
|
||||||
Warning messages:
|
Warning messages:
|
||||||
incomplete final line found on '_2016/2016_11_18-11_31_0.log'
|
incomplete final line found on '2016/2016_11_18-11_31_0.log'
|
||||||
incomplete final line found on '_2016/2016_11_18-11_38_30.log'
|
incomplete final line found on '2016/2016_11_18-11_38_30.log'
|
||||||
incomplete final line found on '_2016/2016_11_18-11_40_36.log'
|
incomplete final line found on '2016/2016_11_18-11_40_36.log'
|
||||||
...
|
...
|
||||||
```
|
```
|
||||||
|
|
||||||
@ -88,17 +88,143 @@ content. It is unclear why and how this happens. So when reading the data,
|
|||||||
these lines were removed. A warning will be given that indicates how many
|
these lines were removed. A warning will be given that indicates how many
|
||||||
files have been affected.
|
files have been affected.
|
||||||
|
|
||||||
## Units of the variables
|
## Extracted variables from raw log files
|
||||||
|
|
||||||
* Welche Einheit haben x und y? Pixel? --> yes
|
The following variables (columns in the data frame) are extracted from the
|
||||||
* Welche Einheit hat scale? --> some kind if bit, does not matter, when
|
raw log file:
|
||||||
calculating a ratio
|
|
||||||
* rotation wirklich degree? --> yes
|
* `fileId`: Containing the zero-left-padded file name of the raw log file
|
||||||
* Nach welchem Zeitintervall resettet sich der Tisch wieder in die
|
the data line has been extracted from
|
||||||
Ausgangskonfiguration? --> PM needs to look it up
|
|
||||||
|
* `folder`: The folder names in which the raw log files haven been
|
||||||
|
organized in. For the HAUM data set, the data are sorted by year (folders
|
||||||
|
2016, 2017, 2018, 2019, 2020, 2021, 2022, and 2023).
|
||||||
|
|
||||||
|
* `data`: Extracted time stamp from the raw log file in the format
|
||||||
|
`yyyy-mm-dd hh:mm:ss`.
|
||||||
|
|
||||||
|
* `timeMs`: Containing a time stamp in Milliseconds that restarts with
|
||||||
|
every new raw log files.
|
||||||
|
|
||||||
|
* `event`: Start and stop event tags. See above for possible values.
|
||||||
|
|
||||||
|
* `artwork`: Identifier of the different artworks. This is a 3 digit
|
||||||
|
(left-padded) number. The numbers of the artworks correspond to the
|
||||||
|
folder names in `/ContentEyevisit/eyevisit_cards_light/` and were
|
||||||
|
orginally taken from the museums catalogue.
|
||||||
|
|
||||||
|
* `popup`: Name of the pop-up opened. This is only interestin for
|
||||||
|
"openPopup" events.
|
||||||
|
|
||||||
|
* `topicNumber`: The number of the topic card that has been opened at the back of
|
||||||
|
the artwork card. See below for a more detailed descripttion what these
|
||||||
|
numbers possibly mean.
|
||||||
|
|
||||||
|
* `x`: Value of x-coordinate in pixel on the 4K-Display ($3840 \times 2160$)
|
||||||
|
|
||||||
|
* `y`: Value of y-coordinate in pixel
|
||||||
|
|
||||||
|
* `scale`: Number in 128 bit that indicates how much the artwork card has
|
||||||
|
been scaled (????)
|
||||||
|
|
||||||
|
* `rotation`: Degree of rotation in start configuration.
|
||||||
|
|
||||||
|
<!-- TODO: Nach welchem Zeitintervall resettet sich der Tisch wieder in die
|
||||||
|
Ausgangskonfiguration? -> PM needs to look it up -->
|
||||||
|
|
||||||
|
## Variables after "closing of events"
|
||||||
|
|
||||||
|
The raw log data consists of start and stop events for each event type.
|
||||||
|
After preprocessing for event types are extracted: `move`, `flipCard`,
|
||||||
|
`openTopic`, and `openPopup`. Except for the `move` events, which can occur
|
||||||
|
at any time when interacting with an artwork card on the table, the events
|
||||||
|
have a hierachical order: An artwork card first needs to be flipped
|
||||||
|
(`flipCard`), then the topic cards on the back of the card can be opened
|
||||||
|
(`openTopic`), and finally pop-ups on these topic cards can be opened
|
||||||
|
(`openPopup`). This implies that the event `openPopup` can only be present
|
||||||
|
for a certain artwork, if the card has already been flipped (i.e., an event
|
||||||
|
`flipCard` for the same artwork has already occured).
|
||||||
|
|
||||||
|
After preprocessing, the data frame is now in a wide format with columns
|
||||||
|
for the start and the stop of each event and contains the following
|
||||||
|
variables:
|
||||||
|
|
||||||
|
* `folder`: Containing the folder name (see above)
|
||||||
|
|
||||||
|
* `eventId`: A numerical variable that indicates the number of the event.
|
||||||
|
Starts at 1 and ends with the total number of events, counting up by 1.
|
||||||
|
|
||||||
|
* `case`: A numerical variable indicating cases in the data. A "case"
|
||||||
|
indicates an interaction interval and could be defined in different ways.
|
||||||
|
Right now a new case begins, when no event occured for 20 seconds.
|
||||||
|
|
||||||
|
* `trace`: A trace is defined as one interaction with one artwork. A trace
|
||||||
|
can either start with a `flipCard` event or when an artwork has been
|
||||||
|
touched for the first time within this case. A trace ends with the
|
||||||
|
artwork card being flipped close again or with the last movement of the
|
||||||
|
card within this case. One case can contain several traces with the same
|
||||||
|
artwork when the artwork is flipped open and slipped close again several
|
||||||
|
times within a short time.
|
||||||
|
|
||||||
|
* `glossar`: An indicator variable with values 0/1 that tracks if a pop-up
|
||||||
|
has been opened from the glossar folder. These pop-ups can be assigned to
|
||||||
|
the wronge artwork since it is not possible to do this algorithmically.
|
||||||
|
It is possible that two artworks are flipped open that could both link to
|
||||||
|
the same popup from a glossar. The indicator variable is left as a
|
||||||
|
variable, so that these pop-ups can be easily deleted from the data.
|
||||||
|
Right now, glossar entries can be ignored completely by setting an
|
||||||
|
argument and this is done by default. Using the pop-ups from the glossar
|
||||||
|
will need a lot more love, before it behaves satisfactorily.
|
||||||
|
|
||||||
|
* `event`: Indicating the event. Can take tha values `move`, `flipCard`,
|
||||||
|
`openTopic`, and `openPopup`.
|
||||||
|
|
||||||
|
* `artwork`: Identifier of the different artworks. This is a 3 digit
|
||||||
|
(left-padded) number. See above.
|
||||||
|
|
||||||
|
* `fileId.start` / `fileId.stop`: See above.
|
||||||
|
|
||||||
|
* `date.start` / `date.stop`: See above.
|
||||||
|
|
||||||
|
* `timeMs.start` / `timeMs.stop`: See above.
|
||||||
|
|
||||||
|
* `duration`: Calculated by $timeMs.stop - timeMs.start$ in Milliseconds.
|
||||||
|
Needs to be adjusted for events spanning more than one log file by a
|
||||||
|
factor of $60,000 \times #logfiles$. See below for details.
|
||||||
|
|
||||||
|
* `topicNumber`: See above.
|
||||||
|
|
||||||
|
* `popup`: See above.
|
||||||
|
|
||||||
|
* `x.start` / `x.stop`: See above.
|
||||||
|
|
||||||
|
* `y.start` / `y.stop`: See above.
|
||||||
|
|
||||||
|
* `distance`: Euclidean distande calculated from $(x.start, y.start)$ and $(x.stop, y.stop)$.
|
||||||
|
|
||||||
|
* `scale.start` / `scale.stop`: See above.
|
||||||
|
|
||||||
|
* `scaleSize`: Relative scaling of artwork card, calculated by
|
||||||
|
$\frac{scale.stop}{scale.start}$.
|
||||||
|
|
||||||
|
* `rotation.start` / `rotation.stop`: See above.
|
||||||
|
|
||||||
|
* `rotationDegree`: Difference of rotation from $rotation.stop$ to
|
||||||
|
$rotation.start$.
|
||||||
|
|
||||||
## How unclosed events are handled
|
## How unclosed events are handled
|
||||||
|
|
||||||
|
Events do not necessarily need to be completed. A person can, e.g., leave
|
||||||
|
the table and not flip the artwork card close again. For `flipCard`,
|
||||||
|
`openTopic`, and `openPopup` the data frame contains `NA` when the event
|
||||||
|
does not complete. For `move` events is happens quite often that a start
|
||||||
|
event follows a start event and a stop event follows a stop event.
|
||||||
|
Technically a move event cannot *not* be finished and the number of events
|
||||||
|
without a start or stop indicated that the time resolution was not
|
||||||
|
sufficient to catch all these events accurately. Double start and stop
|
||||||
|
`move`events have therefore been deleted from the data set.
|
||||||
|
|
||||||
|
<!--
|
||||||
## How a case is defined
|
## How a case is defined
|
||||||
|
|
||||||
* Herausfinden, ob mehr als eine Person am Tisch steht?
|
* Herausfinden, ob mehr als eine Person am Tisch steht?
|
||||||
@ -108,17 +234,40 @@ files have been affected.
|
|||||||
automatisiert rausziehen? Was ist meine Definition von
|
automatisiert rausziehen? Was ist meine Definition von
|
||||||
"Interaktionsboost"?
|
"Interaktionsboost"?
|
||||||
- Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
|
- Egal wie wir es machen, geht es auf den "Event-Log-Daten"?
|
||||||
|
-->
|
||||||
|
|
||||||
## Additional meta data
|
## Additional meta data
|
||||||
|
|
||||||
* Anreicherung der Log-Daten mit weiteren Metadaten? Was wäre interessant?
|
For the HAUM data, I added meta data on state holidays and school
|
||||||
|
vacations. Additionally, the topic categories of the topic cards were
|
||||||
|
extracted from the XML files and added to the data frame.
|
||||||
|
|
||||||
|
This led to the following additional variables:
|
||||||
|
|
||||||
|
* `topicIndex`
|
||||||
|
|
||||||
|
* `topicFile`
|
||||||
|
|
||||||
|
* `topic`
|
||||||
|
|
||||||
|
* `state` (Niedersachsen for complete HAUM data set)
|
||||||
|
|
||||||
|
* `stateCode` (NI)
|
||||||
|
|
||||||
|
* `holiday`
|
||||||
|
|
||||||
|
* `vacations`
|
||||||
|
|
||||||
|
* `stateCodeVacations`
|
||||||
|
|
||||||
|
<!--
|
||||||
- Metadata on artworks like, name, artist, type of artwork, epoch, etc.
|
- Metadata on artworks like, name, artist, type of artwork, epoch, etc.
|
||||||
- School vacations and holidays
|
- School vacations and holidays
|
||||||
- Special exhibits at the museum
|
- Special exhibits at the museum
|
||||||
- Number of visitors per day (bei Sven noch mal nachhaken?)
|
- Number of visitors per day (bei Sven noch mal nachhaken?)
|
||||||
- Age structure of visitors per day?
|
- Age structure of visitors per day?
|
||||||
- ... ????
|
- ... ????
|
||||||
|
-->
|
||||||
|
|
||||||
# Problems and how I handled them
|
# Problems and how I handled them
|
||||||
|
|
||||||
@ -129,10 +278,14 @@ made.
|
|||||||
|
|
||||||
## Weird behavior of `timeMs` and neg. `duration` values
|
## Weird behavior of `timeMs` and neg. `duration` values
|
||||||
|
|
||||||
I think the negative duration values happen, when an event starts in one
|
`timeMs` resets itself every time a new log file starts. This means that
|
||||||
log file and completes in another one. The variable `timeMs` seems to be
|
the durations of events spanning more than one log file must be adjusted.
|
||||||
continuous within one log file but not over several log files.
|
Instead of just calculating $timeMs.stop - timeMs.start$, `timeMs.start`
|
||||||
|
must be subtracted from the maximum duration of the log file where the
|
||||||
|
event started ($600,000 ms$) and the `timeMs.stop` must be added. If the
|
||||||
|
event spans more than two log files, a multiple of $600,000$ must be taken,
|
||||||
|
e.g. for three log files it must be: $2 \times 600,000 - timeMs.start +
|
||||||
|
timeMs.stop$ and so on.
|
||||||
|
|
||||||
```{r, results = FALSE, fig.show = TRUE}
|
```{r, results = FALSE, fig.show = TRUE}
|
||||||
# Read data
|
# Read data
|
||||||
@ -154,34 +307,22 @@ glossar_dict <- create_glossardict(artworks, glossar_files,
|
|||||||
dat1 <- add_trace(dat, glossar_dict)
|
dat1 <- add_trace(dat, glossar_dict)
|
||||||
|
|
||||||
# Close events
|
# Close events
|
||||||
dat2 <- rbind(close_events(dat1, "move"),
|
dat2 <- rbind(close_events(dat1, "move", rm_nochange_moves = TRUE),
|
||||||
close_events(dat1, "flipCard"),
|
close_events(dat1, "flipCard", rm_nochange_moves = TRUE),
|
||||||
close_events(dat1, "openTopic"),
|
close_events(dat1, "openTopic", rm_nochange_moves = TRUE),
|
||||||
close_events(dat1, "openPopup"))
|
close_events(dat1, "openPopup", rm_nochange_moves = TRUE))
|
||||||
dat2 <- dat2[order(dat2$date.start, dat2$fileId), ]
|
dat2 <- dat2[order(dat2$fileId.start, dat2$date.start, dat2$timeMs.start), ]
|
||||||
|
|
||||||
plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
|
plot(timeMs ~ as.factor(fileId), dat[1:5000,], xlab = "fileId")
|
||||||
```
|
```
|
||||||
|
|
||||||
The boxplot shows that we have a continuous range of values within one log
|
The boxplot shows that we have a continuous range of values within one log
|
||||||
file but that `timeMs` does not increase over log files. Since it seems not
|
file but that `timeMs` does not increase over log files. I kept
|
||||||
possible to fix this in a consistent way, I set all durations to `NA` where
|
`timeMs.start` and `timeMs.stop` and also `fileId.start` and `fileId.stop`
|
||||||
`fileId.start` and `fileId.stop` are not identical. I kept `timeMs.start`
|
in the data frame, so it is clear when events span more than one log file.
|
||||||
and `timeMs.stop` and also `fileId.start` and `fileId.stop` in the data
|
|
||||||
frame, so it is clear why there are no durations. The other
|
|
||||||
|
|
||||||
NOTE: Part of this problem was that time stamps that are part of the log
|
<!--
|
||||||
file names are not zero-left-padded and therefore the files were not in the
|
Infos from Philipp:
|
||||||
correct order when read into R. When zero left padding these file IDs and
|
|
||||||
sorting by them and then by `date.start` within, some of the durations are
|
|
||||||
exactly fixed. Unfortunately, only three `move` events were fixed, since it
|
|
||||||
only fixed irregularities *within* one log file. See below for more
|
|
||||||
details.
|
|
||||||
|
|
||||||
UPDATE: By now I remove all events that span more than one log file. This
|
|
||||||
lets me improve speed considerably.
|
|
||||||
|
|
||||||
UPDATE: Infos from Philipp:
|
|
||||||
|
|
||||||
"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
|
"Bin außerdem gerade den Code von damals durchgegangen. Das Logging läuft
|
||||||
so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
|
so: Mit Start der Anwendung wird alle 10 Minuten ein neues Logfile
|
||||||
@ -194,6 +335,7 @@ nachdem, ob der Tisch zum Zeitpunkt des neuen Logging-Intervalls in
|
|||||||
Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
|
Benutzung war). Wenn ein Case also über 2+ Logs verteilt ist, musst du auf
|
||||||
die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
|
die Duration jeweils 10 Minuten pro Logfile nach dem ersten addieren, damit
|
||||||
es passt."
|
es passt."
|
||||||
|
-->
|
||||||
|
|
||||||
## Left padding of file IDs
|
## Left padding of file IDs
|
||||||
|
|
||||||
@ -277,7 +419,7 @@ technical terms that can be opened from the hypertexts on the topic cards.
|
|||||||
Often these information are artwork dependent and then the corresponding
|
Often these information are artwork dependent and then the corresponding
|
||||||
XML-file is in the folder for this artwork. Sometimes, however, more
|
XML-file is in the folder for this artwork. Sometimes, however, more
|
||||||
general terms can be opened. In order to avoid multiple files containing
|
general terms can be opened. In order to avoid multiple files containing
|
||||||
the same informatione, these were stored in a folder called `glossar` and
|
the same information, these were stored in a folder called `glossar` and
|
||||||
get accessed from there. The raw log files only contain the path to this
|
get accessed from there. The raw log files only contain the path to this
|
||||||
glossar entry and did not record from which artwork it was accessed. I
|
glossar entry and did not record from which artwork it was accessed. I
|
||||||
tried to assign these glossar entries to the correct artworks. The (very
|
tried to assign these glossar entries to the correct artworks. The (very
|
||||||
@ -310,12 +452,13 @@ pop-ups where another artwork has been opened in between. This is still an
|
|||||||
open TODO to write a more elaborate algorithm.
|
open TODO to write a more elaborate algorithm.
|
||||||
|
|
||||||
All glossar pop-ups that do not get matched with an artwork are removed
|
All glossar pop-ups that do not get matched with an artwork are removed
|
||||||
from the data set with a warning.
|
from the data set with a warning if the argument `glossar = TRUE` is set.
|
||||||
|
Otherwise the glossar entries will be ignored completely.
|
||||||
|
|
||||||
## Assign a `case` variable based on "time heuristic"
|
## Assign a `case` variable based on "time heuristic"
|
||||||
|
|
||||||
One thing needed in order to work with the data set and use it for machine
|
One thing needed in order to work with the data set and use it for machine
|
||||||
learning algorithms like process mining is a variable that tries to
|
learning algorithms like process mining, is a variable that tries to
|
||||||
identify a case. A case variable will structure the data frame in a way
|
identify a case. A case variable will structure the data frame in a way
|
||||||
that navigation behavior can actually be investigated. However, we do not
|
that navigation behavior can actually be investigated. However, we do not
|
||||||
know if several people are standing around the table interacting with it or
|
know if several people are standing around the table interacting with it or
|
||||||
@ -329,9 +472,9 @@ gets extracted by the algorithm.
|
|||||||
|
|
||||||
In order to investigate user behavior on a more fine grained level, it will
|
In order to investigate user behavior on a more fine grained level, it will
|
||||||
be necessary to come up with a more elaborate approach. A better, still
|
be necessary to come up with a more elaborate approach. A better, still
|
||||||
simple approach could be to use this kind of time limit and additionally
|
simple approach, could be to use this kind of time limit and additionally
|
||||||
look at the distance between artworks interacted with within one time
|
look at the distance between artworks interacted with within one time
|
||||||
window. When artworks are far apart is seems plausible that more than one
|
window. When artworks are far apart it seems plausible that more than one
|
||||||
person interacted with them. Very short time lapses between events on
|
person interacted with them. Very short time lapses between events on
|
||||||
different artworks could also be an indicator that more than one person is
|
different artworks could also be an indicator that more than one person is
|
||||||
interacting with the table.
|
interacting with the table.
|
||||||
@ -348,7 +491,7 @@ algorithmic way but only heuristically. I used the `case` variable in order
|
|||||||
to get meaningful units around the artworks.
|
to get meaningful units around the artworks.
|
||||||
|
|
||||||
If within one case only a single trace for a single artwork was opened, I
|
If within one case only a single trace for a single artwork was opened, I
|
||||||
assigned this trace to the moves associated with this artwork. I (quite
|
assigned this trace to the moves associated with this artwork. It (quite
|
||||||
often) happens that within one case one artwork is opened and closed
|
often) happens that within one case one artwork is opened and closed
|
||||||
several times, each time starting a new trace. I then assigned all the
|
several times, each time starting a new trace. I then assigned all the
|
||||||
following move events to the trace beforehand. This is, of course,
|
following move events to the trace beforehand. This is, of course,
|
||||||
@ -401,6 +544,16 @@ dim(dat2[is.na(dat2$date.start), ])
|
|||||||
dat2 <- dat2[!is.na(dat2$date.start), ]
|
dat2 <- dat2[!is.na(dat2$date.start), ]
|
||||||
```
|
```
|
||||||
|
|
||||||
|
In order to deal with these logging errors, I check the data for what I
|
||||||
|
call "fragmented traces". These are traces that cannot happen, when
|
||||||
|
everything is logged correctly, e.g., traces containing `flipCard ->
|
||||||
|
openPopup` or traces that only consist of `move`, `openTopic`, and
|
||||||
|
`openPopup` events. These fragmented traces are removed from the data. It
|
||||||
|
was not possible to check them all manually, but the 20 or more that I do
|
||||||
|
check in the raw log files were all some kind of logging error like above.
|
||||||
|
Most often a card was already closed again, before a topic card or pop-up
|
||||||
|
was recorded as being closed.
|
||||||
|
|
||||||
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
|
## Card indices go from 0 to 7 (instead of 0 to 5 as expected)
|
||||||
|
|
||||||
See `questions_number-of-cards.R` for more details.
|
See `questions_number-of-cards.R` for more details.
|
||||||
@ -411,13 +564,15 @@ displayed on the back of the card. I added an index giving the ordering in
|
|||||||
the index files.
|
the index files.
|
||||||
|
|
||||||
The possible values in the variable `topicNumber` range from 0 to 7,
|
The possible values in the variable `topicNumber` range from 0 to 7,
|
||||||
however, not artwork has more than six different numbers. So I just renamed
|
however, no artwork has more than six different numbers. So I just renamed
|
||||||
those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
|
those numbers from 1 to the highest number, e.g., $0,1,2,4,5,6$ was changed
|
||||||
to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
|
to $0\to 1,1\to 2,2\to 3,4\to 4,5\to 5,6\to 6$. Next I used the index to
|
||||||
assign topics and file names to the according pop-ups. This needs to be
|
assign topics and file names to the according pop-ups. This needs to be
|
||||||
cross checked with the programming, but seems the most plausible approach
|
cross checked with the programming, but seems the most plausible approach
|
||||||
with my current knowledge.
|
with my current knowledge.
|
||||||
|
|
||||||
|
<!-- TODO: Ask Philipp -->
|
||||||
|
|
||||||
## Extracting topics from `index.xml` vs. `<artwork_number>.xml`
|
## Extracting topics from `index.xml` vs. `<artwork_number>.xml`
|
||||||
|
|
||||||
When I extract the topics from `index.html` I get different topics, than
|
When I extract the topics from `index.html` I get different topics, than
|
||||||
@ -479,9 +634,6 @@ It strongly suggests that the artworks haven been updated after the Corona
|
|||||||
pandemic. I think, the table was also moved to a different location at that
|
pandemic. I think, the table was also moved to a different location at that
|
||||||
point. (Check with PG to make sure.)
|
point. (Check with PG to make sure.)
|
||||||
|
|
||||||
I need to get the XML files for "504" and "505" from PM in order to extract
|
|
||||||
information on them for the metadata.
|
|
||||||
|
|
||||||
# Optimizing resources used by the code
|
# Optimizing resources used by the code
|
||||||
|
|
||||||
After I started trying out the functions on the complete data set, it
|
After I started trying out the functions on the complete data set, it
|
||||||
@ -492,12 +644,15 @@ frame (at least not on my laptop). The code is supposed to work "out of the
|
|||||||
box" for researchers, hence it *should* run on a regular (8 core) laptop.
|
box" for researchers, hence it *should* run on a regular (8 core) laptop.
|
||||||
So, I changed the reshaping so that it is done in batches on subsets of the
|
So, I changed the reshaping so that it is done in batches on subsets of the
|
||||||
data for every `fileId` separately. This means that events that span over
|
data for every `fileId` separately. This means that events that span over
|
||||||
two raw log files cannot be closed and will then be removed from the data
|
two (or more) raw log files cannot be closed and will then be removed from
|
||||||
set. The functions warns about this, but it is a random process getting rid
|
the data set. The function warns about this, but it is a random process
|
||||||
of these data and seems therefore not like a systematic problem. Another
|
getting rid of these data and seems therefore not like a systematic
|
||||||
reason why this is not bad, is that durations cannot be calculated for
|
problem. Another reason why this is not bad, is that durations cannot be
|
||||||
events across log files anyways, because the time stamps do not increase
|
calculated for events across log files anyways, because the time stamps do
|
||||||
systematically over log files (see above).
|
not increase systematically over log files (see above).
|
||||||
|
|
||||||
|
UPDATE: By now, I close the events spanning more than one log file after
|
||||||
|
this has been done.
|
||||||
|
|
||||||
I meant to put the lists back together with `do.call(rbind, some_list)` but
|
I meant to put the lists back together with `do.call(rbind, some_list)` but
|
||||||
this can also not handle big data sets. I therefore switched to
|
this can also not handle big data sets. I therefore switched to
|
||||||
|
Loading…
Reference in New Issue
Block a user