escrito-docker/README.md

117 lines
8.8 KiB
Markdown
Raw Permalink Normal View History

2023-10-07 15:11:50 +02:00
# escrito-docker
## High-level info
This repo contains (almost) everything to start a Docker container running [ESCRITO](https://github.com/catalpa-cl/escrito). In additon, a custom Java-package is added to ESCRITO in order to classify new learner's answers based on stored models that were trained using ESCRITO.
## Details (see Dockerfile)
The base layer of the Docker image is [eclipse-temurin:11-jdk-alpine](https://hub.docker.com/layers/library/eclipse-temurin/11-jdk-alpine/images/sha256-2a16c92565236e8d9b3c3747d995c33f239e8ed30bcea1c1ba6c1a5cfa72da79?context=explore) which is itself based on [alpine:3.18](https://hub.docker.com/layers/library/alpine/3.18/images/sha256-48d9183eb12a05c99bcc0bf44a003607b8e941e1d4f41f9ad12bdcc4b5672f86?context=explore) (i.e. alpine Linux). As the name suggests, it provides a Java 11 JDK inside the container, which is needed to compile and run ESCRITO.
Aside form some system-level utilities (`curl`, `vim`, `git`), [maven](https://maven.apache.org/) is installed in order to compile ESCRITO from source within the container. Which version of maven is installed is determined via the `MAVEN_VERSION` environment variable. Currently, it is set to `3.5.4`, which seems to work fine.
After maven is set up, the [ESCRITO github repo](https://github.com/catalpa-cl/escrito) is cloned to path `/escrito` inside the container. Next the parent `pom.xml` inside `/escrito/de.unidue.ltl.escrito` that is pulled from github during cloning is removed and exchanged with the `pom.xml` within this repository. The `pom.xml` in this repository contains two small additions that fix some errors that otherwise occur during the maven build and cause it to fail (the changes/additions are described [here](https://stackoverflow.com/a/63438394)).
Next, the `.java`-files in `local/models/` of this directory are copied to their required postion within the package structure of the ESCRITO repo inside the Docker container. These files, in particular `StoredModelPredictor.java`, are needed to run the classification of new learner's answers from within the container.
Next, the local directories `dkpro_target/`, `.m2/` and `scripts/` are copied inside the Docker container.
Local directory `dkpro_target/` is mapped to `/dkpro_target` wihtin the Docker container and will later be pointed to by environment variable `$DKPRO_HOME` (see script `scripts/classify.sh`). This directory contains another directoy `models/` which is where **all pre-trained models must be stored** (one sub-folder per trained model!).
Local directory `.m2/repository/` is the local maven repository which will be used during the compilation of the ESCRITO source code. It contains most of the necessary dependencies, which will make the compilation MUCH faster, since they don't have to be downloaded first (otherwise this would take days, literally).
Local directory `scripts/` contains a single shell script named `classify.sh`, which will be called in order to delegate the actual classification. It sets the env. variable `$DKPRO_HOME` to `/dkpro_target` and afterwards calls `StoredModelPredictor.java`. This call is wrapped in a script since the `classpath` is HUGE. More on this below.
Once all necessary files are copied to the container, the ESCRITO source code is compiled using maven by calling:
```
mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -e
```
This command is called from within `/escrito/de.unidue.ltl.escrito` and tells maven to search for dependencies within the local directory `/.m2/repository`. Also, automatic testing during compilation is skipped via flag `-DskipTests`.
You can track the compilation via the output in your terminal during `docker image build` (see below). It should take roughly 10 minutes and (ideally) end with output that is equivalent to the following:
```
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] de.unidue.ltl.escrito 0.0.1-SNAPSHOT ............... SUCCESS [ 1.462 s]
[INFO] de.unidue.ltl.escrito.core ......................... SUCCESS [04:29 min]
[INFO] de.unidue.ltl.escrito.io ........................... SUCCESS [ 1.307 s]
[INFO] de.unidue.ltl.escrito.features ..................... SUCCESS [ 2.642 s]
[INFO] de.unidue.ltl.escrito.examples ..................... SUCCESS [ 1.780 s]
[INFO] de.unidue.ltl.escrito.languagetool 0.0.1-SNAPSHOT .. SUCCESS [04:21 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 08:59 min
[INFO] Finished at: 2023-10-03T16:39:25Z
[INFO] ------------------------------------------------------------------------
```
Lastly, the `ENTRYPOINT` of the container is essentially an infinite loop so that once started, the container will theoretically run indefinitely.
### Building the image and starting a container
To build the Docker image, run the following from within the current directory (i.e. from where the `Dockerfile` is located):
```
docker image build --no-cache -t escrito:latest .
```
This will create an image called `escrito:latest`.
Once the image build is done, run the following to start a container based on image `escrito:latest`:
```
docker container run -d \
-v ./local:/escrito/de.unidue.ltl.escrito/de.unidue.ltl.escrito.examples/src/main/java/de/unidue/ltl/escrito/examples/local \
-v ./scripts:/scripts \
--name escrito-demo escrito:latest
```
This will start a container called `escrito-demo`. The lines starting with `-v` map volumes from host to container and make sure that changes applied outside the container are also reflected inside the container, and vice versa (see [here](https://docs.docker.com/storage/volumes/#choose-the--v-or---mount-flag) for more info).
Once the container `escrito-demo` is running, run the following command to attach to a terminal within the container:
```
docker container exec -it escrito-demo /bin/sh
```
### Classify new learner's answers
When you are attached to the terminal of container `escrito-demo`, you can run the following to classify a learner's answer based on a stored, pre-trained model:
```
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' 'Test answer'
```
This command calls the scipt `classify.sh` in directory `/scripts` within the container. In the command above, **two arguments** are passed to `classify.sh`. The first argument `'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0'` specifies the model that sould be used to classify the answer. This argument **must be equivalent to a name of a sub-folder of `/dkpro_target/models/`**. Otherwise the classification **will not work**. The second argument `'Test answer'` is a placeholder for an answer that should be classified. **Both arguments should be strings wrapped in quotes (i.e. `'...'`)**. Especially for the second argument (the answer to be classified) this is important, since otherwise sentences containing whitespace will be interpreted as multiple individual arguments.
To demonstrate this, lines `27`-`31` in file `local/models/StoredModelPredictor.java` print some output for debugging.
If we run
```
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' 'Test answer'
```
we should get the following output:
```
Total number of arguments passed: 2
Argument 0: Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0
Argument 1: Test answer
0
```
where the first three lines are due to the above mentioned debugging print statements in `local/model/StoredModelPredictor.java` and the last line shows the binary classification outcome for the specified model and the specified answer.
If we instead pass the second argument **without quotes**
```
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' Test answer
```
we should get
```
Total number of arguments passed: 3
Argument 0: Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0
Argument 1: Test
Argument 2: answer
0
```
This is definitely not what we want and we should therefore always pass answers in quotes to `classify.sh`.
### Updating `StoredModelPredictor.java`
Due to the volume mapping specified in the `docker container run` command (see above), changes applied to `local/models/StoredModelPredictor.java` on the host will be reflected inside the Docker container. However, for these changes to actually **take effect**, ESCRITO has to be re-compiled inside the Docker container. To re-compile using maven, run the follwoing when you are attached to a terminal inside the `escrito-demo` container as described above:
```
cd /escrito/de.unidue.ltl.escrito && mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -o -e
```
The `-o` flag tells maven to work in 'offline mode', i.e. it forces maven to only rely on resources that are locally available. This makes the re-build quite fast. After the re-build is done, run `classify.sh` again and you should be able to observe that your changes have taken effect.