dkpro_target/models | ||
local/models | ||
scripts | ||
.gitignore | ||
Dockerfile | ||
pom.xml | ||
README.md |
escrito-docker
High-level info
This repo contains (almost) everything to start a Docker container running ESCRITO. In additon, a custom Java-package is added to ESCRITO in order to classify new learner's answers based on stored models that were trained using ESCRITO.
Details (see Dockerfile)
The base layer of the Docker image is eclipse-temurin:11-jdk-alpine which is itself based on alpine:3.18 (i.e. alpine Linux). As the name suggests, it provides a Java 11 JDK inside the container, which is needed to compile and run ESCRITO.
Aside form some system-level utilities (curl
, vim
, git
), maven is installed in order to compile ESCRITO from source within the container. Which version of maven is installed is determined via the MAVEN_VERSION
environment variable. Currently, it is set to 3.5.4
, which seems to work fine.
After maven is set up, the ESCRITO github repo is cloned to path /escrito
inside the container. Next the parent pom.xml
inside /escrito/de.unidue.ltl.escrito
that is pulled from github during cloning is removed and exchanged with the pom.xml
within this repository. The pom.xml
in this repository contains two small additions that fix some errors that otherwise occur during the maven build and cause it to fail (the changes/additions are described here).
Next, the .java
-files in local/models/
of this directory are copied to their required postion within the package structure of the ESCRITO repo inside the Docker container. These files, in particular StoredModelPredictor.java
, are needed to run the classification of new learner's answers from within the container.
Next, the local directories dkpro_target/
, .m2/
and scripts/
are copied inside the Docker container.
Local directory dkpro_target/
is mapped to /dkpro_target
wihtin the Docker container and will later be pointed to by environment variable $DKPRO_HOME
(see script scripts/classify.sh
). This directory contains another directoy models/
which is where all pre-trained models must be stored (one sub-folder per trained model!).
Local directory .m2/repository/
is the local maven repository which will be used during the compilation of the ESCRITO source code. It contains most of the necessary dependencies, which will make the compilation MUCH faster, since they don't have to be downloaded first (otherwise this would take days, literally).
Local directory scripts/
contains a single shell script named classify.sh
, which will be called in order to delegate the actual classification. It sets the env. variable $DKPRO_HOME
to /dkpro_target
and afterwards calls StoredModelPredictor.java
. This call is wrapped in a script since the classpath
is HUGE. More on this below.
Once all necessary files are copied to the container, the ESCRITO source code is compiled using maven by calling:
mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -e
This command is called from within /escrito/de.unidue.ltl.escrito
and tells maven to search for dependencies within the local directory /.m2/repository
. Also, automatic testing during compilation is skipped via flag -DskipTests
.
You can track the compilation via the output in your terminal during docker image build
(see below). It should take roughly 10 minutes and (ideally) end with output that is equivalent to the following:
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] de.unidue.ltl.escrito 0.0.1-SNAPSHOT ............... SUCCESS [ 1.462 s]
[INFO] de.unidue.ltl.escrito.core ......................... SUCCESS [04:29 min]
[INFO] de.unidue.ltl.escrito.io ........................... SUCCESS [ 1.307 s]
[INFO] de.unidue.ltl.escrito.features ..................... SUCCESS [ 2.642 s]
[INFO] de.unidue.ltl.escrito.examples ..................... SUCCESS [ 1.780 s]
[INFO] de.unidue.ltl.escrito.languagetool 0.0.1-SNAPSHOT .. SUCCESS [04:21 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 08:59 min
[INFO] Finished at: 2023-10-03T16:39:25Z
[INFO] ------------------------------------------------------------------------
Lastly, the ENTRYPOINT
of the container is essentially an infinite loop so that once started, the container will theoretically run indefinitely.
Building the image and starting a container
To build the Docker image, run the following from within the current directory (i.e. from where the Dockerfile
is located):
docker image build --no-cache -t escrito:latest .
This will create an image called escrito:latest
.
Once the image build is done, run the following to start a container based on image escrito:latest
:
docker container run -d \
-v ./local:/escrito/de.unidue.ltl.escrito/de.unidue.ltl.escrito.examples/src/main/java/de/unidue/ltl/escrito/examples/local \
-v ./scripts:/scripts \
--name escrito-demo escrito:latest
This will start a container called escrito-demo
. The lines starting with -v
map volumes from host to container and make sure that changes applied outside the container are also reflected inside the container, and vice versa (see here for more info).
Once the container escrito-demo
is running, run the following command to attach to a terminal within the container:
docker container exec -it escrito-demo /bin/sh
Classify new learner's answers
When you are attached to the terminal of container escrito-demo
, you can run the following to classify a learner's answer based on a stored, pre-trained model:
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' 'Test answer'
This command calls the scipt classify.sh
in directory /scripts
within the container. In the command above, two arguments are passed to classify.sh
. The first argument 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0'
specifies the model that sould be used to classify the answer. This argument must be equivalent to a name of a sub-folder of /dkpro_target/models/
. Otherwise the classification will not work. The second argument 'Test answer'
is a placeholder for an answer that should be classified. Both arguments should be strings wrapped in quotes (i.e. '...'
). Especially for the second argument (the answer to be classified) this is important, since otherwise sentences containing whitespace will be interpreted as multiple individual arguments.
To demonstrate this, lines 27
-31
in file local/models/StoredModelPredictor.java
print some output for debugging.
If we run
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' 'Test answer'
we should get the following output:
Total number of arguments passed: 2
Argument 0: Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0
Argument 1: Test answer
0
where the first three lines are due to the above mentioned debugging print statements in local/model/StoredModelPredictor.java
and the last line shows the binary classification outcome for the specified model and the specified answer.
If we instead pass the second argument without quotes
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' Test answer
we should get
Total number of arguments passed: 3
Argument 0: Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0
Argument 1: Test
Argument 2: answer
0
This is definitely not what we want and we should therefore always pass answers in quotes to classify.sh
.
Updating StoredModelPredictor.java
Due to the volume mapping specified in the docker container run
command (see above), changes applied to local/models/StoredModelPredictor.java
on the host will be reflected inside the Docker container. However, for these changes to actually take effect, ESCRITO has to be re-compiled inside the Docker container. To re-compile using maven, run the follwoing when you are attached to a terminal inside the escrito-demo
container as described above:
cd /escrito/de.unidue.ltl.escrito && mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -o -e
The -o
flag tells maven to work in 'offline mode', i.e. it forces maven to only rely on resources that are locally available. This makes the re-build quite fast. After the re-build is done, run classify.sh
again and you should be able to observe that your changes have taken effect.