first commit

This commit is contained in:
Felix S 2023-10-07 15:11:50 +02:00
commit cdffb21cd7
716 changed files with 1183 additions and 0 deletions

1
.gitignore vendored Normal file
View File

@ -0,0 +1 @@
.m2/repository/*

34
Dockerfile Normal file
View File

@ -0,0 +1,34 @@
FROM eclipse-temurin:11-jdk-alpine
# install system-level utilities
RUN apk add --no-cache curl vim git
# configure & install maven
ENV MAVEN_VERSION 3.5.4
ENV MAVEN_HOME /usr/lib/mvn
ENV PATH $MAVEN_HOME/bin:$PATH
RUN wget http://archive.apache.org/dist/maven/maven-3/$MAVEN_VERSION/binaries/apache-maven-$MAVEN_VERSION-bin.tar.gz && \
tar -zxvf apache-maven-$MAVEN_VERSION-bin.tar.gz && \
rm apache-maven-$MAVEN_VERSION-bin.tar.gz && \
mv apache-maven-$MAVEN_VERSION /usr/lib/mvn
# clone escrito git repo and remove pulled parent pom.xml
RUN git clone https://github.com/catalpa-cl/escrito.git /escrito && \
rm /escrito/de.unidue.ltl.escrito/pom.xml
# copy pom.xml from host
COPY pom.xml /escrito/de.unidue.ltl.escrito
# compy local.models package from host
COPY ./local /escrito/de.unidue.ltl.escrito/de.unidue.ltl.escrito.examples/src/main/java/de/unidue/ltl/escrito/examples/local
# compy some more directories from host
COPY ./dkpro_target /dkpro_target
COPY ./.m2 /.m2
COPY ./scripts /scripts
WORKDIR /escrito/de.unidue.ltl.escrito
RUN mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -e
WORKDIR /
ENTRYPOINT ["/bin/sh", "-c", "while true; do sleep 1; done"] # infinite loop

117
README.md Normal file
View File

@ -0,0 +1,117 @@
# escrito-docker
## High-level info
This repo contains (almost) everything to start a Docker container running [ESCRITO](https://github.com/catalpa-cl/escrito). In additon, a custom Java-package is added to ESCRITO in order to classify new learner's answers based on stored models that were trained using ESCRITO.
## Details (see Dockerfile)
The base layer of the Docker image is [eclipse-temurin:11-jdk-alpine](https://hub.docker.com/layers/library/eclipse-temurin/11-jdk-alpine/images/sha256-2a16c92565236e8d9b3c3747d995c33f239e8ed30bcea1c1ba6c1a5cfa72da79?context=explore) which is itself based on [alpine:3.18](https://hub.docker.com/layers/library/alpine/3.18/images/sha256-48d9183eb12a05c99bcc0bf44a003607b8e941e1d4f41f9ad12bdcc4b5672f86?context=explore) (i.e. alpine Linux). As the name suggests, it provides a Java 11 JDK inside the container, which is needed to compile and run ESCRITO.
Aside form some system-level utilities (`curl`, `vim`, `git`), [maven](https://maven.apache.org/) is installed in order to compile ESCRITO from source within the container. Which version of maven is installed is determined via the `MAVEN_VERSION` environment variable. Currently, it is set to `3.5.4`, which seems to work fine.
After maven is set up, the [ESCRITO github repo](https://github.com/catalpa-cl/escrito) is cloned to path `/escrito` inside the container. Next the parent `pom.xml` inside `/escrito/de.unidue.ltl.escrito` that is pulled from github during cloning is removed and exchanged with the `pom.xml` within this repository. The `pom.xml` in this repository contains two small additions that fix some errors that otherwise occur during the maven build and cause it to fail (the changes/additions are described [here](https://stackoverflow.com/a/63438394)).
Next, the `.java`-files in `local/models/` of this directory are copied to their required postion within the package structure of the ESCRITO repo inside the Docker container. These files, in particular `StoredModelPredictor.java`, are needed to run the classification of new learner's answers from within the container.
Next, the local directories `dkpro_target/`, `.m2/` and `scripts/` are copied inside the Docker container.
Local directory `dkpro_target/` is mapped to `/dkpro_target` wihtin the Docker container and will later be pointed to by environment variable `$DKPRO_HOME` (see script `scripts/classify.sh`). This directory contains another directoy `models/` which is where **all pre-trained models must be stored** (one sub-folder per trained model!).
Local directory `.m2/repository/` is the local maven repository which will be used during the compilation of the ESCRITO source code. It contains most of the necessary dependencies, which will make the compilation MUCH faster, since they don't have to be downloaded first (otherwise this would take days, literally).
Local directory `scripts/` contains a single shell script named `classify.sh`, which will be called in order to delegate the actual classification. It sets the env. variable `$DKPRO_HOME` to `/dkpro_target` and afterwards calls `StoredModelPredictor.java`. This call is wrapped in a script since the `classpath` is HUGE. More on this below.
Once all necessary files are copied to the container, the ESCRITO source code is compiled using maven by calling:
```
mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -e
```
This command is called from within `/escrito/de.unidue.ltl.escrito` and tells maven to search for dependencies within the local directory `/.m2/repository`. Also, automatic testing during compilation is skipped via flag `-DskipTests`.
You can track the compilation via the output in your terminal during `docker image build` (see below). It should take roughly 10 minutes and (ideally) end with output that is equivalent to the following:
```
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] de.unidue.ltl.escrito 0.0.1-SNAPSHOT ............... SUCCESS [ 1.462 s]
[INFO] de.unidue.ltl.escrito.core ......................... SUCCESS [04:29 min]
[INFO] de.unidue.ltl.escrito.io ........................... SUCCESS [ 1.307 s]
[INFO] de.unidue.ltl.escrito.features ..................... SUCCESS [ 2.642 s]
[INFO] de.unidue.ltl.escrito.examples ..................... SUCCESS [ 1.780 s]
[INFO] de.unidue.ltl.escrito.languagetool 0.0.1-SNAPSHOT .. SUCCESS [04:21 min]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 08:59 min
[INFO] Finished at: 2023-10-03T16:39:25Z
[INFO] ------------------------------------------------------------------------
```
Lastly, the `ENTRYPOINT` of the container is essentially an infinite loop so that once started, the container will theoretically run indefinitely.
### Building the image and starting a container
To build the Docker image, run the following from within the current directory (i.e. from where the `Dockerfile` is located):
```
docker image build --no-cache -t escrito:latest .
```
This will create an image called `escrito:latest`.
Once the image build is done, run the following to start a container based on image `escrito:latest`:
```
docker container run -d \
-v ./local:/escrito/de.unidue.ltl.escrito/de.unidue.ltl.escrito.examples/src/main/java/de/unidue/ltl/escrito/examples/local \
-v ./scripts:/scripts \
--name escrito-demo escrito:latest
```
This will start a container called `escrito-demo`. The lines starting with `-v` map volumes from host to container and make sure that changes applied outside the container are also reflected inside the container, and vice versa (see [here](https://docs.docker.com/storage/volumes/#choose-the--v-or---mount-flag) for more info).
Once the container `escrito-demo` is running, run the following command to attach to a terminal within the container:
```
docker container exec -it escrito-demo /bin/sh
```
### Classify new learner's answers
When you are attached to the terminal of container `escrito-demo`, you can run the following to classify a learner's answer based on a stored, pre-trained model:
```
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' 'Test answer'
```
This command calls the scipt `classify.sh` in directory `/scripts` within the container. In the command above, **two arguments** are passed to `classify.sh`. The first argument `'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0'` specifies the model that sould be used to classify the answer. This argument **must be equivalent to a name of a sub-folder of `/dkpro_target/models/`**. Otherwise the classification **will not work**. The second argument `'Test answer'` is a placeholder for an answer that should be classified. **Both arguments should be strings wrapped in quotes (i.e. `'...'`)**. Especially for the second argument (the answer to be classified) this is important, since otherwise sentences containing whitespace will be interpreted as multiple individual arguments.
To demonstrate this, lines `27`-`31` in file `local/models/StoredModelPredictor.java` print some output for debugging.
If we run
```
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' 'Test answer'
```
we should get the following output:
```
Total number of arguments passed: 2
Argument 0: Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0
Argument 1: Test answer
0
```
where the first three lines are due to the above mentioned debugging print statements in `local/model/StoredModelPredictor.java` and the last line shows the binary classification outcome for the specified model and the specified answer.
If we instead pass the second argument **without quotes**
```
cd /scripts && ./classify.sh 'Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0' Test answer
```
we should get
```
Total number of arguments passed: 3
Argument 0: Me_n-SMO-C-1.0-NormalizedPolyKernel-E-3.0
Argument 1: Test
Argument 2: answer
0
```
This is definitely not what we want and we should therefore always pass answers in quotes to `classify.sh`.
### Updating `StoredModelPredictor.java`
Due to the volume mapping specified in the `docker container run` command (see above), changes applied to `local/models/StoredModelPredictor.java` on the host will be reflected inside the Docker container. However, for these changes to actually **take effect**, ESCRITO has to be re-compiled inside the Docker container. To re-compile using maven, run the follwoing when you are attached to a terminal inside the `escrito-demo` container as described above:
```
cd /escrito/de.unidue.ltl.escrito && mvn compile -Dmaven.repo.local=/.m2/repository -DskipTests -o -e
```
The `-o` flag tells maven to work in 'offline mode', i.e. it forces maven to only rely on resources that are locally available. This makes the re-build quite fast. After the re-build is done, run `classify.sh` again and you should be able to observe that your changes have taken effect.

View File

@ -0,0 +1,3 @@
#Bipartition threshold used to train this model (only multi-label classification)
#Mon Sep 25 16:06:42 CEST 2023
threshold=0.5

View File

@ -0,0 +1,2 @@
org.dkpro.tc.features.ngram.WordNGram ngramMinN=1 uniqueFeatureExtractorName=WordNGram26187166286928 ngramUseTopK=2000 ngramMaxN=3
org.dkpro.tc.features.ngram.CharacterNGram ngramMinN=2 uniqueFeatureExtractorName=CharacterNGram26187166531803 ngramUseTopK=2000 ngramMaxN=5

View File

@ -0,0 +1,3 @@
#Feature mode used to train this model
#Mon Sep 25 16:06:42 CEST 2023
featureMode=unit

View File

@ -0,0 +1,3 @@
#Learning mode used to train this model
#Mon Sep 25 16:06:42 CEST 2023
learningMode=singleLabel

View File

@ -0,0 +1 @@
org.dkpro.tc.ml.weka.WekaAdapter

View File

@ -0,0 +1 @@
targetLocation=lucene

View File

@ -0,0 +1 @@
sourceLocation=lucene

View File

@ -0,0 +1,3 @@
#Version of DKPro TC used to train this model
#Mon Sep 25 16:06:42 CEST 2023
TcVersion=1.1.0

View File

@ -0,0 +1,3 @@
#Bipartition threshold used to train this model (only multi-label classification)
#Mon Sep 25 16:08:16 CEST 2023
threshold=0.5

View File

@ -0,0 +1,2 @@
org.dkpro.tc.features.ngram.WordNGram ngramMinN=1 uniqueFeatureExtractorName=WordNGram26281023249414 ngramUseTopK=10000 ngramMaxN=3
org.dkpro.tc.features.ngram.CharacterNGram ngramMinN=2 uniqueFeatureExtractorName=CharacterNGram26281023259954 ngramUseTopK=10000 ngramMaxN=5

View File

@ -0,0 +1,3 @@
#Feature mode used to train this model
#Mon Sep 25 16:08:16 CEST 2023
featureMode=unit

View File

@ -0,0 +1,3 @@
#Learning mode used to train this model
#Mon Sep 25 16:08:16 CEST 2023
learningMode=singleLabel

View File

@ -0,0 +1 @@
org.dkpro.tc.ml.weka.WekaAdapter

View File

@ -0,0 +1 @@
targetLocation=lucene

View File

@ -0,0 +1 @@
sourceLocation=lucene

View File

@ -0,0 +1,3 @@
#Version of DKPro TC used to train this model
#Mon Sep 25 16:08:16 CEST 2023
TcVersion=1.1.0

View File

@ -0,0 +1,3 @@
#Bipartition threshold used to train this model (only multi-label classification)
#Mon Sep 25 16:09:44 CEST 2023
threshold=0.5

View File

@ -0,0 +1,2 @@
org.dkpro.tc.features.ngram.WordNGram ngramMinN=1 uniqueFeatureExtractorName=WordNGram26374503486335 ngramUseTopK=8000 ngramMaxN=3
org.dkpro.tc.features.ngram.CharacterNGram ngramMinN=2 uniqueFeatureExtractorName=CharacterNGram26374503492723 ngramUseTopK=8000 ngramMaxN=5

View File

@ -0,0 +1,3 @@
#Feature mode used to train this model
#Mon Sep 25 16:09:44 CEST 2023
featureMode=unit

View File

@ -0,0 +1,3 @@
#Learning mode used to train this model
#Mon Sep 25 16:09:44 CEST 2023
learningMode=singleLabel

View File

@ -0,0 +1 @@
org.dkpro.tc.ml.weka.WekaAdapter

View File

@ -0,0 +1 @@
targetLocation=lucene

View File

@ -0,0 +1 @@
sourceLocation=lucene

View File

@ -0,0 +1,3 @@
#Version of DKPro TC used to train this model
#Mon Sep 25 16:09:44 CEST 2023
TcVersion=1.1.0

View File

@ -0,0 +1,3 @@
#Bipartition threshold used to train this model (only multi-label classification)
#Mon Sep 25 16:11:16 CEST 2023
threshold=0.5

View File

@ -0,0 +1,2 @@
org.dkpro.tc.features.ngram.WordNGram ngramMinN=1 uniqueFeatureExtractorName=WordNGram26462656536859 ngramUseTopK=8000 ngramMaxN=3
org.dkpro.tc.features.ngram.CharacterNGram ngramMinN=2 uniqueFeatureExtractorName=CharacterNGram26462656543898 ngramUseTopK=8000 ngramMaxN=5

View File

@ -0,0 +1,3 @@
#Feature mode used to train this model
#Mon Sep 25 16:11:16 CEST 2023
featureMode=unit

View File

@ -0,0 +1,3 @@
#Learning mode used to train this model
#Mon Sep 25 16:11:16 CEST 2023
learningMode=singleLabel

View File

@ -0,0 +1 @@
org.dkpro.tc.ml.weka.WekaAdapter

View File

@ -0,0 +1 @@
targetLocation=lucene

View File

@ -0,0 +1 @@
sourceLocation=lucene

View File

@ -0,0 +1,3 @@
#Version of DKPro TC used to train this model
#Mon Sep 25 16:11:16 CEST 2023
TcVersion=1.1.0

View File

@ -0,0 +1,3 @@
#Bipartition threshold used to train this model (only multi-label classification)
#Mon Sep 25 16:12:56 CEST 2023
threshold=0.5

View File

@ -0,0 +1,2 @@
org.dkpro.tc.features.ngram.WordNGram ngramMinN=1 uniqueFeatureExtractorName=WordNGram26554722299725 ngramUseTopK=5000 ngramMaxN=3
org.dkpro.tc.features.ngram.CharacterNGram ngramMinN=2 uniqueFeatureExtractorName=CharacterNGram26554722305026 ngramUseTopK=5000 ngramMaxN=5

View File

@ -0,0 +1,3 @@
#Feature mode used to train this model
#Mon Sep 25 16:12:56 CEST 2023
featureMode=unit

View File

@ -0,0 +1,3 @@
#Learning mode used to train this model
#Mon Sep 25 16:12:56 CEST 2023
learningMode=singleLabel

View File

@ -0,0 +1 @@
org.dkpro.tc.ml.weka.WekaAdapter

Some files were not shown because too many files have changed in this diff Show More