The pipeline for the corpus analysis

Pre analysis

Building tokenisers

Tokenisers are built once a day on another machine than stallo. The comments in $GTHOME/tools/stallo-analysis/ details what is being done.

The tokeniser build jobs are kicked off by this cronjob 14 23 * * * export GTHOME=$HOME/gtsvn && svn up -q $GTHOME/tools/stallo-analysis/ && $GTHOME/tools/stallo-analysis/

The hfst tools on this machine are updated at least once a week from the nightly apertium repo.

Fetch converted corpus and kick off analysis

The analysis is kicked off by this cronjob on stallo: 00 01 * * * . $HOME/.bash_profile && ls -lR $HOME/.local/share/giella && svn up $GTHOME/tools && $GTHOME/tools/stallo-analysis/

The script $GTHOME/tools/stallo-analysis/ fetches the converted corpus files and dispatches separate analysis jobs for each corpus, language and type (type being xfst and hfst).


The analysis is done on the boerre account on stallo. The comments in $GTHOME/tools/stallo-analysis/ details what is being done.

Post analysis

Analysed files are sent to gtweb by this cron job. 00 01 * * * . $HOME/.bash_profile && ls -lR $HOME/.local/share/giella && svn up $GTHOME/tools && $GTHOME/tools/stallo-analysis/

The script $GTHOME/tools/stallo-analysis/ details what is being done.

Files and compilers

These are the repositeries found on stallo:

  • The langtech repo: ~boerre/svnrepos/main/
  • The freecorpus repo: ~boerre/svnrepos/freecorpus/
  • The boundcorpus repo: ~boerre/svnrepos/boundcorpus/

The xfst, hfst, vislcg3 and CorpusTools tools are installed in ~boerre/bin on stallo.

hfst: Build commands

cd ~/hfst/
git pull
module load autoconf/2.69
module load automake/1.13.1
module load gcc/4.9.1
./configure --enable-all-tools --with-unicode-handler=glib --prefix=/home/boerre --with-readline
make install

vislcg3: Build commands

module load CMake/3.6.2-foss-2016b
module load Boost/1.61.0-foss-2016b
module load ICU
cd ~/svnrepos/vislcg3/
svn up
cmake \
    -DCMAKE_INCLUDE_PATH=/global/hds/software/cpu/eb3/ICU/61.1-iomkl-2018a/include/ \
    -DCMAKE_LIBRARY_PATH=/global/hds/software/cpu/eb3/ICU/61.1-iomkl-2018a/lib/ \
    -DCMAKE_EXE_LINKER_FLAGS=-L/global/hds/software/cpu/eb3/ICU/61.1-iomkl-2018a/lib \
make install

CorpusTools: Install commands

cd ~/svnrepos/main/tools/CorpusTools/
svn up
python develop --user

Environment variables on stallo

These environment variables are set on stallo to make hfst and vislcg3 work as expected:

export PATH=$HOME/bin:$PATH