UIT The arctic university of Norway > Giellatekno
 

Webdict Compilation

This text documents the compilation process of the web dictionaries.

For the moment, we use the Apertium dictionary format.

The files

The online files themselves are stored in the relevant catalogue for displaying online files on victorio. They should be identical to the files in $GTHOME/apps/dicts/apertiumdict/, so keep the two catalogues in synch. The victorio files are for the net, and the ones in apps/dicts are for offline use.

The conversion procedure

The files are converted from the original dictionary files, $GTHOME/words/dicts/LANG1LANG2/, via the following procedure:

You need the Apertium dixtools:

  1. download it from wiki.apertium.org/wiki/Apertium-dixtools, and save it somewhere on your machine (outside the GTHOME tree)
  2. install it via the following commands:
    1. cd apertium-dixtools/
    2. ant jar
    3. sudo ant install

Then convert the Giellatekno xml format into Apertium xml format (we assume you stand in $GTHOME/words/dicts/scripts/):

(For an example of commands, see below)

  • collect all relevant entries in a single file (this script does only that)
    • in issue the following command (where INPUT_DIR is ../LANG1LANG2/src/):
    • java -Xmx2048m -Dfile.encoding=UTF8 net.sf.saxon.Transform -it:main collect-dict-parts.xsl inDir=INPUT_DIR/ > PATH_TO_OUTPUT_FILE
    • different filters for different language pair are possible/needed
    • filtering takes place both in collect-dict-parts.xsl and gtdict2simple-apertiumdix.xsl so before using these scripts check both your input and the code to assure the output you expect
  • convert the gt_format into the accepted apertium_xml format
    • The output file from last command shall be the INPUT_FILE here. Note that the lines between the first line and the <r> node must be deleted from the INPUT_FILE, otherwise the xsl script declares it "missing".
    • In the file gtdict2simple-apertiumdix.xsl, edit the variables inFile and outDir, so that the former points to the input file and the latter to a directory we make for this.
    • Then, issue the command:
    • java -Xmx2048m -Dfile.encoding=UTF8 net.sf.saxon.Transform -it:main gtdict2simple-apertiumdix.xsl
  • compile the file using the apertium tools (see above), with this command ( INPUT_FILE is last command's output):
    • apertium-dixtools dix2trie INPUT_FILE lr LANG1-LANG2-lr-trie.xml
  • update the file from the $GTHOME/apps/dicts/apertiumdict/ into the relevant catalogue for displaying online files on gtweb.

Here comes an exampls, again assuming you stand in $GTHOME/words/dicts/scripts/, and want to convert sme-fin:

java -Xmx2048m -Dfile.encoding=UTF8 net.sf.saxon.Transform -it:main collect-dict-parts.xsl inDir=../smefin/src > tull/out_simple-apertium/tull.xml 
see tull/out_simple-apertium/tull.xml 
    THEN DELETE THE LINES BETWEEN THE FIRST LINE AND THE <r> NODE
java -Xmx2048m -Dfile.encoding=UTF8 net.sf.saxon.Transform -it:main gtdict2simple-apertiumdix.xsl
tail tull/ut/tull.xml 
apertium-dixtools dix2trie tull/ut/tull.xml lr ../../../apps/dicts/apertium_dict/dics/fin-smn-lr-trie.xml