UIT The arctic university of Norway > Giellatekno
 

OmegaT

Installation

The user documentation page for OmegaT refers to installation and user documentation, and can be found here:

Adapting to our language pairs

The idea is to offer a set of ready-made folders, in two different formats:

  1. as a one-time downloading of a zipped file archive
  2. as svn checkout (via Tortoise on Windows) for access to updates

For the time being, the folders are at https://victorio.uit.no/biggies/trunk/mt/omegat/.

The idea is to put the following resources into the following subdirectories:

  • into dictionary: our dictionaries (OmegaT documentation)
  • into glossary: term lists, partly fad-marked pairs, partly from satni.org, cf documentation
  • into tm: our parallel texts, all files fused into one .tmx file (or one per theme), cf documentation
  • into omegat: a file segmentation.conf, for doing sentence level segmentation, cf. documentation

The source and target folders are given svn ignore status, as we develop the folders we should determine what other files to ignore and what to share.

The language pairs

The language pairs are of three types:

  1. smesmn, smesmj, smesmn: The main thing here is MT, glossaries and dictionaries are less interesting since they are already in bidix, and since we do not have a OmegaT-compatible tokenizer to look up inflected words.
  2. nobsme, nobsmj, nobsma, finsme, finsmn, finsms: Here we have no MT (except for finsme, which is not much developed). The focus here is on glossaries (fad project, etc.) and translation memory
  3. smasme, smjsme, smnsme, smenob: these we ignore in OmegaT for now. They are mainly made for understanding, not for text production.

Working plan

  1. Add glossaries
  2. Develop segmentation.conf
  3. Test and evaluate

Future plans

Adding more resources:

  • Analysers for lemmatisation of dictionary lookup
  • Proofing tools

HFST Tokenizer

You can get the hfst tokenizer ready compiled. You need to download:

And put them into ~/Library/Preferences/OmegaT/plugins (create the dir if it's not there)

Hfst tokenizer source is at github

FST's are searched from OmegaT preferences folder ~/Library/Preferences/OmegaT/spelling and filename should end in -<lang>.hfstol

Mac App Bundling

This section is only for reference.

HfstTokenizer can be compiled together with OmegaT and bundled into Mac App. Follow these instructions:

  1. Download OmegaT 3.x source code, not 4.x here
  2. Get appbundler used by OmegaT from here. This needs Java 1.7
    1. install this into ~/.ant/lib/
    2. this appbundler needs JavaAppLauncher and jre-mac-root to be defined in OMEGAT_ASSETS_DIR folder, which is searched from environmental variables. If not found in this folder the build process looks one folder down from where you installed OmegaT sources.
      1. jre-mac-root is a soft link to the folder where Java Runtime libraries are found
  3. Download thread safe version of hfst lookup library and put it to OMEGAT_SRC_FOLDER/lib where OMEGAT_SRC_FOLDER is the folder you just installed the OmegaT source files. here
  4. Copy HfstTokenizer.java and HfstStemFilter.java to OMEGAT_SRC_FOLDER/src/org/omegat/tokenizer where OMEGAT_SRC_FOLDER is the folder you just installed the OmegaT source files.
    1. Modify files package name if needed
    2. Remove throws IOException from getTokenStream method and correct StandardTokenizer constructor call
    3. diff HfstTokenizer.java against 4.x HfstTokenizer.java (see diffs below)
  5. Add hfst-ol.jar to manifest-template.mf (details below)
  6. Add lib/hfst-ol.jar entry to manifest.mf 's Class-Path variable
  7. run ant mac in OmegaT source folder, the one where you installed OmegaT

Diffs:

1c1
< package org.omegat.tokenizer;
---
> package no.divvun.tokenizer;
16a17
> import org.omegat.tokenizer.BaseTokenizer;
17a19
> import org.omegat.tokenizer.Tokenizer;
60,63c62,64
<           final boolean stopWordsAllowed) {
<     StandardTokenizer tokenizer = new StandardTokenizer(getBehavior(),
<                         new StringReader(strOrig));
<     // tokenizer.setReader(new StringReader(strOrig));
---
>           final boolean stopWordsAllowed) throws IOException {
>     StandardTokenizer tokenizer = new StandardTokenizer();
>     tokenizer.setReader(new StringReader(strOrig));
71,72c72
<       return new HfstStemFilter(new StandardTokenizer(getBehavior(),
<                     new StringReader(strOrig)), transducer);
---
>       return new HfstStemFilter(tokenizer, transducer);
    1c1
    < package org.omegat.tokenizer;
    ---
    > package no.divvun.tokenizer;
    11a12
    > import org.apache.lucene.util.AttributeSource.State;
    47,49c48,49
    <       for (String s : res) {
    < //      res.forEach(anal -> {
    <         String stem = s.substring(0, s.indexOf("+"));
    ---
    >       res.forEach(anal -> {
    >         String stem = anal.substring(0, anal.indexOf("+"));
    53c53
    <       }
    ---
    >       });
    

    Add the following for hfst-ol.jar to template:

    Name: org.omegat.tokenizer.HfstTokenizer
    OmegaT-Plugin: tokenizer