UIT The arctic university of Norway > Giellatekno
 

OmegaT

Installation

The user documentation page for OmegaT refers to installation and user documentation, and can be found here:

Adapting to our language pairs

The idea is to offer a set of ready-made folders, in two different formats:

  1. as a one-time downloading of a zipped file archive
  2. as svn checkout (via Tortoise on Windows) for access to updates

For the time being, the folders are at https://victorio.uit.no/biggies/trunk/mt/omegat/.

The idea is to put the following resources into the following subdirectories:

  • into dictionary: our dictionaries (OmegaT documentation)
  • into glossary: term lists, partly fad-marked pairs, partly from satni.org, cf documentation
  • into tm: our parallel texts, all files fused into one .tmx file (or one per theme), cf documentation
  • into omegat: a file segmentation.conf, for doing sentence level segmentation, cf. documentation

The source and target folders are given svn ignore status, as we develop the folders we should determine what other files to ignore and what to share.

The language pairs

The language pairs are of three types:

  1. smesmn, smesmj, smesmn: The main thing here is MT, glossaries and dictionaries are less interesting since they are already in bidix, and since we do not have a OmegaT-compatible tokenizer to look up inflected words.
  2. nobsme, nobsmj, nobsma, finsme, finsmn, finsms: Here we have no MT (except for finsme, which is not much developed). The focus here is on glossaries (fad project, etc.) and translation memory
  3. smasme, smjsme, smnsme, smenob: these we ignore in OmegaT for now. They are mainly made for understanding, not for text production.

Working plan

  1. Add glossaries
  2. Develop segmentation.conf
  3. Test and evaluate

Future plans

Adding more resources:

  • Analysers for lemmatisation of dictionary lookup
  • Proofing tools

HFST Tokenizer

You can get the hfst tokenizer ready compiled. You need to download:

And put them into ~/Library/Preferences/OmegaT/plugins (create the dir if it's not there)

Hfst tokenizer source is at github

FST's are searched from OmegaT preferences folder ~/Library/Preferences/OmegaT/spelling and filename should end in -<lang>.hfstol