Development tools
Development tools
The project manipulates text in many ways, organized in lexicons. The text is UTF-8 encoded.
UTF-8 setup
The tools has to be setup to handle the utf-8 encoding.
Editors
To edit our source file we need a text editor, which has to support UTF-8, and can save the edited result as pure text.
We recommend emacs and it's modes to edit and manipulate our source files. If you work on your local machine, you may also use SubEthaEdit. It is easy to use, and you may let others edit the text while you look, but it is not as advanced and time-saving as emacs.
Documentation tools
To document our work we use forrest and it's native XML-format for our documents. To edit these documents you need at least a text editor, and we recommend XMLMind, a java based XML-editor that offers wysiwyg-editing of forrest documents.
Morphological analysis
The project uses the following Xerox tools: twolc (for morphophonology), lexc (for morphology), xfst (for compiling the final transducer) , and lookup (for analysis and generation).
The link list below refers to the Xerox documentation pages for these tools, these and other links are found here:
- twolc, for phonological and morphophonological rules
- lexc, for representing the Sami stems and the affix lexica
- xfst, the finite-state transducer tool, for integratingthe different parts of the program, and for compiling the preprocessor
- tokenize, for tokenization and processing (note that we do not use tokenize for preprocessing at the moment, but perl)
- lookup, an interface to the morphological analyser. NB! cf. our lookup notes
The programs are activated by printing e.g. lexc and then pressing the enter key. The tools are documented in Karttunen / Beesley Finite-State Morphology: Xerox Tools and Techniques. The tools may also be installed on your own machine, be it on Mac OSX, Linux or Windows. One version of the software is found on the CD accompanying the book, for the latest version, ask Trond for reference.
Disambiguation tools
- Morphological disambiguation
- lookup2cg, a script to transform Xerox output to CG input
Analysis and testing
The easiest and the most effective way to do this (although a little scary at first) is to use commandline tools. We have made a short introduction in English and a longer document in Norwegian on this topic. The introduction on how to use our parser is also an excellent introduction on how to combine the individual tools.
Our home-made tools, and adjustments of public tools
- The cgi-bin setup for making the parsers accessible on the web
- The web interface to our web demo
- Conversion scripts
- Testing tools
- Emacs for lexicon expansion
- Special emacs modes
Last modified: $Date: 2008-11-05 18:52:54 +0100 (gask, 05 skáb 2008) $, by $Author: boerre $
by Trond Trosterud, Børre Gaup

