UIT The arctic university of Norway > Giellatekno
 

How To AddA New Language

Adding a new language to the infrastructure

Languages may be added to one of these three directories:

  1. $GTHOME/langs (for languages we work on, or which have full-size transducers)
  2. $GTHOME/experiment-langs (experiments of different kinds)
  3. $GTHOME/startup-langs (when we do not know whether the work will lead to anything or not)

There are also a couple of directories for a few languages with a closed license and restricted access. Their structure is identical to the open ones above.

How to add a new language

We here show how to add new languages to $GTHOME/langs/ (for adding to the two other directories, exchange "langs" with "startup-langs" etc.)

  1. cd $GTHOME/langs
  2. ./autogen.sh
  3. ./configure
  4. make NEW_LANGS=LANGCODE

where LANGCODE is the 3-letter ISO 639-3 code for the language you want to add. You can also add more than one language by separating the codes with spaces, and putting double quotes around the whole list:

  • make NEW_LANGS="LANGCODE1 LANGCODE2 LANGCODE3"

./autogen.sh and ./configure should only be required the first time.

Result

The above command will create a new directory for the specified language, and populate it with the required makefiles, autoconf files and template source files. The files are automatically added (but not committed) to svn, and all relevant svn:ignore properties are set.

In addition, the new language code is added to $GTHOME/langs/Makefile.am and $GTHOME/langs/configure.ac. When you have checked the new language, and found that everything is ok, please svn commit both the new language dir, and the configure.ac and Makefile.am files. If the changes in those two files are not committed, the new language will be left out of future updates to the infrastructure.

Now, to start doing real work, you must do one set of preparations still:

cd LANGCODE
./autogen.sh
./configure

Now you can start editing the source files, and whenever you want to make sure everything compiles, run make. Run make check to ensure that all defined tests are passed. Remember to update the test suits as you enhance the linguistic model!

Setting up the documentation page for the new language

The new language must also be added to the language documentation page. Here we document how to set up language documentation for new languages.

Adding a new language to the $GTBIG/prooftesting dir

The procedure is the same as above, but by adding a template to the command:

  1. cd $GTHOME/langs
  2. $GTCORE/scripts/new-language.sh LANGCODE [[TEMPLATECOLL]

where

  • TEMPLATECOLL (optional, usually automatically identified) is the name of the template collection to use; presently there are two template collections:
    • prooftesting - templates for populating directories for testing proofing tools, also organised according to language

This directory contains infrastructure for testing proofing tools for a number of languages. At present, only spellers are tested, but more tool types will be added in the future. The prerequisite for being able to test a speller is:

  • at least one speller gold-standard document for the targeted language, stored in
$GTFREE/[pre]stable/goldstandard/converted/
  • a speller lexicon available in the test infra for that language
  • a command line speller for the lexicon(s) in the test infra

The command to set up the basic testing infrastructure for a new language is exactly as above, with only one path adjustment:

  1. cd $GTBIG/prooftesting
  2. $GTCORE/scripts/new-language.sh LANGCODE

Result

A new language will be added to the testing infrastructure, ready to be populated with speller lexicons. Add new lexicons in the appropriate places (see the other languages for examples), and you should be ready to prepare the testing for your new language:

cd LANGCODE
./autogen.sh
./configure

If everything is ok, at least one of the speller tests listed at the end of the ./configure step should return yes, after which you can just run make. If there is test data available in $GTFREE, the test(s) will run, and after a while the test results are written to an xml file.

TODO:

  • add a build step to install the xml file in a location available to forrest, so that the test results can be viewed online as html and nice graphs