The university of Tromsø > Giellatekno
 

Workplan

These are the milestones for the now closed Sámi language technology (Disamb) project

Autumn 2006

The Disamb project is officially over at the end of 2006. The remaining tasks are:

  • Improve the Northern Sámi grammatial analysers, both wrt. to lexical coverage and grammatical analyses
  • Improve the Lule Sámi grammatical analysers to a certain extent
  • Collect more corpus texts
  • Improve the graphical interface for texts, and make analysed texts available
  • Optimize the preprocessing and postprocessing tools for coverage and speed

The original milestones

In our application we stated the following milestone list for the project:

Start Finish
1 Language independent preprocessing 2004-1 2004-1
2 Infrastructure for disambiguation 2004-1 2004-2
3 Corpus interface - prototype 2004-1 2004-4
4 Ground work for Northern Sámi 2004-1 2004-4
5 Northern Sámi disambiguation - prototype 2004-1 2005-2
6 Revise morphological analysis programs 2004-1 2006-4
7 Ground work for Lule Sámi 2004-3 2005-4
8 Lule Sámi disambiguation - prototype 2004-4 2005-4
9 Parallell text corpuses - prototype 2005-1 2005-2
10 Corpus interface - beta 2005-1 2005-4
11 Northern Sámi disambiguation - beta 2005-3 2005-4
12 Parallell text corpuses - beta 2005-3 2006-1
13 Lule Sámi disambiguation - done 2005-4 2006-2
14 Northern Sámi disambigation - done 2006-1 2006-4
15 Corpus interface - done 2006-1 2006-4
16 Parallell text corpuses - done 2006-2 2006-4

Status quo, June 2006

The language-independent preprocessor
We have a language independent preprocessor (preprocess) and a morphology-to-disambiguation processor (lookup2cg). The language recognizer is in place, as are tools for processing xml corpora.
Corpus interface
We use a corpus interface developed at the University of Oslo.
Disambiguator
We disambiguate Northern Sámi with an ambiguity rate better than average for statistical disambiguators, but we still lag behind compared to other constraint grammars. The Lule Sámi disambiguator has an ambiguity rate of 1.5 (compared to an input of 2.0 readings per wordform), the Lule Sámi disambiguator will thus not reach the level of the Northern Sámi one.

Open issues

Derivation in the lookup2cg preprocessor (Saara, Trond)
There are open issues here, but basically, lookup2cg works well.
A systematic testing of the morphology of the parser
The sme.fst parser still has open issues, but the last year we have systematically gone through the basic inflection for nouns, adjectives and verbs.
Gathering of corpus texts
This is done in cooperation with the Divvun project. The work is well underway, and the infrastructure is in use. Some of the major text producers is still an open issue, but we have collected appr. 1/4 of the body of electronically available text.

Status quo, June 2005

The language-independent preprocessor
This goal is fulfilled, as we have a revised language independent preprocessor (preprocess) and a morphology-to-disambiguation processor (lookup2cg). There still is work to do on language specific preprocessing (not mentioned in the list). This work will in practice run in parallel with other work
Infrastructure for disambiguation
This was in place already in 2003
Corpus interface
(scheduled finished at 2004-4). Delayed, new goal: 2005 3. This will be done in cooperation with Divvun and with the University of Oslo.
Disambiguation prototype for sme
The disambiguator is well beyond the prototype level.

Status quo, June 2005

Work on Northern Sámi grammar is on schedule. Work on Lule Sámi is on schedule as well, as we now have a lexicon, and a working morphological analyser. The goal for Lule Sámi is to have a working disambiguator as a comparision to the Northern Sámi one, this we have. Infrastructure for corpora is ready, but we still do not have access to Lule Sámi texts, apart from the New Testament.

Status quo, June 2004

Work on Northern Sámi grammar is on schedule. Work on Lule Sámi is lagging behind, since we still haven't got any Lule Sámi lexicon. Work on Lule Sámi grammar started June 2005, without a lexicon. Infrastructure for parsing is at schedule, infrastructure for corpora is not at schedule, but it looks like it will be ready when the texts start being finished.

Milestones for the Divvun project

The milestones for the Divvun project can be found here (also in Northern Sámi, Lule Sámi, Norwegian and Finnish).

Last modified: $Date: 2008-02-01 12:48:55 +0100 (bear, 01 guov 2008) $, by $Author: trond $

by Trond Trosterud