Workplan
These are the milestones for the now closed Sámi language technology (Disamb) project
Autumn 2006
The Disamb project is officially over at the end of 2006. The remaining tasks are:
- Improve the Northern Sámi grammatial analysers, both wrt. to lexical coverage and grammatical analyses
- Improve the Lule Sámi grammatical analysers to a certain extent
- Collect more corpus texts
- Improve the graphical interface for texts, and make analysed texts available
- Optimize the preprocessing and postprocessing tools for coverage and speed
The original milestones
In our application we stated the following milestone list for the project:
| Start | Finish | |
|---|---|---|
| 1 Language independent preprocessing | 2004-1 | 2004-1 |
| 2 Infrastructure for disambiguation | 2004-1 | 2004-2 |
| 3 Corpus interface - prototype | 2004-1 | 2004-4 |
| 4 Ground work for Northern Sámi | 2004-1 | 2004-4 |
| 5 Northern Sámi disambiguation - prototype | 2004-1 | 2005-2 |
| 6 Revise morphological analysis programs | 2004-1 | 2006-4 |
| 7 Ground work for Lule Sámi | 2004-3 | 2005-4 |
| 8 Lule Sámi disambiguation - prototype | 2004-4 | 2005-4 |
| 9 Parallell text corpuses - prototype | 2005-1 | 2005-2 |
| 10 Corpus interface - beta | 2005-1 | 2005-4 |
| 11 Northern Sámi disambiguation - beta | 2005-3 | 2005-4 |
| 12 Parallell text corpuses - beta | 2005-3 | 2006-1 |
| 13 Lule Sámi disambiguation - done | 2005-4 | 2006-2 |
| 14 Northern Sámi disambigation - done | 2006-1 | 2006-4 |
| 15 Corpus interface - done | 2006-1 | 2006-4 |
| 16 Parallell text corpuses - done | 2006-2 | 2006-4 |
Status quo, June 2006
- The language-independent preprocessor
- We have a language independent preprocessor (preprocess) and a morphology-to-disambiguation processor (lookup2cg). The language recognizer is in place, as are tools for processing xml corpora.
- Corpus interface
- We use a corpus interface developed at the University of Oslo.
- Disambiguator
- We disambiguate Northern Sámi with an ambiguity rate better than average for statistical disambiguators, but we still lag behind compared to other constraint grammars. The Lule Sámi disambiguator has an ambiguity rate of 1.5 (compared to an input of 2.0 readings per wordform), the Lule Sámi disambiguator will thus not reach the level of the Northern Sámi one.
Open issues
- Derivation in the lookup2cg preprocessor (Saara, Trond)
- There are open issues here, but basically, lookup2cg works well.
- A systematic testing of the morphology of the parser
- The sme.fst parser still has open issues, but the last year we have systematically gone through the basic inflection for nouns, adjectives and verbs.
- Gathering of corpus texts
- This is done in cooperation with the Divvun project. The work is well underway, and the infrastructure is in use. Some of the major text producers is still an open issue, but we have collected appr. 1/4 of the body of electronically available text.
Status quo, June 2005
- The language-independent preprocessor
- This goal is fulfilled, as we have a revised language independent preprocessor (preprocess) and a morphology-to-disambiguation processor (lookup2cg). There still is work to do on language specific preprocessing (not mentioned in the list). This work will in practice run in parallel with other work
- Infrastructure for disambiguation
- This was in place already in 2003
- Corpus interface
- (scheduled finished at 2004-4). Delayed, new goal: 2005 3. This will be done in cooperation with Divvun and with the University of Oslo.
- Disambiguation prototype for sme
- The disambiguator is well beyond the prototype level.
Status quo, June 2005
Work on Northern Sámi grammar is on schedule. Work on Lule Sámi is on schedule as well, as we now have a lexicon, and a working morphological analyser. The goal for Lule Sámi is to have a working disambiguator as a comparision to the Northern Sámi one, this we have. Infrastructure for corpora is ready, but we still do not have access to Lule Sámi texts, apart from the New Testament.
Status quo, June 2004
Work on Northern Sámi grammar is on schedule. Work on Lule Sámi is lagging behind, since we still haven't got any Lule Sámi lexicon. Work on Lule Sámi grammar started June 2005, without a lexicon. Infrastructure for parsing is at schedule, infrastructure for corpora is not at schedule, but it looks like it will be ready when the texts start being finished.
Milestones for the Divvun project
The milestones for the Divvun project can be found here (also in Northern Sámi, Lule Sámi, Norwegian and Finnish).
Last modified: $Date: 2008-02-01 12:48:55 +0100 (bear, 01 guov 2008) $, by $Author: trond $
by Trond Trosterud

