UIT The arctic university of Norway > Giellatekno
 

Meeting_2006-01-05

Meeting setup

  • Date: 05.01.2006
  • Time: 14.00 Norw. time
  • Place: Göttingen, Helsinki, Tromsø
  • Tools: iChat, SubEthaEdit, phone

Agenda

  1. Opening, agenda review
  2. Background for this meeting (Trond)
  3. Original plan and status quo (Trond)
  4. Review
    1. Review, corpus (Saara)
    2. Review, basic sme work, lexicon (Ilona)
    3. Review, basic sme work, morphophonology (Trond)
    4. Review, basic sme work, syntax (Linda)
  5. Workplan for 2006
  6. Other projects
  7. Other issues
  8. Next meeting

1. Opening, agenda review, participants

Opened at 14: 09.

Present: All

Absent: None

Main secretary: All except Ilona

Agenda accepted as is?

2.Background

2. Original plan and status quo (Trond)

This was our original plan:

                                        Start	Finish    Status quo
1 Språkuavhengig preprosessering	    2004-1	2004-1    ok
2 Infrastruktur for disambiguering	    2004-1	2004-2    ok
3 Korpusgrensesnitt - prototyp	        2004-1	2004-4    (2004-3/2005-3)
4 Grunnarbeid for nordsamisk	        2004-1	2004-4    ok
5 Nordsamisk disambiguering - prototype	2004-1	2005-2    ok
6 Revidere morfologiske analyseprogram	2004-1	2006-4F   ongoing, with divvun
7 Grunnarbeid for lulesamisk	        2004-3	2005-4    L
8 Lulesamisk disambiguering - prototyp	2004-4	2005-4    L
9 Parallelltekstkorpora - prototyp	    2005-1	2005-2    late, 2006-1?
10 Korpusgrensesnitt - beta	            2005-1	2005-4    late, 2006-2?
11 Nordsamisk disambiguering - beta  	2005-3	2005-4    (2005-1)
12 Parallelltekstkorpora - beta	        2005-3	2006-1    late
13 Lulesamisk disambiguering - ferdig	2005-4	2006-2    L
14 Nordsamisk disambiguering - ferdig	2006-1	2006-4F   L
15 Korpusgransesnitt - ferdig	        2006-1	2006-4F    
16 Parallelltekstkorpora - ferdig	    2006-2	2006-4F    

Comments to Lule Sámi (L):

Lule Sámi is late, since we still have not got any lexicon. Divvun will make a Lule Sámi speller, and we participate in the basic morphological work, but the transducer will be finished so late in our project period, that we will not spend much time on developing a parser for it this year. Our board (the KUNSTI program board) has suggested that we omit Lule Sámi if the full smj.fst analyser gets much more delayed than it already is. The role of Lule Sámi will not become more than being a test ground for alternative approaches to North Sámi, with the possible exception of a sme/smj bilingual rawtext corpus.

3.Review

Here, we give a short, subjective statement of what the status quo is and what we see as the main tasks. Planning proper comes at the next point of the agenda.

3.1. Review, corpus (Saara)

Graphical interface is basical ready, although we haven't been in contact with Lars Nygård at UiO for a while. Also, we still have too little text. Saara is working with the corpus interface to collect texts.

With the parallel text corpus, nothing has been done. We have at least bibles. Bibles are aligned, i.e. the verses have the same numbers. For other texts, we need a text aligner.

3.2. Review, basic sme work, lexicon (Ilona)

Ilona works on missing word lists from the xml-based corpora. The bulk of the missing list contains typos, i.e. our lexical coverage starts to be good (judging from non-hapaxes). She now goes through rest lists again, to see whether they are covered by new versions of the

3.3. Review, basic sme work, morphophonology (Trond)

The analyser is constantly improved. The generator now wildly overgenerates. It is not so bad for our corpus analyser project, but for other applications it is very bad.

3.4. Review, basic sme work, syntax (Linda)

Linda has been owrking on single words, e.g. oktii, disambiguating most, but not all cases. Also she has been working on numerals, the problem is that they can vary in meaning. The complete tag list is needed for the corpus interface. The documentation of the tag list is intended (at least) to be up to date at any point.

We will need regular expressions (such as date) for numerals in the lexc format, this will be looked into, especially by Linda (suggestions) and Saara (comments).

abbreviations: links to non-abbreviated forms would be interesting

example in the lexicon: geogr ab-dot-adj ; !geografiijas

Four large open issues: locativepl/comitativesg, gen/acc, locatives/pxsg3, numerals

gt/sme/corp/examples/
ex-ComLoc.txt  ex-PxLoc.txt   ex-buot.txt    ex-seammas.txt
ex-AdvComp.txt ex-GenAcc.txt  ex-Verb.txt    ex-maid.txt
ex-Aktio.txt   ex-Num.txt     ex-alit.txt    ex-oktii.txt

4. Workplan for 2006

Goal: Finish in time, with a disambiguator and a graphical search interface for grammatically analysed Sámi text, monolingual, and perhaps also bilingual.

4.1. Graphical corpus interface

There are 2 x 2 different tasks here:

  • Text search is both monolingual text search, and parallel text search
  • There is both plain text search and text search with grammatical analysis

Work plan:

Corpus

  • Saara. Also: Børre, Tomi, Sjur, Trond
  1. Text
  2. Graphical interface for:
    1. Monolingual text search
    2. Bilingual text search
    3. Monolingual grammatical search
    4. Bilingual grammatical search

Lexicon:

  • Ilona, Maaren.

Morphophonology:

  • Trond, Ilona, Thomas, etc.

Disambiguation:

  • Writing rules: Linda, Trond
  • Testing: Linda, Trond, Ilona
  • Dependencies:

Monolingual rawtext corpus

  • Collected text
    • < Contracts, negotiations, <- contracts ok, negotiations only starting
    • < Corpus infrastructure <- still under construction, soon ok
  • Monolingual rawtext search interface
  • Monolingual rawtext corpus
    • < Collected text,
    • < Monolingual rawtext search interface <- done, but not installed (on our servers or in Oslo)

Bilingual rawtext corpus

  • Bilingual rawtext search interface
    • < Monolingual rawtext search interface
  • Bilingual rawtext corpus
    • < Collected bilingual text,
    • < Bilingual rawtext search interface,
    • < Text aligner (except for the bible)
  • Analysed text
    • < disambiguator
    • < collected text

Monolingual grammatical corpus

  • Monolingual grammatical search interface
    • < Monolingual rawtext search interface
    • < The disambiguator
    • < Disambugated and proofread text

Bilingual grammatical corpus

  • Monolingual grammatical search
    • < Disambiguator,
  • Bilingual grammatical search
    • < Disambugated and proofread text for the other language

4.2. Lexicon

  • Every new text genre added to the corpus will require large-scale additions to the lexicon.
    • Work: Ilona and Maaren
  • The name lexicon project will have consequences as well

We will need to differentiate between sentence adverbials and other adverbials.

  • measta Adv => not derived from adjectives
  • bures Adv => derived from adjectives (for rules such as .. IF NOT A* ... )

4.3. Morphophonology

  • This will proceed in cooperation with Divvun, we will not need separate planning for this.

4.4. Syntax

Tasks:

  • Revise the tagset
  • Improve recall and precisison on known issues
  • Extend?

Goal: Have a disambiguator that is good enough to min

4.5. Personal overview

What do do, when.

texts asap, the rest as outlined above: interface with goal at the end of the spring grammar, a deadline (not final one) at the end of the spring.

5.Other projects

5.1. Pedagogical programs

Eckhard's message to us:

Jeg har en glædelig julenyhed: Vi har fået et positivt svar på den nye PaNoLa-ansøgning, dog med en reduceret bevilgingsramme på 200.000, lige som sidst - så det bliver nok det samme aktivitetsniveau som i 2005. Vi må snart beslutte hvor og hvordan vi arrangerer workshoppene.

http://beta.visl.sdu.dk/visl/smi/

5.2. Text-to-speech

Lähettäjä Bjarte.Toftaker@hum.uit.no Päiväys: 4. januar 2006 09.09.32 GMT+01: 00 Vastaanottaja: Gulbrand.Alhaug@hum.uit.no, Trond.Trosterud@hum.uit.no Fakultetet er bedt om å prioritere søknadene om midler til bilateralt forskningssamarbeid. Dere har begge søkt, kan dere oversende en kopi av søknadene til meg, så vi får gjort prioriteringene. Haster en del, så det hadde vært fint om dere kunne fått gjort dette snarest.

6. Other issues

7. Next meeting

Ovdalaš juovllaid lea bidjon diehtu intranehttii ahte mis lea bargiidseminára ođđajagimánu 25. ja 26. beivviid Kárášjogas. Sávan ahte dii lehpet oaidnán dan dieđu. Prográmma maid biddjo intranehttii doaivumis boahtte vahkus. Like før jul ble det lagt ut informasjon på intranettet at vi har personalseminar 25. og 26. januar i Karasjok. Håper at dere har sett denne informasjonen. Programmet vil bli lagt ut på intranettet i løpet av neste uke.

One of the two final weeks of January.

Physical, in Helsinki or Tromsø

Ilona wants to know via sms.