UIT The arctic university of Norway > Giellatekno
 

Meeting_2006-12-11

Meeting setup

  • Date: 11.12.2006
  • Time: 09.30 Norw. time
  • Place: Where we are
  • Tools: SubEthaEdit, iChat

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09: 46.

Present: Saara, Sjur, Thomas, Tomi, Trond

Absent: Børre, Maaren

Agenda accepted with additions to Other.

2. Updated task status since last meeting

Børre

  • contact authors who have already received the corpus licensing contract
    • Not done
  • continue work on script for automatic testing of the spell checker in Word
    • Some done
  • sma discussions with SD (with Sjur, Trond)
    • Not done
  • get an Intel Mac for testing Windows spellers; get a WinXP license from SD
    • Not done
  • update all forrest installations, including local patches
    • needs to be redone, due to a bug in the forrest tarball distributed on divvun.no
  • fix bugs!
    • Not done

Maaren

  • investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

  • finalize server of the Xerox tools.
    • done
  • help Trond with some shell commands
  • re-analyze parallel files
    • faced some problems with tca2
  • consider implementing some new features to the corpus files
    • not finished.
  • add closed POSes to the paradigm gen, if needed.
    • done
  • investigate why possessives have disappeared from the paradigm generator
    • fixed
  • fix bugs!
    • fixed some

Sjur

  • name lexicon:
    • refactor SD-terms editor code
      • some more done
    • implement missing propnouns editing functions
    • implement improvements decided upon in Tromsø
  • hire linguist and programmer
  • decide how to specify compounding behaviour info in the lexicon
  • sma discussions with SD (with Børre, Trond)
  • get an Intel Mac for testing Windows spellers; get a WinXP license from SD
  • publish corpus contracts and project infra on NoDaLi-sta
  • ask SD/Sig-Britt Persson about some of the South Sámi bible texts
    • done, will receive them soon
  • fix bugs!
  • other things:
    • SD employee seminar took a lot of time
    • also demonstrated the alpha spellers

Thomas

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
    • not done
  • decide how to specify compounding behaviour info in the lexicon
    • not done
  • fix bugs!
    • not anyone in the buglist

Tomi

  • add closed POS and clitics to PLX generation
    • not done
  • add derivations to the PLX generation
    • not done
  • add compound stems to the PLX generation
    • only nouns
  • make sure the normative generator is used when generating paradigms
    • done
  • investigate why possessives have disappeared from the paradigm generator
    • done
  • fix bugs!

Trond

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
    • Looked at them, meeting still not held.
  • get more sma texts
    • Awaiting talks, and a memory stick..
  • decide how to specify compounding behaviour info in the lexicon
    • Worked on this one, the issue is still open.
  • sma discussions with SD (with Børre, Sjur)

3. Documentation

TODO:

  • update all forrest installations, including local patches (Børre)
    • done in Alta for Divvun, still needs a few changes in the file $FORREST_HOME/main/webapp/sitemap.xmap - all false should be changed to true in the i18nMatcher and LocaleAction configs.
  • either fix installations (Sjur), or create a new tarball (Børre)

4. Corpus gathering

Sjur talked to Pia in Alta about the sma bible texts, and she will send us the texts she has. The inclusion in the corpus must be accepted by Bibelselskapet, which has already done so for smj and sme (with a bound license).

TODO:

  • get sma Bible / NT texts (Trond)
    • Sjur discussed with Pia in Alta
  • Discussions with the Sámi Parliament about sma ( Børre, Sjur, Trond)
  • ask SD/Sig-Britt Persson about some of the South Sámi bible texts (Sjur)
    • done

5. Corpus infrastructure

Aligner

Problems with the aligner, Saara has talked to Børre, who will look into it. Saara has also got the uplug with hunalign aligner, a command line aligner. Earlier it did not support an anchor list, but the latest version does. It has arrived a bit late for the Disamb project, though.

TODO:

  • gather more parallel texts (Trond, Børre)
  • re-analyze parallel files using the command-line version (Saara)

6. Infrastructure

Xerox tools wrapped as servers

Vislcg can't be included in the server wrapping code, it does not support the needed interactive operation mode. Another solution needs to be found.

TODO:

  • investigate why possessives have disappeared from the paradigm generator (Number, also a facultative (?) category, has not disappeared) ( Saara, Tomi)
    • fixed
  • make sure the normative generator is used when generating paradigms (Tomi)
    • done
  • find a way of integrating vislcg as a server, or send a feature request to the vislcg developers (Saara)

7. Linguistics

Names and multilinguality

TODO:

  1. finish first version of the editing (Sjur)
  2. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  3. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  4. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  5. start to use the xml file as source file
  6. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  7. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  8. publish the name lexicon on risten.no (Sjur)
  9. add missing parallel names for placenames (linguists)
  10. add informative links between first names like Niillas and Nils ( linguists)

North Sámi

Nothing this week.

Lule Sámi

TODO:

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
    • not done yet.

8. Name lexicon infrastructure

Decided in Tromsø:

  • add logging facilities to the interface
  • add option to download local copies of the lexicon files directly from the db
  • batch editing (change all entries in the found set), should later be enhanced to allow selection of exceptions (the found set minus deselected items)
  • tag for excluding/including a name from certain applications
  • future epxansion: choose what info to display in the single language browser
  • display existing language entries when adding a new language to a record
  • add editor to change single, existing entries

Details can be found in the meeting memo.

TODO:

  • develop the needed XQueries and UI (Sjur, Tomi)

Postponed:

  • data synchronisation between risten.no and the cvs repo
  • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry

9. Spellers

Polderland data generation

TODO:

  • decide how to specify compounding behaviour info for the lexicon ( Thomas, Trond, Sjur)
  • add closed POS and clitics to PLX generation (Tomi)
  • add derivations to the PLX generation (Tomi)
  • add compound stems to the PLX generation (Tomi)

Aspell

TODO when the major part of the PLX conversion is done:

  • add Aspell/Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
  • study Hunspell, perhaps also Soikko (Børre, Sjur, Tomi)

Testing

TODO:

  • get an Intel Mac for testing Windows spellers; get a WinXP license from SD ( Børre, Sjur)

10. Other

Corpus contracts

TODO:

  • publish corpus contracts and project infra on NoDaLi-sta (Sjur)

Bug fixing

56 open Divvun/Disamb bugs, and 23 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Task lists as iCal entries

TODO:

  • update Maaren's Forrest installation (Børre)
    • done

New Perl modules

The rewritten preprocessor depends on a few new Perl modules. Saara has sent installation instructions to all, and will write some documentation on it. The documentation should be further augmented by Børre.

TODO:

  • write Perl module dependency documentation (Saara)
  • update setup and installation instructions (Børre)

11. Next meeting, closing

The next meeting is 18.12.2006, 09: 30 Norwegian time.

The meeting was closed at 11: 06.

Appendix - task lists for the next week

Boerre

iCal

  • contact authors who have already received the corpus licensing contract
  • continue work on script for automatic testing of the spell checker in Word
  • sma discussions with SD (with Sjur, Trond)
  • get an Intel Mac for testing Windows spellers; get a WinXP license from SD
  • recreate our forrest tarball
  • update setup and installation instructions for new users/computers
  • fix bugs!

Maaren

iCal

  • investigate the generated word form list sent to Polderland - use the command make wordlist TARGET=sme in victorio

Saara

iCal

  • help Trond with some shell commands
  • re-analyze parallel files
  • consider implementing some new features to the corpus files
  • write some Perl documentation
  • vislcg as server, possibly as feature request to the vislcg devs
  • fix bugs!

Sjur

iCal

  • name lexicon:
    • refactor SD-terms editor code
    • implement missing propnouns editing functions
    • implement improvements decided upon in Tromsø
  • hire linguist and programmer
  • decide how to specify compounding behaviour info in the lexicon
  • sma discussions with SD (with Børre, Trond)
  • get an Intel Mac for testing Windows spellers; get a WinXP license from SD
  • publish corpus contracts and project infra on NoDaLi-sta
  • fix forrest installations for Maaren, Disamb
  • fix bugs!

Thomas

iCal

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
  • decide how to specify compounding behaviour info in the lexicon
  • fix bugs!

Tomi

iCal

  • add closed POS and clitics to PLX generation
  • add derivations to the PLX generation
  • add compound stems to the PLX generation
  • fix bugs!

Trond

iCal

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
  • get more sma texts
  • decide how to specify compounding behaviour info in the lexicon
  • sma discussions with SD (with Børre, Sjur) fix bugs!.