UIT The arctic university of Norway > Giellatekno
 

Meeting_2007-10-08

Meeting setup

  • Date: 8.10.2007
  • Time: 09.30 Norw. time
  • Place: Internet
  • Tools: SubEthaEdit, iChat/Skype

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09: 39.

Present: Børre, Ilona, Per-Eric, Risten, Sjur, Thomas, Tomi, Trond

Absent: none

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • move Steinar's error markup in the xml files to (a copy of) the original
    • not done
  • Hunspell lexicon conversion
    • nouns, adjs and verbs seem to work okay, other POS'es and CPOS'es (?) don't work as expected
  • collect/build an e-mail notify list
    • not done
  • fix bugs!

Ilona

  • lexicalise missing words
    • Never done... But now already pretty far with a missing list made of Assu-files.
  • make sms propernoun-list
    • Was done already
  • Change NIILLAS-names to ANAR or DUORTNUS.
    • Done

Maaren

  • lexicalise actio compounds

Per-Eric

  • expand the smj typos list
    • Worked and still working
  • add missing smj words
    • Worked and still working
  • lexicalise words from the Olavi missing list
    • Worked and still working
  • fix bugs!
    • Fixed some

Risten

  • fixed and open issues to README files
    • done
  • update translations of README-files - Thursday afternoon
    • done

Saara

  • add new XSL/XML headers for proofing test docs
    • not done
  • Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not)
    • not done

Sjur

  • document the AppleScript testing tool
    • nothing new
  • document the testing procedures
    • not yet
  • work on the XML name editor/risten.no integration
    • still nothing
  • fixed and open issues to README files
    • done
  • test correct-type markup with latest enhancements
    • nope
  • collect/build an e-mail notify list
    • yes
  • update translations of README-files - Thursday afternoon
    • several times
  • update installation packages
    • as well
  • announce the beta
    • today - we need a multilingual e-mail text
  • fix bugs!
    • yes, and reported new ones
  • other things:
    • received, installed and tested InDesign hyphenation - works great! (but there are hyphenation errors, we need a hyphenation command line tool to test the behaviour of the Polderland hyphenation; it has been ordered, and should arrive within the next two weeks).

Thomas

  • explain compound-tags to Tomi
    • done
  • add oslolaš type derivation test cases to smj regresssion file
    • not done
  • sme->smj lexicon conversion to build bilingual lexicon resources
    • worked
  • update translations of README-files - Thursday afternoon
    • done
  • fix bugs!
    • worked some

Tomi

  • make PLX conversion test sample; add conversion testing to the make file
    • not done
  • Hunspell lexicon conversion
    • Børre is doing
  • fix stuorra-oslolaš lower case o
    • this one is fixed? Yes, the latest regression tests are very good: )
  • sme->smj lexicon conversion to build bilingual lexicon resources
    • not done
  • test whether we can revert Makefile changes, and if positive, revert them
    • done
  • fix bugs!
    • fixed

Trond

  • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
    • Not done.
  • fix stuorra-oslolaš lower case o
    • fixed
  • add sma texts to the corpus repository
    • Analysed the sma texts, they look promising, but will require work in order to be added properly. I suggest postponing this to after christmas (as i have done so far, also)
  • sme->smj lexicon conversion to build bilingual lexicon resources
    • Great progress done, still some minor changes left until it is working
  • update translations of README-files - Thursday afternoon
    • Done (well, it might have been Friday...)
  • fix bugs!.

3. Documentation

Bugzilla 3.0.x has some nice features we would like to use, like shared, filtered queries.

TODO:

  • add semi-automatic updates of fixed and open issues to README files ( Sjur)
    • done
  • update Bugzilla (Børre)

4. Corpus gathering

Trond had a look at the sma bible texts. We will postpone adding them till after Christmas, when we hopefully will have a dedicated sma project.

Børre has received lots of texts from Torkel Rasmussen, they will be added this week.

TODO:

  • test correct-type markup with latest enhancements (Sjur)
  • add texts from Torkel Rasmussen ( Børre)

5. Corpus infrastructure

Nothing.

6. Infrastructure

Speller testing is still fluctuating a bit.

7. Linguistics

North Sámi

Čorru > čorut
*Oslolaš with hyphen required, is printed now, but shouldn't
oslolaš - is done correctly now

This one is fixed by the latest changes in the PLX conversion.

TODO:

  • lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji ( Maaren)
  • fix stuorra-oslolaš lower case o ( Tomi)
    • fixed

Lule Sámi

We have the same oslolaš derivation in smj too, but with another derivation.

Tjårro > tjårok
*Oslolaš with hyphen required, is printed now, but shouldn't
oslolaš - is done correctly now

Correct compounds are still not recognised:

Stuorafuoskok => input, should be accepted
(218) Stuorauvsuk
(217) Stuorraluohkák

fuoskok fuoskok Fuossko+N+Prop+Plc+Sg+Gen+Der1+Der/k+N+Sg+Nom

StuorFuoskok	Stuorafuoskok	2
(220) Stuorruskak
(220) Stuoruduskak
(219) SUFUR-Fuoskok

Fuoskok is in the PLX lexicon, but does not take part in this type of compounding. It should actually have been fuoskok.

Fuoskok Fuossko+N+Prop+Plc+Pl+Nom+Clt+ge Fuoskok Fuossko+N+Prop+Plc+Sg+Gen+Clt+ge

smj propernoun bug issue: procedure

  1. convert from common base (which means sme base)
    1. Words not convertable should be added to separate smj lexicon, and words that should not be converted from sme sme should be moved to non-convert lexicon in sme???
  2. send to smj morphology

The original todo was to correct the smj morphology. Current test shows weaknesses in both camps:

  • conversion errors
  • words that should not have been converten
  • missing smj-unique names
  • errors in the morphology

Testing procedures:

  • analyse baseforms (as for sme)
  • generate a couple of caseforms from the baseforms, and inspect result

Suggestion: Let us first analyse the proper noun base forms for Lule Sámi, and thereafter look at the morphology.

TODO:

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas)
    • it is only about adding smj names now, working on it
  • lexicalise words from the Olavi missing list, but check against the pdf original where in doubt (Per-Eric)
    • working on it
  • add oslolaš type derivation test cases to the regresssion file ( Thomas)
    • done
  • sme->smj lexicon conversion to build bilingual lexicon resources, and increase smj coverage (Trond, Thomas, Svenne)
    • working on it
  • add proper nouns (Thomas, Ilona)

8. Name lexicon infrastructure

This sub-project needs to get up and running soon. Mainly Sjur's task.

Decisions made in Tromsø can be found in this meeting memo.

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils ( linguists)

9. Proofing tools

Hunspell

Sami languages are not supported in OpenOffice.org, until that is fixed we will have to do the same tricks we apply in Microsoft Office 2004 for Mac.

TODO:

  • Hunspell lexicon conversion (Tomi, Børre)
  • Begin adding support for the sami languages in OpenOffice.org (Børre)

Testing

Spelling Error Markup

TODO:

  • Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (Saara)
  • move Steinar's error markup in the xml files to (a copy of) the original ( Børre, Kimme)

Automated testing

The infrastructure is about to settle.

TODO:

  • document the AppleScript testing tool (Sjur)
  • document the testing procedures (Sjur)

Lexicon conversion to the PLX format

TODO:

  • fix oslolaš bug (Tomi)
    • fixed for sme, still open for smj

InDesign tools

We have received the first hyphenation beta for InDesign. It has been tested, and seems to work fine.

We should give the beta to Min Áigi and Davvi Girji.

TODO:

  • make available InDesign hyphenator to Min Áigi/Davvi Girji ( Sjur)
  • document the InDesign tools (Sjur)
  • add hyphenation testing (Sjur)

Hyphenators

There are some hyphenation errors we need to debug.

TODO:

  • get command line hyphenator (Sjur)
  • collect list of problematic words for the hyphenator (Sjur, Thomas, all)

New public beta

TODO:

  • collect/build an e-mail notify list; we make it simple, a text document with e-mail addresses (Sjur, Børre, others)
    • done
  • update list of fixed and known issues - Tuesday afternoon (Sjur, Risten)
    • done
  • update translations of README-files - Thursday afternoon ( Risten, Thomas, Sjur, Trond)
    • done
  • update installation packages (Sjur)
    • done
  • announce the beta (Sjur)
    • today

Release version

The CD cover etc. will be worked on by John-Marcus Kuhmunen, and will follow the SD design rules. He is now waiting for the text to be put on the CD cover and other places.

TODO:

  • write text to go on the CD cover (Risten)
  • set up CD-printing printer (Risten)

10. Other

Corpus contracts

Delayed till after final release.

TODO:

  • publish corpus contracts and project infra as open-source on NoDaLi-sta ( Sjur)

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

59 open Divvun/Disamb bugs (26 of these 56 are speller-related bugs, 33 are other bugs), and 23 risten.no bugs

11. Next meeting, closing

The next meeting is 15.10.2007, 09: 30 Norwegian time.

Trond will be away.

The meeting was closed at 10: 43.

Appendix - task lists for the next week

Boerre

iCal

  • move Steinar's error markup in the xml files to (a copy of) the original
  • Hunspell lexicon conversion
  • update Bugzilla to 3.0.x
  • begin adding support for the sami languages in OpenOffice.org
  • add texts from Torkel Rasmussen
  • fix bugs!

Ilona

iCal

  • lexicalise missing words
    • Will I have new missing lists to do?
  • Check the Finnish translation
  • add smj proper nouns
  • other smj tasks

Maaren

  • lexicalise actio compounds

Per-Eric

iCal

  • expand the smj typos list
  • add missing smj words
  • lexicalise words from the Olavi missing list
  • fix bugs!

Risten

iCal

  • write text to go on the CD cover
  • set up CD-printing printer

Saara

iCal

  • add new XSL/XML headers for proofing test docs
  • Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not)

Sjur

iCal

  • document the AppleScript testing tool
  • document the testing procedures
  • work on the XML name editor/risten.no integration
  • test correct-type markup with latest enhancements
  • get command line hyphenator for automated testing of the hyph-lexicons
  • collect list of problematic words for the hyphenator
  • make available InDesign hyphenator to Min Áigi/Davvi Girji
  • document the InDesign tools
  • add hyphenation testing
  • fix bugs!

Thomas

iCal

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • add smj proper nouns
  • fix bugs!

Tomi

iCal

  • Hunspell lexicon conversion
  • sme->smj lexicon conversion to build bilingual lexicon resources
  • fix oslolaš bug in smj ( Tomi)
  • fix bugs!

Trond

iCal

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • fix bugs!.