UIT The arctic university of Norway > Giellatekno
 

Meeting_2007-11-26

Meeting setup

  • Date: 26.11.2007
  • Time: 09.30 Norw. time
  • Place: Internet
  • Tools: SubEthaEdit, iChat/Skype

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 10: 15.

Present: Børre, Ilona, Per-Eric, Sjur, Thomas, Tomi

Absent: Risten, Trond

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • move Steinar's error markup in the xml files to (a copy of) the original
    • not done
  • fix bug 550
    • not done
  • fix Windows CD installation bug
    • not done
  • discuss more parallel texts
    • not done
  • finalise InDesign hyphenator
    • not done
  • update usage and installation documentation
    • not done
  • fix bugs!
    • not done
  • Other:
    • Continued work on hunspell
    • Updated OS X on xserve, added RAM.
    • Made new logos

Ilona

  • other smj tasks, ask Thomas
  • Buy the DVD for Leopard
    • Bought, and downloaded Leopard disk image to the computer. Have to burn it to the DVD and then install it on the computer. Will try to do it today.
  • other tasks:
    • Done some testing in sme speller and reported it to Thomas

Maaren

  • lexicalise actio compounds

Per-Eric

  • lexicalise words from the Olavi missing list
    • Done
  • derivations tests
    • Done some
  • fix bugs!
    • Not done anything this week

Risten

  • set up CD-printing printer
    • on its way - ordered
  • get price and schedule for printed CD cover
    • done

Saara

  • add new XSL/XML headers for proofing test docs
  • Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not)
  • add nested error markup to xml conversion
  • discuss more parallel texts

Sjur

  • work on the XML name editor/risten.no integration
    • nothing
  • set up risten.no on the G5 again
    • made it, although Trond reports problems
  • test new and nested error markup
    • later
  • improve hyphenation testing
    • nothing improved
  • fix bug 550
    • not done
  • fix Windows CD installation bug
    • tried, but it's not working. Work-around should be documented
  • finalise InDesign hyphenator
    • nothing done yet
  • update usage and installation documentation
    • not yet
  • fix bugs!
  • other:
    • several improvements to the CD creation process
    • worked on the CD cover (proofreading many times)
    • faroese speller telephone meeting

Thomas

  • sme->smj lexicon conversion to build bilingual lexicon resources
    • not anything this week
  • check for bad hyphenation
    • worked a little
  • look at test cases still not behaving properly
    • worked a little
  • paradigm testing
    • done some
  • test the proofing tools with all MS Office applications
    • done
  • finalise InDesign hyphenator
    • not done
  • update usage and installation documentation
    • not done
  • fix bugs!
    • worked

Tomi

  • Hunspell lexicon conversion
    • not done
  • fix bugs!
    • fixed bugs

Trond

  • sme->smj lexicon conversion to build bilingual lexicon resources
    • Worked on this, refined, work is on schedule
  • telephone meeting with Sjur and the faroese group re faroese speller
    • Done
  • discuss more parallel texts
    • Worked on this.
  • fix bugs!.

3. Documentation

Nothing new.

TODO:

4. Corpus infrastructure

Nothing.

5. Infrastructure

TODO:

  • add Jabber account in iChat (all)

6. Linguistics

North Sámi

Hyphenation is better, but still contains a lot of errors. Sjur will run the latest hyphenator on our test material, and discuss the test results with the rest.

TODO:

  • test latest hyphenator (Sjur)
  • analyse test results (Thomas, Sjur, Trond)

Lule Sámi

Trond and his team have found words to be added to the smj lexicon.

cat smesmj.txt | grep -v 'prop$' | cut -f2 | lookup -flags mbTT -utf8 ~/gt/smj/bin/smj.fst | grep '\?' | l

6581 words in the smesmj.txt lexicon. Disregarding the proper nouns, 1824 are not recognised by smj-norm.fst or by smj.fst. Many of these are loan words or they are derivations. Some examples:

čála    tjála   n
giehtačála      giehtatjála     n
vuolláičála     vuollájtjála    n <= :-)
vinjučála       vinjotjála      n
johtučála       jåhtotjála      n
čuokkisčála     tjuokkestjála   n
bajildusčála    bajeldustjála   n
mála    mála    n
tjála   čála    n
giehtatjála     giehtačála      n <=== :-((
vuollájtjála    vuolláičála     n
tjuokkestjála   čuokkisčála     n
vinjotjála      vinjučála       n
jåhtotjála      johtučála       n
leapma  liebma  n

ja/dahje        ja/dahje        +?
gobba   gobba   +?
gaiba   gaiba   +?
struhcca        struhcca        +?
fáhcca  fáhcca  +?
suorbmafáhcca   suorbmafáhcca   +?
vahca   vahca   +?
ohca    ohca    +?
juhca   juhca   +?

It seems to be a mixup of smj and sme in the material. That has to be cleaned up.

We have to test hyphenation for lulesami as well.

TODO:

  • lexicalise words from the Olavi missing list, but check against the pdf original where in doubt (Per-Eric)
    • done
  • sme->smj lexicon conversion to build bilingual lexicon resources, and increase smj coverage (Trond, Thomas, Svenne). Add the words.
  • test hyphenation (Sjur, Thomas)

7. Name lexicon infrastructure

Sjur got risten.no up and running on the G5. Worked only for him, though.

Decisions made in Tromsø can be found in this meeting memo.

TODO:

  1. set up Tomcat and risten.no on the G5 again (Sjur, Børre)
    1. install risten.no
      1. did it
  2. fix bugs in lexc2xml; add comments to the log element (Saara)
  3. finish first version of the editing (Sjur)
  4. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  5. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  6. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  7. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  8. start to use the xml file as source file
  9. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  10. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  11. publish the name lexicon on risten.no (Sjur)
  12. add missing parallel names for placenames (linguists)
  13. add informative links between first names like Niillas and Nils ( linguists)

8. Proofing tools

Hunspell

Continuously improving.

TODO:

  • Hunspell lexicon conversion (Tomi, Børre)

Testing

Spelling Error Markup

This will wait till after the release.

TODO:

  • Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (Saara)
  • move Steinar's error markup in the xml files to (a copy of) the original ( Børre, Kimme)
  • add nested error markup to xml conversion (Saara)
  • test new and nested error markup (Sjur)

Automated testing

TODO:

  • improve hyphenation testing (Sjur)
    • not done yet

MS Office

An important aspect of this testing is to document in the user guide anything that could be a problem for users.

TODO:

  • test the proofing tools with all MS Office applications (Børre, Thomas)
    • Thomas has tested all Windows apps - they all work fine with our tools

Lexicon conversion to the PLX format

Open issues based on test results:

smj

482 - still problematic (prefix), 484 - double hyphens suggested, 575 - name+name = double hyphens in sugg, Svierigadárogielan - still rejected (prefix)

sme

397 - double hyphens (name+name), 419 - fixed, 425 - roman number, 431 - does not accept the correct string, but DO suggest the same; also hyphen final forms are accepted, but not the same form when part of a compound, 452 - fixed, 461 - ovda accepted, almost 50 % (17) gets correct suggestion, 489, 522 - fixed, 524 - fixed, Guovdageainnu-láđđi not accepted.

TODO:

  • look at test cases still not behaving properly (Thomas, Tomi)
  • check that the smj R lexicon is identical to sme ( Thomas)

InDesign tools

TODO:

  • improve hyphenation testing (Sjur)

Hyphenators

Testing!!!

Release version

Schedule and tasks for the remaining weeks:

TODO:

  • set up CD-printing printer (Risten, Leif Åge)
  • fix Windows CD installation bug (Sjur, Børre)
    • put on hold - work-around should be documented
  • get price and schedule for printed CD cover (Risten)
    • done: 3980,- + 900,- + VAT for 1000 covers.
    • 8 days production time
    • 100 covers will be picked up in Tromsø (Børre)
    • Print 50 CDs, take them to Oslo (Risten, Julie)
    • Burn the CDs in Oslo (Sjur)
  • fix remaining bugs - golden master by end of this Monday (all)
  • finalise InDesign hyphenator (Sjur, Børre, Thomas)
    • testing = hyphenation testing
    • documentation
    • installation
  • update usage and installation documentation (Børre, Thomas, Sjur)
  • translate all new documentation (all)
  • QA all documentation (all)
  • do as much hunsopell as possible (Børre, Tomi)

Actual release

December 12 is the most likely date, before 12: 00. Still to be confirmed.

There will be a release party in the afternoon.

9. Other

Corpus contracts

Delayed till after final release.

TODO:

  • publish corpus contracts and project infra as open-source on NoDaLi-sta ( Sjur)

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

83 open Divvun/Disamb bugs (45 of these 83 are speller-related bugs, 38 are other bugs), and 23 risten.no bugs

Software updates

  • Leopard, 10.5
    • Ilona - Will have it tomorrow or at latest on Wednesday. Probably needs help with installing.
    • Per-Eric - ready for updating tomorrow - Børre will help

10. Next meeting, closing

The next meeting is 03.12.2007, 09: 30 Norwegian time.

The meeting was closed at 11: 42.

Appendix - task lists for the next week

Boerre

iCal

  • move Steinar's error markup in the xml files to (a copy of) the original
  • fix bug 550
  • finalise InDesign hyphenator
  • update usage and installation documentation
  • fix bugs!

Ilona

iCal

  • lexicalise smj missing words.
  • Help Trond with the smj dictionary.
  • Install Leopard

Maaren

  • lexicalise actio compounds

Per-Eric

iCal

  • check some unusual words from the Olavi missing list which are still not lexicalised
  • derivations tests
  • Install Leopard
  • fix bugs!

Risten

iCal

  • set up CD-printing printer
  • test printer
  • finish the cd cover and cd design

Saara

iCal

  • add new XSL/XML headers for proofing test docs
  • Set up ways of adding meta-information for proofing correct corpus docs (source info, used in testing or not, added to lexicon or not)
  • add nested error markup to xml conversion
  • discuss more parallel texts

Sjur

iCal

  • fix bug 550
  • document Windows CD installation work-around
  • finalise InDesign hyphenator
  • update usage and installation documentation
  • test latest hyphenator
  • analyse hyphenation test results
  • fix bugs!

Thomas

iCal

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • test hyphenation
  • analyse hyphenation test results
  • look at test cases still not behaving properly
  • paradigm testing
  • finalise InDesign hyphenator
  • update usage and installation documentation
  • check that the smj R lexicon is identical to sme
  • fix bugs!

Tomi

iCal

Trond

iCal

  • sme->smj lexicon conversion to build bilingual lexicon resources
  • test hyphenation
  • analyse hyphenation test results
  • fix bugs!.