UIT The arctic university of Norway > Giellatekno
 

Meeting_2007-06-11

Meeting setup

  • Date: 11.06.2007
  • Time: 09.30 Norw. time
  • Place: Internet
  • Tools: SubEthaEdit, iChat/Skype

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from last week
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09: 57.

Present: Børre, Maaren, Per-Eric, Sjur, Steinar, Thomas, Tomi, Trond

Absent: Saara

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • add sma texts to the corpus repository
    • not done
  • run all known spelling errors in the prooftest corpus through the speller
    • not done
  • add extraction of all known spelling errors in the regular corpus (not the prooftest corpus), and check that they are properly corrected
    • not done
  • update and fix our documentation and infrastructure as Steinar finds problem areas - low priority
    • began work again
  • study the Hunspell formalism in detail
    • nothing new
  • contact Davvi Girji / Mikal Aase
    • not done
  • install larger disks, new RAM on the G5 when they arrive
    • Arrived. Will install it asap.
  • move list of known bugs to Bugzilla
    • not done
  • update/check installed file list and paths for Windows
    • not done
  • fix bugs!

Inga

  • expand the smj typos list
    • work and still working
  • add missing smj words
    • work and still working

Maaren

  • lexicalise actio compounds
  • Manually mark speller test documents for typos

Per-Eric

  • expand the smj typos list
    • work and still working
  • add missing smj words
    • work and still working

Saara

  • improve cgi-bin scripts
    • done
  • add new XSL/XML headers for proofing test docs
    • will do this week
  • Try to add files with Lars to the corpus interface.
  • fix bugs!

Sjur

  • run all known spelling errors in the corpus through the speller
    • not done, depends on speller test bench improvements
  • document the AppleScript testing tool
    • not done
  • integrate regression self tests with the make file
    • not done
  • improve speller test bench
    • worked on it, problems with speller test result processing, perl script
  • integrate the ccat speller testing options in the make file
    • worked on it, problems with speller test result processing, perl script
  • fix internet setup for Per-Eric's satelite modem
    • nothing new
  • look over the Bugzilla status mails
    • nothing new
  • contact Davvi Girji / Mikal Aase
    • done
  • ask Xerox for a commercial lisense for the xfst tools on the G5
    • not done
  • check with Sámi publishing houses whether support for CS2 is still needed
    • checked Min Áigi, Áššu and Davvi Girji - CS2 not needed so far
  • fix stuorra-oslolaš lower case o
    • topic for the Drag meeting
  • ö/ä vs ø/æ in speller
    • topic for the Drag meeting
  • study the Hunspell formalism in detail
    • topic for the Drag meeting
  • move list of known bugs to Bugzilla
    • done
  • resend the press release to some channels in Sweden, Finland and Norway
    • not done
  • publish corpus contracts and project infra as open-source on NoDaLi-sta
    • not done
  • fix bugs!
    • filed many new ones
  • other:
    • finished installation of Parallels Desktop, Windows XP, Office 2007 and our Windows proofing tools for testing Windows version of the spellers.

Steinar

  • Beta testing: Align manually (shorter texts)
  • Manually mark speller test texts for typos (making them into gold standards), add the texts to the orig/catalogue
    • added more texts
  • Complete the semantic sets in sme-dis.rle
    • no work this week
  • missing lists
    • no work this week
  • fix bugs!

Thomas

  • work with compounding
    • worked
  • Lack of lowering before hyphen: Twol rewrite.
    • not done
  • smj: öä not accepted, only øæ (except for lexicalised names)
    • not done
  • fix stuorra-oslolaš lower case o
    • not done
  • investigate why actios of 3-syllable verbs are not accepted by the speller
    • had some help with this, we will see
  • investigate why some adverbs of 3-syllable adjectives are not accepted by the speller
    • seem to work
  • fix bugs!
    • haven't barely got time

Tomi

  • add compounding restrictions to the PLX conversion
    • added
  • make PLX conversion test sample; add conversion testing to the make file
    • not done
  • improve prefix and middle-noun PLX conversion
    • done
  • integrate the ccat speller testing options in the Makefile
    • not done
  • first part of multiword expressions not accepted
    • not done
  • open up compounding for all actios
    • not done
  • fix bugs!
    • fixed

Trond

  • Work on the web corpus issues
    • Done some work, yes.
  • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
    • Fixed a fatal bug here (1/3 of names restored!), but not worked more on the morphological issue.
  • Go through the Num bugs
    • Not done
  • fix stuorra-oslolaš lower case o
    • Not done
  • fix bugs!.
    • Closed several, but opened more, I am afraid.

3. Documentation

TODO:

  • write form to request corpus user account (Børre, Sjur, Trond)
  • document how to apply for access to closed corpus, and details on the corpus and its use in general (Børre, Sjur, Trond)
  • correct and improve it based on feedback from Steinar ( Børre)

4. Corpus gathering

Sjur spoke to Davvi Girji, we will send them a list of the authors contacted and DG will send a list of all works DG has published - ever (not all of them available in digital form, though)!

TODO:

  • sme texts: no new additions, fix corpus errors during this month ( Børre, Trond, Saara)
  • missing nob parallel texts should be added if such holes are found ( Børre, Trond)
  • Go through the list of missing or errouneous nob texts, based upon Saara's perfect list (Børre, Trond)
  • add sma texts to the corpus repository (Børre)
  • contact Davvi Girji / Mikal Aase ( Børre, Sjur)
    • done

5. Corpus infrastructure

Nothing this week either.

6. Infrastructure

TODO:

  • update and fix our documentation and infrastructure as Steinar finds problem areas (Børre)
    • working on this one
  • fix internet setup for Per-Eric's satelite modem (Sjur, Børre)
    • this influences iChat, SEE sharing, and ARD connetions

7. Linguistics

North Sámi

Actio compounds: Maaren and Duomma disagrees about what is correct and not, needs to be resolved. We need some more clarifications about the system, which we will do in Drag.

TODO:

  • lexicalise actio compounds. Example: vuolggasadji vs. vuolginsadji ( Maaren)
    • vuolgin- and vuolgga- , both are okei vuolggasadji and vuolgindássi for eks
    • possibly turn on free compounding as part of the PLX conversions (ie free compounding in the spellers, but not in the analyzers/transducers)
  • fix stuorra-oslolaš lower case o ( Sjur, Thomas, Trond)
  • open up compounding for all actios (Tomi)

Lule Sámi

TODO:

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
  • ö/ä vs ø/æ in speller (Thomas, Sjur)
  • lexicalise words from the Olavi missing list, but check against the pdf original where in doubt (Inga)
  • add normativity issues to our normativity document (Inga, Thomas)
  • investigate why actios of 3-syllable verbs are not accepted by the speller ( Thomas)
    • norm-lookup does not see these, ordinary look-up sees
      • these were grepped out because they containted the string SUB as part of their lexicon names. Now the names are changed, and it should be fixed now, needs to be tested in the new speller
  • investigate why some adverbs of 3-syllable adjectives are not accepted by the speller (Thomas)
    • norm-look-up sees some, but not all, ordinary look-up sees
      • it seems to be fixed, needs to be tested in the new speller

8. Name lexicon infrastructure

Decisions made in Tromsø can be found in this meeting memo.

TODO:

  1. fix bugs in lexc2xml; add comments to the log element (Saara)
  2. finish first version of the editing (Sjur)
  3. test editing of the xml files. If ok, then: ( Sjur, Thomas, Trond)
  4. make terms-smX.xml <=== automatically from propernoun-sme-lex.xml (add nob as well) (the morphological section should be kept intact, in e.g. propernoun-sme-morph.txt) (Sjur, Saara)
  5. convert propernoun-($lang)-lex.txt to a derived file from common xml files ( Sjur, Tomi, Saara)
  6. implement data synchronisation between risten.no and the cvs repo, and possibly other servers (ie the G5 as an alternative server to the public risten.no - it might be faster and better suited than the official one; also local installations could be treated the same way)
  7. start to use the xml file as source file
  8. clean terms-sme.xml such that all names have the correct tag for their use (e.g. @type=secondary) (Thomas, Maaren, linguists)
  9. merge placenames which are errouneously in different entries: e.g. Helsinki, Helsingfors, Helsset (linguists)
  10. publish the name lexicon on risten.no (Sjur)
  11. add missing parallel names for placenames (linguists)
  12. add informative links between first names like Niillas and Nils ( linguists)

9. Spellers

OOo spellers

Børre, Sjur, Tomi will have a session on this in Drag.

TODO:

  • add Hunspell data generation to the lexc2xspell (Tomi - after the PLX data generation is finished)
  • study the Hunspell formalism in detail (Børre, Sjur, Tomi)

Testing

Spelling Error Markup

Text in other languages should not be marked as spelling errors.

TODO:

  • Manually mark test texts for typos (making them into gold standards) ( Steinar)
  • Set up ways of adding meta-information (source info, used in testing or not, added to lexicon or not) (Saara)

Testing tools

Sjur is trying to get the ccat typos option integrated in the test targets in the Makefile. Hopefully done soon.

TODO:

  • document the AppleScript testing tool (Sjur)
  • improve speller test bench (Sjur)
    • integrate the ccat speller testing options in the Makefile (Sjur, Tomi)
      • working

Regression tests

Nothing new

TODO:

  • add extraction of all known spelling errors in the corpus (not the prooftest corpus), and check that they are properly corrected ( Børre, Sjur)
  • test the typos.txt list, and check that all entries are properly corrected ( Børre, Sjur)
  • consider how to do a regression self-test, ie, how to test the full wordlist (Børre, Sjur)
    • extract all the base forms in the lexicon, and run them through the speller
    • extract all SUB-marked entries, and run them through the lexicon
      • integrate these in the make file (Sjur)

Lexicon conversion to the PLX format

TODO:

  • install larger disks, new RAM on the G5 when they arrive (Børre)
    • received, will be installed soon.
  • ask for mklex for Linux (victorio) from Polderland (Sjur)
    • waiting for the offer
  • ask Xerox for a commercial lisense for the xfst tools on the G5 (Sjur)
  • add compounding restrictions to the PLX conversion (Tomi)
    • done, seems correct, but needs more testing when a new speller is ready.

Compounding restrictions

Compounding restrictions are now integrated in the PLX conversion, thanks to Tomi.

TODO:

  1. improve prefix conversion to PLX (Tomi)
    1. done
  2. improve middle noun conversion to PLX (Tomi)
    1. done
  3. improve noun + adjective PLX conversion: ( Tomi)
    1. compounding stems - how do we generate them? Using the java client? +SgNomCmp+Cmpnd = sáme–, should give the correct compounding stem, shouldn't it? We want to optionally go from: sáme- NLI to sáme NL: - NLI (->) NL, which means we should be able to extract correct compounding stems using xfst methods only.
      1. done
    2. compounding tags - we need to obey them when making the transducers. Suggestion - see above.
      1. done
  4. make conversion test sample; add conversion testing to the make file ( Tomi)
    1. to regression test / QA the PLX conversion.
      1. not done

Public Beta follow-up

TODO:

  • fix clitics (Tomi)
    • done after the release, has to be tested
      • can be tested in the small speller - tested, all accepted: Sjurgo buorrege Trosterutge biilago
  • file list in Windows not complete (Børre, Sjur)
  • test smj on typos (Børre)
    • tried, but got an error, thus skipped. Needs to be checked now.
      • error reported to Saara
  • celebrate
    • NOT done - will do in Drag: )
  • resend the press release to some channels in Sweden, Finland and Norway ( Sjur)
    • Per-Eric will follow up in Sweden, Tomi in Finland, to make sure we have got reasonable coverage, or at least enough users for the beta spellers. Finnish press coverage. Other finnish institutions to contact could be:
      • Samiradio (Tomi) - they're planning to make a report
      • Sami parliament (Tomi)
      • Oulu - giellagas (Tomi)
      • Lapin yliopisto - Rantala (Trond)
      • Helsingin yliopisto - Seurujärvi-Kari (Tomi)
      • KOTUS (Sjur)
      • Citysaamit (Tomi)
      • Oulun saamelaiset (Tomi)
  • move list of known errors to Bugzilla (Børre, Sjur)
    • done

10. Other

Summer vacation

When are we taking it? Please fill in the table below:

Name Starting Ending
Børre x x
Maaren 9.7. 10.8.
Per-Eric 9.7. 20.7.
Saara 2.7 3.8
Sjur x x
Steinar x x
Thomas 9.7. 12.8.
Tomi 9.7. 5.8.
Trond 2.7. 12.8, but working at the end

Divvun people also need to send the dates to Julie Eira or Ellen Mienna Guttorm.

Corpus contracts

TODO:

  • publish corpus contracts and project infra as open-source on NoDaLi-sta ( Sjur)

Bug fixing

When fixing bugs, record the version number containing the fix in the Bugzilla bug report, such that for each bug, we know exactly when it should have been fixed, in what file(s) and what version.

56 open Divvun/Disamb bugs (21 of these 56 are speller bugs, 35 are general bugs), and 23 risten.no bugs

TODO:

  • look over the Bugzilla status mails (Børre)

The meeting in Drag

The Sámi Parliament board has its meeting June 19-21. We should use Monday 18. as our travel day, and return on Friday 22. Fly to Bodø, and go by rental car from there. It is also possible to go by car all the way from Tromsø, and it is even faster. Those going to Bodø are (at least):

  • Maaren (?)
  • Sjur
  • Tomi

Topics for Drag:

  • two-level fixes (stuorra-oslolaš)
  • OOo/Hunspell
  • QA session
  • Actio compounding clarifications
  • smj work in general
  • loan words in -áhta or -áhtta (example: advokáhtta or advokáhta)

SD-ráddi presentation (1 hour):

  • demo Divvun
  • demo risten.no
  • drift av divvun
  • drift av risten.nno
  • forlenging/nytt prosjekt (ie drift)
  • sørsamisk
  • terminologi-utvikling
  • parallellkorpus
  • nordisk samarbeid

Sjur will order rooms for all (except Per-Eric) on Hamarøy Hotell, meeting room either at the Hotel or at Árran. Beds are needed as follows:

  • Monday: Sjur, Maaren, Tomi
  • Tuesday: Sjur, Maaren, Thomas, Tomi, Trond, Børre
  • Wedday: Sjur, Maaren, Tomi, Børre (not at Hamarøy Hotell - it is full)
  • Thursday: Sjur, Maaren, Tomi, Børre

TODO:

  • order rooms (Sjur)
  • order meeting room (Sjur)
  • plan presentation (Sjur)

A commercial

An alternative compiler to Xerox is coming up, in

11. Next meeting, closing

The next meeting is 25.6.2007, 10: 30 Norwegian time (or possibly in the afternoon).

The meeting was closed at 11: 28.

Appendix - task lists for the next week

Boerre

iCal

  • add sma texts to the corpus repository
  • run all known spelling errors in the prooftest corpus through the speller
  • add extraction of all known spelling errors in the regular corpus (not the prooftest corpus), and check that they are properly corrected
  • update and fix our documentation and infrastructure as Steinar finds problem areas - low priority
  • study the Hunspell formalism in detail
  • follow-up contact with Davvi Girji
  • install larger disks, new RAM on the G5
  • update/check installed file list and paths for Windows
  • study the Hunspell formalism in detail
  • fix bugs!

Maaren

iCal

  • lexicalise actio compounds
  • Manually mark speller test documents for typos

Per-Eric

iCal

  • expand the smj typos list
  • add missing smj words
  • contact media in Sweden about the beta release

Saara

iCal

  • add new XSL/XML headers for proofing test docs
  • Try to add files with Lars to the corpus interface.
  • fix bugs!

Sjur

iCal

  • run all known spelling errors in the corpus through the speller
  • document the AppleScript testing tool
  • integrate regression self tests with the make file
  • improve speller test bench
  • integrate the ccat speller testing options in the make file
  • fix internet setup for Per-Eric's satelite modem
  • look over the Bugzilla status mails
  • ask Xerox for a commercial lisense for the xfst tools on the G5
  • check with Sámi publishing houses whether support for CS2 is still needed
  • resend the press release to some channels in Sweden, Finland and Norway
  • publish corpus contracts and project infra as open-source on NoDaLi-sta
  • study the Hunspell formalism in detail
  • fix bugs!

Steinar

iCal

  • Beta testing: Align manually (shorter texts)
  • Manually mark speller test texts for typos (making them into gold standards), add the texts to the orig/catalogue
  • Complete the semantic sets in sme-dis.rle
  • missing lists
  • fix bugs!

Thomas

iCal

  • work with compounding
  • Lack of lowering before hyphen: Twol rewrite.
  • smj: öä not accepted, only øæ (except for lexicalised names)
  • fix stuorra-oslolaš lower case o
  • add normativity issues to our normativity document
  • test new speller for actios of 3-sybbable verbs and adverbs of 3-s adjs.
  • fix bugs!

Tomi

iCal

  • make PLX conversion test sample; add conversion testing to the make file
  • integrate the ccat speller testing options in the Makefile
  • first part of multiword expressions not accepted
  • open up compounding for all actios
  • contact Finnish institutions about the speller beta release
  • study the Hunspell formalism in detail
  • add Hunspell data generation/conversion
  • fix bugs!

Trond

iCal

  • Work on the web corpus issues
  • update the smj proper noun lexicon, and refine the morphological analysis, cf. the propernoun-smj-lex.txt
  • fix bugs!.