UIT, The arctic university of Norway > Giellatekno
 

Meeting_2006-02-13

Meeting setup

  • Date: 13.02.2006
  • Time: 09.30 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Other issues
  9. Summary, task lists
  10. Closing

1. Opening, agenda review, participants

Opened at 09: 57.

Present: Børre, Saara, Sjur, Trond

Absent: Maaren, Thomas, Tomi

Main secretary: Børre

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

  • send out contracts with accompanying letter
    • Min Áigi
  • Gather public texts, preferrably also parallel ones
    • Gathered some, sitting on a computer at the Sámediggi.
  • Continue converting text from input format to our xml
    • Converted existing texts using the upload form. Nice experience: -)
  • review code and documentation for corpus xsl files under version control
    • Not done
  • convert nob and nno bible texts to be used as part of a parallel corpus, and review the paratext2xml converter as part of the conversion
    • Not done
  • convert smj NT to paratext
    • Not done
  • close bug 211 as WONTFIX
    • DONE : -)
  • fix bugs!

Maaren

  • work with risten.no
  • discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

  • continue discussion on the new lexicon format
  • Refine language detection for Finnish
    • not done
  • Finish the review of the hyphenation detection.
    • not done
  • Review the handling of xsl-files in corpus infrastructure, including version control
    • done
  • Fix the preprocess script and optimize it.
    • not done
  • finalize an improved working version of the CGI and command line scripts for corpus additions
    • done
  • update conversion from lexc to xml (proper names) with the latest refinements
  • Try to add numeral treatment as part of the analyzator.
    • not done
  • Look at crontab ga/ directory issue with Trond.
    • done, but there is a bug.. which should be fixed now.
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
    • not done
  • Lule Sámi twol problems, with Thomas and Trond
    • delayed till Thomas is back
  • project planning with Trond, continued
    • not done
  • Follow up on place names from Norge Digitalt
    • not done
  • Evaluate SFST as speller (and analyzer) lexicon
    • not done
  • write a background document on the corpus contracts
    • not done
  • public tender:
    • review draft tender document from Finnut
      • done, feedback and changes returned
  • smj G3 issue with Thomas and Trond
    • delayed till Thomas is back
  • sme G3 issue with Thomas and Trond
    • delayed till Thomas is back
  • call EDD/ Christian Emil Ore about national place name lexicon
    • not done
  • risten.no/proper noun lexicon development: fix bugs, continue development
    • wrote a draft specification of filename conventions, that at the same time also repeates what has been said earlier regarding dir. structure.
    • some coding as well (don't remember the details any more)
  • fix bugs!
  • other:
    • monthly report for January
    • report for 2005 to Nordplus Sprog

Thomas

On sick-leave.

Tomi

  • Aspell: Continue working on the affix file & aspell
    • Contact aspell author (UTF-8 thing)
  • corpus infrastructure:
    • dtd location (both public and internal)
  • Document aspell and corpus infrastructure
  • new proper name lexicon
    • remove last part of complex names not used as simplex names
    • discuss the new lexicon format and other issues in the newsgroup
    • Look into data synchronisation of proper nouns between risten.no and CVS
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • comment review template made by Saara
  • fix bugs!

Trond

  • Work on corpus texts with Børre.
    • Done, but more to do.
  • Contact the Finnish and Swedish Bible societies to get Bible texts.
    • Not done.
  • Look at ga/ directory issue with Saara.
    • Done. She has made a script, and I have posted things to the newsgroup.
  • News group discussion followup.
    • Some done.
  • Do a bug report (if not done) on commandline bahaviour in the Xerox tools.
    • Hmm, not done, after all.
  • Ask for e-mail adress for corpus upload script
    • Don't remember this one.
  • fix bugs!.
    • This one was forgotten.

3. Documentation

Reviews

XSLT processing part of the corpus infra review is finished. The code is finished and ready to use, but language identification still needs improvements.

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO: Send out the rest of the letters (Børre)

Since last meeting:

  • Min Áigi
  • called Anders Kintel - will sign it as soon as he gets the contract; will also burn on a CD
  • Swedish Sámi Parliament: Grundström will be finished by summer time, now proofreading

Next:

  • calling Olavi Korhonen, his dictionary is now in for printing
  • then continuing on the list of orgs/persons to contact

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

  • call Sæth (Børre)

Bible texts

TODO:

  • review paratext2xml converter (Børre)
  • convert smj NT to paratext. (Børre)
  • ask to get fin and swe NT and OT in paratext format. (Trond)

5. Corpus infrastructure

We need more "version control" in the corpus work - we don't know which version of the XSL script was used (but roughly whether it was used or not). We need to document within the XSL file which version of all tools were used, including the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:

  • for the biggest top ten (or so) the orig. should be located and copied to the new corpus repo
  • then these files should be removed from gt/sme/corp/
  • all small files could just be forgotten/ignored

Task list:

  1. Include the xsl files under version control
    1. RCS version control is almost finished, but an issue with access control is still open.
      1. Access control resolved through Unix groups: one group for corpus maintainers with write access, and another for corpus users, with read-only access.
  2. Improve Finnish language detection as part of the corpus processing
    1. Move to Bugzilla (Saara)
  3. Review automatic hyphen:
    1. Acceptable results: 90% of all real hyphens correctly tagged.
      1. Move to Bugzilla (Saara)

Further discussion about corpus analysis and computer use:

  • the new G5 is tremendeously faster than cochise, thus we want to use it
  • cochise will continue to be our main corpus repo
  • the corpus/gt/ dir will be synchronised with the G5
  • corpus analysis and usage will happen on the G5
  • we need to develop strong enough security routines for the G5 to fulfill our obligations towards the text licensers
  • we are still using only one processor when analysing - making some simple multiprocessor analysis script will increase the speed of the analysis a lot

6. Linguistics

Anything? Nothing.

7. Name lexicon infrastructure

Complex names

TODO:

  • make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Tomi)
    • the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)
  • Saara has added the analyzer as part of the preprocess, but it is slow, and needs to be optimized.

Move these issues to bugzilla (Børre)

Preprocessor optimization

To optimize one could build a targeted transducer only containing the relevant lexicons for preprocessing. But presently that leads to the following lexicons being reported as referenced but undefined:

  • (Punctuation)
  • (Abbreviation)
  • (Acronym)
  • (Adposition)
  • (Negativeverb)
  • (Copula)
  • (VerbRoot)
  • (AdjectiveRoot)
  • (At)
  • (NounSecond)
  • (ALIT)
  • (NAMAT)
  • (SAS)

Perhaps picking the

hum-tf4-ans142:~/gt/sme/src trond$ grep '% ' adv-sme-lex.txt 
earret% eará adv ;
dan% dihte adv ;
...

TODO:

  1. make a lexc Root lexicon (first 40 lines of sme-lex.txt)
  2. extract the relevant parts of the relevant lexica from the main transducer
  3. built from the union of a and b.

Discussion will continue on the newsgroup.

XML format

TODO:

  1. testing of conversion
  2. eXist as editor:
    1. develop the needed XQueries and interface
    2. data synchronisation between risten.no and
    3. test whether eXist as editor is actually working well

8. Other

SGL Seminar

SGL has now been elected, with the folowing members:

  • Rolf Olsen (Else Turi)
  • Tor Magne Berg (Marit Breie Henriksen)
  • Elle Marja Vars (-)
  • Lena Kappfjell (Albert Jåma)
  • Heidi Andersen (-)

SGL/normativity seminar:

  • all members = potentially/likely all languages
    • not all languages, only North Sámi
  • date? As early as possible, end of February/beginning of March
  • place? Maaren will investigate

Infra for new projects and ideas:

  • make Forrest integration work as expected (Børre)

Bug fixing

30 open bugs (and 24 risten.no bugs)

  • Add bug report for the Xerox backspace error (Trond)

9. Summary, task list

Børre

  • send out contracts with accompanying letter
  • Gather public texts, preferrably also parallel ones
  • Continue converting text from input format to our xml
  • convert nob and nno bible texts to be used as part of a parallel corpus
  • review the paratext2xml converter
  • convert smj NT to paratext
  • Call Ove Sæth and Olavi Korhonen
  • Correct Forrest integration for new projects and project ideas
  • Move complex name lexicon issue to bugzilla
  • fix bugs!

Maaren

  • work with risten.no
  • discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

  • continue discussion on the new lexicon format
  • Move the issue "Refine language detection for Finnish" to Bugzilla
  • Move the issue "Finnish the review of the hyphenation detection" to Bugzilla
  • Add version information of the tools to part of the corpus infra.
  • Fix the preprocess script and optimize it.
  • finalize an improved working version of the CGI and command line scripts for corpus additions
  • update conversion from lexc to xml (proper names) with the latest refinements
  • Try to add numeral treatment as part of the analyzator.
  • Look at crontab ga/ directory issue with Trond.
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
  • Lule Sámi twol problems, with Thomas and Trond
  • project planning with Trond, continued
  • Follow up on place names from Norge Digitalt
  • Evaluate SFST as speller (and analyzer) lexicon
  • write a background document on the corpus contracts
  • public tender:
    • review draft tender document from Finnut
  • smj G3 issue with Thomas and Trond
  • sme G3 issue with Thomas and Trond
  • call EDD/ Christian Emil Ore about national place name lexicon
  • risten.no/proper noun lexicon development: fix bugs, continue development
  • fix bugs!

Thomas

  • work on North Sámi compounding and derivation
  • review corpus usage documentation
  • smj G3 issue with Sjur and Trond
  • sme G3 issue with Sjur and Trond

Tomi

  • move aspell UTF-8 suffix bug to Bugzilla
  • corpus infrastructure:
    • dtd location (both public and internal)
  • Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
  • new proper name lexicon
    • discuss the new lexicon format and other issues in the newsgroup
    • Look into data synchronisation of proper nouns between risten.no and CVS
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • fix bugs!

Trond

  • Work on corpus texts with Børre.
  • Contact the Finnish and Swedish Bible societies to get Bible texts.
  • Look at ga/ directory issue with Saara.
  • News group discussion followup.
  • Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
  • Ask IT guys for an e-mail adress for corpus upload script: corpus@giellatekno.uit.no (forwarded to Børre)
  • fix bugs!.

10. Next meeting, closing

20.02.2006 09: 30

Closed at 11: 33