UIT The arctic university of Norway > Giellatekno
 

Meeting_2006-02-27

Meeting setup

  • Date: 27.02.2006
  • Time: 09.30 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Linguistics
  7. name lexicon infrastructure
  8. Spellers
  9. Other issues
  10. Summary, task lists
  11. Closing

1. Opening, agenda review, participants

Opened at 10: 36.

Present: Børre, Saara, Sjur, Thomas, Trond

Absent: Maaren, Tomi

Main secretary: Trond

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

  • send out contracts with accompanying letter
  • Gather public texts, preferrably also parallel ones
  • Continue converting text from input format to our xml
    • Some new texts added to the corpus, from the Sámediggi
  • convert nob and nno bible texts to be used as part of a parallel corpus
    • Not done
  • review the paratext2xml converter
  • Not done
  • convert smj NT to paratext
    • Not done
  • Call Ove Sæth and Olavi Korhonen
    • Called Korhonen, he was very positive about sharing his Lule Sámi dictionary. He asked if he could get access to our corpus as a return favour, because he is also making a Northern Sámi dictionary.
  • Correct Forrest integration for new projects and project ideas
    • Done with Sjur
  • Move complex name lexicon issue to bugzilla
    • Not done
  • fix bugs!

Maaren

  • work with risten.no
    • Added words. Also added words from a frequency-based missing list.
  • discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.

Saara

  • continue discussion on the new lexicon format
  • Fix the preprocess script and optimize it.
    • not done.
  • update conversion from lexc to xml (proper names) with the latest refinements
  • Try to add numeral treatment as part of the analyzator.
    • not done
  • Look at crontab ga/ directory issue with Trond.
    • postponed.
  • Create a parallel corpora of the new testaments.
    • In progress.
  • Routine for adding new languages to the propernoun xml-structure (Lule Sámi name lexicon and links to the termcenter.xml etc.).
    • In progress.
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
  • Lule Sámi twol problems, with Thomas and Trond
  • project planning with Trond, continued
  • Follow up on place names from Norge Digitalt
  • Evaluate SFST as speller (and analyzer) lexicon
  • write a background document on the corpus contracts
  • public tender:
    • review draft tender document from Finnut
      • almost finished - will finish today.
  • smj G3 issue with Thomas and Trond
  • sme G3 issue with Thomas and Trond
  • call EDD/ Christian Emil Ore about national place name lexicon
  • risten.no/proper noun lexicon development: fix bugs, continue development
  • fix bugs!
  • other:
    • was on Winter Holiday

Thomas

  • work on North Sámi compounding and derivation
    • nothing, been working with Lule sámi namelexicas and incoming words
  • review corpus usage documentation
    • nothing, been working with Lule sámi namelexicas and incoming words
  • smj G3 issue with Sjur and Trond
    • nothing, been working with Lule sámi namelexicas and incoming words
  • sme G3 issue with Sjur and Trond
    • nothing, been working with Lule sámi namelexicas and incoming words

Tomi

On sick leave.

Trond

  • Mostly worked on Lule Sámi this week.
  • Work on corpus texts with Børre.
    • Discussed issues.
  • Contact the Finnish and Swedish Bible societies to get Bible texts.
    • Not done.
  • Look at ga/ directory issue with Saara.
    • Postponed. We discussed it, though...
  • News group discussion followup.
  • Do a bug report (if not done) on commandline (mis)behaviour in the Xerox tools
    • Done.
  • Ask IT guys for an e-mail adress for corpus upload script: corpus@giellatekno.uit.no (forwarded to Børre)
    • Other people have looked into this.
  • fix bugs!.

3. Documentation

TODO:

  • Integrating the future project plans and ideas in our Forrest documentation ( Børre)
    • Done by Sjur and Børre.

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO: Send out the rest of the letters (Børre)

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

  • call Sæth (Børre)

Olavi Korhonen's Lule Sámi dictionary.

Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary.

KIO Grafisk and the Iđut books

We need a test file in order to find out whether file conversion from Quark to InDesign works.

TODO:

  • Trond to ask for a Quark test file from his sister
  • Børre to ask KIO to send a more elaborate test file, representative for what we can expect, but without any copyrighted content (and please stress, that fonts will never leave the computer, they are part of the OS, and are not included in the Quark file). The file should include text with Sámi and Norwegian letters, and preferably some dummy pictures as well. Actually, Børre could create that file in Word and send it to KIO for them to move to Quark and send back.
  • Børre will send letters to the authors.

Bible texts

TODO:

  • review paratext2xml converter (Børre)
  • convert smj NT to paratext (Børre)
  • ask to get fin and swe NT and OT in paratext format. (Trond)
    • Still not done. Trond will contact Bibelselskapet for a new sme version, and for the NT nob and nno, and the fin and swe ones (after Wednesday)

5. Corpus infrastructure

We need more "version control" in the corpus work - we don't know which version of the XSL script was used (but roughly whether it was used or not). We need to document within the XSL file which version of all tools were used, including the XSL file (common section/template) itself.

Transferring the old gt/sme/corp files to the new corpus repo:

  • for the biggest top ten (or so) the orig. should be located and copied to the new corpus repo
  • then these files should be removed from gt/sme/corp/
  • all small files could just be forgotten/ignored

TODO:

  • Access control to corpus repo resolved through Unix groups: one group for corpus maintainers with write access, and another for corpus users, with read-only access. Still not done.
    • Saara will ask Roy to do what should be done.

Further discussion about corpus analysis and computer use:

The new G5 is tremendeously faster than cochise, thus we want to use it. But cochise will continue to be our main corpus repo.

  • the corpus/gt/ dir will be synchronised with the G5
    • TODO: set up copying script (Saara)
  • we need to develop strong enough security routines for the G5 to fulfill our obligations towards the text licensers
    • TODO: Børre to move this to bugzilla
  • we are still using only one processor when analysing - making some simple multiprocessor analysis script will increase the speed of the analysis a lot
    • script implemented, but not tested due to copying not in place yet (see above).

Secure copying:

To get cvs ssh working without password prompting:
ssh-keygen -t rsa
<just type enter to all questions>
chmod 0644 .ssh/id_rsa.pub
scp .ssh/id_rsa.pub <user>@cochise.uit.no:.ssh/authorized_keys2

New tasks:

  • corpus dtd documentation:
    • structure, content/model and location of the dtd (location = permlink) http: //giellatekno.uit.no/dtd/corpus.dtd TODO: Saara to write and finish the docu, also check the dtd link
      • the link isn't working
  • finalize gt/doc/ling/corpus_conversion_tech.html (and possibly convert it to xml) (Saara)
  • add xml validation against our dtd to the corpus conversion process (Saara)

6. Linguistics

North Sámi

We want to get the decisions from the SGL meeting in December.

TODO:

  • get the decisions (Maaren)

Lule Sámi

NFR wrote:

[Programstyret] bed (...) om at det umiddelbart vert laga ein alternativ plan 
for resten av prosjektperioden. Denne planen må gå ut frå den endra føresetnaden,
at ein ikkje har tilgang til den lulesamiske ordboka. Programstyret vil at ein
går over til den alternative planen frå 1. mars, dersom tilgangen ikkje er reelt
i orden på det tidspunktet. 

The criterion for continuing with Lule Sámi that we set up in our plan was the following, somewhat stronger:

kriteriet for om det blir satsing på lulesamisk er om vi har ein betaversjon
av den lulesamiske transdusaren, med integrert leksikon, i gang 1. mars.

Moments to our March 1st report:

  • We have recieved the dictionary, and integrated most of it (27.2: 78 % of the Kintel words are added)
  • Our Lule Sámi disambiguator is up and running as an alpha version
  • The analyser runs on 80,0 recall on token and 67,4 on type, on the New Testament (proper names excluded)
  • The analyser has 17773 lexicon lines
  • We have 3367 lines of non-allocated Kintel words
  • The integration of proper nouns will be done in parallel with Northern Sámi, and when it suits the Northern Sámi conversion timetable.

Goal for March 1st: Reach the 90 percent lexical recall limit.

Conclusion: we have already met the basic requirements for continuing with Lule Sámi.

TODO:

  • Trond and Thomas: work on the 90 % limit
  • Trond: Write a report to NFR.

7. Name lexicon infrastructure

Complex names

  • make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Tomi)
    • the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)

TODO:

  • Move this issue to bugzilla (Børre)

XML format

TODO:

  1. testing of conversion
  2. eXist as editor:
    1. develop the needed XQueries and interface (Sjur, Tomi)
    2. data synchronisation between risten.no and (Sjur, Tomi)
    3. test whether eXist as editor is actually working well (Thomas, others)

8. Spellers

Nothing while Tomi is on sick leave, and until the new proper name lexicon is in place.

9. Other

Upcoming Strategy money application deadline at Samisk senter

Possible grant proposal themes include:

  • Southern Sámi:
    • Grammar research
    • corpus collection
    • normativity seminar (future speller)
  • Lule Sámi:
    • Disambiguator
  • Northern Sámi:
    • Text-to-speech
    • Semantic annotation
  • Programming infrastructure:
    • Setting up a structure for inflecting dictionaries

SGL Seminar

  • all members = potentially/likely all languages
    • not all languages, only North Sámi
  • date:
    • As early as possible, end of March?
  • place? Maaren will investigate

Bug fixing

33 open bugs (and 24 risten.no bugs)

10. Summary, task list

Børre

  • send out contracts with accompanying letter
  • Gather public texts, preferrably also parallel ones
  • Continue converting text from input format to our xml
  • convert nob and nno bible texts to be used as part of a parallel corpus
  • review the paratext2xml converter
  • convert smj NT to paratext
  • Call Ove Sæth
  • Move complex name lexicon issue to bugzilla
  • Ask KIO Grafisk to make a test Quark document based on a Word document from us
  • Send out letters to the Iđut authors
  • Add corpus security re G5 syncing as an issue to Bugzilla
  • fix bugs!

Maaren

  • work with risten.no
  • discuss with relevant people regarding seminar on proofing tools, normativity and SGL in February/March, including place.
  • get the normativity decissions from the December SGL meeting

Saara

  • continue discussion on the new lexicon format
  • Fix the preprocess script and optimize it.
  • Try to add numeral treatment as part of the analyzator.
  • Move gt2ga.sh to G5 and implement copying of the gt-dir.
  • Create a parallel corpora of the new testaments.
  • Routine for adding new languages to the propernoun xml-structure.
  • Move to Bugzilla: the analyzer needs to be optimized.
  • Implement validation of xml corpus against the dtd.
  • Create a group for corpus users.
  • Finish corpus dtd documentation, dtd location and permlink reference
  • Finish gt/doc/ling/corpus_conversion_tech.html and rename to .xml
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
  • Lule Sámi twol problems, with Thomas and Trond
  • project planning with Trond, continued
  • Follow up on place names from Norge Digitalt
  • Evaluate SFST as speller (and analyzer) lexicon
  • write a background document on the corpus contracts
  • public tender:
    • review draft tender document from Finnut
  • smj G3 issue with Thomas and Trond
  • sme G3 issue with Thomas and Trond
  • call EDD/ Christian Emil Ore about national place name lexicon
  • risten.no/proper noun lexicon development:
    • refactor code
    • implement inheritance/collection overriding for xsl/css/xquery using sitemaps
    • data synchronisation between risten.no and the cvs repository
  • fix bugs!

Thomas

  • work with Lule sámi 90% goal
  • work on North Sámi compounding and derivation
  • review corpus usage documentation
  • smj G3 issue with Sjur and Trond
  • sme G3 issue with Sjur and Trond

Tomi

  • move aspell UTF-8 suffix bug to Bugzilla
  • corpus infrastructure:
    • dtd location (both public and internal)
  • Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
  • new proper name lexicon
    • discuss the new lexicon format and other issues in the newsgroup
    • Look into data synchronisation of proper nouns between risten.no and CVS
    • XQuery refactoring and code development for our proper noun editor
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • fix bugs!

Trond

  • Bring Lule Sámi up to 90 %
  • Write Lule Report to NFR
  • Apply for strategy funds
  • Work on corpus texts with Børre.
  • Contact the Finnish and Swedish Bible societies to get Bible texts.
  • Look at ga/ directory issue with Saara
  • Ask for a Quark test file from his sister
  • News group discussion followup.
  • fix bugs!.

11. Next meeting, closing

06.03.2006 09: 30

Børre, Saara, Trond will be away.

Closed at 12: 00