UIT The arctic university of Norway > Giellatekno
 

Meeting_2006-08-28

Meeting setup

  • Date: 28.08.2006
  • Time: 09.30 Norw. time
  • Place: Where we are
  • Tools: SubEthaEdit, iChat

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09: 47.

Present: Saara, Sjur, Thomas, Børre, Tomi, Trond

Absent: Maaren

Agenda accepted as is.

2. Updated task status since last meeting

Børre

  • corpus collection:
    • send out contracts with accompanying letter
    • Gather public texts, preferrably also parallel ones
    • Send out letters to the rest of the Iđut authors
    • contact Ája (Kåfjord)
    • send contracts to Čálliid Lágádus
      • All the above not done
    • Contact Richard Valkepää at NSI about older Min Áigi and Áššu files.
      • Not heard anything from him
  • corpus conversion:
    • convert nob and nno bible texts to be used as part of a parallel corpus
    • convert fin, swe to paratext or directly to our XML
    • review the paratext2xml converter
      • All above not done
    • Move norwegian documents in Min Áigi from sme to nob
      • Not done
  • corpus access:
    • possibly deploy the user account form as an HTML form
      • not done
    • make a test user
    • Write both user and admin documentation (Børre, review: Sjur, Thomas)
      • User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
      • Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.
      • Not done
  • set up Bugzilla automatic reminders for open issues
    • My attempt doesn't work
  • create document & document entry for name double-tagging
    • ?
  • Update forrests to latest svn version
    • Done before vacations
  • fix bugs!

Maaren

  • On sick leave

Saara

  • Create a parallel corpora of the new testaments
  • add more texts to the graphical corpus interface
  • grammatical searchability in the graphical corpus interface
    • tested
  • Implement parallel corpus upload in web upload script
    • not done, waiting for decision of the cgi-bin scripts
  • Install Gobby
    • done
  • Test the aligners once again
    • done
  • refine the xml output of the xml-tagged analyses
    • done
  • convert or adapt the received PHP for paradigm generation to our needs
    • discussed possible solutions in the news
  • remove headers and footers from antiword documents, other improvements
    • antiword done, pdf-documents are to be fixed next
  • fix bugs!

Sjur

  • public tender:
    • review letters to tenderers, contract for subcontractor
      • contract finished, signed and countersigned. First common meeting held.
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • name lexicon:
    • implement editing functions
      • more done, plans for the rest is made.
    • finalise refactoring for multiple collections (regular search interface)
      • mostly done, but a bug in eXist or Cocoon is blocking the final step
  • review user and admin documentation for corpus access
    • not done
  • write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
    • not done
  • fix bugs!

Thomas

  • investigate productivity of even-syllable Actio compounding
    • done
  • investigate and identify under which conditions even-syllable Actio compounding is possible
    • done
  • discuss findings with the rest of us
    • done
  • add proper numeral analysis/treatment to smj
    • done to the same extent as in sme
  • add loanwords (e.g. latin -ere verbs) to smj
    • done
  • sme G3 issue
    • started with a load of help from my friends
  • review user documentation for corpus access
    • not done
  • create smj abbr file
    • copied the sme abbr file and lulefied it
  • review the document
    • done
  • Redirected following three syllable verbs and prevent them from being first-part Actios in compounds:
    • Reflexives on -dit
    • Reciprocals on -dit, -(a)lit
    • Momentatives on -dit, -(a)lit, -ádit, -ihit
    • Frequentatives on -(a)lit, -(u)hit, -dit
    • Continuatives on -dit, -(u)hit, -nit
    • Inchoatives in -nit
    • Translatives on -dit
    • Essives on -dit and -stit
    • Causatives on -dit, -stit
      • done
  • find and study all derived verbs in our corpus (depends on Trond)
    • not done
  • suggest which derivations could be generated (depends on Trond)
    • not done

Tomi

  • new proper name lexicon
    • data synchronisation of proper nouns between risten.no and CVS
    • XQuery refactoring and code development for our proper noun editor
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • read aligner docu, install, provide feedback
  • Set up the mechanism for the hash-mark transducer package
  • test the new xml output of the xml-tagged analyses
  • export corpus tools to /opt (with cron)
  • fix bugs!

Trond

  • better smj NT text
  • get fin, swe, nob and nno NT and OT in paratext format
    • No bible work during summer, not received anything, will have to look into this again.
  • install aligner, test it and give feedback
    • Aligner installed, it is good, but slow an operates manually. Will call Bergen today.
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
    • Not done.
  • make shell script wrappers for the most common commands for user friendlyness
    • Some done, not others.
  • write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
    • not done
  • write documentation for our bound users, with pointers to the ordinary documentation
    • not done
  • write documentation on double-tagging names
    • not done
  • discuss web-only user access management with Oslo
    • not done
  • change tagging of derived stems in the disamb output, to facilitate much easier extraction of non-lexicalised derivations
    • prepared for the conversion, not done the actual change.
  • fix bugs!.

3. Documentation

TODO:

  • Write both user and admin documentation (Børre, review: Sjur, Thomas)
    • User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
    • Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.

4. Corpus gathering

Nothing has happened during the summer.

Olavi Korhonen's Lule Sámi dictionary.

Waiting for an answer.

Bible texts

Will have a second round with the Word versions.

TODO:

  • get nob and nno NT and OT in paratext format. (Trond)
  • convert smj NT to paratext (Børre)
  • convert fin, swe to paratext or directly to our XML (Børre)

Kåfjord

TODO:

  • contact Ája (Børre)
  • talk to Lene about Kåfjord (Børre)

Sámi Instituhtta

When will we get the corpus? We still don't know, Børre will contact him again.

TODO:

  • contact NSI again (Børre)

Čálliid Lágádus

http://www.calliidlagadus.org/

TODO:

  • send contracts (Børre)

Árran

TODO:

  • contact Bård Eriksen again (Børre)

5. Corpus infrastructure

General

What we would like: A make-type system that kept track of the file.xml and file.xsl -pairs, always converting the former whenever the latter had a newer date. Cf. Trond's letter "Makefile for xsl corpus conversion?" in the news.

At first sight, this sounds like a good idea.

TODO:

  • remove headers and footers from antiword documents, other improvements as needed, including PDF conversion (Saara)
  • give the make issue a second sight (Tomi)
  • comment Trond's suggestion in the news (all)

User accounts and access

For details, see a previous meeting memo, as well as the memo from a dedicated meeting.

Shell access

TODO:

  • export to /opt (with cron) tools that the project team members find in their cvs tree (the bound users do not have a cvs tree, and therefore need these tools in /opt in order to conduct linguistic analyses) (Tomi)
    • ccat
  • make shell script wrappers for the most common commands for user friendliness (we must think of what commands they are) (Trond)
    • (first version of first script, teaksta.sh, was checked in, but it is still not working (the problem is a simple handling of input-output, some shell script literates should have a look at it)
  • write user account form, probably ask for copy of existing ones from the IT centre (Trond and Sjur)
  • possibly deploy the user account form as an HTML form (Børre)
  • write documentation for our bound users, with pointers to the ordinary documentation (Børre, Trond)
  • write documentation for how to apply for a user account (where's the form, to whom do I send the form, who needs it, etc.) (Børre)
  • make our own guidelines for the user application processing (Børre)
  • make a test user (Børre)
  • test corpus access as test user (Trond)

Web browser access

TODO:

  • discuss with Oslo (Trond)
  • delay other tasks until we are ready to go public?
  • user management for access to bound texts
  • short user guide needed before going public (either write one or take whatever has been made in Oslo (Trond)

More texts to the graphical corpus interface:

TODO:

  • add text to the server (Lars)

Aligner

The aligner aligns fine, better than its competitors. Unfortunately it is slow, and dependent upon manual input.

TODO:

  • contact Bergen to discuss these issues, ask for a non-manual interface, etc. ( Trond)

Language recognition

Still waiting for more smj and sma text to improve it. We need South Sámi, since Reindriftsnytt/Boazodoallo-ođđasat is trilingual, nob, sme, sma. We are presently unable to correctly identify sma.

Corpus summary

The time-based statistics is still missing.

TODO:

  • add time-based display as a feature request to Bugzilla (Sjur)

6. Infrastructure

Xerox tools wrapped as servers

To improve throughput and response time on heavy loads, it would really be nice to have the Xerox tools wrapped up as servers.

TODO:

  • decide the programming language to use (Saara)
  • find some (almost-)ready-to-use code to build on (Saara)
  • implement it (Saara)

Paradigm generation

Goal: Reuse Greenlandic code for paradigm generation.

Saara has given a report on the PHP code in News. Please read.

Conclusion: Only to be used as a source of inspiration. We'll wait with further work until we have the server wrapper thing (see above) in place.

Hyphenator

TODO:

  • correct the treatment of hyphenation of word boundaries and exceptions (fst gymnastics) (Sjur, Trond)
  • Update the sma hyphenator rule set with the insights gained from smj updates ( Trond during summer vacation)

Automatic Bugzilla reminder for untouched bugs

TODO:

  • give mail reminders a second try; ask Thor-Øivind for help if needed ( Børre)

M4

Tomi and Saara did a lot in Tromsø. How far is it now? Probably finished today!

TODO:

  • finish the work, and check it in (Saara)

7. Linguistics

Derivation and spellers like Aspell

To make it easier to extract all derived stems, we should enhance the tags used for derivations in sme to make them easier to grep. The most straightforward solution is to make the tags follow the same pattern as for smj, +Der/NNN. Presently sme is only using the NNN part as a tag, where NNN represents the derivational suffix. That is, there is no single pattern to match against for sme.

Problematic issue: the disamb output will presently give information only about non-lexicalised derivations. This can potentially give false data on the frequency of derivational affixes. To get (more) precise data, there should be an option in lookup2cg that favours derived analyses over lexicalised ones, everything else being equal. A similar option for compounds can also be useful in certain contexts. That is, the present behaviour should be partly turned upside-down ("select the analysis/-es with the fewest compounds and derivations available"). The best alternative would probably be to select the next to least complex analysis, that is, to allow only one compound border or one derivation.

TODO:

  • change tagging of derived stems in the disamb output, to facilitate much easier extraction of these verbs (Trond, Tomi)
    • Issues: Rewrite the sme-lex.txt, sme-dis.rle, sme-tdis.rle, twol-sme.txt files (and perhaps other source files), (40 tags and 5 files are involved (+ possibly 5-10 more, utf8-scripts, also the tag conversion scripts in sme/src and in gt/cwb/korpustags.txt)). Pseudocode:
For file i, take taglist 33-37 and replace with 39-43, respectively
  • find and study all derived verbs in our corpus (Thomas)
  • suggest which derivations could be generated (Thomas)
    • see source code above, but also consider overgeneration problems, as well as input from the statistical results in the previous point
  • lexicalise the rest (Thomas)

Semantic double-tagging of names

The policy needs documentation. Thus:

TODO:

  • Make a section under gt/doc/lang/smi/, add a chapter Common linguistic resources under the Languages section in site-frag.xml, and add the document below to that section (Børre)
  • write guidelines for annotators wrt. to name tagging and put them under gt/doc/lang/smi. A short version of the guidelines goes as follows:
    Systematic ambiguity (the Trosterud (plc/sur) and Aftenposten (org/obj) types) should not be doubletagged, but given primary tags (plc, org) only. Doubletagging should be confined to exceptional cases (Eira (Fem/Sur/Plc), Janne (Fem/Mal), such things. These guidelines should be added to our documentation file. (Trond)
  • make sure all linguists is aware of the guidelines (Trond, Sjur)
  • write disamb rules to implement the system above (Trond, Linda)

North Sámi

The following already derived verbs (verbs ending in -šit, -skit, etc.) are not happy with further derivation. It seems that most of them do not appear as Actio forms in the first part of compounding either. The following holds for both sme and smj:

LEXICON MUITTASJ !Words ending -šit, -skit, -smit, -idit, -ldit, -git and 
5-syllables, formerly directed to 
MUITAL
 +V+TV: MUITALStem ;
!These derived verbs have now been redirected to MUITTASJ and similar lexica.
!Reflexives on -dit
!Reciprocals on -dit, -(a)lit
!Momentatives on -dit, -(a)lit, -ádit, -ihit
!Frequentatives on -(a)lit, -(u)hit, -dit
!Continuatives on -dit, -(u)hit, -nit
!Inchoatives in -nit 
!Translatives on -dit
!Essives on -dit and -stit
!Causatives on -dit, -stit

Examples:

  • muittašit > *muittašallat

TODO:

  • investigate and identify under which conditions Actio compounding is possible ( Thomas)
    • done
  • discuss findings with all of us (Thomas)
    • done

Lule Sámi

TODO:

  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt ( Thomas, Trond)
  • convert roughly 100 smj names from that file (lines 740-843) to XML ( Saara)
  • add inc abbr to a new abbr lexicon file (Thomas)
    • copied the sme abbr file and lulefied it
  • add proper numeral analysis/treatment (Thomas)
    • done to the same extent as in sme
  • add loanwords (e.g. latin -ere verbs) (Thomas)
    • done

8. Name lexicon infrastructure

Decided in Tromsø:

  • add smj proper noun lexicon file to the output
  • remove
^ # 0}} from the center ID
* replace spaces with underscores in all IDs
* remove occurence indicator from language IDs: Agalin_1 (the center/concept ID)
  => Agalin (the language ID), and thus the two Agalin's should become one
  language entry (but two different concept entries)
* store a redundant copy of the center-file semantic information in the
  language-specific files, for processing speed
* add logging facilities
* add option to download local copies of the lexicon files directly from the db
* batch editing (change all entries in the found set), should later be enhanced
  to allow selection of exceptions (the found set minus deselected items)
* all names in all languages by default
* tag for excluding/including a name from certain applications
* hide / display {{^}} during browsing
* future epxansion: choose what info to display in the single language browser
* search by (single) language
** done
* display existing language entries when adding a new language to a record
* make searches behave predictable (the hits should be the expected ones)
** done
* add editor to change single, existing entries

Details can be found in [the meeting
memo.|/doc/admin/physical_meetings/tromso-2006-08-propnoun.html]

Language entry example illustrating both the sem-tag on the sense elements, and
the removal of occurence indicator:
{{{
<entry id="Agalin">
 <infl lexc="BERN" />
 <senses>
  <sense sem="plc" ref="Agalin"/>
  <sense sem="sur" ref="Agalin_1"/>
 </senses>
</entry>

TODO:

  • improve lexc2xml conversion (Saara)
    • add default smj entries
    • exclude ^ # 0 from the center ID
    • add an empty <log/> element to all entries (center and lanuage files)
    • add a last-update attribute to the root element of all files
  • finish refactoring for multiple collections in the search interfarce ( Sjur)
    • waiting for a bug fix (Tomi is investigating it)
  • develop the needed XQueries and UI (Sjur, Tomi)
  • data synchronisation between risten.no and the cvs repo (Tomi)
    • discussion started on eXist-list, nothing useful came up. We need to reformulate the question from our perspective, and bring it up again ( Sjur)

9. Public tender

TODO:

  • write a contract (mostly done by Finnut, review by Sjur)
    • done
  • get it signed (Finnut, Lennart Mikkelsen)
    • done

10. Tromsø meeting round-up

TODO:

  • check in meeting memos (Sjur)
  • Polderland questions. Thomas did already send requested info.
  • speller development - see the meeting memo. Separate follow-up next week.
  • Lule Sámi linguist - Sjur has tried to call a possible candidate, but no answer so far. He will use e-mail instead. (Sjur)
  • order AirPort Express to Tromsø (Sjur)

11. Other

Bug fixing

43 open Divvun/Disamb bugs (two down!), and 25 risten.no bugs

Guess: 1/3 of the bugs are fixed already (?)

Please help Saara with bug 279 (Perl locale). Not much help... Saara will contact Roy on this issue.

Gobby

TODO:

  • install Gobby (Trond, Sjur)
  • review the document ( Thomas)
    • done - document accepted: -)

Task lists as iCal entries

TODO:

  • update all forrest installations to r430284 (Børre)
cd $FORREST_HOME
svn up -r430284

11. Next meeting, closing

Next meeting 4.9.2006 at 9: 30.

Closed at 11: 25.

Appendix - task lists for the next week

Børre

iCal

  • corpus collection:
    • send out contracts with accompanying letter
    • Gather public texts, preferrably also parallel ones
    • Send out letters to the rest of the Iđut authors
    • contact Ája (Kåfjord), talk to Lene
    • send contracts to Čálliid Lágádus
    • contact Richard Valkepää at NSI about older Min Áigi and Áššu files
    • contact Bård Eriksen again
  • corpus conversion:
    • convert nob and nno bible texts to be used as part of a parallel corpus
    • convert fin, swe to paratext or directly to our XML
    • review the paratext2xml converter
    • Move norwegian documents in Min Áigi from sme to nob
  • corpus access:
    • possibly deploy the user account form as an HTML form
    • make a test user
    • Write both user and admin documentation (Børre, review: Sjur, Thomas)
      • User documentation probably in several languages. This covers how to apply for an account, on what grounds one can apply, and pointers to documentation telling how to use the corpus.
      • Admin documentation, telling how we set the permissions to the corpus files, and whatever other processes and tasks needed to set up a corpus account.
  • set up Bugzilla automatic reminders for open issues
  • create document & document entry for semantic double-tagging of names (for Trond)
  • Update forrests to svn version r430284
  • fix bugs!

Maaren

  • On sick leave

Saara

iCal

  • Create a parallel corpora of the new testaments
  • add more texts to the graphical corpus interface
  • grammatical searchability in the graphical corpus interface
  • Implement parallel corpus upload in web upload script
  • remove headers and footers from pdf documents
  • Implement server of the analysis tools.
  • Add more languages to the lexc2xml propernoun conversion.
  • Refine the namelex output
  • finish M4 work
  • fix bugs!

Sjur

iCal

  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • name lexicon:
    • implement editing functions
    • finalise refactoring for multiple collections
    • implement improvements decided upon in Tromsø
  • review user and admin documentation for corpus access
  • write user account form, probably ask for copy of existing ones from the IT centre (with Trond)
  • add time-based corpus summary as a feature request to Bugzilla
  • check in meeting memos from Tromsø
  • start hiring process of linguist and programmer
  • order AirPort Express to the Tromsø gang
  • install Gobby
  • fix bugs!

Thomas

iCal

  • sme G3 issue
  • bug-fixing!
  • refine smj proper noun lexica, cf. the propernoun-smj-lex.txt
  • review user documentation for corpus access
  • find and study all derived verbs in our corpus (depends on Trond)
  • suggest which derivations could be generated (depends on Trond)

Tomi

iCal

  • new proper name lexicon
    • data synchronisation of proper nouns between risten.no and CVS
    • XQuery refactoring and code development for our proper noun editor
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
    • implement improvements decided upon in Tromsø
  • read aligner docu, install, provide feedback
  • Set up the mechanism for the hash-mark transducer package
  • test the new xml output of the xml-tagged analyses
  • export corpus tools to /opt (with cron)
  • Do the sme Der/ change (with Trond)
  • consider Trond's suggestion of a makefile for corpus conversion
  • fix bugs!

Trond

iCal

  • better smj NT text
  • get fin, swe, nob and nno NT and OT in paratext format
  • contact Bergen about aligner issues
  • fst gymnastics to add hyphenation and word boundary marks to hyphenation transducer
  • make shell script wrappers for the most common commands for user friendlyness
  • write user account form, probably ask for copy of existing ones from the IT centre (with Sjur)
  • write documentation for our bound users, with pointers to the ordinary documentation
  • write documentation on semantic double-tagging of names
  • discuss web-only user access management with Oslo
  • change tagging of derived stems in the disamb output, to facilitate much easier extraction of non-lexicalised derivations
  • do the sme Der/ change (with Tomi)
  • write short user guide for the corpus web interface
  • install Gobby
  • fix bugs!.