UIT The arctic university of Norway > Giellatekno
 

Meeting_2006-03-20

Meeting setup

  • Date: 13.03.2006
  • Time: 09.30 Norw. time
  • Place: Wherever we are : -)
  • Tools: iChat, SubEthaEdit

Agenda

  1. Opening, agenda review
  2. Reviewing the task list from two weeks ago
  3. Documentation - divvun.no
  4. Corpus gathering
  5. Corpus infrastructure
  6. Infrastructure
  7. Linguistics
  8. name lexicon infrastructure
  9. Spellers
  10. Other issues
  11. Summary, task lists
  12. Closing

1. Opening, agenda review, participants

Opened at 09: 40.

Present: Børre, Linda (from topic 7 onwards), Maaren, Saara, Sjur, Thomas, Tomi, Trond

Absent: none

Main secretary: Tomi

Agenda accepted as is.

2. Reviewing the task list from the last meeting

Børre

  • send out contracts with accompanying letter
  • Gather public texts, preferrably also parallel ones
    • http: //girji.info/skolehist done …
  • Continue converting text from input format to our xml
    • Done
  • convert nob and nno bible texts to be used as part of a parallel corpus
    • Waiting for paratext versions
  • review the paratext2xml converter
    • Waiting for paratext versions of nob and nny
  • convert smj NT to paratext
    • Waiting for paratext versions of nob and nny
  • Call Ove Sæth
    • Not done
  • Move complex name lexicon issue to bugzilla
    • Not done
  • Ask KIO Grafisk to make a test Quark document based on a Word document from us
    • Iđut and KIO Grafisk won't give access to their Quark files, so they don't see the need to send it.
  • Send out letters to the Iđut authors
    • Åge Persen will collect an address list.
  • Add corpus security re G5 syncing as an issue to Bugzilla
    • Not done
  • set up anon. read-only cvs with Sjur
    • done
  • write docu for how to apply for a corpus user account (forms, recipients, etc)
    • Not done
  • remove old corpus files from gt/sme/corp/ after Trond has cleaned it
    • Not done
  • integrate generated corpus repository documents in the Forrest site
    • Not daon
  • give Saara the needed details regarding corpus dtd location on our public server
    • Not done
  • fix bugs!

Maaren

  • work with the top-ten list
    • done
  • translate stopword list from Norw. to smj, fin (aligner, stopword list from Trond)
    • ?????? (ask Trond)

Saara

  • Extract corpus meta info into a standard xml format; set up cron task for the extraction
    • Done.
  • Create a parallel corpora of the new testaments.
  • Implement validation of xml corpus against the dtd.
    • Done, not installed.
  • Create a group for corpus users.
    • learned how to create groups and added one: corpus.
  • Finish corpus dtd documentation, dtd location and permlink reference
    • The dtd-location still open.
  • Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
    • Implemented, some questions left before installation.
  • add more texts to the graphical corpus interface.
    • Not done. emailed Lars some questions.
  • grammatical searchability in the graphical corpus interface
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
    • not done
  • Lule Sámi twol problems, with Thomas and Trond
    • not done
  • project planning with Trond, continued
    • not done
  • Follow up on place names from Norge Digitalt
    • not done
  • Evaluate SFST as speller (and analyzer) lexicon
    • not done
  • write a background document on the corpus contracts
    • not done
  • public tender:
    • answer requests/questions
      • no questions this week
    • set up anon. read-only cvs with Børre
      • done (well, I really did nothing)
    • corpus repo access
      • still open, but we'll let them have only the free part for now; further discussions later
  • conversion of corpus repo summary xml to Forrest xml
    • not done
  • smj G3 issue with Thomas and Trond
    • not done
  • sme G3 issue with Thomas and Trond
    • not done
  • call EDD/ Christian Emil Ore about national place name lexicon
    • not done
  • risten.no/proper noun lexicon development:
    • refactor code
      • did some small adjustments - most of the work waiting for the task below
    • implement inheritance/collection overriding for xsl/css/xquery using sitemaps
      • first version of this system implemented and working for the initial query page for editors (select a collection - the query page that follows will now be generated from the most specific version found for the requested collection
  • fix bugs!
    • no bugs fixed last week: -(

Thomas

  • add incoming Lule sámi words
    • added and still adding
  • include the SGL decisions in our normativity document
    • done
  • include normativity desicions made by Magga and Sammalahti in our normativity document
    • done
  • work on North Sámi compounding and derivation
    • nothing
  • smj G3 issue with Sjur and Trond
    • nothing
  • sme G3 issue with Sjur and Trond
    • nothing
  • translate stopword list into smj (aligner; list from Trond)
    • not done

Tomi

  • move aspell UTF-8 suffix bug to Bugzilla
    • not done
  • corpus infrastructure:
    • dtd location (both public and internal)
      • not done
  • Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
    • not done
  • new proper name lexicon
    • discuss the new lexicon format and other issues in the newsgroup
    • implement data synchronisation of proper nouns between risten.no and CVS
      • looked briefly
    • XQuery refactoring and code development for our proper noun editor
      • helped Sjur with this one
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
      • not done
  • read aligner docu, install, provide feedback
    • not done
  • implement oslolaš issue for smj
    • done
  • fix bugs!
    • not done

Trond

  • Clean up the old corp/ directory.
    • Done
  • Work on corpus texts with Børre for parallel NT texts
    • Not much, still awaiting response from Bible societies.
  • Contact the Finnish and Swedish Bible societies to get Bible texts.
    • Asked my contacts in Norway to get addresses, still no response.
  • read aligner docu, install, provide feedback
    • Not done.
  • Ask for a Quark test file from his sister
    • Done, received, also got some from Michael Everson.
  • translate stopword list into nno?
    • Not done anything with the aligner issue.
  • fix bugs!.
    • Not done.

3. Documentation

Changes and updates because of the Divvun public tender

TODO:

  • document anonymous, read-only access to our cvs repo (Børre)
    • done
  • review: Sjur, Saara, by Wednesday morning
  • probably a new main section (sub-tab?) on external access to all our resources
  • documentation on how to apply for a user account for the corpus repo ( Børre)
    • Not done
  • we need to finish the corpus dtd documentation (Saara)

Permlink location for all our dtd's (filename will vary, of course):

http://giellatekno.uit.no/dtd/corpus.dtd

This corresponds to the dir ~/gt/public_html/dtd/ on our public web server. The DTD's has to be manually copied to this location, but since they don't change that often, that shouldn't be a big problem.

TODO:

  • copy updated DTD's to the permlink location (Børre or Saara)

4. Corpus gathering

Collecting

See a previous meeting memo for what's to be done.

TODO: Send out the rest of the letters (Børre)

Odin

Waiting for Sæth to discuss with colleagues about how to implement the cooperation, and return to us.

TODO:

  • call Sæth (Børre)
    • Not done.

Olavi Korhonen's Lule Sámi dictionary.

Korhonen and Oahpadusguovdásj have a shared copyright to the dictionary. They are both very positive.

KIO Grafisk and the Iđut books

Iđut and KIO Grafisk won't give access to their Quark files, due to copyright issues with fonts and pictures. It is a principle for them.

Citations from one of the discussions we have had with Quark experts:

  • Trond: Can you confirm that I can get a quark file from you WITHOUT at the same time receiving the fonts?
  • Michael: Of course. Quark files do not embed fonts. They are in the Fonts folder.
  • Trond: what about pictures?
  • Michael: However, if I send you a Quark file and you don't have the fonts, it will display with other fonts and will look bad. A low-quality image of a picture will appear when you import one into Quark, but the main picture that the printer uses is in a separate file.

TODO:

  • Børre will send letters to the authors.
  • send an e-mail explaining the issues as we see them, to Iđut and KIO Grafisk, and with a copy to the head of the project board (Anne Britt). This e-mail will be our last effort to try to get the Iđut material in its corrected form. ( Sjur and Børre)

Bible texts

TODO:

  • review paratext2xml converter (Saara)
  • convert smj NT to paratext (Børre)
  • ask to get fin and swe NT and OT in paratext format. (Trond)
    • Still not done. Trond has contacted Bibelselskapet for a new sme version, and the issue is under discussion.

5. Corpus infrastructure

TODO in transferring the old gt/sme/corp files to the new corpus repo:

  • for the biggest top ten (or so) the orig. should be located and copied to the new corpus repo (Trond)
    • done
  • then these files should be removed from gt/sme/corp/ (Trond/Børre)
    • done
  • all small files could just be forgotten/ignored
  • make sure there's nothing left with a copyright attached to it (Trond)
    • Trond will go a second round

TODO for access control:

  • Access control to corpus repo resolved through Unix groups: one group for corpus maintainers with write access, and another for corpus users, with read-only access. Still not done.
    • Saara has asked for a *nix group - use it when created
      • done

Further discussion about corpus analysis and computer use:

  • we need to develop strong enough security routines for the G5 to fulfill our obligations towards the text licensers
    • TODO: Børre to move this to bugzilla

TODO dtd usage and documentation:

  • corpus dtd documentation:
    • structure, content/model and location of the dtd (location = permlink) http: //giellatekno.uit.no/dtd/corpus.dtd TODO: Saara to write and finish the docu, also check the dtd link
      • see above under documentation for details.
  • add xml validation against our dtd to the corpus conversion process ( Saara)

HTML conversion problem

We need to extract only the table from input like below, since our DTD does not allow tables to be nested inside paragraphs. Cf newsgroup message from Saara.

<p>
  <table>
  ...
  </table>
</p>

The solution is a simple XSL template that will only match the relevant structures, and then "eat" the paragraphs before continue processing of the table:

  <xsl:template match="p[table]">
    <xsl:apply-templates select="./table">
  </xsl:template>

Correction tags?

There are many scenarios where information about spelling and other errors is useful, especially if combined with the correct string. As a simple way of marking up such info, Sjur proposed the following in the newsgroup:

... this is <error correct="text">tekst</error> with an error...

No problem in adding it to the DTD, together with corresponding info in the header/meta part. That info should be a single element stating that/whether it was manually edited for corrections. Absence of this element is the same as the document NOT being edited.

TODO:

  • update the DTD (Saara)

OPEN ISSUES:

  • since this is manual editing, we break the automatic regeneration/reconversion principle. Either we track each change when we find such editions in the existing version before re-conversion, and apply them again after the re-conversion, or we have to find another way of preserving them across conversion generations. Anyway, this is now left as an open issue, and added to Bugzilla (Sjur)
  • the proposed markup is too simplistic for describing more complex error patterns, e.g. when two different errors overlap or intertwine. One could allow nested error markup to cover cases with a syntactic error surrounding a spelling error (one error tag for the syntax error, another inside for the spelling error. To be further discussed in the newsgroup.

Changes and updates because of the Divvun public tender

User account admin and infra: see previous memo.

TODO: see above under Documentation.

Automatic build of the content of our corpus repo: also see previous memo.

TODO:

  • extract meta info into a compact xml document, the xml should be stored in the Forrest docu tree; set up cron task for the extraction (Saara)
    • done
  • discuss and decide upon the structure of the generated xml above (Sjur and Saara, news)
    • done
  • convert from that xml to Forrest document format (Sjur)
    • looked at the generated file, nothing more yet
  • integrate the final Forrest documents into Forrest, and make sure it gets published (Børre)
    • waiting for the above

Free and non-free texts

More info in the previous meeting memo.

TODO:

  • update scripts to handle this dichotomy. (Saara)
    • almost finished

More texts to the graphical corpus interface:

  • We would like to have more than the NT in the graphical interface (Saara)
  • We would like to have grammatical searchability, not only POS. (Saara, Trond)
  • This presupposes a discussion with Oslo. (Trond to start discussion and Trond and Saara to continue
  • For Lule Sámi: We would like to have a parallel corpus interface with NT (text only). This presupposes better quality texts (Trond, Børre)
    • Better Lule NT text still not made.

6. Infrastructure

We need to set up anonymous, read-only access to our cvs repo as outlined by our friend in Skolelinux.

Howto/who:

  • what do we need?
    • web interface? maybe, not required
    • command line check-out? yes (Roy Dragseth / Børre)
    • need to be able to restrict anonymous cvs to only specific modules ( Roy Dragseth / Børre)
      • done
  • testing needed: ( Saara, Sjur)

Aligner

TODO:

  • Read documentation and try out, give feedback to Bergen. (Trond, Saara, Tomi)
    • Trond to send relevant documents to Tomi.
  • Translate the stopword list anchor-eng-nor.txt into sme (and fin?) ( Tomi), and into smj (Thomas or Anders Urheim) (and nno? Trond). Note that the format is "lang1 / lang2", and that the number of lines should be kept in order to make it possible to move from one language to the next.
    • Saara to install the aligner, everyone to read the documentation on Tuesday and Wednesday, and then we have a meeting on it later this week.

Language recogniser

We don't have enough Finnish text. We will look at the Helsinki corpus ( Saara).

  • This is documented in bug database.

7. Linguistics

North Sámi

TODO:

  • document all past decisions in our normativity document (Thomas)
    • done
  • decide on a semantic feature system for nouns (Linda). This is needed for the sme locative/comitative issue. We looked at the feature system made by Eckhard Bick - Eckhard Bick: "The parsing system "palavras"", p.372ff. The top of the feature tree looks like the following:
                                 Concrete
                   +/                             \-
            Animate                                 Verbal Content
        +/           \-                            +/            \-
      Human          Moving                  Control               Mass
     +/    \-      +/      \-               +/     \-            +/    \-
#humans# Moving #vehicles# Movable    Perfective Perfective #features# Count 
.......

TODO:

  • Work with semantically based sets (Trond, Linda)
  • Return to the infrastructure issue (Trond)
  • A full semantic encoding of the lexicon is a future project, outside the scope of both divvun and disamb, but the ground work for such a project will be laid now.

Semantic tags we already have: Actio Plc (for proper nouns such as "London London+N+Prop+Plc+Sg+Nom")

Place names:

Now: Tags Plc Sur and combinations (London, Trosterud).

Problem: 19000 Plc 900 SurPlc (Trosterud BERN-surplc) Oslo sur. Berlin Bonn? 10500 Sur

  • Solution A:
    • Heavily retag the lexicon: London also as Sur (Jack)
  • Solution B: Do not use (new..) double tags.
    • Plc being the default tag for Plc/Sur ("if it can be Plc, it is Plc)
    • Sur being the tag of things that cannot be places (Andersen)
    • Then cg rules turning Plc into &Plc and &Sur, and Sur into &Sur.
    • Then rules for interpreting London, Trosterud, etc. as &Sur.
    • Then a final rule for removing ambiguous ones (remove Plc &Sur strings).
  • Solution C compromise:
    • Real placenames Menešjávri
    • Convertible placenames (today's double) England, Bonn, (default)
    • Real surnames Andersen, Johansson

TODO (Trond):

  1. Discussion testing
  2. infrastructure
  3. semiatomatic retagging

Lule Sámi

TODO:

  • add the rest of the inc- words (Thomas)
    • still working on it, should finish this week
  • name morphology (Thomas)
    • handed Tomi list
  • oslolaš for smj (Tomi)
    • done
  • translate Northern Sámi lists and sets to Lule Sámi
    • Linda, Trond, with help from mother tongue speakers (Thomas, others). Work in progress.

Trond will go to Drag tomorrow. Issues for the trip? No unobvious ones.

8. Name lexicon infrastructure

Complex names

  • make sure xml2lexc can handle complex names in ways compatible with our present tool chain (=reconstruct the lexc format we have now) (Tomi)
    • the resulting file format should be identical to our present prop-name file (=lexc), that can then be converted to our new xml format using the same script as for the regular names (Tomi or Saara, but only when the technical details are settled)

TODO:

  • Move this issue to bugzilla (Børre)

XML format

TODO on eXist as editor:

  1. refactor and prepare risten.no for multiple collections:
    1. develop the Cocoon sitemap to delegate requests to the proper folder level, such that the most specific code is always used (Sjur)
      1. Progressing well
    2. refactor the code into more and more specific components according to our folder hierarchy (Tomi)
  2. develop the needed XQueries and interface (Sjur, Tomi)
  3. data synchronisation between risten.no and the cvs repo (Tomi)
    1. done some, but it didn't work out, will need to start on a different trail
  4. test whether eXist as editor is actually working well (linguists)

Data synchronisation task list/specification:

  • the xml file needs to be stored/updated in cvs
  • there should be no diffs on whitespace and sorting order (to ensure we get only substantial diffs, and no noise)
  • the prop name update cycle should something like:
    • dump the xml from eXist (in proper sorting order)
    • check whether there are diffs against cvs; continue only if there are
    • update from cvs
    • error check: are there conflicts? if yes, send report to <somebody>
    • are there still diffs? if yes, continue:
    • check in/commit w. generated comment
    • error check: is the document valid and conformant xml? if no, stop and send a report to <somebody>
    • reimport the xml file into eXist
  • question: do we need to lock the file in eXist through this update cycle?
  • the update cycle should be a nightly cron job

9. Spellers

Nothing until the new proper noun lexicon is in place.

10. Other

Gobby

A cross-platform alternative to SubEthaEdit, Gobby, is now available for OS X through DarwinPorts. I haven't tested it in collaborative use, but it is worth looking at for collaboration involving Windows and Linux users.

Requirements for easy install:

Install and run the above as admin user. Then find and install Gobby (hint: use the search field), and wait. It took something like 8 hours on my computer to download, compile and install all dependencies, but there was no problems, hicups or other complications. When finished, open X11, and just type 'gobby' in any terminal window.

Bug fixing

35 open Divvun/Disamb bugs, and 25 risten.no bugs

SPR language policy decision

Last week's SPR meeting decided upon a language policy. Their decision was the following (cf. their meeting minutes for the final version):

«En fungerende språklig infrastruktur er av avgjørende betydning for at de samiske språkene skal kunne fungere som bruksspråk i et moderne samfunn. I en slik infrastruktur inngår en velutviklet terminologi, enspråklige ordbøker, ordbøker og lærebøker de samiske språkene i mellom, uttømmende grammatikker som gir informasjon om alle sider ved språket, og språkprogrammer som gjør det mulig å søke etter informasjon, finne termer, rette skrivefeil og grammatikk, og oversette maskinelt fra ett språk til et annet. For å få dette til trengs det representative tekstsamlinger, helst på flere hundre millioner ord, både enspråklige og parallellspråklige, det trengs grammatisk og språkteknologisk forsking. SPR sin rolle er å legge til rette for dette arbeidet.»

11. Summary, task list

Børre

  • send out contracts with accompanying letter
  • Gather public texts, preferrably also parallel ones
  • Continue converting text from input format to our xml
  • convert nob and nno bible texts to be used as part of a parallel corpus
  • review the paratext2xml converter
  • convert smj NT to paratext
  • Call Ove Sæth
  • Move complex name lexicon issue to bugzilla
  • Send out letters to the Iđut authors
  • Add corpus security re G5 syncing as an issue to Bugzilla
  • write docu for how to apply for a corpus user account (forms, recipients, etc)
  • remove old corpus files from gt/sme/corp/ after Trond has cleaned it
  • integrate generated corpus repository summaries in the Forrest site
  • copy updated DTD's to the permlink location, or help Saara do it
  • send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
  • fix bugs!

Maaren

  • work with new missing lists

Saara

  • Extract corpus meta info into a standard xml format; set up cron task for the extraction
  • Create a parallel corpora of the new testaments.
  • Implement validation of xml corpus against the dtd.
  • Create a group for corpus users.
  • Finish corpus dtd documentation, dtd location and permlink reference
  • update the corpus dtd with option for correction tags
  • copy updated dtd's to permanent external location
  • Update convert2xml.pl to handle two gt-trees (gtfree and gtbound)
  • add more texts to the graphical corpus interface
  • grammatical searchability in the graphical corpus interface
  • review paratext2xml converter.
  • install sentence aligner.
  • test anonymous cvs access and review documentation.
  • fix bugs!

Sjur

  • Follow up the lawyer treatment of the contracts
  • Follow up on place names from Norge Digitalt
  • Evaluate SFST as speller (and analyzer) lexicon
  • write a background document on the corpus contracts
  • public tender:
    • answer requests/questions
    • test anon. read-only cvs, review docu, and send link to Finnut
    • corpus repo access to free texts (with Børre)
  • conversion of corpus repo summary xml to Forrest xml
  • call EDD/ Christian Emil Ore about national place name lexicon
  • risten.no/proper noun lexicon development:
    • refactor code
    • implement inheritance/collection overriding for xsl/css/xquery using sitemaps
    • code design for XQueries needed for dict/term editing
  • send a final e-mail to Iđut and KIO Grafisk about copyright issues and texts
  • add manual editing of corpus files as an issue to Bugzilla (error tags)
  • fix bugs!

Thomas

  • add incoming Lule sámi words
  • work on North Sámi compounding and derivation
  • smj G3 issue
  • sme G3 issue
  • translate stopword list into smj (aligner; list from Trond)
  • assist Trond and Linda with the smj disamb work

Tomi

  • move aspell UTF-8 suffix bug to Bugzilla
  • corpus infrastructure:
    • dtd location (both public and internal)
  • Document aspell infrastructure: finish doc/proof/spelling/X-spell/aspell.xml (it's almost done, but there are a couple of loose ends)
  • new proper name lexicon
    • implement data synchronisation of proper nouns between risten.no and CVS
    • XQuery refactoring and code development for our proper noun editor
    • new version of xml2lexc (based on ccat), should handle complex names correct: construct entries like we have now from the different parts of a complex name entry
  • read aligner docu, install, provide feedback
  • translate stopword list into sme (aligner; list from Trond)
  • fix bugs!

Trond

  • Contact the Finnish and Swedish Bible societies to get Bible texts.
  • translate stopword list into nno?
  • double check all remaining docs in gt/sme/corp/ for copyright issues
  • grammatical searchability in the graphical corpus interface
  • better smj NT text
  • work on semantically based sets (sme, smj)
  • start and lead discussion and work on semantic features for disamb
  • fix bugs!.

12. Next meeting, closing

27.03.2006 09: 30

Maaren will be away the next four weeks, starting next week. After that she will work in Tromsø all May, and will share office with one of the Tromsø gang then. She will need her own key for that period (Trond).

Closed at 11: 48