UIT The arctic university of Norway > Giellatekno
 

Notes For New Oahpa Code

This is a document for ideas and discussion about the new Oahpa code.

workshop-week from 19. Sept

IDEAS

Requirements, from different perspectives:

  • programmer
  • linguist
  • teacher
  • students

Today:Special features, not used for all languages

Leksa:

  • audio-files (crk-oahpa, vro-oahpa)
    • for the time being: audio files are not functioning with Safari
 <l pos="N" animacy="AN" t2c="no" rime="0" gen_only="Pl,Sg" audio="kohkos">kohkôs</l>
  • "almost correct"-feedback to the student (tcomm) for sma-oahpa, sme-oahpa
         <t pos="n" stat="pref">mors eldre søster</t>
            <t pos="n">eldre søster til mor</t>
            <t pos="n" tcomm="yes">tante</t>
 
  • different colours in teh interface for different animacy (crk-oahpa)
    • crk-oahpa: green for N+AN, or animacy="AN"
  • Place names as an option (sme-oahpa, sma-oahpa)
  • Multiword expressions as an option in menu (sme-oahpa, sma-oahpa)
    • the target language is multiword expression
      It is important to distinguish this from the ordinary Leksa choices, so the user knows that Leksa is not aksing for "the Saami word" for so-and-so, but for something which is more than one word.

Morfa

  • Dictionary (NDS) is integrated (crk-oahpa)
  • only-sg choice in menu (fkv-oahpa, rus-oahpa)
  • Transitive verbs (TI) are presented with object in Morfa-S (crk-oahpa)
<l pos="V" object="iskocês" trans_anim="TI" initial="c" rime="m">wâpahtam</l>
  • Translation tasks in Morfa-C (crk-oahpa)
  • What kind of agreement or other rule inducement is used or is needed
    • rus-oahpa: agreement of subject and main verb in number for pres/past, in gender for past tense (in elaborated multiple sentence tasks may be actual, not grammatical gender for professions)
    • rus-oahpa: agreement of modifier (adj or adj-l pron) and noun in gender and case in singular, MFN as gender (for modifier) and case in plural. Implies also change of gender for Adj Sg Nom -> Pl Nom task: {Msc, Fem, Neu} -> MFN
    • rus-oahpa: verb passive forms may be congruent with reflexive verb forms, whereas not all reflexive verbs from dictionary are suitable to be used for questions-answers in active voice (passive is not curently used in tasks)

FST with error-tags (Err)

  • only for sme, in old infra:
lookup -flags mbTT -utf8 /Users/lan000/errortag-gt/gt/sme/bin/ped-sme.fst

and is used for Vasta and Sahka.

e.g.

viessui	viessu+Build+N+DiphErr+Sg+Ill
vissui	viessu+Build+N+Sg+Ill
lavka	lávka+N+Sg+Nom+AErr
lávka	lávka+N+Sg+Nom
lávka	lávka+N+CGErr+Sg+Gen
lávkka	lávka+N+Sg+Gen

We want this for more languages, but implemented in another way than for sme. By concatenating FSTs instead of messing up the source files, which is a big work to maintain.

That we have done for numeral transcriptor in crk (Err/ShLo means error marking short/long vowel)

peyak	1+Err/ShLo
pêyak	1

Numra

  • Numra: special feedback for tag +Err/ShLo (crk-oahpa)
    • crk-oahpa: you can test with peyak for 1
    • rus-oahpa: tag +Err/ShLo is added if one or more stress marks are missing in a numeral string, accepted as a correct answer. The student's answer with incorrect stress marks is not accepted (although almost correct).

Different ways of adding limitations for words or forms

Leksa

  • exclude="smenob" in e-element (smeOahpa, smaOahpa)
    • for excluding nobsme: remove the nobsme entry
  • <sem class="NOTLEKSA"/> (crkOahpa)

Morfa completily

  • morfas="no" in l-element (smeOahpa)
  • gen_only="None" (crkOahpa)

Morfa partly

  • to be included in some Morfa tasks: gen_only="Pl,Sg" (crkOahpa)
  • to be included in one Morfa task: <sem class="MORFAPOSS"/> (crkOahpa)
  • to be included in Morfa tasks: <sem class="MORFA"/> (fkvOahpa)
  • to be included in Morfa tasks: <sem class="MORFAS"/> (smeOahpa, smaOahpa)

Interface design

  • all relevant menus in the top, also PoS choice (crk-oahpa)
  • instructions is given over the tasks, instead of in a box to the right (crk-oahpa)
  • morphological feedback appears to the right of the question it is referring to (crk-oahpa)
  • tool tip in instructions, eg. example of how to answer (crk-oahpa)
  • tool tip in morphological feedback (crk-oahpa)
  • special characters are given in a typer for all programs (crk-oahpa)
    • it appears when you click in the blank
  • menu for grammar explanations, with links to a web grammar (sme-oahpa)
  • dialect choice in menu (sme-oahpa and sma-oahpa)
    • only one dialect is presented in the tasks, but both dialects are accepted as input from the student
    • the morphological feedback is according to the chosen dialect
  • all instruction is in target language, but the student gets a tool tip translation by clicking Alt (sme-oahpa)
  • short verbal explanation of the game topic under each game icon and name (vro-oahpa)

New ideas (don't hesitate)

General structure ideas

Will need some discussion:

Oahpa needs more separation within the source code, there is a lot of logic relating to rendering exercises spread across various parts.

One way to simplify this could be separating Oahpa into two major pieces: (1) linguistic and exercise generation functions, which could be separated out into an internal-use-only JSON API, (2) frontend stuff, like presenting exercises to users, tracking user login, course management, etc.

These two pieces could either exist as their own separate codebases, or be in one. Separating them should allow us to write unit tests for exercise generation with a fake language only for Oahpa testing, and ensure that as language needs change, we have less surprises.

Each language project should have a central configuration setup for linguistic info and meta-info, and we should avoid copying code for each new project.

Other:

  • Newest version of Django
  • Use django-rest-API for exercise generation pieces?
  • Word database structure is otherwise great, tagsets, semantics, etc., but it may be possible to extract wordforms from this (see below)

Exercises

  • Defining Morfa-S and Leksa exercises should be as simple as editing YAML; some bits of this are coded already in the inner workings of morfa-s, but this should be the default for morfa-s, leksa, and numra-type exercises that require no syntax. More info below
  • Exercise code needs to produce very simple output, so that we can write automated tests with a sample project and maybe with a fake language project where we only add linguistic data we need to test
  • Morfa-C: we can write some generalized agreement process that allows a language maintainer to easily define syntax relationships. These have been hard to add in python, and require significant work, but it could be simplified... Ryan may have a sample checked in somewhere, when it comes to it

Exercise Definitions

For exercise definitions there are three parts:

1.) defining how menus appear, and what content goes in them 2.) defining how the questions appear 3.) defining connections between (1) and (2)

For leksa, numra, and morfa-s, the general structure is shared: question is a word with a particular form, answer is a question with a particular form.

Question form:

  • internal question ID, for making menus, etc.?
  • language iso
  • word semantics
  • word morphology
  • context

Answer form:

  • language iso
  • word semantics (usually same)
  • word morphology (leksa: usually lemma, morfa-s: cases, tense, etc)
  • context

Some of the ideas here exist in NDS (context-dependent question context)

Database

As far as django is concerned, there should be a separate lexicon/morphology database from user and login data.

It should be possible to install wordforms as a separate part of the process:

  • Example: the word structure, semantics, and Morfa-C exercises all do not need to be updated, but wordforms have changed and need to be replaced
  • Example: wordforms are fine, but Morfa-S or Leksa question/answer pairs have changed
  • Example: wordforms are fine, but Morfa-C has a new question, or a changed question
  • Example: wordforms are fine, questions are fine, but semantic sets need to be changed
  • Example: maybe more ideas ...

An option is using the FSTs "live", and only caching the last forms generated, unless the FST changes. Forms would be incrementally updated as users request them, and an install process could install only the word metadata structure, which should make the install process faster. Typically, unless the FST is very slow, these lookups can happen very quickly.

We can rely on an updated lookup server daemon, which already exists, to speed up this process as well. (See main/ped/lookupserv/)

Install definition

The install process should be defined in some system other than a `bash file, perhaps also .yaml, alongside the systemwide configuration.

User feedback

A button "Report an error" (wrong word form, etc.). The error reports will be saved and the expert (teacher or linguist) decides if it really was an error.

User activity logging

Logging of the activity of all users, not only of those who are logged in.

Would be very nice to have ideas

... but not immediately crucial

  • Morfa-C could use a word / wordform / semantics browser interface to simplify our troubleshooting processes, and help define exercises. Export XML?

What to send to analyser

Today

  • Numra is sent to transkriptor-FST, with error-tags,
  • Vasta and Sahka is sent to FST with error-tags

New code

  • Send Morfa to FST (API) with error-tags
  • Send also Leksa to FST with error-tags?

Modularity

We should get the linguistic data out of the system files

Linguistic data

Different kinds of data:

  • information in the source files
  • definitions of how to display content based on morphology, contained within python (ex. if noun, add verb pronoun context, if Essive, no Sg or Pl.)
  • coordinated with the information in the systemfiles
    • for choices in menus
    • for producing the tasks

Contents of drop-down menus (semantic supersets, case a.o. grammatical category lists, source lists (books and chapters), etc.) should be in separate file(s), not inside the program code files.

  • Create a procedure for automatically generating these lists out of the Oahpa source files (lexicon, tags, paradigms, semantic sets).
  • In the teacher interface there could be an option to check the contents of the automatically generated menus and to suppress the unwanted items.

Small FST for Oahpa? Containing the lexicon from ped/LANG/src/ This may be included in the standard newinfra build process: ./configure --enable-minimal-oahpa-fst

System files

Audio files/Text-to-speech

There are some audio files in Leksa in crk-oahpa and vro-oahpa. In vro we are also developing a new feature: reading aloud Morfa-C questions with the help of text-to-speech software.

How to make Oahpa for new languages

Maintainence

  • Linguist interface. The linguist should be able to install/remove new lemmas and tasks.
  • Teacher interface, e.g. (taken from oahpapres.pdf): Good morning, Dorothy!
  1. Mark (in the list) the part of speech and the morphological inflection you want Morfa S task for
  2. Mark (in the list) the lemmas you want to include in the task
  3. Write the instruction for the students
  4. – Wait, while the program is compiling –
  5. Test the tasks. Click for the next step.

Consistency in naming

  • Aim to introduce/keep names the same for the same things (and better different for substantial differences)
  • Tag collisions
    • Different uses of +Pass tag in sme and rus-oahpa (affects also nob, smn, sms, fin, est, vro, fkv, fao, mhr, vep)
    • rus-oahpa: some verb forms have secondary tags that are confused even with PoS tag (Adv)
  • Database fields
    • Database tables have columns named with actually reserved words: string, case. This could be avoided.

Konteaksta

We have a version for sme and two versions for rus (Ruskonteaksta and RusVIEW).

Student-log-in

We have a demo-version

Experiences from working with Oahpa from different languages

Pain points

  1. database update takes time, sometimes changes are small (i.e., subset of wordforms updated in FST, but still need to run whole install)
  1. database update requires a lot of awareness of an individual oahpa installation
  1. database: lots of places to configure data being installed, adding files is not transparent
  1. exercises: Morfa-C: adding new types of agreement is tricky
  2. exercises: Morfa-S: new types of exercises are also tricky
  3. exercises: Numra relies on some external FSTs, which Morfa currently does not, need better error feedback for whoever is maintaining the processes that these are missing when the process starts
  4. exercise: sahka / vasta lookupserv is tough to configure (new version exists: main/ped/lookupserv/)
  1. general: errors are opaque, the backend could do a better job of explaining what is wrong

Document the differences between the versions

The full list (24 Oahpas), is, as always, http://giellatekno.uit.no/ped/index.html. URLs starting with testing.oahpa.no are on gtlab, URLs with oahpa.no are on gtoahpa.

Most differences follow from the availability of basis resources.

  • Only sme has a full CG program, so only sme has Sahka
  • We are now working on Vasta-F for translation (with CG) for crk and est
  • Languages without CG but with an FST typically have 4 programs
  • Languages also an FST typically have 2 programs, Numra and Leksa

sme

sma

est

vro

  • http://testing.oahpa.no/voro/
  • 4 programs
  • spell-relax issues because of different ways of marking palatalisation and glottal stop
  • some audio files in Leksa (the full lexicon coming soon)
  • TTS integration under development
  • Morfa-C noun: "limited mix" exercises

crk

liv

  • http: //testing.oahpa.no/livokel/]
  • http: //giellatekno.uit.no/ped/liv-oahpa.html]
  • 4 working programs
  • The code for this setup is the same as for the other
  • This is the language with the most complicated writing system, and it may therefore be a testing ground for different spellrelax issues.

rus

  • http: //giellatekno.uit.no/ped/rus-oahpa.html]
  • http: //testing.oahpa.no/rusoahpa/]
  • 4 programs
  • Russian has a mechanism for handling stress
  • rusoahpa should be moved from gtlab to gtoahpa

sms

The other Oahpas (probably not so relevant for the initial part of the work)

izh

mhr

mrj

myv

fkv

mdf

bxr

olo

yrk

FST exists but it not really implemented

hdn

http://oahpa.no/hdn_oahpa/

The FST-less Oahpas

rup

http://oahpa.no/armaneshte/ http: //giellatekno.uit.no/ped/rupdoc/rup-oahpa.html]

kpv

http: //oahpa.no/kpvoahpa/] http: //giellatekno.uit.no/ped/kpv-oahpa.html] Leksa, Numra

sjd

sjd has some audio files, and plans for extending them. Leksa, Numra

udm

smn