UIT The arctic university of Norway > Giellatekno
 

Meeting_2014-06-11

Tastatur og preprosessering

Planar for sumaren og hausten:

  • tastatur for iOS8 og Android (Lavangen og India? Utlysing)
  • preprosessering
  • arbeid til Mike

tastatur for iOS8 og Android (Lavangen og India? Utlysing)

Finansiering: Divvun-potten for ekstra satsingar Timeplan: ferdig til offentleg lansering av iOS8 (rykte: september) - vi satsar på 15. september for ein beta, ferdig så fort som mogleg etter det

Design-mål:

  • så lik Apple sitt tastatur som mogleg
  • fullføringsforslag og retteforslag frå hfst (men kanskje berre listebasert i fyrste versjon for å få han ut)
  • norske og ikkje-samiske teikn som popup-liste (som Apple-tastatura)
  • klårt skilje mellom språkuavhengig og språkspesifikk kode
  • tastaturlayout som xml-fil (eller noko liknande)
  • vi lagar for nordsamisk no, men skal enkelt kunna lagast for alle språka våre

Moglege framtidsvariantar:

  • swipe-inspirert?
  • (meir avansert) bruk av hfst-teknologi for stavekontroll og ordfullføring

preprosessering

Basert på:

  • hfst-pmatch? https: //kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstPmatch
  • hfst-ataq
  • something else?

Possible issues with hfst-pmatch:

  • char-by-char processing? Just like any other fst: state-by-state
  • processing of formatting? It can deal with any text - as long as the formatting is expressed as text (in-stream markup) it should be no problem
  • speed? We don't know yet, but fst speed anyway
  • compilation speed? Unknown until we try

Tommi: You cannot get your tokeniser as you analyse with ambiguos readings in middle of the string from pmatch; if "in order to" is lrlm there won't be "in" "order" "to" using pmatch applicator.

Sjur: Can this be changed in the pmatch code to collect all paths up until a common tokenisation point?

Tommi: Wouldn't it in the end be just as much work as rewriting from scratch and probably harder? Like, using pmatch for this with these specs is like having a hammer and trying very hard to use it on screws cause they kind of look like a nail.

See http://www.stanford.edu/~laurik/publications/pmatch for details on how to use (hfst-)pmatch.

arbeid til Mike

Mike to try out hfst-pmatch for a month, then we evaluate the feasibility of hfst-pmatch as an analysing tokeniser.

wishlist for tokeniser

  • have whitespace in the middle of words, e.g. \n and softhyphen
  • string: lettersequence - whitespacesequence - othersequence - ...
  • LR longest match for token-sharing boundaries
  • within the token, all the analyses
  • input: 12345678901235463; possible tokenisations:
    • 12 34 5678 90 12 35 463
    • 123 45678 90 12 35 463
      • ^12345678/12+34+5678/123+45678$ ^90/90$ ^12/12$ ...
      • thus: get both tokenisations between 1 and 8. then analyse
  • input: "the cat's mother, in order to", possible tokenisations:
    • the cat 's mother, in order to
    • the cat's mother, in order to
      • ^the/the$ ^cat's/cat+'s/cat's$ ^mother/mother$^,/,$ ^in order to/in+order+to/in order to$

Two possible tokenisations:

"<in order to>"
    "in order to" pr

"<in>"
    "pr"

"<order>"
    "order" vblex pres
    "order" n sg 

"<to>"
    "to" pr
  • output an ambiguous lattice ?
  • do backoff automata ? e.g. analyser -> regex -> unicode database
  • Sane handling for Finnic(?) coordinated compounds with hanging hyphen:
    • ”koira- ja kissajuttu” ?= koira+juttu ja kissa+juttu
    • it'd be neat if hyphenated words were not in morph. analyser.. maybe
  • Case mangling:
    • "Thing" -> thing
    • an tAerfort -> an t+aerfort

Re unicode regexes: "You can match a single character belonging to the "letter" category with \p{L}. You can match a single character not belonging to that category with \P{L}." See http://www.regular-expressions.info/unicode.html for details.

Which tools support Unicode regexes? pcre? Yes, I believe so. Any decent and recent programming language with proper ICU-based Unicode support : )