The sentence is ok with preprocess, but not with the Apertium fst preprocessor, which introduces a sentence boundary for "6 ." pro "6.". The result is (in this case) a wrong analysis for the Inf / Pl1 disambiguation (should be Inf, is rendered as Pl1 by apertium). tf4-hsl-m0024:apertium-sme-smn trond$ echo "Giellagáldu čohkke áššedovdiid Anárii golggotmánu 6. beaivve ságastallat, makkár hástalusat leat giellagáhttemis."|smedis ... pos disambiguating ... 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% ***** LEXICON LOOK-UP ***** LOOKUP STATISTICS (success with different strategies): strategy 0: 14 times (100.00 %) not found: 0 times (0.00 %) corpus size: 14 words execution time: 0 sec speed: 14 words/sec ***** END OF LEXICON LOOK-UP ***** "<Giellagáldu>" "giella#gáldu" N <sme> Sem/Dummytag Sg Nom "<čohkke>" "čohkket" V <sme> TV Ind Prs Sg3 @+FMAINV "<áššedovdiid>" "áššedovdi" N <sme> NomAg Sem/Hum Pl Acc "<Anárii>" "Anár" N <sme> Prop Sem/Sur Sg Ill "<golggotmánu>" "golggotmánnu" N <sme> Sem/Time Sg Gen "<6.>" "6" A <sme> Ord Attr "<beaivve>" "beaivi" N <sme> Sem/Time Sg Gen Allegro "<ságastallat>" "ságastallat" V <sme> TV Inf @-FMAINV "<,>" "," CLB "<makkár>" "makkár" Pron <sme> Interr Attr "<hástalusat>" "hástalus" N <sme> Sem/Semcon Pl Nom "<leat>" "leat" V <sme> IV Ind Prs Pl3 @+FMAINV "<giellagáhttemis>" "giellagáhtten" N <sme> Sem/Act Sg Loc "<.>" "." CLB tf4-hsl-m0024:apertium-sme-smn trond$ echo "Giellagáldu čohkke áššedovdiid Anárii golggotmánu 6. beaivve ságastallat, makkár hástalusat leat giellagáhttemis."|apertium -d. sme-smn-disam|grep -v '^;' "<Giellagáldu>" "gáldu" n sem_dummytag sg nom SELECT:10021:NDr2334 "giella" n cmp_sgnom cmp "gáldu" n sg nom SELECT:10021:NDr2334 "giella" n cmp_sgnom cmp "gáldu" n sem_dummytag sg nom SELECT:10021:NDr2334 "giella" n sem_lang_tool cmp_sgnom cmp "gáldu" n sg nom SELECT:10021:NDr2334 "giella" n sem_lang_tool cmp_sgnom cmp "<čohkke>" "čohkket" vblex tv indic pres p3 sg @+FMAINV SELECT:3859:VFINSg3TEST MAP:7712:+FMAINV "<áššedovdiid>" "áššedovdi" n nomag pl acc SELECT:9082:AccTV1 "áššedovdi" n nomag sem_hum pl acc SELECT:9082:AccTV1 "<Anárii>" "Anár" np top sg ill SELECT:12007:cleanSemClass "<golggotmánu>" "golggotmánnu" n sem_time sg gen SELECT:8866:timeADVL4 "golggotmánnu" n sg gen SELECT:8866:timeADVL4 "<6>" "6" num arab sg nom "6" num arab sg gen "<.>" "." sent "<beaivve>" "beaivi" n sem_time sg gen "beaivi" n sg gen "<ságastallat>" "ságastallat" vblex tv indic pres p1 pl @+FMAINV SELECT:6726:vfin MAP:7712:+FMAINV "<,>" "," cm "<makkár>" "makkár" prn itg attr SELECT:6214:DemAttr "<hástalusat>" "hástalus" n pl nom "hástalus" n sem_semcon pl nom "<leat>" "leat" vblex iv indic pres p3 pl @+FMAINV SELECT:6894:r2974 MAP:7707:+FMAINVCop "<giellagáhttemis>" "gáhttet" ex_vblex tv der_nomact n sg loc "giella" n cmp_sgnom cmp "gáhttet" ex_vblex tv der_nomact n sg loc "giella" n sem_lang_tool cmp_sgnom cmp "<.>" "." sent "<.>" "." sent
The only way I know to make this work without using the perl-based preprocessor (which I believe is not possible or acceptable in an Apertium context) is to use the new hfst-tokenise tool + an extra cg file for disambiguating tokenisation ambiguities (6. vs 6 + . in this case). That is what we do in the grammar checker, and what is planned for the replacement of preprocess + lookup2cg. This setup does introduce more dependencies (hfst-tokenis) and a new formalism (the pmatch formalism, which is an extension of the Xerox fst formalism) for Apertium, which may or may not be problematic. Added Fran to the CC list to get further comments.
Merkelig, for i sme-sma har vi det motsatte problemet, for ofte Ord analyse, Først samme setning som Trond viste til, deretter en setning hvor vi ikke ønsker Ord. apertium-sme-sma$ echo 'Giellagáldu čohkke áššedovdiid Anárii golggotmánu 6. beaivve ságastallat' | apertium -d. sme-sma-disam "<Giellagáldu>" "gáldu" n sem_dummytag sg nom SELECT:10034:NDr2334 "giella" n cmp_sgnom cmp "gáldu" n sg nom SELECT:10034:NDr2334 "giella" n cmp_sgnom cmp "gáldu" n sem_dummytag sg nom SELECT:10034:NDr2334 "giella" n sem_lang_tool cmp_sgnom cmp "gáldu" n sg nom SELECT:10034:NDr2334 "giella" n sem_lang_tool cmp_sgnom cmp ; "gáldu" n sem_dummytag sg acc SELECT:10034:NDr2334 ; "giella" n cmp_sgnom cmp ; "gáldu" n sem_dummytag sg gen SELECT:10034:NDr2334 ; "giella" n cmp_sgnom cmp ; "gáldu" n sg acc SELECT:10034:NDr2334 ; "giella" n cmp_sgnom cmp ; "gáldu" n sg gen SELECT:10034:NDr2334 ; "giella" n cmp_sgnom cmp ; "gáldu" n sem_dummytag sg acc SELECT:10034:NDr2334 ; "giella" n sem_lang_tool cmp_sgnom cmp ; "gáldu" n sem_dummytag sg gen SELECT:10034:NDr2334 ; "giella" n sem_lang_tool cmp_sgnom cmp ; "gáldu" n sg acc SELECT:10034:NDr2334 ; "giella" n sem_lang_tool cmp_sgnom cmp ; "gáldu" n sg gen SELECT:10034:NDr2334 ; "giella" n sem_lang_tool cmp_sgnom cmp "<čohkke>" "čohkket" vblex tv indic pres p3 sg @+FMAINV MAP:7733:+FMAINVC ; "čohkket" vblex tv imp p2 sg REMOVE:3821:TESTImprt ; "čohkket" vblex tv imp conneg REMOVE:4409:muhtoNotConNeg ; "čohkket" vblex tv indic pres conneg REMOVE:4409:muhtoNotConNeg ; "čohkket" vblex tv vgen REMOVE:7059:KillAllVGen "<áššedovdiid>" "dovdat" ex_vblex tv der_nomag n pl acc SELECT:9092:AccTV1 "ášši" n g3 sem_semcon cmp_sgnom cmp "dovdat" ex_vblex tv der_nomag n pl acc SELECT:9092:AccTV1 "ášši" n g3 cmp_sgnom cmp ; "dovdat" ex_vblex tv der_nomag n pl gen SELECT:9092:AccTV1 ; "ášši" n g3 cmp_sggen cmp ; "dovdat" ex_vblex tv der_nomag n pl gen SELECT:9092:AccTV1 ; "ášši" n g3 cmp_sgnom cmp ; "dovdat" ex_vblex tv der_nomag n pl gen SELECT:9092:AccTV1 ; "ášši" n g3 sem_semcon cmp_sggen cmp ; "dovdat" ex_vblex tv der_nomag n pl gen SELECT:9092:AccTV1 ; "ášši" n g3 sem_semcon cmp_sgnom cmp ; "dovdat" ex_vblex tv der_nomag n pl acc SELECT:9092:AccTV1 ; "ášši" n g3 cmp_sggen cmp REMOVE:12049 ; "dovdat" ex_vblex tv der_nomag n pl acc SELECT:9092:AccTV1 ; "ášši" n g3 sem_semcon cmp_sggen cmp REMOVE:12049 "<Anárii>" "Anár" np sem_sur sg ill SELECT:2109:PlcSur4 ; "Anár" np sg ill SELECT:2109:PlcSur4 ; "Anár" np top sg ill SELECT:2109:PlcSur4 "<golggotmánu>" "golggotmánnu" n sem_time sg acc "golggotmánnu" n sg acc ; "golggotmánnu" n sem_time sg gen REMOVE:8138:SEMTr ; "golggotmánnu" n sg gen REMOVE:8138:SEMTr "<6.>" "6" adj ord attr "<beaivve>" "beaivi" n sem_time sg gen "beaivi" n sg gen "<ságastallat>" "ságastallat" vblex tv inf @-FMAINV MAP:7894:-FMAINVInf SELECT:8000:killifVinCohort ; "ságastit" ex_vblex tv der_alla vblex indic pres p1 pl REMOVE:2205:derV ; "ságastit" ex_vblex tv der_alla vblex inf REMOVE:2205:derV ; "ságastallat" vblex tv indic pres p1 pl @X MAP:7974:realverbX SELECT:8000:killifVinCohort "<.>" "." sent apertium-sme-sma$ echo Davvi Girji 2007. | apertium -d. sme-sma-disam "<Davvi Girji>" "Davvi Girji" np sem_org sg nom "Davvi Girji" np sg nom ; "Davvi Girji" np attr REMOVE:2405:PropAttrIfPropx ; "Davvi Girji" np sem_org attr REMOVE:2405:PropAttrIfPropx "<2007.>" "2007" adj ord attr "<.>" "." sent
This is completely possible to do in Apertium by having both the entries and then disambiguating them in CG. LEXICON Ambigs %<num%>%+.%<sent%>:. # ; %<num%>%<ord%>:. # ; LEXICON Nums < [ %0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]+ > Ambigs ; Then you will get two readings: ^6./6<num>+.<sent>/6<num><ord>$ And you can write a CG rule to select ord if you have a noun after.
Yeah, hfst-tokenise is not ready for apertium use yet – it doesn't handle superblank-skipping at all. The robust solution for now is to just put both entries in the lexc and deal with it in a CG pre-disambiguation step – this has worked for other apertium pairs. If we ever change to hfst-tokenise in apertium-sme-*, we'll still need such a CG file. (One thing you don't get without hfst-tokenise is the correct word form on each part of a set-of-multiword-readings, but that's thrown away anyway later in the apertium pipeline.)
Since this issue needs to be solved with a combination of lexc and vislcg3, I send it back to Trond. The solution needs to work also for the grammar checker and the hfst-tokenised based analysis+tokenisation. Hm, perhaps this is something for Kevin to look into?
Det ville være nyttig å vite hvorfor sme-sma og sme-smn fungerer så forskjellig, som i våre eksempler her. Noen som har kommentarer til det?
Jeg sender denne til Francis, for å se om han kan finne ut hva som skjer.
Note: The issue is that numerals are adjs in smj, sma apertium, not in smn. Francis will fix.
(In reply to Francis Tyers from comment #3) > This is completely possible to do in Apertium by having both the entries and > then disambiguating them in CG. > > LEXICON Ambigs > > %<num%>%+.%<sent%>:. # ; > %<num%>%<ord%>:. # ; > > LEXICON Nums > > < [ %0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ]+ > Ambigs ; > > Then you will get two readings: > > ^6./6<num>+.<sent>/6<num><ord>$ > > And you can write a CG rule to select ord if you have a noun after. Jeg har prøvd ut to metoder: 1) legge til ekstra cohort med vislcg3: COPY:Apertium (Num Arab Sg Nom addsent) EXCEPT (A Ord Attr) TARGET (ord) ; COPY:Apertium (Num Sem/Year Sg Nom) EXCEPT (A Ord Attr) TARGET (ord) ; REMOVE:TestNum Ord IF (1 ("[a-z].*"r)) ; ADDCOHORT ("<.>" "." sent) AFTER (Num Arab Sg Nom addsent) ; Denne fungerer, bortsett fra at vislcg3 alfabetiserer taggene, og dermed mister vi num som første tagg, og dette hinder generering- "<2008.>" "2008" sg nom num sem_year COPY:2211:Apertium "2008" sg nom addsent num arab COPY:2210:Apertium ADDCOHORT-AFTER:2214 ; "2008" adj ord attr REMOVE:2212:TestNum echo nu dat lea 2008. ja maŋŋel boahtit. | apertium -d. sme-smn nuuvt tot lii @2008 .já maŋa puáttip. Dessuten blir whitespace feil. 2) La til i lexc: +Num+Arab+Sg+Nom#%.%<sent%>:%. # ; !for Apertium ja deklarering for %<sent%> i root $HLOOKUP $GTHOME/langs/sme/src/analyser-disamb-gt-desc.hfstol 2008. 2008. 2008+A+Ord+Attr 0,000000 2008. 2008+Num+Arab+Sg+Nom#.<sent> 0,000000 Men denne ekstre stringen blir ikke overført til Apertium. echo nu dat lea 2008. ja maŋŋel boahtit. | apertium -d. sme-smn-disam "<2008.>" "2008" adj ord attr Hva har jeg glemt å gjøre? (utkommentert for innsjekking)
Jeg tester med +Num+Arab+Sg+Nom#%.+sent:%. # ; istedenfor +Num+Arab+Sg+Nom#%.%<sent%>:%. # ; !for Apertium
(In reply to Lene Antonsen from comment #10) > Jeg tester med > +Num+Arab+Sg+Nom#%.+sent:%. # ; > istedenfor > +Num+Arab+Sg+Nom#%.%<sent%>:%. # ; !for Apertium Heller ikke dette fikk jeg til å fungere. Jeg åpner en ny bug for å diskutere muligheta av å begerense +Ord analysen til tall under 1000, evt under 101.