Bug 2455 - Ávvir: CG-inserted tags mess up typo generation for e.g. "sátnejodiheaddji"
Summary: Ávvir: CG-inserted tags mess up typo generation for e.g. "sátnejodiheaddji"
Status: RESOLVED FIXED
Alias: None
Product: Grammar checkers
Classification: Unclassified
Component: Linguistic issues (show other bugs)
Version: unspecified
Hardware: Macintosh Linux
: P5 - Later enhancement
Assignee: Linda Wiechetek
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-04-10 09:25 CEST by Kevin Brubeck Unhammer
Modified: 2019-08-20 14:25 CEST (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Kevin Brubeck Unhammer 2018-04-10 09:25:44 CEST
I just got another report from Ávvir that their speller stopped generating suggestions (but still shows underlines) for certain words. 

We used to get this:

$ echo sátnejodiheaddji | hfst-tokenise -g 'tokeniser-gramcheck-gt-desc.pmhfst'  | vislcg3 -g 'mwe-dis.bin'  | cg-mwesplit  | divvun-cgspell -n 25 -l 'acceptor.default.hfst' -m 'errmodel.default.hfst'  | vislcg3 -g 'spellchecker.bin' | divvun-suggest -g 'generator-gramcheck-gt-norm.hfstol' -m 'errors.xml' -l se
"<sátnejodiheaddji>"
        "jođiheaddji" Err/Orth N NomAg Sem/Hum Sg Acc <W:10.0000000000> &typo
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
typo
        "jođiheaddji" N NomAg Sem/Hum Sg Acc <W:10.0000000000> &typo &SUGGEST
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Acc        sátnejođiheaddji
        "jođiheaddji" Err/Orth N NomAg Sem/Hum Sg Gen <W:10.0000000000> &typo
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
typo
        "jođiheaddji" N NomAg Sem/Hum Sg Gen <W:10.0000000000> &typo &SUGGEST
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Gen        sátnejođiheaddji
        "jođiheaddji" Err/Orth N NomAg Sem/Hum Sg Nom <W:10.0000000000> &typo
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
typo
        "jođiheaddji" N NomAg Sem/Hum Sg Nom <W:10.0000000000> &typo &SUGGEST
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Nom        sátnejođiheaddji
:\n

ie. it managed to generate "sátnejođiheaddji".

Now we get this:

"<sátnejodiheaddji>"
        "jođiheaddji" Err/Orth N NomGenSg NomAg Sem/Hum Sg Nom <W:0.0000000000> @HNOUN &typo
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0000000000>
typo
        "jođiheaddji" N NomGenSg NomAg Sem/Hum Sg Nom <W:0.0000000000> @HNOUN &typo &SUGGEST
                "sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0000000000>
sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomGenSg+NomAg+Sg+Nom       ?
:\n

Putting a "|sed 's/ NomGenSg//g'" into the pipeline makes it generate again.


This is the same issue we've had so many times with Apertium – using CG to insert "lexicon tags" will mess up generation. 

Possible fixes: Have NomGenSg in the lexicon instead of SUBSTITUTING it in in the disambiguator, or use one of the tag types that's ignored for generation (e.g. anything with <> around the tag is treated as CG-inserted – if it were called <NomGenSg> then divvun-suggest would skip it when trying to generate), or, worst-case, have an optional arc with NomGenSg for that word in the generator. It looks to me like something that should̈́'ve been in the analyser in the first place, but maybe there's a good reason it wasn't.

I'll put in a hotfix now for Ávvir at least, and a test in their update-script so they never update if this word doesn't get suggestions.
Comment 1 Kevin Brubeck Unhammer 2018-04-10 09:28:55 CEST
Oh, forgot to mention, this is the rule that does it in tools/grammarcheckers/disambiguator.cg3, line 3301 as of -r165388:

SUBSTITUTE N (N NomGenSg) TARGET N IF (0 (N Sg Nom) + $$SAMELEMMA LINK 0 (N Sg Gen) + $$SAMELEMMA)(NEGATE 0 NomGenSg);
Comment 2 Kevin Brubeck Unhammer 2018-04-10 09:40:28 CEST
A few more lines from grammarcheckers/disambiguator.cg3 that change "lexicon tags" (those not starting with @ or < or Sem/ or & or §):

./disambiguator.cg3:3398:SUBSTITUTE:PropAttr (Sg Nom) (Attr) PROP-ATTR + Nom (NOT 0 Attr LINK *1 PROP-SUR BARRIER REAL-WORD-NOT-ABBR OR COMMA)(NEGATE 0 Sem/Sur LINK -1 FIRSTNAME LINK 2 FIRSTNAME LINK 1 Sem/Sur);
./disambiguator.cg3:3399:SUBSTITUTE:Attr (Sg Nom) (Attr) PROP-ATTR + Nom (NEGATE -1 Prop)(1 ("ja") OR ("dahje") OR ("dehe") OR ("dahe") LINK 1 (Prop Attr) LINK *1 (N Prop Sem/Sur));
./disambiguator.cg3:3400:SUBSTITUTE:Attr (Prop Sem/Mal Sg Nom) (Prop Sem/Mal Attr) ("Hearrá") (1 ("Ipmil"));
./disambiguator.cg3:13969:SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0 ("lávet"))(NOT 0 TV);

If they don't ever run on anything we do &SUGGEST on, we're safe for now, but we should consider if they can be added to the lexicon so we can keep analysis in the analyser.
Comment 3 Linda Wiechetek 2018-04-11 13:07:39 CEST
(In reply to Kevin Brubeck Unhammer from comment #0)
> I just got another report from Ávvir that their speller stopped generating
> suggestions (but still shows underlines) for certain words. 
> 
> We used to get this:
> 
> $ echo sátnejodiheaddji | hfst-tokenise -g
> 'tokeniser-gramcheck-gt-desc.pmhfst'  | vislcg3 -g 'mwe-dis.bin'  |
> cg-mwesplit  | divvun-cgspell -n 25 -l 'acceptor.default.hfst' -m
> 'errmodel.default.hfst'  | vislcg3 -g 'spellchecker.bin' | divvun-suggest -g
> 'generator-gramcheck-gt-norm.hfstol' -m 'errors.xml' -l se
> "<sátnejodiheaddji>"
>         "jođiheaddji" Err/Orth N NomAg Sem/Hum Sg Acc <W:10.0000000000> &typo
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
> typo
>         "jođiheaddji" N NomAg Sem/Hum Sg Acc <W:10.0000000000> &typo &SUGGEST
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
> sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Acc        sátnejođiheaddji
>         "jođiheaddji" Err/Orth N NomAg Sem/Hum Sg Gen <W:10.0000000000> &typo
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
> typo
>         "jođiheaddji" N NomAg Sem/Hum Sg Gen <W:10.0000000000> &typo &SUGGEST
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
> sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Gen        sátnejođiheaddji
>         "jođiheaddji" Err/Orth N NomAg Sem/Hum Sg Nom <W:10.0000000000> &typo
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
> typo
>         "jođiheaddji" N NomAg Sem/Hum Sg Nom <W:10.0000000000> &typo &SUGGEST
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:10.0000000000>
> sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Nom        sátnejođiheaddji
> :\n
> 
> ie. it managed to generate "sátnejođiheaddji".
> 
> Now we get this:
> 
> "<sátnejodiheaddji>"
>         "jođiheaddji" Err/Orth N NomGenSg NomAg Sem/Hum Sg Nom
> <W:0.0000000000> @HNOUN &typo
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0000000000>
> typo
>         "jođiheaddji" N NomGenSg NomAg Sem/Hum Sg Nom <W:0.0000000000>
> @HNOUN &typo &SUGGEST
>                 "sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0000000000>
> sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomGenSg+NomAg+Sg+Nom       ?
> :\n
> 
> Putting a "|sed 's/ NomGenSg//g'" into the pipeline makes it generate again.
> 
> 
> This is the same issue we've had so many times with Apertium – using CG to
> insert "lexicon tags" will mess up generation. 
> 
> Possible fixes: Have NomGenSg in the lexicon instead of SUBSTITUTING it in
> in the disambiguator, or use one of the tag types that's ignored for
> generation (e.g. anything with <> around the tag is treated as CG-inserted –
> if it were called <NomGenSg> then divvun-suggest would skip it when trying
> to generate), or, worst-case, have an optional arc with NomGenSg for that
> word in the generator. It looks to me like something that should̈́'ve been in
> the analyser in the first place, but maybe there's a good reason it wasn't.
> 
> I'll put in a hotfix now for Ávvir at least, and a test in their
> update-script so they never update if this word doesn't get suggestions.

Æ laga en sekundær tag ut av dette, altså <NomGenSg> som burde løse problemet.
Comment 4 Linda Wiechetek 2018-04-11 13:17:29 CEST
(In reply to Kevin Brubeck Unhammer from comment #2)
> A few more lines from grammarcheckers/disambiguator.cg3 that change "lexicon
> tags" (those not starting with @ or < or Sem/ or & or §):
> 
> ./disambiguator.cg3:3398:SUBSTITUTE:PropAttr (Sg Nom) (Attr) PROP-ATTR + Nom
> (NOT 0 Attr LINK *1 PROP-SUR BARRIER REAL-WORD-NOT-ABBR OR COMMA)(NEGATE 0
> Sem/Sur LINK -1 FIRSTNAME LINK 2 FIRSTNAME LINK 1 Sem/Sur);
> ./disambiguator.cg3:3399:SUBSTITUTE:Attr (Sg Nom) (Attr) PROP-ATTR + Nom
> (NEGATE -1 Prop)(1 ("ja") OR ("dahje") OR ("dehe") OR ("dahe") LINK 1 (Prop
> Attr) LINK *1 (N Prop Sem/Sur));
> ./disambiguator.cg3:3400:SUBSTITUTE:Attr (Prop Sem/Mal Sg Nom) (Prop Sem/Mal
> Attr) ("Hearrá") (1 ("Ipmil"));
> ./disambiguator.cg3:13969:SUBSTITUTE:TV_IV (V TV) (V IV) FAUXV (0
> ("lávet"))(NOT 0 TV);
> 
> If they don't ever run on anything we do &SUGGEST on, we're safe for now,
> but we should consider if they can be added to the lexicon so we can keep
> analysis in the analyser.

Dette er verre fordi vi trenger de samme taggan som før, dermed kan vi ikkje lage sekundære tagga.
Æ har ikkje nokka imot det å ha dem i leksikonet, men dette burde diskuteres med de som lager den deskriptive analysatoren. Æ trur det har vært en strategi i lengre tid å få fram visse analyser gjennom CGen istedenfor å ha flere oppslag i leksikonet (for eks. intransitive verb som også kan være transitive). Så vi må tenke på dette ilag.
Comment 5 Sjur Nørstebø Moshagen 2019-08-20 14:25:11 CEST
Ting ser ut til å fungera igjen no, i alle fall med dømet som Kevin gav:

"<sátnejodiheaddji>"
	"jođiheaddji" N <NomGenSg> Err/Orth NomAg Sem/Hum Sg Gen <W:0.0> <cohort-with-dynamic-compound> &typo
		"sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0>
typo
	"jođiheaddji" N <NomGenSg> NomAg Sem/Hum Sg Gen <W:0.0> <cohort-with-dynamic-compound> &typo &SUGGEST
		"sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0>
sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Gen	sátnejođiheaddji
	"jođiheaddji" N <NomGenSg> Err/Orth NomAg Sem/Hum Sg Nom <W:0.0> <cohort-with-dynamic-compound> &typo
		"sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0>
typo
	"jođiheaddji" N <NomGenSg> NomAg Sem/Hum Sg Nom <W:0.0> <cohort-with-dynamic-compound> &typo &SUGGEST
		"sátni" N Sem/Cat Cmp/SgNom Cmp <W:0.0>
sátni+N+Cmp/SgNom+Cmp#jođiheaddji+N+NomAg+Sg+Nom	sátnejođiheaddji
:\n

Eg let att denne meldinga no.