Bug 2376 - Problematic tokenisation of unrecognised compounds with hyphen
Summary: Problematic tokenisation of unrecognised compounds with hyphen
Status: RESOLVED FIXED
Alias: None
Product: Pre- and postprocessing
Classification: Unclassified
Component: hfst-preprocess (show other bugs)
Version: unspecified
Hardware: All All
: P4 - Within a month normal
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-11 09:04 CEST by Linda Wiechetek
Modified: 2017-09-15 07:48 CEST (History)
7 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Linda Wiechetek 2017-04-11 09:04:37 CEST
In the following construction the hyphen is analyzed as a separate token in erroneous compounds (nominative plural noun is the first component).

Diibmá gudnejahtte bálkkašumiin Dievdduid-searvvi , dán jagi ádde bálkkašumi Guovdageainnieiddat-spábbačiekčanjovkui .

Guovdageainnieiddat-spábbačiekčanjovkui is analyzed as 

"<Guovdageainnieiddat>"
        "nieida" N Sem/Hum Pl Nom <W:10> @<SUBJ MAP:17304 #10->10
                "Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp <W:10> #10->10
;       "nieida" N Sem/Hum Sg Acc PxSg2 <W:10> REMOVE:12781:r2534
;               "Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp <W:10>
;       "nieida" N Sem/Hum Sg Gen PxSg2 <W:10> REMOVE:15381:r3235
;               "Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp <W:10>
"<->"
        "-" PUNCT <W:0> #11->11
"<spábbačiekčanjovkui>"
        "spábbačiekčanjoavku" N Sem/Group_Hum Sg Ill <W:0> #12->12
;       "čiekčanjoavku" N Sem/Group_Hum Sg Ill <W:10>
;               "spábba" N Sem/Obj-play Cmp/SgNom Cmp <W:10> REMOVE:2072:longest-match
;       "joavku" N Sem/Group_Hum_Plc Sg Ill <W:10>
;               "spábbačiekčan" N Sem/Act Cmp/SgNom Cmp <W:10> REMOVE:2072:longest-match
: 


This causes two problems:
1. Guovdageainnieiddat-spábbačiekčanjovkui is not recognized as a compound error
2. I cannot match the arguments to their verbal governors as "-" is often a sentence barrier as opposed to "-" within a compound
Comment 1 Kevin Brubeck Unhammer 2017-04-11 09:16:25 CEST
Isn't this "just" a matter of having a transition with hyphen back into nouns, just like we do with non-space transition into nouns?

At first glance at least, it doesn't seem like we need any of the special new machinery to handle this case; the tokenisation should be unambiguous for any such sequence 'N.CmpHyphen N.Ill', or 'N.Cmp N.CmpHyphen N.Cmp N.Ill' etc. (and would be possible to override for exceptions by lexicalising).
Comment 2 Linda Wiechetek 2017-04-11 10:29:44 CEST
I'm afraid I don't get what you are saying Kevin. Could you be more explicit maybe by means of an example?
Comment 3 Sjur Nørstebø Moshagen 2017-04-11 10:38:46 CEST
Eg er samd med Kevin. Eg føreslår dette (Trond og eg prata litt om det):

DFrå leksikon K, gå ikkje berre til # men også til Nouns, med eigen tagg
{{+Cmp/Cit}}.

Alle ord og ordformer kan laga slike sitat-samansetjingar (med bindestrek), jf "go-gážaldat".
Comment 4 Kevin Brubeck Unhammer 2017-04-11 11:03:07 CEST
(In reply to Kevin Brubeck Unhammer from comment #1)
> Isn't this "just" a matter of having a transition with hyphen back into
> nouns, just like we do with non-space transition into nouns?

So, skipping the details with vowel reduction and suffixes and such, this means having things like

LEXICON GOAHTI 
 +N+Cmp/PlNom+Cmp/Hyph+Cmp:%- R ;
 +N+Sg+Nom: K ;
 […]

in addition to the regular "+N+Cmp/SgNom+Cmp/Hyph+Cmp:%- R;" that we already have … somewhere in compounding.lexc+nouns.lexc … you probably know better than me :)


In any case, adding such a path should give the same that we get with the SgNom:

$ echo Guovdageainnieida-spábbačiekčanjovkui | hfst-tokenise -g tokeniser-gramcheck-gt-desc.pmhfst 
"<Guovdageainnieida-spábbačiekčanjovkui>"
        "spábbačiekčanjoavku" N Sem/Group_Hum Sg Ill <W:20>
                "nieida" N Sem/Hum Cmp/SgNom Cmp/Hyph Cmp <W:20>
                        "Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp <W:20>
        "čiekčanjoavku" N Sem/Group_Hum Sg Ill <W:30>
                "spábba" N Sem/Obj-play Cmp/SgNom Cmp <W:30>
                        "nieida" N Sem/Hum Cmp/SgNom Cmp/Hyph Cmp <W:30>
                                "Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp <W:30>
        "joavku" N Sem/Group_Hum_Plc Sg Ill <W:30>
                "spábbačiekčan" N Sem/Act Cmp/SgNom Cmp <W:30>
                        "nieida" N Sem/Hum Cmp/SgNom Cmp/Hyph Cmp <W:30>
                                "Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp <W:30>
:\n


And it doesn't require the new machinery, it works also with plain lookup:

$ echo Guovdageainnieida-spábbačiekčanjovkui |hfst-lookup -q -b0 analyser-disamb-gt-desc.hfstol 
Guovdageainnieida-spábbačiekčanjovkui   Guovdageaidnu+N+Prop+Err/Orth+Sem/Dummytag+Cmp#nieida+N+Sem/Hum+Cmp/SgNom+Cmp/Hyph+Cmp#spábbačiekčanjoavku+N+Sem/Group_Hum+Sg+Ill       20,000000
Guovdageainnieida-spábbačiekčanjovkui   Guovdageaidnu+N+Prop+Err/Orth+Sem/Dummytag+Cmp#nieida+N+Sem/Hum+Cmp/SgNom+Cmp/Hyph+Cmp#spábbačiekčanjoavku+N+Sem/Group_Hum+Sg+Ill       20,000000
Guovdageainnieida-spábbačiekčanjovkui   Guovdageaidnu+N+Prop+Err/Orth+Sem/Dummytag+Cmp#nieida+N+Sem/Hum+Cmp/SgNom+Cmp/Hyph+Cmp#spábbačiekčanjoavku+N+Sem/Group_Hum+Sg+Ill       20,000000
Guovdageainnieida-spábbačiekčanjovkui   Guovdageaidnu+N+Prop+Err/Orth+Sem/Dummytag+Cmp#nieida+N+Sem/Hum+Cmp/SgNom+Cmp/Hyph+Cmp#spábbačiekčanjoavku+N+Sem/Group_Hum+Sg+Ill       20,000000
Guovdageainnieida-spábbačiekčanjovkui   Guovdageaidnu+N+Prop+Err/Orth+Sem/Dummytag+Cmp#nieida+N+Sem/Hum+Cmp/SgNom+Cmp/Hyph+Cmp#spábbačiekčanjoavku+N+Sem/Group_Hum+Sg+Ill       20,000000
Guovdageainnieida-spábbačiekčanjovkui   Guovdageaidnu+N+Prop+Err/Orth+Sem/Dummytag+Cmp#nieida+N+Sem/Hum+Cmp/SgNom+Cmp/Hyph+Cmp#spábbačiekčanjoavku+N+Sem/Group_Hum+Sg+Ill       20,000000
Comment 5 Linda Wiechetek 2017-04-20 12:37:08 CEST
Looks great! Let's do it or is it fixed already?
Comment 6 Sjur Nørstebø Moshagen 2017-09-13 12:31:55 CEST
I think we need one modification of Kevin's suggestion: in principle any word form of any POS can be compounded like this:

ja-sátni

ie when discussing word forms and language and grammar in general. On the other side, the second part of the compound is probably pretty limited, so I think it is enough that we only compound with regular nouns.
Comment 7 Sjur Nørstebø Moshagen 2017-09-14 19:47:52 CEST
This should be solved in svn rev 157051 and 157055.

NB! Test thoroughly! I analysed the whole of GTFREE, and could not find a single instance of this compound type, but the word reported in this bug report does now get an additional analysis:

$ echo Guovdageainnieiddat-spábbačiekčanjovkui | hfst-tokenise -g tools/tokenisers/tokeniser-gramcheck-gt-desc.pmhfst
"<Guovdageainnieiddat-spábbačiekčanjovkui>"
	"spábbačiekčanjoavku" N Sem/Group_Hum Sg Ill
		"nieida" N Sem/Hum Pl Nom Cmp/Cit Cmp/Hyph Cmp
			"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"spábbačiekčanjoavku" N Sem/Group_Hum Sg Ill
		"nieida" N Sem/Hum Sg Acc PxSg2 Cmp/Cit Cmp/Hyph Cmp
			"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"spábbačiekčanjoavku" N Sem/Group_Hum Sg Ill
		"nieida" N Sem/Hum Sg Gen PxSg2 Cmp/Cit Cmp/Hyph Cmp
			"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"čiekčanjoavku" N Sem/Group_Hum Sg Ill
		"spábba" N Sem/Obj-play Cmp/SgNom Cmp
			"nieida" N Sem/Hum Pl Nom Cmp/Cit Cmp/Hyph Cmp
				"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"joavku" N Sem/Group_Hum_Plc Sg Ill
		"spábbačiekčan" N Sem/Act Cmp/SgNom Cmp
			"nieida" N Sem/Hum Pl Nom Cmp/Cit Cmp/Hyph Cmp
				"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"čiekčanjoavku" N Sem/Group_Hum Sg Ill
		"spábba" N Sem/Obj-play Cmp/SgNom Cmp
			"nieida" N Sem/Hum Sg Acc PxSg2 Cmp/Cit Cmp/Hyph Cmp
				"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"joavku" N Sem/Group_Hum_Plc Sg Ill
		"spábbačiekčan" N Sem/Act Cmp/SgNom Cmp
			"nieida" N Sem/Hum Sg Acc PxSg2 Cmp/Cit Cmp/Hyph Cmp
				"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"čiekčanjoavku" N Sem/Group_Hum Sg Ill
		"spábba" N Sem/Obj-play Cmp/SgNom Cmp
			"nieida" N Sem/Hum Sg Gen PxSg2 Cmp/Cit Cmp/Hyph Cmp
				"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp
	"joavku" N Sem/Group_Hum_Plc Sg Ill
		"spábbačiekčan" N Sem/Act Cmp/SgNom Cmp
			"nieida" N Sem/Hum Sg Gen PxSg2 Cmp/Cit Cmp/Hyph Cmp
				"Guovdageaidnu" N Prop Err/Orth Sem/Dummytag Cmp

Is this what you are looking for?
Comment 8 Sjur Nørstebø Moshagen 2017-09-15 07:48:28 CEST
(In reply to Sjur Nørstebø Moshagen from comment #7)
> This should be solved in svn rev 157051 and 157055.
> 
> NB! Test thoroughly! I analysed the whole of GTFREE, and could not find a
> single instance of this compound type,

I searched for the wrong tag :/

There are a number of new compound analyses, most of them in the form of additional analyses in the cohort. There are a few instances of analyses of words that earlier got no analysis at all:

"<Čálagieđat-semináras>"
        "seminára" N Sem/Edu_Event Sg Acc PxSg3
                "giehta" N Sem/Body Pl Nom Cmp/Cit Cmp/Hyph Cmp
                        "čála" N Sem/Txt Cmp/SgNom Cmp

which to me seems to be exactly the thing we want.

Another such example:

"<Gamas-váriid>"
        "váre" N Sem/Hum Err/Orth Pl Acc
                "Gama" N Prop Sem/Plc Sg Loc Cmp/Cit Cmp/Hyph Cmp
[...]
        "várri" N Sem/Plc-elevate Err/Orth Pl Acc
                "Gama" N Prop Sem/Plc Sg Loc Cmp/Cit Cmp/Hyph Cmp

In GTFREE I found some cases of cuoŋománu-miessemánu and similar expressions, where all non-Cmp/Cit analyses contained Err/Orth, which means that one need to be aware of cases where Cmp/Cit can cover erroneous spelling:

"<cuoŋománu-miessemánu>"
        "miessemánnu" N Sem/Time Sg Acc
                "cuoŋománnu" N Sem/Time Cmp/SgGen Err/Orth Cmp/Hyph Cmp
        "miessemánnu" N Sem/Time Sg Gen
                "cuoŋománnu" N Sem/Time Cmp/SgGen Err/Orth Cmp/Hyph Cmp
        "miessemánnu" N Sem/Time Sg Acc
                "cuoŋománnu" N Sem/Time Sg Acc Cmp/Cit Cmp/Hyph Cmp
        "miessemánnu" N Sem/Time Sg Gen
                "cuoŋománnu" N Sem/Time Sg Acc Cmp/Cit Cmp/Hyph Cmp

Another case of Err/Orth vs Cmp/Cit:

"<Ubme-eatnu>"
        "eatnu" N Sem/Plc Sg Nom
                "Ubmi" N Prop Sem/Plc Cmp/SgNom Err/Orth Cmp/Hyph Cmp
        "eatnu" N Sem/Plc Sg Nom
                "Ubmi" N Prop Sem/Plc Sg Gen Allegro Cmp/Cit Cmp/Hyph Cmp

I close this as fixed, reopen if there are any issues related to how it is done. New bugs due to this change should get their own bugzilla report.