Bug 2585

Summary: Double space in front of "Eanas oassi" suggest "Eanas oassi"
Product: Grammar checkers Reporter: Børre Gaup <borre.gaup>
Component: Linguistic issuesAssignee: Linda Wiechetek <linda.wiechetek>
Severity: normal CC: linda.wiechetek, sjur.n.moshagen, thomas.omma, trond.trosterud, unhammer+apertium
Priority: P5 - Later    
Version: unspecified   
Hardware: Macintosh   
OS: Linux   

Description Børre Gaup 2019-05-20 15:58:44 CEST
sme $ echo " olbmui.  Eanas oassi." | divvun-checker -l se -n smegram
{"errs":[["Eanas oassi",10,21,"double-space-before","Leat guokte gaskka ovdal \" oassi\"",["Eanas  oassi"],"Sátnegaskameattáhusat"]],"text":" olbmui.  Eanas oassi."}

If oassi is replaced with guossi, or Eanáš with Eanas the correct suggestion is given:

sme $ echo " olbmui.  Eanas guossit." | divvun-checker -l se -n smegram
{"errs":[[".  Eanas",7,15,"double-space-before","Leat guokte gaskka ovdal \"Eanas\"",[". Eanas"],"Sátnegaskameattáhusat"]],"text":" olbmui.  Eanas guossit."}

sme $ echo " olbmui.  Eanáš oassi." | divvun-checker -l se -n smegram
{"errs":[[".  Eanáš",7,15,"double-space-before","Leat guokte gaskka ovdal \"Eanáš\"",[". Eanáš"],"Sátnegaskameattáhusat"]],"text":" olbmui.  Eanáš oassi."}

sme $ echo eanas | husmeNorm 
eanas   eanas+Adv       0,000000
eanas   eanas+Pron+Indef+Sg+Nom 0,000000
eanas   eanas+A+Attr    0,000000

sme $ echo eanáš | husmeNorm 
eanáš   eanášit+V+TV+Imprt+ConNeg       0,000000
eanáš   eanášit+V+TV+Imprt+Sg2  0,000000
eanáš   eanášit+V+TV+Ind+Prs+ConNeg     0,000000
eanáš   eanáš+Adv       0,000000
Comment 1 Linda Wiechetek 2019-08-17 20:06:08 CEST
What exactly is the problem? I can't see a difference in the headline. Could you check if the problem still exists?
Comment 2 Linda Wiechetek 2019-08-18 15:20:03 CEST
Now I see the problem. I sent you an email about it.
The difference between Eanas oassi and Eanáš oassi is that the first one is listed as a one word compound. I'm not sure how that influences the matter.
Comment 3 Sjur Nørstebø Moshagen 2019-08-20 14:38:08 CEST
The underlying problem is that the whitespace analyser is applied directly after the morphological analysis & tokenisation, which means that the tag


meant to target the two spaces in front of Eanas is added to all readings of the following word.

So far so good, and as it should be.

But when that following word is ambiguous in its tokenisation, as in this case, and it resolves to two tokens, the <doubleSpaceBefore> tag is dragged along in both the new cohorts. And this leads to the strange situation that also the following word 'oassi' is tagged as being preceded by two spaces, although that is not the case.

One solution would be to move the whitespace tagging till after mwe disambiguation. The problem with that is that we then loose the information from the whitespace tagger that could be useful when disambiguating ambiguous tokenisations. But maybe we don't use that information at all.

Linda, Kevin - other ideas? Comments?
Comment 4 Kevin Brubeck Unhammer 2019-08-21 09:43:51 CEST
Whitespace-analyser kan gi taggane


og av desse ser eg berre <firstWordOfParagraph> brukt, i éin regel:

SELECT:before-paragraph ("." CLB) IF (1*> (>>>) BARRIER (>>>) LINK 1 <firstWordOfParagraph>);
	## Dat lea eanet go 10.  <linjeshift> Dat lei boahtán.

Går det an å disambiguera «10.» her utan å referera til <firstWordOfParagraph>?

Alternativt er det ikkje noko problem å ha *to* whitespace-analysers køyrande, éin som legg på meir «informative» taggar som <firstWordOfParagraph> (og køyrer før mwe-dis.cg3), og éin som legg på feiltaggar som <doubleSpaceBefore> (etter cg-mwesplit).
Comment 5 Linda Wiechetek 2019-08-29 00:08:11 CEST
Well, right now we mess up anyway.. I tested "Dat lea eanet go 10. Dat lea eanet go 10. olbmui."

and we get:

        "go" CS <W:0.0> @CVP SELECT:8116:r1180 MAP:12871:r10 SELECT:13056:r1461
;       "go" CS <W:0.0> @CNP SELECT:8116:r1180 MAP:12871:r10 SELECT:13056:r1461
;       "go" Pcle Qst <W:0.0> SELECT:8116:r1180
        "10" A Arab Ord Attr <W:0.0> @>N MAP:21848:r86
;       "." CLB <W:0.0> "<.>"
;               "10" Num Sem/ID <W:0.0> "<10>" REMOVE:2689:longest-match
;       "." CLB <W:0.0> "<.>"
;               "10" Num Arab Sg Nom <W:0.0> "<10>" REMOVE:2689:longest-match
;       "." CLB <W:0.0> "<.>"
;               "10" Num Arab Sg Loc Attr <W:0.0> "<10>" REMOVE:2689:longest-match
;       "." CLB <W:0.0> "<.>"
;               "10" Num Arab Sg Ill Attr <W:0.0> "<10>" REMOVE:2689:longest-match
;       "." CLB <W:0.0> "<.>"
;               "10" Num Arab Sg Gen <W:0.0> "<10>" REMOVE:2689:longest-match
;       "." CLB <W:0.0> "<.>"
;               "10" Num Arab Sg Acc <W:0.0> "<10>" REMOVE:2689:longest-match
        "dat" Pron Dem Sg Nom <W:0.0> @SUBJ> SELECT:17765:r2334 MAP:23324
;       "dat" Pcle <W:0.0> SELECT:17765:r2334
;       "dat" Pron Dem Pl Nom <W:0.0> REMOVE:13658:r1619

"It is more than 10" should give us "." CLB.. 
I'll have a look at a possible rule.
Comment 6 Linda Wiechetek 2019-08-29 00:11:28 CEST
Ahh.. linjeshift... Altså vi klarer å disambiguere i denne setninga uten å referera til <firstWordOfParagraph>, ja, men æ vet ikkje korvidt vi treng å generalisere.
Comment 7 Sjur Nørstebø Moshagen 2019-09-02 14:16:37 CEST
Eg flyttar blankteiknsanalysatoren til lenger ut i kommandorekka. Linda sin regel er ikkje lenger avhengig av denne taggen.
Comment 8 Sjur Nørstebø Moshagen 2019-09-06 20:52:34 CEST
(In reply to Kevin Brubeck Unhammer from comment #4)
> Alternativt er det ikkje noko problem å ha *to* whitespace-analysers
> køyrande, éin som legg på meir «informative» taggar som
> <firstWordOfParagraph> (og køyrer før mwe-dis.cg3), og éin som legg på
> feiltaggar som <doubleSpaceBefore> (etter cg-mwesplit).

Eg valde å gjera det på denne måten, og no funkar ting som dei skal:

$ echo " olbmui.  Eanas oassi." | divvun-checker -a se.zcheck | jq .
  "errs": [
      ".  Eanas",
      "Leat guokte gaskka ovdal \"Eanas\"",
        ". Eanas"
  "text": " olbmui.  Eanas oassi."

Eg avsluttar lusmeldinga.