Bug 2674 - Empty symbol analysed as "´" (02BC MODIFYER LETTER APOSTROPHE)
Summary: Empty symbol analysed as "´" (02BC MODIFYER LETTER APOSTROPHE)
Status: RESOLVED FIXED
Alias: None
Product: Pre- and postprocessing
Classification: Unclassified
Component: hfst-preprocess (show other bugs)
Version: unspecified
Hardware: Macintosh Other
: P5 - Later enhancement
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-09-03 11:05 CEST by Trond Trosterud
Modified: 2021-10-27 22:32 CEST (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Trond Trosterud 2020-09-03 11:05:09 CEST
Synopsis:
The problem is that an empty character (actually: every character SPACE) is analysed as MODIFYER LETTER APOSTROPHE, 

Input is:
Йомак ¶
Туш то ¶

Command for analysis is:
ccat -l mhr ~/rusbound/converted/mhr/ficti/|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst


Output is:

"<Йомак>"
        "йомак" N Attr <W:0.0>
        "йомак" N Sg Nom <W:0.0>
: 
"<¶>"
        "¶" CLB <W:0.0>
:\n
"<>"
        "ʼ" N Symbol <W:0.0>
"<Туш>"
        "ту" Hom2 N Sg Ill <W:0.0>
        "ту" Hom3 N Sg Ill <W:0.0>
        "туш" Adv <W:0.0>
        "туш" Hom2 N Attr <W:0.0>
        "туш" Hom2 N Sg Nom <W:0.0>
        "туш" Hom3 N Attr <W:0.0>
        "туш" Hom3 N Sg Nom <W:0.0>
        "туш" Pron Pron Dem <W:0.0>
: 
"<то>"
        "то" CC CC"+WORK" <W:0.0>
        "то" Pron Pron Ind <W:0.0>
: 
"<¶>"
        "¶" CLB <W:0.0>
:\n
"<>"
        "ʼ" N Symbol <W:0.0>
Comment 1 Trond Trosterud 2020-09-03 11:08:42 CEST
Correction: It does not happen for spaces betšeen šords. Here I get them before and after :\n, i.e. at the end of the sentence:

e "тиде книга."|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<тиде>"
	"тидаш" V ConNeg <W:0.0>
	"тидаш" V Imprt Sg2 <W:0.0>
	"тиде" Pron Dem Sg Nom <W:0.0>
: 
"<книга>"
	"книга" A <W:0.0>
	"книга" A Der/MWN N Attr <W:0.0>
	"книга" A Der/MWN N Sg Nom <W:0.0>
	"книга" N Attr <W:0.0>
	"книга" N Sg Nom <W:0.0>
"<.>"
	"." CLB <W:0.0>
"<>"
	"ʼ" N Symbol <W:0.0>
:\n
"<>"
	"ʼ" N Symbol <W:0.0>

It happens only for mhr.
Comment 2 Sjur Nørstebø Moshagen 2021-10-27 22:32:40 CEST
The problem seems to have been fixed:

echo "тиде книга."|hfst-tokenise -cg tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst 
"<тиде>"
	"тидаш" V ConNeg <W:0.0>
	"тидаш" V Imprt Sg2 <W:0.0>
	"тиде" Pron Dem Sg Nom <W:0.0>
: 
"<книга>"
	"книга" A <W:0.0>
	"книга" A Der/MWN N Attr <W:0.0>
	"книга" A Der/MWN N Sg Nom <W:0.0>
	"книга" N Attr <W:0.0>
	"книга" N Sg Nom <W:0.0>
"<.>"
	"." CLB <W:0.0>
:\n