Bug 2652

Summary: Stray character +U02BC "ʼ" introduced in hfst output of sms analysis
Product: Infrastructure Reporter: Jack Rueter <rueter.jack>
Component: newinfraAssignee: Sjur Nørstebø Moshagen <sjur.n.moshagen>
Status: NEW ---    
Severity: enhancement CC: borre.gaup, chiara.argese
Priority: P5 - Later    
Version: unspecified   
Hardware: Macintosh   
OS: Other   

Description Jack Rueter 2020-03-19 04:49:41 CET
When applying the pipeline
hfst-tokenise --giella-cg -W $GTHOME/langs/sms/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g $GTHOME/langs/sms/src/syntax/disambiguator.cg3 

to the sms text:
'De suu teâđast ... son käggõõđi teä suu koʹdde, pačču di koʹdde.'

undesired output occurs in the form of introduced Modifier Letter Apostrophe +U02BC. (This is one of the characters in the sms orthography.)

"<De>"
	"de" Adv Sem/Time 
	"de" CC 
: 
"<suu>"
	"son" Pron Pers Sg3 Acc 
	"son" Pron Pers Sg3 Gen
: 
"<teâđast>"
	"teâtt" N Sg Loc
	"teâđast" Adv 
	"teâđsted" V Ind Prs Sg3 
: 
"<...>"
	"..." CLB

"<>"
	"ʼ" N Symbol
: 
"<son>"
	"son" Pron Pers Sg3 Nom
: 
"<käggõõđi>"
	"käggõõđi" ?
: 
"<teä>"
	"teä" Adv Sem/Time Sem/Time 
: 
"<suu>"
	"son" Pron Pers Sg3 Acc 
	"son" Pron Pers Sg3 Gen
: 
"<koʹdde>"
	"kåʹdded" V Ind Prt Pl3 
"<,>"
	"," CLB
"<>"
	"ʼ" N Symbol
: 
"<pačču>"
	"pääččad" V Ind Prt Pl3 
: 
"<di>"
	"di" CC 
: 
"<koʹdde>"
	"kåʹdded" V Ind Prt Pl3 
"<.>"
	"." CLB

"<>"
	"ʼ" N Symbol
:\n
"<>"
	"ʼ" N Symbol

In the analysis epsilon has become Modifier Letter Apostrophe +U02BC in seemingly random places.

One workaround is to comment "Modifier Letter Apostrophe +U02BC" out in the 
"src/morphology/generated_files/symbols.lexc" file locally.

This immediately disallows the unwanted output. Of course, the introduction of the Modifier Letter Apostrophe might be associated with the spellrelax in:
ʼ (->) 0 ,  # U+02BC MODIFIER LETTER APOSTROPHE, accept also ZERO.

Should native symbols be removed or commented out of the symbols.lexc?