Bug 2652 - Stray character +U02BC "ʼ" introduced in hfst output of sms analysis
Summary: Stray character +U02BC "ʼ" introduced in hfst output of sms analysis
Status: NEW
Alias: None
Product: Infrastructure
Classification: Unclassified
Component: newinfra (show other bugs)
Version: unspecified
Hardware: Macintosh Other
: P5 - Later enhancement
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2020-03-19 04:49 CET by Jack Rueter
Modified: 2020-03-19 04:49 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jack Rueter 2020-03-19 04:49:41 CET
When applying the pipeline
hfst-tokenise --giella-cg -W $GTHOME/langs/sms/tools/tokenisers/tokeniser-disamb-gt-desc.pmhfst | vislcg3 -g $GTHOME/langs/sms/src/syntax/disambiguator.cg3 

to the sms text:
'De suu teâđast ... son käggõõđi teä suu koʹdde, pačču di koʹdde.'

undesired output occurs in the form of introduced Modifier Letter Apostrophe +U02BC. (This is one of the characters in the sms orthography.)

"<De>"
	"de" Adv Sem/Time 
	"de" CC 
: 
"<suu>"
	"son" Pron Pers Sg3 Acc 
	"son" Pron Pers Sg3 Gen
: 
"<teâđast>"
	"teâtt" N Sg Loc
	"teâđast" Adv 
	"teâđsted" V Ind Prs Sg3 
: 
"<...>"
	"..." CLB

"<>"
	"ʼ" N Symbol
: 
"<son>"
	"son" Pron Pers Sg3 Nom
: 
"<käggõõđi>"
	"käggõõđi" ?
: 
"<teä>"
	"teä" Adv Sem/Time Sem/Time 
: 
"<suu>"
	"son" Pron Pers Sg3 Acc 
	"son" Pron Pers Sg3 Gen
: 
"<koʹdde>"
	"kåʹdded" V Ind Prt Pl3 
"<,>"
	"," CLB
"<>"
	"ʼ" N Symbol
: 
"<pačču>"
	"pääččad" V Ind Prt Pl3 
: 
"<di>"
	"di" CC 
: 
"<koʹdde>"
	"kåʹdded" V Ind Prt Pl3 
"<.>"
	"." CLB

"<>"
	"ʼ" N Symbol
:\n
"<>"
	"ʼ" N Symbol

In the analysis epsilon has become Modifier Letter Apostrophe +U02BC in seemingly random places.

One workaround is to comment "Modifier Letter Apostrophe +U02BC" out in the 
"src/morphology/generated_files/symbols.lexc" file locally.

This immediately disallows the unwanted output. Of course, the introduction of the Modifier Letter Apostrophe might be associated with the spellrelax in:
ʼ (->) 0 ,  # U+02BC MODIFIER LETTER APOSTROPHE, accept also ZERO.

Should native symbols be removed or commented out of the symbols.lexc?