Bug 1642

Summary: Add test to make sure that InitUpper actually covers all letters in alphabet
Product: Infrastructure Reporter: Sjur Nørstebø Moshagen <sjur.n.moshagen>
Component: newinfraAssignee: Sjur Nørstebø Moshagen <sjur.n.moshagen>
Status: ASSIGNED ---    
Severity: normal CC: sjur.n.moshagen, trond.trosterud
Priority: P4 - Within a month    
Version: unspecified   
Hardware: All   
OS: All   

Description Sjur Nørstebø Moshagen 2013-04-04 16:19:42 CEST
Cf. svn r73832:

Log:
Norwegian letters æøå (->) ÆØÅ was not part of this file,
and has evidently not worked since the introduction of the
new infra.

This shows that our testing routines are not exactly optimal :-(

TODO: Check inituppercase.regex for the other languages as well.

What we need is a generic test to ensure that the whole alphabet is covered by the regex.
Comment 1 Sjur Nørstebø Moshagen 2017-03-31 15:05:46 CEST
Idea for testing:

* extract all initial, lowercase letters in the lexical fst
* apply initial upper case
* apply a regular Unicode sed or similar transform to uppercase
* compare the two, FAIL if diff

We need to check whether there can be exceptional letters that should not be uppercased.

Languages without casing can just outcomment the test.
Comment 2 Sjur Nørstebø Moshagen 2017-03-31 15:38:26 CEST
Tomi, would you have a suggestion for an efffetctive xfst regex to extract the first letter of an fst?

I tried:

hfst[0]: load stack src/analyser-gt-norm.hfst
? bytes. 483360 states, 1099479 arcs, ? paths
hfst[1]: define lex
Defined 'lex'
hfst[0]: read regex lex.i .o. [ ? -> 0 || \[ .#. ]  _ ] ;
? bytes. 781273 states, 2658850 arcs, ? paths
hfst[1]: print random-lower
Jie
K

but as you can see it isn't doing what I want :-/
Comment 3 Tomi Pieski 2017-03-31 17:22:06 CEST
'print net' prints out the whole fst network, and first letter starts from 's0'. Create a scriptfile with appropriate commands:

----scriptfile.hfst
load stack src/analyser-gt-norm.hfst
print net
----end file


Call this:
hfst-xfst -F scriptfile.hfst | grep -e 's0:'