Cf. svn r73832:
Norwegian letters æøå (->) ÆØÅ was not part of this file,
and has evidently not worked since the introduction of the
This shows that our testing routines are not exactly optimal :-(
TODO: Check inituppercase.regex for the other languages as well.
What we need is a generic test to ensure that the whole alphabet is covered by the regex.
Idea for testing:
* extract all initial, lowercase letters in the lexical fst
* apply initial upper case
* apply a regular Unicode sed or similar transform to uppercase
* compare the two, FAIL if diff
We need to check whether there can be exceptional letters that should not be uppercased.
Languages without casing can just outcomment the test.
Tomi, would you have a suggestion for an efffetctive xfst regex to extract the first letter of an fst?
hfst: load stack src/analyser-gt-norm.hfst
? bytes. 483360 states, 1099479 arcs, ? paths
hfst: define lex
hfst: read regex lex.i .o. [ ? -> 0 || \[ .#. ] _ ] ;
? bytes. 781273 states, 2658850 arcs, ? paths
hfst: print random-lower
but as you can see it isn't doing what I want :-/
'print net' prints out the whole fst network, and first letter starts from 's0'. Create a scriptfile with appropriate commands:
load stack src/analyser-gt-norm.hfst
hfst-xfst -F scriptfile.hfst | grep -e 's0:'