Cf. svn r73832: Log: Norwegian letters æøå (->) ÆØÅ was not part of this file, and has evidently not worked since the introduction of the new infra. This shows that our testing routines are not exactly optimal :-( TODO: Check inituppercase.regex for the other languages as well. What we need is a generic test to ensure that the whole alphabet is covered by the regex.
Idea for testing: * extract all initial, lowercase letters in the lexical fst * apply initial upper case * apply a regular Unicode sed or similar transform to uppercase * compare the two, FAIL if diff We need to check whether there can be exceptional letters that should not be uppercased. Languages without casing can just outcomment the test.
Tomi, would you have a suggestion for an efffetctive xfst regex to extract the first letter of an fst? I tried: hfst[0]: load stack src/analyser-gt-norm.hfst ? bytes. 483360 states, 1099479 arcs, ? paths hfst[1]: define lex Defined 'lex' hfst[0]: read regex lex.i .o. [ ? -> 0 || \[ .#. ] _ ] ; ? bytes. 781273 states, 2658850 arcs, ? paths hfst[1]: print random-lower Jie K but as you can see it isn't doing what I want :-/
'print net' prints out the whole fst network, and first letter starts from 's0'. Create a scriptfile with appropriate commands: ----scriptfile.hfst load stack src/analyser-gt-norm.hfst print net ----end file Call this: hfst-xfst -F scriptfile.hfst | grep -e 's0:'