Bug 1642 - Add test to make sure that InitUpper actually covers all letters in alphabet
Summary: Add test to make sure that InitUpper actually covers all letters in alphabet
Status: ASSIGNED
Alias: None
Product: Infrastructure
Classification: Unclassified
Component: newinfra (show other bugs)
Version: unspecified
Hardware: All All
: P4 - Within a month normal
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2013-04-04 16:19 CEST by Sjur Nørstebø Moshagen
Modified: 2018-05-29 10:52 CEST (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sjur Nørstebø Moshagen 2013-04-04 16:19:42 CEST
Cf. svn r73832:

Log:
Norwegian letters æøå (->) ÆØÅ was not part of this file,
and has evidently not worked since the introduction of the
new infra.

This shows that our testing routines are not exactly optimal :-(

TODO: Check inituppercase.regex for the other languages as well.

What we need is a generic test to ensure that the whole alphabet is covered by the regex.
Comment 1 Sjur Nørstebø Moshagen 2017-03-31 15:05:46 CEST
Idea for testing:

* extract all initial, lowercase letters in the lexical fst
* apply initial upper case
* apply a regular Unicode sed or similar transform to uppercase
* compare the two, FAIL if diff

We need to check whether there can be exceptional letters that should not be uppercased.

Languages without casing can just outcomment the test.
Comment 2 Sjur Nørstebø Moshagen 2017-03-31 15:38:26 CEST
Tomi, would you have a suggestion for an efffetctive xfst regex to extract the first letter of an fst?

I tried:

hfst[0]: load stack src/analyser-gt-norm.hfst
? bytes. 483360 states, 1099479 arcs, ? paths
hfst[1]: define lex
Defined 'lex'
hfst[0]: read regex lex.i .o. [ ? -> 0 || \[ .#. ]  _ ] ;
? bytes. 781273 states, 2658850 arcs, ? paths
hfst[1]: print random-lower
Jie
K

but as you can see it isn't doing what I want :-/
Comment 3 Tomi Pieski 2017-03-31 17:22:06 CEST
'print net' prints out the whole fst network, and first letter starts from 's0'. Create a scriptfile with appropriate commands:

----scriptfile.hfst
load stack src/analyser-gt-norm.hfst
print net
----end file


Call this:
hfst-xfst -F scriptfile.hfst | grep -e 's0:'