Bug 2537

Summary: Filter out numerals, acronyms etc from word list for pattern hyhpenation
Product: Infrastructure Reporter: Sjur Nørstebø Moshagen <sjur.n.moshagen>
Component: newinfraAssignee: Sjur Nørstebø Moshagen <sjur.n.moshagen>
Status: ASSIGNED ---    
Severity: normal CC: borre.gaup, chiara.argese
Priority: P5 - Later    
Version: unspecified   
Hardware: All   
OS: All   

Description Sjur Nørstebø Moshagen 2019-01-25 09:31:29 CET
The way it is generated now (random output from fst) makes it contain all sorts of random noise (over generation patterns that are usually harmless, but turns out to be really harmful in this context).
Comment 1 Sjur Nørstebø Moshagen 2019-01-25 09:39:27 CET
Use the weighted fst (do not convert to unweighted), add heavy weights to tags for all unwanted strings, then filter the output based on weight (ie only output with weight below threshold should survive).

Requires that the wordlist is printed with weights, or that we remove such paths from the fst first, whatever is more easily implemented.
Comment 2 Sjur Nørstebø Moshagen 2019-01-25 09:42:51 CET
Another alternative: add more paths to be removed from the lexicon - we don't need acronyms and abbreviations in the hyphenator lexicon (they will be covered by the rule component). The same goes for numbers.

We already do this, so this is definitely the easiest way forward.