Bug 2537 - Filter out numerals, acronyms etc from word list for pattern hyhpenation
Summary: Filter out numerals, acronyms etc from word list for pattern hyhpenation
Status: ASSIGNED
Alias: None
Product: Infrastructure
Classification: Unclassified
Component: newinfra (show other bugs)
Version: unspecified
Hardware: All All
: P5 - Later normal
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-01-25 09:31 CET by Sjur Nørstebø Moshagen
Modified: 2019-01-25 09:42 CET (History)
2 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sjur Nørstebø Moshagen 2019-01-25 09:31:29 CET
The way it is generated now (random output from fst) makes it contain all sorts of random noise (over generation patterns that are usually harmless, but turns out to be really harmful in this context).
Comment 1 Sjur Nørstebø Moshagen 2019-01-25 09:39:27 CET
Use the weighted fst (do not convert to unweighted), add heavy weights to tags for all unwanted strings, then filter the output based on weight (ie only output with weight below threshold should survive).

Requires that the wordlist is printed with weights, or that we remove such paths from the fst first, whatever is more easily implemented.
Comment 2 Sjur Nørstebø Moshagen 2019-01-25 09:42:51 CET
Another alternative: add more paths to be removed from the lexicon - we don't need acronyms and abbreviations in the hyphenator lexicon (they will be covered by the rule component). The same goes for numbers.

We already do this, so this is definitely the easiest way forward.