Summary: | Add language detection as part of the corpus infra | ||
---|---|---|---|
Product: | Corpus | Reporter: | Saara Huhmarniemi <saara.huhmarniemi> |
Component: | Text corpus infrastructure | Assignee: | Saara Huhmarniemi <saara.huhmarniemi> |
Status: | RESOLVED FIXED | ||
Severity: | normal | CC: | trond.trosterud |
Priority: | P2 - As soon as possible | ||
Version: | unspecified | ||
Hardware: | All | ||
OS: | MacOS X | ||
Bug Depends on: | 279 | ||
Bug Blocks: |
Description
Saara Huhmarniemi
2006-02-17 18:34:52 CET
The text_cat is now improved with larger texts for sme, smj, nob and nno. More texts are still required for smj and sma. Finnish needs update as well. The text_cat should be improved otherwise as well, e.g. non-word characters can affect the recognition and the handling of unicode characters should be checked. The non-word characters are now excluded from the trigraphs. The language recognition is improved by creating a short word list to help the trigram heuristics. Some languages are still missing (smj, sme, nob) so the lists are not yet in full use. The file-specific xsl-files already contain a possibility to choose the languages that are used in guessing the language. The file-specific xsl-files should be updated by variable "monolingual" for files that are not run through the language recognition, e.g. bible contains short paragraphs with lots of names and is generally monolingual. The default is to assume that the document is multilingual and should be tested against all the languages in LM-directory. The text_cat tool is updated so that it uses these variables if present. Extensive testing is still required. The file-specific xsl-files are now updated with variable "monolingual" which can be used for turning of the language recognition. The document is multilingual by default, and the tested languages can be chosen in hte xsl-file (remember to select 1 for variable multlingual as well). The short-word lists for smj, sme, sma and nob are still missing, so it's not clear how much the selection between e.g. nno and nob improves with the short-word lists. I leave the bug open until we get them. The short word lists are in place now, and text_cat seems to work. The selection between nob and nno did not get any better with short-word lists, but instead the selection between sámi languages and other, e.g. nob improved. There is always room for improvement in the language models (and short word models) but I think otherwise this bug can be marked as fixed. |