Bug 249

Summary: Add language detection as part of the corpus infra
Product: Corpus Reporter: Saara Huhmarniemi <saara.huhmarniemi>
Component: Text corpus infrastructureAssignee: Saara Huhmarniemi <saara.huhmarniemi>
Status: RESOLVED FIXED    
Severity: normal CC: trond.trosterud
Priority: P2 - As soon as possible    
Version: unspecified   
Hardware: All   
OS: MacOS X   
Bug Depends on: 279    
Bug Blocks:    

Description Saara Huhmarniemi 2006-02-17 18:34:52 CET
Finnish gets mixed with sámi languages in language recognition using text_cat. Finnish language detection should be improved by using larger texts for generating the language model.
Comment 1 Saara Huhmarniemi 2006-04-22 13:23:02 CEST
The text_cat is now improved with larger texts for sme, smj, nob and nno. More texts are still required for smj and sma. Finnish needs update as well. The text_cat should be improved otherwise as well, e.g. non-word characters can affect the recognition and the handling of unicode characters should be checked. 
Comment 2 Saara Huhmarniemi 2006-05-15 14:32:23 CEST
The non-word characters are now excluded from the trigraphs.

The language recognition is improved by creating a short word list to help the trigram heuristics. Some languages are still missing (smj, sme, nob) so the lists are not yet in full use.
The file-specific xsl-files already contain a possibility to choose the languages that are used in guessing the language.
The file-specific xsl-files should be updated by variable "monolingual" for files that are not run through the language recognition, e.g. bible contains short paragraphs with lots of names and is generally monolingual. The default is to assume that the document is multilingual and should be tested against all the languages in LM-directory.

The text_cat tool is updated so that it uses these variables if present. Extensive testing is still required.
Comment 3 Saara Huhmarniemi 2006-06-02 17:25:48 CEST
The file-specific xsl-files are now updated with variable "monolingual" which can be used for turning of the language recognition. The document is multilingual by default, and the tested languages can be chosen in hte xsl-file (remember to select 1 for variable multlingual as well).

The short-word lists for smj, sme, sma and nob are still missing, so it's not clear how much the selection between e.g. nno and nob improves with the short-word lists. I leave the bug open until we get them.
Comment 4 Saara Huhmarniemi 2006-11-07 11:07:47 CET
The short word lists are in place now, and text_cat seems to work. The selection between nob and nno did not get any better with short-word lists, but instead the selection between sámi languages and other, e.g. nob improved.

There is always room for improvement in the language models (and short word models) but I think otherwise this bug can be marked as fixed.