Language recognition using text_cat

To be able to identify sections within a document not in the main language, we need automatic language reqognition. We have now installed an open-source package that performs such a task, and this page documents its usage and origin.

Source

The home page of the package TextCat is found at several locations.

. The Groningen home page also includes links to a background article, a list of supported languages coming with the tools, and also a list of competitors. Here's also another link to a demo page, with e-mail address of the author.

Usage

The tool text_cat itself is installed in gt/scripts/, and basic usage is explained by:

text_cat -h

Typical usage will be something like:

text_cat -l "What language is this"

Or:

text_cat <input-file>

In both cases text_cat will return one or more strings with the name of the language(s) the script believes the text to be in.

Adding a new recognizable language

The text_cat reference files are stored in $GTHOME/tools/lang-guesser.

Adding a new language to be recognized requires a suitable training corpus to be built. This is most easily done with the accompanying tool random_lines:

>$ random_lines < some-text-file > ShortTexts/language-name.txt

This commando extracts random lines of text from the input file, and stores them in the output file. It also cleans the file a bit. The file created is used to build a language model like this:

>$ text_cat -n < ShortTexts/language-name.txt > LM/language-name.lm

After this, the language recognition tool text_cat is ready for use with another language as shown in the previous section.

by Sjur N. Moshagen