Pre- and postprocessing

Processing text between different formats, as done e.g. by the scripts preprocess, lookup2cg, etc. Corpus-related pre- and postprocessing tools also come here (aligner, lg recognition)

Select a component to see open bugs in that component.

Component Default Assignee
abbr.txt Sjur Nørstebø Moshagen
The abbreviation and idiom component
casing and spellrelax Trond Trosterud
The case.regex, spellrelax.regex and allcaps.regex xfst files.
catxml Børre Gaup
Bugs relating to our corpus extraction tool catxml
ccat Børre Gaup
Bugs connected to our corpus extraction tool ccat (ccat has replaced xmlcat)
cg2visl Sjur Nørstebø Moshagen
Converting text from vislcg output to the input required by the pedagogical visl program Børre Gaup is a perl script to take the content of

tags of xml files out, send them to standard grammatical analysis, and put them back again.

hfst-preprocess Sjur Nørstebø Moshagen
Language recognision with text_cat Børre Gaup
The text_cat tool with its accompanying language files are used to detect language in mixed language documents.
lookup2cg Sjur Nørstebø Moshagen
Changing text from lookup output to vislcg input.
preprocess file Sjur Nørstebø Moshagen
The perl file
Text aligner Børre Gaup
The parallel text aligner tool from Bergen and its behaviour.
The conversion scripts Børre Gaup
The scripts for converting files, especially from external formats to our internal latin-1-and-digraphs format