The university of Tromsø > Giellatekno

Corpus collector's manual


This file provides an overview of the corpus conversion process. Basically, when a new document is recieved, it is classified according to the language and genre and stored as such to the directory structure for the original files. All the original files are left untouched. The text and structural information contained in the document is extracted by a conversion script and transformed to an xml-file. Xml-files are stored in separate, parallel directory hierarchy where they can be easily accessed and used as a test material for grammar checkers and spellers. The metainformation associated with the document, such as the name of the author, is stored in an xsl-file and appended to the xml-file during the conversion. The metainformation is used elsewhere, in different corpus applications.


The file to be converted is given to convert2xml as a command line argument. If a directory is the argument, all files in the given directory and the directories inside it are converted. The file types that are supported at the moment are: doc, pdf, html, text and paratext. At the users' point of view, there is no difference between the file types, the technical documentation is provided at corpus_conversion_tech.html.

Install convert2xml

Follow the instructions on our getting started page. Then install convert2xml by running

cd $GTHOME/tools/CorpusTools
python develop --user
mv $HOME/.local/bin/* $HOME/bin

Convert a file

convert2xml original.pdf

The resulting file is a fully converted xml-file original.pdf.xml.

If the file is placed inside a corpus hierarchy, e.g. orig/sme/facta/original.pdf, then the converted file will be written to converted/sme/facta/original.pdf, where orig and converted are parallel directories.

Otherwise a directory named converted will be made, and the result of the conversion is written to converted/original.pdf.xml

Convert all the files in directories


convert2xml dir1
convert2xml dir1 dir2
convert2xml dir*

Log files

If a file is not converted, the error messages, warnings and other information of the conversion are stored in a logfile. The log file is named after current file with a .log extenseion and stored in the same directory that the file resides.

Working with the corpus

by Tomi Pieski, Saara Huhmarniemi, Børre Gaup