Corpus collector's manual
This file provides an overview of the corpus conversion process. Basically, when a new document is recieved, it is classified according to the language and genre and stored as such to the directory structure for the original files. All the original files are left untouched. The text and structural information contained in the document is extracted by a conversion script and transformed to an xml-file. Xml-files are stored in separate, parallel directory hierarchy where they can be easily accessed and used as a test material for grammar checkers and spellers. The metainformation associated with the document, such as the name of the author, is stored in an xsl-file and appended to the xml-file during the conversion. The metainformation is used elsewhere, in different corpus applications.
The file to be converted is given to convert2xml as a command line argument. If a directory is the argument, all files in the given directory and the directories inside it are converted. The file types that are supported at the moment are: doc, pdf, html, text and paratext. At the users' point of view, there is no difference between the file types, the technical documentation is provided at corpus_conversion_tech.html.
Follow the instructions on our getting started page. Then install convert2xml by running
cd $GTHOME/tools/CorpusTools python setup.py develop --user mv $HOME/.local/bin/* $HOME/bin
Convert a file
The resulting file is a fully converted xml-file original.pdf.xml.
If the file is placed inside a corpus hierarchy, e.g. orig/sme/facta/original.pdf, then the converted file will be written to converted/sme/facta/original.pdf, where orig and converted are parallel directories.
Otherwise a directory named converted will be made, and the result of the conversion is written to converted/original.pdf.xml
Convert all the files in directories
convert2xml dir1 convert2xml dir1 dir2 convert2xml dir*
If a file is not converted, the error messages, warnings and other information of the conversion are stored in a logfile. The log file is named after current file with a .log extenseion and stored in the same directory that the file resides.
Working with the corpus
by Tomi Pieski, Saara Huhmarniemi, Børre Gaup