Corpus collector's manual
This file provides an overview of the corpus conversion process. Basically, when a new document is recieved, it is classified according to the language and genre and stored as such to the directory structure for the original files. All the original files are left untouched. The text and structural information contained in the document is extracted by a conversion script and transformed to an xml-file. Xml-files are stored in separate, parallel directory hierarchy where they can be easily accessed and used as a test material for grammar checkers and spellers. The metainformation associated with the document, such as the name of the author, is stored in an xsl-file and appended to the xml-file during the conversion. The metainformation is used elsewhere, in different corpus applications.
The first section explains the conversion process step by step. It is followed by more detailled descriptions of the directory structure and the handling of the file-specific xsl-files. There is also a small section of the log-files and the different uses of convert2xml.
Basic conversion process
- Copy the file to the correct place in your working copy of the corpus repositores.
Run convert2xml on the file. For example, a MS Word document.
- If there are fatal errors during the conversion, you will see an error message stating «Could not convert file_name.doc». The error log of the file is found in «file_name.doc.log».
If a file is not converted, the error messages, warnings and other information of the conversion are stored in a logfile. The log file is named after current file and stored in the same directory that the file resides. The log file contains the information that was printed to STDERR during the conversion.
Follow the Instructions on our getting started page. Then install convert2xml by running
cd $GTHOME/tools/CorpusTools python setup.py install --install-scripts=/usr/local/bin
usage: convert2xml [-h] source Convert original files to giellatekno xml. positional arguments: source either a file to be converted, or a directory containing files to be converted optional arguments: -h, --help show this help message and exit
The file to be converted is given to convert2xml as command line argument. If a directory is the argument, all files in the given directory and the directories inside it are converted. The file types that are supported at the moment are: doc, pdf, html, text and paratext. The corresponding file suffixes are are .doc, .pdf, .html, txt and .ptx. At the users' point of view, there is no difference between the file types, the technical documentation is provided at corpus_conversion_tech.html.
Let the original file be original.pdf. Use for example the command
$ convert2xml original.pdf
The resulting file is a fully converted xml-file original.pdf.xml. If the file is placed inside a corpus hierarchy, e.g. orig/sme/facta/original.pdf, then the converted file will be written to converted/sme/facta/original.pdf, where orig and converted are parallel directories. Otherwise a directory named converted will be made, and the result of the conversion is written to converted/original.pdf.xml
by Tomi Pieski, Saara Huhmarniemi, Børre Gaup