The university of Tromsø > Giellatekno
 

Corpus collector's manual

Introduction

This file provides an overview of the corpus conversion process. Basically, when a new document is recieved, it is classified according to the language and genre and stored as such to the directory structure for the original files. All the original files are left untouched. The text and structural information contained in the document is extracted by a conversion script and transformed to an xml-file. Xml-files are stored in separate, parallel directory hierarchy where they can be easily accessed and used as a test material for grammar checkers and spellers. The metainformation associated with the document, such as the name of the author, is stored in an xsl-file and appended to the xml-file during the conversion. The metainformation is used elsewhere, in different corpus applications.

The first section explains the conversion process step by step. It is followed by more detailled descriptions of the directory structure and the handling of the file-specific xsl-files. There is also a small section of the log-files and the different uses of convert2xml.

Basic conversion process

Manual conversion

  1. Copy the file to the correct place in your working copy of the corpus repositores.
  2. Run convert2xml on the file. For example, a MS Word document.

    convert2xml file_name.doc
                            
  3. If there are fatal errors during the conversion, you will see an error message stating «Could not convert file_name.doc». The error log of the file is found in «file_name.doc.log».

Log files

If a file is not converted, the error messages, warnings and other information of the conversion are stored in a logfile. The log file is named after current file and stored in the same directory that the file resides. The log file contains the information that was printed to STDERR during the conversion.

convert2xml

Follow the Instructions on our getting started page. Then install convert2xml by running

cd $GTHOME/tools/CorpusTools
                        python setup.py install --install-scripts=/usr/local/bin
                    
usage: convert2xml [-h] source

Convert original files to giellatekno xml.

positional arguments:
  source      either a file to be converted, or a directory containing files
              to be converted

optional arguments:
  -h, --help  show this help message and exit

                

The file to be converted is given to convert2xml as command line argument. If a directory is the argument, all files in the given directory and the directories inside it are converted. The file types that are supported at the moment are: doc, pdf, html, text and paratext. The corresponding file suffixes are are .doc, .pdf, .html, txt and .ptx. At the users' point of view, there is no difference between the file types, the technical documentation is provided at corpus_conversion_tech.html.

Let the original file be original.pdf. Use for example the command

                    $ convert2xml original.pdf
                

The resulting file is a fully converted xml-file original.pdf.xml. If the file is placed inside a corpus hierarchy, e.g. orig/sme/facta/original.pdf, then the converted file will be written to converted/sme/facta/original.pdf, where orig and converted are parallel directories. Otherwise a directory named converted will be made, and the result of the conversion is written to converted/original.pdf.xml

by Tomi Pieski, Saara Huhmarniemi, Børre Gaup