Corpus collector's manual
This file provides an overview of the corpus conversion process. Basically, when a new document is recieved, it is classified according to the language and genre and stored as such to the directory structure for the original files. All the original files are left untouched. The text and structural information contained in the document is extracted by a conversion script and transformed to an xml-file. Xml-files are stored in separate, parallel directory hierarchy where they can be easily accessed and used as a test material for grammar checkers and spellers. The metainformation associated with the document, such as the name of the author, is stored in an xsl-file and appended to the xml-file during the conversion. The metainformation is used elsewhere, in different corpus applications.
The first section explains the conversion process step by step. It is followed by more detailled descriptions of the directory structure and the handling of the file-specific xsl-files. There is also a small section of the log-files and the different uses of the convert2xml.pl -script.
Basic conversion process
When you have a file that you want to add to our corpus directory, you have two alternatives. You can either use the web upload script in http://www.divvun.no/upload/upload_corpus_file.html or add the file manually to the corpus subversion repositories.
- Copy the file to the correct place in your working copy of the corpus repositores.
Run the convert2xml.pl script for the file. For example, a MS Word document, with language and genre as above, is converted with command (relative path is fine as well, but do not use tilde.) Give the language as command line option.
convert2xml.pl --lang=sme file_name.doc
The conversion takes a while especially if the file is large or with lot of structure. If you want to see all the error messages generated during conversion, use option --nolog.
- If there are fatal errors during the conversion, you will see an error message stating "ERROR <something>", consult the maintainers of the corpus files or see the log file (in <path-to-your-working-copy>/tmp) for details. The script informs you of the character conversion and the original (or intermediate) encoding.
The error messages, warnings and other information of the conversion are stored in special logfile. The log file is named after current date and time and stored to a temporary directory, by default the directory tmp/ under corpdir. The directory can be given as a command line option as well. The log file contains the information that was printed to STDERR during the conversion. The errors that prevent some document to be converted are still printed to the screen as well. If you want all the errors to be printed to screen, use option --nolog. The log file is for example: tmp/Feb-10-8-57.log
There are several options available which control the conversion process, it is e.g possible to skip the xsl-processing, the character decoding, or the hyphenation. The available options:
Usage: convert2xml.pl [OPTIONS] [FILE|DIR] The available options: --xsl=<file> The xsl-file which is used in the conversion. If not specified, the default values are used. --dir=<dir> The directory where to search for converted files. If not given, only FILE is processed. --tmpdir=<dir> The directory where the log and other temporary files are stored. --nolog Print error messages to screen, not to log files. --corpdir=<dir> The corpus directory, default is where you have checked out your working copy. --no-decode Do not decode the characters. --multi-coding Resolve the character coding separately for each paragraph. --no-hyph Do not add <hyph/> tags. --no-xsl Skip the file-specific xsl-processing. --all-hyph Tag all hyphens (default is at the end of the line). --upload Do conversion in the upload-directory. --lang The main language of the document. --help Print this message and exit.
The file to be converted is given to convert2xml.pl as command line argument. If a directory is the argument, all files in the given directory and the directories inside it are converted. The file types that are supported at the moment are: doc, pdf, html, text and paratext. The corresponding file suffixes are are .doc, .pdf, .html, txt and .ptx. At the users' point of view, there is no difference between the file types, the technical documentation is provided at corpus_conversion_tech.html.
Some documents get wrongly utf-8-encoded by the conversion tools and they are fixed using a Perl module samiChar::Decode.pm. It is installed in victorio, and you should not notice it's presence. If you are converting files in some other machine, you should install the module.. Another Perl module that is needed is XML::Twig. Instructions for installing both modules can be found in corpus_conversion_tech.html.
Let the original file be original.pdf. Use for example the command
$ perl convert2xml.pl --corpdir=/home/mydir/samipdf --tmpdir=mytmp original.pdf
Generally each file has only one character coding which may be wrongly utf-8-encoded. The best way to decode the file is determined statistically. By using the option --multi-coding, the file is assumed to contain more than one character codings, which are decoded paragraph-wise.
The resulting file is fully converted xml-file original.pdf.xml When working with the full directory hierarchy, the original file is expected to be found in some subdirectory under /usr/local/share/corp/orig. Then the resulting xml-file is generated to the corresponding subdirectory under /usr/local/share/corp/bound. If you have the same hierarchy in the corpdir given in commandline, the process will be similar. You may use relative path names to the files and directories but do not use tilde (~).
The hyphenation points are tagged as <hyph/> . The script tags hyphens that are found at line breaks or end of the paragraphs followed by a suitable word. If all the hyphenation points should be tagged, convert2xml.pl can be called with option --all-hyph.
by Tomi Pieski, Saara Huhmarniemi