UIT The arctic university of Norway > Giellatekno
 

corpus_maintenance

Corpus maintenance

This document keeps track of measures to improve the corpus collection and conversion process.

Note also the sentence alignment page, which looks into that specific sub-part of the corpus maintenance.

Corpus improvement project meetings

2017

2016

2014

2012

2011

2016 Metadata and corpus conversion work

OCR and conversion errors leftover from spring 2011

Task list, autumn 2011

Conversion

Issues related to conversion

Coverage

Issues related to

Parallel texts

Suggestions for detecting (flaws in) parallel texts

Where do we find texts

Corpus conversion targets

As a reminder, this is what we aim at:

  1. Sentence-aligned bilingual corpus for CAT
  2. Analysed mono- and bilingual corpus
    1. Lemmatised and word-aligned for terminology
    2. Fully analysed and presented for linguistic work and terminology
  3. Among other things, a one-click corpus like [europarl]