Bug 273 - File names cause errors in corpus conversion.
Summary: File names cause errors in corpus conversion.
Alias: None
Product: Corpus
Classification: Unclassified
Component: Text corpus infrastructure (show other bugs)
Version: unspecified
Hardware: All All
: P2 - As soon as possible normal
Assignee: Børre Gaup
Depends on: 76
  Show dependency treegraph
Reported: 2006-04-08 18:40 CEST by Saara Huhmarniemi
Modified: 2015-03-16 14:16 CET (History)
0 users

See Also:


Note You need to log in before you can comment on or make changes to this bug.
Description Saara Huhmarniemi 2006-04-08 18:40:27 CEST
Some of the file names in /usr/local/share/corp/orig contain non-utf8 characters or are otherwise troublesome.  Some of the characters are harmful when passed to shell, like parentheses, like in file:

The non-uft8 chars in filenames can be spotted in http://www.divvun.no/doc/lang/corp/corpus-sme.html
Not all of them cause errors in coversion.
Comment 1 Saara Huhmarniemi 2006-05-13 10:38:16 CEST
- The script convert2xml.pl is fixed so that the files with troublesome characters are not accepted to the conversion (quotation marks, parentheses, &, etc.). The directories under sme/gtbound and sme/gtfree are fixed so that the broken files are removed and reconverted so that only valid filenames are accepted. I'll write some documentation of the accepted filenames (general unix style with sámi chars is preferred).

- The filenames under sme/orig are moved to NFC-format to ease up browsing and editing, the converted files in sme/gtbound and sme/gtfree end up to NFC as well.

- Some of the errors in the filenames were caused by the module Decode.pm, which still had the line
"use encoding utf-8"  which converted the utf8 -characters to utf8 again. Some of the errors were caused by missing quotation marks around filenames in convert2xml.pl These are now fixed.

The files in smj are fixed as well, but other languages are not. I'll do the fix if needed.