Bug 211

Summary: Perl problems with UTF-8
Product: Infrastructure Reporter: Trond Trosterud <trond.trosterud>
Component: LocalisationAssignee: Tomi Pieski <tomi.k.pieski>
Status: RESOLVED FIXED    
Severity: normal CC: saara.huhmarniemi, sjur.n.moshagen, thor.oivind.johansen, trond.trosterud
Priority: P2 - As soon as possible    
Version: unspecified   
Hardware: Macintosh   
OS: MacOS X   

Description Trond Trosterud 2005-11-17 14:19:54 CET
With OS 10.3, and some struggeling, the problem to report here did not exist. But since 10.4, it does. Here, I will report the problem, a suboptlimal solution, and a quest for an optimal solution.

The problem: perl has problems with UTF-8 input. It is manifested as follows: With a command like the following (given in gt/sme/):

cat corp/*txt | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle  --minimal | grep oktii

the machine outputs the discourageing message set (literally pages of it)
utf8 "\xC3" does not map to Unicode at /Users/trond/gt/script/preprocess line 91, <> chunk 24.
utf8 "\xC3" does not map to Unicode at /Users/trond/gt/script/preprocess line 91, <> chunk 24.
...
before returning any interesting output (the interesting output will come, eventually).

I discussed this with Thor-Øivind. In 10.3., he fixed this, with an environmental.plist file in a hidden catalogue .MacOSX/ in my home directory. In 10.4, this did not give the desired result. Instead, we did two things:
1. In the shell, we ran a "set env" command, asking for LANG=no_NO.UTF-8, and (was it LC_ALL="C"? I don't remember the exact detail, and at the end of the day, this may also turn out to be irrelevant.)
2. We replaced the "preprocess" command with an explicit perl command, and a -C flag, which, according to Thor-Øivind, means "beware of Unicode" or something similar. Thus, our command was changed to (and note that we this time needed the path to the preprocess script file):

cat corp/*txt | perl -C ../script/preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle  --minimal | grep oktii

This time, it worked. No more ugly "does not map to Unicode". Unfortunately, the Mac didn't turn a lot faster, so that bonus was not achieved, but at least I don't need to read the message any more.

Now, and finally comes the point:

This seems like a hack solution. We do not want to have to specify -C flags, we want perl to simply work.

Gurus in different camps: Is that possible, and how?
Comment 1 Saara Huhmarniemi 2005-11-18 17:57:09 CET
From Perldiag:
%s "\x%s" does not map to Unicode
When reading in different encodings Perl tries to map everything into Unicode characters. The bytes you read in are not legal in this encoding, for example

    utf8 "\xE4" does not map to Unicode
if you try to read in the a-diaereses Latin-1 as UTF-8.

However, all our files are utf8-encoded. The problem seems to occure when using cat program. If you give the files straight to preprocess, there are no error messages:

$ preprocess --abbr=bin/abbr.txt corp/*txt | ..etc..

I don't know what is wrong with cat (or is it cat that causes the problems). Setting locales did not help as noted earlier.
Comment 2 Trond Trosterud 2005-11-21 12:37:39 CET
Have a look at these 3 commands:

  1  cat dev/seammas.txt | perl -C ../script/preprocess --abbr=bin/abbr.txt | wc -l
  2  cat dev/seammas.txt | preprocess --abbr=bin/abbr.txt | wc -l
  3  preprocess --abbr=bin/abbr.txt dev/seammas.txt | wc -l

Numbers 1, 3 are ok, number 2 displays the UTF-8 error. It seems Saara is right that this comes from cat garbling UTF-8 into Latin 1, and that it can be solved either by avoiding cat or by using the -C option in perl (hmm, how come the -C flags cures the cat disease?). To me, both commands 1, 3 are workarounds, I would rather have Everything Working Well (TM), and I hope the fact that two different commands give the right result is of some help.
Comment 3 Saara Huhmarniemi 2005-11-21 14:25:38 CET
Perl -C sets Perl to default locale, which means that it does not complain even if there was something wrong with the characters (getting latin-1 when expecting utf8). I do not recommend using Perl -C, since at some point there can be some misbehaves with locales.
Comment 4 Sjur Nørstebø Moshagen 2005-12-05 13:43:17 CET
The following command:

cat corp/regnor.txt | ../script/preprocess 2> err.txt | l

reveals this list of errors in the single input file:

"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 5.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 136, <> chunk 5.
"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 13.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 114, <> chunk 13.
"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 15.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 114, <> chunk 15.
"\x{00c5}" does not map to utf8 at ../script/preprocess line 114, <> chunk 17.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 136, <> chunk 17.
"\x{00c5}" does not map to utf8 at ../script/preprocess line 136, <> chunk 17.
"\x{00be}" does not map to utf8 at ../script/preprocess line 114, <> chunk 19.
"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 40.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 136, <> chunk 42.
"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 48.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 114, <> chunk 50.
"\x{00c5}" does not map to utf8 at ../script/preprocess line 114, <> chunk 83.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 136, <> chunk 85.
"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 137.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 136, <> chunk 138.
"\x{00c3}" does not map to utf8 at ../script/preprocess line 136, <> chunk 224.
"\x{00a1}" does not map to utf8 at ../script/preprocess line 114, <> chunk 226.

It looks to me as if something is garbling the UTF-8 stream, corrupting multibyte sequences at irregular intervals by removing the last part of a double-byte sequence. That would explain why Perl would give those reports, the reported characters are only valid as first-byte chars in multibyte sequences. If one looks at the file with e.g. hexdump -C, one will find these bytes all over the place (and much more than reported).

Also, using cat -u (to prevent buffering) doesn't help; the hypothesis would be that buffering could happen within a UTF-8 sequence, and cause such problems.

cat -v removes the errors by transforming all high-byte sequences to escape sequences!

Conclusion:
To me it looks like cat isn't working properly with UTF-8 data. I'm waiting for Saara's test report as discussed in today's meeting:-)
Comment 5 Sjur Nørstebø Moshagen 2005-12-05 16:17:54 CET
This page (http://www.cl.cam.ac.uk/~mgk25/unicode.html) clearly states that cat should be agnostic to encoding issues:

"Most applications can do very fine with just soft conversion. This is what makes the introduction of UTF-8 on Unix feasible at all. To name two trivial examples, programs such as _cat_ and _echo_ ***do not have to be modified at all***." (my emphasis).

Why is it then that this problem seems to display only when using cat? Actually it seems to be the combination of cat + Perl that causes the problem. cat in itself is fine, and perl in itself is fine.
Comment 6 Sjur Nørstebø Moshagen 2005-12-05 21:50:56 CET
Regarding Comment #2 above, there is a strong argument on the net that Alternative 2) is "harmful", and that Alternative 3) is The Correct Way (TM). See further on Rob Pike: "UNIX Style, or cat -v Considered Harmful" (http://gaul.org/files/cat_-v_considered_harmful.html). In this spirit only Alternative 1) is a hack.

This is not to say that the behaviour described here is buggy: even if the use of cat is "harmful", it should still work:-)
Comment 7 Sjur Nørstebø Moshagen 2005-12-05 22:06:15 CET
Sorry, the previous link was merely a discussion, this the *real* argument:

Useless Uses of Cat (http://www.sektorn.mooo.com/era/unix/award.html)

(Yes, it is from our old Lingsoft friend Era - unfortunately the site is presently down)
Comment 8 Sjur Nørstebø Moshagen 2005-12-07 10:08:25 CET
The cat version coming with MacOS X 10.4.x is a BSD version dated 2.5.1995. Even though it *should* be agnostic about encodings, it is still quite old.

The GNU utility package coreutils is updated in November 2005, and has a cat version from 2003 (IIRC). It might be worth trying to install the coreutil package and test the cat version in it. Installation can be done easily through DarwinPorts (use Port Authority as a GUI front end). Also, all such installations are easily removed later, if one wants to.
Comment 9 Sjur Nørstebø Moshagen 2006-01-09 21:20:23 CET
In some testing we did in today's project meeting, we found the following interesting behaviour:

- there is no difference between 'pr' and 'cat': on the effected machines, both give the same errors.

This indicates that the problem is not related to 'cat' at all, but rather to the interaction between the shell (pipes), locales and perl. At the same time, there's no difference between bash 2.0x (Børre) and bash 3.0x (Trond, Sjur) either.

Another fact that has gone unnoticed in the previous discussion, is that support for locales was introduced with 10.4. This would explain why the problem did not show up earlier under 10.3 (cf the original bug report).

The total number of affected computers is now three: Trond, Sjur and Børre.

For more info on today's efforts on trying to move this issue forward, see:
http://www.divvun.no/doc/admin/weekly/2006/Meeting_2006-01-09.html#Technical+issues
Comment 10 Børre Gaup 2006-02-06 11:41:09 CET
We are using ccat now. It doesn't have the problems that cat has, therefor this is a won't fix
Comment 11 Trond Trosterud 2006-02-09 09:45:43 CET
I hereby reopen the 2... \xC3 does not map to Unicode" bug. We got rid of it by avoiding cat file | preprocess, and using "preprocess file" instead, and we didn't see the error with ccat. Now, I start using ccat seriously, with the following command:

 time ccat -a zcorp/gt/sme/*/*/*xml zcorp/gt/sme/*/*xml | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 bin/sme.fst | lookup2cg | vislcg --grammar=src/sme-dis.rle --minimal > zcorp/dis/sme/all/060208c.txt 

This was a command to make an analysed version of an 1.16 million word corpus. Approximately 60 % of it was analysed, on the new G5 mac, until I got an "out of memory" message. Now, is this due to the constant printing of the error messages on the screen (which went on for 128 minutes before it gave in), or is it just that the task is too big. Now, at one point (the latest version of the shell before 10.4.4) I believe Thor-Øivind had fixed this, then it came back, but we were able to make a workaround. This workaround now is in serious trouble, and I am thus open for views. It seems to be that the locale info between Perl and the shell aren't in sync?


Here came the answer, two hours later (omintting 128 minutes of "does not map" lines):

(...)
09, <> chunk 111.
utf8 "\xA1" does not map to Unicode at //Users/trond/gt/script/preprocess line 109, <> chunk 111.
utf8 "\xA1" does not map to Unicode at //Users/trond/gt/script/preprocess line 109, <> chunk 111.
perl(2364) malloc: *** vm_allocate(size=8421376) failed (error code=3)
perl(2364) malloc: *** error: can't allocate region
perl(2364) malloc: *** set a breakpoint in szone_error to debug
Out of memory!
perl(2364) malloc: *** vm_allocate(size=8421376) failed (error code=3)
perl(2364) malloc: *** error: can't allocate region
perl(2364) malloc: *** set a breakpoint in szone_error to debug
Out of memory!

real    128m50.590s
user    104m35.222s
sys     1m17.356s

Comment 12 Trond Trosterud 2006-02-09 09:48:21 CET
Btw, we got a similar problem with the missing list command (command is quoted at the end):

utf8 "\xA1" does not map to Unicode at //Users/trond/gt/script/preprocess line 109, <> chunk 111.
utf8 "\xA1" does not map to Unicode at //Users/trond/gt/script/preprocess line 109, <> chunk 111.
utf8 "\xA1" does not map to Unicode at //Users/trond/gt/script/preprocess line 109, <> chunk 111.
perl(2356) malloc: *** vm_allocate(size=8421376) failed (error code=3)
perl(2356) malloc: *** error: can't allocate region
perl(2356) malloc: *** set a breakpoint in szone_error to debug
Out of memory!
perl(2356) malloc: *** vm_allocate(size=8421376) failed (error code=3)
perl(2356) malloc: *** error: can't allocate region
perl(2356) malloc: *** set a breakpoint in szone_error to debug
Out of memory!

real    19m16.091s
user    1m39.085s
sys     0m12.885s

hum-tf4-ans142:~/gt/sme trond$ time ccat -a zcorp/gt/sme/*/*/*xml zcorp/gt/sme/*/*xml | preprocess --abbr=bin/abbr.txt | lookup -flags mbTT -utf8 -f bin/missing | grep '\?' | cut -f1 | sort | uniq -c | sort -nr > dev/missing-z-060208.txt
Comment 13 Sjur Nørstebø Moshagen 2006-02-09 13:20:39 CET
Tomi's latest code change in the preprocess perl script fixes the bug. The change was:

Before:
use encoding 'utf-8';
use open ':utf8';

After:
#use encoding 'utf-8';
use open ':std', ':utf8';

The difference on my system comes with the commenting out of the first line, changes to the second line does not matter: with the first line as comment, the bug disappears.

The second line could probably be changed to:

use open ':locale';

This form implies ':std', and it will correctly take the utf8 value from $LANG etc env. variables. It works correctly on my system. More info can be found at:

http://search.cpan.org/~nwclark/perl-5.8.7/lib/open.pm

Conclusion: it was always a bug in our Perl code. It didn't materialize on 10.3.x, since that system didn't have a locale subsystem.

I'll now close the bug as fixed.
Comment 14 Saara Huhmarniemi 2006-12-20 20:22:34 CET
Using locale in Perl scripts has been problematic. Statement "use locale;", picks only the letters that are defined in the current locale-settings (e.g. in Norwegian locale, you won't get Islandic characters recognized correctly.) This is unbearable in a multilingual project like ours. In addition, there have been other problems such as "Wide character in print at ..".

I have now updated all our Perl scripts to recognize all utf8-characters correctly. utf8 is now expected everywhere. Our own scripts and modules seem to work fine with the change, but there are some other Perl modules and functions which are troublesome, I list some of them below.

The env variable I changed is PERL_UNICODE, and it's equivalent with invoking Perl with option -C.
To update your home mac, add this line to the file .profile in your home directory:
export PERL_UNICODE=""

In victorio and G5, I added this to /etc/profile, so there the change is made.

If you have perl-scripts of your own:
If you use unicode characters in the script, e.g in regular expressions, then write
use utf8;

You wouldn't need any other utf8-related statements anymore. There were couple of more technical problems, which I write down to remember:

1. There are some Perl-modules, that do not support unicode, Expect.pm is one of them. Before sending utf8-encoded data to such modules it has to be explicitely encoded as utf8, and the output decoded:
$string = Encode::encode_utf8($string)
$module_output = Module::process($string);
$output = Encode::decode_utf8($module_output)

2. Also some Perl functions which deal with filenames and system properties such as mkdir, require special treatment. For example when using Open2, the pipes have to be set to
binmode R, ':utf8';

Same goes to IO::Socket, which seems to ignore unicode requirements. The xerox-tools server may be the only one that does not work with the new settings and java-implementation.

3. For some reason, command line arguments, such as $ARGV[0] get double utf8-encoded and have to
be decoded with
Encode::decode_utf8($ARGV[0]);

I reopen the bug for the new error reports.
Comment 15 Trond Trosterud 2008-05-03 10:37:53 CEST
One and a half year with no more bug reports, and I close the bug.