Bug 841 - Nested error markup does not get correctly converted to xml
Summary: Nested error markup does not get correctly converted to xml
Status: RESOLVED FIXED
Alias: None
Product: Corpus
Classification: Unclassified
Component: xml conversion (show other bugs)
Version: unspecified
Hardware: All All
: P2 - As soon as possible critical
Assignee: Børre Gaup
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2010-05-11 17:42 CEST by Sjur Nørstebø Moshagen
Modified: 2015-03-16 14:16 CET (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sjur Nørstebø Moshagen 2010-05-11 17:42:09 CEST
With Lene's phd project (and my test bench project), some bugs and problems in the convert2xml.pl script has turned up. The most problematic one is that nested error markup doesn't always convert properly to xml.

Here is a text snippet that causes  this behaviour:

Eadni (barggá$(v,conc|bargá) boazzu)£(n,advl,compl,x,mix|bargá bohccuiguin). Sus lea guoktečuođinjealljenuppelohkai$(num,á|guoktečuođinjealljenuppelohkái). Therese barggá$(v,conc|bargá) buohcciviesus$(n,cmp|buohcceviesus) Romssas.

The resulting xml is presently as follows:

    <p>Eadni <errorort correct="bargá" errtype="conc" pos="v">barggá</errorort> boazzu)£(n,advl,compl,x,mix|bargá bohccuiguin). Sus lea guoktečuođinjealljenuppelohkai$(num,á|guoktečuođinjealljenuppelohkái). Therese barggá$(v,conc|bargá) buohcciviesus$(n,cmp|buohcceviesus) Romssas.
</p>

As seen above, only the first error is detected and converted correctly to xml. The rest is left untouched. During conversion, I get the following message:

***
*** WARNING - NO MATCH:  boazzu)£(n,advl,compl,x,mix|bargá bohccuiguin). Sus lea guoktečuođinjealljenuppelohkai$(num,á|guoktečuođinjealljenuppelohkái). Therese barggá$(v,conc|bargá) buohcciviesus$(n,cmp|buohcceviesus) Romssas.

***

The warning comes from the following part of the perl code:

sub add_error_markup {
	my ($twig, $para) = @_;

	my @new_content;
	for my $c ($para->children) {
		my $text = $c->text;
		my $new_text;
		my $nomatch = 0;
        # separator: either §, $, €, ¥ or £
		while ($text && $text =~ /[$sep]/) {

			# No nested errors, no parentheses
			if ($text =~ s/^([^$sep]*\s)?(?:\()?($plainerr)(?:\))?(?=$|\n|\s|\p{P})//) {
				if($1) { push @new_content, $1; }
				print STDERR "Plain error: $2\n"; # Debug print-out
				get_error($2, \@new_content);
			}

			elsif ($text =~ s/^([^$sep\(\)]*\s)?(?:\()($plainerr)(?=[$sep])//) {
				if ($1) { push @new_content, $1; }
...
			}
			else {
				print "\n***\n*** WARNING - NO MATCH: $text\n***\n\n";
				push @new_content, $text;
				$text ="";
			}
		}
		if ($text) { push @new_content, $text; }
		
	}
	$para->set_content(@new_content);
}

I don't see what is wrong, except that it is related to the regex in the if test.

There are many more such cases giving the same warning, but this should suffice for now.

Any comments or feedback appreciated.
Comment 1 Sjur Nørstebø Moshagen 2010-05-11 17:43:07 CEST
I forgot to mention that the problematic code is in the file:

$GTHOME/gt/script/langTools/Corpus.pm
Comment 2 Sjur Nørstebø Moshagen 2010-05-27 14:19:01 CEST
Steps to reproduce:

$ ssh victorio.uit.no
$ cd /usr/local/share/corp/goldstandard/orig/sme/learner
$ convert2xml.pl --test --nolog --corpdir=/usr/local/share/corp ude1.1.correct.txt

Observe how you get a warning:


***
*** WARNING - NO MATCH:  boazzu)£(n,advl,compl,x,mix|bargá bohccuiguin). Sus lea guoktečuođinjealljenuppelohkai$(num,á|guoktečuođinjealljenuppelohkái). Therese barggá$(v,conc|bargá) buohcciviesus$(n,cmp|buohcceviesus) Romssas.

***

Then open the converted file:

$ l ../../../converted/sme/learner/ude1.1.correct.txt.xml

Notice how the second to last paragraph (p element) contains unprocessed markup:

    <p>Eadni <errorort correct="bargá" errtype="conc" pos="v">barggá</errorort> boazzu)£(n,advl,compl,x,mix|bargá bohccuiguin). Sus lea guoktečuođinjealljenuppelohkai$(num,á|guoktečuođinjealljenuppelohkái). Therese barggá$(v,conc|bargá) buohcciviesus$(n,cmp|buohcceviesus) Romssas.
</p>

Expected result:

All error markup should have been converted to proper xml code.

The problematic perl code is described earlier in this bug.
Comment 3 Sjur Nørstebø Moshagen 2011-05-04 15:29:57 CEST
Removed Saara from Cc list - she is not participating in this work anymore. Changed assignee to Børre.
Comment 4 Trond Trosterud 2011-05-04 20:15:57 CEST
Quoting a year back:

<quote>
Steps to reproduce:

$ ssh victorio.uit.no
$ cd /usr/local/share/corp/goldstandard/orig/sme/learner
$ convert2xml.pl --test --nolog --corpdir=/usr/local/share/corp
ude1.1.correct.txt
</quote>

This was a good report a year ago. Now, this procedure is not relevant today.

I replace the path with the present placement of the file, and learn that the convert2xml.pl is not the same script as a year ago:

boundcorpus$convert2xml.pl --test --nolog --corpdir=goldstandard/orig/sme/learner/ude1.1.correct.txt.xml 
Unknown option: test
Unknown option: nolog
Unknown option: corpdir
Usage: convert2xml.pl [OPTIONS] [FILES|DIRS]
The available options:
    --debug    Print all the operations that are done when converting files to stderr
    --shallow  Convert only files that haven't been converted before
boundcorpus$


So, I redo the command with contemporary flags:

boundcorpus$convert2xml.pl --debug goldstandard/orig/sme/learner/ude1.1.correct.txt.xml 
Processing files

Unable to handle 
/home/trond/boundcorpus/goldstandard/orig/sme/learner/ude1.1.correct.txt.xml
Get rid of this error by executing the following command from the
freecorpus and boundcorpus directories:

svn status | grep ^? | cut -f8 -d" " | xargs rm -v

If the problem persists after that command, issue a bug in
http://giellatekno.uit.no/bugzilla
unable to handle /home/trond/boundcorpus/goldstandard/orig/sme/learner/ude1.1.correct.txt.xml


No, I am not at all convinced that this is a relevant test at all.

Today, we convert the corpus not from xml but to xml. 

In my opinion, we need a blank slate.

If we are to debug our present goldstandard corpora with the year-old convert2xml.pl, the very least I expect is an update repeating the problem.

The real problem is of course the transfer of our eror$error etc notation to xml format, but that problem is today related to our present scripts.
Comment 5 Sjur Nørstebø Moshagen 2011-05-04 20:54:25 CEST
(In reply to comment #4)
> 
> Today, we convert the corpus not from xml but to xml. 

As we did a year ago, and from the very beginning.

> In my opinion, we need a blank slate.
> 
> If we are to debug our present goldstandard corpora with the year-old
> convert2xml.pl, the very least I expect is an update repeating the problem.
> 
> The real problem is of course the transfer of our eror$error etc notation to
> xml format, but that problem is today related to our present scripts.

Exactly. And the core of the error markup conversion was until a few weeks ago also exactly the same as a year ago, despite the new version of the main process. Lately Børre has started some work on improving the error markup conversion, but I don't know how far he his.

This bug report is as relevant as it was a year ago, and will stay open until we have a proper conversion of error markup. I am sure Børre can provide us with an up-to-date example of how to trigger the bug.
Comment 6 Trond Trosterud 2011-05-13 10:23:20 CEST
There are actually two problems here:

One is that the error markup does not work (?) for the gold corpus. Let us devote this bug to that problem (as it is stated in the title).

The other problem is that the error markup mechanism creeps in and spols the ordinary corpus conversion.

For that I suggest we create a new bug.
Comment 7 Sjur Nørstebø Moshagen 2011-05-16 10:58:24 CEST
(In reply to comment #6)
> There are actually two problems here:
> 
> One is that the error markup does not work (?) for the gold corpus. Let us
> devote this bug to that problem (as it is stated in the title).

Agree.

> The other problem is that the error markup mechanism creeps in and spols the
> ordinary corpus conversion.
> 
> For that I suggest we create a new bug.

Definitely. What I don't understand is why  this happen in the first place. It used to be that the error markup processing was dependent on the pathname, such that it would only be triggered if the filename contained *correct* and/or the file was found in a subdir of goldstandard/. Unless these two conditions were fulfilled (or at least one of them), the error markup conversion part ***was not triggered***, and could not have distroyed the regular conversion. This is how it should be, and if it is not behaving this way now please create a new bug, and describe exactly what is going on. IMPORTANT: exact steps needed to reproduce the bug.
Comment 8 Trond Trosterud 2011-05-19 14:48:06 CEST
"if it is not behaving this way now please create a new
bug, and describe exactly what is going on."

==> you have not seen it since you do not convert the corpus (?). For the weeks it was going on, this bug dominated the error reports, and we discussed it. Now, I have looked for the error report ("halting at § in ordinary corpus"), and it is gone, for at least to Börre and me unknown reasons. Hence, no bug on that one.

So, you are right, we must move the error reports from tmp/ into bz. But take a look at the corpus as well.
Comment 9 Trond Trosterud 2011-05-23 13:07:14 CEST
Blocker, fordi feilkorpuskonvertering ikkje fungerer utan at denne er fiksa.
Comment 10 Børre Gaup 2011-06-19 01:01:14 CEST
An implementation that handles nested markup was implemented in r43520 and r43521.
There are a lot of docs in [bound|free]corpus/goldstandard/orig that aren't valid according to the dtd. This is a sample output from convert2xml.pl --debug --shallow  `find goldstandard/orig/ -name \*.correct.\* | grep -v .svn | tr "\n" " "` in boundcorpus:
element errorort: validity error : Value "n" for attribute pos of errorort is not among the enumerated set
rorort>, NRK Sámi Radio <errorort correct="guovllukantuvra" errtype="a" pos="n"

Please test and report here if you find any errors in the conversion.
Comment 11 Lene Antonsen 2011-06-20 00:13:02 CEST
Jeg foreslår å oppdatere dtd-en. Er det andre som har brukt den, eller bare de tekstene som jeg har annotert? Hvor ligger dtd-en?
Comment 12 Sjur Nørstebø Moshagen 2011-06-20 16:03:30 CEST
(In reply to comment #11)
> Jeg foreslår å oppdatere dtd-en. Er det andre som har brukt den, eller bare de
> tekstene som jeg har annotert?

Det veit eg ikkje. Men dtd-en er felles for alle korpusfiler, både med og utan error-taggar, og med og utan analyse.

> Hvor ligger dtd-en?

I $GTHOME/gt/dtd/corpus.dtd

Eg endra kor alvorleg denne er frå blocker til critical, sidan lusa no skal vera skvisa. Børre har ikkje funne andre feil enn dtd-relaterte.
Comment 13 Lene Antonsen 2011-06-20 17:53:02 CEST
Jeg har endra annoteringa i alle filene og oppdatert corpus.dtd. Jeg skal se nærmere på dtd-en i kveld, om det skulle være flere mangler i den. Nå har jeg kopiert nyeste versjon av alle filene som jeg har arbeidet med, til apache-corpus.
Comment 14 Sjur Nørstebø Moshagen 2011-06-20 18:13:05 CEST
(In reply to comment #13)
> Jeg har endra annoteringa i alle filene og oppdatert corpus.dtd.

Fint!

> Jeg skal se
> nærmere på dtd-en i kveld, om det skulle være flere mangler i den.

:)

> Nå har jeg
> kopiert nyeste versjon av alle filene som jeg har arbeidet med, til
> apache-corpus.

Hm, dette funkar ikkje. Du må sjekka dei inn i svn, og det kan du ikkje gjera i apache-corpus. Om du legg ting inn der, vil du heller risikera at endringane dine blir overskrive, og dei vert uansett ikkje synlege for resten av verda.

Det du må gjera er å sjekka ut bound-korpuset på victorio. Det er ok å gjera det på vic, men ikkje andre stader. Deretter må du leggja endringane dine på plass i din eigen vic-kopi av bound, og deretter sjekka inn endringane på kommando-lina, på vanleg måte (svn ci -m "blablabla" filnamn.txt). Då blir endringane registrert, og teke med i neste konverteringsrunde..
Comment 15 Lene Antonsen 2011-06-20 20:44:26 CEST
Jeg har sjekka endringene inn i svn i mine egne mapper (private/plan/assignments/Lene/PhD/corpus/).
Og så har jeg i tillegg kopiert over i victorio - apache osv. på den måten som jeg gjorde da jeg arbeidet med dette for ett år siden.
Jeg visste ikke at jeg nå skulle gjøre det på noen annen måte. 

Som sagt, jeg ønsker opplæring i strukturen og bruken av våre servere og corpus. Det er viktig at vi alle som arbeider med giellatekno vet hvordan det skal fungere.
Comment 16 Lene Antonsen 2011-06-22 10:39:35 CEST
Ciprian har sjekka inn en midlertidig dtd med CDATA #IMPLIED slik at den skal kunne godta litt av hvert. Jeg skal se nærmere på attributtene etter sommerferien.
Comment 17 Lene Antonsen 2011-06-28 16:28:10 CEST
Jeg har nå sjekka endringene inn i svn - både boundcorpus og freecorpus.
Comment 18 Sjur Nørstebø Moshagen 2011-09-16 10:08:17 CEST
(In reply to comment #10)
> An implementation that handles nested markup was implemented in r43520 and
> r43521.

Then we perhaps can close this bug? :)

The new implementation is tested several times lately, and most bugs have been fixed. Even though there are some open bugs left, these have their own Bugzilla bug reports, and should not be preventing us from closing this one.