Bug 1226 - b. at end of sentence should behave like nr.
Summary: b. at end of sentence should behave like nr.
Status: RESOLVED FIXED
Alias: None
Product: Pre- and postprocessing
Classification: Unclassified
Component: preprocess file (show other bugs)
Version: unspecified
Hardware: Macintosh Linux
: P5 - Later enhancement
Assignee: Børre Gaup
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2011-12-19 16:39 CET by Børre Gaup
Modified: 2012-02-08 11:18 CET (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Børre Gaup 2011-12-19 16:39:34 CET
b. at the end of a sentence should behave like nr.
echo "nr. Nysetning" | preprocess --abbr=$GTHOME/gt/sme/bin/abbr.txt gives
nr.
.
Nysetning

and 
echo "b. Nysetning" | preprocess --abbr=$GTHOME/gt/sme/bin/abbr.txt gives
b.
Nysetning
Comment 1 Børre Gaup 2011-12-19 16:51:00 CET
Looking at $GTHOME/gt/sme/bin/abbr.txt I find b at line 306, 637 and 713
The first b belongs to LEXICON TRNUMAB, found at line 279
The next to b's seem to belong to LEXICON TRAB, found at line 345 

Could this be the source of the problem?
Comment 2 Børre Gaup 2011-12-19 16:55:36 CET
Looking for b in $GTHOME/gt/sme/src/ I find this at line 507: b ab-adv ; ! trab in 12. b. 2001 and "lea 30. b. mearidán…, but not in "Bohten geassemánu 3. b. Lean velá dáppe."
Comment 3 Trond Trosterud 2011-12-19 16:59:06 CET
(In reply to comment #1)
> Looking at $GTHOME/gt/sme/bin/abbr.txt I find b at line 306, 637 and 713
> The first b belongs to LEXICON TRNUMAB, found at line 279
> The next to b's seem to belong to LEXICON TRAB, found at line 345 
> 
> Could this be the source of the problem?

Potentially, yes, but I removed it, and the problem remains. It is in the file
preprocessing itself, with a special twist demanding TRNUMAB be at least two
characters long.  This must be abandoned (but check the initials )
Comment 4 Trond Trosterud 2011-12-19 17:01:05 CET
(In reply to comment #2)
> Looking for b in $GTHOME/gt/sme/src/ I find this at line 507: b ab-adv ; ! trab
> in 12. b. 2001 and "lea 30. b. mearidán…, but not in "Bohten geassemánu 3. b.
> Lean velá dáppe."

This is how it should be, but it is not. Instead, b. is treated as a TRAB, because of the one-char-ban in preprocess.
Comment 5 Børre Gaup 2011-12-19 19:59:24 CET
Running 
freecorpus $ echo "b. Nysetning" | preprocess --abbr=$GTHOME/gt/sme/bin/abbr.txt -v
[NEW:233] b.
[process_word:278] b.
[if word:322] b.
[process_word/go_to_test_abbr:388] b
[test_abbr:471] b
b.
[NEW:233] Nysetning
[no correction, no idiom:241] Nysetning
Nysetning

shows that b in this is treated as a TRAB.
The comment concerning this part of the code says this:
	# Transitive abbreviations are never followed
	# by sentence boundary.
	if ($abbrs{$abbr} eq $TRAB || $abbrs{lc($abbr)} eq $TRAB ) {
		add_token($tokens_aref, $w_token, $word);
                verbose("test_abbr", $abbr, __LINE__);
		return 1;
	}

And further processing is stopped

Removing these two lines from abbr-sme-lex.txt:
b+Use/-Spell:b ab-nodot-noun; !
b+Use/-Spell:b ab-dot-noun; !

leaves b as a TRNUMAB only.

Then running the same command as above gives this result:
freecorpus $ echo "b. Nysetning" | preprocess --abbr=$GTHOME/gt/sme/bin/abbr.txt -v
[NEW:233] b.
[process_word:278] b.
[if word:322] b.
[process_word/go_to_test_abbr:388] b
[test_abbr:484] b
b.
.
[NEW:233] Nysetning
[no correction, no idiom:241] Nysetning
Nysetning

The tests in abbrtester.py (test_nr_vs_b and test_b_inside_sentence) also pass, letting "b." behave as expected both inside sentences and at the end of sentences.
Comment 6 Trond Trosterud 2011-12-19 21:21:29 CET
~/main/gt$echo "b. Nysetning" | preprocess --abbr=sme/bin/abbr.txt 
b.
Nysetning
~/main/gt$echo "b. Nysetning" | preprocess
b.
.
Nysetning

Men derimot med tal etterpå 


~/main/gt$echo "b. 1324" | preprocess
b.
.
1324
~/main/gt$echo "b. 1324" | preprocess --abbr=sme/bin/abbr.txt
b.
1324

Så vi må få abbr til å fungere. Her er -v -testen:


~/main/gt$echo "b. Nysetning" | preprocess --abbr=sme/bin/abbr.txt -v
[NEW:233] b.
[process_word:278] b.
[if word:322] b.
[process_word/go_to_test_abbr:388] b
[test_abbr:471] b
b.


~/main/gt$echo "nr. Nysetning" | preprocess --abbr=sme/bin/abbr.txt -v
[NEW:233] nr.
[process_word:278] nr.
[if word:322] nr.
[process_word/go_to_test_abbr:388] nr
[test_abbr:484] nr
nr.
.
[NEW:233] Nysetning

484 er her:

	elsif ($abbrs{$abbr} eq $TRNUMAB || $abbrs{lc($abbr)} eq $TRNUMAB) {
		add_token ($tokens_aref, $w_token, $word);
		if ($next_word =~ /^\p{Lu}/ && $next_word !~ /^(\p{Lu}|[IVXCDLM]+)$/o) {
			add_new_token ($tokens_aref, $sentence_break);
		}
		verbose("test_abbr", $abbr, __LINE__);    # <----------------- 484
		return 1;

Så if ($next_word =~ /^\p{Lu}/ && $next_word !~ /^(\p{Lu}|[IVXCDLM]+)$/o) {
er kravet om at forkortingar på ein bokstav ikkje kan vere TRNUMAB. Den må vi ha bort.
Comment 7 Børre Gaup 2011-12-19 23:36:08 CET
For å få b. til å fungere som nr. (gå samme stien gjennom preprosess som nr.) har jeg fjernet b fra TRAB og kommentert vekk linja:
        if ($next_word =~ /^\p{Lu}/ && $next_word !~ /^(\p{Lu}|[IVXCDLM]+)$/o)
u preprocess.

Dette fungerer for b. Nysetning, men eksemplet nedenfor fungerer ikke som ventet lenger.

echo njukčamánu 1. b. dii. 09.00. | preprocess --abbr=$GTHOME/gt/sme/bin/abbr.txt
njukčamánu
1.
b.
.
dii.
.
09.00
.

Det ser ut til å være en innebygd motsetning mellom b som TRAB og TRNUMAB. Som det er nå får man enten det ene eller det andre.

F.eks gir c. Nysetning dette svaret:
gt $ echo "c. Nysetning" | preprocess --abbr=sme/bin/abbr.txt
c.
Nysetning

der c er både TRAB og TRNUMAB.

Dette er neppe tilfredsstillende, så vi må se på regler for når b. skal fungere som 
b.
.

og når det skal fungere som 
b.

...
Comment 8 Trond Trosterud 2011-12-20 15:06:45 CET
# Test med manipulert preprocess -- 
# kommentert ut if ($next_word =~ /^\p{Lu}/ && $next_word !~ /^(\p{Lu}|[IVXCDLM]+)$/o) {

~/main/gt$echo b. asdf b. Asfd b. 1 nr. asdf nr. Asdf nr. 2 | preprocess --abbr=sme/bin/abbr.txt 
b.
asdf
b.
Asfd
b.
1
nr.
. # ikke dette
asdf# ikke dette
nr.
.
Asdf
nr. # ikke dette
. # ikke dette
2

# Test med vanleg preprocess
~/main/gt$echo b. asdf b. Asfd b. 1 nr. asdf nr. Asdf nr. 2 | preprocess --abbr=sme/bin/abbr.txt 
b.
asdf
b.
Asfd
b.
1
nr.
asdf
nr.
.
Asdf
nr.
2

Med vanleg preprocess er nr. trnumab som vi vil ha det, men b. er ikkje det.

no er:
b = trab
nr = feilaktig itrab

vi vil ha dette:
 b. asdf   b. . Asfd   b. 1 	     = trnumab
 nr. asdf  nr. . Asdf  nr. 2	     = trnumab
 hr. asdf  hr. Asdf    hr. 2	     = trab
 Ltd. asdf Ltd. . Asdf Ltd. . 2      = itrab
Comment 9 Børre Gaup 2012-02-08 11:18:49 CET
The test_nr_vs_b test in abbrtester.py shows that nr. and b. behave equally.