Bug 2256 - Extract termwiki term pairs for our apertium languages, prepare for inclusion in bidix
Summary: Extract termwiki term pairs for our apertium languages, prepare for inclusion...
Status: ASSIGNED
Alias: None
Product: satni.org
Classification: Unclassified
Component: Termwiki data dump (show other bugs)
Version: unspecified
Hardware: All All
: P2 - As soon as possible normal
Assignee: Børre Gaup
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-11-09 11:53 CET by Sjur Nørstebø Moshagen
Modified: 2018-05-28 09:11 CEST (History)
6 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sjur Nørstebø Moshagen 2016-11-09 11:53:43 CET
We want to leverage the work done in the termwiki for our MT systems, and one way of doing that is to automatically extract all term pairs for the language pairs we have MT systems for:

sme-nob
sme-sma
sme-smj
sme-smn
sme-fin

(please correct me if I missed something)

For both target and source language it is imporant that the term is analysable and generatable, so this must be checked before the candidate pair is constructed. Also, it must be checked that the pair does not exist in the bidix already.

Given this, new pairs should be constructed in bidix format, and added to the apertium svn for the relevant pair, in a special bidic candidate file (there is no such file ATM, so just create one with a descriptive name).
Comment 1 Lene Antonsen 2016-11-09 13:07:18 CET
Børre has made this script:

New Revision: 133146

Added:
   trunk/gt/script/termwiki-expressions.py
Log:
Print se and fi entries from the daily termwiki dump to stdout

The se and fi output are separated by a tab.

If there are several expressions in each language, they are separated by
a comma.
Comment 2 Sjur Nørstebø Moshagen 2017-02-10 11:00:29 CET
Moving satni.org bugs from the risten.no product to the satni.org product.
Comment 3 Tomi Pieski 2017-04-04 12:03:33 CEST

*** This bug has been marked as a duplicate of bug 2342 ***
Comment 4 Sjur Nørstebø Moshagen 2017-04-05 09:14:19 CEST
This is not a duplicate. The crucial part is:

> prepare for inclusion in bidix

That is, you need to format the term pairs as a bidix entry fragment, ie xml. The xml fragment should be sent somewhere, presently I don't know.

I'll add Fran to the Cc list. Maybe he or Lene or Trond have ideas for where to put the bidix fragment.
Comment 5 Tomi Pieski 2017-04-06 14:33:56 CEST
Ok. What is bidix? Can you explain?
Comment 7 Tomi Pieski 2017-04-18 09:19:36 CEST
Termwiki term pairs are prepared in GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix
Comment 8 Sjur Nørstebø Moshagen 2017-04-25 10:32:17 CEST
(In reply to Tomi Pieski from comment #7)
> Termwiki term pairs are prepared in
> GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix

$ ll $GTHOME/tools/TermWikiExporter/bidix/
total 6360
-rw-r--r--@ 1 smo036  1907360568   3,1M 18 apr 16:42 sme-nob.dix

It looks reasonable, but there are a couple of questions:

1) you should add the required root element and xmllint the result to verify that the structure is ok
2) have you ensured that all entries in the bidix are recognised by the analyser?

Also, in the initial bug report I requested bidix data for all language pairs for which we have MT systems, and you have so far only added one. I will reopen the bug report. Leave it open until we all agree that we have reached the goal as stated in the initial bug report.
Comment 9 Tomi Pieski 2017-04-25 13:30:37 CEST
(In reply to Sjur Nørstebø Moshagen from comment #8)
> (In reply to Tomi Pieski from comment #7)
> > Termwiki term pairs are prepared in
> > GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix
> 
> $ ll $GTHOME/tools/TermWikiExporter/bidix/
> total 6360
> -rw-r--r--@ 1 smo036  1907360568   3,1M 18 apr 16:42 sme-nob.dix
> 
> It looks reasonable, but there are a couple of questions:
> 
> 1) you should add the required root element and xmllint the result to verify
> that the structure is ok

Done.

> 2) have you ensured that all entries in the bidix are recognised by the
> analyser?
> 

Only analysed entries are added.

> Also, in the initial bug report I requested bidix data for all language
> pairs for which we have MT systems, and you have so far only added one. I
> will reopen the bug report. Leave it open until we all agree that we have
> reached the goal as stated in the initial bug report.

Leaving this just here.
Comment 10 Sjur Nørstebø Moshagen 2017-04-25 13:52:34 CEST
(In reply to Tomi Pieski from comment #9)
> > 1) you should add the required root element and xmllint the result to verify
> > that the structure is ok
> 
> Done.

Not completely. The correct root element is <dictionary>, and the linting command should be something like:

xmllint --dtdvalid /usr/local/share/apertium/dix.dtd --noout $FILE1 && exit 0;

That is, follow the dtd.
Comment 11 Tomi Pieski 2017-04-25 14:02:33 CEST
(In reply to Sjur Nørstebø Moshagen from comment #10)
> (In reply to Tomi Pieski from comment #9)
> > > 1) you should add the required root element and xmllint the result to verify
> > > that the structure is ok
> > 
> > Done.
> 
> Not completely. The correct root element is <dictionary>, and the linting
> command should be something like:
> 
> xmllint --dtdvalid /usr/local/share/apertium/dix.dtd --noout $FILE1 && exit
> 0;
> 
> That is, follow the dtd.

Dtd is not found in folder defined above.
Comment 12 Børre Gaup 2017-04-26 16:39:49 CEST
test
Comment 13 Børre Gaup 2017-04-26 16:41:16 CEST
(In reply to Børre Gaup from comment #12)
> test

one more test
Comment 14 Børre Gaup 2017-04-26 16:42:16 CEST
(In reply to Børre Gaup from comment #12)
> test

one more test
Comment 15 Tomi Pieski 2017-04-27 11:37:45 CEST
(In reply to Tomi Pieski from comment #11)
> (In reply to Sjur Nørstebø Moshagen from comment #10)
> > (In reply to Tomi Pieski from comment #9)
> > > > 1) you should add the required root element and xmllint the result to verify
> > > > that the structure is ok
> > > 
> > > Done.
> > 
> > Not completely. The correct root element is <dictionary>, and the linting
> > command should be something like:
> > 
> > xmllint --dtdvalid /usr/local/share/apertium/dix.dtd --noout $FILE1 && exit
> > 0;
> > 
> > That is, follow the dtd.
> 
> Dtd is not found in folder defined above.

Dtd was found and XML is valid now.
Comment 16 Lene Antonsen 2017-04-28 15:31:08 CEST
Some  comments based on the meeting today and revision 151881 after the meeting. First about PoS:

Be aware of that we expect the target language (TL) word to be a base form (lemma), and usually the same PoS as SL. If another PoS, then a linguist should look at it manually

here; TL should be the noun 'fyr' and not the verb 'fyre'
<e><p><l>čuovgadoardna<s n="n"/><s n="sg"/><s n="nom"/></l><r>fyre<s n="v"/><s n="imp"/></r></p></e>

here; TL should be the verb 'rulle' and not the nouns 'rull'
<e><p><l>rullet<s n="v"/><s n="tv"/><s n="inf"/></l><r>rulle<s n="n"/><s n="fem"/><s n="sg"/><s n="indef"/></r></p></e>

and here should only be tags for PoS and transitivity
Comment 17 Lene Antonsen 2017-04-28 15:35:06 CEST
Correction to 'rulle' (instead of 'rull') as in my last comment.

> here; TL should be the verb 'rulle' and not the noune 'rulle'
> <e><p><l>rullet<s n="v"/><s n="tv"/><s n="inf"/></l><r>rulle<s n="n"/><s
> n="fem"/><s n="sg"/><s n="indef"/></r></p></e>

There are also examples of mismatch between SL verb and TL adjective
Comment 18 Lene Antonsen 2017-04-28 15:51:39 CEST
The list in has many dublets: revision 151881 se-nb.dix

 grep '<l>' se-nb.dix | tr '<' '>' | cut -d '>' -f7 |wc -l
    1678
bidix$ grep '<l>' se-nb.dix | tr '<' '>' | cut -d '>' -f7 |sort -u | wc -l
     324

Only baseform should be here, not inflected forms, eg,
<e><p><l>gednet<s n="v"/><s n="tv"/><s n="der_d"/><s n="v"/><s n="imprt"/><s n="sg2"/></l><r>avsone<s n="v"/><s n="inf"/></r></p></e>
Comment 19 Tomi Pieski 2017-04-28 21:03:15 CEST
(In reply to Lene Antonsen from comment #16)
> Some  comments based on the meeting today and revision 151881 after the
> meeting. First about PoS:
> 
> Be aware of that we expect the target language (TL) word to be a base form
> (lemma), and usually the same PoS as SL. If another PoS, then a linguist
> should look at it manually

New revision 151903 adds '<--CHECK THIS-->' to these.

> 
> here; TL should be the noun 'fyr' and not the verb 'fyre'
> <e><p><l>čuovgadoardna<s n="n"/><s n="sg"/><s n="nom"/></l><r>fyre<s
> n="v"/><s n="imp"/></r></p></e>
> 
Improved matching of the analysis base form.

> here; TL should be the verb 'rulle' and not the nouns 'rull'
> <e><p><l>rullet<s n="v"/><s n="tv"/><s n="inf"/></l><r>rulle<s n="n"/><s
> n="fem"/><s n="sg"/><s n="indef"/></r></p></e>
> 
> and here should only be tags for PoS and transitivity

> The list in has many dublets: revision 151881 se-nb.dix

Many, or all, of these should be removed. Some duplicates have different target lemma. Should they be removed, and how to choose which one to remove?


New version includes only certain cases. Are these cases defined somewhere or could they be defined here? For now analysis' are filtered using regex:
".*\\+[Sg|Pl].*\\+[Nom|Indef].*|.*\\+V(\\+TV|\\+IV)\\+Inf.*"
Comment 20 Lene Antonsen 2017-05-02 09:05:58 CEST
> > The list in has many dublets: revision 151881 se-nb.dix
> 
> Many, or all, of these should be removed. Some duplicates have different
> target lemma. Should they be removed, and how to choose which one to remove?

PROBLEM OF DUBLETS:
TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | wc -l
    2154
TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | cut -d '>' -f7 |sort  -u | wc -l
    1179

Example of dublets
<e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="neu"/><s n="sg"/><s n="indef"/></r></p></e>
<e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="msc"/><s n="sg"/><s n="indef"/></r></p></e>
<e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="fem"/><s n="sg"/><s n="indef"/></r></p></e>
<e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="mf"/><s n="sg"/><s n="indef"/></r></p></e>
 
PROBLEM OF OVERLAP WITH EXISTING ENTRIES:
We don't want overlap even if the TL lemma is different, now there is a big overlap. Example

In se-nb.dix:
<e><p><l>čoahkkananluodda<s n="n"/><s n="sg"/><s n="nom"/></l><r>oppsamlingsveg<s n="n"/><s n="msc"/><s n="sg"/><s n="indef"/></r></p></e>

In apertium-sme-nob.sme-nob.dix
<e><p><l>čoahkkananluodda<s n="n"/></l><r>oppsamlingsvei<s n="n"/><s n="m"/></r></p></e>

PROBLEM WITH TAGGING LINGUISTICS ISSUES
In apertium-sme-nob.sme-nob.dix there is a lot of linguistics, e.g. tags marking issues like
* mismatch sg vs pl in SL and TL
* if the nob noun should get article in indefinite form
* for choosing m vs fem gender in nob
* massnoun
* verbs: if they are reflexive, causative, if they will have a person or not as subject
* adjectiveds: do they have comp and superl as morphology in nob
.....


=> the list from termwiki has to be added to an income directory so the linguist can do the tagging.

FST FOR NOB:
apertium-sne-nob uses the apertium-nob, not the giella-nob
=> the pipeline should use apertium-nob FST
Comment 21 Lene Antonsen 2017-05-02 09:11:58 CEST
MORPH TAGS IN BIDIX:
There should only be PoS and  for verbs also transitivity.
Comment 22 Tomi Pieski 2017-05-02 09:34:35 CEST
(In reply to Lene Antonsen from comment #20)
> > > The list in has many dublets: revision 151881 se-nb.dix
> > 
> > Many, or all, of these should be removed. Some duplicates have different
> > target lemma. Should they be removed, and how to choose which one to remove?
> 
> PROBLEM OF DUBLETS:
> TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | wc -l
>     2154
> TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | cut -d '>' -f7
> |sort  -u | wc -l
>     1179
> 
> Example of dublets
> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s
> n="a"/><s n="pos"/><s n="neu"/><s n="sg"/><s n="indef"/></r></p></e>
> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s
> n="a"/><s n="pos"/><s n="msc"/><s n="sg"/><s n="indef"/></r></p></e>
> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s
> n="a"/><s n="pos"/><s n="fem"/><s n="sg"/><s n="indef"/></r></p></e>
> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s
> n="a"/><s n="pos"/><s n="mf"/><s n="sg"/><s n="indef"/></r></p></e>
>  
> PROBLEM OF OVERLAP WITH EXISTING ENTRIES:
> We don't want overlap even if the TL lemma is different, now there is a big
> overlap. Example
> 
> In se-nb.dix:
> <e><p><l>čoahkkananluodda<s n="n"/><s n="sg"/><s
> n="nom"/></l><r>oppsamlingsveg<s n="n"/><s n="msc"/><s n="sg"/><s
> n="indef"/></r></p></e>
> 
> In apertium-sme-nob.sme-nob.dix
> <e><p><l>čoahkkananluodda<s n="n"/></l><r>oppsamlingsvei<s n="n"/><s
> n="m"/></r></p></e>
> 
> PROBLEM WITH TAGGING LINGUISTICS ISSUES
> In apertium-sme-nob.sme-nob.dix there is a lot of linguistics, e.g. tags
> marking issues like
> * mismatch sg vs pl in SL and TL
> * if the nob noun should get article in indefinite form
> * for choosing m vs fem gender in nob
> * massnoun
> * verbs: if they are reflexive, causative, if they will have a person or not
> as subject
> * adjectiveds: do they have comp and superl as morphology in nob
> .....
> 
> 
> => the list from termwiki has to be added to an income directory so the
> linguist can do the tagging.

It is in one income directory. If you are talking about a certain directory, please be more specific.
> 
> FST FOR NOB:
> apertium-sne-nob uses the apertium-nob, not the giella-nob
> => the pipeline should use apertium-nob FST

I don't use giella-nob, I use 'analyser-gt-norm.hfstol' in giella/nob directory. Where is this apertium-nob?
Comment 23 Sjur Nørstebø Moshagen 2017-05-02 10:01:16 CEST
(In reply to Tomi Pieski from comment #22)
> > => the list from termwiki has to be added to an income directory so the
> > linguist can do the tagging.
> 
> It is in one income directory. If you are talking about a certain directory,
> please be more specific.

Lene, Tomi has put the generated bidix files in:

(In reply to Tomi Pieski from comment #7)
> Termwiki term pairs are prepared in
> GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix

Another question is whether this is a good location - it is well hidden from the linguists (as demonstrated by Lene's comment).

> > FST FOR NOB:
> > apertium-sne-nob uses the apertium-nob, not the giella-nob
> > => the pipeline should use apertium-nob FST
> 
> I don't use giella-nob, I use 'analyser-gt-norm.hfstol' in giella/nob
> directory. Where is this apertium-nob?

./configure -h

You can build Apertium fst's as part of our infrastructure. These are not exactly the same as used by Apertium, but they are the starting point, so to speak. The fst's we build are intersected with the bidix fst's, so that the final Apertium analysers will only analyse those entries that are also in the bidix.

For the purposes of your work, that last restriction is actually counterproductive, so the Apertium fst we build in our infra should be exactly what we want.
Comment 24 Lene Antonsen 2017-05-02 10:10:50 CEST
To be spesific;
The apertium nob FST is built in apertium/languages/apertium-nob
and not in the giella-infrastructure
Comment 25 Tomi Pieski 2017-05-02 13:07:02 CEST
(In reply to Sjur Nørstebø Moshagen from comment #23)
> (In reply to Tomi Pieski from comment #22)
> > > => the list from termwiki has to be added to an income directory so the
> > > linguist can do the tagging.
> > 
> > It is in one income directory. If you are talking about a certain directory,
> > please be more specific.
> 
> Lene, Tomi has put the generated bidix files in:
> 
> (In reply to Tomi Pieski from comment #7)
> > Termwiki term pairs are prepared in
> > GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix
> 
> Another question is whether this is a good location - it is well hidden from
> the linguists (as demonstrated by Lene's comment).

It's the best and only location from the available options.

> 
> > > FST FOR NOB:
> > > apertium-sne-nob uses the apertium-nob, not the giella-nob
> > > => the pipeline should use apertium-nob FST
> > 
> > I don't use giella-nob, I use 'analyser-gt-norm.hfstol' in giella/nob
> > directory. Where is this apertium-nob?
> 
> ./configure -h
> 
> You can build Apertium fst's as part of our infrastructure. These are not
> exactly the same as used by Apertium, but they are the starting point, so to
> speak. The fst's we build are intersected with the bidix fst's, so that the
> final Apertium analysers will only analyse those entries that are also in
> the bidix.
> 
> For the purposes of your work, that last restriction is actually
> counterproductive, so the Apertium fst we build in our infra should be
> exactly what we want.

Coolio. Now using 'analyser-mt-gt-desc.hfstol' in tools/mt/apertium of each language.
Comment 26 Lene Antonsen 2017-05-02 16:16:59 CEST
(In reply to Lene Antonsen from comment #24)
> To be spesific;
> The apertium nob FST is built in apertium/languages/apertium-nob
> and not in the giella-infrastructure

apertium/languages/apertium-nob

after compiling (make) the analysing command is:

lt-proc nob.automorf.bin
Comment 27 Lene Antonsen 2017-05-02 16:17:22 CEST
(In reply to Lene Antonsen from comment #24)
> To be spesific;
> The apertium nob FST is built in apertium/languages/apertium-nob
> and not in the giella-infrastructure

apertium/languages/apertium-nob

after compiling (make) the analysing command is:

lt-proc nob.automorf.bin
Comment 28 Lene Antonsen 2017-05-04 09:31:48 CEST
apertium-nob$ lt-proc nob.automorf.bin
glatt
^glatt/glatte<vblex><imp>/glatt<adj><sint><pst><nt><sg><ind>/glatt<adj><sint><pst><mf><sg><ind>$


This command gives the apertium tags, and for adjectives "sint" should also be included, pluss info about paradigm, like this (the word pair is in se-nb.dix)

<e><p><l>livttis<s n="adj"/></l><r>glatt<s n="adj"/><s n="sint"/></r></p><par n="__adj"/></e>

Because the dix-file includes much linguistic information, my suggestion is:
The wordpairs from termwiki, which SL part is not in the dix file, should be exported to an incoming directory in apertium (e.g. apertium-sme-nob/incoming) and then the linguist will add the wordpairs to the dix-file. I suggest that the word list could in a format like this:

livttis adj : glatt adj
Comment 29 Lene Antonsen 2017-05-04 09:40:27 CEST
> The wordpairs from termwiki, which SL part is not in the dix file, should be
> exported to an incoming directory in apertium (e.g.
> apertium-sme-nob/incoming) and then the linguist will add the wordpairs to
> the dix-file. I suggest that the word list could in a format like this:
> 
> livttis adj : glatt adj

For verbs the transitivity tag should also be included.

And there could be a second list for word pairs which SL part is in the dix file, but the TL part is not the same as in dix. This list would then be a resource for the linguist, to check if there are word pairs which she wants to add to the dix.
Comment 30 Tomi Pieski 2017-05-15 11:15:25 CEST
(In reply to Lene Antonsen from comment #27)
> (In reply to Lene Antonsen from comment #24)
> > To be spesific;
> > The apertium nob FST is built in apertium/languages/apertium-nob
> > and not in the giella-infrastructure
> 
> apertium/languages/apertium-nob
> 
> after compiling (make) the analysing command is:
> 
> lt-proc nob.automorf.bin

Compiling apertium fails on my Mac:

apertium_deshtml.cc:3963:2: error: ISO C++1z does not allow 'register' storage
      class specifier [-Wregister]
        register yy_state_type yy_current_state;
        ^~~~~~~~~
Comment 31 Sjur Nørstebø Moshagen 2018-05-28 09:11:50 CEST
Tomi is about to leave - handing this over to Børre.