We want to leverage the work done in the termwiki for our MT systems, and one way of doing that is to automatically extract all term pairs for the language pairs we have MT systems for: sme-nob sme-sma sme-smj sme-smn sme-fin (please correct me if I missed something) For both target and source language it is imporant that the term is analysable and generatable, so this must be checked before the candidate pair is constructed. Also, it must be checked that the pair does not exist in the bidix already. Given this, new pairs should be constructed in bidix format, and added to the apertium svn for the relevant pair, in a special bidic candidate file (there is no such file ATM, so just create one with a descriptive name).
Børre has made this script: New Revision: 133146 Added: trunk/gt/script/termwiki-expressions.py Log: Print se and fi entries from the daily termwiki dump to stdout The se and fi output are separated by a tab. If there are several expressions in each language, they are separated by a comma.
Moving satni.org bugs from the risten.no product to the satni.org product.
*** This bug has been marked as a duplicate of bug 2342 ***
This is not a duplicate. The crucial part is: > prepare for inclusion in bidix That is, you need to format the term pairs as a bidix entry fragment, ie xml. The xml fragment should be sent somewhere, presently I don't know. I'll add Fran to the Cc list. Maybe he or Lene or Trond have ideas for where to put the bidix fragment.
Ok. What is bidix? Can you explain?
http://wiki.apertium.org/wiki/Northern_Sámi_and_Norwegian/bidix and http://wiki.apertium.org/wiki/Bilingual_dictionary
Termwiki term pairs are prepared in GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix
(In reply to Tomi Pieski from comment #7) > Termwiki term pairs are prepared in > GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix $ ll $GTHOME/tools/TermWikiExporter/bidix/ total 6360 -rw-r--r--@ 1 smo036 1907360568 3,1M 18 apr 16:42 sme-nob.dix It looks reasonable, but there are a couple of questions: 1) you should add the required root element and xmllint the result to verify that the structure is ok 2) have you ensured that all entries in the bidix are recognised by the analyser? Also, in the initial bug report I requested bidix data for all language pairs for which we have MT systems, and you have so far only added one. I will reopen the bug report. Leave it open until we all agree that we have reached the goal as stated in the initial bug report.
(In reply to Sjur Nørstebø Moshagen from comment #8) > (In reply to Tomi Pieski from comment #7) > > Termwiki term pairs are prepared in > > GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix > > $ ll $GTHOME/tools/TermWikiExporter/bidix/ > total 6360 > -rw-r--r--@ 1 smo036 1907360568 3,1M 18 apr 16:42 sme-nob.dix > > It looks reasonable, but there are a couple of questions: > > 1) you should add the required root element and xmllint the result to verify > that the structure is ok Done. > 2) have you ensured that all entries in the bidix are recognised by the > analyser? > Only analysed entries are added. > Also, in the initial bug report I requested bidix data for all language > pairs for which we have MT systems, and you have so far only added one. I > will reopen the bug report. Leave it open until we all agree that we have > reached the goal as stated in the initial bug report. Leaving this just here.
(In reply to Tomi Pieski from comment #9) > > 1) you should add the required root element and xmllint the result to verify > > that the structure is ok > > Done. Not completely. The correct root element is <dictionary>, and the linting command should be something like: xmllint --dtdvalid /usr/local/share/apertium/dix.dtd --noout $FILE1 && exit 0; That is, follow the dtd.
(In reply to Sjur Nørstebø Moshagen from comment #10) > (In reply to Tomi Pieski from comment #9) > > > 1) you should add the required root element and xmllint the result to verify > > > that the structure is ok > > > > Done. > > Not completely. The correct root element is <dictionary>, and the linting > command should be something like: > > xmllint --dtdvalid /usr/local/share/apertium/dix.dtd --noout $FILE1 && exit > 0; > > That is, follow the dtd. Dtd is not found in folder defined above.
test
(In reply to Børre Gaup from comment #12) > test one more test
(In reply to Tomi Pieski from comment #11) > (In reply to Sjur Nørstebø Moshagen from comment #10) > > (In reply to Tomi Pieski from comment #9) > > > > 1) you should add the required root element and xmllint the result to verify > > > > that the structure is ok > > > > > > Done. > > > > Not completely. The correct root element is <dictionary>, and the linting > > command should be something like: > > > > xmllint --dtdvalid /usr/local/share/apertium/dix.dtd --noout $FILE1 && exit > > 0; > > > > That is, follow the dtd. > > Dtd is not found in folder defined above. Dtd was found and XML is valid now.
Some comments based on the meeting today and revision 151881 after the meeting. First about PoS: Be aware of that we expect the target language (TL) word to be a base form (lemma), and usually the same PoS as SL. If another PoS, then a linguist should look at it manually here; TL should be the noun 'fyr' and not the verb 'fyre' <e><p><l>čuovgadoardna<s n="n"/><s n="sg"/><s n="nom"/></l><r>fyre<s n="v"/><s n="imp"/></r></p></e> here; TL should be the verb 'rulle' and not the nouns 'rull' <e><p><l>rullet<s n="v"/><s n="tv"/><s n="inf"/></l><r>rulle<s n="n"/><s n="fem"/><s n="sg"/><s n="indef"/></r></p></e> and here should only be tags for PoS and transitivity
Correction to 'rulle' (instead of 'rull') as in my last comment. > here; TL should be the verb 'rulle' and not the noune 'rulle' > <e><p><l>rullet<s n="v"/><s n="tv"/><s n="inf"/></l><r>rulle<s n="n"/><s > n="fem"/><s n="sg"/><s n="indef"/></r></p></e> There are also examples of mismatch between SL verb and TL adjective
The list in has many dublets: revision 151881 se-nb.dix grep '<l>' se-nb.dix | tr '<' '>' | cut -d '>' -f7 |wc -l 1678 bidix$ grep '<l>' se-nb.dix | tr '<' '>' | cut -d '>' -f7 |sort -u | wc -l 324 Only baseform should be here, not inflected forms, eg, <e><p><l>gednet<s n="v"/><s n="tv"/><s n="der_d"/><s n="v"/><s n="imprt"/><s n="sg2"/></l><r>avsone<s n="v"/><s n="inf"/></r></p></e>
(In reply to Lene Antonsen from comment #16) > Some comments based on the meeting today and revision 151881 after the > meeting. First about PoS: > > Be aware of that we expect the target language (TL) word to be a base form > (lemma), and usually the same PoS as SL. If another PoS, then a linguist > should look at it manually New revision 151903 adds '<--CHECK THIS-->' to these. > > here; TL should be the noun 'fyr' and not the verb 'fyre' > <e><p><l>čuovgadoardna<s n="n"/><s n="sg"/><s n="nom"/></l><r>fyre<s > n="v"/><s n="imp"/></r></p></e> > Improved matching of the analysis base form. > here; TL should be the verb 'rulle' and not the nouns 'rull' > <e><p><l>rullet<s n="v"/><s n="tv"/><s n="inf"/></l><r>rulle<s n="n"/><s > n="fem"/><s n="sg"/><s n="indef"/></r></p></e> > > and here should only be tags for PoS and transitivity > The list in has many dublets: revision 151881 se-nb.dix Many, or all, of these should be removed. Some duplicates have different target lemma. Should they be removed, and how to choose which one to remove? New version includes only certain cases. Are these cases defined somewhere or could they be defined here? For now analysis' are filtered using regex: ".*\\+[Sg|Pl].*\\+[Nom|Indef].*|.*\\+V(\\+TV|\\+IV)\\+Inf.*"
> > The list in has many dublets: revision 151881 se-nb.dix > > Many, or all, of these should be removed. Some duplicates have different > target lemma. Should they be removed, and how to choose which one to remove? PROBLEM OF DUBLETS: TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | wc -l 2154 TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | cut -d '>' -f7 |sort -u | wc -l 1179 Example of dublets <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="neu"/><s n="sg"/><s n="indef"/></r></p></e> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="msc"/><s n="sg"/><s n="indef"/></r></p></e> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="fem"/><s n="sg"/><s n="indef"/></r></p></e> <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s n="a"/><s n="pos"/><s n="mf"/><s n="sg"/><s n="indef"/></r></p></e> PROBLEM OF OVERLAP WITH EXISTING ENTRIES: We don't want overlap even if the TL lemma is different, now there is a big overlap. Example In se-nb.dix: <e><p><l>čoahkkananluodda<s n="n"/><s n="sg"/><s n="nom"/></l><r>oppsamlingsveg<s n="n"/><s n="msc"/><s n="sg"/><s n="indef"/></r></p></e> In apertium-sme-nob.sme-nob.dix <e><p><l>čoahkkananluodda<s n="n"/></l><r>oppsamlingsvei<s n="n"/><s n="m"/></r></p></e> PROBLEM WITH TAGGING LINGUISTICS ISSUES In apertium-sme-nob.sme-nob.dix there is a lot of linguistics, e.g. tags marking issues like * mismatch sg vs pl in SL and TL * if the nob noun should get article in indefinite form * for choosing m vs fem gender in nob * massnoun * verbs: if they are reflexive, causative, if they will have a person or not as subject * adjectiveds: do they have comp and superl as morphology in nob ..... => the list from termwiki has to be added to an income directory so the linguist can do the tagging. FST FOR NOB: apertium-sne-nob uses the apertium-nob, not the giella-nob => the pipeline should use apertium-nob FST
MORPH TAGS IN BIDIX: There should only be PoS and for verbs also transitivity.
(In reply to Lene Antonsen from comment #20) > > > The list in has many dublets: revision 151881 se-nb.dix > > > > Many, or all, of these should be removed. Some duplicates have different > > target lemma. Should they be removed, and how to choose which one to remove? > > PROBLEM OF DUBLETS: > TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | wc -l > 2154 > TermWikiExporter$ grep '<r>' bidix/se-nb.dix |tr '<' '>' | cut -d '>' -f7 > |sort -u | wc -l > 1179 > > Example of dublets > <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s > n="a"/><s n="pos"/><s n="neu"/><s n="sg"/><s n="indef"/></r></p></e> > <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s > n="a"/><s n="pos"/><s n="msc"/><s n="sg"/><s n="indef"/></r></p></e> > <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s > n="a"/><s n="pos"/><s n="fem"/><s n="sg"/><s n="indef"/></r></p></e> > <e><p><l>aktoráđálaš<s n="a"/><s n="sg"/><s n="nom"/></l><r>enerådende<s > n="a"/><s n="pos"/><s n="mf"/><s n="sg"/><s n="indef"/></r></p></e> > > PROBLEM OF OVERLAP WITH EXISTING ENTRIES: > We don't want overlap even if the TL lemma is different, now there is a big > overlap. Example > > In se-nb.dix: > <e><p><l>čoahkkananluodda<s n="n"/><s n="sg"/><s > n="nom"/></l><r>oppsamlingsveg<s n="n"/><s n="msc"/><s n="sg"/><s > n="indef"/></r></p></e> > > In apertium-sme-nob.sme-nob.dix > <e><p><l>čoahkkananluodda<s n="n"/></l><r>oppsamlingsvei<s n="n"/><s > n="m"/></r></p></e> > > PROBLEM WITH TAGGING LINGUISTICS ISSUES > In apertium-sme-nob.sme-nob.dix there is a lot of linguistics, e.g. tags > marking issues like > * mismatch sg vs pl in SL and TL > * if the nob noun should get article in indefinite form > * for choosing m vs fem gender in nob > * massnoun > * verbs: if they are reflexive, causative, if they will have a person or not > as subject > * adjectiveds: do they have comp and superl as morphology in nob > ..... > > > => the list from termwiki has to be added to an income directory so the > linguist can do the tagging. It is in one income directory. If you are talking about a certain directory, please be more specific. > > FST FOR NOB: > apertium-sne-nob uses the apertium-nob, not the giella-nob > => the pipeline should use apertium-nob FST I don't use giella-nob, I use 'analyser-gt-norm.hfstol' in giella/nob directory. Where is this apertium-nob?
(In reply to Tomi Pieski from comment #22) > > => the list from termwiki has to be added to an income directory so the > > linguist can do the tagging. > > It is in one income directory. If you are talking about a certain directory, > please be more specific. Lene, Tomi has put the generated bidix files in: (In reply to Tomi Pieski from comment #7) > Termwiki term pairs are prepared in > GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix Another question is whether this is a good location - it is well hidden from the linguists (as demonstrated by Lene's comment). > > FST FOR NOB: > > apertium-sne-nob uses the apertium-nob, not the giella-nob > > => the pipeline should use apertium-nob FST > > I don't use giella-nob, I use 'analyser-gt-norm.hfstol' in giella/nob > directory. Where is this apertium-nob? ./configure -h You can build Apertium fst's as part of our infrastructure. These are not exactly the same as used by Apertium, but they are the starting point, so to speak. The fst's we build are intersected with the bidix fst's, so that the final Apertium analysers will only analyse those entries that are also in the bidix. For the purposes of your work, that last restriction is actually counterproductive, so the Apertium fst we build in our infra should be exactly what we want.
To be spesific; The apertium nob FST is built in apertium/languages/apertium-nob and not in the giella-infrastructure
(In reply to Sjur Nørstebø Moshagen from comment #23) > (In reply to Tomi Pieski from comment #22) > > > => the list from termwiki has to be added to an income directory so the > > > linguist can do the tagging. > > > > It is in one income directory. If you are talking about a certain directory, > > please be more specific. > > Lene, Tomi has put the generated bidix files in: > > (In reply to Tomi Pieski from comment #7) > > Termwiki term pairs are prepared in > > GTHOME/langtech/tools/TermWikiExporter/bidix/sme-nob.dix > > Another question is whether this is a good location - it is well hidden from > the linguists (as demonstrated by Lene's comment). It's the best and only location from the available options. > > > > FST FOR NOB: > > > apertium-sne-nob uses the apertium-nob, not the giella-nob > > > => the pipeline should use apertium-nob FST > > > > I don't use giella-nob, I use 'analyser-gt-norm.hfstol' in giella/nob > > directory. Where is this apertium-nob? > > ./configure -h > > You can build Apertium fst's as part of our infrastructure. These are not > exactly the same as used by Apertium, but they are the starting point, so to > speak. The fst's we build are intersected with the bidix fst's, so that the > final Apertium analysers will only analyse those entries that are also in > the bidix. > > For the purposes of your work, that last restriction is actually > counterproductive, so the Apertium fst we build in our infra should be > exactly what we want. Coolio. Now using 'analyser-mt-gt-desc.hfstol' in tools/mt/apertium of each language.
(In reply to Lene Antonsen from comment #24) > To be spesific; > The apertium nob FST is built in apertium/languages/apertium-nob > and not in the giella-infrastructure apertium/languages/apertium-nob after compiling (make) the analysing command is: lt-proc nob.automorf.bin
apertium-nob$ lt-proc nob.automorf.bin glatt ^glatt/glatte<vblex><imp>/glatt<adj><sint><pst><nt><sg><ind>/glatt<adj><sint><pst><mf><sg><ind>$ This command gives the apertium tags, and for adjectives "sint" should also be included, pluss info about paradigm, like this (the word pair is in se-nb.dix) <e><p><l>livttis<s n="adj"/></l><r>glatt<s n="adj"/><s n="sint"/></r></p><par n="__adj"/></e> Because the dix-file includes much linguistic information, my suggestion is: The wordpairs from termwiki, which SL part is not in the dix file, should be exported to an incoming directory in apertium (e.g. apertium-sme-nob/incoming) and then the linguist will add the wordpairs to the dix-file. I suggest that the word list could in a format like this: livttis adj : glatt adj
> The wordpairs from termwiki, which SL part is not in the dix file, should be > exported to an incoming directory in apertium (e.g. > apertium-sme-nob/incoming) and then the linguist will add the wordpairs to > the dix-file. I suggest that the word list could in a format like this: > > livttis adj : glatt adj For verbs the transitivity tag should also be included. And there could be a second list for word pairs which SL part is in the dix file, but the TL part is not the same as in dix. This list would then be a resource for the linguist, to check if there are word pairs which she wants to add to the dix.
(In reply to Lene Antonsen from comment #27) > (In reply to Lene Antonsen from comment #24) > > To be spesific; > > The apertium nob FST is built in apertium/languages/apertium-nob > > and not in the giella-infrastructure > > apertium/languages/apertium-nob > > after compiling (make) the analysing command is: > > lt-proc nob.automorf.bin Compiling apertium fails on my Mac: apertium_deshtml.cc:3963:2: error: ISO C++1z does not allow 'register' storage class specifier [-Wregister] register yy_state_type yy_current_state; ^~~~~~~~~
Tomi is about to leave - handing this over to Børre.