Bug 283 - Several Min Áigi files are not converted properly.
Summary: Several Min Áigi files are not converted properly.
Status: RESOLVED FIXED
Alias: None
Product: Corpus
Classification: Unclassified
Component: xml conversion (show other bugs)
Version: unspecified
Hardware: Macintosh MacOS X
: P2 - As soon as possible normal
Assignee: Saara Huhmarniemi
URL:
Keywords:
: 307 (view as bug list)
Depends on: 76 279
Blocks:
  Show dependency treegraph
 
Reported: 2006-05-10 15:08 CEST by Maaren Palismaa
Modified: 2006-11-04 12:44 CET (History)
3 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Maaren Palismaa 2006-05-10 15:08:58 CEST
These are the files:
•AJ-sverre_porsanger.txt.xml
•AJ-vuoddjit.txt.xml
•ÅP-dynamittfond.txt.xml
•ÅP-Gåte.txt.xml
•ÅP-sponsoravtale.txt.xml
•IU-Bb_boazodoaluseminára.txt.xml
•Kronihkka-JK-terror.txt.xml
•Lohkki_János_+_reinjakt,_sami.txt.xml
•NHM-_Finnmarkoláhka.txt.xml
•NHM-Rigoberta.txt.xml
•sn-ealgaovttasbargu.txt.xml
•Uhca_-_Odda_Gaba.txt.xml
•AJ-ruoná_ivdni.txt.xml
•AJ-sabetskeanka.txt.xml
•AJ-skábma.txt.xml
•ÅP-GDG_seastin-duppalávvudeapm.txt.xml
•ÅP-liikabeana
•ÅP-sunniva
•ÅP-telefun_geazis.txt.xml
•IU-lasse_berit.txt.xml
•IU-veahki_haga.txt.xml
•IU-veahki_haga4.txt.xml
•JK_-_girjearvvostallan.txt.xml
•leder_6
•Leserinnlegg_-_Eva_Nielsen.txt.xml
•Lohkki_-_EU.txt.xml
•uhca-ÅP-romssa_taxi.txt.xml
•Uhca_-_duodji.txt.xml
•20_jagi_das_ovdal007.txt.xml
•AJ-deantogáttenuorat.txt.xml
•AJ-play-boy.txt.xml
•alm_NRK-sami_radio.txt.xml
•alm_prod_sjef,_3177_tegn.txt.xml
•ÅP-IDOL-Ellen_Marie_Eira.txt.xml
•ÅP-oiva_ohcan-eai_gavdnan.txt.xml
•ÅP-vajalduhttan_sámi_álb.txt.xml
•ÅP-veagalvaldimat.txt.xml
•ÅP-Walk_of_Fame.txt.xml
•IU-cuovvolan.txt.xml
•IU-dalve_ealgabivdu.txt.xml
•IU-jápmin_karasjogas.txt.xml
•KMO-measttir.txt.xml
•MÁrkka.txt.xml
•NHM-_ohcejoga_sátnejodiheaddji.txt.xml
•AEN-setter_dagsorden_-_sami.txt.xml
•AJ-álbmotbeaivi_suomas.txt.xml
•AJ-skihppagurra_festivála.txt.xml
•ÅP-dusse_okta_cevzzii.txt.xml
•ÅP-intro-álbmot.txt.xml
•ÅP-johkamohkemarkanat.txt.xml
•åpning,_sami.txt.xml
•Durham.txt.xml
•Emi_-_samisk.txt.xml
•filmspektakel,_sami.txt.xml
•horoskop_uke5_sami.txt.xml
•IU-boazosiehtadus4.txt.xml
•IU-NYYY_boazosiehtadus.txt.xml
•kaos,_sami.txt.xml
•kaos.txt.xml
•liv_på_flygende_teppe.txt.xml
•Program_-_barents_spetakel.txt.xml
•seminar.txt.xml
•spekulær_åpning.txt.xml
•UHCA-_MÅ_MED_FREDAG!!!.txt.xml
•uhca-ÅP-SGP2005.txt.xml
•UHCA-spor_i_snø.txt.xml
•Uhca_-_IFI.txt.xml
•Utstilling_-_sami.txt.xml
•Wimme_Sari_+_samefolketsdag.txt.xml
•AJ-maori_filmmat.txt.xml
•AJ-moana.txt.xml
•AJ-odda_filbmadahkkit.txt.xml
•AJ-ruoná_ivdni.txt.xml
•alm-Tana_NSR.txt.xml
•alm_beaivvas.txt.xml
•ÅP-dalvefestivala.txt.xml
•ÅP-Håvard_Klemetsen.txt.xml
•govvaraiddut_0009
•IU-beatnagat_hearggit.txt.xml
•IU-NBR_nubbejodiheaddji.txt.xml
•IU-ráhkkanit_boazonealgái.txt.xml
•Leder_9
•LOHKKI.txt.xml
•NHM-Gurut_golbma.txt.xml
•NHM-Solturnering.txt.xml
•PP-manaid_TV.txt.xml
•20_jagi_das_ovdal010.txt.xml
•AJ-katja.txt.xml
•ÅP-Johkamohkki.txt.xml
•HAL-_Boston-ny.txt.xml
•horoskop_uke_6,_sami.txt.xml
•IU-Beredskap_i_Karsjok.txt.xml
•IU-Sæther_beredskap.txt.xml
•leder_10
•NHM-veakki_asiai.txt.xml
•20_jagi_das_ovdal011.txt.xml
•AJ-_ovddesfilmmat+govva.txt.xml
•AJ-ann_helene.txt.xml
•AJ-giellabeassi.txt.xml
•AJ-kárásjogas.txt.xml
•AJ-mánát_sámedikkis.txt.xml
•alm_nasj_park,_R_fylkkamanni.txt.xml
•IU-Heargevuodjin.txt.xml
•LOHKKI_-_TERJE_TRETNES,_sami.txt.xml
•lohkki_Lásses.txt.xml
•lohkki_nrk-ii.txt.xml
•MLA_-_Zoya.txt.xml
•NHM-Sállosa.txt.xml
•NHM_-luktfri_møkkaspre,_sami.txt.xml
•20_jagi_dassái.txt.xml
•AJ-Guttorm.txt.xml
•AJ-manaid_mánna.txt.xml
•AJ-per_iver_turi.txt.xml
•AJ-Ravdna.txt.xml
•ÅP-christer.txt.xml
•ÅP-duodji-matki_aiggi_cada.txt.xml
•ÅP-Min_Áigi_ovdána.txt.xml
•ÅP-Zapp_me.txt.xml
•HRM-Utstilling_i_Tromsö,_sami.txt.xml
•IU-massan_doarjaga.txt.xml
•IU-massan_doarjaga3.txt.xml
•IU-Mathis_Ailu.txt.xml
•Leder_13
•lohkki_NAS.txt.xml
•Ny-AJ-per_iver_turi.txt.xml
•PP-Erkke_Ánde_2.txt.xml
•privahta_almmuhusa,_3_geardde.txt.xml
•_20__jagi_das_ovdal.txt.xml
•AJ-ragnild_lydia_Nystad.txt.xml
•ÅP-kitok_veaddeduodji.txt.xml
•ÅP-Sámedikkit_deaivvadit.txt.xml
•IU-10000_rein.txt.xml
•IU-biedganan.txt.xml
•IU-boazodoallosiehtadus.txt.xml
•Kronikk_-_Helga_Pedersen,_sami.txt.xml
•lohkki,_16
•MLA-Hammerfeast_satnejodiheadd.txt.xml
•MLA-Olli_ja_fala.txt.xml
•NHM-musihkkahoavda.txt.xml
•uhca-ÅP-nissonat_mahttet.txt.xml
•uhca-ÅP-nordfors-MÅ_MED.txt.xml
-F•crossbane,_sami.txt.xml
-F•Fakta_om_drag,_sami.txt.xml
•_FP-_hyperrask_bane,_sami.txt.xml
•Fakta_om_drag,_sami1.txt.xml
•AJ-Egil_Utsi.txt.xml
•AJ-midttun.txt.xml
•AJ-Nils_Utsi.txt.xml
•alm_Altta_siida.txt.xml
•alm_rabas_virggit,_Alta.txt.xml
•ann_kultur,_318__tegn.txt.xml
•ÅP-doarjjadoalut.txt.xml
•Diedut_govvamuituide.txt.xml
•IU-anarjoga_aidi3.txt.xml
•IU-anárjoga_áidi.txt.xml
•IU-anárjoga_áidi2.txt.xml
•NHM-gollenieida_steira.txt.xml
•NHM-kristin_áhcci.txt.xml
•NHM-Odda_NRK_hoavddat.txt.xml
•NHM-Rábmo_váhnemiid.txt.xml
•NHM-Sátnejodiheaddji.txt.xml
•UHCA_-_coop_utbeta,_sami.txt.xml
•20_jagi_DÁS_OVDAL.txt.xml
•AJ-harriet.txt.xml
•IU-Johttan_davas.txt.xml
•IU-laikes_boazodoallit.txt.xml
•IU-MNS_duhtavas.txt.xml
•IU-Sponheim_KTK-as.txt.xml
•leder_nr_17
•MLA-STK_okkupasjon.txt.xml
•NHM-Vintertur_i_fokus,_sami.txt.xml
•UHCA_-_NB_FREDAG!!!!.txt.xml
Comment 1 Saara Huhmarniemi 2006-05-11 10:15:32 CEST
Most of these files are now fixed. The names of the files are changed so that the preceding dot is replaced with underscore _. Some of the MinAigi files are still left unconverted (due to still some more filename&Perl&character encoding problems), and there is work to  be done with the format (\@ -tags). However, all the files that are converted to xml should now be analyzable. I leave the bug open until the rest of the problems are solved.
Comment 2 Saara Huhmarniemi 2006-05-15 15:04:39 CEST
The @-tags are now taken into account in the conversion process. There may be some errors e.g. due to missing @:s in the front of the keyword in the original document. The extra xml-tags are removed as well (e.g. \!q>). 2003-files are now reconverted, the other directories follow.
Comment 3 Trond Trosterud 2006-05-16 15:14:12 CEST
ccat -r zcorp/gtbound/sme/news/MinAigi/2003/ | less
10A oahppit leat dlvi mieht rhkkanan klssamtki E_landii. Sii leat ovttas vhnemiiguin _oaggn ru_aid, loaddavuovdi
=========> all sámi characters are lost (dálvi, ráhkkanan, klsássamátki, Eŋlandii.
Comment 4 Trond Trosterud 2006-05-16 15:26:57 CEST
Sorry, my last message was accidently written on G5, not on victorio (hard to see the difference...). On victorio, everything works fine:
Golbma lávvardaga maŋŋálágaid čájeha TV2 sámi dokumentáraid. Ihttin diibmu 13.40 lea vuosttaš oassi. Mii guovlalat sihke
 Norggas, Suomas, Ruoŧas ja Ruoššas. Dás lea kultuvra, nugo giella, luohti, dálkkudanvuogit ja sámi bajásgeassin guovddá
žis. Gitta 1956 rádjai eai beassan sámi mánát sámástit skuvllas. Jus dahke dan de ráŋggáštuvvoje. Dasa lassin máhccat ru
Comment 5 Saara Huhmarniemi 2006-05-16 18:20:55 CEST
The first analysis was correct. There is a real problem with at least some of the 2003-files, like
gtbound/sme/news/MinAigi/2003/10A_på_klassetur.txt.xml

The Sámi characters are lost somewhere during the process. I'll see what's wrong.
Comment 6 Saara Huhmarniemi 2006-05-19 14:56:51 CEST
The problem with this file is that there are no sámi characters in the file except the á:s. The process of guessing the encoding is based on counting the occurences of sámi characters, and since there are none, it fails. I now added the á to the set of tested sámi characters. It has not been there, since it often correctly encoded even if the rest of the document is not. The statistics should handle the change without errors.
Comment 7 Saara Huhmarniemi 2006-06-05 16:18:18 CEST
I'm not able to determine the encoding of the following files:

MinAigi/2003/Ássi_gáldus.txt
MinAigi/2003/Eldrebølgen.doc

what do you think?
Comment 8 Trond Trosterud 2006-06-06 00:12:43 CEST
I copied the two files to my local machine and had a look at them. Eldrebølgen.doc (a file in Norwegian, btw.) opened on my local mac without problems (command "open Eldrebølgen.doc" rendered it ok in Word, and with all æøå-s in place. Why it cannot be converted I thus do not understand, it should be a routine task. 

As for the Ássi_gáldus.txt, it turned out to be harder. The caron letters (š, ž) came out as ̌sˇ and zˇ, and the other ones as identical question marks. It seems the document has started out as e.g. Winsam (or even UTF-8), and then perhaps being opened in a Mac Classic version of some program. I remember seing the "delayed carons" when opening Sámi UTF-8 pages in a web browser in Mac OS 9. The real question is of course what happened to the other 5 letters. If they DO have different representation, we may dig out the correct values, but if they all are reduced to the same question mark (I don't have a hex editor), then we will have to drop this (and similar) file(s). 
Comment 9 Trond Trosterud 2006-06-15 09:40:25 CEST
*** Bug 307 has been marked as a duplicate of this bug. ***
Comment 10 Børre Gaup 2006-06-15 09:53:06 CEST
For those documents that have so big problems with the encoding that they're not usable, I suggest we edit them manually, then regenerate them ...
Comment 11 Trond Trosterud 2006-06-15 10:01:27 CEST
My comments to the duplicate bug did not carry over here. The thing is that the 6 Sámi letters seem to all have been replaced with the same character "_" (underscore). Of course, it is possible to read through the files, and fill in the missing characters manually, but at the moment I do not consider it a sensible way of spending time. In case future historians want to use our bases we may perhaps keep them in the orig repository, but in the derived version they are just noise, and they should not be generated We may thus mark them as "do not generate" in their respective xsl files, or (the easy solution) we may just remove them from the orig catalogue. Removing things, as a principle, feels bad (it seems someone comes and wants a second look just after we have deleted them), so if we could have a "do-not-generate"-type xsl file for them instead, it could perhaps be ok. I add my original comment from the duplicate bug here:
Comment 12 Trond Trosterud 2006-06-15 10:02:26 CEST
It seems the following 2003 files have got the Sámi characters conflated. Cf.
the following text snip, where all Sámi characters except á are represented
by underscore:

10A_på_klassetur.txt.xml:    <p>10A oahppit leat dálvi miehtá ráhkkanan
klássamá
tkái E_landii. Sii leat ovttas váhnemiiguin _oaggán ru_aid, loaddavuovdima
ja ka
fea bargguin. Lea oalle rah__amu_ leama_an gártadit ru_aid gok_at
mátkegoluid, m
uhto á_gir – ja vi__alvuo_ain leat gártadan dan maid sii dárbbá_edje
mátkái. Ulb
mil mátkiin lei beassat geavahit e_gelasgiela ja maiddái oahpásmuvvat eará
kultu
vrrain. </p>


For 2003, it is 51 out of 1632 files., for 2004 it is 6 out of appr 6600 files.
I suggest we look through the MinAigi corpus, check whether the files are
garbled beyond rescue, and then remove them from the corpus (eventually keep
them in the orig but blocked from being generated).

grep "__" * | cut -d":" -f1 | uniq | l

10A_på_klassetur.txt.xml
alm_Allaskuvla.txt.xml
alm_De_samiske_samlinger.txt.xml
alm_Finnm_AP.txt.xml
alm_Finnm_miljøtj.txt.xml
alm_Karasjok_kom_6_feb_prog_NY.txt.xml
alm_Nesseby_komm.txt.xml
alm_reindr_agronom.txt.xml
alm_Sami_Daiddaraddi.txt.xml
alm_Sámi_giellaguovddas.txt.xml
alm_Sami_Instituhtta.txt.xml
alm_SDGBF.txt.xml
alm_-_SDR.txt.xml
alm_suoma_samediggi.txt.xml
alm_Urfolksenter.txt.xml
alm_Vardobaiki.txt.xml
arvvostallan2_copy.txt.xml
ceavgegeadgi_-_notis.txt.xml
deanu_cealkamus.txt.xml
Dearv_Valentina.txt.xml
EA_020202_manna.txt.xml
EA-nordlys_ut.txt.xml
Finnmarksloven.txt.xml
Folk_er_folk_-_kulturforskjelle.doc.xml
Fredagsavisa-lohkkicalus_SMM.txt.xml
Giitu_-_Anne_lise.txt.xml
LESERINNLEGG.txt.xml
lohkki-bieggamillot.txt.xml
Lohkkicalus2.txt.xml
Lohkkicalus.txt.xml
lohkkiid-anders_jh_eira.txt.xml
LOhkkiid_privahta_skuvla.txt.xml
lohkki-kirku.txt.xml
lujavri_aviissa_haga.txt.xml
Máret_Sara-_Lohkkicalus.txt.xml
Marka103.txt.xml
Muitosatni.txt.xml
PP-gielddahoavda_copy.txt.xml
presideanta_sárdni_nr_1.txt.xml
revsnesham_samisk.txt.xml
riddu_riddu_samisk.txt.xml
s12_Skoleportala.txt.xml
s13_Garegasnjargga_bankku.txt.xml
s13_LAN.txt.xml
s16_Info_nuorra.txt.xml
s17_Gran_Canaria.txt.xml
s17_Raste.txt.xml
sametinget_HASTER.txt.xml
SAN-gaskasiidu_rep.txt.xml
uhca-_sami_allaskuvla_ap.txt.xml
uhca-_vitenskap_i_kauto.txt.xml
Comment 13 Saara Huhmarniemi 2006-11-04 12:44:45 CET
This bug is now finally fixed, so that the problematic files are not included in the conversion. There were 2 or 3 files in the list, where "__"  was used for some other purpose.