Bug 2063

Summary: the freecorpus containts directories with mixed content
Product: Corpus Reporter: Ciprian Gerstenberger <ciprian.gerstenberger>
Component: Text corpus infrastructureAssignee: Børre Gaup <borre.gaup>
Status: REOPENED ---    
Severity: enhancement CC: ciprian.gerstenberger, lene.antonsen, sjur.n.moshagen, trond.trosterud
Priority: P5 - Later    
Version: unspecified   
Hardware: Macintosh   
OS: Other   

Description Ciprian Gerstenberger 2015-06-24 15:59:13 CEST
As we agreed upon, a dir in the corpus should contain either dirs only or files only. This is not the case any longer, obviously with data coming from the Finnish side.
The following dirs contain both dirs and files:

freecorpus/2015-06-24/sme/facta/klemetti.blogspot.com/2009/
freecorpus/2015-06-24/sme/facta/lundui.fi/
freecorpus/2015-06-24/sme/facta/lundui.fi/aigeguovdil/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/nuuksio/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/earaguovllut/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/historjacuozahagat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/ceavetjavribuolbmatjavri/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/finnmarkkubalggis/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/geavujohtolat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/heahttaballasjohtolat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/kalohttageinnodat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/njuohttejohka/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/piilolabalggis/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/luondduguovddazat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/samimeahcceguovllut/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/stobut/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardeamiabc/
freecorpus/2015-06-24/sme/laws/finland/
Comment 1 Ciprian Gerstenberger 2015-06-25 08:41:13 CEST
Here is an updated list with the test in the freecorpus/orig directory, so there are even more dirs with
mixed content there.

orig/fin/facta/klemetti.blogspot.com/2009/
orig/fin/laws/finland/
orig/sme/facta/klemetti.blogspot.com/2009/
orig/sme/facta/lundui.fi/
orig/sme/facta/lundui.fi/aigeguovdil/
orig/sme/facta/lundui.fi/vanddardancuozahagat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/
orig/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/nuuksio/
orig/sme/facta/lundui.fi/vanddardancuozahagat/earaguovllut/
orig/sme/facta/lundui.fi/vanddardancuozahagat/historjacuozahagat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/ceavetjavribuolbmatjavri/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/finnmarkkubalggis/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/geavujohtolat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/heahttaballasjohtolat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/kalohttageinnodat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/njuohttejohka/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/piilolabalggis/
orig/sme/facta/lundui.fi/vanddardancuozahagat/luondduguovddazat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/samimeahcceguovllut/
orig/sme/facta/lundui.fi/vanddardancuozahagat/stobut/
orig/sme/facta/lundui.fi/vanddardeamiabc/
orig/sme/laws/finland/
orig/smj/admin/depts/regjeringen.no/
Comment 2 Børre Gaup 2015-07-08 00:54:48 CEST
Fixed in commits freecorpus r4880-4487
Comment 3 Ciprian Gerstenberger 2017-06-23 10:21:36 CEST
The following directories have mixed content:


01_2017-06-22/fc/sme/admin/allaskuvla dirs_|1| files_|22|
01_2017-06-22/fc/sme/blogs dirs_|2| files_|6|
01_2017-06-22/fc/sme/facta/samediggi.fi dirs_|1| files_|37|
01_2017-06-22/fc/smn/facta/samediggi.fi dirs_|1| files_|4|

Moreover the klementiblog should be moved from the facta to the newly created
domain "blogs".