Bug 2063 - the freecorpus containts directories with mixed content
Summary: the freecorpus containts directories with mixed content
Status: REOPENED
Alias: None
Product: Corpus
Classification: Unclassified
Component: Text corpus infrastructure (show other bugs)
Version: unspecified
Hardware: Macintosh Other
: P5 - Later enhancement
Assignee: Børre Gaup
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-06-24 15:59 CEST by Ciprian Gerstenberger
Modified: 2017-06-23 10:21 CEST (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ciprian Gerstenberger 2015-06-24 15:59:13 CEST
As we agreed upon, a dir in the corpus should contain either dirs only or files only. This is not the case any longer, obviously with data coming from the Finnish side.
The following dirs contain both dirs and files:

freecorpus/2015-06-24/sme/facta/klemetti.blogspot.com/2009/
freecorpus/2015-06-24/sme/facta/lundui.fi/
freecorpus/2015-06-24/sme/facta/lundui.fi/aigeguovdil/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/nuuksio/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/earaguovllut/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/historjacuozahagat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/ceavetjavribuolbmatjavri/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/finnmarkkubalggis/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/geavujohtolat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/heahttaballasjohtolat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/kalohttageinnodat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/njuohttejohka/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/piilolabalggis/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/luondduguovddazat/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/samimeahcceguovllut/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardancuozahagat/stobut/
freecorpus/2015-06-24/sme/facta/lundui.fi/vanddardeamiabc/
freecorpus/2015-06-24/sme/laws/finland/
Comment 1 Ciprian Gerstenberger 2015-06-25 08:41:13 CEST
Here is an updated list with the test in the freecorpus/orig directory, so there are even more dirs with
mixed content there.

orig/fin/facta/klemetti.blogspot.com/2009/
orig/fin/laws/finland/
orig/sme/facta/klemetti.blogspot.com/2009/
orig/sme/facta/lundui.fi/
orig/sme/facta/lundui.fi/aigeguovdil/
orig/sme/facta/lundui.fi/vanddardancuozahagat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/
orig/sme/facta/lundui.fi/vanddardancuozahagat/albmotmeahcit/nuuksio/
orig/sme/facta/lundui.fi/vanddardancuozahagat/earaguovllut/
orig/sme/facta/lundui.fi/vanddardancuozahagat/historjacuozahagat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/ceavetjavribuolbmatjavri/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/finnmarkkubalggis/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/geavujohtolat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/heahttaballasjohtolat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/kalohttageinnodat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/njuohttejohka/
orig/sme/facta/lundui.fi/vanddardancuozahagat/johtolagat/piilolabalggis/
orig/sme/facta/lundui.fi/vanddardancuozahagat/luondduguovddazat/
orig/sme/facta/lundui.fi/vanddardancuozahagat/samimeahcceguovllut/
orig/sme/facta/lundui.fi/vanddardancuozahagat/stobut/
orig/sme/facta/lundui.fi/vanddardeamiabc/
orig/sme/laws/finland/
orig/smj/admin/depts/regjeringen.no/
Comment 2 Børre Gaup 2015-07-08 00:54:48 CEST
Fixed in commits freecorpus r4880-4487
Comment 3 Ciprian Gerstenberger 2017-06-23 10:21:36 CEST
The following directories have mixed content:


01_2017-06-22/fc/sme/admin/allaskuvla dirs_|1| files_|22|
01_2017-06-22/fc/sme/blogs dirs_|2| files_|6|
01_2017-06-22/fc/sme/facta/samediggi.fi dirs_|1| files_|37|
01_2017-06-22/fc/smn/facta/samediggi.fi dirs_|1| files_|4|

Moreover the klementiblog should be moved from the facta to the newly created
domain "blogs".