Bug 2146 - German compound nouns introduce problems with case handling
Summary: German compound nouns introduce problems with case handling
Status: ASSIGNED
Alias: None
Product: Infrastructure
Classification: Unclassified
Component: newinfra (show other bugs)
Version: unspecified
Hardware: Macintosh Other
: P5 - Later enhancement
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-01-13 10:47 CET by Jack Rueter
Modified: 2016-01-29 09:54 CET (History)
4 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Jack Rueter 2016-01-13 10:47:21 CET
Uppercasing in German nouns affects down-casing of non-initial nouns in German compound nouns, and also up-casing of derived nouns from verbs and adjectives.

One solution to uppercasing in German nouns could be:
Flag diacritics:
@U.Cap.Obl@
Apfel:apfel

Cf. main/langs/sme/src/morphology/root.lexc

Results vary, however, when we come to consider yaml tests:

Using the analyzer:
$GTHOME/langs/deu/src/analyser-gt-desc.hfstol
Äpfel
Äpfel	Apfel+N+Msc+Pl+Acc	0.000000
Äpfel	Apfel+N+Msc+Pl+Gen	0.000000
Äpfel	Apfel+N+Msc+Pl+Nom	0.000000

does not have symmetry in
the generator:
GTHOME/langs/deu/src/generator-gt-desc.hfstol

Apfel+N+Msc+Pl+Acc	äpfel	0.000000
Apfel+N+Msc+Pl+Gen	äpfel
Apfel+N+Msc+Pl+Nom	äpfel	0.000000


Whereas optional upper-casing working sentence-initially has worked for other languages. German presents something that will need a little language-specific work.

This optional uppercasing is done as part of the regular compilation,
which means that we need a pre-tmp file for language-specific pre-processing before the language-independent compilation steps.

Thanks for the discussion, Sjur.
Comment 1 Sjur Nørstebø Moshagen 2016-01-29 09:54:23 CET
Changed subject line from "German compound nouns introduce problems with upcasing that are observed in yaml test analyses and generation asymmetry." to "German compound nouns introduce problems with case handling" - long subject lines tend to make the bug lists harder to read.