Bug 2420 - Case endings for URL's and guessers
Summary: Case endings for URL's and guessers
Status: ASSIGNED
Alias: None
Product: sme lexicon
Classification: Unclassified
Component: Continuation lexica (show other bugs)
Version: unspecified
Hardware: All All
: P3 - Within a week normal
Assignee: Sjur Nørstebø Moshagen
URL:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-09-07 16:36 CEST by Sjur Nørstebø Moshagen
Modified: 2017-09-13 14:56 CEST (History)
5 users (show)

See Also:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Sjur Nørstebø Moshagen 2017-09-07 16:36:11 CEST
Consider this sentence:

Tearpmat biddjojuvvoojit neahttabáikái www.risten.no‘s dađistaga go Sámi giellalávdegoddi lea dohkkehan tearpmaid.

It is presently unrecognised as a URL because the URL parser does not know about morphology:

"<www>"
        "www" N Prop Sem/Txt ACR Sg Acc <W:0.0000000000>
        "www" N Prop Sem/Txt ACR Sg Gen <W:0.0000000000>
        "www" N Prop Sem/Txt ACR Sg Nom <W:0.0000000000>
        "www" N Sem/Txt Prop ACR Sg Acc <W:0.0000000000>
        "www" N Sem/Txt Prop ACR Sg Gen <W:0.0000000000>
        "www" N Sem/Txt Prop ACR Sg Nom <W:0.0000000000>
"<.>"
        "." CLB <W:0.0000000000>
"<risten>"
        "riestit" V TV Ind Prt Sg1 <W:0.0000000000>
        "ristat" V TV Ind Prt Sg1 <W:0.0000000000>
        "ristet" V TV Actio Gen <W:0.0000000000>
        "ristet" V TV Actio Nom <W:0.0000000000>
        "ristet" V TV Ind Prs Sg1 <W:0.0000000000>
        "ristet" V TV Ind Prt ConNeg <W:0.0000000000>
        "ristet" V TV PrfPrc <W:0.0000000000>
        "ristet" VV TV Der/NomAct N Sg Gen <W:0.0000000000>
        "ristet" VV TV Der/NomAct N Sg Nom <W:0.0000000000>
"<.>"
        "." CLB <W:0.0000000000>
:no‘s 
"<dađistaga>"
        "dađistaga" Adv <W:0.0000000000>

(output from hfst-tokenise).

The easiest solution is to build a separate fst with just the tags and affixes, and then concatenate it with the URL parser.
Comment 1 Linda Wiechetek 2017-09-13 14:56:41 CEST
Definitely essential for both analysis and grammar checking to be able to analyze these.