BUG: words in german utterances are splitted
In German, we have many compound words like "Finanzbericht". LUIS splits these words into "finanz" and "bericht". Even dates like "14.06.2017" are split into "14 . 06 . 2017" (spaces before and after the dots).
LUIS really struggles with the split words. This means if I have a List entity with the synonym "Finanzbericht" and I enter an utterance with "Finanzbericht" LUIS not recognizes it. I need to add the split words as synonyms.
However, in the English version of LUIS the same german words don't get split.
This has been fixed in all languages.
For general information about language support, please see: https://docs.microsoft.com/en-us/azure/cognitive-services/luis/luis-language-support.
Today I have noticed that not every compound word is split by LUIS, e.g. "Reisekosten", "Konferenzraum" and "Öffnungszeiten" are split, but "Zugticket", "Sonderurlaub" or "Ruheraum" are not.
So what is wrong with German/Dutch and LUIS?
Marco Rietveld commented
@riham and @carol, this is a really big problem. In fact, it basically means that LUIS is unusable for Dutch (and German).
If we can't count on "pepernoten" or "nootmuskaat" to be treated as words (they are), then we would literally have to input a large chunk of the dictionary in order to compensate for this problem, which is an unacceptable workaround for us.
We've even seen that "chocolade" is being split into "choco" and "lade"!
I strongly urge you to retract support for Dutch (and German) until this problem is fixed.
Will there be an option to disable this "feature"? I now have to mark the split words as on utterance and than remove the white-space in my code. These are unnecessary steps in my opinion or is there a better way to handle this behavior?
We might need to add a feature to disable pre-processing
@Carol Hanna: So, would it be possible to add a feature to disable this "pre-processing" on demand?
Across all the languages there is a pre-processing step that is specific to every culture. Dutch and German have special handling that aim at splitting compounded words into separate entities to facilitate labeling.
Johan Kroese commented
Same here in the Dutch language. It seems bound to the culture setting on the application. If I use dutch words in a phrase list in a Luis application where the culture is set to en-us it doesn't do the annoying word breaking.
create a Luis app with culture 'Dutch' (or German)
create a phrase list and give it a name
enter values: mercedes,microsoft
export the application model
look at the json
The UI will look ok. the json will show that it saves the phrases with broken words "mercedes, micro soft"