OCR processing of numerical text should not be impaired when detected language is Turkish
When testing with Turkish text that also includes numerical strings, I have isolated the following anomaly:
Consider two versions of an otherwise identical image (for informative purposes, the image is a graphic depicting the total number of deaths in Turkey from COVID-19, on a particular date):
TÜRKİYE’DE ÖLÜMLER 1.368
DEATHS IN TURKEY 1.368
When the image is OCR-processed with the text in English (i.e. with the words: “DEATHS IN TURKEY”), the numeric string is returned correctly as “1.368”.
However, when the image is OCR-processed with the text in Turkish (i.e. with the words: “TÜRKİYE’DE ÖLÜMLER”), the numeric string is recognized incorrectly, such that a Turkish lowercase dotless i ("ı", Unicode U+0131) is returned for the digit “1” in the numeric string, while the remaining part is returned correctly as “.368”.
In both images, the numerical value is in the exact same location. Backgrounds of both images are identical. And note that the thousands separator character used in Turkey is period (not comma, as used in the U.S.).
I feel recognition accuracy of numeric strings should not degrade when the primary language detected for the request is Turkish.
I used the following endpoints for testing, with identical results:
I have attached below both images, as well as the JSON output returned for both requests.
Following curl request may be used to test the images:
--header "ocp-apim-subscription-key: <azure.subscription.id>" \
--header "content-type: application/octet-stream" \
--data-binary "@<image.file.full.pathname>" \