Current developments in sensors open new possible uses across numerous real-life applications, including optical character recognition (OCR). An OCR system requires incorporation of text processing tools into the sensor functionality. The most critical stage in OCR systems is the segmentation stage. It refers to the challenge of subdividing a text image into characters, which can be individually processed using a classifier. The cursive nature of the Arabic script such as the existence of different shapes for each character according to its location in the word besides the existence of diacritics makes Arabic character segmentation a very challenging task. A robust offline character segmentation algorithm for printed Arabic text with diacritics is developed based on the contour extraction technique. The algorithm works through extracting the up-contour part of a word and then identifies the splitting areas of the word characters. Then a postprocessing stage is used to handle the over-segmentation problems that appear in the initial segmentation stage. The proposed scheme is benchmarked using the APTI dataset and a manually collected dataset consisting of image texts varying in font size, type, and style for more than 38,000 words. The experiments show that the proposed algorithm is able to segment Arabic words with diacritics with an average accuracy of 98.5%. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
CITATIONS
Cited by 10 scholarly publications.
Image segmentation
Optical character recognition
Image processing algorithms and systems
Algorithm development
Detection and tracking algorithms
Image processing
Binary data