Paper
24 March 2014 Utilizing web data in identification and correction of OCR errors
Author Affiliations +
Proceedings Volume 9021, Document Recognition and Retrieval XXI; 902109 (2014) https://doi.org/10.1117/12.2042403
Event: IS&T/SPIE Electronic Imaging, 2014, San Francisco, California, United States
Abstract
In this paper, we report on our experiments for detection and correction of OCR errors with web data. More specifically, we utilize Google search to access the big data resources available to identify possible candidates for correction. We then use a combination of the Longest Common Subsequences (LCS) and Bayesian estimates to automatically pick the proper candidate. Our experimental results on a small set of historical newspaper data show a recall and precision of 51% and 100%, respectively. The work in this paper further provides a detailed classification and analysis of all errors. In particular, we point out the shortcomings of our approach in its ability to suggest proper candidates to correct the remaining errors.
© (2014) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Kazem Taghva and Shivam Agarwal "Utilizing web data in identification and correction of OCR errors", Proc. SPIE 9021, Document Recognition and Retrieval XXI, 902109 (24 March 2014); https://doi.org/10.1117/12.2042403
Lens.org Logo
CITATIONS
Cited by 5 scholarly publications.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Error analysis

Data corrections

Liquid crystals

Lanthanum

Machine learning

Computer science

Back to Top