Paper
21 December 2000 Approximate string matching algorithms for limited-vocabulary OCR output correction
Thomas A. Lasko, Susan E. Hauser
Author Affiliations +
Proceedings Volume 4307, Document Recognition and Retrieval VIII; (2000) https://doi.org/10.1117/12.410841
Event: Photonics West 2001 - Electronic Imaging, 2001, San Jose, CA, United States
Abstract
Five methods for matching words mistranslated by optical character recognition to their most likely match in a reference dictionary were tested on data from the archives of the National Library of Medicine. The methods, including an adaptation of the cross correlation algorithm, the generic edit distance algorithm, the edit distance algorithm with a probabilistic substitution matrix, Bayesian analysis, and Bayesian analysis on an actively thinned reference dictionary were implemented and their accuracy rates compared. Of the five, the Bayesian algorithm produced the most correct matches (87%), and had the advantage of producing scores that have a useful and practical interpretation.
© (2000) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Thomas A. Lasko and Susan E. Hauser "Approximate string matching algorithms for limited-vocabulary OCR output correction", Proc. SPIE 4307, Document Recognition and Retrieval VIII, (21 December 2000); https://doi.org/10.1117/12.410841
Lens.org Logo
CITATIONS
Cited by 18 scholarly publications and 3 patents.
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Optical character recognition

Associative arrays

Detection and tracking algorithms

Biology

Evolutionary algorithms

Medicine

Bismuth

Back to Top