Document structure analysis can be regarded as a syntactic analysis problem. The order and containment relations among the physical or logical components of a document page can be described by an ordered tree structure and can be modeled by a tree grammar which describes the page at the component level in terms of regions or blocks. This paper provides a detailed survey of past work on document structure analysis algorithms and summarize the limitations of past approaches. In particular, we survey past work on document physical layout representations and algorithms, document logical structure representations and algorithms, and performance evaluation of document structure analysis algorithms. In the last section, we summarize this work and point out its limitations.
Image segmentation is an important component of any document image analysis system. While many segmentation algorithms exist in the literature, very few i) allow users to specify the physical style, and ii) incorporate user-specified style information into the algorithm's objective function that is to be minimized. We describe a segmentation algorithm that models a document's physical structure as a hierarchical structure where each node describes a region of the document using a stochastic regular grammar. The exact form of the hierarchy and the stochastic language is specified by the user, while the probabilities associated with the transitions are estimated from groundtruth data. We demonstrate the segmentation algorithm on images of bilingual dictionaries.
Tools for visualizing and creating groundtruth and metadata are crucial for document image analysis research. In this paper, we describe TrueViz which is a tool for visualizing and editing groundtruth/metadata for OCR. TrueViz is implemented in the Java programming language and works on various platforms including Windows and Unix. TrueViz reads and stores groundtruth/metadata in XML format, and reads a corresponding image stored in TIFF image file format. Multilingual text editing, display, and search module based on the Unicode representation for text is also provided. This software is being made available free of charge to researchers.
Document page segmentation is a crucial preprocessing step in Optical Character Recognition (OCR) system. While numerous segmentation algorithms have been proposed, there is relatively less literature on comparative evaluation -- empirical or theoretical -- of these algorithms. We use the following five step methodology to quantitatively compare the performance of page segmentation algorithms: (1) First we create mutually exclusive training and test dataset with groundtruth, (2) we then select a meaningful and computable performance metric, (3) an optimization procedure is then used to automatically search for the optimal parameter values of the segmentation algorithms, (4) the segmentation algorithms are then evaluated on the test dataset, and finally (5) a statistical error analysis is performed to give the statistical significance of the experimental results. We apply this methodology to five segmentation algorithms, three of which are representative research algorithms and the rest two are well-known commercial products. The three research algorithms evaluated are: Nagy's X-Y cut, O'Gorman's Docstrum and Kise's Voronoi-diagram-based algorithm. The two commercial products evaluated are: Caere Corporation's segmentation algorithm and ScanSoft Corporation's segmentation algorithm. The evaluations are conducted on 978 images from the University of Washington III dataset. It is found that the performance of the Voronoi-based, Docstrum and Caere's segmentation algorithms are not significantly different from each other, but they are significantly better than ScanSoft's segmentation algorithm, which in turn is significantly better than the performance of the X-Y cut algorithm. Furthermore, we see that the commercial segmentation algorithms and research segmentation algorithms have comparable performances.
Numerous Optical Character Recognition (OCR) companies claim that their products have near-perfect recognition accuracy (close to 99.9%). In practice, however, these accuracy rates are rarely achieved. Most systems break down when the input document images are highly degraded, such as scanned images of carbon-copy documents, documents printed on low-quality paper, and documents that are n-th generation photocopies. Besides, the end user cannot compare the relative performances of the products because the various accuracy results are not reported on the same dataset.. In this article we report our evaluation results for two popular Arabic OCR products: (1) Sakhr OCR and (2) OmniPage for Arabic. In our evaluation we establish that the Sakhr OCR product has 15.47% lower page error rate relative to the OmniPage page error rate. The absolute page accuracy rates for Sakhr and Omnipage are 90.33% and 86.89% respectively. Our evaluation was performed using the SAIC Arabic image dataset, and we used only those pages for which both OCR systems produced output. A scatter-plot of the page accuracy-rate pairs reveals that Sakhr in general performs better on low-accuracy (degraded) pages. The scatter-plot visualization technique allows an algorithm developer to easily detect and analyze outliers in the results.
Characterizing the performance of Optical Character Recognition (OCR) systems is crucial for monitoring technical progress, predicting OCR performance, providing scientific explanations for the system behavior and identifying open problems. While research has been done in the past to compare performances of two or more OCR systems, all assume that the accuracies achieved on individual documents in a dataset are independent when, in fact, they are not. In this paper we show that accuracies reported on any dataset are correlated and invoke the appropriate statistical technique--the paired model--to compare the accuracies of two recognition systems. Theoretically we show that this method provides tighter confidence intervals than methods used in OCR and computer vision literature. We also proposed a new visualization method, which we call the accuracy scatter plot, for providing a visual summary of performance results. This method summarizes the accuracy comparisons on the entire corpus while simultaneously allowing the researcher to visually compare the performances on individual document images. Finally, we report on the accuracy and speed performances as a function of scanning resolution. Contrary to what one might expect, the performance of one of the systems degrades when the image resolution is increased beyond 300 dpi. Furthermore, the average time taken to OCR a document image, after increasing almost linearly as a function of resolution, suddenly becomes a constant beyond 400 dpi. This behavior is most likely because the OCR algorithm samples the images at resolutions 400 dpi and higher to a standard resolution. The two products that we compare are the Arabic OmniPage 2.0 and the Automatic Page Reader 3.01 from Sakhr. The SAIC Arabic dataset was used for the evaluations. The statistical and visualization methods presented in this article are very general and can be used for comparing accuracies of any two recognition systems, not just OCR systems.
In this paper we propose to use the Bible as a dataset for comparing OCR accuracy across languages. Besides being available in a wide range of languages, Bible translations are closely parallel in content, carefully translated, surprisingly relevant with respect to modern-day language, and quite inexpensive. A project at University of Maryland is currently implementing this idea. We have created a scanned image dataset with groundtruth from an Arabic Bible. We have also used image degradation models to create synthetically degraded images of a French Bible. We hope to generate similar Bible datasets for other languages, and we are exploring alternative corpora with similar properties such the Koran and the Bhagavad Gita. Quantitative OCR evaluation based on the Arabic Bible dataset is currently in progress.
Computing discrete 2D convolutions is an important problem in image processing. In mathematical morphology, an important variant is that of computing binary convolutions, where large kernels are involved. In this paper, we present an algorithm for computing convolutions of this form, where the kernel of the binary convolution is derived from a convex polygon. Because the kernel is a geometric object, we allow the algorithm some flexibility in how it elects to digitize the convex kernel at each placement, as long as the digitization satisfies certain reasonable requirements. We say that such a convolution is valid. Given this flexibility we show that it is possible to computer binary convolutions more efficiently than would normally be possible for large kernels, computes a valid convolution in time O(kmn) time. Unlike standard algorithms for computing correlations and convolutions, the running time is independent of the area or perimeter of K, and our technique do not rely on computing fast Fourier transforms. Our algorithm is based on a novel use of Bresenham's line-drawing algorithm and prefix-sums to update the convolution efficiently as the kernel is moved from one position to another across the image.
We have discussed a morphologically based nonlinear document degradation model to characterize the perturbation process associated with the printing and scanning process [KHP93, KHP94]. In this paper we use the nonparametric estimation algorithm discussed in [KHP93, KHP94] for estimating the sizes of the structuring elements of the degradation model. Other parameters of the model can be estimated in a similar fashion. Thus, given a small sample of (real) scanned documents, we can estimate the parameters of the model using the nonparametric estimation algorithm, and use the estimated parameters to create a large sample of simulated documents with degradation characteristics similar to that of the real scanned documents. The large simulated sample can then be used for various purposes, for example, training classifiers, estimating performance of OCR algorithms, choosing parameter values in noise removal algorithms, etc.
Statistical Morphology is concerned with the statistical characterization of the four morphological operations — dilation, erosion, opening and closing. By statistical characterization of a morphological operator we mean the statistical characterization of the output in terms of the statistical characteristics of the input. Characterization of operators allows us to predict the characteristics of the output of an algorithm composed of a sequence of morphological operations in terms of the statistical characteristics of the input and the sequence of morphological operators used. Furthermore, such statistical analyses of morphological algorithms is necessary for evaluating the algorithm's performance. In this paper we describe what we have learned about one way to characterize the dilation and opening morphological operators in a one dimensional setting. That is, the input to each of these operators is assumed to be binary one-dimensional. The input is modeled as a union of randomly translated discrete lines of a fixed length. The line segments can overlap and result in line segments of various lengths. Thus the final output appears as an unordered pattern of lines and gaps of various lengths. This input is characterized by giving its line and gap length distribution and the distribution of the number of line and gap segments of various lengths. The characterization of a morphological operator, therefore, entails a similar characterization of the output. There has been a recent interest in the area of statistical morphology and some results have been published in the literature. Morales and Acharya {MA92} analyzed the statistical characteristics of a morphological opening on grayscale signals perturbed by Gaussian noise. Stevenson and Arce {SA92] studied the effects of opening for a class of structuring elements. Atola, Koskinen and Neuvo [ALN93] studied the output distributions of one dimensional grayscale filtering. Costa and Haralick {CH92} came up with an empirical description of the output graylevel distributions of morphologically opened signals. Dougherty and Loce [DL93] used libraries of structuring elements to restore corrupted signals in the case when a noise model is available. In the following section we set up the notation and definitions used in this paper. In section 3 we give a formal statement of the random process used to generate random sequences. In section 4 we give a maximum likelihood algorithm for estimating the model parameters. The four morphological operators are characterized in section 5.
We present a general methodology for designing experiments to quantitatively characterize lowlevel computer vision algorithms. The methodology can be applied to any vision problem that can be posed as a detection task. It provides a convenient framework to measure the sensitivity of an algorithm to various factors that affect the performance. The methodology is illustrated by applying it to a line detection algorithm consisting of the second directional derivative edge detector followed by a Hough transform. In particular we measure the selectivity of the algorithm in the presence of an interfering oriented grating and additive Gaussian noise. The final result is a measure of the detectors'' performance as a function of the orientation of the interfering grating.
A convex, filled polygonal shape in R x R can be uniquely represented in the discrete Zx Z domain by the set of all the lattice points lying in its interior and on its edges. We define a reslricled convex shape as the discrete four connected set of points representing any convex, filled polygon whose vertices lie on the lattice points and whose interior angles are multiples of 450 In this paper we introduce the Boundary Code (B-Code), and we express the morphological dilation operation on the restricted convex shapes with structuring elements that are also restricted convex shapes. The algorithm for this operation is of O( 1) complexity and hence is independent of the size of the object. Further, we show that the algorithmic for the n-fold dilation is of 0(1) complexity. We prove that there is an unique set of thirteen shapes {K1 ,K2, . . . , "13) such that any given restricted convex shape, K, is expressible as K = K' K . . . K3 where K, represents the ni-fold dilation of K. We also derive a finite step algorithm to find this decomposition.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.