Considerations in OCR/OCV Applications
| By: Nello Zuech, Contributing Editor
There is a distinction between optical character recognition (OCR) and optical character verification (OCV). In the case of OCR, a single alphanumeric character is being recognized based on a comparison to essentially a database of character patterns. In other words, the pattern of the alphanumeric character being recognized is compared to the pattern of every character in the database. The pattern that is the best match is selected as the character currently being recognized. In the case of OCV, there is a priori knowledge about what the specific alphanumeric pattern is and the system is essentially verifying or validating that the current character is exactly what it is supposed to be. While a subtle difference between verification and recognition, it is significant. So the first thing that must be understood about the application is whether it is actually OCR or OCV.
For applications involving optical character recognition (OCR) or optical character verification, there are several 'rules-of- thumb':
- The stroke width of the smallest character should be at least three pixels wide
- A typical character should cover an area on the order of 20-25 pixels by 20-25 pixels
- The spacing between characters should be two pixels
- The font style should be bold
The critical issue here then is the length of the string of characters that one wants to read. At 20 pixels across a character and two pixels spacing between characters, the maximum length of the character string would be on the order of 22 characters in order to fit into a camera with a 500-pixel arrangement. This assumes repeatable positioning so there is no provision for character string location within the field of view. In optical character recognition/verification applications, a bold font style is desirable. In general it is also true that in a machine vision-based implementation only one font style can be handled at a given time.
Another rule of thumb is that the best OCR systems have a correct read rate on the order of 99.9%. In other words, one out of every thousand characters will be either misread or a 'no-read'. The impact of this in a given application should be evaluated. For example, if 300 objects per minute are to be read, and 0.1% are sorted as 'no reads', in one hour you would have approximately 20 products to be read manually. Is this acceptable? This is the best case scenario. The worst case would be if they were misread.
There are many other application issues that must be understood in developing an OCR/OCV application. The number of alphanumerics in the data set must be defined. In addition, the number of different font styles that might be expected must be understood. Will both upper and lower case characters have to be handled? The combination of alphanumerics, upper and lower case and font styles may ultimately dictate the number of patterns that must be in the database representing each of the respective alphanumerics.
All the variables that can influence the scene must be understood. Positional variables might include part position, character field position and ultimately specific character position within the field. Positional variables could be both of a translation and rotation nature. This might mean that the machine vision system requires a 'find routine' before using pattern recognition algorithms to read the alphanumerics. The extent of the positional uncertainty may dictate the specific algorithms required to optimize performance. For all intensive purposes the machine vision system should be rotationally and transitionally invariant as transparently as possible to the user.
If there is 'z-axis' positional uncertainty because of variations in the position of the plane where the alphanumerics are printed with respect to where the camera has to be mounted, the system may also have to be scale invariant. Given the plane where the alphanumerics appear can not be maintained parallel to the camera plane may also mean that the system has the ability to handle perspective errors. If the character string happens to be on a cylindrically or spherically shaped part, this must be conveyed to the prospective machine vision companies bidding on the job as the system would have to handle any distortions in image capture of the characters.
The surface properties of the object with the alphanumeric string may have variations that should be conveyed to prospective bidders. For example, the surface finish and/or the reflectance property may vary within the part or from part-to-part due to variations in manufacturing (different machine tools, cutters, etc.) or over time due to normal exposure to the atmosphere. The color of the part may vary from batch-to-batch or even within a given batch. If the hue does not vary, maybe the saturation level or brightness level of the color does vary. There may be other part attributes that may be variables that surround the alphanumeric string, e.g. holes or other cutouts. The system may have to be tolerant of physical conditions that are acceptable; e.g. part dimensional tolerances, surface scratches, burrs, etc.
There will no doubt be variations in the stroke width of the characters, character height, the character-to-character spacing and effective contrast of the alphanumeric string as a consequence of how the characters are imprinted. This contrast can vary from one character to another imprinted on the same part, randomly within a specific imprinted character as well as from part-to-part. These could stem from variations due to the manufacturing process, printing process, part or raw material suppliers and post manufacturing and application handling.
Any system that is delivered should be robust enough to handle 'confusion pairs' regardless of all the appearance variables expected. In other words, the system should not confuse a 'B' with a '3' or '8'; the letter 'I' with a '1'; a 'G' with a '6'; the letter 'O' with the numeral zero; '6' with a '9' or '3' with an '8'; or any other combination of alphanumerics that will be encountered in the application.
It may be that the system should be able to reliably read characters although some of the character stroke is missing due to improper printing or obliteration due to handling. This may be an impossible requirement for all conditions of partial print. Hence, it may be appropriate to suggest to the vendors that under these circumstances the system should issue a 'no read' signal in the event of uncertainty rather than risk a misread.
These are all considerations that should be defined in a comprehensive specification before inviting vendors to bid. To increase reliability a single font style should be specified. It should be as bold a font style as possible and have a set of alphanumerics each uniquely defined geometrically to increase the reliability of the system. The specification should include both false accept (misread) and false reject (no read) rates.
Comments on Optical Character Recognition
There are at least four fundamental approaches to OCR:
- correlation-based, essentially geometric shape scoring and matching
- nearest neighbor classifiers/decision theoretic, essentially using specific geometric features/feature sets and matching based on proximity to specific multidimensional model
- syntactic or structural, essentially using specific features and relationships between the features
- neural network/ fuzzy logic classification based on train-by-showing and reinforcing based on correct decision.
Some systems come with factory installed fonts which are pre-trained: e.g. semi, OCR-A, OCR-B. Some have the ability to be trained at the factory on the specific font style being used. Others require that the vendor train new font styles.
Different executions can yield different performance so it is difficult to suggest that one approach is the most superior. All but the syntactic approach are font specific. Requiring a system to be trained to read more than one font style at the same time with these exacerbates the ability to provide correct reads as more characters can become the victim of a confusion pair (e.g. 3 vs. 8, 8 vs. B, etc.). The syntactic approach is generally conceded to be the most robust when it comes to multi-font applications. With this approach specific character features and specific relationships are used which are generally font-style independent.
For example, an 'E' might be characterized: start at (0,0) if a vector is generally easterly and another vector southerly, and if the vector in the southerly direction meets an intersection from which there is a vector also in the easterly direction and a vector in the southerly direction and the vector in the southerly direction intersects with a vector in the easterly direction, then the character is an 'E'.
The approach requires that each character be uniquely defined based on vectors and arcs and their directions. These conditions would generally exist regardless of the font style, although there could be font styles, which may cause confusion by their very nature. These should be avoided in any application.
No companies provide numbers with regard to read accuracies. All suggest, and rightfully so, that read accuracies are related to many factors: quality of print, application conditions, such as lighting uniformity/consistency, font style and potential for confusion pairs, how well the system is trained, consistency of spacing between characters, etc. Under good conditions, one can expect 99.9% read accuracies. Given more 'read' time, some systems claim 99.96% read accuracies.
Often systems have more than one approach to reading characters. If the characters can be reliably read with high confidence (as determined by the system) only one set of algorithms are enabled. If the system determines a degree of uncertainty for a given character, the better systems can then enable another set of algorithms to read that specific character. If still a concern, some systems can enable even more recognition algorithms. The final decision may be then based on 'voting' - the number of times the different approaches suggest that it is the same character results in the decision regarding the character. Even where there is only one suite of character recognition algorithms, the system may have the ability to 'vote' by reading the character 5 - 10 times. For example, if ten times, the threshold might be set such that at least six times the system must agree on the specific character and when that is the case that is the character read.
False reads are usually controllable by establishing conditions during training that err in favor of a 'no read'. In the case of a 'no read' the system will generally display the unidentified character in a highlighted fashion so an operator can decide and input the correct character via a keyboard.
False reads can also be reduced by the addition of a check sum number in a character string. This number must agree with the other number read by some rule: e.g. the last digit of the sum of all the numbers read must be the same as the check sum number.
Comparing read rates is also not straightforward. Most companies are reluctant to expand too much on their read rates or throughputs. For the most part it can be assumed that the read rates claimed are based on optimal conditions. In most cases, one must also add times associated with taking the picture and finding the string/finding the character before reading takes place. Again, these are dependent on quality of print and font types as well as whether rotation as well as translation must be handled.
In the case of OCR lighting is very critical. Lighting may be as critical as the algorithms in achieving rigorous OCR. In OCR applications the lighting yields a binary image of the character strings. Either the characters appear white on a dark background or vice versa. Ideally the lighting yields a consistent binary image. As a consequence, the algorithms used typically operate on binary images rather than gray scale images, which accelerates processing and reading.
Comments on Optical Character Verification
Besides engineering details, there are two basic issues related to the optical character verification (OCV) application. They are: is the application satisfied by verifying that the character is correct; or is it also the requirement that the system must be able to assure the quality of the character - print quality inspection (PQI). In the way companies have approached this application these are not mutually inclusive.
As observed in our comments about OCR, it is also true of OCV applications that lighting is critical. The objective is to yield a binary image. Most of the approaches to OCV exploit the fact that the image is binary. Many who offer GPMV systems offer a binary correlation approach to verifying characters. As long as a binary state can be insured, this approach generally produces adequate results for verifying correct characters.
Some systems use a gray scale template as the basis for establishing a character's pattern and subsequent comparison. The correlation routine uses degree of match or match score as the basis of determining if the character is correct. Being based on a gray scale template it is tolerant of conditions that affect gray scale, in effect normalizing them. Hence, such conditions are less likely to cause a false reject.
Some such conditions include: print contrast itself, print contrast differences that stem from different background colors, variations in contrast across a stroke width, variations in contrast stemming from lighting non-uniformity across the scene, from scene to scene, etc.
On the other hand, shape scoring may not be the best at checking the quality of the characters. Where conditions vary that affect the shape of the character stroke but are still acceptable in terms of legibility (stroke thickening or thinning, minor obliteration), establishing a shape match score that tolerates these conditions may effect the character verification reliability. By loosening the shape score match criteria to tolerate conditions that reflect modest character deterioration but still legible characters, it may even be possible that the shape of a different character may become acceptable or conditions that may yield a character that could be misinterpreted could become acceptable to the system.
One special condition of concern is handling stroke width variation in ink jet printing. One approach to handling this is to 'fool' the system by training it to accept a character provided it is a match with any similar character in one of three font styles. The font styles would correspond to: thick, thin and nominal stroke width. Two drawbacks with this approach are: most likely makes the system even more tolerant of quality concerns and speed; each match scoring comparison takes a finite amount of time; correlating to three matches would take three times as long as doing a single correlation.
Some systems actually do an optical character recognition (OCR) during operator set-up, but in the run mode do optical character verification (OCV). During set-up the system reads each character based on correlating the gray scale template to each and every character in the font library. The character in the font library with the best match score or correlation to the character being read is deemed to be the character read. This is displayed on the screen for operator verification of correctness. During the run mode, the system knows which character is in each position so only a 'match' or verification is actually performed.
The gray scale template derived from the current imprint is matched to the gray scale template stored for the specific location in the current trained set. As long as the match number (or correlation number) is greater than the previously established value (on a scale of 1-1000 with 1000 being a perfect match), the system acknowledges the character as correct.
Verifying characters that could constitute 'confusion pairs' requires additional logic than just match scores to be reliable. One approach is to establish regions of interest (ROIs) where there could be confusion and to look for contrast changes. Another approach includes different tools that are automatically enabled based on the character sets involved. So, for example, while the former approach would apply a 'region-of-interest' (ROI) at three locations along the left hand stroke of a 'B' which are areas that distinguish it from an '8' or a '3' and look for contrast changes, the latter approach would use a direction vector property - the direction in which the gray scale is changing at each pixel along the boundary. This is perceived to be more robust. Other rules they would use for other character pairs include number of holes, character breaks and pixel weighting.
The following discussion reviews the properties of two other approaches. Most of the other companies that offer OCV/PQI systems embody one or more of the techniques reviewed. Hence, this discussion is actually more generic than suggested and this discussion is for the purpose of discussing differences between approaches found in the literature.
While several companies use a binary template as the basis of their character comparisons, their respective executions may differ dramatically. For example, in one case the approach is based on using a gray scale region around a nominal value as the basis of the binary image. That is, all pixels that have a gray shade value between 60 - 120 (for example) are assigned to the black or foreground region. All pixels outside that range are assigned the background or white region. The nominal value itself is adaptive; that is, it uses the gray scale distribution in an ROI within the current scene to correct the value.
Another approach might establish a single threshold above which all pixels are white or background and below which all pixels are black or foreground. This approach uses an adaptive threshold to compensate for light level variations. It then is based on performing a correlation routine to establish best character match for a nominal threshold as originally determined and for each shade of gray 10 shades around nominal. This is all performed automatically during operator set up. The threshold, however, may not be adapted on a scene to scene basis.
Another fundamental difference between approaches may be the basis of the window search routine. In one instance it might be based on blob analysis while in another it may be based on a normalized gray scale correlation. The blob analysis is based on bounding box distribution of the character pixel string and locating its centroid and correcting to a referenced position accordingly. This approach will be sensitive to any text, graphics or extraneous 'noise' that may get too close to the character string being examined.
After the region is found, one approach may look for specific features on specific characters to perform a fine align. In another approach fine align will use a correlation match usually based on a piece of a character - 'a gray scale edge template.' This is generally done automatically during operator set up and the system includes certain rules to identify what piece of what character it should use as the basis of the correlation template.
After binarizing the image, one system might perform an erosion and then base its decision on a foreground and background comparison to the learned template on a sub-region by sub-region basis. In another system, a character by character align is performed, in addition to the greater string align, before doing the image subtraction. This could be followed by a single pixel erosion to eliminate extraneous pixels and then another erosion whose pixel width is based on thick/thin setting established during engineering set up. This is designed to compensate for stroke width variations.
In some approaches the decision is based on the results of a single sub region, while in others the decision is based on the pixel residue results associated with the template subtraction for a specific character. In both cases 'sensitivity' for rejection might be based on a pre-established percent of the pixels that vary. Some implementations also automatically reject characters whose contrast is less than 50% of the contrast established during operator training. This percent can be adjusted during engineering set up.
Another approach found in some products is based on binary images and a syntactic analysis based on the extraction of localized geometric features and their relationship to each other. This approach is better suited to OCR applications and less suitable to applications that involve character quality as well. Another approach is also based on binary images and uses vector correlation as the basis for verification or recognition. While such an approach should be less sensitive to scale or stroke width thinning or thickening it is probably more sensitive to localized stroke shape variations.
In both cases, the robustness of the binarization is somewhat dependent on image preprocessing that the specific machine vision system may perform before or after it binarizes the image. Nevertheless, this approach is expected to yield performance likely to be good for OCV but wanting when it comes to character quality issues.
Several other executions are based on OCR as the means to verify characters. These tend to be somewhat slower and be less amenable to PQI. In addition to specific algorithms executed, throughput is a function of: positional repeatability of the character string, number of characters in a string, number of strings, expected variables in appearance and whether PQI is also required.
In general, systems that have executed morphological operations are better at PQI.
One difference between executions is that some perform their template matching on the basis of sub regions within a larger region that includes several characters (possibly even the entire string) at one time versus matching on a character by character basis. Some suggest that this may be adequate for laser marking since the failure is generally a 'no mark' condition for the entire set of characters (all strings and all characters in the string) rather than a character specific 'no mark'. This is not likely the case in ink jet printing.
Where several characters are verified as a pattern, confusion pairs are more of a problem since not performing analysis on a character by character basis. It is also more susceptible to characters whose character to character spacing is varying. In general the systems do offer an ability to establish their regions of interest on a character basis but this may be more awkward to do during operator set-up and may require more time.
As also noted above, some systems have a built in set of rules to handle confusion pairs. In other systems, this feature may not exist or rather they use an ROI analysis based approach. During operator set up some systems require somewhat more operator intervention than others. In some systems, properties such as: threshold, string locate, character align, contrast property for rejection, character scale and aspect ratio and rules to enable for confusion pairs are all performed automatically totally transparent to the operator.
What does all this mean? Basically while there are many ways to perform reasonably reliable OCV. Approaches that perform more image processing are better suited to perform PQI as well. The more robust approaches generally operate on gray scale data and do more than binary pixel counting. The simpler type approaches tend to have a higher incidence of false rejects when set up to avoid false accepts. Some are less suited to character quality evaluation than others and, for example, because of the nature of ink jet printing it is expected to experience more false rejects. On the other hand, some approaches are considered better suited to handling confusion pairs.