![]() The green bounding boxes on the image below are the outputs of the model and above each box is a prediction of what kind of object is contained within. ![]() For any input image, this model is trying to accomplish three things: object detection (green boxes), object classification, and segmentation (colorful shaded regions). While there have been many interesting developments in the field, we will focus primarily on MaskRCNN, a model which is able to very successfully conduct object detection and image segmentation.Īn example output from MaskRCNN is shown below. Many approaches have focused on speeding up the identification of candidate regions and on using convolutional mechanisms for feature extraction and classification. Early attempts at object detection focused on applying image classification techniques to various pre-identified parts of an image. For this discussion, we’ll focus on the field of object detection (and related image segmentation) which has seen impressive improvements in recent years. The field of computer vision aims to extract semantic knowledge from digitized images by tackling challenges such as image classification, object detection, image segmentation, depth estimation, pose estimation, and more. As we discuss below, powerful methods from the object detection community can be easily adapted to the special case of OCR. The problems posed in non-traditional OCR can be addressed with recent advances in computer vision, particularly within the field of object detection. These regimes of non-traditional OCR pose unique challenges, including background/object separation, multiple scales of object detection, coloration, text orientation, text length diversity, font diversity, distraction objects, and occlusions. Example images from COCO-Text and ICDAR-DeTEXT are shown below. In these images, a primary challenge lies in properly segmenting objects in an image to identify reasonable text blocks. These images are characterized by complex arrangements of text bodies scattered throughout a document and surrounded by many “distraction” objects. Problems like this have been recently formalized in the ICDAR DeTEXT Text Extraction From Biomedical Literature Figures challenge. In contrast to documents with a global layout (such as a letter, a page from a book, a column from a newspaper), many types of documents are relatively unstructured in their layout and have text elements scattered throughout (such as receipts, forms, and invoices). Another area that poses similar challenges is in text extraction from images of complex documents. Problems of this nature are formalized in the COCO-Text challenge, where the goal is to extract text that might be included in road signs, house numbers, advertisements, and so on. An example might be in detecting arbitrary text from images of natural scenes. There are, however, many use cases in what we might call non-traditional OCR where these existing generic solutions are not quite the right fit. It is an exciting time in the field, as computer vision techniques are becoming widely available to empower many use cases. These include GoogleVision, AWS Textract, Azure OCR, and Dropbox, among others. ![]() More recently, cloud service providers are rolling out text detection capabilities alongside their various computer vision offerings. Tesseract provides an easy-to-use interface as well as an accompanying Python client library, and tends to be a go-to tool for OCR-related projects. A popular open source tool for OCR is the Tesseract Project, which was originally developed by Hewlett-Packard but has been under the care and feeding of Google in recent years. ![]() When documents are clearly laid out and have global structure (for example, a business letter), existing tools for OCR can perform quite well. The challenge of extracting text from images of documents has traditionally been referred to as Optical Character Recognition (OCR) and has been the focus of much research. In this post, we’ll describe a multi-task convolutional neural network that we developed in order to efficiently and accurately extract text from images of documents. Examples might include receipts, invoices, forms, statements, contracts, and many more pieces of unstructured data, and it’s important to be able to quickly understand the information embedded within unstructured data such as these.įortunately, recent advances in computer vision allow us to make great strides in easing the burden of document analysis and understanding. Like many companies, not least financial institutions, Capital One has thousands of documents to process, analyze, and transform in order to carry out day-to-day operations. ![]()
0 Comments
Leave a Reply. |