Intelligent Document Processing (IDP) in a New Age Automated Workplace

What is IDP and how does it integrate with automation to maximize benefits?

Introduction

In an increasingly digitized world, corporates have realized that the fastest way to ensure efficiencies within internal operations is to automate – and the biggest roadblock to automation is structured and digitized inputs. Consequently, there has been a big focus on digitization and document processing. The target of these drives is to boost accuracy and effectiveness to structure unstructured and semi-structured data, and to digitize any non-electronic data into machine-readable formats. Towards this end, multiple technologies have been explored – optical character recognition through digital character libraries, redaction solutions to un-redact or redact digital documentation, AI/ML to classify and categorize multiple formats and data structures and provide automated learning.

It is a base understanding that the automation scope in the back office today is limited to the volumes that come from electronic and structured data, with a manual intervention needed for the rest. Through IDP, the scope of automation can be expanded to all the volumes that can be digitized and brought to a structured format, thereby maximizing the value delivered. The following sections look into how Intelligent Document Processing takes process automation to the next level, by adding digitization, structuring, and intelligence to the picture.

Components of the IDP Solution

The document reading and extraction market today is highly fragmented. Different vendors provide different process flow and differentiated capabilities. Any comprehensive Intelligent Document Processing or IDP solution will comprise of five critical flows – Data Ingestion, Pre-Processing, Document Classification, Data Extraction, and Validation or Feedback Loop.

Data Ingestion: Any IDP solution needs to be able to read different documents using OCR or other powerful ML algorithms. Normally, any data captured by an organization can be categorized as Structured (fixed structure and hierarchy), Unstructured (unorganized and multi-format, free form), and Semi-structured (blend of structured and unstructured data). Market research shows that more than 75% of data in the world is either in an unstructured or semi-structured format. While it is easy to read and categorize data in structured formats, like Excel tables, reading data in unstructured and semi-structured formats requires the use of AI based solutions – such as Optical Character Recognition (OCR), Computer Vision (CV), and Natural Language Processing (NLP). OCR detects language-related characters, letters, numbers, etc. by depending on structured data tables. However, with CV and NLP, the capabilities of OCR to handle unstructured and semi-structured data have undergone a paradigm shift, resulting in a set of solutions called Intelligent Character Recognition (ICR).

IDP solutions capture the data extracted by ICR and enable the system to structure it based on the data types, regardless of whether the documents are structured, unstructured, or semi-structured. Adding ML algorithms on top allows the solution to learn from training with the manual corrections made every document-reading iteration.

Pre-processing: IDP may need to run on handwritten documents, or scanned images or computer generated PDG files. Each of these have a different standard of quality. Before any data can be extracted then, a document needs to be evaluated from a quality assessment standpoint – including cleaning, organizing, and transforming the raw data to meet the quality parameters mandated by the IDP or machine learning models.

Some popular preprocessing methods used by IDP tools include:

Data annotation and labeling: A configuration process where specific document types, including document fields, get tagged and annotated to help with classification.
Merge/split documents: Capability to analyze multi-page documents to recognize the layout to detect when documents need to split or analyze multiple documents to recognize when they need to be merged into one.
Skew correction: Ability to recognize when a scanned document is not upright and is skewed, where the text appears rotated or tilted in different angles. Capability to carry out a skew correction or de-skewing.
Noise Removal: The process of detecting and removing unclear sections, such as black dots, shadows, or blurs, and cleaning them up to ensure better quality. Noise removal methods include median filtering, edge detection, linear regression and auto-encoding.
Data validation and correction: Compatibility check of the document, format validation, resolution specification.
Taxonomies and ontologies: Collection of data in categories, and the identification of a pattern of various entities and relationships among the data

Any IDP solution should be able to take up pre-processing to ensure data quality for extraction.

Document Classification: When a back-office process requires multiple types of documents or multiple types of information to be captured – these need to be classified correctly by the IDP solution to be able to extract the relevant information. This could be the classification of multiple documents or even the classification of pages or sections within a single document basis the data type being captured.

The classification module of an IDP solution is based on a combination of NLP, ML algorithms, deep learning, and other AI technologies. In today’s IDP market, a classification does not just identify the content type within a document and categorize it but aims to achieve intelligent classification by looking at additional contributing parameters of classification, such as date ranges.

Data Extraction: The actual data extraction element of an IDP solution goes beyond traditional OCR. Since OCR focuses on character recognition, it is limited in terms of intelligence of what a data point indicates. With the evolution of technology, businesses are looking to neural networks and algorithms for natural language processing or computer vision to go from OCR to intelligent data extraction.

An efficient IDP solution addresses many extraction challenges:

Textual data extraction: Use of entity extraction models to identify and segregate sets of information based on similar or common semantic parameters. For example:
1. Key value pairs
2. Entity recognition
3. Questions and answers
Visual data extraction: Understanding of visual elements such as tables, graphs, checkboxes, logos, and signatures. IDP solutions focus on the following during extraction:
1. De-noising irrelevant content, and detecting the region of the visual element presence accurately;
2. Detecting elements with multiple layouts and mostly different variations;
3. Detecting the exact boundaries, and segmentation based on semantics;
4. Detecting sub-elements in the region of interest and extracting information from them, such as rows and columns for tables; and
5. Decoding the structural relationship of the information

Validation and Feedback Loop: The key evaluation parameter of any IDP solution is the accuracy of extraction. Unlike OCR, IDP looks beyond quality of extraction to data validation against external sources to improve accuracy. An efficient data validation system can improve IDP accuracy by almost 5-10%. Data validation can be dictionary-based, context-based, or pattern-based. Most IDP solutions validate data against defined business rules.

Another method used by IDP solutions to improve accuracy is the incorporation of feedback loops. Any corrections performed by a user carrying out quality checks can act as an input to the system so that the accuracy can be improved for future extractions. Modern ML-based IDP solutions automatically learn from manual corrections to ensure the accuracy of future extractions is improved.

Author:

Nandhagopal Muralithar is a Senior Business Consultant at IGT Solutions’ Intelligent Automation and Analytics Practice. With specialized experience of 7 years in Digital Transformation Consulting across Travel, Hospitality and Retail domains, Nandh has worked extensively on the front edge of process and conversational automation technologies, designing and delivering innovative back and front office automation solutions.