List of 25 Key Terms in OCR & IDP

OCR (Optical Character Recognition)
HCR (Handwritten Character Recognition)
ICR (Intelligent Character Recognition)
OMR (Optical Mark Recognition)
Computer Vision
Dots per Inch (DPI)
Deskew / Skew Correction
Character Error Rate (CER)
Word Error Rate (WER)
Confidence Score
Confidence Threshold
Parsing
Fuzzy Matching
Tokens
Lemmatization
Word Embedding
IDP (Intelligent Document Processing)
Human in the Loop
Straight Through Processing (STP)
RPA (Robotic Process Automation)
ML (Machine Learning)
DL (Deep Learning)
NLP (Natural Language Processing)
NER (Named Entity Recognition)
LLM (Large Language Model)

The world of OCR (Optical Character Recognition) and IDP (Intelligent Document Processing) is changing rapidly. For many, this technical vocabulary may seem complex, even though it is at the heart of modern document automation. This glossary presents 25 key definitions, ranging from the basics of OCR to advanced artificial intelligence building blocks, to help you better navigate the world of intelligent document management.

Understand the essentials of OCR and document automation: clear definitions, comparisons, and best practices. Enough to speed up your workflows and make your processes more reliable today.

OCR Basics and Its Variants

1 - OCR (Optical Character Recognition)

OCR is the technology that makes it possible to convert text found in an image or PDF into usable digital data.

For example, it can automatically extract the number from an invoice or the expiration date of an identity card. OCR is the fundamental building block of document automation, because it makes information “readable” by a computer.

Image Quality and Accuracy

2 - HWR (Handwritten Character Recognition)

The handwritten recognition is a technology dedicated to the recognition of isolated handwritten characters. It is found, for example, in administrative or banking forms where you ask to write in capital letters, letter by letter, in boxes. This is a reliable approach in highly structured environments, but it is limited when it comes to cursive scripts or whole sentences.

2 - ICR (Intelligent Character Recognition)

THEICR is a more advanced evolution of UNHCR. It uses machine learning algorithms to recognize more complex types of writing, whether cursive or free handwritten. Unlike UNHCR, it can learn and improve through human corrections. For example, it is used to read handwritten notes, medical prescriptions, or notes on invoices.

4- OMR (Optical Mark Recognition)

OMR is a technology that detects the presence of visual marks on a document, such as checked boxes or filled circles. It is used in multiple choice questionnaires, paper surveys or even certain attendance sheets.

Image Quality and Precision

5- Computer Vision

Computer vision is a field of artificial intelligence that allows machines to understand and analyze images and videos. It is the basis of many OCR applications, since it makes it possible to identify the structure of a document, to identify text areas or to differentiate text, tables and images.

6- Dots per inch (DPI)

DPI (dots per inch) measures the resolution of a scanned image. The higher the value, the more detail the image contains, which improves OCR accuracy.

In practice, a 300 DPI scan is often recommended for invoices or identity documents in order to obtain reliable extractions.

Note : beyond 600 DPI, extraction quality doesn’t really improve anymore, but file sizes become significantly heavier.

7- Deskew/Skew correction

When a document is scanned crooked, the text lines are angled, which reduces the quality of the extraction. Deskew consists in automatically straightening the document so that OCR can work on an aligned basis. This pre-processing step is essential to avoid reading errors.

Note : this step is invisible to the end user, but it has a significant impact on recognition accuracy.

8- Character Error Rate (CER)

The CER is an indicator that measures the rate of recognition errors at the character level. For example, if an OCR regularly mistakes the uppercase “O” for the number “0", it increases the CER. The lower this indicator, the better the performance of the system.

9 -Word Error Rate (WER)

WER works like CER, but at the level of whole words. It is often used to assess the quality of the transcription of a document or audio file. In professional use, a low WER is essential to guarantee reliable and usable extractions.

10 - Confidence score

The confidence score is a score given by an OCR engine to estimate the reliability of the recognition of a character, word, or field. For example, if a “TTC Amount” field is extracted with 98% confidence, it is most likely correct.

Note : proper score configuration helps reduce the amount of manual verification required.

11- Confidence threshold

The confidence threshold is the minimum value at which extracted data is considered acceptable. Below this threshold, the system may request a manual check. This makes it possible to combine automation and quality control.

Note : if the threshold is set too low, errors may slip through; if it’s too high, you’ll end up with excessive manual validations.

Linguistic and Semantic Processing

12 - Parsing

The parsing is the process of analyzing a text in order to structure it and extract usable elements. In the context of OCR, this can mean identifying an amount in an invoice or a date in a contract, even if the format of the document varies.

Note : without parsing, OCR only produces a raw “copy-paste” of text, which is hard to work with.

13 - Fuzzy Matching

The fuzzy matching allows you to compare two character strings even if they do not match exactly. For example, “Société Générale” and “Societe Generale” will be recognized as identical despite differences in accent or class. This approach is widely used for bank data reconciliation or KYC.

Note : fuzzy matching doesn’t always guarantee a perfect match — there’s a risk of “false positives” if the similarity threshold is poorly configured.

14- Tokens

Tokens are the basic units of text, obtained after breaking down into words, subwords, or characters. Tokenization is a step prior to NLP, which allows language to be processed in a more structured form.

15- Lemmatization

Lemmatization consists in bringing a word back to its original form (the lemma). For example, “ran” and “will run” become “run.” This allows AI systems to better understand the general meaning of a text without being disturbed by grammatical variations.

Note : it differs from “stemming,” which simply removes suffixes and often produces less accurate results.

16- Word embedding

Word embedding is a technique that turns words into digital vectors. These representations allow machines to understand relationships between words, such as the proximity between “bill” and “payment.” Embeddings are used in modern NLP models to improve contextual understanding.

Note : this technique forms the foundation of modern models such as Word2Vec, GloVe, and BERT.

Intelligent Document Automation

17- IDP (Intelligent Document Processing)

The IDP is a solution that combines OCR, AI, and NLP to extract, classify, and validate data from complex documents. Unlike OCR alone, it integrates business logic (for example: verifying that an invoice contains a valid VAT number) and allows large volumes of documents to be automatically processed.

18 - Human in the Loop

The approach human in the Loop involves including human intervention in an automated process to correct or validate certain data. It is particularly useful when OCR encounters poor quality or atypical documents.

19 - Straight Through Processing (STP)

STP refers to complete automated processing, without any human intervention. It is highly sought after in financial processes (for example, automatic validation of a correctly formatted supplier invoice).

Note : achieving 100% STP is rare; most organizations combine STP with manual checks.

20 - RPA (Robotic Process Automation)

La RPA allows you to automate repetitive tasks using software robots. Combined with OCR and IDP, it can automate entire workflows: receipt of invoices, extraction, entry into the ERP, then automatic archiving.

AI and Natural Language Processing

21 - ML (Machine Learning)

Machine learning is a branch of AI that allows a system to learn from data and improve its performance over time. In OCR, it is used to improve character recognition or to adapt extraction to new document formats.

22 - DL (Deep Learning)

Deep learning is a subset of machine learning based on deep neural networks. It is particularly effective for complex tasks such as image recognition, the reading of handwritten texts or the contextual understanding of documents.

To better understand the differences between these two approaches, check out our article on the Machine Learning vs Deep Learning.

23 - NLP (Natural Language Processing)

NLP includes techniques that allow machines to understand and analyze human language. Combined with OCR, it makes it possible to extract meaning from unstructured documents such as contracts or emails.

24- NER (Named Entity Recognition)

The recognition of named entities is an NLP technique that identifies specific elements in a text: names of people, dates, amounts, account numbers, etc. It is a key feature for automating KYC verification and regulatory compliance.

25- LLM (Large Language Model)

The LLM are AI models trained on huge volumes of text.

They are able to understand, summarize, or generate natural language. In the IDP, they provide an additional layer of intelligence, for example by making it possible to contextualize an extraction or to check the consistency of a document.

Note : LLMs are powerful but can “hallucinate” responses; human oversight is therefore essential in professional contexts.