New Research

Cognitive Data Capture – Setting a New Baseline to Digitize Invoices

Christopher Bourez, Artificial Intelligence Architect , Ivalua

Many providers in the market talk about leveraging machine learning to capture and digitize information such as that from an invoice. The truth is that the improvement in accuracy that machine learning can offer really varies widely – they are not all made equal.  For example, some may call a rule-based engine machine learning. In the rule-based system, for instance, the supplier is typically identified thanks to a regex search on the VAT number in the OCR data of the invoice. Although the VAT number is precise and unique, it is not always on the invoice itself, and the OCR data might not be clean enough to identify a number as a VAT number.

Compare this to a cognitive model. There have been significant gains in the identification of fields in an invoice, such as total amount, total amount without tax, invoice code and dates, through a cognitive model based on image segmentation. However, even with this model the most challenging problem remains the correct identification of the supplier.

  • First, supplier name is not something that is easily identified, at least not as easy as it may seem. One could name the company with a street (Rue du Commerce), a field value (Total), a website (, common words (public, best practices, buy, etc), abbreviations (“ABC”), brand names rather than legal names and so on. One could also replace the supplier name with a logo which the OCR does not read. The supplier name is never tied to a set label such as the total amount with “Total :” or “Total $”.
  • Second, the supplier name does not appear to be reliable and/or sufficient information to identify the correct supplier. Many suppliers can share the same name with different legal status or registration numbers in different cities, or different countries for example.  

In data science, precision and recall are two metrics that best describe the behavior of these two methods. Quoting wikipedia definitions, “precision is the fraction of relevant instances among the retrieved instances, while recall is the fraction of relevant instances that have been retrieved over the total amount of relevant instances”. Thus, the rule-based regex method has a high precision with a low recall, while the cognitive method on supplier name identification has a high recall but a low precision.

Ivalua’s AI team has developed a new method to overcome the challenge of supplier identification in invoices. Our belief is that using the invoice template as the method to identify and capture information is unreliable. Some templates are common to many suppliers and what happens when the suppliers decides to update their template? We wanted a more reliable method to identify the supplier with certainty.

While statistical language models are commonly used to improve the quality of neural machine translation by filtering out the improbable word combinations, it is possible to define a statistical supplier model to rank supplier proposals by their computed probability.

The recent advances in question-answering over knowledge bases led to a similar conclusion. Possible matches to a user’s question are proposed by a cognitive model and the values to query against (subject, object, dates, …) are matched using tables of lexicalisations or other tools.

The issue of supplier identification can follow such an approach, provided there is access to a complementary supplier database, with information of each supplier, such as names, addresses, city, country, to help. Moreover, we can consider adding a table of lexicalisations i.e., all possible ways to name each supplier (e.g., abbreviations, different spellings, the website, the name with legal statuses, the commercial brand, etc).

Ivalua assessed the accuracy of identifying supplier information using all the different methods. For this purpose, we selected a particularly difficult sample of 1,481 supplier data. The data had few VAT numbers, with sparse supplier addresses, spelling errors, lack of lexicalizations, etc). Here is what we found:

Method Supplier Identification Accuracy
Rule-based: simple Regex search on VAT numbers 46.15%
Cognitive model: the model proposes a supplier name in the invoice, then identification is performed by name match against the database 60.60%
Full Text search: a fulltext distance between the OCR data in the invoice and available supplier data to identify the best supplier match in the database 68.42%
Full Text search with few lexicalizations: the database of names is enhanced with other ways to name the supplier 73.69%
Cognitive Full Text search: word matches are weighted by the cognitive score, the probability of the word to belong to the supplier zone in the invoice 89.09%
Manual: we ask humans to select the supplier using a search UI in the database 73.74%
Achievable upper bound with words only: a supplier identification method based on words cannot achieve better score on this database due to the absence of word matches between the invoice OCR data and the supplier data 95%

*Numbers are based on challenging sample dataset

The best performance is achieved at 89.09% by the combination of the search engine and probabilities by the cognitive model, as expected. Both cognitive model and full text search performed worse separately, at 60.60% and 68.42% respectively. Adding a few lexicalisations manually helped to increase the performance of the search engine.

The most surprising result is the performance achieved by humans, 73.74%. We expected results above 90%. Looking at the human errors, we classified them into the following 13 issues:

  1. Insider rules and funky names making supplier identification impossible, such as identifying the supplier by the bank card used for the payment
  2. A change of supplier name
  3. Error in identifying the correct supplier country
  4. Mergers and acquisitions
  5. Spelling error in the database corrected later leading to a retrieval error
  6. Missing of fuzzy search implementation (spaces, plural, punctuation, stopwords such as Inc, corporations, countries)
  7. Abbreviations
  8. Void name in database
  9. ETL / import error
  10. Ambiguity between commercial name and legal entity name
  11. Human error – name identification (real error)
  12. Employee cheat (creation of a “all remaining” supplier to annotate faster tail spend)
  13. Incomplete data in query and database leading to mismatch: a name such as “xxx yyy” where “xxx” is in query and “yyy” is in database

Human annotators perform similarly to the automatic supplier identification by full text search given all OCR data, but is surpassed by the hybrid solution combining the cognitive model and full text search in a statistical way.

The use of a statistical model to identify an entity in a document opens the way for other applications in procurement databases such as automatic supplier deduplication as well as automatic item/product deduplication.

We hope this provides a slightly detailed peek into the things Ivalua’s AI team is working on, and we hope to meet you at Ivalua NOW 2019.

Part 1 in this blog series can be found here  Invoice Data Capture with AI: Rule-based versus cognitive field extraction


You May Also Like

Ready to Realize the Possibilities?

Let’s Talk