Master data matching methods (edoc master data matching service)

edoc master data matching service uses methods such as fuzzy matching and the Manhattan metric to compare document properties with master data despite different spellings and to evaluate their uniqueness.

To increase the accuracy of the master data matching, all comparison values are normalized. Certain characters are removed and the character strings are converted to lower case.

The following characters are removed from the comparison values:

Horizontal tabulator (\t)
Carriage return (\r)
New line (\n)
Vertical tabulator (\v)
Space

Things to know

In order for master data matching to be as successful as possible, sufficient and well-maintained master data must be available. The result of the master data matching also depends on how well the document properties were previously recognized. The more properties are recognized and the more complete the master data is, the more likely a good result is.

Fuzzy matching

There may be different spellings for some properties, such as the company name. To ensure that master data matching is successful despite different spellings, edoc master data matching service uses the fuzzy matching procedure. Character strings are checked for their degree of similarity. The more similar the strings are, the higher their degree of comparison. For fuzzy matching, edoc master data matching service uses the "FuzzySharp" library. Further information about the library can be found here: https://github.com/JakeBayer/FuzzySharp

Manhattan metric

The Manhattan metric is a mathematical method that makes complex structures comparable.

For example, a company consists of various properties such as name, bank details, addresses and tax information. The information from an invoice is compared with the master data using fuzzy matching. The degree of comparison is calculated for each property. Some properties, such as the street name, can differ greatly in their spelling. To ensure that master data matching is still successful, each property is given a weighting. Unique properties such as the IBAN are weighted higher than properties such as a street name, which may have different spellings.

The degree of comparison is multiplied by the weighting. The calculated values of the properties of a company or vendor are added together. The higher the result, the more likely it is that the master data is correctly linked to a document.

The result must exceed a certain threshold value. The properties of the company or vendor with the highest score are written to the properties of the document.