August 14, 2019

AI and Machine Learning in Data Management: Patterns, Metapatterns, and Cheap Gas

The ibi Team

Topic:   Data

Last updated: July 28th, 2020

We know that artificial intelligence (AI) and machine learning (ML) are important in certain use cases, such as seeing patterns in a mountain of data (petabytes, even zettabytes) that would be extremely hard for us to see. It would take an incredible amount of time for humans to find these patterns in such volumes of data, if it were possible at all. It’s just too high of a cognitive load for humans to handle in a timely, consistent, accurate way.

How do those use cases apply to data quality (DQ) and master data management (MDM)?

When we first apply AI to DQ, we can use it to detect patterns in the data that we correct. An AI is trained by seeing how we fix different kinds of data.

Over time, however, it can also detect patterns in the kinds of corrections we make. In other words, an AI can see patterns in the patterns of data we correct – metapatterns, if you will. Consider spelling, for instance. The last name “McLauglin” just looks weird to us (depending on cultural context). We’d expect an au to have a gh after it, not just a g. Meanwhile, “Jon” and “John” are both correct – despite the dropped h, and even though the former is less common. An AI can, theoretically, learn how to spell enough to surface strange combinations to us, and sometimes even to silently correct errors.

The successful use of AI with DQ is a matter of understanding what we want from it.

When data quality software sees patterns, it can surface them to us to make sure that specific things we’ve previously wanted to correct are being corrected, or even correct them on its own. In other words, it identifies, and perhaps corrects, errors.

When DQ starts to see metapatterns, it can surface things to us that it has never seen before, and that we may never have seen before. In other words, it identifies potential new rules for us.

Note that these patterns and metapatterns don’t necessarily have to be limited to a single field or record. In fact, it’s critically important that it takes in more data and detects relationships among fields as they relate to data quality. There’s so much more that can be understood if it does.

For example, we may correct first and last names at a much lower rate when the last name starts with al , al-, abd, and so on, because the people doing the correcting aren’t confident about Arabic spelling. We can’t assume the data entry is better – in fact, it might be worse – so noticing this pattern might make us want to hire a few Arabic language speakers to augment our capabilities there (which will help our AI learn, too).

We may also see sets of data that “look less strange” only in a given context, such as Irish names (“Siobhan” and “Maeve,” I’m looking at you) showing up in a historically Irish geographic area. It may be reasonable that those names have been left alone.

Similar effects can be seen when we attempt to perform identity matching. We need to make the best decisions about identifying and describing a given person or entity through the data we have in the various systems in which it resides – and anything that can reduce the amount of human cognitive load that it takes to do so, including all of the context across systems and cultures, is incredibly helpful.

The AI doesn’t understand human cultural differences. It just spots the patterns. As an AI learns these patterns and metapatterns, therefore, it needs a human to review them. In time, as the AI learns more, human intervention becomes less necessary.

This leads us to the various ways that AI, including ML, can be used on any problem.

  • Human-led, machine-assisted. This is where the human is making corrections or applying rules and the AI surfaces suggestions to the human (“augmented analytics”).
  • Machine-led, human-guided. This is where the machine is making corrections or applying rules and the AI surfaces exceptions or ambiguous cases to the human.
  • Machine-led, machine-guided. This is where the machine is making corrections and applying rules, and the self-learning of the machine provides a feedback loop with little to no oversight from humans – no “micromanaging” – though the human is expected to run quality control over the AI, just like a boss runs QC over a human.

All three of those could happen at the same time with the same dataset – more difficult data (names) could be getting checked out by people while simpler data (addresses) could be fully automated.

The ultimate goal would be to get to machine-led, machine-guided for more trivial tasks and machine-led, human-guided for more difficult ones. This is not necessarily a short journey, and would require eyes-on to demonstrate the effectiveness of the program and ensure alignment with corporate goals.

Now, these ideas have been around for a long time. One might reasonably wonder why we think we can make progress now vs. other times when these promises have been made. I recall similar discussions around expert systems in the late 80s and early 90s.

I think the difference is “cheap gas”. Henry Ford started to change the world with his assembly line, but people didn’t drive all over the country until gas was available and affordable enough to make it easy for ordinary people to drive his cars. In the same way, the methods and ideas behind AI and ML have been around for decades, but we finally have the ubiquitous, inexpensive, instantly expandable computing power (think Big Data and cloud), the availability of large data sets for model training, and easy access to open-source machine learning tools needed to make AI available and affordable.

Our Omni-Gen data management platform is designed with flexibility and extensibility in mind. Stay tuned for more information on how data management is advancing at Information Builders.