Breaking Down Data Silos: AI and ML in Master Data Management – YouTube Dictation Transcript & Vocabulary
Willkommen bei FluentDictation, der besten YouTube-Diktat-Website für Englischübungen. Meistere dieses C1-Video mit interaktivem Transkript und Shadowing-Tools. Wir haben "Breaking Down Data Silos: AI and ML in Master Data Management" in kleine Abschnitte aufgeteilt – perfekt für Diktatübungen und Ausspracheverbesserung. Lies das annotierte Transkript, lerne wichtige Vokabeln und verbessere dein Hörverständnis. 👉 Diktat starten
Schließe dich tausenden Lernenden an, die unser YouTube-Diktat-Tool nutzen, um ihr englisches Hör- und Schreibvermögen zu verbessern.

📺 Click to play this educational video. Best viewed with captions enabled for dictation practice.
Interaktives Transkript & Highlights
1.[Music] hello and welcome to the data engineering podcast the show about modern data management it's 2024 why are we still doing data migrations by hand teams spend months sometimes years manually converting queries and data burning resources and crushing morale data folds AI powered migration agent brings migrations into the modern era their unique combination of AI code translation and automated data validation has helped companies complete migrations up to 10 times faster than manual approaches and they're so confident in their solution they'll actually guarantee your timeline in writing ready to turn your year-long migration into weeks visit dataengineering podcast.com folds today to learn how data folds can automate your migration and ensure source to Target parity your host is Tobias Macy and today I'm interviewing Dan Brookner about the application of ML and AI techniques to the challenge of reconciling data at the scale of business so Dan can you start by introducing yourself yeah thanks Tobias it's a pleasure to be here I'm Dan Brookner I'm a a co-founder and CTO at Tamer I've been solving problems in this space for I don't know going on 15 years now and uh we we build solutions for master data management using Ai and machine learning to simplify and uh make MDM projects successful and do you remember how you first got started working in data yeah it go it goes way back so I actually started out as a physicist and my first uh my first job out of college was working at CERN on the LHC and it was be in the days before the LHC had actually started and was getting going and so most of what I did was actually write code solve computational problems in those days we were doing analysis over large volumes of simulated data and trying to model the system and and get a handle on our expectations for what was going to happen so I did that and as I was doing it and the system wasn't running I kind of got more interested in the computational problems that I was working on and and the code that I was writing and so when I got back to the States uh I decided to Pivot and move into computer science started programming and then uh got interested in in computer science research uh because of my background I kind of naturally gravitated into data and large scale data processing database systems and uh eventually started working with Mike Stonebreaker at MIT on Research into largescale Data integration uh approached holistically and approached using machine learning techniques and applying those techniques in in ways that scale to uh extremely large volumes of data and before we get too much into the application of ml techniques to that challenge of processing data reconciling it getting it into a usable state I'm wondering if you can just start by giving a bit of an overview of some of the different ways that data at the organizational scale becomes unwieldy and some of the challenges that arise from that lack of reconciliation yeah it's a class of problem that I think is is very common and taken for granted and also not necessarily deeply understood I like to start from an analogy to software engineering and Conway's law are you familiar with Conway's law I am that the software design will eventually reflect the organizational communication patterns for better or worse that's yeah that's exactly right so the structure of your organization dictates the structure of your of your software architecture and and the same is true to a large extent in data and data management the structure of data within a large organization is naturally going to reflect the structure of the teams and the groups and the divisions that created that data and that can be a very good thing it means individual teams can kind of operate naturally and independently and use the data that they need to be successful and to do do what needs doing but it also creates big challenges and and missed opportunities when you start to move up a level and want to reason about and change and ask questions of the the kind of data uh the data across the whole organization uh different teams are speaking fundamentally different languages they often have redundant duplicated data and it can be very hard to actually use that data to communicate and to make kind of high level decisions within the org from a very kind of nuts and bolts perspective what kind of issues are are we talking about I I as as basically a database guy I kind of go back to the kinds of problems we're interested in are fuzzy unions putting together schemas across different databases fuzzy joins and fuzzy group Buys so essentially cases where you would like to treat large uh sets of data as a coherent whole single database um but you don't have the keys you don't have the the common attributes you don't have the common identif iers and so you're not actually able to just directly go and ask the questions you want to ask first you have this problem of just you know mechanically getting all the data together linking it up getting a coherent picture that you can go in query and and use use for applications analytics what whatever it is you're trying to accomplish and given the reflection of Conway's law in that data ecosystem for the business what are some of the attributes of either scale or Team Dynamics that you see being the biggest contributors to that messiness and that lack of cohesion that brings out these problems yeah so there I mean depending on the scale of the organization there there can be many but frequently the kind of most common case is data sets come from applications they come from processes I guess processes whether they're soft software based or not that are well established and that are designed not primarily to create data but to solve some some problem for the business so sales marketing you know this kind of basic basic things that that companies do as a side effect they produce these piles of data and then the the teams that that work with those processes and applications they're very vested in kind of the way that things work if you have other teams come in data teams most frequently to do analytics to look across Ross different groups different parts of the or there's kind of a natural conflict that arises in terms of well we would like it all to look this way we think it this would solve the problem for the whole organization better and team say no like we that's not how we operate we can't do that you can't just come in and change our process change our data you know we've been doing this for for forever in in this way the problem gets worse the larger the organization gets and especially for companies that grow through acquisition mergers you start bringing in you know data that's Arisen not just from different teams but completely different organizations start trying to put it together consolidate and th those kinds of those kinds of small inconsistencies can can really start to undermine the process of of finding a good way to operate and put all the data together coherently and so that process of reconciling data bringing it together in a way that makes organizational sense so that you can start to ask those questions across the business is largely called Master data management or building golden records and I'm wondering if you can talk to some of the typical approaches that teams and organizations try to take to be able to actually Embark upon that process of building those Master records and reconciling that data and some of the scaling challenges that they run into whether that's in terms of scaling at the compute level or scaling just in terms of time effort and human capacity yeah that's a good question that's a big question and so breaking that down a bit Master data management really does cover kind of the heart of of this problem of linking linking different data sets together there are a number of stages in a successful Master data management that you kind of have to move through one one stage even starts just ahead of getting into Master data which is just physically getting the data together off and and you know treating the data quality problem essentially getting a common level of quality often pulling in thirdparty Source data reference data to to enrich it and and kind of get your base to a good spot then okay you have you have a set of sources different data tables database systems you put them in one physical place and now you you want to link them together you you want to kind of create the the point of reference across common records and and just SOL solve that that linkage problem the The Entity resolution problem once you've done that great now we have a common identifier that we can use you're going to uh now draw in all the data from these systems and attempt to consolidate it produce golden records so now you have you have an identifier that links Source data and for each identifier you have a a kind of a golden record like this is the best this is the truth about this customer or this supplier or this part in our in in our organization so you produce that record and and now you you want to start to manage that over time as you go further you're now going to want to push out more of that to the systems The Source systems themselves and to Downstream applications analytic engines so essentially solve this problem of the coexistence of a master data on one hand and all these operational and analytical data sets that exist everywhere in the organization the physical problem of linking those things together and keeping them consistent become becomes a big challenge as you start to operationalize the master data and they they're they're kind of diff different uh in different scenarios different use cases different folks will focus on kind of different parts of this uh this like journey through Master data management you know maybe some projects only require getting that identifier throw the data together get the identifier great we can go that's all we needed we can go run with it maybe you're just doing some analytics so you do that as a one-off every quarter you produce a report so we refresh our our our data we get this high level of Integrity with our Master data we generate our report we're good to go as you move farther along and want to actually take that data operationalize it use it on an ongoing basis keep it fresh constantly you know so as new data comes into operational systems it's immediately mastered immediately Incorporated with the master data you have to start to go farther in this journey of kind of pulling together and closely integrating your your master data system with your operational database systems and other applications and the canonical example that's often brought to bear in this context is the idea of the customer record where you have this is our customer this is all the attributes about them and then there's the challenge of well which system is the one that we actually trust the most to collect that information accurately or different systems collect different pieces of information and then there's also the challenge of when you're dealing with people they change locations so you have to make sure that you have the appropriate address but you also want to know their old addresses and so then you have the the issue of hisor that data and this supplies across other business objects Beyond just your customers and I'm wondering if you can talk to some of the people problems of figuring out what are those decision points what are the ways that we determine what is the place that we actually trust the most for which pieces of that data and then being able to actually manage the merging of those attributes from the multiple systems to be able to say this is the thing that we trust the most that other system over there has different information so we're going to ignore that or we actually need to use that in that system but over here we're going to use this I'm just wondering some of the some of the ways that organizations have to wrestle with that kind of constant decision making about what data to use where when and how yeah yeah I I think I think what you're picking up on is a really key about master data management as like a problem space it's fun it's it's not just a technical problem if it were just a technical problem putting data together creating like a coherent Knowledge Graph is like we know how to do that we can do that in real organizations it's also a political problem so you're not just trying trying to get the data to agree you're actually trying to get these different teams to agree and coexist and have each have their own special view of the data because you know we the the reason the data silos were created in the first place was all of these teams operating independently and efficiently pulling together those silos you you need to you need to make sure that you don't actually interfere with the independent happy trustful operation of of everyone who created it and what it what it comes down to is solving the master data management problem less from like a dictatorial we will come up with the one standard that will work for everybody kind of approach and more creating a repository for the linkage and creating a system and a common touch point for all of these different silos and applications to touch base and stay closely linked in a in a clean way we uh one of our early customers uh very large manufacturer when we started working with them they they essentially said okay our our history in in master data management we we have this is a company with many lines of business SE different divisions we have 26 different major Erp systems we have more but like the long tail is too much to to worry about we have 26 major Erp systems all of our parts all of our suppliers exist across all these 26 systems systems we've had several efforts at Master data management and what happens is we go in we pick some of these systems the the largest most most popular ones that we think are the most trustworthy uh we we collect the data we consolidate it we create this master we have a new identifier and at the end no one wants to use it we have 27 systems now for all of our supplier data and all of our parts data and so if if you go and and kind of do the technical work but don't also do it in a way that uh that kind of meets the consumers of the data where they are then the project can be a failure and and and essentially just make the problem worse um so it's it's really it's really critical to to find that the way to not just create a standard but create a system that Bridges the Gap and the between all these different consumers and does it a scalable way if you take three of 20 systems and say this is the con this is how we're consolidating well what about the 17 other teams like their data is gone now how are they don't know it they have no frame of reference so you need you need you need an approach that can scale to handle kind of the the the whole problem the other interesting piece of this is that business intelligence data warehousing those have existed in some fashion for at least the past 30 years give or take and so you You' think that given that time span this is a problem that would have been solved at least reasonably well by now and yet even today it's still a challenge that organizations are tackling and starting new projects on today tomorrow next week and I'm wondering what are some of the evolutionary aspects of the problem that lead us to having to keep revisiting it and keep resolving it across organization after or after organization it rather than it being a wellestablished well understood more or less solved problem yeah it's a good question and i' I'd say master data management is uh it's it's about I think it's going on about three decades old now so we've been companies have been Building Systems to solve this problem for a while and uh the traditional systems tend to focus on using sets of rules and strict data models to put together data from Source systems and and they tend to focus more on kind of the operational side you know you do some basic data quality you do some basic data integration but fundamentally you get your your set of golden records and now okay put that in a database like let's go and use that they tend to focus on then you know supporting applications Downstream but they don't necessarily do a great job of pulling in lots of data from the organization linking it together coherently cleaning it up enriching it sort of making sure that the master data itself is actually the best view there is of the data has the highest possible quality and has all of this linkage across across organizations So currently what's what's happening is the application of AI and machine learning to these problems actually unlocks much better Solutions and the ability to kind of tackle this problem much more holistically and and do it in just a much higher Fidelity way than has happened traditionally So to that point of the application of ML and AI in this ecosystem machine learning in various forms has been used at various levels of success in this context you mentioned rules-based systems that's maybe the the expert systems era of AI which we have largely moved past and then there have been a lot of different natural language processing techniques used for trying to do some of that entity extraction and entity resolution and I'm wondering if you can just talk to some of the evolutionary aspects of the application of ML and AI to the problem of Master data yeah yeah absolutely so you know you're exactly right to start start with the rules because kind of you know I would I don't want to say that rules are the wrong way to solve this problem they're actually very good for The Right Use case but I think like the just for cont like the fundamental nature of dealing with dirty data is what there's like the I think it's check off it's like a check off quotation that all happy families are happy in the same way but all unhappy families are unhappy in a very different way it's true of data too all bad data is bad in a different way and so you you need a lot of tools in your toolkit so traditionally rules for the approach if you come up with a good data model if we just model the problem well enough if we model how a customer model and schema that that's good enough then we can we can put all customer data into that schema um and we can Define rules for how it should work the reality is that data you know uh data can mean lots of subtly different things in in and be used in subtly different ways and pretty much always is and so you you have to always be ready to account for these slight differences in granularity or slight differences in sh of meeting and so yeah so essentially Beyond rules um getting into fuzzy fuzzy matching becomes big and that starts to lead into natural language processing techniques uh and especially techniques from information retrieval um and uh you know applying uh scalable methods from text search goes a very long way in dealing with fuzziness and and solving ma fuzzy m matching problems um beyond that you start to get into statistical techniques and and kind of traditional machine learning building models to classify uh matches between data uh to classify groups and taxonomies of data and to to look for uh different different characteristics of the data to to um uh perform reconciliation consolidation um and then you you know once once you've entered into this Statistics and machine learning World um the sky the limit uh and and essentially techniques uh uh techniques from 30 years ago and information retri retrieval are great um but you can move that all the way through up to uh what we have today with large language models and generative AI um and apply that to the problem as well large language models and generative AI have definit uded the overall landscape of ml in recent years where they have to some degree become synonymous with AI even though that's not technically accurate and I'm curious whether you see those capabilities as being a transformative shift in this space of Master data management record reconciliation entity extraction or if it's largely an iterative step you know maybe it's a large iteration but not a wholly transformative piece and just a a step change Improvement in what we already had yeah I I think this is sort of a lame answer but I think it's a little of both there are key ways that are in terms of how you how you match records or uh enrich data or classify data or parse data applying language models can it it it adds another really valuable tool in the toolkit I wouldn't say completely uh well actually there are some some scenarios where it does completely let you throw away a lot of techniques of the past in schema mapping for example uh large language models without much training are very good at I give you two tables tell me how to align them from a schema perspective um so for for some problems at a small scale you know it's just like it blows it out of the water for larger scale problems llms can give you a lot and can give you a lot of subtlety that using traditional techniques could be very difficult so for example uh language model embeddings are extremely good at capturing things like synonyms synonymous meetings across different terms abbreviations without having to sort of build lookup tables and additional artifacts um on the side you sort of you get a lot of that just like the richness of the meaning in the language and how how language works for free except it's not actually for free there's a there's a cost in terms of compute and so so it's like it occupies a a interesting uh spot in the tradeoff space where if you can figure out how to use it in a cost-effective way alongside other cheaper more scalable techniques then you can get a tremendous amount of value I think where where it's like really transformative is creating more natural user experiences in actually working with the data and solving these problems for end users one of the challenges that that a Tamer we've we've wrestled with and and kind of you know learned a ton and be getting better and better at over our 12 13 year existence is the problem of taking complex you know complex data problems and complex machine learning Concepts and encapsulating them in a simple user interface and making them understandable to end users who are not phds in machine learning Ai and statistics and llms can actually come in and explain concept scen uh complex scenarios in kind of straightforward ways so a very very common situation for us is our system is doing some record matching consolidation presents the user with um maybe an ambiguous case we've clustered data from 40 different systems here's 40 different records describing what we think is all the same the same actual person one of your customers and uh what do you think is this are are these all in fact the same looking at a table with 40 records and a lot of columns is like kind of an overwhelming experience and so it can be difficult for a human to parse that and decide do they do they agree with the machine learning do they do they not do they think it's you know uh did did it hallucinate this and what a lang model can do is actually give you ways to summarize what's on screen it's very complex Concepts tell you know sort of draw user attention to you know hey look all 40 records here they all share the same last name the first name for for this person is there are only three different values and it's a common you know this is a common nickname so it looks legit there are a few different addresses but it seems like maybe they moved over time and um we can kind of put that in context can also include you know we know what the uh what the model is on the back end that produced this suggestion we can provide that context to a language model as well and it can explain well you know the the reason these got pulled together were the weightings for the you know were uh let's say this is a b Toc customer mastering data product and in this data product the model looks for and and puts a lot of weight on common alley and the name and the address and the phone number sort of give users a way to engage with a hard problem a very Niche problem but do it in a way that's that's sort of easy to understand and accessible and why this is so important is trust one of the one of the interesting things since uh since the launch of chat GPT and kind of the of AI and large language models is it it has it has brought this question of trust can you trust the robots to kind of uh main you know mainstream and we we've been dealing with this problem we've been in a machine Learning Company since the beginning and I've always dealt with you know everyone always loves to be smarter than the computer than the machine um and so there's always a questioning of I see the model is suggesting this I want to make sure that model isn't doing something crazy with with large language models now being mainstream people know that these things hallucinate they come up with nonsense if you if you push them a little too far and so there's there's this real questioning of if you're using machine learning and AI techniques to to make this data good can I trust it like is it real and so being able to kind of take uh take our results the results of the modeling and communicate them in a way that that sort of like puts the data front and center but then also provides the context is really important for end users to feel like yeah this this is actually solving the problem this is real this is this is data I can trust like this is the truth about our customers and we should adopt this and share this and so it's it's kind of ironic that like the uh the technology that's leading to this crisis of trust can also be a big part of the solution of kind of framing it that way but I I think you know as we as we move past the the initial shock of uh artificial intelligence that's that's what we're moving into as a listener of the data engineering podcast you clearly care about data and how it affects your organization and the world for even more perspective on the ways that data impacts everything around us don't miss data citizens dialogues the Forward Thinking podcast brought to you by calibra you'll get further insights from industry leaders innovators and in the world's largest companies on the topics that are top of mind for everyone in every episode of data citizens dialogues industry leaders unpacked data's impact on the world like in their episode the secret sauce behind McDonald's data strategy which digs into how Aid driven tools can be used to support crew efficiency and customer interactions in particular I appreciate the ability to hear about the challenges that Enterprise scale businesses are tackling in this this fast moving field the data citizens dialogues podcast is bringing the data conversation to you so start listening now follow data citizens dialogues on Apple Spotify YouTube or wherever you get your podcasts in that space of large language models their application to the problem of Master data management what are some of the pieces that are missing from the out-of-the-box perspective of I have a large language model now I actually have to stand up pieces X Y and Z before I can even really start to bring it to bear on the problem particularly in that context of hallucination and Trust building I'm wondering how you have been working through that challenge of being able to harness the capabilities while mitigating the inherent risks in the problem of being able to actually build trustworthy data with an inherently unpredictable utility yeah yeah so I think well let's see I'm going to go back a little to start when we got started at Tamer our our initial Vision our initial goal this was in the days when semantic web was really hot and knowledge graphs were becoming a big deal and our vision was we want a we want a product that can build produce efficiently produce the the Enterprise Knowledge Graph So within within your organization you have this extremely high quality linked data describing everything you do everything you care about all the key entities um your customers your suppliers your your employees all the parts um and products that you produce know this this whole space and so now like in in in this context where we are today with uh with these these large scale models that idea of this like correct Knowledge Graph is is still sort of the I I guess this guiding principle or it's really this idea of like Truth for the Enterprise and to get the to get like real value out of large scale artificial intelligence you need to find ways and architectures to tie it back to that truth it's very good at syntax and articulating ideas we just need to do a good job giving it the right content giving it the right context uh depending on what problem it's solved so there's there's kind of an uh an increased importance of uh data quality and data linkage and and Master data management to be able to produce these common data sets and and maintain them so that you can point you know your your GPT at it and get really good high quality trusted answers from the AI I imagine that the inclination for people who are thinking about bringing AI to bear on this problem is that oh well AI is very sophisticated it has all of these Nifty capabilities I should be able to just set it loose on my data it'll solve all of my problems and I don't really have to worry about it maybe just click yes or no a couple of times versus what I imagine to be the reality of you actually want to use all of those manual and statistical techniques that we have been relying on for and developing for the past 30 years to do maybe the 80% case and use the llm for that 20% case that takes 80% of the time to accelerate the process a bit and I'm curious how you are guiding organizations on that strategic aspect of how where and when and why to actually bring these language models to bear on the problem in conjunction with all of the other techniques that have been developed and that we have established trust and confidence in yeah we so and that that question has has kind of become Central to to everyone these days the the question of am I comfortable using generative AI with my data and and uh it sort of you know the previous version of this was am I comfortable putting our you know our most business critical important high value data assets on the cloud um that's now shifted most organizations are comfortable with the cloud but now it's well but can machine learning look at it what if God forbid someone trains a model on our data and shares that model I think there's there's a certain amount of just uh feeling out the the right level of security around these things and I I I don't want to go into that too deeply but just for the purposes of solving problems in this space there are big opportunities to improve quality of matching and mastering uh using these new models but they they need to be they need to be harnessed there's there's been a lot of research uh over the last few years applying large language models to these of fuzzy database problems fuzzy joins groupy schema mapping and what they basically find is large language models perform with very little configuration and Engineering uh perform as well as a lot of state-of-the-art techniques um that that existed previously the challenge is putting those techniques into a system and into a product where they're used intelligently uh in conjunction with other lower cost techniques of kind of more traditional machine learning and also in conjunction with rules and and human feedback one of one of our sort of founding principles at Tamer is that uh the machine is never right 100% of the time you need humans in the loop to be able to review to uh address complex cases and to assess how well things are going and so you need a system that kind of incorporates all all of these pillars of human input rules-based input rules and people users love rules people love rules that you know they're sort of the most if you just say like a match means the social security number value is equal everyone loves that they they understand what that means given the truth about data quality and how things exist in the real world that rule might actually be wrong a large proportion of the time in in you know some a real particular case but people love that when it's wrong they they sort of get why as opposed to if if some machine learning model is wrong and they want to know why it's like well let me show you this random Forest decision tree and talk about that um let me explain Val yeah uh so you need all of these things and and they all kind of uh come together to to create a a coherent solution that doesn't doesn't have an absolutely overwhelming cost I think one one thing we haven't talked much about is if you want to solve these problems at scale it can become very expensive very quickly by its nature these are all matching problems they're quadratic naively you would be comparing everything to everything else so if you take a naive approach you're you're gonna uh you're gonna burn a lot of compute and you're going to spend a lot of money and you probably won't get the best results so you need you need a way to sort of identif ify the easy parts of the problem solve them easily the unsolvable part parts of the problem send them to a human and then everything in between and and that's that's where that's where especially the we're seeing the biggest boosts from Vector embeddings large language models and newer newer Cutting Edge techniques they can kind of dig into those ambiguous cases and like get a lot of value another complex of this space particularly when you're first embarking on the process of trying to reconcile all of your organizational data is level of expertise in the process of Master data management as well as level of familiarity with the data itself where the person who created the data figured out what the schema should be decided what attributes to pick may not even be with the company anymore so you don't have all of the cont context all the information and especially when you have a inexperienced team who's just starting on this process and then you say hey here rub some machine learning on it it's magical I'm wondering what are some of the potential pitfalls that you're setting them up for if they don't have an appropriate understanding of what are the actual capabilities and limitations of the techniques so that they can be appropriately skeptical or appropriately confident where where those apply appropriately yeah that's uh uh touching back on the political problem of Master data management you have you have to get a lot of different kinds of people involved the the domain experts the subject matter experts are very rarely the people who are own the project essentially um they have to be drafted in and uh to to and convinced to share their time to to really uh make make the project successful and so so yeah there there does tend to be skepticism of what what is the you know what's the system doing and if it's a if it's an AI based system and you know the skepticism is increased and they'll see these things they'll look at the data say this doesn't make sense this doesn't make sense why is it doing it this way and so you you kind of you need a workflow that kind of Embraces that uncertainty to an extent or makes users comt comtable with the fact that the data is bad and we're not going to fix it all at once it's going to be a process and so something interesting we we we learned over the years one of our earlier products was very explicitly built designed as a system for end users to train models to master data and so there were workflows for end users to come in and the system um we used act Active Learning techniques to surface really high value examples users go in they label the examples and the model gets gets trained as quickly as possible and kind of make this this like really like ml practitioner experience available to non-ml experts and that system it you know it's it as easy as we made it it was still hard to use and you still had to kind of understand machine learning and so for subject matter experts is a challenge and there's there's a lot of handholding um so we we've we've moved towards more the the models will be will be and we have General models that that can apply to your domain and start at a very high level and then you'll be tuning them and what we found actually when we first did that we we no longer had this kind of like active feedback workflow and what we found was that actually damages trust with the end users they want to give feedback to the machine and they want to have that back and forth have that conversation um that's really important in kind of like gaining trust in the system as the system directing how you should be exploring the data how you should be interacting with it and um and how you should be understanding it and so so we've we've kind of brought that back so even though now it's you're not training you're actually still you can go and you can interact with with the system as if you're training it and it's it's this positive user experience and in your experience of building Tamer working with organizations to address this challenge of Master data management and incorporating these ML and the newer generation of generative AI capabilities into that system what are some of the most interesting or Innovative or unexpected ways that you've seen ML and AI techniques used in that context of data resolution and master data management yeah it's a good question you know uh our our our architecture is is fairly generic so we primarily work with structured Enterprise data relational database systems but we can we can extend the system to uh work with more complex data types and so actually we we support number of years ago we added support for uh Geo dat and geojson essentially polygons and and uh the applications that users have for that system are are are really quite interesting and surprising and you know normally you think about master data management as applying to fairly narrow subset of data customer data B you know organizations people Parts products that's kind of the The Sweet Spot but uh with with some of these other uh you know data formats that we support we've seen it applied to things like radar tracks like keeping track of fuzzy data related to planes and uh other kinds of like aerial phenomena stuff like that and it it it kind of really gets out there it's it's it's pretty cool to see in your work of building these systems coming to grips with the constantly evolving landscape of ML and AI techniques what are some of the most interesting or unexpected or challenging lessons that you've learned in the process of building a product that harnesses that so a big challenge that we frequently have thought that we've solved but sort of continue to find better Solutions every couple of years as we you know each each solution seems to like reveal another side of the problem is finding how to connect the two modes of solving these sorts of problems we we talked earlier about you know you can you can take some data uh as a snapshot as a one-off project put it all together you're going to run some large scale batch computation you're going to put it all together and then you're going to be done but then it's like okay yeah but the data is not static what happens when customer X Y and Z all come in and uh they they someone moved their address changed maybe some customers passed away and you no longer want to be sending them offers in the mail um maybe two companies merge or or you know split up um uh things happen in the real world and you need to manage this on an ongoing basis and many applications I get most applications in the real world aren't content to just have a oneoff solution and kind of need this need to be able to solve the problem in you know in a live ongoing updated and real time kind of fashion and it we found it's it's a major challenge kind of marrying the extremely efficient um High throughput batch oriented solutions to these problems with a more operational live system database to to solve it on an ongoing basis sort of in real time or in a in like a a event driven fashion and to essentially take the the core of the system all these techniques that we've been talking about rules-based matching lessons from natural language processing and information retrieval and you know kind of the the latest AI has to offer and apply them in consistent ways across two different extremely different architectures and marry those together and do it in a way that end users can actually use that and transition from one to the other so you you know can come into to uh to the system and load a whole bunch of data process it in a big way create that initial starting point for what your master data should look like and then like boom you're up and running it's in a live database now you can interact with it directly you can start to point other systems at it use it in a way where it's it's not this oneoff it's not some extract that's going to be relevant tomorrow and no one's going to adopt it but it kind of becomes this living piece of of the overall architecture within an organization and for people who are addressing these challenges of Master data management data reconciliation trying to figure out what is the crosscutting view of their business what are the cases where ML and AI are the wrong choice for some or all of that problem yeah good question I I think it comes down to uh simplicity of the problem sometimes you know it's Ai and ml are are bright and shiny but for smaller scale problems you know maybe you just you're putting together a couple of sources or maybe you're you really just need to do this you you have a oneoff you're trying to create a pres you know this onetime presentation if you don't have a lot of complexity to the problem then deterministic techniques are going to are are are likely to win there's nothing wrong with rules and applying rule can be much cheaper than applying AIML to the problem so if if if you kind of have a low stakes scenario where you can just 8020 it and get a good answer quickly then then yeah like go crazy with deterministic solutions that that should you know and that that that should be the first step really of any approach pick the lwh hanging fruit then get into the hard problem and we'll make it really good and as you continue to invest in and keep tabs on this evolving space of large language models and generative Ai and its application to the challenge of data cleaning what are some of the hopes or predictions that you have or any specific techniques or areas of effort that you're keeping a close watch on yeah so I I think the um the ability to build build intelligent agents into existing user workflows is is creating a big opportunity to you know I like the the first wave of this AI roll out was like put a chatbot in it you got a product put a chat button and it's going to be amazing and there are some good applications for that but I think what's what's really coming up next is looking at what are the problems where llms are extremely well suited and then how can you apply those to actual like key features within a product and deploy that like think really starting to think of it as just like this is a this is this this capability we can productionize like how do we think about our product road map and where we build that and how we use it and how we adopt it and I I think the the upshot is a lot of the challenging work in not in solving the master data management problem but in managing the system and the complex of it can be automated to a much larger extent so you know there there's there's a lot of configuration that goes into pulling data from different systems aligning all the schemas figuring out how you want to enrich how you want to apply data quality Transformations how you want to pull in thirdparty Source data kind of just just like creating that model of what what you want your master data to look like starting from what all your Source data looks like and the there's there's big opportunities for llms to go in and and just simplify that turn it into you know give you a very straightforward kind of just like basic wizard-like experience setting up this extremely complex machine to go and process all this data in complex ways and then and then to manage it over time it's sort of putting agents into the system can take the hardest parts of the user experience and either automate them away or you know make turn them into a delight for end users and so we're focused a lot on kind of really simplifying down that that experience and making uh making Master data management something that isn't kind of this like scary thing that sounds like it's doomed to fail and will be very expensive but it's more like no no no this is just this is this is something you need like if you're not doing this you're crazy all your data could be 10 times better and you won't be tearing your hair out to get there I think that point too of figuring out what is that common cohesive schema what is the representation that is going to be useful and applicable and easy to integrate is one of the challenges as well and maybe the llms can help set that initial pass of here is something that it could look like because at either end of the spectrum you have either people who are unable to see the art of the possible because it looks too daunting or you have people at the other end of the spectrum who ask for the impossible because they think it's easy yeah yeah absolutely I think like uh there's there's like uh llms are really good at translating right like you can speak different languages and they can act as an intermediary and like you can and it and it just works somehow I think there's there's kind of like uh there there's a a vision for a future here of like what if you did Master data management and there wasn't even like a single Master data model what if everyone got to keep their own model that they wanted from the beginning and there's an llm in the middle that was just intelligently translating across these things so everyone thinks they're speaking the same language but it's really you know it's like a tower Babble situation so that the like that's kind of the promise here and I I I think it's it's a big opportunity it's still there's a lot of challenging engineering and product development to to get to that but that's where we're headed are there any other aspect of this overall space of Master data management and the application of ML and and AI to its execution and implementation that we didn't discuss yet that you'd like to cover before we close out the show I feel like I had something more to say about third party data but honestly I I think we might be good fair enough well for anybody who wants to get in touch with you and follow along with the work that you and your team are doing I'll have you add your preferred contact information to the show notes and as the final question and I'd like to get your perspective on what you see as being the biggest Gap in the tooling or technology that's available for data management today I think it it it feels like it comes back to somehow a location challenge I don't know maybe I'm just thinking about this because this has been on the problem I've been dealing with lately but um I feel like we haven't really solved the like cross Cloud problem there are really good systems on different clouds and they're they don't TR translate one to one so there's a lot of there's a lot of like essential technology that's locked up in different proprietary walled Gardens and so it's like it's now very easy to build extremely powerful Cutting Edge data architectures for managing your data but you have to make some pretty big decisions at at at the outset and some pretty big bets on on vendors and who you trust in the market and it's it's gotten a lot harder to kind of remain independent on the other hand it's also easier to remain independent there's a lot of amazing Tools in kind of like breaking up the relational database into its component parts and using independent systems to to put it back together and there's there's like at the same time there there are a lot of these amazing tools in the open source world but but it's it's kind of it's uh it's difficult for the worlds to collide and to kind of put it all together into coherent coherent approach so so yeah and I I think there's like I feel like there's a little bit too much satisfaction with um folks thinking if you put all the data into a single physical place that all of your problems are solved and really you're just kind of like kicking a bunch of problems down the road for 10 years till you get sick of your vendor and need to go do something dramatically different no the the data gravity problem is definitely real and until we are able to circumvent physics it won't go away yeah yeah all right well thank you very much for taking the time today to join me and share your thoughts and experience on building these Master data management workflows bringing ML and AI to bear and some of the ways that the current generation of llms and generative AI are adding new capabilities and techniques to that process so they appreciate all the time and energy that you and your team are putting into bringing that to bear and making it more accessible and easier to apply to this Challenge and I hope you enjoy the rest of your day thank you yeah thanks so much for having me this has been [Music] fantastic thank you for listening and don't forget to check out our other shows podcast
2.it covers the Python language its community and the innovative ways it is being used and the AI engineering podcast is your Guide to the fast moving world of building AI systems visit the site to subscribe to the show sign up for the mailing list and read the show notes and if you've learned something or tried out a project from the show then tell us about it email hosts ATD engineering podcast.com with your story and to help other people find the show please leave a review on Apple podcasts and to tell your friends and co-workers [Music]
💡 Tap the highlighted words to see definitions and examples
Schlüsselvokabular (CEFR C1)
observation
B2The act of observing, and the fact of being observed (see observance)
Example:
"how yeah yeah I I think I think what you're picking up on is a really key observation about master data management"
maintains
B1To support (someone), to back up or assist (someone) in an action.
Example:
"Bridges the Gap and maintains the connectivity between all these different consumers and does it a scalable way if"
connectivity
B2The state of being connected
Example:
"Bridges the Gap and maintains the connectivity between all these different consumers and does it a scalable way if"
incremental
B2Pertaining to an increment.
Example:
"lame answer but I think it's a little of both there are key ways that are incremental in terms of how you how you"
mainstreaming
B2To popularize, to normalize, to render mainstream.
Example:
"uh since the launch of chat GPT and kind of the mainstreaming of AI and large"
consciousness
B2The state of being conscious or aware; awareness.
Example:
"trust the robots to kind of uh main you know mainstream Consciousness and we we've"
executives
B1A chief officer or administrator, especially one who can make significant decisions on their own authority.
Example:
"Executives in the world's largest companies on the topics that are top of mind for everyone in every episode of"
pre-trained
B2A B2-level word commonly used in this context.
Example:
"the the models will be will be pre-trained and we have General models that that can apply to your domain and"
streaming
B1To flow in a continuous or steady manner, like a liquid.
Example:
"uh since the launch of chat GPT and kind of the mainstreaming of AI and large"
validating
B1To render valid.
Example:
"validating data burning resources and crushing morale data folds AI powered"
Wort | CEFR | Definition |
---|---|---|
observation | B2 | The act of observing, and the fact of being observed (see observance) |
maintains | B1 | To support (someone), to back up or assist (someone) in an action. |
connectivity | B2 | The state of being connected |
incremental | B2 | Pertaining to an increment. |
mainstreaming | B2 | To popularize, to normalize, to render mainstream. |
consciousness | B2 | The state of being conscious or aware; awareness. |
executives | B1 | A chief officer or administrator, especially one who can make significant decisions on their own authority. |
pre-trained | B2 | A B2-level word commonly used in this context. |
streaming | B1 | To flow in a continuous or steady manner, like a liquid. |
validating | B1 | To render valid. |
Möchtest du mehr YouTube-Diktate? Besuche unser Übungszentrum.
Möchtest du mehrere Sprachen gleichzeitig übersetzen? Besuche unserWant to translate multiple languages at once? Visit our Mehrsprachen-Übersetzer.
Grammatik- & Aussprachetipps für Diktate
Chunking
Achte auf Pausen des Sprechers nach bestimmten Phrasen – das hilft beim Verständnis.
Linking
Höre auf verbundene Sprache, wenn Wörter verschmelzen.
Intonation
Achte auf Tonhöhenänderungen, die wichtige Informationen betonen.
Videodifficulty-Analyse & Statistik
Diktat-Ressourcen zum Herunterladen
Download Study Materials
Download these resources to practice offline. The transcript helps with reading comprehension, SRT subtitles work with video players, and the vocabulary list is perfect for flashcard apps.
Ready to practice?
Start your dictation practice now with this video and improve your English listening skills.