The Future of Data Engineering: AI, LLMs, and Automation – YouTube Dictation Transcript & Vocabulary

최고의 YouTube 받아쓰기 사이트 FluentDictation에 오신 것을 환영합니다. 인터랙티브 스크립트와 쉐도잉 도구로 C1 수준 영상 "The Future of Data Engineering: AI, LLMs, and Automation"을 마스터하세요. 우리는 영상을 받아쓰기와 발음 연습에 최적인 작은 세그먼트로 분할했습니다. 강조 표시된 스크립트를 읽고 핵심 어휘를 학습하며 듣기 실력을 향상하세요. 👉 받아쓰기 시작

수천 명의 학습자들과 함께 YouTube 받아쓰기 도구로 영어 듣기와 쓰기 실력을 향상하세요.

스크립트 & 하이라이트 핵심 어휘 문법 & 발음 난이도 통계 다운로드 자료 연습 시작

📺 Click to play this educational video. Best viewed with captions enabled for dictation practice.

인터랙티브 스크립트 & 하이라이트

1.[Music] hello and welcome to the data engineering podcast the show about modern data management data migrations are brutal they drag on for months sometimes years burning through resources and crushing team morale data folds AI powered migration agent changes all that they unique combination of AI code translation and a at data validation has helped companies complete migrations up to 10 times faster than manual approaches and they're so confident in their solution they'll actually guarantee your timeline in writing ready to turn your year-long migration into weeks visit dataengineering podcast.com folds today for the details your host is Tobias Macy and today I'd like to welcome back G maansi where we're going to talk about the work of data engineering to build AI to build better data engineering and all of the things that out of that idea so glib for folks who haven't heard any of your past appearances if you could just give a quick introduction yeah thanks for having me again to it's always fun to be at the podcast I'm glad I Am Co and co-founder of data fold we work on data engineering workflows now also with AI prior to starting data fold I was a data engineer data scientist data product manager and I got a chance to build three data platforms pretty much from scratch and at three very different companies including Autodesk and Lyft where I was one of the first founding data engineers and got to build a lot of pipelines and infrastructure and also break a lot of pipelines and infrastructure and I've always been by how important is data engineering to the business in that it unlocks the delivery of the actual applications that are data driven be that dashboards or machine learning models or now increasingly also bi applications and at the same time as a data engineer I've always been very with how manual airr prone TVs and toysome my personal workflow was and pretty much started data fold to solve that problem and remove all the manual work from the data engineering workflow said that we can ship high quality data faster and help all the wonderful businesses that are trying to leverage data actually do it so excited to chat in the context of data engineering AI obviously there's a lot of hype that's being thrown around about oh you just rub some AI on it and it'll be magical and your problems are solved you don't need to work anymore it's going to replace all of your Junior Engineers or whatever the current marketing spin is for it and it's undeniable that large language models generative AI the current ERA that we're in has a lot of potential there are a lot of useful applications of it but the work to actually realize those capabilities is often a little bit opaque or misunderstood or confusing and so there are definitely a lot of opportunities for being able to bring large language models or other generative AI Technologies into the context of data engineering work or development environments but the work of actually getting it to the point where it is more help than hindrance is often where things start to fall apart and I'm wondering if you can just start from the work that you're doing and the experience you've had of actually incorporating llms into some of your product some of the lessons learned about what are some of those impedance mismatches what are some of those stumbling blocks that you're going to run into on the path of saying I've got a model I've got a problem let's put them together yeah absolutely and I think that's SP on obs to bias in terms of there's a lot of noise and hype around AI everywhere but yet we don't have a really clear idea and how actually it impacts State Engineering and maybe before we dive into like okay what is actually working it's worth kind of disambiguating and cutting through the noise a little bit and I've been thinking about this recently but I think there is probably two main things that everyone gets a bit confused about one is the confusion of software engineering and data engineering software engineering and data engineering are very related and in many ways there are similar in data engineering we ultimately also write code that produces some outcome but unlike software engineering typically we're not really building a deterministic application that performs a certain function we write in code that processes large amounts of data and usually that data is highly imperfect and so so we're dealing not just with uh code we're dealing also with extremely complex extremely noisy inputs and a lot of the times also unpredictable outputs and that makes the workflow quite different and I think one important is when we see lots of different tools and advancements in tools that are affecting software engineers and impacting their work CLS for it better like one example is I think over the past year we've seen amazing amazing Improvement of the kind of co-pilot type of support within the software engineering workflow through various tools we at beta fold for example use cursor ID a lot and we really like how it seamlessly plugs in and enables our Engineers working on the application code just be more productive spend less time on uh a lot of like boiler plate toil tasks and those tools are really it's really exciting how it affects the software engineering workflow there's also a huge part in the software engineering space right now that is devoted to the agents so for example with cursor the idea is that you plug it in the IDE in a few touch points for developer like code completion and then kind of an assistant that helps you mck up and refactor the code and it's very seamless but it's still kind of part of the core workflow for human and then there's a second school of thought where there's an agent that takes a tasks that can be very Loosely defined and then basically Builds an app from scratch or takes a gr linear ticket and does the work from scratch and it's also very exciting I would say in our experience testing multiple tools the results there are far less impressive and actual impact on the business for us in terms of software engineering has been far less impressive than with more like a ID native enhancement but all of is to say that while those tools are really impactful for software engineers and there's a lot happening also in other parts of the workflow we've seen very limited impact of those particular tools on the data Engineers workflow and the primary reason is that although we also write in code as data Engineers the tools that are built for software Engineers they lack very important context about the data and that is kind of a simple idea and a simple statement but what's underneath is actually quite a bit of complexity because if you think about what data engineer needs to do in order to do their job they have to understand not just the codebase but they also have to have a really good grasp on the underlying data that their codebase is processing which is actually a very hard task by itself starting from understanding what data you have on the first place how is the data computed where it's coming who is consuming it what are relationships between all the data sets and absent of that context the tools that you may have supporting your workflow yes it can help you generate the code but the impact of that would be quite limited relative to um how complex your workflow is and I think that means that for data Engineers we need to see a specialized class of tools that would be dedicated at improving data Engineers workflow and would excel at doing that by having that context that is critical for data engineer to do their job that's kind of I think one aspect of the confusion sort of like all the advancements and software engineering tools are exciting and inspiring it doesn't mean that now data Engineers workflow is impacted as as significantly as a software engineer's workflow I think the other type of confusion that I'm seeing is there's a lot of talk about AI in the data space and all the vendors you see out there are I think smartly positioning them themselves as really relevant and essential to the fundamental tectonic ship we now seeing technology meaning they trying to position themselves as relevant in the in the world where llms are really providing big opportunity for businesses to uh to improve and grow and automate a lot of business processes but if you double click into what is exactly everyone is saying is it's pretty much we going to help you the data team the data engineer ship AI to your business and to your stakeholders like we are the best you know workflow engine so that you can get data delivered for AI or we are the best data quality vendor that will help you ensure the quality of the data that goes into AI or we have the most Integrations with all the vector databases that are important fori and kind of the the the message that you're getting from all of this and by no means this is not import this is definitely important and relevant but what's interesting about this is we're saying essentially data engineer you have so many things to do and now you also have to ship AI we're going to help you ship AI it's so important that you ship data for AI applications we are the best tool to help you ship AI but it's almost sounds like this is data engineers in the service Sur Ai and I think what's really interesting to explore and to unpack and what I would personally love for myself as a data engineer is kind of reversing that question and asking the question of okay so we have now this fundamental shift in technology amazing capabilities by llms how does it actually help me in my workflow so what does the AI for data engineer look like and I think we need much more of that because I think that if we make people who are actually working on all these important problems more productive with the help of AI then they will for sure do amazing things with data and I think that's a really exciting opportunity to explore one of the first and most vocal applications of AI in that context of helping the data Engineers by maybe taking some of the burden off them that I've seen is the idea of talk to your data warehouse in English or text a SQL or whatever formulation it ends up taking where rather than saying oh now you need to build your complicated star or snowflake schema and then build all of the different dashboards and visualizations for your business intelligence you just put an AI on top of it and then your data consumers just talk to the AI and say hey what was my net promoter score last quarter or what's my year-over-year Revenue growth or how much growth can I expect in the next quarter based on current sales and it's going to just automatically generate the the relevant queries it's going to generate the visualizations for them and you as a data engineer or as an analytics engineer don't need to worry about it anymore and from the description it sounds amazing it's like great okay job done I don't need to worry about that toysome work I do all of the interesting work of getting the data to where it needs to be and then the AI does the rest but then you still have to deal with issues of making sure that you have the appropriate semantics mapped so that the AI understands what the question actually means in the context of the data that you have which that's the hardest problem in data anyway no matter what so the AI doesn't actually solve anything for you it just maybe exacerbates the problem because somebody asks the AI the question the AI gives an answer but it's answering it based on a misunderstanding of the data that you have and so you still have those issues of hallucination incorrect data or variance in the way that the data is being interpreted and I'm wondering what you have seen as far as the the actual practical applications of the AI being that simplifying interface versus the amount of effort that's needed to be able to actually make that useful yeah I think this is tax to SQL is the Holly Grail of the data space I would say for as long as I've worked in the space for over a decade that you know people really try to solve this problem multiple times and obviously now in hindsight it's obvious that pre- LM all of those um approaches using traditional NLP were doomed and now that we we have llms it seems like okay finally we can actually solve this problem and I'm very optimistic that it indeed will help make data way more accessible and I think it eventually will have tremendous impact on how humans interact with data and how that is leveraged but I think that the how and how it happens and how applied is also very important because I don't think that the fundamental problem is that people cannot write SQL SQL is actually not that hard to to write and to master I think the fundamental issue is that if we think about the life cycle of data in the organization it's very important to understand that the raw data that it gets from you know all the business systems and all the events and logs and everything we have in a data Lake it is pretty much on usable and it's unusable both by machines and AI or and and people if we just try to you know throw a bunch of queries that it and ask you know try to answer really key business questions and in order for the data to become use usable we need what is currently is the job of a data engineer of structuring filtering merging aggregating this data curating it and creating a really structured representation of what is our business and what are all the entities in the business that we care about like customers products orders so that then this data can be fed into all the applications right business intelligence machine learning Ai and I don't think that tax to SQL replaces that because if we just do that on top of the raw data we basically get garbage in garbage out I do think that in certain applications in certain applications of that we can actually get very good results even today if we put that level of a system on top of Highly curated semantically structured data sets right so if we have a number of tables that are well defined that describe how our business Works having a text tosql interface could be actually extremely powerful because we know that the questions that are asked and will be translated into code will be answered with the data which has been already prepared and structured and so it's actually quite easy for the system to be able to make sense about it but I don't think we are there where just like you don't need the data team let's just ask a question almost guaranteed that the answer will be wrong so data engineer in that regard data engineering and data Engineers uh are definitely not going to lose their jobs because now it's easy to generate SQL from text and in the context even of that text tosql use case what I've been hearing a lot is that it's not even very good at that one because llms are bad at math and SQL is just a manifestation of relational algebra thereby math but that if you bring a Knowledge Graph into the system where the AI is using the knowledge graph to understand what are the relations between all the different entities from which it then generates the queries and actually does a much better job but again you have to build the knowledge graph first and I think maybe that's one of the places where bringing AI earlier in the cycle is actually potentially useful where you can use the AI to do some of that wrote work of saying here are all the different representations that I have of this entity or this concept across my different data sources give me a first pass of what a unified model looks like to be able to represent that given all of the data that I have about it and all the ways that it's being represented and I'm wondering what you've seen in that context of bringing the AI into that data modeling data curation workflow of it's not the end user interacting with it it's the data engineer using the AI as their co-pilot if you will or as their assistant to be able to do some of that tedious work that would otherwise be okay well I've got 15 different spreadsheets I need to visually look across them and try and figure out the similarities and differences Etc yeah that's a great point I would say that there are I have two thoughts there on how the EI plugs in to actually make text tosql work yes you absolutely need that kind of semantic graph of what what data sets you have how are they related what are all the metrics how those metrics are computed and in that regard what's really interesting is that the metrix layer that was at some point a really hot idea in the modern Data stock probably about for you know 3 to five years ago and then everyone was really disappointed with how little impa it actually made on on a data team's productivity and just overall a data stack it almost like now now it's the metric layers time because if you take the metric layer and um which gives you a really structured representation of the core entities and the metrics putting the text to SQL is almost like the most impactful thing that you can do because then you have a structured representation of your data model which allows AI to be very very effective at being able to answer questions while being while while operating on a structured graph and so I think we'll see really exciting applications coming out of the hybrid of that kind of fundamental metric layer semantic graph and text to SQL in you know we already seeing that the early impacts of that but I think over the next two years it probably will become the a really popular way to open up data for ultimate stakeholders instead of classical B of like drag and drop uh interfaces and kind of passively consumed dashboards but then the second point which you made is basically cani actually help us get to that structured representation and I think absolutely um for the data Engineers workflow so not for a I would say business stakeholder or someone who is Data consumer but for data producer I think that leveraging llms to help you build data models and especially build them faster build FAS in the sense of understanding all the semantic relationships not just writing code is a very promising area and that comes back to the my point about how software tools are limited in their help of you know for data Engineers right I can write SQL but if I if my tool does not understand what are the relationships between the data sets then it can't even help me write joints properly and one of the interesting things we've done at data fold was actually build A system that essentially infers a entity relationship diagram from the raw data that you have combined with all the ad hoc SQL queries that have been written by people so previously that would be a very hard problem to solve but with the help of our lamps we can actually have a really good shot at understanding what is the what are all the entities that your business have in your data Lake how are they related and that's almost like a probabilistic graph because people can be writing joints correctly or incorrectly and you have noisy data and sometimes keys that you think are like primary keys and foreign keys are not perfect but if you have a large enough data set of queries that were ran against your Warehouse you can actually have a really good shot at understanding what's the semantic graph looks like and the context on which we actually did this was to help data teams build testing environments for their data but the the implications of having that knowledge is actually very powerful right so to your point we can use that to also help right SQL so I'm very bullish on the ability to help engine data engineers build pipelines by creating a semantic graph without the need for curation because previously that problem was almost pushed to people with all the kind of data governance tools the idea was let's have data stewards Define all the canonical data sets and all the relationships and obviously we just discovered this completely non-scalable so now we finally at the point where we can automate that kind of semantic data mining uh with llms that brings us back around to another point that I wanted to dig into further in the context of how to actually integrate the llms into these different use cases and workflows you brought up the example of cursor as an IDE that was built specifically with llm use cases in mind supposed with something like a vs code or Vim or emac where the llm is a bolt-on and something that you're trying to retrofit into the experience and it can be useful but it requires a lot more effort to be able to actually set it up configure it make it aware of the codebase that you're trying to operate on Etc versus the prepackaged product and we're seeing that same type of thing in the context of data where you mentioned there are all these different vendors of oh hey we're going to make it super easy for you to make your data ready for AI or use this AI on your data but most teams already have some sort of system in place and they just want to be able to retrofit the llm into it to be able to start getting some of those gains with the eventual goal of having the llm maybe be a core portion of their data system their data product and I'm wondering in that process of bringing an llm retrofitting it onto an existing system whether that be your code editor your deployment Environ enironment your data warehouse what have you what are some of those impedance mismatches or some of the issues in conceptual understanding about how to bring the appropriate I'm going to use the word knowledge even though it's a bit of a misnomer into the operating memory of the llm so that it can actually do the thing that you're trying to tell it to do yeah that's a great question Tobias I think that to answer this we kind of need to go back to what are the jobs to be done for a data engineer and how does the data engineer workflow actually look like and if we were to visualize it it actually looks quite similar to the software engineering workflow in just the types of tasks that a data engineer does day-to-day to do their work and by the way we're saying data engineer as sort of like a blank label but I don't necessarily mean just people who have data engineer in their title because all roles that are working with data including data scientists analysts analytics engineers and BM and M cases and software Engineers a lot of them actually do data Engineering in terms of building pipelines and developing pipelines as part of their job it's just data Engineers probably do this you know 100% a day time and if I'm a data analyst or data scientist that would be doing this maybe 30 40% of the time of my week and so if we think about what do I need to do to let's say ship a new data model like a table or extend an existing data model you know refactor definitions or add new types of information into an existing model it starts with planning right so I'm doing planning I'm trying to find the data that I need for for my work and a lot of the times a lot of information can be sourced from documentation from data catalog I think right now the data cataloging in the sense of like what data sets I have and what's the profile of those data sets has been largely solve there are great tools you know some are open source some are vendors but overall understanding what data sets you have now is way easier than it was 5 years ago you also probably are Consulting your tribal knowledge and you go to slack and you do like search for certain definitions and that's also now is largely solved with a lot of the Enterprise search tools and then you go into writing code and writing code I think this is also an important like if you are not really you know doing this for for living you think that people spend most of their time actually writing SQL and in terms of like writing SQL to for production and in my experience actual writing of this SQL or other types of code is maybe like 10 to 15% of my time whereas all the operational tasks around testing it talking to people to get context doing code reviews shipping it to production monitoring it remitting ating issues talking to more people is where the bulk of the work is happening and if that's true then that means that probably as we talk about automation these operational workflows are where the bulk of the lift coming from LMS can actually happen and so for actual writing code as a data engineer I would still recommend probably using the best-in-class software tools these days like cursor it will even though it's not aware of the data it will probably still help you write a lot of boil boiler plate and will speed up your workflow somewhat and or you can use other IDs with Co know like V code plus copilot I think those tools will just help you speed up the writing of the code itself but back to the operational workflows that I think take the majority of the of the time within any kind of cycle of shipping some of shipping something when it comes to what happens after you wrote the code right typically if you have people who care about the quality of the data means that you have to do a fair amount of testing of your work and testing is both helping making sure that my code is correct right does it conform to the expectations uh does it produce the data that I expect but it's also about understanding potential breakages Data Systems are historically fragile in the sense that you have layers and layers of dependencies that are often AA because um I can be changing some definition of what active user is somewhere in the pipeline but then I can be completely oblivious of the fact that 10 jobs down the road someone builds a machine learning model that consumes that definition and tries to automate certain decisions for like for example spend and manipulating that metric and so if I'm not aware of those Downstream dependencies I could be actually causing a massive business disruption just by the sheer fact of changing it and so the testing that involves not just understanding how the data behaves but also how the data is consumed and what are the larger business implications for making any kind of modification to the code is where a ton of time is spent in the data engineering and so what's interesting is that is the use case where historically we had data fault spent a lot of time thinking about even prei and before LMS were thing what we did there was came up with a concept of data diffing and the idea is everyone can see code diff right my code looked like this before I made it change now it's it's a different you know it's a different set of characters that uh the code looks like and uh def the code is something that is like EMB bettered in GitHub right you can see that but the very hard question is understanding how does the data change based on the change in the code because that is not obvious that happens like once you actually run the goat against the database and so data diff allows you to see the impact of a code change on the data and that by itself was quite impactful and we've seen a lot of teams adopt that you know large Enterprise teams uh fast moving software you know startup teams but we were not fully satisfied with the degree of automation that feature alone produced because people are still required to like sip through all the data diffs and explore them for multiple tables and see how the downstream impacts manifest themselves through lineage and it felt like okay now at least we can give people all the information but they still have to sift for a a lot of it and some of the important details can be missed and the big unlock that llms bring this particular workflow is once llms became pretty good in comprehending the code and actually semantically understanding the code which pretty much happened over 2024 with the latest generation of fundamental large you know large large language models we were able to do two things one take a lot of information and condens it into like three bullet points kind of like an executive summary and those bullet points are essentially helping the data engineer understand on the high level what are the most important impacts that I need to worry about for any given change and for a code reviewer to understand the same and that just helps people to get on the same page very quickly and save everyone a lot of time that otherwise could spend be spent in meetings or back and forth you know putting comments on a code change and the second unlock that we've seen is the opportunity to drill down and explore all the impacts and do the testing by essentially chatting with your P request chatting with your code and that comes in the form of a chat interface where you're basically speaking to an agent that has a full context of your code full context of the data change data diff and also full context St for your lineage so that I can actually understand how every line of code was modified affecting the data and what does that mean for the business and you can ask questions and it produces the answer is way faster than you would by essentially looking at all the different you know code changes and and data diffs and that ended up saving a lot of times a lot of time for data teams and now that I'm describing this you kind of feel that it sounds like almost having a buddy that just like helps you think for the code almost like having a code reviewer except for with thei with llm this is a buddy that's always available to you 24/7 and probably makes youro mistake because it has all the context and can set through a lot of information so really quickly so that's an example of how an could be applied into an operational use case that historically has been really timec consuming and take a lot of manual work out of that context and I really want to dig into that one word that you said probably at least a half dozen times if not maybe a couple of dozen was that context where that I think is the key piece that is so critical and also probably the most difficult portion of making AI useful is context what context does it need how do you get that context to it how do you model that context how do you keep it up to date and so I think that really is where the difference comes in between the cursor example that we touched on earlier versus the retrofitting onto emac or whatever your tool or workflow of choice is is how do you actually get the context to the place that it needs to be and so you just discussed the use case that you have of being able to use the llm in that use case of interpreting the various data diffs understanding what is the actual ramifications of this change and I'm wondering if you can just talk through some of the lessons learned about how you actually populate and maintain that context and how you're able to instruct the llm how to take advantage of the context that you've given it that's a great question Tobi and I think what's interesting is that at face value it seems like you want to throw all the information you have at llm right just like tell it everything and then let it figure out things and in fact it is obviously not as easy as that and in fact it's actually counterproductive to over Supply the llm with context in part because the context window of large language models is limited and the trade-off there is one you just like can't physically fit everything and two even if you're dealing with a model that actually is designed to have a very large Contex window if if you overuse it and Supply too much information LM just get gets lost it's also a starts being far far less effective in understanding what's actually important versus not and the overall effectiveness of your system goes down so back to your question of like what is the actual information that is important to provide as context into llm it really depends on what is the workflow that we're talking about in the context of a code review and testing where we we are trying to fundamentally answer the question of a if we changed the code was a change correct relative to what we tried to do what was what the task was or did we not conform to the business requirement the second question is did we follow the best practices such as you know code guidelines and performance guidelines or not and the third question is okay let's say we conformed to the business requirements we did the good job at following our coding best practices but we may still cause a business disruption just by making a change that can be a surprise either for a human consumer of data Downstream or could throw off a machine learning model that was trained based on a different distribution of data right and so these are fundamental three questions that we try to answer and by the way even without AI That's what a good code review would ultimately accomplish done by humans so so what is the context that it's important for LM to have here first obviously it is the code difference right so we already know what the original code was what the new code is and feeding that inm is really important so that I can understand okay what are the actual changes in the code itself in the logic and I'm I won't go into the details here because obviously the code base can be very large sometimes your PR can touch a lot of code so you have to be quite strategic in terms of how do you feed that on a technical side but conceptually that's what we have to provide as an input number one the second important input is the data diff right it's understanding if I have a kind of main branch version of the code understanding what data it produces and what are the metrics showing right and then if I have a new version of the code let's call it a developer Branch what data it produces and what is the difference in the output let's say with my main branch code I see that I have 37 orders on Monday but with the new version of the code I see that I have 39 and so that already tells me that okay so this is the important impact on the output data and on the metric and that can and that's important both on the value levels understanding how the individual cells rows and columns are changing but it's also important to do rollups and understand what is the impact on metrics and coupling that context with the code diff allows us to understand how changes in the code affect the actual data output and the third really important aspect is the lineage so lineage is fundamentally understanding how the data flows throughout your system how it's computed how it's aggregated and how it's consumed and the lineage is a graph and there are kind of two directions of exploration one of them is Upstream which helps us understand how how did the geta data get to the point where you're looking at it right so for example if I'm looking at number of orders and I'm changing a formula where does the information about orders come from in the first place and that is important because that can tell us a lot about how a given metric is is computed and what is the source of Truth are we getting it from Salesforce are we getting it from our internal system and then the downstream lineage is also important because it tells us how the data gets consumed and that is absolutely essential information that can help us understand what Downstream systems and metrics will be affected and lineage graph in itself can be very complex and building it actually is a tough problem because you have to essentially scrape all of your data platform information all the queries all the bi tools to understand how data flows how it's consumed and produced but let's say you have this lineage graph it's actually also a lot of information by itself and so to properly Supply that lineage information into an LM context you actually kind of need uh your system to be able to explore lineage graph on its own to see like okay if I am if theel make made a change here what are the important Downstream implications of that so now we're talking about kind of the system to be able to kind of Traverse that and do analysis on its own for for the context I would say these are the three most important types of context and then the fourth one is kind of optional if you're again if your team has any kind of best practices SQL linting rules documentation rules you can also provide them as context and then your kind of AI code reviewer assistant can help you reason about well did you conform or not and if not making suggestions about what to correct eventually probably going in and correcting your code itself I think that's ultimately where this is going but again it's pretty much would be operating on the same set of input context another interesting element of bringing llms into the context of the data engineering workflow and use case one is the Privacy aspect which is a whole other conversation I don't want to get too deep into that Quagmire but also when you're working as a data engineer one of the things you need to be thinking about is what is my data platform what are the tools that I rely on what are the ways they link together and if you're going to rely on an llm or generative AI as part of that tool chain how does that fit into that platform what is some of the scaffolding what are some of the workflows what are some of the custom development that you need to do where a lot of the first pass and naive use cases for generative Ai and L m is oh well just go and open up the chat GPT UI or just go run LM Studio or use CLA or what have you but if you want to get into anything sophisticated where you're actually relying on this as a component of your workflow you want to make sure that it's customized that you own it in some fashion and so that is likely going to require doing some custom development using something like a lan chain or a lane graph or uh crew AI or whatever where you're actually building additional scaffolding logic around just that kernel of the llm and I'm curious how you're seeing some of the needs and use cases of incorporating the llm more closely into that actual core capabilities of the data platform through that effort of customization and uh software engineering that's a great point I think that the models themselves are getting rapidly commoditized in the sense that their capabilities the fun you know the foundational large language models their interfaces are very similar their capabilities are similar we're seeing a lot of race between the companies training those models in terms of beating each other in benchmarks looks like the whole industry is conver converging on adding more reasoning and then the ways that this is happening is also converging on the same experience and the matter and the difference is like who is doing this better right who is beating the metrix who provides the best the cheaper inference the faster inference uh more intelligence for for the same price and to that end I don't think that differentiation or the effectiveness of whatever is the automation that you're trying to bring really depends on the choice of a model maybe for certain narrow applications actually maybe choosing a more specialized model and or fine-tuning model would be more applicable but still I don't think the model is where you really where the magic happens these days model is important for magic but it's not something that actually allows you to build a really fact application by just you know choosing something better than what's available to everyone else the actual magic and the value ad and the automation happens in how you leverage that model in your workflow so all the orchestration in terms of how do you prompt the model what kind of context you provide how do you tune The Prompt how do you tune the inputs how do you evaluate the performance of the model in production how do you make various LM based actors that may be playing different roles interact with each other that is where the hard work is happening and that is where I think the actual value and impact is created and that's where all the complexity is so I think you don't have to be you know PhD and really understand how the models are trained although I would say just like in computer science is obviously very helpful to understand how these models are trained in the architectures and their trade-offs but you don't have to be good at um you know training those models in order to effectively leverage them but to leverage them you have to do a lot of work to effectively plug them in the workflows and I think that the applications and companies and teams that are thinking about what is the workflow what is the ideal user interface what is all the information that we can gather to make LM do the better job and then are able to rapidly iterate will ultimately create the most impact with alls and so on that note in your experience of working with the llms working with other data teams and keeping apprised of the evolution of the space What are some of the most interesting or Innovative or unexpected ways that you've seen teams bring llms into that inner loop of building and and evolving their Data Systems I think the most in hindsight obvious but not necessarily obvious when you just in realization is that no one really knows how to ship llm AI based applications they obviously you know guides and tutorials and still like there's a lot you can learn from looking at what people are doing but the field is evolving so fast that nothing replaces fast experimentation and just building things it's not that you can just hire someone who worked on building an llm based application like six months ago a year ago and all of a sudden you you know gain a lot of Advantage as you would with many other Technologies like you know if we were I guess working in a space of video streaming it would be very beneficial to have extensive experience with working with video streaming and codex and with llms one no one really knows exactly how they work even the company in terms of like how they behave right in terms even the companies that are shipping them are discovering more and more novel ways of leveraging them more effectively every week and from for the teams that are using leveraging LMS like like data folds the thing that we found matter the most is the ability to a just stay on top of the field and understanding what's the what's the like most exciting thing that people are doing how they relate to our field how can we borrow some of those ideas but most is is rapid experimentation with some sort of methodology that allows you to try new things measure results quickly and then being able to scrap your approach that you thought was great and just go with a different one because a lot of times when a new model is released you have to kind of adjust a lot of things you have to adjust the prompts you have to even rearchitecturing and that is both difficult but also incredibly exciting because the pace of innovation and what is possible to solve is evolving extremely fast I would say the fastest of any previous technological wave of disruption that we've seen in your experience and in your work of investing in this space figuring out how best to apply llms to the problems facing data engineers and how to incorporate at that into your products what are some of the most interesting or unexpected or challenging lessons that you've learned personally yeah I I think that the the interesting realization was that specifically for data engineering domain again if you just take the problem at face value you think well let's just build a co-pilot or an agent that would kind of try to automate data engineer way and I don't think we have the tech ready for an agent to just like really take a task and run with it yet I don't think it's been solved in software space I think it's in some ways even harder to solve in data space we'll eventually get there I don't think we are there yet I don't think that the biggest feedback you can make in bing workflow again is like having a co-pilot because that's not where the engineers spend most of their time in terms of like writing production code it's all operational tasks and there are certain kinds of problems in the data engineering space where it's not even a dayto day you know you help you save like an hour two hours 3 hours but there are certain types of workflows where to complete a task a team needs to spend like 10,000 hours and a good example of such a project would be a data platform migration where for example you have millions of lines of code on Legacy database you have to move them over to a new modern data warehouse you have to refactor them optimize them repackage them into a new kind of framework right you may be moving from like store procedures on Oracle to DBT Plus data bricks and doing that requires a certain number of hours for every object and because you're dealing with a large database that at Enterprise level sums up to enormous amount of work and historically these projects would last years and be done by a lot of times outsourced Talent from you know Consultants or or Si and for data engineer that's like probably one of the most miserable projects to do I've done I've glad such a project at lift and it's been an absolute grind where you you're not shipping new things you're not shipping AI you're not shipping even data pipelines you're just like solving technical debt for years and what's interesting is that those types of projects and workflows are actually I would say where Ai and LMS can make today the most impact because we can take a task we can reverse engineer it we know exactly what is the target of you know you move the code you do all these things with the code and ultimately the data has to be the same right you're moving you're going through multiple complex steps but what's important for the business is once you move from let's say you know ter data to snowflake your output is the same because otherwise these wouldn't accept it and that allows us to a lever LS for a lot of the tasks that historically manual but also have a really clear objective function for LMS like diffing the output on a legacy system to a modern system and using it as a constraint and if you put those two things together you have a very powerful system that is a extremely flexible and scalable thanks to all lamps but also can be constrained to a very objective definition of what's good you know unlike a lot of this text tosql generation that cannot be constrained to the definition of what's good because like how do you know migration you do know and that allows AI to make tremendous impact on the productivity of a data team by essentially taking a project that would last for years cost millions of dollars and go budget and constrain that into weeks and you know just a fraction of the price I think that is where we can see real impact of AI That's like useful it's working and we also see the powerless in software space as well there also a lot of the like really impactful Enterprise applications of AI is actually taking these Legacy code bases and you know helping teams maintain them and or migrate them and I think that there are more opportunities like that in the data engineering space where we'll see AI make tremendous impacts and as you continue to keep in touch with the evolution in the space work with data teams evaluate what are the cases where llms are beneficial versus you're better off going with good old human Ingenuity what are some of the things you're keeping a particularly close eye on or any projects or context you're excited to explore in terms of where you where I think that llms would really make a huge impact on the workflow uh just llms in general how to apply them to data engineering problems how to incorporate them more closely and with less leg work into the actual problem solving apparatus of an organization yeah so I think that on multiple levels there's a lot of exciting things like for example being able to prompt an llm from SQL as a function call that's available these days in modern data platforms is incredibly impactful right because instead of trying to in many instances we're dealing with extremely massive data and instead of having to write like complex case when statements and rexes and like udfs to be able to clean the data to classify things and to just tangle the mass we can now apply llms from within SQL from within the query to solve that problem and that is incredibly impactful for a whole variety of different applications so I'm very excited about all these capabilities that are now you know brought by the major data platforms like you know snowflake data breaks uh B query I think that the if we go into the workflow itself like what does data engineer do and how to make that work better I think there's a ton of opportunity to further automate a lot of tasks I think a big big one is data observability in monitoring I honestly think that data observability in its current state is a dead end in terms of like let's cover all the data with alerts and monitors and then be the first to know about any anomalies it's useful but then it quickly leads to a lot of noise alert fatigue and ultimately kind of could be even net negative on the workflow of a data engineer I think that this is a type of workflow where putting an a to investigate those alerts do the root cause analysis and potentially remediation is where I see a lot of opportunity for saving a ton of time for a data team while also improving the slas and the overall quality of the output of the data engineering team and that's something that we are really excited about something we're working on data full and we're excited about coming later this year are there any other aspects of this overall space of using llms to improve the lives of data engineers and the work that data Engineers can do to improve the effectiveness of those llms that we didn't discuss yet that you'd like to cover before we close out the show I think that you know we talked a lot about kind of the the workflow Improvement I think that overall my recommendation to dat Engineers today would be to learn how to ship LM applications it's not that hard Frameworks like Leng chain make it very easy to compose multiple blocks together and ship something that works whether or not you end up losing using L chain or other frameworking production and whether your you know team allows that doesn't really matter but it's really really really useful to try and build and learn all the components and by it's just like software engineering you know learning how to code opens up so many opportunities for you to solve problems right you see a problem and you're like I can write a bon crep for that and I think that with LMS it's almost like a new skill that both software engineers and data Engineers need to learn where you see a problem and you think that okay I actually think I can scill the problem into three tasks that I can give to an llm like one would be extraction web could be like reasoning and classification and now it just solves the problem and so but but really learning how to build and trying helps you build that intuition and so my recommendation would be for all dat Engineers for listening to this is try to build your own application that solves either a business problem or helps you in your own workflow because knowing how to build with LMS just gives you tremendous superpowers and will definitely be helpful in your career in the coming years I definitely would like to reinforce that statement because despite the AI maximalists the AI Skeptics no matter what you think about it llms aren't going anywhere they're going to continue to grow in their usage and their capabilities so it's worth understanding how to use them and investing in that skill because it is going to be one of those core Tools in your toolbox for many years to come and So for anybody who wants to get in touch with you and follow along with the work that you are doing I'll have you add your preferred contact information to the show notes and as the final question I'd like to get what your current perspective is on the biggest gap on the tooling or technology for data management today I think that there's a lot of kind of skepticism and some bitterness around kind of modern data stack failed Us in the sense that we were so excited that more days stack will make things so great five years ago and we're kind of disappointed and I think that I'm an optimist here I think that modern data stack in the sense of infrastructure and getting a lot of the fundamental challenges out of the way like running queries and getting data in and out of different databases and visualizing the query outputs and having amazing notebooks all of that that we now take for granted is actually so great relative to where we were you know five seven eight 10 years ago I don't think it's enough so I think that uh I am with the data practitioners for like well it's 2025 we have all these amazing models why is it still so hard to ship data absolutely with you and I think what I'm excited about is now that we have this really great foundation with modern data stack in the sense of infrastructure I'm excited about one getting everyone on Modern Data stock to the point of migrations right let's get everyone on more infrastructure so that they can ship faster obviously a problem that I'm really passionate about in solving and working second once you are on the modern data infrastructure how to keep modernizing your team's workflow so that the engineers are spending more and more time on solving hard problems and thinking and planning on the Val activities that are really worth their time and less and less on operational toil that just is burnout inducing and keeps everyone back so I'm excited about the modern data stack Renaissance thanks to the fundamental capabilities of large language models absolutely well thank you very much for taking the time today to join me and sharing your thoughts and experiences around building with llms to improve the capabilities of data Engineers it's definitely an area that we all need to be keeping track of and investing some time into so I appreciate the insights that you've been able to share and I hope you enjoy the rest of your day thank you so much Tobias [Music] thank you for listening and don't forget to check out our other shows podcast onit covers the Python language its community and the innovative ways it is being used and the AI engineering podcast is your Guide to the fast moving world of building AI systems visit the site to subscribe to the show sign up for the mailing list and read the show notes and if you've learned something or tried out a project from the show then tell us about it email hosts at dataengineering podcast.com with your story and to help other people find the show please leave a review on Apple podcasts and to tell your friends and co-workers [Music]

💡 Tap the highlighted words to see definitions and examples

핵심 어휘 (CEFR C1)

fascinated

To evoke an intense interest or attraction in someone.

Example:

"infrastructure and I've always been fascinated by how important is data"

importantly

(sentence adverb) Used to mark a statement as having importance.

Example:

"relate to our field how can we borrow some of those ideas but most importantly is is rapid"

distinction

That which distinguishes; a single occurrence of a determining factor or feature, the fact of being divided; separation, discrimination.

Example:

"important distinction is when we see lots of different tools and advancements in tools that are affecting software"

discussion

Conversation or debate concerning a particular topic.

Example:

"discussion because I think that if we make people who are actually working on all these important problems more"

collected

To gather together; amass.

Example:

"of data in the organization it's very important to understand that the raw data that it gets collected from you"

misconception

A mistaken belief, a wrong idea

Example:

"think this is also an important misconception like if you are not really you know doing this for for living you"

maintaining

To support (someone), to back up or assist (someone) in an action.

Example:

"maintaining and evolving their Data Systems I think the most in hindsight"

automating

To replace or enhance human labor with machines.

Example:

"automating data engineering workflows now also with AI prior to starting data"

frustrated

To disappoint or defeat; to vex by depriving of something expected or desired.

Example:

"frustrated with how manual airr prone TVs and toysome my personal workflow was"

consensus

A process of decision-making that seeks widespread agreement among group members.

Example:

"around AI everywhere but yet we don't have a really clear idea and consensus"

단어	CEFR	정의	예문
fascinated	B1	To evoke an intense interest or attraction in someone.	"infrastructure and I've always been fascinated by how important is data"
importantly	B2	(sentence adverb) Used to mark a statement as having importance.	"relate to our field how can we borrow some of those ideas but most importantly is is rapid"
distinction	B2	That which distinguishes; a single occurrence of a determining factor or feature, the fact of being divided; separation, discrimination.	"important distinction is when we see lots of different tools and advancements in tools that are affecting software"
discussion	B2	Conversation or debate concerning a particular topic.	"discussion because I think that if we make people who are actually working on all these important problems more"
collected	B1	To gather together; amass.	"of data in the organization it's very important to understand that the raw data that it gets collected from you"
misconception	B2	A mistaken belief, a wrong idea	"think this is also an important misconception like if you are not really you know doing this for for living you"
maintaining	B2	To support (someone), to back up or assist (someone) in an action.	"maintaining and evolving their Data Systems I think the most in hindsight"
automating	B1	To replace or enhance human labor with machines.	"automating data engineering workflows now also with AI prior to starting data"
frustrated	B1	To disappoint or defeat; to vex by depriving of something expected or desired.	"frustrated with how manual airr prone TVs and toysome my personal workflow was"
consensus	B1	A process of decision-making that seeks widespread agreement among group members.	"around AI everywhere but yet we don't have a really clear idea and consensus"

더 많은 YouTube 받아쓰기 연습을 원하나요? 방문하세요 연습 허브.

여러 언어를 동시에 번역하고 싶으세요? 방문하세요Want to translate multiple languages at once? Visit our 다국어 번역기.

받아쓰기 문법 & 발음 팁

Chunking

이해를 돕기 위해 화자가 구 뒤에 멈추는 부분에 주목하세요.

Linking

단어가 이어질 때 연음에 귀 기울이세요.

Intonation

중요 정보를 강조하는 억양 변화를 살펴보세요.

영상 난이도 분석 & 통계

카테고리

science-&-technology

CEFR 레벨

재생 시간

3579

총 단어 수

10142

총 문장 수

568

평균 문장 길이

18 단어

받아쓰기 자료 다운로드

Download Study Materials

Download these resources to practice offline. The transcript helps with reading comprehension, SRT subtitles work with video players, and the vocabulary list is perfect for flashcard apps.

Ready to practice?

Start your dictation practice now with this video and improve your English listening skills.

연습 시작

Connect

The Future of Data Engineering: AI, LLMs, and Automation – YouTube Dictation Transcript & Vocabulary

인터랙티브 스크립트 & 하이라이트

핵심 어휘 (CEFR C1)

fascinated

importantly

distinction

discussion

collected

misconception

maintaining

automating

frustrated

consensus

받아쓰기 문법 & 발음 팁

Chunking

Linking

Intonation

영상 난이도 분석 & 통계

받아쓰기 자료 다운로드

스크립트 (.txt)

자막 (.srt)

어휘 (.csv)

Download Study Materials

Ready to practice?

공유: