Development of an AI Data Engineer

Investigating the scalability of data engineering models with machine learning and large language models

Project status

Ongoing

Introduction

It is difficult and expensive to scale up the development of data models to tens of thousands or millions of models, as there would not be enough data engineers to perform this task. Aspects of the data engineering pipeline must be automated, from data processing, developing machine learning models, quality assurance, and end-user communication. We aim to provide machine learning frameworks where automation of data pipelines can be done automatically and will suit a wide range of scenarios. 

Chart showing the pipeline of a data engineer, from raw data collection processing and structuring, to creating and fine tuning the AI model, and finally gathering insights and feedback from the end user which can then in turn be used to further fine tune the AI model

Explaining the science

Developing data models for individual products is becoming more feasible for companies. However, scaling this up for thousands or millions of products presents a significant challenge. Imagine needing to create, quality assure, and update models for 10,000 or even 60,000,000 products. This would require an enormous number of data engineers, which is neither financially viable nor feasible given the current shortage of skilled professionals.


Large Language Models (LLMs) offer a solution by processing and generating vast amounts of data simultaneously, providing the scalability needed for developing data models at scale. Nevertheless, LLMs need to be fine-tuned to meet specific and flexible requirements for generating data models, storage systems, reasoning, and explanations, as well as accurately evaluating the quality of the generated outputs.


Moreover, it is crucial for end-users to understand and trust the machine-generated outputs. Since people have different communication preferences, models should be developed to infer their communication preference to allow LLMs be personalized to meet these varied needs, ensuring effective and clear communication. Personalizing LLMs for better user interaction becomes increasingly important as AI usage grows.
 

Project aims

The Splunk project aims to pioneer fundamental research on automating the data engineering process. Our focuses include: 

  • Developing scalable and adaptable systems for data storage design
  • Enhancing data pre-processing for improved usability
  • Automated data analysis
  • Efficient data processing
  • Seamless end-user communication 

Our research initiatives are interdisciplinary, ensuring applicability across various industries. Our primary objective is to publish our findings and translate this research into practical solutions for industry challenges using real-world data. We strongly emphasise collaborating with industry partners, aiming to conduct foundational research that drives tangible, impactful advancements in the industry. Through this collaborative approach, we strive to bridge the gap between academic research and practical implementation, ultimately fostering innovation and efficiency in data engineering processes.

Organisers

Researchers and collaborators

Contact info

 [email protected]