A Glossary with Use Cases for First-Timers in Data Engineering
Are you a data engineering rookie interested in knowing more about modern data infrastructures? I bet you are, this article is for you!
In this guide Data Engineering meets Formula 1. But, we’ll keep it simple.
Introduction
I strongly believe that the best way to describe a concept is via examples, even though some of my university professors used to say, “If you need an example to explain it, it means you didn’t get it”.
Anyways, I wasn’t paying enough attention during university classes, and today I’ll walk you through data layers using — guess what — an example.
Business Scenario & Data Architecture
Imagine this: next year, a new team on the grid, Red Thunder Racing, will call us (yes, me and you) to set up their new data infrastructure.
In today’s Formula 1, data is at the core, way more than it was 20 or 30 years back. Racing teams are improving performance with a phenomenal data-driven approach, making improvements millisecond by millisecond.
It’s not only about the lap time; Formula 1 is a multi-billion-dollar business. Boosting fan engagement isn’t just for fun; making the sport more attractive isn’t just for drivers’s fun. These activities generate revenues.
A robust data infrastructure is a must-have to compete in the F1 business.
We’ll build a data architecture to support our racing team starting from the three canonical layers: Data Lake, Data Warehouse, and Data Mart.
Data Lake
A data lake would serve as a repository for raw and unstructured data generated from various sources within the Formula 1 ecosystem: telemetry data from the cars (e.g. tyre pressure per second, speed, fuel consumption), driver configurations, lap times, weather conditions, social media feeds, ticketing, fans registered to marketing events, merchandise purchases, …
All kind of data can be stored in our consolidated data lake: unstructured (audio, video, images), semistructured (JSON, XML) and structured (CSV, Parquet, AVRO).
We’ll face our first challenge while we integrate and consolidate everything in a single place. We’ll create batch jobs extracting records from marketing tools and we’ll also deal with real-time streaming telemetry data (and be sure, there will be very low latency requirements with that).
We’ll have a long list of systems to integrate and each will be supporting a different protocol or interface: Kafka Streaming, SFTP, MQTT, REST API and more.
We won’t be alone in this data collection; thankfully, there are data integration tools available in the market that can be adopted to configure and maintain ingestion pipelines in one place (e.g. in alphabetical order: Fivetran, Hevo, Informatica, Segment, Stitch, Talend, …).
Instead of relying on hundreds of python scripts scheduled on crontab or having custom processes handling data streaming from Kafka topics, these tools will help us simplifying, automating and orchestrating all these processes.
Data Warehouse
After a few weeks defining all the datastreams we need to integrate, we are now ingesting a remarkable variety of data in our data lake. It’s time to move on to the next layer.
The data warehouse is used to clean, structure, and store processed data from the data lake, providing a structured, high-performance environment for analytics and reporting.
At this stage, it’s not about ingesting data and we’ll focus more and more on business use cases. We should consider how the data will be utilised by our colleagues offering structured datasets, regularly refreshed, about:
- Car Performance: telemetry data is cleaned, normalised and integrated to provide a unified view.
- Strategy and Trend Review: past race data are used to identify trends, driver performance and understand the impact of specific strategies.
- Team KPI: pit stop times, tyres temperature before pit stop, budget control on car developments.
We’ll have numerous pipelines dedicated to data transformation and normalisation.
Like for the data integration, there are plenty of products available in the market to simplify and efficiently manage data pipelines. These tools can streamline our data processes, reducing operational costs and increasing developments’ effectiveness (e.g. in alphabetical order: Apache Airflow, Azure Data Factory, DBT, Google DataForm, …).
Data Marts
There is a thin line between Data Warehouses and Data Marts.
Let’s not forget that we are working for Red Thunder Racing, a large company, with thousands of employees involved in diverse areas.
Data must be accessible and tailored to specific business units requirements. Data models are built around business needs.
Data marts are specialized subsets of data warehouses that focus on specific business functions.
- Car Performance Mart: RnD Team analyses data related to engine efficiency, aerodynamics, and reliability. Engineers will use this data mart to optimize the car’s setup for different race tracks or run simulations to understand the best car configuration based on weather conditions.
- Fan Engagement Mart: Marketing Team analyses social media data, fan surveys, and viewer ratings to understand fan preferences. The Marketing Team is using this data to perform tailored marketing strategies, merchandise development, and improve their Fan360 knowledge.
- Bookkeeping Analytics Mart: The Finance Team needs data as well (lot of numbers, I believe!). Now more than ever, racing teams have to deal with budget restrictions and regulations. It’s important to keep track of budget allocations, revenues and cost overviews in general.
Moreover, It’s often a requirement to ensure that sensitive data remains accessible only to authorised teams. For instance, the Research and Development team may require exclusive access to telemetry information, and they need that data can be analysed using a specific data model. However, they might not be permitted (or interested) in accessing financial reports.
Our layered data architecture will enable Red Thunder Racing to leverage the power of data for car performance optimization, strategic decision-making, enhanced marketing campaign… and beyond!
That’s it?
Absolutely not! We barely scratched the surface of a data architecture. There are probably other hundreds of integration points we should consider, moreover we didn’t go beyond just mentioning data transformation and data modeling.
We didn’t cover the Data Science domain at all, which probably deserves its own article, same for data governance, data observability, data security, and more.
But hey, as they say, “Rome was not built in a day”. We have already quite a lot on our plate for today, including the first draft of our data architecture (below).
Conclusions
Data Engineering is a magical realm, with a plethora of books dedicated to it.
Throughout the journey, data engineers will engage with unlimited integration tools, diverse data platforms aiming to cover one or more of the layers mentioned above (e.g. in alphabetical order: AWS Redshift, Azure Synapse, Databricks, Google BigQuery, Snowflake, …), business intelligence tools (e.g. Looker, PowerBI, Tableau, ThoughtSpot, …) and data pipelines tools.
Our data engineering journey at Red Thunder Racing has just began and we should leave plenty of space for flexibility in our toolkit!
Data Layers can be often combined together, sometimes in a single platform. Data platforms and tools are raising the bar and reducing gaps day by day releasing new features. The competition is intense in this market.
- Do you always need to have a data lake? It depends.
- Do you always need to have data stored as soon as possible (a.k.a. streaming and real-time processing)? It depends, what’s the data freshness requirement by Business Users?
- Do you always need to rely on third party tools for data pipelines management? It depends!
- <Placeholder for any other question you might have>? It depends!
If you have any questions or suggestions, please feel free to reach out to me on LinkedIn. I promise I’ll answer with something different from: It depends!
Opinions expressed in this article are solely my own and do not reflect the views of my employer. Unless otherwise noted, all images are by the author.
The story, all names and incidents portrayed in this article are fictitious. No identification with actual places, buildings, and products is intended or should be inferred.
Data Engineering: A Formula 1-inspired Guide for Beginners was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Data Engineering: A Formula 1-inspired Guide for Beginners
Go Here to Read this Fast! Data Engineering: A Formula 1-inspired Guide for Beginners