Dataflow Architecture—Derived Data Views and Eventual Consistency
A (not-so) brief history of a health & fitness data pipeline: part ii
Welcome to part ii of our coming-of-age trilogy on a public health and fitness data pipeline.
In this chapter, we reimagine the backend system as a distributed state machine and explore the art of achieving consistency — with a functional flavour.
A quick recap of part i
The evolution of a data pipeline
In part I, we watched SmartGym grow into (version 2.1), an integrated health and fitness platform that streams, processes, and saves data from a range of gym equipment sensors and medical devices. These data provided insights that empowered users to take more active ownership of their personal health and fitness.
As our system evolved from merely saving and retrieving data to responding to a real-world events, our architecture had to reflect this paradigm shift — from a request-driven to an event-driven one.
An event-driven ingestion pipeline
There were two pipelines that kept the lights on:
- Streaming pipeline: data is continuously streamed from sensors, processed, and stored in a buffer.
- Saving pipeline: when the user ends a session, the buffered data is processed and saved in the database, as a record that represents the user session (a single workout).
The event-driven architecture was a double-edged sword, introducing both order and chaos. Ultimately, we watched it evolve into a well-oiled machine capable of taming its complexities.
Part ii: the evolution goes on
In this installment, we’ll explore the next three versions of the system, each enhancing a user’s workout experience in a different way:
- v3.0: fitness as a personalised experience
- v4.0: fitness as a collective experience
- v5.0: fitness as a personalised-collective experience
But first, meet the main character of this article!
The Magic Black Box
Be it distributed, event-driven, or not, we can think of the SmartGym/SENSEI backend system as a magic black box.
Input: We feed this black box new information — e.g. users info, sensor data
State Transition: This new information interacts with its existing state based on a certain logic, resulting in a new internal state.
Output: At any time, we can query its internal state to retrieve relevant information — e.g. user workout information
The Magic Black Box is deterministic: if you take two identical black boxes with the same state and logic, feed them with the same inputs in the same order, both will end up with identical internal states.
Inside the black box
If we tear it apart, we’ll find there’s nothing truly magical — just a dataflow architecture comprising a bunch of data views and ingestion pipelines.
There are two main types of data views:
1. Source of truth (solid line) — e.g. users, sensor stream
- New data is first written here. These are original, authoritative data — typically represented exactly once in a normalised way.
- The system’s state is a function of these sources of truth and the state transition logic — a reflection of the sequence of events that have transformed it over time.
2. Derived data (dotted line) — e.g. workouts
- This data is processed from existing data in other views, usually involving denormalization, aggregation, or transformation. It is precomputed for efficient future queries.
- Derived data is redundant, effectively “duplicating” existing information; if it is lost, it can always be re-derived from the original source.
The process of deriving data views from sources of truth is known as materialisation, a deterministic task handled by workers in the ingestion pipeline.
All internal state of the magic box are encapsulated within these data views, while the ingestion pipeline — the machinery behind the materialisation process — remains stateless.
Note that whenever the source of truth changes, the derived data has to be re-derived. Else, the state transition is incomplete and the magic black box is left in an inconsistent state.
In the previous example, we feed the magic black box:
- user details through our CRUD API
- sensor telemetry through an event stream (streaming pipeline)
These inputs serve as sources of truth, reflecting real-world entities or events.
From these two pieces of information, we derive a single workout session record in the saving pipeline.
Now, you might be offended by Mr. Magic Black Box mansplaining what appears to be common sense to you. But bear with him because he would prove to be a useful abstraction throughout this article.
v3.0 — Metrics, dashboards, insights: fitness as a personalised experience
With the streaming and saving pipelines working tirelessly to ingest data into the system, our database now houses a wealth of user and workout records. We are empowered to provide meaningful macro insights by analysing trends, groups, averages, and totals for our users and stakeholders.
User insights and fitness metrics
This is where SmartGym’s vision of becoming “every citizen’s #1 fitness companion” begins to take shape. Beyond giving real-time feedback during my workout, recalling my historical performance, and telling me what a great job I’ve done — a diligent fitness companion provides tangible metrics to measure my improvement in performance over time.
By leveraging recent workouts and user information — such as body metrics captured by the SmartGym weighing scale — we can estimate various performance metrics, including:
- 1RM (1 Rep Max) for weight-based workouts (e.g. leg press)
- Max reps per minute for bodyweight workouts (e.g. push-ups)
- VO2 max or MET (Metabolic Equivalent) for cardio workouts (e.g. treadmill)
Deriving both product and fitness metrics typically involves denormalization and aggregation across records, which can be resource-intensive in terms of memory usage, database reads, and network throughput.
Since it doesn’t make sense to run these operations every time a user loads the dashboard, we need a precomputed intermediate data representation, ready for query and visualisation — another derived data view for user fitness metrics!
The periodic processing pipeline
Let’s return to our magic black box.
As new users and workout records continuously flow in, user fitness metrics — a derived data view — require continuous updates in response to changes in the upstream data views they rely on.
To address this, we decided to recompute user fitness metrics periodically, accepting that these metrics may be a few hours stale.
Our ingestion pipeline now includes a cronjob service that schedules batch processing jobs within the periodic pipeline based on a preconfigured schedule, ensuring timely updates without overloading the system.
v4.0 — Gamification: fitness as a collective experience
Moving from individuals to community
Staying fit is difficult work. But with a healthy dose of competition and collective suffering, it can be an experience that feels larger than life. Imagine if every rep, every set, and every workout session contributed to something greater.
Introducing exercise leaderboard and fitness challenges.
Exercise leaderboard
The leaderboard showcases users who have logged the most effort over the month — measured by distance run, weights lifted, and more.
Boy, did this feature stroke some egos! Some of the gym regulars started renaming their randomly-generated username with titles like “Beefy” or “Armstrong”. For many others, surveying the leaderboard became the first order of business upon entering the gym, and the same post-workout ritual before strutting out with newfound swagger.
Similar to how we compute product and user fitness metrics, the leaderboard data is updated periodically in batches, derived from user profiles and their historical workout data.
Fitness challenge system
In collaboration with the gym management team, we launched a fitness challenge to coincide with periods such as Singapore’s National Day.
Each day, users received a challenge requiring them to complete a specific number of reps on a weighted machine or spend a certain number of minutes on a cardio machine, earning rewards for their hard work.
This kickstarted a series of diverse fitness challenges, each featuring unique gameplay variations in terms of duration, workout types, intensity, streaks, and more.
Configurable rules: rule engines and syntax trees
At its core, a fitness challenge is a unique set of workout requirements, specified by an administrator. By comparing a user’s workout history to these requirements, we can assess their progress and completion status for the challenge.
Instead of cluttering our codebase with a new bunch of if-else statements for each variation of the fitness challenge, we externalised the business logic by representing these logical rules in a syntax tree. During runtime, the rule engine parses this tree and evaluates it against users’ actual workout histories to track their challenge progress.
When the program admin modifies the parameters of a fitness challenge, they are directly updating the underlying rules syntax tree. This same data structure is shared between the backend rules engine and the frontend rules configuration page, ensuring consistency and ease of management.
The on-demand processing pipeline
Lets revisit our magic black box.
User fitness challenge results, derived from the workouts and user fitness challenge data through the rules engine, need to be recalculated whenever changes occur in the upstream data views they depend on — such as each time a user completes a workout set.
Among our enthusiastic users, these fitness challenges are a matter of honour and glory. If they don’t see updated challenge results immediately after finishing a set, they get confused and frustrated. Therefore, we can’t afford to reprocess user fitness challenge results in periodic batches; every change in the workouts data view must be propagated instantly.
To achieve that, we extend our ingestion pipeline with a Change Data Capture mechanism, introducing a service that continuously listens for changes in relevant data views, using built-in database triggers or change streams. This on-demand pipeline triggers a cascade of updates to the downstream derived data views.
In this case, an on-demand worker implements the logic of the rules engine to evaluate user fitness challenges results on the fly.
A recap of the different stages of our ingestion pipeline:
- Stream processing: Responsible for ingesting, processing, and storing a stream of live sensor data into a buffer
- Save processing: Responsible for consolidating data from the stream buffer, processing, and saving it into the database as a record that represents a single workout or a user session.
- Periodic processing: Responsible for precomputing derived data views periodically
- On-demand processing: Responsible for propagating updates from upstream data views to derived data views immediately
v5.0 — Recommendation: fitness as a personalised-collective experience
What if we could give the fitness challenges a personal touch?
NSFIT x SmartGym
In late 2021, a team from the Singapore Army described their predicament: each year, servicemen must meet specific fitness benchmarks. If they fall short, they are enrolled in a structured training program, known as NSFIT. However, these sessions were limited to specific times, required sign-ups, and needed staff to facilitate and monitor progress. With the ongoing pandemic and social distancing measures then, gathering servicemen for group sessions wasn’t feasible.
Using the SmartGym fitness challenge system, servicemen could carry out their training on their own schedule — no need for staff to hover over every session. All that’s required is for the staff to verify that the training was completed and the standards were met.
But here’s the twist: servicemen come in all shapes, sizes, and fitness levels. A one-size-fits-all fitness challenge just wouldn’t cut it. They need something that meets them where they are, in order to push their fitness to new heights.
So, why not add a recommendation step before curating a user fitness challenge? By leveraging the fitness metrics we already compute (like in v3.0), we could tailor the intensity of their subsequent training sessions.
Our personalized fitness challenge now follows three key steps:
Step 1 — Profiling
Using historical workout data, we profile each user’s fitness level.
To further optimise this process, we could expand our profiling methods beyond simple heuristics, incorporating machine learning methods to extract more sophisticated features — resulting in new derived data views.
Step 2 — Recommendation
Starting with a generic challenge template (including general information like location, total workouts, participant groups, start/end dates, etc), the recommendation engine fleshes out this skeletal template into a rules syntax tree tailored to each user’s fitness profile.
For even greater personalization, a domain expert could manually fine-tune the challenge requirements, offering a professional perspective beyond algorithmic deduction.
Step 3 — Evaluation
Once personalized parameters are embedded into the rules syntax tree, evaluations can be triggered on-demand after the workout is saved, or even performed in real-time against the sensor stream and displayed on the frontend console.
The dataflow paradigm
The Magic Black Box revisited
Earlier in the article, we mentioned that the magic black box is composed of data views, where the state lives, and a stateless ingestion pipeline.
To get reliable outputs from the magic black box, these data views must be consistent, achieved through a deterministic series of materialisation in the ingestion pipeline.
Buckle up and brace yourself, because we will be diving deep into consistency and determinism — the undercurrents of dataflow.
The challenge of consistency: a necessary evil
At first glance, it might seem simpler to create one massive data view containing every bit of detail from the raw data. With only one data source, consistency is implied. However, there are several reasons why we need derived data views despite the complexity of maintaining consistency:
Data polymorphism
Data can be represented in a myriad of forms — in different combinations and at multiple levels of granularity — each serving their own unique purpose.
For example, it turns out that a user is not interested in knowing if the 3rd rep of his 2nd set of chest press on September 2020 was fully extended or not — after that real-time window of immediacy, low-level raw details become increasingly irrelevant, while higher-level derived insights become much more valuable.
To avoid having to assume how data will be used and represented in future — raw is better, a.k.a the Sushi Principle.
Read-write performance
With that, we decouple the write model from the range of potential read models, and bridge the gap with a series of materialisation stages. This separation is commonly known as Command and Query Responsibility Segregation (CQRS).
Having a materialisation path gives a piece of data the space and time — to morph and discover its different personalities, enabling:
- Faster & simpler writes: a shortened write path by deferring data processing and complex data models to a later stage
- Efficient & flexible reads: a shortened read path by computing different derived views in advance
Basis of consistency
By designating the write model as the authoritative source of truth to reason from, it is easier to achieve consistency — without the complexities of having multiple authoritative systems trying to reach a consensus.
There are instances when the raw data grows too rapidly. E.g. with 1 message/sec per treadmill sensor, with multiple gyms, you end up with millions of messages in the sensor stream in just one day.
When the sensor stream grows prohibitively large, we can treat workouts records as a “lossy compression” of the sensor stream, purge the processed sensor stream, and promote the derived workouts records as a new authoritative source of truth.
Since the one-way chain of materialisation still starts from a single source of truth, we retain our basis of consistency.
Anti-fragility: fault recovery and application evolution
Derived views offers resilience. If a bug corrupts the output, we can roll back to previous versions and rerun the materialisation process, ensuring accurate data again.
Derived views also enable gradual evolution of application. You can introduce new data views without deleting or restructuring the old ones, keeping both as independent views of the same data, with the option of falling back if something goes wrong.
In part one of the series, we’ve seen how the publisher-subscriber (pub/sub) pattern (via a fanout exchange) of the ingestion pipeline makes it easy to extend system functionality in a plug-and-play manner, without disrupting existing pipelines or requiring upstream modifications.
A key to agile development and building anti-fragile systems — those that improve with each bug fix or new feature — is the ease of recovery and evolution. The decoupling enabled by the pub/sub pattern and derived views makes this possible.
The art of consistency and control
Next, let’s dissect the nature of consistency and control flow of the dataflow architecture.
Broadly speaking, distributed systems can be categorised by two consistency levels — strong or eventual; and two types of control flow — orchestration (centralised) or choreography (decentralised).
Strong consistency guarantees that every read reflects the most recent write. It ensures that all data views are updated immediately and accurately after a change. Strong consistency is typically associated with orchestration, since it often relies on a central coordinator to manage atomic updates across multiple data views — either updating all at once, or none at all. Such “over-engineering” may be required for systems where minor discrepancies can be disastrous, e.g. financial transactions, but not in our case.
Eventual consistency allows for temporary discrepancies between data views, but given enough time, all views will converge to the same state. This approach typically pairs with choreography, where each worker reacts to events independently and asynchronously, without needing a central coordinator.
The asynchronous and loosely-coupled design of the dataflow architecture is characterised by eventual consistency of data views, achieved through a choreography of materialisation logic.
And there are perks to that.
Perks: at the system level
Resilience to partial failures: The asynchrony of choreography is more robust against component failures or performance bottlenecks, as disruptions are contained locally. In contrast, orchestration can propagate failures across the system, amplifying the issue through tight coupling.
Simplified write path: Choreography also reduces the responsibility of the write path, which reduces the code surface area for bugs to corrupt the source of truth. Conversely, orchestration makes the write path more complex, and increasingly harder to maintain as the number of different data representations grows.
Perks: at the human level
The decentralised control logic of choreography allows different materialisation stages to be developed, specialised, and maintained independently and concurrently.
Determinism: the event-driven work ethic (revisited)
The spreadsheet ideal
A reliable dataflow system is akin to a spreadsheet: when one cell changes, all related cells update instantly — no manual effort required.
In an ideal dataflow system, we want the same effect: when an upstream data view changes, all dependent views update seamlessly. Like in a spreadsheet, we shouldn’t have to worry about how it works; it just should.
But ensuring this level of reliability in distributed systems is far from simple. Network partitions, service outages, and machine failures are the norm rather than the exception, and the concurrency in the ingestion pipeline only adds complexity.
Since message queues in the ingestion pipeline provide reliability guarantees, deterministic retries can make transient faults seem like they never happened. To achieve that, our ingestion workers need to adopt the event-driven work ethic:
Pure functions have no free will
In computer science, pure functions exhibit determinism, meaning their behaviour is entirely predictable and repeatable.
They are ephemeral — here for a moment and gone the next, retaining no state beyond their lifespan. Naked they come, and naked they shall go. And from the immutable message inscribed into their birth, their legacy is determined. They always return the same output for the same input — everything unfolds exactly as predestined.
And that is exactly what we want our ingestion workers to be.
Immutable inputs (statelessness)
This immutable message encapsulates all necessary information, removing any dependency on external, changeable data. Essentially we are passing data to the workers by value rather than by reference, such that processing a message tomorrow would yield the same result as it would today.
Task isolation
To avoid concurrency issues, workers should not share mutable state.
Transitional states within the workers should be isolated, like local variables in pure functions — without reliance on shared caches for intermediate computation.
It’s also crucial to scope tasks independently, ensuring that each worker handles tasks without sharing input or output spaces, allowing parallel execution without race conditions. E.g. scoping the user fitness profiling task by a particular user_id, since inputs (workouts) are outputs (user fitness metrics) are tied to a unique user.
Deterministic execution
Non-determinism can sneak in easily: using system clocks, depending on external data sources, probabilistic/statistical algorithms relying on random numbers, can all lead to unpredictable results. To prevent this, we embed all “moving parts” (e.g. random seeds or timestamp) directly in the immutable message.
Deterministic ordering
Load balancing with message queues (multiple workers per queue) can result in out-of-order message processing when a message is retried after the next one is already processed. E.g. Out-of-order evaluation of user fitness challenge results appearing as 50% completion to 70% and back to 60%, when it should increase monotonically. For operations that require sequential execution, like inserting a record followed by notifying a third-party service, out-of-order processing could break such causal dependencies.
At the application level, these sequential operations should either run synchronously on a single worker or be split into separate sequential stages of materialisation.
At the ingestion pipeline level, we could assign only one worker per queue to ensure serialised processing that “blocks” until retry is successful. To maintain load balancing, you can use multiple queues with a consistent hash exchange that routes messages based on the hash of the routing key. This achieves a similar effect to Kafka’s hashed partition key approach.
Idempotent outputs
Idempotence is a property where multiple executions of a piece of code should always yield the same result, no matter how many times it got executed.
For example, a trivial database “insert” operation is not idempotent while an “insert if does not exist” operation is.
This ensures that you get the same outcome as if the worker only executed once, regardless of how many retries it actually took.
Caveat: Note that unlike pure functions, the worker does not “return” an object in the programming sense. Instead, they overwrite a portion of the database. While this may look like a side-effect, you can think of this overwrite as similar to the immutable output of a pure function: once the worker commits the result, it reflects a final, unchangeable state.
Dataflow-ception
Dataflow in client-side applications
Traditionally, we think of web/mobile apps as stateless clients talking to a central database. However, modern “single-page” frameworks have changed the game, offering “stateful” client-side interaction and persistent local storage.
This extends our dataflow architecture beyond the confines of a backend system into a multitude of client devices. Think of the on-device state (the “model” in model-view-controller) as derived view of server state — the screen displays a materialised view of local on-device state, which mirrors the central backend’s state.
Push-based protocols like server-sent events and WebSockets take this analogy further, enabling servers to actively push updates to the client without relying on polling — delivering eventual consistency from end to end.
In fact, this real-time synchronization is exactly how we evaluated personalized fitness challenges in the frontend console — as a derived data view residing in a client device.
Dataflow in databases
Even down the stack, we see a semblance of dataflow in databases. Database triggers, stored procedures, and materialized view maintenance routines are not very different from the on-demand and periodic processing pipelines; B-tree indexes and materialised views of a relational database are essentially derived data views— talk about dataflows within dataflows!
Dataflows, dataflows everywhere
“The goal of data integration is to make sure that data ends up in the right form, in all the right places.”
— Designing Data-Intensive Applications (Martin Kleppmann)
As data systems scale, we should progress beyond seeing them as passive databases manipulated by applications like global variables.
Instead, it is useful to reimagine data systems in an organisation as an interplay of data views, one derived from another, with state changes rippling out from a central source of truth, propagated through functional application code. Dataflows, built upon dataflows.
It’s magic black boxes — all the way down.
Wrapping up
Congratulations on making it this far!
In part I, we evolved from a trivial request-response system into an event-driven system that streams, processes, and saves data from a range of gym equipment sensors and medical devices.
In this second instalment, we expanded upon those saved records and processed them periodically and on-demand. This enabled new features that enhance a user’s workout experience into a more collective yet personalised one. As our ingestion pipeline evolved, so did our dataflow architecture, scaling to meet new demands.
The story of our evolution doesn’t end here.
In the next and final part, we explore adding and removing functionality in a plug-and-play manner, paving the way for an ecosystem to emerge from our platform.
Stay tuned…
All images and gifs featured in this article are original works created and captured by the author.
Shoutout to the data engineering bible, a.k.a Designing Data-Intensive Applications — by Martin Kleppmann, for giving me the vocabulary to think about these distributed systems with clarity.
Find out more on the feature development from my teammates
User insights dashboard
Fitness Lover to Developer: Building a SmartGym Vision
Product metrics dashboard
Leaderboard and fitness challenge analytics dashboard
- My Internship in the SmartGym team
- Don’t throw darts in the dark
- My Most Fulfilling Moment at SmartGym
Personalised workout for physiotherapy
Designing an end to end experience for Physiotherapist and Patients
Dataflow Architecture-Derived Data Views and Eventual Consistency was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Dataflow Architecture-Derived Data Views and Eventual Consistency
Go Here to Read this Fast! Dataflow Architecture-Derived Data Views and Eventual Consistency