Blog

Pundit Outlines Factors That Could Push XRP Price to $10, But There’s a Catch

Newton Gitonga

Crypto analyst Raajeev Anand emphasized the rising optimism surrounding XRP, predicting a potential 1,500% price surge for the crypto asset.

Go here to Read this Fast! Pundit Outlines Factors That Could Push XRP Price to $10, But There’s a Catch

Originally appeared here:
Pundit Outlines Factors That Could Push XRP Price to $10, But There’s a Catch

October 15, 2024
Mastering Dollar Volatility: Key Strategies for Traders in Asia and India from Octa Broker

Guest Author

The U.S. dollar’s dominance in global trade and finance is unparalleled. Around 88% of all foreign exchange transactions involve the U.S. dollar, solidifying its position as the backbone of international markets. For traders in Asia—especially in India and Southeast Asia—understanding the dollar’s movements is not just advisable; it’s essential. Changes in the dollar’s value influence […]

Go here to Read this Fast! Mastering Dollar Volatility: Key Strategies for Traders in Asia and India from Octa Broker

Originally appeared here:
Mastering Dollar Volatility: Key Strategies for Traders in Asia and India from Octa Broker

October 15, 2024
Sui’s Bulls Push New A.I Social Dating Coin GoodEgg Currently Priced At $0.00021 Due to Skyrocket By 210%

Guest Author

As Sui (SUI) continues to capture attention with its market performance, savvy adherents are shifting their focus towards GoodEgg (GEGG), a new AI-powered dating cryptocurrency poised for explosive growth. With over 93% of its presale Stage 2 tokens already sold and priced at $0.00021, GoodEgg (GEGG) is drawing attention for its innovative approach, and market […]

Go here to Read this Fast! Sui’s Bulls Push New A.I Social Dating Coin GoodEgg Currently Priced At $0.00021 Due to Skyrocket By 210%

Originally appeared here:
Sui’s Bulls Push New A.I Social Dating Coin GoodEgg Currently Priced At $0.00021 Due to Skyrocket By 210%

October 15, 2024
Dataflow Architecture-Derived Data Views and Eventual Consistency
caleb lee
Dataflow Architecture—Derived Data Views and Eventual Consistency

A (not-so) brief history of a health & fitness data pipeline: part ii

Welcome to part ii of our coming-of-age trilogy on a public health and fitness data pipeline.

In this chapter, we reimagine the backend system as a distributed state machine and explore the art of achieving consistency — with a functional flavour.

A quick recap of part i

The evolution of a data pipeline

In part I, we watched SmartGym grow into (version 2.1), an integrated health and fitness platform that streams, processes, and saves data from a range of gym equipment sensors and medical devices. These data provided insights that empowered users to take more active ownership of their personal health and fitness.

SmartGym shoulder press

As our system evolved from merely saving and retrieving data to responding to a real-world events, our architecture had to reflect this paradigm shift — from a request-driven to an event-driven one.

An event-driven ingestion pipeline

There were two pipelines that kept the lights on:
1. Streaming pipeline: data is continuously streamed from sensors, processed, and stored in a buffer.
2. Saving pipeline: when the user ends a session, the buffered data is processed and saved in the database, as a record that represents the user session (a single workout).
Streaming and Saving pipelines

The event-driven architecture was a double-edged sword, introducing both order and chaos. Ultimately, we watched it evolve into a well-oiled machine capable of taming its complexities.

Part ii: the evolution goes on

In this installment, we’ll explore the next three versions of the system, each enhancing a user’s workout experience in a different way:
- v3.0: fitness as a personalised experience
- v4.0: fitness as a collective experience
- v5.0: fitness as a personalised-collective experience
But first, meet the main character of this article!

The Magic Black Box

Be it distributed, event-driven, or not, we can think of the SmartGym/SENSEI backend system as a magic black box.

Input: We feed this black box new information — e.g. users info, sensor data

State Transition: This new information interacts with its existing state based on a certain logic, resulting in a new internal state.

Output: At any time, we can query its internal state to retrieve relevant information — e.g. user workout information

The Magic Black Box is deterministic: if you take two identical black boxes with the same state and logic, feed them with the same inputs in the same order, both will end up with identical internal states.

Inside the black box

If we tear it apart, we’ll find there’s nothing truly magical — just a dataflow architecture comprising a bunch of data views and ingestion pipelines.

Sources of truth (solid), derived data views (dotted) and stream/save pipeline logic (arrows)

There are two main types of data views:

1. Source of truth (solid line) — e.g. users, sensor stream
- New data is first written here. These are original, authoritative data — typically represented exactly once in a normalised way.
- The system’s state is a function of these sources of truth and the state transition logic — a reflection of the sequence of events that have transformed it over time.
2. Derived data (dotted line) — e.g. workouts
- This data is processed from existing data in other views, usually involving denormalization, aggregation, or transformation. It is precomputed for efficient future queries.
- Derived data is redundant, effectively “duplicating” existing information; if it is lost, it can always be re-derived from the original source.
The process of deriving data views from sources of truth is known as materialisation, a deterministic task handled by workers in the ingestion pipeline.

All internal state of the magic box are encapsulated within these data views, while the ingestion pipeline — the machinery behind the materialisation process — remains stateless.

Note that whenever the source of truth changes, the derived data has to be re-derived. Else, the state transition is incomplete and the magic black box is left in an inconsistent state.

v2.1 dataflow architecture

In the previous example, we feed the magic black box:
- user details through our CRUD API
- sensor telemetry through an event stream (streaming pipeline)
These inputs serve as sources of truth, reflecting real-world entities or events.

From these two pieces of information, we derive a single workout session record in the saving pipeline.

Now, you might be offended by Mr. Magic Black Box mansplaining what appears to be common sense to you. But bear with him because he would prove to be a useful abstraction throughout this article.

v3.0 — Metrics, dashboards, insights: fitness as a personalised experience

With the streaming and saving pipelines working tirelessly to ingest data into the system, our database now houses a wealth of user and workout records. We are empowered to provide meaningful macro insights by analysing trends, groups, averages, and totals for our users and stakeholders.

SmartGym product metrics dashboard prototype

User insights and fitness metrics

This is where SmartGym’s vision of becoming “every citizen’s #1 fitness companion” begins to take shape. Beyond giving real-time feedback during my workout, recalling my historical performance, and telling me what a great job I’ve done — a diligent fitness companion provides tangible metrics to measure my improvement in performance over time.

SmartGym user workout insights page

By leveraging recent workouts and user information — such as body metrics captured by the SmartGym weighing scale — we can estimate various performance metrics, including:
- 1RM (1 Rep Max) for weight-based workouts (e.g. leg press)
- Max reps per minute for bodyweight workouts (e.g. push-ups)
- VO2 max or MET (Metabolic Equivalent) for cardio workouts (e.g. treadmill)
Deriving both product and fitness metrics typically involves denormalization and aggregation across records, which can be resource-intensive in terms of memory usage, database reads, and network throughput.

Since it doesn’t make sense to run these operations every time a user loads the dashboard, we need a precomputed intermediate data representation, ready for query and visualisation — another derived data view for user fitness metrics!

The periodic processing pipeline

Let’s return to our magic black box.

v3.0 dataflow architecture

As new users and workout records continuously flow in, user fitness metrics — a derived data view — require continuous updates in response to changes in the upstream data views they rely on.

To address this, we decided to recompute user fitness metrics periodically, accepting that these metrics may be a few hours stale.

Our ingestion pipeline now includes a cronjob service that schedules batch processing jobs within the periodic pipeline based on a preconfigured schedule, ensuring timely updates without overloading the system.

Streaming, Saving, and Periodic pipelines

v4.0 — Gamification: fitness as a collective experience

Moving from individuals to community

Staying fit is difficult work. But with a healthy dose of competition and collective suffering, it can be an experience that feels larger than life. Imagine if every rep, every set, and every workout session contributed to something greater.

Introducing exercise leaderboard and fitness challenges.

Exercise leaderboard

The leaderboard showcases users who have logged the most effort over the month — measured by distance run, weights lifted, and more.

Boy, did this feature stroke some egos! Some of the gym regulars started renaming their randomly-generated username with titles like “Beefy” or “Armstrong”. For many others, surveying the leaderboard became the first order of business upon entering the gym, and the same post-workout ritual before strutting out with newfound swagger.

SmartGym leaderboard

Similar to how we compute product and user fitness metrics, the leaderboard data is updated periodically in batches, derived from user profiles and their historical workout data.

Fitness challenge system

In collaboration with the gym management team, we launched a fitness challenge to coincide with periods such as Singapore’s National Day.

SmartGym fitness challenge at Our Tampines Hub

Each day, users received a challenge requiring them to complete a specific number of reps on a weighted machine or spend a certain number of minutes on a cardio machine, earning rewards for their hard work.

This kickstarted a series of diverse fitness challenges, each featuring unique gameplay variations in terms of duration, workout types, intensity, streaks, and more.

SmartGym fitness challenge UI

Configurable rules: rule engines and syntax trees

At its core, a fitness challenge is a unique set of workout requirements, specified by an administrator. By comparing a user’s workout history to these requirements, we can assess their progress and completion status for the challenge.

Rules syntax tree: representing one set of chest-press/leg-press/treadmill workout

Instead of cluttering our codebase with a new bunch of if-else statements for each variation of the fitness challenge, we externalised the business logic by representing these logical rules in a syntax tree. During runtime, the rule engine parses this tree and evaluates it against users’ actual workout histories to track their challenge progress.

Runtime evaluation of syntax tree

When the program admin modifies the parameters of a fitness challenge, they are directly updating the underlying rules syntax tree. This same data structure is shared between the backend rules engine and the frontend rules configuration page, ensuring consistency and ease of management.

SmartGym fitness challenge configuration page

The on-demand processing pipeline

Lets revisit our magic black box.

v4.0 dataflow architecture

User fitness challenge results, derived from the workouts and user fitness challenge data through the rules engine, need to be recalculated whenever changes occur in the upstream data views they depend on — such as each time a user completes a workout set.

Among our enthusiastic users, these fitness challenges are a matter of honour and glory. If they don’t see updated challenge results immediately after finishing a set, they get confused and frustrated. Therefore, we can’t afford to reprocess user fitness challenge results in periodic batches; every change in the workouts data view must be propagated instantly.

To achieve that, we extend our ingestion pipeline with a Change Data Capture mechanism, introducing a service that continuously listens for changes in relevant data views, using built-in database triggers or change streams. This on-demand pipeline triggers a cascade of updates to the downstream derived data views.

In this case, an on-demand worker implements the logic of the rules engine to evaluate user fitness challenges results on the fly.

Unveiling our latest inline-four engine with Streaming, Saving, Periodic, and On-demand pipelines

A recap of the different stages of our ingestion pipeline:
- Stream processing: Responsible for ingesting, processing, and storing a stream of live sensor data into a buffer
- Save processing: Responsible for consolidating data from the stream buffer, processing, and saving it into the database as a record that represents a single workout or a user session.
- Periodic processing: Responsible for precomputing derived data views periodically
- On-demand processing: Responsible for propagating updates from upstream data views to derived data views immediately
v5.0 — Recommendation: fitness as a personalised-collective experience

What if we could give the fitness challenges a personal touch?

NSFIT x SmartGym

In late 2021, a team from the Singapore Army described their predicament: each year, servicemen must meet specific fitness benchmarks. If they fall short, they are enrolled in a structured training program, known as NSFIT. However, these sessions were limited to specific times, required sign-ups, and needed staff to facilitate and monitor progress. With the ongoing pandemic and social distancing measures then, gathering servicemen for group sessions wasn’t feasible.

Using the SmartGym fitness challenge system, servicemen could carry out their training on their own schedule — no need for staff to hover over every session. All that’s required is for the staff to verify that the training was completed and the standards were met.

Real-time recommendation for treadmill intensity based on fitness profile

But here’s the twist: servicemen come in all shapes, sizes, and fitness levels. A one-size-fits-all fitness challenge just wouldn’t cut it. They need something that meets them where they are, in order to push their fitness to new heights.

So, why not add a recommendation step before curating a user fitness challenge? By leveraging the fitness metrics we already compute (like in v3.0), we could tailor the intensity of their subsequent training sessions.

Our personalized fitness challenge now follows three key steps:

Step 1 — Profiling
Using historical workout data, we profile each user’s fitness level.

To further optimise this process, we could expand our profiling methods beyond simple heuristics, incorporating machine learning methods to extract more sophisticated features — resulting in new derived data views.

Step 2 — Recommendation
Starting with a generic challenge template (including general information like location, total workouts, participant groups, start/end dates, etc), the recommendation engine fleshes out this skeletal template into a rules syntax tree tailored to each user’s fitness profile.

For even greater personalization, a domain expert could manually fine-tune the challenge requirements, offering a professional perspective beyond algorithmic deduction.

Step 3 — Evaluation
Once personalized parameters are embedded into the rules syntax tree, evaluations can be triggered on-demand after the workout is saved, or even performed in real-time against the sensor stream and displayed on the frontend console.

Real-time personalised fitness challenge evaluation

v5.0 dataflow architecture

The dataflow paradigm

The Magic Black Box revisited

Earlier in the article, we mentioned that the magic black box is composed of data views, where the state lives, and a stateless ingestion pipeline.

To get reliable outputs from the magic black box, these data views must be consistent, achieved through a deterministic series of materialisation in the ingestion pipeline.

Buckle up and brace yourself, because we will be diving deep into consistency and determinism — the undercurrents of dataflow.

The challenge of consistency: a necessary evil

At first glance, it might seem simpler to create one massive data view containing every bit of detail from the raw data. With only one data source, consistency is implied. However, there are several reasons why we need derived data views despite the complexity of maintaining consistency:

Data polymorphism

Data can be represented in a myriad of forms — in different combinations and at multiple levels of granularity — each serving their own unique purpose.

For example, it turns out that a user is not interested in knowing if the 3rd rep of his 2nd set of chest press on September 2020 was fully extended or not — after that real-time window of immediacy, low-level raw details become increasingly irrelevant, while higher-level derived insights become much more valuable.

To avoid having to assume how data will be used and represented in future — raw is better, a.k.a the Sushi Principle.

Read-write performance

With that, we decouple the write model from the range of potential read models, and bridge the gap with a series of materialisation stages. This separation is commonly known as Command and Query Responsibility Segregation (CQRS).

Having a materialisation path gives a piece of data the space and time — to morph and discover its different personalities, enabling:
- Faster & simpler writes: a shortened write path by deferring data processing and complex data models to a later stage
- Efficient & flexible reads: a shortened read path by computing different derived views in advance
Basis of consistency

By designating the write model as the authoritative source of truth to reason from, it is easier to achieve consistency — without the complexities of having multiple authoritative systems trying to reach a consensus.

There are instances when the raw data grows too rapidly. E.g. with 1 message/sec per treadmill sensor, with multiple gyms, you end up with millions of messages in the sensor stream in just one day.

workouts replacing sensor stream as the new authoritative data view

When the sensor stream grows prohibitively large, we can treat workouts records as a “lossy compression” of the sensor stream, purge the processed sensor stream, and promote the derived workouts records as a new authoritative source of truth.

Since the one-way chain of materialisation still starts from a single source of truth, we retain our basis of consistency.

Anti-fragility: fault recovery and application evolution

Derived views offers resilience. If a bug corrupts the output, we can roll back to previous versions and rerun the materialisation process, ensuring accurate data again.

Derived views also enable gradual evolution of application. You can introduce new data views without deleting or restructuring the old ones, keeping both as independent views of the same data, with the option of falling back if something goes wrong.

Non-breaking evolution of derivation logic

In part one of the series, we’ve seen how the publisher-subscriber (pub/sub) pattern (via a fanout exchange) of the ingestion pipeline makes it easy to extend system functionality in a plug-and-play manner, without disrupting existing pipelines or requiring upstream modifications.

A key to agile development and building anti-fragile systems — those that improve with each bug fix or new feature — is the ease of recovery and evolution. The decoupling enabled by the pub/sub pattern and derived views makes this possible.

The art of consistency and control

Next, let’s dissect the nature of consistency and control flow of the dataflow architecture.

Broadly speaking, distributed systems can be categorised by two consistency levels — strong or eventual; and two types of control flow — orchestration (centralised) or choreography (decentralised).

Relationship between consistency levels and control flow

Strong consistency guarantees that every read reflects the most recent write. It ensures that all data views are updated immediately and accurately after a change. Strong consistency is typically associated with orchestration, since it often relies on a central coordinator to manage atomic updates across multiple data views — either updating all at once, or none at all. Such “over-engineering” may be required for systems where minor discrepancies can be disastrous, e.g. financial transactions, but not in our case.

Eventual consistency allows for temporary discrepancies between data views, but given enough time, all views will converge to the same state. This approach typically pairs with choreography, where each worker reacts to events independently and asynchronously, without needing a central coordinator.

The asynchronous and loosely-coupled design of the dataflow architecture is characterised by eventual consistency of data views, achieved through a choreography of materialisation logic.

And there are perks to that.

Perks: at the system level

Resilience to partial failures: The asynchrony of choreography is more robust against component failures or performance bottlenecks, as disruptions are contained locally. In contrast, orchestration can propagate failures across the system, amplifying the issue through tight coupling.

Simplified write path: Choreography also reduces the responsibility of the write path, which reduces the code surface area for bugs to corrupt the source of truth. Conversely, orchestration makes the write path more complex, and increasingly harder to maintain as the number of different data representations grows.

Perks: at the human level

The decentralised control logic of choreography allows different materialisation stages to be developed, specialised, and maintained independently and concurrently.

Determinism: the event-driven work ethic (revisited)

The spreadsheet ideal

A reliable dataflow system is akin to a spreadsheet: when one cell changes, all related cells update instantly — no manual effort required.

In an ideal dataflow system, we want the same effect: when an upstream data view changes, all dependent views update seamlessly. Like in a spreadsheet, we shouldn’t have to worry about how it works; it just should.

But ensuring this level of reliability in distributed systems is far from simple. Network partitions, service outages, and machine failures are the norm rather than the exception, and the concurrency in the ingestion pipeline only adds complexity.

Since message queues in the ingestion pipeline provide reliability guarantees, deterministic retries can make transient faults seem like they never happened. To achieve that, our ingestion workers need to adopt the event-driven work ethic:

Pure functions have no free will

In computer science, pure functions exhibit determinism, meaning their behaviour is entirely predictable and repeatable.

They are ephemeral — here for a moment and gone the next, retaining no state beyond their lifespan. Naked they come, and naked they shall go. And from the immutable message inscribed into their birth, their legacy is determined. They always return the same output for the same input — everything unfolds exactly as predestined.

And that is exactly what we want our ingestion workers to be.

Immutable inputs (statelessness)
This immutable message encapsulates all necessary information, removing any dependency on external, changeable data. Essentially we are passing data to the workers by value rather than by reference, such that processing a message tomorrow would yield the same result as it would today.

Task isolation

To avoid concurrency issues, workers should not share mutable state.

Transitional states within the workers should be isolated, like local variables in pure functions — without reliance on shared caches for intermediate computation.

It’s also crucial to scope tasks independently, ensuring that each worker handles tasks without sharing input or output spaces, allowing parallel execution without race conditions. E.g. scoping the user fitness profiling task by a particular user_id, since inputs (workouts) are outputs (user fitness metrics) are tied to a unique user.

Deterministic execution
Non-determinism can sneak in easily: using system clocks, depending on external data sources, probabilistic/statistical algorithms relying on random numbers, can all lead to unpredictable results. To prevent this, we embed all “moving parts” (e.g. random seeds or timestamp) directly in the immutable message.

Deterministic ordering
Load balancing with message queues (multiple workers per queue) can result in out-of-order message processing when a message is retried after the next one is already processed. E.g. Out-of-order evaluation of user fitness challenge results appearing as 50% completion to 70% and back to 60%, when it should increase monotonically. For operations that require sequential execution, like inserting a record followed by notifying a third-party service, out-of-order processing could break such causal dependencies.

At the application level, these sequential operations should either run synchronously on a single worker or be split into separate sequential stages of materialisation.

At the ingestion pipeline level, we could assign only one worker per queue to ensure serialised processing that “blocks” until retry is successful. To maintain load balancing, you can use multiple queues with a consistent hash exchange that routes messages based on the hash of the routing key. This achieves a similar effect to Kafka’s hashed partition key approach.

Idempotent outputs

Idempotence is a property where multiple executions of a piece of code should always yield the same result, no matter how many times it got executed.

For example, a trivial database “insert” operation is not idempotent while an “insert if does not exist” operation is.

This ensures that you get the same outcome as if the worker only executed once, regardless of how many retries it actually took.

Caveat: Note that unlike pure functions, the worker does not “return” an object in the programming sense. Instead, they overwrite a portion of the database. While this may look like a side-effect, you can think of this overwrite as similar to the immutable output of a pure function: once the worker commits the result, it reflects a final, unchangeable state.

Dataflow-ception

Dataflow in client-side applications

Traditionally, we think of web/mobile apps as stateless clients talking to a central database. However, modern “single-page” frameworks have changed the game, offering “stateful” client-side interaction and persistent local storage.

This extends our dataflow architecture beyond the confines of a backend system into a multitude of client devices. Think of the on-device state (the “model” in model-view-controller) as derived view of server state — the screen displays a materialised view of local on-device state, which mirrors the central backend’s state.

Push-based protocols like server-sent events and WebSockets take this analogy further, enabling servers to actively push updates to the client without relying on polling — delivering eventual consistency from end to end.

v5.0 dataflow architecture (extended)

In fact, this real-time synchronization is exactly how we evaluated personalized fitness challenges in the frontend console — as a derived data view residing in a client device.

Real-time personalised fitness challenge evaluation

Dataflow in databases

Even down the stack, we see a semblance of dataflow in databases. Database triggers, stored procedures, and materialized view maintenance routines are not very different from the on-demand and periodic processing pipelines; B-tree indexes and materialised views of a relational database are essentially derived data views— talk about dataflows within dataflows!

Dataflows, dataflows everywhere

“The goal of data integration is to make sure that data ends up in the right form, in all the right places.”

— Designing Data-Intensive Applications (Martin Kleppmann)

As data systems scale, we should progress beyond seeing them as passive databases manipulated by applications like global variables.

Instead, it is useful to reimagine data systems in an organisation as an interplay of data views, one derived from another, with state changes rippling out from a central source of truth, propagated through functional application code. Dataflows, built upon dataflows.

It’s magic black boxes — all the way down.

Wrapping up

Congratulations on making it this far!

In part I, we evolved from a trivial request-response system into an event-driven system that streams, processes, and saves data from a range of gym equipment sensors and medical devices.

In this second instalment, we expanded upon those saved records and processed them periodically and on-demand. This enabled new features that enhance a user’s workout experience into a more collective yet personalised one. As our ingestion pipeline evolved, so did our dataflow architecture, scaling to meet new demands.

Summary of SmartGym features built upon the ingestion pipeline

The story of our evolution doesn’t end here.

In the next and final part, we explore adding and removing functionality in a plug-and-play manner, paving the way for an ecosystem to emerge from our platform.

Stay tuned…

All images and gifs featured in this article are original works created and captured by the author.

Shoutout to the data engineering bible, a.k.a Designing Data-Intensive Applications — by Martin Kleppmann, for giving me the vocabulary to think about these distributed systems with clarity.

Find out more on the feature development from my teammates

User insights dashboard

Fitness Lover to Developer: Building a SmartGym Vision

Product metrics dashboard
- Web development with Python, really?
- Internship Experience — Don’t DASH through data analysis
Leaderboard and fitness challenge analytics dashboard
Personalised workout for physiotherapy

Designing an end to end experience for Physiotherapist and Patients

Dataflow Architecture-Derived Data Views and Eventual Consistency was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Dataflow Architecture-Derived Data Views and Eventual Consistency

Go Here to Read this Fast! Dataflow Architecture-Derived Data Views and Eventual Consistency
October 15, 2024
I Built An AI Human-Level Game Player

Rafe Brena, Ph.D.

Old-school game trees can be incredibly effective.

Continue reading on Towards Data Science »

Originally appeared here:
I Built An AI Human-Level Game Player

Go Here to Read this Fast! I Built An AI Human-Level Game Player

October 15, 2024
Evaluating synthetic data
Aymeric Floyrac
Assessing plausibility and usefulness of data we generated from real data

Synthetic data serves many purposes, and has been gathering attention for a while, partly due to the convincing capabilities of LLMs. But what is «good» synthetic data, and how can we know we managed to generate it ?

Photo by Nigel Hoare on Unsplash

What is synthetic data ?

Synthetic data is data that has been generated with the intent to look like real data, at least on some aspects (schema at the very least, statistical distributions, …). It is usually generated randomly, using a wide range of models : random sampling, noise addition, GAN, diffusion models, variational autoencoders, LLM, …
It is used for many purposes, for instance :
- training and education (eg, discovering a new database or teaching a course),
- data augmentation (ie, creating new samples to train a model),
- sharing data while protecting privacy (especially useful from an open science point of view),
- conducting research while protecting privacy.
It is particularily used in software testing, and in sensitive domains like healthcare technology : having access to data that behaves like real data without jeopardizing patients privacy is a dream come true.

Synthetic data quality principles

Individual plausibility

For a sample to be useful it must, in some way, look like real data. The ultimate goal is that generated samples must be indistinguishable from real samples : generate hyper-realistic faces, sentences, medical records, … Obviously, the more complex the source data, the harder it is to generate «good» synthetic data.

Usefulness

In many cases, especially data augmentation, we need more than one realistic sample, we need a whole dataset. And it is not the same to generate a single sample and a whole dataset : the problem is very well known, under the name of mode collapse, which is especially frequent when training a generative adversarial network (GAN). Essentially, the generator (more generally, the model that generates synthetic data) could learn to generate a single type of sample and totally miss out on the rest of the sample space, leading to a synthetic dataset that is not as useful as the original dataset.

For instance, if we train a model to generate animal pictures, and it finds a very efficient way to generate cat pictures, it could stop generating anything else than cat pictures (in particular, no dog pictures). Cat pictures would then be the “mode” of the generated distribution.

This type of behaviour is harmful if our initial goal is to augment our data, or create a dataset for training. What we need is a dataset that is realistic in itself, which in absolute means that any statistic derived from this dataset should be close enough to the same statistic on real data. Statistically speaking, this means that univariate and multivariate distributions should be the same (or at least “close enough”).

Privacy

We will not dive too deep on this topic, which would deserve an article in itself. To keep it short : according to our initial goal, we may have to share data (more or less publicly), which means, if it is personal data, that it should be protected. For instance, we need to make sure we cannot retrieve any information on any given individual of the original dataset using the synthetic dataset. In particular, that means being cautious about outliers, or checking that no original sample was generated by the generator.

One way to consider the privacy issue is to use the differential privacy framework.

Evaluation in practice

Photo by Library of Congress on Unsplash

Let’s start by loading data and generating a synthetic dataset from this data. We’ll start with the famous `iris` dataset. To generate it synthetic counterpart, we’ll use the Synthetic Data Vault package.
```
pip install sdv
```
```
from sklearn.datasets import load_iris
from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata.metadata import Metadata

data = load_iris(return_X_y=False, as_frame=True)
real_data = data["data"]

# metadata of the `iris` dataset
metadata = Metadata().load_from_dict({
    "tables": {
        "iris": {
            "columns": {
                "sepal length (cm)": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                },
                "sepal width (cm)": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                },
                "petal length (cm)": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                },
                "petal width (cm)": {
                    "sdtype": "numerical",
                    "computer_representation": "Float"
                }
            },
            "primary_key": None
        }
    },
    "relationships": [],
    "METADATA_SPEC_VERSION": "V1"
})

# train the synthesizer
synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=real_data)
# generate samples - in this case, 
# synthetic_data has the same shape as real_data
synthetic_data = synthesizer.sample(num_rows=150)
```
Sample level

Now, we want to test whether it is possible to tell if a single sample is synthetic or not.

With this formulation, we easily see it is fundamentally a binary classification problem (synthetic vs original). Hence, we can train any model to classify original data from synthetic data : if this model achieves a good accuracy (which here means significantly above 0.5), the synthetic samples are not realistic enough. We aim for 0.5 accuracy (if the test set contains half original samples and half synthetic samples), which would mean that the classifier is making random guesses.

As in any classification problem, we should not limit ourself to weak models and give a fair amount of effort in hyperparameters selection and model training.

Now for the code :
```
import pandas as pd 
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestClassifier


def classification_evaluation(
  real_data: pd.DataFrame, 
  synthetic_data: pd.DataFrame
) -> float:

    X = pd.concat((real_data, synthetic_data))
    y = np.concatenate(
        (
            np.zeros(real_data.shape[0]),
            np.ones(synthetic_data.shape[0])
        )
    )

    Xtrain, Xtest, ytrain, ytest = train_test_split(
        X,
        y,
        test_size=0.2,
        stratify=y
    )
    
    clf = RandomForestClassifier()
    clf.fit(Xtrain, ytrain)
    score = accuracy_score(clf.predict(Xtest), ytest)

    return score

classification_evaluation(real_data, synthetic_data)
>>> 0.9
```
In this case, it appears the synthesizer was not able to fool our classifier : the synthetic data is not realistic enough.

Dataset level

If our samples were realistic enough to fool a reasonably powerful classifier, we would need to evaluate our dataset as a whole. This time, it cannot be translated into a classification problem, and we need to use several indicators.

Statistical distributions

The most obvious tests are statistical tests : are the univariate distributions in the original dataset the same as in the synthetic dataset ? Are the correlations the same ?

Ideally, we would like to test N-variate distributions for any N, which can be particularily expensive for a high number of variables. However, even univariate distributions make it possible to see if our dataset is subject to mode collapse.

Now for the code :
```
import pandas as pd
from scipy.stats import ks_2samp

def univariate_distributions_tests(
  real_data: pd.DataFrame, 
  synthetic_data: pd.DataFrame
) -> None:
    for col in real_data.columns:
        if real_data[col].dtype.kind in "biufc": 
            stat, p_value = ks_2samp(real_data[col], synthetic_data[col])
            print(f"Column: {col}")
            print(f"P-value: {p_value:.4f}")
            print("Significantly different" if p_value < 0.05 else "Not significantly different")
            print("---")
```
```
univariate_distributions_tests(real_data, synthetic_data)

>>> Column: sepal length (cm)
P-value: 0.9511
Not significantly different
---
Column: sepal width (cm)
P-value: 0.0000
Significantly different
---
Column: petal length (cm)
P-value: 0.0000
Significantly different
---
Column: petal width (cm)
P-value: 0.1804
Not significantly different
---
```
In our case, out of the 4 variables, only 2 have similar distributions in the real dataset and in the synthetic dataset. This shows that our synthesizer fails to reproduce basic properties of this dataset.

Visual inspection

Though no mathematically proof, a visual comparison of the datasets can be useful.

The first method is to plot bivariate distributions (or correlation plots).

We can also represent all the dataset dimensions at once: for instance, given a tabular dataset and its synthetic equivalent, we can plot both datasets using a dimension reduction technique, such as t-SNE, PCA or UMAP. With a perfect synthetizer, the scatter plots should look the same.

Now for the code :
```
pip install umap-learn
```
```
import pandas as pd
import seaborn as sns
import umap
import matplotlib.pyplot as plt


def plot(
  real_data: pd.DataFrame, 
  synthetic_data: pd.DataFrame, 
  kind: str = "pairplot"
):

    assert kind in ["umap", "pairplot"]
    real_data["label"] = "real"
    synthetic_data["label"] = "synthetic"
    X = pd.concat((real_data, synthetic_data))

    if kind == "pairplot":
        sns.pairplot(X, hue="label")
    
    elif kind == "umap": 
        reducer = umap.UMAP()
        embedding = reducer.fit_transform(X.drop("label", axis=1))
        plt.scatter(
            embedding[:, 0],
            embedding[:, 1],
            c=[sns.color_palette()[x] for x in X["label"].map({"real":0, "synthetic":1})],
            s=30,
            edgecolors="white"
        )
        plt.gca().set_aspect('equal', 'datalim')
        sns.despine(top=True, right=True, left=False, bottom=False)
```
```
plot(real_data, synthetic_data, kind="pairplot")
```
We already see on these plots that the bivariate distributions are not identical between real data and synthetic data, which is one more hint that the synthetization process failed to reproduce high-order relationship between data dimensions.

Now let’s take a look at a representation of the four dimensions at once :
```
plot(real_data, synthetic_data, kind="umap")
```
In this image is also clear that the two datasets are distinct from one another.

Information

A synthetic dataset should be as useful as the original dataset. Especially, it should be equivalently useful for prediction tasks, meaning it should capture complex relationships between features. Hence a comparison : TSTR vs TRTR, which mean “Train on Synthetic Test on Real” vs “Train on Real Test on Real”. What does it mean in practice ?

For a given dataset, we take a given task, like predicting the next token or the next event, or predicting a column given the others. For this given task, we train a first model on the synthetic dataset, and a second model on the original dataset. We then evaluate these two models on a common test set, which is an extract of the original dataset. Our synthetic dataset is considered useful if the performance of the first model is close to the performance of the second model, whatever the performance. It would mean that it is possible to learn the same patterns in the synthetic dataset as in the original dataset, which is ultimately what we want (especially in the case of data augmentation).

Now for the code :
```
import pandas as pd 
from typing import Tuple
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor


def tstr(
  real_data: pd.DataFrame, 
  synthetic_data: pd.DataFrame, 
  target: str = None
) -> Tuple[float]:
    
    # if no target is specified, use the last column of the dataset
    if target is None: 
        target = real_data.columns[-1]
    
    X_real_train, X_real_test, y_real_train, y_real_test = train_test_split(
        real_data.drop(target, axis=1), 
        real_data[target],
        test_size=0.2
    )
    
    X_synthetic, y_synthetic = synthetic_data.drop(target, axis=1), synthetic_data[target]
    # create regressors (could have been classifiers)
    reg_real = RandomForestRegressor()
    reg_synthetic = RandomForestRegressor()
    # train the models
    reg_real.fit(X_real_train, y_real_train)
    reg_synthetic.fit(X_synthetic, y_synthetic)
    # evaluate 
    trtr_score = reg_real.score(X_real_test, y_real_test)
    tstr_score = reg_synthetic.score(X_real_test, y_real_test)

    return trtr_score, tstr_score
```
```
tstr(real_data, synthetic_data)
>>> (0.918261846477529, 0.5644428690930647)
```
It appears clearly that a certain relationship was learnt by the “real” regressor, whereas the “synthetic” regressor failed to learn this relationship. This hints that the relationship was not faithfully reproduced in the synthetic dataset.

Conclusion

Synthetic data quality evaluation does not rely on a single indicator, and one should combine metrics to get the whole idea. This article displays some indicators that can easily be built . I hope that this article gave you some useful hints on how to do it best in your use case !

Feel free to share and comment ✨

Evaluating synthetic data was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Evaluating synthetic data

Go Here to Read this Fast! Evaluating synthetic data
October 15, 2024
Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker

Vikesh Pandey

In this post, we show how the SageMaker Core SDK simplifies the developer experience while providing API for seamlessly executing various steps in a general ML lifecycle. We also discuss the main benefits of using this SDK along with sharing relevant resources to learn more about this SDK.

Originally appeared here:
Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker

Go Here to Read this Fast! Introducing SageMaker Core: A new object-oriented Python SDK for Amazon SageMaker

October 15, 2024
Create a data labeling project with Amazon SageMaker Ground Truth Plus

Joydeep Saha

Amazon SageMaker Ground Truth is a powerful data labeling service offered by AWS that provides a comprehensive and scalable platform for labeling various types of data, including text, images, videos, and 3D point clouds, using a diverse workforce of human annotators. In addition to traditional custom-tailored deep learning models, SageMaker Ground Truth also supports generative […]

Originally appeared here:
Create a data labeling project with Amazon SageMaker Ground Truth Plus

Go Here to Read this Fast! Create a data labeling project with Amazon SageMaker Ground Truth Plus

October 15, 2024
Beats partners with Kim Kardashian on new Beats Pill colors

Apple’s Beats is continuing to work with Kim Kardashian, with the latest collaboration being some special-edition Beats Pill colors.

Kim Kardashian holding two new Beats Pill colors [Beats]

The celebrity Kim Kardashian has worked with Beats on new variants of its products previously. In the latest and third collaboration, the focus is on the Beats Pill.

The special edition releases under the Beats x Kim collection offers the revamped portable speaker in two new hues. This time, they are Light Gray and Dark Gray, selected for function and style.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast! Beats partners with Kim Kardashian on new Beats Pill colors

Originally appeared here:
Beats partners with Kim Kardashian on new Beats Pill colors

October 15, 2024
iPhone SE 4 may be what Apple needs to make Apple Intelligence a hit

Investment bank Morgan Stanley says Apple is not cutting orders for the iPhone 16 range — and that it will be the iPhone SE 4 that makes Apple Intelligence take off.

An iPhone 14, which the next generation of iPhone SE is said to be based on.

Morgan Stanley has previously told investors that the sales of the iPhone 16 have been lower than expected. Now, though, while iPhone 16 lead times are still shorter than at this time for the iPhone 15 range, the iPhone 16 Pro and iPhone 16 Pro Max appear to be selling better.

This is chiefly based on the lead times quoted by Apple, and Apple does not release any details about the numbers of iPhones either sold or produced. In a note to investors seen by AppleInsider, though, Morgan Stanley says it believes that better supply conditions are contributing to the shorter lead times.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast! iPhone SE 4 may be what Apple needs to make Apple Intelligence a hit

Originally appeared here:
iPhone SE 4 may be what Apple needs to make Apple Intelligence a hit

October 15, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Blog

Dataflow Architecture—Derived Data Views and Eventual Consistency

A (not-so) brief history of a health & fitness data pipeline: part ii

A quick recap of part i

The evolution of a data pipeline

An event-driven ingestion pipeline

Part ii: the evolution goes on

The Magic Black Box

Inside the black box

v3.0 — Metrics, dashboards, insights: fitness as a personalised experience

User insights and fitness metrics

The periodic processing pipeline

v4.0 — Gamification: fitness as a collective experience

Moving from individuals to community

Exercise leaderboard

Fitness challenge system

Configurable rules: rule engines and syntax trees

The on-demand processing pipeline

v5.0 — Recommendation: fitness as a personalised-collective experience

NSFIT x SmartGym

The dataflow paradigm

The Magic Black Box revisited

The challenge of consistency: a necessary evil

Data polymorphism

Read-write performance

Basis of consistency

Anti-fragility: fault recovery and application evolution

The art of consistency and control

Perks: at the system level

Perks: at the human level

Determinism: the event-driven work ethic (revisited)

The spreadsheet ideal

Pure functions have no free will

Dataflow-ception

Dataflow in client-side applications

Dataflow in databases

Dataflows, dataflows everywhere

Wrapping up

Find out more on the feature development from my teammates

Assessing plausibility and usefulness of data we generated from real data

What is synthetic data ?

Synthetic data quality principles

Individual plausibility

Usefulness

Privacy

Evaluation in practice

Sample level

Dataset level

Conclusion

**Perks: at the human level**