Taking a break from the generative AI hype around LLMs and foundation models, let’s explore how synthetic data created by more traditional generative AI models are set for mainstream adoption.
Data is as valuable as gold, and sharing it responsibly presents both immense opportunities and significant challenges for organizations and society. To ethically process data and avoid legal repercussions, organizations must ensure they do not violate the privacy of individuals contributing their data. Despite the vast potential of data sharing, traditional anonymization methods are becoming increasingly inadequate to tackle the challenges presented by our information-saturated digital age. By instead harnessing advanced generative methods we can create realistic but privacy-compliant synthetic data that retains the utility of the original data. Join us as we unveil the gateway to a wealth of untapped data opportunities.
In this article, we specifically emphasize the use of synthetic data in business contexts, addressing a gap we have identified in existing literature. While our focus here is on the corporate sphere, the insights and applications of synthetic data are equally relevant to other organizations and individuals engaged in data sharing, especially within the research community.
Why do you even need anonymization?
The goal of anonymization is to prevent re-identification of individuals by making it impossible, or at least highly unlikely, to connect the data to or divulge information about a specific individual. Anonymizing data before sharing it has an intrinsic moral value in respecting the privacy of individuals, but as the public becomes more and more concerned with how their data is used, and governments introduce stricter regulations (GDPR, CCPA etc.), it has become something all organizations need to pay attention to unless they want to risk massive reputation losses, law suites, and fines.
At the same time, by not daring to leverage the full potential of big data and data sharing, organizations risk overlooking significant business opportunities, innovative advancements, and potential cost-savings. It also hampers our ability to solve larger societal challenges. Utilizing anonymized data presents a secure and compliant way to harness the value of your data as it is exempt from the restrictions of GDPR.
The underestimated challenge of anonymization
The task of anonymizing data is a complex and often underestimated challenge. Far too many believe anonymization to be as easy as removing direct identifiers such as name, social security number, and address. However, humans are often more distinguishable than commonly assumed. In a groundbreaking study from the year 2000, computer scientist Latanya Sweeny demonstrated that just three pieces of information — date of birth, gender, and zip code — could uniquely identify 87% of the U.S population¹. Bridging the gap to more recent times, a 2019 study published in the Nature journal further underscores this point, revealing that in a database of 7 million individuals, merely 15 data points were sufficient to identify 99.98% of them².
In the age of big data and in a time where we both willingly and unwillingly share more information about ourselves than ever before, anonymization is much more fragile and risky than it initially seems.
For a dataset to be adequately anonymized it must not only have a low reidentification risk when analyzed by itself, but also when cross-referenced with all the other information freely available on the web. This includes publicly available datasets, personal details we freely share on social platforms, and potentially even the stolen sensitive information about us that is available on the dark web. In other words, an anonymized dataset must also be resistant to linkage attacks.
Netflix’s unpleasant experience with failed anonymization should serve as a wake-up call
A textbook example of this occurred in 2006 when Netflix, aiming to enhance its movie recommendation algorithm, released what they believed to be an anonymized dataset for a public competition. The dataset contained ratings from 480,000 users across 18,000 movies. Despite the users being anonymized and even intentional errors systematically inserted into the data, it proved insufficient. A paper published by researchers from the University of Texas demonstrated how many users could easily be re-identified by cross-referencing with publicly accessible movie ratings on IMDB, inadvertently exposing the user’s complete movie viewing history.
This incident might seem harmless, but bear in mind that our movie tastes can sometimes reveal deep insights into our personal lives, such as sexual orientation or political beliefs. As such, when Netflix attempted to initiate a similar competition in 2009, they were forced to cancel it due to a class-action lawsuit, highlighting the serious privacy risks involved³.
After reviewing the challenges associated with anonymization, it should be unsurprising that conventional anonymization techniques often need to be very invasive to even be marginally effective. As traditional methods anonymize through removing or obscuring information contained in the original data, the result is often a huge loss in data utility.
Synthetic data — an alternative to traditional anonymization
Artificial intelligence has been used to create synthetic data for a long time, but the invention of variational autoencoders (VAE), generative adversarial networks (GAN), and diffusion models, in respectively 2013, 2014, and 2015, were significant milestones on the path to creating realistic synthetic data. Since then, lots of incremental advancements from the scientific community have enabled us to precisely capture the complex statistical patterns in our datasets, whether they are tabular, time series, images, or other formats.
The abovementioned models are generative methods. Generative methods are a class of machine learning techniques that can create new data by capturing the patterns and structures of existing data. They do not merely replicate existing data but instead create unique and diverse examples that resemble the original in terms of underlying features and relationships. Think of it as a new generation of data, like how each new generation of humans resembles their ancestors.
The introduction of generative methods to the mainstream public through OpenAI’s chat robot Chat-GPT and image generator DALLE-2 was nothing short of a tremendous success. People were amazed with the ability of these tools to effectively perform tasks many believed were reserved for human intelligence and creativity. This has propelled Generative AI into becoming one of the most used buzz words of the year. Whilst these new foundation models are game changers and may even revolutionize our society, more traditional generative methods still have a vital role to play. Gartner has estimated that by 2030, synthetic data will completely overshadow real data in AI models⁴, and for data sharing and data augmentation of specific datasets, traditional methods such as GAN, VAE, and diffusion models (not foundational) are, at least for now, still the best choice.
Unlike traditional anonymization techniques, generative methods do not destroy valuable information.
Synthetic data from generative methods thus offers an optimal solution, combining the best of both worlds. Advanced generative methods can learn the complex patterns inherent in real-world data, enabling them to produce realistic yet fictitious new examples. This effectively avoids the risky one-to-one relation to the original dataset that traditional methods suffer from. On aggregate level the statistical properties are retained, meaning we can interact with these synthetic dataset as if we would actual data, whether this is to compute summary statistics or train machine learning models.
Synthetic data will create value for a multitude of industries
The use of AI-generated synthetic data offers a solution for privacy-regulated businesses to share data, which was previously difficult due to privacy concerns. These industries include, but are not at all limited to:
- Healthcare: Currently, researchers often face lengthy and cumbersome processes to access real patient data, significantly slowing down the pace of medical advancements. Synthetic medical records present a transformative solution for accelerating medical research while safeguarding patient confidentiality. Additionally, generating synthetic data offers an effective way to address biases in healthcare datasets by intentionally augmenting underrepresented groups, thereby contributing to more inclusive research outcomes.
- Financial services: Transactional data, inherently sensitive and identifiable, presents a unique challenge in the financial sector. Synthetic data arises as a key solution, enabling both internal and external data sharing while effectively addressing privacy issues. Moreover, its utility extends to augmenting limited or skewed datasets, an aspect particularly crucial in enhancing fraud detection and anti-money laundering efforts.
In general, all businesses can utilize synthetic datasets to improve privacy, and we encourage you to think of how synthetic data can benefit you specifically. To help you grasp the potential of synthetic data we include a few selected use cases:
- Third-party sharing: In scenarios where a company needs third-party analysis on customer or user data, synthetic datasets provide a viable alternative to sharing sensitive information. This approach can be particularly beneficial during a selection phase when evaluating multiple external partners or to enable immediate project start-up, bypassing the time-consuming legal processes required for sharing real data.
- Internal data sharing: Even internally, navigating the complexities of sharing sensitive information, such as employee and HR data is often challenging due to strict regulations. Synthetic data provides a solution, allowing company leadership to improve internal knowledge transfer and data sharing while ensuring the privacy of individual employees. This method is equally advantageous for handling datasets with sensitive customer information. By employing synthetic data, organizations can securely distribute these datasets more widely within the company. Enabling expansive sharing, this approach empowers a larger segment of the organization to engage in problem-solving and decision-making, thereby boosting overall efficiency and collaboration while simultaneously upholding the utmost respect for privacy.
- Retain data insights longer: Under the stringent regulations of GDPR, organizations are required to delete user data after its intended processing purpose or upon user request. However, this necessary compliance poses the risk of losing valuable insights contained within the data. Synthetic data offers an innovative resolution to this challenge. It preserves the essence and utility of the original data whilst adhering to legal requirements. Thereby ensuring that the value of the data is preserved for future analytical and AI-driven pursuits.
Combining synthetic data with privacy enhancing tests and technologies is the future juggernaut
Synthetic data stands as a very promising solution for addressing data privacy and accessibility challenges, yet it is not foolproof. The accuracy of generative models is paramount; a poorly calibrated model can lead to synthetic data that inadequately reflects real-world conditions or, in some cases, too closely resembles the original datasets, thereby jeopardizing privacy. Recognizing this, robust methods have been developed to verify the output quality of synthetic data, both with respect to utility and privacy. These critical evaluations are imperative for effectively leveraging synthetic data, ensuring sensitive information is not inadvertently exposed. Most reputable synthetic data providers recognize this necessity and inherently include such quality assurance processes.
A promising enhancement is the combination of differential privacy with synthetic data generators. Differential privacy is a rigorous mathematical definition of privacy and if used correctly, provides a strong guarantee that individual privacy is preserved during statistical analysis.
Differentially private models are machine learning models designed to preserve privacy by incorporating differential privacy techniques during training or inference.
This is particularly beneficial for datasets containing distinct outliers or when an elevated level of privacy assurance needs to be guaranteed. Differentially private models also enable sharing of the data synthesizer model itself, not merely the synthetic data it generates. However, it is important to underscore that such sharing necessitates the application of differential privacy methods throughout the model’s training process. In contrast, standard data synthesizers typically cannot safely be shared, as they may inadvertently reveal sensitive information when subjected to advanced machine learning techniques aimed at extracting information.
In Figure 2, we exemplify the principle of differentially private models using the case of the Netflix dataset. The core idea here is to cap the influence of a single data record upon the learned data distribution by the differentially private data synthesizer. Put simply, if we were to retrain the model on the same dataset, minus the data from one individual, the resultant data distribution would not show substantial deviation. The maximum influence of a single observation is a quantifiable parameter of the differentially private model. This leads to a trade-off between privacy and utility, but a satisfactory compromise can often be found, ensuring that both privacy and utility are upheld to a satisfactory degree.
AI-generated synthetic data is ready for mainstream adoption
Synthetic data is rapidly asserting itself as a crucial technology for enhancing privacy, set to become a mainstay in modern data management. Its utility extends beyond simply safeguarding privacy, serving as a conduit to a wealth of untapped data potential — a prospect being leveraged by numerous forward-thinking businesses. In this article, we have highlighted the benefits of synthetic data in facilitating secure data sharing. Yet, its potential in data augmentation is perhaps even more exciting. By enabling data imputation and rebalancing, synthetic data can profoundly boost the efficiency of machine learning algorithms, effectively delivering significant added value with minimal investment of cost or effort. We invite you to explore the myriad of ways in which synthetic data can transform your business operations.
About the authors
Arne Rustad is a Data Scientist at BearingPoint in Oslo, with experience from multiple generative AI projects. He wrote his master’s thesis on synthetic data generation for tabular data where he developed a new Generative Adversarial Network (GAN) model achieving state-of-the-art performance. Arne obtained his master’s thesis in Physics and Mathematics from the Norwegian University of Science and Technology (NTNU). Email address: [email protected].
Helene Semb is a Data Scientist at BearingPoint in Oslo, with machine learning experience in computer vision and object detection. Helene recently obtained her master’s degree in Cybernetics and Robotics from the Norwegian University of Science and Technology (NTNU). Email address: [email protected].
References
[1] L. Sweeney, Simple demographics often identify people uniquely (2000), Health (San Francisco)
[2] L. Rocher, J. Hendrickx and Y. De Montjoye, Estimating the success of re-identifications in incomplete datasets using generative models (2019), Nature communications
[3] Wired, Netflix Cancels Recommendation Contest After Privacy Lawsuit (2010, March 12)
[4] Gartner, Is Synthetic Data the Future of AI? (2022, June 22)
Adaption of Generative Methods for Anonymization will Revolutionize Data Sharing and Privacy was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Adaption of Generative Methods for Anonymization will Revolutionize Data Sharing and Privacy