Researchers found child abuse material in the largest AI image generation dataset

CAT December 20, 2023 3 min read

Researchers from the Stanford Internet Observatory say that a dataset used to train AI image generation tools contains at least 1,008 validated instances of child sexual abuse material. The Stanford researchers note that the presence of CSAM in the dataset could allow AI models that were trained on the data to generate new and even realistic instances of CSAM.

LAION, the non-profit that created the dataset, told 404 Media that it “has a zero tolerance policy for illegal content and in an abundance of caution, we are temporarily taking down the LAION datasets to ensure they are safe before republishing them.” The organization added that, before publishing its datasets in the first place, it created filters to detect and remove illegal content from them. However, 404 points out that LAION leaders have been aware since at least 2021 that there was a possibility of their systems picking up CSAM as they vacuumed up billions of images from the internet.

According to previous reports, the LAION-5B dataset in question contains “millions of images of pornography, violence, child nudity, racist memes, hate symbols, copyrighted art and works scraped from private company websites.” Overall, it includes more than 5 billion images and associated descriptive captions. LAION founder Christoph Schuhmann said earlier this year that while he was not aware of any CSAM in the dataset, he hadn’t examined the data in great depth.

It’s illegal for most institutions in the US to view CSAM for verification purposes. As such, the Stanford researchers used several techniques to look for potential CSAM. According to their paper, they employed “perceptual hash‐based detection, cryptographic hash‐based detection, and nearest‐neighbors analysis leveraging the image embeddings in the dataset itself.” They found 3,226 entries that contained suspected CSAM. Many of those images were confirmed as CSAM by third parties such as PhotoDNA and the Canadian Centre for Child Protection.

Stability AI founder Emad Mostaque trained Stable Diffusion using a subset of LAION-5B data. Google’s Imagen text-to-image model was trained on a subset of LAION-5B as well as internal datasets. A Stability AI spokesperson told Bloomberg that it prohibits the use of its test-to-image systems for illegal purposes, such as creating or editing CSAM.“This report focuses on the LAION-5B dataset as a whole,” the spokesperson said. “Stability AI models were trained on a filtered subset of that dataset. In addition, we fine-tuned these models to mitigate residual behaviors.”

Stable Diffusion 2 (a more recent version of Stability AI’s image generation tool) was trained on data that substantially filtered out ‘unsafe’ materials from the dataset. That, Bloomberg notes, makes it more difficult for users to generate explicit images. However, it’s claimed that Stable Diffusion 1.5, which is still available on the internet, does not have the same protections. “Models based on Stable Diffusion 1.5 that have not had safety measures applied to them should be deprecated and distribution ceased where feasible,” the Stanford paper’s authors wrote.

This article originally appeared on Engadget at https://www.engadget.com/researchers-found-child-abuse-material-in-the-largest-ai-image-generation-dataset-154006002.html?src=rss

Go Here to Read this Fast! Researchers found child abuse material in the largest AI image generation dataset

Originally appeared here:
Researchers found child abuse material in the largest AI image generation dataset

Tags: tech technews technology

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Related Stories

Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling

Autonomous Agent Ecosystems, Data Integration, Open Source LLMs, and Other November Must-Reads

Dog Poop Compass: Bayesian Analysis of Canine Business

You may have missed

Five Reasons You Cannot Afford Not Knowing Probability Proportional to Size (PPS) Sampling

Autonomous Agent Ecosystems, Data Integration, Open Source LLMs, and Other November Must-Reads

Dog Poop Compass: Bayesian Analysis of Canine Business

Upgrade to Windows 11 Pro for $20 right now with this Black Friday deal