Tag: AI

  • Evolving Chess Puzzles

    Evolving Chess Puzzles

    Robert Elmes

    An exploration of Evolutionary AI

    A chess puzzle, generated using the theory of evolution. Checkmate in 2 moves for white…

    Evolutionary Algorithms (EAs) are a subset of AI that solve problems using methods inspired by biological evolution. From optimizing neural networks to resource scheduling, they have a stunning range of applications in the real world. Their beauty emerges through a shift in focus in what’s required to solve a problem. Instead of describing the steps required to reach a goal, EAs describe what the goal looks like.

    In this article I will explore how we can utilize this fantastic AI to generate chess puzzles, the benefits it provides, and the drawbacks we need to consider.

    A chess puzzle is a legal chess position, where one unique combination of moves results in a win, often ending in a checkmate. They are typically found by analysing databases of competitive games between human players.

    By generating my own puzzles using nothing but code, randomness, and a sprinkle of biology, an interesting, diverse database of puzzles can be created. Lets explore how.

    Evolutionary Algorithms typically work by randomly generating a large population of results, then picking the ‘fittest’ results using a heuristic and finally taking those ‘fittest’ results and generating subsequent random populations. They are inspired by Darwin’s theory of natural selection, where the animals in a population which are more likely to survive are also more likely to pass on their traits to the next generation. After many generations, sometimes hundreds of thousands, the population converges on an optimal result. So how can we apply this to chess?

    Population Generation

    With chess, we can create a population of random legal positions by simulating games where the program takes it in turns to play random moves for black and white a random number of times. By repeating this process tens of thousands of times, large samples of random positions can be analyzed for fitness.

    Below, you can see a function from my Board class, which returns a list of moves.

    public List<(int[] from, int[] to)> GetAllPotentialMoves(Colour currentColour) 
    {
    var activePieces = ActivePieces.Find(p => p.colour == currentColour);
    var allLegalMoves = new List<(int[] from, int[] to)>();

    foreach (var piece in activePieces.pieces)
    {
    var moves = piece.GetLegalMoves(this);

    allLegalMoves.AddRange(moves);
    }

    return allLegalMoves;
    }

    Survival of the Fittest

    Once a population of positions has been generated, the real tricky bit starts. The key to any Evolutionary Algorithm is how you evaluate your heuristic. In my case, only positions where a single solution leading to a checkmate were considered for a puzzle. After narrowing those results down, heuristic is a measure of how difficult it is to choose the correct moves to win the game. But how can a computer program estimate how difficult it is for a human to interpret a chess position?

    A puzzle generated using a heuristic favoring knights on the board. Checkmate in 2 moves.

    One option is to look at the structure of the puzzle. Is the king safe? Are there moves that don’t solve the puzzle, but look good? Do we sacrifice any material? What pieces are we moving? By evaluating many factors, we can create a measure of difficulty. The issue with this approach is it’s really hard to decide how to create a final score from so many factors. Rigid rules also completely ignore biases in human perception. It might be that even subtle changes to a chess position make it much harder for some individuals to pick the correct move.

    So, how can we get a better idea of human performance? By utilizing large databases filled with real games, machine learning models have been trained to play chess like players of certain levels. Through these models we can get a better idea how players of different abilities might attempt a puzzle. Can an AI trained on 1200 rated players solve the puzzle? What about 1600, 1900? The benefit of this approach is it delves deeper into the minds of real players. However, machine learning models are not without their drawbacks. These AIs don’t play like a real player, they play like an approximation of a player. They’re also trained on real, regular games, meaning they might be unreliable evaluating randomized chess positions.

    By combining the machine learning models with complex and detailed rule based evaluation, I created a best of both worlds type scenario. A heuristic that both understands the composition of the puzzle, whilst at the same time considering how humans might approach it.

    The Next Generation

    Once the best puzzles in a population have been found, the next step is to create new generations. This can be done through many evolution inspired techniques. I chose to use crossover and mutation.

    Crossover involves randomly merging the features of two results in the hope you might end up with the best features of both. We can cross over similar chess positions by going back a number of moves to a shared starting place, then picking legal moves used to reach each result. Perhaps moving the queen gave one puzzle a really good property, and moving the knight made another puzzle interesting. By combining both of these features we create an even more compelling problem.

    Similarly, we can mutate puzzles by backtracking and then going forwards a number of moves. Depending on the number of moves you go backwards and forwards it can change the puzzle subtly or massively. Too much mutation and you can find the algorithm never improving, too little and your best result could converge on a single value too quickly.

    So… What’s the Catch?

    The most common issue with Evolutionary Algorithms is converging too fast. Initially, the puzzles I was generating stopped improving after only a few generations. In the real world, physical boundaries such as mountains, deserts and seas have prevented populations from crossing over their DNA, allowing genetic diversity to be preserved. Without enough genetic diversity, a population won’t evolve vary far. By running smaller populations of chess puzzles in parallel for a little while, I gave them breathing room enough to maintain some diversity and avoid converging too early.

    Evolutionary Algorithms can also be very slow. Chess is certainly no exception. Running heuristic evaluation on millions of chess positions requires a considerable amount of processing. Generally, the longer you run a chess engine on a position the more accurate it can predict the next best move. By finding the sweet spot in time spent analysing each position, picking out the most promising ones and looking at them in much more detail, I could optimise the time a reasonable amount. Deciding when to stop generating is also crucial. If a sample has stopped improving for several generations then perhaps it’s best to start again with a new random population, as it may be unable to improve much further. After countless optimisations, my home PC is able to generate over 1000 challenging puzzles per day using evolution.

    Finally, diagnosing errors can be incredibly difficult. With many programs you can expect certain outputs given certain inputs. With evolution it’s a different kettle of fish. I spent a lot of time scratching my head wondering why my population was converging too quickly. Was it position generation? Was it the evolutionary methods, perhaps the heuristic? It can be easy to not even notice if some things aren’t working as intended when the expected output of a program can not be clearly defined.

    Results

    However, issues aside, the power and potential of this AI technique shines bright for all to see. Using just my old PC I’ve been able to generate almost 50,000 chess puzzles in 3 months, containing an abundance of weird and wonderful positions.

    The random nature of the algorithm means that it creates an incredibly colourful and diverse set of puzzles. Interesting tactical problems we rarely see in chess such as queen sacrifices, knight promotions and en passant are easy to find using evolution, but difficult using databases of real games. However, the nonsensical nature of the puzzles makes them less applicable to real world scenarios. Although great fun, an argument could be made that puzzles based on real games are better for learning common patterns in chess games.

    As well as being incredibly productive, the algorithm is also exceptionally flexible. Shatranj, lopsided chess boards, it’s easy to extend the EA to work with any derivative of chess. This extendable nature is where the evolutionary technique really excels. You just can’t do this with databases of games, as they simply don’t exist!

    A Shatranj puzzle generated by the algorithm. Can you checkmate the white king in 2 moves?

    To Conclude

    Although a forgotten corner of AI to many, I’ve shown how evolution can be used to create a novel solution to a real world problem. There’s much unexplored potential with this technology. With generative AI on the rise, I wonder what other funky applications people will find for EAs in the future…

    You can experience the puzzles for yourself on my website, chesspuzzler.com.

    Unless otherwise noted, all images are by the author.


    Evolving Chess Puzzles was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Evolving Chess Puzzles

    Go Here to Read this Fast! Evolving Chess Puzzles

  • Customizing RStudio Container with Docker Compose

    Rami Krispin

    A step-by-step guide for customizing and launching RStudio Server inside a container with Docker Compose

    In this post, we will review the steps of setting up a Docker-Compose workflow to launch an RStudio Server inside a container. We will introduce the Docker Compose settings process and discuss when you should consider using it. Last but not least, we will demonstrate how to transition the docker run workflow that was introduced in the previous post and launch the Rocker RStudio image with the Docker Compose process.

    Related articles:

    Running RStudio Inside a Container

    By the end of this tutorial, you will be able to transition your docker run settings to a docker-compose.yml file and seamlessly launch your RStudio container with the docker-compose command.

    Launching RStudio Server inside a container with Docker Compose (screenshot by the author)

    Motivation

    Before getting started, let’s explain what Docker Compose is and when you should consider using it.

    Starting with the what — Docker Compose is a simplistic framework for launching single or multiple containers. It is a wrapper of the docker run command, and it manages the container launch settings using a YAML file. It is recommended to set a Docker Compose workflow instead of using the docker run command when:

    • The number of arguments of thedocker run command increases, and it becomes convoluted to manage it via the CLI
    • You regularly use and work with the container
    • The level of complexity is high (e.g., launching multiple containers in parallel, etc.)

    We use the docker-compose.yml file to set up a Docker Compose framework and launch it with the docker-compose command. This process includes mapping the docker run arguments into YAML format. Simple docker-compose.yml file settings will include the following two arguments:

    • Version — the Docker Compose version, which is currently 3.9
    • Services — a list of containers to launch and their corresponding arguments

    Let’s illustrate the mapping process of the docker run command arguments to the docker-compose.yml file with the below example:

    docker run -argument_a argument_a_values 
    -argument_b argument_b_values
    -argument_c argument_c_values
    IMAGE_NAME/IMAGE_TAG

    Where this command has the following three arguments — argument_a, argument_b, argument_c, and their corresponding values are argument_a_values, argument_b_values, argument_c_values, and calling the following image — IMAGE_NAME/IMAGE_TAG.

    The below docker-compose.yml represents the mapping of the above docker run arguments:

    version: "3.9"
    services:
    my_service:
    image: "IMAGE_NAME/IMAGE_TAG"
    argument_a:
    - "argument_a_values"
    argument_b:
    - "argument_b_values"
    argument_c:
    - "argument_c_values"

    The version and services arguments, as mentioned above, define the Docker Compose version and the list of images to launch during the run time, respectively. In this case, we use the most recent version, 3.9, and define a single container under the services argument named my_service. Under the my_service section, we define the run time arguments corresponding to the above docker run command arguments following the standard YAML format.

    It is important to note that naming convention mapping between the docker run command arguments and their settings in the docker-compose.yml file is not always one-to-one. The Docker Compose documentation is a great resource for identifying the argument settings.

    In the next section, we will connect the dots and map the docker run command we set in the previous tutorial to a docker-compose.yml file.

    Setting RStudio with Docker Compose

    Recall that in the previous tutorial, we used the below docker run command to launch the RStudio server inside a container:

    docker run --rm -ti 
    -v .:/home/rstudio
    -v $HOME/.config/rstudio:/home/rstudio/.config/rstudio
    -v $HOME/.Renviron:/home/rstudio/.Renviron
    -e PASSWORD=yourpassword
    -p 8787:8787 rocker/rstudio

    In short, the above run command uses the following arguments:

    • Volume or v to mount local folders with the container file system
    • Environment or e to set the RStudio server password as an environment variable
    • Port or p to map between the local and container ports

    The below YAML file represents the mapping of the above docker run command:

    version: "3.9"
    services:
    rstudio:
    image: "rocker/rstudio"
    ports:
    - "8787:8787"
    volumes:
    - type: "bind"
    source: "."
    target: "/home/rstudio"
    - type: "bind"
    source: "$HOME/.config/rstudio"
    target: "/home/rstudio/.config/rstudio"
    - type: "bind"
    source: "$HOME/.Renviron"
    target: "/home/rstudio/.Renviron"
    environment:
    - PASSWORD=yourpassword

    Where we set a single service named rstudio under the services argument and defined the corresponding run arguments:

    • image — defines the image name, in this case, using the RStudio Rocker image rocker/rstudio
    • ports — sets the port mapping between the local machine and the container
    • volumes — maps the folders mount, using the type argument to define the type of mount and the source and target arguments to define the local and container folder path mapping. More details on the volume arguments can be found here.
    • environment — defines environment variables, in this case, we set PASSWORD variable to define the RStudio server password

    Once the YAML file is set, we can use the docker-compose command on the CLI to launch the RStudio container:

    docker-compose up

    Where the up argument is used to launch the container. You should expect the following output:

    [+] Running 2/2
    ✔ Network rstudio-docker_default Created 0.1s
    ✔ Container rstudio-docker-rstudio-1 Created 0.1s
    Attaching to rstudio-docker-rstudio-1
    rstudio-docker-rstudio-1 | [s6-init] making user provided files available at /var/run/s6/etc...
    rstudio-docker-rstudio-1 | exited 0.
    rstudio-docker-rstudio-1 | [s6-init] ensuring user provided files have correct perms...
    rstudio-docker-rstudio-1 | exited 0.
    rstudio-docker-rstudio-1 | [fix-attrs.d] applying ownership & permissions fixes...
    rstudio-docker-rstudio-1 | [fix-attrs.d] done.
    rstudio-docker-rstudio-1 | [cont-init.d] executing container initialization scripts...
    rstudio-docker-rstudio-1 | [cont-init.d] 01_set_env: executing...
    rstudio-docker-rstudio-1 | skipping /var/run/s6/container_environment/HOME
    rstudio-docker-rstudio-1 | skipping /var/run/s6/container_environment/PASSWORD
    rstudio-docker-rstudio-1 | skipping /var/run/s6/container_environment/RSTUDIO_VERSION
    rstudio-docker-rstudio-1 | [cont-init.d] 01_set_env: exited 0.
    rstudio-docker-rstudio-1 | [cont-init.d] 02_userconf: executing...
    rstudio-docker-rstudio-1 | [cont-init.d] 02_userconf: exited 0.
    rstudio-docker-rstudio-1 | [cont-init.d] done.
    rstudio-docker-rstudio-1 | [services.d] starting services
    rstudio-docker-rstudio-1 | [services.d] done.

    After launching the container, you can access the RStudio server from your browser using the local host address with the port number, in this case — http://localhost:8787:

    RStudio Server inside a container (screenshot by the author)

    Note: Once launching the container with the docker-compose up command, it keeps the CLI attached to the terminal until stopping it. Alternatively, you can add the d argument to run it in a detach mode:

    docker-compose up -d

    Likewise, the docker-compose down command stops the container run time.

    Summary

    In this tutorial, we reviewed how to set up a Docker Compose framework to launch your RStudio container. This includes setting a docker-compose.yml file and using the docker-compose command to launch the container concisely.

    The motivations for wrapping your docker run command with Docker Compose are:

    • Efficient and concise — required a one-time setting and, afterward, the launch time is simple with thedocker-compose command (as opposed to a long docker run command)
    • Higher complexity — it simplifies the process of launching a single or multiple containers seamlessly. For example, one good use case would be running the RStudio and Postgres database together. In this case, you can set the Docker Compose process to launch the two containers to work side-by-side

    Resources


    Customizing RStudio Container with Docker Compose was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Customizing RStudio Container with Docker Compose

    Go Here to Read this Fast! Customizing RStudio Container with Docker Compose

  • The Death of the Static AI Benchmark

    The Death of the Static AI Benchmark

    Sandi Besen

    Benchmarking as a Measure of Success

    Benchmarks are often hailed as a hallmark of success. They are a celebrated way of measuring progress — whether it’s achieving the sub 4-minute mile or the ability to excel on standardized exams. In the context of Artificial Intelligence (AI) benchmarks are the most common method of evaluating a model’s capability. Industry leaders such as OpenAI, Anthropic, Meta, Google, etc. compete in a race to one-up each other with superior benchmark scores. However, recent research studies and industry grumblings are casting doubt about whether common benchmarks truly capture the essence of a models ability.

    Source: Dalle 3

    Data Contamination Leading to Memorization

    Emerging research points to the probability that training sets of some models have been contaminated with the very data that they are being assessed on — raising doubts on the the authenticity of their benchmark scores reflecting true understanding. Just like in films where actors can portray Doctors or Scientists, they deliver the lines without truly grasping the underlying concepts. When Cillian Murphy played famous physicist J. Robert Oppenheimer in the movie Oppenheimer, he likely did not understand the complex physics theories he spoke of. Although benchmarks are meant to evaluate a models capabilities, are they truly doing so if like an actor the model has memorized them?

    Recent findings from the University of Arizona have discovered that GPT-4 is contaminated with AG News, WNLI, and XSum datasets discrediting their associated benchmarks[1]. Further, researchers from the University of Science and Technology of China found that when they deployed their “probing” technique on the popular MMLU Benchmark [2], results decreased dramatically.

    Their probing techniques included a series of methods meant to challenge the models understanding of the question when posed different ways with different answer options, but the same correct answer. Examples of the probing techniques consisted of: paraphrasing questions, paraphrasing choices, permuting choices, adding extra context into questions, and adding a new choice to the benchmark questions.

    From the graph below, one can gather that although each tested model performed well on the unaltered “vanilla” MMLU benchmark, when probing techniques were added to different sections of the benchmark (LU, PS, DK, All) they did not perform as strongly.

    “Vanilla” represents performance on the unaltered MMLU Benchmark.The other keys represent the performance on the altered sections of the MMLU Benchmark:Language Understanding (LU),Problem Solving (PS),Domain Knowledge (DK), All

    Future Considerations on how to evaluate AI

    This evolving situation prompts a re-evaluation of how AI models are assessed. The need for benchmarks that both reliably demonstrate capabilities and anticipate the issues of data contamination and memorization is becoming apparent.

    As models continue to evolve and are updated to potentially include benchmark data in their training sets, benchmarks will have an inherently short lifespan. Additionally, model context windows are increasing rapidly, allowing a larger amount of context to be included in the models response. The larger the context window the more potential impact of contaminated data indirectly skewing the model’s learning process, making it biased towards the seen test examples .

    The Rise of the Dynamic and Real-World Benchmark

    To address these challenges, innovative approaches such as dynamic benchmarks are emerging, employing tactics like: altering questions, complicating questions, introduce noise into the question, paraphrasing the question, reversing the polarity of the question, and more [3].

    The example below provides an example on several methods to alter benchmark questions (either manually or language model generated).

    Source: Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation

    As we move forward, the imperative to align evaluation methods more closely with real-world applications becomes clear. Establishing benchmarks that accurately reflect practical tasks and challenges will not only provide a truer measure of AI capabilities but also guide the development of Small Language Models (SLMs) and AI Agents. These specialized models and agents require benchmarks that genuinely capture their potential to perform practical and helpful tasks.


    The Death of the Static AI Benchmark was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    The Death of the Static AI Benchmark

    Go Here to Read this Fast! The Death of the Static AI Benchmark

  • Step by Step Basics: Code Autodocumentation

    Lucy Dickinson

    Generate flawless documentation of Python code using Sphinx

    Photo by Dustin Humes on Unsplash

    You can build beautiful, standardised and stylised documentation using just the docstrings in a few simple steps.

    A Data Scientist holds many responsibilities when working on a project, and one that is usually left until the last minute is documentation. Perhaps you’re diligent with writing docstrings for classes and functions (well done!) — but should that be the resting place of your documentation?

    In my opinion, documentation should sit independently from your code. Your team (or you in a few months time) shouldn’t have to trawl through hundreds of lines of code in your python modules to understand what’s going on. You can build beautiful, standardised and stylised documentation using just the docstrings in a few simple steps and make your project speak for itself.

    In this article, I’ll focus on using the Sphinx framework for autodocumenting python modules. I’ll also be using the Cookiecutter Data Science project template in Visual Studio Code (VS Code) due to its seamless integration with Sphinx and standardised directory structure. Whilst the official sphinx tutorial documentation is a great resource for those wanting to take a deep dive into this topic, my aim for this article is to be a helpful ‘quick start’ guide to take you through the key steps. Enjoy 🙂

    A note on docstrings

    The key to good documentation is docstrings. These are the comment blocks that sit within each class, class method and function that describe the nature of the code, along with the inputs, outputs and raised errors. There are three core docstring formats. These are Google, reStructuredText (reST) and NumPy. They all contain the same information, the only difference is that they are formatted differently. You can see examples of each docstring format here.

    I’ll be using the Google docstring format as it is easy to read and takes up less space than the others. The below code block is a typical example of a Google docstring:

    """Description of the function, class or method etc.

    Args:
    varA (str): Description of varA
    varB (bool): Description of varB

    Returns:
    list: Description of returned list

    Raises:
    ValueError: Description of raised error
    """

    Top tip. download the ‘autoDocstring — Python Docstring Generator’ in VS Code to automatically generate a docstring when you type three double quotation marks (i.e. the start of a docstring). Be sure to finish writing the function before generating the docstring so that all inputs/outputs/errors get included in the docstring template that gets generated for you!

    Let’s move on to making the documentation!

    For the purpose of this demo, I have created a python module demo.py which contains a class and three basic functions (all annotated with docstrings with the exception of one function). It is this module that I’ll be building documentation for in this article. The contents of this demo.py module is below:

    Contents of demo.py module to be documented. Snapshot taken using CodeSnap extension in VS Code.

    1. Setup

    First thing is to get everything setup. You’ll need to install VS Code and setup a new project along with Sphinx. There are a few options here. You can either a) set up a new project using Cookiecutter (where the relevant Sphinx setup will be generated along with standardised directories) or b) create your own project structure and install sphinx separately.

    option A — install Cookiecutter

    In the terminal, pip install Cookiecutter and then create a new project:

    pip install cookiecutter
    cookiecutter https://github.com/drivendata/cookiecutter-data-science

    Next, answer the questions that appear in the terminal window and your new project will be created. The Sphinx framework will be stored in the /docs directory of your project.

    option B — Sphinx quickstart

    If the Cookiecutter template doesn’t take your fancy, you can create your own project structure from scratch and install sphinx. It is a good idea to make a documentation directory and install sphinx there. In the terminal:

    mkdir docs
    cd docs

    pip install sphinx
    sphinx-quickstart

    2. Understanding Sphinx folder structure

    After you’ve installed Sphinx using one of the options above, there will be some files that appear in the documentation directory in your project. The conf.py file is the key configuration file which you’ll edit to make your documentation bespoke — more detail on this in the next section. The index.rst file acts as a contents for your documentation. You can find more information on the index.rst file here. The getting-started.rst and commands.rst files are suggested templates for your documentation. You can remove these if necessary. The make files (make.bat and Makefile) are used to actually make the documentation. You don’t need to edit these but will call them in the terminal window when you’re ready to make the documentation.

    Default Sphinx files installed

    3. Conf.py file

    The configuration file is where the magic happens. This file is used during the build process and so it is crucial that you have this set up correctly. Below are some steps to modifying the conf.py file:

    Uncomment the sys.path line (line 20 in my setup):

    #  sys.path.insert(0,  os.path.abspath('.')) 

    Change the pathway of the os.path.abspath to the relative location of the code you want documenting (relative to the conf.py file). For example, the python modules that I want documenting sits within the src/ directory of my project. Hence I will change the os.path.abspath to the look in /src directory which is located in the parent folder of the conf.py file. You can specify the relative location using the . and / syntax:

    sys.path.insert(0,  os.path.abspath('../src'))

    """
    # you can use the following syntax to specify relative locations:

    '.' # current path of conf.py
    '..' # parent path of conf.py
    '../..' # parent of the parent path of conf.py
    """
    The relative location of the directory containing the python modules to the documentation folder. In this example, ‘demo.py’ is the module to be documented, located in the src/data/ directory.

    Add in the relevant extensions. You’ll need to add in some extensions to the conf.py file to gain extra functionality when creating your documentation. These are all optional and you can have some fun exploring the different extensions available here. Here are the 5 extensions that I recommend at minimum:

    • sphinx.ext.autodoc— use documentation from docstrings
    • autodocsumm— generate a tabular summary of all docstrings at the top of the html page by listing out the docstring summaries only. Useful when you have a lot of docstrings. Note. you will need to pip install autodocsumm in the terminal.
    • sphinx.ext.napoleon — enables Sphinx to parse google docstrings
    • sphinx.ext.viewcode — adds a link to a html page containing the source code for each module
    • sphinx.ext.coverage — provides a summary of how many classes/functions etc have docstrings. Good coverage signifies that a codebase is well explained.

    Here’s how to include these extensions in the conf.py file (line 29 in my setup):

    #  add in the extension names to the empty list variable 'extensions'
    extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.napoleon',
    'autodocsumm',
    'sphinx.ext.coverage'
    ]

    # add in this line for the autosummary functionality
    auto_doc_default_options = {'autosummary': True}

    Change the theme. The default theme of the documentation is quite clean, although you may prefer to play around with different options by changing the ‘html_theme’ variable (line 94 in my setup) from ‘default’ to one of the standard theme options or some third party options. In this demo, I’ll show the default and Read the Docs themes.

    html_theme = 'sphinx_rtd_theme' # read the docs theme. This variable is 'default' by default.

    Note. you will need to pip install any non-standard (third party) themes.

    4. Make the html pages

    Now that your conf.py file is set up and you have glorious docstrings in your code, we’re ready to do some scraping and build some html pages.

    Generate .rst files of your python packages

    These files are the precursor to the html pages and are the native format for Sphinx. These need to be generated before making the html files. You’ll use the sphinx.apidoc command, which uses the autodoc extension to locate all python modules (e.g. any .py files) within the sys.path location that you specified in the conf.py file. There are some optional parameters to include when using the apidoc command which you can find in the documentation, but I used the following template:

    Note. in the terminal, change directory to the root of the project to run the following code.

    sphinx-apidoc -f -o output_dir module_dir/

    -f (force overwriting any existing generated files).

    -o output_dir (directory to place the output files. If it does not exist, it is created). Note. replace ‘output_dir’ with a directory name of your choice. I set mine to the /docs directory.

    module_dir (location of python packages to document)

    After running this command, there should be newly generated .rst files in the docs folder.

    Contents of documentation folder after running sphinx-apidoc command to generate .rst files

    Notice that two new .rst files have been generated: data.rst and modules.rst. In addition to modules.rst, a .rst file will be generated for each directory that contains at least one python module. In my example, data.rst is generated as I have saved my demo.py file in the src/data directory. If you have multiple directories of python modules within the location you specified in sys.path in the conf.py file, then multiple .rst files will be generated. Note. These files do not contain the scraped documentation just yet, they just contain the information required for autodoc to make the html files in the next step.

    Edit index.rst file

    Remember, index.rst acts as a contents page so we must edit this file to include all python modules we want documenting. Luckily, the modules.rst references the source location of all python modules identified in the sys.path, so you can simply add this file to index.rst.

    To do this, open the index.rst file and add ‘modules’ underneath the toctree (table of contents tree) section. Make sure there is a line in between the :maxdepth: parameter and the names of the the .rst files.
    Note. ‘getting-started’ and ‘commands’ will already be in the index.rst file. You can delete them from this file if you do not want to generate html pages (although a ‘getting-started’ page is probably a good idea!)

    Contents of the index.rst file. I have added in ‘modules’ so that the modules.rst file is used in the html generation process.

    Make html files

    Now we can use the make files in your documentation directory to build the html files. These files will appear in the _build/html/ directory within your documentation folder. You can preview these in VS code if you download the ‘HTML Preview’ extension.

    Change directory to where the make.bat file is located and run the following command in cmd terminal:

    make html

    Note. if you are using windows powershell terminal (rather than cmd), use the following syntax:

    .make.bat html

    Top tip. if a warning arises when using the make html command that states ‘autodoc: failed to import module’, this is most likely due to autodoc not being able to find your modules as the sys.path has not been configured correctly in conf.py. Make sure this points to the directory where your python modules are located.

    Editing html files

    If you wish to edit your docstrings and update your html files with the changes, then you can do so using the following command:

    make clean html

    Let’s take a look at our documentation!

    As I mentioned above, I have created some documentation of my python module demo.py in two different themes seen in the images below; ‘default’ (left image) and ‘Read the Docs’ (right image). The content is identical but the look and feel are different. Let’s take note of the core features:

    • Navigation bar on left hand side
    • A summary of all classes or functions belonging to the module in tables at the top of the page (thanks to the ‘autodocsumm’ extension)
    • Detailed list of docstring components for all functions and classes below the summary
    Examples of documentation html pages for a sample python module using default theme (left image) and the read the docs theme (right image), generated using Sphinx.

    Once you’ve created the html pages, you’ll notice a variety of hierarchical html pages will be generated. These will include a home page and pages for each package and module. Have a look around the pages to familiarise yourself with their structure and read the official documentation to see how you can customise them further.

    For example, if you wanted to be able to see the raw code of each function in the documentation, add the extension ‘sphinx.ext.viewcode’ to the conf.py file. This will add a hyperlink next to each function or class method which will reveal the raw code to allow for easy interrogation without having to delve into the codebase.

    Closing thoughts

    And there we have it. Simple and beautiful documentation of your Python modules made in a few easy steps using Sphinx! I hope you have enjoyed learning about autodocumentation and find it to be a useful tool to implement in your projects. If you have any useful tips then feel free to add a comment 🙂


    Step by Step Basics: Code Autodocumentation was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Step by Step Basics: Code Autodocumentation

    Go Here to Read this Fast! Step by Step Basics: Code Autodocumentation

  • Automated Detection of Data Quality Issues

    Automated Detection of Data Quality Issues

    Simon Grah

    This article is the second in a series about cleaning data using Large Language Models (LLMs), with a focus on identifying errors in tabular data sets.

    The sketch outlines the methodology we’ll explore in this article, which focuses on evaluating the Data Dirtiness Score of a tabular data set with minimal human involvement.

    The Data Dirtiness Score

    Readers are encouraged to first review the introductory article on the Data Dirtiness Score, which explains the key assumptions and demonstrates how to calculate this score.

    As a quick refresher, the Data Dirtiness Score estimates the expected proportion of cells in a data set that contain errors. Here are the key hypotheses behind this metric:

    • Data errors are related to violated constraints.
    • If there are no expectations, there is no effect on the score.
    • Data problems can be pinpointed to specific cells.
    • Each data error is assigned a confidence score.
    • Every cell has an equal impact on the overall score.

    The initial step in this process involves identifying and cataloguing data inaccuracies present within the data set.

    The Importance of Detecting Data Quality Issues Automatically

    Detecting data issues is crucial in the process but challenging due to several factors:

    • High Human Labelling Cost: Identifying data errors often needs significant input from data professionals (like scientists, engineers, and analysts) or subject matter experts (SMEs). This requires a lot of time and is expensive.
    • Lack of Enthusiasm Among Data Practitioners for this Grunt Work: It’s no secret that many in the field view data cleaning as a less appealing aspect of their work. Seen as a precursor to more engaging activities such as modelling, building modern data stacks or answering business queries, data cleaning often falls lower on the priority list, leading to procrastination or, in some cases, completely ignored until critical issues arise.
    • SME Limitations: SMEs have valuable knowledge but might lack technical skills like SQL or programming. While no-code and low-code tools help to some extent, they haven’t been fully adopted and might not cover all data management aspects, such as version control.
    • The Expertise Gap: Effective data cleaning transcends basic skill sets, requiring specialised expertise. The lack of training and the general disinterest in data preparation mean that many practitioners may only identify superficial errors, missing more complex issues that require a deeper understanding of data cleaning.

    Despite the inherent challenges, advancements in the field of Large Language Models (LLMs) offer promising solutions for automating the identification of straightforward data issues and uncovering more intricate data quality problems.

    Data Error Detection powered by LLMs

    Large language models are becoming invaluable tools in automating the detection of data quality issues, serving as an efficient starting point for a productive human-in-the-loop iterative process. Models, such as those discussed in papers like Jellyfish: A Large Language Model for Data Preprocessing, Can language models automate data wrangling? and Large Language Models as Data Preprocessors, demonstrate their potential to automate constraint generation and data error detection. This automation doesn’t replace human intervention but rather enhances it, allowing for the review and adjustment of automated constraints by either addressing issues directly or modifying confidence scores to reflect the uncertainty inherent in data error detection.

    LLMs are particularly well-suited for detecting data quality issues due to their extensive training on a diverse range of internet content, including a vast array of domain knowledge and numerous examples of code reviews related to data quality issues. This training enables LLMs to identify data errors based on textual content without the need for explicitly defined rules. By converting tabular data sets into plain text (called serialisation), LLMs can scrutinise data much like a team of experienced humans, leveraging their “compressed” internet knowledge to pinpoint errors. This extensive training allows them to identify potential errors in human-readable data sets, such as CSV files, with a level of intuition that mimics human expertise. Moreover, any gaps in domain-specific knowledge can be bridged through techniques like Retrieval-Augmented Generation (RAG) or by tailoring the model’s prompts to the specific nature of the data set.

    Another key advantage of employing LLMs in data error detection is their ability to handle the inherent uncertainty associated with data quality issues. Not all errors are straightforward, and even experts can sometimes disagree on what constitutes a data issue. LLMs can assign confidence scores to their findings, like a human does based on a mix of intuition and experience, reflecting the estimated likelihood of an error.

    The challenge of generalising error detection across diverse data sets and potential issues is considerable. Traditional methods often resort to an extensive set of decision rules or a combination of specialised machine learning models to address various scenarios, such as checking the validity of addresses and phone numbers or anomaly detection. This is where LLMs shine, offering a more adaptable and less labour-intensive alternative. Their ability to understand and identify a wide range of data quality issues without extensive rule-based systems or domain-specific models makes them an invaluable tool. The analogy with the advantages of Machine Learning approaches over traditional business rules or statistical methods is quite intriguing. The adoption of machine learning has been driven by its relative ease of use and adaptability across different use cases, requiring less domain-specific knowledge and time to implement.

    Next, we will demonstrate this approach through a practical example.

    A Case Study

    In the previous article, we explored the concept of the Data Dirtiness Score using a data set example from the book Cleaning Data for Effective Data Science. The data set in question is as follows:

    Student#,Last Name,First Name,Favorite Color,Age
    1,Johnson,Mia,periwinkle,12
    2,Lopez,Liam,blue,green,13
    3,Lee,Isabella,,11
    4,Fisher,Mason,gray,-1
    5,Gupta,Olivia,9,102
    6,,Robinson,,Sophia,,blue,,12

    Data errors were already pointed out. Now, we want to explore how we can use a Large Language Model, specifically GPT-4, to automatically find these errors. This new method offers a modern way to spot issues in data sets but comes with possible risks such as privacy concerns when using external APIs. However, this can work with any LLMs, not just GPT-4, although the effectiveness might vary depending on the model’s capabilities.

    Preliminary Step: Retrieve Table Annotation

    To assist the model in identifying data inconsistencies, it’s beneficial to provide additional context about the data frame. This is precisely the role of a data catalog, which, although a broad topic, we will simplify to focus solely on the essential context information that a LLM requires to detect data errors when examining batches of data set rows.

    The key metadata needed includes:

    • An overview of the table, including its description and purpose.
    • A clear understanding of each column’s meaning and type.

    Given the frequent absence of data catalogs or reliable documentation in organisations, we’ll explore how to use LLMs to speed up this process. This process is known as Table Annotation, which involves identifying semantic information about table elements, including columns, their relationships, and the entities within the cells. For further details, refer to sources such as Column Type Annotation using ChatGPT, Annotating Columns with Pre-trained Language Models, or SOTAB: The WDC Schema.org Table Annotation Benchmark.

    Here’s the prompt I use:

    Analyse the table below and provide schema annotations based on Schema.org standards.

    Student#,Last Name,First Name,Favorite Color,Age
    1,Johnson,Mia,periwinkle,12
    2,Lopez,Liam,blue,green,13
    3,Lee,Isabella,,11
    4,Fisher,Mason,gray,-1
    5,Gupta,Olivia,9,102
    6,,Robinson,,Sophia,,blue,,12

    Follow these steps:
    1. Identify the overall semantic type of the table.
    2. Provide a short description of each column.
    3. Annotate each column with its semantic type from Schema.org.
    4. Determine the most suitable data type for each column (after data cleaning).

    Summarise the table schema as follows:
    - Table Semantic Type: <type>
    - Column: <name>, Description: <description>, Semantic Type: <Schema.org type>, Pandas Type: <Pandas data type>

    In the prompt instructions, I direct the model to analyse the provided table (or an overview of the table) and to suggest annotations following the Schema.org standards. Specifically, the output should include:

    • The table’s semantic type
    • A brief description of each column
    • The column’s annotation type from Schema.org, where applicable
    • The ideal or best-suited data types for each column, regardless of data issues in the provided text serialisation

    The response is then formatted to provide a clear and structured summary that can be reused as context in subsequent prompts:

    Summarise the table schema as follows:
    - Table Semantic Type: <type>
    - Column: <name>, Description: <description>, Semantic Type: <Schema.org type>, Pandas Type: <Pandas data type>

    The table is incorporated using a {table} placeholder in the prompt. The typical method involves converting tabular data into text through serialisation, as discussed in Large Language Models(LLMs) on Tabular Data: Prediction, Generation, and Understanding — A Survey.

    Here is a sample response from GPT-4:

    - Table Semantic Type: Person
    - Column: Student#, Description: Unique identifier for each student, Semantic Type: identifier, Pandas Type: int
    - Column: Last Name, Description: Family name of the student, Semantic Type: familyName, Pandas Type: string
    - Column: First Name, Description: Given name of the student, Semantic Type: givenName, Pandas Type: string
    - Column: Favorite Color, Description: Preferred color of the student, Semantic Type: color (custom), Pandas Type: string (or list if cleaned for multiple values)
    - Column: Age, Description: Age of the student, Semantic Type: age, Pandas Type: int (after cleaning invalid entries)

    Responses may vary slightly but are generally consistent for such a straightforward example. The aim here is to accelerate the initial process rather than fully automate it. Thus, this can be seen as a preliminary draft, which can then be refined with insights from our knowledge and external context from subject matter experts (SMEs).

    Now, with some context about the table, let’s explore how to automatically identify data quality issues.

    Sniffing Data Errors with LLMs

    To start, I suggest a prompt that will help identify data quality issues in a given table.

    Task: Analyse the provided table to identify and document data quality issues.

    Below are common data quality issues to guide your analysis. However, you may also identify other relevant issues:
    - Ingestion errors
    - Typecasting issues
    - Duplicates
    - Date parsing issues
    - Character encoding problems
    - Missing values
    - Typos/spelling mistakes
    - Anomalies/outliers
    - Conversion errors and inconsistent units
    - Privacy concerns (e.g., exposed PII)
    - Domain-specific errors (e.g., invalid formats for addresses, phone numbers, emails)

    Instructions:
    1. Examine silently the table and its metadata.
    2. Line by line, identify potential data quality issues without coding.
    3. Document each issue, including:
    - Nature and description of the issue
    - Expected correct state
    - Violated constraint
    - Confidence level in your assessment using ordinal categories: `low`, `medium`, `high` and `certain`.
    - Specific location of the issue in the table (use 'None' for table-wide issues): Index and Column names.

    Provided Data:

    Table:
    ,Student#,Last Name,First Name,Favorite Color,Age
    0,1,Johnson,Mia,periwinkle,12
    1,2,Lopez,Liam,blue,green,13
    2,3,Lee,Isabella,,11
    3,4,Fisher,Mason,gray,-1
    4,5,Gupta,Olivia,9,102
    5,6,,Robinson,,Sophia,,blue,,12

    Metadata:
    - Table Semantic Type: Person
    - Column: Student#, Description: Unique identifier for each student, Semantic Type: identifier, Pandas Type: int or string
    - Column: Last Name, Description: Family name of the student, Semantic Type: familyName, Pandas Type: string
    - Column: First Name, Description: Given name of the student, Semantic Type: givenName, Pandas Type: string
    - Column: Favorite Color, Description: Preferred color of the student, Semantic Type: color (custom), Pandas Type: string (or list if cleaned for multiple values)
    - Column: Age, Description: Age of the student, Semantic Type: age, Pandas Type: int (after cleaning invalid entries)

    Detected Data Issues:

    The initial part of the prompt sets the task’s objective and lists examples of common data issues, such as ingestion errors, duplicates, and privacy concerns, among others. This list is not exhaustive, and you’re encouraged to add more relevant types based on your table’s context to guide the analysis.

    Next, the prompt details step-by-step instructions following a Chain-of-Thoughts approach, ensuring the model methodically analyses the table and its metadata before identifying data issues line by line, mirroring human analysis. This process is meant to be conducted without coding, to maintain simplicity and broad applicability. This is crucial because, although models like GPT-4 with analytics capabilities can perform useful iterative coding sessions, relying solely on textual analysis promotes generalisation.

    Upon detecting a potential data issue, the prompt instructs documenting the following details:

    • The nature and description of the issue
    • The expected correct state
    • The violated constraint
    • A confidence level in the assessment using ordinal categories: low, medium, high and certain.
    • The specific location of the issue in the table, using ‘None’ for table-wide issues, with Index and Column names for reference.

    The table and its metadata are provided within the prompt, with an index added to each row to aid the model in pinpointing the exact locations of errors.

    For large tables, this prompt can be applied in batches to cover the entire data set, with findings aggregated to identify all data quality issues.

    Here is an example of the output this prompt can generate, formatted as a report detailing identified data issues, each with a description, expected state, violated constraint, confidence level, and location.

    1. Issue: Incorrect value format in "Favorite Color" column
    - Nature and Description: The entry "blue,green" in the "Favorite Color" column for the second row appears to contain two colors instead of one, suggesting a possible ingestion or data entry error.
    - Expected Correct State: Each cell under the "Favorite Color" column should contain only one color.
    - Violated Constraint: Single value constraint.
    - Confidence Level: High
    - Specific Location: Index 1, Column "Favorite Color"

    2. Issue: Missing value in "Favorite Color" and "Age" columns
    - Nature and Description: The third row has a missing value for "Favorite Color", and the sixth row has missing values for both "Last Name" and "Favorite Color".
    - Expected Correct State: No missing values in any of the columns.
    - Violated Constraint: Non-null constraint.
    - Confidence Level: Certain
    - Specific Location: Index 2, Column "Favorite Color"; Index 5, Columns "Last Name", "Favorite Color"

    3. Issue: Negative value in "Age" column
    - Nature and Description: The fourth row has an age of "-1", which is not possible and indicates a data entry error.
    - Expected Correct State: Age values should be positive integers.
    - Violated Constraint: Age value range (greater than 0).
    - Confidence Level: Certain
    - Specific Location: Index 3, Column "Age"

    4. Issue: Misplaced values in "Favorite Color" and "Age" columns
    - Nature and Description: In the fifth row, what appears to be an age ("9") is placed in the "Favorite Color" column, and an unlikely age ("102") is in the "Age" column, suggesting a possible shifting or misplacement of values.
    - Expected Correct State: Age should be a reasonable positive integer, and favorite color should be a string denoting a color.
    - Violated Constraint: Data type and value range constraint.
    - Confidence Level: High
    - Specific Location: Index 4, Columns "Favorite Color", "Age"

    5. Issue: Inconsistent formatting in row entries
    - Nature and Description: The sixth row contains additional commas, which might indicate incorrect splitting or entry of the data, leading to misalignment of the values with their respective columns.
    - Expected Correct State: Each row should consistently follow the format defined by the column headers without extra delimiters.
    - Violated Constraint: Data format and consistency constraint.
    - Confidence Level: High
    - Specific Location: Index 5, Entire row

    The next step is converting these identified issues into a Python object for easier calculation of the Data Dirtiness Score.

    Converting Identified Data Issues into the Correct Format

    This section focuses on transforming the previously identified data issues from plain text descriptions into Python objects. These objects should adhere to the structure defined by the DataIssue class:

    @dataclass
    class DataIssue:
    type_of_issue: str
    expectation: str
    constraint_violated: str
    confidence_score: str # `low`, `medium`, `high` or `certain`
    location: List[Tuple] # Cell positions as (Index, Column). Use None for row/column-wide issues.

    Here is the prompt I use:

    Task: Document data table issues using the provided `DataIssue` class.

    For each identified issue and its description, document it using the provided `DataIssue` class structure.

    ```python
    from dataclasses import dataclass
    from typing import List, Tuple

    @dataclass
    class DataIssue:
    type_of_issue: str
    expectation: str
    constraint_violated: str
    confidence_score: str # `low`, `medium`, `high` or `certain`
    location: List[Tuple] # Cell positions as (Index, Column). Use None for row/column-wide issues.

    Instructions:
    1. Review all identified issues provided and their descriptions silently.
    2. For each issue, instantiate it using the provided `DataIssue` class structure.
    3. Return only the code.
    4. Once the code has been validated, stop generation.

    Identified issues:
    {issues_found}

    Your code here:

    The goal is to generate Python code representing each data issue as a DataIssue object without redundant text. The identified issues are injected into the prompt using the placeholder {issues_found}.

    Example output:

    issue1 = DataIssue(
    type_of_issue="Incorrect value format",
    expectation="Each cell under 'Favorite Color' should contain only one color",
    constraint_violated="Single value constraint",
    confidence_score="high",
    location=[(1, "Favorite Color")]
    )

    issue2 = DataIssue(
    type_of_issue="Missing value",
    expectation="No missing values in any columns",
    constraint_violated="Non-null constraint",
    confidence_score="certain",
    location=[(2, "Favorite Color"), (5, "Last Name"), (5, "Favorite Color")]
    )

    issue3 = DataIssue(
    type_of_issue="Negative value",
    expectation="Age values should be positive integers",
    constraint_violated="Age value range (greater than 0)",
    confidence_score="certain",
    location=[(3, "Age")]
    )

    issue4 = DataIssue(
    type_of_issue="Misplaced values",
    expectation="Age should be a reasonable positive integer, and favorite color should be a string denoting a color",
    constraint_violated="Data type and value range constraint",
    confidence_score="high",
    location=[(4, "Favorite Color"), (4, "Age")]
    )

    issue5 = DataIssue(
    type_of_issue="Inconsistent formatting",
    expectation="Each row should consistently follow the format defined by the column headers without extra delimiters",
    constraint_violated="Data format and consistency constraint",
    confidence_score="high",
    location=[(5, None)] # None indicates entire row issue
    )

    The final step involves converting the location attribute from lists of tuples to numpy arrays, which is detailed in the appendix.

    With all elements in place, we can now calculate the Data Dirtiness Score.

    Calculation of the Data Dirtiness Score and Comparison with Ground Truth

    Let’s revisit the function from the previous article, compute_data_dirtiness_score, which uses a list of DataIssue objects mentioned earlier.

    compute_data_dirtiness_score(data_issues)

    Data Dirtiness Score: 28.33%

    Using the GPT-4 model, we estimated the score to be around 28% for this sample. This is fairly close to the “ground truth” score of 31.87%.

    To understand the discrepancy between these scores, let’s delve into more detailed metrics on data issue detection. In addition to the overall score, we have matrices of cell issue probabilities for both the ground truth and the model’s estimates.

    Below is the ground truth matrix, with columns and indices added for clarity:

       Student#  Last Name  First Name  Favorite Color   Age
    0 0.00 0.0 0.00 0.00 0.00
    1 0.00 0.0 0.00 0.75 0.00
    2 0.00 0.0 0.00 1.00 0.00
    3 0.00 0.0 0.00 0.00 1.00
    4 0.00 0.0 0.00 0.75 0.75
    5 0.75 1.0 0.75 1.00 0.75

    And here is the matrix of probabilities estimated by the model:

       Student#  Last Name  First Name  Favorite Color   Age
    0 0.0 0.0 0.00 0.0000 0.00
    1 0.0 0.0 0.00 0.7500 0.00
    2 0.0 0.0 0.00 1.0000 0.00
    3 0.0 0.0 0.00 0.0000 1.00
    4 0.0 0.0 0.25 0.8125 0.75
    5 1.0 1.0 1.00 1.0000 1.00

    Though the matrices appear similar at first glance, we can apply threshold-based metrics such as accuracy, recall, precision, and F1-score to get a clearer picture. These metrics provide a straightforward evaluation of the model’s performance by considering a cell problematic if the model’s likelihood exceeds 0. Here are the metrics obtained:

    The model correctly identified 91% of problematic cells (recall), and all of its error predictions were accurate (precision).

    The model missed one particular issue: “The Favorite Color and First Name fields might be swapped, considering Olivia can be both a name and a colour.” This was deemed improbable with a low confidence score, suggesting Olivia is more likely the First Name rather than the Favorite Color. Consequently, even though this potential issue was overlooked, its minimal confidence score lessened its impact on the overall Data Dirtiness Score. This explains why the two scores are relatively close despite this omission.

    In summary, this approach, based on large language models (LLMs), offers a method for detecting data quality issues in a data frame. While this method may not yet be fully automated and might need manual adjustments, it’s hoped that it will expedite the detection of data errors and the calculation of the Data Dirtiness Score for tabular data sets.

    Next Steps and Challenges

    I use a two-step process to generate the issues as code. This is done because I have found this adds more stability over a one-in-all solution, i.e. scanning data set and metadatas and outputs data issues directly in right code format. This doesn’t imply it’s impossible, but I’ve chosen to divide this step into two phases to improve robustness for the time being.

    An issue we face concerns managing large data sets, both in terms of the number of rows and columns. Despite recent advancements, LLMs still face limitations regarding the input context window and the length of generated content. These constraints limit the size of the table that can be serialised into the prompt for analysis and the length of the data issue report produced by the model. How to divide a data frame based on its size and the model’s capabilities is a question that arises.

    In certain scenarios, the lack of general context can be problematic, such as when identifying duplicate rows in a database or detecting spelling errors without a broad understanding of the column values. For instance, in cases where duplicates are not straightforward, a common approach is Entity Matching. This technique is particularly useful in data cleaning processes and has seen advancements through the use of Large Language Models. Relevant research in this area includes studies like Entity Matching using Large Language Models and Can Foundation Models Wrangle Your Data?, along with Large Language Models as Data Preprocessors and Jellyfish: A Large Language Model for Data Preprocessing.

    Ensemble methods in machine learning, which involve combining multiple models, can enhance performance and stability. This approach can be applied by running several LLMs simultaneously to identify issues in a data set. It’s beneficial to vary the prompts and settings for each LLM to ensure a diverse range of insights. Additionally, assigning specific error types, like spelling mistakes, to individual models can make the process more efficient. While this method can lead to more reliable results by dividing the task into smaller parts, it also increases both the cost and the complexity of the software. By gathering all the identified data issues, we can improve our chances of finding errors (increasing recall) but might also identify more false errors (decreasing precision). However, reviewing these identified errors is generally less time-consuming than finding them in the first place.

    The ability of LLMs to interact directly with databases, similar to the code analysis capability in ChatGPT-4, opens up a wider range of possibilities for detecting data errors. A challenge here is automating this process, as the model may deviate from its intended path without sufficient guidance.

    Despite all the challenges, it is already quite promising what we can achieve with such as simple approach. With more work on engineering, I hope we can very soon provide a more robust solution to cover larger data sets and fully automate the detection process.

    The next article will discuss automated data repair or, at the very least, suggest solutions for repair pending validation.

    References

    Appendix

    The section explains how to transform the location attribute of a DataIssue object, which comes from a LLM, into a different format. This transformation changes a list of tuples, which represent cell positions, into a numpyarray. This array acts as a mask for those cell positions.
    Here’s a basic example using the Students data set:

    create_mask_from_list_of_cell_positions(
    shape=dataset_shape,
    list_of_cell_positions=[(4, 'Favorite Color'), (4, 'Age')],
    columns=columns
    )
    array([[0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 0, 0],
    [0, 0, 0, 1, 1],
    [0, 0, 0, 0, 0]], dtype=int8)

    Below are the function definitions:

    def validate_cell_position(
    cell_position: Union[
    Tuple[int, int], Tuple[None, int], Tuple[int, None], Tuple[None, None]
    ],
    columns: List[str] = None,
    ) -> Tuple[int, int]:
    """
    Validate the cell position and convert column names to indices if necessary.
    """
    if not isinstance(cell_position, tuple):
    raise ValueError("Cell position must be a tuple")

    # Convert column name to index if columns are provided
    if isinstance(cell_position[1], str):
    if columns is None:
    raise ValueError(
    "Column names must be provided to create a mask based on column names"
    )
    column_index = columns.index(cell_position[1])
    return (cell_position[0], column_index)

    return cell_position


    def set_mask_values(mask: np.ndarray, cell_position: Tuple[int, int]):
    """
    Set values in the mask based on the cell position.
    """
    row_index, col_index = cell_position
    if row_index is None:
    mask[:, col_index] = 1
    elif col_index is None:
    mask[row_index, :] = 1
    else:
    mask[row_index, col_index] = 1


    def create_mask_from_list_of_cell_positions(
    shape: Tuple[int, int],
    list_of_cell_positions: List[Tuple],
    columns: List[str] = None,
    ) -> np.ndarray:
    """
    Create a mask array based on a list of cell positions.
    """
    mask = np.zeros(shape=shape, dtype=np.int8)

    for cell_position in list_of_cell_positions:
    validated_position = validate_cell_position(cell_position, columns)
    set_mask_values(mask, validated_position)

    return mask


    Automated Detection of Data Quality Issues was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Automated Detection of Data Quality Issues

    Go Here to Read this Fast! Automated Detection of Data Quality Issues