Category: Artificial Intelligence

  • Some Thoughts on Operationalizing LLM Applications

    Some Thoughts on Operationalizing LLM Applications

    Matthew Harris

    A few personal lessons learned from developing LLM applications

    Source DALL·E 3 prompted with “Operationalizing LLMs, watercolor”

    It’s been fun posting articles exploring new Large Language Model (LLM) techniques and libraries as they emerge, but most of the time has been spent behind the scenes working on the operationalization of LLM solutions. Many organizations are working on this right now, so I thought I’d share a few quick thoughts about my journey so far.

    Prototypes are Easy … Production is, well, hard

    It’s beguiling easy to throw up a quick demo to showcase some of the amazing capabilities of LLMs, but anybody who is tasked with putting them in front of users with the hope of having a discernable impact soon realizes there’s a lot of work required to tame them. Below are some of the key areas that most organizations might need to consider.

    Some of the key areas that should be considered before launching applications that use Large Language Models (LLMs).

    The list isn’t exhaustive (see also Kadour et al 2023), and which of the above applies to your application will of course vary, but even solving for safety, performance, and cost can be a daunting prospect.

    So what can we do about it?

    Not all LLM applications are equally scary

    There is much concern about the safe use of LLMs, and quite right too. Trained on human output they suffer from many of the less favorable aspects of the human condition, and being so convincing in their responses raises new issues around safety. However, the risk profile is not the same for all cases, some applications are much safer than others. Asking an LLM to provide answers directly from its training data offers more potential for hallucination and bias than a low-level technical use of an LLM to predict metadata. This is an obvious distinction, but worthwhile considering for anybody about to build LLM solutions— starting with low-risk applications is an obvious first step and reduces the amount of work required for launch.

    How LLMs are used influences how risky it is to use them

    Future-proofing, hedge against hype

    We live in incredibly exciting times with so many rapid advances in AI coming out each week, but it sure makes building a roadmap difficult! Several times in the last year a new vendor feature, open-source model, or Python package has been released which has changed the landscape significantly. Figuring out which techniques, frameworks, and models to use such that LLM applications maintain value over time is challenging. No point in building something fabulous only to have its capabilities natively supported for free or very low cost in the next 6 months.

    Another key consideration is to ask whether an LLM is actually the best tool for the job. With all of the excitement in the last year, it’s easy to get swept away and “LLM the heck” out of everything. As with any new technology, using it just for the sake of using it is often a big mistake, and as LLM hype adjusts one may find our snazzy app becomes obsolete with real-world usage.

    That said, there is no doubt that LLMs can offer some incredible capabilities so if forging ahead, here are some ideas that might help …

    Adopt a “Cheap LLM First” Policy

    In web design there is the concept of mobile-first, to develop web applications that work on less functional phones and tablets first, then figure out how to make things work nicely on more flexible desktop browsers. Doing things this way around can sometimes be easier than the converse. A similar idea can be applied to LLM applications — where possible try and develop them so that they work with cheaper, faster, and lower-cost models from the outset, such as GPT-3.5-turbo instead of GPT-4. These models are a fraction of the cost and will often force the design process towards more elegant solutions that break the problem down into simpler parts with less reliance on monolithic lengthy prompts to expensive and slow models.

    Of course, this isn’t always feasible and those advanced LLMs exist for a reason, but many key functions can be supported with less powerful LLMs — simple intent classification, planning, and memory operations. It may also be the case that careful design of your workflows can open the possibility of different streams where some use less powerful LLMs and others more powerful (I’ll be doing a later blog post on this).

    Down the road when those more advanced LLMs become cheaper and faster, you can then swap out the more basic LLMs and your application may magically improve with very little effort!

    Avoid native APIs, use generic interfaces instead

    It is a good software engineering approach to use a generic interface where possible. For LLMs, this can mean using a service or Python module that presents a fixed interface that can interact with multiple LLM providers. A great example is langchain which offers integration with a wide range of LLMs. By using Langchain to communicate with LLMs from the outset and not native LLM APIs, we can swap out different models in the future with minimal effort.

    Another example of this is to use autogen for agents, even if using OpenAI assistants. That way as other native agents become available, your application can be adjusted more easily than if you had built a whole process around OpenAI’s native implementation.

    Agents or Chains? You can use both!

    A common pattern with LLM development is to break down the workflow into a chain of conditional steps using frameworks such as promptflow. Chains are well-defined so we know, more or less, what’s going to happen in our application. They are a great place to start and have a high degree of transparency and reproducibility. However, they don’t support fringe cases well, that’s where groups of autonomous LLM agents can work well as they are able to iterate towards a solution and recover from errors (most of the time). The issue with these is that — for now at least — agents can be a bit slow due to their iterative nature, expensive due to LLM token usage, and have a tendency to be a bit wild at times and fail spectacularly. They are likely the future of LLM applications though, so it’s a good idea to prepare even if not using them in your application right now. By building your workflow as a modular chain, you are in fact doing just that! Individual nodes in the workflow can be swapped out to use agents later, providing the best of both worlds when needed.

    It should be noted there are some limitations with this approach, streaming of the LLM response becomes more complicated, but depending on your use case the benefits may outweigh these challenges.

    Linking together steps in an LLM workflow with Promtpflow. This has several advantages, one being that steps can be swapped out with more advanced techniques in the future.

    Do you really want your application generating code on the fly?

    It is truly amazing to watch autogen agents and Open AI assistants generating code and automatically debugging to solve tasks, to me it feels like the future. It also opens up amazing opportunities such as LLM As Tool Maker (LATM, Cai et al 2023), where your application can generate its own tools. That said, from my personal experience, so far, code generation can be a bit wild. Yes, it’s possible to optimize prompts and implement a validation framework, but even if that generated code runs perfectly, is it right when solving new tasks? I have come across many cases where it isn’t, and it’s often quite subtle to catch — the scale on a graph, summing across the wrong elements in an array, and retrieving slightly the wrong data from an API. I think this will change as LLMs and frameworks advance, but right now, I would be very cautious about letting LLMs generate code on the fly in production and instead opt for some human-in-the-loop review, at least for now.

    Start with LLM-enhanced applications rather than LLM-first applications

    There are of course many use cases that absolutely require an LLM. But to ease into things, it might make sense to choose applications where the LLM adds value to the process rather than being the process. Imagine a web app that presents data to a user, already being useful. That application could be enhanced to implement LLM improvements for finding and summarizing that data. By placing slightly less emphasis on using LLMs, the application is less exposed to issues arising from LLM performance. Stating the obvious of course, but it’s easy to dive into generative AI without first taking baby steps.

    Don’t forget the … errrr … oh yeah, memory!

    Prompting LLMs incurs costs and can result in a poor user experience as they wait for slow responses. In many cases, the prompt is similar or identical to one previously made, so it’s useful to be able to remember past activity for reuse without having to call the LLM again. Some great packages exist such as memgpt and GPTCache which use document embedding vector stores to persist ‘memories’. This is the same technology used for the common RAG document retrieval, memories are just chunked documents. The slight difference is that frameworks like memgpt do some clever things to use LLM to self-manage memories.

    You may find however that due to a specific use case, you need some form of custom memory management. In this scenario, it’s sometimes useful to be able to view and manipulate memory records without having to write code. A powerful tool for this is pgvector which combines vector store capabilities with Postgres relational database for querying, making it easy to understand the metadata stored with memories.

    Test, test, test

    At the end of the day, whether your application uses LLMs or not it is still a software application and so will benefit from standard engineering techniques. One obvious approach is to adopt test-driven development. This is especially important with LLMs provided by vendors to control for the fact that the performance of those LLMs may vary over time, something you will need to quantify for any production application. Several validation frameworks exist, again promptflow offers some straightforward validation tools and has native support in Microsoft AI Studio. There are other testing frameworks out there, the point being, to use one from the start for a strong foundation in validation.

    That said, it should be noted that LLMs are not deterministic, providing slightly different results each time depending on the use case. This has an interesting effect on tests in that the expected result isn’t set in stone. For example, testing that a summarization task is working as required can be challenging because the summary with slightly vary each time. In these cases, it’s often useful to use another LLM to evaluate the application LLM’s output. Metrics such as Groundedness, Relevance, Coherence, Fluency, GPT Similarity, ADA Similarity can be applied, see for example Azure AI studio’s implementation.

    Once you have a set of amazing tests that confirm your application is working as expected, you can incorporate them into a DevOps pipeline, for example running them in GitHub actions before your application is deployed.

    Use 3rd party tools and save yourself some work

    No one size fits all of course, but for smaller organizations implementing LLM applications, developing every aspect of the solution may be a challenge. It might make sense to focus on the business logic and work closely with your users while using enterprise tools for areas such as LLM safety rather than developing them yourself. For example, Azure AI studio has some great features that enable various safety checks on LLMs with a click of a button, as well as easy deployment to API endpoints with integrating monitoring and safety. Other vendors such as Google have similar offerings.

    There is of course a cost associated with features like this, but it may be well worth it as developing them is a significant undertaking.

    Azure AI Content Safety Studio is a great example of a cloud vendor solution to ensure your LLM application is safe, with no associated development effort

    Human in the loop, always

    LLMs are far from being perfect, even the most powerful ones, so any application using them must have a human in the loop to ensure things are working as expected. For this to be effective all interactions with your LLM application must be logged and monitoring tools in place. This is of course no different to any well-managed production application, the difference being new types of monitoring to capture performance and safety issues.

    Another key role humans can play is to correct and improve the LLM application when it makes mistakes. As mentioned above, the ability to view the application’s memory can help, especially if the human can make adjustments to the memory, working with the LLM to provide end-users with the best experience. Feeding this modified data back into prompt tunning of LLM fine-tuning can be a powerful tool in improving the application.

    Conclusions

    The above thoughts are by no means exhaustive for operationalizing LLMs and may not apply to every scenario, but I hope they might be useful for some. We are all on an amazing journey right now!

    References

    Challenges and Applications of Large Language Models, Kaddour et al, 2023

    Large Language Models as Tool Makers, Cai et al, 2023.

    Unless otherwise noted, all images are by the author

    Please like this article if inclined and I’d be delighted if you followed me! You can find more articles here.


    Some Thoughts on Operationalizing LLM Applications was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Some Thoughts on Operationalizing LLM Applications

    Go Here to Read this Fast! Some Thoughts on Operationalizing LLM Applications

  • Reframing LLM ‘Chat with Data’: Introducing LLM-Assisted Data Recipes

    Reframing LLM ‘Chat with Data’: Introducing LLM-Assisted Data Recipes

    Matthew Harris

    Source DALL·E 3 prompted with “Oil painting of a data chef making data recipes”

    TL;DR

    In this article, we cover some of the limitations in using Large Language Models (LLMs) to ‘Chat with Data’, proposing a ‘Data Recipes’ methodology which may be an alternative in some situations. Data Recipes extends the idea of reusable code snippets but includes data and has the advantage of being programmed conversationally using an LLM. This enables the creation of a reusable Data Recipes Library — for accessing data and generating insights — which offers more transparency for LLM-generated code with a human-in-the-loop to moderate recipes as required. Cached results from recipes — sourced from SQL queries or calls to external APIs — can be refreshed asynchronously for improved response times. The proposed solution is a variation of the LLMs As Tool Makers (LATM) architecture which splits the workflow into two streams: (i) A low transaction volume / high-cost stream for creating recipes; and (ii) A high transaction volume / low-cost stream for end-users to use recipes. Finally, by having a library of recipes and associated data integration, it is possible to create a ‘Data Recipes Hub’ with the possibility of community contribution.

    Using LLMs for conversational data analysis

    There are some very clever patterns now that allow people to ask questions in natural language about data, where a Large Language Model (LLM) generates calls to get the data and summarizes the output for the user. Often referred to as ‘Chat with Data’, I’ve previously posted some articles illustrating this technique, for example using Open AI assistants to help people prepare for climate change. There are many more advanced examples out there it can be an amazing way to lower the technical barrier for people to gain insights from complicated data.

    Examples of using LLMs to generate SQL queries from user inputs, and summarize output to provide an answer. Sources: Langchain SQL Agents
    Examples of using LLMs to generate API calls from user inputs, and summarize output to provide an answer. Sources: Langchain Interacting with APIs

    The method for accessing data typically falls into the following categories …

    1. Generating Database queries: The LLM converts natural language to a query language such as SQL or Cypher
    2. Generating API Queries: The LLM converts natural language to text used to call APIs

    The application executes the LLM-provided suggestion to get the data, then usually passes the results back to the LLM to summarize.

    Getting the Data Can be a Problem

    It’s amazing that these techniques now exist, but in turning them into production solutions each has its advantages and disadvantages …

    LLMs can generate text for executing database queries and calling external APIs, but each has its advantages and disadvantages

    For example, generating SQL supports all the amazing things a modern database query language can do, such as aggregation across large volumes of data. However, the data might not already be in a database where SQL can be used. It could be ingested and then queried with SQL, but building pipelines like this can be complex and costly to manage.

    Accessing data directly through APIs means the data doesn’t have to be in a database and opens up a huge world of publically available datasets, but there is a catch. Many APIs do not support aggregate queries like those supported by SQL, so the only option is to extract the low-level data, and then aggregate it. This puts more burden on the LLM application and can require extraction of large amounts of data.

    So both techniques have limitations.

    Passing Data Directly through LLMs Doesn’t Scale

    On top of this, another major challenge quickly emerges when operationalizing LLMs for data analysis. Most solutions, such as Open AI Assistants can generate function calls for the caller to execute to extract data, but the output is then passed back to the LLM. It’s unclear exactly what happens internally at OpenAI, but it’s not very difficult to pass enough data to cause a token limit breach, suggesting the LLM is being used to process the raw data in a prompt. Many patterns do something along these lines, passing the output of function calling back to the LLM. This, of course, does not scale in the real world where data volumes required to answer a question can be large. It soon becomes expensive and often fails.

    LLM Code Generation Can be Slow, Expensive, and Unstable

    One way around this is to instead perform the analysis by having the LLM generate the code for the task. For example, if the user asks for a count of records in a dataset, have the LLM generate a snippet of Python to count records in the raw data, execute it, and pass that information back to the user. This requires far fewer tokens compared to passing in the raw data to the LLM.

    It is fairly well established that LLMs are pretty good at generating code. Not yet perfect, for sure, but a lot of the world right now is using tools like GitHub Copilot for software development. It is becoming a common pattern in LLM applications to have them generate and execute code as part of solving tasks. OpenAI’s code interpreter and frameworks such as autogen and Open AI assistants take this a step further in implementing iterative processes that can even debug generated code. Also, the concept of LLMs As Tool Makers (LATM) is established (see for example Cai et al, 2023).

    But here there are some challenges too.

    Any LLM process generating code, especially if that process goes through an iterative cycle to debug code, can quickly incur significant costs. This is because the best models needed for high-quality code generation are often the most expensive, and to debug code a history of previous attempts is required at each step in an iterative process, burning through tokens. It’s also quite slow, depending on the number of iterations required, leading to a poor user experience.

    As many of us have also found, code generation is not perfect — yet — and will on occasion fail. Agents can get themselves lost in code debugging loops and though generated code may run as expected, the results may simply be incorrect due to bugs. For most applications, a human still needs to be in the loop.

    Remembering Data ‘Facts’ Has Limitations

    Code generation cost and performance can be improved by implementing some sort of memory where information from previous identical requests can be retrieved, eliminating the requirement for repeat LLM calls. Solutions such as memgpt work with frameworks like autogen and offer a neat way of doing this.

    Two issues arise from this. First, data is often volatile and any specific answer (ie ‘Fact’) based on data can change over time. If asked today “Which humanitarian organizations are active in the education sector in Afghanistan?”, the answer will likely be different next month. Various memory strategies could be applied to ignore memory after some time, but the most trustworthy method is to simply get the information again.

    Another issue is that our application may have generated an answer for a particular situation, for example, the population of a specific country. The memory will work well if another user asks exactly the same question, but isn’t useful if they ask about a different country. Saving ‘Facts’ is only half of the story if we are hoping to be able to reuse previous LLM responses.

    So What Can We Do About It?

    Given all of the above, we have these key issues to solve:

    • We need an approach that would work with databases and APIs
    • We want to be able to support aggregate queries using API data
    • We want to avoid using LLMs to summarize data and instead use code
    • We want to save on costs and performance by using memory
    • Memory needs to be kept up-to-date with data sources
    • Memory should be generalizable, containing skills as well as facts
    • Any code used needs to be reviewed by a human for accuracy and safety

    Phew! That’s a lot to ask.

    Introducing LLM-Assisted Data Recipes

    Data Recipes architecture: LLM-assisted generation of reusable recipes (skills) which can be used for conversational data analysis

    The idea is that we split the workflow into two streams to optimize costs and stability, as proposed with the LATM architecture, with some additional enhancements for managing data and memories specific to Data Recipes …

    Stream 1: Recipes Assistant

    This stream uses LLM agents and more powerful models to generate code snippets (recipes) via a conversational interface. The LLM is instructed with information about data sources — API specifications and Database Schema — so that the person creating recipes can more easily conversationally program new skills. Importantly, the process implements a review stage where generated code and results can be verified and modified by a human before being committed to memory. For best code generation, this stream uses more powerful models and autonomous agents, incurring higher costs per request. However, there is less traffic so costs are controlled.

    Stream 2: Data Analysis Assistant

    This stream is used by the wider group of end-users who are asking questions about data. The system checks memory to see if their request exists as a fact, e.g. “What’s the population of Mali?”. If not, it checks recipes to see if it has a skill to get the answer, eg ‘How to get the population of any country’. If no memory or skill exists, a request is sent to the recipes assistant queue for the recipe to be added. Ideally, the system can be pre-populated with recipes before launch, but the recipes library can actively grow over time based on user telemetry. Note that the end user stream does not generate code or queries on the fly and therefore can use less powerful LLMs, is more stable and secure, and incurs lower costs.

    Asynchronous Data Refresh

    To improve response times for end-users, recipes are refreshed asynchronously where feasible. The recipe memory contains code that can be run on a set schedule. Recipes can be preemptively executed to prepopulate the system, for example, retrieving the total population of all countries before end-users have requested them. Also, cases that require aggregation across large volumes of data extracted from APIs can be run out-of-hours, mitigating —albeit in part— the limitation of aggregate queries using API data.

    Memory Hierarchy — remembering skills as well as facts

    The above implements a hierarchy of memory to save ‘facts’ which can be promoted to more general ‘skills’. Memory retrieval promotion to recipes are achieved through a combination of semantic search and LLM reranking and transformation, for example prompting an LLM to generate a general intent and code, eg ‘Get total population for any country’ from a specific intent and code, eg ‘What’s the total population of Mali?’.

    Additionally, by automatically including recipes as available functions to the code generation LLM, its reusable toolkit grows such that new recipes are efficient and call prior recipes rather than generating all code from scratch.

    Some Additional Benefits of Data Recipes

    By capturing data analysis requests from users and making these highly visible in the system, transparency is increased. LLM-generated code can be closely scrutinized, optimized, and adjusted, and answers produced by such code are well-understood and reproducible. This acts to reduce the uncertainty many LLM applications face around factual grounding and hallucination.

    Another interesting aspect of this architecture is that it captures specific data analysis requirements and the frequency these are requested by users. This can be used to invest in more heavily utilized recipes bringing benefits to end users. For example, if a recipe for generating a humanitarian response situation report is accessed frequently, the recipe code for that report can improved proactively.

    Data Recipes Hub

    This approach opens up the possibility of a community-maintained library of data recipes spanning multiple domains — a Data Recipes Hub. Similar to code snippet websites that already exist, it would add the dimension of data as well as help users in creation by providing LLM-assisted conversational programming. Recipes could receive reputation points and other such social platform feedback.

    Data Recipes — code snippets with data, created with LLM assistance — could be contributed by the community to a Data Recipes Hub. Image Source: DALL·E 3

    Limitations of Data Recipes

    As with any architecture, it may not work well in all situations. A big part of data recipes is geared towards reducing costs and risks associated with creating code on the fly and instead building a reusable library with more transparency and human-in-the-loop intervention. It will of course be the case that a user can request something new not already supported in the recipe library. We can build a queue for these requests to be processed, and by providing LLM-assisted programming expect development times to be reduced, but there will be a delay to the end-user. However, this is an acceptable trade-off in many situations where it is undesirable to let loose LLM-generated, unmoderated code.

    Another thing to consider is the asynchronous refresh of recipes. Depending on the amount of data required, this may become costly. Also, this refresh might not work well in cases where the source data changes rapidly and users require this information very quickly. In such cases, the recipe would be run every time rather than the result retrieved from memory.

    The refresh mechanism should help with data aggregation tasks where data is sourced from APIs, but there still looms the fact that the underlying raw data will be ingested as part of the recipe. This of course will not work well for massive data volumes, but it’s at least limiting ingestion based on user demand rather than trying to ingest an entire remote dataset.

    Finally, as with all ‘Chat with Data’ applications, they are only ever going to be as good as the data they have access to. If the desired data doesn’t exist or is of low quality, then perceived performance will be poor. Additionally, common inequity and bias exist in datasets so it’s important a data audit is carried out before presenting insights to the user. This isn’t specific to Data Recipes of course, but one of the biggest challenges posed in operationalizing such techniques. Garbage in, garbage out!

    Conclusions

    The proposed architecture aims to address some of the challenges faced with LLM “Chat With Data”, by being …

    • Transparent — Recipes are highly visible and reviewed by a human before being promoted, mitigating issues around LLM hallucination and summarization
    • Deterministic — Being code, they will produce the same results each time, unlike LLM summarization of data
    • Performant — Implementing a memory that captures not only facts but skills, which can be refreshed asynchronously, improves response times
    • Inexpensive— By structuring the workflow into two streams, the high-volume end-user stream can use lower-cost LLMs
    • Secure — The main group of end-users do not trigger the generation and execution of code or queries on the fly, and any code undergoes human assessment for safety and accuracy

    I will be posting a set of follow-up blog posts detailing the technical implementation of Data Recipes as we work through user testing at DataKind.

    References

    Large Language Models as Tool Makers, Cai et al, 2023.

    Unless otherwise noted, all images are by the author.

    Please like this article if inclined and I’d be delighted if you followed me! You can find more articles here.


    Reframing LLM ‘Chat with Data’: Introducing LLM-Assisted Data Recipes was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Reframing LLM ‘Chat with Data’: Introducing LLM-Assisted Data Recipes

    Go Here to Read this Fast! Reframing LLM ‘Chat with Data’: Introducing LLM-Assisted Data Recipes

  • Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

    Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

    Christopher Rae

    This post provides three guided steps to architect risk management strategies while developing generative AI applications using LLMs. We first delve into the vulnerabilities, threats, and risks that arise from the implementation, deployment, and use of LLM solutions, and provide guidance on how to start innovating with security in mind. We then discuss how building on a secure foundation is essential for generative AI. Lastly, we connect these together with an example LLM workload to describe an approach towards architecting with defense-in-depth security across trust boundaries.

    Originally appeared here:
    Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

    Go Here to Read this Fast! Architect defense-in-depth security for generative AI applications using the OWASP Top 10 for LLMs

  • Exploring Public Storage Traces

    Raluca Diaconu

    What are they, where are they, and are they right for you?

    Photo by Hongwei FAN on Unsplash

    Input and output (I/O) operations refer to the transfer of data between a computer’s main memory and various peripherals. Storage peripherals such as HDDs and SSDs have particular performance characteristics in terms of latency, throughput, and rate which can influence the performance of the computer system they power. Extrapolating, the performance and design of distributed and cloud based Data Storage depends on that of the medium. This article is intended to be a bridge between Data Science and Storage Systems: 1/ I am sharing a few datasets of various sources and sizes which I hope will be novel for Data Scientists and 2/ I am bringing up the potential for advanced analytics in Distributed Systems.

    Intro

    Storage access traces are “a treasure trove of information for optimizing cloud workloads.” They’re crucial for capacity planning, data placement, or system design and evaluation, suited for modern applications. Diverse and up-to-date datasets are particularly needed in academic research to study novel and unintuitive access patterns, help the design of new hardware architectures, new caching algorithms, or hardware simulations.

    Storage traces are notoriously difficult to find. The SNIA website is the best known “repository for storage-related I/O trace files, associated tools, and other related information” but many traces don’t comply with their licensing or upload format. Finding traces becomes a tedious process of scanning the academic literature or attempting to generate one’s own.

    Popular traces which are easier to find tend to be outdated and overused. Traces older than 10 years should not be used in modern research and development due to changes in application workloads and hardware capabilities. Also, an over-use of specific traces can bias the understanding of real workloads so it’s recommended to use traces from multiple independent sources when possible.

    This post is an organized collection of recent public traces I found and used. In the first part I categorize them by the level of abstraction they represent in the IO stack. In the second part I list and discuss some relevant datasets. The last part is a summary of all with a personal view on the gaps in storage tracing datasets.

    Type of traces

    I distinguish between three types of traces based on data representation and access model. Let me explain. A user, at the application layer, sees data stored in files or objects which are accessed by a large range of abstract operations such as open or append. Closer to the media, the data is stored in a continuous memory address space and accessed as blocks of fixed size which may only be read or written. At a higher abstraction level, within the application layer, we may also have a data presentation layer which may log access to data presentation units, which may be, for example, rows composing tables and databases, or articles and paragraphs composing news feeds. The access may be create table, or post article.

    While traces can be taken anywhere in the IO stack and contain information from multiple layers, I am choosing to structure the following classification based on the Linux IO stack depicted below.

    Linux IO Stack Diagram
    I/O Stack Diagram (adapted from [1], [2] and [3])

    Block storage traces

    The data in these traces is representative of the operations at the block layer. In Linux, this data is typically collected with blktrace (and rendered readable with blkparse), iostat, or dtrace. The traces contain information about the operation, the device, CPU, process, and storage location accessed. The first trace listed is an example of blktrace output.

    The typical information generated by tracing programs may be too detailed for analysis and publication purposes and it is often simplified. Typical public traces contain operation, offset, size, and sometimes timing. At this layer the operations are only read and write. Each operation accesses the address starting at offset and is applied to a continuous size of memory specified in number of blocks (4KiB NTFS). For example, a trace entry for a read operation contains the address where the read starts (offset), and the number of blocks read (size). The timing information may contain the time the request was issued (start time), the time it was completed (end time), the processing in between (latency), and the time the request waited (queuing time).

    Available traces sport different features, have wildly different sizes, and are the output of a variety of workloads. Selecting the right one will depend on what one’s looking for. For example, trace replay only needs the order of operations and their size; For performance analysis timing information is needed.

    Disk access visualization with iowatcher (source)

    Object storage traces

    At the application layer, data is located in files and objects which may be created, opened, appended, or closed, and then discovered via a tree structure. From an user’s point of view, the storage media is decoupled, hiding fragmentation, and allowing random byte access.

    I’ll group together file and object traces despite a subtle difference between the two. Files follow the file system’s naming convention which is structured (typically hierarchical). Often the extension suggests the content type and usage of the file. On the other hand, objects are used in large scale storage systems dealing with vast amounts of diverse data. In object storage systems the structure is not intrinsic, instead it is defined externally, by the user, with specific metadata files managed by their workload.

    Being generated within the application space, typically the result of an application logging mechanism, object traces are more diverse in terms of format and content. The information recorded may be more specific, for example, operations can also be delete, copy, or append. Objects typically have variable size and even the same object’s size may vary in time after appends and overwrites. The object identifier can be a string of variable size. It may encode extra information, for example, an extension that tells the content type. Other meta-information may come from the range accessed, which may tell us, for example, whether the header, the footer or the body of an image, parquet, or CSV file was accessed.

    Object storage traces are better suited for understanding user access patterns. In terms of block access, a video stream and a sequential read of an entire file generate the same pattern: multiple sequential IOs at regular time intervals. But these trace entries should be treated differently if we are to replay them. Accessing video streaming blocks needs to be done with the same time delta between them, regardless of the latency of each individual block, while reading the entire file should be asap.

    Access traces

    Specific to each application, data may be abstracted further. Data units may be instances of a class, records in a database, or ranges in a file. A single data access may not even generate a file open or a disk IO if caching is involved. I choose to include such traces because they may be used to understand and optimize storage access, and in particular cloud storage. For example, the access traces from Twitter’s Memcache are useful in understanding popularity distributions and therefore may be useful for data formatting and placement decisions. Often they’re not storage traces per se, but they can be useful in the context of cache simulation, IO reduction, or data layout (indexing).

    Data format in these traces can be even more diverse due to a new layer of abstraction, for example, by tweet identifiers in Memcached.

    Examples of traces

    Let’s look at a few traces in each of the categories above. The list details some of the newer traces — no older than 10 years — and it is by no means exhaustive.

    Block traces

    YCSB RocksDB SSD 2020

    These are SSD traces collected on a 28-core, 128 GB host with two 512 GB NVMe SSD Drives, running Ubuntu. The dataset is a result of running the YCSB-0.15.0 benchmark with RocksDB.

    The first SSD stores all blktrace output, while the second hosts YCSB and RocksDB. YCSB Workload A consists of 50% reads and 50% updates of 1B operations on 250M records. Runtime is 9.7 hours, which generates over 352M block I/O requests at the file system level writing a total of 6.8 TB to the disk, with a read throughput of 90 MBps and a write throughput of 196 MBps.

    The dataset is small compared to all others in the list, and limited in terms of workload, but a great place to start due to its manageable size. Another benefit is reproducibility: it uses open source tracing tools and benchmarking beds atop a relatively inexpensive hardware setup.

    Format: These are SSD traces taken with blktrace and have the typical format after parsing with blkparse: [Device Major Number,Device Minor Number] [CPU Core ID] [Record ID] [Timestamp (in nanoseconds)] [ProcessID] [Trace Action] [OperationType] [SectorNumber + I/O Size] [ProcessName]

    259,2    0        1     0.000000000  4020  Q   R 282624 + 8 [java]
    259,2 0 2 0.000001581 4020 G R 282624 + 8 [java]
    259,2 0 3 0.000003650 4020 U N [java] 1
    259,2 0 4 0.000003858 4020 I RS 282624 + 8 [java]
    259,2 0 5 0.000005462 4020 D RS 282624 + 8 [java]
    259,2 0 6 0.013163464 0 C RS 282624 + 8 [0]
    259,2 0 7 0.013359202 4020 Q R 286720 + 128 [java]

    Where to find it: http://iotta.snia.org/traces/block-io/28568

    License: SNIA Trace Data Files Download License

    Alibaba Block Traces 2020

    The dataset consists of “block-level I/O requests collected from 1,000 volumes, where each has a raw capacity from 40 GiB to 5 TiB. The workloads span diverse types of cloud applications. Each collected I/O request specifies the volume number, request type, request offset, request size, and timestamp.”

    Limitations (from the academic paper)

    • the traces do not record the response times of the I/O requests, making them unsuitable for latency analysis of I/O requests.
    • the specific applications running atop are not mentioned, so they cannot be used to extract application workloads and their I/O patterns.
    • the traces capture the access to virtual devices, so they are not representative of performance and reliability (e.g., data placement and failure statistics) for physical block storage devices.

    A drawback of this dataset is its size. When uncompressed it results in a 751GB file which is difficult to store and manage.

    Format: device_id,opcode,offset,length,timestamp

    • device_idID of the virtual disk, uint32
    • opcodeEither of ‘R’ or ‘W’, indicating this operation is read or write
    • offsetOffset of this operation, in bytes, uint64
    • lengthLength of this operation, in bytes, uint32
    • timestampTimestamp of this operation received by server, in microseconds, uint64
    419,W,8792731648,16384,1577808144360767
    725,R,59110326272,360448,1577808144360813
    12,R,350868463616,8192,1577808144360852
    725,R,59110686720,466944,1577808144360891
    736,R,72323657728,516096,1577808144360996
    12,R,348404277248,8192,1577808144361031

    Additionally, there is an extra file containing each virtual device’s id device_id with its total capacity.

    Where to find it: https://github.com/alibaba/block-traces

    License: CC-4.0.

    Tencent Block Storage 2018

    This dataset consists of “216 I/O traces from a warehouse (also called a failure domain) of a production cloud block storage system (CBS). The traces are I/O requests from 5584 cloud virtual volumes (CVVs) for ten days (from Oct. 1st to Oct. 10th, 2018). The I/O requests from the CVVs are mapped and redirected to a storage cluster consisting of 40 storage nodes (i.e., disks).”

    Limitations:

    • Timestamps are in seconds, a granularity too little for determining the order of operations. As a consequence many requests appear as if issued at the same time. This trace is therefore unsuitable for queuing analysis.
    • There is no latency information about the duration of each operation, making the trace unsuitable for latency performance, queuing analytics.
    • No extra information about each volume such as total size.

    Format: Timestamp,Offset,Size,IOType,VolumeID

    • Timestamp is the Unix time the I/O was issued in seconds.
    • Offset is the starting offset of the I/O in sectors from the start of the logical virtual volume. 1 sector = 512 bytes
    • Size is the transfer size of the I/O request in sectors.
    • IOType is “Read(0)”, “Write(1)”.
    • VolumeID is the ID number of a CVV.
    1538323200,12910952,128,0,1063
    1538323200,6338688,8,1,1627
    1538323200,1904106400,384,0,1360
    1538323200,342884064,256,0,1360
    1538323200,15114104,8,0,3607
    1538323200,140441472,32,0,1360
    1538323200,15361816,520,1,1371
    1538323200,23803384,8,0,2363
    1538323200,5331600,4,1,3171

    Where to find it: http://iotta.snia.org/traces/parallel/27917

    License: NIA Trace Data Files Download License

    K5cloud Traces 2018

    This dataset contains traces from virtual cloud storage from the FUJITSU K5 cloud service. The data is gathered during a week, but not continuously because “ one day’s IO access logs often consumed the storage capacity of the capture system.” There are 24 billion records from 3088 virtual storage nodes.

    The data is captured in the TCP/IP network between servers running on hypervisor and storage systems in a K5 data center in Japan. The data is split between three datasets by each virtual storage volume id. Each virtual storage volume id is unique in the same dataset, while each virtual storage volume id is not unique between the different datasets.

    Limitations:

    • There is no latency information, so the traces cannot be used for performance analysis.
    • The total node size is missing, but it can be approximated from the maximum offset accessed in the traces.
    • Some applications may require a complete dataset, which makes this one unsuitable due to missing data.

    The fields in the IO access log are: ID,Timestamp,Type,Offset,Length

    • ID is the virtual storage volume id.
    • Timestamp is the time elapsed from the first IO request of all IO access logs in seconds, but with a microsecond granularity.
    • Type is R(Read) or (W)Write.
    • Offset is the starting offset of the IO access in bytes from the start of the virtual storage.
    • Length is the transfer size of the IO request in bytes.
    1157,3.828359000,W,7155568640,4096
    1157,3.833921000,W,7132311552,8192
    1157,3.841602000,W,15264690176,28672
    1157,3.842341000,W,28121042944,4096
    1157,3.857702000,W,15264718848,4096
    1157,9.752752000,W,7155568640,4096

    Where to find it: http://iotta.snia.org/traces/parallel/27917

    License: CC-4.0.

    Object traces

    Server-side I/O request arrival traces 2019

    This repository contains two datasets for IO block traces with additional file identifiers: 1/ parallel file systems (PFS) and 2/ I/O nodes.

    Notes:

    • The access patterns are resulting from MPI-IO test benchmark ran atop of Grid5000, a large scale test bed for parallel and High Performance Computing (HPC). These traces are not representative of general user or cloud workloads but instead specific to HPC and parallel computing.
    • The setup for the PFS scenario uses Orange FS as file system and for the IO nodes I/O Forwarding Scalability Layer(IOFSL). In both cases the scheduler was set to AGIOS I/O scheduling library. This setup is perhaps too specific for most use cases targeted by this article and has been designed to reflect some proposed solutions.
    • The hardware setup for PFS consists of our server nodes with 600 GB HDDs each and 64 client nodes. For IO nodes, it has four server nodes with similar disk configuration in a cluster, and 32 clients in a different cluster.

    Format: The format is slightly different for the two datasets, an artifact of different file systems. For IO nodes, it consists of multiple files, each with tab-separated values Timestamp FileHandle RequestType Offset Size. A peculiarity is that reads and writes are in separate files named accordingly.

    • Timestamp is a number representing the internal timestamp in nanoseconds.
    • FileHandle is the file handle in hexadecimal of size 64.
    • RequestType is the type of the request, inverted, “W” for reads and “R” for writes.
    • Offset is a number giving the request offset in bytes
    • Size is the size of the request in bytes.
    265277355663    00000000fbffffffffffff0f729db77200000000000000000000000000000000        W       2952790016      32768
    265277587575 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 1946157056 32768
    265277671107 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 973078528 32768
    265277913090 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 4026531840 32768
    265277985008 00000000fbffffffffffff0f729db77200000000000000000000000000000000 W 805306368 32768

    The PFS scenario has two concurrent applications, “app1” and “app2”, and its traces are inside a folder named accordingly. Each row entry has the following format: [<Timestamp>] REQ SCHED SCHEDULING, handle:<FileHandle>, queue_element: <QueueElement>, type: <RequestType>, offset: <Offset>, len: <Size> Different from the above are:

    • RequestType is 0 for reads and 1 for writes
    • QueueElement is never used and I believe it is an artifact of the tracing tool.
    [D 01:11:03.153625] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x12986c0, type: 1, offset: 369098752, len: 1048576 
    [D 01:11:03.153638] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x1298e30, type: 1, offset: 268435456, len: 1048576
    [D 01:11:03.153651] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x1188b80, type: 1, offset: 0, len: 1048576
    [D 01:11:03.153664] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0xf26340, type: 1, offset: 603979776, len: 1048576
    [D 01:11:03.153676] REQ SCHED SCHEDULING, handle: 5764607523034233445, queue_element: 0x102d6e0, type: 1, offset: 637534208, len: 1048576

    Where to find it: https://zenodo.org/records/3340631#.XUNa-uhKg2x

    License: CC-4.0.

    IBM Cloud Object Store 2019

    These are anonymized traces from the IBM Cloud Object Storage service collected with the primary goal to study data flows to the object store.

    The dataset is composed of 98 traces containing around 1.6 Billion requests for 342 Million unique objects. The traces themselves are about 88 GB in size. Each trace contains the REST operations issued against a single bucket in IBM Cloud Object Storage during a single week in 2019. Each trace contains between 22,000 to 187,000,000 object requests. All the traces were collected during the same week in 2019. The traces contain all data access requests issued over a week by a single tenant of the service. Object names are anonymized.

    Some characteristics of the workload have been published in this paper, although the dataset used was larger:

    • The authors were “able to identify some of the workloads as SQL queries, Deep Learning workloads, Natural Language Processing (NLP), Apache Spark data analytic, and document and media servers. But many of the workloads’ types remain unknown.”
    • “A vast majority of the objects (85%) in the traces are smaller
      than a megabyte, Yet these objects only account for 3% of the
      of the stored capacity.” This made the data suitable for a cache analysis.

    Format: <time stamp of request> <request type> <object ID> <optional: size of object> <optional: beginning offset> <optional: ending offset> The timestamp is the number of milliseconds from the point where we began collecting the traces.

    1219008 REST.PUT.OBJECT 8d4fcda3d675bac9 1056
    1221974 REST.HEAD.OBJECT 39d177fb735ac5df 528
    1232437 REST.HEAD.OBJECT 3b8255e0609a700d 1456
    1232488 REST.GET.OBJECT 95d363d3fbdc0b03 1168 0 1167
    1234545 REST.GET.OBJECT bfc07f9981aa6a5a 528 0 527
    1256364 REST.HEAD.OBJECT c27efddbeef2b638 12752
    1256491 REST.HEAD.OBJECT 13943e909692962f 9760

    Where to find it: http://iotta.snia.org/traces/key-value/36305

    License: SNIA Trace Data Files Download License

    Access traces

    Wiki Analytics Datasets 2019

    The wiki dataset contains data for 1/ upload (image) web requests of Wikimedia and 2/ text (HTML pageview) web requests from one CDN cache server of Wikipedia. The mos recent dataset, from 2019 contains 21 upload data files and 21 text data files.

    Format: Each upload data file, denoted cache-u, contains exactly 24 hours of consecutive data. These files are each roughly 1.5GB in size and hold roughly 4GB of decompressed data each.

    This dataset is the result of a single type of workload, which may limit the applicability, but it is large and complete, which makes a good testbed.

    Each decompressed upload data file has the following format: relative_unix hashed_path_query image_type response_size time_firstbyte

    • relative_unix: Seconds since start timestamp of dataset, int
    • hashed_path_query: Salted hash of path and query of request, bigint
    • image_type: Image type from Content-Type header of response, string
    • response_size: Response size in bytes, int
    • time_firstbyte: Seconds to first byte, double
    0 833946053 jpeg 9665 1.85E-4
    0 -1679404160 png 17635 2.09E-4
    0 -374822678 png 3333 2.18E-4
    0 -1125242883 jpeg 4733 1.57E-4

    Each text data file, denoted cache-t, contains exactly 24 hours of consecutive data. These files are each roughly 100MB in size and hold roughly 300MB of decompressed data each.

    Each decompressed upload data file has the following format: relative_unix hashed_host_path_query response_size time_firstbyte

    4619 540675535 57724 1.92E-4
    4619 1389231206 31730 2.29E-4
    4619 -176296145 20286 1.85E-4
    4619 74293765 14154 2.92E-4

    Where to find it: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Caching

    License: CC-4.0.

    Memcached 2020

    This dataset contains one-week-long traces from Twitter’s in-memory caching (Twemcache / Pelikan) clusters. The data comes from 54 largest clusters in Mar 2020, Anonymized Cache Request Traces from Twitter Production.

    Format: Each trace file is a csv with the format: timestamp,anonymized key,key size,value size,client id,operation,TTL

    • timestamp: the time when the cache receives the request, in sec
    • anonymized key: the original key with anonymization where namespaces are preserved; for example, if the anonymized key is nz:u:eeW511W3dcH3de3d15ec, the first two fields nz and u are namespaces, note that the namespaces are not necessarily delimited by :, different workloads use different delimiters with different number of namespaces.
    • key size: the size of key in bytes
    • value size: the size of value in bytes
    • client id: the anonymized clients (frontend service) who sends the request
    • operation: one of get/gets/set/add/replace/cas/append/prepend/delete/incr/decr
    • TTL: the time-to-live (TTL) of the object set by the client, it is 0 when the request is not a write request.
    0,q:q:1:8WTfjZU14ee,17,213,4,get,0
    0,yDqF:3q:1AJrrJ1nnCJKKrnGx1A,27,27,5,get,0
    0,q:q:1:8WTw2gCuJe8,17,720,6,get,0
    0,yDqF:vS:1AJr9JnArxCJGxn919K,27,27,7,get,0
    0,yDqF:vS:1AJrrKG1CAnr1C19KxC,27,27,8,get,0

    License: CC-4.0.

    Conclusion

    If you’re still here and haven’t gone diving into one of the traces linked above it may be because you haven’t found what you’re looking for. There are a few gaps that current storage traces have yet to fill:

    • Multi-tenant Cloud Storage: Large cloud storage providers store some of the most rich datasets out there. Their workload reflects a large scale systems’ architecture and is the result of a diverse set of applications. Storage providers are also extra cautious when it comes to sharing this data. There is little or no financial incentive to share data with the public and a fear of unintended customer data leaks.
    • Full stack. Each layer in the stack offers a different view on access patterns, none alone being enough to understand cause-and-effect relationships in storage systems. Optimizing a system to suit modern workloads requires a holistic view of the data access which are not publicly available.
    • Distributed tracing. Most data is nowadays accessed remotely and managed in large scale distributed systems. Many components and layers (such as indexes or caching) will alter the access patterns. In such an environment, end-to-end means tracing a request across several components in a complex architecture. This data can be truly valuable for designing large scale systems but, at the same time, may be too specific to the system inspected which, again, limits the incentive to publish it.
    • Data quality. The traces above have limitations due to the level of detail they represent. As we have seen, some have missing data, some have large granularity time stamps, others are inconveniently large to use. Cleaning data is a tedious process which limits the dataset publishing nowadays.


    Exploring Public Storage Traces was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Exploring Public Storage Traces

    Go Here to Read this Fast! Exploring Public Storage Traces

  • What is switchback testing for decision models?

    Tiffany Bogich

    What Is Switchback Testing for Decision Models?

    A/B testing for decision models

    Switchback testing for decision models allows algorithm teams to compare a candidate model to a baseline model in a true production environment, where both models are making real-world decisions for the operation. With this form of testing, teams can randomize which model is applied to units of time and/or location in order to mitigate confounding effects (like holidays, major events, etc.) that can impact results when doing a pre/post rollout test.

    Switchback tests can go by several names (e.g., time split experiments), and they are often referred to as A/B tests. While this is a helpful comparison for orientation, it’s important to acknowledge that switchback and A/B tests are similar but not the same. Decision models can’t be A/B tested the same way webpages can be due to network effects. Switchback tests allow you to account for these network effects, whereas A/B tests do not.

    For example, when you A/B test a webpage by serving up different content to users, the experience a user has with Page A does not affect the experience another user has with Page B. However, if you tried to A/B test delivery assignments to drivers — you simply can’t. You can’t assign the same order to two different drivers as a test for comparison. There isn’t a way to isolate treatment and control within a single unit of time or location using traditional A/B testing. That’s where switchback testing comes in.

    Illustration of switchback testing with models A and B represented by bunny shapes. Image from N. Misek and T. Bogich, What is switchback testing for decision models? (2023), Nextmv. Reposted with permission.

    Let’s explore this type of testing a bit further.

    What’s an example of switchback testing?

    Imagine you work at a farm share company that delivers fresh produce (carrots, onions, beets, apples) and dairy items (cheese, ice cream, milk) from local farms to customers’ homes. Your company recently invested in upgrading the entire vehicle fleet to be cold-chain ready. Since all vehicles are capable of handling temperature-sensitive items, the business is ready to remove business logic that was relevant to the previous hybrid fleet.

    Before the fleet upgrade, your farm share handled temperature-sensitive items last-in-first-out (LIFO). This meant that if a cold item such as ice cream was picked up, a driver had to immediately drop the ice cream off to avoid a sad melty mess. This LIFO logic helped with product integrity and customer satisfaction, but it also introduced inefficiencies with route changes and backtracking.

    After the fleet upgrade, the team wants to remove this constraint since all vehicles are capable of transporting cold items for longer with refrigeration. Previous tests using historical inputs, such as batch experiments (ad-hoc tests used to compare one or more models against offline or historical inputs [1]) and acceptance tests (tests with pre-defined pass/fail metrics used to compare the current model with a candidate model against offline or historical inputs before ‘accepting’ the new model [2]), have indicated that vehicle time on road and unassigned stops decrease for the candidate model compared to the production model that has the LIFO constraint. You’ve run a shadow test (an online test in which one or more candidate models is run in parallel to the current model in production but “in the shadows”, not impacting decisions [3]) to ensure model stability under production conditions. Now you want to let your candidate model have a go at making decisions for your production systems and compare the results to your production model.

    For this test, you decide to randomize based on time (every 1 hour) in two cities: Denver and New York City. Here’s an example of the experimental units for one city and which treatment was applied to them.

    Sample plan summary for a switchback test on Nextmv. Image from Nextmv.io Cloud Console (2024). Reposted with permission.

    After 4 weeks of testing, you find that your candidate model outperforms the production model by consistently having lower time on road, fewer unassigned stops, and happier drivers because they weren’t zigzagging across town to accommodate the LIFO constraint. With these results, you work with the team to fully roll out the new model (without the LIFO constraint) to both regions.

    Why do switchback testing?

    Switchback tests build understanding and confidence in the behavioral impacts of model changes when there are network effects in play. Because they use online data and production conditions in a statistically sound way, switchback tests give insight into how a new model’s decision making impacts the real world in a measured way rather than just “shipping it” wholesale to prod and hoping for the best. Switchback testing is the most robust form of testing to understand how a candidate model will perform in the real world.

    This type of understanding is something you can’t get from shadow tests. For example, if you run a candidate model that changes an objective function in shadow mode, all of your KPIs might look good. But if you run that same model as a switchback test, you might see that delivery drivers reject orders at a higher rate compared to the baseline model. There are just behaviors and outcomes you can’t always anticipate without running a candidate model in production in a way that lets you observe the model making operational decisions.

    Additionally, switchback tests are especially relevant for supply and demand problems in the routing space, such as last-mile delivery and dispatch. As described earlier, standard A/B testing techniques simply aren’t appropriate under these conditions because of network effects they can’t account for.

    When do you need switchback testing?

    There’s a quote from the Principles of Chaos Engineering, “Chaos strongly prefers to experiment directly on production traffic” [4]. Switchback testing (and shadow testing) are made for facing this type of chaos. As mentioned in the section before: there comes a point when it’s time to see how a candidate model makes decisions that impact real-world operations. That’s when you need switchback testing.

    That said, it doesn’t make sense for the first round of tests on a candidate model to be switchback tests. You’ll want to run a series of historical tests such as batch, scenario, and acceptance tests, and then progress to shadow testing on production data. Switchback testing is often a final gate before committing to fully deploying a candidate model in place of an existing production model.

    Illustration of testing workflow that includes switchback testing prior to deploying a new model. Image by Haley Eshagh.

    How is switchback testing traditionally done?

    To perform switchback tests, teams often build out the infra, randomization framework, and analysis tooling from scratch. While the benefits of switchback testing are great, the cost to implement and maintain it can be high and often requires dedicated data science and data engineering involvement. As a result, this type of testing is not as common in the decision science space.

    Once the infra is in place and switchback tests are live, it becomes a data wrangling exercise to weave together the information to understand what treatment was applied at what time and reconcile all of that data to do a more formal analysis of the results.

    A few good points of reference to dive into include blog posts on the topic from DoorDash like this one (they write about it quite a bit) [5], in addition to this Towards Data Science post from a Databricks solutions engineer [6], which references a useful research paper out of MIT and Harvard [7] that’s worth a read as well.

    Conclusion

    Switchback testing for decision models is similar to A/B testing, but allows teams to account for network effects. Switchback testing is a critical piece of the DecisionOps workflow because it runs a candidate model using production data with real-world effects. We’re continuing to build out the testing experience at Nextmv — and we’d like your input.

    If you’re interested in more content on decision model testing and other DecisionOps topics, subscribe to the Nextmv blog.

    Disclosures

    The author works for Nextmv as Head of Product.

    References

    [1] R. Gough, What are batch experiments for optimization models? (2023), Nextmv

    [2] T. Bogich, What’s next for acceptance testing? (2023), Nextmv

    [3] T. Bogich, What is shadow testing for optimization models and decision algorithms? (2023), Nextmv

    [4] Principles of Chaos Engineering (2019), Principles of Chaos

    [5] C. Sneider, Y. Tang, Experiment Rigor for Switchback Experiment Analysis (2019), DoorDash Engineering

    [6] M. Berk, How to Optimize your Switchback A/B Test Configuration (2021), Towards Data Science

    [7] I. Bojinov, D. Simchi-Levi, J. Zhao, Design and Analysis of Switchback Experiments (2020), arXiv


    What is switchback testing for decision models? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    What is switchback testing for decision models?

    Go Here to Read this Fast! What is switchback testing for decision models?

  • Exploring a Two-Decade Trend: College Acceptance Rates and Tuition in the U.S.

    Exploring a Two-Decade Trend: College Acceptance Rates and Tuition in the U.S.

    Ryu Sonoda

    Is it harder to get into college these days?

    Background
    As a recent Grinnell College alum, I’ve closely observed and been impacted by significant shifts in the academic landscape. When I graduated, the acceptance rate at Grinnell had plummeted by 15% from the time I entered, paralleled by a sharp rise in tuition fees. This pattern wasn’t unique to my alma mater; friends from various colleges echoed similar experiences.

    This got me thinking: Is this a widespread trend across U.S. colleges? My theory was twofold: firstly, the advent of online applications might have simplified the process of applying to multiple colleges, thereby increasing the applicant pool and reducing acceptance rates. Secondly, an article from the Migration Policy Institute highlighted a doubling in the number of international students in the U.S. from 2000 to 2020 (from 500k to 1 million), potentially intensifying competition. Alongside, I was curious about the tuition fee trends from 2001 to 2022. My aim here is to unravel these patterns through data visualization. For the following analysis, all images, unless otherwise noted, are by the author!

    Dataset
    The dataset I utilized encompasses a range of data about U.S. colleges from 2001 to 2022, covering aspects like institution type, yearly acceptance rates, state location, and tuition fees. Sourced from the College Scorecard, the original dataset was vast, with over 3,000 columns and 10,000 rows. I meticulously selected pertinent columns for a focused analysis, resulting in a refined dataset available on Kaggle. To ensure relevance and completeness, I concentrated on 4-year colleges featured in the U.S. News college rankings, drawing the list from here.

    Change in Acceptance Rates Over the Years
    Let’s dive into the evolution of college acceptance rates over the past two decades. Initially, I suspected that I would observe a steady decline. Figure 1 illustrates this trajectory from 2001 to 2022. A consistent drop is evident until 2008, followed by fluctuations leading up to a notable increase around 2020–2021, likely a repercussion of the COVID-19 pandemic influencing gap year decisions and enrollment strategies.

    avg_acp_ranked = df_ranked.groupby("year")["ADM_RATE_ALL"].mean().reset_index()

    plt.figure(figsize=(10, 6)) # Set the figure size
    plt.plot(avg_acp_ranked['year'], avg_acp_ranked['ADM_RATE_ALL'], marker='o', linestyle='-', color='b', label='Acceptance Rate')
    plt.title('Average Acceptance Rate Over the Years') # Set the title
    plt.xlabel('Year') # Label for the x-axis
    plt.ylabel('Average Acceptance Rate') # Label for the y-axis
    plt.grid(True) # Show grid

    # Show a legend
    plt.legend()
    # Display the plot
    plt.show()
    Figure 1.
    Figure 1

    However, the overall drop wasn’t as steep as my experience at Grinnell suggested. In contrast, when we zoom into the acceptance rates of more prestigious universities (Figure 2), a steady decline becomes apparent. This led me to categorize colleges into three groups based on their 2022 admission rates (Top 10% competitive, top 50%, and others) and analyze the trends within these segments.

    pres_colleges = ["Princeton University", "Massachusetts Institute of Technology", "Yale University", "Harvard University", "Stanford University"]
    pres_df = df[df['INSTNM'].isin(pres_colleges)]
    pivot_pres = pres_df.pivot_table(index="INSTNM", columns="year", values="ADM_RATE_ALL")
    pivot_pres.T.plot(linestyle='-')
    plt.title('Change in Acceptance Rate Over the Years')
    plt.xlabel('Year')
    plt.ylabel('Acceptance Rate')
    plt.legend(title='Colleges')
    plt.show()
    Figure 2

    Figure 3 unveils some surprising insights. Except for the least competitive 50%, colleges have generally seen an increase in acceptance rates since 2001. The fluctuations post-2008 across all but the top 10% of colleges could be attributed to economic factors like the recession. Notably, competitive colleges didn’t experience the pandemic-induced spike in acceptance rates seen elsewhere.

    top_10_threshold_ranked = df_ranked[df_ranked["year"] == 2001]["ADM_RATE_ALL"].quantile(0.1)
    top_50_threshold_ranked = df_ranked[df_ranked["year"] == 2001]["ADM_RATE_ALL"].quantile(0.5)

    top_10 = df_ranked[(df_ranked["year"]==2001) & (df_ranked["ADM_RATE_ALL"] <= top_10_threshold_ranked)]["UNITID"]
    top_50 = df_ranked[(df_ranked["year"]==2001) & (df_ranked["ADM_RATE_ALL"] > top_10_threshold_ranked) & (df_ranked["ADM_RATE_ALL"] <= top_50_threshold_ranked)]["UNITID"]
    others = df_ranked[(df_ranked["year"]==2001) & (df_ranked["ADM_RATE_ALL"] > top_50_threshold_ranked)]["UNITID"]

    top_10_df = df_ranked[df_ranked["UNITID"].isin(top_10)]
    top50_df = df_ranked[df_ranked["UNITID"].isin(top_50)]
    others_df = df_ranked[df_ranked["UNITID"].isin(others)]

    avg_acp_top10 = top_10_df.groupby("year")["ADM_RATE_ALL"].mean().reset_index()
    avg_acp_others = others_df.groupby("year")["ADM_RATE_ALL"].mean().reset_index()
    avg_acp_top50 = top50_df.groupby("year")["ADM_RATE_ALL"].mean().reset_index()

    plt.figure(figsize=(10, 6)) # Set the figure size
    plt.plot(avg_acp_top10['year'], avg_acp_top10['ADM_RATE_ALL'], marker='o', linestyle='-', color='g', label='Top 10%')
    plt.plot(avg_acp_top50['year'], avg_acp_top50['ADM_RATE_ALL'], marker='o', linestyle='-', color='b', label='Top 50%')
    plt.plot(avg_acp_others['year'], avg_acp_others['ADM_RATE_ALL'], marker='o', linestyle='-', color='r', label='Others')
    plt.title('Average Acceptance Rate Over the Years') # Set the title
    plt.xlabel('Year') # Label for the x-axis
    plt.ylabel('Average Acceptance Rate') # Label for the y-axis


    # Show a legend
    plt.legend()
    # Display the plot
    plt.show()
    Figure 3

    One finding particularly intrigued me: when considering the top 10% of colleges, their acceptance rates hadn’t decreased notably over the years. This led me to question whether the shift in competitiveness was widespread or if it was a case of some colleges becoming significantly harder or easier to get into. The steady decrease in acceptance rates at prestigious institutions (shown in Figure 2) hinted at the latter.
    To get a clearer picture, I visualized the changes in college competitiveness from 2001 to 2022. Figure 4 reveals a surprising trend: about half of the colleges actually became less competitive, contrary to my initial expectations.

    pivot_pres_ranked = df_ranked.pivot_table(index="INSTNM", columns="year", values="ADM_RATE_ALL")
    pivot_pres_ranked_down = pivot_pres_ranked[pivot_pres_ranked[2001] >= pivot_pres_ranked[2022]]
    len(pivot_pres_ranked_down)

    pivot_pres_ranked_up = pivot_pres_ranked[pivot_pres_ranked[2001] < pivot_pres_ranked[2022]]
    len(pivot_pres_ranked_up)

    categories = ["Up", "Down"]
    values = [len(pivot_pres_ranked_up), len(pivot_pres_ranked_down)]

    plt.figure(figsize=(8, 6))
    plt.bar(categories, values, width=0.4, align='center', color=["blue", "red"])
    plt.xlabel('Change in acceptance rate')
    plt.ylabel('# of colleges')
    plt.title('Change in acceptance rate from 2001 to 2022')

    # Show the chart
    plt.tight_layout()
    plt.show()
    Figure 4

    This prompted me to explore possible factors influencing these shifts. My hypothesis, reinforced by Figure 2, was that already selective colleges became even more so over time. Figure 5 compares acceptance rates in 2001 and 2022.
    The 45-degree line delineates colleges that became more or less competitive. Those below the line saw reduced acceptance rates. A noticeable cluster in the lower-left quadrant represents selective colleges that became increasingly exclusive. This trend is underscored by the observation that colleges with initially low acceptance rates (left side of the plot) tend to fall below this dividing line, while those on the right are more evenly distributed.
    Furthermore, it’s interesting to note that since 2001, the most selective colleges are predominantly private. To test whether the changes in acceptance rates differed significantly between the top and bottom 50 percentile colleges, I conducted an independent t-test (Null hypothesis: θ_top = θ_bottom). The results showed a statistically significant difference.

    import seaborn as sns
    from matplotlib.patches import Ellipse

    pivot_region = pd.merge(pivot_pres_ranked[[2001, 2022]], df_ranked[["REGION","INSTNM", "UNIVERSITY", "CONTROL"]], on="INSTNM", how="right")

    plt.figure(figsize=(8, 8))
    sns.scatterplot(data=pivot_region, x=2001, y=2022, hue='CONTROL', palette='Set1', legend='full')
    plt.xlabel('Acceptance rate for 2001')
    plt.ylabel('Acceptance rate for 2022')
    plt.title('Change in acceptance rate')

    x_line = np.linspace(0, max(pivot_region[2001]), 100) # X-values for the line
    y_line = x_line # Y-values for the line (slope = 1)

    plt.plot(x_line, y_line, label='45-Degree Line', color='black', linestyle='--')
    # Define ellipse parameters (center, width, height, angle)
    ellipse_center = (0.25, 0.1) # Center of the ellipse
    ellipse_width = 0.4 # Width of the ellipse
    ellipse_height = 0.2 # Height of the ellipse
    ellipse_angle = 45 # Rotation angle in degrees

    # Create an Ellipse patch
    ellipse = Ellipse(
    xy=ellipse_center,
    width=ellipse_width,
    height=ellipse_height,
    angle=ellipse_angle,
    edgecolor='b', # Edge color of the ellipse
    facecolor='none', # No fill color (transparent)
    linewidth=2 # Line width of the ellipse border
    )

    plt.gca().add_patch(ellipse)

    # Add the ellipse to the current a

    plt.legend()
    plt.gca().set_aspect('equal')
    plt.show()
    Figure 5

    Another aspect that piqued my curiosity was regional differences. Figure 6 lists the top 5 colleges with the most significant decrease in acceptance rates (calculated by dividing the 2022 acceptance rate by the 2001 rate).
    It was astonishing to see how high the acceptance rate for the University of Chicago was two decades ago — half of the applicants were admitted then!
    This also helped me understand my initial bias towards a general decrease in acceptance rates; notably, Grinnell College, my alma mater, is among these top 5 with a significant drop in acceptance rate.
    Interestingly, three of the top five colleges are located in the Midwest. My theory is that with the advent of the internet, these institutions, not as historically renowned as those on the West and East Coasts, have gained more visibility both domestically and internationally.

    pivot_pres_ranked["diff"] = pivot_pres_ranked[2001] / pivot_pres_ranked[2022]
    tmp = pivot_pres_ranked.reset_index()
    tmp = tmp.merge(df_ranked[df_ranked["year"]==2022][["INSTNM", "STABBR", "CITY"]],on="INSTNM")
    tmp.sort_values(by="diff",ascending=False)[["INSTNM", "diff", "STABBR", "CITY"]].head(5)
    Figure 6

    In the following sections, we’ll explore tuition trends and their correlation with these acceptance rate changes, delving deeper into the dynamics shaping modern U.S. higher education.

    Change in Tuition Over the Years
    Analyzing tuition trends over the past two decades reveals some eye-opening patterns. Figure 7 presents the average tuition over the years across different categories: private, public in-state, public out-of-state, and overall. A steady climb in tuition fees is evident in all categories.
    Notably, private universities exhibit a higher increase compared to public ones, and the rise in public in-state tuition appears relatively modest. However, it’s striking that the overall average tuition has more than doubled since 2001, soaring from $15k to $35k.

    avg_tuition = df_ranked.groupby('year')["TUITIONFEE_OUT"].mean().reset_index()
    avg_tuition_private = df_ranked[df_ranked['CONTROL'] != "Public"].groupby('year')["TUITIONFEE_OUT"].mean().reset_index()
    avg_tuition_public_out = df_ranked[df_ranked['CONTROL'] == "Public"].groupby('year')["TUITIONFEE_OUT"].mean().reset_index()
    avg_tuition_public_in = df_ranked[df_ranked['CONTROL'] == "Public"].groupby('year')["TUITIONFEE_IN"].mean().reset_index()

    plt.figure(figsize=(10, 6)) # Set the figure size (optional)
    plt.plot(avg_tuition_public_out['year'], avg_tuition_public_out['TUITIONFEE_OUT'], marker='o', linestyle='-', color='g', label='Out-state Tuition for Public')
    plt.plot(avg_tuition_public_in['year'], avg_tuition_public_in['TUITIONFEE_IN'], marker='o', linestyle='-', color='y', label='In-state Tuition for Public')
    plt.plot(avg_tuition_private['year'], avg_tuition_private['TUITIONFEE_OUT'], marker='o', linestyle='-', color='r', label='Tuition for Private')
    plt.plot(avg_tuition['year'], avg_tuition['TUITIONFEE_OUT'], marker='o', linestyle='-', color='b', label='Tuition for All')

    plt.title('Average Tuition Over the Years') # Set the title
    plt.xlabel('Year') # Label for the x-axis
    plt.ylabel('Average Tuition') # Label for the y-axis

    # Show a legend
    plt.legend()
    # Display the plot
    plt.show()
    Figure 7

    One might argue that this increase is in line with general economic inflation, but a comparison with inflation rates paints a different picture (Figure 8). Except for the last two years, where inflation spiked due to the pandemic, tuition hikes consistently outpaced inflation.
    Although the pattern of tuition increases mirrors that of inflation, it’s important to note that unlike inflation, which dipped into negative territory in 2009, tuition increases never fell below zero. Though the rate of increase has been slowing, the hope is for it to eventually stabilize and halt the upward trajectory of tuition costs.

    avg_tuition['Inflation tuition'] = avg_tuition['TUITIONFEE_OUT'].pct_change() * 100
    avg_tuition.iloc[0,2] = 1
    avg_tuition

    plt.figure(figsize=(10, 6)) # Set the figure size
    plt.plot(df_inflation['year'], df_inflation['Inflation rate'], marker='o', linestyle='-', color='r', label='Inflation')
    plt.plot(avg_tuition['year'],avg_tuition['Inflation tuition'], marker='o', linestyle='-', color='b', label='Tuition')
    plt.title('Increase in Tuition and Inflation Over the Years') # Set the title
    plt.xlabel('Year') # Label for the x-axis
    plt.ylabel('Rate') # Label for the y-axis


    # Show a legend
    plt.legend()
    # Display the plot
    plt.show()
    Figure 8

    In exploring the characteristics of colleges that have raised tuition fees more significantly, I hypothesized that more selective colleges might exhibit higher increases due to greater demand. Figure 9 investigates this theory. Contrary to expectations, the data does not show a clear trend correlating selectivity with tuition increase. The change in tuition seems to hover around an average of 2.2 times across various acceptance rates. However, it’s noteworthy that tuition at almost all selective universities has more than doubled, whereas the distribution for other universities is more varied. This indicates a lower standard deviation in tuition changes at selective schools compared to their less selective counterparts.

    tuition_pivot = df_ranked.pivot_table(index="INSTNM", columns="year", values="TUITIONFEE_OUT")
    tuition_pivot["TUI_CHANGE"] = tuition_pivot[2022]/tuition_pivot[2001]
    tuition_pivot = tuition_pivot[tuition_pivot["TUI_CHANGE"] < 200]
    print(tuition_pivot["TUI_CHANGE"].isnull().sum())
    tmp = pd.merge(tuition_pivot["TUI_CHANGE"], df_ranked[df_ranked["year"]==2022][["ADM_RATE_ALL", "INSTNM", "REGION", "STABBR", "CONTROL"]], on="INSTNM", how="right")
    plt.figure(figsize=(8, 8))
    sns.scatterplot(data=tmp, x="ADM_RATE_ALL", y="TUI_CHANGE", palette='Set2', legend='full')
    plt.xlabel('Acceptance rate in 2022')
    plt.ylabel('Change in Tuition')
    plt.title('Acceptance rate vs Change in Tuition')

    plt.legend()
    plt.show()
    Figure 9

    After examining the relationship between acceptance rates and tuition hikes, I turned my attention to regional factors. I hypothesized that schools in the West Coast, influenced by the economic surge of tech companies, might have experienced significant tuition increases. To test this, I visualized the tuition growth for each state in Figure 10.
    Contrary to my expectations, the West Coast wasn’t the region with the highest rise in tuition. Instead, states like Oklahoma and Utah saw substantial increases, while South Dakota and New Mexico had the smallest hikes. While there are exceptions, the overall trend suggests that tuition increases in the western states generally outpace those in the eastern states.

    import geopandas as gpd

    sta_tui = tmp.groupby("STABBR")["TUI_CHANGE"].mean()
    sta_tui = sta_tui.reset_index()

    shapefile_path = "path_to_shape_file"
    gdf = gpd.read_file(shapefile_path)

    sta_tui["STUSPS"] = sta_tui["STABBR"]
    merged_data = gdf.merge(sta_tui, on="STUSPS", how="left")
    final = merged_data.drop([42, 44, 45, 38, 13])

    # Plot the choropleth map
    fig, ax = plt.subplots(1, 1, figsize=(16, 20))
    final.plot(column='TUI_CHANGE', cmap="Reds", ax=ax, linewidth=0.3, edgecolor='0.8', legend=True)
    ax.set_title('Average Change in Tuition over across the U.S.')
    plt.axis('off') # Turn off axis
    plt.legend(fontsize=6)
    plt.show()
    Figure 10

    Future Directions and Limitations

    While this analysis provides insights based on single-year comparisons for changes in acceptance rates and tuition, a more comprehensive view could be obtained from a 5-year average comparison. In my preliminary analysis using this approach, the conclusions were similar.

    The dataset used also contains many other attributes like racial proportions, mean SAT scores, and median household income. However, I didn’t utilize these due to missing values in older data. By focusing on more recent years, these additional factors could offer deeper insights. For those interested in further exploration, the dataset is available on Kaggle.

    It’s important to note that this analysis is based on colleges ranked in the U.S. News, introducing a certain degree of bias. The trends observed may differ from the overall U.S. college landscape.

    For data enthusiasts, my code and methodology are accessible for further exploration. I invite you to delve into it and perhaps uncover new perspectives or validate these findings. Thank you for joining me on this data-driven journey through the changing landscape of U.S. higher education!

    Sources

    [1] Emma Israel and Jeanne Batalova. “International Students in the United States” (January 14, 2021). https://www.migrationpolicy.org/article/international-students-united-states
    [2] U.S. Department of Education College Scoreboard (last updated October 10, 2023). Public Domain, https://will-stanton.com/creating-a-great-data-science-resume/
    [3] Andrew G. Reiter, “U.S. News & World Report Historical Liberal Arts College and University Rankings” http://andyreiter.com/datasets/


    Exploring a Two-Decade Trend: College Acceptance Rates and Tuition in the U.S. was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Exploring a Two-Decade Trend: College Acceptance Rates and Tuition in the U.S.

    Go Here to Read this Fast! Exploring a Two-Decade Trend: College Acceptance Rates and Tuition in the U.S.