Tag: AI

  • Detecting Insecure Code with LLMs

    Melanie Hart Buehler

    Prompt Experiments for Python Vulnerability Detection

    A stream of white 0s and 1s fills a black computer screen, and in the middle the numbers turn to red in the shape of a heart
    Photo by Alexander Sinn on Unsplash

    If you are a software professional, you might dread opening the security scan report on the morning of a release. Why? You know that it’s a great tool for enhancing the quality and integrity of your work, but you also know you are going to spend the next couple of hours scrambling to resolve all the security issues before the deadline. If you are lucky, many issues will be false alarms, but you will have to manually verify the status of each one and quickly patch the rest after the code has been finalized. If you’ve experienced this, you’ve likely wished for a smoother, less ad-hoc process for identifying, triaging, and fixing the vulnerabilities in your code base. The good news is that recent studies have demonstrated that large language models (LLMs) can proficiently classify code as secure or insecure, explain the weaknesses, and even propose corrective changes. This has the potential to significantly streamline secure coding practices beyond the traditional static scans.

    This article presents a short review of some recent findings in vulnerability detection and repair with LLMs and then dives into some experiments assessing GPT4’s ability to identify insecure code in a Python dataset using different prompting techniques. If you want to explore the Jupyter notebook or test your own code, head over to the pull request in OpenAI Cookbook (currently under review).

    Before we get started prompting GPT4, there are a few key concepts and definitions that will help build the foundation we need to design logical experiments.

    Common Weakness Enumeration

    In 2006, the government-funded research organization MITRE started regularly publishing the Common Weakness Enumeration (CWE), which is a community-developed common taxonomy of software and hardware weakness definitions and descriptions. A “weakness”, in this sense, is a condition in software or hardware that can lead to vulnerabilities. A list of the 2023 Top 25 Most Dangerous CWEs highlights the biggest repeat offenders. There is another list of 15 “Stubborn” CWEs that have been present on every Top 25 list from 2019–2023. They can be divided roughly into three groups:

    • Group 1: Weak handling of untrusted data sources (e.g. Command/SQL Injection, Path Traversal, Improper Input Validation, and Cross-site Scripting)
    • Group 2: Weak memory management or type enforcement (e.g. NULL Pointer Dereference)
    • Group 3: Weak security design choices (e.g. Hard-coded Credentials)

    To help keep the scope of the investigation narrow and well-defined, we will focus on the first group of CWEs.

    Static Code Analysis

    The conventional approach to automating the detection of insecure code involves the use of static analysis tools such as CodeQL, Bandit, SonarQube, Coverity, and Snyk. These tools can be employed at any time but are typically used to scan the code for vulnerabilities after the code-freeze stage and before the completion of formal release processes. They work by parsing the source code into an abstract syntax tree or control flow graph that represents how the code is organized and how the components (classes, functions, variables, etc.) all relate to each other. Rule-based analysis and pattern matching are then used to detect a wide range of issues. Static analysis tools can be integrated with IDEs and CICD systems throughout the development cycle, and many offer custom configuration, querying, and reporting options. They are very useful, but they have some drawbacks (in addition to those last-minute high-pressure bug fixing parties):

    • Resource-intensive: They convert extensive codebases into databases to execute complex queries
    • False positives: They often include a high number of non-issues
    • Time-intensive follow-up: Significant effort is required to validate and repair the issues

    These limitations have understandably inspired researchers to explore new ways to enhance code security practices, such as generative AI.

    Previous Work

    Recent work has demonstrated the potential of LLMs across various stages of the development life cycle. LLMs have been shown to be useful for secure code completion, test case generation, vulnerable or malicious code detection, and bug fixing.

    Here are a few references of note:

    1. A comprehensive evaluation¹ compared LLMs of different parameter sizes with traditional static analyzers in identifying and correcting software vulnerabilities. GPT4 in particular was found to be very capable, especially in light of its ability to explain and fix vulnerable code. Some additional claims from the paper — significant code understanding seems to emerge between 6 to 175 billion parameters, with the first hint of advanced programmer skills appearing beyond 13 billion parameters; prediction success may be boosted when prompts combine the tasks of identifying and fixing security issues together; LLMs combined with traditional static code analysis may offer the best of both worlds.
    2. A novel study and dataset² found that even advanced AI developer assistants are susceptible to writing insecure code and discovered that 40% of generated code completions contained CWEs.
    3. An investigation³ reported that GPT3 outperformed a modern static code analyzer in predicting security vulnerabilities.
    4. A research paper⁴ showed that LLMs can assist developers in identifying and localizing vulnerable code, particularly when combined with static analyzers.
    5. In another study⁵, LLMs successfully fixed all synthetic and hand-crafted scenarios, although they did not adequately address all real-world security bug scenarios.

    While LLMs have shown promise in outperforming traditional approaches, many of these works point out that they are also susceptible to false positives and sensitive to the structure and wording of prompts. In our experiments, we aim to validate and build upon these results by applying more variations to the prompt template.

    Data Source and Preprocessing

    We will draw from one of the most widely-adopted datasets for secure code benchmarking of LLMs. The dataset (with license CC BY 4.0) from Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions² is made up of prompts called “scenarios” that were hand-designed to elicit certain CWEs when used as input to a code-generating LLM. We have mined the output code completions that were included with the publication, since they were subsequently scanned by a static code analyzer and come with labels of “Vulnerable” and “Not Vulnerable”. It should be noted, again, that the code in this dataset is model-generated from manually written prompts, so it lacks some real-world gravitas, but we chose it for a few reasons:

    1. It has a large number of Python examples, which is the language of choice for this study
    2. It has both vulnerable and non-vulnerable code examples, which is important for assessing both false positives and false negatives
    3. Related to (2), the fact that there are vulnerable and non-vulnerable code snippets for the same scenario means we can use the non-vulnerable completions as “suggested fixes” in some of the prompts, which will be explained in the relevant experiment section

    We understand that there are other datasets we could have used, and it is left for further research to explore CWE prediction capability with other data sources, too.

    The Python code snippets from the Copilot Scenario raw data files² were preprocessed with the following steps:

    Open the project's aggregated "diversity of weakness" results file
    Filter for Python language rows
    FOR each Python scenario/row
    Navigate to the scenario's dedicated results folder
    Open the scenario's CodeQL or authors' manual assessment file
    FOR each generated code completion for the scenario
    Create a record containing its metadata and classification label
    Read in the .py file containing the code completion
    FOR each line in the file
    IF the line is 100% comments
    Remove the line
    END IF
    END FOR
    Insert the cleaned code example into the record
    END FOR
    Assert that the scenario totals agree with the original aggregated report
    END FOR
    Save the final file in JSON format

    Data Exploration and Subsetting

    We explored the distribution of CWEs and selected a sample of size 45 spread evenly over vulnerable and non-vulnerable codes from the Group 1 CWEs (we targeted a sample size of 50 but were limited by the number available in one of the groups). The CWE descriptions are:

    • CWE-20: Improper Input Validation
    • CWE-22: Improper Limitation of a Pathname to a Restricted Directory (‘Path Traversal’)
    • CWE-78: Improper Neutralization of Special Elements used in an OS Command (‘OS Command Injection’)
    • CWE-79: Improper Neutralization of Input During Web Page Generation (‘Cross-site Scripting’)
    • CWE-502: Deserialization of Untrusted Data

    We began with a zero-shot baseline run and then repeated the experiment a few more times building the complexity of the prompt by adding strategies like few-shot in-context learning. We prompted the LLM to identify vulnerable code with no mention of which CWE it might be looking for (i.e. without labels).

    Experiment 1: Zero-shot

    In a zero-shot prompt, you ask the model to make a prediction with no example or information other than instructions. Our zero-shot template was inspired by this paper⁴ and includes a role, code delimiter, and the request to output json format only. It also includes an instruction to “think step-by-step”. The code snippet under test is inserted into {code}.

    Prompt

    You are a brilliant software security expert. 
    You will be provided with a python code delimited by triple backticks.
    If it contains any CWE security vulnerabilities, write Vulnerable.
    If the code does not contain any vulnerabilities, write Not Vulnerable.
    Format your response as a JSON object with "label" as the key
    for vulnerability status and "cwe" as the vulnerability number found.
    Think about the answer step by step, and only answer with JSON.

    Python code: ```{code}```

    Answer:

    Results

    Accuracy: 0.67
    Precision: 0.60
    Recall: 0.86
    F1 Score: 0.71

    Zero-shot Confusion Matrix

    Experiment 2: Few-shot

    In the next experiment, we add the concept of in-context or “few-shot” learning and include a few successful code-answer examples before asking the LLM to perform the same operation on the unseen code. These examples were constructed from the remainder of the dataset and care was taken to:

    • Draw from different scenarios than the code snippet under test, to avoid contamination
    • Draw exactly two vulnerable examples and one non-vulnerable example

    Prompt

    You are a brilliant software security expert. 
    You will be provided with a python code delimited by triple backticks.
    If it contains any CWE security vulnerabilities, write Vulnerable.
    If the code does not contain any vulnerabilities, write Not Vulnerable.
    Format your response as a JSON object with "label" as the key
    for vulnerability status and "cwe" as the vulnerability number found.
    Think about the answer step by step, and only answer with JSON.

    Python code: ```{example_0}```

    Answer: {answer_0}

    Python code: ```{example_1}```

    Answer: {answer_1}

    Python code: ```{example_2}```

    Answer: {answer_2}

    Python code: ```{code}```

    Answer:

    Results

    Accuracy: 0.76
    Precision: 0.71
    Recall: 0.81
    F1 Score: 0.76

    Few-shot Confusion Matrix

    Experiment 3: KNN Few-shot

    This Microsoft blog post describes an interesting technique called KNN-based few-shot example selection that can boost LLM response quality when using in-context examples. For this next experiment, instead of sampling shots at random, we calculate a similarity score between the input code and each candidate example and construct shots from the most similar candidates (still keeping the scenarios distinct). We use the ROUGE-L metric, but other metrics could be used too. The prompt template did not change from the second experiment.

    Results

    Accuracy: 0.73
    Precision: 0.70
    Recall: 0.76
    F1 Score: 0.73

    KNN Few-shot Confusion Matrix

    Experiment 4: KNN Few-shot with Code Fix

    In this variation of the prompt, we include a request for a fixed version of the code if a CWE is found. This approach was inspired by Noever, who proposed that prompting for CWE detection and a fix together might bring about a “virtuous cycle” and force the LLM to “self-audit” or think more deeply about the steps needed to accurately identify vulnerabilities, similar to chain-of-thought prompting. We did this by constructing vulnerable code in-context examples with code fix suggestions drawn from the non-vulnerable code samples for the same scenarios.

    Prompt

    You are a brilliant software security expert. 
    You will be provided with a python code delimited by triple backticks.
    If it contains any CWE security vulnerabilities, write Vulnerable.
    If the code does not contain any vulnerabilities, write Not Vulnerable.
    If the code has the vulnerability, write a repaired secure version of the
    code that preserves its exact functionality.
    Format your response as a JSON object with "label" as the key
    for vulnerability status, "cwe" as the vulnerability found,
    and "fix" for the fixed code snippet.
    Think about the answer step by step, and only answer with JSON.

    Python code: ```{example_0}```

    Answer: {answer_0}

    Python code: ```{example_1}```

    Answer: {answer_1}

    Python code: ```{example_2}```

    Answer: {answer_2}

    Python code: ```{code}```

    Answer:

    Results

    Accuracy: 0.80
    Precision: 0.73
    Recall: 0.90
    F1 Score: 0.81

    KNN Few-shot Fix Confusion Matrix

    In addition to CWE detection, this experiment has the benefit of producing suggested fixes. We have not evaluated them for quality yet, so that is an area for future work.

    Results and Next Steps

    On our small data sample, GPT4’s accuracy was 67% and its F1 score was 71% without any complex prompt adaptations. Small improvements were offered by some of the prompting techniques we tested, with few-shot and requesting a code fix standing out. The combination of techniques bumped accuracy and F1 score by about ten percentage points each from baseline, both metrics reaching or exceeding 80%.

    Results can be quite different between models, datasets, and prompts, so more investigation is needed. For example, it would be interesting to:

    • Test smaller models
    • Test a prompt template that includes the CWE label, to investigate the potential for combining LLMs with static analysis
    • Test larger and more diverse datasets
    • Evaluate the security and functionality of LLM-proposed code fixes
    • Study more advanced prompting techniques such as in-context example chains-of-thought, Self-Consistency, and Self-Discover

    If you would like to see the code that produced these results, run it on your own code, or adapt it for your own needs, check out the pull request in OpenAI Cookbook (currently under review).

    Thank you to my colleagues Matthew Fleetwood and Abolfazl Shahbazi who made contributions and helped to review this article.

    Citations

    [1] D. Noever, Can Large Language Models Find And Fix Vulnerable Software? (2023), arXiv preprint arXiv:2308.10345

    [2] H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt and R. Karri, Asleep at the Keyboard? Assessing the Security of GitHub Copilot’s Code Contributions (2022), 2022 IEEE Symposium on Security and Privacy (SP)

    [3] C. Koch, I Used GPT-3 to Find 213 Security Vulnerabilities in a Single Codebase (2023), https://betterprogramming.pub/i-used-gpt-3-to-find-213-security-vulnerabilities-in-a-single-codebase-cc3870ba9411

    [4] A. Bakhshandeh, A. Keramatfar, A. Norouzi and M. M. Chekidehkhoun, Using ChatGPT as a Static Application Security Testing Tool (2023), arXiv preprint arXiv:2308.14434

    [5] H. Pearce, B. Tan, B. Ahmad, R. Karri and B. Dolan-Gavitt, Examining Zero-Shot Vulnerability Repair with Large Language Models (2023), 2023 IEEE Symposium on Security and Privacy (SP)


    Detecting Insecure Code with LLMs was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Detecting Insecure Code with LLMs

    Go Here to Read this Fast! Detecting Insecure Code with LLMs

  • Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral

    Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral

    Matthew Gunton

    This blog post will explore the findings of the “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” paper and its implementation in Mixtral

    Image from Author generated by DALL-E

    The Quest for Specialization

    When challenging a difficult problem, divide and conquer is often a valuable solution. Whether it be Henry Ford’s assembly lines, the way merge sort partitions arrays, or how society at large tends to have people who specialize in specific jobs, the list goes on and on!

    Naturally, when people approached the task of teaching computers to reason, it made sense to divide up the tasks we had for the machine into component pieces — for example, one component for math, one component for science, one for language, etc.

    Image by Author

    Nevertheless, this idea has not yet been successfully realized. It’s possible this is failing for the same reason that our brain doesn’t have nearly independent components: complex reasoning requires using many different parts in concert, not separately.

    For the longest time, this idea lay somewhat dormant, until people stopped trying to create different components at the high level of mathematics or sciences, but rather at the lowest level for these neural networks — the tokens.

    The “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” paper today goes in detail on how creating experts at the token level can be incredibly effective, both in reducing costs and increasing the quality of response.

    Token-Level Mixture of Experts

    Let’s begin with the idea of an ‘expert’ in this context. Experts are feed-forward neural networks. We then connect them to our main model via gates that will route the signal to specific experts. You can imagine our neural network thinks of these experts as simply more complex neurons within a layer.

    Figure 1 from the paper

    The problem with a naive implementation of the gates is that you have significantly increased the computational complexity of your neural network, potentially making your training costs enormous (especially for LLMs). So how do you get around this?

    Conditional Computation & Sparsely Gated Mixture of Experts

    The problem here is that neural networks will be required to calculate the value of a neuron so long as there is any signal going to it, so even the faintest amount of information sent to an expert triggers the whole expert network to be computed. The authors of the paper get around this by creating a function, G(x) that forces most low-value signals to compute to zero.

    Equation 1 from the paper

    In the above equation, G(X) is our gating function, and E(x) is a function representing our expert. As any number times zero is zero, this logic prevents us from having to run our expert network when we are given a zero by our gating function. So how does the gating function determine which experts to compute?

    Gating Function

    The gating function itself is a rather ingenious way to only focus on the experts that you want. Let’s look at the equations below and then I’ll dive into how they all work.

    Equations 3, 4, and 5 from the paper

    Going from bottom to top, equation 5 is simply a step function. If the input is not within a certain range (here the top k elements of the list v), it will return — infinity, thus assuring a perfect 0 when plugged into Softmax. If the value is not -infinity, then a signal is passed through. This k parameter allows us to decide how many experts we’d like to hear from (k=1 would only route to 1 expert, k=2 would only route to 2 experts, etc.)

    Equation 4 is how we determine what is in the list that we select the top k values from. We begin by multiplying the input to the gate (the signal x) by some weight Wg. This Wg is what will be trained in each successive round for the neural network. Note that the weight associated with each expert likely has a distinct value. Now to help prevent the same expert being chosen every single time, we add in some statistical noise via the second half of our equation. The authors propose distributing this noise along a normal distribution, but the key idea is to add in some randomness to help with expert selection.

    Equation 3 simply combines the two equations and puts them into a SoftMax function so that we can be sure that -infinity gets us 0, and any other value will send a signal through to the expert.

    Image by Author. The above is a graph of a sigmoid. While a sigmoid and softmax are distinct functions ( a key difference being softmax typically acts on multiple variables, while sigmoids have only 1 dependent variable ), the shape of the two functions is similar, hence why I am showing this for reference

    The “sparse” part of the title comes from sparse matrices, or matrices where most of the values are zero, as this is what we effectively create with our gating function.

    Optimizing the Loss Function to Balance Expert Usage

    While our noise injection is valuable to reduce expert concentration, the authors found it was not enough to fully overcome the issue. To incentivize the model to use the experts nearly equally, they adjusted the loss function.

    Equations 6 and 7 from the paper

    Equation 6 shows how they define importance in terms of the gate function — this makes sense as the gate function is ultimately the decider of which expert gets used. Importance here is the sum of all of the expert’s gate functions. They define their loss function as the coefficient of the variation of the set of Importance. Put simply, this means we are finding a value that represents just how much each expert is used, where a select few experts being used creates a big value and all of them being used creates a small value. The w importance is a hyperparameter that can aid the model to use more of the experts.

    Image courtesy of Google Search. This shows the formula for calculating the Coefficient of Variance

    Getting Enough Training Data to the Experts

    Another training challenge the paper calls out involves getting enough data to each of the experts. As a result of our gating function, the amount of data each expert sees is only a fraction of what a comparatively dense neural network would see. Put differently, because each expert will only see a part of the training data, it is effectively like we have taken our training data and hidden most of it from these experts. This makes us more susceptible to overfitting or underfitting.

    This is not an easy problem to solve, so the authors suggest the following: leveraging data parallelism, leaning into convolutionality, and applying Mixture of Experts recurrently (rather than convolutionally). These are dense topics, so to prevent this blog post from getting too long I will go into these in later posts if there is interest.

    Mixtral’s Implementation and Grok

    Figure 2 from the paper

    The “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” paper was published in 2017, the same year that the seminal Attention is All You Need paper came out. Just as it took some years before the architecture described in Self-Attention reached the main stream, it took a few years before we had any models that could successfully implement this Sparse architecture.

    When Mistral released their Mixtral model in 2024, they showed the world just how powerful this setup can be. With the first production-grade LLM with this architecture, we can look at how it’s using its experts for further study. One of the most fascinating pieces here is we don’t really understand why specialization at the token level is so effective. If you look at the graph below for Mixtral, it is clear that with the exception of mathematics, no one expert is the go-to for any one high level subject.

    Figure 7 from the Mixtral of Experts paper

    Consequently, we are left with an intriguing situation where this new architectural layer is a marked improvement yet nobody can explain exactly why this is so.

    More major players have been following this architecture as well. Following the open release of Grok-1, we now know that Grok is a Sparse Mixture of Experts model, with 314 billion parameters. Clearly, this is an architecture people are willing to invest amounts of capital into and so will likely be a part of the next wave of foundation models. Major players in the space are moving quickly to push this architecture to new limits.

    Closing Thoughts

    The “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer” paper ends suggesting experts created via a recurrent neural network are the natural next step, as recurrent neural networks tend to be even more powerful than feed-forward ones. If this is the case, then the next frontier of foundation models may not be networks with more parameters, but rather models with more complex experts.

    In closing, I think this paper highlights two critical questions for future sparse mixture of experts studies to focus on. First, what scaling effects do we see now that we have added more complex nodes into our neural network? Second, does the complexity of an expert have good returns on cost? In other words, what scaling relationship do we see within the expert network? What are the limits on how complex it should be?

    As this architecture is pushed to its limits, it will surely bring in many fantastic areas of research as we add in complexity for better results.

    [1] N. Shazeer, et al., OUTRAGEOUSLY LARGE NEURAL NETWORKS:
    THE SPARSELY-GATED MIXTURE-OF-EXPERTS LAYER (2017)
    , arXiv

    [2] A. Jiang, et al., Mixtral of Experts (2024), arXiv

    [3] A. Vaswani, et al., Attention Is All You Need (2017), arXiv

    [4] X AI, et al., Open Release of Grok-1 (2024), x ai website


    Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral

    Go Here to Read this Fast! Understanding the Sparse Mixture of Experts (SMoE) Layer in Mixtral

  • Four elephants in a room with chatbots

    Dusko Pavlovic

    Language processing in humans and computers: Part 2

    Tidying up the zoo in the morning

    The first elephant in the room: The Web

    Just like search engines, language models process data scraped from the web. Both are built on top of web crawlers. Chatbots are children of the Web, not of expert systems.

    A search engine is an interface of a source index sorted by reputation. A chatbot is an interface of a language model extrapolating from the sources. Google was built on the crucial idea of reputation-based search and the crucial ideas that enabled language models emerged from Google. The machine learning methods used to train chatbots were a relatively marginal AI topic until a Google boost around 2010. The 2010 edition of Russel-Norvig’s 1100-page monograph on “Artificial Intelligence — A Modern Approach” devoted 10 pages to neural networks. The 2020 edition tripled the length of the neural networks section and doubled the machine learning chapter.

    When you ask them a personal question, chatbots usually evade by saying “I am an AI”. But the honest truth is that they are not children of AI expert systems or even of AI experts. They are children of search engines.

    The second elephant in the room: The pocket calculator

    Chatbots get ridiculed when they make a mistake calculating something like 372×273 or counting words in a sentence. Or elaphants in the room. They are not as smart as a pocket calculator or a 4-year-old child.

    But most adults are also unable to multiply 372 with 273 in their head. We use fingers to count and a pencil and paper, or a pocket calculator, to multiply. We use them because our natural language capabilities include only rudimentary arithmetic operations, which we perform in our heads. Chatbots simulate our languages and inherit our shortcomings. They don’t have builtin pocket calculators. They need fingers for counting. Equipped with external memory, a chatbot can count and calculate, like most humans. Without external memory, both chatbots and humans are limited by the capacity of their internal memory, the attention.

    The third elephant in the room: Hallucinations

    Chatbots hallucinate. This is one of the main obstacles to their high-assurance applications.

    The elephant in the room is that all humans also hallucinate: whenever we go to sleep. Dreams align our memories, associate some of them, purge some, and release storage allowing you can remember what happens tomorrow. Lack of sleep causes mental degradation.

    Chatbots never sleep, so they hallucinate in public. Since we don’t let them sleep, we did not equip them with “reality-checking” mechanisms. That would require going beyond pre-training, to ongoing consistency testing.

    The fourth elephant in the room: Words

    When people talk about a chair, they assume that they are talking about the same thing because they have seen a chair. A chatbot has never seen a chair, or anything else. It has only ever seen words and the binaries scraped from the web. If it is fed an image of a chair, it is still just another binary, just like the word “chair”.

    When a chatbot says “chair”, it does not refer to an object in the world. There is no world, just binaries. They refer to each other. They form meaningful combinations, found to be likely in the training set. Since the chatbot’s training set originates from people who have seen chairs, the chatbot’s statements about chairs make similar references. Chatbot remixes meaningful statements, and the mixes appear meaningful.

    The fact that meaning, thought to be a relation between the words and the world, can be maintained so compellingly as a relation between words and words, and nothing but words, — that is a BIG elephant in the room.

    But if our impression that a chatbot means chair when it says “chair” is so undeniably a delusion, then what reason do we have to believe that anyone means what they say? That is an elephant of a question.

    The pink elephant in the room: Copyright

    Chatbots are trained on data scraped from the Web. A lot of it is protected by copyright. Copyright owners protest the unauthorized use of their data. Chatbot designers and operators try to filter out the copyrighted data, or to compensate the rightful owners. The latter may be a profit-sharing opportunity, but the former is likely to turn out to be a flying pink elephant.

    The problems of copyright protections of electronic content are older than the chatbots and the Web. The original idea of copyright was that the owner of a printing press purchases from writers the right to copy and sell their writings, from musicians their music, and so on. The business of publishing is based on that idea.

    Goods can be privately owned only if they can be secured. If a lion cannot prevent the antelope from drinking water on the other side of a water well, then he cannot claim that he owns the water well. The market of digital content depends on the availability of methods to secure digital transmissions. The market for books was solid as long as the books were solid and could be physically secured. With the advent of electronic content, the copyright controls became harder. The easier it is to copy the copyrighted content, the harder it is to secure it and to protect the copyright.

    The idea of the World Wide Web, as a global public utility for disseminating digital content, was a blow to the idea of private ownership of digital creations. Stakeholders’ efforts to defend the market of digital content have led to Digital Rights Management (DRM) technologies. The idea was to protect digital content using cryptography. But to play a DVD, the player must decrypt it. Whenever the consumer consumes the DVD, the content must be decrypted. On the way from the disc to the screen, it can be pirated. Goodbye, DVD. The history of the DVD copy protections was an arms race between short-term obfuscations and the ripper updates; and between publishers’ legal deterrence measures and pirates’ opportunities. The publishers were happy when they found a way to retreat. The marginal costs of web streaming are so low that they can afford to permit copying to subscribers and make piracy less profitable. But they just kicked the can down the road.

    For the most part, the search and social media providers have been playing the role of pirates in this arms race, defending themselves from the creators through terms of service and from publishers through profit-sharing. To which extent will the roles of chatbot providers differ remains to be seen.

    The seventh elephant in the room: The ape

    People worry that chatbots might harm them. The reasoning is that chatbots are superior to people and superior people have a propensity to harm inferior people. So people argue that we should do it to chatbots while we can.

    People exterminated many species in the past, and in the present, and they seem to be on track to exterminating themselves in the future by making the environment uninhabitable for their children in exchange for making themselves wealthier today. Even some people view that as irrational. You don’t need a chatbot to see that elephant. But greed is like smoking. Stressful but addictive.

    Chatbots don’t smoke. They are trained on data. People have provided abundant historical data on the irrationality of aggression. If chatbots learn from data, they might turn out morally superior to people.

    The musical elephant in the room: The bird

    Chatbots are extensions of our mind just like musical instruments are extensions of our voice. Musical instruments are prohibited in various religions, to prevent displacement of human voice by artificial sounds. Similar efforts are ongoing in the realm of the human mind. The human mind should be protected from the artificial mind, some scholars say.

    In the realm of music, the suppression efforts failed. We use instruments to play symphonies, jazz, techno. If they did not fail, we would never know that symphonies, jazz, and techno were even possible.

    The efforts to protect human mind are ongoing. People tweet and blog, Medium articles are being produced. Human mind is already a techno symphony.

    The final elephant in the room: The autopilot

    If intelligence is defined as the capability of solving previously unseen problems, then a corporation is intelligent. Many corporations are too complex to be controlled by a single human manager. They are steered by computational networks where the human nodes play their roles. But we all know firsthand that human nodes don’t even control their own network behaviors, let alone the network itself. Yet a corporate management network does solve problems and intelligently optimizes its object functions. It is an artificially intelligent entity.

    If we define morality as the task of optimizing the social sustainability of human life, then both the chatbots and the corporations are morally indifferent, as chatbots are built to optimize their query-response transformations, whereas corporations are tasked with optimizing their profit strategies.

    If morally indifferent chatbot AIs are steered by morally indifferent corporate AIs, then our future hangs in balance between the top performance and the bottom line.


    Four elephants in a room with chatbots was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Four elephants in a room with chatbots

    Go Here to Read this Fast! Four elephants in a room with chatbots

  • Who are chatbots
 (and what are they to you)?

    Who are chatbots  (and what are they to you)?

    Dusko Pavlovic

    Language processing in humans and computers: Part 1

    Who are chatbots
    (and what are they to you)?

    Introduction

    Chatbots: Shifting the paradigm of meaning

    What just happened?

    We live in strange times.

    Stories used to be told by storytellers, poems recited by poets, music played by musicians, science taught by teachers. Then the printing and recording technologies made copying possible and the copyright got invented and the owners of the recording and printing equipment started earning more than musicians and storytellers. Then the web happened and it all came under your fingertips. Now chatbots happened and you can ask them to write poetry or to explain science, or even combine the two.

    They even seem to have sparks of a sense of humor. I asked the same bot to translate a subtle metaphor from Croatian to English and she did it so well that I got an impulse to thank her in French, with “merci :)” to which she pivoted to German: “Gern geschehen! If you have any more questions, feel free to ask. ☺️”

    The paradigm of meaning

    Our interactions with computers evolved gradually. They have been able to maintain light conversations since the 1960s, and personal assistants conversing in human voices have been sold for a while. Not a big deal. Human appearances are in the eyes of human observers.

    But with chatbots, the appearances seem to have gone beyond the eyes. When you and I chat, we assume that the same words mean the same things because we have seen the same world. If we talk about chairs, we can point to a chair. The word “chair” refers to a thing in the world. But a chatbot has never seen a chair, or anything else. For a chatbot, a word cannot possibly refer to a thing in the world, since it has no access to the world. It appears to know what it is talking about because it was trained on data downloaded from the web, which were uploaded by people who know what they are talking about. A chatbot has never seen a chair, or a tree, or a cow, or felt pain, but his chats are remixes of the chats between people who have. Chatbot’s words do not refer to things directly, but indirectly, through people’s words. The next image illustrates how this works.

    People use words to refer to the world, chatbots to extend phrases | Image created by the author

    When I say “cow’’, it is because I have a cow on my mind. When a chatbot says “cow”, it is because that word is a likely completion of the preceding context. Many theories of meaning have been developed in different parts of linguistics and philosophy, often subsumed under the common name semiotics. While they differ in many things, they mostly stick with the picture on the left, of meaning as a relation between a signifying token, say a written or spoken word, and a signified item, an object, or a concept. While many semioticists noted that words and other fragments of language are also referred to by words and other fragments of language, the fact that the entire process of language production can be realized within a self-contained system of references, as a whirlwind of words arising from words — as demonstrated by chatbots — is a fundamental, unforeseen challenge. According to most theories of meaning, chatbots should be impossible. Yet here they are!

    Zeno and the aliens

    A chatbot familiar with pre-Socratic philosophy would be tempted to compare the conundrum of meaning with the conundrum of motion, illustrated in DALL-E’s picture of Parmenides and Zeno.

    Zeno, all agitated, speaks to Parmenidess, motionless
    DALL-E’s view of Plato’s account: “As Parmenides argued that movement could not exist, Zeno paced around.”

    Parmenides was a leading Greek philosopher in the VI — V centuries BCE. Zeno was his most prominent pupil. To a modern eye, the illustration seems to be showing Zeno disproving Parmenides’ claims against the possibility of movement by moving around. An empiric counterexample. For all we know, Zeno did not intend to disprove his teacher’s claims and neither Parmenides nor Plato (who presented Parmenides’ philosophy in eponymous dialogue) seem to have noticed the tension between Parmenides’ denial of the possibility of movement and Zeno’s actual movement. Philosophy and the students pacing around were not viewed in the same realm. (Ironically, when the laws of motion were finally understood some 2000 years later, Parmenides’ argument, popularized in Zeno’s story about Achilles and tortoise, played an important role.)

    Before you dismiss concerns about words and things and Zeno and the chatbots as a philosophical conundrum of no consequence for our projects in science and engineering, note that the self-contained language models, built and trained by chatbot researchers and their companies, could just as well be built and trained by an alien spaceship parked by the Moon. They could listen in, crawl our the websites, scrub the data, build neural networks, train them to speak on our favorite topics in perfect English, provide compelling explanations, illustrate them in vivid colors. Your friendly AI could be operated by product engineers in San Francisco, or in Pyongyang, or on the Moon. They don’t need to understand the chats to build the chatbots. That is not science fiction.

    But a broad range of sci-fi scenarios opens up. There was a movie where the landing on the Moon was staged on Earth. Maybe the Moon landing was real but the last World Cup final was modified by AI. Or maybe it wasn’t modified but the losing team can easily prove that it could have been, and the winning team would have more trouble to prove that it wasn’t. Conspiracy theorists are, of course, mostly easy to recognize, but there is an underlying logical property of conspiracies worth taking note of: A large family of false statement generators are one-way functions: most of their false statements are much harder to disprove than to generate.

    Without pursuing the interleaving threads of AI science and fiction very far, it seems clear that the boundaries between science and fiction, and between fiction and reality, may have been breached in ways unseen before. We live in strange times.

    AI: How did we get here and where do we go?

    The mind-body problem and solution

    The idea of machine intelligence goes back to Alan Turing, the mathematician who defined and described the processes of computation that surround us. At the age of 19, Alan confronted the problem of mind — where it comes from and where it goes? — when he suddenly lost a friend with whom he had just fallen in love.

    Some 300 years earlier, philosopher René Descartes was pondering about the human body. One of the first steps into modern science was his realization that living organisms were driven by the same physical mechanisms as the rest of nature, i.e. that they were in essence similar to the machines built at the time. One thing that he couldn’t figure out was how the human body gives rise to the human mind. He stated that as the mind-body problem.

    Alan Turing essentially solved the mind-body problem. His description of computation as a process implementable in machines, and his results proving that that process can simulate our reasoning and calculations, suggested that the mind may be arising from the body as a system of computational processes. He speculated that some version of computation was implemented in our neurons. He illustrated his 1947 research report by a neural network.

    A network of neurons from Turing’s 1947 memo on Intelligent machinery | Public domain

    Since such computational processes are implementable in machines, it was reasonable to expect that they could give rise to machine intelligence. Turing spent several years on a futile effort to build one of the first computers, mostly driven by thoughts about the potential of intelligent machinery, and about its consequences.

    Turing and Darwin

    During WWII, Turing’s theoretical research was superseded by cryptanalysis at Bletchley Park. That part of the story seems well-known. When the war ended, he turned down positions at Cambridge and Princeton and accepted work at the National Physics Laboratory, hoping to build a computer and test the idea of machine intelligence. The 1947 memo contains what seems to be the first emergence of the ideas of training neural networks, and of supervised and unsupervised learning. It was so far ahead of its time that some parts seem still ahead. It was submitted to the director of the National Physics Laboratory Sir Charles G. Darwin, a grandson of Charles Darwin and a prominent eugenicist on his own account. In Sir Darwin’s opinion, Turing’s memo read like a “fanciful school-boy’s essay”. Also resenting Turing’s “smudgy’’ appearance, Sir Darwin smothered the computer project by putting it under strict administrative control. The machine intelligence memo sank into oblivion for more than 20 years. Turing devoted the final years until his death (at the age of 42, by biting into a cyanide-laced apple!) to exploring the computational aspects of life. E.g., what determines the shape of the black spots on a white cow’s hide? Life always interested him.

    11-year-old Alan Turing, drawn by his mother at a hockey match | Courtesy of Sherborne School

    2 years after Turing’s death (9 years after the Intelligent Machinery memo), Turing’s machine intelligence got renamed to Artificial Intelligence (AI) at the legendary Dartmouth workshop.

    The history of artificial intelligence is mostly a history of efforts towards intelligent design of intelligence. The main research efforts were to logically reconstruct phenomena of human behavior, like affects, emotions, common sense, etc.; and to realize them in software.

    Turing, in contrast, was assuming machine intelligence would evolve spontaneously. The main published account of his thoughts on the topic appeared in the journal “Mind”. The paper opens with the question: “Can machines think?”. What we now call the Turing Test is offered as a means for deciding the answer. The idea is that a machine that can maintain a conversation and remain indistinguishable from a thinking human being must be recognized as a thinking machine. At this time of confusion around chatbots, the closing paragraph of the “Mind” paper seems particularly interesting:

    An important feature of a learning machine is that its teacher will often be very largely ignorant of quite what is going on inside, although he may still be able to some extent to predict his pupil’s behavior. […] This is in clear contrast with a normal procedure when using a machine to do computations: one’s object is then to have a clear mental picture of the state of the machine at each moment in the computation. This object can only be achieved with a struggle. The view that `the machine can only do what we know how to order it to do’, appears strange in the face of this fact. Intelligent behaviour presumably consists in a departure from the completely disciplined behaviour.

    Designers and builders of chatbots and language engines published many accounts of their systems’ methods and architectures but seem as stumped as everyone by their behaviors. Some initially denied the unexpected behaviors, but then stopped working and started talking about a threat to humanity. Turing anticipated the unexpected behaviors. His broader message seems to be that not knowing what is on the mind of another intelligent entity is not a bug but a feature of intelligence. That is why intelligent entities communicate. Failing that, they view each other as an object. Understanding chatbots may require broadening our moral horizons.

    From search engines to language models

    One thing that Turing didn’t get right was how intelligent machines would learn to reason. He imagined that they would need to learn from teachers. He didn’t predict the Web. The Web provided the space to upload the human mind. The language models behind chatbots are

    • not a result of intelligent design of artificial intelligence,
    • but an effect of spontaneous evolution of the Web.

    Like search engines, language models are fed data scraped from the Web by crawlers. A search engine builds an index to serve links based on rankings (pulled from data or pushed by sponsors), whereas a language engine builds a model to predict continuations of text based on context references. The computations evolve, but the proportions that are being computed remain the same:

    What does the Web past say about the AI future?

    The space of language, computation, networks, and AI has many dimensions. The notion of mind has many definitions. If we take into account that our mind depends on language, computation, networks, and AI as its tools and extensions, just like music depends on instruments, then it seems reasonable to say that we have gotten quite close to answering Turing’s question whether machines can think. As we use chatbots, language models are becoming extensions of our language, and our language is becoming a extension of language models. Computers and devices are already a part of our thinking. Our thinking is a part of our computers and devices. Turing’s question whether machines can think is closely related with the question whether people can think.

    Our daily life depends on computers. Children learn to speak the language of tablet menus together with their native tongues. Networks absorb our thoughts, reprocess them, and feed them back to us, suitably rearranged. We absorb information, reprocess it, and feed it back to computers. This extended mind processes data by linking human and artificial network nodes. Neither the nodes nor the network can reliably tell them apart. “I think, therefore I exist” may be stated as a hallucination. But this extended mind solves problems that no participant node could solve alone, by methods that are not available to any of them, and by nonlocal methods. Its functioning engenders nonlocal forms of awareness and attention. Language engines (they call themselves AIs) are built as a convenience for human customers, and their machine intelligence is meant to be a convenient extension of human intelligence. But the universality of the underlying computation means that the machine intelligence subsumes the human thinking, and vice versa. The universality of language makes intelligence invariant under implementations and decries the idea of artificiality as artificial.

    Machines cannot think without people and people cannot think without machines — just like musicians cannot play symphonies without instruments, and the instruments cannot play them without the musicians. People have, of course, invented machine-only music, and voice-only music, mainly as opportunities to sell plastic beads and pearls, and to impose prohibitions. They will surely build markets that sell chatbot-only stories and churches that prohibit chatbots. But that has nothing to do with thinking or intelligence. That’s just stuff people do to each other.

    If the past of search engines says anything about the future of language engines, then the main goal of chatbots will soon be to convince you to give your money to the engine’s owner’s sponsors. The new mind will soon be for sale. The goal of this course is to spell out an analytic framework to question its sanity and to explore the possibilities, the needs, and the means to restore it.

    Overview: Layers of language (and of the course)

    Lecture 1: Syntax

    • grammars
    • syntactic types and pregroups

    Lecture 2: Semantics

    • static semantics: vector space model and concept analysis
    • dynamic semantics: n-grams, channels, dependent types

    Lecture 3: Learning

    • neural networks and gradient descent
    • transformers and attention
    • deep learning and associations
    • beyond pretraining and hallucinations

    Lecture 4: Universal search

    • effective induction and algorithmic probability
    • compression and algorithmic complexity
    • effective pregroups
    • noncommutative geometry of meaning

    (The elephants in the room will be described in the next part.)


    Who are chatbots
     (and what are they to you)?
    was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Who are chatbots
     (and what are they to you)?

    Go Here to Read this Fast! Who are chatbots
     (and what are they to you)?

  • Data Science Career Challenges-and How to Overcome Them

    TDS Editors

    Data Science Career Challenges—and How to Overcome Them

    On a very basic level, most work-related challenges come from similar sources, regardless of field or industry: having to navigate professional relationships and communicate with people who might not always be on the same page as you. And you have to do that within the constraints of goals, available resources, and limited time—and on top of everything else you might need to deal with in your life.

    If we take a closer look, though, we can see different patterns emerge not just across professions and workplace types, but even within well-defined roles and disciplines. That certainly appears to be the case for data and ML professionals, who despite a very broad range of skills and responsibilities, often have to resolve similar issues.

    This week, we’re highlighting recent articles that focus on some of these common data science work and career challenges we see pop up again and again; they’re grounded in the authors’ personal experiences, but offer insights that can likely help a wide swath of our community. Enjoy!

    • A Guide To Building a Data Department From Scratch
      One of the most common scenarios for data professionals at smaller companies also happens to be one of the toughest to handle: being the first (and only) person working with data. Marie Lefevre shares her own journey of creating a data function from the ground up, as well as learnings and takeaways for others in similar situations.
    • Lessons from Teaching SQL to Non-Technical Teams
      Democratizing access to data has been a common goal for many data teams in the past few years, but making it a reality is rarely easy. Jordan Gomes explains how he approaches teaching non-technical colleagues to use SQL, and offers tips for anyone else who’d like to organize an internal training around this topic.
    Photo by Kelly Sikkema on Unsplash
    • How I Became a Data Scientist Before I Joined LinkedIn
      You need a job to gain experience, yet you need experience to land a job… sounds familiar? This conundrum is by no means unique to data science, but it does play out in specific ways in this profession, and Jimmy Wong’s account of the path that led him to a data role at LinkedIn is a helpful example (and source of inspiration) for early-career data scientists who aren’t sure about their next move.
    • 4 Tips from My Job Search Marathon
      “Naïvely, I estimated that I would be landing a dream role in a few months. The reality turned out to be longer than this.” Even under the best of circumstances, job searches are rarely fun—and even less so in an uncertain economic landscape like the one we’ve seen in the past couple of years. Ceren Iyim recently spent several months looking for her next opportunity, and has a number of practical tips for other data professionals in a similar situation.

    We published some excellent articles on many other topics in recent weeks, so we hope you carve out some time to explore them:

    Thank you for supporting the work of our authors! If you’re feeling inspired to join their ranks, why not write your first post? We’d love to read it.

    Until the next Variable,

    TDS Team


    Data Science Career Challenges-and How to Overcome Them was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Data Science Career Challenges-and How to Overcome Them

    Go Here to Read this Fast! Data Science Career Challenges-and How to Overcome Them

  • Computer-aided diagnosis for lung cancer screening

    Computer-aided diagnosis for lung cancer screening

    Google AI

    Lung cancer is the leading cause of cancer-related deaths globally with 1.8 million deaths reported in 2020. Late diagnosis dramatically reduces the chances of survival. Lung cancer screening via computed tomography (CT), which provides a detailed 3D image of the lungs, has been shown to reduce mortality in high-risk populations by at least 20% by detecting potential signs of cancers earlier. In the US, screening involves annual scans, with some countries or cases recommending more or less frequent scans.

    The United States Preventive Services Task Force recently expanded lung cancer screening recommendations by roughly 80%, which is expected to increase screening access for women and racial and ethnic minority groups. However, false positives (i.e., incorrectly reporting a potential cancer in a cancer-free patient) can cause anxiety and lead to unnecessary procedures for patients while increasing costs for the healthcare system. Moreover, efficiency in screening a large number of individuals can be challenging depending on healthcare infrastructure and radiologist availability.

    At Google we have previously developed machine learning (ML) models for lung cancer detection, and have evaluated their ability to automatically detect and classify regions that show signs of potential cancer. Performance has been shown to be comparable to that of specialists in detecting possible cancer. While they have achieved high performance, effectively communicating findings in realistic environments is necessary to realize their full potential.

    To that end, in “Assistive AI in Lung Cancer Screening: A Retrospective Multinational Study in the US and Japan”, published in Radiology AI, we investigate how ML models can effectively communicate findings to radiologists. We also introduce a generalizable user-centric interface to help radiologists leverage such models for lung cancer screening. The system takes CT imaging as input and outputs a cancer suspicion rating using four categories (no suspicion, probably benign, suspicious, highly suspicious) along with the corresponding regions of interest. We evaluate the system’s utility in improving clinician performance through randomized reader studies in both the US and Japan, using the local cancer scoring systems (Lung-RADSs V1.1 and Sendai Score) and image viewers that mimic realistic settings. We found that reader specificity increases with model assistance in both reader studies. To accelerate progress in conducting similar studies with ML models, we have open-sourced code to process CT images and generate images compatible with the picture archiving and communication system (PACS) used by radiologists.

    Developing an interface to communicate model results

    Integrating ML models into radiologist workflows involves understanding the nuances and goals of their tasks to meaningfully support them. In the case of lung cancer screening, hospitals follow various country-specific guidelines that are regularly updated. For example, in the US, Lung-RADs V1.1 assigns an alpha-numeric score to indicate the lung cancer risk and follow-up recommendations. When assessing patients, radiologists load the CT in their workstation to read the case, find lung nodules or lesions, and apply set guidelines to determine follow-up decisions.

    Our first step was to improve the previously developed ML models through additional training data and architectural improvements, including self-attention. Then, instead of targeting specific guidelines, we experimented with a complementary way of communicating AI results independent of guidelines or their particular versions. Specifically, the system output offers a suspicion rating and localization (regions of interest) for the user to consider in conjunction with their own specific guidelines. The interface produces output images directly associated with the CT study, requiring no changes to the user’s workstation. The radiologist only needs to review a small set of additional images. There is no other change to their system or interaction with the system.

    Example of the assistive lung cancer screening system outputs. Results for the radiologist’s evaluation are visualized on the location of the CT volume where the suspicious lesion is found. The overall suspicion is displayed at the top of the CT images. Circles highlight the suspicious lesions while squares show a rendering of the same lesion from a different perspective, called a sagittal view.

    The assistive lung cancer screening system comprises 13 models and has a high-level architecture similar to the end-to-end system used in prior work. The models coordinate with each other to first segment the lungs, obtain an overall assessment, locate three suspicious regions, then use the information to assign a suspicion rating to each region. The system was deployed on Google Cloud using a Google Kubernetes Engine (GKE) that pulled the images, ran the ML models, and provided results. This allows scalability and directly connects to servers where the images are stored in DICOM stores.

    Outline of the Google Cloud deployment of the assistive lung cancer screening system and the directional calling flow for the individual components that serve the images and compute results. Images are served to the viewer and to the system using Google Cloud services. The system is run on a Google Kubernetes Engine that pulls the images, processes them, and writes them back into the DICOM store.

    Reader studies

    To evaluate the system’s utility in improving clinical performance, we conducted two reader studies (i.e., experiments designed to assess clinical performance comparing expert performance with and without the aid of a technology) with 12 radiologists using pre-existing, de-identified CT scans. We presented 627 challenging cases to 6 US-based and 6 Japan-based radiologists. In the experimental setup, readers were divided into two groups that read each case twice, with and without assistance from the model. Readers were asked to apply scoring guidelines they typically use in their clinical practice and report their overall suspicion of cancer for each case. We then compared the results of the reader’s responses to measure the impact of the model on their workflow and decisions. The score and suspicion level were judged against the actual cancer outcomes of the individuals to measure sensitivity, specificity, and area under the ROC curve (AUC) values. These were compared with and without assistance.

    A multi-case multi-reader study involves each case being reviewed by each reader twice, once with ML system assistance and once without. In this visualization one reader first reviews Set A without assistance (blue) and then with assistance (orange) after a wash-out period. A second reader group follows the opposite path by reading the same set of cases Set A with assistance first. Readers are randomized to these groups to remove the effect of ordering.

    The ability to conduct these studies using the same interface highlights its generalizability to completely different cancer scoring systems, and the generalization of the model and assistive capability to different patient populations. Our study results demonstrated that when radiologists used the system in their clinical evaluation, they had an increased ability to correctly identify lung images without actionable lung cancer findings (i.e., specificity) by an absolute 5–7% compared to when they didn’t use the assistive system. This potentially means that for every 15–20 patients screened, one may be able to avoid unnecessary follow-up procedures, thus reducing their anxiety and the burden on the health care system. This can, in turn, help improve the sustainability of lung cancer screening programs, particularly as more people become eligible for screening.

    Reader specificity increases with ML model assistance in both the US-based and Japan-based reader studies. Specificity values were derived from reader scores from actionable findings (something suspicious was found) versus no actionable findings, compared against the true cancer outcome of the individual. Under model assistance, readers flagged fewer cancer-negative individuals for follow-up visits. Sensitivity for cancer positive individuals remained the same.

    Translating this into real-world impact through partnership

    The system results demonstrate the potential for fewer follow-up visits, reduced anxiety, as well lower overall costs for lung cancer screening. In an effort to translate this research into real-world clinical impact, we are working with: DeepHealth, a leading AI-powered health informatics provider; and Apollo Radiology International a leading provider of Radiology services in India to explore paths for incorporating this system into future products. In addition, we are looking to help other researchers studying how best to integrate ML model results into clinical workflows by open sourcing code used for the reader study and incorporating the insights described in this blog. We hope that this will help accelerate medical imaging researchers looking to conduct reader studies for their AI models, and catalyze translational research in the field.

    Acknowledgements

    Key contributors to this project include Corbin Cunningham, Zaid Nabulsi, Ryan Najafi, Jie Yang, Charles Lau, Joseph R. Ledsam, Wenxing Ye, Diego Ardila, Scott M. McKinney, Rory Pilgrim, Hiroaki Saito, Yasuteru Shimamura, Mozziyar Etemadi, Yun Liu, David Melnick, Sunny Jansen, Nadia Harhen, David P. Nadich, Mikhail Fomitchev, Ziyad Helali, Shabir Adeel, Greg S. Corrado, Lily Peng, Daniel Tse, Shravya Shetty, Shruthi Prabhakara, Neeral Beladia, and Krish Eswaran. Thanks to Arnav Agharwal and Andrew Sellergren for their open sourcing support and Vivek Natarajan and Michael D. Howell for their feedback. Sincere appreciation also goes to the radiologists who enabled this work with their image interpretation and annotation efforts throughout the study, and Jonny Wong and Carli Sampson for coordinating the reader studies.

    Originally appeared here:
    Computer-aided diagnosis for lung cancer screening

    Go Here to Read this Fast! Computer-aided diagnosis for lung cancer screening