Originally appeared here:
This devious malware is targeting Facebook accounts to steal credit card data
Category: Technology
-
This devious malware is targeting Facebook accounts to steal credit card data
Python NodeStealer now targets more than Facebook Business accounts. -
Is shared hosting really any good?
Could shared hosting be the right choice for your business? Check out our guide to what it is and if it’s good for your online portfolio.Go Here to Read this Fast! Is shared hosting really any good?
Originally appeared here:
Is shared hosting really any good? -
Top benefits of managed VPS hosting
VPS hosting offers the flexibility of a dedicated server without the high costs.Go Here to Read this Fast! Top benefits of managed VPS hosting
Originally appeared here:
Top benefits of managed VPS hosting -
Google’s AI-powered bug hunting tool finds a host of concerning open source security flaws
Among the bugs was a flaw in OpenSSL that could leave users vulnerable.Google’s AI-powered bug hunting tool finds a host of concerning open source security flawsGoogle’s AI-powered bug hunting tool finds a host of concerning open source security flaws -
ChatGPT: Two Years Later
Tracing the impact of the generative AI revolution
Tracing the impact of the generative AI revolution (Photo by vackground.com on Unsplash). Happy birthday, Mr. Chatbot
This November 30 marks the second anniversary of ChatGPT’s launch, an event that sent shockwaves through technology, society, and the economy. The space opened by this milestone has not always made it easy — or perhaps even possible — to separate reality from expectations. For example, this year Nvidia became the most valuable public company in the world during a stunning bullish rally. The company, which manufactures hardware used by models like ChatGPT, is now worth seven times what it was two years ago. The obvious question for everyone is: Is it really worth that much, or are we in the midst of collective delusion? This question — and not its eventual answer — defines the current moment.
AI is making waves not just in the stock market. Last month, for the first time in history, prominent figures in artificial intelligence were awarded the Nobel Prizes in Physics and Chemistry. John J. Hopfield and Geoffrey E. Hinton received the Physics Nobel for their foundational contributions to neural network development.
In Chemistry, Demis Hassabis, John Jumper, and David Baker were recognized for AlphaFold’s advances in protein design using artificial intelligence. These awards generated surprise on one hand and understandable disappointment among traditional scientists on the other, as computational methods took center stage.
ChatGPT was launched on November 30, 2022 (Photo by Rolf van Root on Unsplash). In this context, I aim to review what has happened since that November, reflecting on the tangible and potential impact of generative AI to date, considering which promises have been fulfilled, which remain in the running, and which seem to have fallen by the wayside.
D-Day
Let’s begin by recalling the day of the launch. ChatGPT 3.5 was a chatbot far superior to anything previously known in terms of discourse and intelligence capabilities. The difference between what was possible at the time and what ChatGPT could do generated enormous fascination and the product went viral rapidly: it reached 100 million users in just two months, far surpassing many applications considered viral (TikTok, Instagram, Pinterest, Spotify, etc.). It also entered mass media and public debate: AI landed in the mainstream, and suddenly everyone was talking about ChatGPT. To top it off, just a few months later, OpenAI launched GPT-4, a model vastly superior to 3.5 in intelligence and also capable of understanding images.
The situation sparked debates about the many possibilities and problems inherent to this specific technology, including copyright, misinformation, productivity, and labor market issues. It also raised concerns about the medium- and long-term risks of advancing AI research, such as existential risk (the “Terminator” scenario), the end of work, and the potential for artificial consciousness. In this broad and passionate discussion, we heard a wide range of opinions. Over time, I believe the debate began to mature and temper. It took us a while to adapt to this product because ChatGPT’s advancement left us all somewhat offside. What has happened since then?
The day Goliath stumbled
As far as technology companies are concerned, these past two years have been a roller coaster. The appearance on the scene of OpenAI, with its futuristic advances and its CEO with a “startup” spirit and look, raised questions about Google’s technological leadership, which until then had been undisputed. Google, for its part, did everything it could to confirm these doubts, repeatedly humiliating itself in public. First came the embarrassment of Bard’s launch — the chatbot designed to compete with ChatGPT. In the demo video, the model made a factual error: when asked about the James Webb Space Telescope, it claimed it was the first telescope to photograph planets outside the solar system, which is false. This misstep caused Google’s stock to drop by 9% in the following week. Later, during the presentation of its new Gemini model — another competitor, this time to GPT-4 — Google lost credibility again when it was revealed that the incredible capabilities showcased in the demo (which could have placed it at the cutting edge of research) were, in reality, fabricated, based on much more limited capabilities.
The day Goliath stumbled (Photo by Shutter Speed on Unsplash). Meanwhile, Microsoft — the archaic company of Bill Gates that produced the old Windows 95 and was as hated by young people as Google was loved — reappeared and allied with the small David, integrating ChatGPT into Bing and presenting itself as agile and defiant. “I want people to know we made them dance,” said Satya Nadella, Microsoft’s CEO, referring to Google. In 2023, Microsoft rejuvenated while Google aged.
This situation persisted, and OpenAI remained for some time the undisputed leader in both technical evaluations and subjective user feedback (known as “vibe checks”), with GPT-4 at the forefront. But over time, this changed and just as GPT-4 had achieved unique leadership by late 2022, by mid-2024 its close successor (GPT-4o) was competing with others of its caliber: Google’s Gemini 1.5 Pro, Anthropic’s Claude Sonnet 3.5, and xAI’s Grok 2. What innovation gives, innovation takes away.
This scenario could be shifting again with OpenAI’s recent announcement of o1 in September 2024 and rumors of new launches in December. For now, however, regardless of how good o1 may be (we’ll talk about it shortly), it doesn’t seem to have caused the same seismic impact as ChatGPT or conveyed the same sense of an unbridgeable gap with the rest of the competitive landscape.
To round out the scene of hits, falls, and epic comebacks, we must talk about the open-source world. This new AI era began with two gut punches to the open-source community. First, OpenAI, despite what its name implies, was a pioneer in halting the public disclosure of fundamental technological advancements. Before OpenAI, the norms of artificial intelligence research — at least during the golden era before 2022 — entailed detailed publication of research findings. During that period, major corporations fostered a positive feedback loop with academia and published papers, something previously uncommon. Indeed, ChatGPT and the generative AI revolution as a whole are based on a 2017 paper from Google, the famous Attention Is All You Need, which introduced the Transformer neural network architecture. This architecture underpins all current language models and is the “T” in GPT. In a dramatic plot twist, OpenAI leveraged this public discovery by Google to gain an advantage and began pursuing closed-door research, with GPT-4’s launch marking the turning point between these two eras: OpenAI disclosed nothing about the inner workings of this advanced model. From that moment, many closed models, such as Gemini 1.5 Pro and Claude Sonnet, began to emerge, fundamentally shifting the research ecosystem for the worse.
The second blow to the open-source community was the sheer scale of the new models. Until GPT-2, a modest GPU was sufficient to train deep learning models. Starting with GPT-3, infrastructure costs skyrocketed, and training models became inaccessible to individuals or most institutions. Fundamental advancements fell into the hands of a few major players.
But after these blows, and with everyone anticipating a knockout, the open-source world fought back and proved itself capable of rising to the occasion. For everyone’s benefit, it had an unexpected champion. Mark Zuckerberg, the most hated reptilian android on the planet, made a radical change of image by positioning himself as the flagbearer of open source and freedom in the generative AI field. Meta, the conglomerate that controls much of the digital communication fabric of the West according to its own design and will, took on the task of bringing open source into the LLM era with its LLaMa model line. It’s definitely a bad time to be a moral absolutist. The LLaMa line began with timid open licenses and limited capabilities (although the community made significant efforts to believe otherwise). However, with the recent releases of LLaMa 3.1 and 3.2, the gap with private models has begun to narrow significantly. This has allowed the open-source world and public research to remain at the forefront of technological innovation.
LLaMa models are open source alternatives to closed-source corporate LLMs (Photo by Paul Lequay on Unsplash). Technological advances
Over the past two years, research into ChatGPT-like models, known as large language models (LLMs), has been prolific. The first fundamental advancement, now taken for granted, is that companies managed to increase the context windows of models (how many words they can read as input and generate as output) while dramatically reducing costs per word. We’ve also seen models become multimodal, accepting not only text but also images, audio, and video as input. Additionally, they have been enabled to use tools — most notably, internet search — and have steadily improved in overall capacity.
On another front, various quantization and distillation techniques have emerged, enabling the compression of enormous models into smaller versions, even to the point of running language models on desktop computers (albeit sometimes at the cost of unacceptable performance reductions). This optimization trend appears to be on a positive trajectory, bringing us closer to small language models (SLMs) that could eventually run on smartphones.
On the downside, no significant progress has been made in controlling the infamous hallucinations — false yet plausible-sounding outputs generated by models. Once a quaint novelty, this issue now seems confirmed as a structural feature of the technology. For those of us who use this technology in our daily work, it’s frustrating to rely on a tool that behaves like an expert most of the time but commits gross errors or outright fabricates information roughly one out of every ten times. In this sense, Yann LeCun, the head of Meta AI and a major figure in AI, seems vindicated, as he had adopted a more deflationary stance on LLMs during the 2023 hype peak.
However, pointing out the limitations of LLMs doesn’t mean the debate is settled about what they’re capable of or where they might take us. For instance, Sam Altman believes the current research program still has much to offer before hitting a wall, and the market, as we’ll see shortly, seems to agree. Many of the advancements we’ve seen over the past two years support this optimism. OpenAI launched its voice assistant and an improved version capable of near-real-time interaction with interruptions — like human conversations rather than turn-taking. More recently, we’ve seen the first advanced attempts at LLMs gaining access to and control over users’ computers, as demonstrated in the GPT-4o demo (not yet released) and in Claude 3.5, which is available to end users. While these tools are still in their infancy, they offer a glimpse of what the near future could look like, with LLMs having greater agency. Similarly, there have been numerous breakthroughs in automating software engineering, highlighted by debatable milestones like Devin, the first “artificial software engineer.” While its demo was heavily criticized, this area — despite the hype — has shown undeniable, impactful progress. For example, in the SWE-bench benchmark, used to evaluate AI models’ abilities to solve software engineering problems, the best models at the start of the year could solve less than 13% of exercises. As of now, that figure exceeds 49%, justifying confidence in the current research program to enhance LLMs’ planning and complex task-solving capabilities.
Along the same lines, OpenAI’s recent announcement of the o1 model signals a new line of research with significant potential, despite the currently released version (o1-preview) not being far ahead from what’s already known. In fact, o1 is based on a novel idea: leveraging inference time — not training time — to improve the quality of generated responses. With this approach, the model doesn’t immediately produce the most probable next word but has the ability to “pause to think” before responding. One of the company’s researchers suggested that, eventually, these models could use hours or even days of computation before generating a response. Preliminary results have sparked high expectations, as using inference time to optimize quality was not previously considered viable. We now await subsequent models in this line (o2, o3, o4) to confirm whether it is as promising as it currently seems.
Beyond language models, these two years have seen enormous advancements in other areas. First, we must mention image generation. Text-to-image models began to gain traction even before chatbots and have continued developing at an accelerated pace, expanding into video generation. This field reached a high point with the introduction of OpenAI’s Sora, a model capable of producing extremely high-quality videos, though it was not released. Slightly less known but equally impressive are advances in music generation, with platforms like Suno and Udio, and in voice generation, which has undergone a revolution and achieved extraordinarily high-quality standards, led by Eleven Labs.
It has undoubtedly been two intense years of remarkable technological progress and almost daily innovations for those of us involved in the field.
The market boom
If we turn our attention to the financial aspect of this phenomenon, we will see vast amounts of capital being poured into the world of AI in a sustained and growing manner. We are currently in the midst of an AI gold rush, and no one wants to be left out of a technology that its inventors, modestly, have presented as equivalent to the steam engine, the printing press, or the internet.
It may be telling that the company that has capitalized the most on this frenzy doesn’t sell AI but rather the hardware that serves as its infrastructure, aligning with the old adage that during a gold rush, a good way to get rich is by selling shovels and picks. As mentioned earlier, Nvidia has positioned itself as the most valuable company in the world, reaching a market capitalization of $3.5 trillion. For context, $3,500,000,000,000 is a figure far greater than France’s GDP.
We are currently in the midst of an AI gold rush, and no one wants to be left out (Photo by Dimitri Karastelev on Unsplash). On the other hand, if we look at the list of publicly traded companies with the highest market value, we see tech giants linked partially or entirely to AI promises dominating the podium. Apple, Nvidia, Microsoft, and Google are the top four as of the date of this writing, with a combined capitalization exceeding $12 trillion. For reference, in November 2022, the combined capitalization of these four companies was less than half of this value. Meanwhile, generative AI startups in Silicon Valley are raising record-breaking investments. The AI market is bullish.
While the technology advances fast, the business model for generative AI — beyond the major LLM providers and a few specific cases — remains unclear. As this bullish frenzy continues, some voices, including recent economics Nobel laureate Daron Acemoglu, have expressed skepticism about AI’s ability to justify the massive amounts of money being poured into it. For instance, in this Bloomberg interview, Acemoglu argues that current generative AI will only be able to automate less than 5% of existing tasks in the next decade, making it unlikely to spark the productivity revolution investors anticipate.
Is this AI fever or rather AI feverish delirium? For now, the bullish rally shows no signs of stopping, and like any bubble, it will be easy to recognize in hindsight. But while we’re in it, it’s unclear if there will be a correction and, if so, when it might happen. Are we in a bubble about to burst, as Acemoglu believes, or, as one investor suggested, is Nvidia on its way to becoming a $50 trillion company within a decade? This is the million-dollar question and, unfortunately, dear reader, I do not know the answer. Everything seems to indicate that, just like in the dot com bubble, we will emerge from this situation with some companies riding the wave and many underwater.
Social impact
Let’s now discuss the broader social impact of generative AI’s arrival. The leap in quality represented by ChatGPT, compared to the socially known technological horizon before its launch, caused significant commotion, opening debates about the opportunities and risks of this specific technology, as well as the potential opportunities and risks of more advanced technological developments.
The problem of the future
The debate over the proximity of artificial general intelligence (AGI) — AI reaching human or superhuman capabilities — gained public relevance when Geoffrey Hinton (now a Physics Nobel laureate) resigned from his position at Google to warn about the risks such development could pose. Existential risk — the possibility that a super-capable AI could spiral out of control and either annihilate or subjugate humanity — moved out of the realm of fiction to become a concrete political issue. We saw prominent figures, with moderate and non-alarmist profiles, express concern in public debates and even in U.S. Senate hearings. They warned of the possibility of AGI arriving within the next ten years and the enormous problems this would entail.The urgency that surrounded this debate now seems to have faded, and in hindsight, AGI appears further away than it did in 2023 (Photo by Axel Richter on Unsplash). The urgency that surrounded this debate now seems to have faded, and in hindsight, AGI appears further away than it did in 2023. It’s common to overestimate achievements immediately after they occur, just as it’s common to underestimate them over time. This latter phenomenon even has a name: the AI Effect, where major advancements in the field lose their initial luster over time and cease to be considered “true intelligence.” If today the ability to generate coherent discourse — like the ability to play chess — is no longer surprising, this should not distract us from the timeline of progress in this technology. In 1996, the Deep Blue model defeated chess champion Garry Kasparov. In 2016, AlphaGo defeated Go master Lee Sedol. And in 2022, ChatGPT produced high-quality, articulated speech, even challenging the famous Turing Test as a benchmark for determining machine intelligence. I believe it’s important to sustain discussions about future risks even when they no longer seem imminent or urgent. Otherwise, cycles of fear and calm prevent mature debate. Whether through the research direction opened by o1 or new pathways, it’s likely that within a few years, we’ll see another breakthrough on the scale of ChatGPT in 2022, and it would be wise to address the relevant discussions before that happens.
A separate chapter on AGI and AI safety involves the corporate drama at OpenAI, worthy of prime-time television. In late 2023, Sam Altman was abruptly removed by the board of directors. Although the full details were never clarified, Altman’s detractors pointed to an alleged culture of secrecy and disagreements over safety issues in AI development. The decision sparked an immediate rebellion among OpenAI employees and drew the attention of Microsoft, the company’s largest investor. In a dramatic twist, Altman was reinstated, and the board members who removed him were dismissed. This conflict left a rift within OpenAI: Jan Leike, the head of AI safety research, joined Anthropic, while Ilya Sutskever, OpenAI’s co-founder and a central figure in its AI development, departed to create Safe Superintelligence Inc. This seems to confirm that the original dispute centered around the importance placed on safety. To conclude, recent rumors suggest OpenAI may lose its nonprofit status and grant shares to Altman, triggering another wave of resignations within the company’s leadership and intensifying a sense of instability.
From a technical perspective, we saw a significant breakthrough in AI safety from Anthropic. The company achieved a fundamental milestone in LLM interpretability, helping to better understand the “black box” nature of these models. Through their discovery of the polysemantic nature of neurons and a method for extracting neural activation patterns representing concepts, the primary barrier to controlling Transformer models seems to have been broken — at least in terms of their potential to deceive us. The ability to deliberately alter circuits actively modifying the observable behavior in these models is also promising and brought some peace of mind regarding the gap between the capabilities of the models and our understanding of them.
The problems of the present
Setting aside the future of AI and its potential impacts, let’s focus on the tangible effects of generative AI. Unlike the arrival of the internet or social media, this time society seemed to react quickly, demonstrating concern about the implications and challenges posed by this new technology. Beyond the deep debate on existential risks mentioned earlier — centered on future technological development and the pace of progress — the impacts of existing language models have also been widely discussed. The main issues with generative AI include the fear of amplifying misinformation and digital pollution, significant problems with copyright and private data use, and the impact on productivity and the labor market.Regarding misinformation, this study suggests that, at least for now, there hasn’t been a significant increase in exposure to misinformation due to generative AI. While this is difficult to confirm definitively, my personal impressions align: although misinformation remains prevalent — and may have even increased in recent years — it hasn’t undergone a significant phase change attributable to the emergence of generative AI. This doesn’t mean misinformation isn’t a critical issue today. The weaker thesis here is that generative AI doesn’t seem to have significantly worsened the problem — at least not yet.
However, we have seen instances of deep fakes, such as recent cases involving AI-generated pornographic material using real people’s faces, and more seriously, cases in schools where minors — particularly young girls — were affected. These cases are extremely serious, and it’s crucial to bolster judicial and law enforcement systems to address them. However, they appear, at least preliminarily, to be manageable and, in the grand scheme, represent relatively minor impacts compared to the speculative nightmare of misinformation fueled by generative AI. Perhaps legal systems will take longer than we would like, but there are signs that institutions may be up to the task at least as far as deep fakes of underage porn are concerned, as illustrated by the exemplary 18-year sentence received by a person in the United Kingdom for creating and distributing this material.
Secondly, concerning the impact on the labor market and productivity — the flip side of the market boom — the debate remains unresolved. It’s unclear how far this technology will go in increasing worker productivity or in reducing or increasing jobs. Online, one can find a wide range of opinions about this technology’s impact. Claims like “AI replaces tasks, not people” or “AI won’t replace you, but a person using AI will” are made with great confidence yet without any supporting evidence — something that ironically recalls the hallucinations of a language model. It’s true that ChatGPT cannot perform complex tasks, and those of us who use it daily know its significant and frustrating limitations. But it’s also true that tasks like drafting professional emails or reviewing large amounts of text for specific information have become much faster. In my experience, productivity in programming and data science has increased significantly with AI-assisted programming environments like Copilot or Cursor. In my team, junior profiles have gained greater autonomy, and everyone produces code faster than before. That said, the speed in code production could be a double-edged sword, as some studies suggest that code generated with generative AI assistants may be of lower quality than code written by humans without such assistance.
If the impact of current LLMs isn’t entirely clear, this uncertainty is compounded by significant advancements in associated technologies, such as the research line opened by o1 or the desktop control anticipated by Claude 3.5. These developments increase the uncertainty about the capabilities these technologies could achieve in the short term. And while the market is betting heavily on a productivity boom driven by generative AI, many serious voices downplay the potential impact of this technology on the labor market, as noted earlier in the discussion of the financial aspect of the phenomenon. In principle, the most significant limitations of this technology (e.g., hallucinations) have not only remained unresolved but now seem increasingly unlikely to be resolved. Meanwhile, human institutions have proven less agile and revolutionary than the technology itself, cooling the conversation and dampening the enthusiasm of those envisioning a massive and immediate impact.
In any case, the promise of a massive revolution in the workplace, if it is to materialize, has not yet materialized in at least these two years. Considering the accelerated adoption of this technology (according to this study, more than 24% of American workers today use generative AI at least once a week) and assuming that the first to adopt it are perhaps those who find the greatest benefits, we can think that we have already seen enough of the productivity impact of this technology. In terms of my professional day-to-day and that of my team, the productivity impacts so far, while noticeable, significant, and visible, have also been modest.
Another major challenge accompanying the rise of generative AI involves copyright issues. Content creators — including artists, writers, and media companies — have expressed dissatisfaction over their works being used without authorization to train AI models, which they consider a violation of their intellectual property rights. On the flip side, AI companies often argue that using protected material to train models is covered under “fair use” and that the production of these models constitutes legitimate and creative transformation rather than reproduction.
This conflict has resulted in numerous lawsuits, such as Getty Images suing Stability AI for the unauthorized use of images to train models, or lawsuits by artists and authors, like Sarah Silverman, against OpenAI, Meta, and other AI companies. Another notable case involves record companies suing Suno and Udio, alleging copyright infringement for using protected songs to train generative music models.
In this futuristic reinterpretation of the age-old divide between inspiration and plagiarism, courts have yet to decisively tip the scales one way or the other. While some aspects of these lawsuits have been allowed to proceed, others have been dismissed, maintaining an atmosphere of uncertainty. Recent legal filings and corporate strategies — such as Adobe, Google, and OpenAI indemnifying their clients — demonstrate that the issue remains unresolved, and for now, legal disputes continue without a definitive conclusion.
The use of artificial intelligence in the EU will be regulated by the AI Act, the world’s first comprehensive AI law (Photo by Guillaume Périgois on Unsplash). The regulatory framework for AI has also seen significant progress, with the most notable development on this side of the globe being the European Union’s approval of the AI Act in March 2024. This legislation positioned Europe as the first bloc in the world to adopt a comprehensive regulatory framework for AI, establishing a phased implementation system to ensure compliance, set to begin in February 2025 and proceed gradually.
The AI Act classifies AI risks, prohibiting cases of “unacceptable risk,” such as the use of technology for deception or social scoring. While some provisions were softened during discussions to ensure basic rules applicable to all models and stricter regulations for applications in sensitive contexts, the industry has voiced concerns about the burden this framework represents. Although the AI Act wasn’t a direct consequence of ChatGPT and had been under discussion beforehand, its approval was accelerated by the sudden emergence and impact of generative AI models.
With these tensions, opportunities, and challenges, it’s clear that the impact of generative AI marks the beginning of a new phase of profound transformations across social, economic, and legal spheres, the full extent of which we are only beginning to understand.
Coming Soon
I approached this article thinking that the ChatGPT boom had passed and its ripple effects were now subsiding, calming. Reviewing the events of the past two years convinced me otherwise: they’ve been two years of great progress and great speed.
These are times of excitement and expectation — a true springtime for AI — with impressive breakthroughs continuing to emerge and promising research lines waiting to be explored. On the other hand, these are also times of uncertainty. The suspicion of being in a bubble and the expectation of a significant emotional and market correction are more than reasonable. But as with any market correction, the key isn’t predicting if it will happen but knowing exactly when.
What will happen in 2025? Will Nvidia’s stock collapse, or will the company continue its bullish rally, fulfilling the promise of becoming a $50 trillion company within a decade? And what will happen to the AI stock market in general? And what will become of the reasoning model research line initiated by o1? Will it hit a ceiling or start showing progress, just as the GPT line advanced through versions 1, 2, 3, and 4? How much will today’s rudimentary LLM-based agents that control desktops and digital environments improve overall?
We’ll find out sooner rather than later, because that’s where we’re headed.
Happy birthday, ChatGPT! (Photo by Nick Stephenson on Unsplash)
ChatGPT: Two Years Later was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
ChatGPT: Two Years Later -
LangChain’s Parent Document Retriever — Revisited
LangChain’s Parent Document Retriever — Revisited
Enhance retrieval with context using your vector database only
TL;DR — We achieve the same functionality as LangChains’ Parent Document Retriever (link) by utilizing metadata queries. You can explore the code here.
Introduction to RAG
Retrieval-augmented generation (RAG) is currently one of the hottest topics in the world of LLM and AI applications.
In short, RAG is a technique for grounding a generative models’ response on chosen knowledge sources. It comprises two phases: retrieval and generation.
- In the retrieval phase, given a user’s query, we retrieve pieces of relevant information from a predefined knowledge source.
- Then, we insert the retrieved information into the prompt that is sent to an LLM, which (ideally) generates an answer to the user’s question based on the provided context.
A commonly used approach to achieve efficient and accurate retrieval is through the usage of embeddings. In this approach, we preprocess users’ data (let’s assume plain text for simplicity) by splitting the documents into chunks (such as pages, paragraphs, or sentences). We then use an embedding model to create a meaningful, numerical representation of these chunks, and store them in a vector database. Now, when a query comes in, we embed it as well and perform a similarity search using the vector database to retrieve the relevant information
Image by the author If you are completely new to this concept, I’d recommend deeplearning.ai great course, LangChain: Chat with Your Data.
What is “Parent Document Retrieval”?
“Parent Document Retrieval” or “Sentence Window Retrieval” as referred by others, is a common approach to enhance the performance of retrieval methods in RAG by providing the LLM with a broader context to consider.
In essence, we divide the original documents into relatively small chunks, embed each one, and store them in a vector database. Using such small chunks (a sentence or a couple of sentences) helps the embedding models to better reflect their meaning [1].
Then, at retrieval time, we do not return the most similar chunk as found by the vector database only, but also its surrounding context (chunks) in the original document. That way, the LLM will have a broader context, which, in many cases, helps generate better answers.
LangChain supports this concept via Parent Document Retriever [2]. The Parent Document Retriever allows you to: (1) retrieve the full document a specific chunk originated from, or (2) pre-define a larger “parent” chunk, for each smaller chunk associated with that parent.
Let’s explore the example from LangChains’ docs:
# This text splitter is used to create the parent documents
parent_splitter = RecursiveCharacterTextSplitter(chunk_size=2000)
# This text splitter is used to create the child documents
# It should create documents smaller than the parent
child_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
# The vectorstore to use to index the child chunks
vectorstore = Chroma(
collection_name="split_parents", embedding_function=OpenAIEmbeddings()
)
# The storage layer for the parent documents
store = InMemoryStore()
retriever = ParentDocumentRetriever(
vectorstore=vectorstore,
docstore=store,
child_splitter=child_splitter,
parent_splitter=parent_splitter,
)
retrieved_docs = retriever.invoke("justice breyer")In my opinion, there are two disadvantages of the LangChains’ approach:
- The need to manage external storage to benefit from this useful approach, either in memory or another persistent store. Of course, for real use cases, the InMemoryStore used in the various examples will not suffice.
- The “parent” retrieval isn’t dynamic, meaning we cannot change the size of the surrounding window on the fly.
Indeed, a few questions have been raised regarding this issue [3].
Here I’ll also mention that Llama-index has its own SentenceWindowNodeParser [4], which generally has the same disadvantages.
In what follows, I’ll present another approach to achieve this useful feature that addresses the two disadvantages mentioned above. In this approach, we’ll be only using the vector store that is already in use.
Alternative Implementation
To be precise, we’ll be using a vector store that supports the option to perform metadata queries only, without any similarity search involved. Here, I’ll present an implementation for ChromaDB and Milvus. This concept can be easily adapted to any vector database with such capabilities. I’ll refer to Pinecone for example in the end of this tutorial.
The general concept
The concept is straightforward:
- Construction: Alongside each chunk, save in its metadata the document_id it was generated from and also the sequence_number of the chunk.
- Retrieval: After performing the usual similarity search (assuming for simplicity only the top 1 result), we obtain the document_id and the sequence_number of the chunk from the metadata of the retrieved chunk. Retrieve all chunks with surrounding sequence numbers that have the same document_id.
For example, assuming you’ve indexed a document named example.pdf in 80 chunks. Then, for some query, you find that the closest vector is the one with the following metadata:
{document_id: "example.pdf", sequence_number: 20}
You can easily get all vectors from the same document with sequence numbers from 15 to 25.
Let’s see the code.
Here, I’m using:
chromadb==0.4.24
langchain==0.2.8
pymilvus==2.4.4
langchain-community==0.2.7
langchain-milvus==0.1.2The only interesting thing to notice below is the metadata associated with each chunk, which will allow us to perform the search.
from langchain_community.document_loaders import PyPDFLoader
from langchain_core.documents import Document
from langchain_text_splitters import RecursiveCharacterTextSplitter
document_id = "example.pdf"
def preprocess_file(file_path: str) -> list[Document]:
"""Load pdf file, chunk and build appropriate metadata"""
loader = PyPDFLoader(file_path=file_path)
pdf_docs = loader.load()
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=0,
)
docs = text_splitter.split_documents(documents=pdf_docs)
chunks_metadata = [
{"document_id": file_path, "sequence_number": i} for i, _ in enumerate(docs)
]
for chunk, metadata in zip(docs, chunks_metadata):
chunk.metadata = metadata
return docsNow, lets implement the actual retrieval in Milvus and Chroma. Note that I’ll use the LangChains’ objects and not the native clients. I do this because I assume developers might want to keep LangChains’ useful abstraction. On the other hand, it will require us to perform some minor hacks to bypass these abstractions in a database-specific way, so you should take that into consideration. Anyway, the concept remains the same.
Again, let’s assume for simplicity we want only the most similar vector (“top 1”). Next, we’ll extract the associated document_id and its sequence number. This will allow us to retrieve the surrounding window.
from langchain_community.vectorstores import Milvus, Chroma
from langchain_community.embeddings import DeterministicFakeEmbedding
embedding = DeterministicFakeEmbedding(size=384) # Just for the demo :)
def parent_document_retrieval(
query: str, client: Milvus | Chroma, window_size: int = 4
):
top_1 = client.similarity_search(query=query, k=1)[0]
doc_id = top_1.metadata["document_id"]
seq_num = top_1.metadata["sequence_number"]
ids_window = [seq_num + i for i in range(-window_size, window_size, 1)]
# ...Now, for the window/parent retrieval, we’ll dig under the Langchain abstraction, in a database-specific way.
For Milvus:
if isinstance(client, Milvus):
expr = f"document_id LIKE '{doc_id}' && sequence_number in {ids_window}"
res = client.col.query(
expr=expr, output_fields=["sequence_number", "text"], limit=len(ids_window)
) # This is Milvus specific
docs_to_return = [
Document(
page_content=d["text"],
metadata={
"sequence_number": d["sequence_number"],
"document_id": doc_id,
},
)
for d in res
]
# ...For Chroma:
elif isinstance(client, Chroma):
expr = {
"$and": [
{"document_id": {"$eq": doc_id}},
{"sequence_number": {"$gte": ids_window[0]}},
{"sequence_number": {"$lte": ids_window[-1]}},
]
}
res = client.get(where=expr) # This is Chroma specific
texts, metadatas = res["documents"], res["metadatas"]
docs_to_return = [
Document(
page_content=t,
metadata={
"sequence_number": m["sequence_number"],
"document_id": doc_id,
},
)
for t, m in zip(texts, metadatas)
]and don’t forget to sort it by the sequence number:
docs_to_return.sort(key=lambda x: x.metadata["sequence_number"])
return docs_to_returnFor your convenience, you can explore the full code here.
Pinecone (and others)
As far as I know, there’s no native way to perform such a metadata query in Pinecone, but you can natively fetch vectors by their ID (https://docs.pinecone.io/guides/data/fetch-data).
Hence, we can do the following: each chunk will get a unique ID, which is essentially a concatenation of the document_id and the sequence number. Then, given a vector retrieved in the similarity search, you can dynamically create a list of the IDs of the surrounding chunks and achieve the same result.
Limitations
It’s worth mentioning that vector databases were not designed to perform “regular” database operations and usually not optimized for that, and each database will perform differently. Milvus, for example, will support building indices over scalar fields (“metadata”) which can optimize these kinds of queries.
Also, note that it requires additional query to the vector database. First we retrieved the most similar vector, and then we performed additional query to get the surrounding chunks in the original document.
And of course, as seen from the code examples above, the implementation is vector database-specific and is not supported natively by the LangChains’ abstraction.
Conclusion
In this blog we introduced an implementation to achieve sentence-window retrieval, which is a useful retrieval technique used in many RAG applications. In this implementation we’ve used only the vector database which is already in use anyway, and also support the option to modify dynamically the the size of the surrounding window retrieved.
References
[1] ARAGOG: Advanced RAG Output Grading, https://arxiv.org/pdf/2404.01037, section 4.2.2
[2] https://python.langchain.com/v0.1/docs/modules/data_connection/retrievers/parent_document_retriever/
[3] Some related issues:
– https://github.com/langchain-ai/langchain/issues/14267
– https://github.com/langchain-ai/langchain/issues/20315
– https://stackoverflow.com/questions/77385587/persist-parentdocumentretriever-of-langchain[4] https://docs.llamaindex.ai/en/stable/api_reference/node_parsers/sentence_window/
LangChain’s Parent Document Retriever — Revisited was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
LangChain’s Parent Document Retriever — RevisitedGo Here to Read this Fast! LangChain’s Parent Document Retriever — Revisited
-
Graph Neural Networks: Fraud Detection and Protein Function Prediction
Understanding AI applications in bio for machine learning engineers
Photo by Conny Schneider on Unsplash What do a network of financial transactions and a protein structure have in common? They’re both poorly modeled in Euclidean (x, y) space and require encoding complex, large, and heterogeneous graphs to truly grok.
Left: image in Euclidean Space. Right: graph in non-Euclidean space. From Graph neural networks: A review of methods and applications Graphs are the natural way to represent relational data in financial networks and protein structures. They capture the relationships and interactions between entities, such as transactions between accounts in financial systems or bonds and spatial proximities between amino acids in proteins. However, more widely known deep learning architectures like RNNs/CNNs and Transformers fail to model graphs effectively.
You might ask yourself why we can’t just map these graphs into 3D space? If we were to force them into a 3D grid:
- We would lose edge information, such as bond types in molecular graphs or transaction types.
- Mapping would require padding or resizing, leading to distortions.
- Sparse 3D data results would result in many unused grid cells, wasting memory and processing power.
Given these limitations, Graph Neural Networks (GNNs) serve as a powerful alternative. In this continuation of our series on Machine Learning for Biology applications, we’ll explore how GNNs can address these challenges.
As always, we’ll start with the more familiar topic of fraud detection and then learn how similar concepts are applied in biology.
Fraud Detection
To be crystal clear, let’s first define what a graph is. We remember plotting graphs on x, y axes in grade school but what we were really doing there was graphing a function where we plotted the points of f(x)=y. We when talk about a “graph” in the context of GNNs, we mean to model pairwise relations between objects where each object is a node and the relationships are edges.
In a financial network, the nodes are accounts and the edges are the transactions. The graph would be constructed from related party transactions (RPT) and could be enriched with attributes (e.g. time, amount, currency).
Left: Graph of a Function (What we are NOT talking about) (2024, March 15). In Wikipedia. https://en.wikipedia.org/wiki/Graph_of_a_function) Right: A graph with nodes and edges (What we are talking about) (2024, October 25). In Wikipedia. https://en.wikipedia.org/wiki/Graph_theory Traditional rules-based and machine-learning methods often operate on a single transaction or entity. This limitation fails to account for how transactions are connected to the wider network. Because fraudsters often operate across multiple transactions or entities, fraud can go undetected.
By analyzing a graph, we can capture dependencies and patterns between direct neighbors and more distant connections. This is crucial for detecting laundering where funds are moved through multiple transactions to obscure their origin. GNNs illuminate the dense subgraphs created by laundering methods.
Example of a related party transfers network from Using GNN to detect financial fraud based on the related party transactions network Message-passing frameworks
Like other deep learning methods, the goal is to create a representation or embedding from the dataset. In GNNs, these node embeddings are created using a message-passing framework. Messages pass between nodes iteratively, enabling the model to learn both the local and global structure of the graph. Each node embedding is updated based on the aggregation of its neighbors’ features.
A generalization of the framework works as follows:
- Initialization: Embeddings hv(0) are initialized with feature-based embeddings about the node, random embeddings, or pre-trained embeddings (e.g. the account name’s word embedding).
- Message Passing: At each layer t, nodes exchange messages with their neighbors. Messages are defined as features of the sender node, features of the recipient node, and features of the edge connecting them combined in a function. The combining function can be a simple concatenation with a fixed-weight scheme (used by Graph Convolutional Networks, GCNs) or attention-weighted, where weights are learned based on the features of the sender and recipient (and optionally edge features) (used by Graph Attention Networks, GATs).
- Aggregation: After the message passing step, each node aggregates the received messages (as simple as mean, max, sum).
- Update: The aggregated messages then update the node’s embedding through an update function (potentially MLPs (Multi-Layer Perceptrons) like ReLU, GRUs (Gated Recurrent Units), or attention mechanisms).
- Finalization: Embeddings are finalized, like other deep learning methods, when the representations stabilize or a maximum number of iterations is reached.
Node representation update in a Message Passing Neural Network (MPNN) layer. Node receives messages sent by all of its immediate neighbours to . Messages are computing via the message function , which accounts for the features of both senders and receiver. Graph neural network. (2024, November 14). In Wikipedia. https://en.wikipedia.org/wiki/Graph_neural_network After the node embeddings are learned, a fraud score can be calculated in a few different ways:
- Classification: where the final embedding is passed into a classifier like a Multi-Layer Perceptron, which requires a comprehensive labeled historical training set.
- Anomaly Detection: where the embedding is classified as anomalous based on how distinct it is from the others. Distance-based scores or reconstruction errors can be used here for an unsupervised approach.
- Graph-Level Scoring: where embeddings are pooled into subgraphs and then fed into classifiers to detect fraud rings. (again requiring a label historical dataset)
- Label Propagation: A semi-supervised approach where label information propagates based on edge weights or graph connectivity making predictions for unlabeled nodes.
Now that we have a foundational understanding of GNNs for a familiar problem, we can turn to another application of GNNs: predicting the functions of proteins.
Protein Function Prediction
We’ve seen huge advances in protein folding prediction via AlphaFold 2 and 3 and protein design via RFDiffusion. However, protein function prediction remains challenging. Function prediction is vital for many reasons but is particularly important in biosecurity to predict if DNA will be parthenogenic before sequencing. Tradional methods like BLAST rely on sequence similarity searching and do not incoperate any structural data.
Today, GNNs are beginning to make meaningful progress in this area by leveraging graph representations of proteins to model relationships between residues and their interactions. There are considered to be well-suited for protein function prediction as well as, identifying binding sites for small molecules or other proteins and classifying enzyme families based on active site geometry.
In many examples:
- nodes are modeled as amino acid residues
- edges as the interactions between them
The rational behind this approach is a graph’s inherent ability to capture long-range interactions between residues that are distant in the sequence but close in the folded structure. This is similar to why transformer archicture was so helpful for AlphaFold 2, which allowed for parallelized computation across all pairs in a sequence.
To make the graph information-dense, each node can be enriched with features like residue type, chemical properties, or evolutionary conservation scores. Edges can optionally be enriched with attributes like the type of chemical bonds, proximity in 3D space, and electrostatic or hydrophobic interactions.
DeepFRI is a GNN approach for predicting protein functions from structure (specifically a Graph Convolutional Network (GCN)). A GCN is a specific type of GNN that extends the idea of convolution (used in CNNs) to graph data.
DeepFRI Diagram: LSTM language model, pretrained on ~2 million Pfam protein sequences, used for extracting residue level features of PDB sequence. (B) GCN with 3 graph convolutional layers for learning complex structure–to–function relationships. from Structure-Based Function Prediction using Graph Convolutional Networks In DeepFRI, each amino acid residue is a node enriched by attributes such as:
- the amino acid type
- physicochemical properties
- evolutionary information from the MSA
- sequence embeddings from a pretrained LSTM
- structural context such as the solvent accessibility.
Each edge is defined to capture spatial relationships between amino acid residues in the protein structure. An edge exists between two nodes (residues) if their distance is below a certain threshold, typically 10 Å. In this application, there are no attributes to the edges, which serve as unweighted connections.
The graph is initialized with node features LSTM-generated sequence embeddings along with the residue-specific features and edge information created from a residue contact map.
Once the graph is defined, message passing occurs through adjacency-based convolutions at each of the three layers. Node features are aggregated from neighbors using the graph’s adjacency matrix. Stacking multiple GCN layers allows embeddings to capture information from increasingly larger neighborhoods, starting with direct neighbors and extending to neighbors of neighbors etc.
The final node embeddings are globally pooled to create a protein-level embedding, which is then used to classify proteins into hierarchically related functional classes (GO terms). Classification is performed by passing the protein-level embeddings through fully connected layers (dense layers) with sigmoid activation functions, optimized using a binary cross-entropy loss function. The classification model is trained on data derived from protein structures (e.g., from the Protein Data Bank) and functional annotations from databases like UniProt or Gene Ontology.
Final Thoughts
- Graphs are useful for modeling many non-linear systems.
- GNNs capture relationships and patterns that traditional methods struggle to model by incorporating both local and global information.
- There are many variations to GNNs but the most important (currently) are Graph Convolutional Networks and Graph Attention Networks.
- GNNs can be efficient and effective at identifying the multi-hop relationships present in money laundering schemes using supervised and unsupervised methods.
- GNNs can improve on sequence only based protein function prediction tools like BLAST by incorporating structural data. This enables researchers to predict the functions of new proteins with minimal sequence similarity to known ones, a critical step in understanding biosecurity threats and enabling drug discovery.
Cheers and if you liked this post, check out my other articles on Machine Learning and Biology.
Graph Neural Networks: Fraud Detection and Protein Function Prediction was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Graph Neural Networks: Fraud Detection and Protein Function PredictionGo Here to Read this Fast! Graph Neural Networks: Fraud Detection and Protein Function Prediction
-
Leverage Python Inheritance in ML projects
Learn how to implement coding best practices to avoid tech debts
Originally appeared here:
Leverage Python Inheritance in ML projectsGo Here to Read this Fast! Leverage Python Inheritance in ML projects
-
One of the best pool-cleaning robots I’ve tested is $450 off for Prime Day
After a hurricane showered debris into my pool, the Beatbot Aquasense Pro pool cleaner easily tackled the mess — and it’s on sale ahead of October Prime Day.One of the best pool-cleaning robots I’ve tested is $450 off for Prime DayOne of the best pool-cleaning robots I’ve tested is $450 off for Prime Day -
Apple’s M2 MacBook Air is on sale for $749 for Black Friday
Apple’s ultraportable M2 MacBook Air set the standard for portability, and right now, it’s down to $749 on Amazon for Black Friday — the lowest price we’ve seen yet.Apple’s M2 MacBook Air is on sale for $749 for Black FridayApple’s M2 MacBook Air is on sale for $749 for Black Friday