Category: Artificial Intelligence

  • Spoiler Alert: The Magic of RAG Does Not Come from AI

    Spoiler Alert: The Magic of RAG Does Not Come from AI

    Frank Wittkampf

    Why retrieval, not generation, makes RAG systems magical

    Quick POCs

    Most quick proof of concepts (POCs) which allow a user to explore data with the help of conversational AI simply blow you away. It feels like pure magic when you can all of a sudden talk to your documents, or data, or code base.

    These POCs work wonders on small datasets with a limited count of docs. However, as with almost anything when you bring it to production, you quickly run into problems at scale. When you do a deep dive and you inspect the answers the AI gives you, you notice:

    • Your agent doesn’t reply with complete information. It missed some important pieces of data
    • Your agent doesn’t reliably give the same answer
    • Your agent isn’t able to tell you how and where it got which information, making the answer significantly less useful

    It turns out that the real magic in RAG does not happen in the generative AI step, but in the process of retrieval and composition. Once you dive in, it’s pretty obvious why…

    * RAG = Retrieval Augmented Generation — Wikipedia Definition of RAG

    RAG process — Illustration

    So, how does a RAG-enabled AI agent answer a question?

    A quick recap of how a simple RAG process works:

    1. It all starts with a query. The user asked a question, or some system is trying to answer a question. E.g. “Does patient Walker have a broken leg?”
    2. A search is done with the query. Mostly you’d embed the query and do a similarity search, but you can also do a classic elastic search or a combination of both, or a straight lookup of information
    3. The search result is a set of documents (or document snippets, but let’s simply call them documents for now)
    4. The documents and the essence of the query are combined into some easily readable context so that the AI can work with it
    5. The AI interprets the question and the documents and generates an answer
    6. Ideally this answer is fact checked, to see if the AI based the answer on the documents, and/or if it is appropriate for the audience

    Where’s the magic?

    The dirty little secret is that the essence of the RAG process is that you have to provide the answer to the AI (before it even does anything), so that it is able to give you the reply that you’re looking for.

    In other words:

    • the work that the AI does (step 5) is apply judgement, and properly articulate the answer
    • the work that the engineer does (step 3 and 4) is find the answer and compose it such that AI can digest it

    Which is more important? The answer is, of course, it depends, because if judgement is the critical element, then the AI model does all the magic. But for an endless amount of business use cases, finding and properly composing the pieces that make up the answer, is the more important part.

    What are the typical engineering problems to solve if you want proper RAG process?

    The first set of problems to solve when running a RAG process are the data ingestion, splitting, chunking, document interpretation issues. I’ve written about a few of these in prior articles, but am ignoring them here. For now let’s assume you have properly solved your data ingestion, you have a lovely vector store or search index.

    Typical challenges:

    • Duplication — Even the simplest production systems often have duplicate documents. More so when your system is large, you have extensive users or tenants, you connect to multiple data sources, or you deal with versioning, etc.
    • Near duplication — Documents which largely contain the same data, but with minor changes. There are two types of near duplication:
       — Meaningful — E.g. a small correction, or a minor addition, e.g. a date field with an update
       — Meaningless — E.g.: minor punctuation, syntax, or spacing differences, or just differences introduced by timing or intake processing
    • Volume — Some queries have a very large relevant response data set
    • Data freshness vs quality — Which snippets of the response data set have the most high quality content for the AI to use vs which snippets are most relevant from a time (freshness) perspective?
    • Data variety — How do we ensure a variety of search results such that the AI is properly informed?
    • Query phrasing and ambiguity — The prompt that triggered the RAG flow, might not be phrased in such a way that it yields the optimal result, or might even be ambiguous
    • Response Personalization — The query might require a different response based on who asks it

    This list goes on, but you get the gist.

    Sidebar: Don’t unlimited context windows solve this?

    Short answer: no.

    The cost and performance impact of using extremely large context windows shouldn’t be underestimated (you easily 10x or 100x your per query cost), not including any follow up interaction that the user/system has.

    However, putting that aside. Imagine the following situation.

    We put Anne in room with a piece of paper. The paper says: *patient Joe: complex foot fracture.* Now we ask Anne, does the patient have a foot fracture? Her answer is “yes, he does”.

    Now we give Anne a hundred pages of medical history on Joe. Her answer becomes “well, depending on what time you are referring to, he had …”

    Now we give Anne thousands of pages on all the patients in the clinic…

    What you quickly notice, is that how we define the question (or the prompt in our case) starts to get very important. The larger the context window, the more nuance the query needs.

    Additionally, the larger the context window, the universe of possible answers grows. This can be a positive thing, but in practice, it’s a method that invites lazy engineering behavior, and is likely to reduce the capabilities of your application if not handled intelligently.

    Suggested approaches

    As you scale a RAG system from POC to production, here’s how to address typical data challenges with specific solutions. Each approach has been adjusted to suit production requirements and includes examples where useful.

    Duplication

    Duplication is inevitable in multi-source systems. By using fingerprinting (hashing content), document IDs, or semantic hashing, you can identify exact duplicates at ingestion and prevent redundant content. However, consolidating metadata across duplicates can also be valuable; this lets users know that certain content appears in multiple sources, which can add credibility or highlight repetition in the dataset.

    # Fingerprinting for deduplication
    def fingerprint(doc_content):
    return hashlib.md5(doc_content.encode()).hexdigest()

    # Store fingerprints and filter duplicates, while consolidating metadata
    fingerprints = {}
    unique_docs = []
    for doc in docs:
    fp = fingerprint(doc['content'])
    if fp not in fingerprints:
    fingerprints[fp] = [doc]
    unique_docs.append(doc)
    else:
    fingerprints[fp].append(doc) # Consolidate sources

    Near Duplication

    Near-duplicate documents (similar but not identical) often contain important updates or small additions. Given that a minor change, like a status update, can carry critical information, freshness becomes crucial when filtering near duplicates. A practical approach is to use cosine similarity for initial detection, then retain the freshest version within each group of near-duplicates while flagging any meaningful updates.

    from sklearn.metrics.pairwise import cosine_similarity
    from sklearn.cluster import DBSCAN
    import numpy as np

    # Cluster embeddings with DBSCAN to find near duplicates
    clustering = DBSCAN(eps=0.1, min_samples=2, metric="cosine").fit(doc_embeddings)

    # Organize documents by cluster label
    clustered_docs = {}
    for idx, label in enumerate(clustering.labels_):
    if label == -1:
    continue
    if label not in clustered_docs:
    clustered_docs[label] = []
    clustered_docs[label].append(docs[idx])

    # Filter clusters to retain only the freshest document in each cluster
    filtered_docs = []
    for cluster_docs in clustered_docs.values():
    # Choose the document with the most recent timestamp or highest relevance
    freshest_doc = max(cluster_docs, key=lambda d: d['timestamp'])
    filtered_docs.append(freshest_doc)

    Volume

    When a query returns a high volume of relevant documents, effective handling is key. One approach is a **layered strategy**:

    • Theme Extraction: Preprocess documents to extract specific themes or summaries.
    • Top-k Filtering: After synthesis, filter the summarized content based on relevance scores.
    • Relevance Scoring: Use similarity metrics (e.g., BM25 or cosine similarity) to prioritize the top documents before retrieval.

    This approach reduces the workload by retrieving synthesized information that’s more manageable for the AI. Other strategies could involve batching documents by theme or pre-grouping summaries to further streamline retrieval.

    Data Freshness vs. Quality

    Balancing quality with freshness is essential, especially in fast-evolving datasets. Many scoring approaches are possible, but here’s a general tactic:

    • Composite Scoring: Calculate a quality score using factors like source reliability, content depth, and user engagement.
    • Recency Weighting: Adjust the score with a timestamp weight to emphasize freshness.
    • Filter by Threshold: Only documents meeting a combined quality and recency threshold proceed to retrieval.

    Other strategies could involve scoring only high-quality sources or applying decay factors to older documents.

    Data Variety

    Ensuring diverse data sources in retrieval helps create a balanced response. Grouping documents by source (e.g., different databases, authors, or content types) and selecting top snippets from each source is one effective method. Other approaches include scoring by unique perspectives or applying diversity constraints to avoid over-reliance on any single document or perspective.

    # Ensure variety by grouping and selecting top snippets per source

    from itertools import groupby

    k = 3 # Number of top snippets per source
    docs = sorted(docs, key=lambda d: d['source'])

    grouped_docs = {key: list(group)[:k] for key, group in groupby(docs, key=lambda d: d['source'])}
    diverse_docs = [doc for docs in grouped_docs.values() for doc in docs]

    Query Phrasing and Ambiguity

    Ambiguous queries can lead to suboptimal retrieval results. Using the exact user prompt is mostly not be the best way to retrieve the results they require. E.g. there might have been an information exchange earlier on in the chat which is relevant. Or the user pasted a large amount of text with a question about it.

    To ensure that you use a refined query, one approach is to ensure that a RAG tool provided to the model asks it to rephrase the question into a more detailed search query, similar to how one might carefully craft a search query for Google. This approach improves alignment between the user’s intent and the RAG retrieval process. The phrasing below is suboptimal, but it provides the gist of it:

    tools = [{ 
    "name": "search_our_database",
    "description": "Search our internal company database for relevent documents",
    "parameters": {
    "type": "object",
    "properties": {
    "query": {
    "type": "string",
    "description": "A search query, like you would for a google search, in sentence form. Take care to provide any important nuance to the question."
    }
    },
    "required": ["query"]
    }
    }]

    Response Personalization

    For tailored responses, integrate user-specific context directly into the RAG context composition. By adding a user-specific layer to the final context, you allow the AI to take into account individual preferences, permissions, or history without altering the core retrieval process.

    By addressing these data challenges, your RAG system can evolve from a compelling POC into a reliable production-grade solution. Ultimately, the effectiveness of RAG relies more on careful engineering than on the AI model itself. While AI can generate fluent answers, the real magic lies in how well we retrieve and structure information. So the next time you’re impressed by an AI system’s conversational abilities, remember that it’s likely the result of an expertly designed retrieval process working behind the scenes.

    I hope this article provided you some insight into the RAG process, and why the magic that you experience when talking to your data isn’t necessarily coming from the AI model, but is largely dependent on the design of your retrieval process.

    Please comment with your thoughts.


    Spoiler Alert: The Magic of RAG Does Not Come from AI was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Spoiler Alert: The Magic of RAG Does Not Come from AI

    Go Here to Read this Fast! Spoiler Alert: The Magic of RAG Does Not Come from AI

  • Exploring Music Transcription with Multi-Modal Language Models

    Exploring Music Transcription with Multi-Modal Language Models

    Jon Flynn

    Using Qwen2-Audio to transcribe music into sheet music

    Image by author

    Automatic music transcription is the process of converting audio files like MP3 and WAV into sheet music, guitar tablature, and any format a musician may want to learn a song on their instrument.

    We’ll go over the best current tools for doing this, which happen to be deep learning-based, and a novel approach for it.

    Current state of the art

    The current state-of-the-art for this task comes from Magenta, an open-source research project developed by the now defunct (as of April 2023) Google Brain Team.

    They released a paper Sequence-to-Sequence Piano Transcription with Transformers in 2021 which used a T5-inspired transformer model (similar to “t5-small”) with 54 million parameters and the Maestro dataset, achieving great results. The problem is approached as a sequence-to-sequence task using an encoder-decoder Transformer architecture. The encoder processes mel spectrogram frames as input and produces embeddings, while the decoder uses these embeddings via cross-attention to autoregressively generate a sequence of MIDI-like tokens. Their vocabulary consisted of four types of tokens:

    • Note tokens (128 values for MIDI pitches)
    • Velocity tokens (128 values including zero for note-off)
    • Time tokens (6,000 values in 10ms bins for absolute timing)
    • EOS token (to mark sequence end)

    See the image below for a visualisation of the architecture and an example sequence of their custom MIDI tokens:

    Figure 1. from Sequence-to-Sequence Piano Transcription with Transformers paper

    Our model is a generic encoder-decoder Transformer architecture where each input position contains a single spectrogram frame and each output position contains an event from our MIDI-like vocabulary. Outputs tokens are autoregressively sampled from the decoder, at each step taking the token with maximum probability.

    In 2022, they released a paper, MT3: Multi-Task Multitrack Music Transcription. This experiment used the same approach as the last one but added additional instrument tokens to represent the different instruments. Again, they used a similar T5 model and achieved great performance against many of the datasets trained on, notably Slakh, Maestro and MusicNet.

    MR-MT3 was released the following year as a slight improvement to MT3.

    Why use language models and not continue with these SOTA models?

    Compute/GPU resources

    Huge resources were needed to train this from scratch, despite being much smaller in size compared to even the smallest language models. The 2021 paper noted:

    “We trained all models on 32 TPUv3 cores, resulting in a per-core batch size of 8. Based on validation set results, overfitting did not seem to be a problem, so we allowed training to progress for 400K steps, which took about 2.5 days for our baseline models.”

    The MT3 paper doesn’t provide as specific details on training, stating they train for 1 million steps.

    Other limitations

    These models have some inherent limitations in their output flexibility. While language models typically have large vocabularies (often 30,000+ tokens) that are extensively pre-trained on diverse natural language data, MT3 and similar music transcription models use a much smaller, specialised token vocabulary (only a few thousand tokens) focused solely on musical events. This specialisation means that adding new tokens, such as for new instruments or playing techniques like palm muting on guitars or pizzicato on violins, is likely not easy — it requires significant retraining to integrate these new tokens effectively with the existing vocabulary, and often requires substantial training data demonstrating these techniques. This differs from large language models which can often describe such musical nuances in natural language without modification, as they’ve encountered these concepts during their broad pre-training.

    Transfer learning and zero-shot

    We can leverage transfer learning from large open-source pre-trained audio and language models. Examples of music generation models include OpenAI’s Jukebox and Meta’s MusicGen.

    Modern multi-modal model architecture

    GPT-4o is designed to handle text, audio and images “natively”. Although OpenAI has not released the technical details on this, it’s assumed that some weights in the network will process all modalities. It’s possible that the model uses a decoder-only architecture like language only GPT models without the need for encoder components to convert different modalities to a dense representation first. This design allows the model to seamlessly process and interpret inputs like text and images together, potentially offering performance benefits both computationally and in terms of model understanding.

    Many multi-modal models take a simpler approach reminiscent of the encoder-decoder architecture: they combine two pre-trained models — an encoder for the specific input modality (like ViT for vision or an audio encoder for sound) and a Large Language Model (such as LLaMA, Gemma, or Qwen). These models are connected through projection layers that align their representations in a shared latent space, often using just a single linear layer. These projection layers learn to convert the encoder’s output into a format that matches the LLM’s expected input dimensions and characteristics. The projection creates new embeddings/tokens from the input modality that can then be injected into the LLM’s input sequence. LLaVA is a prime example of this architecture for vision-language tasks, while Spotify’s Llark and Qwen-Audio apply the same principle using audio encoders instead of vision encoders.

    Here’s some pseudocode on how the models are stitched together:

    # Extract features from final layer of audio encoder
    # Shape: [batch_size, audio_seq_len, encoder_dim=1024]
    audio_features = audio_model(audio_input)

    # Project audio features to match LLM's embedding dimension
    # Shape: [batch_size, audio_seq_len, llm_embed_dim=4096]
    audio_embeddings = projection_layer(audio_features)

    # Get text embeddings from LLM's embedding layer
    # Shape: [batch_size, text_seq_len, llm_embed_dim=4096]
    text_embeddings = llm.embed_text(text_input)

    # Concatenate along sequence length dimension
    # Shape: [batch_size, audio_seq_len + text_seq_len, llm_embed_dim=4096]
    combined_input = concatenate([audio_embeddings, text_embeddings], dim=1)

    # Feed them into the LLM as normal for generation
    output = llm(combined_input)

    Spotify Llark and Qwen2-Audio

    Overview of architecture

    Llark uses OpenAI’s Jukebox and Qwen2-Audio uses OpenAI’s Whisper for the audio towers. Jukebox is a music generation model but it can also take in audio clips as input and outputs a continuation of the audio clip. Whisper is used for transcribing voice to text.

    Given their purpose, the choice of audio module is clear: Llark specialises in music analysis, while Qwen2Audio primarily focuses on responding to voice instructions with some basic audio and music analysis capabilities.

    Determining the optimal source for extracting embeddings from large pre-trained models involves research and experimentation. Additionally, deciding whether to fine-tune the entire module or freeze parts of it is a crucial design choice. For instance, LlaVa’s training strategy involves freezing the vision tower and focusing on fine-tuning the projection layer and language model. We’ll go over this aspect of each model below.

    Llark: why Jukebox? Are these embeddings the best as of September 2024?

    Determining the optimal location to extract embeddings from large models typically requires extensive probing. This involves testing various activations or extracted layers of the model on different classification tasks through a process of trial and error. For music generation models, this could include tasks like genre recognition, instrument detection, emotion detection, as well as analysis of harmonic structures and temporal patterns. Many commercial embedding models (like OpenAI’s embedding models) are trained specifically for embedding generation with specialised architectures and training objectives, rather than being fine-tuned versions of existing language models.

    The two largest publicly available music generation and music continuation (i.e.: able to take in audio as input) models are Jukebox and MusicGen. MusicGen is newer and faster, and therefore seemed like it would be the obvious choice to me. However, according to this paper on probing MusicGen, embeddings extracted from Jukebox appear to outperform MusicGen on average in classification tasks. The findings from this paper led to the authors of Llark using the following approach for extracting embeddings:

    1. Embeddings are derived from the output of the 36th layer of the Jukebox encoder following the approach described in Castellon et al. (2021)
    2. Original Jukebox encoding:
      * 4800-dimensional vectors at 345Hz
      * For a 25s clip: over 4.14 * 10⁷ floating-point values
    3. The authors use a downsampling approach: Mean-pooling within 100ms frames, resulting in:
      * Downsampled frequency: 10Hz
      * Embedding size: 1.2 × 10⁶ for a 25s audio clip. That means a 2D array with shape [240, 4800].
      * Retains temporal information (unlike Castellon et al. who average over the time dimension)

    (The downsampled embedding size is approximately 6x larger than CLIP ViT-L14 models used in many multimodal vision models)

    Qwen2Audio: Whisper

    The embedding extraction for Qwen2Audio isn’t mentioned in detail in the paper. Whisper is an encoder-decoder architecture where the encoder generates deeply learned representations of the audio and the decoder decodes the representations to text (the transcription). In Qwen2Audio, it appears they extract embeddings from the final layer of Whisper’s encoder, although they don’t mention whether they freeze it during training.

    Pre-trained weights, training data and datasets

    Unfortunately Spotify has not provided any datasets or their trained model weights to the public, noting:

    “With respect to inputs: the inputs to our model are public, open-source, Creative Commons-licensed audio and associated annotations. However, each individual audio file can have its own, potentially more restrictive license. Many of the audio files include “no derivatives” licenses. We encourage users of the datasets to familiarize themselves with the restrictions of these licenses; in order to honor such licenses, we do not release any derivatives from the training data in this paper (including query- response pairs or trained model weights).”

    They used the following datasets:

    • MusicCaps (Agostinelli et al., 2023)
    • YouTube8M-MusicTextClips (McKee et al., 2023)
    • MusicNet (Thickstun et al., 2017)
    • FMA (Defferrard et al., 2017)
    • MTG-Jamendo (Bogdanov et al., 2019)
    • MagnaTagATune (Law et al., 2009)

    Llark details it’s training data generation process in the following extract:

    “We use variants of ChatGPT to extract the instruction- tuning data for all experiments. However, the exact language model used varies by dataset. We select the OpenAI model as follows: We use GPT-4 for all reasoning tasks. We found that GPT-4 was much more adept at following the complex instructions in the Reasoning task family. For datasets with more than 25k samples, we limit Reasoning data to a random subsample of 25k tracks.”

    This results in Q&A data like this:

    Example text inputs and outputs from LLark, for the provided audio.

    The datasets used for training Qwen2Audio are not shared either, but the trained model is widely available and also is implemented in the transformers library:

    For this project, fine-tuning off a pre-trained Llark model would have been optimal, given it’s reportedly good performance against the evaluation benchmarks Spotify stated in the paper.

    However, given they didn’t release the weights for it, it’s unfeasible to start training a model like this from scratch without a fair bit of expertise and money. Spotify trained it on:

    Our model is trained on 4 80GB NVIDIA A100 GPUs. Training takes approximately 54 hours.

    This would cost around $700 using a provider like LambdaLabs.

    Because of the above, I went with Qwen. However, Qwen2-Audio doesn’t perform that well across basic music tasks like tempo and instrument detection. I detail this below in the evaluation section. This means that the model is probably not large enough or pre-trained enough to achieve this task, but my hope is I could at least set a starting point and framework for fine-tuning on this task in the future. As Alibaba state in their Qwen2-Audio blog post:

    We also plan to build larger Qwen2-Audio models to explore the scaling laws of audio language models.

    For my own learning though, I did have a go at re-creating the model using torch and pre-trained models with the transformers library.

    I also created datasets for Q&A data and embeddings. I generated short form Q&A data for the URMP dataset, e.g.: “What is the tempo of this track”, “What instruments are playing in this audio”.

    Here’s a notebook for running Jukebox in a Colab environment to take advantage of the cheap T4 GPU’s. I uploaded both Q&A and embeddings datasets to HuggingFace here.

    Here’s a notebook with Llark replicated.

    Training data for music transcription

    Transcription format

    I chose ABC music notation as the output format that the language model is expected to transcribe the music in. Here’s an example of it:

    X:1
    M:4/4
    L:1/16
    K:none
    Q:67

    V:1 name="Electric Bass (finger)"
    %%octave-default C4
    GAA^2E3A2<A^2 | D^D^2E2A2A^4 A^2E2 | A2A^4A^2E2 A2A^4 | A^2E2A2A^4A^2E2A2 |
    A^4 A^2E2 A2A^4A^2 E2 | A2A^4 |

    V:2 name="Bright Acoustic Piano"
    %%octave-default C5
    [E3C3][E3C3][E3C3] [E3C3][A^,2E2A^2] | [E3A^3][E3A^3][E3A^3][E3A^3][E3A^3] |
    [E3A^3][E3A^3][E3A^3] [E3A^3][E3A^3] | [E3A^3][E3A^3][E3A^3][E3A^3][E3A^3] |
    [E3A^3][E3A^3][E3A^3] [E3A^3][E3A^3] | [E3A^3] |

    V:3 name="Electric Guitar (jazz)"
    %%octave-default C5
    E'3C'3A^4E'3C'3 | A^4E'3 C'3A^4E'3C'3 | A^4 E'3C'3A^4 E'3C'3 | A^4E'3C'3A^4E'3C'3 |
    A^4E'3C'3 A^4E'3C'3 | A^4 |

    In this notation we have the time signature and tempo defined at the top denoted by ‘M’ and ‘Q’. The ‘L’ indicates the default note length of the notation, in this case a sixteenth note, which is the norm. We then define each instrument and the default octave they should adhere to when writing the notes for each of them. Here’s a summary of the key syntactical points for writing notes in ABC music notation:

    • Notes are represented by letters A-G, with lowercase letters indicating higher octaves
    • Sharps are denoted by ^ before the note, flats by _
    • Natural signs are represented by =
    • Note length is indicated by numbers after the note (C2 is twice as long as C)
    • Dotted notes use a . after the note (C. is a dotted quarter note)
    • Rests are represented by z, with numbers for duration (z2 is a half rest)
    • Chords are enclosed in square brackets [CEG]
    • Ties are shown with a hyphen –
    • Bar lines are represented by |
    • Broken rhythms use > or < between notes (C>D means dotted-C eighth note followed by D sixteenth note)

    Why ABC?

    The reasons for choosing this notation are:

    1. It’s a minimalist format for writing music
    2. It’s widely used and popular; language models already have good comprehension of ABC notation due to extensive pre-training on it.
    3. It’s flexible and can easily be extended to include tempo changes, time signature changes, additional playing styles like mentioned above, etc…

    I converted the MIDI files provided by the datasets to ABC notation using this library. A notebook for creating the datasets is here.

    Evaluation

    To evaluate both the original model and each stage of fine-tuning I performed thereafter, I randomly selected 30 samples of varying complexity from the URMP dataset and ran the model three times on each sample, manually examining all responses.

    Through manual testing, I found the optimal decoding parameters to be a temperature of 0.7 and a top_p of 1.2. The maximum number of tokens to return was capped at 2048. Adjusting the max seemed to have little difference on performance.

    The original model performed poorly on this evaluation set. While it occasionally predicted the tempo and instruments correctly, it mostly failed to do so. A text file with the evaluation results is available here.

    Given this starting point, it’s unlikely that we’ll see strong results from this experiment without a robust pre-trained model. However, the goal is to develop strategies that can be applied in the future as more advanced pre-trained models become available.

    Fine-tuning strategies

    I first attempted fine-tuning with basic cross-entropy loss. Supervised fine-tuning with cross-entropy loss is a quick way to start teaching the model but a basic loss function like this has limitations as we will see below. The intuition behind this stage of training is that it would nudge the model in the right direction and it would pick up any patterns or any customised ABC notation the dataset may have which the model may not have seen before.

    Cross-entropy loss with teacher forcing

    First, we trained it in a typical supervised fine-tuning manner for language models. I used the SFTtrainer from the trl library for this, which uses cross-entropy loss with teacher forcing defined step by step below:

    1. The model predicts the next token in the sequence.
    2. The loss is calculated based on the difference between the predicted probabilities (logits) and the actual next token.
    3. For the next prediction, the model is given the actual correct token (ground truth), rather than its own prediction. This is known as teacher forcing, it helps stabilise training and significantly speed it up, especially in the early stages.

    The results from this training phase were poor. It degraded the performance of the original model. The model, which previously handled tempo and instrument recognition well, now mostly got these wrong. It also began producing garbled text output with endless repetition. This occurred even when setting a low learning rate, applying gradient clipping, and using low LoRA ranks to mitigate large changes to the model. Overall, it seemed the model was very sensitive to the training applied.

    However, while this training phase may offer some improvements, it won’t lead to optimal performance due to the limitations of our basic loss function. This function struggles to fully capture the model’s performance nuances. For example, when using teacher forcing, instrument predictions can yield deceptively low loss across certain token sections. If an instrument name begins with “V”, the model might confidently predict “Violin” or “Viola” based on our dataset, regardless of accuracy. Additionally, the loss function may not accurately reflect near-misses, such as predicting a tempo of 195 instead of 200 — a small difference that’s reasonably accurate but potentially penalised heavily dependent on the distribution of probabilities amongst logits. It’s possible that neighbouring numbers also have high probabilities.

    RLHF with PPO

    Because of these limitations, we can create our own custom loss function that can more accurately score the response from the model. That is, given a predicted sequence from the model, the loss function could give it a score between 0 and 1 on how good it is.

    However, integrating this custom loss function into supervised fine-tuning presents a significant challenge. The issue stems from the non-linearity introduced by the custom loss function, which prevents the direct calculation of gradients. Let’s break this down:

    In traditional SFT with cross-entropy loss:

    • The model outputs logits (raw scores) for each token in its vocabulary
    • These logits directly represent the model’s prediction probabilities
    • The loss function compares these probabilities to the ground truth
    • Gradients can be computed directly through this comparison
    • The chain rule of calculus allows us to propagate these gradients back through the model

    With our custom loss function:

    • The model must first generate complete text output
    • This generation process involves sampling from probability distributions
    • Our loss function then analyses this text output (checking tempo, notes, etc.)
    • This creates a non-differentiable step between the model’s logits and our loss calculation
    • The sampling and text analysis steps break the gradient chain needed for backpropagation

    To overcome this, reinforcement learning techniques like Proximal Policy Optimisation (PPO) can be employed. PPO is specifically designed to handle non-differentiable loss functions and can optimise the model by considering the entire policy (the model’s output distribution), rather than relying on gradient information from logits.

    Note, there’s a lot of great articles on here explaining PPO!

    The key insight of PPO is that instead of trying to directly backpropagate through the non-differentiable steps, it:

    1. Treats the model’s outputs as actions in a reinforcement learning framework
    2. Uses the custom loss function as a reward signal
    3. Updates the model’s policy (its probability distributions over tokens) to maximise expected reward
    4. Does this while ensuring the updated policy doesn’t deviate too far from the current one

    This approach allows us to effectively train the model with the custom loss function, ensuring performance improvements without disrupting the core training dynamics. The PPO algorithm’s conservative update strategy helps maintain stability during training, which is particularly important when working with large language models.

    Usually, this scoring function would be implemented as a separate LLM in the form of a “reward model” commonly used when fine-tuning models via RLHF, which was a breakthrough first introduced when ChatGPT came out. Due to the nature of this task, we can manually write code to score the responses, which uses fewer resources and is quicker.

    For time signature and tempo recognition this is easy to calculate. We extract all predicted items with regex, for example extracting the metre:

    def extract_metre(self, abc_string):
    return re.search(r'M:(S+)', abc_string).group(1)

    The model should learn the syntax and structure we want it to output in the SFT stage. If it outputs something that will cause our regex to not find anything or error, we can just skip that sample, assuming it’s a small minority of the dataset.

    We extract the predicted tempo and write a function that is more forgiving for small errors but penalises larger errors more heavily:

    • For small differences (≤10 BPM), it uses linear scaling.
    • For larger differences, it switches to exponential scaling.
    • The final loss is capped between 0 and 1.

    Let’s break down the key components of this custom loss:

    Code for the custom loss is here

    1. Metre Loss

    The metre loss focuses on the time signature of the piece. It compares the predicted metre with the ground truth, considering both the numerator and denominator separately, as well as their ratio. This approach allows for a nuanced evaluation that can handle various time signatures accurately.

    The metre loss uses a combination of linear and exponential scaling to penalise differences. Small discrepancies result in a linear increase in loss, while larger differences lead to an exponential increase, capped at a maximum value of 1.

    2. Tempo Loss

    Tempo loss evaluates the accuracy of the predicted beats per minute (BPM). Similar to the metre loss, it uses a combination of linear and exponential scaling.

    For small tempo differences (≤10 BPM), the function applies linear scaling. Larger differences trigger exponential scaling, ensuring that significant tempo mismatches are penalised more heavily.

    3. Pitch Loss

    The pitch loss is perhaps the most crucial component, as it assesses the accuracy of the transcribed notes. This function uses the Levenshtein distance to compare the sequence of notes in each voice.

    The pitch loss calculation accounts for multiple voices, matching each predicted voice to the closest ground truth voice. This approach allows for flexibility in voice ordering while still maintaining accuracy in the overall pitch content.

    4. Instrument Loss

    The instrument loss evaluates the accuracy of instrument selection for each voice.

    This function considers exact matches, instruments from the same family, and uses string similarity for more nuanced comparisons. It provides a comprehensive assessment of how well the model identifies and assigns instruments to each voice.

    5. Combining the Losses

    The final loss is a weighted combination of these individual components:

    total_loss = (0.5 * pitch_loss +
    0.15 * metre_loss +
    0.15 * tempo_loss +
    0.2 * instrument_loss)

    This weighting scheme prioritises pitch accuracy while still considering other important aspects of music transcription.

    Training and hyperparameters

    PPO training generally requires a lot more memory than SFT for a few reasons:

    1. Multiple policy evaluations — PPO needs to maintain both the current policy (model weights) and an “old” policy to compute the probability ratio between them. This effectively doubles the model parameters in memory.
    2. Experience buffer — PPO stores a buffer of experiences (states, actions, rewards, etc.) to perform updates in mini-batches. This buffer can be quite large and takes significant memory.
    3. Advantage estimation — Computing advantages requires keeping track of value estimates and returns across trajectories, adding another layer of memory overhead.
    4. Additional optimisation objectives — PPO tracks multiple loss components (policy loss, value loss, entropy bonus) and their gradients, whereas SFT has a single loss.

    Because of the above, we’re more limited than SFT in the size of the models we can train and how much it costs. Whereas the above training I could do on an A100 40GB in Colab, for the PPO training I needed more memory. I trained on an H100 80GB, which could train a LoRA with a rank of 128 and a batch size of 8.

    My hyperparameter sweep was narrow, I went with what seemed most intuitive using batch sizes ranging from 1 to 16 and learning rates from 2e-5 to 2e-4.

    The model made no improvements to the task. The text file with the results is here.

    I tracked various training metrics using Weights & Biases (WandB). Key metrics included the policy loss, value loss, total loss, KL divergence, and the reward model’s score.

    For all hyperparameter runs, the logs no improvement in the rewards and loss calculated over time. The KL divergence remained within the pre-defined threshold.

    Conclusion

    While this initial experiment didn’t achieve the desired performance in music transcription, we’ve provided some groundwork for future developments in the space. The challenges encountered have provided valuable insights into both the technical requirements and potential approaches for tackling this complex task. Future work could explore several promising directions:

    • Experimenting with larger pre-trained models as they become available
    • Expanding the training dataset with more diverse musical examples
    • Further refinement of the reward functions to capture more nuanced musical relationships
    • Exploring hybrid approaches that combine traditional music processing techniques with language model capabilities

    Here’s my notebook for running these experiments with Qwen2-Audio!


    Exploring Music Transcription with Multi-Modal Language Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Exploring Music Transcription with Multi-Modal Language Models

    Go Here to Read this Fast! Exploring Music Transcription with Multi-Modal Language Models

  • Why Most Cross-Validation Visualizations Are Wrong (And How to Fix Them)

    Why Most Cross-Validation Visualizations Are Wrong (And How to Fix Them)

    Samy Baladram

    MODEL VALIDATION & OPTIMIZATION

    Stop using moving boxes to explain cross-validation!

    You know those cross-validation diagrams in every data science tutorial? The ones showing boxes in different colors moving around to explain how we split data for training and testing? Like this one:

    Have you seen that? Image by author.

    I’ve seen them too — one too many times. These diagrams are common — they’ve become the go-to way to explain cross-validation. But here’s something interesting I noticed while looking at them as both a designer and data scientist.

    When we look at a yellow box moving to different spots, our brain automatically sees it as one box moving around.

    It’s just how our brains work — when we see something similar move to a new spot, we think it’s the same thing. (This is actually why cartoons and animations work!)

    You might think the animated version is better, but now you can’t help following the blue box and starting to forget that this should represent how cross-validation works. Source: Wikipedia

    But here’s the thing: In these diagrams, each box in a new position is supposed to show a different chunk of data. So while our brain naturally wants to track the boxes, we have to tell our brain, “No, no, that’s not one box moving — they’re different boxes!” It’s like we’re fighting against how our brain naturally works, just to understand what the diagram means.

    Looking at this as someone who works with both design and data, I started thinking: maybe there’s a better way? What if we could show cross-validation in a way that actually works with how our brain processes information?

    All visuals: Author-created using Canva Pro. Optimized for mobile; may appear oversized on desktop.

    What’s Cross-Validation Really About?

    Cross-validation is about making sure machine learning models work well in the real world. Instead of testing a model once, we test it multiple times using different parts of our data. This helps us understand how the model will perform with new, unseen data.

    Here’s what happens:

    1. We take our data
    2. Divide it into groups
    3. Use some groups for training, others for testing
    4. Repeat this process with different groupings

    The goal is to get a reliable understanding of our model’s performance. That’s the core idea — simple and practical.

    (Note: We’ll discuss different validation techniques and their applications in another article. For now, let’s focus on understanding the basic concept and why current visualization methods need improvement.)

    What’s Wrong with Current Cross-validation Diagrams?

    Open up any machine learning tutorial, and you’ll probably see these types of diagrams:

    • Long boxes split into different sections
    • Arrows showing parts moving around
    • Different colors showing training and testing data
    • Multiple versions of the same diagram side by side
    Currently, this is similar to the first image you’ll see if you look up “Cross Validation.” (Image by author)

    Here are the issues with such diagram:

    Not Everyone Sees Colors the Same Way

    Colors create practical problems when showing data splits. Some people can’t differentiate certain colors, while others may not see colors at all. The visualization fails when printed in black and white or viewed on different screens where colors vary. Using color as the primary way to distinguish data parts means some people miss important information due to their color perception.

    Not everyone see the same colors. Image by author.

    Colors Make Things Harder to Remember

    Another thing about colors is that it might look like they help explain things, but they actually create extra work for our brain. When we use different colors for different parts of the data, we have to actively remember what each color represents. This becomes a memory task instead of helping us understand the actual concept. The connection between colors and data splits isn’t natural or obvious — it’s something we have to learn and keep track of while trying to understand cross-validation itself.

    Our brain doesn’t naturally connect colors with data splits.

    These are the colors we used in the previous diagrams. Why original dataset is green? Then split into blue and red?

    Too Much Information at Once

    The current diagrams also suffer from information overload. They attempt to display the entire cross-validation process in a single visualization, which creates unnecessary complexity. Multiple arrows, extensive labeling, all competing for attention. When we try to show every aspect of the process at the same time, we make it harder to focus on understanding each individual part. Instead of clarifying the concept, this approach adds an extra layer of complexity that we need to decode first.

    Too many labels, too many colors, too many arrows and it is too hard to focus.

    Movement That Misleads

    Movement in these diagrams creates a fundamental misunderstanding of how cross-validation actually works. When we show arrows and flowing elements, we’re suggesting a sequential process that doesn’t exist in reality. Cross-validation splits don’t need to happen in any particular order — the order of splits doesn’t affect the results at all.

    These diagrams also give the wrong impression that data physically moves during cross-validation. In reality, we’re simply selecting different rows from our original dataset each time. The data stays exactly where it is, and we just change which rows we use for testing in each split. When diagrams show data flowing between splits, they add unnecessary complexity to what should be a straightforward process.

    While diagrams typically flow from top to bottom, it’s hard to follow the sequence of operations. The timing of model training and the calculation results remain unclear. When does the training happen? What results come from each calculation?

    What We Need Instead

    We need diagrams that:

    • Don’t just rely on colors to explain things
    • Show information in clear, separate chunks
    • Make it obvious that different test groups are independent
    • Don’t use unnecessary arrows and movement

    Let’s fix this. Instead of trying to make our brains work differently, why don’t we create something that feels natural to look at?

    A Better Way to Visualize Cross-validation

    Let’s try something different. First, this is how data looks like to most people — rows and columns of numbers with index.

    This is the common dataset I used for my articles on classification algorithms.

    Inspired by that structure, here’s a diagram that make more sense.

    Simpler but clear depiction of cross-validation.

    Here’s why this design makes more sense logically:

    1. True Data Structure: It matches how data actually works in cross-validation. In practice, we’re selecting different portions of our dataset — not moving data around. Each column shows exactly which splits we’re using for testing each time.
    2. Independent Splits: Each split explicitly shows it’s different data. Unlike moving boxes that might make you think “it’s the same test set moving around,” this shows that Split 2 is using completely different data from Split 1. This matches what’s actually happening in your code.
    3. Data Conservation: By keeping the column height the same throughout all folds, we’re showing an important rule of cross-validation: you always use your entire dataset. Some portions for testing, the rest for training. Every piece of data gets used, nothing is left out.
    4. Complete Coverage: Looking left to right, you can easily check an important cross-validation principle: every portion of your dataset will be used as test data exactly once.
    5. Three-Fold Simplicity: We specifically use 3-fold cross-validation here because:
      a. It clearly demonstrates the key concepts without overwhelming detail
      b. The pattern is easy to follow: three distinct folds, three test sets. Simple enough to mentally track which portions are being used for training vs testing in each fold
      c. Perfect for educational purposes — adding more folds (like 5 or 10) would make the visualization more cluttered without adding conceptual value
      (Note: While 5-fold or 10-fold cross-validation might be more common in practice, 3-fold serves perfectly to illustrate the core concepts of the technique.)

    Adding Indices for Clarity

    While the concept above is correct, thinking about actual row indices makes it even clearer:

    An enhanced variation with subtle index, making it easier to see which part of the dataset each fold belong to. The dashed lines help in separating the indices.

    Here are some reasons of improvements of this visual:

    • Instead of just “different portions,” we can see that Fold 1 tests on rows 1–4, Fold 2 on rows 5–7, and Fold 3 on rows 8–10
    • “Complete coverage” becomes more concrete: rows 1–10 each appear exactly once in test sets
    • Training sets are explicit: when testing on rows 1–4, we’re training on rows 5–10
    • Data independence is obvious: test sets use different row ranges (1–3, 4–6, 7–10)

    This index-based view doesn’t change the concepts — it just makes them more concrete and easier to implement in code. Whether you think about it as portions or specific row numbers, the key principles remain the same: independent folds, complete coverage, and using all your data.

    Adding Some Colors

    If you feel the black-and-white version is too plain, this is also another acceptable options:

    A variation of the previous diagram, adding color to each fold’s number.

    While using colors in this version might seem problematic given the issues with color blindness and memory load mentioned before, it can still work as a helpful teaching tool alongside the simpler version.

    The main reason is that it doesn’t only use colors to show the information — the row numbers (1–10) and fold numbers tell you everything you need to know, with colors just being a nice extra touch.

    This means that even if someone can’t see the colors properly or prints it in black and white, they can still understand everything through the numbers. And while having to remember what each color means can make things harder to learn, in this case you don’t have to remember the colors — they’re just there as an extra help for people who find them useful, but you can totally understand the diagram without them.

    Just like the previous version, the row numbers also help by showing exactly how the data is being split up, making it easier to understand how cross-validation works in practice whether you pay attention to the colors or not.

    The visualization remains fully functional and understandable even if you ignore the colors completely.

    Try the challenge above. For limited number of colors, it aids in tracking the changes of the position faster.

    Why This Works Better: From Design to Data

    Let’s look at why our new designs makes sense not just from a UX view, but also from a data science perspective.

    Matching Mental Models: Think about how you explain cross-validation to someone. You probably say “we take these rows for testing, then these rows, then these rows.” Our visualization now matches exactly how we think and talk about the process. We’re not just making it pretty, we’re making it match reality.

    Data Structure Clarity: By showing data as columns with indices, we’re revealing the actual structure of our dataset. Each row has a number, each number appears in exactly one test set. This isn’t just good design, it’s accurate to how our data is organized in code.

    Even with shuffling, which is the default way to do cross validation, we can just change the index so people understand that it is being shuffled.

    Focus on What Matters: Our old way of showing cross-validation had us thinking about moving parts. But that’s not what matters in cross-validation. What matters is:

    • Which rows are we testing on?
    • Are we using all our data?
    • Is each row used for testing exactly once?

    Our new design answers these questions at a glance.

    Index-Based Understanding: Instead of abstract colored boxes, we’re showing actual row indices. When you write cross-validation code, you’re working with these indices. Now the visualization matches your code — Fold 1 uses rows 1–4, Fold 2 uses 5–7, and so on.

    Using similar diagram, we can also show how leave-on-out cross validation works. Only one data point is used in the test set! The split numbering and the chosen index for the test set are also nicely matched.

    Clear Data Flow: The layout shows data flowing from left to right: here’s your dataset, here’s how it’s split, here’s what each split looks like. It matches the logical steps of cross-validation and it’s also easier to look at.

    Clarifying the purpose of the arrows to denote the train & test process can make it clearer on how many models and what are the outputs of the cross-validation. You may note that there’s no arrow connecting elements between splits.

    Conclusion: When Visualization Matches Your Code

    Here’s what we’ve learned about the whole redrawing of the cross-validation diagram:

    Match Your Code, Not Conventions: We usually stick to traditional ways of showing things just because that’s how everyone does it. But cross-validation is really about selecting different rows of data for testing, so why not show exactly that? When your visualization matches your code, understanding follows naturally.

    Data Structure Matters: By showing indices and actual data splits, we’re revealing how cross-validation really works while also make a clearer picture. Each row has its place, each split has its purpose, and you can trace exactly what’s happening in each step.

    Simplicity Has It Purpose: It turns out that showing less can actually explain more. By focusing on the essential parts — which rows are being used for testing, and when — we’re not just simplifying the visualization but we’re also highlighting what actually matters in cross-validation.

    Looking ahead, this thinking can apply to many data science concepts. Before making another visualization, ask yourself:

    • Does this show what’s actually happening in the code?
    • Can someone trace the data flow?
    • Are we showing structure, or just following tradition?

    Good visualization isn’t about following rules — it’s about showing truth. And sometimes, the clearest truth is also the simplest.

    About the Illustrations

    Unless otherwise noted, all images are created by the author, incorporating licensed design elements from Canva Pro.


    Why Most Cross-Validation Visualizations Are Wrong (And How to Fix Them) was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.

    Originally appeared here:
    Why Most Cross-Validation Visualizations Are Wrong (And How to Fix Them)

    Go Here to Read this Fast! Why Most Cross-Validation Visualizations Are Wrong (And How to Fix Them)

  • From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC – Part 2

    From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC – Part 2

    Aude Genevay

    This post focuses on doing RAG on heterogeneous data formats. We first introduce routers, and how they can help managing diverse data sources. We then give tips on how to handle tabular data and will conclude with multimodal RAG, focusing specifically on solutions that handle both text and image data.

    Originally appeared here:
    From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC – Part 2

    Go Here to Read this Fast! From RAG to fabric: Lessons learned from building real-world RAGs at GenAIIC – Part 2

  • Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

    Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

    Breanne Warner

    The Cohere Embed multimodal embeddings model is now generally available on Amazon SageMaker JumpStart. This model is the newest Cohere Embed 3 model, which is now multimodal and capable of generating embeddings from both text and images, enabling enterprises to unlock real value from their vast amounts of data that exist in image form. In this post, we discuss the benefits and capabilities of this new model with some examples.

    Originally appeared here:
    Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart

    Go Here to Read this Fast! Cohere Embed multimodal embeddings model is now available on Amazon SageMaker JumpStart