Onchain Highlights DEFINITION: The number of unique accumulation addresses. Accumulation addresses are defined as addresses that have at least 2 incoming non-dust transfers and have never spent funds. Exchange addresses and addresses receiving assets from Coinbase transactions (miner addresses) are discarded. To account for lost coins, addresses that were last active more than 7 years […]
The Digital Chamber (TDC) has called on Congress to pass legislation that would define certain non-fungible tokens (NFTs) as consumer goods and exempt them from federal securities laws. The move follows growing concerns over the Securities and Exchange Commission’s (SEC) recent enforcement actions, including the issuance of a Wells notice to NFT marketplace OpenSea. Classifying […]
BTC has been heavily shorted for the last three days, but the recent price hike has increased liquidation.
Bitcoin’s investors FUD and doubt in the rally will increase prices, suggesting volati
PEPE faces key resistance levels that could confirm a 300% rally if bullish patterns hold
Short positions at risk of liquidations could trigger a massive price surge
In this post, we detail our collaboration in creating two proof of concept (PoC) exercises around multi-modal machine learning for survival analysis and cancer sub-typing, using genomic (gene expression, mutation and copy number variant data) and imaging (histopathology slides) data. We provide insights on interpretability, robustness, and best practices of architecting complex ML workflows on AWS with Amazon SageMaker. These multi-modal pipelines are being used on the Genomics England cancer cohort to enhance our understanding of cancer biomarkers and biology.
Video chaptering is the task of segmenting a video into distinct chapters. Besides its use as a navigation aid as seen with YouTube chapters, it is also core to a series of downstream applications ranging from information retrieval (e.g., RAG semantic chunking), to referencing or summarization.
In a recent project, I needed to automate this task and was suprised by the limited options available, especially in the open-source domain. While some professional tools or paid APIs offer such services, I couldn’t find any library or tutorial that provided a sufficiently robust and accurate solution. If you know of any, please share them in the comment!
And in case you wonder why not simply copy and paste the transcript into a large language model (LLM) and ask for chapter headings, this won’t be effective for two reasons. First, LLMs cannot consistently preserve timestamp information to link them back to chapter titles. Second, LLMs often overlook important sections when dealing with long transcripts.
I therefore ended up designing a custom workflow by relying on LLMs for different language processing subtasks (text formatting, paragraph structuring, chapter segmentation and title generation), and on a TF-IDF statistics to add timestamps back after the paragraph structuring.
The combination of LLMs and TF-IDF allows to efficiently edit and structure a raw transcript while preserving timestamps
The resulting workflow turns out to work pretty well, often generating chapters that replicate or enhance YouTube’s suggested ones. The tool additionally allows to export poorly formatted transcripts into well-structured documents, as illustrated above and on this HuggingFace space.
This blog post aims at explaining its main steps, which are outlined in the diagram below:
Proposed workflow for video chaptering, from raw transcript retrieval to structured markdown and Gradio app.
The key steps in the workflow lie in structuring the transcript in paragraphs (step 2) before grouping the paragraphs into chapters from which a table of contents is derived (step 4). Note that these two steps may rely on different LLMs: A fast and cheap LLM such as LLama 3 8B for the simple task of text editing and paragraph identification, and a more sophisticated LLM such as GPT-4o-mini for the generation of the table of contents. In between, TF-IDF is used to add back timestamp information to the structured paragraphs.
The rest of the post describes each step in more detail.
Screenshot of the course YouTube page. Course material is under an MIT licence.
Note that chapters are already provided in the video description.
Chaptering made available in the YouTube description
This provides us with a baseline to qualitatively compare our chaptering later in this post.
YouTube transcript API
For YouTube videos, an automatically generated transcript is usually made available by YouTube. A convenient way to retrieve that transcript is by calling the get_transcript method of the Python youtube_transcript_api library. The method takes the YouTube video_id library as argument:
# https://www.youtube.com/watch?v=ErnWZxJovaM video_id = "ErnWZxJovaM" # MIT Introduction to Deep Learning - 2024
# Retrieve transcript with the youtube_transcript_api library from youtube_transcript_api import YouTubeTranscriptApi transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
This returns the transcript as a list of text and timestamp key-value pairs:
[{'text': '[Music]', 'start': 1.17}, {'text': 'good afternoon everyone and welcome to', 'start': 10.28}, {'text': 'MIT sus1 191 my name is Alexander amini', 'start': 12.88}, {'text': "and I'll be one of your instructors for", 'start': 16.84}, ...]
The transcript is however poorly formatted: it lacks punctuation and contains typos (‘MIT sus1 191’ instead of ‘MIT 6.S191′, or ‘amini’ instead of ‘Amini’).
Speech-to-text with Whisper
Alternatively, a speech-to-text library can be used to infer the transcript from a video or audio file. We recommend using faster-whisper, which is a fast implementation of the state-of-the-art open-source whisper model.
The models come in different size. The most accurate is the ‘large-v3’, which is able to transcribe about 15 minutes of audio per minute on a T4 GPU (available for free on Google Colab).
from faster_whisper import WhisperModel
# Load Whisper model whisper_model = WhisperModel("large-v3", device="cuda" if torch.cuda.is_available() else "cpu", compute_type="float16", )
# Call the Whisper transcribe function on the audio file initial_prompt = "Use punctuation, like this." segments, transcript_info = whisper_model.transcribe(audio_file, initial_prompt=initial_prompt, language="en")
The result of the transcription is provided as segments which can be easily converted in a list of text and timestamps as with the youtube_transcript_api library.
Tip: Whisper may sometimes not include the punctuation. The initial_prompt argument can be used to nudge the model to add punctuation by providing a small sentence containing punctuation.
Below is an excerpt of the transcription of the our video example with whisper large-v3:
[{'start': 0.0, 'text': ' Good afternoon, everyone, and welcome to MIT Success 191.'}, {'start': 15.28, 'text': " My name is Alexander Amini, and I'll be one of your instructors for the course this year"}, {'start': 19.32, 'duration': 2.08, 'text': ' along with Ava.'} ...]
Note that compared to the YouTube transcription, the punctuation is added. Some transcription errors however still remain (‘MIT Success 191’ instead of ‘MIT 6.S191′).
2) Structure the transcript in paragraphs
Once a transcript is available, the second stage consists in editing and structuring the transcript in paragraphs.
Transcript editing refers to changes made to improve readability. This involves, for example, adding punctuation if it is missing, correcting grammatical errors, removing verbal tics, etc.
The structuring in paragraphs also improves readability, and additionnally serves as a preprocessing step for identifying chapters in stage 4, since chapters will be formed by grouping paragraphs together.
Paragraph editing and structuring can be carried out in a single operation, using an LLM. We illustrated below the expected result of this stage:
Left: Raw transcript. Right: Edited and structured transcript.
This task does not require a very sophisticated LLM since it mostly consists in reformulating content. At the time of writing this article, decent results could be obtained with for example GPT-4o-mini or Llama 3 8B, and the following system prompt:
You are a helpful assistant.
Your task is to improve the user input’s readability: add punctuation if needed and remove verbal tics, and structure the text in paragraphs separated with ‘nn’.
Keep the wording as faithful as possible to the original text.
Put your answer within <answer></answer> tags.
We rely on OpenAI compatible chat completion API for LLM calling, with messages having the roles of either ‘system’, ‘user’ or ‘assistant’. The code below illustrates the instantiation of an LLM client with Groq, using LLama 3 8B:
# Connect to Groq with a Groq API key llm_client = Groq(api_key=api_key) model = "llama-8b-8192"
# Extract text from transcript transcript_text = ' '.join([s['text'] for s in transcript])
print(response_content) """ <answer> Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course.
This is a fast-paced and intense one-week course that we're about to go through together. We'll be covering the foundations of a rapidly changing field, and a field that has been revolutionizing many areas of science, mathematics, physics, and more.
Over the past decade, AI and deep learning have been rapidly advancing and solving problems that we didn't think were solvable in our lifetimes. Today, AI is solving problems beyond human performance, and each year, this lecture is getting harder and harder to teach because it's supposed to cover the foundations of the field. </answer> """
Let us then extract the edited text from the <answer> tags, divide it into paragraphs, and structure the results as a JSON dictionary consisting of paragraph numbers and pieces of text:
import re pattern = re.compile(r'<answer>(.*?)</answer>', re.DOTALL) response_content_edited = pattern.findall(response_content) paragraphs = response_content_edited.strip().split('nn') paragraphs_dict = [{'paragraph_number': i, 'paragraph_text': paragraph} for i, paragraph in enumerate(paragraphs)
print(paragraph_dict)
[{'paragraph_number': 0, 'paragraph_text': "Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course."}, {'paragraph_number': 1, 'paragraph_text': "This is a fast-paced and intense one-week course that we're about to go through together. We'll be covering the foundations of a rapidly changing field, and a field that has been revolutionizing many areas of science, mathematics, physics, and more."}, {'paragraph_number': 2, 'paragraph_text': "Over the past decade, AI and deep learning have been rapidly advancing and solving problems that we didn't think were solvable in our lifetimes. Today, AI is solving problems beyond human performance, and each year, this lecture is getting harder and harder to teach because it's supposed to cover the foundations of the field."}]
Note that the input should not be too long as the LLM will otherwise ‘forget’ part of the text. For long inputs, the transcript must be split in chunks to improve reliability. We noticed that GPT-4o-mini handles well up to 5000 characters, while Llama 3 8B can only handle up to 1500 characters. The notebook provides the function transcript_to_paragraphs which takes care of splitting the transcript in chunks.
3) Infer paragraph timestamps with TF-IDF
The transcript is now structured as a list of edited paragraphs, but the timestamps have been lost in the process.
The third stage consists in adding back timestamps, by inferring which segment in the raw transcript is the closest to each paragraph.
TF-IDF is used to find which raw transcript segment (right) best matches the beginning of the edited pargagraphs (left).
We rely for this task on the TF-IDF metric. TF-IDF stands for term frequency–inverse document frequency and is a similarity measure for comparing two pieces of text. The measure works by computing the number of similar words, giving more weight to words which appear less frequently.
As a preprocessing step, we adjust the transcript segments and paragraph beginnings so that they contain the same number of words. The text pieces should be long enough so that paragraph beginnings can be successfully matched to a unique transcript segment. We find that using 50 words works well in practice.
num_words = 50
transcript_num_words = transform_text_segments(transcript, num_words=num_words) paragraphs_start_text = [{"start": p['paragraph_number'], "text": p['paragraph_text']} for p in paragraphs] paragraphs_num_words = transform_text_segments(paragraphs_start_text, num_words=num_words)
We then rely on the sklearn library and its TfidfVectorizer and cosine_similarity function to run TF-IDF and compute similarities between each paragraph beginning and transcript segment. below is an example of code for finding the best match index in the transcript segments for the first paragraph.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.metrics.pairwise import cosine_similarity
# Paragraph for which to find the timestamp paragraph_i = 0
# Create a TF-IDF vectorizer vectorizer = TfidfVectorizer().fit_transform(transcript_num_words + paragraphs_num_words) # Get the TF-IDF vectors for the transcript and the excerpt vectors = vectorizer.toarray() # Extract the TF-IDF vector for the paragraph paragraph_vector = vectors[len(transcript_num_words) + paragraph_i]
# Calculate the cosine similarity between the paragraph vector and each transcript chunk similarities = cosine_similarity(vectors[:len(transcript_num_words)], paragraph_vector.reshape(1, -1)) # Find the index of the most similar chunk best_match_index = int(np.argmax(similarities))
We wrapped the process in a add_timestamps_to_paragraphs function, which adds timestamps to paragraphs, together with the matched segment index and text:
#Example of output for the first paragraph: print(paragraphs[0])
{'paragraph_number': 0, 'paragraph_text': "Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course.", 'matched_index': 1, 'matched_text': 'good afternoon everyone and welcome to', 'start_time': 10}
In the example above, the first paragraph (numbered 0) is found to match the transcript segment number 1 that starts at time 10 (in seconds).
4) Generate table of content
The table of content is then found by grouping consecutive paragraphs into chapters and identifying meaningful chapter titles. The task is mostly carried out by an LLM, which is instructed to transform an input consisting in a list of JSON paragraphs into an output consisting in a list of JSON chapter titles with the starting paragraph numbers:
system_prompt_paragraphs_to_toc = """
You are a helpful assistant.
You are given a transcript of a course in JSON format as a list of paragraphs, each containing 'paragraph_number' and 'paragraph_text' keys.
Your task is to group consecutive paragraphs in chapters for the course and identify meaningful chapter titles.
Here are the steps to follow:
1. Read the transcript carefully to understand its general structure and the main topics covered. 2. Look for clues that a new chapter is about to start. This could be a change of topic, a change of time or setting, the introduction of new themes or topics, or the speaker's explicit mention of a new part. 3. For each chapter, keep track of the paragraph number that starts the chapter and identify a meaningful chapter title. 4. Chapters should ideally be equally spaced throughout the transcript, and discuss a specific topic.
Format your result in JSON, with a list dictionaries for chapters, with 'start_paragraph_number':integer and 'title':string as key:value.
An important element is to specifically ask for a JSON output, which increases the chances to get a correctly formatted JSON output that can later be loaded back in Python.
GPT-4o-mini is used for this task, as it is more cost-effective than OpenAI’s GPT-4o and generally provides good results. The instructions are provided through the ‘system’ role, and paragraphs are provided in JSON format through the ‘user’ role.
# Connect to OpenAI with an OpenAI API key llm_client_get_toc = OpenAI(api_key=api_key) model_get_toc = "gpt-4o-mini-2024-07-18"
# Dump JSON paragraphs as text paragraphs_number_text = [{'paragraph_number': p['paragraph_number'], 'paragraph_text': p['paragraph_text']} for p in paragraphs] paragraphs_json_dump = json.dumps(paragraphs_number_text)
Et voilà! The call returns the list of chapter titles together with the starting paragraph number in JSON format:
print(response)
{ "chapters": [ { "start_paragraph_number": 0, "title": "Introduction to the Course" }, { "start_paragraph_number": 17, "title": "Foundations of Intelligence and Deep Learning" }, { "start_paragraph_number": 24, "title": "Course Structure and Expectations" } .... ] }
As in step 2, the LLM may struggle with long inputs and dismiss part of the input. The solution consists again in splitting the input into chunks, which is implemented in the notebook with the paragraphs_to_toc function and the chunk_size parameter.
5) Output structured chapters
This last stage combines the paragraphs and the table of content to create a structured JSON file with chapters, an example of which is provided in the accompanying Github repository.
We illustrate below the resulting chaptering (right), compared to the baseline chaptering that was available from the YouTube description (left):
Side-by-side comparison of baseline chaptering from YouTube (left) and ours (right)
The comparison is mostly qualitative as there is no ‘ground truth’. Overall, the approach described in this post identified similar chapters but provides a slightly more refined segmentation of the video. A manual check of both chapterings revealed that the baseline chaptering is off regarding course information, which indeed starts at 9:37 and not at 7:25.
A couple of other examples of chaptering are given on this HuggingFace space. The whole workflow is bundled as a Gradio app at the end of the accompanying notebook, making it easier to test it on your own videos.
Gradio app that bundles the different steps and outputs a well-structured document from a raw transcript
The Level Lock company and brand have been acquired by ASSA ABLOY, a global leader in access solutions, and will operate as an independent subsidiary under the name Ambient Property Technologies.
Level Lock+
Fans of Apple HomeKit-compatible products have likely heard of Level Lock thanks to their excellent smart locks. The company has been acquired by ASSA ABLOY, which is responsible for many access points you’ve likely encountered at work or in hotels.
The acquisition is a total takeover that leaves Level Lock as a separate subsidiary. The standalone entity will take on the new name Ambient Property Technologies, which will focus on delivering the next-generation platform to multifamily customers.
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.