Blog

Bitcoin accumulation addresses decline sharply in 2024

News Desk

Onchain Highlights DEFINITION: The number of unique accumulation addresses. Accumulation addresses are defined as addresses that have at least 2 incoming non-dust transfers and have never spent funds. Exchange addresses and addresses receiving assets from Coinbase transactions (miner addresses) are discarded. To account for lost coins, addresses that were last active more than 7 years […]

The post Bitcoin accumulation addresses decline sharply in 2024 appeared first on CryptoSlate.

Go here to Read this Fast! Bitcoin accumulation addresses decline sharply in 2024

Originally appeared here:
Bitcoin accumulation addresses decline sharply in 2024

September 10, 2024
Digital Chamber urges lawmakers to classify NFTs as consumer goods amid SEC enforcement concerns

Assad Jafri

The Digital Chamber (TDC) has called on Congress to pass legislation that would define certain non-fungible tokens (NFTs) as consumer goods and exempt them from federal securities laws. The move follows growing concerns over the Securities and Exchange Commission’s (SEC) recent enforcement actions, including the issuance of a Wells notice to NFT marketplace OpenSea. Classifying […]

The post Digital Chamber urges lawmakers to classify NFTs as consumer goods amid SEC enforcement concerns appeared first on CryptoSlate.

Go here to Read this Fast! Digital Chamber urges lawmakers to classify NFTs as consumer goods amid SEC enforcement concerns

Originally appeared here:
Digital Chamber urges lawmakers to classify NFTs as consumer goods amid SEC enforcement concerns

September 10, 2024
NEAR Protocol price prediction: Mapping its road to $15

Dipayan Mitra

Per AMBCrypto’s NEAR Protocol price prediction, the token’s first target might be $7.8.
A few market indicators hinted at a continued price rise.

After days of decline, Near Protocol [NE

The post NEAR Protocol price prediction: Mapping its road to $15 appeared first on AMBCrypto.

Go here to Read this Fast! NEAR Protocol price prediction: Mapping its road to $15

Originally appeared here:
NEAR Protocol price prediction: Mapping its road to $15

September 10, 2024
Bitcoin subject to massive short liquidations: What happens now?

Gladys Makena

BTC has been heavily shorted for the last three days, but the recent price hike has increased liquidation.
Bitcoin’s investors FUD and doubt in the rally will increase prices, suggesting volati

The post Bitcoin subject to massive short liquidations: What happens now? appeared first on AMBCrypto.

Go here to Read this Fast! Bitcoin subject to massive short liquidations: What happens now?

Originally appeared here:
Bitcoin subject to massive short liquidations: What happens now?

September 10, 2024
PEPE – A 300% rally on the cards? This bullish pattern could be key!

Olivia Stephanie

PEPE faces key resistance levels that could confirm a 300% rally if bullish patterns hold
Short positions at risk of liquidations could trigger a massive price surge

PEPE Coin (PEPE) has bee

The post PEPE – A 300% rally on the cards? This bullish pattern could be key! appeared first on AMBCrypto.

Go here to Read this Fast! PEPE – A 300% rally on the cards? This bullish pattern could be key!

Originally appeared here:
PEPE – A 300% rally on the cards? This bullish pattern could be key!

September 10, 2024
Dogecoin rises 8%, nears $0.12: Good news for the next altcoin season?

Ripley G

DOGE surged 8% in three days after Bitcoin’s rebound.
Is this just a temporary blip or a precursor to a significant altcoin breakout?

Dogecoin [DOGE] has trended downwards following Bitcoin?

The post Dogecoin rises 8%, nears $0.12: Good news for the next altcoin season? appeared first on AMBCrypto.

Go here to Read this Fast! Dogecoin rises 8%, nears $0.12: Good news for the next altcoin season?

Originally appeared here:
Dogecoin rises 8%, nears $0.12: Good news for the next altcoin season?

September 10, 2024
Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

Cemre Zor

In this post, we detail our collaboration in creating two proof of concept (PoC) exercises around multi-modal machine learning for survival analysis and cancer sub-typing, using genomic (gene expression, mutation and copy number variant data) and imaging (histopathology slides) data. We provide insights on interpretability, robustness, and best practices of architecting complex ML workflows on AWS with Amazon SageMaker. These multi-modal pipelines are being used on the Genomics England cancer cohort to enhance our understanding of cancer biomarkers and biology.

Originally appeared here:
Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

Go Here to Read this Fast! Genomics England uses Amazon SageMaker to predict cancer subtypes and patient survival from multi-modal data

September 10, 2024
Automate Video Chaptering with LLMs and TF-IDF
Yann-Aël Le Borgne
Transform raw transcripts into well-structured documents

Photo by Jakob Owens on Unsplash

Video chaptering is the task of segmenting a video into distinct chapters. Besides its use as a navigation aid as seen with YouTube chapters, it is also core to a series of downstream applications ranging from information retrieval (e.g., RAG semantic chunking), to referencing or summarization.

In a recent project, I needed to automate this task and was suprised by the limited options available, especially in the open-source domain. While some professional tools or paid APIs offer such services, I couldn’t find any library or tutorial that provided a sufficiently robust and accurate solution. If you know of any, please share them in the comment!

And in case you wonder why not simply copy and paste the transcript into a large language model (LLM) and ask for chapter headings, this won’t be effective for two reasons. First, LLMs cannot consistently preserve timestamp information to link them back to chapter titles. Second, LLMs often overlook important sections when dealing with long transcripts.

I therefore ended up designing a custom workflow by relying on LLMs for different language processing subtasks (text formatting, paragraph structuring, chapter segmentation and title generation), and on a TF-IDF statistics to add timestamps back after the paragraph structuring.

The combination of LLMs and TF-IDF allows to efficiently edit and structure a raw transcript while preserving timestamps

The resulting workflow turns out to work pretty well, often generating chapters that replicate or enhance YouTube’s suggested ones. The tool additionally allows to export poorly formatted transcripts into well-structured documents, as illustrated above and on this HuggingFace space.

This blog post aims at explaining its main steps, which are outlined in the diagram below:

Proposed workflow for video chaptering, from raw transcript retrieval to structured markdown and Gradio app.

The key steps in the workflow lie in structuring the transcript in paragraphs (step 2) before grouping the paragraphs into chapters from which a table of contents is derived (step 4). Note that these two steps may rely on different LLMs: A fast and cheap LLM such as LLama 3 8B for the simple task of text editing and paragraph identification, and a more sophisticated LLM such as GPT-4o-mini for the generation of the table of contents. In between, TF-IDF is used to add back timestamp information to the structured paragraphs.

The rest of the post describes each step in more detail.

Check out the accompanying Github repository and Colab notebook to explore on your own!

1) Get the video/audio transcript

Let us use as an example the first lecture of the course ‘MIT 6.S191: Introduction to Deep Learning’ (IntroToDeepLearning.com) by Alexander Amini and Ava Amini (licensed under the MIT License).

Screenshot of the course YouTube page. Course material is under an MIT licence.

Note that chapters are already provided in the video description.

Chaptering made available in the YouTube description

This provides us with a baseline to qualitatively compare our chaptering later in this post.

YouTube transcript API

For YouTube videos, an automatically generated transcript is usually made available by YouTube. A convenient way to retrieve that transcript is by calling the get_transcript method of the Python youtube_transcript_api library. The method takes the YouTube video_id library as argument:
```
# https://www.youtube.com/watch?v=ErnWZxJovaM
video_id = "ErnWZxJovaM" # MIT Introduction to Deep Learning - 2024

# Retrieve transcript with the youtube_transcript_api library
from youtube_transcript_api import YouTubeTranscriptApi
transcript = YouTubeTranscriptApi.get_transcript(video_id, languages=["en"])
```
This returns the transcript as a list of text and timestamp key-value pairs:
```
[{'text': '[Music]', 'start': 1.17}, 
{'text': 'good afternoon everyone and welcome to',  'start': 10.28}, 
{'text': 'MIT sus1 191 my name is Alexander amini',  'start': 12.88}, 
{'text': "and I'll be one of your instructors for",  'start': 16.84},
...]
```
The transcript is however poorly formatted: it lacks punctuation and contains typos (‘MIT sus1 191’ instead of ‘MIT 6.S191′, or ‘amini’ instead of ‘Amini’).

Speech-to-text with Whisper

Alternatively, a speech-to-text library can be used to infer the transcript from a video or audio file. We recommend using faster-whisper, which is a fast implementation of the state-of-the-art open-source whisper model.

The models come in different size. The most accurate is the ‘large-v3’, which is able to transcribe about 15 minutes of audio per minute on a T4 GPU (available for free on Google Colab).
```
from faster_whisper import WhisperModel

# Load Whisper model
whisper_model = WhisperModel("large-v3",
                              device="cuda" if torch.cuda.is_available() else "cpu",
                              compute_type="float16",
                            )

# Call the Whisper transcribe function on the audio file
initial_prompt = "Use punctuation, like this."
segments, transcript_info = whisper_model.transcribe(audio_file,  initial_prompt=initial_prompt, language="en")
```
The result of the transcription is provided as segments which can be easily converted in a list of text and timestamps as with the youtube_transcript_api library.

Tip: Whisper may sometimes not include the punctuation. The initial_prompt argument can be used to nudge the model to add punctuation by providing a small sentence containing punctuation.

Below is an excerpt of the transcription of the our video example with whisper large-v3:
```
[{'start': 0.0, 'text': ' Good afternoon, everyone, and welcome to MIT Success 191.'},
 {'start': 15.28, 'text': " My name is Alexander Amini, and I'll be one of your instructors for the course this year"},
 {'start': 19.32, 'duration': 2.08, 'text': ' along with Ava.'}
...]
```
Note that compared to the YouTube transcription, the punctuation is added. Some transcription errors however still remain (‘MIT Success 191’ instead of ‘MIT 6.S191′).

2) Structure the transcript in paragraphs

Once a transcript is available, the second stage consists in editing and structuring the transcript in paragraphs.

Transcript editing refers to changes made to improve readability. This involves, for example, adding punctuation if it is missing, correcting grammatical errors, removing verbal tics, etc.

The structuring in paragraphs also improves readability, and additionnally serves as a preprocessing step for identifying chapters in stage 4, since chapters will be formed by grouping paragraphs together.

Paragraph editing and structuring can be carried out in a single operation, using an LLM. We illustrated below the expected result of this stage:

Left: Raw transcript. Right: Edited and structured transcript.

This task does not require a very sophisticated LLM since it mostly consists in reformulating content. At the time of writing this article, decent results could be obtained with for example GPT-4o-mini or Llama 3 8B, and the following system prompt:

You are a helpful assistant.

Your task is to improve the user input’s readability: add punctuation if needed and remove verbal tics, and structure the text in paragraphs separated with ‘nn’.

Keep the wording as faithful as possible to the original text.

Put your answer within <answer></answer> tags.

We rely on OpenAI compatible chat completion API for LLM calling, with messages having the roles of either ‘system’, ‘user’ or ‘assistant’. The code below illustrates the instantiation of an LLM client with Groq, using LLama 3 8B:
```
# Connect to Groq with a Groq API key
llm_client = Groq(api_key=api_key)
model = "llama-8b-8192"

# Extract text from transcript
transcript_text = ' '.join([s['text'] for s in transcript])

# Call LLM
response = client.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt
            },
            {
                "role": "user",
                "content": transcript_text
            }
        ],
        model=model,
        temperature=0,
        seed=42
    )
```
Given a piece of raw ‘transcript_text’ as input, this returns an edited piece of text within <answer> tags:
```
response_content=response.choices[0].message.content

print(response_content)
"""
<answer>
Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course.

This is a fast-paced and intense one-week course that we're about to go through together. We'll be covering the foundations of a rapidly changing field, and a field that has been revolutionizing many areas of science, mathematics, physics, and more.

Over the past decade, AI and deep learning have been rapidly advancing and solving problems that we didn't think were solvable in our lifetimes. Today, AI is solving problems beyond human performance, and each year, this lecture is getting harder and harder to teach because it's supposed to cover the foundations of the field.
</answer>
"""
```
Let us then extract the edited text from the <answer> tags, divide it into paragraphs, and structure the results as a JSON dictionary consisting of paragraph numbers and pieces of text:
```
import re
pattern = re.compile(r'<answer>(.*?)</answer>', re.DOTALL)
response_content_edited =  pattern.findall(response_content)
paragraphs = response_content_edited.strip().split('nn')
paragraphs_dict = [{'paragraph_number': i, 'paragraph_text': paragraph} for i, paragraph in enumerate(paragraphs)

print(paragraph_dict)

[{'paragraph_number': 0,
  'paragraph_text': "Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course."},
 {'paragraph_number': 1,
  'paragraph_text': "This is a fast-paced and intense one-week course that we're about to go through together. We'll be covering the foundations of a rapidly changing field, and a field that has been revolutionizing many areas of science, mathematics, physics, and more."},
 {'paragraph_number': 2,
  'paragraph_text': "Over the past decade, AI and deep learning have been rapidly advancing and solving problems that we didn't think were solvable in our lifetimes. Today, AI is solving problems beyond human performance, and each year, this lecture is getting harder and harder to teach because it's supposed to cover the foundations of the field."}]
```
Note that the input should not be too long as the LLM will otherwise ‘forget’ part of the text. For long inputs, the transcript must be split in chunks to improve reliability. We noticed that GPT-4o-mini handles well up to 5000 characters, while Llama 3 8B can only handle up to 1500 characters. The notebook provides the function transcript_to_paragraphs which takes care of splitting the transcript in chunks.

3) Infer paragraph timestamps with TF-IDF

The transcript is now structured as a list of edited paragraphs, but the timestamps have been lost in the process.

The third stage consists in adding back timestamps, by inferring which segment in the raw transcript is the closest to each paragraph.

TF-IDF is used to find which raw transcript segment (right) best matches the beginning of the edited pargagraphs (left).

We rely for this task on the TF-IDF metric. TF-IDF stands for term frequency–inverse document frequency and is a similarity measure for comparing two pieces of text. The measure works by computing the number of similar words, giving more weight to words which appear less frequently.

As a preprocessing step, we adjust the transcript segments and paragraph beginnings so that they contain the same number of words. The text pieces should be long enough so that paragraph beginnings can be successfully matched to a unique transcript segment. We find that using 50 words works well in practice.
```
num_words = 50

transcript_num_words = transform_text_segments(transcript, num_words=num_words)
paragraphs_start_text = [{"start": p['paragraph_number'], "text": p['paragraph_text']} for p in paragraphs]
paragraphs_num_words = transform_text_segments(paragraphs_start_text, num_words=num_words)
```
We then rely on the sklearn library and its TfidfVectorizer and cosine_similarity function to run TF-IDF and compute similarities between each paragraph beginning and transcript segment. below is an example of code for finding the best match index in the transcript segments for the first paragraph.
```
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Paragraph for which to find the timestamp
paragraph_i = 0

# Create a TF-IDF vectorizer
vectorizer = TfidfVectorizer().fit_transform(transcript_num_words + paragraphs_num_words)
# Get the TF-IDF vectors for the transcript and the excerpt
vectors = vectorizer.toarray()
# Extract the TF-IDF vector for the paragraph
paragraph_vector = vectors[len(transcript_num_words) + paragraph_i]

# Calculate the cosine similarity between the paragraph vector and each transcript chunk
similarities = cosine_similarity(vectors[:len(transcript_num_words)], paragraph_vector.reshape(1, -1))
# Find the index of the most similar chunk
best_match_index = int(np.argmax(similarities))
```
We wrapped the process in a add_timestamps_to_paragraphs function, which adds timestamps to paragraphs, together with the matched segment index and text:
```
paragraphs = add_timestamps_to_paragraphs(transcript, paragraphs, num_words=50)

#Example of output for the first paragraph:
print(paragraphs[0])

{'paragraph_number': 0,
  'paragraph_text': "Good afternoon, everyone, and welcome to MIT 6.S191. My name is Alexander Amini, and I'll be one of your instructors for the course this year, along with Ava. We're really excited to welcome you to this incredible course.",
  'matched_index': 1,
  'matched_text': 'good afternoon everyone and welcome to',
  'start_time': 10}
```
In the example above, the first paragraph (numbered 0) is found to match the transcript segment number 1 that starts at time 10 (in seconds).

4) Generate table of content

The table of content is then found by grouping consecutive paragraphs into chapters and identifying meaningful chapter titles. The task is mostly carried out by an LLM, which is instructed to transform an input consisting in a list of JSON paragraphs into an output consisting in a list of JSON chapter titles with the starting paragraph numbers:
```
system_prompt_paragraphs_to_toc = """

You are a helpful assistant.

    You are given a transcript of a course in JSON format as a list of paragraphs, each containing 'paragraph_number' and 'paragraph_text' keys.

    Your task is to group consecutive paragraphs in chapters for the course and identify meaningful chapter titles.

    Here are the steps to follow:

1. Read the transcript carefully to understand its general structure and the main topics covered.
2. Look for clues that a new chapter is about to start. This could be a change of topic, a change of time or setting, the introduction of new themes or topics, or the speaker's explicit mention of a new part.
3. For each chapter, keep track of the paragraph number that starts the chapter and identify a meaningful chapter title.
4. Chapters should ideally be equally spaced throughout the transcript, and discuss a specific topic.

    Format your result in JSON, with a list dictionaries for chapters, with 'start_paragraph_number':integer and 'title':string as key:value.
    
    Example: 
    {"chapters": 
        [{"start_paragraph_number": 0, "title": "Introduction"}, 
         {"start_paragraph_number": 10, "title": "Chapter 1"}
        ]
    }
"""
```
An important element is to specifically ask for a JSON output, which increases the chances to get a correctly formatted JSON output that can later be loaded back in Python.

GPT-4o-mini is used for this task, as it is more cost-effective than OpenAI’s GPT-4o and generally provides good results. The instructions are provided through the ‘system’ role, and paragraphs are provided in JSON format through the ‘user’ role.
```
# Connect to OpenAI with an OpenAI API key
llm_client_get_toc = OpenAI(api_key=api_key)
model_get_toc = "gpt-4o-mini-2024-07-18"


# Dump JSON paragraphs as text
paragraphs_number_text = [{'paragraph_number': p['paragraph_number'], 'paragraph_text': p['paragraph_text']} for p in paragraphs]
paragraphs_json_dump = json.dumps(paragraphs_number_text)

# Call LLM
response = client_get_toc.chat.completions.create(
        messages=[
            {
                "role": "system",
                "content": system_prompt_paragraphs_to_toc
            },
            {
                "role": "user",
                "content": paragraphs_json_dump
            }
        ],
        model=model_get_toc,
        temperature=0,
        seed=42
    )
```
Et voilà! The call returns the list of chapter titles together with the starting paragraph number in JSON format:
```
print(response)

{
  "chapters": [
    {
      "start_paragraph_number": 0,
      "title": "Introduction to the Course"
    },
    {
      "start_paragraph_number": 17,
      "title": "Foundations of Intelligence and Deep Learning"
    },
    {
      "start_paragraph_number": 24,
      "title": "Course Structure and Expectations"
    }
....
  ]
}
```
As in step 2, the LLM may struggle with long inputs and dismiss part of the input. The solution consists again in splitting the input into chunks, which is implemented in the notebook with the paragraphs_to_toc function and the chunk_size parameter.

5) Output structured chapters

This last stage combines the paragraphs and the table of content to create a structured JSON file with chapters, an example of which is provided in the accompanying Github repository.

We illustrate below the resulting chaptering (right), compared to the baseline chaptering that was available from the YouTube description (left):

Side-by-side comparison of baseline chaptering from YouTube (left) and ours (right)

The comparison is mostly qualitative as there is no ‘ground truth’. Overall, the approach described in this post identified similar chapters but provides a slightly more refined segmentation of the video. A manual check of both chapterings revealed that the baseline chaptering is off regarding course information, which indeed starts at 9:37 and not at 7:25.

A couple of other examples of chaptering are given on this HuggingFace space. The whole workflow is bundled as a Gradio app at the end of the accompanying notebook, making it easier to test it on your own videos.

Gradio app that bundles the different steps and outputs a well-structured document from a raw transcript

To go further
Note:
- Unless otherwise noted, all images are by the author
Enjoyed this post? Share your thoughts, give it claps, or connect with me on LinkedIn .

Automate Video Chaptering with LLMs and TF-IDF was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Automate Video Chaptering with LLMs and TF-IDF

Go Here to Read this Fast! Automate Video Chaptering with LLMs and TF-IDF
September 10, 2024
How to Create a Powerful AI Email Search for Gmail with RAG

Eivind Kjosbakken

Learn how you can develop an application to search emails using RAG

Continue reading on Towards Data Science »

Originally appeared here:
How to Create a Powerful AI Email Search for Gmail with RAG

Go Here to Read this Fast! How to Create a Powerful AI Email Search for Gmail with RAG

September 10, 2024
ASSA ABLOY acquires Level Lock, creates new Ambient Property Technologies subsidiary

The Level Lock company and brand have been acquired by ASSA ABLOY, a global leader in access solutions, and will operate as an independent subsidiary under the name Ambient Property Technologies.

Level Lock+

Fans of Apple HomeKit-compatible products have likely heard of Level Lock thanks to their excellent smart locks. The company has been acquired by ASSA ABLOY, which is responsible for many access points you’ve likely encountered at work or in hotels.

The acquisition is a total takeover that leaves Level Lock as a separate subsidiary. The standalone entity will take on the new name Ambient Property Technologies, which will focus on delivering the next-generation platform to multifamily customers.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast! ASSA ABLOY acquires Level Lock, creates new Ambient Property Technologies subsidiary

Originally appeared here:
ASSA ABLOY acquires Level Lock, creates new Ambient Property Technologies subsidiary

September 10, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Blog

Transform raw transcripts into well-structured documents

1) Get the video/audio transcript

YouTube transcript API

Speech-to-text with Whisper

2) Structure the transcript in paragraphs

3) Infer paragraph timestamps with TF-IDF

4) Generate table of content

5) Output structured chapters

To go further