Blog

Fossil is done making smartwatches but will keep releasing updates for a few years

Cheyenne MacDonald

Fossil is officially getting out of the smartwatch business. Following months of speculation about the future of its Wear OS smartwatch lineup, which hasn’t seen a new model since 2021’s Gen 6, the company confirmed to The Verge on Friday that it’s abandoning the category altogether. There won’t be a successor to the Gen 6, but existing Fossil smartwatches will still get updates “for the next few years.”

In a statement to The Verge, a spokesperson said Fossil Group has “made the strategic decision to exit the smartwatch business,” citing the industry’s evolving landscape. “Fossil Group is redirecting resources to support our core strength and the core segments of our business that continue to provide strong growth opportunities for us: designing and distributing exciting traditional watches, jewelry, and leather goods under our own as well as licensed brand names.”

Fossil has been pretty quiet about its smartwatch plans lately, after an initial few years of steady releases, and the decision is going to come as a disappointment to anyone who’s been holding out hope for a Gen 7. While they were known to struggle in the battery life department, Fossil smartwatches are some of the nicest looking out there.

This article originally appeared on Engadget at https://www.engadget.com/fossil-is-done-making-smartwatches-but-will-keep-releasing-updates-for-a-few-years-161958128.html?src=rss

Go Here to Read this Fast! Fossil is done making smartwatches but will keep releasing updates for a few years

Originally appeared here:
Fossil is done making smartwatches but will keep releasing updates for a few years

January 27, 2024
Avoid the usual carriers – the best Samsung Galaxy S24 deal is at this retailer

If you’re weighing up your next upgrade, don’t skip out on this often over-looked Galaxy S24 deal – it could save you some serious cash in the long run.

Go Here to Read this Fast! Avoid the usual carriers – the best Samsung Galaxy S24 deal is at this retailer

Originally appeared here:
Avoid the usual carriers – the best Samsung Galaxy S24 deal is at this retailer

January 27, 2024
Apple wants to make sure your posture’s right when using the Vision Pro

A newly published patent gives us some idea of the clever tech that might eventually reach Apple’s headset.

Go Here to Read this Fast!

Apple wants to make sure your posture’s right when using the Vision Pro

Originally appeared here:

Apple wants to make sure your posture’s right when using the Vision Pro

January 27, 2024
Tron resilient amid bearish conditions, over 90% of GFOX sold in presale

Guest Post

Tron is resilient despite the current bearish market conditions. Meanwhile, Galaxy Fox has sold over 90% of GFOX in ongoing presale

Go here to Read this Fast! Tron resilient amid bearish conditions, over 90% of GFOX sold in presale

Originally appeared here:
Tron resilient amid bearish conditions, over 90% of GFOX sold in presale

January 27, 2024
Criminal charges against ex-Celsius CEO must stick, prosecutors say

Julius Mutunkei

U.S. prosecutors have pushed back against former Celsius CEO Alex Mashinsky’s motion to dismiss two criminal charges brought against him.

Go here to Read this Fast! Criminal charges against ex-Celsius CEO must stick, prosecutors say

Originally appeared here:
Criminal charges against ex-Celsius CEO must stick, prosecutors say

January 27, 2024
Why an XRP spot ETF seems far-fetched right now

Aniket Verma

Ripple was seemingly looking to boost crypto-related ETF initiatives.
Analysts have aired their apprehensions about such a possibility.

With the spot Bitcoin [BTC] ETFs up and running, and E

The post Why an XRP spot ETF seems far-fetched right now appeared first on AMBCrypto.

Go here to Read this Fast! Why an XRP spot ETF seems far-fetched right now

Originally appeared here:
Why an XRP spot ETF seems far-fetched right now

January 27, 2024
How to Find the Best Multilingual Embedding Model for Your RAG

Iulia Brezeanu

Optimize the Embedding Space for Improving RAG

Continue reading on Towards Data Science »

Originally appeared here:
How to Find the Best Multilingual Embedding Model for Your RAG

Go Here to Read this Fast! How to Find the Best Multilingual Embedding Model for Your RAG

January 27, 2024
How I’d Learn Machine Learning (If I Could Start Over)

Egor Howell

A full breakdown of how you can learn machine learning this year effectively

Continue reading on Towards Data Science »

Originally appeared here:
How I’d Learn Machine Learning (If I Could Start Over)

Go Here to Read this Fast! How I’d Learn Machine Learning (If I Could Start Over)

January 27, 2024
Getting Started With Python as a Geoscientist? Here Are 5 Ways You Can Improve Your Code!

Andy McDonald

5 Tips to Help You Get Off to a Better Start When Writing Python Code

Continue reading on Towards Data Science »

Originally appeared here:
Getting Started With Python as a Geoscientist? Here Are 5 Ways You Can Improve Your Code!

Go Here to Read this Fast! Getting Started With Python as a Geoscientist? Here Are 5 Ways You Can Improve Your Code!

January 27, 2024
De-Coded: Understanding Context Windows for Transformer Models
Chris Hughes
Everything you need to know about how context windows affect Transformer training and usage

The context window is the maximum sequence length that a transformer can process at a time. With the rise of proprietary LLMs that limit the number of tokens and therefore the prompt size — as well as the growing interest in techniques such as Retrieval Augmented Generation (RAG)— understanding the key ideas around context windows and their implications is becoming increasingly important, as this is often cited when discussing different models.

The transformer architecture is a powerful tool for natural language processing, but it has some limitations when it comes to handling long sequences of text. In this article, we will explore how different factors affect the maximum context length that a transformer model can process, and whether bigger is always better when choosing a model for your task.

Image generated by the Azure OpenAI Service DALL-E model with the following prompt: “A robot reading a book. Hyper-realistic, HQ”

How many words can I fit into a Transformer?

At the time of writing, models such as the Llama-2 variants have a context length of 4k tokens, GPT-4 turbo has 128k, and Claude 2.1 has 200k! From the number of tokens alone, it can be difficult to envisage how this translates into words; whilst it depends on the tokenizer used, a good rule of thumb is that 100k tokens is approximately 75,000 words. To put that in perspective, we can compare this to some popular literature:
- The Lord of the Rings (J. R. R. Tolkien): 564,187 words, 752k tokens
- Dracula (Bram Stoker): 165,453 words, 220k tokens
- Grimms’ Fairy Tales (Jacob Grimm and Wilhelm Grimm): 104,228 words, 139k tokens
- Frankenstein (Mary Shelley): 78,100 words, 104k tokens
- Harry Potter and the Philosopher’s Stone (J. K. Rowling): 77,423 words, 103k tokens
- Treasure Island (Robert Louis Stevenson): 72,036 words, 96k tokens
- The War of the Worlds (H. G. Wells): 63,194 words, 84k tokens
- The Hound of the Baskervilles (Arthur Conan Doyle): 62,297 words, 83k tokens
- The Jungle Book (Rudyard Kipling): 54,178 words, 72k tokens
To summarise, 100k tokens is roughly equivalent to a short novel, whereas at 200k we can almost fit the entirety of Dracula, a medium sized volume! To ingest a large volume, such as The Lord of the Rings, we would need 6 requests to GPT-4 and only 4 calls to Claude 2!

Image generated by the Azure OpenAI Service DALL-E model with the following prompt: “A machine taking a bite out of a book. Hyper-realistic, HQ”

What determines the size of the context window?

At this point, you may be wondering why some models have larger context windows than others.

To understand this, let’s first review how the attention mechanism works in the figure below; if you aren’t familiar with the details of attention, this is covered in detail in my previous article. Recently, there have been several attention improvements and variants which aim to make this mechanism more efficient, but the key challenges remain the same. Here, will focus on the original scaled dot-product attention.

The attention mechanism for decoder-only transformer models. The approach to positionally encoding only the Q and K inputs, applied inside each transformer layer, follows modern architectures such as LLama-2.

From the figure above, we can notice that the size of the matrix containing our attention scores is determined by the lengths of the sequences passed into the model and can grow arbitrarily large! Therefore, we can see that the context window is not determined by the architecture, but rather the length of the sequences that are given to the model during training.

This calculation can be incredibly expensive to compute as, without any optimisations, matrix multiplications are generally quadratic in space complexity (O(n^2)). Put simply, this means that if the length of an input sequence doubles, the amount of memory required quadruples! Therefore, training a model on sequence lengths of 128k will require approximately 1024 times the memory compared to training on sequence lengths of 4k!

It is also important to keep in mind that this operation is repeated for every layer and every head of the transformer, which results in a significant amount of computation. As the amount of GPU memory available is also shared with the parameters of the model, any computed gradients, and a reasonable sized batch of input data, hardware can quickly become a bottleneck on the size of the context window when training large models.

Can we extend the context window of a pre-trained model?

After understanding the computational challenges of training models on longer sequence lengths, it may be tempting to train a model on short sequences, with the hope that this will generalise to longer contexts.

Image generated by the Azure OpenAI Service DALL-E model with the following prompt: “A robot holding a spanner. Hyper-realistic, HQ”

One obstacle to this is the positional encoding mechanism, used to enable transformers to capture the position of tokens in a sequence. In the original paper, two strategies for positional encoding were proposed. The first was to use learnable embeddings specific to each position in the sequence, which are clearly unable to generalise past the maximum sequence length that the model was trained on. However, the authors hypothesised that their preferred sinusoidal approach may extrapolate to longer sequences; subsequent research has demonstrated that this is not the case.

In many recent transformer models such as PaLM and Llama-2, absolute positional encodings have been replaced by relative positional encodings, such as RoPE, which aim to preserve the relative distance between tokens after encodings. Whilst these are slightly better at generalising to longer sequences than previous approaches, performance quickly breaks down for sequence lengths significantly longer than the model has seen before.

Whilst there are several approaches that aim to change or remove positional encodings completely, these require fundamental changes to the transformer architecture, and would require models to be retrained, which is highly expensive and time consuming. As many of the top performing open-source models at the time of writing, are derived from pretrained versions of Llama-2, there is a lot of active research taking place into how to extend the context length of existing model which use RoPE embeddings, with varying success.

Many of these approaches employ some variation of interpolating the input sequence; scaling the positional embeddings so that they fit within the original context window of the model. The intuition behind this is that it should be easier for the model fill in the gaps between words, rather than trying to predict what comes after the words.

With interpolation methods, the objective is not to make the model extrapolate to longer sequences, but to create intermediate positions within the model’s current sequence length. Image Source: [2306.15595] Extending Context Window of Large Language Models via Positional Interpolation (arxiv.org)

One such approach, known as YaRN, was able to extend the context window of the Llama-2 7B and 13B models to 128k without a significant degradation in performance!

Whilst a definitive approach that works well in all contexts has yet to emerge, this remains an exciting area of research, with big potential implications!

Are longer context windows always better?

Now that we understand some of the practical challenges around training models on longer sequence lengths, and some potential mitigations to overcome this, we can ask another question — is this extra effort worth it? At first glance, the answer may seem obvious; providing more information to a model should make it easier to inject new knowledge and reduce hallucinations, making it more useful in almost every conceivable application. However, things are not so simple.

Image generated by the Azure OpenAI Service DALL-E model with the following prompt: “Two robots, each holding a book, comparing the sizes of the books. Hyper-realistic, HQ”

In the 2023 paper Lost in the Middle, researchers at Stanford and Berkley investigated how models use and access information provided in their context window, and concluded the following:

“We find that changing the position of relevant information in the input context can substantially affect model performance, indicating that current language models do not robustly access and use information in long input contexts”.

For their experiments, the authors created a dataset where, for each query, they had a document that contains the answer and k — 1 distractor documents which did not contain the answer; adjusting the input context length by changing the number of retrieved documents that do not contain the answer. They then modulated the position of relevant information within the input context by changing the order of the documents to place the relevant document at the beginning, middle or end of the context, and evaluated whether any of the correct answers appear in the predicted output.

Specifically, they observed that the models studied performed the best when the relevant information was found at the start or the end of the context window; when the information required was in the middle of the context, performance significantly decreased.

In theory, the self-attention mechanism in a Transformer enables the model to consider all parts of the input when generating the next word, regardless of their position in the sequence. As such, I believe that any biases that the model has learned about where to find important information is more likely to come from the training data than the architecture. We can explore this idea further by examining the results that the authors observed when evaluating the Llama-2 family of models on the accuracy or retrieving documents based on their position, which are presented in the figure below.

Image source: [2307.03172] Lost in the Middle: How Language Models Use Long Contexts (arxiv.org)

Looking at the base models, we can clearly observe the authors’ conclusions for the Llama-2 13B and 70B models. Interestingly, for the 7B model, we can see that it relies almost exclusively on the end of the context; as a lot of unsupervised finetuning is on streams of data scraped from various sources, when the model has relatively few parameters to dedicate to predicting the next word in an ever-changing context, it makes sense to focus on the most recent tokens!

The bigger models also perform well when the relevant information is at the beginning of the text; suggesting that they learn to focus more on the start of the text as they get more parameters. The authors hypothesise that this is because, during pre-training, the models see a lot of data from sources like StackOverflow which start with important information. I doubt the 13B model’s slight advantage with front-loaded information is significant, as the accuracy is similar in both cases and the 70B model does not show this pattern.

The ‘chat’ models are trained further with instruction tuning and RLHF, and they perform better overall, and also seem to become less sensitive to the position of the relevant information in the text. This is more clear for the 13B model, and less for the 70B model. The 7B model does not change much, perhaps because it has fewer parameters. This could mean that these models learn to use information from other parts of the text better after more training, but they still prefer the most recent information. Given that the subsequent training stages are significantly shorter, they have not completely overcome have the biases from the first unsupervised training; I suspect that the 70B model may require a larger, more diverse subsequent training to exhibit a similar magnitude of change as in the performance of the 13B model observed here.

Additionally, I would be interested in an investigation which explores the position of the relevant information in the text in the datasets used for SFT. As humans exhibit similar behaviour of being better at recalling information at the start and end of sequences, it would not be surprising if this behaviour is mirrored in a lot of the examples given.

Conclusion

To summarise, the context window is not fixed and can grow as large as you want it to, provided there is enough memory available! However, longer sequences mean more computation — which also result in the model being slower — and unless the model has been trained on sequences of a similar length, the output may not make much sense! However, even for models with large context windows, there is no guarantee that they will effectively use all of the information provided to them — there really is no free lunch!

Chris Hughes is on LinkedIn

Unless otherwise stated, all images were created by the author.

References
De-Coded: Understanding Context Windows for Transformer Models was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
De-Coded: Understanding Context Windows for Transformer Models

Go Here to Read this Fast! De-Coded: Understanding Context Windows for Transformer Models
January 27, 2024

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Blog

Everything you need to know about how context windows affect Transformer training and usage

How many words can I fit into a Transformer?

What determines the size of the context window?

Can we extend the context window of a pre-trained model?

Are longer context windows always better?

Conclusion

References