Tag: tech

Understanding the Evolution of ChatGPT: Part 1—An In-Depth Look at GPT-1 and What Inspired It
Shirley Li
Tracing the roots of ChatGPT: GPT-1, the foundation of OpenAI’s LLMs

(Image from Unsplash)

The GPT (Generative Pre-Training) model family, first introduced by OpenAI in 2018, is another important application of the Transformer architecture. It has since evolved through versions like GPT-2, GPT-3, and InstructGPT, eventually leading to the development of OpenAI’s powerful LLMs.

In other words: understanding GPT models is essential for anyone looking to dive deeper into the world of LLMs.

This is the first part of our GPT series, in which we will try to walk through the core concepts in GPT-1 as well as the prior works that have inspired it.

Below are the topics we will cover in this article:

Prior to GPT-1:
- The Pre-training and Finetuning Paradigm: the journey from CV to NLP.
- Previous Work: Word2vec, GloVe, and other methods using LM for pretraining.
- Decoder-only Transformer.
- Auto-regressive vs. Auto-encoding Language Models.
Core concepts in GPT-1:
- Key Innovations.
- Pre-training.
- Finetuning.
Prior to GPT-1

Pre-training and Finetuning

The pretraining + finetuning paradigm, which firstly became popular in Computer Vision, refers to the process of training a model using two stages: pretraining and then finetuning.

In pretraining stage, the model is trained on a large-scale dataset that related to the downstream task at hand. In Computer Vision, this is done usually by learning an image classification model on ImageNet, with its most commonly used subset ILSVR containing 1K categories, each has 1K images.

Although 1M images doesn’t sound like “large-scale” by today’s standard, ILSVR was truly remarkable in a decade ago and was indeed much much larger than what we could have for specific CV tasks.

Also, the CV community has explored a lot of ways to get rid of supervised pre-training as well, for example MoCo (by Kaiming He et al.) and SimCLR (by Ting Chen et al.), etc.

After pre-training, the model is assumed to have learnt some general knowledge about the task, which could accelerate the learning process on the downstream task.

Then comes to finetuning: In this stage, the model will be trained on a specific downstream task with high-quality labeled data, often in much smaller scale compared to ImageNet. During this stage, the model will pick up some domain-specific knowledge related to the task at-hand, which helps improve its performance.

For a lot of CV tasks, this pretraining + finetuning paradigm demonstrates better performance compared to directly training the same model from scratch on the limited task-specific data, especially when the model is complex and hence more likely to overfit on limited training data. Combined with modern CNN networks such as ResNet, this leads to a performance leap in many CV benchmarks, where some of which even achieve near-human performance.

Therefore, a natural question arises: how can we replicate such success in NLP?

Previous Explorations of Pretraining Prior to GPT-1

In fact, the NLP community never stops trying in this direction, and some of the efforts can date back to as early as 2013, such as Word2Vec and GloVe (Global Vectors for Word Representation).

Word2Vec

The Word2Vec paper “Distributed Representations of Words and Phrases and their Compositionality” was honored with the “Test of Time” award at NeurIPS 2023. It’s really a must-read for anyone not familiar with this work.

Today it feels so natural to represent words or tokens as embedding vectors, but this wasn’t the case before Word2Vec. At that time, words were commonly represented by one-hot encoding or some count-based statistics such as TD-IDF (term frequency-inverse document frequency) or co-occurrence matrices.

For example in one-hot encoding, given a vocabulary of size N, each word in this vocabulary will be assigned an index i, and then it will be represented as a sparse vector of length N where only the i-th element is set to 1.

Take the following case as an example: in this toy vocabulary we only have four words: the (index 0), cat (index 1), sat (index 2) and on (index 3), and therefore each word will be represented as a sparse vector of length 4(the ->1000, cat -> 0100, sat -> 0010, on -> 0001).

Figure 1. An example of one-hot encoding with a toy vocabulary of 4 words. (image by the author)

The problem with this simple method is that, as vocabulary grows larger and larger in real-world cases, the one-hot vectors will become extremely long. Also, neural networks are not designed to handle these sparse vectors efficiently.

Additionally, the semantic relationships between related words will be lost during this process as the index for each word is randomly assigned, meaning similar words have no connection in this representation.

With that, you can better understand the significance of Word2Vec’s contribution now: By representing words as continuous vectors in a high-dimensional space where words with similar contexts have similar vectors, it completely revolutionized the field of NLP.

With Word2Vec, related words will be mapped closer in the embedding space. For example, in the figure below the authors show the PCA projection of word embeddings for some countries and their corresponding capitals, with their relationships automatically captured by Word2Vec without any supervised information provided.

Figure 2. PCA projection of countries and capital vectors by Word2Vec. (Image from Word2Vec paper)

Word2Vec is learnt in an unsupervised manner, and once the embeddings are learnt, they can be easily used in downstream tasks. This is one of the earliest efforts exploring semi-supervised learning in NLP.

More specifically, it can leverage either the CBOW (Continuous Bag of Words) or Skip-Gram architectures to learn word embeddings.

In CBOW, the model tries to predict the target word based on its surrounding words. For example, given the sentence “The cat sat on the mat,” CBOW would try to predict the target word “sat” given the context words “The,” “cat,” “on,” “the.” This architecture is effective when the goal is to predict a single word from the context.

However, Skip-Gram works quite the opposite way — it uses a target word to predict its surrounding context words. Taking the same sentence as example, this time the target word “sat” becomes the input, and the model would try to predict context words like “The,” “cat,” “on,” and “the.” Skip-Gram is particularly useful for capturing rare words by leveraging the context in which they appear.

Figure 3. CBOW vs. Skip-Gram architectures in Word2Vec, where “sat” is the target word. (Image by the author)

GloVe

Another work along this line of research is GloVe, which is also an unsupervised method to generate word embeddings. Unlike Word2Vec which focuses on a local context, GloVe is designed to capture global statistical information by constructing a word co-occurrence matrix and factorizing it to obtain dense word vectors.

Figure 4. Illustration of word embedding generation in GloVe. (Image by the author)

Note that both Word2Vec and GloVe can mainly transfer word-level information, which is often not sufficient in handling complex NLP tasks as we need to capture high-level semantics in the embeddings. This leads to more recent explorations on unsupervised pre-training of NLP models.

Unsupervised Pre-Training

Before GPT, many works have explored unsupervised pre-training with different objectives, such as language model, machine translation and discourse coherence, etc. However, each method only outperforms others on certain downstream tasks and it remained unclear what optimization objectives were most effective or most useful for transfer.

You may have noticed that language models had already been explored as training objectives in some of the earlier works, but why didn’t these methods succeed like GPT?

The answer is Transformer models.

When the earlier works were proposed, there is no Transformer models yet, so researchers could only rely on RNN models like LSTM for pre-training.

This brings us to the next topic: the Transformer architecture used in GPT.

Decoder-only Transformer

In GPT, the Transformer architecture is a modified version of the original Transformer called decoder-only Transformer. This is a simplified Transformer architecture proposed by Google in 2018, and it contains only the decoder.

Below is a comparison of the encoder-decoder architecture introduced in the original Transformer vs. the decoder-only Transformer architecture used in GPT. Basically, the decoder-only architecture removes the encoder part entirely along with the cross-attention, leading to a more simplified architecture.

Figure 5. Comparison of the encoder-decoder architecture in Transformer vs. decoder-only Transformer in GPT. (Image from the Transformer and GPT paper)

So what’s the benefit of making Transformer decoder-only?

Compared with encoder-only models such as BERT, decoder-only models often perform better in generating coherent and contextually relevant text, making them ideal for text generation tasks.

Encoder-only models like BERT, on the other hand, often perform better in tasks that require understanding the input data, like text classification, sentiment analysis, and named entity recognition, etc.

There is another type of models that employ both the encoder and decoder Transformer, such as T5 and BART, with the encoder processes the input, while the decoder generates the output based on the encoded representation. While such a design makes them more versatile in handling a wide range of tasks, they are often more computationally intensive than encoder-only or decoder-only models.

In a nutshell, while both built on Transformer models and tried to leverage pre-training + finetuning scheme, GPT and BERT have chosen very different ways to achieve that similar goal. More specifically, GPT conducts pre-training in an auto-regressive manner, while BERT follows an auto-encoding approach.

Auto-regressive vs. Auto-encoding Language Models

An easy way to understand their difference is to compare their training objectives.

In Auto-regressive language models, the training objective is often to predict the next token in the sequence, based on previous tokens. Due to the dependency on previous tokens, this usually lead to a unidirectional (typically left-to-right) approach, as we show in the left of Figure 6.

By contrast, auto-encoding language models are often trained with objectives like Masked Language Model or reconstructing the entire input from corrupted versions. This is often done in a bi-directional manner where the model can leverage all the tokens around the masked one, in other words, both the left and right side tokens. This is illustrated in the right of Figure 6.

Figure 6. Auto-regressive Language Model vs. Auto-encoding Language Model. (Image by the author)

Simply put, auto-regressive LM is more suitable for text generation, but its unidirectional modeling approach may limit its capability in understanding the full context. Auto-encoding LM, on the other hand, can do a better job at context understanding, but is not designed for generative tasks.

Core Concepts in GPT-1

Key Innovations

Most of the key innovations of GPT-1 have already been covered in above sections, so I will just list them here as a brief summary:
- GPT-1 is the first work that successfully leverages auto-regressive language modeling as unsupervised pre-training task, making the pre-training + finetuning paradigm a standard procedure for NLP tasks.
- Unlike its prior works that rely on RNNs and LSTMs, GPT-1 leverages decoder-only Transformer architecture, which improved parallelization and long-range dependency handling, leading to better performance.
Unsupervised Pre-training

In GPT pre-training, a standard language modeling objective is used:

where k is the size of the context window, and the conditional probability P is modeled using the decoder-only Transformer with it parameters represented as θ.

Supervised Finetuning

Once the model is pre-trained, it can be adapted to a specific downstream task by finetuning on a task-specific dataset using a proper supervised learning objective.

One problem here is that GPT requires a continuous sequence of text as input, while some tasks may involve more than one input sequence. For example in Entailment we have both the premise and the hypothesis, and in some QA tasks we will need to handle three different input sequences: the document, the question and the answer.

To make it easier to fit into different tasks, GPT adopts some task-specific input transformations in the finetuning stage, as we show in the figure below:

Figure 7. Input transformations for fine-tuning on different tasks. (Image from GPT paper)

More specifically,
- For Entailment task, the premise and hypothesis sequences will be concatenated into a single sequence, with a delimiter token in between.
- For Similarity task, since there is no inherent ordering of the sentences, we can simply construct two input sequences by switching the two input sequences, get their respective embeddings and then add these two embeddings in a element-wise manner.
- For more complex Question Answering and Commonsense Reasoning tasks, where we are typically given a context document, a question and a set of possible answers, we can concatenate the document context and the question with each possible answer (again with a delimiter token in between), process each of these sequence independently, and then use a softmax layer to produce the final output distribution over possible answers.
Conclusions

In this article, we revisited the key techniques that inspired GPT-1 and highlighted its major innovations.

This is the first part of our GPT series, and in the next article, we will walk through the evolution from GPT-1 to GPT-2, GPT-3, and InstructGPT.

Thanks for reading!

Understanding the Evolution of ChatGPT: Part 1—An In-Depth Look at GPT-1 and What Inspired It was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Understanding the Evolution of ChatGPT: Part 1—An In-Depth Look at GPT-1 and What Inspired It

Go Here to Read this Fast! Understanding the Evolution of ChatGPT: Part 1—An In-Depth Look at GPT-1 and What Inspired It
January 7, 2025
Apple’s second iOS 18.3, iPadOS 18.3 developer betas kick off 2025

Apple has moved on to its second round of developer betas, issuing fresh builds of iOS 18.3, iPadOS 18.3, macOS 15.3, tvOS 18.3, watchOS 11.3, and visionOS 2.3.

Examples of Apple Intelligence at work.

The second round of betas for iOS 18.3, iPadOS 18.3, macOS 15.3, tvOS 18.3, watchOS 11.3, and visionOS 2.2 follow after the first round, which landed on December 16.

The second iOS 18.3 and iPadOS 18.3 developer betas share build number 22D5040d, replacing 22D5034e, the first build. The second macOS Sequoia 15.3 build is 24D5040f, up from 24D5034.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Apple’s second iOS 18.3, iPadOS 18.3 developer betas kick off 2025

Originally appeared here:

Apple’s second iOS 18.3, iPadOS 18.3 developer betas kick off 2025

January 7, 2025
Garmin previews a new rugged smartwatch with an AMOLED display – and it’s fairly priced

Garmin’s new Instinct 3 series features impressive watch faces. In addition to the vibrant AMOLED screen, there is a solar-powered option for longevity.

Go Here to Read this Fast!

Garmin previews a new rugged smartwatch with an AMOLED display – and it’s fairly priced

Originally appeared here:

Garmin previews a new rugged smartwatch with an AMOLED display – and it’s fairly priced

January 7, 2025
CES 2025: The 9 best mobile accessories we’ve seen so far

ZDNET is live from Las Vegas, covering the best and most innovative mobile accessories at CES 2025. Here are our favorites so far.

Go Here to Read this Fast!

CES 2025: The 9 best mobile accessories we’ve seen so far

Originally appeared here:

CES 2025: The 9 best mobile accessories we’ve seen so far

January 7, 2025
Stop spam and keep your personal details private with Incogni’s exclusive 55% off deal

John Alexander

We’re now into the thick of January, meaning there’s a big chance that you’ve already abandoned whatever self-improvement routine you’d planned for 2025. But, if your plans included reducing those neverending spam calls, keeping important personal details like your Social Security number and address private, keeping your family more secure, and reducing the risk of […]

Go Here to Read this Fast! Stop spam and keep your personal details private with Incogni’s exclusive 55% off deal

Originally appeared here:
Stop spam and keep your personal details private with Incogni’s exclusive 55% off deal

January 7, 2025
Meta is ditching third-party fact checkers

Kris Holt

Meta CEO Mark Zuckerberg has announced a major shift in the company’s approach to moderation and speech. Meta is ditching its fact-checking program and moving to an X-style Community Notes model on Facebook, Instagram and Threads.

Zuckerberg said in a video that Meta has “built a lot of complex systems to moderate content” in recent years. “But the problem with complex systems is they make mistakes. Even if they accidentally censor one percent of posts, that’s millions of people.” He added that we’re now at a point where there have been “too many mistakes and too much censorship.”

To that end, he said, “we’re gonna get back to our roots and focus on reducing mistakes, simplifying our policies and restoring free expression on our platforms.” That’s going to start with a switch to “Community Notes, similar to X, starting in the US.”

Meta’s new Chief Global Affairs Officer (and Nick Clegg’s replacement) Joel Kaplan wrote in a blog post that the company has seen the Community Notes “approach work on X — where they empower their community to decide when posts are potentially misleading and need more context, and people across a diverse range of perspectives decide what sort of context is helpful for other users to see.”

The company plans to phase in Community Notes in the US over the next few months and iterate on them over this year, all the while removing its fact checkers and ending the demotion of fact-checked content. Meta will also make certain content warning labels less prominent.

Meta says it will be up to contributing users to write Community Notes and to decide which ones are applied to posts on Facebook, Instagram and Threads. “Just like they do on X, Community Notes will require agreement between people with a range of perspectives to help prevent biased ratings,” Kaplan wrote. “We intend to be transparent about how different viewpoints inform the Notes displayed in our apps, and are working on the right way to share this information.”

The Community Notes model hasn’t entirely been without issue for X, however. Studies have shown that Community Notes have failed to prevent misinformation from spreading there. Elon Musk has championed the Community Notes approach but some have been applied to his own posts to correct falsehoods that he has posted. After one such incident, Musk accused “state actors” of manipulating the system. YouTube has also tested a Community Notes model.

ASSOCIATED PRESS

Meanwhile, Zuckerberg had some other announcements to make, including a simplification of certain content policies and ditching “a bunch of restrictions on topics like immigration and gender that are just out of touch with mainstream discourse. What started as a movement to be more inclusive has increasingly been used to shut down opinions and shut out people with different ideas, and it’s gone too far. I wanna make sure that people can share their experiences and their beliefs on our platforms.”

When asked to provide more details about these policy changes, Meta directed Engadget to Kaplan’s blog post.

In addition, the filters that Meta had used to search for any policy violations across its platforms will be focused on “illegal and high-severity violations.” These include terrorism, child sexual exploitation, drugs, fraud and scams. For other, less-severe types of policy violations, Meta will rely more on users making manual reports, but the bar for removing content will be higher.

“We’re going to tune our systems to require a much higher degree of confidence before a piece of content is taken down,” Kaplan wrote. In some cases, that will mean multiple reviewers looking at a certain piece of content before reaching a decision on whether to take it down. Along with that, Meta is “working on ways to make recovering accounts more straightforward and testing facial recognition technology, and we’ve started using AI large language models (LLMs) to provide a second opinion on some content before we take enforcement actions.”

Last but not least, Meta says it’s taking a more personalized approach to political content across its platforms after attempting to make its platforms politically agnostic for the past few years. So, if you want to see more political stuff in your Facebook, Instagram and Threads feeds, you’ll have the choice to do so.

As with donating to Donald Trump’s inauguration fund, replacing longtime policy chief Nick Clegg with a former George W. Bush aide and appointing Trump’s buddy (and UFC CEO) Dana White to its board, it’s very difficult to see these moves as anything other than Meta currying favor with the incoming administration.

Many Republicans have long railed against social media platforms, accusing them of censoring conservative voices. Meta itself blocked Trump from using his accounts on his platforms for years after he stoked the flames of the attempted coup of January 6, 2021. “His decision to use his platform to condone rather than condemn the actions of his supporters at the Capitol building has rightly disturbed people in the US and around the world,” Zuckerberg said at the time. “We believe the risks of allowing the President to continue to use our service during this period are simply too great.” Meta removed its restrictions on Trump’s Facebook and Instagram accounts last year.

Zuckerberg explicitly said that Trump’s election win is part of the reasoning behind Meta’s policy shift, calling it “a cultural tipping point” on free speech. He said that the company will work with Trump to push back against other governments, such as the Chinese government and some in Latin America, that are “pushing to censor more.”

He claimed that “Europe has an ever-increasing number of laws institutionalizing censorship and making it difficult to build anything innovative there.” Zuckerberg also took shots at the outgoing administration (over an alleged push for censorship) and third-party fact checkers, who he claimed were “too politically biased and have destroyed more trust than they created.”

These are all significant changes for Meta’s platforms. On one hand, allowing more types of speech could increase engagement without having to rely on, say, garbage AI bots. But the company may end up driving away many folks who don’t want to deal with the type of speech that could become more prevalent on Instagram, Facebook and Threads now that Meta is taking the shackles off.

“Now we have an opportunity to restore free expression and I am excited to take it,” Zuckerberg said. While he noted that “it’ll take time to get this right and these are complex systems that are never gonna be perfect,” and that the company will still need to work hard to remove illegal content, “the bottom line is that after years of having our content moderation work focused primarily on removing content, it is time to focus on reducing mistakes, simplifying our systems and getting back to our roots about giving people voice.”

Update January 7, 2:58PM ET: Noting that Meta responded to our request for comment.

This article originally appeared on Engadget at https://www.engadget.com/social-media/meta-is-loosening-some-content-policies-and-moving-to-an-x-style-community-notes-system-142330500.html?src=rss

Go Here to Read this Fast! Meta is ditching third-party fact checkers

Originally appeared here:
Meta is ditching third-party fact checkers

January 7, 2025
Amazon’s M4 iMac sale drops 10-core GPU spec down to $1,349

Apple deals continue to roll in during CES 2025, with Amazon knocking $150 off the all-in-one M4 iMac 24-inch with a 10-core GPU, eliminating the need to purchase an external display.

Amazon’s has dropped prices on Apple’s M4 iMac.

The standout discount across Amazon’s M4 iMac sale is the $150 markdown on the silver colorway with a 10-core CPU/GPU chip, 16GB of unified memory and 256GB of SSD storage. The upgraded model also provides two extra Thunderbolt ports and a Gigabit Ethernet port compared to the standard spec.

Buy for $1,349

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Amazon’s M4 iMac sale drops 10-core GPU spec down to $1,349

Originally appeared here:

Amazon’s M4 iMac sale drops 10-core GPU spec down to $1,349

January 7, 2025
Nanoleaf’s new lighting products include a floor lamp and new lightstrips

Canadian smart home brand Nanoleaf released multiple new smart home products at CES 2025, including its first floor lamp and a new software subscription.

The new Nanoleaf floor lamp

Most intriguing for Apple users will be the new Smart Multicolor Floor Lamp that is outfitted with LED lights for more than 16 million colors, music syncing, and gradient effects.

It has an ultra-slim design to fit into many rooms and is now available to preorder from Nanoleaf’s website for $99.

Continue Reading on AppleInsider | Discuss on our Forums

Go Here to Read this Fast!

Nanoleaf’s new lighting products include a floor lamp and new lightstrips

Originally appeared here:

Nanoleaf’s new lighting products include a floor lamp and new lightstrips

January 7, 2025
CES 2025: The 12 most impressive products so far

CES is in full swing, and we’ve seen major announcements from the likes of Samsung, Roborock, MSI, and more. Here’s our roundup of the new tech you don’t want to miss.

Go Here to Read this Fast! CES 2025: The 12 most impressive products so far

Originally appeared here:
CES 2025: The 12 most impressive products so far

January 7, 2025
Need more storage or Thunderbolt 5 ports? OWC delivers new gear for video pros

Premium hardware maker OWC unveils a new RAID storage unit and Thunderbolt 5 hub.

Go Here to Read this Fast! Need more storage or Thunderbolt 5 ports? OWC delivers new gear for video pros

Originally appeared here:
Need more storage or Thunderbolt 5 ports? OWC delivers new gear for video pros

January 7, 2025

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Tag: tech

Tracing the roots of ChatGPT: GPT-1, the foundation of OpenAI’s LLMs

Prior to GPT-1

Pre-training and Finetuning

Previous Explorations of Pretraining Prior to GPT-1

Decoder-only Transformer

Auto-regressive vs. Auto-encoding Language Models

Core Concepts in GPT-1

Key Innovations

Unsupervised Pre-training

Supervised Finetuning

Conclusions