Renowned gaming platform Dragonsworld has unveiled the debut of the world’s first token talismans, dubbed “The Dragons”. The project represents a remarkable fusion of tradition and innovation while seeking to redefine the crypto space. Crafted by a team of crypto enthusiasts, these token amulets aim to infuse the ancient energy of dragons into the digital […]
As Bitcoin (BTC) edges closer to the psychological price of $50,000 thanks to heightened bullish momentum, the leading cryptocurrency has been in the green for a couple of months.
Charles Hoskinson, the founder of Cardano, has expressed confidence in the future of the network, underscoring its potential to surpass Ethereum in the realm of decentralized finance (DeFi) and blockchain innovation.
Crypto expert and former Ethereum advisor Steven Nerayoff has raised concerns about the degree of Chinese influence and control in the Ethereum ecosystem.
Large Language Models, GPT-2 — Language Models Are Unsupervised Multitask Learners
Acing GPT capabilities by turning it into a powerful multitask zero-shot model
Introduction
GPT is a well-known series of models whose last versions are currently dominating in various NLP tasks. The first GPT version was a significant milestone: being trained on enormous 120M parameters, this model demonstrated state-of-the-art performance on top benchmarks. Starting from this point, researchers tried to improve the base version.
In 2019, researchers from OpenAI officially released GPT-2. It was 10 times bigger than GPT-1 which allowed it to improve performance even further. Apart from that, the authors conjectured in their work that LLMs are multitask learners meaning that they can learn to perform several tasks at the same time. This important statement made it possible to further develop LLMs in a much more efficient framework.
In this article, we will refer to the official GPT-2 paper by going through its main aspects and improvements over GPT-1 and understand a novel approach for building LLMs.
Note. This article assumes that you are already familiar with the first version of GPT. If not, check out this article.
It is no secret that with the recent introduction of powerful models like ChatGPT or GPT-4, the first GPT versions no longer attract that much attention and appear obsolete.
Nevertheless, the following reasons explain the important motivation behind studying the GPT evolution.
The first GPT versions introduced language learning concepts that are still used by the most recent models. The best example is GPT-2 innovating the multitask learning technique. Thanks to this concept, the modern GPT models can accurately solve a large variety of NLP tasks.
From the algorithmic perspective, most LLMs already use many advanced techniques and it becomes harder to innovate new efficient methods. That is why NLP researchers focus more on scraping and feeding more high-quality data to models. This detail explains why there is not so much difference between internal working mechanisms in first GPT models, in comparison to ChatGPT-3.5 or GPT-4. As a result, the most principled differences are usually the amount of data fed to them and the complexity of a neural network. By understanding how first GPT models work, you can automatically recognize the working concepts of more advanced models.
Even though there might be some subtle differences in the training process between different GPT models, the aspects contributing the most to the model’s performance is the amount of data fed to it and the neural network’s complexity.
Multitask learning
GPT-2 is built on top of GPT-1 meaning that it has the same architecture. During training, GPT-1 uses the standard log-likelihood language modeling objective:
GPT’s learning objective
This expression can be thought of as an optimization of conditional probability distribution p(output | input) for a given task (in the case of GPT-1, the task consists of predicting the next token). While this approach works well for individual tasks, the model is still not able to learn to perform multiple tasks. For instance, a model trained with the aforementioned objective to predict the next token in the sequence will perform poorly on a sentiment analysis problem without proper fine-tuning.
The GPT-2 authors proposed a novel approach for replacing the common pre-training + fine-tuning framework that would allow a trained model to perform well across different tasks. The idea consists of not modeling the standard probability p(output | input) butincluding task conditioning p(output | input, task) instead. There exist several approaches to incorporating task type into the model. Most of the previous methods considered this information by making changes on the architecture level. Though this approach worked well in the past, it turned out that there would be no need to modify the model’s architecture for task-type incorporation.
The ultimate idea is that taskinformation can be easily incorporated into the input sequence. For example:
If a sentence in language A needs to be translated into the language B, then the input sequence in the dataset will be written as:
Example from the paper demonstrating input adaption for translation tasks
If an answer should be given to a question in a provided context, then the input sequence will take the following form:
Example from the paper demonstrating input adaption for question answering tasks
Surprisingly the described approach was already proven to be competitive in previous works (e.g. MQAN model)! The only main disadvantage is its slow learning speed.
Zero-shot learning is a popular term and designates the ability of a model to perform a certain task without having explicitly received any training examples for it. GPT-2 is an example of a model having this ability.
Dataset
To use the idea of multitask learning from the previous section, for training, we would normally need a dataset whose objects contain task descriptions, text inputs and labels. However, in reality, the authors developed a robust framework which turns this supervised problem into an unsupervised one and does not even need task descriptions!
The researchers conjectured that if a model was trained on a large and diverse dataset, then there would probably be a lot of language demonstration tasks in different domains that would definitely help the model to fully understand them. To validate this hypothesis, the authors designed a web scraping algorithm that collected human responses on Reddit which received at least 3 likes. Collecting all possible Reddit responses would likely have led to data quality issues and also have been too large for a model. As a result, the final dataset version includes 8M documents containing 40GB of text data in total.
Dataset fragment containing a sentence including phrases in English and French. Such text fragments can help the model perform translation tasks. The example is taken from the paper.A similar example to the previous one from the paper.
Since the collected dataset is very diverse, to better account for rare words and characters, the authors incorporated a slightly modified version of Byte-Pair Encoding (BPE) for input representations.
Model
According to the paper, GPT-2 has the same architecture as GPT-1 except for several changes:
Layer normalization was moved to the input of each Transformer block and was added to the final self-attention block.
Weights of residual layers are divided by √N at initialization where (N is the number of residual layers).
Context size is increased from 512 to 1024.
Batch size is augmented from 64 to 512.
Vocabulary size is expanded from 40,000 tokens to 50,257.
Conclusion
By turning a supervised problem into the unsupervised format, multitask learning helps GPT-2 to ace the performance on various downstream tasks (except for text summarization) without explicit fine-tuning. In fact, after several years, this learning framework is still constantly gaining popularity in machine learning.
When a training dataset is sufficiently large and diverse, it allows gigantic models to enrich linguistic knowledge by simply optimizing the log-likelihood language objective. Finally, GPT-2 has become a perfect example of such a model.
The Samsung Galaxy S24 Ultra proves that the OnePlus 12 could be this year’s best-value flagship
Originally appeared here:
The Samsung Galaxy S24 Ultra proves that the OnePlus 12 could be this year’s best-value flagship
We use cookies on our website to give you the most relevant experience by remembering your preferences and repeat visits. By clicking “Accept”, you consent to the use of ALL the cookies.
This website uses cookies to improve your experience while you navigate through the website. Out of these, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may affect your browsing experience.
Necessary cookies are absolutely essential for the website to function properly. These cookies ensure basic functionalities and security features of the website, anonymously.
Cookie
Duration
Description
cookielawinfo-checkbox-analytics
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional
11 months
The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance
11 months
This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy
11 months
The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.
Functional cookies help to perform certain functionalities like sharing the content of the website on social media platforms, collect feedbacks, and other third-party features.
Performance cookies are used to understand and analyze the key performance indexes of the website which helps in delivering a better user experience for the visitors.
Analytical cookies are used to understand how visitors interact with the website. These cookies help provide information on metrics the number of visitors, bounce rate, traffic source, etc.
Advertisement cookies are used to provide visitors with relevant ads and marketing campaigns. These cookies track visitors across websites and collect information to provide customized ads.