In recent years, we have seen a big increase in the size of large language models (LLMs) used to solve natural language processing (NLP) tasks such as question answering and text summarization. Larger models with more parameters, which are in the order of hundreds of billions at the time of writing, tend to produce better […]
Originally appeared here:
Faster LLMs with speculative decoding and AWS Inferentia2
Go Here to Read this Fast! Faster LLMs with speculative decoding and AWS Inferentia2