Apple researchers have hit on a new multi-modal method of quickly training large language models (LLMs) that can enable more flexible and powerful machine-learning and “AI” type systems.
A research paper posted by the company to research site arxiv.org earlier this week revealed that Apple has used what it calls a “careful mix” of image-caption, interleaved image-text, and text-only data to train LLMs. The mix of visual and language data allowed the models to handle tasks like intelligently captioning images or infer natural-language meanings.
A research paper posted by the company to research site arxiv.org earlier this week revealed that Apple has used what it calls a “careful mix” of image-caption, interleaved image-text, and text-only data to train LLMs. The mix of visual and language data allowed the models to handle tasks like intelligently captioning images or infer natural-language meanings.
As part of the research, it was determined that the choice of image encoder and the resolution of images it processes has a big impact on performance, more than the design of the vision-language connector.
In one instance, using a 30-billion-parameter MM1 model, it was found that there were strong in-context learning abilities. The discovery means it can perform multi-step reasoning over multiple images with few “chain of thought” prompts.
Originally appeared here:
New Apple AI training method retains privacy, and could make a future Siri more flexible