Google DeepMind RecurrentGemma Beats Transformer Models via @sejournal, @martinibuster

Google DeepMind published a research paper that proposes language model called RecurrentGemma that can match or exceed the performance of transformer-based models while being more memory efficient, offering the promise of large language model performance on resource limited environments.

The research paper offers a brief overview:

“We introduce RecurrentGemma, an open language model which uses Google’s novel Griffin architecture. Griffin combines linear recurrences with local attention to achieve excellent performance on language. It has a fixed-sized state, which reduces memory use and enables efficient inference on long sequences. We provide a pre-trained model with 2B non-embedding parameters, and an instruction tuned variant. Both models achieve comparable performance to Gemma-2B despite being trained on fewer tokens.”

Connection To Gemma

Gemma is an open model that uses Google’s top tier Gemini technology but is lightweight and can run on laptops and mobile devices. Similar to Gemma, RecurrentGemma can also function on resource-limited environments. Other similarities between Gemma and RecurrentGemma are in the pre-training data, instruction tuning and RLHF (Reinforcement Learning From Human Feedback). RLHF is a way to use human feedback to train a model to learn on its own, for generative AI.

Griffin Architecture

The new model is based on a hybrid model called Griffin that was announced a few months ago. Griffin is called a “hybrid” model because it uses two kinds of technologies, one that allows it to efficiently handle long sequences of information while the other allows it to focus on the most recent parts of the input, which gives it the ability to process “significantly” more data (increased throughput) in the same time span as transformer-based models and also decrease the wait time (latency).

The Griffin research paper proposed two models, one called Hawk and the other named Griffin. The Griffin research paper explains why it’s a breakthrough:

“…we empirically validate the inference-time advantages of Hawk and Griffin and observe reduced latency and significantly increased throughput compared to our Transformer baselines. Lastly, Hawk and Griffin exhibit the ability to extrapolate on longer sequences than they have been trained on and are capable of efficiently learning to copy and retrieve data over long horizons. These findings strongly suggest that our proposed models offer a powerful and efficient alternative to Transformers with global attention.”

The difference between Griffin and RecurrentGemma is in one modification related to how the model processes input data (input embeddings).

Breakthroughs

The research paper states that RecurrentGemma provides similar or better performance than the more conventional Gemma-2b transformer model (which was trained on 3 trillion tokens versus 2 trillion for RecurrentGemma). This is part of the reason the research paper is titled “Moving Past Transformer Models” because it shows a way to achieve higher performance without the high resource overhead of the transformer architecture.

Another win over transformer models is in the reduction in memory usage and faster processing times. The research paper explains:

“A key advantage of RecurrentGemma is that it has a significantly smaller state size than transformers on long sequences. Whereas Gemma’s KV cache grows proportional to sequence length, RecurrentGemma’s state is bounded, and does not increase on sequences longer than the local attention window size of 2k tokens. Consequently, whereas the longest sample that can be generated autoregressively by Gemma is limited by the memory available on the host, RecurrentGemma can generate sequences of arbitrary length.”

RecurrentGemma also beats the Gemma transformer model in throughput (amount of data that can be processed, higher is better). The transformer model’s throughput suffers with higher sequence lengths (increase in the number of tokens or words) but that’s not the case with RecurrentGemma which is able to maintain a high throughput.

The research paper shows:

“In Figure 1a, we plot the throughput achieved when sampling from a prompt of 2k tokens for a range of generation lengths. The throughput calculates the maximum number of tokens we can sample per second on a single TPUv5e device.

…RecurrentGemma achieves higher throughput at all sequence lengths considered. The throughput achieved by RecurrentGemma does not reduce as the sequence length increases, while the throughput achieved by Gemma falls as the cache grows.”

Limitations Of RecurrentGemma

The research paper does show that this approach comes with its own limitation where performance lags in comparison with traditional transformer models.

The researchers highlight a limitation in handling very long sequences which is something that transformer models are able to handle.

According to the paper:

“Although RecurrentGemma models are highly efficient for shorter sequences, their performance can lag behind traditional transformer models like Gemma-2B when handling extremely long sequences that exceed the local attention window.”

What This Means For The Real World

The importance of this approach to language models is that it suggests that there are other ways to improve the performance of language models while using less computational resources on an architecture that is not a transformer model. This also shows that a non-transformer model can overcome one of the limitations of transformer model cache sizes that tend to increase memory usage.

This could lead to applications of language models in the near future that can function in resource-limited environments.

Read the Google DeepMind research paper:

RecurrentGemma: Moving Past Transformers for Efficient Open Language Models (PDF)

Featured Image by Shutterstock/Photo For Everything

Leave a Reply

Your email address will not be published. Required fields are marked *