Aiera Year End Language Model Benchmarking Review 2024

December 13, 2024

Image

This week, Google released their Gemini 2.0 Flash Experimental model ahead of the 2.0 official release. It represents for us another opportunity to understand how the language model landscape is evolving, with this one in particular offering some very interesting insight.  

[Written by Ken Sena, Co-Founder and CEO of Aiera | December 13, 2024]

At Aiera, our ML team benchmarks all new language model releases. So, we thought it would be a good time to revisit our scoring across the major providers, and provide understanding as to performance and price. Specifically, we use these benchmarks to gain insight into:

1. Performance and perceived scale limitations
2. Pricing efficiencies based on newer model architectures being released

Notably, Google’s newest release, Gemini Flash 2.0 Experimental, stands out to us as being among the most economically efficient for the industry, while ranking in the top 5 for us as far as absolute performance. Why this is and how we get to this conclusion requires an understanding of our benchmarking approach, which we will review.

Aiera LLM Benchmarking

The Aiera ML team has developed proprietary tests to generate benchmarks on a variety of financial domain tasks, including sentiment, summarization, Q&A, and speaker identification. Our benchmarking informs which models we deploy for which tasks, offering key flexibility and responsiveness to our platform and to our users, who rely on the quality of these model-derived outputs. 

As our users tend to be institutional investors, they know well that the quality of a language model output, measured in sourcing accuracy and citations, requires transparency. 

We have developed a standardized view across the major model providers, offering contextually relevant, accurate, and well-cited information tailored to our specific use cases. Further, while benchmark performance does not always translate directly to real-world utility, by holding the tests constant across providers, it does offer comparison.

Comparison of Model Providers

Aiera has benchmarked 28 unique models to date, with measured performance reflecting a quantitative assessment on the quality of outputs across many critical features of the Aiera platform, including sentiment, summarization, Q&A, and speaker identification. 

As part of this process, we derive an Aiera score, which is set on a scale of 0 to 100 and based on the four measured benchmarks.

While each model is scored individually, we can compare providers in terms of maximum individual scores achieved, where OpenAI, at 11 models evaluated, had the most opportunities, followed by Anthopic and Google at 6 each.

Performance: Are We Seeing Scaling Limitations?

A question that often comes up is whether we are seeing scale limitations based on current benchmarking. This is a difficult question to answer because most models are closed-source where we don't know the exact parameter size. However, if we take one model series, Anthropic’s Claude, we find performance improvements to be mixed. Specifically, with more recent releases, we see continued improvement around Speaker Identification and Q&A, but diminished, or even some declining performance, where Sentiment and Summarization are concerned.

Price: Are We Seeing Efficiency Gains?

Again, large providers such as OpenAI, Anthropic, and Google operate as closed-source entities, offering limited transparency into their pricing models. This makes it challenging to assess the true cost-efficiency or alignment of their resource allocation with the underlying architecture and operational expenses. However, by evaluating the cost of these models relative to our benchmarked performance, we feel we can glean some insight into comparative efficiency. 


Model intelligence is a function of the complexity of the composing neural network. Larger models perform better than their smaller counterparts with similar architectures due to their increased capacity to capture and generalize complex patterns in data. However, this performance gain often comes at the cost of higher computational requirements, longer training times, and greater energy consumption, underscoring the need for efficient architectures that optimize both size and performance. 

Competing architectures, including transformer-based models, RNN variants, and emerging approaches like mixture-of-experts (MoE), aim to balance scalability, interpretability, and computational efficiency, with MoE models standing out for their ability to activate only a subset of parameters dynamically, offering enhanced performance while more effectively managing resource consumption. 

In other words, by throwing cost-efficiency into the mix, model selection becomes much more interesting.

Google Gemini 2.0 Flash Experiment - Thesis Supporting

To calculate cost, we multiply the quantity of input and output tokens by their cost (in millions). We then standardize the cost distribution between 0 to 100, where the Gemini Flash models are far less expensive because they leverage a hybrid architecture combining the strengths of transformer-based architectures (used in models like GPT) with advanced techniques for fine-tuned efficiency and performance. Moreover, whereas historically these lighter-weight faster models tended to be less performant compared to the larger model sizes, we were impressed to find that yesterday’s Gemini 2.0 Flash Exp release ranked in our top 5 for the year for overall performance while likely at a fraction of the cost.

We look forward to sharing more model updates to come as we focus on delivering the best possible performance for Aiera and our users.

Enjoy the holidays!

To learn more, visit aiera.com.