Whether you’re looking for a quick walkthrough or enterprise pricing, our team is ready to show you how Aiera can fit your needs.
We may request cookies to be set on your device. We use cookies to let us know when you visit our websites, how you interact with us, to enrich your user experience, and to customize your relationship with our website.
Click on the different category headings to find out more. You can also change some of your preferences. Note that blocking some types of cookies may impact your experience on our websites and the services we are able to offer.
These cookies are strictly necessary to provide you with services available through our website and to use some of its features.
Because these cookies are strictly necessary to deliver the website, refusing them will have impact how our site functions. You always can block or delete cookies by changing your browser settings and force blocking all cookies on this website. But this will always prompt you to accept/refuse cookies when revisiting our site.
We fully respect if you want to refuse cookies but to avoid asking you again and again kindly allow us to store a cookie for that. You are free to opt out any time or opt in for other cookies to get a better experience. If you refuse cookies we will remove all set cookies in our domain.
We provide you with a list of stored cookies on your computer in our domain so you can check what we stored. Due to security reasons we are not able to show or modify cookies from other domains. You can check these in your browser security settings.
We also use different external services like Google Webfonts, Google Maps, and external Video providers. Since these providers may collect personal data like your IP address we allow you to block them here. Please be aware that this might heavily reduce the functionality and appearance of our site. Changes will take effect once you reload the page.
Google Webfont Settings:
Google Map Settings:
Google reCaptcha Settings:
Vimeo and Youtube video embeds:
You can read about our cookies and privacy settings in detail on our Privacy Policy Page.
Privacy Policy
Evaluating GPT-4o on Financial Tasks
ArticlesEvaluating GPT-4o on Financial Tasks
On May 13, OpenAI announced its new flagship model, GPT-4o, capable of reasoning over visual, audio, and text input in real time. The model has clear benefits in response time and multimodality, offering impressive millisecond latency for voice processing comparable to human conversation cadence. Further, OpenAI has boasted comparable performance to GPT-4 across a set of standard evaluation metrics, including the MMLU, GPQA, MATH, HumanEval, MGSM, and DROP for text; however, standard model benchmarks often focus on narrow, artificial tasks that don’t necessarily translate to aptitude on real-world applications. For example, both the MMLU and DROP datasets have been reported to be highly contaminated, with significant overlap between training and test samples.
At Aiera, we employ LLMs on a variety of financial-domain tasks, including topic identification, summarization, speaker identification, sentiment analysis, and financial mathematics. We’ve developed our own set of high-quality benchmarks for the evaluation of LLMs in order to perform intelligent model selection for each of these tasks, so that our clients can be confident in our commitment to the highest performance.
In our tests, GPT-4o was found to trail both Anthropic’s Claude 3 models and OpenAI’s prior model releases on several domains including identifying the sentiment of financial text, performing computations against financial documents, and identifying speakers in raw event transcripts. Reasoning performance (BBH) was found to exceed Claude 3 Haiku, but fell behind other OpenAI GPT models, Claude 3 Sonnet, and Claude 3 Opus.
Spider Plot Comparison of Task Performance
Raw Evaluation Scores
As for speed, we used the LLMPerf library to measure and compare performance across both Anthropic and OpenAI’s model endpoints. We performed a modest analysis, running 100 synchronous requests against each model using the LLMPerf tooling as below:
python token_benchmark_ray.py \ --model "gpt-4o" \ --mean-input-tokens 550 \ --stddev-input-tokens 150 \ --mean-output-tokens 150 \ --stddev-output-tokens 10 \ --max-num-completed-requests 100 \ --timeout 600 \ --num-concurrent-requests 1 \ --results-dir "result_outputs" \ --llm-api openai \ --additional-sampling-params '{}'
Performance Test Results
Our results reflect the touted speedup over gpt-3.5-turbo and gpt-4-turbo. Despite the improvement, Claude 3 Haiku remains superior on all dimensions except the time to first token and, given it’s performance advantage, remains the best choice for timely text analysis.
Despite it’s shortcomings on the highlighted tasks, I’d like to note that the OpenAI release remains impressive considering its multimodality and indication of a world to come. GPT-4o is assumedly quantized given it’s impressive latency and therefore may suffer from performance degradation due to reduced precision. The decrease in computation burden from such reduction facilitates quicker processing, but introduces potential errors in output accuracy and variations in model behavior across different types of data. These trade-offs necessitate careful tuning of the quantization parameters to maintain a balance between efficiency and effectiveness in practical applications.
Consistent with OpenAI’s release cadence to-date, subsequent versions of the model will be rolled out in the coming months and will doubtlessly demonstrate significant performance jumps across domains.
To learn more about how Aiera can support your research, check out our website or send us a note at hello@aiera.com.