July 11, 2024
To-date, we’ve combatted errors and hallucinations through human-in-the-loop validation and benchmarking. As we expand into new applications, our requirements have grown and we use internal benchmarks to ensure we’re using the right model in the right place.
Leaderboards rank model performance on popular standards such as ARC, HellaSwag, MMLU, GSM8K, TruthfulQA, etc. But, while standard benchmarks help in assessing the generalization ability of models across a wide range of tasks, they may not effectively measure how well a model performs on areas requiring highly specialized knowledge or skills. This gap may lead to false senses of model competency and superiority. Projects like huggingface datasets function as communal dataset repositories for diverse natural language tasks. However, community datasets may vary significantly in quality or format, contain errors, inconsistencies, or lack thorough documentation. In this article, I’ll outline some lessons learned from my exploration into benchmarking model performance on financial question-and-answer tasks focused on multi-step computation.
Quantitative question answering requires domain comprehension, data extraction, and the execution of numerical operations, which is among the most challenging tasks for LLMs. In 2021, researchers from the University of Pennsylvania, J.P. Morgan, and Amazon published “FinQA: A Dataset of Numerical Reasoning over Financial Data, ” introducing a dataset of 8,281 annotated QA pairs built against publicly available earnings reports of S&P 500 companies from 1999 to 2019 (Zheng et al., 2021). Each task is represented as a single question and answer pair derived from tabular and textual data from the earnings report. The original formulation distills the answer reasoning into sets of mathematical and tabular operations: add, subtract, multiply, divide, greater, exp, table-max, table-min, table-sum, table-average.
For this project, I used the PIXIU FinQA dataset available on huggingface here. PIXIU evaluated models response against the questions for exact-match accuracy, focusing on the generation rather than intermediate computation steps. For the purpose of side-by-side model ranking, I only cared about the model’s ability to surface the correct result to the user. Their data is structured as below:
For the execution, I used EleutherAI’s lm-evaluation-harness to execute an evaluation task on the FinQA dataset. For those new to the lm-evaluation-harness, its an excellent open-source tool that can be used to template model evaluation tasks. A guide for configuring new tasks can be found in the lm-eval docs here and user’s can get quickly started with a number of major model providers. Tasks reference huggingface dataset paths and are configurable with a variety of generation and evaluation options. To set up my task, I created a directory tasks
in my project and a subdirectory tasks/finqa
. Then, I created a yaml spec for the flare_finqa task referencing the original dataset in tasks/finqa/flare_finqa.yaml
:
task: flare_finqa
dataset_path: TheFinAI/flare-finqa
training_split: null
validation_split: null
test_split: test
doc_to_text: query
doc_to_target: answer
process_results: !function utils.process_results_gen
generation_kwargs:
max_gen_toks: 100
do_sample: False
temperature: 0.0
until:
- "<s>"
metric_list:
- metric: exact_match_manual
aggregation: mean
higher_is_better: true
I also set up a utils.py
file to postprocess model results.
def process_results_gen(doc, results):
completion = results[0]
target = str(doc["answer"])
# hack fixes to string formatting
if target[-2] == ".":
target = target + "0"
elif "." not in target:
target = target + ".00"
exact_match_manual = 1 if completion == target else 0
return {
"exact_match_manual": exact_match_manual
}
I added a hack fix for float formatting from float → dataset string that impacts the precision reflected in the target string. Additionally, I noticed was that the OpenAI models were prematurely stoping with double newlines (likely the default in the lm-eval-harness), so I added a stop token in the generate_until
field.
I used the lm-evaluation-harness’ API over their CLI tools because I wanted to run some tests in a Jupyter notebook. I found the API to be simple and useful, though the CLI is documented as the default use.
from lm_eval.models.openai_completions import OpenaiChatCompletionsLM
task_name = "flare_finqa"
model_name = "gpt-4-turbo-2024-04-09"
model = OpenaiChatCompletionsLM("gpt-4-turbo-2024-04-09")
task_manager = tasks.TaskManager(include_path="path/to/tasks")
results = simple_evaluate( # call simple_evaluate
model=model,
tasks=[task_name],
num_fewshot=0,
task_manager=task_manager,
write_out = True,
I ran the gpt-4–2024–04–09 against a subset of 100 dataset samples and observed an exact match score of 0.0. On suspicions I’d fumbled, I logged completions:
Consistent with past experience using chat-models on targeted tasks, I found that model’s often disregard instructions to report the only results and express a clear preference for reporting their explanation. Looking back at the dataset query, the prompt reads:
This prompt does little to specify the format and precision of the desired result. For the purpose of this test, I decided to allow the models to generate their explanation, but discard that explanation before evaluation. Comparing the results of the verbose gpt-4–2024–04–09 output with the dataset answers, I found several cases of incorrect calculations in the original dataset. One issue was the conflation of the words portion, ratio, and proportion in calculations reported as a decimal proportion. The semantic difference is small, but portion refers the quantity allocated. For example, in the case where 30 balls are green in a total of 100 balls, the portion of balls that are green is 30. The decimal proportion of green balls is 0.3 and the percentage proportion is 3%. Ratios were also used to mean decimal proportion in the dataset. In order to give the model best chance of success, I modified the prompt to specify decimal percentage as the output.
I added further specification of a unitless result returned with a precision of two decimal points.
The new prompt reads:
Due to other discovered errors, I decided to manually verify the calculations in the set. The verification was arduous, and so this dataset is only a small 91-sample subset of the original test set (available here).
The new yaml for the task is:
task: flare_finqa
dataset_path: Aiera/finqa-verified
training_split: null
validation_split: null
test_split: test
doc_to_target: answer
doc_to_text: "Context:\n{{context}}\n\nGiven the context, \
{{question}} Report your answer using the following format:\n\
Explanation: Explanation of calculation\n\
Formatted answer: Float number to two decimal point precision and no units\n"
process_results: !function utils.process_results_gen
generation_kwargs:
max_gen_toks: 500
do_sample: False
temperature: 0.0
until:
- "<s>"
metric_list:
- metric: exact_match
aggregation: mean
higher_is_better: true
ignore_case: true
ignore_punctuation: false
- metric: exact_match_manual
aggregation: mean
higher_is_better: true
The doc_to_text
field specifies a jinja prompt template used to compose the prompt from the bracketed dataset columns at runtime. The post-generation processing in my utils.py
extracts the formatted answer:
def process_results_gen(doc, results):
completion = results[0]
target = str(doc["answer"])
if "formatted answer:" in completion.lower():
completion_splits = completion.split(":")
completion = completion_splits[-1].strip()
# hack fix for string formatting
if target[-2] == ".":
target = target + "0"
elif "." not in target:
target = target + ".00"
exact_match_manual = 1 if completion == target else 0
return {
"exact_match_manual": exact_match_manual
}
claude-3-opus
to be the winner, followed by gemini-1.5-pro
then gpt-4-turbo-2024-04-09.
Because this testing set is a much smaller subset of the original dataset, I wanted to measure the confidence in how well the smaller sample was able represent the model’s broader performance. In the yaml
, I specified an exact_match
evaluation metric that outputs trial results as a 1
for a hit (correct computation) or a 0
for a miss (incorrect). The resulting outputs follow a discrete Bernoulli distribution where the value 1
is assumed with probability p
and 0
is assumed with probability q=1-p
. Using the distribution, we can establish the minimum dataset size we’ll need to understand the model’s performance on this specific task:
The lm-eval-harness reports out the standard error associated with our exact_match
calc.
For a Z score of 1.96 and a margin of error of 0.02 score points we can calculate the minimum samples to evaluate performance to the 95% confidence level:
Our 91 sub-samples exceeds n across models, so we can be reasonably confident these scores represent model performance on this specific task and dataset. In close, this sufficiency demonstrates why smaller, high integrity datasets are most valuable in evaluating model competence. Future areas of exploration that follow naturally from this brief exploration are evaluations of significant digits, unit comprehension, and expansion into other datasets such as the ConvFinQA, using few-shot and chain of thought prompting.