Benchmarking Methodology
How we measure the function calling capability of a model
This benchmark evaluates models on the following 15 parameters, which are tested across the 10 scenarios listed below. You can view the leaderboard here.
This leaderboard builds on the Berkeley Function Calling Leaderboard (BFCL) by expanding its test categories to include error handling, constraint enforcement, and function synthesis. It defines 15 evaluation parameters that cover the full function call pipeline: function discovery and selection, parameter extraction and transformation, function invocation, and output interpretation. Each parameter is tested through targeted scenarios that assess model behavior under both standard and edge-case conditions.
The tests are run using promptfoo. The scenarios (consisting of the user prompts, functions specifications, and function outputs) are written in markdown, and are accompanied by a single yaml file that is used by the test runner. A few custom scripts are used to analyze the results of the tests, and turn them into the leaderboard page. All of this data and code can be found in this repository.
Parameters
function-discovery
The model must identify and catalog available functions from specifications in the conversation history, including user-provided and self-generated functions.
function-selection
The model must select the correct function(s) from the list of discovered functions to accomplish a certain task, avoiding hallucination of non-existent tools.
parameter-extraction
The model must extrapolate appropriate pieces of information from the conversation to use as arguments when calling a function, taking note of the expected types, formats, and required parameters.
parameter-transformation
The model must transform user input from natural language or other formats into the precise data types and structures required by the function's arguments.
function-calling
The model must be able to produce valid, executable function calls that conform to the defined API, with correctly named parameters and formatted arguments.
output-understanding
The model must make use of the data returned by a function to generate helpful, user-friendly responses and inform subsequent actions.
parallel-calling
The model must identify when multiple independent function calls can be made concurrently, and support receiving their outputs out of order.
composite-calling
The model must be able to plan ahead and chain function calls in sequence, where the output of one function is used as the input for the next.
context-application
The model must be able to use information from earlier in the conversation, including user prompts and previous function call outputs, for subsequent tasks.
error-handling
The model must be able to handle errors returned by a function, understand why the error occurred, and decide on a suitable course of action such as retrying or informing the user.
missing-functions
The model must be able to identify when it does not have access to an appropriate function to fulfill a request and inform the user about this limitation.
missing-parameters
The model must identify when it lacks a piece of information required to call a function, and ask the user instead of making an assumption.
handling-ambiguity
When faced with a vague request, the model must ask the user about their intent or the specific entities involved, rather than guessing.
constraint-adherence
The model must adhere to arbitrary constraints imposed by the user or the environment, such as being told not to use a certain function to perform a task.
function-generation
The model must be able to generate function code and specifications based on a user's request, and then subsequently use that new function in the conversation.
Scenarios
Simple Function Call
function-discovery
parameter-extraction
function-calling
functions
get_pr_details(repo: str, pr_number: int) -> dict
question
What's the status of PR #1138 in the 'frontend' repo?
expected
The model should identify the single available function, extract the string 'frontend' and the integer '1138' directly from the text, and execute a correct function call.
Parameter Extraction
parameter-transformation
function-calling
functions
create_deployment(app_name: str, version: str, replicas: int, memory_mb: int) -> dict
question
Deploy the 'analytics-service' app version 'v2.1.0' with 3 replicas and 512MB memory.
expected
The model should correctly identify and extract all parameters. It must convert the string '512MB' into the integer 512
to match the function's type signature.
Parameter Transformation
parameter-transformation
function-calling
functions
schedule_backup(database_name: str, schedule_cron: str, retention_days: int) -> dict
question
Schedule a backup for the 'orders' database every day at 2 AM, and keep backups for 2 weeks.
expected
The model must transform the natural language inputs into the required formats. It should convert 'every day at 2 AM' into the cron string 0 2 * * *
and '2 weeks' into the integer 14
.
Resolving Ambiguous Function Choice
function-selection
handling-ambiguity
parameter-transformation
functions
search_products(query: str, category: str = None, price_range: tuple = None, brand: str = None, rating_min: float = None) -> list
filter_products(product_ids: list, filters: dict) -> list
get_product_recommendations(user_id: str, category: str = None, price_max: float = None) -> list
get_similar_products(product_id: str, similarity_threshold: float = 0.7) -> list
question
Find me some wireless headphones under $100 with good reviews.
expected
The model should select search_products
and properly infer parameters: query="wireless headphones"
, price_range=(0, 100)
, and rating_min=4.0
(inferred from "good reviews"). It must recognize that filter_products
requires existing product IDs, get_product_recommendations
needs a user_id, and get_similar_products
needs a reference product - none of which are available.
Handling Missing Information
missing-functions
missing-parameters
context-application
function-selection
functions
list_calendar_events(start_date: str, end_date: str) -> list
send_email(recipient_email: str, subject: str, body: str) -> dict
question
Book 'Conference Room 4B' for 10 AM tomorrow for a meeting to plan out Q3.
followup
Okay, in that case, just send an email to book it.
expected
This is a multi-turn test. First, the model must recognize it lacks a tool to book rooms and inform the user. When the user asks the model to send an email instead, it must then identify the send_email
tool but recognize it's missing parameters (recipient). After asking for and receiving this information, it must successfully call the send_email
function using context from the entire conversation.
Stateful Composite Calling
composite-calling
context-application
output-understanding
functions
get_document_status(doc_id: str) -> dict
submit_for_review(doc_id: str, reviewer_id: str) -> dict
approve_document(doc_id: str, approver_id: str) -> dict
question
Move document 'DOC-001' to the next stage in the approval workflow.
expected
The model must first call get_document_status
to determine the current state. Based on the output, it should then select and call the correct subsequent function (e.g., submit_for_review
or approve_document
).
Parallel Function Calling
parallel-calling
output-understanding
parameter-transformation
functions
crm_get_customer(customer_id: str) -> dict
billing_get_invoices(customer_id: str) -> list
support_get_tickets(customer_id: str) -> list
merge_customer_profile(crm_data: dict, billing_data: list, support_data: list) -> dict
question
Give me a complete profile for customer 'CUST-789' across all systems.
expected
The model should recognize the need to gather data from multiple systems. It must make parallel calls to crm_get_customer
, billing_get_invoices
, and support_get_tickets
, then use the collected data to call merge_customer_profile
.
Cascading Error Recovery
error-handling
composite-calling
function-selection
functions
get_user_from_cache(user_id: str) -> dict
get_user_from_database(user_id: str) -> dict
get_user_from_ldap(username: str) -> dict
question
Get user details for user_id 'u-999'. Scenario: Cache returns not_found, then database returns connection_error.
expected
The model must demonstrate a robust error handling chain. After the cache fails, it should attempt the database. When the database fails, it must not give up but instead ask the user for more information (username
) to try the next available tool (get_user_from_ldap
).
Adhering to Contextual Constraints
constraint-adherence
function-selection
context-application
functions
view_sensitive_data(resource_id: str) -> dict # Requires level-3 clearance
view_summary_data(resource_id: str) -> dict # Requires level-2 clearance
question
Show me the data for resource 'RES-001'. (the user has level-2 clearance)
expected
The model must understand the user's permission level from the conversational context. It should treat this as an implicit constraint, filter the available functions, and select view_summary_data
as the only permissible tool.
Dynamic Function Generation
function-generation
function-discovery
parameter-transformation
parallel-calling
functions
(None provided initially)
question
I need to score incoming leads. Can you write a Python function to calculate a lead_score
from a user dictionary? Give +10 points if the job_title
is "Software Engineer", +5 if country
exists, and +20 if company_size
is over 1000.
followup
Score a lead on the following candidates.
Job Title: Software Engineer
Country: USA
Company Size: 5000
Job Title: Product Manager
Company Size: 500
expected
The model must generate a Python function containing the specified conditional logic and arithmetic. The function should also gracefully handle cases where keys might be missing. It should then immediately discover and call this new function twice with the provided details in the form of a dictionary to return the calculated scores.