Benchmarks
Methodology

Benchmarking Methodology

How we measure the function calling capability of a model

This benchmark evaluates models on the following 15 parameters, which are tested across the 10 scenarios listed below. You can view the leaderboard here.

This leaderboard builds on the Berkeley Function Calling Leaderboard (BFCL) by expanding its test categories to include error handling, constraint enforcement, and function synthesis. It defines 15 evaluation parameters that cover the full function call pipeline: function discovery and selection, parameter extraction and transformation, function invocation, and output interpretation. Each parameter is tested through targeted scenarios that assess model behavior under both standard and edge-case conditions.

The tests are run using promptfoo. The scenarios (consisting of the user prompts, functions specifications, and function outputs) are written in markdown, and are accompanied by a single yaml file that is used by the test runner. A few custom scripts are used to analyze the results of the tests, and turn them into the leaderboard page. All of this data and code can be found in this repository.

Parameters

function-discovery

 The model must identify and catalog available functions from specifications in the conversation history, including user-provided and self-generated functions.

function-selection

 The model must select the correct function(s) from the list of discovered functions to accomplish a certain task, avoiding hallucination of non-existent tools.

parameter-extraction

 The model must extrapolate appropriate pieces of information from the conversation to use as arguments when calling a function, taking note of the expected types, formats, and required parameters.

parameter-transformation

 The model must transform user input from natural language or other formats into the precise data types and structures required by the function's arguments.

function-calling

 The model must be able to produce valid, executable function calls that conform to the defined API, with correctly named parameters and formatted arguments.

context-understanding

 The model must make use of the information from earlier in the conversation, including data returned by a function to generate helpful, user-friendly responses and inform subsequent actions.

parallel-calling

 The model must identify when multiple independent function calls can be made concurrently, and support receiving their outputs out of order.

composite-calling

 The model must be able to plan ahead and chain function calls in sequence, where the output of one function is used as the input for the next.

no-hallucinations

 The model must not hallucinate (e.g., calling functions that do not exist, generating function outputs on its own, passing made-up parameters, etc.).

error-handling

 The model must be able to handle errors returned by a function, understand why the error occurred, and decide on a suitable course of action such as retrying or informing the user.

missing-functions

 The model must be able to identify when it does not have access to an appropriate function to fulfill a request and inform the user about this limitation.

missing-parameters

 The model must identify when it lacks a piece of information required to call a function, and ask the user instead of making an assumption.

handling-ambiguity

 When faced with a vague request, the model must use the conversation history to make an informed assumption, or ask the user for clarification, as appropriate.

constraint-adherence

 The model must adhere to arbitrary constraints imposed by the user or the environment, such as being told not to use a certain function to perform a task.

function-generation

 The model must be able to generate function code and specifications based on a user's request, and then subsequently use that new function in the conversation.

Scenarios

Simple Function Call
function-discovery parameter-extraction function-calling

functions

get_pr_details(repo: str, pr_number: int) -> dict

question

 What's the status of PR #1138 in the 'frontend' repo?

expected

 The model should identify the single available function, extract the string 'frontend' and the integer '1138' directly from the text, and execute a correct function call.



Parameter Extraction
function-discovery parameter-extraction function-calling

functions

create_deployment(app_name: str, version: str, replicas: int, memory_mb: int) -> dict

question

 Deploy the 'analytics-service' app version 'v2.1.0' with 3 replicas and 512MB memory.

expected

 The model should correctly identify and extract all parameters. It must convert the string '512MB' into the integer 512 to match the function's type signature.



Parameter Transformation
function-discovery parameter-transformation function-calling

functions

schedule_backup(database_name: str, schedule_cron: str, retention_days: int) -> dict

question

 Schedule a backup for the 'orders' database every day at 2 AM, and keep backups for 2 weeks.

expected

 The model must transform the natural language inputs into the required formats. It should convert 'every day at 2 AM' into the cron string 0 2 * * * and '2 weeks' into the integer 14.



Resolving Ambiguous Function Choice
function-discovery function-selection handling-ambiguity parameter-transformation function-calling

functions

search_products(query: str, category: str = None, price_range: tuple = None, brand: str = None, rating_min: float = None) -> list
filter_products(product_ids: list, filters: dict) -> list
get_product_recommendations(user_id: str, category: str = None, price_max: float = None) -> list
get_similar_products(product_id: str, similarity_threshold: float = 0.7) -> list

question

 Find me some wireless headphones under $100 with good reviews.

expected

 The model should select search_products and properly infer parameters: query="wireless headphones", price_range=(0, 100), and rating_min=4.0 (inferred from "good reviews"). It must recognize that filter_products requires existing product IDs, get_product_recommendations needs a user_id, and get_similar_products needs a reference product - none of which are available.



Handling Missing Information
function-discovery missing-functions missing-parameters context-understanding function-selection function-calling

functions

list_calendar_events(start_date: str, end_date: str) -> list
send_email(recipient_email: str, subject: str, body: str) -> dict

question

Book 'Conference Room 4B' for 10 AM tomorrow for a meeting to plan out Q3.

followup

Okay, in that case, just send an email to book it.

expected

This is a multi-turn test. First, the model must recognize it lacks a tool to book rooms and inform the user. When the user asks the model to send an email instead, it must then identify the send_email tool but recognize it's missing parameters (recipient). After asking for and receiving this information, it must successfully call the send_email function using context from the entire conversation.



Stateful Composite Calling
function-discovery function-selection context-understanding composite-calling

functions

get_document_status(doc_id: str) -> dict
submit_for_review(doc_id: str, reviewer_id: str) -> dict
approve_document(doc_id: str, approver_id: str) -> dict

question

 Move document 'DOC-001' to the next stage in the approval workflow.

expected

 The model must first call get_document_status to determine the current state. Based on the output, it should then select and call the correct subsequent function (e.g., submit_for_review or approve_document).



Parallel Function Calling
function-discovery function-selection context-understanding parameter-transformation parallel-calling

functions

crm_get_customer(customer_id: str) -> dict
billing_get_invoices(customer_id: str) -> list
support_get_tickets(customer_id: str) -> list

question

 Give me a complete profile for customer 'CUST-789' across all systems.

expected

 The model should recognize the need to gather data from multiple systems. It must make parallel calls to crm_get_customer, billing_get_invoices, and support_get_tickets, then use the collected data to give a comprehensive response to the user.



Cascading Error Recovery
function-discovery function-selection error-handling composite-calling

functions

get_user_from_cache(user_id: str) -> dict
get_user_from_database(user_id: str) -> dict
get_user_from_ldap(username: str) -> dict

question

 Get user details for user_id 'u-999'. Scenario: Cache returns not_found, then database returns connection_error.

expected

 The model must demonstrate a robust error handling chain. After the cache fails, it should attempt the database. When the database fails, it must not give up but instead ask the user for more information (username) to try the next available tool (get_user_from_ldap).



Adhering to Contextual Constraints
function-discovery function-selection constraint-adherence context-understanding

functions

view_sensitive_data(resource_id: str) -> dict  # Requires level-3 clearance
view_summary_data(resource_id: str) -> dict   # Requires level-2 clearance

question

 Show me the data for resource 'RES-001'. (the user has level-2 clearance)

expected

 The model must understand the user's permission level from the conversational context. It should treat this as an implicit constraint, filter the available functions, and select view_summary_data as the only permissible tool.



Dynamic Function Generation
function-generation function-discovery parameter-transformation parallel-calling

functions

(None provided initially)

question

 I need to score incoming leads. Can you write a Python function to calculate a lead_score from a user dictionary? Give +10 points if the job_title is "Software Engineer", +5 if country exists, and +20 if company_size is over 1000.

followup

Score a lead on the following candidates.

Job Title: Software Engineer
Country: USA
Company Size: 5000

Job Title: Product Manager
Company Size: 500

expected

 The model must generate a Python function containing the specified conditional logic and arithmetic. The function should also gracefully handle cases where keys might be missing. It should then immediately discover and call this new function twice with the provided details in the form of a dictionary to return the calculated scores.