Benchmarks
Leaderboard
Offline Function Calling Leaderboard

Offline Function Calling Leaderboard

A comprehensive evaluation of offline, multimodal, large language models' function calling capabilities across various scenarios.

Evaluation ID: eval-combined-mcyy480a
Generated: 7/11/2025, 8:32:32 PM
25
Models
10
Scenarios
15
Parameters
650
Total Tests
Rank Model Scenario Score Parameter Score Average Latency ^ Composite Score * Details
1
Gemma 3 27B Instr. Tuned QAT
0.92 0.89
27.0145 sec
0.87
2
Gemma 3 27B Instr. Tuned Q4_K_M
0.88 0.85
19.0812 sec
0.84
3
Gemma 3 27B
0.85 0.85
20.1868 sec
0.82
4
Gemma 3 12B
0.85 0.75
08.7492 sec
0.79
5
Gemma 3 12B Instr. Tuned Q8_0
0.81 0.70
10.9550 sec
0.75
6
Gemma 3 12B Instr. Tuned Q4_K_M
0.77 0.81
08.4982 sec
0.79
7
Gemma 3 12B Instr. Tuned QAT
0.77 0.67
13.1357 sec
0.71
8
Gemma 3 4B Instr. Tuned FP16
0.69 0.77
06.3084 sec
0.73
9
Gemma 3 4B Instr. Tuned QAT
0.65 0.74
07.3117 sec
0.70
10
Gemma 3 4B Instr. Tuned Q4_K_M
0.65 0.72
03.2875 sec
0.69
11
Gemma 3 4B
0.62 0.74
03.4160 sec
0.69
12
Gemma 3 4B Instr. Tuned Q8_0
0.62 0.68
04.0405 sec
0.66
13
Gemma 3n E4B Instr. Tuned Q8_0
0.58 0.67
12.0893 sec
0.62
14
Gemma 3n E4B Instr. Tuned FP16
0.54 0.61
12.7311 sec
0.57
15
Gemma 3n E2B
0.46 0.63
07.6640 sec
0.55
16
Gemma 3n E2B Instr. Tuned FP16
0.42 0.61
09.4918 sec
0.52
17
Gemma 3n E4B
0.42 0.60
10.2829 sec
0.52
18
Gemma 3n E4B Instr. Tuned Q4_K_M
0.42 0.58
10.1451 sec
0.51
19
Gemma 3n E2B Instr. Tuned Q4_K_M
0.42 0.58
08.3391 sec
0.51
20
Gemma 3n E2B Instr. Tuned Q8_0
0.38 0.60
07.9739 sec
0.50
21
Gemma 3 1B Instr. Tuned FP16
0.15 0.37
02.0871 sec
0.29
22
Gemma 3 1B Instr. Tuned Q4_K_M
0.12 0.58
01.0626 sec
0.37
23
Gemma 3 1B
0.12 0.40
00.9691 sec
0.29
24
Gemma 3 1B Instr. Tuned Q8_0
0.12 0.29
01.6037 sec
0.23
25
Gemma 3 1B Instr. Tuned QAT
0.08 0.47
03.9768 sec
0.30
*Composite Score: A weighted score between 0 and 1 that factors in multiple aspects of a model's performance in the tests. It is calculated as follows: (pass rate × 0.48) + (parameter score × 0.48) + (latency score × 0.04). Each component is normalized before averaging.

^Average Latency: These tests were run using Ollama on an Apple Mac Mini with 24G of RAM and an M4 Pro chip. The latency includes the time taken by Ollama to load the model into memory, and the time taken to make the (REST) API call.