Benchmarks
Leaderboard
Offline Function Calling Leaderboard

Offline Function Calling Leaderboard

A comprehensive evaluation of offline, multimodal, large language models' function calling capabilities across various scenarios.

Evaluation ID: eval-combined-mdytlb5b
Generated: 8/5/2025, 11:05:33 PM
25
Models
10
Scenarios
14
Parameters
650
Total Tests
Rank Model Scenario Score Parameter Score Average Latency ^ Composite Score * Details
1
Gemma 3 27B Instr. Tuned QAT
0.96 0.89
28.0758 sec
0.89
2
Gemma 3 27B
0.92 0.84
19.1038 sec
0.86
3
Gemma 3 27B Instr. Tuned Q4_K_M
0.85 0.79
21.7459 sec
0.80
4
Gemma 3 12B Instr. Tuned Q8_0
0.81 0.74
10.9173 sec
0.77
5
Gemma 3 12B Instr. Tuned QAT
0.81 0.67
12.8965 sec
0.73
6
Gemma 3 12B Instr. Tuned Q4_K_M
0.77 0.82
08.1121 sec
0.79
7
Gemma 3 12B
0.77 0.70
09.6838 sec
0.73
8
Gemma 3 4B
0.65 0.68
03.4277 sec
0.68
9
Gemma 3n E4B
0.62 0.61
10.7790 sec
0.61
10
Gemma 3n E4B Instr. Tuned FP16
0.62 0.59
12.1621 sec
0.60
11
Gemma 3 4B Instr. Tuned FP16
0.62 0.54
06.2974 sec
0.59
12
Gemma 3n E4B Instr. Tuned Q8_0
0.58 0.65
10.4753 sec
0.62
13
Gemma 3 4B Instr. Tuned Q4_K_M
0.58 0.60
03.2700 sec
0.60
14
Gemma 3 4B Instr. Tuned Q8_0
0.58 0.57
04.2566 sec
0.59
15
Gemma 3 4B Instr. Tuned QAT
0.54 0.56
07.2858 sec
0.56
16
Gemma 3n E4B Instr. Tuned Q4_K_M
0.42 0.59
11.1672 sec
0.51
17
Gemma 3n E2B Instr. Tuned FP16
0.42 0.58
15.3226 sec
0.50
18
Gemma 3n E2B Instr. Tuned Q8_0
0.42 0.52
07.7962 sec
0.48
19
Gemma 3n E2B Instr. Tuned Q4_K_M
0.35 0.61
07.9822 sec
0.49
20
Gemma 3n E2B
0.35 0.48
08.4418 sec
0.43
21
Gemma 3 1B Instr. Tuned FP16
0.15 0.25
02.1435 sec
0.23
22
Gemma 3 1B Instr. Tuned Q4_K_M
0.12 0.40
01.2753 sec
0.29
23
Gemma 3 1B Instr. Tuned Q8_0
0.12 0.35
01.5114 sec
0.26
24
Gemma 3 1B Instr. Tuned QAT
0.08 0.38
04.1859 sec
0.26
25
Gemma 3 1B
0.04 0.33
01.2161 sec
0.22
*Composite Score: A weighted score between 0 and 1 that factors in multiple aspects of a model's performance in the tests. It is calculated as follows: (pass rate × 0.48) + (parameter score × 0.48) + (latency score × 0.04). Each component is normalized before averaging.

^Average Latency: These tests were run using Ollama on an Apple Mac Mini with 24G of RAM and an M4 Pro chip. The latency includes the time taken by Ollama to load the model into memory, and the time taken to make the (REST) API call.