DAXBench logo

DAX LLM Benchmark

Which LLM is best at DAX?
DAXBench tests how models understand, write, and reason about DAX and Power BI.
Methodology designed by Maxim Anatsko.

Last updated: Dec 6, 2025

33 models · 30 tasks· Initial Release

Model Leaderboard

Ranked by score

ModelTasks
1
gpt-oss-120bOpenAI
81.6%
80.0%100.0%24/30
2
Gemini 2.5 Flash Preview 09-2025Google
79.7%
76.7%100.0%23/30
3
DeepSeek V3.2DeepSeek
79.4%
80.0%100.0%24/30
4
Claude Opus 4.5Anthropic
78.7%
76.7%100.0%23/30
5
Gemini 3 Pro PreviewGoogle
77.9%
76.7%96.7%23/30
6
Claude Sonnet 4.5Anthropic
77.5%
76.7%100.0%23/30
7
Gemini 2.0 Flash Experimental (free)Google
76.3%
73.3%100.0%22/30
8
Gemini 2.0 FlashGoogle
76.2%
73.3%100.0%22/30
9
DeepSeek V3.2 SpecialeDeepSeek
74.5%
73.3%100.0%22/30
10
Kimi K2 ThinkingMoonshotAI
74.3%
73.3%100.0%22/30
11
GPT-4o-mini (2024-07-18)OpenAI
73.6%
73.3%100.0%22/30
12
Gemini 2.5 FlashGoogle
72.0%
70.0%100.0%21/30
13
DeepSeek R1T2 Chimera (free)TNG
71.6%
70.0%93.3%21/30
14
DeepSeek V3 0324DeepSeek
70.2%
70.0%100.0%21/30
15
Grok Code Fast 1xAI
69.7%
66.7%100.0%20/30
16
GPT-5.1-Codex-MiniOpenAI
69.6%
70.0%100.0%21/30
17
Mistral Large 3 2512Mistral
69.0%
66.7%100.0%20/30
18
MiniMax M2MiniMax
68.1%
66.7%100.0%20/30
19
Qwen3 Coder 480B A35B (free)Qwen
67.9%
66.7%100.0%20/30
20
GPT-5.1OpenAI
67.4%
63.3%100.0%19/30
21
Gemini 2.5 Flash Lite Preview 09-2025Google
66.6%
63.3%100.0%19/30
22
Gemini 2.5 ProGoogle
64.4%
63.3%80.0%19/30
23
GPT-5 NanoOpenAI
62.1%
60.0%96.7%18/30
24
GPT-3.5 TurboOpenAI
61.3%
60.0%100.0%18/30
25
GPT-4.1 MiniOpenAI
60.7%
60.0%100.0%18/30
26
Gemini 2.0 Flash LiteGoogle
59.0%
53.3%100.0%16/30
27
GPT-5 MiniOpenAI
57.1%
56.7%100.0%17/30
28
Claude Haiku 4.5Anthropic
57.0%
56.7%100.0%17/30
29
gpt-oss-20b (free)OpenAI
56.7%
56.7%100.0%17/30
30
GLM 4.5 Air (free)Z.AI
53.2%
50.0%100.0%15/30
31
Nova 2 Lite (free)Amazon
49.5%
50.0%96.7%15/30
32
Gemma 3 27B (free)Google
46.6%
43.3%100.0%13/30
33
Phi 4Microsoft
20.2%
16.7%56.7%5/30

About This Benchmark

Evaluation Method

Models are tested against DAX tasks of varying complexity using the Contoso sample dataset. Responses are evaluated for syntax correctness and output accuracy.

Scoring System

Harder tasks are worth more points. Correct solutions also earn bonus points for following DAX best practices, writing efficient code, and producing clear, readable output.

Task Categories

Tasks cover aggregation, time intelligence, filtering, calculations, iterators, and context transitions across basic, intermediate, and advanced levels.