Back to projects

Frontier-Model Benchmark Dashboard

Hands-on comparison of Claude / ChatGPT / Gemini / DeepSeek / GLM / MiniMax across six dimensions, mapped to real product decisions I've made. Not a leaderboard — one usage story per model.

Type
Personal benchmark dashboard
Period
2026.02 — Present (continuously updated)
Team
Solo
Outcome
Living dashboard, drives daily model selection
Stack
Notion · Custom test prompts · Real-task evaluation
My Role
AnalystDecision Framework DesignerWriter

English summary

Translation in progress

A decision-oriented benchmark covering Claude (Opus / Sonnet), ChatGPT-5, Gemini 2.5, DeepSeek V3, GLM 4.5, and MiniMax M2. Six dimensions: coding, Chinese reasoning, agent orchestration, long-context, multimodal, cost-per-task. Critically, each model gets one real usage story from my OpenClaw stack — the dashboard is built to answer 'which model should I deploy for this task right now', not 'who's the best overall'.

Chinese version below covers the rationale (why most public benchmarks are useless for product decisions), the 6-dimension framework, my evaluation methodology, and key findings from running the same test suite quarterly.

Back to all projects