Hands-on comparison of Claude / ChatGPT / Gemini / DeepSeek / GLM / MiniMax across six dimensions, mapped to real product decisions I've made. Not a leaderboard — one usage story per model.
A decision-oriented benchmark covering Claude (Opus / Sonnet), ChatGPT-5, Gemini 2.5, DeepSeek V3, GLM 4.5, and MiniMax M2. Six dimensions: coding, Chinese reasoning, agent orchestration, long-context, multimodal, cost-per-task. Critically, each model gets one real usage story from my OpenClaw stack — the dashboard is built to answer 'which model should I deploy for this task right now', not 'who's the best overall'.
Chinese version below covers the rationale (why most public benchmarks are useless for product decisions), the 6-dimension framework, my evaluation methodology, and key findings from running the same test suite quarterly.