[tools] | 2026-06-21 | 14 min

# LLM 成本与部署基础：模型给你能力，工作流决定账单

大模型账单 = 模型单价 × Token 计量 × 工作流放大系数——同一 Review 功能，单轮 Chat 几分钱，Agent Loop 可能几块钱。

[llm-operations][cost][tools]

Web 开发者接入大模型，表面是「调一个 API」，本质是在管理 能力、上下文、调用次数、输出长度 四个变量的乘积。

大模型使用成本 = 模型单价 × Token 计量 × 工作流放大系数

模型给你能力，工作流决定账单。 Agent Loop 每一轮都是计费单元；Harness / Context 管理即成本管理。

贯穿案例：代码 Review 助手三版本

V1 — 单轮 Chat（粘贴 diff）

input ~3.2K + output ~500
Claude Sonnet 4.6：约 $0.01 / 次

V2 — 自动 RAG（检索相关文件）

input ~25K + output ~800
约 $0.09 / 次（≈ V1 的 9×）

RAG 的 input 是成本大头；还要计 embedding 索引费用。

V3 — Agent Loop（读 repo、跑 test、改代码）

15 轮 × 平均 40K input + 3K output/轮
总 input ~600K，output ~45K
Claude Sonnet 4.6：约 $2.48 / 次
DeepSeek V4-Pro（80% cache hit）：约 $0.09–0.12 / 次

你没有选更贵的模型名——你选了更厚的工作流。 设计满足质量门槛的最薄 workflow。

Token 计费要点

规则	说明
Input / Output 分价	Output 常贵 3–8×；Agent + thinking 下 output 费用可占 60–80%
全部 prompt 计 Input	system、history、RAG、tool results、tool schema
Prompt Caching	稳定前缀重复使用时 cached read 大幅降价
长上下文阶梯价	单次请求超阈值按高档计价
Batch API	非实时任务通常 ~50% 折扣

代码里读 `usage`

const u = response.usage;
const cost =
  (u.prompt_tokens - (u.prompt_tokens_details?.cached_tokens ?? 0)) * inputPrice +
  (u.prompt_tokens_details?.cached_tokens ?? 0) * cachedPrice +
  u.completion_tokens * outputPrice;

metrics.record('llm.cost_usd', cost, { model, feature: 'code-review' });

追踪 cost per task 比 cost per token 更有业务意义。

五个成本旋钮

旋钮	成本影响
`max_tokens`	分类任务设 64–256，非默认 4096
System prompt 长度	每轮计费；应稳定、精简、可缓存
History 策略	全量 history → input 线性增长
RAG top-k / chunk	V1→V2 的第一杠杆
模型路由	简单任务 → Flash/Turbo；复杂 → Sonnet/Pro

Agent 隐性成本：tool schema 占 input；tool results 回灌；retry 翻倍；thinking 模式放大 output token。

价格快照（2026-06，$/1M tokens）

角色	国际代表	Input/Output
旗舰	GPT-5.5	$5 / $30
主力	Claude Sonnet 4.6	$3 / $15
路由	Gemini Flash-Lite	$0.10 / $0.40
极致性价比	DeepSeek V4-Flash	$0.14 / $0.28
国内主力	Qwen3.5-Plus	~$0.11 / ~$0.67
国内编码	GLM-5 / Kimi K2.7	~$0.56–0.95 / ~$2.5–4.0

国内平台常按上下文长度阶梯计价；交付前回查各厂商官方定价页。

五杠杆优化（按影响力）

模型路由 — 分类/提取用小模型
控制 context — RAG top-k↓、history 截断、tool result 摘要
约束 output — max_tokens、关 thinking、结构化输出
Prompt Caching — 稳定 system 前缀，动态内容置后
埋点与告警 — cost_per_task 超阈值告警

反模式

默认 max_tokens=4096 做意图分类
整份 codebase 塞进 prompt
Agent 无轮次上限、无 token 预算
所有请求打旗舰模型
不读 usage，月底才看账单
以为 Cursor/Copilot 订阅 = API 免费

私有化部署：何时考虑

驱动	信号
数据合规	代码/用户数据不能出网
成本规模	月 API 稳定 > ¥3–5 万且模型可固定
模型固定	开源 Qwen/GLM 即可，不必最新闭源

混合架构最常见： 主推理本地 + 复杂任务 fallback 云端 API。

显存估算（简）

GPU 显存 ≈ 量化权重 + KV Cache + 框架开销（0.5–2 GB）
Context 翻倍 → KV Cache 约翻倍

配置	最低 GPU
7–8B Q4，8K	12–16 GB
32B Q4，8K	24 GB（4090）
32B Q4，32K	48 GB

OOM often 来自 KV，不是权重。 生产用 vLLM/SGLang，试用 Ollama。

API vs 本地 TCO

Break-even：月 API 费用 > 月 TCO（本地 + 运维 + 电费）→ 值得 POC

1×4090 服务器月 TCO 粗算 ~¥570；若月 API 仅 ¥8,000 且可用 DeepSeek 级 API，本地难回本——除非数据必须内网。

与 AI4SE 衔接

概念	成本含义
Agent Loop	每轮 Loop = 计费单元
Harness	Context 管理 = 成本管理
Coding Agent Benchmark	看 token/task，非只看 Index
Tool Calling	每次 tool result 回灌 = input token

Benchmark 解读见 Coding Agent 市场与 Benchmark。