[methodology] | 2026-06-21 | 16 min

# AI4SE 端到端软件交付最佳实践

2025–2026 行业共识：AI 是放大器而非万能药——端到端成功依赖规格→上下文→执行→验证→度量的闭环，而非 IDE 内补全。

[e2e-delivery][methodology]

2025–2026 年，Anthropic、OpenAI、Cursor、Google（DORA）、Microsoft、GitHub 等机构对 AI 辅助软件工程形成了高度收敛的判断：

AI 是放大器，不是万能药 — 组织既有交付能力决定 AI 价值上限
端到端成功依赖闭环 — 规格 → 上下文 → 执行 → 验证 → 度量
Harness 与人机分工 — Planner / Generator / Evaluator 是长时程交付关键
组织 enablement 与工程实践必须同步 — 政策、Champion、平台与 Inner Loop 同等重要

AI 放大组织既有能力

DORA 2025 核心论断：

AI’s primary role in software development is that of an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones.

阶段	组织基础弱	组织基础强
需求/规格	AI 放大需求歧义	AI 辅助澄清与验收标准
开发	更多难审查的代码	小批量 + CR 提升吞吐
测试/发布	缺陷与 MTTR 恶化	流动效率与质量同步改善

详见 DORA 2025 AI 放大器。

DORA AI Capabilities Model（七项放大器）

#	能力	E2E 含义
1	Clear AI stance	可用工具、边界、实验空间清晰传达
2	Healthy data ecosystems	内部数据质量、一致性、可发现性
3	AI-accessible internal data	Context Engineering 基础
4	Strong version control	应对 AI 带来的变更量与 PR 体积
5	Working in small batches	抵消 AI 一次生成大段代码的风险
6	User-centric focus	加速仍对齐业务价值
7	Quality internal platforms	个人效率转化为组织流动效率

AI 易增大 PR 体积，小批量是 safeguard。

先 Plan，再 Code

Cursor 官方将 E2E feature 交付归纳为：

Plan Mode — 研究代码库、澄清需求、产出可编辑计划
可独立验证的小步 — 每步 agent 可自行确认完成
计划失败则回退 — revert + 改计划，而非在错误方向连续 patch
TDD 五步法 — 写测试 → 确认失败 → 实现 → 通过 → commit
可验证目标 — 类型系统、Linter、测试

The biggest risk when building features quickly is skipping verification.

结构化规格与验收标准是跨 BA → DEV → QA 的接口。见 SDD 真相源。

Context Engineering

Anthropic 将 context 定义为有限注意力预算下的高信号 token 集合：

组件	原则
System prompt	最小充分，分节清晰
Tools	少而精；避免 bloated tool set
Rules / CLAUDE.md	项目约定、命令
JIT 检索	轻量引用 + 运行时拉取，非预灌全库

长时程任务三机制：Compaction、Structured note-taking（PROGRESS.md）、Sub-agent（返回摘要给主 agent）。

BA 的需求包、DEV 的 Context Pack、QA 的验收意图，应设计为可引用、可版本化、可增量加载 — 对应 Middle Loop Harness。

Harness 与 P/G/E 三角色

角色	职责	E2E 对应
Planner	短 prompt → 产品 spec	BA + 架构概要
Generator	按 spec 小步实现	DEV Inner/Middle Loop
Evaluator	独立验证；硬阈值 pass/fail	QA + 自动化

关键洞察：

Separating the agent doing the work from the agent judging it.

Evaluator 应使用运行时测试（Playwright、pytest），而非仅 LLM 自评。Sprint contract：编码前对齐「何为 done」。见 Planner-Generator-Evaluator 与 HITL/HOTL/HOOL 光谱。

多层验证栈

层级	机制
L1 本地	单元测试、类型检查、Linter
L2 任务	Sprint contract / 验收标准
L3 运行时	Playwright、集成测试
L4 流程	PR Agent Review
L5 组织	Contextual Evals、golden set
L6 持续	CI/CD 门禁、SAST

OpenAI Evals 三阶段：Specify → Measure → Improve。不要 hope for “great” — specify it, measure it, improve toward it.

端到端参考架构

ORGANIZATION LAYER
  AI stance · Enablement · DORA 7 caps
        ↓
MIDDLE LOOP — Spec & Harness
  Planner/spec · Context packs · Review · Permissions
        ↓
   BA/PM    DEV (Gen+CR)    QA (Eval)
        ↓
OUTER LOOP — Delivery System
  Git PR · CI/CD · Platform · Metrics · Feedback → Eval

最简单方案优先

Anthropic《Building Effective Agents》：

Find the simplest solution possible, and only increase complexity when needed.

复杂度	模式	适用
低	增强 LLM（检索 + 工具）	单步分类、生成
中	Workflow（链式、路由、并行）	可分解固定子任务
高	Agent（动态工具循环）	开放-ended、步数不可预测

优先 Workflow + 人审，再视 Eval 证据升级 Agent 自治。

组织采纳是变革管理

GitHub AI Playbook：

Companies fail at AI adoption because they treat it like installing software when it’s actually rewiring how people work.

八柱 enablement：Executive support、Policies、AI Advocates、Communities、L&D、DRI、Right-fit tooling、Metrics。许可证 ≠ 规模化价值。

关键 Cautions

主题	观点
Agent 复杂度	随模型变强，部分脚手架可能 obsolete
PR 体积	AI 可能增大 PR → 与小批量、CR 冲突
个人 vs 组织效能	individual effectiveness ↑ ≠ organizational performance ↑
工具数量	more tools ≠ better