链式法则与反向传播

链式法则 dy/dx = (dy/du)·(du/dx) 是微积分中的基础公式,但在深度学习中,它是反向传播算法(Backpropagation)的唯一数学原理。一个 100 层的 ResNet 从 loss 到第 1 层的梯度,就是 100 次链式法则的连乘;Transformer 中 attention→FFN→LayerNorm 的梯度回传同样依赖链式法则逐层展开。梯度消失——Sigmoid 导数最大 0.25,100 层连乘后梯度衰减到 0.25100 ≈ 10-60,使早期层无法学习——正是链式法则连乘的直接后果;ReLU 激活函数的导数为 0 或 1,解决了这一问题。梯度爆炸在 RNN 中尤为突出,梯度裁剪(gradient clipping)是直接应对手段。框架的 autograd 引擎所做的事情就是在计算图上自动应用链式法则。本节从最简单的两层复合出发,一步步推导到真实 2 层神经网络的反向传播全过程,并演示梯度消失现象。

一、链式法则的数学定义

单变量链式法则:
若 y = f(u),u = g(x),则:
dy/dx = (dy/du) × (du/dx) = f'(u) × g'(x)

直觉理解:
x 变化 → 导致 u 变化(g'(x)描述这个变化率)
u 变化 → 导致 y 变化(f'(u)描述这个变化率)
所以 x 对 y 的总变化率 = 两个变化率的乘积

类比:张三身高影响体重(每高 1cm 重 0.5kg),体重影响 BMI(每重 1kg BMI 增 0.3)
→ 身高对 BMI 的影响 = 0.5 × 0.3 = 0.15(每高 1cm BMI 增 0.15)
多变量链式法则:
若 L = f(a₁, a₂, ..., aₘ),且每个 aᵢ = gᵢ(x₁, x₂, ..., xₙ),则:
∂L/∂xⱼ = Σᵢ (∂L/∂aᵢ) × (∂aᵢ/∂xⱼ)

当多条路径都连接 xⱼ 和 L 时,把每条路径的偏导乘积加起来。

二、简单链式法则逐步计算

例1:y = (3x + 2)⁴

令 u = 3x + 2,则 y = u⁴

dy/du = 4u³
du/dx = 3

dy/dx = 4u³ × 3 = 12(3x+2)³

在 x=1 处:u = 5,dy/dx = 12 × 125 = 1500

例2:y = e^(x²+1)

令 u = x²+1,则 y = eᵘ

dy/du = eᵘ
du/dx = 2x

dy/dx = eᵘ × 2x = 2x × e^(x²+1)

在 x=1 处:dy/dx = 2 × e² = 2 × 7.389 = 14.778

例3:三层复合 y = sin(eˣ²)

令 a = x², b = eᵃ, y = sin(b)

dy/db = cos(b)
db/da = eᵃ
da/dx = 2x

dy/dx = cos(b) × eᵃ × 2x = cos(eˣ²) × eˣ² × 2x

在 x=0 处:a=0, b=1, dy/dx = cos(1) × 1 × 0 = 0

三、计算图——反向传播的核心思想

把复杂的计算公式拆成一系列简单的基本操作,画成有向无环图(DAG)。前向传播沿图计算输出,反向传播则从输出往回走,逐节点用链式法则传递梯度。

完整示例——手动反向传播一个简单计算图:

计算 L = (x × w₁ + w₂)² ,求 ∂L/∂w₁ 和 ∂L/∂w₂
设 x = 3, w₁ = 2, w₂ = -1

═══ 前向传播(从左到右)═══
节点 a = x × w₁ = 3 × 2 = 6
节点 b = a + w₂ = 6 + (-1) = 5
节点 L = b² = 25

═══ 反向传播(从右到左)═══

第1步:L 对自身的梯度恒为 1
∂L/∂L = 1

第2步:L = b² → ∂L/∂b = 2b = 2×5 = 10

第3步:b = a + w₂(加法节点,梯度直接传递给两个输入)
∂L/∂a = ∂L/∂b × ∂b/∂a = 10 × 1 = 10
∂L/∂w₂ = ∂L/∂b × ∂b/∂w₂ = 10 × 1 = 10 ← w₂的梯度!

第4步:a = x × w₁(乘法节点,一个输入的梯度=另一个输入×上游梯度)
∂L/∂x = ∂L/∂a × ∂a/∂x = 10 × w₁ = 10 × 2 = 20
∂L/∂w₁ = ∂L/∂a × ∂a/∂w₁ = 10 × x = 10 × 3 = 30 ← w₁的梯度!

验证:直接对 L = (3w₁ + w₂)² 求导
∂L/∂w₁ = 2(3w₁+w₂) × 3 = 2×5×3 = 30 ✓
∂L/∂w₂ = 2(3w₁+w₂) × 1 = 2×5×1 = 10 ✓

四、真实神经网络的反向传播(完整数值演示)

2 层网络完整反向传播(精确到每一步):

网络架构:1 输入 → 1 隐藏(sigmoid) → 1 输出(无激活) → MSE损失
任务:学习 y = 2x(最简单的线性函数)

初始参数:w₁ = 0.5, b₁ = 0.1, w₂ = 0.3, b₂ = 0.2
训练样本:x = 1.0, y = 2.0(真实值 2×1=2)
学习率:α = 0.5

═══ 前向传播 ═══
z₁ = w₁×x + b₁ = 0.5×1.0 + 0.1 = 0.6
a₁ = σ(z₁) = σ(0.6) = 1/(1+e⁻⁰·⁶) = 1/(1+0.5488) = 1/1.5488 = 0.6457
z₂ = w₂×a₁ + b₂ = 0.3×0.6457 + 0.2 = 0.1937 + 0.2 = 0.3937
ŷ = z₂ = 0.3937(预测值)
L = (y - ŷ)² = (2.0 - 0.3937)² = 1.6063² = 2.580

═══ 反向传播 ═══

∂L/∂ŷ = -2(y - ŷ) = -2(2.0 - 0.3937) = -2 × 1.6063 = -3.2126

ŷ = z₂(恒等连接)
∂L/∂z₂ = ∂L/∂ŷ × 1 = -3.2126

z₂ = w₂×a₁ + b₂
∂L/∂w₂ = ∂L/∂z₂ × a₁ = -3.2126 × 0.6457 = -2.0744
∂L/∂b₂ = ∂L/∂z₂ × 1 = -3.2126
∂L/∂a₁ = ∂L/∂z₂ × w₂ = -3.2126 × 0.3 = -0.9638

a₁ = σ(z₁),σ'(z₁) = σ(z₁)(1-σ(z₁)) = 0.6457 × 0.3543 = 0.2288
∂L/∂z₁ = ∂L/∂a₁ × σ'(z₁) = -0.9638 × 0.2288 = -0.2205

z₁ = w₁×x + b₁
∂L/∂w₁ = ∂L/∂z₁ × x = -0.2205 × 1.0 = -0.2205
∂L/∂b₁ = ∂L/∂z₁ × 1 = -0.2205

═══ 参数更新 ═══
w₁_new = 0.5 - 0.5×(-0.2205) = 0.5 + 0.110 = 0.610
b₁_new = 0.1 - 0.5×(-0.2205) = 0.1 + 0.110 = 0.210
w₂_new = 0.3 - 0.5×(-2.0744) = 0.3 + 1.037 = 1.337
b₂_new = 0.2 - 0.5×(-3.2126) = 0.2 + 1.606 = 1.806

验证:更新后的预测值
z₁ = 0.610×1.0 + 0.210 = 0.820
a₁ = σ(0.820) = 0.6944
ŷ = 1.337×0.6944 + 1.806 = 0.928 + 1.806 = 2.734
L = (2.0-2.734)² = 0.539

损失从 2.580 降到 0.539 ✓ 下降了 79%!

五、基本运算节点的反向传播规则

运算前向: y = ?反向: ∂L/∂输入数值示例(∂L/∂y=0.5)
加法y = a + b∂L/∂a = ∂L/∂y, ∂L/∂b = ∂L/∂y∂L/∂a=0.5, ∂L/∂b=0.5
乘法y = a × b∂L/∂a = ∂L/∂y × b, ∂L/∂b = ∂L/∂y × a若 a=3,b=4: ∂L/∂a=2.0, ∂L/∂b=1.5
y = xⁿ∂L/∂x = ∂L/∂y × nxⁿ⁻¹若 x=2,n=3: ∂L/∂x=0.5×12=6
指数y = eˣ∂L/∂x = ∂L/∂y × eˣ若 x=1: ∂L/∂x=0.5×e=1.359
对数y = ln(x)∂L/∂x = ∂L/∂y × (1/x)若 x=2: ∂L/∂x=0.5×0.5=0.25
ReLUy = max(0,x)∂L/∂x = ∂L/∂y × (x>0? 1:0)若 x=3: ∂L/∂x=0.5; x=-1: ∂L/∂x=0
Sigmoidy = σ(x)∂L/∂x = ∂L/∂y × y(1-y)若 x=0,y=0.5: ∂L/∂x=0.5×0.25=0.125
矩阵乘Y = WX∂L/∂W = (∂L/∂Y)Xᵀ, ∂L/∂X = Wᵀ(∂L/∂Y)本节4-3部分有完整数值

六、梯度消失与梯度爆炸

梯度消失(Vanishing Gradient):

Sigmoid 导数最大值 = 0.25。如果网络有 n 层 Sigmoid:
反向传播时梯度需要乘 n 次 σ',最坏情况:0.25ⁿ

n=10 层:0.25¹⁰ = 9.5 × 10⁻⁷(梯度几乎消失)
n=20 层:0.25²⁰ = 9.1 × 10⁻¹³(完全消失)

后果:前面几层参数几乎不更新,只有最后几层在学习 → 深层网络训练不动
解决方案:ReLU 激活(导数=1)、残差连接、BatchNorm、LSTM/GRU 的门机制
梯度爆炸(Exploding Gradient):

如果每层的权重矩阵让梯度放大 1.5 倍:
n=50 层:1.5⁵⁰ ≈ 6.4 × 10⁸(梯度爆炸)

后果:参数更新量过大 → 参数变成 NaN → 训练崩溃
解决方案:梯度裁剪(gradient clipping, 如限制梯度范数 ≤ 1.0)、
  合理的权重初始化(Xavier/He 初始化)、BatchNorm
ReLU 如何解决梯度消失——数值对比:

10 层网络,假设每层只有一个神经元:

Sigmoid:∂L/∂w₁ = ∂L/∂ŷ × (σ'₁₀)(σ'₉)...(σ'₁)
最坏:∂L/∂w₁ = 1.0 × 0.25¹⁰ = 0.00000095

ReLU(假设都激活):∂L/∂w₁ = ∂L/∂ŷ × (1)(1)...(1)
∂L/∂w₁ = 1.0 × 1¹⁰ = 1.0

ReLU 的梯度可以完整传递 10 层甚至 100 层。
这就是为什么深度网络(如 152 层的 ResNet)都用 ReLU 而不用 Sigmoid。

七、代码验证(C# / Rust)

C#(.NET 10)

// dotnet run 即可执行(顶级语句)
double Sigmoid(double v) => 1.0 / (1.0 + Math.Exp(-v));

double w1 = 0.5, b1 = 0.1, w2 = 0.3, b2 = 0.2;
double x = 1.0, y = 2.0, lr = 0.5;

// ===== 前向传播 =====
double z1 = w1 * x + b1;
double a1 = Sigmoid(z1);
double z2 = w2 * a1 + b2;
double yHat = z2;
double loss = Math.Pow(y - yHat, 2);
Console.WriteLine($"前向: z1={z1:F4}, a1={a1:F4}, yHat={yHat:F4}, loss={loss:F4}");

// ===== 反向传播 =====
double dL_dy = -2.0 * (y - yHat);
double dL_dz2 = dL_dy;
double dL_dw2 = dL_dz2 * a1;
double dL_db2 = dL_dz2;
double dL_da1 = dL_dz2 * w2;
double sigGrad = a1 * (1.0 - a1);
double dL_dz1 = dL_da1 * sigGrad;
double dL_dw1 = dL_dz1 * x;
double dL_db1 = dL_dz1;
Console.WriteLine($"梯度: dw1={dL_dw1:F4}, db1={dL_db1:F4}, dw2={dL_dw2:F4}, db2={dL_db2:F4}");

// ===== 参数更新 =====
w1 -= lr * dL_dw1; b1 -= lr * dL_db1;
w2 -= lr * dL_dw2; b2 -= lr * dL_db2;
Console.WriteLine($"更新后: w1={w1:F4}, b1={b1:F4}, w2={w2:F4}, b2={b2:F4}");

// 验证:更新后损失减小
z1 = w1 * x + b1; a1 = Sigmoid(z1);
yHat = w2 * a1 + b2;
double lossNew = Math.Pow(y - yHat, 2);
Console.WriteLine($"更新后 loss: {lossNew:F4} (< {loss:F4})");

// ===== 梯度消失演示 =====
Console.WriteLine("\n梯度消失演示 (Sigmoid):");
foreach (int n in new[] { 5, 10, 20, 50 })
    Console.WriteLine($"  {n}层: 梯度衰减到 {Math.Pow(0.25, n):E2}");

Console.WriteLine("\nReLU (全部激活):");
foreach (int n in new[] { 5, 10, 20, 50 })
    Console.WriteLine($"  {n}层: 梯度 = 1.0");

Rust

// cargo run 即可执行
fn sigmoid(x: f64) -> f64 { 1.0 / (1.0 + (-x).exp()) }

fn main() {
    let (mut w1, mut b1, mut w2, mut b2) = (0.5_f64, 0.1, 0.3, 0.2);
    let (x, y, lr) = (1.0_f64, 2.0, 0.5);

    // ===== 前向传播 (mul_add = fused multiply-add, 单指令更高精度) =====
    let mut z1 = w1.mul_add(x, b1);
    let mut a1 = sigmoid(z1);
    let z2 = w2.mul_add(a1, b2);
    let mut y_hat = z2;
    let loss = (y - y_hat).powi(2);
    println!("前向: z1={z1:.4}, a1={a1:.4}, yHat={y_hat:.4}, loss={loss:.4}");

    // ===== 反向传播 =====
    let dl_dy = -2.0 * (y - y_hat);
    let dl_dz2 = dl_dy;
    let dl_dw2 = dl_dz2 * a1;
    let dl_db2 = dl_dz2;
    let dl_da1 = dl_dz2 * w2;
    let sig_grad = a1 * (1.0 - a1);
    let dl_dz1 = dl_da1 * sig_grad;
    let dl_dw1 = dl_dz1 * x;
    let dl_db1 = dl_dz1;
    println!("梯度: dw1={dl_dw1:.4}, db1={dl_db1:.4}, dw2={dl_dw2:.4}, db2={dl_db2:.4}");

    // ===== 参数更新 =====
    w1 -= lr * dl_dw1; b1 -= lr * dl_db1;
    w2 -= lr * dl_dw2; b2 -= lr * dl_db2;
    println!("更新后: w1={w1:.4}, b1={b1:.4}, w2={w2:.4}, b2={b2:.4}");

    // 验证:更新后损失减小
    z1 = w1.mul_add(x, b1); a1 = sigmoid(z1);
    y_hat = w2.mul_add(a1, b2);
    let loss_new = (y - y_hat).powi(2);
    println!("更新后 loss: {loss_new:.4} (< {loss:.4})");

    // ===== 梯度消失演示 =====
    println!("\n梯度消失演示 (Sigmoid):");
    for n in [5, 10, 20, 50] {
        println!("  {n}层: 梯度衰减到 {:.2e}", 0.25_f64.powi(n));
    }
    println!("\nReLU (全部激活):");
    for n in [5, 10, 20, 50] { println!("  {n}层: 梯度 = 1.0"); }
}

已阅读当前小节,可返回首页继续浏览其它主题。