期望、方差与协方差

期望和方差是描述分布最基本的两个统计量,贯穿深度学习的各个环节。BatchNorm 在每一层对 mini-batch 计算均值 μ 和方差 σ²,将激活值标准化为零均值单位方差,加速收敛并稳定训练——ResNet、EfficientNet 等主流 CNN 几乎每个卷积后都接 BatchNorm。Layer Norm 沿特征维度做同样的均值/方差标准化,是 Transformer/GPT/BERT 的默认选择。损失函数的期望 E[L(θ)] 就是经验风险,整个训练过程就是最小化它。协方差矩阵刻画特征之间的相关性:PCA 对协方差矩阵做特征值分解实现降维;金融风控中协方差矩阵评估组合风险;多维高斯分布完全由均值向量和协方差矩阵决定。方差还出现在 Xavier/He 初始化公式中——权重初始值的方差必须匹配层宽度,否则梯度会消失或爆炸。本节用真实数据从概念到计算进行完整推导。

一、期望(均值)E[X]

离散随机变量:E[X] = Σᵢ xᵢ P(X=xᵢ)
连续随机变量:E[X] = ∫ x·p(x) dx

线性性质(极其重要!):
E[aX + b] = aE[X] + b
E[X + Y] = E[X] + E[Y](不需要独立!)
E[XY] = E[X]E[Y](仅当 X, Y 独立时成立)
示例 1:骰子的期望
X = 骰子点数,每面概率 1/6
E[X] = 1×(1/6) + 2×(1/6) + 3×(1/6) + 4×(1/6) + 5×(1/6) + 6×(1/6)
     = (1+2+3+4+5+6)/6 = 21/6 = 3.5

示例 2:不公平硬币
P(正面=1) = 0.7, P(反面=0) = 0.3
E[X] = 1×0.7 + 0×0.3 = 0.7

示例 3:分类模型预测的期望损失
模型在 100 个测试样本上的交叉熵损失:
30 个样本损失 ≈ 0.1(高置信正确)
40 个样本损失 ≈ 0.5(中等置信度)
20 个样本损失 ≈ 1.5(低置信度)
10 个样本损失 ≈ 3.0(预测错误)

E[L] = 0.1×0.3 + 0.5×0.4 + 1.5×0.2 + 3.0×0.1
     = 0.03 + 0.20 + 0.30 + 0.30 = 0.83

二、方差 Var(X) 与标准差 σ

方差定义:Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²
标准差:σ = √Var(X)

重要性质:
Var(aX + b) = a² Var(X)(常数平移不改变方差)
Var(X + Y) = Var(X) + Var(Y) + 2Cov(X,Y)
若 X, Y 独立:Var(X + Y) = Var(X) + Var(Y)
完整数值计算——5 个学生的考试成绩:
成绩 X = [72, 85, 90, 68, 95]

步骤 1:计算均值
E[X] = (72+85+90+68+95)/5 = 410/5 = 82

步骤 2:计算每个偏差
72 - 82 = -10
85 - 82 = +3
90 - 82 = +8
68 - 82 = -14
95 - 82 = +13

偏差之和 = -10+3+8-14+13 = 0(永远为 0!)

步骤 3:偏差平方
(-10)² = 100
(+3)² = 9
(+8)² = 64
(-14)² = 196
(+13)² = 169

步骤 4:方差(总体方差)
Var(X) = (100+9+64+196+169)/5 = 538/5 = 107.6

步骤 5:标准差
σ = √107.6 = 10.37

验算(用第二个公式):
E[X²] = (72²+85²+90²+68²+95²)/5 = (5184+7225+8100+4624+9025)/5 = 34158/5 = 6831.6
Var = E[X²] - (E[X])² = 6831.6 - 82² = 6831.6 - 6724 = 107.6 ✓
总体方差 vs 样本方差:
总体方差:除以 N → σ² = Σ(xᵢ-μ)²/N
样本方差:除以 N-1(贝塞尔校正)→ s² = Σ(xᵢ-x̄)²/(N-1)
上例:样本方差 = 538/4 = 134.5
注意:总体方差除 N,样本方差除 N-1(贝塞尔校正),不同库默认行为可能不同。
方差在 BatchNorm 中的核心作用:

BatchNorm 对每个特征做标准化:x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)

假设一个 batch 有 4 个样本,某个神经元的激活值为 [2.0, 4.0, 6.0, 8.0]
μ_B = (2+4+6+8)/4 = 5.0
σ²_B = [(2-5)²+(4-5)²+(6-5)²+(8-5)²]/4 = (9+1+1+9)/4 = 5.0
ε = 1e-5(数值稳定性)

x̂₁ = (2-5)/√5.00001 = -3/2.236 = -1.342
x̂₂ = (4-5)/√5.00001 = -1/2.236 = -0.447
x̂₃ = (6-5)/√5.00001 = 1/2.236 = 0.447
x̂₄ = (8-5)/√5.00001 = 3/2.236 = 1.342

标准化后均值 ≈ 0,方差 ≈ 1。然后 BN 还有可学习参数 γ, β:
yᵢ = γ × x̂ᵢ + β
若 γ=2, β=1:y₁ = 2×(-1.342)+1 = -1.684

三、协方差 Cov(X, Y)

Cov(X, Y) = E[(X-E[X])(Y-E[Y])] = E[XY] - E[X]E[Y]

Cov > 0:X 增大时 Y 倾向增大(正相关)
Cov < 0:X 增大时 Y 倾向减小(负相关)
Cov = 0:线性无关(但可能非线性相关!)

相关系数(标准化的协方差):
ρ(X,Y) = Cov(X,Y) / (σ_X × σ_Y),ρ ∈ [-1, 1]
完整数值计算——学时与成绩的关系:
5 个学生的每周学习时数 X 和考试成绩 Y:

学生X(学时)Y(成绩)X-μₓY-μᵧ(X-μₓ)(Y-μᵧ)
A560-5-1890
B872-2-612
C1080020
D12852714
E159351575
μₓ = (5+8+10+12+15)/5 = 50/5 = 10
μᵧ = (60+72+80+85+93)/5 = 390/5 = 78

Cov(X,Y) = (90+12+0+14+75)/5 = 191/5 = 38.2

Var(X) = (25+4+0+4+25)/5 = 58/5 = 11.6 → σₓ = 3.406
Var(Y) = (324+36+4+49+225)/5 = 638/5 = 127.6 → σᵧ = 11.30

ρ = 38.2 / (3.406 × 11.30) = 38.2 / 38.49 = 0.992

结论:学习时间与成绩强正相关(ρ≈0.99)。

四、协方差矩阵

n 个特征 X₁, X₂, ..., Xₙ 的协方差矩阵 Σ 是 n×n 矩阵:
Σᵢⱼ = Cov(Xᵢ, Xⱼ)
对角线 Σᵢᵢ = Var(Xᵢ)
Σ 是对称正半定矩阵
Iris 数据集的协方差矩阵(4 个特征):
特征:花萼长 SL、花萼宽 SW、花瓣长 PL、花瓣宽 PW

先提供 6 个样本(简化版):
样本SLSWPLPW
15.13.51.40.2
24.93.01.40.2
37.03.24.71.4
46.43.24.51.5
56.33.36.02.5
65.82.75.11.9
均值向量:μ = [5.917, 3.150, 3.850, 1.283]

完整 Iris 数据集(150 样本)的协方差矩阵:
       SL     SW     PL     PW
SL [ 0.686 -0.042 1.274 0.516]
SW [-0.042 0.190 -0.330 -0.122]
PL [ 1.274 -0.330 3.116 1.296]
PW [ 0.516 -0.122 1.296 0.581]

解读:
• Cov(SL,PL)=1.274 > 0:花萼越长,花瓣也越长
• Cov(SW,PL)=-0.330 < 0:花萼越宽,花瓣反而短
• Var(PL)=3.116 最大:花瓣长度变化最大→区分度最高
• PCA 第一主成分主要沿 PL 和 PW 方向
协方差矩阵在 AI 中的应用:

PCA 降维:对协方差矩阵做特征分解,特征向量就是主成分方向
多维高斯分布:N(μ, Σ) 中的 Σ 就是协方差矩阵
马氏距离:d(x,μ) = √((x-μ)ᵀΣ⁻¹(x-μ)),考虑了特征间的相关性
白化(Whitening):使协方差矩阵变为单位矩阵 I,消除特征间相关性
注意力机制:Q·Kᵀ 的 Softmax 输出本质上是序列位置间的相关性矩阵

五、期望和方差在损失函数中的角色

MSE 损失就是误差的期望:
MSE = E[(y - ŷ)²] = (1/N) Σᵢ (yᵢ - ŷᵢ)²

房价预测 4 个样本:
y(真实)= [300, 450, 200, 550](万元)
ŷ(预测)= [320, 430, 210, 500](万元)

误差 = [20, -20, 10, -50]
误差² = [400, 400, 100, 2500]
MSE = (400+400+100+2500)/4 = 3400/4 = 850
RMSE = √850 = 29.15 万元

损失的方差影响训练稳定性:
Var(MSE_batch) 越大 → 梯度波动越大 → 训练越不稳定
batch_size 增大 → 方差减小(中心极限定理:Var ∝ 1/N)
batch_size: 1 → 16 → 64 → 256
梯度方差比: 1 → 1/16 → 1/64 → 1/256

六、代码验证(C# / Rust)

C#(.NET 10)

// dotnet run 即可执行
// ===== 学生成绩方差 =====
double[] scores = { 72, 85, 90, 68, 95 };
double mean = scores.Average();
double varPop = scores.Select(s => (s-mean)*(s-mean)).Average();
double varSample = scores.Select(s => (s-mean)*(s-mean)).Sum() / (scores.Length-1);
Console.WriteLine($"均值={mean}, 总体方差={varPop}, 样本方差={varSample}, 标准差={Math.Sqrt(varPop):F2}");

// ===== 协方差 =====
double[] X = { 5, 8, 10, 12, 15 };
double[] Y = { 60, 72, 80, 85, 93 };
double mx = X.Average(), my = Y.Average();
double covXY = 0, sx2 = 0, sy2 = 0;
for (int i = 0; i < 5; i++) {
    covXY += (X[i]-mx)*(Y[i]-my);
    sx2 += (X[i]-mx)*(X[i]-mx); sy2 += (Y[i]-my)*(Y[i]-my);
}
covXY /= 5;
double rho = covXY / (Math.Sqrt(sx2/5) * Math.Sqrt(sy2/5));
Console.WriteLine($"\nCov(X,Y) = {covXY:F2}");
Console.WriteLine($"ρ = {rho:F4}");

// 总体协方差矩阵
Console.WriteLine($"\n协方差矩阵(ddof=0):");
Console.WriteLine($"  [{sx2/5,8:F2} {covXY,8:F2}]");
Console.WriteLine($"  [{covXY,8:F2} {sy2/5,8:F2}]");

// ===== 5名学生、3项指标的协方差矩阵 =====
double[,] data3 = {{85,90,88},{70,75,72},{92,88,95},{65,70,68},{78,82,80}};
int n = 5, p = 3;
double[] mu3 = new double[p];
for (int j = 0; j < p; j++) { for (int i = 0; i < n; i++) mu3[j] += data3[i,j]; mu3[j] /= n; }

Console.WriteLine("\n3×3 样本协方差矩阵 (替代 Iris):");
for (int j1 = 0; j1 < p; j1++) {
    for (int j2 = 0; j2 < p; j2++) {
        double c = 0;
        for (int i = 0; i < n; i++) c += (data3[i,j1]-mu3[j1])*(data3[i,j2]-mu3[j2]);
        Console.Write($"{c/(n-1),8:F2}");
    }
    Console.WriteLine();
}

// ===== BatchNorm =====
double[] act = { 2.0, 4.0, 6.0, 8.0 };
double muBn = act.Average();
double vBn = act.Select(v => (v-muBn)*(v-muBn)).Average();
double gamma = 2.0, beta = 1.0;
Console.Write($"\nBatchNorm: μ={muBn}, σ²={vBn}\n标准化: ");
foreach (var v in act) Console.Write($"{(v-muBn)/Math.Sqrt(vBn+1e-5):F3} ");
Console.Write($"\n缩放平移(γ=2,β=1): ");
foreach (var v in act) Console.Write($"{gamma*((v-muBn)/Math.Sqrt(vBn+1e-5))+beta:F3} ");
Console.WriteLine();

// ===== MSE =====
double[] yT = { 300, 450, 200, 550 };
double[] yP = { 320, 430, 210, 500 };
double mse = 0;
for (int i = 0; i < 4; i++) mse += Math.Pow(yT[i]-yP[i], 2);
mse /= 4;
Console.WriteLine($"\nMSE = {mse}, RMSE = {Math.Sqrt(mse):F2}");

Rust

fn main() {
    // ===== 学生成绩方差 =====
    let scores = [72.0, 85.0, 90.0, 68.0, 95.0_f64];
    let n = 5.0_f64;
    let mut mean = 0.0; for i in 0..5 { mean += scores[i]; } mean /= n;
    let mut var_pop = 0.0; for i in 0..5 { var_pop += (scores[i]-mean).powi(2); } var_pop /= n;
    let var_sample = var_pop * n / (n - 1.0);
    println!("均值={mean}, 总体方差={var_pop}, 样本方差={var_sample}, 标准差={:.2}", var_pop.sqrt());

    // ===== 协方差 =====
    let x = [5.0, 8.0, 10.0, 12.0, 15.0_f64];
    let y = [60.0, 72.0, 80.0, 85.0, 93.0_f64];
    let mut mx = 0.0; for i in 0..5 { mx += x[i]; } mx /= 5.0;
    let mut my = 0.0; for i in 0..5 { my += y[i]; } my /= 5.0;
    let (mut cov_xy, mut sx2, mut sy2) = (0.0, 0.0, 0.0_f64);
    for i in 0..5 {
        cov_xy += (x[i]-mx)*(y[i]-my);
        sx2 += (x[i]-mx).powi(2); sy2 += (y[i]-my).powi(2);
    }
    cov_xy /= 5.0;
    let rho = cov_xy / ((sx2/5.0).sqrt() * (sy2/5.0).sqrt());
    println!("\nCov(X,Y) = {cov_xy:.2}");
    println!("ρ = {rho:.4}");
    println!("\n协方差矩阵(ddof=0):");
    println!("  [{:8.2} {:8.2}]", sx2/5.0, cov_xy);
    println!("  [{:8.2} {:8.2}]", cov_xy, sy2/5.0);

    // ===== 3×3 协方差矩阵 =====
    let data3 = [[85.0,90.0,88.0],[70.0,75.0,72.0],[92.0,88.0,95.0],
                  [65.0,70.0,68.0],[78.0,82.0,80.0_f64]];
    let mut mu3 = [0.0_f64; 3];
    for j in 0..3 { for i in 0..5 { mu3[j] += data3[i][j]; } mu3[j] /= 5.0; }
    println!("\n3×3 样本协方差矩阵:");
    for j1 in 0..3 {
        for j2 in 0..3 {
            let mut c = 0.0;
            for i in 0..5 { c += (data3[i][j1]-mu3[j1])*(data3[i][j2]-mu3[j2]); }
            print!("{:8.2}", c / 4.0);
        }
        println!();
    }

    // ===== BatchNorm =====
    let act = [2.0, 4.0, 6.0, 8.0_f64];
    let mut mu_bn = 0.0; for i in 0..4 { mu_bn += act[i]; } mu_bn /= 4.0;
    let mut v_bn = 0.0; for i in 0..4 { v_bn += (act[i]-mu_bn).powi(2); } v_bn /= 4.0;
    let (gamma, beta) = (2.0_f64, 1.0);
    print!("\nBatchNorm: μ={mu_bn}, σ²={v_bn}\n标准化: ");
    for i in 0..4 { print!("{:.3} ", (act[i]-mu_bn)/(v_bn+1e-5).sqrt()); }
    print!("\n缩放平移(γ=2,β=1): ");
    for i in 0..4 { print!("{:.3} ", gamma.mul_add((act[i]-mu_bn)/(v_bn+1e-5).sqrt(), beta)); }
    println!();

    // ===== MSE =====
    let yt = [300.0, 450.0, 200.0, 550.0_f64];
    let yp = [320.0, 430.0, 210.0, 500.0_f64];
    let mut mse = 0.0; for i in 0..4 { mse += (yt[i]-yp[i]).powi(2); } mse /= 4.0;
    println!("\nMSE = {mse}, RMSE = {:.2}", mse.sqrt());
}

已阅读当前小节,可返回首页继续浏览其它主题。