Solomon Kullback, Information Theory and Statistics

If Hi, i=1,2 is the hypothesis that X is from statistical measure μi, where

dμi(x) = fidλ(x) ,

and P(Hi) is the prior probability of the respective hypothesis, i=1,2, it follows that conditional on x,

P(Hi|x) = P(Hi)fi(x) P(H1)f1(x)+P(H2)f2(x) ,

except over a zero measured set by λ.

One obtains

logf1(x)f2(x) = logP(H1|x)P(H2|x)- logP(H1)P(H2) ,

except over a zero measured set by λ.

This logarithm of the likelihood ratio, log[f1(x)/f2(x)] is defined as the information in X=x for discrimination in favour of H1 against H2.

The mean information for discrimination in favour of H1 against H2 over set E is

I(1:2) = { 1μ1(E) E f1(x) logf1(x)f2(x) dλ(x) μ1(E)>0 0 μ1(E)=0

Over the entire space,

I(1:2) = logf1(x)f2(x) dμ1(x) = f1(x) logf1(x)f2(x) dλ(x) = ( logP(H1|x)P(H2|x) dμ1(x)) - logP(H1)P(H2) .

It is non-negative, 0I(1:2)+∞. It is zero if f1=f2 except over a zero measured set by μ1. It attains +∞ when for example, the space can be divided into two sets, on one of them f1>0, f2=0, and on the other f1=0, f2>0.

The prior information, logP(H1)P(H2) still remains in I(1:2).

Likewise,

I(2:1) = logf2(x)f1(x) dμ2(x) = f2(x) logf2(x)f1(x) dλ(x) .

One may define the divergence J(1,2),

J(1,2) = I(1:2)+I(2:1) = (f1(x)-f2(x)) logf1(x)f2(x) dλ(x) = logP(H1|x)P(H2|x) (dμ1(x)-dμ2(x)) .

If we slightly rewrite it,

J(1,2) = I(1:2)+I(2:1) = (f1(x)-f2(x)) (logf1(x)-logf2(x)) dλ(x) = (logP(H1|x)-logP(H2|x)) (dμ1(x)-dμ2(x)) .

If one swaps the position of hypothesis 1 and 2 one obtains the same quantity, J(1,2)=J(2,1). So it is symmetric, and the prior information disappeared. Kullback proposes to call I(1:2) or I(2:1) directed divergence.

As 0I(1:2)+∞, 0I(2:1)+∞,

0J(1,2)+∞ .

Both bounds are attainable.