Misplaced Pages

Fisher information: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 11:07, 7 July 2004 editThe Anome (talk | contribs)Edit filter managers, Administrators253,502 edits Category:StatisticsCategory:Information theory← Previous edit Revision as of 18:33, 9 September 2004 edit undo150.135.248.126 (talk) Added links to related information "physical information" and principle of "extreme physical information" which generates laws of physics.Next edit →
Line 86: Line 86:


In case the parameter θ is vector valued, the information is a positive-definite matrix, which defines a metric on the parameter space; consequently ] is applied to this topic. See ]. In case the parameter θ is vector valued, the information is a positive-definite matrix, which defines a metric on the parameter space; consequently ] is applied to this topic. See ].

=== Physical information ===

The difference between the Fisher information in data and in the source effect that generated them is called the ]. When the latter is mathematically extremized through choice of the system probability amplitudes, the approach is called the principle of ]. The solution amplitudes define the physics of the source effect.


]] ]]

Revision as of 18:33, 9 September 2004

In statistics, the Fisher information I(θ), thought of as the amount of information that an observable random variable carries about an unobservable parameter θ upon which the probability distribution of X depends, is the variance of the score. Because the expectation of the score is zero, this may be written as

I ( θ ) = E ( [ θ log f ( X ; θ ) ] 2 ) {\displaystyle I(\theta )=E\left(\left^{2}\right)}

where f is the probability density function of random variable X. The Fisher information is thus the expectation of the square of the score. A random variable carrying high Fisher information implies that the absolute value of the score is frequently high (remember that the expectation of the score is zero).

This concept is named in honor of the geneticist and statistician Ronald Fisher.

Note that the information as defined above is not a function of a particular observation, as the random variable X has been averaged out. The concept of information is useful when comparing two methods of observation of some random process.

Information as defined above may be written as

I ( θ ) = E [ 2 θ 2 log f ( X ; θ ) ] {\displaystyle I(\theta )=-E\left}

and is thus the expection of log of the second derivative of X with respect to θ. Information may thus be seen to be a measure of the "sharpness" of the support curve near the maximum likelihood estimate of θ. A "blunt" support curve (one with a shallow maximum) would have low expected second derivative, and thus low information; while a sharp one would have a high expected second derivative and thus high information.

Information is additive, in the sense that the information gathered by two independent experiments is the sum of the information of each of them:

I X , Y ( θ ) = I X ( θ ) + I Y ( θ ) . {\displaystyle I_{X,Y}(\theta )=I_{X}(\theta )+I_{Y}(\theta ).}

This is because the variance of the sum of two independent random variables is the sum of their variances. It follows that the information in a random sample of size n is n times that in a sample of size one (if observations are independent).

The information provided by a sufficient statistic is same as that of the sample X. This may be seen by using Fisher's factorization criterion for a sufficient statistic. If T(X) is sufficient for θ, then

f ( X ; θ ) = g ( T ( X ) , θ ) × h ( X ) {\displaystyle f(X;\theta )=g(T(X),\theta )\times h(X)}

for some functions g and h (see sufficient statistic for a more detailed explanation). The equality of information follows from the fact that

θ log [ f ( X ; θ ) ] = θ log [ g ( T ( X ) ; θ ) ] {\displaystyle {\frac {\partial }{\partial \theta }}\log \left={\frac {\partial }{\partial \theta }}\log \left}

(which is the case because h(X) is independent of θ) and the definition for information given above. More generally, if T=t(X) is a statistic, then

I T ( θ ) I X ( θ ) {\displaystyle I_{T}(\theta )\leq I_{X}(\theta )}

with equality if and only if T is a sufficient statistic.

The Cramér-Rao inequality states that the reciprocal of the Fisher information is a lower bound on the variance of any unbiased estimator of θ.

Example

The information contained in n independent Bernoulli trials, each with probability of success θ may be calculated as follows. In the following, a represents the number of successes, b the number of failures, and n=a+b is the total number of trials.

I ( θ ) = E ( 2 θ 2 log ( f ( X ; θ ) ) {\displaystyle I(\theta )=-E\left({\frac {\partial ^{2}}{\partial \theta ^{2}}}\log(f(X;\theta )\right)}
= E ( 2 θ 2 log [ θ a ( 1 θ ) b ( a + b ) ! a ! b ! ] ) {\displaystyle =-E\left({\frac {\partial ^{2}}{\partial \theta ^{2}}}\log \left\right)}
= E ( 2 θ 2 [ a log θ + b log ( 1 θ ) ] ) {\displaystyle =-E\left({\frac {\partial ^{2}}{\partial \theta ^{2}}}\left\right)}
= E ( θ [ a θ b 1 θ ] ) {\displaystyle =-E\left({\frac {\partial }{\partial \theta }}\left\right)}
= + E ( a θ 2 + b ( 1 θ ) 2 ) {\displaystyle =+E\left({\frac {a}{\theta ^{2}}}+{\frac {b}{(1-\theta )^{2}}}\right)}
= n θ θ 2 + n ( 1 θ ) ( 1 θ ) 2 {\displaystyle ={\frac {n\theta }{\theta ^{2}}}+{\frac {n(1-\theta )}{(1-\theta )^{2}}}}
= n θ ( 1 θ ) {\displaystyle ={\frac {n}{\theta (1-\theta )}}}

The first line is just the definition of information; the second uses the fact that the information contained in a sufficient statistic is the same as that of the sample itself; the third line just expands the log term (and drops a constant), the fourth and fifth just differentiation wrt θ, the sixth replaces a and b with their expectations, and the seventh is algebraic manipulation.

The overall result, viz

I ( θ ) = n θ ( 1 θ ) {\displaystyle I(\theta )={\frac {n}{\theta (1-\theta )}}}

may be seen to be in accord with what one would expect, since it is the reciprocal of the variance of the sum of the n Bernoulli random variables..

In case the parameter θ is vector valued, the information is a positive-definite matrix, which defines a metric on the parameter space; consequently differential geometry is applied to this topic. See Fisher information metric.

Physical information

The difference between the Fisher information in data and in the source effect that generated them is called the physical information. When the latter is mathematically extremized through choice of the system probability amplitudes, the approach is called the principle of Extreme physical information. The solution amplitudes define the physics of the source effect.

Categories: