Misplaced Pages

Sanov's theorem

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Mathematical theorem
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these messages)
This article may be too technical for most readers to understand. Please help improve it to make it understandable to non-experts, without removing the technical details. (February 2012) (Learn how and when to remove this message)
This article includes a list of general references, but it lacks sufficient corresponding inline citations. Please help to improve this article by introducing more precise citations. (February 2012) (Learn how and when to remove this message)
(Learn how and when to remove this message)

In mathematics and information theory, Sanov's theorem gives a bound on the probability of observing an atypical sequence of samples from a given probability distribution. In the language of large deviations theory, Sanov's theorem identifies the rate function for large deviations of the empirical measure of a sequence of i.i.d. random variables.

Let A be a set of probability distributions over an alphabet X, and let q be an arbitrary distribution over X (where q may or may not be in A). Suppose we draw n i.i.d. samples from q, represented by the vector x n = x 1 , x 2 , , x n {\displaystyle x^{n}=x_{1},x_{2},\ldots ,x_{n}} . Then, we have the following bound on the probability that the empirical measure p ^ x n {\displaystyle {\hat {p}}_{x^{n}}} of the samples falls within the set A:

q n ( p ^ x n A ) ( n + 1 ) | X | 2 n D K L ( p | | q ) {\displaystyle q^{n}({\hat {p}}_{x^{n}}\in A)\leq (n+1)^{|X|}2^{-nD_{\mathrm {KL} }(p^{*}||q)}} ,

where

  • q n {\displaystyle q^{n}} is the joint probability distribution on X n {\displaystyle X^{n}} , and
  • p {\displaystyle p^{*}} is the information projection of q onto A.
  • D K L ( P Q ) {\displaystyle D_{\mathrm {KL} }(P\|Q)} , the KL divergence, is given by: D K L ( P Q ) = x X P ( x ) log P ( x ) Q ( x ) . {\displaystyle D_{\mathrm {KL} }(P\|Q)=\sum _{x\in {\mathcal {X}}}P(x)\log {\frac {P(x)}{Q(x)}}.}

In words, the probability of drawing an atypical distribution is bounded by a function of the KL divergence from the true distribution to the atypical one; in the case that we consider a set of possible atypical distributions, there is a dominant atypical distribution, given by the information projection.

Furthermore, if A is a closed set, then

lim n 1 n log q n ( p ^ x n A ) = D K L ( p | | q ) . {\displaystyle \lim _{n\to \infty }{\frac {1}{n}}\log q^{n}({\hat {p}}_{x^{n}}\in A)=-D_{\mathrm {KL} }(p^{*}||q).}

Technical statement

Define:

  • Σ {\textstyle \Sigma } is a finite set with size 2 {\textstyle \geq 2} . Understood as “alphabet”.
  • Δ ( Σ ) {\textstyle \Delta (\Sigma )} is the simplex spanned by the alphabet. It is a subset of R Σ {\textstyle \mathbb {R} ^{\Sigma }} .
  • L n {\textstyle L_{n}} is a random variable taking values in Δ ( Σ ) {\textstyle \Delta (\Sigma )} . Take n {\textstyle n} samples from the distribution μ {\textstyle \mu } , then L n {\textstyle L_{n}} is the frequency probability vector for the sample.
  • L n {\textstyle {\mathcal {L}}_{n}} is the space of values that L n {\textstyle L_{n}} can take. In other words, it is

{ ( a 1 / n , , a | Σ | / n ) : i a i = n , a i N } {\displaystyle \{(a_{1}/n,\dots ,a_{|\Sigma |}/n):\sum _{i}a_{i}=n,a_{i}\in \mathbb {N} \}} Then, Sanov's theorem states:

  • For every measurable subset S Δ ( Σ ) {\textstyle S\in \Delta (\Sigma )} , inf ν i n t ( S ) D ( ν μ ) lim inf n 1 n ln P μ ( L n S ) lim sup n 1 n ln P μ ( L n S ) inf ν c l ( S ) D ( ν μ ) {\displaystyle -\inf _{\nu \in int(S)}D(\nu \|\mu )\leq \liminf _{n}{\frac {1}{n}}\ln P_{\mu }(L_{n}\in S)\leq \limsup _{n}{\frac {1}{n}}\ln P_{\mu }(L_{n}\in S)\leq -\inf _{\nu \in cl(S)}D(\nu \|\mu )}
  • For every open subset U Δ ( Σ ) {\textstyle U\in \Delta (\Sigma )} , lim n lim ν U L n D ( ν μ ) = lim n 1 n ln P μ ( L n S ) = inf ν U D ( ν μ ) {\displaystyle -\lim _{n}\lim _{\nu \in U\cap {\mathcal {L}}_{n}}D(\nu \|\mu )=\lim _{n}{\frac {1}{n}}\ln P_{\mu }(L_{n}\in S)=-\inf _{\nu \in U}D(\nu \|\mu )}

Here, i n t ( S ) {\displaystyle int(S)} means the interior, and c l ( S ) {\displaystyle cl(S)} means the closure.

References

  1. Dembo, Amir; Zeitouni, Ofer (2010). "Large Deviations Techniques and Applications". Stochastic Modelling and Applied Probability. 38: 16–17. doi:10.1007/978-3-642-03311-7. ISBN 978-3-642-03310-0. ISSN 0172-4568.
  • Sanov, I. N. (1957) "On the probability of large deviations of random variables". Mat. Sbornik 42(84), No. 1, 11–44.
  • Санов, И. Н. (1957) "О вероятности больших отклонений случайных величин". МАТЕМАТИЧЕСКИЙ СБОРНИК' 42(84), No. 1, 11–44.


Stub icon

This probability-related article is a stub. You can help Misplaced Pages by expanding it.

Categories: