Misplaced Pages

Information projection

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In information theory, the information projection or I-projection of a probability distribution q onto a set of distributions P is

p = arg min p P D K L ( p | | q ) {\displaystyle p^{*}={\underset {p\in P}{\arg \min }}\operatorname {D} _{\mathrm {KL} }(p||q)} .

where D K L {\displaystyle D_{\mathrm {KL} }} is the Kullback–Leibler divergence from q to p. Viewing the Kullback–Leibler divergence as a measure of distance, the I-projection p {\displaystyle p^{*}} is the "closest" distribution to q of all the distributions in P.

The I-projection is useful in setting up information geometry, notably because of the following inequality, valid when P is convex:

D K L ( p | | q ) D K L ( p | | p ) + D K L ( p | | q ) {\displaystyle \operatorname {D} _{\mathrm {KL} }(p||q)\geq \operatorname {D} _{\mathrm {KL} }(p||p^{*})+\operatorname {D} _{\mathrm {KL} }(p^{*}||q)} .

This inequality can be interpreted as an information-geometric version of Pythagoras' triangle-inequality theorem, where KL divergence is viewed as squared distance in a Euclidean space.

It is worthwhile to note that since D K L ( p | | q ) 0 {\displaystyle \operatorname {D} _{\mathrm {KL} }(p||q)\geq 0} and continuous in p, if P is closed and non-empty, then there exists at least one minimizer to the optimization problem framed above. Furthermore, if P is convex, then the optimum distribution is unique.

The reverse I-projection also known as moment projection or M-projection is

p = arg min p P D K L ( q | | p ) {\displaystyle p^{*}={\underset {p\in P}{\arg \min }}\operatorname {D} _{\mathrm {KL} }(q||p)} .

Since the KL divergence is not symmetric in its arguments, the I-projection and the M-projection will exhibit different behavior. For I-projection, p ( x ) {\displaystyle p(x)} will typically under-estimate the support of q ( x ) {\displaystyle q(x)} and will lock onto one of its modes. This is due to p ( x ) = 0 {\displaystyle p(x)=0} , whenever q ( x ) = 0 {\displaystyle q(x)=0} to make sure KL divergence stays finite. For M-projection, p ( x ) {\displaystyle p(x)} will typically over-estimate the support of q ( x ) {\displaystyle q(x)} . This is due to p ( x ) > 0 {\displaystyle p(x)>0} whenever q ( x ) > 0 {\displaystyle q(x)>0} to make sure KL divergence stays finite.

The reverse I-projection plays a fundamental role in the construction of optimal e-variables.


The concept of information projection can be extended to arbitrary f-divergences and other divergences.

See also

References

  1. Cover, Thomas M.; Thomas, Joy A. (2006). Elements of Information Theory (2 ed.). Hoboken, New Jersey: Wiley Interscience. p. 367 (Theorem 11.6.1).
  2. Nielsen, Frank (2018). "What is... an information projection?" (PDF). Notices of the American Mathematical Society. 65 (3): 321–324. doi:10.1090/noti1647.
  • K. Murphy, "Machine Learning: a Probabilistic Perspective", The MIT Press, 2012.


Stub icon

This probability-related article is a stub. You can help Misplaced Pages by expanding it.

Categories: