Misplaced Pages

Good–Turing frequency estimation: Difference between revisions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Browse history interactively← Previous editNext edit →Content deleted Content addedVisualWikitext
Revision as of 09:41, 31 January 2013 editOrenBochman (talk | contribs)Extended confirmed users12,714 edits The method← Previous edit Revision as of 22:53, 19 February 2013 edit undoSrchvrs (talk | contribs)139 edits Fixed an ambiguity.Next edit →
Line 40: Line 40:
:<math>p_0 = \frac{N_1}{N}. </math> :<math>p_0 = \frac{N_1}{N}. </math>


The next step is to find an estimate of probability for species which were seen ''r'' times. This estimate is The next step is to find an estimate of probability for species which were seen ''r'' times. For a '''single''' species this estimate is:
:<math>p_r = \frac{(r+1) S(N_{r+1})}{N S(N_r)}.</math> :<math>p_r = \frac{(r+1) S(N_{r+1})}{N S(N_r)}.</math>
To estimate a probability of encountering any species from this group (i.e., the group of species seen ''r'' times) one can use the following formula:
:<math>\frac{(r+1) S(N_{r+1})}{N}.</math>
Here, the notation <math>S( )</math> means the smoothed or adjusted value of the frequency shown in parenthesis (see also ]). An overview of how to perform this smoothing follows. Here, the notation <math>S( )</math> means the smoothed or adjusted value of the frequency shown in parenthesis (see also ]). An overview of how to perform this smoothing follows.



Revision as of 22:53, 19 February 2013

Good–Turing frequency estimation is a statistical technique for predicting the probability of occurrence of objects belonging to an unknown number of species, given past observations of such objects and their species. (In drawing balls from an urn, the 'objects' would be balls and the 'species' would be the distinct colors of the balls (finite but unknown in number). After drawing R red {\displaystyle R_{\text{red}}} red balls, R black {\displaystyle R_{\text{black}}} black balls and R green {\displaystyle R_{\text{green}}} green balls, we would ask what is the probability of drawing a red ball, a black ball, a green ball or one of a previously unseen color.)

Historical background

Good–Turing frequency estimation was developed by Alan Turing and his assistant I. J. Good as part of their efforts at Bletchley Park to crack German ciphers for the Enigma machine during World War II. Turing at first modeled the frequencies as a multinomial distribution, but found it inaccurate. Good developed smoothing algorithms to improve the estimator's accuracy.

The discovery was recognized as significant when published by Good in 1953, but the calculations were difficult so it was not used as widely as it might have been. The method even gained some literary fame due to the Robert Harris novel Enigma.

In the 1990s, Geoffrey Sampson worked with William A. Gale of AT&T, to create and implement a simplified and easier-to-use variant of the Good–Turing method described below.

The method

First notation and some required data structures are defined:

  • Assuming that X distinct species have been observed, numbered x = 1, ..., X.
  • Then the frequency vector, R ¯ {\displaystyle {\bar {R}}} , has elements R x {\displaystyle R_{x}} that give the number of individuals that have been observed for species x.
  • The frequency of frequencies vector, ( N r ) r = 0 , 1 , {\displaystyle (N_{r})_{r=0,1,\ldots }} , shows how many times the frequency r occurs in the vector R; i.e. among the elements R x {\displaystyle R_{x}} .
N r = | { x | R x = r } | {\displaystyle N_{r}=|\{x|R_{x}=r\}|}

For example N 1 {\displaystyle N_{1}} is the number of species for which only 1 individual was observed. Note that the total number of objects observed, N, can be found from

N = r = 1 r N r . {\displaystyle N=\sum _{r=1}^{\infty }rN_{r}.}

The first step in the calculation is to find an estimate of the total probability of unseen species. This estimate is

p 0 = N 1 N . {\displaystyle p_{0}={\frac {N_{1}}{N}}.}

The next step is to find an estimate of probability for species which were seen r times. For a single species this estimate is:

p r = ( r + 1 ) S ( N r + 1 ) N S ( N r ) . {\displaystyle p_{r}={\frac {(r+1)S(N_{r+1})}{NS(N_{r})}}.}

To estimate a probability of encountering any species from this group (i.e., the group of species seen r times) one can use the following formula:

( r + 1 ) S ( N r + 1 ) N . {\displaystyle {\frac {(r+1)S(N_{r+1})}{N}}.}

Here, the notation S ( ) {\displaystyle S()} means the smoothed or adjusted value of the frequency shown in parenthesis (see also empirical Bayes method). An overview of how to perform this smoothing follows.

We would like to make a plot of log N r {\displaystyle \log N_{r}} versus log r {\displaystyle \log r} but this is problematic because for large r many N r {\displaystyle N_{r}} will be zero. Instead a revised quantity, log Z r {\displaystyle \log Z_{r}} , is plotted versus log r {\displaystyle \log r} , where Zr is defined as

Z r = N r 0.5 ( t q ) , {\displaystyle Z_{r}={\frac {N_{r}}{0.5(t-q)}},}

and where q, r and t are consecutive subscripts having N q , N r , N t {\displaystyle N_{q},N_{r},N_{t}} non-zero. When r is 1, take q to be 0. When r is the last non-zero frequency, take t to be 2r − q.

The assumption of Good-Turing estimation is that the number of occurrence for each species follows a binomial distribution .

A simple linear regression is then fitted to the log–log plot. For small values of r it is reasonable to set S ( N r ) = N r {\displaystyle S(N_{r})=N_{r}} (that is, no smoothing is performed), while for large values of r, values of S ( N r ) {\displaystyle S(N_{r})} are read off the regression line. An automatic procedure (not described here) can be used to specify at what point the switch from no smoothing to linear smoothing should take place. Code for the method is available in the public domain.

See also

References

  1. Good, I.J. (1953). "The population frequencies of species and the estimation of population parameters". Biometrika. 40 (3–4): 237–264. doi:10.1093/biomet/40.3-4.237. JSTOR 2333344. MR 0061330.
  2. Newsise: Scientists Explain and Improve Upon 'Enigmatic' Probability Formula, a popular review of Orlitsky A, Santhanam NP, Zhang J. (2003). "Always Good Turing: asymptotically optimal probability estimation". Science (New York, N.Y.). 302 (5644): 427–31. doi:10.1126/science.1088284.{{cite journal}}: CS1 maint: multiple names: authors list (link)
  3. Sampson, Geoffrey and Gale, William A. (1995) Good‐turing frequency estimation without tears
  4. Gale, William A. (1995). "Good–Turing smoothing without tears". Journal of Quantitative Linguistics. 2: 3. doi:10.1080/09296179508590051.
  5. Lecture 11: The Good-Turing Estimate. CS 6740, Cornell University, 2010
  6. Sampson, Geoffrey (2005) Simple Good–Turing Frequency Estimator (code in C)
Categories: