Misplaced Pages

Sturges's rule

Article snapshot taken from[REDACTED] with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.
Statistical rule of thumb


Sturges's rule is a method to choose the number of bins for a histogram. Given n {\displaystyle n} observations, Sturges's rule suggests using

k ^ = 1 + log 2 ( n ) {\displaystyle {\hat {k}}=1+\log _{2}(n)}

bins in the histogram. This rule is widely employed in data analysis software including Python and R, where it is the default bin selection method.

Sturges's rule comes from the binomial distribution which is used as a discrete approximation to the normal distribution. If the function to be approximated f {\displaystyle f} is binomially distributed then

f ( y ) = ( m y ) p y ( 1 p ) m y {\displaystyle f(y)={\binom {m}{y}}p^{y}(1-p)^{m-y}}

where m {\displaystyle m} is the number of trials and p {\displaystyle p} is the probability of success and y = 0 , 1 , , m {\displaystyle y=0,1,\ldots ,m} . Choosing p = 1 / 2 {\displaystyle p=1/2} gives

f ( y ) = ( m y ) 2 m {\displaystyle f(y)={\binom {m}{y}}2^{-m}}

In this form we can consider 2 m {\displaystyle 2^{-m}} as the normalisation factor and Sturges's rule is saying that the sample should result in a histogram with bin counts given by the binomial coefficients. Since the total sample size is fixed to n {\displaystyle n} we must have

n = y ( m y ) = 2 m {\displaystyle n=\sum _{y}{\binom {m}{y}}=2^{m}}

using the well-known formula for sums of the binomial coefficients. Solving this by taking logs of both sides gives m = log 2 ( n ) {\displaystyle m=\log _{2}(n)} and finally using k = m + 1 {\displaystyle k=m+1} (due to counting the 0 outcomes) gives Sturges's rule. In general Sturges's rule does not give an integer answer so the result is rounded up.

Doane's formula

Doane proposed modifying Sturges's formula to add extra bins when the data is skewed. Using the method of moments estimator

g 1 = m 3 m 2 3 / 2 = 1 n i = 1 n ( x i x ¯ ) 3 [ 1 n i = 1 n ( x i x ¯ ) 2 ] 3 / 2 , {\displaystyle g_{1}={\frac {m_{3}}{m_{2}^{3/2}}}={\frac {{\tfrac {1}{n}}\sum _{i=1}^{n}(x_{i}-{\overline {x}})^{3}}{\left^{3/2}}},}

along with its variance

σ g 1 2 = 6 ( n 2 ) ( n + 1 ) ( n + 3 ) {\displaystyle \sigma _{g_{1}}^{2}={\frac {6(n-2)}{(n+1)(n+3)}}}

Doane proposed adding log 2 ( 1 + | g 1 | σ g 1 ) {\displaystyle \log _{2}\left(1+{\frac {|g_{1}|}{\sigma _{g_{1}}}}\right)} extra bins giving Doane's formula

k ^ = 1 + log 2 ( n ) + log 2 ( 1 + | g 1 | σ g 1 ) {\displaystyle {\hat {k}}=1+\log _{2}(n)+\log _{2}\left(1+{\frac {|g_{1}|}{\sigma _{g_{1}}}}\right)}

For symmetric distributions | g 1 | 0 {\displaystyle |g_{1}|\simeq 0} this is equivalent to Sturges's rule. For asymmetric distributions a number of additional bins will be used.

Criticisms

Histogram of 10,000 samples from a Gamma(2,2) distribution. Number of bins suggested by Scott's rule is 61, Doane's rule 21, and Sturges's rule 15.

Sturges's rule is not based on any sort of optimisation procedure, like the Freedman–Diaconis rule or Scott's rule. It is simply posited based on the approximation of a normal curve by a binomial distribution. Hyndman has pointed out that any multiple of the binomial coefficients would also converge to a normal distribution, so any number of bins could be obtained following the derivation above. Scott shows that Sturges's rule in general produces oversmoothed histograms i.e. too few bins, and advises against its use in favour of other rules such as Freedman-Diaconis or Scott's rule.

References

  1. Sturges, H. A. (1926). "The choice of a class interval". Journal of the American Statistical Association. 21 (153): 65–66. doi:10.1080/01621459.1926.10502161. JSTOR 2965501.
  2. "Numpy.histogram_bin_edges — NumPy v2.1 Manual".
  3. "Hist function - RDocumentation".
  4. ^ Scott, David W. (2009). "Sturges' rule". WIREs Computational Statistics. 1 (3): 303–306. doi:10.1002/wics.35. S2CID 197483064.
  5. Doane DP (1976) Aesthetic frequency classification. American Statistician, 30: 181–183
  6. Hyndman RJ. The problem with Sturges' rule for constructing histograms. Monash University. 1995 Jul 5:1-2.
Categories:
Sturges's rule Add topic