Let’s suppose you look at the latest census for wherever you live, and you count the frequencies of each leading digit (1 – 9) for its population. You would expect the count to be all roughly equal, that they would each occur about 1/9 or 11% of the time, no?
WRONG.
That is the principle behind Benford’s Law. It states that 1 occurs as the leading digit of a number around 30% of the time, whereas 9 appears only around 5% of the time.
In particular, it gives the probability \(P(d)\) that a randomly chosen number starts with the leading digit \(d\) where $d$ in ${1, 2 \ldots 9}$.
$d$ | $log_{10}(1+\frac{1}{d})$ | $P(d)$ |
---|---|---|
\(1\) | $log_{10}(1+\frac{1}{1})$ | $30.1$% |
$2$ | $log_{10}(1+\frac{1}{2})$ | $17.6$% |
$3$ | $log_{10}(1+\frac{1}{3})$ | $12.5$% |
$4$ | $log_{10}(1+\frac{1}{4})$ | $9.7$% |
$5$ | $log_{10}(1+\frac{1}{5})$ | $7.9$% |
$6$ | $log_{10}(1+\frac{1}{6})$ | $6.7$% |
$7$ | $log_{10}(1+\frac{1}{7})$ | $5.8$% |
$8$ | $log_{10}(1+\frac{1}{8})$ | $5.1$% |
$9$ | $log_{10}(1+\frac{1}{9})$ | $4.6$% |
A number $x$ has a leading digit $d$ if and only if $log_{10}(d) \leqslant \{ log_{10}(x) \} < log_{10}(d+1)$
Note: The curly braces {} denote the fractional part, or $frac(x) = x-\lfloor x \rfloor$, which essentially returns just the fractional part of $x$, hence the name.
For example, for the number $x=231$, the leading digit $d$ equals $2$ because $$ log_{10}(2) \leqslant \{ log_{10}(231) \} < log_{10}(3) $$ $$0.3 \leqslant 0.36 < 0.48$$
In other words, $\{log_{10}(x)\}$ lies in an interval of length $log_{10}(d+1)-log_{10}(d)$, or
So far, we have been working with the decimal system, and thus our logarithm is in base 10. In fact, Benford’s law can be generalised to other bases $b$, where $$ P(d)=log_{b}(1+\frac{1}{d}) $$ For example, in the hexadecimal base $b = 16$,
In fact, $log_{10}(1+\frac{1}{d})$ gives not only the probability of a decimal number starting with the leading digit $n$, but the probability of a number starting with the string of $n$ digits.
For example, the probability of a number starting with $12 \ldots$ is $log_{10}(1+\frac{1}{12}) ≈3.5%$. Likewise, the probability of a number starting with $999$ is $log_{10}(1+\frac{1}{999}) ≈ 0.00043%$.
Furthermore, the above result can be extended to give the probability that a particular number occurs at the second digit. For example, the probability that a $4$ appears in the second digit is given by: $$log_{10}(1+\frac{1}{14}) + log_{10}(1+\frac{1}{24}) + ... + log_{10}(1+\frac{1}{94}) ≈ 10.03%$$. Here, we are summing the probabilities of a number beginning with $14 \ldots$, $24 \ldots$, $34 \ldots$ ... $94 \ldots$.
Some distributions that may not follow the Benford distribution are:
Numbers that represent magnitudes of events , such as populations of cities, flows of water in rivers or sizes of celestial bodies, usually follow the Benford distribution.
Since humans tend to think in uniform, or at best, normal distribution, it is difficult to manually construct distributions that satisfy Benford’s law. Even if fraudsters consciously manipulate the first digits of their data, in practice, the generalised form of the law above is used to check for multiple digits to increase precision. Therefore, scientists can identify fraudulent numerical data sets by checking if it satisfies Benford’s law to some degree.
In 2011, scientists analysed 130 different values such as total debt, government cash reserves and pensions of retired civil servants from 11 years’ worth of official macroeconomic data reported to Eurostat by the EU Member States. Using Benford’s law, they found that that the Greek government’s data deviated significantly from the Benford distribution, particularly in the year 2000 – one year before Greece officially joined the EU. This suggests that the government manipulated their data, and that, had Benford’s law been used at the time, the Greek government may not have been able to enter the EU. Economists say that this simple law could have prevented the current Greek economic crisis.
Electricity bills, loan data, street addresses, election results, stock prices, purchase orders, house prices, population numbers, death rates, lengths of rivers, research data (e.g. regression models), credit card transactions, areas of countries, the sky is the limit! Check the spreadsheet below to see how Benford’s law holds for the population of US countries and the number of workers employed in different jobs in the US.
*All credits go to Professor Karl Schmedders from the University of Zurich who shared this spreadsheet online on his free “Intuitive Introduction to Probability” course on Coursera.
Interestingly, the powers of 2 also follow the Benford distribution:
In addition, the Fibonacci sequence (1, 1, 3, 5, 8…), the factorials ($1!, 2!, 3!, 4!... = 1, 2, 6, 24…$) as well as the powers of nearly any number obey the law. Square roots and reciprocals disobey the law.
The proofs for these findings are beyond the scope of this article, but feel free to look them up yourelf!
Here is a glossary of the key terms used in this article. You may want to review some of them after you complete your comprehension check.
Word | Definition |
---|---|
Uniform Distribution | Where all outcomes are equally likely. For example, most people would intuitively think that the frequency of the leading digits 1-9 should be about the same, or 11.11%. |
Normal Distribution | Where data near the mean/average occurs more frequently than data far from it (think of a bell shaped curve). For example, most people are about 160cm - 180cm tall, and very few are shorter than 140cm or taller than 200cm. |