Saturday, April 18, 2009

Benford's Law

When a number is randomly chosen from a large data set, such as stock quotes, census statistics, or scientific data, what is the probability that the first digit is a 1? Excluding the possibility of a leading 0, the probability would logically seem to be 1 in 9, or about 11.1%.

If you test this hypothesis on real world data, though, you find that the probability that the first digit is a 1 is actually about 30.1%, the probability that the first digit is a 2 is about 17.6%, the probability that the first digit is a 3 is 12.4%, and the probabilities keep dropping so the probability that the leading digit is a 9 is only 4.5%, as the following graph illustrates.



This distribution fits the rule that the probability that the first digit is d is

pd = log10(1 + 1/d)


Cover of Henry Briggs' original table of logarithms.  Source: Eric's computer museum.This distribution is called Benford's Law after physicist Frank Benford who discovered it in 1938. Benford wasn't the first to notice the distribution. Astronomer and mathematician Simon Newcomb made the same discovery 57 years earlier when he noticed that the earlier pages of logarithm tables were dirtier and more worn than the later pages.

Benford tested thousands of data sets for the distribution of leading digits, including geographic data, physical properties of chemicals, baseball statistics, and street addresses. He found the same pattern repeated in all of these seemingly unrelated sets of data.

The following graph shows that the first digits of recent stock prices also closely resemble Benford's distribution.



Properties

Distributions that cover several orders of magnitude are likely to satisfy Benford's Law very closely. This is the case for sets of numbers that are allowed to grow exponentially, such as individual incomes and stock prices.

Another property of Benford's Law is that it is scale invariant. This means that the leading digits of the stock data above, for example, would follow Benford's Law even if it were converted to other currencies, such as Euros, or Japanese Yen.

In addition to being scale invariant, Benford's Law can be shown to be base invariant as well. If you convert the values from a set of data that adheres to Benford's Law to any other base system, the new data set will continue to adhere to the law, with one slight modification. The probability distribution of the digits of the new base can be calculated with the following adjusted formula:

pd = logbase(1 + 1/d)

where d takes the value of each non-zero digit of the new base.

Limitations

Distributions that have a built in minimum or maximum value are unlikely to satisfy Benford's Law. For example, one might expect a set of numbers representing "small insurance claims" to obey the law. However, if "small" in this instance is defined to be between $50 and $100, then some leading digits are excluded from the range by definition.

Distributions that cover only one or two orders of magnitude, or even fewer, are also unlikely to satisfy Benford's Law. Adult IQ scores are an example of a set of data that covers a relatively narrow range, despite having no theoretical maximum.

Explanation

A simple explanation of Benford's Law can be found in inflationary price increases. The price of an item that starts out at a cost of $1.00 with a steady 3% rate of annual inflation will have a leading digit of 1 for 24 years, until the price reaches $2.03 in the 25th year. The leading digit will be 2 for the following 14 years, 3 for the next 9 years, 4 for 8 years, 5 for 6 years...and 9 for only 3 years before the price reaches $10.03 in the 79th year. After that, the leading digit will be a 1 again for another 24 years.

When you take into account the fact that inflation affects a wide range of consumer goods, you can see why, if you take a sample of all of the prices in your local grocery store at one specific moment in time, the prices as a group adhere to Benford's Law. Every item in the store has been going through its own exponential price increase over time, so the probability of a randomly selected item's price having a leading digit of 1 at the particular moment that you sample those prices will be roughly 30.1%.

Applications

Benford's Law may seem like a mere mathematical curiosity, but it has some surprisingly pragmatic applications. Based on the assumption that people who attempt to falsify data will tend to distribute digits uniformly, Benford's Law can be used to expose potential fraud in accounting, insurance claims, and tax forms. Other uses, such as analyzing the results of clinical trials and election results, have also been proposed.

For more information on applications of Benford's Law, see I've Got Your Number: How a mathematical phenomenon can help CPAs uncover fraud and other irregularities.

8 comments:

John said...

Benford's law is amazing. If I hadn't first seen it in a reputable source (Knuth's TAOCP vol 2) I would have said it can't be true. Even once I accepted that it's legit, I thought I understood it before I actually did. Not sure I really understand it now.

William Shields said...

There are two interesting things about this post:

1. This post. Never actually heard this before. Great explanation too; and

2. You and I (cletus@stackoverflow) have chosen nearly exactly the same Blogger template (www.cforcoding.com). Is there some law for that? :)

Bill the Lizard said...

William (cletus),
1. Thanks for the praise for this post. This is one of my own favorites because it's something I heard about years ago, thought it was cool, then learned a lot more about in researching for the post.

2. There must be some probabilistic explanation for it. :) I keep thinking the layout needs a few tweaks, but I'm just too lazy to do it.

Mark Nigrini said...

I enjoyed reading your post. I am the author of "Benford's Law: Applications for Forensic Accounting, Auditing, and Fraud Detection" (Wiley, 2012). My book starts off with a few chapters on the underlying maths (including scale invariance and base invariance as is described in the summary above). I then continue with many examples ranging from census data to tax evasion. Stock prices (analyzed above) are not a good example for Benford's Law because the stock price is really the total market capitalization divided by the number of shares outstanding. However, daily returns (e.g., going from 20 to 21 is a daily return of 5.00 percent which has a first digit 5) do conform to Benford and the book talks about how we might be able to detect Ponzi schemes by looking at the digits of their purported returns. The companion website http://www.nigrini.com/benfordslaw.htm has free Excel templates, data sets, photos, and other interesting items relating to Benford's Law. Enjoy.

Jeremy Lynesw said...

you give a partial explanation for the law, but one involving the imposition of a logarithmic factor....inflation rate. Some websites say that the reason for Benford's Law is still elusive....is this true?

Cecilia said...

I would like to know if we can use the picture of LOGARITHMICA for a Math TExtbook for High School students

thank you in advance

Cecilia Blanco

Mi email is ceblanco@clb.santillana.com

Cecilia said...

I would like to know if we can use the picture of LOGARITHMICA for a Math TExtbook for High School students

thank you in advance

Cecilia Blanco

mi email is cecibla@gmail.com

Bill the Lizard said...

Cecilia,

It looks like that image was removed from the site that I originally downloaded it from (and linked to). It would probably be better if you got permission from someone who owns the rights to an image of the text.