by Dr. Peter Wayne
| In this article we use Alpha Five to explore a counterintuitive fact about numbers. |
Frank Benford, an engineer who worked for the General Electric Company in the 1930's, remarked that the first pages of books of logarithms - corresponding to numbers starting with the numeral 1 - were worn more than the later pages. He was not the first to make this observation, but he was the first to understand that numbers beginning with "1" make up a disproportionate share of all sorts of collections of numbers. Indeed, Mr. Benford cataloged over 20,000 sets of numbers - drainage areas of rivers, baseball statistics, stock ticker prices, addresses - and found that the frequency of leading numerals followed a logarithmic rule, in which the frequency of the leading digit "x" is related to the base 10 logarithm of (1+1/x). By what has come to be known as Benford's Law, in a typical series of real-world numbers, log (1+1/1) or 30.1% will begin with "1", log (1+1/2) or 17.6% with "2", etc.
But wait - isn't there something wrong with this? In an unselected series of numbers, shouldn't 10% begin with "1", 10% with "2", and 10% with each of the remaining digits? If the numbers were truly random, then certainly they would be equally distributed among all the digits - but real world numbers are not randomly distributed. For example, think of addresses: streets come in all sizes. My office address, for example, is 469 North Broadway. If my office were the last building on the street, then 111 addresses would begin with "1" (1, 11-19, 100-199), 111 with "2", 111 with "3", but only 81 with "4" and 11 with "5", "6", "7", "8", and "9". If we consider many different blocks, we realize that there can never be more addresses beginning with "2" than with "1", and that in general the the lower digits will always outnumber the higher digits.
Although this distribution is understandable with addresses, what about numbers that do not necessarily come in orderly sequences, such as checkbook transactions or stock ticker prices? Surprisingly, the same rules apply. We'll use Alpha Five to experiment with some data files.
I created a table in which to store the results of the numerical analysis. Here is the structure of benford.dbf:
Figure 1. Structure of benford.dbf
This global script, "benford analysis", will analyze the first digit of any character or numeric field:
dim fn as c
dim fld as c
dim nf as n
dim t as p
dim digit as n
dim digitstring as c
dim nrecs as n
dim result as p
dim i as n
dim digits[9] as n
for i=1 to 9
digits[i]=0
next
dim count as n
count=0
dir_put(a5.get_path())
fn=ui_get_file("Table to read","Tables(*.dbf)","","X")
if fn="" then
end
end if
t=table.open(fn,file_ro_shared)
nf=t.fields_get()
dim fieldlist.name[nf] as c
dim fieldlist.type[nf] as c
for i=1 to nf
fieldlist.name[i]=t.field_get(i).name_get()
fieldlist.type[i]=t.field_get(i).type_get()
next
fld=ui_get_list_array("Choose field to analyze",1,"fieldlist.name")
nrecs=t.records_get()
t.fetch_first()
while .not. t.fetch_eof()
fld_type=t.field_get(fld).type_get()
select
case fld_type="N"
digitstring=left(alltrim(str(100*eval("t."+fld),10,0)),1)
digit=val(digitstring)
case fld_type="C"
digit=val(left(eval("t."+fld),1))
case else
ui_msg_box("Invalid field type for analysis",\
"Use only date or text fields, please!")
exit while
end select
if digit>0 then
digits[digit]=digits[digit]+1
count=count+1
end if
statusbar.percent(count,nrecs)
t.fetch_next()
end while
t.close()
dir_put(a5.get_path())
found=file.exists("benford.dbf")
if found then
result=table.open("benford",file_rw_exclusive)
result.zap(.t.)
result.close()
else
table.create_begin("digit","N",5,2)
table.field_add("observed","N",8,2)
table.field_add("expected","N",8,2)
table.create_end("benford")
file_add_to_db("benford.dbf")
end if
result=table.open("benford.dbf")
for i=1 to 9
result.enter_begin()
result.digit=i
result.observed=digits[i]
result.expected=count*(log(1+1/i)/log(10))
result.enter_end(.t.)
next
result.close()
end
Script 1. Benford Analysis of table. Alpha Five's "log" function is a natural log, not a base-10 log. To convert the values to base-10 logs I divided by the natural log of 10. That's your high school math!
I ran this script and selected a table:
Figure 2. Choosing a table to analyze.
My business checkbook transactions are in the table, wdddetail.dbf. I then chose to analyze the "amount" numeric field:
Figure 3. Choosing a field to analyze.
Here are the results of the analysis:
Figure 4. Leading digits in the "amount" field of wdddetail.
Notice that there is a close, although not perfect, correlation between the true distribution of leading digits and the expected values.
I then ran the analysis on the "address_1" field of my patient table:
Figure 5. Benford analysis of my patient addresses.
I decided to run an analysis on a selection of stock transactions exported from Microsoft Money®. Money exports data in "Quicken® Interchange Format" or .qif files, which are text files with a repeating structure like this:
D9/20/99 T3,441.88 NSell YOxford Health Plans I17.25 Q200 O8.12 ^
Part of a .qif file
For the purpose of my analysis, I discarded the lines with dates and then looked at any remaining line with a digit in it and took the first digit for analysis. In this single Money transaction, there are 4 numeric fields, the "T", "I", "Q", and "O" lines. The code to perform a Benford analysis on the .qif file is:
''XBasic
dim digits[9] as n
for i=1 to 9
digits[i]=0
next
dim count as n
count=0
fp=file.open("c:\my documents\my investments.qif",file_ro_shared)
txt=fp.read_line()
while .not. fp.eof()
digit=0
posn=1
if left(txt,1)<>"D" then
' it's not a date field, process it
while posn<len(txt)
if between(substr(txt,posn,1),"1","9")=.t. then
digit=val(substr(txt,posn,1))
exit while
end if
posn=posn+1
end while
if digit>0 then
digits[digit]=digits[digit]+1
count=count+1
end if
statusbar.robot()
end if
txt=fp.read_line()
end while
fp.close()
dir_put(a5.get_path())
found=file.exists("benford.dbf")
if found then
result=table.open("benford",file_rw_exclusive)
result.zap(.t.)
result.close()
else
table.create_begin("digit","N",5,2)
table.field_add("observed","N",8,2)
table.field_add("expected","N",8,2)
table.create_end("benford")
file_add_to_db("benford.dbf")
end if
result=table.open("benford.dbf")
for i=1 to 9
result.enter_begin()
result.digit=i
result.observed=digits[i]
result.expected=count*(log(1+1/i)/log(10))
result.enter_end(.t.)
next
result.close()
end
Script 2. Benford analysis of a .qif file.
The results of running this script are:
Figure 6. Benford analysis of a .qif file.
Finally, I looked at a large file of stock quotations for a mix of individual stocks and mutual funds. I generated a report from Microsoft Money and wrote the report as a text file to disk. A section of the report looks like this:
NewKidCo International 11/10/99 2 1/4 11/9/99 2.093 11/8/99 2.093 11/5/99 2.156 11/2/99 2.062 10/29/99 2 1/4 10/22/99 2.094 10/21/99 2 1/8 10/20/99 2 1/8
Recent ticker price history for NewKidCo International.
The code to read in and analyze this file is slightly different from the code in Script 2, but generally quite similar:
''XBasic
dim digits[9] as n
for i=1 to 9
digits[i]=0
next
dim count as n
count=0
fp=file.open("c:\my documents\my investments.txt",file_ro_shared)
txt=fp.read_line()
while .not. fp.eof()
select
case txt=""
'
case between(left(txt,1),"A","z")
'
case else
quotation=substr(txt,at(chr(9),txt)+1,at(chr(9),txt,2)-at(chr(9),txt)-1)
if val(quotation)=0 then
while len(quotation)>1 .and. val(quotation)=0
quotation=right(quotation,len(quotation)-1)
end while
end if
if at(chr(32),quotation)>0 then
quoteval=val(quotation)
else
quoteval=eval(remspecial(quotation))
end if
digit=val(left(ltrim(str(quoteval)),1))
if digit>0 then
digits[digit]=digits[digit]+1
count=count+1
end if
end select
statusbar.robot()
txt=fp.read_line()
end while
fp.close()
dir_put(a5.get_path())
found=file.exists("benford.dbf")
if found then
result=table.open("benford",file_rw_exclusive)
result.zap(.t.)
result.close()
else
table.create_begin("digit","N",5,2)
table.field_add("observed","N",8,2)
table.field_add("expected","N",8,2)
table.create_end("benford")
file_add_to_db("benford.dbf")
end if
result=table.open("benford.dbf")
for i=1 to 9
result.enter_begin()
result.digit=i
result.observed=digits[i]
result.expected=count*(log(1+1/i)/log(10))
result.enter_end(.t.)
next
result.close()
end
Script 3. Benford analysis of an exported Money report file.
The result of running this script produces yet another example of how Benford's Law works:
Figure 7. Benford analysis of stock ticker prices.
The power of Benford's Law is not limited to the first digit of a number. You can equally apply the analysis to the first 2 digits of a number, e.g., "10", "11", "12" will appear more often than "90", "91", and "92".
Remember that Benford's distribution would not apply to data that would be expected to follow a bell-shaped curve, such as the heights of 4th grade students in a class.
I wrote these little routines to analyze tables and files because I was surprised and fascinated by this 60-year-old insight of Frank Benford's. Benford's realization has a more practical use. In any selection of financial data, large deviations from a Benford distribution are deserving of closer scrutiny. There may be a good reason for the deviation - for example, in analyzing payment receipts for my practice, there is an excess of transactions beginning with 8, 1 and 5, because copayments for some large insurers are $8, $10, and $5. Careful analysis of financial data for deviations from Benford's distribution can be a powerful signal of fraudulent transactions. Few white collar thieves are sufficiently versed in statistics to generate fictitious transactions that adhere to Benford's Law!
11/12/99
Don't forget, we need your feedback to make this site better!