Drilling Through Data
by Stephen Baker
Houghton Mifflin; 244 pp.; $26
The world is buried in data, great banks and drifts of the stuff. In recent years a new technology has emerged: computer programs that will drill through it all to pick out hidden patterns and trends — information that may be useful to marketers, politicians, employers, doctors, match-makers, or national-security analysts. Such programs are extraordinarily sophisticated, and their creators need to be very clever indeed. A doctorate in math or computer science is pretty much required. Stephen Baker calls such whizzes the Numerati. Using "data mining," they seek out veins of useful ore in the mountains of facts that computers accumulate every day.
In The Numerati, Mr. Baker offers a highly readable and fascinating account of the number-driven world we now live in. He shows us, for instance, how political consultants, mining databases that track consumer and "lifestyle" preferences, sort us into tribes by behavioral proxy. Cat owner? Likely Democrat. NRA member? Probably Republican. Mailings and phone calls can then be targeted more accurately. Health professionals, especially when treating older patients, are now monitoring such things as weight, body temperature and pulse by having a computer follow data streams from sensors on clothing or even from sensor-laden "magic carpets" laid around the house. Disturbing patterns prompt the computer to signal a problem. The Numerati are taking over dating services, too. How do you find that special one in a million? By mining the data of the million. How do you improve your own chances of being found? By the same techniques that companies use to show up first in a Google inquiry — "search engine optimization," now a flourishing industry.
The Numerati are even mining the output of bloggers, those stream-of-consciousness online diarists and self-promoters. "What makes the blog world especially valuable to marketers," Mr. Baker writes, is "its unfiltered immediacy." What do consumers think of your new product? What desires are still not satisfied by products of this kind? You can commission a poll or wait for the sales figures to come in … or you can read the blogs. Better yet, you can hire Numerati to write programs that will read them for you, since there are now more than 20 million blogs in the U.S. alone.
There is active advertising to be done on blogs, too. If you read these things, or write one, you know that Google's Adsense service will automatically place context-related ads on a blog page, splitting the click-measured revenue with the blogger. So far, so good. But Adsense has set in motion an ugly arms race online as robot bloggers — clever computer programs — have generated hundreds of thousands of spam blogs, or "splogs."
A splog, though unreadable, is seeded with words that will attract Google ads. A computer-user may be annoyed at finding himself staring at a screen full of gibberish but click on an ad anyway, allowing the robot blogger to harvest revenue. This sleight of hand has the Numerati hard at work getting their software to distinguish between a blog and a splog. Mr. Baker gives a helpful sketch of the math involved, each blog reduced to a vector in a space of several dozen dimensions.
In Mr. Baker's chapter on terrorism, we meet Numerati who seek traces of the abnormal and unexpected in their data sets and who must then try to identify the individual "subjects of interest" who are generating those traces. The task of matching abnormal data to actual individuals, though, presents problems — their names, for example. Researching a book about math once, I turned up 32 different Latin-alphabet spellings of the Russian name "Chebyshev." Arabic, Indian, Chinese and African names present especially daunting challenges. Mr. Baker quotes a Numeratus, a Ph.D. in computational linguistics, who has researched the electronic recognition of names for more than 20 years: "Untangling global names," he says, "will continue to confound us for generations."
To make things worse, terrorists themselves are data-savvy and skillful exploiters of the Internet. "Hundreds of Dutch Web Sites Hacked by Islamic Hackers" reads the headline on a technical news site I was just reading. Jihadists may want to take us back to the seventh century, but they are willing to detour through the 21st to get us there. It doesn't help that our National Security Agency, the proper home of anti-terrorist Numerati, is restricted to hiring U.S. citizens and paying civil-service salaries while their competitors in recruitment — Yahoo, Google, IBM Research — can cast their net world-wide and engage in bidding wars for top talent.
So the Numerati follow the electronic trails that we all now leave behind us as we work, shop, travel, date, trade, or fall sick: What then of our privacy? What if the NSA, having scrutinized my data and determined that I am not a terrorist, sees that I may be cheating on my taxes? Or that I am running for public office while subscribing to a pornography service? Mr. Baker cites Jeff Jonas, a security Numeratus who got his start working for casinos (places also keen to spot "subjects of interest"). "We technologists," Mr. Jonas warns, "had better spend a little more time thinking about what we're creating." Mr. Baker acknowledges that privacy is a problem — we are, after all, the raw material of data mining. Are we also its beneficiaries? He offers a qualified "yes."