This collection, edited by Toby Segaran and Jeff Hammerbacher and published by O’Reilly, contains 20 recent papers on the theme of helping data tell its own story - via an array of data management, data mining, data analysis, data visualization, and data reconciliation techniques. I’ll start by reviewing what I thought were the standouts, and then summarize the rest.
“Cloud storage design in a PNUTShell” gives a concise description of the distributed database architecture deployed at Yahoo! for storing and updating user data. The system must scale horizontally to handle the vast number of transactions and replicate consistently across geographic locations. This article contains a very interesting discussion of the trade-off between availability and consistency, and compares their design with that of other systems, such as Amazon’s Dynamo and Google’s BigTable.
“Information Platforms and the Rise of the Data Scientist” provides some interesting reflections on data warehousing and business intelligence in the context of Facebook. They used Hadoop to build a system called Lexicon that processes terabytes of wall posts every day, counting words and phrases. Again, the emphasis is on scale, and the ability to apply relatively simple algorithms to massive amounts of data.
“Data Finds Data” takes us beyond search to a realm in which relationships between key data items are discovered in real-time and immediately relayed to the right users. “Portable Data in Real Time” addresses some of the issues inherent in sharing data between applications, e.g., the Flickr/Friendfeed example, where user behavior needs to propagate between different social media sites. “Surfacing the Deep Web” describes work at Google to make data behind Web forms amenable to search by automatic query generation.
“Natural Language Corpus Data” describes some simple experiments with a trillion-word content set created by Google and now available through the Linguistic Data Consortium. Simple Python programs are used to build a language model of the corpus, i.e., a probability distribution over the words and phrases occurring in the documents. This model can then be applied to common problems like spell correction and spam detection.
Other papers focused on particular domains, such as the structure of DNA, image processing on Mars, geographic and demographic data, and drug discovery, and many of these were interesting. The first two and last two chapters can be described as more generic, focusing on a range of issues, such as statistics, visualization, entity resolution, and user interface design.
Altogether, this is a very timely and useful collection of papers. As one might expect, the book contains many URLs for further exploration, and some pointers into the literature. If you care about data, this will make a great addition to your bookshelf, alongside Bill Tancer’s somewhat lightweight but nonetheless engaging “Click” and Ian Ayres’ entertaining and provocative “Super Crunchers”.