Monday, February 7, 2011

Strata Conference on Big Data: Talking Points

Two presentation got my attention at this two-day meeting.

Benjamin Black of fast_ip described attempts to build a huge store to collect, index and query trillions of records involving multi-dimensional data. A typical application might be running analytics over huge amounts of sensor network output, where the events of interest have many attributes. The challenge is to manage write performance in the loading of data and the number of key fetches needed to generate query results.

Speedy response to multi-dimensional queries is really an OLAP (Online Analytical Processing) problem, which has been well-studied in the literature. The trick is to precompute many of the results in a hypercube that materializes the most important data relationships. Such an approach finally enabled them to perform most queries quickly within the database, instead of dragging data, kicking and screaming, to the computational engine.

Werner Vogels (CTO, Amazon) defined data as ‘big’ when you have to innovate to collect, store, organize, analyze and share it. Certainly, the Amazons, Googles, Yahoos, and Facebooks of this world were forced to invent their own solutions to the problems posed by their burgeoning businesses. Conventional database vendors, such as Oracle and IBM, were in no position to support the volumes or velocities of true Web scale.

Vogels’ talk focused primarily upon Amazon Web Services and Elastic MapReduce, which affords any business the ability to run big data on Hadoop. Bringing data to the cloud allows for faster and more flexible processing than most businesses could achieve on their own. Apparently Fedexing disks is as good a way as any of delivering the data, and better than sending it down the wires!

Clients such as Best Buy, Yelp and Etsy run between 100 gigabytes and a terabyte or so of behavioral data through AWS every day. As Amazon’s Jinesh Varia pointed out in a later talk, this is way more cost-effective than buying your own servers and SAN storage. In addition to loading raw data, you can index the aggregated records in parallel, so that execs can query the resultant database and geeks can be constantly running experiments.