Tuesday, October 27, 2009

Book Review: Wired for Innovation

“Wired for Innovation: How Information technology is Reshaping the Economy” by Erik Brynjolfsson & Adam Saunders, MIT Press, 2010.

Main argument of the book is that the so-called “productivity paradox” of the period 1970-1995, whereby the adoption of IT failed to boost per capita productivity growth in the US, is now seen to be due to a lack of concomitant investment in organizational capital. In other words, people tended to “computerize” existing tasks in a brain-dead way, without investing in new business processes, employee training, and the like. The literature shows that the period 1995-2008 saw far greater gains, but only for those companies that combined new technology with a range of complementary practices around job redesign, openness of information, and empowerment of workers.

These “complementarities” are not just the “best practices” beloved of management consultants of every stripe, but aspects of a company’s culture, including things like profit sharing, flexible work hours and reorganization of workflows to really capitalize on new technology, so that even where there are fewer jobs, these jobs are more interesting than before. These practices are only effective if they work together to reinforce each other, both incenting workers and helping them innovate. Studies show that productivity programs such as TQMS don’t produce results otherwise.

Secondary related topic is that GDP tends to underestimate the value generated by IT innovations. The argument is that traditional measures of inputs and outputs used by the government do not capture “consumer surplus”, i.e., the net benefit that consumers derive from a product or service after you subtract the amount paid. E.g., various studies claim that eBay alone generates several billion a year in consumer surplus, and Amazon’s used book sales generate tens of millions in surplus, given aggressively low pricing. Another study found that the government’s 10 year regulatory delay in allowing cell phone usage cost consumers about $100 billion in lost surpluses!

Chapter 6, on ‘Incentives to Innovate in the Information Economy’ is especially worth reading, for an economist’s view of our world. The authors characterize information as a ‘non-rival good’, i.e., if I consume a piece of information that doesn’t prevent anyone else from consuming it (unlike a piece of cake, say). They also argue that people are reluctant to pay full-price for information goods, since they don’t know how useful they will be until they consume them (by which time they have already paid). Bundling is advocated, because it gets you away from the problem of pricing individual pieces of information in a situation where the marginal cost is close to zero (so that formulas like marginal cost plus a markup don’t work).

Another consequence of information being a non-rival good is ‘knowledge spillovers’, which typically mean that the private return for innovation will be less than the return for society as a whole. (Think about cell phones, and the value they create for society compared to the returns for cell phone companies.) The authors argue that this kind of disparity leads to a ‘chronic underinvestment in R&D’ by the private sector. Even within a single corporation, like Thomson Reuters, you can see this play out. Individual business units that invest in R&D often feel that other businesses that benefit from the ‘spillover’ are freeloaders who have not taken the risk or borne the cost of R&D, but nonetheless derived its benefits. This is why companies like 3M try to take a broader view of R&D cost-benefits across the organization as a whole.

Finally, the authors address the issues of price dispersion and low-cost copies of information goods, and how disruptive they really are. With respect to pricing, studies show that Amazon retains market share, even though it doesn’t actually have the lowest online books prices, thanks to their customer experience and reputation. (This doesn’t bode very well for Walmart in their current price war with Amazon over holiday sales.) With respect to low-cost copies, the authors argue that high availability can create demand for publishers’ wares under some circumstances. For example, lending libraries created readership and stimulated book sales rather than depressing them; more recently, VCRs created demand an insatiable demand for video.

At 150 pages, this is a slim, dense book that contains many insights into the forces behind innovation. Unlike many of the volumes that one finds in airport bookstores, the facts and claims are backed by citations to recent research. Although the main theme is really the relationship between innovation, IT and productivity, there are many interesting reflections upon leadership, change, knowledge, value and the distribution of wealth. It’s not a light read, because it doesn’t shrink from complexity, but it’s not turgid either. The authors are from MIT’s Sloan School and U Penn’s Wharton School, so you are getting an up-to-date view of the very latest thinking at these august institutions.

Thursday, October 8, 2009

Book Review: Free

Chris Anderson of “Long Tail” fame has done it again with his latest book, “Free”. I reviewed his NetGain talk earlier in this blog, so I won’t go through all the basics again. His concept is that, in a highly competitive online market, price falls to the marginal cost of adding a new user, which is close to zero.

He distinguishes between three different models of “free”: the cross-subsidy (basically a come-on, like the free sample or the loss leader), the two-sided market (one customer subsidizes the other, e.g., advertisers subsidizing magazine readers), and the “freemium” model (upgrade users from a free version to a more capable version of your product, or a range of ancillary experiences).

His historical preamble is both amusing and informative, explaining the origins of common phrases such as "free lunch" and "jumping on the bandwagon." There are sidebars containing detailed examples of companies that play in the free space to a non-trivial extent. His comments on the dilemma currently facing the media industry are unsparing and incisive.

"Free" is a good read, being both very clear and non-redundant. Anderson has had a long and successful career in quality publishing, as well as being a well-known blogger. For my money, he is up there with Nicholas Carr and Malcolm Gladwell as a really insightful writer on technology, business and social trends. At about 250 pages, this is a book you can read in a small number of sittings, and then refer back to, as needed.

His message can be summarized as follows: a version of your current digital assets will one day be abundantly available free of charge, so you have to keep moving up the value chain. Furthermore, it often makes sense to give away some part of your asset base to some part of your potential market to gain attention, reputation and good will. You can then successfully monetize other, higher value, opportunities arising from all this buzz, possibly by charging just a fraction of your users.

He cites many interesting examples, e.g., Skype giving away computer-to-computer calls, but charging for computer-to-phone calls, Second Life giving away virtual tourism but charging for virtual land, etc. My main problem is that they are mostly drawn from the mass-market, consumer space, not the professional space.

Even in the consumer space, experiments with free are ambiguous. For example, local newspapers have had a mixed experience with free, as Anderson acknowledges. There is the whole issue of how readers (and more to the point advertisers) value free publications, especially those that were once sold at a price. He also acknowledges that free has a habit of turning billion dollar businesses (classified ads in newspapers) into million dollar businesses (like Craigslist). Scale seems to be everything, as well as getting your sums right.

Perhaps the most interesting passages for me concerned Google: its vast economies of scale; its use of free for various purposes (goodwill, data gathering, ad delivery); and the concern of execs like Eric Schmidt that publishing companies may suffer to an extent that they are no longer able to generate quality content to be searched. Not every aspect of the Internet economy is a zero-sum game, and Google probably knows that the sooner the media industry finds new business models to maintain its creative output the better for all concerned.

The book ends with a handy guide to the rules, tactics and business models that seem to work in the free economy. Appropriately enough, free versions of the book are available, e.g., in an abridged form as an audiobook. (For a while, an advanced copy was also free on the Web as a pdf.)

Thursday, September 10, 2009

Book Review: Beautiful Data

This collection, edited by Toby Segaran and Jeff Hammerbacher and published by O’Reilly, contains 20 recent papers on the theme of helping data tell its own story - via an array of data management, data mining, data analysis, data visualization, and data reconciliation techniques. I’ll start by reviewing what I thought were the standouts, and then summarize the rest.

“Cloud storage design in a PNUTShell” gives a concise description of the distributed database architecture deployed at Yahoo! for storing and updating user data. The system must scale horizontally to handle the vast number of transactions and replicate consistently across geographic locations. This article contains a very interesting discussion of the trade-off between availability and consistency, and compares their design with that of other systems, such as Amazon’s Dynamo and Google’s BigTable.

“Information Platforms and the Rise of the Data Scientist” provides some interesting reflections on data warehousing and business intelligence in the context of Facebook. They used Hadoop to build a system called Lexicon that processes terabytes of wall posts every day, counting words and phrases. Again, the emphasis is on scale, and the ability to apply relatively simple algorithms to massive amounts of data.

“Data Finds Data” takes us beyond search to a realm in which relationships between key data items are discovered in real-time and immediately relayed to the right users. “Portable Data in Real Time” addresses some of the issues inherent in sharing data between applications, e.g., the Flickr/Friendfeed example, where user behavior needs to propagate between different social media sites. “Surfacing the Deep Web” describes work at Google to make data behind Web forms amenable to search by automatic query generation.

“Natural Language Corpus Data” describes some simple experiments with a trillion-word content set created by Google and now available through the Linguistic Data Consortium. Simple Python programs are used to build a language model of the corpus, i.e., a probability distribution over the words and phrases occurring in the documents. This model can then be applied to common problems like spell correction and spam detection.

Other papers focused on particular domains, such as the structure of DNA, image processing on Mars, geographic and demographic data, and drug discovery, and many of these were interesting. The first two and last two chapters can be described as more generic, focusing on a range of issues, such as statistics, visualization, entity resolution, and user interface design.

Altogether, this is a very timely and useful collection of papers. As one might expect, the book contains many URLs for further exploration, and some pointers into the literature. If you care about data, this will make a great addition to your bookshelf, alongside Bill Tancer’s somewhat lightweight but nonetheless engaging “Click” and Ian Ayres’ entertaining and provocative “Super Crunchers”.

Monday, July 27, 2009

ICBO Keynote: Harold "Skip" Garner

Skip Garner of the UT SW Medical Center at Dallas gave an interesting talk on text data mining, primarily for drug discovery. Their primary tool seems to be a paragraph level text similarity tool, called eTBlast, which powers a search engine with some fairly powerful results post-processing capabilities. One of their goals is to build a literature network based on biomedical entities, and they have a Swanson-style approach in which they look for similarities across different domain literatures. Results are input into their IRIDESCENT hypothesis generation engine, which looks for interesting connections, e.g., among microsatellites, motifs, and other DNA sequencing phenomena in different species.

Another application area involved coming up with new therapeutic uses for existing drugs. It takes 15 years and costs $500-800M to develop a new drug, so finding new uses for old drugs has the potential to cut costs and boost profitability. Also, many old drugs are now coming off patent, so new uses would allow reformulation and refiling. The classic example of recent years is Sildenafil citrate (a.k.a. Viagra), which began life as a drug for reducing hypertension. IRIDESCENT has discovered that Chrlopromazine (an anti-psychotic) may be efficacious against cardiac hypertrophy (swelling of the heart), and also found anti-biotic and anti-epileptic agents that might help, based on side effect data mined from the literature.

Sunday, July 26, 2009

ICBO: Tools for Annotation and Ontology Mapping

The International Conference on Biomedical Ontology was held in Buffalo, NY this weekend. There were a number of papers that addressed the tools side of enriching documents with annotations derived from domain ontologies, using more than one source of terminology.

Michael Bada (from Larry Hunter's group at U. Colo. Denver) described a project called CRAFT (Colorado Richly Annitated Full-Text) to create a corpus of scientific documents annotated with respect to the Gene Ontology (GO) and other sources from CheBI and NCBI. The annotation tool is called Knowtator, which performs some automatic tagging which can then be corrected by a domain expert in an editorial environment, e.g., Knowtator plugs into Protege. All annotation is 'stand-off', i.e., not inserted into the document as inline tags, but kept in another file. The semantic tagging performed in this effort relies in part upon syntactic annotations performed elsewhere (U. Colo. Boulder).

Various representational issues arise in the course of applying GO to documents, e.g., the fact that verb nominalizations in English tend to be ambiguous between process and result, e.g., "mutation" can refer to either the process by which change occurs or the results of that change. These ambiguities pose difficulties for automatic annotation, and can also perplex human annotators.

Cui Tao (from Chris Chute's group at the Mayo) presented a paper on LexOwl, an information model and API that connects LexGrid's lexical resources with Owl's logic-based representation. LexOwl is being released as a plug-in for the Protege 4 editorial environment. It's heartening to see high-quality tools being made available in a manner that can be shared and deployed by the community as a whole.

ICBO: Panel on Ontology and Publishing

ICBO is the International Conference on Biomedical Ontology, held this year in Buffalo, NY, and this is the second in a series of blog posts on that meeting.

I appeared on a panel with Colin Batchelor (Royal Soc of Chem), Larry Hunter (U. Colo), Alan Ruttenberg (Science Commons), David Shotton (U. Oxford) and Jabe Wilson (Elsevier). The session was chaired by Matt Day of Nature, and the discussion ranged far and wide, but here are the key themes.

On the subject of who should enrich scientific content with tags and other ontological annotations, many agreed with Larry when he said that authors don't have the skills to do knowledge representation. Publishers should hire and train personnel to do this kind of annotation, using industry standard ontologies, where they exist. Others felt that only the authors understand the paper sufficiently to do this task, but this is a strange argument, considering that the purpose of scientific papers is to inform the reader on the chosen topic. Either way, better tools are needed to enable annotation and promote inter-annotator agreement.

The other main topic centered around academic publishers, who are seen as not giving value to the scientific community and not sharing enough with them. Scientific authors provide publishers with their work product for free, but then they and their students have to pay to access. This growing dissatisfaction has led to the death of some technical journals (e.g., Machine Learning) and their replacement by open access publications. Clearly, publishers need to be adding significant value if they wish to continue to charge high prices, and as a good will gesture they need to share more content and data with the community.

Saturday, July 25, 2009

ICBO: The Role of Ontology in Biomedicine

ICBO is the International Conference on Biomedical Ontology, held this year in Buffalo, NY. Ontologies can be defined in various ways; I prefer to think of them as formal systems for describing domain concepts and their instantiations. Topics include: Ontologies for Chemistry and Molecular Biology, Cellular and Anatomical Ontologies, Disease Ontologies, Clinical Ontologies, and IT topics such as computing with ontologies and getting ontologies to work together.

Much of the effort in this area is geared towards extracting biomedical data from various sources and representing it in a computable form. Doing this requires the ability to both represent universal truths about biological systems and describe instances of these truths as applied to individual cases. One can think of ontologies as playing a key role in the irresistible move from dispersed documents to linked data.

Biomedicine contains many difficult modeling problems, e.g., the classification of lipids, where the nomenclature is currently very heterogeneous, and based upon graphical concepts. Even attempts to axiomatize chemistry are full of pitfalls for the unwary, e.g., it doesn't make much sense to talk about a molecule having a boiling point, however tempting it might be to attach the information there, since heat is something that applies to substances, rather than individual molecules. Similarly, the modern view of RNA has changed from its taking a passive role in gene transcription to a more active role in gene expression.