Skip Garner of the UT SW Medical Center at Dallas gave an interesting talk on text data mining, primarily for drug discovery. Their primary tool seems to be a paragraph level text similarity tool, called eTBlast, which powers a search engine with some fairly powerful results post-processing capabilities. One of their goals is to build a literature network based on biomedical entities, and they have a Swanson-style approach in which they look for similarities across different domain literatures. Results are input into their IRIDESCENT hypothesis generation engine, which looks for interesting connections, e.g., among microsatellites, motifs, and other DNA sequencing phenomena in different species.
Another application area involved coming up with new therapeutic uses for existing drugs. It takes 15 years and costs $500-800M to develop a new drug, so finding new uses for old drugs has the potential to cut costs and boost profitability. Also, many old drugs are now coming off patent, so new uses would allow reformulation and refiling. The classic example of recent years is Sildenafil citrate (a.k.a. Viagra), which began life as a drug for reducing hypertension. IRIDESCENT has discovered that Chrlopromazine (an anti-psychotic) may be efficacious against cardiac hypertrophy (swelling of the heart), and also found anti-biotic and anti-epileptic agents that might help, based on side effect data mined from the literature.
Monday, July 27, 2009
Sunday, July 26, 2009
ICBO: Tools for Annotation and Ontology Mapping
The International Conference on Biomedical Ontology was held in Buffalo, NY this weekend. There were a number of papers that addressed the tools side of enriching documents with annotations derived from domain ontologies, using more than one source of terminology.
Michael Bada (from Larry Hunter's group at U. Colo. Denver) described a project called CRAFT (Colorado Richly Annitated Full-Text) to create a corpus of scientific documents annotated with respect to the Gene Ontology (GO) and other sources from CheBI and NCBI. The annotation tool is called Knowtator, which performs some automatic tagging which can then be corrected by a domain expert in an editorial environment, e.g., Knowtator plugs into Protege. All annotation is 'stand-off', i.e., not inserted into the document as inline tags, but kept in another file. The semantic tagging performed in this effort relies in part upon syntactic annotations performed elsewhere (U. Colo. Boulder).
Various representational issues arise in the course of applying GO to documents, e.g., the fact that verb nominalizations in English tend to be ambiguous between process and result, e.g., "mutation" can refer to either the process by which change occurs or the results of that change. These ambiguities pose difficulties for automatic annotation, and can also perplex human annotators.
Cui Tao (from Chris Chute's group at the Mayo) presented a paper on LexOwl, an information model and API that connects LexGrid's lexical resources with Owl's logic-based representation. LexOwl is being released as a plug-in for the Protege 4 editorial environment. It's heartening to see high-quality tools being made available in a manner that can be shared and deployed by the community as a whole.
Michael Bada (from Larry Hunter's group at U. Colo. Denver) described a project called CRAFT (Colorado Richly Annitated Full-Text) to create a corpus of scientific documents annotated with respect to the Gene Ontology (GO) and other sources from CheBI and NCBI. The annotation tool is called Knowtator, which performs some automatic tagging which can then be corrected by a domain expert in an editorial environment, e.g., Knowtator plugs into Protege. All annotation is 'stand-off', i.e., not inserted into the document as inline tags, but kept in another file. The semantic tagging performed in this effort relies in part upon syntactic annotations performed elsewhere (U. Colo. Boulder).
Various representational issues arise in the course of applying GO to documents, e.g., the fact that verb nominalizations in English tend to be ambiguous between process and result, e.g., "mutation" can refer to either the process by which change occurs or the results of that change. These ambiguities pose difficulties for automatic annotation, and can also perplex human annotators.
Cui Tao (from Chris Chute's group at the Mayo) presented a paper on LexOwl, an information model and API that connects LexGrid's lexical resources with Owl's logic-based representation. LexOwl is being released as a plug-in for the Protege 4 editorial environment. It's heartening to see high-quality tools being made available in a manner that can be shared and deployed by the community as a whole.
ICBO: Panel on Ontology and Publishing
ICBO is the International Conference on Biomedical Ontology, held this year in Buffalo, NY, and this is the second in a series of blog posts on that meeting.
I appeared on a panel with Colin Batchelor (Royal Soc of Chem), Larry Hunter (U. Colo), Alan Ruttenberg (Science Commons), David Shotton (U. Oxford) and Jabe Wilson (Elsevier). The session was chaired by Matt Day of Nature, and the discussion ranged far and wide, but here are the key themes.
On the subject of who should enrich scientific content with tags and other ontological annotations, many agreed with Larry when he said that authors don't have the skills to do knowledge representation. Publishers should hire and train personnel to do this kind of annotation, using industry standard ontologies, where they exist. Others felt that only the authors understand the paper sufficiently to do this task, but this is a strange argument, considering that the purpose of scientific papers is to inform the reader on the chosen topic. Either way, better tools are needed to enable annotation and promote inter-annotator agreement.
The other main topic centered around academic publishers, who are seen as not giving value to the scientific community and not sharing enough with them. Scientific authors provide publishers with their work product for free, but then they and their students have to pay to access. This growing dissatisfaction has led to the death of some technical journals (e.g., Machine Learning) and their replacement by open access publications. Clearly, publishers need to be adding significant value if they wish to continue to charge high prices, and as a good will gesture they need to share more content and data with the community.
I appeared on a panel with Colin Batchelor (Royal Soc of Chem), Larry Hunter (U. Colo), Alan Ruttenberg (Science Commons), David Shotton (U. Oxford) and Jabe Wilson (Elsevier). The session was chaired by Matt Day of Nature, and the discussion ranged far and wide, but here are the key themes.
On the subject of who should enrich scientific content with tags and other ontological annotations, many agreed with Larry when he said that authors don't have the skills to do knowledge representation. Publishers should hire and train personnel to do this kind of annotation, using industry standard ontologies, where they exist. Others felt that only the authors understand the paper sufficiently to do this task, but this is a strange argument, considering that the purpose of scientific papers is to inform the reader on the chosen topic. Either way, better tools are needed to enable annotation and promote inter-annotator agreement.
The other main topic centered around academic publishers, who are seen as not giving value to the scientific community and not sharing enough with them. Scientific authors provide publishers with their work product for free, but then they and their students have to pay to access. This growing dissatisfaction has led to the death of some technical journals (e.g., Machine Learning) and their replacement by open access publications. Clearly, publishers need to be adding significant value if they wish to continue to charge high prices, and as a good will gesture they need to share more content and data with the community.
Saturday, July 25, 2009
ICBO: The Role of Ontology in Biomedicine
ICBO is the International Conference on Biomedical Ontology, held this year in Buffalo, NY. Ontologies can be defined in various ways; I prefer to think of them as formal systems for describing domain concepts and their instantiations. Topics include: Ontologies for Chemistry and Molecular Biology, Cellular and Anatomical Ontologies, Disease Ontologies, Clinical Ontologies, and IT topics such as computing with ontologies and getting ontologies to work together.
Much of the effort in this area is geared towards extracting biomedical data from various sources and representing it in a computable form. Doing this requires the ability to both represent universal truths about biological systems and describe instances of these truths as applied to individual cases. One can think of ontologies as playing a key role in the irresistible move from dispersed documents to linked data.
Biomedicine contains many difficult modeling problems, e.g., the classification of lipids, where the nomenclature is currently very heterogeneous, and based upon graphical concepts. Even attempts to axiomatize chemistry are full of pitfalls for the unwary, e.g., it doesn't make much sense to talk about a molecule having a boiling point, however tempting it might be to attach the information there, since heat is something that applies to substances, rather than individual molecules. Similarly, the modern view of RNA has changed from its taking a passive role in gene transcription to a more active role in gene expression.
Much of the effort in this area is geared towards extracting biomedical data from various sources and representing it in a computable form. Doing this requires the ability to both represent universal truths about biological systems and describe instances of these truths as applied to individual cases. One can think of ontologies as playing a key role in the irresistible move from dispersed documents to linked data.
Biomedicine contains many difficult modeling problems, e.g., the classification of lipids, where the nomenclature is currently very heterogeneous, and based upon graphical concepts. Even attempts to axiomatize chemistry are full of pitfalls for the unwary, e.g., it doesn't make much sense to talk about a molecule having a boiling point, however tempting it might be to attach the information there, since heat is something that applies to substances, rather than individual molecules. Similarly, the modern view of RNA has changed from its taking a passive role in gene transcription to a more active role in gene expression.
Wednesday, June 24, 2009
"The Big Switch" by Nicholas Carr
Every now and again, you read a book about IT that is well-written, informative, and thought-provoking, while managing to avoid all the usual pitfalls and clichés. ‘The Big Switch’ is definitely one of those books. Nicholas Carr’s previous book (‘Does IT Matter?’) was based on his controversial HBR article, which claimed that IT, though a necessary investment, typically doesn’t convey competitive advantage to companies. In this work, he goes even further, arguing that computing and storage will soon be utilities, like electricity.
Carr explores the parallels with electricity for power and light in the early chapters, explaining how industry switched from generating its own capacity to relying on a grid, once innovations in power transduction, load balancing, and demand metering made this possible. Similarly, he argues that the ability to assemble huge data centers, make software available as a service, and supply cycles and disk space on demand provides companies with the opportunity to get out of the IT business. The resemblance between these two situations, set a century apart, is quite uncanny.
In the later chapters, Carr examines the downsides of IT progress: the stagnation of middle class wages as jobs are computerized, the disembowelment of information and media industries as content is unbundled, as well as subtle cultural and social effects that fall short of the utopian vision. He states that “All technological change is generational change … As the older generations die, they take with them the knowledge of what was lost … and only the sense of what was gained remains.” I think we are all seeing this play out very clearly as the pace of technological change continues to accelerate.
Carr explores the parallels with electricity for power and light in the early chapters, explaining how industry switched from generating its own capacity to relying on a grid, once innovations in power transduction, load balancing, and demand metering made this possible. Similarly, he argues that the ability to assemble huge data centers, make software available as a service, and supply cycles and disk space on demand provides companies with the opportunity to get out of the IT business. The resemblance between these two situations, set a century apart, is quite uncanny.
In the later chapters, Carr examines the downsides of IT progress: the stagnation of middle class wages as jobs are computerized, the disembowelment of information and media industries as content is unbundled, as well as subtle cultural and social effects that fall short of the utopian vision. He states that “All technological change is generational change … As the older generations die, they take with them the knowledge of what was lost … and only the sense of what was gained remains.” I think we are all seeing this play out very clearly as the pace of technological change continues to accelerate.
Thursday, May 7, 2009
Judy Estrin: "Closing the Innovation Gap"
Judy Estrin's book, brought out this year by McGraw-Hill, is called "Closing the Innovation Gap: Reigniting the Spark of Creativity in a Global Economy." The subtitle is a bit misleading, since Estrin focuses almost entirely on the US. The “gap” of her title reflects her thesis that that we are now living off the fat of the 1950s and 60s, when spending on R&D was much higher than it is today.
Estrin interviewed about 100 business and technology luminaries for her book, including Marc Andreesen, John Seely Brown, Vint Cerf, to name but a few. Many of the problems she and others identify are cultural, e.g., the get-rich-quick philosophy of Wall Street; the waning of interest in science at school, even as technology becomes more and more a part of our lives; and corporations that would rather offshore R&D and lie in wait for promising start-ups than innovate themselves.
Estrin has an interesting CV that includes successful start-ups, academia, and serving on the boards of major corporations. I was fortunate enough to hear her give a keynote at SIIA's NetGain 2009 in San Francisco, and her talk followed the contents and spirit of the book fairly closely, see my review. Her book ends with 'A Call to Action', but one wonders whether or not it's already too late, given the erosion of what she calls our "innovation ecosystem."
Estrin interviewed about 100 business and technology luminaries for her book, including Marc Andreesen, John Seely Brown, Vint Cerf, to name but a few. Many of the problems she and others identify are cultural, e.g., the get-rich-quick philosophy of Wall Street; the waning of interest in science at school, even as technology becomes more and more a part of our lives; and corporations that would rather offshore R&D and lie in wait for promising start-ups than innovate themselves.
Estrin has an interesting CV that includes successful start-ups, academia, and serving on the boards of major corporations. I was fortunate enough to hear her give a keynote at SIIA's NetGain 2009 in San Francisco, and her talk followed the contents and spirit of the book fairly closely, see my review. Her book ends with 'A Call to Action', but one wonders whether or not it's already too late, given the erosion of what she calls our "innovation ecosystem."
Tuesday, May 5, 2009
Chris Anderson on "Free" at SIIA NetGain
Chris Anderson of Wired magazine, and author of "The Long Tail", gave a fascinating talk today that is aligned with the contents of his latest book, "Free". He began by examining the ambiguity inherent in the English word "free", which can mean (among other things) either the same as "gratis" (i.e., no cost or charge) or can mean unrestricted or independent. (It's clear that sayings such as "information wants to be free" are really a pun on these two meanings, and beg many important questions.)
The key question Anderson asks in this talk is: what happens when key resources become so cheap as to be (almost) not worth measuring, when they become essentially free? The answer is that people (rightly) start "wasting" them. Thus the plummeting price of processing power (down 50% every 18 months) has led to the proliferation of personal computers that people didn't know they needed. Cheap storage (about $110 a TB today) makes a mockery of company policies that expect highly-paid professionals to take time out of their day to decide which documents to delete to meet disk quotas. Cheap bandwidth gave rise to mass media, first radio then TV, where the marginal cost of supplying an extra customer with prime-time, lowest common denominator entertainment is essentially $0. Even now, the cost to stream a Netflix movie is about 5 cents and falling.
Anderson made some interesting statements in the course of this talk which are worth quoting verbatim. Talking about how mammals are unusual in their nurturing attitude towards a small number of young, he states that "Nature wastes life in pursuit of a better life", i.e., the norm is for most of an upcoming generation to perish, while only a few survive and even fewer find better environments. Analogously, people "waste" bandwidth on YouTube, trying anything and everything, just because doing so is a cost-free activity, beyond the investment of time, energy, etc. Of course, survival (measured as popularity) is very highly skewed, but diversity ensures that pretty much every taste is catered for. For example, many 9-year olds would rather watch amateur stop-motion animations of Star Wars made by their friends than watch a conventional Star Wars movie on a plasma TV with surround sound. As Anderson says, "Relevance is the most important determiner of quality", not conventional production values.
The central axiom is the following: "In a competitive market, price falls to the marginal cost." This quote, from the French mathematician and economist Bertrand, has acquired new meaning in the world of the Web, where the marginal cost of any bit-based product is close to $0. It follows (says Anderson) that, sooner or later, there will be a free version of everything.
How does business survive in a "free" world? One strategy is called "freemium". This inverts the 20th century marketing model of giving away a few free samples to drive sales volume. Instead, one gives a version of a product away to the majority to users to drive the adoption of a premium version by a minority of users.
How does this work? The assumption is the only 5% or so of users are really needed to cover your marginal cost. Going big via freemium ensures that this is 5% of a large number. You can then try and drive above 5% to pay back your fixed costs and get your business into the black. Sounds easy, doesn't it? Obviously, your sums have to be right, and adopting the strategy takes intestinal fortitude (as well as access to capital).
The premium version can be differentiated from the free one along any number of dimensions, e.g., it has time/effort saving features, it's limited by number of seats, it's only available to certain users (such as small companies), etc. The lesson from freemium in online gaming is that people will pay to save time, lower risk, enhance their status, and other inducements.
This is certainly a brave new world for the software and information industry, and one that will require many mental (and price) adjustments. But, if Anderson is right, the triumph of "free" is inevitable, in the sense that in future there will tend to be a free version of almost any given resource. I like his account, because it also allows for the other reality, which we live at Thomson Reuters. Namely that people who want more accurate legal research, more up-to-date financial quotes, and deeper analysis of scientific trends, will always be prepared to pay for the advantage so conveyed. The trick is to stay ahead of the value curve and provide must-have information that is both intelligible and actionable.
Nature may waste life, but life is still a zero-sum game.
The key question Anderson asks in this talk is: what happens when key resources become so cheap as to be (almost) not worth measuring, when they become essentially free? The answer is that people (rightly) start "wasting" them. Thus the plummeting price of processing power (down 50% every 18 months) has led to the proliferation of personal computers that people didn't know they needed. Cheap storage (about $110 a TB today) makes a mockery of company policies that expect highly-paid professionals to take time out of their day to decide which documents to delete to meet disk quotas. Cheap bandwidth gave rise to mass media, first radio then TV, where the marginal cost of supplying an extra customer with prime-time, lowest common denominator entertainment is essentially $0. Even now, the cost to stream a Netflix movie is about 5 cents and falling.
Anderson made some interesting statements in the course of this talk which are worth quoting verbatim. Talking about how mammals are unusual in their nurturing attitude towards a small number of young, he states that "Nature wastes life in pursuit of a better life", i.e., the norm is for most of an upcoming generation to perish, while only a few survive and even fewer find better environments. Analogously, people "waste" bandwidth on YouTube, trying anything and everything, just because doing so is a cost-free activity, beyond the investment of time, energy, etc. Of course, survival (measured as popularity) is very highly skewed, but diversity ensures that pretty much every taste is catered for. For example, many 9-year olds would rather watch amateur stop-motion animations of Star Wars made by their friends than watch a conventional Star Wars movie on a plasma TV with surround sound. As Anderson says, "Relevance is the most important determiner of quality", not conventional production values.
The central axiom is the following: "In a competitive market, price falls to the marginal cost." This quote, from the French mathematician and economist Bertrand, has acquired new meaning in the world of the Web, where the marginal cost of any bit-based product is close to $0. It follows (says Anderson) that, sooner or later, there will be a free version of everything.
How does business survive in a "free" world? One strategy is called "freemium". This inverts the 20th century marketing model of giving away a few free samples to drive sales volume. Instead, one gives a version of a product away to the majority to users to drive the adoption of a premium version by a minority of users.
How does this work? The assumption is the only 5% or so of users are really needed to cover your marginal cost. Going big via freemium ensures that this is 5% of a large number. You can then try and drive above 5% to pay back your fixed costs and get your business into the black. Sounds easy, doesn't it? Obviously, your sums have to be right, and adopting the strategy takes intestinal fortitude (as well as access to capital).
The premium version can be differentiated from the free one along any number of dimensions, e.g., it has time/effort saving features, it's limited by number of seats, it's only available to certain users (such as small companies), etc. The lesson from freemium in online gaming is that people will pay to save time, lower risk, enhance their status, and other inducements.
This is certainly a brave new world for the software and information industry, and one that will require many mental (and price) adjustments. But, if Anderson is right, the triumph of "free" is inevitable, in the sense that in future there will tend to be a free version of almost any given resource. I like his account, because it also allows for the other reality, which we live at Thomson Reuters. Namely that people who want more accurate legal research, more up-to-date financial quotes, and deeper analysis of scientific trends, will always be prepared to pay for the advantage so conveyed. The trick is to stay ahead of the value curve and provide must-have information that is both intelligible and actionable.
Nature may waste life, but life is still a zero-sum game.
Subscribe to:
Posts (Atom)