Skip Garner of the UT SW Medical Center at Dallas gave an interesting talk on text data mining, primarily for drug discovery. Their primary tool seems to be a paragraph level text similarity tool, called eTBlast, which powers a search engine with some fairly powerful results post-processing capabilities. One of their goals is to build a literature network based on biomedical entities, and they have a Swanson-style approach in which they look for similarities across different domain literatures. Results are input into their IRIDESCENT hypothesis generation engine, which looks for interesting connections, e.g., among microsatellites, motifs, and other DNA sequencing phenomena in different species.
Another application area involved coming up with new therapeutic uses for existing drugs. It takes 15 years and costs $500-800M to develop a new drug, so finding new uses for old drugs has the potential to cut costs and boost profitability. Also, many old drugs are now coming off patent, so new uses would allow reformulation and refiling. The classic example of recent years is Sildenafil citrate (a.k.a. Viagra), which began life as a drug for reducing hypertension. IRIDESCENT has discovered that Chrlopromazine (an anti-psychotic) may be efficacious against cardiac hypertrophy (swelling of the heart), and also found anti-biotic and anti-epileptic agents that might help, based on side effect data mined from the literature.
Monday, July 27, 2009
Sunday, July 26, 2009
ICBO: Tools for Annotation and Ontology Mapping
The International Conference on Biomedical Ontology was held in Buffalo, NY this weekend. There were a number of papers that addressed the tools side of enriching documents with annotations derived from domain ontologies, using more than one source of terminology.
Michael Bada (from Larry Hunter's group at U. Colo. Denver) described a project called CRAFT (Colorado Richly Annitated Full-Text) to create a corpus of scientific documents annotated with respect to the Gene Ontology (GO) and other sources from CheBI and NCBI. The annotation tool is called Knowtator, which performs some automatic tagging which can then be corrected by a domain expert in an editorial environment, e.g., Knowtator plugs into Protege. All annotation is 'stand-off', i.e., not inserted into the document as inline tags, but kept in another file. The semantic tagging performed in this effort relies in part upon syntactic annotations performed elsewhere (U. Colo. Boulder).
Various representational issues arise in the course of applying GO to documents, e.g., the fact that verb nominalizations in English tend to be ambiguous between process and result, e.g., "mutation" can refer to either the process by which change occurs or the results of that change. These ambiguities pose difficulties for automatic annotation, and can also perplex human annotators.
Cui Tao (from Chris Chute's group at the Mayo) presented a paper on LexOwl, an information model and API that connects LexGrid's lexical resources with Owl's logic-based representation. LexOwl is being released as a plug-in for the Protege 4 editorial environment. It's heartening to see high-quality tools being made available in a manner that can be shared and deployed by the community as a whole.
Michael Bada (from Larry Hunter's group at U. Colo. Denver) described a project called CRAFT (Colorado Richly Annitated Full-Text) to create a corpus of scientific documents annotated with respect to the Gene Ontology (GO) and other sources from CheBI and NCBI. The annotation tool is called Knowtator, which performs some automatic tagging which can then be corrected by a domain expert in an editorial environment, e.g., Knowtator plugs into Protege. All annotation is 'stand-off', i.e., not inserted into the document as inline tags, but kept in another file. The semantic tagging performed in this effort relies in part upon syntactic annotations performed elsewhere (U. Colo. Boulder).
Various representational issues arise in the course of applying GO to documents, e.g., the fact that verb nominalizations in English tend to be ambiguous between process and result, e.g., "mutation" can refer to either the process by which change occurs or the results of that change. These ambiguities pose difficulties for automatic annotation, and can also perplex human annotators.
Cui Tao (from Chris Chute's group at the Mayo) presented a paper on LexOwl, an information model and API that connects LexGrid's lexical resources with Owl's logic-based representation. LexOwl is being released as a plug-in for the Protege 4 editorial environment. It's heartening to see high-quality tools being made available in a manner that can be shared and deployed by the community as a whole.
ICBO: Panel on Ontology and Publishing
ICBO is the International Conference on Biomedical Ontology, held this year in Buffalo, NY, and this is the second in a series of blog posts on that meeting.
I appeared on a panel with Colin Batchelor (Royal Soc of Chem), Larry Hunter (U. Colo), Alan Ruttenberg (Science Commons), David Shotton (U. Oxford) and Jabe Wilson (Elsevier). The session was chaired by Matt Day of Nature, and the discussion ranged far and wide, but here are the key themes.
On the subject of who should enrich scientific content with tags and other ontological annotations, many agreed with Larry when he said that authors don't have the skills to do knowledge representation. Publishers should hire and train personnel to do this kind of annotation, using industry standard ontologies, where they exist. Others felt that only the authors understand the paper sufficiently to do this task, but this is a strange argument, considering that the purpose of scientific papers is to inform the reader on the chosen topic. Either way, better tools are needed to enable annotation and promote inter-annotator agreement.
The other main topic centered around academic publishers, who are seen as not giving value to the scientific community and not sharing enough with them. Scientific authors provide publishers with their work product for free, but then they and their students have to pay to access. This growing dissatisfaction has led to the death of some technical journals (e.g., Machine Learning) and their replacement by open access publications. Clearly, publishers need to be adding significant value if they wish to continue to charge high prices, and as a good will gesture they need to share more content and data with the community.
I appeared on a panel with Colin Batchelor (Royal Soc of Chem), Larry Hunter (U. Colo), Alan Ruttenberg (Science Commons), David Shotton (U. Oxford) and Jabe Wilson (Elsevier). The session was chaired by Matt Day of Nature, and the discussion ranged far and wide, but here are the key themes.
On the subject of who should enrich scientific content with tags and other ontological annotations, many agreed with Larry when he said that authors don't have the skills to do knowledge representation. Publishers should hire and train personnel to do this kind of annotation, using industry standard ontologies, where they exist. Others felt that only the authors understand the paper sufficiently to do this task, but this is a strange argument, considering that the purpose of scientific papers is to inform the reader on the chosen topic. Either way, better tools are needed to enable annotation and promote inter-annotator agreement.
The other main topic centered around academic publishers, who are seen as not giving value to the scientific community and not sharing enough with them. Scientific authors provide publishers with their work product for free, but then they and their students have to pay to access. This growing dissatisfaction has led to the death of some technical journals (e.g., Machine Learning) and their replacement by open access publications. Clearly, publishers need to be adding significant value if they wish to continue to charge high prices, and as a good will gesture they need to share more content and data with the community.
Saturday, July 25, 2009
ICBO: The Role of Ontology in Biomedicine
ICBO is the International Conference on Biomedical Ontology, held this year in Buffalo, NY. Ontologies can be defined in various ways; I prefer to think of them as formal systems for describing domain concepts and their instantiations. Topics include: Ontologies for Chemistry and Molecular Biology, Cellular and Anatomical Ontologies, Disease Ontologies, Clinical Ontologies, and IT topics such as computing with ontologies and getting ontologies to work together.
Much of the effort in this area is geared towards extracting biomedical data from various sources and representing it in a computable form. Doing this requires the ability to both represent universal truths about biological systems and describe instances of these truths as applied to individual cases. One can think of ontologies as playing a key role in the irresistible move from dispersed documents to linked data.
Biomedicine contains many difficult modeling problems, e.g., the classification of lipids, where the nomenclature is currently very heterogeneous, and based upon graphical concepts. Even attempts to axiomatize chemistry are full of pitfalls for the unwary, e.g., it doesn't make much sense to talk about a molecule having a boiling point, however tempting it might be to attach the information there, since heat is something that applies to substances, rather than individual molecules. Similarly, the modern view of RNA has changed from its taking a passive role in gene transcription to a more active role in gene expression.
Much of the effort in this area is geared towards extracting biomedical data from various sources and representing it in a computable form. Doing this requires the ability to both represent universal truths about biological systems and describe instances of these truths as applied to individual cases. One can think of ontologies as playing a key role in the irresistible move from dispersed documents to linked data.
Biomedicine contains many difficult modeling problems, e.g., the classification of lipids, where the nomenclature is currently very heterogeneous, and based upon graphical concepts. Even attempts to axiomatize chemistry are full of pitfalls for the unwary, e.g., it doesn't make much sense to talk about a molecule having a boiling point, however tempting it might be to attach the information there, since heat is something that applies to substances, rather than individual molecules. Similarly, the modern view of RNA has changed from its taking a passive role in gene transcription to a more active role in gene expression.
Subscribe to:
Posts (Atom)