This article is part of the supplement: Proceedings of the Fourth International Symposium on Semantic Mining in Biomedicine (SMBM)
An analysis of gene/protein associations at PubMed scale
1 Department of Computer Science, University of Tokyo, Tokyo, Japan
2 School of Computer Science, University of Manchester, Manchester, UK
3 National Centre for Text Mining, University of Manchester, Manchester, UK
Journal of Biomedical Semantics 2011, 2(Suppl 5):S5 doi:10.1186/2041-1480-2-S5-S5Published: 6 October 2011
Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available.
In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology.
We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.