Email updates

Keep up to date with the latest news and content from Journal of Biomedical Semantics and BioMed Central.

This article is part of the supplement: Proceedings of the Fourth International Symposium on Semantic Mining in Biomedicine (SMBM)

Open Access Research

An analysis of gene/protein associations at PubMed scale

Sampo Pyysalo1*, Tomoko Ohta1 and Jun’ichi Tsujii123

Author Affiliations

1 Department of Computer Science, University of Tokyo, Tokyo, Japan

2 School of Computer Science, University of Manchester, Manchester, UK

3 National Centre for Text Mining, University of Manchester, Manchester, UK

For all author emails, please log on.

Journal of Biomedical Semantics 2011, 2(Suppl 5):S5  doi:10.1186/2041-1480-2-S5-S5

Published: 6 October 2011

Abstract

Background

Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available.

Results

In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology.

Conclusions

We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.