• We are sorry, but NCBI web applications do not support your browser and may not function properly. More information
Logo of bioinfoLink to Publisher's site
Bioinformatics. Jun 1, 2009; 25(11): 1422–1423.
Published online Mar 20, 2009. doi:  10.1093/bioinformatics/btp163
PMCID: PMC2682512

Biopython: freely available Python tools for computational molecular biology and bioinformatics

Abstract

Summary: The Biopython project is a mature open source international collaboration of volunteer developers, providing Python libraries for a wide range of bioinformatics problems. Biopython includes modules for reading and writing different sequence file formats and multiple sequence alignments, dealing with 3D macro molecular structures, interacting with common tools such as BLAST, ClustalW and EMBOSS, accessing key online databases, as well as providing numerical methods for statistical learning.

Availability: Biopython is freely available, with documentation and source code at www.biopython.org under the Biopython license.

Contact: All queries should be directed to the Biopython mailing lists, see www.biopython.org/wiki/_Mailing_listsku.ca.ircs@kcoc.retep.

1 INTRODUCTION

Python (www.python.org) and Biopython are freely available open source tools, available for all the major operating systems. Python is a very high-level programming language, in widespread commercial and academic use. It features an easy to learn syntax, object-oriented programming capabilities and a wide array of libraries. Python can interface to optimized code written in C, C++or even FORTRAN, and together with the Numerical Python project numpy (Oliphant, 2006), makes a good choice for scientific programming (Oliphant, 2007). Python has even been used in the numerically demanding field of molecular dynamics (Hinsen, 2000). There are also high-quality plotting libraries such as matplotlib (matplotlib.sourceforge.net) available.

Since its founding in 1999 (Chapman and Chang, 2000), Biopython has grown into a large collection of modules, described briefly below, intended for computational biology or bioinformatics programmers to use in scripts or incorporate into their own software. Our web site lists over 100 publications using or citing Biopython.

The Open Bioinformatics Foundation (OBF, www.open-bio.org) hosts our web site, source code repository, bug tracking database and email mailing lists, and also supports the related BioPerl (Stajich et al., 2002), BioJava (Holland et al., 2008), BioRuby (www.bioruby.org) and BioSQL (www.biosql.org) projects.

2 BIOPYTHON FEATURES

The Seq object is Biopython's core sequence representation. It behaves very much like a Python string but with the addition of an alphabet (allowing explicit declaration of a protein sequence for example) and some key biologically relevant methods. For example,

An external file that holds a picture, illustration, etc.
Object name is btp163i1.jpg

Sequence annotation is represented using SeqRecord objects which augment a Seq object with properties such as the record name, identifier and description and space for additional key/value terms. The SeqRecord can also hold a list of SeqFeature objects which describe sub-features of the sequence with their location and their own annotation.

The Bio.SeqIO module provides a simple interface for reading and writing biological sequence files in various formats (Table 1), where regardless of the file format, the information is held as SeqRecord objects. Bio.SeqIO interprets multiple sequence alignment file formats as collections of equal length (gapped) sequences. Alternatively, Bio.AlignIO works directly with alignments, including files holding more than one alignment (e.g. re-sampled alignments for bootstrapping, or multiple pairwise alignments). Related module Bio.Nexus, developed for Kauff et al. (2007), supports phylogenetic tools using the NEXUS interface (Maddison et al., 1997) or the Newick standard tree format.

Table 1.
Selected Bio.SeqIO or Bio.AlignIO file formats

Modules for a number of online databases are included, such as the NCBI Entrez Utilities, ExPASy, InterPro, KEGG and SCOP. Bio.Blast can call the NCBI's online Blast server or a local standalone installation, and includes a parser for their XML output. Biopython has wrapper code for other command line tools too, such as ClustalW and EMBOSS. Bio.PDB module provides a PDB file parser, and functionality related to macromolecular structure (Hamelryck and Manderick, 2003). Module Bio.Motif provides support for sequence motif analysis (searching, comparing and de novo learning). Biopython's graphical output capabilities were recently significantly extended by the inclusion of GenomeDiagram (Pritchard et al., 2006).

Biopython contains modules for supervised statistical learning, such as Bayesian methods and Markov models, as well as unsu pervised learning, such as clustering (De Hoon et al., 2004).

The population genetics module provides wrappers for GENEPOP (Rousset, 2007), coalescent simulation via SIMCOAL2 (Laval and Excoffier, 2004) and selection detection based on a well-evaluated Fst-outlier detection method (Beaumont and Nichols, 1996).

BioSQL (www.biosql.org) is another OBF supported initiative, a joint collaboration between BioPerl, Biopython, BioJava and BioRuby to support loading and retrieving annotated sequences to and from an SQL database using a standard schema. Each project provides an object-relational mapping (ORM) between the shared schema and its own object model (a SeqRecord in Biopython). As an example, xBASE (Chaudhuri and Pallen, 2006) uses BioSQL with both BioPerl and Biopython.

3 CONCLUSIONS

Biopython is a large open-source application programming interface (API) used in both bioinformatics software development and in everyday scripts for common bioinformatics tasks. The homepage www.biopython.org provides access to the source code, documentation and mailing lists. The features described herein are only a subset; potential users should refer to the tutorial and API documentation for further information.

ACKNOWLEDGEMENTS

The OBF hosts and supports the project. The many Biopython contributors over the years are warmly thanked, a list too long to be reproduced here.

Funding: Fundacao para a Ciencia e Tecnologia (Portugal) (grant SFRH/BD/30834/2006 to T.A.).

Conflict of Interest: none declared.

REFERENCES

  • Chapman B, Chang J. Biopython: Python tools for computational biology. ACM SIGBIO Newslett. 2000;20:15–19.
  • Chaudhuri RR, Pallen MJ. xBASE, a collection of online databases for bacterial comparative genomics. Nucleic Acids Res. 2006;34:D335–D337. [PMC free article] [PubMed]
  • Bateman A, et al. The Pfam protein families database. Nucleic Acids Res. 2004;32:D138–D141. [PMC free article] [PubMed]
  • Beaumont MA, Nichols RA. Evaluating loci for use in the genetic analysis of population structure. Proc. R. Soc. Lond. B. 1996;263:1619–1626.
  • Benson DA, et al. GenBank. Nucleic Acids Res. 2007;35:D21–D25. [PMC free article] [PubMed]
  • Felsenstein J. PHYLIP -phylogeny inference package (Version 3.2) Cladistics. 1989;5:164–166.
  • Hamelryck T, Manderick B. PDB file parser and structure class implemented in Python. Bioinformatics. 2003;19:2308–2310. [PubMed]
  • Hinsen K. The molecular modeling toolkit: a new approach to molecular simulations. J. Comp. Chem. 2000;21:79–85.
  • Holland RCG, et al. BioJava: an open-source framework for bioinformatics. Bioinformatics. 2008;24:2096–2097. [PMC free article] [PubMed]
  • De Hoon MJL, et al. Open source clustering software. Bioinformatics. 2004;20:1453–1454. [PubMed]
  • Kauff F, et al. WASABI: an automated sequence processing system for multi-gene phylogenies. Syst. Biol. 2007;56:523–531. [PubMed]
  • Kulikova T, et al. EMBL nucleotide sequence database in 2006. Nucleic Acids Res. 2006;35:D16–D20. [PMC free article] [PubMed]
  • Lavel G, Excoffier L. SIMCOAL 2.0: a program to simulate genomic diversity over large recombining regions in a subdivided population with a complex history. Bioinformatics. 2004;20:2485–2487. [PubMed]
  • Maddison DR, et al. NEXUS: an extensible file format for systematic information. Syst. Biol. 1997;46:590–621. [PubMed]
  • Oliphant TE. Guide to NumPy. USA: Trelgol Publishing; 2006.
  • Oliphant TE. Python for Scientific Computing. Comput. Sci. Eng. 2007;9:10–20.
  • Pearson WR, Lipman DJ. Improved tools for biological sequence analysis. PNAS. 1988;85:2444–2448. [PMC free article] [PubMed]
  • Pritchard L, et al. GenomeDiagram: a Python package for the visualisation of large-scale genomic data. Bioinformatics. 2006;22:616–617. [PubMed]
  • Rice P, et al. EMBOSS: the European molecular biology open software suite. Trends Genet. 2000;16:276–277. [PubMed]
  • Rousset F. GENEPOP '007: a complete re-implementation of the GENEPOP software for Windows and Linux. Mol. Ecol. Res. 2007;8:103–106. [PubMed]
  • Stajich JE, et al. The Bioperl toolkit: Perl modules for the life sciences. Genome Res. 2002;12:1611–1618. [PMC free article] [PubMed]
  • The UniProt Consortium. 2007 The universal protein resource (UniProt) Nucleic Acids Res. 35 D193-D197 [PMC free article] [PubMed]
  • Thompson JD, et al. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994;22:4673–4680. [PMC free article] [PubMed]

Articles from Bioinformatics are provided here courtesy of Oxford University Press
PubReader format: click here to try

Formats:

Related citations in PubMed

See reviews...See all...

Cited by other articles in PMC

See all...

Links

  • Cited in Books
    Cited in Books
    PubMed Central articles cited in books
  • PubMed
    PubMed
    PubMed citations for these articles

Recent Activity

Your browsing activity is empty.

Activity recording is turned off.

Turn recording back on

See more...