Computational Tools and Resources in Plant Genome Informatics

Though all biologists deal with information, only recently have the computational challenges of systematically collecting, storing, organising, man-ipulating, visualising and analysing large amounts of biological information come to be widely appreciated. The cause of this is the explosive growth of genomics. The term bioinformatics was originally coined for the application of information technology to large volumes of biological, and particularly genomic, data. The field of bioinformatics has come to be intermingled with traditional computational biology and biostatistics, which are strictly concerned not with how to handle the information itself, but rather with how to extract biological meaning from it. Thus, bioinformatics, in its broad sense, can be seen as providing both the infrastructure and the scientific framework in which biologists take information and use computers to help convert it into knowledge.

Despite the relative youth of the field as a recognised discipline, there is an impressive diversity of bioinformatics resources currently available. By necessity, we only focus on a small slice of this diversity here. We pay particular attention to sequence analysis because of its centrality to genomics. We also do not attempt to provide specific protocols, as the specific needs of users vary greatly. The resources we describe range drastically in sophistication from little tested programs posted on graduate student web pages to very stable and complex databases maintained by governmental agencies. The better ones typically provide manuals and tutorials, often containing descriptions of the underlying principles. The reader is strongly advised to consult the documentation available for each tool.

Though a wide array of commercial resources exist, some of which are ideally suited to specific tasks, many of the most fundamental and long-lived bioinformatics tools are freely available. For this reason, we describe primarily non-commercial software in this chapter. Many of the databases and analysis tools we describe are hosted by government or academic research centres and can be accessed via user-friendly web interfaces.

Collectively, online databases allow access to a staggering quantity of data. This partly reflects the way much biological data are now collected. Genome projects popularised the concept of high-throughput, highly automated biological data factories, in which data are systematically collected with the express purpose of facilitating as-yet-unknown downstream applications. As a result, the value of such data is only realised when it is made accessible to the research community as a whole.

The growth in the size of Genbank (Benson et al., 2002), the DNA and protein sequence repository jointly maintained by the National Center for Biotechnology Information (NCBI), the European Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan (DDBJ), is legendary. Genbank contained 14.4 billion base pairs by the end of 2001, 200 times the number of base pairs in the database just 10 years earlier. In step with the growth in sequence data, a wide variety of different types of data have become available. These run the gamut from raw sequence data to highly derived computational predictions of protein structure and biomolecular interactions.

Unlike Genbank, which archives sequence data from all organisms, many database resources are organism specific. A variety of crop and model-plant specific genomic databases are accessible through UKCropNet. These include GrainGenes (which holds molecular and phenotypic information on wheat, barley, oats, rye and sugarcane) and MaizeDB (which performs a similar service for maize). Some databases are specific to somewhat larger taxonomic assemblages. For example, the Gramene database is a recent effort that aims to integrate genomic information from among all grasses using the rice genomic sequence as a focal point (Ware et al., 2002).

It can be helpful to recognise a distinction between primary data repositories, on the one hand, and derivative databases that offer a regularly updated analysis of data from primary repositories, on the other. Genbank is an example of a primary repository. Pfam, a protein sequence signature database, is an example of one that is derived. Derived databases in plant genomics frequently only include those plant systems having the most abundant data. One example is the set of Gene Indices at The Institute for Genomic Research (TIGR), which is a collection of very focussed databases, each covering a different plant, animal, protist or fungal species (Quackenbush et al., 2001). Each Gene Index computationally assembles the non-redundant set of gene sequences for that organism, with links to expression, homology and other information. Those plants for which there exist sufficient publicly available sequence data are included. This includes 14 species at the time of writing. Because it was the first plant nuclear genome to be sequenced in its entirety, Arabidopsis thaliana is sometimes the sole plant representative in other genomic databases. An example of this is MODBASE, which contains homology modelled protein structures using predicted amino acid sequences from a variety of completed genomes.

Plant biologists are, of course, also interested in plant symbionts and disease causing organisms. A number of plant pathogenic bacteria and fungi have either been sequenced in their entirety, including Agrobacterium tumefaciens (Goodner et al., 2001), Ralstonia solanacearum (Salanoubat et al., 2002) and Xylella fastidiosa (Simpson et al., 2000), or are the subject of ongoing sequencing projects, such as Magnaporthe grisea (Zhu et al., 1997), Pseudomonas syringae pv. tomato and Xanthomonas campestris. Completed sequence is also available for the legume nodule-associated mutualist Sinorhizobium meliloti (Capela et al., 2001). In addition, a variety of plant viral genomes have been deposited in Genbank. The Genomes OnLine Database (GOLD) is a regularly updated on-line listing of prokaryotic and eukaryotic genome projects that have been completed or that are under way. TIGR offers what it calls the Comprehensive Microbial Resource database, which allows exploration and comparison of the annotated microbial sequences. Unfortunately, genomic information for metazoan plant symbionts, such as pathogenic nematodes and insect herbivores, is much less abundant and likely to remain that way for some time.

An excellent resource to the world of genomic databases is the annual database issue of the journal Nucleic Acids Research, published on the 1st of January each year (www3.oup.co.uk/nar/database/c/). In addition to written descriptions of dozens of different databases, a list of links to hundreds of databases, organised by category, is maintained online. Publications describing online databases quickly become obsolete as new databases spring up and old ones change, and no list (online or otherwise) could hope to be comprehensive, but this is a good place to start. Website addresses (URLs) for databases and resources discussed in this chapter are provided in Table 12.1, while major web jump stations for genomics and bioinformatics are given in Table 12.2.

The Growing Role of Standards

The meanings of biological terms are often slippery and operational. For instance, ‘gene function’ can easily mean different things to different practitioners. Although it may be preferable, in some cases, to allow for ambiguity rather than force misguided precision, computers are not at all adept at handling ambiguity. Thus, there has been much effort expended in adopting standardised terminologies, with clear relationships defined among the terms. Such language standards are referred to as controlled vocabularies, or ontologies. Ontologies provide transparency of meaning to users and greatly facilitate inter-communication among databases.

One of the oldest systematic attempts to standardise plant gene nomenclature is the Mendel Plant Gene Names Database and its derivatives, which provide a useful categorisation of known plant genes and their sequences (Lonsdale et al., 2001; Price et al., 2001). The Enzyme Commission Database, which is taxonomically broader, offers a heavily used classification system that organises enzymes hierarchically by function. An even more ambitious effort is that of the Gene Ontology (GO) Consortium, which works to produce a dynamic controlled vocabulary, valid across all organisms, that can accommodate accumulating and changing knowledge of gene function (The Gene Ontology Consortium 2001). GO recognises three independent ontologies for genes and gene products:

Molecular function, which is specific to an individual gene product (e.g. DNA helicase)
Biological process, which is coordinated by multiple products (e.g. mitosis)
Cellular component, which describes the physical localisation of a gene product (e.g. nucleus)

Controlled vocabularies are not restricted to gene or protein function. A number of plant databases (including TAIR—The Arabidopsis Information Resource, Gramene and MaizeDB) are collaborating to provide a controlled vocabulary for plant-specific terms such as anatomy, morphology and development (The Plant Ontology Consortium, in press).

In addition to controlled vocabularies, there is an important role for standards that define the salient features of particular kinds of data. For example, a group has been working to develop a standard for the minimum information about microarray experiments (MIAME). The diversity of experimental and analytical approaches to microarray expression data could potentially be a major barrier to the verification and integration of such data by the research community as a whole. MIAME is a set of evolving guidelines designed to ‘facilitate the establishment of databases and public repositories and enable the development of data analysis tools’ (Brazma et al., 2001).

Each of these approaches at facilitating transparent communication among multiple users and databases has slightly different goals and guiding philosophies. Some of the earliest and most successful initiatives to date in this area have tackled the practical, and limited, goal of establishing concrete relationships among the entities in a small number of related databases. The InterPro database, for example, provides a single point of entry for searching a large number of different protein signature (motif and domain) databases, including PROSITE, PRINTS, ProDom and Pfam, SMART, and TIGRFams (Apweiler et al., 2001).

Plant Pedia