Though all biologists deal with information, only
recently have the computational challenges of systematically
collecting, storing, organising, man-ipulating, visualising and
analysing large amounts of biological information come to be widely
appreciated. The cause of this is the explosive growth of genomics. The
term bioinformatics was originally coined for
the application of information technology to large volumes of
biological, and particularly genomic, data. The field of bioinformatics
has come to be intermingled with traditional computational biology and
biostatistics, which are strictly concerned not with how to handle the
information itself, but rather with how to extract biological meaning
from it. Thus, bioinformatics, in its broad sense, can be seen as
providing both the infrastructure and the scientific framework in which
biologists take information and use computers to help convert it into
knowledge.
Despite the relative youth of the field as a recognised
discipline, there is an impressive diversity of bioinformatics
resources currently available. By necessity, we only focus on a small
slice of this diversity here. We pay particular attention to sequence
analysis because of its centrality to genomics. We also do not attempt
to provide specific protocols, as the specific needs of users vary
greatly. The resources we describe range drastically in sophistication
from little tested programs posted on graduate student web pages to
very stable and complex databases maintained by governmental agencies.
The better ones typically provide manuals and tutorials, often
containing descriptions of the underlying principles. The reader is
strongly advised to consult the documentation available for each tool.
Though a wide array of commercial resources exist, some
of which are ideally suited to specific tasks, many of the most
fundamental and long-lived bioinformatics tools are freely available.
For this reason, we describe primarily non-commercial software in this
chapter. Many of the databases and analysis tools we describe are
hosted by government or academic research centres and can be accessed
via user-friendly web interfaces.
Collectively, online databases allow access to a
staggering quantity of data. This partly reflects the way much
biological data are now collected. Genome projects popularised the
concept of high-throughput, highly automated biological data factories,
in which data are systematically collected with the express purpose of
facilitating as-yet-unknown downstream applications. As a result, the
value of such data is only realised when it is made accessible to the
research community as a whole.
The growth in the size of Genbank (Benson et al., 2002),
the DNA and protein sequence repository jointly maintained by the
National Center for Biotechnology Information (NCBI), the European
Molecular Biology Laboratory (EMBL) and the DNA Databank of Japan
(DDBJ), is legendary. Genbank contained 14.4 billion base pairs by the
end of 2001, 200 times the number of base pairs in the database just 10
years earlier. In step with the growth in sequence data, a wide variety
of different types of data have become available. These run the gamut
from raw sequence data to highly derived computational predictions of
protein structure and biomolecular interactions.
Unlike Genbank, which archives sequence data from all
organisms, many database resources are organism specific. A variety of
crop and model-plant specific genomic databases are accessible through
UKCropNet. These include GrainGenes (which holds molecular and
phenotypic information on wheat, barley, oats, rye and sugarcane) and
MaizeDB (which performs a similar service for maize). Some databases
are specific to somewhat larger taxonomic assemblages. For example, the
Gramene database is a recent effort that aims to integrate genomic
information from among all grasses using the rice genomic sequence as a
focal point (Ware et al., 2002).
It can be helpful to recognise a distinction between
primary data repositories, on the one hand, and derivative databases
that offer a regularly updated analysis of data from primary
repositories, on the other. Genbank is an example of a primary
repository. Pfam, a protein sequence signature database, is an example
of one that is derived. Derived databases in plant genomics frequently
only include those plant systems having the most abundant data. One
example is the set of Gene Indices at The Institute for Genomic
Research (TIGR), which is a collection of very focussed databases, each
covering a different plant, animal, protist or fungal species
(Quackenbush et al., 2001).
Each Gene Index computationally assembles the non-redundant set of gene
sequences for that organism, with links to expression, homology and
other information. Those plants for which there exist sufficient
publicly available sequence data are included. This includes 14 species
at the time of writing. Because it was the first plant nuclear genome
to be sequenced in its entirety, Arabidopsis thaliana
is sometimes the sole plant representative in other genomic databases.
An example of this is MODBASE, which contains homology modelled protein
structures using predicted amino acid sequences from a variety of
completed genomes.
Plant biologists are, of course, also interested in
plant symbionts and disease causing organisms. A number of plant
pathogenic bacteria and fungi have either been sequenced in their
entirety, including Agrobacterium tumefaciens (Goodner et al., 2001), Ralstonia solanacearum (Salanoubat et al., 2002) and Xylella fastidiosa (Simpson et al., 2000), or are the subject of ongoing sequencing projects, such as Magnaporthe grisea (Zhu et al., 1997), Pseudomonas syringae pv. tomato and Xanthomonas campestris. Completed sequence is also available for the legume nodule-associated mutualist Sinorhizobium meliloti (Capela et al., 2001).
In addition, a variety of plant viral genomes have been deposited in
Genbank. The Genomes OnLine Database (GOLD) is a regularly updated
on-line listing of prokaryotic and eukaryotic genome projects that have
been completed or that are under way. TIGR offers what it calls the
Comprehensive Microbial Resource database, which allows exploration and
comparison of the annotated microbial sequences. Unfortunately, genomic
information for metazoan plant symbionts, such as pathogenic nematodes
and insect herbivores, is much less abundant and likely to remain that
way for some time.
An excellent resource to the world of genomic databases is the annual database issue of the journal Nucleic Acids Research, published on the 1st of January each year (www3.oup.co.uk/nar/database/c/).
In addition to written descriptions of dozens of different databases, a
list of links to hundreds of databases, organised by category, is
maintained online. Publications describing online databases quickly
become obsolete as new databases spring up and old ones change, and no
list (online or otherwise) could hope to be comprehensive, but this is
a good place to start. Website addresses (URLs) for databases and
resources discussed in this chapter are provided in Table 12.1, while major web jump stations for genomics and bioinformatics are given in Table 12.2.
The Growing Role of Standards
The meanings of biological terms are often slippery and
operational. For instance, ‘gene function’ can easily mean different
things to different practitioners. Although it may be preferable, in
some cases, to allow for ambiguity rather than force misguided
precision, computers are not at all adept at handling ambiguity. Thus,
there has been much effort expended in adopting standardised
terminologies, with clear relationships defined among the terms. Such
language standards are referred to as controlled vocabularies, or
ontologies. Ontologies provide transparency of meaning to users and
greatly facilitate inter-communication among databases.
One of the oldest systematic attempts to standardise
plant gene nomenclature is the Mendel Plant Gene Names Database and its
derivatives, which provide a useful categorisation of known plant genes
and their sequences (Lonsdale et al., 2001; Price et al., 2001).
The Enzyme Commission Database, which is taxonomically broader, offers
a heavily used classification system that organises enzymes
hierarchically by function. An even more ambitious effort is that of
the Gene Ontology (GO) Consortium, which works to produce a dynamic
controlled vocabulary, valid across all organisms, that can accommodate
accumulating and changing knowledge of gene function (The Gene Ontology
Consortium 2001). GO recognises three independent ontologies for genes and gene products:
- Molecular function, which is specific to an individual gene product (e.g. DNA helicase)
- Biological process, which is coordinated by multiple products (e.g. mitosis)
- Cellular component, which describes the physical localisation of a gene product (e.g. nucleus)
In addition to controlled vocabularies, there is an
important role for standards that define the salient features of
particular kinds of data. For example, a group has been working to
develop a standard for the minimum information about microarray
experiments (MIAME). The diversity of experimental and analytical
approaches to microarray expression data could potentially be a major
barrier to the verification and integration of such data by the
research community as a whole. MIAME is a set of evolving guidelines
designed to ‘facilitate the establishment of databases and public
repositories and enable the development of data analysis tools’ (Brazma
et al., 2001).
Each of these approaches at facilitating transparent
communication among multiple users and databases has slightly different
goals and guiding philosophies. Some of the earliest and most
successful initiatives to date in this area have tackled the practical,
and limited, goal of establishing concrete relationships among the
entities in a small number of related databases. The InterPro database,
for example, provides a single point of entry for searching a large
number of different protein signature (motif and domain) databases,
including PROSITE, PRINTS, ProDom and Pfam, SMART, and TIGRFams
(Apweiler et al., 2001).
0 komentar:
Post a Comment