![]() |
![]() |
||
![]() |
||
Searching through Databases
Table of contents The
first success story Simian sarcoma virus is a
double-stranded RNA virus that was known to cause cancer in specific
species of monkeys. The oncogene responsible, v-sis, was isolated and
sequenced in 1983. About the same time, a partial amino acid sequence of
an important growth factor, the platelet-derived growth factor (PDGF) was
determined and published. R.F. Doolittle was keeping a home-grown database
of published amino acid sequences (and entered by hand with the partial
aid of family-supplied labor). He had previously entered the translated
amino acid sequence of the v-sis oncogene and when the PDGF sequence
became available he compared it to the sequences in his home-made
database. Surprisingly, he found one region of 31 amino acid residues with
26 exact matches between the PDGF sequence and the v-sis protein sequence.
In another region of 39 residues, he found 35 exact matches. This
first-established connection between an oncogene and a normal protein
affected the way oncogenesis has been seen and understood since then. Many
additional oncogenes have now been shown to be highly similar to genes
that encode growth-regulating proteins in normal cells. The theory is that
a previously harmless virus becomes oncogenic by incorporating the
proto-oncogene of its host into its own genome. In the viral genome, the
proto-oncogene is mutated, or moved to a strong enhancer so that an
excessive amount of proto-oncogene product is produced when the virus
infects a normal cell. Another nice example of how the
combination of database searches and experimental work in molecular
biology creates interesting discoveries was described in the New York
Times in 1995. Multiple sceloris (MS) is a debilitating neurological
disease that is not well understood. However, it is understood that MS is
an autoimmune disease, meaning that the immune system incorrectly
identifies native cells as foreign invaders. In MS, the myelin sheath
encasing nerve cells is attacked by the immune system, disrupting the
normal transmission of signals along the nerve. The first line of attack
in the immune system are the T-cells, which identify foreign targets. Once
identified, other elements of the immune system attack and destroy the
identified tagets. The body develops specific T-cells in reaction to
exposure to different foreign antigens. Specific T-cells were found that
identify proteins or protein segments that appear on the surface of myelin
cells. It was the conjectured that those T-cells had previously been
generated by the immune system to (correctly) identify highly similar
proteins on the surface of bacteria or viruses. In other words, the immune
system attacks the myelin sheath because it confuses certain proteins on
its surface for proteins on the outer surface of certain bacteria that had
previously infected the individual. But how could this be tested? Which
bacteria and which viruses were involved? Using the sequences of myelin surface
proteins, a search was conducted in the protein databases for highly
similar proteins in bacteria and viruses. About one hundred proteins were
found. Laboratory work then verified that the specific T-cells that attack
the myelin sheath also attack particular proteins found by the database
search. This combined database/laboratory approach not only confirmed the
general conjecture, but identified the particular bacterial and viral
proteins that are confused with proteins on the myelin surface. The hope
is now that by examining the similarities among those bacterial and viral
protein sequences (an example of multiple sequence comparison), one might
better understand what features of the myelin surface proteins are used by
the T-cells to mistakenly identify myelin cells as foreign. Over the past few years, our ability
to extract information from protein and DNA databases has improved
dramatically; computers are faster and comparison algorithms are more
effective, in large part because of the incorporation of statistics for
local similarity scores in both heuristic and rigourous sequence
comparison programs. Here, we discuss programs and search strategies for
identifying distantly related protein sequences. Because of the high rate of data
production and the need for researchers to have rapid access to new data,
public databases have become the major medium through which genome
sequence data are published. Public databases and the data services that
support them are important resources in bioinformatics, and will soon be
essential sources of information for all the molecular biosciences.
However, successful public data services suffer from continually
escalating demands from the biological community. Waterman describes the
current situation in the following way: It is probably important to
realize from the very beginning that the databases will never completely
satisfy a very large percentage of the user community. The range of
interest within biology itself suggests the difficulty of constructing a
database that will satisfy all the potential demands on it. There is
virtually no end to the depth and breadth of desirable information of
interest and use to the biological community. EMBL and GenBank are the two major
nucleotide databases. EMBL is the European version and GenBank is the
American. EMBL and GenBank collaborate and synchronize their databases so
that the databases will contain the same information. The rate of growth
of DNA databases has been following an exponential trend, with a doubling
time now estimated to be 9-12 months. In January 1998, EMBL contained more
than a million entries, representing more than 15 500 species, although
most data is from model organisms such as Saccharomyces cerevisiae, Homo
sapiens, Caenorhabditis elegans, Mus musculus and Arabidopsis thaliana.
These databases are updated on a daily basis, but still you may find that
a sequence referred to in the latest issue of a journal is not accessible.
This is most often due to the fact that the release-date of the entry did
not correlate with the publication date, or that the authors forgot to
tell the databases that the sequences have been published. If you find
such a case, please report it to EMBL and, or to GenBank. Below is an incomplete list of general and more specialized
databases: GENERAL DATABASES GenBank - http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html
DNA and RNA sequences, National Center for Biotechnology Information,
USA The principal requirements on the public data services
are: Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter. Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases. Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database. Timeliness - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission. Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another. Access to bioinformationBiological databases are built by
different teams, in different locations, for different purposes, and using
different data models and supporting database-management systems. However,
biological databases are most valuable when interconnected than when
isolated. One approach to database integration is construction of either a
data warehouse or a database containing a combination of datasets from a
variety of primary databases. Annotations and connections may be added,
either automatically (algorithmically) or manually by experts (curators).
Some examples of integrated bioinformation resources are: SRS (Sequence Retrieval
System) http://srs.ebi.ac.uk The alternative approach to physical
integration is for data sources to remain distributed at multiple
geographic sites. The databases can be queried via a network such as for
example Internet. Ideally, it should be possible to form queries to each
of the relevant remote databases and the retrieved data should be
integrated into a coherent report for the user. The software modules that
perform these functions have been termed mediators. The limitation with hypertext
navigation is the difficulty to perform complex queries. A complex query
selects and combines large amounts of information; it uses complex logic
and is processed automatically by a program. Complex queries can be
supported by data warehouses and it can take only minutes to answer
questions that it would take days to answer using the manual hypertext
navigation approach. One example of a complex query is: Find examples of
tightly clustered genes that code for enzymes in a single metablic
pathway. The steps involved in processing this query include enumerating a
list of pathways, finding all enzymes that catalyze those reactions, and
finding all genes that encode those enzymes. Further research and
development is needed to provide standard query languages and data
formats. Searching for sequence
similarities For effective sequence identification,
one should first search protein sequence databases, not DNA sequence
databases. Protein sequence comparisons routinely identify sequences that
shared a common ancestor more than 1 billion years ago. In contrast, it is
often difficult to detect homology in noncoding DNA sequences that
diverged 200 million years ago. Even for protein coding DNA it is rare to
detect significant DNA sequence similarity for sequences that diverged
more than 600 million years ago, whereas significant similarities can
sometimes be detected between protein sequences that diverged more than
2.5 billion years ago. Differences in the performance of sequence
comparison algorithms, scoring matrices, or gap penalties are
insignificant compared to the loss of information in DNA sequence
comparison. Thus, if the biological sequence of interest encodes a
protein, protein sequence comparisons should be done. BLAST and FASTA are two of the most popular programs for identifying sequences that are homologous. BLAST = Basic Local Alignment Search Tool Virtually all sequence similarity
searching today is done with algorithms that calculate a local similarity
score. Such a score identifies the most similar regions shared by the two
proteins without requiring that the similarity extends to the ends of the
sequences. Methods that calculate local sequence similarity scores are
very useful because they can detect homologous protein domains that are
embedded in different sequence environments and because they can be used
with partial sequences. For example, the sequence pair ASCDEFG/ATCEEFG in
the alignment shown below has an optimal local similarity because an
extension in either direction would reduce the similarity score of the two
sequences. A global alignment score derived from these sequences would
require that the alignment begins with the first V-Q pair and end at the
last Y-L. VVVVVASCDEFGYYYYY QQQQQATCEEFGLLLLL -----*-*-***----- The blast package of proteins include a suite of programs with slightly different applications: These programs provided the first rapid sequence comparison programs to incroporate estimates of statistical significance based on an analytical theory for the statistics of similarity scores. Different versions of the program enable different search methods. BLASTP - compares an amino acid
query sequence against a protein seuqence database. The BLAST program searches for similar
segments between the query sequence and the database sequence and then
evaluates the statistical significance of any matches. It reports only
those matches that satisfy a user-selectable threshold of significance
(output). First, the BLAST programs identify
pairwise segments that have similar words. BLAST will split an amino acid
query sequence into short words of N-mers (normally 3-5 aa long). A
nucleotide query sequence will also be split into words of N-mers
(normally 12 bp long). The database will be scanned for occurences of
N-mer look-alikes and all reported pairs above a threshold score will be
remembered. After the scanning step, identified regions will be extended
and if the extended region is above a certain threshold score (HSP = high
score segment pair) it will remembered. The highest scoring of all
possible segment pairs that can be produced from the two sequences (MSP =
maximal scoring segment pair) will be reported. If the score is below a
certain threshold it will not appear in the output file, since it is
thought to be insignificant. In addition to comparison algorithms, different scoring matrices can be used. Modern scoring matrices have been derived using two different approaches. PAM - Point Accepted Mutations The method which was used to develop the BLOSUM series of matrices is based on observed rather than extrapolated transition frequencies. These matrices are derived from blocks of conserved residues that are at least 45% (BLOSUM45), 50% (BLOSUM50), or 62% (BLOSUM62) identical. While it is often stated that differences in scoring matrices can cause large changes in sequence alignments, similarity scores for clearly related sequences are usually not very sensitive to changes in scoring matrix. How to interpret the result of your
search? Ideally, one would like to have a firm
cut-off level when to accept a hit or not. Unfortunately, the situation is
not so simple, reflecting the complexity of biology. As a very approximate
guideline, the P(N) value should be far below 10(-2) for a homologous pair
of sequences. However, the interpretation of the result will depend on a
number of parameters and the user should try to use all available
information in order to judge whether a hit against a sequence in the
public database is likely to represent a true similarity or was the result
of pure chance. Imprtant factors to keep in mind are
the following: Overall
alignment - Although current versions of
BLASP are very effective at identifying distantly related sequences,
BLASTP does not produce biologically meaningful alignments because it does
not allow gaps. Distant sequence relationships (>30% identity)
typically extend over entire protein sequences or long protein domains and
requrie gaps to includce the entire homologous regions. Because of its
restriction on gaps, BLASTP may break up long homologous domains into
several HSPs without gaps, which, when combined have significant
similarity. Thus, although BLASTP is effective at identifying distant
relationships, other alignment methods should be used when BLASTP matches
are analyzed and displayed. Homologous sequences are usually similar over
an entire sequence or domain, typically sharing 20-25% or greater identity
for more than 200 residues. Matches that are more than 50% identical in a
20- to 40-amino acid region occur frequently by chance and do not indicate
homology. Gene
duplications - Numerous gene duplications
have occured throughout evolution. Take for example a gene which codes for
a sigma factor. There is only one such gene (rpoX) in organism A, but five
genes (rpoX1-rpoX5) coding for sigma factors in organism B as a result of
several recent gene duplication events. The five sigma factors in organism
B have been shown experimentally to recognize a variety different
promoters. A similarity search using rpoX in organism A produces hits of
similar significance to all the five sigma factor genes in organism B.In
this case, it is not possible to identify the one gene in organism B which
is the ortholog of rpoX in organism A. In this case, it would be better to
simply identify the gene as a sigma factor but wait with a more detailed
description until experimental results are available. The distance between
species - The similarity score will depend
on the divergence time of the two organisms being compared. A comparison
of a bacterial gene to a human homolog will necessarily result in a lower
score than to the corresponding homologs in other bacterial species. Thus,
it is adviceble to take into account the evolutionary distance between the
two species when evaluating the results of the similarity
search. The degree of protein
conservation - The similarity score will
also depend on the degree of conservation of the proteins being compared.
A comparison of a highly conserved proteins will necessarily be associated
with a higher score than a comparison of lowly conserved proteins from the
same species pair. Part of a metabolic
pathway - Finally, use all of your
biological expertise about the organism from which your sequence was
derived. Does the best hit represent a gene that is likely to be present
in the organism? Is it part of a metabolic pathway that is known from
experimental data to be present in this organism? Have other genes in this
pathway been sequenced? (see the section on metabolic
reconstructions). Large-scale Sequencing Searches Computational biologists and genomics
researchers, as well as molecular biologists involved with cloning and
sequencing genes, are all confronted with a worsening situation when
interacting with sequence databases. The main difficulties arise from (a)
the decreasing quality (both in terms of errors and redundancy) of the
data banks and (b) their burgeoinig sizes. The absolute size of the databases has made them impossible to work with or search in a reasonable amount of time, except for a few leading centers in the world. While useful, the E-mail of World Wide Web (WWW) service offered to the public consists of only a selected subset of the fastest algorithms (to be used within a narrow range of options). They also limit the submission of queries from each client to a small, reasonable number. Scientific assessments, involving very large-scale comparisons (such as entire databases versus themselves), exotic algorithms, or unusual parameter setting, are not possible in this context. In addition, the lack of confidentiality of this mode of operation can be worrisome to some laboratories, and is definitively not acceptable to the private biotechnology industry. Two concepts, sequence masking and distributed processing, are keys to local (and secure) implementation of effective and flexible large-scale sequence comparison. Concepts of sequence
masking However, any attempt to compare a
human genomic sequence with EST data quickly reveals that this promising
method, in its simplest form does not provide meaningful results. To give
an example: the 67-kb sequence from the p22.3 region of chromosome X
contains a total of 1343 distinct putative peptide-encoding sequences of
which at most 10 are expected to be real. The 1343 ORFs were compared
against the public EST data bank (dbEST) which contains approximately
150,000 ESTs of human origin, using the program BLASTN and TBLASTN. As
expected the fraction of ORFs matching at least one EST rapidly increased
as the required minimal score decreased. For BLASTN scores at or below 100
or TBLASTN scores up to 50, almost all ORFs were found to have a match.
This is in agreement with statistics, which predicts a very high
probability value for random matches associated to those low scores.
However, as larger minimal scores are imposed and thereby more stringent
local similarity, the number of matching ORFs remain much higher than
expected. For instance, more than 350 candidate exons were identified by
their match to a human EST with BLASTN using a minimal score of 200. Such
an extremely high fraction of false-positive identifications makes the
direct EST lookup method of exon identification totally impractical until
the nature of the problem is understood. The model fails because it assumes a
randomness in sequence data that is not valid. The problem is that besides
a small fraction of well-behaved regions, actual sequences, both genomic
or ESTs, are constituted of many different repeats. Repetitive DNA
represents over 50% of the human genome. It has been classified in various
categories such as retroposons, satellites etc. These repeats, present
both in the target EST data bank and in most of the putative ORF queries,
dramatically increase the chance of a fortuitous match. Eventually, this
noise can obscure the few alignments with biological
significance. As a general solution to this problem,
and as a prerequisite to all large-scale sequence comparisons, the concept
of sequence masking has been introduced. It simply consists of delineating
the various type of repeats and other a priori troublesome segments by ad
hoc programs, and then replacing the corresponding positions with a
special character neutral to the specific scoring scheme (usually X for
proteins and Nfor DNA sequences). Masking the most frequent repeats (Alu
and simple sequences) can have drastic effects on the distribution of the
number of hits for a given minimal score. For example, in the example
given above masking reduced the number of hits at a minimal score of 200
to less than 50 as compared to more than 300 without masking. Thus, database searches is emerging as
a powerful tool for proposing putative gene functions. Once the remaining
problems and limitations in technology is overcome, the wealth of
experimental data will yield far greater insights. Each molecular biology
database is a lens that can either magnify - or cloud - our view of
experimental data. Extracted from:
| ||