Bioinformatics VT 2002



		Searching through Databases Table of contents The first success story The database industry Access to bioinformation Homlogy searches Large-scale sequencing searches The First Success Story Database organization and searching have by now become industries. However, it was only about 20 years ago that the first successful database search was performed. Infection by certain viruses was known to cause particular cells in culture to grow withouth limitations. This cancerlike transformation suggested that a viral infection could be a cause of cancer in animals, but the mechanisms were unknown. It was hypothesized that certain genes in the infected virus (oncogenes) which encode cellular growth factors may stimulate the growth of a cell colony. Surprisingly, the link between oncogenes and growth factors did not come from experimental work, but was the result of merging two independent sets of data via a computer search. Simian sarcoma virus is a double-stranded RNA virus that was known to cause cancer in specific species of monkeys. The oncogene responsible, v-sis, was isolated and sequenced in 1983. About the same time, a partial amino acid sequence of an important growth factor, the platelet-derived growth factor (PDGF) was determined and published. R.F. Doolittle was keeping a home-grown database of published amino acid sequences (and entered by hand with the partial aid of family-supplied labor). He had previously entered the translated amino acid sequence of the v-sis oncogene and when the PDGF sequence became available he compared it to the sequences in his home-made database. Surprisingly, he found one region of 31 amino acid residues with 26 exact matches between the PDGF sequence and the v-sis protein sequence. In another region of 39 residues, he found 35 exact matches. This first-established connection between an oncogene and a normal protein affected the way oncogenesis has been seen and understood since then. Many additional oncogenes have now been shown to be highly similar to genes that encode growth-regulating proteins in normal cells. The theory is that a previously harmless virus becomes oncogenic by incorporating the proto-oncogene of its host into its own genome. In the viral genome, the proto-oncogene is mutated, or moved to a strong enhancer so that an excessive amount of proto-oncogene product is produced when the virus infects a normal cell. Another nice example of how the combination of database searches and experimental work in molecular biology creates interesting discoveries was described in the New York Times in 1995. Multiple sceloris (MS) is a debilitating neurological disease that is not well understood. However, it is understood that MS is an autoimmune disease, meaning that the immune system incorrectly identifies native cells as foreign invaders. In MS, the myelin sheath encasing nerve cells is attacked by the immune system, disrupting the normal transmission of signals along the nerve. The first line of attack in the immune system are the T-cells, which identify foreign targets. Once identified, other elements of the immune system attack and destroy the identified tagets. The body develops specific T-cells in reaction to exposure to different foreign antigens. Specific T-cells were found that identify proteins or protein segments that appear on the surface of myelin cells. It was the conjectured that those T-cells had previously been generated by the immune system to (correctly) identify highly similar proteins on the surface of bacteria or viruses. In other words, the immune system attacks the myelin sheath because it confuses certain proteins on its surface for proteins on the outer surface of certain bacteria that had previously infected the individual. But how could this be tested? Which bacteria and which viruses were involved? Using the sequences of myelin surface proteins, a search was conducted in the protein databases for highly similar proteins in bacteria and viruses. About one hundred proteins were found. Laboratory work then verified that the specific T-cells that attack the myelin sheath also attack particular proteins found by the database search. This combined database/laboratory approach not only confirmed the general conjecture, but identified the particular bacterial and viral proteins that are confused with proteins on the myelin surface. The hope is now that by examining the similarities among those bacterial and viral protein sequences (an example of multiple sequence comparison), one might better understand what features of the myelin surface proteins are used by the T-cells to mistakenly identify myelin cells as foreign. Over the past few years, our ability to extract information from protein and DNA databases has improved dramatically; computers are faster and comparison algorithms are more effective, in large part because of the incorporation of statistics for local similarity scores in both heuristic and rigourous sequence comparison programs. Here, we discuss programs and search strategies for identifying distantly related protein sequences. Top The Database Industry Because of the high rate of data production and the need for researchers to have rapid access to new data, public databases have become the major medium through which genome sequence data are published. Public databases and the data services that support them are important resources in bioinformatics, and will soon be essential sources of information for all the molecular biosciences. However, successful public data services suffer from continually escalating demands from the biological community. Waterman describes the current situation in the following way: It is probably important to realize from the very beginning that the databases will never completely satisfy a very large percentage of the user community. The range of interest within biology itself suggests the difficulty of constructing a database that will satisfy all the potential demands on it. There is virtually no end to the depth and breadth of desirable information of interest and use to the biological community. EMBL and GenBank are the two major nucleotide databases. EMBL is the European version and GenBank is the American. EMBL and GenBank collaborate and synchronize their databases so that the databases will contain the same information. The rate of growth of DNA databases has been following an exponential trend, with a doubling time now estimated to be 9-12 months. In January 1998, EMBL contained more than a million entries, representing more than 15 500 species, although most data is from model organisms such as Saccharomyces cerevisiae, Homo sapiens, Caenorhabditis elegans, Mus musculus and Arabidopsis thaliana. These databases are updated on a daily basis, but still you may find that a sequence referred to in the latest issue of a journal is not accessible. This is most often due to the fact that the release-date of the entry did not correlate with the publication date, or that the authors forgot to tell the databases that the sequences have been published. If you find such a case, please report it to EMBL and, or to GenBank. Below is an incomplete list of general and more specialized databases: GENERAL DATABASES GenBank - http://www.ncbi.nlm.nih.gov/Genbank/GenbankOverview.html DNA and RNA sequences, National Center for Biotechnology Information, USA EMBL - http://www.ebi.ac.uk/embl DNA and RNA sequences; the European Molecular Biology Laboratory, Cambridge PIR - http://pir.georgetown.edu/ Protein sequences, Protein Identification Resource, USA SWISS-PROT - http://www.expasy.ch/sprot/sprot-top.html Protein sequences, Switzerland NRL-3D - http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html Predicted protein structures OWL - http://www.biochem.ucl.ac.uk/bsm/dbbrowser/OWL Non-redudandant collections of protein sequences PROSITE - http://www.expasy.ch/prosite Protein sequence motifs PRINTS - http://www.bioinf.man.ac.uk/dbbrowser/PRINTS Protein sequence motifs BLOCKS - http://www.blocks.fhcrc.org/ Protein sequence motifs SCOP - http://scop.mrc-lmb.cam.ac.uk/scop Proteins classified according to strutural similarities PDB - http://www.biochem.ucl.ac.uk/bsm/pdbsum Macromolecular structures The principal requirements on the public data services are: Data quality - data quality has to be of the highest priority. However, because the data services in most cases lack access to supporting data, the quality of the data must remain the primary responsibility of the submitter. Supporting data - database users will need to examine the primary experimental data, either in the database itself, or by following cross-references back to network-accessible laboratory databases. Deep annotation - deep, consistent annotation comprising supporting and ancillary information should be attached to each basic datat object in the database. Timeliness - the basic data should be available on an Internet-accessible server within days (or hours) of publication or submission. Integration - each data object in the database should be cross-referenced to representation of the same or related biological entities in other databases. Data services should provide capabilities for following these links from one database or data service to another. Top Access to bioinformation Biological databases are built by different teams, in different locations, for different purposes, and using different data models and supporting database-management systems. However, biological databases are most valuable when interconnected than when isolated. One approach to database integration is construction of either a data warehouse or a database containing a combination of datasets from a variety of primary databases. Annotations and connections may be added, either automatically (algorithmically) or manually by experts (curators). Some examples of integrated bioinformation resources are: SRS (Sequence Retrieval System) http://srs.ebi.ac.uk Entrez Browser http://www.ncbi.nlm.nih.gov ExPASy http://www.expasy.ch Integrated genome database http://genome.dkfz-heidelberg.de The popularity of these services indicates the need for querying interrelated datasets, rather than isolated databases. The advantages of physical integration are that queries can be executed rapidly because all data are located in one place, and the user sees a homogenous, integrated data source. However, integration efforts depend on making local copies of data from other databases, which is becoming increasingly difficult because of (a) a proliferation of independently administered biological databases containing relevant data; (b) the absence of a clear boundary to the set of sources that should be integrated and c) an accelerating rate of data production that makes manual intervention into the integration process more or less impossible. The alternative approach to physical integration is for data sources to remain distributed at multiple geographic sites. The databases can be queried via a network such as for example Internet. Ideally, it should be possible to form queries to each of the relevant remote databases and the retrieved data should be integrated into a coherent report for the user. The software modules that perform these functions have been termed mediators. The limitation with hypertext navigation is the difficulty to perform complex queries. A complex query selects and combines large amounts of information; it uses complex logic and is processed automatically by a program. Complex queries can be supported by data warehouses and it can take only minutes to answer questions that it would take days to answer using the manual hypertext navigation approach. One example of a complex query is: Find examples of tightly clustered genes that code for enzymes in a single metablic pathway. The steps involved in processing this query include enumerating a list of pathways, finding all enzymes that catalyze those reactions, and finding all genes that encode those enzymes. Further research and development is needed to provide standard query languages and data formats. Top Homology Searches Searching for sequence similarities The availability of public databases also allows searches for sequences that are similar to an unkown sequence with the purpose of identifying homologous sequences (i.e. sequences that share a common ancestor). We infer sequence homology by calculating sequence similarity. Similarity is a quantity (two sequences share 15 or 30% identity) while homology is an inference (two proteins are either homologous or they are not). In general, statistically significant similarity scores can be used to infer homology with high level of confidence. However, the converse is not true; absence of significant similarity does not guarantee nonhomology. For effective sequence identification, one should first search protein sequence databases, not DNA sequence databases. Protein sequence comparisons routinely identify sequences that shared a common ancestor more than 1 billion years ago. In contrast, it is often difficult to detect homology in noncoding DNA sequences that diverged 200 million years ago. Even for protein coding DNA it is rare to detect significant DNA sequence similarity for sequences that diverged more than 600 million years ago, whereas significant similarities can sometimes be detected between protein sequences that diverged more than 2.5 billion years ago. Differences in the performance of sequence comparison algorithms, scoring matrices, or gap penalties are insignificant compared to the loss of information in DNA sequence comparison. Thus, if the biological sequence of interest encodes a protein, protein sequence comparisons should be done. Search Programs: BLAST and FASTA BLAST and FASTA are two of the most popular programs for identifying sequences that are homologous. BLAST = Basic Local Alignment Search Tool FASTA = FAST homology search All sequences Virtually all sequence similarity searching today is done with algorithms that calculate a local similarity score. Such a score identifies the most similar regions shared by the two proteins without requiring that the similarity extends to the ends of the sequences. Methods that calculate local sequence similarity scores are very useful because they can detect homologous protein domains that are embedded in different sequence environments and because they can be used with partial sequences. For example, the sequence pair ASCDEFG/ATCEEFG in the alignment shown below has an optimal local similarity because an extension in either direction would reduce the similarity score of the two sequences. A global alignment score derived from these sequences would require that the alignment begins with the first V-Q pair and end at the last Y-L. VVVVVASCDEFGYYYYY QQQQQATCEEFGLLLLL -------*----- The BLAST package The blast package of proteins include a suite of programs with slightly different applications: These programs provided the first rapid sequence comparison programs to incroporate estimates of statistical significance based on an analytical theory for the statistics of similarity scores. Different versions of the program enable different search methods. BLASTP - compares an amino acid query sequence against a protein seuqence database. BLASTN - compares a nucleotide query sequence against a nucleotide sequence database. BLASTX - compares six-frame conceptual translation products of a nucleotide query sequence (both strands) against a protein sequence database. TBLASTN - compares a protein query sequence against a nucleotide sequence database. dynamically translated in all six reading frames (both strands). TBLASTX - compares six-frame conceptual translation products of a nucleotide query sequence (both strands) against a nucleotide sequence database dynamically translated in all six reading frames (both strands). The BLAST program searches for similar segments between the query sequence and the database sequence and then evaluates the statistical significance of any matches. It reports only those matches that satisfy a user-selectable threshold of significance (output). First, the BLAST programs identify pairwise segments that have similar words. BLAST will split an amino acid query sequence into short words of N-mers (normally 3-5 aa long). A nucleotide query sequence will also be split into words of N-mers (normally 12 bp long). The database will be scanned for occurences of N-mer look-alikes and all reported pairs above a threshold score will be remembered. After the scanning step, identified regions will be extended and if the extended region is above a certain threshold score (HSP = high score segment pair) it will remembered. The highest scoring of all possible segment pairs that can be produced from the two sequences (MSP = maximal scoring segment pair) will be reported. If the score is below a certain threshold it will not appear in the output file, since it is thought to be insignificant. Comparison matrices In addition to comparison algorithms, different scoring matrices can be used. Modern scoring matrices have been derived using two different approaches. PAM - Point Accepted Mutations The method which was used to develop the PAM250 matrix, is based on estimated transition frequencies for a small amount of sequence change (typically 1%). An evolutionary distance of 1 PAM indicates the probability of a residue mutating during a distance in which 1 point mutation was accepted per 100 residues. The method then extrapolates the transition frequencies by successive multiplication to a matrix that models the distribution of amino acid substitutions after 120% (PAM120), 200% (PAM200), or 250% (PAM250) amino acid substitutions. Altough it seems surprising to consider alignments where two sequences have changed by 250%, in fact, such sequences are expected to remain aobut 20% identical and are thus in the twiligth zone. BLOSUM - BLOcks SUbstitution Matrix The method which was used to develop the BLOSUM series of matrices is based on observed rather than extrapolated transition frequencies. These matrices are derived from blocks of conserved residues that are at least 45% (BLOSUM45), 50% (BLOSUM50), or 62% (BLOSUM62) identical. While it is often stated that differences in scoring matrices can cause large changes in sequence alignments, similarity scores for clearly related sequences are usually not very sensitive to changes in scoring matrix. How to interpret the result of your search? The result of the search displays the database in which the hit was identified, the accession number, the name of the entry, a short description of the entry and the score for the entry. The P(N) value is an estimation of how probable it is to find such a similar match by pure chance. If the number is low, the match is valid, but as it gets larger there is a higher probability that the match may have occurred by chance. Ideally, one would like to have a firm cut-off level when to accept a hit or not. Unfortunately, the situation is not so simple, reflecting the complexity of biology. As a very approximate guideline, the P(N) value should be far below 10(-2) for a homologous pair of sequences. However, the interpretation of the result will depend on a number of parameters and the user should try to use all available information in order to judge whether a hit against a sequence in the public database is likely to represent a true similarity or was the result of pure chance. Imprtant factors to keep in mind are the following: Overall alignment - Although current versions of BLASP are very effective at identifying distantly related sequences, BLASTP does not produce biologically meaningful alignments because it does not allow gaps. Distant sequence relationships (>30% identity) typically extend over entire protein sequences or long protein domains and requrie gaps to includce the entire homologous regions. Because of its restriction on gaps, BLASTP may break up long homologous domains into several HSPs without gaps, which, when combined have significant similarity. Thus, although BLASTP is effective at identifying distant relationships, other alignment methods should be used when BLASTP matches are analyzed and displayed. Homologous sequences are usually similar over an entire sequence or domain, typically sharing 20-25% or greater identity for more than 200 residues. Matches that are more than 50% identical in a 20- to 40-amino acid region occur frequently by chance and do not indicate homology. Gene duplications - Numerous gene duplications have occured throughout evolution. Take for example a gene which codes for a sigma factor. There is only one such gene (rpoX) in organism A, but five genes (rpoX1-rpoX5) coding for sigma factors in organism B as a result of several recent gene duplication events. The five sigma factors in organism B have been shown experimentally to recognize a variety different promoters. A similarity search using rpoX in organism A produces hits of similar significance to all the five sigma factor genes in organism B.In this case, it is not possible to identify the one gene in organism B which is the ortholog of rpoX in organism A. In this case, it would be better to simply identify the gene as a sigma factor but wait with a more detailed description until experimental results are available. The distance between species - The similarity score will depend on the divergence time of the two organisms being compared. A comparison of a bacterial gene to a human homolog will necessarily result in a lower score than to the corresponding homologs in other bacterial species. Thus, it is adviceble to take into account the evolutionary distance between the two species when evaluating the results of the similarity search. The degree of protein conservation - The similarity score will also depend on the degree of conservation of the proteins being compared. A comparison of a highly conserved proteins will necessarily be associated with a higher score than a comparison of lowly conserved proteins from the same species pair. Part of a metabolic pathway - Finally, use all of your biological expertise about the organism from which your sequence was derived. Does the best hit represent a gene that is likely to be present in the organism? Is it part of a metabolic pathway that is known from experimental data to be present in this organism? Have other genes in this pathway been sequenced? (see the section on metabolic reconstructions). Top Large-scale Sequencing Searches Computational biologists and genomics researchers, as well as molecular biologists involved with cloning and sequencing genes, are all confronted with a worsening situation when interacting with sequence databases. The main difficulties arise from (a) the decreasing quality (both in terms of errors and redundancy) of the data banks and (b) their burgeoinig sizes. The absolute size of the databases has made them impossible to work with or search in a reasonable amount of time, except for a few leading centers in the world. While useful, the E-mail of World Wide Web (WWW) service offered to the public consists of only a selected subset of the fastest algorithms (to be used within a narrow range of options). They also limit the submission of queries from each client to a small, reasonable number. Scientific assessments, involving very large-scale comparisons (such as entire databases versus themselves), exotic algorithms, or unusual parameter setting, are not possible in this context. In addition, the lack of confidentiality of this mode of operation can be worrisome to some laboratories, and is definitively not acceptable to the private biotechnology industry. Two concepts, sequence masking and distributed processing, are keys to local (and secure) implementation of effective and flexible large-scale sequence comparison. Concepts of sequence masking A number of important scientific experiments results in the comparison of a large number of query sequences against an entire sequence database. The recognition of exons in human genomic sequences by database similarity searches against the rapidly growing collection of partial cDNA sequences (expressed sequence tags or ESTs) is one example. Several thousand ESTs are added to the databases every day and a nearly complete sampling of all transcripts should be available shortly. It should thus in principle be possible to locate all exons within any human genomic region. However, any attempt to compare a human genomic sequence with EST data quickly reveals that this promising method, in its simplest form does not provide meaningful results. To give an example: the 67-kb sequence from the p22.3 region of chromosome X contains a total of 1343 distinct putative peptide-encoding sequences of which at most 10 are expected to be real. The 1343 ORFs were compared against the public EST data bank (dbEST) which contains approximately 150,000 ESTs of human origin, using the program BLASTN and TBLASTN. As expected the fraction of ORFs matching at least one EST rapidly increased as the required minimal score decreased. For BLASTN scores at or below 100 or TBLASTN scores up to 50, almost all ORFs were found to have a match. This is in agreement with statistics, which predicts a very high probability value for random matches associated to those low scores. However, as larger minimal scores are imposed and thereby more stringent local similarity, the number of matching ORFs remain much higher than expected. For instance, more than 350 candidate exons were identified by their match to a human EST with BLASTN using a minimal score of 200. Such an extremely high fraction of false-positive identifications makes the direct EST lookup method of exon identification totally impractical until the nature of the problem is understood. The model fails because it assumes a randomness in sequence data that is not valid. The problem is that besides a small fraction of well-behaved regions, actual sequences, both genomic or ESTs, are constituted of many different repeats. Repetitive DNA represents over 50% of the human genome. It has been classified in various categories such as retroposons, satellites etc. These repeats, present both in the target EST data bank and in most of the putative ORF queries, dramatically increase the chance of a fortuitous match. Eventually, this noise can obscure the few alignments with biological significance. As a general solution to this problem, and as a prerequisite to all large-scale sequence comparisons, the concept of sequence masking has been introduced. It simply consists of delineating the various type of repeats and other a priori troublesome segments by ad hoc programs, and then replacing the corresponding positions with a special character neutral to the specific scoring scheme (usually X for proteins and Nfor DNA sequences). Masking the most frequent repeats (Alu and simple sequences) can have drastic effects on the distribution of the number of hits for a given minimal score. For example, in the example given above masking reduced the number of hits at a minimal score of 200 to less than 50 as compared to more than 300 without masking. Thus, database searches is emerging as a powerful tool for proposing putative gene functions. Once the remaining problems and limitations in technology is overcome, the wealth of experimental data will yield far greater insights. Each molecular biology database is a lens that can either magnify - or cloud - our view of experimental data. Extracted from:** Dan Gusfield. 1997. Algorithms on strings, trees, and sequences. Cambridge University Press. Peter Karp. 1996. Database links are a foundation for interoperability. TIBS 14: 273-279. Document adapted from Linnaeus Centre for Bioinformatics, Uppsala, Sweden. Top