SRS
and Entrez
Retrieval of sequences and information
Because of the high rate
of data production and the need for researchers to have rapid access to new
data, public databases have become the major medium through which genome sequence
data are published. Public databases and the data services that support them
are important resources in bioinformatics, and will soon be essential sources
of information for all the molecular biosciences. EMBL and Genbank are the two
major nucleotide databases, whereas Swissprot is the major database for protein
sequences. Biological databases are built by different teams, in different locations,
for different purposes, and using different data models and supporting database-management
systems. However, biological databases are most valuable when interconnected
than when isolated. The popularity of these services indicates the need for
querying interrelated datasets, rather than isolated databases. The advantages
of physical integration are that queries can be executed rapidly because all
data are located in one place, and the user sees a homogeneous, integrated data
source.
A more complete introduction to this subject
is offered here.
A. Use
SRS (Sequence Retrieval System) to
solve problems 2-1 through 2-13. Mark the database/databases you wish to include
in your search, then use the standard query form.
Hints: “all text”
is a good search criterion to start out with, but not a very specific one. Be
sure to use more specific ones as well, so that you discover the differences.
Note that species names are preferably entered in their latin forms. Don’t forget
to “plug in” all the databases / the specific database you want to include in
your searches.
Search Swissprot to find how many entries there are there for the following organisms. What search word and criterion did you use? Give one answer; the criterion which best answer the question.
2-1. Caenorhabditis elegans?
2-2. Escherichia coli?
2-3. Arabidopsis thaliana?
2-4. House mouse?
2-5. If you would like to find not only "House mouse" but all mice species in 2-4 would it be wise to use "mouse" as organism search word (organism field)? Why/Why not? Try it!
Now we are going to search for tRNAArg sequence entries from E. coli using SRS.
Hints
- Make sure that what you report is actually tRNA to successfully answer the question!
- Consider that the nomenclature in the databases might vary, e.g. tRNA could be written as transfer-RNA or transferRNA. Also, tRNAArg might be written as tRNA-Arg, Arg-tRNA or ArgtRNA.
- Use the | character to express or and & to express and in your search!
- Note that you must use several search fields in order to complete this task.
- tRNA molecules are in a particular size range, by using the "Sequence Length" field you can specify that your result sequences should have a specific length. For example if you are looking for long proteins of more than 2000 bases you can write "2000:" and if you are looking for proteins between 400 and 600 you can write "400:600" etc.
2-6. Report all correct tRNAArg you find (IDs) and the search(es) by which you found these in EMBL?
2-7. Report all correct tRNAArg you find (IDs) and the search(es) by which you found these in Swissprot?
2-8. Accession no. P20153 leads to which organism?
2-9. And (P20153) what protein?
2-10. How long is the coding sequence for cytochrome b in woolly mammoth?
Use SRS to find in Swissprot all protein sequences of human
hydroxysteroid dehydrogenases. Hydroxysteroid dehydrogenases are enzymes participating
in the metabolism of steroids. Dehydrogenases catalyse oxidation reactions.
Hint: Some hydroxysteroid dehydrogenases have a prefix, like
17beta-. Thus, you should use wildcard (*) before the word hydroxysteroid in
order to catch all sequences.
2-11. Describe how you search for human hydroxysteroid dehydrogenases. How many did you find?
2-12. For which of these Swissprot sequences are the three-dimensional
structures known (in the PDB database)? Describe your search. Hint: Use the link function in SRS.
2-13. Acetylation is a common post-translational
modification of proteins. Describe how you search for all acetylated human proteins
in Swissprot. How many did you find? (Should be more than two thousand)
Hint: You could start with an "All text" search in
order to find at least one protein that is acetylated. By examining the Swissprot
entry, you will find out how the modification acetylation is encoded
in the Swissprot database.
B. Use Entrez (http://www.ncbi.nlm.nih.gov/Entrez/)
to solve exercises 2-14 – 2-19.
Your friend the molecular biologist will soon present a
Master's thesis on similarities between “zebrafish” and a species
called “torafugu”. In order to make an educated impression you would
like to know more about this subject before going there.
2-14. The Entrez cross-database search is
a very good starting point when you have limited prior knowledge of the search
terms. Use Entrez to find Latin names and taxonomic IDs for “zebrafish”
and “torafugu”.
Please give the Latin names and taxonomic IDs. Also describe how you found the
answers.
2-15. Which is the most studied organism? Motivate your answer.
2-16. How are they related evolutionarily? Give both the trivial
answer in common language, and their smallest common taxa in Latin.
2-17. The taxonomy browser is useful to find all sequences
of a particular lineage. See if you can find a clever way to find all fish proteins
that are available in Swissprot using NCBI Taxonomy together with the limitation
possibilities (to only search Swissprot entries). How many fish proteins did you find in SwissProt?
What search query was used and describe how you found it (your answer should be fairly close to 5000).
When returning from the vacation you find a cryptic Post-it™
stuck to your computer, with the text: “AAT74529 – check it out
A.S.A.P. May be disease related /Y”. You recognize the handwriting of
your collaborator, who unfortunately is out of town for the coming weeks. You
will just have to do this yourself.
2-18. The first part is probably some kind of accession number,
but to what? What is it?
2-19. What human disease is it related to?
|