LiU Institute of Technology - IFM Bioinformatics

Bioinformatics
Introduction and practical applications

Course TFTB29

   
Assignment 2
 
 
  Please submit your answers via the ARMS system
http://bioinfo.ifm.liu.se/edu/ARMS/TFTB29/HT2013/report_assignment_2.html
 

SRS and Entrez
Retrieval of sequences and information

Because of the high rate of data production and the need for researchers to have rapid access to new data, public databases have become the major medium through which genome sequence data are published. Public databases and the data services that support them are important resources in bioinformatics, and will soon be essential sources of information for all the molecular biosciences. EMBL and Genbank are the two major nucleotide databases, whereas Swissprot is the major database for protein sequences. Biological databases are built by different teams, in different locations, for different purposes, and using different data models and supporting database-management systems. However, biological databases are most valuable when interconnected than when isolated. The popularity of these services indicates the need for querying interrelated datasets, rather than isolated databases. The advantages of physical integration are that queries can be executed rapidly because all data are located in one place, and the user sees a homogeneous, integrated data source.

A more complete introduction to this subject is offered here.
 

 

A. Use SRS (Sequence Retrieval System) to solve problems 2-1 through 2-13. Mark the database/databases you wish to include in your search, then use the standard query form.

Hints: “all text” is a good search criterion to start out with, but not a very specific one. Be sure to use more specific ones as well, so that you discover the differences. Note that species names are preferably entered in their latin forms. Don’t forget to “plug in” all the databases / the specific database you want to include in your searches.

Search Swissprot to find how many entries there are there for the following organisms. What search word and criterion did you use? Give one answer; the criterion which best answer the question.

2-1. Caenorhabditis elegans?

2-2. Escherichia coli?

2-3. Arabidopsis thaliana?

2-4. House mouse?


2-5. If you would like to find not only "House mouse" but all mice species in 2-4 would it be wise
to use "mouse" as organism search word (organism field)? Why/Why not? Try it!


Now we are going to search for tRNAArg sequence entries from E. coli using SRS.

Hints
  • Make sure that what you report is actually tRNA to successfully answer the question!
  • Consider that the nomenclature in the databases might vary, e.g. tRNA could be written as transfer-RNA or transferRNA. Also, tRNAArg might be written as tRNA-Arg, Arg-tRNA or ArgtRNA.
  • Use the | character to express or and & to express and in your search!
  • Note that you must use several search fields in order to complete this task.
  • tRNA molecules are in a particular size range, by using the "Sequence Length" field you can specify that your result sequences should have a specific length. For example if you are looking for long proteins of more than 2000 bases you can write "2000:" and if you are looking for proteins between 400 and 600 you can write "400:600" etc.

2-6. Report all correct tRNAArg you find (IDs) and the search(es) by which you found these in EMBL?


2-7. Report all correct tRNAArg you find (IDs) and the search(es) by which you found these in Swissprot?


2-8. Accession no. P20153 leads to which organism?


2-9. And (P20153) what protein?


2-10. How long is the coding sequence for cytochrome b in woolly mammoth?


Use SRS to find in Swissprot all protein sequences of human hydroxysteroid dehydrogenases. Hydroxysteroid dehydrogenases are enzymes participating in the metabolism of steroids. Dehydrogenases catalyse oxidation reactions.
Hint: Some hydroxysteroid dehydrogenases have a prefix, like 17beta-. Thus, you should use wildcard (*) before the word hydroxysteroid in order to catch all sequences.

2-11. Describe how you search for human hydroxysteroid dehydrogenases. How many did you find?


2-12. For which of these Swissprot sequences are the three-dimensional structures known (in the PDB database)? Describe your search. Hint: Use the link function in SRS.


2-13. Acetylation is a common post-translational modification of proteins. Describe how you search for all acetylated human proteins in Swissprot. How many did you find? (Should be more than two thousand)

Hint: You could start with an "All text" search in order to find at least one protein that is acetylated. By examining the Swissprot entry, you will find out how the modification acetylation is encoded in the Swissprot database.

 


B.
Use Entrez (http://www.ncbi.nlm.nih.gov/Entrez/) to solve exercises 2-142-19.

Your friend the molecular biologist will soon present a Master's thesis on similarities between “zebrafish” and a species called “torafugu”. In order to make an educated impression you would like to know more about this subject before going there.

2-14. The Entrez cross-database search is a very good starting point when you have limited prior knowledge of the search terms. Use Entrez to find Latin names and taxonomic IDs for “zebrafish” and “torafugu”.
Please give the Latin names and taxonomic IDs. Also describe how you found the answers.


2-15. Which is the most studied organism? Motivate your answer.


2-16. How are they related evolutionarily? Give both the trivial answer in common language, and their smallest common taxa in Latin.


2-17. The taxonomy browser is useful to find all sequences of a particular lineage. See if you can find a clever way to find all fish proteins that are available in Swissprot using NCBI Taxonomy together with the limitation possibilities (to only search Swissprot entries). How many fish proteins did you find in SwissProt? What search query was used and describe how you found it (your answer should be fairly close to 5000).


When returning from the vacation you find a cryptic Post-it™ stuck to your computer, with the text: “AAT74529 – check it out A.S.A.P. May be disease related /Y”. You recognize the handwriting of your collaborator, who unfortunately is out of town for the coming weeks. You will just have to do this yourself.

2-18. The first part is probably some kind of accession number, but to what? What is it?


2-19. What human disease is it related to?

 

 


Please submit your answers via the ARMS system
http://bioinfo.ifm.liu.se/edu/ARMS/TFTB29/HT2013/report_assignment_2.html

Check your marking process at
http://bioinfo.ifm.liu.se/edu/ARMS/TFTB29/HT2013/check_progress.html

 

Problems?
Questions can be e-mailed to Fredrik Lysholm (frely@ifm.liu.se).

Modified October 2011