Assignment 2 - February 5, 2018

This assignment is worth 5% of the course grade.

Due by 11:59 pm, Wednesday February 14

1. (5 points) Search for and retrieve the pBI121 cloning vector
Inside your PLNT2530 directory, create a sub-directory called as2, to hold materials associated with Assignment 2. Save all files in your PLNT2530/as2 directory.

Based on information in the lab manual, use blncbi (SEQFETCH) to find and retrieve the GenBank entry for the cloning vector pBI121. Before saving, you should view the GenBank entry from bldna to verify that you have the correct file. Next, save the GenBank entry in a file called pBI121.gen.

Take screenshots of the following for your report:

2. (10 points) Search for and retrieve the tan spot necrosis toxin gene, ToxA.

Based on information in the lab manual, use blncbi to find and retrieve the GenBank entries for the Ptr tan spot necrosis toxin gene used in the lab. This is likely to take some experimentation with different search keys and terms.

The goal is to find, using as few search terms as possible, the entries that contain the necrosis toxin protein coding sequence (annotated as a CDS feature in any given GenBank entry) and exclude those that are false positives. For example, the gene names "toxA" or "ToxA" or "tox-A" etc. might be used for different genes in different species, or the word necrosis might be found in any entry in which the word necrosis was used. Beware that the title fields shown in column B can be misleading. You must also look at the annotation in the GenBank files (covered in the tutorial).  For example, an entry that gets a hit because the terms "necrosis" and "toxin" both appear somewhere in the entry could conceivably come from an entry with a literature citation that had both words in the title, but had nothing to do with the tan spot toxin. Proteins annotated by ambiguous terms such as "hypothetical protein" should be considered false positives, unless other evidence in the entry explicitly identifies the protein as the tan spot necrosis toxin.

It is important to note that you may identify different sets of toxA genes with different queries. While it would be ideal to find a single query that finds all true toxA genes and no false positives,  2 or 3 independent queries is fine if it gets all the genes.

Once you settle on one or a few good queries, save your query results from blncbi and the corresponding GenBank entries from bldna.

Search tips:
1) blncbi lets you construct complex queries using the conjunctions AND, OR and NOT, as well as grouping two or more things using parentheses. For example, a search aimed at retrieving either the pUC18 or pUC19 vectors might use the query

(pUC18 [Title] OR pUC19 [Title]) AND syn [Division]

2) Don't waste your time adding trivial terms to your query to specifically exclude sequences by Accession number. eg.

(pUC18 [Title] OR pUC19 [Title]) AND syn [Division] NOT (S38358 [Accession] OR
M22135 [Accession] OR X13074 [Accession] OR X13070 [Accession])

If the number of false positives is small, they will be easy to weed out by inspection of the GenBank files.

Saving blncbi results
The results from any blncbi window can be saved by choosing Edit --> Select All, and then choosing File --> Save SELECTION As.

Make sure that the File format is tsv, and give the file a descriptive name with the .tsv file extension.

TSV stands for "Tab-separated value" files. This is a generic format in which each row in a table has one or more fields (columns). The values on each line are separated by TAB characters, which have the same effect as using tabs in a document. Virtually all spreadsheet programs such as LibreOffice of MS Excell can import TSV files.

Check your files
It is always a good idea to examine your GenBank or TSV files in a text editor to make sure that they contain what you think they contain!

Take screenshots of the following for your report:

3. (5 points) Create a document to show your results

Use the template document to create a report. The report template is available in LibreOffice and MS-Word formats. Replace the dummy results in the template with your own results. The report should include the following:
If you were unable to exclude all false positives, list the accession numbers of the false positives. Briefly explain why you think they were found, despite being false positives.

Along with your report, you should upload your tsv and GenBank files. Make sure that all files have descriptive names.

Presentation guidelines

Submitting your assignment

Your PDF report, along with associated TSV and GenBank files, is due by 11:59 pm, Mon. February 14 on the PLNT2530 UMLearn dropbox site in the Bioinformatics 2 folder. Files in word processing formats (.doc, .docx, .rtf, .odt) are not acceptable.

If you have questions, it may help to send me a message at