BIRCH

TUTORIAL: CREATING DATASETS OF RELATED SEQUENCES


Oct. 16, 2016

Fristensky, B. (1993) Feature Expressions: Creating and Manipulating Sequence Datasets. Nucleic Acids Res. 21:5997-6003

ENTREZ documentation: http://www.ncbi.nlm.nih.gov/books/NBK44864/
FEATURES documentation: $doc/xylem/features.txt
NCBI BLAST http://www.ncbi.nlm.nih.gov/BLAST
FASTA $doc/fasta/fasta36.1.html


Rationale: It is often necessary to create datasets of sequences or sequence fragments. For example, you may wish to assemble as many examples of a gene or protein as possible and construct a multiple sequence alignment. The multiple alignment may be used to discover conserved regions, or study the structure of the protein. Alternatively, the multiple protein alignment might be used to construct a DNA alignment of the coding sequences for use in phylogenetic analysis or design of PCR primers from conserved regions.

Overview:

Example: Plant Defensins

Goal: To create a dataset of defensin genes for comparison across species. The defensins in plants are 5 kDa Cys-rich proteins that are produced in plants in response to fungi. Defensins have antifungal properties, and are associated with many defense responses in plants. The problem is that the term 'defensin' was adopted after many proteins in this class were published with different names, and that not all workers have chosen to use the term 'defensin'.

Nonetheless, it should be possible to use 'defensin' as a starting point, and build a dataset in several steps.

Note: The sequence dataset constructed in this exercise should be considered a 'first draft'. For example, many of the proteins listed in these GenBank entries are referred to in the annotation as 'defensin-like proteins'. Several rounds of multiple sequence alignment and phylogenetic analysis may be required before a decision can be reached as to which proteins to consider an orthologous group, and which to exclude from the group.

1. Create a directory for the project

I can't repeat this often enough. ALWAYS create a new directory for each project.
mkdir dataset
cd dataset

2. Find genes by keyword and retrieve

The first phase is simply to retrieve GenBank entries containing the keyword 'defensin', using blncbi. blncbi is a BioLegato interface to the NCBI Entrez database search engine. Blncbi complements the Entrez web site by integrating the database search with the other BioLegato tools. The Entrez web search site can be found at
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

Launch blnci by typing

blncbi

For working with tabular data, blncbi uses the BioLegato table canvas, that works similar to a spreadsheet.

In the Database menu, choose Nucleotide - Query NCBI nucleotide database.



The query menus in bldna generate Entrez query statements that are sent to NCBI where the search is performed. Although part of the point of blncbi is that you don't need to learn to write query statements, it is still useful to know a bit about how these statements are structured.


A Quick Lesson on Entrez Query Statements

Entrez lets you do highly specific searches by creating search statements in the following form:

term [field] OPERATOR term [field]

where term is a search string, limited to a specific field. The field must be enclosed in square brackets. One or more term-field statements can be joined using the operators AND, OR or NOT, which must be capitalized. See the Entrez Searching Options for more information on writing search statements.

First, let's try just searching for the key word 'defensin'. By default, Entrez will search all fields.



Click on Run: Output to a new window, and a new blncbi window will pop up with the search results.

The lines beginning with hash signs are pseudo-comments that summarize the results. The COUNT line shows that 8527 entries were found in which the word defensin appears somewhere in a GenBank entry.


This is probably a larger number of hits than we wanted to find, and by default, blncbi will only display the information on specific hits if there are fewer than 500 hits. The actual number of hits retrieved can be changed in the Limits tab.

It is likely that the hits found include defensins from animals as well as plants. As well, the "All Fields" setting in query term 1 would match an entry in which the word defensin was found anywhere in the entry. For example, an unrelated gene might been identified in a publication whose title contained the word 'defensin'.

To do a more precise search for defensin proteins from higher plants, you might use the following search statement at NCBI:

defensin [protein name] AND magnoliophyta [orgn]

To set up the same search in blncbi, set query terms in blncbi as follows:



Note that the Boolean operator "AND" must be set in the left hand column so that Entrez will search for entries with both terms. Click on Run: Output to new window to run the search.



This time, there are 122 hits are found in the current version of the database. A quick examination of the hits indicates that all hits are from plant sequences. As well, the sizes of the sequences are all in the range that would be expected of individual genes, except for some from Arabidopsis that are between 41,000 and 100,000 nucleotides, which are probably BAC clones.


This search report can be saved to a file for future user by choosing Edit --> Select All, and then choosing Save SELECTION As. Type a descriptive filename, such as blncbi.defensin_magnoliophyta.tsv, and set the file format to tsv so that blncbi can read this file if we want to use the results later.



These GenBank entries whose Accession numbers are listed in the uid column can be retrieved directly from blncbi. For this large a number of entries, it is best to retrieve the entire dataset to a file. Choose Database --> SEQFETCH. Click on Output file, and type in a descriptive filename. Since these are preliminary results we can use a name like temp.gen.



The GenBank entries are saved to temp.gen.


Note:
An increasing percentage of GenBank entries contain chromosome-sized sequences, which are usually awkward to work with. blncbi automatically limits the sizes of sequences found in hits to 500,000 bases, which should accommodate most BAC-sized sequences. This limit can changed in the Limits tab if you really want large sequences.

3. Find related genes by sequence similarity

Go into the dataset directory and launch the bioLegato instance bldna by typing 'bldna'. Read in temp.gen using  File --> Open.
 
In principle, it should be straightforward to find other defensin genes in a database search with any of the genes in temp.gen. However, DNA/DNA comparisons are not very sensitive compared to protein/protein searches. Therefore, we need to translate the protein coding regions into protein sequences. In GenBank entries, protein coding sequences are annotated in the Features Table using the 'CDS' feature key. To extract CDS features the sequences from temp.gen, first select ALL the sequences by dragging the cursor down through the list of names as shown, or choosing Edit --> Select All:


Next, choose Database --> FEATURES - extract by feature keys.

Set the 'Single feature key' menu to 'CDS'. Under DATABASE, click on 'Selected sequences' to tell FEATURES to extract subsequences from  the sequences you have selected. (By default,  FEATURES would first retrieve sequences from GenBank, which has already been done, in this case.) The menu should look like this:




The coding sequences will appear in a new bldna window


Consistent with the fact that protein coding regions have been extracted from the total DNA sequences, all of the CDS regions begin with 'atg'. These can be translated in one step by selecting all sequences and choosing DNARNA --> Ribosome. By default, ribosome sends output to a new blprotein window.


To find other proteins related to defensins, we can run BLAST searches of defensin proteins versus translated GenBank sequences. Our first choice of a test sequence will be AY313169, a defensin homologues from the barrel medic (Medicago truncatula).
(Hint: You can find for any sequence in a bioLegato instance window using Edit --> Select by name.) Once you have selected AY313169, open Database --> NCBI TBLASTN.


Since we expect to have greater than 100 sequences by the end of the process, it is not realistic to work at the 5% or 1% statistical level. To ensure that no hits are found due to random chance, we will consider sequences with E > 10-5 not significant. Set #matches expected to 1 x 10-5 to ensure that only signficant matches are found. The choice is somewhat arbitrary, but in practice, hits would later be pruned by other criteria.


 

Be patient! The BLAST search may take several minutes, depending on the load on the server.

Output will appear in two windows.

The HTML BLAST report gives a human-readable list of hits, followed by the best alignment for each hit. An excerpt is shown an right. The list contains links both to the GenBank entry for the hit at NCBI, as well as a link to the alignment.


A tab-separated value (tsv) file with more detailed hit information is displayed in blnfetch, a BioLegato interface specialized for displaying and retrieving hits from nucleotide databases.

blnfetch is a simple spreadsheet that makes it easier to sift through large numbers of hits, and includes sort and extract functions which are valuable for selecting subsets of hits eg. hits from a particular species.


Save the BLAST report using the file name AY313169.tblastn.htm (File --> Save Page As).

Before saving the tsv file, we need to make sure that there aren't any sequences too large for our purposes eg. in a classroom tutorial. This example illustrates the value of having BLAST results in a table. Note that the length of each hit found is given in the field labeled 'subject length' (column I).

We can sort the table based on the 'subject length' by Edit --> Select All, and then Edit --> BLSORT.

Since column I is the ninth column, set the first sort key to column 9, and sort order to descending.

We see that AP01540 is over 33 million bases long, so for the purposes of saving disk space we want to delete that from the file.
Make sure the Selection mode at bottom is set to "by row",  click on AP015040 to select this row, and then delete using Edit --> Delete Rows.

The data from the new blnfetch window can be saved by selecting all hits (Edit --> Select All), and then File --> Save SELECTION As AY313169.tblastn.tsv.

Output:
       AY313169.tblastn.html - program output
         AY313169.tblastn.tsv - table of hits


It is worth sampling as broad a taxonomic range as possible. Both of the previous test sequences came from dicotyledonous plants.  Therefore, we'll show two more examples one from the dicotyledonous plant chick pea (Cicer arietinum) , and the other using a defensin-like sequence from the monocot wheat (Triticum aestivum).

To save a few steps, you can send the output directly to files, bypassing the popup windows.

Go to the Output tab, and set the BLAST REPORT to HTML file, and the TABLE OF HITS to tsvfile. Fill in a base name to be used for the two output files as shown at right.


Repeat this search with the wheat sequence AB089942.

Output:
          DQ288897.tblastn.html - program output
          DQ288897.tblastn.tsv - table of hits
      

Output:
        AB089942.tblastn.html - program output
        AB089942.tblastn.tsv - table of hits


Remember: it will be necessary to open both
DQ288897.tblastn.tsv and AB089942.tblastn.tsv delete any hits that are too large.

4. Create a combined dataset

The easiest way to combine the entries in temp.gen with those listed as significant hits in the .acc files is to create a combined file of Accession numbers. The Accession numbers must be parsed out of the lines in each GenBank entry that contain the Accession numbers. For example:

ACCESSION   KJ939334

We can generate such a file for temp.gen as follows:

grep ACCESSION  temp.gen > temp.ACCESSION
cat temp.ACCESSION |tr -s " " | cut -f2 -d " " >temp.acc


The first command creates a file containing all the  lines from temp.gen. The next command uses this file as input and pipes the output, first to the tr -s command, which compresses multiple blanks into single blanks, and then to cut. cut extracts the second field of the output (where fields are delimited by blanks) and writes the output to temp.acc.

The next step is to create a non-redundant namefile, containing all Accession numbers from all .tsv files. What we want is to first exclude the pseudo-comment lines that begin with hash marks, and then to cut out ACCESSION numbers, which are in the first field of the tsv file. We should be able to specify all tsv output files as "*.tblastn.tsv", but check that first by typing "ls -l *.tblastn.tsv".  The command to extract the ACCESSION numbers from all of the tsv files in a single go would therefore be

grep -v '#' -h *.tblastn.tsv | cut -f1 > tblastn.acc
The -h option is necessary because there is more than one input file. By default, when grep has more than one input file, it will print the name of each file for each line printed. The -h option suppresses printing of filenames. Note that because this is a tsv file (ie. tab-separated value) the delimiter used by the cut command must be the TAB character!  Fortunately, cut defaults to use TAB as a delimiter, so we don't need to specify a delimiter in this particular case.)

Finally, we need to combine the two  .acc files and eliminate the duplicates.
cat temp.acc tblastn.acc | sort | uniq > all.acc
cat effectively combines these files into a single output stream. Sort, as the name implies, sorts the lines in the ouput stream, and uniq eliminates duplicate lines from a sorted file. See the manual pages for each command to get a better idea of what they do (eg. man sort, man uniq).
These tasks could be done in one step by piping output from one command to the next, eliminating the temporary files and saving some typing:


As a check, we can use the wc command to tell us how many lines (ie. accession numbers)
are in the various .acc files:

{mars:/home/psgendb/tutorials/bioLegato/dataset}wc -l *.acc
500 AB089942.tblastn.acc
434 all.acc
102 AY313169.tblastn.acc
85 DQ288897.tblastn.acc
430 tblastn.acc
122 temp.acc
1673 total

Note that the combined lengths of temp.acc and tblastn.acc add up to 552, and all.acc is only 434. This is because the tblastn searches will often find the same sequences, so there is some redundancy in the .acc files.

Next, we want to retrieve the corresponding sequences. The BioLegato interface blnfetch is specialized for retrieving entries from GenBank, using ACCESSION numbers. Type

blnfetch all.acc



Choose Edit --> Select All to select all ACCESSION numbers (blnfetch will ignore the blank line). Next, choose SEQFETCH, and send output to a file called defensin.gen. (For classroom tutorials where disk space is limited: Set Max. Seq. Length to 100,000, so that only shorter sequences will be retrieved.)

5. Large GenBank entries may contain many protein coding sequences. Make sure that your dataset contains only homologues from a single gene family.

As done previously, open defensin.gen in the bioLegato instance bldna, and extract the coding sequences (CDS) using Database --> Features - Extract by Feature Keys. Send output to a new bldna window, which will look something like this:
 

 


Save the sequences in this file by choosing File --> Save All As, and call the file all.CDS.fsa. Make sure you choose the Fasta format for saving the file. This file will be read later by bldna.

Note: Do NOT use Export Foreign Format. This function uses readseq to interconvert formats. However, readseq will truncate sequence names if they are longer than 16 characters. The added ':CDSxx' puts these names over the 16 character limit. Consequently, when there are multiple coding sequences in a given entry, several sequences will end up with the same name.
 

As more and more large genomic fragments are entered into GenBank, FASTA searches will find sequences that match only a small part of the total fragment. That is, the gene you are using as a query sequence will be only one of many genes on the fragment. Thus, FEATURES will generate all protein coding sequences (CDS) for all genes on each fragment. 

We can create a small database of all proteins translated from the CDS sequences, and then search with FASTA to identify which encode defensins.
 
As was done previously, translate all CDS sequences to proteins by choosing Edit --> Select All and  using DNARNA --> Ribosome.


In the new window, select all sequences (Edit -- Select All), and choose File --> Save ALL Ast, and call the file all.pro.fsa.

The file will contain all of the protein sequences.

We can use the same three defensin sequences as used in the original database search to search all.pro.fsa. To start, find the AY313169 amino acid sequence using Edit --> Select sequence by name

Next, choose Database --> Fasta (protein vs. protein database).

This time, instead of searching GenBank, search the data set in all.pro.fsa.

 

Use ssearch, which constructs a rigorous Smith/Waterman alignment for each pairwise comparison.

 
Set the E value, which is the cut off for listing hits. I this case, hits will only be included in the output if they have a probability of matching by random chance of less than 0.0001.



 By default, the FASTA programs calculate statistics assuming that significant hits will occur with only a very small fraction of sequences in the database. This assumption is invalid with the all.pro.fsa dataset.  Setting the -z 11 option tell FASTA to calculate E values for a population in which most sequences are closely-related.  

If you do not set this option, you will miss most of the homologous coding sequences!







Send output to a textedit window.
 

This is what it should look like with all the settings:


Repeat the process for the other two sequences.

The ssearch results in the three output files tell us which CDS sequences to keep, and which to delete from the dataset. For this purpose, it would be useful to have a list of significant matches, which can be created as follows. First use a text editor to create an empty file called all.hits. Now, copy the summary lines, such as


AY437639:CDS1 74 bp ( 74) 525 131.0 1.1e-34
NM_126272:CDS1 78 bp ( 78) 220 56.9 2.5e-12
AJ843264:CDS1 73 bp ( 73) 212 54.8 1e-11
NM_125761:CDS1 74 bp ( 74) 204 52.8 3.9e-11
AF322914:CDS1 78 bp ( 78) 203 52.7 4.4e-11
NM_126271:CDS1 78 bp ( 78) 199 51.8 8.6e-11
NM_104788:CDS1 77 bp ( 77) 193 50.3 2.4e-10
NM_126273:CDS1 78 bp ( 78) 192 50.1 2.8e-10
AF442388:CDS1 79 bp ( 79) 191 49.9 3.3e-10
AY498565:CDS1 79 bp ( 79) 191 49.9 3.3e-10
AY078426:CDS1 80 bp ( 80) 190 49.6 3.8e-10

into all.hits.  Save the file, and then sort the results for readability:

cat all.hits | cut -f1 -d" " | sort | uniq  > all.hits.sort

(Note that the cut command extracts just the names of the sequences.)

Finally, we can use this file to extract only those sequences whose names appear in all.hits.sort into a new dataset. This file contains all the CDS sequences in our dataset, not just those coding for defensin proteins.

Open up a new bldna window and read in all.CDS.fsa. Choose Edit --> Select All to select all sequences.
 
delete.non-hits.png

Now, we'll read in the names of the hits and extract those sequences into a new window. Choose Edit -->  Extract subset of sequences to new window. Under NAMES/ACCESSION #'s, choose File of Names/Acc.#'s. Enter the all.hits.sort on the line next to the text 'File of names/acc#'s'.


Click on Run. bldna will run BLExtractSubset.py and send the results to a new bldna window. The sequences will appear in the order that they are listed in all.hits.sort. Now, only defensin sequences should be in this file.


Save these sequences in File --> Save As, under the name
defensin.CDS.fsa. Make sure to write the file in Fasta format.

This illustrates an important point regarding FASTA searches. Most database searches will only find the BEST alignment between a query sequence and a sequence in the database. In the case where several homologous genes are present in a single database entry, only the best match with the query sequence is likely to be reported, and the others ignored! If we hadn't been careful, we would have missed 3 out of the 4 defensin-related genes in AC005936.

The dataset is complete!

It is now ready for creating multiple sequence alignments, and from there, for phylogenetic analysis.

6. Dot-matrix searches are good for uncovering arrays of related genes on a single sequence

Another way to make sure that we have gotten all copies of a gene is to do a dot-matrix search comparing one or more known defensin genes with the large genomic sequence. This requires that we have both sequences in the same bioLegato window. To do this, first select the genomic sequence AC005936 and choose Edit --> extract to put it into a new bldna window. To copy AC005936:CDS9 into the same window, select it, and then choose Edit --> Copy out. This copies the sequence to the bioLegato clipboard file ($home/.GDEclipboard). Next, move to the new bldna window containing the complete genomic fragment and choose Edit --> Paste in. Both sequences should now be in the window. Finally, make a copy of the defensin sequence and create its inverse complement, so that both strands of defensin can be compared with the genomic sequence.
Recall to inverse a complement you choose
DNARNA --> blrevcomp -->reverse compliment

 

The comparison of the forward strand of AC005936 with the opposite strand of CDS9 shows no compelling similarities, which may not be surprising considering the fact that CDS9-12 were all on the forward strand, according to the Features Table.

 CDS.vs.genomic.d4hom

 CDS-opp.vs.genomic.d4hom

The clustering of members  of the defensin gene family on one strand, within several kilobases is probably evidence that these genes have been duplicated by unequal crossing over in recent evolutionary time.

8. SUMMARY: Guidelines for making datasets of ortholgous genes