TUTORIAL: CREATING DATASETS OF RELATED SEQUENCES

August 6, 2024

Fristensky, B. (1993) Feature Expressions: Creating and Manipulating Sequence Datasets. Nucleic Acids Res. 21:5997-6003

ENTREZ documentation: http://www.ncbi.nlm.nih.gov/books/NBK44864/
FEATURES documentation: $doc/xylem/features.html
NCBI BLAST http://www.ncbi.nlm.nih.gov/BLAST
FASTA $doc/fasta/fasta36.1.html

Rationale: It is often necessary to create datasets of sequences or sequence fragments. For example, you may wish to assemble as many examples of a gene or protein as possible and construct a multiple sequence alignment. The multiple alignment may be used to discover conserved regions, or study the structure of the protein. Alternatively, the multiple protein alignment might be used to construct a DNA alignment of the coding sequences for use in phylogenetic analysis or design of PCR primers from conserved regions.

Overview:

retrieve plant entries from GenBank, using the keyword 'defensin'
identify additional defensin genes based on database similarity searches
combine both sets of entries into a single dataset
extract coding sequences from all entries
where entries have more than one coding sequence (eg. BAC clones) do a similarity search of the coding sequences to eliminate all but those that match known defensin genes

Example: Plant Defensins

Goal: To create a dataset of defensin genes for comparison across species. The defensins in plants are 5 kDa Cys-rich proteins that are produced in plants in response to fungi. Defensins have antifungal properties, and are associated with many defense responses in plants. The problem is that the term 'defensin' was adopted after many proteins in this class were published with different names, and that not all workers have chosen to use the term 'defensin'.

Nonetheless, it should be possible to use 'defensin' as a starting point, and build a dataset in several steps.

Note: The sequence dataset constructed in this exercise should be considered a 'first draft'. For example, many of the proteins listed in these GenBank entries are referred to in the annotation as 'defensin-like proteins'. Several rounds of multiple sequence alignment and phylogenetic analysis may be required before a decision can be reached as to which proteins to consider an orthologous group, and which to exclude from the group.

Note: The sequence dataset constructed in this exercise should be considered a 'first draft'. For example, many of the proteins listed in these GenBank entries are referred to in the annotation as 'defensin-like proteins'. Several rounds of multiple sequence alignment and phylogenetic analysis may be required before a decision can be reached as to which proteins to consider an orthologous group, and which to exclude from the group.

1. Create a directory for the project

I can't repeat this often enough. ALWAYS create a new directory for each project.

mkdir dataset
cd dataset

2. Find genes by keyword and retrieve

The first phase is simply to retrieve GenBank entries containing the keyword 'defensin', using blncbi. blncbi is a BioLegato interface to the NCBI Entrez database search engine. Blncbi complements the Entrez web site by integrating the database search with the other BioLegato tools. The Entrez web search site can be found at
http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi

Launch blnci by typing

blncbi

For working with tabular data, blncbi uses the BioLegato table canvas, that works similar to a spreadsheet.

In the Database menu, choose Nucleotide - Query NCBI nucleotide database.

The query menus in bldna generate Entrez query statements that are sent to NCBI where the search is performed. Although part of the point of blncbi is that you don't need to learn to write query statements, it is still useful to know a bit about how these statements are structured.

A Quick Lesson on Entrez Query Statements

Entrez lets you do highly specific searches by creating search statements in the following form:

term [field] OPERATOR term [field]

where term is a search string, limited to a specific field. The field must be enclosed in square brackets. One or more term-field statements can be joined using the operators AND, OR or NOT, which must be capitalized. See the Entrez Searching Options for more information on writing search statements.

First, let's try just searching for the key word 'defensin'. By default, Entrez will search all fields.

Click on Run: Output to a new window, and a new blncbi window will pop up with the search results.

The lines beginning with hash signs are pseudo-comments that summarize the results. The COUNT line shows that 14797 entries were found in which the word defensin appears somewhere in a GenBank entry.

This is probably a larger number of hits than we wanted to find, and by default, blncbi will only display the information on specific hits if there are fewer than 500 hits. The actual number of hits retrieved can be changed in the Limits tab.

It is likely that the hits found include defensins from animals as well as plants. As well, the "All Fields" setting in query term 1 would match an entry in which the word defensin was found anywhere in the entry. For example, an unrelated gene might been identified in a publication whose title contained the word 'defensin'.

To do a more precise search for defensin proteins from higher plants, you might use the following search statement at NCBI:

defensin [protein name] AND magnoliophyta [orgn]

To set up the same search in blncbi, set query terms in blncbi as follows:

Note that the Boolean operator "AND" must be set in the left hand column so that Entrez will search for entries with both terms. Click on Run: Output to new window to run the search.

This time, there are 133 hits are found in the current version of the database. A quick examination of the hits indicates that all hits are from plant sequences. As well, the sizes of the sequences are all in the range that would be expected of individual genes, except for some that are greater than 80,000 nucleotides, which are probably BAC clones.

This search report can be saved to a file for future user by choosing Edit --> Select All, and then choosing Save SELECTION As. Type a descriptive filename, such as blncbi.defensin_magnoliophyta.tsv, and set the file format to tsv so that blncbi can read this file if we want to use the results later.

These GenBank entries whose Accession numbers are listed in the uid column can be retrieved directly from blncbi. For this large a number of entries, it is best to retrieve the entire dataset to a file. Choose Database --> SEQFETCH. Click on Output file, and type in a descriptive filename. Since these are preliminary results we can use a name like temp.gen.

The GenBank entries are saved to temp .gen.

Note: An increasing percentage of GenBank entries contain chromosome-sized sequences, which are usually awkward to work with. blncbi automatically limits the sizes of sequences found in hits to 500,000 bases, which should accommodate most BAC-sized sequences. This limit can changed in the Limits tab if you really want large sequences.

3. Find related genes by sequence similarity

Go into the dataset directory and launch the bioLegato instance bldna by typing 'bldna'. Read in temp.gen using File --> Open.

In principle, it should be straightforward to find other defensin genes in a database search with any of the genes in temp.gen. However, DNA/DNA comparisons are not very sensitive compared to protein/protein searches. Therefore, we need to translate the protein coding regions into protein sequences. In GenBank entries, protein coding sequences are annotated in the Features Table using the 'CDS' feature key. To extract CDS features the sequences from temp.gen, first select ALL the sequences by dragging the cursor down through the list of names as shown, or choosing Edit --> Select All:

bldna
with all sequences selected

Next, choose Database --> FEATURES - extract by feature keys.

By default, the 'Single feature key' menu is set to 'CDS'. Stay on the "from selected" tab to tell FEATURES to extract subsequences from the sequences you have selected. (If you wanted t FEATURES would first retrieve sequences from GenBank, use the "from NCBI" tab. The "from local files" tab could be used to extract features from a local GenBank dataset.)The menu should look like this:

default menu with CDS
selected in "from selected" tab

The coding sequences will appear in a new bldna window

Consistent with the fact that protein coding regions have been extracted from the total DNA sequences, all of the CDS regions begin with 'atg'. These can be translated in one step by selecting all sequences and choosing DNARNA --> Ribosome. By default, ribosome sends output to a new blprotein window.

To find other proteins related to defensins, we can run BLAST searches of defensin proteins versus translated GenBank sequences. Our first choice of a test sequence will be AY313169, a defensin homologues from the barrel medic (Medicago truncatula).
(Hint: You can find for any sequence in a bioLegato instance window using Edit --> Select by name.) Once you have selected AY313169, open Database --> NCBI TBLASTN.

Since we expect to have greater than 100 sequences by the end of the process, it is not realistic to work at the 5% or 1% statistical level. To ensure that no hits are found due to random chance, we will consider sequences with E > 10^-11 not significant. Set #matches expected to 1 x 10^-11 to ensure that only signficant matches are found. The choice is somewhat arbitrary, but in practice, hits would later be pruned by other criteria.

Finally, because the nt is one of the largest NCBI databases, the search may take awhile. Open the Output tab and choose Notify of completion by email.

Output will appear in three windows.

The HTML BLAST report gives a human-readable list of hits, followed by the best alignment for each hit. An excerpt is shown an right. The list contains links both to the GenBank entry for the hit at NCBI, as well as a link to the alignment.

A tab-separated value (tsv) file with more detailed hit information is displayed in blnfetch, a BioLegato interface specialized for displaying and retrieving hits from nucleotide databases.

blnfetch is a simple spreadsheet that makes it easier to sift through large numbers of hits, and includes sort and extract functions which are valuable for selecting subsets of hits eg. hits from a particular species. Save the BLAST report using the file name AY313169.tblastn.htm (File --> Save Page As).

Before saving the tsv file, we need to make sure that there aren't any sequences too large for our purposes eg. in a classroom tutorial. This example illustrates the value of having BLAST results in a table. Note that the length of each hit found is given in the field labeled 'subject length' (column I).

We can sort the table based on the 'subject length' by Edit --> Select All, and then Edit --> BLSORT.

Since column I is the ninth column, set the first sort key to column 9, and sort order to descending.

We see that 7 sequences over 33 million bases long, so for the purposes of saving disk space we want to delete those from the file.
Make sure the Selection mode at bottom is set to "by row", select these rows, and then delete using Edit --> Delete Rows.

The data from the new blnfetch window can be saved by selecting all hits (Edit --> Select All), and then File --> Save SELECTION As AY313169.tblastn.tsv.

Output:
AY313169.tblastn.html - program output
AY313169.tblastn.tsv - table of hits

Output also appears in BLAST Viewer, giving another avenue for exploring the search results.

It is worth sampling as broad a taxonomic range as possible. Both of the previous test sequences came from dicotyledonous plants. Therefore, we'll show two more examples one from the dicotyledonous plant chick pea (Cicer arietinum) , and the other using a defensin-like sequence from the monocot wheat (Triticum aestivum).

To save a few steps, you can send the output directly to files, bypassing the popup windows.

Go to the Output tab, and set the BLAST REPORT to HTML file, and the TABLE OF HITS to tsvfile. Fill in a base name to be used for the two output files as shown at right.

Repeat this search with the wheat sequence AB089942.

Output:
          DQ288897.tblastn.html - program output
          DQ288897.tblastn.tsv - table of hits


Output:
        AB089942.tblastn.html - program output
        AB089942.tblastn.tsv - table of hits

Remember: it will be necessary to open both DQ288897.tblastn.tsv and AB089942.tblastn.tsv delete any hits that are too large.

4. Create a combined dataset

The easiest way to combine the entries in temp.gen with those listed as significant hits in the .acc files is to create a combined file of Accession numbers. The Accession numbers must be parsed out of the lines in each GenBank entry that contain the Accession numbers. For example:

ACCESSION KJ939334

We can generate such a file for temp.gen as follows:

grep ACCESSION temp.gen > temp.ACCESSION
cat temp.ACCESSION |tr -s " " | cut -f2 -d " " >temp.acc

The first command creates a file containing all the lines from temp.gen. The next command uses this file as input and pipes the output, first to the tr -s command, which compresses multiple blanks into single blanks, and then to cut. cut extracts the second field of the output (where fields are delimited by blanks) and writes the output to temp.acc.

The next step is to create a non-redundant namefile, containing all Accession numbers from all .tsv files. What we want is to first exclude the pseudo-comment lines that begin with hash marks, and then to cut out ACCESSION numbers, which are in the first field of the tsv file. We should be able to specify all tsv output files as "*.tblastn.tsv", but check that first by typing "ls -l *.tblastn.tsv". The command to extract the ACCESSION numbers from all of the tsv files in a single go would therefore be

grep -v '#' -h *.tblastn.tsv | cut -f1 > tblastn.acc

The -h option is necessary because there is more than one input file. By default, when grep has more than one input file, it will print the name of each file for each line printed. The -h option suppresses printing of filenames. Note that because this is a tsv file (ie. tab-separated value) the delimiter used by the cut command must be the TAB character! Fortunately, cut defaults to use TAB as a delimiter, so we don't need to specify a delimiter in this particular case.)

Finally, we need to combine the two .acc files and eliminate the duplicates.

cat temp.acc tblastn.acc | sort | uniq > all.acc

cat effectively combines these files into a single output stream. Sort, as the name implies, sorts the lines in the ouput stream, and uniq eliminates duplicate lines from a sorted file. See the manual pages for each command to get a better idea of what they do (eg. man sort, man uniq).
These tasks could be done in one step by piping output from one command to the next, eliminating the temporary files and saving some typing:

As a check, we can use the wc command to tell us how many lines (ie. accession numbers)
are in the various .acc files:


{mars:/home/psgendb/tutorials/bioLegato/dataset}wc -l *.acc
  500 AB089942.tblastn.acc
  401 all.acc
  102 AY313169.tblastn.acc
   85 DQ288897.tblastn.acc
  409 tblastn.acc
  123 temp.acc
 1620 total

Note that the combined lengths of temp.acc and tblastn.acc add up to 552, and all.acc is only 434. This is because the tblastn searches will often find the same sequences, so there is some redundancy in the .acc files.

Next, we want to retrieve the corresponding sequences. The BioLegato interface blnfetch is specialized for retrieving entries from GenBank, using ACCESSION numbers. Type

blnfetch all.acc

Choose Edit --> Select All to select all ACCESSION numbers (blnfetch will ignore the blank line). Next, choose SEQFETCH, and send output to a file called defensin.gen. (For classroom tutorials where disk space is limited: Set Max. Seq. Length to 100,000, so that only shorter sequences will be retrieved.)

5. Large GenBank entries may contain many protein coding sequences. Make sure that your dataset contains only homologues from a single gene family.

The goal of this section is to create a dataset of protein coding sequences (CDS) from defensin.gen, which only code for defensin, and eliminate those CDS which code for other proteins. The steps are as follows:

extract all CDS sequences to a new file
translate all CDS sequences to protein, and save those to a file
use FASTA to get a list of sequence names for proteins that match defensins
use the list of defensins to extract the defensin CDS sequences, and save those to a new file

As done previously, open defensin.gen in the bioLegato instance bldna,

and extract the coding sequences (CDS) using Database --> Features - Extract by Feature Keys.
Send output to a new bldna window, which will look something like this:

Save the sequences in this file by choosing File --> Save All As, and call the file all.CDS.fsa. Make sure you choose the Fasta format for saving the file. This file will be read later by bldna.

Note: Do NOT use Export Foreign Format. This function uses readseq to interconvert formats. However, readseq will truncate sequence names if they are longer than 16 characters. The added ':CDSxx' puts these names over the 16 character limit. Consequently, when there are multiple coding sequences in a given entry, several sequences will end up with the same name.

As more and more large genomic fragments are entered into GenBank, FASTA searches will find sequences that match only a small part of the total fragment. That is, the gene you are using as a query sequence will be only one of many genes on the fragment. Thus, FEATURES will generate all protein coding sequences (CDS) for all genes on each fragment.

We can create a small database of all proteins translated from the CDS sequences, and then search with FASTA to identify which encode defensins.

As was done previously, translate all CDS sequences to proteins by choosing Edit --> Select All and using DNARNA --> Ribosome.

In the new window, select all sequences (Edit -- Select All), and choose File --> Save ALL As, and call the file all.pro.fsa.

The file will contain the protein sequences from all of the CDS sequences.

We can use the same three defensin sequences as used in the original database search to search all.pro.fsa to find defensins. To start, find the AY313169 amino acid sequence using Edit --> Select sequence by name

Next, choose Database --> Fasta (protein vs. protein database).

This time, instead of searching GenBank, search the data set in all.pro.fsa.
Use ssearch, which constructs a rigorous Smith/Waterman alignment for each pairwise comparison.
Set the E value, which is the cut off for listing hits. I this case, hits will only be included in the output if they have a probability of matching by random chance of less than 0.0001.
By default, the FASTA programs calculate statistics assuming that significant hits will occur with only a very small fraction of sequences in the database. This assumption is invalid with the all.pro.fsa dataset. Setting the -z 11 option tell FASTA to calculate E values for a population in which most sequences are closely-related. If you do not set this option, you will miss most of the homologous coding sequences!

This is what it should look like with all the settings:

Repeat the process for the other two sequences. There should now be three blpfetch windows with the results from each of the searches:

The ssearch results in the three output files tell us which CDS sequences are defensins. The next step is to combine the results into a single window, so that we can create a non-redundant list of sequences.

Simply choose any of the three blpfetch windows to combine all hits, and copy the hits from the other two windows into it. As an example, we'll use the blpfetch window for which the query was AY313169 for combining hits.

Go to the blpfetch window in which the Query on line 3 says "DQ288897" and choose Edit --> Select All, and then Copy out. This copies all lines to the blpfetch clipboard. Next, go to the blpfetch window for Query: AY313169, and choose Edit --> Paste in. Repeat this process to copy all hits from the AB089942 window to the AY313169 window.

To verify that that multiple sets hits are now in the AY313169 window, scroll down. You should see the end of one list, and the beginning of another, as indicated by the comment lines beginning with hash marks (#).

Optional: Save the list as a file.
Edit --> Select All. File -->
Format: tsv
Name: defensin.pro.hits.tsv

Finally, we can create our file of CDS sequences for defensin. If you still have the bldna window with the complete set of CDS sequences, you can proceed now. Otherwise, open a new bldna window, and read in all.CDS.fsa.

Next, we need to get the list of hits to paste into the bldna. Go back to the blpfetch window for AY313169, containing the combined list of hits, and set the Selection mode at the bottom to "by column". Next click in column A. The window should look similar to the one at right.

Copy the selection into the clipboard:

Linux: ctrl-C
Mac: command-C

Next, go to bldna and choose Edit --> Extract subset of sequences to new window and paste the contents of the clipboard into the Names/Accession numbers box.

Linux - ctrl-V
Mac: command-V

If you scroll to the top, the menu window should look similar to the one at right.

Clicking Run will extract only those sequences whose names appear in the list into a new dataset.

Save these sequences in File --> Save As, under the name defensin.CDS.fsa. Make sure to write the file in Fasta format.

This illustrates an important point regarding database searches. Most database searches will only find the BEST alignment between a query sequence and a sequence in the database. In the case where several homologous genes are present in a single database entry, only the best match with the query sequence is likely to be reported, and the others ignored! If we hadn't been careful, we would have missed 3 out of the 4 defensin-related genes in AC005936.

To be complete, let's also save the amino acid sequences corresponding to these CDS sequences. Choose DNARNA --> Ribosome:

Save these sequences as defensin.pro.fsa.

The dataset is complete!

It is now ready for creating multiple sequence alignments, and from there, for phylogenetic analysis.

6. Dot-matrix searches are good for uncovering arrays of related genes on a single sequence

Another way to make sure that we have gotten all copies of a gene is to do a dot-matrix search comparing one or more known defensin genes with the large genomic sequence. This requires that we have both sequences in the same bioLegato window. To do this, first select the genomic sequence AC005936 and choose Edit --> extract to put it into a new bldna window. To copy AC005936:CDS9 into the same window, select it, and then choose Edit --> Copy out. This copies the sequence to the bioLegato clipboard file ($home/.GDEclipboard). Next, move to the new bldna window containing the complete genomic fragment and choose Edit --> Paste in. Both sequences should now be in the window. Finally, make a copy of the defensin sequence and create its inverse complement, so that both strands of defensin can be compared with the genomic sequence.
Recall to inverse a complement you choose DNARNA --> blrevcomp -->reverse compliment

The comparison of the forward strand of AC005936 with the opposite strand of CDS9 shows no compelling similarities, which may not be surprising considering the fact that CDS9-12 were all on the forward strand, according to the Features Table.

CDS.vs.genomic.d4hom

CDS-opp.vs.genomic.d4hom

The clustering of members of the defensin gene family on one strand, within several kilobases is probably evidence that these genes have been duplicated by unequal crossing over in recent evolutionary time.

8. SUMMARY: Guidelines for making datasets of ortholgous genes

Start with a keyword search

Using hits from the keyword search, do database searches using protein sequences for maximum sensitivity

Retrieve statistically significant hits

Some GenBank entries may contain multiple copies of a gene. However only the best will appear on the database search output. Always check the features table, or do a similarity comparison between representative copies of the gene and all CDS features from all entries.

Make sure to do similarity comparisons with the a query sequence versus the genomic sequence, to detect clusters of related genes.

Not all hits that are statistically signficant should appear in the dataset. Several rounds of multiple alignments and phylogeny constructions may be necessary to determine which genes should be kept in the set.

Generally, it is best to eliminate all but one sequence from any set of sequences where all share 100% identity.