TUTORIAL: SEARCHING LOCAL SEQUENCE DATABASES USING BLAST

Oct. 5, 2023

NCBI BLAST Documentation

NCBI BLAST Handbook

BLAST searches through a Data Science Lens (34:28)

Example: Antifreeze Proteins

This tutorial assumes that local copies of the Uniprot (Swissprot), NCBI RefSeq RNA and the Non-redundant nucleotide (nt) databases . If local copies of databases are not available, NCBI databases can be searched remotely using NCBI blastp, NCBI blastn etc.

1. Read in the test sequence and generate the corresponding amino acid sequence

This tutorial will introduce the several types of sequence database searches, using the antifreeze protein from the sea raven, Hemitripterus americanus. The cDNA sequence for this gene is found in SRAAFP.gen.

Create a directory called 'db' and save this sequence to the db directory.

To read this sequence into bldna, choose 'File --> Open', and select SRAAFP.gen.

Since this tutorial will demonstrate database searches using both DNA and protein sequences as queries, we need to generate the amino acid sequence. This is done in two steps. First, we need to extract the protein coding sequence from the cDNA. (Remember that mRNAs contain both 5' and 3' untranslated sequences. If you look at SRAAFP.gen, you'll see that the CDS (protein coding sequence) runs from positions 10 to 597. We will use the FEATURES program to extract the CDS, and then translate the CDS using RIBOSOME.

Make sure SRAAFP is selected, and choose Database --> FEATURES (Extract by feature keys).

Click on 'Run' with the above settings. The CDS will appear in a new bldna window.

Note that the CDS begins with a start codon, indicating that this cDNA contains the entire protein coding region.

Next translate the CDS into protein. Select SRAAFP:CDS1, and choose DNARNA --> Ribosome. By default, Ribosome sends the output to a new blprotein window.

2. Local BLAST Databases

The slowest part of a BLAST database search is simply the process of reading the data from disk. Therefore, it us useful to have an idea of how large a database is before running a BLAST search.

BIRCH maintains a list of locally-installed BLAST databases as a TSV file that can be read by any spreadsheet program. The spreadsheet can be viewed on the BIRCH web site at Local System --> BLAST databases. They can also be viewed in bldna or blprotein in Database --> BLASTDB - report on local BLAST databases.

This spreadsheet is meant as a guide for managing diskspace, but is also useful if we want some idea of the size of various databases before doing a BLAST search.

Some databases are small, such as the UniProt/Swissprot database (356 Mb).

At the other extreme, ref_euk_rep_genomes is about 562 Gb.

Aside from the time required to read a database file from disk, BLAST works most efficiently when the entire database can be loaded into memory (RAM). A good rule of thumb is that a database will require roughly half the RAM of the size of the database file on disk. For example, refseq_rna is about 50 Gb, so it would need about 26 Gb of RAM. Where the database is larger than available RAM, data will have to be swapped between memory and disk, which slows the search substantially.

3. The Database Menu

bldna

blastn (DNA/RNA vs. DNA database) - Compares a DNA or RNA sequence against a DNA database, or a user-created file containing DNA or RNA sequences. DNA/DNA searches usually only detect matches between closely-related sequences.
blastx (cDNA vs. protein) - Translates a DNA query in 3 forward and 3 reverse reading frames. The query is searched against a protein database, or a user-created file of amino acid sequences. The best program to use when the query sequence is likely to contain errors, such as the result from a sequencing reaction eg. Expressed Sequence Tag.
tblastx (translated DNA query vs. translated DNA database) - Translates in all 6 reading frames both the query and the sequences in the database and compares the resultant amino acid sequences.

blprotein

blastp (protein vs. protein) - Compares an amino acid sequence against a protein database, or against a user-created file of amino acid sequences.
tblastn (protein vs. DNA database) - Compares an amino acid sequence with DNA sequences translated in all 6 reading frames. All DNA is translated, even non-coding DNA. This search is more sensitive than a DNA/DNA search.

3. blastp - Protein vs. Protein

Protein vs. Protein searches are both the most sensitive and the fastest types of search. Due to the degeneracy of the genetic code, protein vs. protein searches are more likely than DNA vs. DNA searches to detect more distantly-related genes. As well, since amino acid sequences are only 1/3 the length of the corresponding DNA sequences, protein searches require only (1/3)² = 1/9 the computational time.

In blprotein, select the protein sequence SRAAFP:CDS1, and choose Database --> blastp (protein vs. protein database) or NCBI blastp.

First, choose a database. The quickest one to search is UniProt (swissprot), which contains a small set of carefully-chosen proteins representative of almost all known proteins. This is probably the most efficient database for gene annotation or gene identification ie. which protein does my sequence code for?

For DNA searches a very simple scoring scheme is used, eg. +1 is added to the score for a match, and -1 is added to the score for a mismatch. For protein searches, the greater diversity of amino acids lends itself to more complex scoring schemes. The menu shows a variety of protein scoring matrices. The Blosum matrices were constructed using distantly-related sequences. If you need a highly sensitive search, use a low-numbered Blosum matrix eg. Blosum45. The PAM matrices were constructed from a set of closely-related sequences. Two PAM matrices are also provided for use with short query sequences. More information of scoring matrices can be found in
Pearson WR (2013) Selecting the Right Similarity-Scoring Matrix. Curr Protoc Bioinformatics. 2013; 43: 3.5.1-3.5.9.

Click on Run to start the search.

Output will go to three windows. First, the BLAST report will appear in a web browser,

Note that hits are sorted with the lowest (most significant) E-value first, with decreasing statistical significance going from the top down. All of the matches shown are extremely unlikely to have occurred by random chance, so all are extremely significant. Remember that by definition, the E-value corrects for the size if the data base and the length of the pairwise alignment between the query and each hit.

Next, the hits will pop up in BLAST-Viewer:

If you open the Graphic tab, you will see a how hits map to the query sequence:

Each arrow represents the extent of the alignment between the query and one of the hits. Ideally, one might expect each arrow to span the entire length of the query. What this alignment shows is that the highest-scoring alignments do in span most of the query. The lower scoring alignments are sequences in which the best local alignment spans only part of the query, particularly going from around 60 to 80 aa's in the query. This is consistent with the observation that proteins often allow more rapid sequence divergence in their N-terminal regions.

Click on the MSA tab to see an alignment of the hits with the query, along with a sequence logo of the alignment.

Keep in mind that the multiple alignments produced by BlastViewer are done using some very quick shotcuts and should not be considered to be very reliable. If you want to do a robust multiple alignment, see the BIRCH Multiple Sequence Alignment tutorial.

Finally, the Swissprot hits will appear in blpfetch, a BioLegato application specialized for retrieving database entries.

The results in these windows are in temporary files that will be deleted once each window is closed. To save the BLAST report from the web browser, choose File --> Save As, and type in a filename. A good name that keeps track of the sequence used, the program run, and the type of file would be SRAAFP.blastp.swissprot.html.

Similarly, the results in blpfetch can be saved from blpfetch by choosing File --> Save Selection As SRAFFP.blastp.swissprot.tsv.

A quick word on CSV, TSV and text files, and BioLegato
Most spreadsheet programs can read and write formats called CSV and TSV. CSV files are comma-separated value text files, in which each line in the file represents a row in a spreadsheet. The values for each column in a row are separated by commas. TSV files are TAB-separated value text files. In these files, values in each row are separated by TAB characters. TSV files are probably safer to use, because CSV files make it impossible to include commas within fields.

BioLegato programs such as blnfetch, blpfetch and blncbi have a table canvas, that works like a spreadsheet. In these programs, the Open menu can only read text files with a .csv or .tsv file extension. To read files with other extensions eg. .acc, you need to use File --> Import Data from CSV file. Text files such as the one generated above will be read using this approach.

Retrieving your hits
If you decide to retrieve all the hits, this can be done from blnfetch. Choose Edit --> Select All, and then Database --> SeqFetch. By default, SEQFETCH is set to retrieve GenBank entries to a new blprotein window. If you had a large number of entries, it would probably be best to sent output to a file.

Entries retrieved to a blprotein must be saved if you wish to keep them. An example can be found in SRAAFP.blastn.swissprot.blpfetch.gen.

Hint: Oligopeptides need logarithmic score weighting
For very short query sequences such as oligopeptides, unweighted scores would not exceed the minimum cutoff values needed to appear in the output. The BLAST programs allow you to specify a logarithmic weighting ratio that gives short query sequences higher matches. In the General tab, choose Type of search: blastp, adjust parameters for short queries. The score is weighted by the natural log of the query length divided by the natural log of the database size.

4. blastx - translated DNA vs. Protein

In some cases your might have a cDNA, a gene fragment or even a sequencing read. Single reads are more likely to contain errors, while genomic fragments might contain introns or other non-coding sequences interspersed with coding sequences. Even if you don't know the exact translation start and stop sites or the reading frame, or even the coding strand, it is useful to do a crude translation of the gene fragment into amino acids. BLASTX simply takes each group of 3 nucleotides in each of 3 reading frames on each strand and uses these crude translations as the query.

In this example the query will be the CDS feature that was extracted from the complete cDNA:

Select SRAAFP:CDS1, and choose Database --> BLASTX.

Set Database to Uniprot (swissprot).

Again this should be a very quick search because the database is small.

Output

SRAARFP-CDS1.blastx.report.html - BLAST report
SRAAFP-CDS1.blastx.tsv. table of hits
SRAAFP-CDS1.blastx.xml. - XML file for BlastViewer

In the BlastViewer output, it is worth noting that the extent of the matches between query and database sequences is the same as in the protein vs. protein search. That is, all matches are on the forward strand (left to right arrows). Many of the hits show little or no similarity at the N-termini, which is consistent with the observation that in many protein families, N-terminal sequences diverge rapidly, while core sequences tend to be more conserved.

Note on automatic translation:

Programs that translate sequences on the fly (eg. BLASTX, TBLASTN, TBLASTX) have no knowledge whatsoever about gene structure (ie. exons, introns, 5' UTR, 3'UTR). All these programs do is take every group of 3 nucleotides and assign a codon to it. Even stop codons are represented by an asterisk (*). Consequently, non-coding sequences and non-coding open reading frames are translated into meaningless amino acid sequences.

5. blastn - DNA vs. DNA

DNA vs. DNA searches are far less sensitive than protein vs. protein searches. DNA searches only work for closely-related sequences.

As a query, we'll again use the SRAAFP:CDS1.

Note: The BLAST programs search BOTH strands of DNA database sequences, by default.

The quickest one to search is probably RefSeq RNA, which contains the transcripts and coding regions from most known genes encoding RNAs and proteins. Non-coding genomic regions, which make up the vast majority of eukaryotic genomes, are omitted from this dataset, which is consequently a lot smaller and faster to search.

In your bldna window, select the amino acid sequence SRAAFP:CDS1, and choose Database -->blastn (DNA vs. DNA database).

In order to detect more distantly related sequences, set Type of search to blastn (default is Megablast), and set E: # of matches expected by random chance to 1.0e-5 (stands for 10^-5).

RefSeqRNA is a fairly large database (~50 Gb at this writing.) Since the search may take awhile, let's tell BioLegato to send an email notifying us when the search is complete.
Consequently, it may be convenient to send output directly to files, rather than to the screen. The search can run in the background, and the user can view the files at a later time.

Open the Output tab and set Notify of completion by email to Yes. It is usually most convenient to set the email address to one that will go to your phone or mobile device.

We want output to be saved to files, rather than popping up on the screen. Choose HTML file for the destination of the BLAST report, the tsvfile for the table of hits, and xmlfile to create an XML file that could be read by BLASTviewer. Set the base name for these files to SRAAFP-CDS1.blastn.All filenames will begin with this basename, and will get file extensions .html. .tsv and .xml, respectively.

How many threads should I use?
BLAST can break the database search into 2 or more threads which work simultaneously on different parts of the problem. The more threads, the faster the search will be completed. Most desktop computers and laptops will have at least 2 or 4 CPUs, while the example at right is taken from Biolegato running on a server with 64 CPUs.
In the General menu, BioLegato will set the default number of threads to 1/4 of the number of CPUs available on your computer. If you increase the number of threads, always use at least some free CPUs for other tasks on the system. In a workshop setting where many students are running BLAST searches at the same time, it is best to limit each student to a few threads. Each search will take longer, but that is a trade-off to avoid an overall degradation of system performance.

Hint: For more distantly-related sequences, use Word size = 4 or 6.

In the Scoring tab, the word size is 11 by default. Setting a smaller word size would provide some improvement in sensitivity, at the expense of speed. It is usually better to allow a larger E value (eg. 1.0e-5) to detect more distantly-related matches.

Output files:

BLAST Report - SRAAFP-CDS1.blastn.html
table of hits - SRAAFP-CDS1.blastn.tsv
XML file - SRAAFP-CDS1.blastn.xml

What factors affect BLAST performance?

The BLAST programs are computationally very fast. By far the most important factor affecting performance is the fact that BLAST reads the entire database from disk into memory.

When databases reside on an internal disk drive, the database can be read quickly. Solid-state drives have faster read times than magnetic (HDD) drives. On server systems where files are mounted remotely across the network from a file server using NFS, the time needed to read the database can be very long. For example, This search took 37 min. on a using 20 threads on an 80-core Dell R815 Linux server with 1 Tb of memory, with many sessions running by other users.

(As an interesting aside, NFS filesystems will cache files on the local machine. That means that once you do one search on a particular database, subsequent searches will be lightening fast, because the database is already cached on the local machine.)

RAM also matters. If the size of the database is larger than available memory, BLAST searches may take much longer because the data will need to be swapped between disk and memory.

5. tblastn - Protein vs. Translated DNA

As mentioned previously, DNA vs. DNA searches are not very sensitive for detecting distantly-related sequences. tblastn makes it possible to leverage the sensitivity of protein comparisons using a DNA database.

For this example we will use the translated SRAFFP protein as the query against the Non-redundant nucleotide database (nt).

This is a BIG database, so the search will take a long time!

Each DNA sequence in the nt database is translated on the fly in either 3 or 6 reading frames (ie. one or both strands) . ALL of the sequence is translated, whether coding or non-coding, intron or exon.

By default, tblastn is set to search the GenBank Non-redundant nucleotide database.

Set E: # matches expected by random chance to 1.0e-11.

The nt database contains many genomic sequences with short tandem repeats, which are referred to as low complexity regions. Low complexity sequences might skew the distribution of scores. Therefore, it is best to turn on filtering in the Query filtering tab, as shown at right.

Since the non-redundant nucleotide database is quite large (262 Gb at this writing), the search may take awhile. Consequently, it may be convenient to send output directly to files, rather than to the screen. The search can run in the background, and the user can view the files at a later time.

In the Output tab, choose HTML file for the destination of the BLAST report, the tsvfile for the table of hits, and xmlfile to create an XML file that could be read by BLASTviewer. Set the base name for these files to SRAAFP-nt.tblastn.

Output files:

BLAST Report - SRAAFP-CDS1.tblastn.nt.html
table of hits - SRAAFP-CDS1.tblastn.nt.tsv
XML file - SRAAFP-CDS1.tblastn.nt.xml

6. tblastx - translated cDNA vs. Translated DNA

Finally, tblastx will translate a DNA query, as well as sequences in the database to give the sensitivity of a protein to protein search based on DNA sequences. This example will compare the SRAAFP-CDS with the non-redundant nucleotide database.

The main search parameters remain the same.

Remember to set Filer out low complexity regions with SEG to "Yes" in the Query filtering tab.

Output files:

BLAST Report - SRAAFP-CDS1.tblastx.nt.html
table of hits - SRAAFP-CDS1.tblastx.nt.tsv
XML file - SRAAFP-CDS1.tblastx.nt.xml

It is instructive to see that graphic alignment output from BlastViewer.

Clicking on individual arrows will show that each arrow may come from a different species, and even from different strands. The main reason is that the nt database contains a wide variety of sequences from genomic or cDNA fragments. These fragments may contain genes or only part of genes, and may be on either strand relative to the query. In some cases, a non-overlapping "alignment" may be found between parts of the query, and parts of different genes, which ends up putting them onto the same line. The fact that tblastx also translates the query in all 6 reading frames further contributes to this patchwork of partial hits. This illustrates the compromise we make by translating both the query and database.

The main usefulness of tblastx therefore is for those cases when all you have for a query is a fragment of a cDNA or a gene, you aren't even sure of which strand is the coding strand, and you want to detect distant hits from as broad a range of sources as possible.

7. Retrieving Hits

1. blnfetch, blpfetch - retrieve sequences from NCBI using NCBI Eutils

When you have saved the names or accession numbers of database hits, the sequences can easily be retrieved using blnfetch for nucleotide sequences, and blpfetch for protein sequences.

For example, to retrieve protein sequence hits from the blastp search in part 3 above, start blpfetch at the command line by typing blpfetch.

Read in the names from SRAFFP.blastp.swissprot.tsv using File --> Open.

To retrieve the sequences for these hits, choose Edit --> Select All, and then Database --> SEQFETCH.

Next, choose Database --> SeqFetch - Retrieve protein entries from NCBI. By default, sequences will be retrieved in GenBank format.

Note: When there are a large number of sequences you may wish do select some subset of the hits for retrieval.

The sequences will pop up in a blprotein window:

The exact same procedure can be used to retrieve DNA sequences using a list of accession numbers as input for blnfetch.

2. NCBI Batch Entrez - retrieve sequences from NCBI web site

The NCBI web site offers another way to retrieve the GenBank entries associated with your accession numbers, we'll use Batch Entrez at https://www.ncbi.nlm.nih.gov/sites/batchentrez.

BatchEntrez reads a file of accession numbers and returns sequences. Therefore we first need to save the accession numbers previously selected to a file. In blpfetch, make sure the accession numbers you want are selected and then choose File --> Export accession numbers to a file. Set an output filename eg. SRAAFP.blastp.swissprot.acc.

Go to the batchentrez link above and click on Browse. In the file chooser, select

Make sure Database is set to Protein. Now, click on Retrieve. A message will appear saying something like "Retrieve records for 46 UID(s). (Occasionally, a list of accessions will find additional entries.) Click on that link.

Click on Send to and fill in the following:

Destination: File
Format: GenPept (full)

When you click on Create File, a file chooser will pop up.

The default name will be something like "sequence.gp", but it is good practice to give it a name consistent with the rest of the files in this exercise. A good name would be

SRAAFP.blastp.swissprot.hits.gen

NOTE: Remember to save this file in the current working directory! Your file chooser will show a different location from the one shown a right.

This file will contain full GenBank flatfile entries for all of the blastp hits.