BIRCH

return to tutorials

TUTORIAL: DATA PIPELINING WITH BioLegato


  Nov. 8, 2021


NCBI BLAST Documentation

NCBI BLAST Handbook

NCBI BLAST Topics

VIDEO:BLAST searches through a Data Science Lens (34:28)



Goal: This tutorial illustrates how ad hoc pipelining with BioLegato applications empowers the researcher to address biological questions through as series of tasks. Using the SARS-Covid spike protein as a starting point, the workflow includes:

Example: SARS-Cov2 spike proteins

0. Background

As an example we’ll start with the amino acid sequence for the SARS-Cov-2 Spike (S) protein, which is the best potential vaccine target. Yuan and colleagues have used X-ray crystallography to determine the 3D structure of the spike protein when complexed with a neutralizing antibody from a patient infected with a SARS virus from 2007.

The authors demonstrated that decreased binding of the 2007 antibody to the 2020 Spike protein could be explained by amino acid substitutions at stearicly critical positions in the antibody binding site, compared to the earlier spike protein from 2007.

In panel B we see the antibody heavy chain in gold. The 2007 protein has an Alanine at position 372 creates an N-glycosylation site at posn. 370. Mutation to a Threonine at 370 eliminates that N-glycosylation site (Fig. 2C)
In this light, it would be instructive to compare Spike proteins from a number of SARS strains for a more thorough understanding of amino acid conservation at these positions.
2D - Antibody binding may also be affected by substitution of the Alanine at position 384 in the 2007 protein, with a Proline, which is likely to result in conformational changes.


From an epidemiological perspective, it would be useful to know whether these mutations are very recent, or whether they have been present in distinct strains of the SARS virus for a long time.

For that we need to compare our two spike proteins with spike proteins from other SARS strains in a multiple sequence alignment. The best way to find those sequences is to use the these spike proteins as queries in BLAST searches.

That means that first we have to retrieve the two proteins described in this paper, using the accession numbers from the manuscript.

1. Retrieval of SARS-Cov and SARS-Cov2 using NCBI Entrez Eutils

In your tutorials directory, create a directory called 'pipeline' and go to the pipeline directory

cd tutorials
mkdir pipeline
cd pipeline


Launch blncbi in the background.

blncbi &


Blncbi lets you send keyword queries to the NCBI Entrez service, and retrieve sequences that match your query.

The query builder lets you set values for different data fields, and put together query statements using AND, OR and NOT operators, and group them using parentheses.

Since we already have the accession numbers from the paper, our query is very simple: Just set the query terms to Accession, and paste in the accession numbers
Choose Database --> Protein.

For the first query term, set the field to Accession, and the value to QHD43416

For query term 2, set OR, so that sequences matching either of the queries will be found. Set the field to Accession and the value to ABF65836.

These settings will send the following Entrez search expression to NCBI:

QHD43416 [Accession] OR ABF65836 [Accession]


Click 'Run' to send the query to NCBI.

The results pop up in a new blncbi window.


Choose Database --> PROTFETCH.

Click on Run. The GenBank protein entries will be retrieved to a bldna object.


The GenBank protein entries will be retrieved to a blprotein object.





To save these entries, choose File --> Save ALL As

Set the file name to SpikeQueries.pg and set the file type to GenBank.

At any later time, you could open this file in blprotein. For now, it's worth having a look at the sequences. In the file manager, double click on the file to open it in a text editor.



2. Pairwise similarity alignments to evaluate relatedness of spike proteins from 2007 and 2020.

One of the strengths of BioLegato is that is lets you stop and an point and learn something about your data. First let's run a simple program that gives us some basic protein statistics.

Choose Edit --> Select All and then Protein --> Protein statistics. Set Output to Table, and run the program.


We can see that these are moderately sized proteins, over 1200 amino acids each, or over 13 kDa in molecular weight. Both are slightly acidic, with the 2020 protein being somewhat less acidic than that from 2007.




In blprotein, choose Similarity --> SSEARCH.  SSEARCH performs a rigorus Smith-Waterman alignment between the two proteins, resulting in an optimal alignment. To evaluate the statistical significance of the alignmet, SSEARCH will repeat the alignment 500 times, randomizing  the sequences in each case. If the score of the optimal alignment is better than alignments of randomized sequences, the optimal alignment is considered better than an alignment of proteins of similar length and amino acid composition, by random chance.



A partial output is shown at right, and the entire file is linked below.

We can see that these proteins show high similarity along their entire length (91% similarity over 1277 aa overlap). This indicates that there haven't been any major rearrangements, duplications, deletions or insertions, and likely no new domains since they diverged from a common ancestor.

Even the N-terminal region is highly conserved, which is significant since N-termini of protein families often show tremendous polymorphism.



queries_ssearch.txt 

3. BLAST search to find spike proteins from different species by searching UniProt

In this section we will use the two query sequences to create a dataset of spike proteins for cross-species comparison. We'll then retrieve the sequences found by combining results of the two BLAST searches.

In blprotein, select the 2020 sequence, QHD43416, and choose Database --> BLASTP. (If you don't have a local copy of Uniprot, use NCBI BLASTP instead.)

Click Run to begin the search. Even on personal computers, results should pop up in less than a minute.


The first output to appear is the familiar HTML report. Note that clicking on a score will move to the alignment, and clicking on an accession number will open the GenPept entry at NBCI.




Next, a tabular BLAST report will pop up in blpfetch, a BioLegato application for viewing and retrieving hits.




Finally, BlastViewer opens with numerous options for viewing results.




The spreadsheet lets us quickly scan through hundreds or even thousands of hits, and narrow down the hits to just the ones we're looking for.

Subject line is same for all, so keep it closed. The Title line is more informative, so we open that column wide
DECISION POINT: For the best hits, the alignment length is roughly the same as the subject sequence length. This means that the sequences share significant similarity along their entire lengths. The other sequences have shorter alignment lengths, which means either substantial divergence, sequence rearrangements or maybe recombination products.

For now, we'll stick with the ones that allow a more global alignment. Set "Selection mode: by row" and select the hits whose alignment length is greater than 1200 aa (we'll ignore the mouse hepatitis sequence):


Choose Edit --> Extract to extract these hits to an new blfetch object.


At this point, let's repeat the process using the 2007 sequence (ABF65836) as a query. We'll only show the blpfetch output here:



We can easily combine our two sets of hits using BioLegato's Copy out and Paste in functions.
Again, select by row, and this time do the following:

Edit --> Copy out
Move to the other blpfetch object with hits previously-extracted
Edit --> Paste in



It is apparent that is a lot of redundancy in the dataset. That shouldn't be surprising since the two query sequences are 91% similar. We won't worry about eliminating duplicates now, because seqfetch.py automatically does that. To retrieve these sequences, choose

Edit --> Select All
Database --> Seqfetch
Run

Finally, use Copy out/Paste in to copy the two original queries into this blprotein object. The final set of proteins should look like this:


4. Multiple alignment of spike proteins to determine conservation of antigenic sites

Multiple alignments provide important insights into the critical functional and structural sites in proteins. The better conserved a site is, the more likely it is to have a critical function. Less conserved sites have more freedom to mutate. In the case of antigenic determinants, it is actually the sites that have changed along with antigenicity, that we are interested in.

Before doing a multiple alignment, it's important to eliminate any duplicate sequences. Duplicate sequences would carry more weight than sequences represented only once, thus biasing the alignment.

Select all sequences in blprotein. Next choose Alignment --> cd-hit. Set the sequence identity threshold to 1.0, so that only perfectly matching sequences will be eliminated as duplicates.
Click on Run






For this simple dataset, the report tells us that one pair of duplicate sequences was found. Arbitrarily, QHD43416 was retained, and SPIKE_SARS2 was deleted.

Choose Alignment --> MAFFT. MAFFT has numerous alignment algorithms based which vary the number of iterative improvement to the alignment. As well, while most models consider the entire alignment as a single conserved block, E-INS-i assumes multiple conserved blocks punctuated by unalignable sites, and L-INS-i with 1 conserved block flanked by unalignable sites.

By default, MAFFT will do a global alignment, and will choose the best algorithm based on the number of sequences.


Yuan and co-workers showed that the antibody binds to the Cyan colored sites in the spike protein. You can see four asterisks, which mark the amino acid substitutions that appear to affect the binding affinity between the protein and the antibody.
 
First let’s have a look at the posn. 372, where the 2007 protein has a Threonine, while the 2020 protein has an  Alanine. In the multiple alignment of human and bat proteins, this is at at position 412. The alignment shows us that the Alanine  is unique to the 2020 spike protein (QDH43416). Most of the human and bat proteins have Threonine at this position. This Alanine in the 2020 protein appears to be a fairly recent evolutionary event. Remember that  the Ala substitution introduces an N-glycosylation site at the Asn, 410 in our alignment.


It’s interesting that at  the 424 position the 2020 spike protein has a Proline, also seen in the Bat spike proteins, while the other two human proteins have Alanine at that position.


We may be able to learn more by looking at a sequence logo at these sites.
Select all sequences in the alignment and choose Patterns --> Weblogo.


A sequence logo gives us a measurement of the information content at each site
The Y-axis gives the information content at each position in bits of information. One way of thinking of bit scores is as a measure of the deviation from randomness. High bit scores are associated with strong conservation of amino acids. Conservative replacements result in a moderate decrease in the bit score at a given position, while less conservative replacements result in low bit scores.

The  four critical amino acid substitutions in the heavy chain binding site of the receptor binding domain are circled in the output below:


We can see replacements at the circled positions:

posn. 2007 2020
412 T (Bat-like) A+ N-glycosylation
424 A P (Bat-like)
470 M T (Bat-like)
559 N (Bat-like) H

In all cases, the amino acid substitutions occur at sites of moderate information content, flanked on either side by sites with high information content.

The more ways we can visualize the data, the more we learn. Let's export the alignment to Jalview, which is a comprehensive tool for performing and visulalizing multiple sequence alignments.

Edit --> Select all
Alignment --> Jalview
Format --> Font=20
Color --> Zappo


The Conservation graph at the bottom shows the degree of amino acid conservation at each position.

Click on Conservation bar at 470 to highlight M --> T substitution

This display confirms our other observations, that the critical mutations in the 2020 spike protein occurred in less conserved amino acids surrounded by more conserved amino acids.