de-novo assembly of transcripts

June 18, 2019

References

Spades User's guide
Spades Manual

Trinity Web site
Trinity Manual

SOAPdenovo-Trans User's guide

This tutorial assembles transcriptomes using the trimmed, corrected reads from the Read Correction tutorial.

4. De-novo transcriptome assembly using RNASpades

Rnapades has its own read corrector, and rnaspades seem to construct better assemblies when read correction is done by rnaspades, rather than Rcorrector. However, we will need the files corrected by Rcorrector for use with SOAPdenovo2 and Trinity, later in this tutorial.

For now, start blreads in the reads.trim_galore directory, and run guesspairs to select .fq files.

Choose File --> guesspairs.

As before, we use R1 and R2 to identify paired-end read files, and ask blreads to only select files with the .fq file extension.

A new blreads window appears with the paired-end files in two columns. Select all pairs of reads, and choose Transcriptome --> rnaspades.

Set the output directory ../reads.Trimmomatic.rnaspades.

Since it takes a long time to do an assembly, it is best to set Notify of completion by email to Yes, and type in your email address.

If you had additional files with single-end reads reads or additional paired-end reads those files can be selected in the SE reads or PE reads (not shown) tabs, respectively.

Once parameters are set, click on Run. If you leave blreads running, the a new blreads window will pop up when the assembly is complete. Otherwise, you may close blreads and this point. When your assembly is done open a new blreads window in the ../reads.Trimmomatic.rnaspades directory, and you can examine the results.

The final contigs will be found in contigs.fasta, and scaffolds in transcripts.fasta.

To see a summary of the results, select spades.log and choose File --> View file.

The parameters of the assembly, along with a progress report, are shown spades.log.

These results don't tell us much about the assembly, so in the next step we will run transrate to evaluate the assembly.

Evaluation of assembly quality using transrate

The starting point for transrate is to go back to the same blreads window that has the read pairs (in the reads.Trimmomatic directory and again select the read pairs. (If you closed that window, you will need to open blreads in the reads.Trimmomatic directory, and run guesspairs as before.

Next, choose Transcriptome --> transrate.

The transcriptome assembly file, transcripts.fasta, is in the rnaspades directory from the last step.

By default, output is written to a directory called transrate.

When transrate is complete, files from the transrate directory will pop up in two windows.

transrate.log - the details of the transrate run.

The most important parameter is the TRANSRATE ASSEMBLY SCORE, which for this assembly is 0.35. While the published scores tend to be closer to the OPTIMAL scores, remember that the dataset for this tutorial has only 5% of the total reads, meaning that coverage is less. Assemblies done using higher coverage will give higher scores.

The complete output can be seen in transrate.log.

assemblies.csv - a CSV file with the results in spreadsheet format. These should pop up in the spreadsheet program set in the $BL_Spreadsheet variable eg. LibreOffice.

The CSV files from transrate are very useful when you want to compare assemblies using different read correction methods or assembly methods. Simply paste results from each assemblies.csv file into a single master file, and each parameter can be compared in a spreadsheet. In particular, you can sort assemblies based on assembly score, N50, or any other parameter.

4. De-novo transcriptome assembly using Trinity

For the assemblies using Trinity and SOAPdenovo-Trans, we will use the read files corrected by Rcorrector in the previous section.

Open blreads in reads.Trimmomatic.Rcorrector and run guessreads to get all files with the .fq file extension. Select all read pairs.

Next, open Transcriptome --> Trinity.

For this dataset there is probably no need to change the parameters.

It may be worth trying several k-mer sizes, but keep in mind that for RNAseq, where reads are usually 100 nt, larger k-mers will artificially lower the coverage.

If you wanted to make sure to assemble tRNAs or other small RNAs in the assembly, set min. contig length to the size of the RNA you want to assemble.

In the Output tab, set output directory to ../reads.trim_galore.Rcorrector.Trinity.

A progress report of the assembly can be found in the file trinity.log. Other than verifying that no errors occurred in the assembly, this file doesn't provide a lot of information on the assembly itself.

The final set of transcripts is found in Trinity.fasta.

Genes vs. Transcripts - Remember: In eukaryotes, any gene may potentially produce 2 or more different transcripts as a result of either alternative splicing of introns, or alternative transcription start sites. Consequently, the number of transcripts in a transcriptome will be greater than the number of genes in the genome.

As done for the rnaspades output, run Transrate to evaluate the results.

Select the same read files used in the assembly and choose Transcriptome --> transrate. Select Trinity.fasta as the assembly file.

The TRANSRATE ASSEMBLY SCORE of 0.481 is a substantial improvement over the rnaspades assembly. The p good contigs score of 0.94 is about as high as you ever see for a transcriptome assembly.

5. De-novo transcriptome assembly using SOAPdenovo-Trans

Once again, using the same sets of read files as before, choose Transcriptome --> SOAPdenovo-Trans.

As with Trinity, set shortest contig to 200, keeping in mind that smaller settings are needed if you want to assemble small RNAs.

(Although you can run SOAPdenovo-Trans with k-mers up to 127, it seldom makes sense to do so, since for RNAseq reads are necessarily short eg. 100 nt.)

In the Output tab, set the output directory to ../reads.Trimmomatic.Rcorrector.SOAPdenovo-Trans.

SOAPdenovo-Trans lets you specify a prefix to be used for all output filenames, so in this case we'll set the prefix to "Rdio" for R. diobovatum.

When complete, check bl-soaptrans.log to make sure that no errors occurred during the assembly.

Next, using the same read files as before, run Transcriptome --> transrate, using Rdio.scafSeq as the assembly file.

Output from transrate.log is shown at right.

The TRANSRATE ASSEMBLY SCORE is only 0.16, suggesting that SOAPdenovo-Trans doesn't work as well with low-coverage samples.

6. Comparison of assembly statistics

To make the results easier to compare, title lines for Spades Trinity and SOAPdenovo-Trans have been added. As well, some columns have been hidden to make it easier to see a number of important columns at once.

In summary,

By far, Trinity gives the highest transrate score, at 0.488.
Proportionally, Trinity's transrate score is closest to the calculated optimal score, of the three assemblies.
The probability that reads map well to transcripts is very close for all tree assemblies
The mean ORF percentage is highest for Trinity, at 82%.

By these criteria, the Trinity assembly for these reads is much better than the other assemblies.

Most of the other criteria, while informative, don't really distinguish between the qualities of the assemblies. For example, the Trinity results show the total length of the transcriptome varies between the three methods, but there is no a priori reason to prefer the longest or shortest transcriptome. Similarly, all of the N50 results are reasonable for an mRNA population.

For comparison, results from several assemblies using the full dataset, are given below.

In all cases, the Transrate scores are not as high as those using the smaller dataset. There may be many reasons for that. For example note that n_seqs, the umber of unique transcripts, may include transcripts from more genes with lower expression levels that were not predicted using the smaller dataset with less depth of coverage. Also with higher coverage, we would expect to see more alternative splice products. The lower Transrate scores could therefore be explained by the greater complexity of the transcript population inferred from the reads, making it harder to uniquely assign one some reads to one transcript versus another.

The critical finding from the full dataset is that all Trinity assemblies were substantially better than those constructed by SOAPdenovo-Trans and rnaspades, based on Transrate scores (col AX), probability of good mapping (col X) and percent ORF (col K). The best assembly was obtained running Trinity with a k-mer size of 31, with slight improvement due to Jaccard clipping which corrects for cases in which gene density is high, such that some paired reads have sequences from two adjacent genes. It is also worth mentioning that the Trinity assembly using reads corrected by Rcorrector was far better than the Trinity assembly using reads corrected by rnaspades.

Next: Generation of differential expression data

TUTORIAL: Transcriptome Assembly

de-novo assembly of transcripts

4. De-novo transcriptome assembly using RNASpades

4. De-novo transcriptome assembly using Trinity

5. De-novo transcriptome assembly using SOAPdenovo-Trans

6. Comparison of assembly statistics