MIRA Version 3.2.1
Document revision $Id: book_definitiveguide.xml,v 1.1.2.1 2010-07-04 21:38:21 bach Exp $
Copyright © 2010 Bastien Chevreux
This documentation is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
Table of Contents
List of Figures
GGCxG
problem.
This "book" is actually the result of an exercise in self-defense. It contains texts from several years of help files, mails, postings, questions, anwers etc.pp concerning MIRA and assembly projects one can do with it.
I never really intended to push MIRA. It started out as a PhD thesis and I subsequently continued development when I needed something to be done which other programs couldn't do at the time. But MIRA has always been available as binary on the Internet since 1999 ... and as Open Source since 2007. Somehow, MIRA seems to have catched the attention of more than just a few specialised sequencing labs and over the years I've seen an ever growing number of mails in my inbox and on the MIRA mailing list. Both from people having been "since ever" in the sequencing business as well as from labs or people just getting their feet wet in the area.
The help files -- and through them this book -- sort of reflect this development. They contain both very specialised topics as well as step-by-step walk-throughs intended to help people to get their assembly projects going. Some parts of the documentation are written in a decidedly non-scientific way. Please excuse, time for rewriting mails somewhat lacking, some texts were re-used almost verbatim.
Nothing is perfect, and both MIRA and this documentation are far from it. If you spot an error either in MIRA or the docs, feel free to report it. Or, even better, correct it if you can. At least with the help files it should be easy, they're just text files.
I hope that MIRA will be as useful to you as it has been to me. Have a lot of fun with it.
Rheinfelden, Summer 2010
Bastien Chevreux
Table of Contents
MIRA is a multi-pass DNA sequence data assembler/mapper for whole genome and EST projects. MIRA assembles reads gained by
electrophoresis sequencing (aka Sanger sequencing)
454 pyrosequencing (GS20, FLX or Titanium)
Solexa (Illumina) sequencing
Pacific Biosciences sequencing
into contiguous sequences (called contigs). One can use the sequences of different sequencing technologies either in a single assembly run (a true hybrid assembly) or by mapping one type of data to an assembly of other sequencing type (a semi-hybrid assembly (or mapping)) or by mapping a data against consensus sequences of other assemblies (a simple mapping).
The MIRA acronym stands for Mimicking Intelligent Read Assembly and the program pretty well does what its acronym says (well, most of the time anyway). It is the Swiss army knife of sequence assembly that I've used and developed during the past 12 years to get assembly jobs I work on done efficiently - and especially accurately. That is, without me actually putting too much manual work into it.
Over time, other labs and sequencing providers have found MIRA useful for assembly of extremely 'unfriendly' projects containing lots of repetitive sequences. As always, your mileage may vary.
At the last count, this manual had almost 200 pages and this might seem a little bit daunting. However, you very probably do not need to read everything.
You should read most of this introductionary chapter though: e.g.,
the part with the MIRA quick tour
the part which gives a quick overview for which data sets to use MIRA and for which not
the part which showcases different features of MIRA (lots of screenshots!)
where and how to get help if things don't work out as you expected
After that, reading should depend on the type of data you intend to work with: there are specific chapters for Sanger, 454, Solexa and PacBio data, all of which containing an overview on how to prepare your data and how to launch MIRA for these data sets. There are also complete walkthroughs which exemplarily show from start to end one way of doing an assembly for a specific data set and what to do with the results of the assembly.
As the former named chapters are geared toward genome assemblies, there is also a chapter going into details on how to use MIRA for EST assemblies. Read that if you're into ESTs.
As the previously cited chapters are more introductory in their nature, they do not go into the details of MIRA parametrisation. MIRA has more than 150 switches / parameters with which one can fine tune almost every aspect of an assembly. A complete description for each and every parameter and how to correctly set parameters for different use cases and sequencing technologies can be found in the reference chapter.
The chapter on working with results of MIRA should again be of general interest to everyone. It describes the structure of output directories and files and gives first pointers on what to find where. Also, converting results into different formats -- with and without filtering for specific needs -- is covered there.
As not every assembly project is simple, there is also a chapter with tipps on how to deal with projects which turn out to be "hard." It certainly helps if you at least skim through it even if you do not expect to have problems with your data ... it contains a couple of tricks on what one can see in log and result files which are not explained elsewhere.
As from time to time some general questions on sequencing are popping up on the MIRA talk mailing list, I have added a chapter with some general musings on what to consider when going into sequencing projects. This should be in no way a replacement for an exhaustive talk with a sequencing provider, but it can give a couple of hints on what to take care of.
There is also a FAQ chapter with some of the more frequently aksed questions which popped up in the past few years.
Finally, there are also chapters covering some more technical aspects of MIRA: the MAF format and structure / content of the log directory have own chapters.
Input can be in various formats like Staden experiment (EXP), Sanger CAF, FASTA, FASTQ or PHD file. Ancillary data containing additional information helpful to the assembly as is contained in, e.g. NCBI traceinfo XML files or Staden EXP files, is also honoured. If present, base qualities in phred style and SCF signal electrophoresis trace files are used to adjudicate between or even correct contradictory stretches of bases in reads by either the integrated automatic EdIt editor (written by Thomas Pfisterer) or the assembler itself.
MIRA was conceived especially with the problem of repeats in genomic data and SNPs in EST data in mind. Considerable effort was made to develop a number of strategies -- ranging from standard clone-pair size restrictions to discovery and marking of base positions discriminating the different repeats / SNPs -- to ensure that repetitive elements are correctly resolved and that misassemblies do not occur.
The resulting assembly can be written in different standard formats like CAF, Staden GAP4 directed assembly, ACE, HTML, FASTA, simple text or transposed contig summary (TCS) files. These can easily be imported into numerous finishing tools or further evaluated with simple scripts.
The aim of MIRA is to build the best possible assembly by
having a more or less full overview on the whole project at any time of the assembly, i.e. knowledge of almost all possible read-pairs in a project,
using high confidence regions (HCRs) of several aligned read-pairs to start contig building at a good anchor point of a contig, extending clipped regions of reads on a 'can be justified' basis.
using all available data present at the time of assembly, i.e., instead of relying on sequence and base confidence values only, the assembler will profit from trace files containing electrophoresis signals, tags marking possible special attributes of DNA, information on specific insert sizes of read-pairs etc.
having 'intelligent' contig objects accept or refuse reads based on the rate of unexplainable errors introduced into the consensus
learning from mistakes by discovering and analysing possible repeats differentiated only by single nucleotide polymorphisms. The important bases for discriminating different repetitive elements are tagged and used as new information.
using the possibility given by the integrated automatic editor to correct errors present in contigs (and subsequently) reads by generating and verifying complex error hypotheses through analysis of trace signals in several reads covering the same area of a consensus,
iteratively extending reads (and subsequently) contigs based on
additional information gained by overlapping read pairs in contigs and
corrections made by the automated editor.
MIRA was part of a bigger project that started at the DKFZ (Deutsches Krebsforschungszentrum, German Cancer Research Centre) Heidelberg in 1997: the "Bundesministerium für Bildung, Wissenschaft, Forschung und Technologie" supported the PhD thesis of Thomas and myself by grant number 01 KW 9611. Beside an assembler to tackle difficult repeats, the grant also supported the automated editor / finisher EdIt package -- written by Thomas Pfisterer. The strength of MIRA and EdIt is the automatic interaction of both packages which produces assemblies with less work for human finishers to be done.
I'd like to thank everybody who reported bugs to me, pointed out problems, sent ideas and suggestions they encountered while using the predecessors. Please continue to do so, the feedback made this third version possible.
As a general rule of thumb: if you have an organism with more than 100 to 150 megabases or more than 20 to 40 million reads, you might want to try other assemblers first.
For genome assembly, the version 3 series of MIRA (and predecessors of the 2.9.x development tree) have been reported to work on projects with something like a million Sanger reads (~80 to 100 megabases at 10x coverage), five to ten million 454 Titanium reads (~100 megabases at 20x coverage) and 20 to 40 million Solexa reads (enough for de-novo of a bacterium or a small eukaryote with 76mers or 100mers).
Provided you have the memory, MIRA is expected to work in de-novo mode with
Sanger reads: 5 to 10 million
454 reads: 5 to 15 million
Solexa reads: 15 to 20 million
and "normal" coverages, whereas "normal" would be at no more that 50x to 70x for genome projects. Higher coverages will also work, but need heavy parametrisation. Lower coverages (<4x for Sanger, <10x for 454) also need special attention in the parameter settings.
As the complexity of mapping is a lot lower than de-novo, one can basically double (perhaps even triple) the number of reads compared to 'de-novo'. The limiting factor will be the amount of RAM though, and MIRA will also need lots of it if you go into eukaryotes.
The main limiting factor regarding time will be the number of reference sequences (backbones) you are using. MIRA being pedantic during the mapping process, it might be a rather long wait if you have more than 500 to 1000 reference sequences.
For EST assembly (be it de-novo or mapping), it is suggested to use only Sanger and/or 454 sequences as MIRA currently won't like coverages exceeding 16383x. This coverage is almost never attained in Sanger sequencing and only rarely occurs in 454 sequencing, even in non-normalised EST libraries.
With Solexa however, getting coverages of >20,000x apparently occurs pretty often in non-normalised EST libraries, so using MIRA for Solexa EST data is currently not recommended.
The default values for MIRA should allow it to work with many ESTs sets, sometimes even from non-normalised libraries. For extreme coverage cases however (like, something with a lot of cases at and above 10k coverage), one would perhaps need to resort to data reduction routines before feeding the sequences to MIRA.
A few perhaps.
Note | |
---|---|
The screenshots in this section show data from assemblies produced with MIRA, but the visualisation itself is done in a finishing program named gap4. Some of the screenshots were edited for showing a special feature of MIRA. E.g., in the screenshots with Solexa data, quite some reads were left out of the view pane as else -- due to the amount of data -- these screenshots would need several pages for a complete printout. |
MIRA is an iterative assembler (it works in several passes) and acts a bit like a child when exploring the world: it explores the assembly space and is specifically parametrised to allow a couple of assembly errors during the first passes. But after each pass some routines (the "parents", if you like) check the result, searching for those assembly errors and deduces knowledge about specific assemblies MIRA should not have ventured into. MIRA will then prevent these errors to re-occur in subsequent passes.
One example. Consider the following multiple alignment:
Figure 1.1. How MIRA learns from misassemblies (1). Multiple alignment after 1st pass with an obvious assembly error. Two slightly different repeats were assembled together, notice the columns discrepancies.
These kind of errors can be easily spotted by a human, but are hard to prevent by normal alignment algorithms as sometimes there's only one single base column difference between repeats (and not several as in this example).
MIRA spots these things (even if it's only a single column), tags the base positions in the reads with additional information and then will use that information in subsequent passes. The net effect is shown in the next two figures:
Figure 1.2. Multiple alignment after last pass where assembly errors from previous passes have been resolved (1st repeat site)
Figure 1.3. Multiple alignment after last pass where assembly errors from previous passes have been resolved (2nd repeat site)
The ability of MIRA to learn and discern non-identical repeats from each other through column discrepancies is nothing new. Here's the link to a paper from a talk I had at the German Conference on Bioinformatics in 1999: http://www.bioinfo.de/isb/gcb99/talks/chevreux/
I'm sure you'll recognise the basic principle in figures 8 and 9. The slides from the corresponding talk also look very similar to the screenshots above:
You can get the talk with these slides here: http://chevreux.org/dkfzold/gcb99/bachvortrag_gcb99.ppt
Since the first versions in 1999, the EdIt automatic Sanger sequence editor from Thomas Pfisterer has been integrated into MIRA.
The routines use a combination of hypothesis generation/testing together with neural networks (trained on ABI and ALF traces) for signal recognition to discern between base calling errors and true multiple alignment differences. They go back to the trace data to resolve potential conflicts and eventually recall bases using the additional information gained in a multiple alignment of reads.
Figure 1.6. Sanger assembly without EdIt automatic editing routines. The bases with blue background are base calling errors.
Figure 1.7. Sanger assembly with EdIt automatic editing routines. Bases with pink background are corrections made by EdIt after assessing the underlying trace files (SCF files in this case). Bases with blue background are base calling errors where the evidence in the trace files did not show enough evidence to allow an editing correction.
With the introduction of 454 reads, MIRA also got in 2007 specialised editors to search and correct for typical 454 sequencing problems like the homopolymer run over-/undercalls.
While not being paramount to the assembly quality, both editors provide additional layers of safety for the MIRA learning algorithm to discern non-perfect repeats even on a single base discrepancy. Furthermore, the multiple alignments generated by these two editors are way more pleasant to look at (or automatically analyse) than the ones containing all kind of gaps, insertions, deletions etc.pp.
With introduction of PacBio strobed reads, MIRA also got an editor to handle "elastic dark inserts" (stretches of unread bases where the length is known only approximately). How this editor works is explained in the chapter on PacBio data, but in essence it allows to transform this:
into this:
A very useful feature for finishing are hash frequency (HAF) tags which MIRA sets in the assembly. Provided your finishing editor understands those tags, they'll give you precious insight where you might want to be cautious when joining to contigs or where you would need to perform some primer walking. MIRA colourises the assembly with the HAF tags to show repetitiveness.
You will need to read about the HAF tags in the reference manual, but in a nutshell: the HAF5, HAF6 and HAF7 tags tell you potentially have repetitive to very repetitive read areas in the genome, while HAF2 tags will tell you that these areas in the genome have not been covered as well as they should have been.
One example: the following figure shows the coverage of a contig.
The question is now: why did MIRA stop building this contig on the left end (left oval) and why on the right end (right oval).
Looking at the HAF tags in the contig, the answer becomes quickly clear: the left contig end has HAF5 tags in the reads (shown in bright red in the following figure). This tells you that MIRA stopped because it could not unambiguously continue building this contig. Indeed, if you BLAST the sequence at the NCBI, you will find out that this is an rRNA area of a bacterium, of which bacteria normally have several copies in the genome:
Figure 1.13. HAF5 tags (reads shown with red background) covering a contig end show repetitiveness as reason for stopping a contig build.
The right end of the contig however ends in HAF3 tags (normal coverage, bright green in the next figure) and even HAF2 tags (below average coverage, pale green in the next image). This tells you MIRA stopped building the contig at this place simply because there were no more reads to continue. This is a perfect target for primer walking if you want to finish a genome.
Figure 1.14. HAF2 tags covering a contig end show that no more reads were available for assembly at this position.
Many people combine Sanger & 454 -- or nowadays more 454 & Solexa -- to improve the sequencing quality of their project through two (or more) sequencing technologies. To reduce time spent in finishing, MIRA automatically tags those bases in a consensus of a hybrid assembly where reads from different sequencing technologies contradict each other.
The following example shows a hybrid 454 / Solexa assembly where reads from 454 (highlighted read names in following figure) were not sure whether to have one or two "G" at a certain position. The consensus algorithm would have chosen "two Gs" for 454, obviously a wrong decision as all Solexa reads at the same spot (not highlighted) show only one "G" for the given position. While MIRA chose to believe Solexa in this case, in tags the position anyway.
Figure 1.15. A "STMS" tag (Sequencing Technology Mismatch Solved, the black square base in the consensus) showing a potentially difficult decision in a hybrid 454 / Solexa de-novo assembly.
This works also for other sequencing technology combinations or in mapping assemblies. The following is an example where by pure misfortune, all Sanger reads have a base calling error at a given position while the 454 reads show the true sequence.
Figure 1.16. A "STMU" tag (Sequencing Technology Mismatch Unresolved, light blue square in the consensus at lower end of large oval) showing a potentially difficult decision in a hybrid Sanger / 454 mapping assembly.
Quality control is paramount when you do mutation analysis for biologists: I know they'll be on my doorstep the minute they found out one of the SNPs in the resequencing data wasn't a SNP, but a sequencing artefact. And I can understand them: why should they invest -- per SNP -- hours in the wet lab if I can invest a couple of minutes to get them data false negative rates (and false discovery rates) way below 1%? So, finishing any mapping project is a must.
Both gap4 and consed start to have a couple of problems when projects have millions of reads: you need lots of RAM and scrolling around the assembly gets a test to your patience. Still, these two assembly finishing programs are amongst the better ones out there.
So, MIRA reduces the number of reads in Solexa mapping projects without sacrificing information on coverage. The princible is pretty simple: for 100% matching reads, MIRA tracks coverage of every reference base and creates long synthetic, coverage equivalent reads (CERs) in exchange for the Solexa reads. Reads that do not match 100% are kept as own entities, so that no information gets lost. The following figure illustrates this:
Figure 1.17. Coverage equivalent reads (CERs) explained.
Left side of the figure: a conventional mapping with eleven reads of size 4 against a consensus (in uppercase). The inversed base in the lowest read depicts a sequencing error.
Right side of the figure: the same situation, but with coverage equivalent reads (CERs). Note that there are less reads, but no information is lost: the coverage of each reference base is equivalent to the left side of the figure and reads with differences to the reference are stil present.
This strategy is very effective in reducing the size of a project. As an example, in a mapping project with 9 million Solexa 36mers, MIRA created a project with 1.7m reads: 700k CER reads representing ~8 million 100% matching Solexa reads, and it kept ~950k mapped reads as they had ≥ mismatch (be it sequencing error or true SNP) to the reference. A reduction of 80%.
Also, mutations of the resequenced strain now really stand out in the assembly viewer as the following figure shows:
Want to assemble two or several very closely related genomes without reference, but finding SNPs or differences between them?
Tired of looking at some text output from mapping programs and guessing whether a SNP is really a SNP or just some random junk?
MIRA tags all SNPs (and other features like missing coverage etc.) it finds so that -- when using a finishing viewer like gap4 or consed -- one can quickly jump from tag to tag and perform quality control. This works both in de-novo assembly and in mapping assembly, all MIRA needs is the information which read comes from which strain.
The following figure shows a mapping assembly of Solexa 36mers against a bacterial reference sequence, where a mutant has an indel position in an gene:
Figure 1.19. "SROc" tag (Snp inteR Organism on Consensus) showing a SNP position in a Solexa mapping assembly.
Other interesting places like deletions of whole genome parts are also directly tagged by MIRA and noted in diverse result files (and searchable in assembly viewers):
Figure 1.20. "MCVc" tag (Missing CoVerage in Consenus, dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.
Note | |
---|---|
For bacteria -- and if you use annotated GenBank files as reference sequence -- MIRA will also output some nice lists directly usable (in Excel) by biologists, telling them which gene was affected by what kind of SNP, whether it changes the protein, the original and the mutated protein sequence etc.pp. |
Extensive possibilities to clip data if needed: by quality, by masked bases, by A/T stretches, by evidence from other reads, ...
Routines to re-extend reads into clipped parts if multiple alignment allows for it.
Read in ancillary data in different formats: EXP, NCBI TRACEINFO XML, SSAHA2, SMALT result files and text files.
Detection of chimeric reads.
Pipeline to discover SNPs in ESTs from different strains (miraSearchESTSNPs)
Support for many different of input and output formats (FASTA, EXP, FASTQ, CAF, MAF, ...)
Automatic memory management (when RAM is tight)
Over 150 parameters to tune the assembly for a lot of use cases, many of these parameters being tunable individually depending on sequencing technology they apply to.
There are two kind of versions for MIRA that can be compiled form source files: production and development.
Production versions are from the stable branch of the source code. These versions are available for download on the web site of MIRA.
Development versions are from the development branch of the source tree. These are also made available to the public and should be compiled by users who want to test out new functionality or to track down bugs or errors that might arise at a given location. Release candidates (rc) also fall into the development versions: they are usually the last versions of a given development branch before being folded back into the production branch.
MIRA has been put under the GPL version 2.
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA
You may also visit http://www.opensource.org/licenses/gpl-2.0.php at the Open Source Initiative for a copy of this licence.
The documentation pertaining to MIRA is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA.
© 1997-2000 Deutsches Krebsforschungszentrum Heidelberg -- Dept. of Molecular Biophysics and Bastien Chevreux (for MIRA) and Thomas Pfisterer (for EdIt)
© 2001-2010 Bastien Chevreux.
All rights reserved.
MIRA uses the excellent Expat library to parse XML files. Expat is Copyright © 1998, 1999, 2000 Thai Open Source Software Center Ltd and Clark Cooper as well as Copyright © 2001, 2002 Expat maintainers.
See http://www.libexpat.org/ and http://sourceforge.net/projects/expat/ for more information on Expat.
Please try to find an answer to your question by first reading the documents provided with the MIRA package (FAQs, READMEs, usage guide, guides for specific sequencing technologies etc.). It's a lot, but then again, they hopefully should cover 90% of all questions.
If you have a tough nut to crack or simply could not find what you were searching for, you can subscribe to the MIRA talk mailing list and send in your question (or comment, or suggestion), see http://www.chevreux.org/mira_mailinglists.html for more information on that. Now that the number of subscribers has reached a good level, there's a fair chance that someone could answer your question before I have the opportunity or while I'm away from mail for a certain time.
Note | |
---|---|
Subscribing to the list before sending mails to it is necessary as messages from non-subscribers will be stopped by the system to keep the spam level low. |
To report bugs or ask for new features, please use the new ticketing system at: http://sourceforge.net/apps/trac/mira-assembler/. This ensures that requests do not get lost and you get the additional benefit to automatically know when a bug has been fixed (there won't be separate emails sent, that's what bug trackers are there for).
Please mail the author directly (<bach@chevreux.org>
) only
if you feel that there's some information you absolutely do not want to
share.
Finally, new or intermediate versions of MIRA will be announced on the MIRA announce mailing list. Subscribe if you want to be informed automatically on new versions.
Bastien Chevreux (mira): <bach@chevreux.org>
MIRA can use automatic editing routines for Sanger sequences which were
written by Thomas Pfisterer (EdIt):
<t.pfisterer@dkfz-heidelberg.de>
Please use these citations:
Chevreux, B., Wetter, T. and Suhai, S. (1999): Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56.
Chevreux, B., Pfisterer, T., Drescher, B., Driesel, A. J., Müller, W. E., Wetter, T. and Suhai, S. (2004): Using the miraEST Assembler for Reliable and Automated mRNA Transcript Assembly and SNP Detection in Sequenced ESTs. Genome Research, 14(6)
Table of Contents
SourceForge: http://sourceforge.net/projects/mira-assembler/
There you will normally find a couple of precompiled binaries -- usually for Linux, sometimes also for Mac OSX -- or the source package for compiling yourself.
Precompiled binary packages are named in the following way:
mira_
miraversion
_audience
_OS-and-binarytype
.tar.bz2
where
is usually
a version number in three parts, like miraversion
3.0.5
,
sometimes also followed by some postfix like in
3.2.0rc1
to denote release candidate 1 of the
3.2.0 version of MIRA.
is either
audience
prod
or dev
; denoting
either a production or a
development version.
The development version usually contains more checks and more debugging output to catch potential errors, hence it might run slower. Furthermore, development versions may contain some new code which did not get as extensive testing as usual.
finally define for which operating system and which processor class
the package is destined. E.g.,
OS-and-binarytype
linux-gnu_x86_64_static
contains static
binaries for Linux running a 64 bit processor.
Source packages are usually named
mira-
miraversion
.tar.bz2
Examples for packages at SourceForge:
mira_3.0.5_prod_linux-gnu_x86_64_static.tar.bz2
mira_3.0.5_prod_linux-gnu_i686_32_static.tar.bz2
mira_3.0.5_prod_OSX_snowleopard_x86_64_static.tar.bz2
mira-3.0.5.tar.bz2
Download the package, unpack it. Inside, there is -- beside other
directories -- a bin
. Copy or move the files and
soft-links inside this directory to a directory in your $PATH variable.
Additional scripts for special purposes are in the
scripts
directory. You might or might not want to
have them in your $PATH.
Scripts and programs for MIRA from other authors are in the
3rdparty
directory. Here too, you may or may not
want to have (some of them) in your $PATH.
MIRA sets tags in the assemblies that can be read and interpreted by the Staden gap4 package or consed. These tags are extremely useful to efficiently find places of interest in an assembly (be it de-novo or mapping), but both gap4 and consed need to be told about these tags.
Data files for a correct integration are delivered in the
support
directory of the distribution. Please
consult the README in that directory for more information on how to
integrate this information in either of these packages.
Table of Contents
mira
[-project=<name>]
[--job=arguments
]
[-fasta[=<filename>] | -fastq[=<filename>] | -caf[=<filename>] | -phd[=<filename>]]
[-notraceinfo]
[-noclipping[=...]]
[-highlyrepetitive]
[-lowqualitydata]
[-highqualitydata]
[-params=<filename>]
[-GENERAL:arguments
]
[-STRAIN/BACKBONE:arguments
]
[-ASSEMBLY:arguments
]
[-DATAPROCESSING:arguments
]
[-CLIPPING:arguments
]
[-SKIM:arguments
]
[-ALIGN:arguments
]
[-CONTIG:arguments
]
[-EDIT:arguments
]
[-MISC:arguments
]
[-DIRECTORY:arguments
]
[-FILENAME:arguments
]
[-OUTPUT:arguments
]
[COMMON_SETTINGS | SANGER_SETTINGS | 454_SETTINGS | SOLEXA_SETTINGS | SOLID_SETTINGS]
For an easy introduction on how to use mira, a number of tutorials with step-by-step instructions are available:
mira_usage
for basic Sanger assembly
mira_454
for basic 454 assembly
mira_solexadev
for basic mapping assembly of Solexa data
mira_est
for some advice concerning assembly of EST sequence
(and miraSearchESTSNPs)
mira_hard
some notes on how to assemble 'hard'
data sets: EST data sets or genome projects for eukaryotes, but some
prokaryotes also qualify for this
mira_faq
with some frequently asked question
To use mira itself, one doesn't need very much:
Sequence data in EXP, CAF, PHD, FASTA or FASTQ format (ideally preprocessed)
Optionally: ancillary information in NCBI traceinfo XML format; ancillary information about strains in tab delimited format, vector screen information generated with ssaha2 or smalt.
Some memory and disk space. Actually lots of both if you are venturing into 454 or Solexa).
mira has three basic working modes: genome, EST or EST-reconstruction-and-SNP-detection. From version 2.4 on, there is only executable which supports all modes. The name with which this executable is called defines the working mode:
mira for assembly of genomic data as well as assembly of EST data from one or multiple strains / organisms
and
miraSearchESTSNPs for assembly of EST data from different strains (or organisms) and SNP detection within this assembly. This is the former miraEST program which was renamed as many people got confused regarding whether to use mira in est mode or miraEST.
Note that miraSearchESTSNPs is usually realised as a link to the mira executable, the executable decides by the name it was called with which module to start.
Parameters can be given on the command line or loaded via parameter files.
mira knows two basic parameter types: quick switches and extensive switches.
quick switches, also dubbed DWIM switches (for 'Do-What-I-Mean'), are easy to use-and-combine switches activating parameter collections for predefined tasks that will suit most people's needs.
extensive switches offer a way to set about any possible parameter to configure mira for any kind of special need. While the format of extensive switches might look a little bit strange, it is borrowed from the SGI C compiler options and allows both compact command lines as well as readable and / or script generated parameter files.
Due to the introduction of new sequencing technologies like 454, Solexa and ABI SOLiD, the extensive switches had to be split into two groups:
technology independent switches which control general behaviour of MIRA like, e.g., the number of assembly passes or file names etc.
technology dependent switches which control behaviour of algorithms where the sequencing technology plays a role. Example for this would be the minimum length of a read (like 200 for Sanger reads and 120 for 454 FLX reads).
More on this a bit further down in this documentation.
As example, a typical call of mira using quick switches and some tweaking with extended switches on the command line could look like this:
mira --job=denovo,genome,draft,sanger --fasta SANGER_SETTINGS -ALIGN:min_relative_score=70 -GENERAL:use_template_information=yes -GENERAL:templateinsertsizeminimum=500:templateinsertsizemaximum=2500
or in short form
mira --job=denovo,genome,draft,sanger --fasta SANGER_SETTINGS -AL:mrs=70 -GE:uti=yes:tismin=500:tismax=2500
Please note that it is also perfectly legal to decompose the switches so that they can be used more easily in scripted environments (notice the multiple -GE in the following example):
mira --job=denovo,genome,draft,sanger --fasta SANGER_SETTINGS -AL:mrs=70 -GE:uti=yes -GE:tismin=500 -GE:tismax=2500
These switches are 'Do-What-I-Mean' parameter collections for predefined tasks which should suit most people's needs. You might still need a few of the extensive switches, but not too many anymore.
Important note 1: For de-novo assembly of genomes, these switches are optimised for 'decent' coverages that are commonly seen to get you something useful, i.e., ≥ 7x for Sanger, >=18x for 454 FLX or Titanium, ≥ 25x for 454 GS20 and ≥ 30x for Solexa. Should you venture into lower coverage or extremely high coverage (say, >=60x for 454), you will need to adapt a few parameters via extensive switches.
Important note 2: For some switches, the order of appearance in the command line (or parameter file) is important. This is because the quick switches are realised internally as a collection of extensive switches that will overwrite any previously manually set extensive switch. It is generally a good idea to place switches in the order as described in this documentation, that is: first the order dependent quick switches, then other quick switches, then all the other extensive switches.
E.g. always write --job=... -highlyrepetitive and not -highlyrepetitive --job=.... In the same vein, always write --job=... -SK:mnr=yes and not -SK:mnr=yes --job=....
The main one-stop-switches for most assemblies. You can choose between two different assembly methods (denovo or mapping), two different assembly types (genome or est), three different quality grades (draft, normal or accurate and mix different sequencing technologies (sanger, 454, solexa and solid). This switch is explained in more detail in the subsection "The --job= switch in detail".
A modifier switch for genome data that is deemed to be highly repetitive. The assemblies will run slower due to more iterative cycles that give mira a chance to resolve nasty repeats.
Switches off clipping options for given sequencing technologies. Technologies can be sanger, 454, solexa or solid. Multiple entries separated by comma.
Note that [-CL:pec] and the chimera clipping [-CL:ascdc] are not switched off by this parameter and should be switched off separately.
Examples:
Switch off 454 and Solexa (but keep eventually keep Sanger
clipping):
--noclipping=454,solexa
Switch off all: --noclipping
or --noclipping=all
Switches off loading TRACEINFO ancillary data in XML files for all technologies. Place it after [--fasta] and/or [--job=] quick switches.
Loads parameters from the filename given. Allows a maximum of 10 levels of recursion, i.e. a --params option appearing within a file that loads other parameter files (though I cannot think of useful applications with more than 3 levels).
Sets parameters suited for loading sequences from FASTA files. The version with =<filename> will also set the input file to the given filename.
Sets parameters suited for loading sequences from PHD files. The version with =<filename> will also set the input file to the given filename.
Sets parameters suited for loading sequences from CAF files. The version with =<filename> will also set the input file to the given filename.
The following switches can be placed anywhere on the command line without interfering with other switches:
Default is mira. Defines the project name for
this assembly. The project name automatically influences the
name of input and output files / directories. E.g. in the
default setting, the file names for the output of the assembly
in FASTA format would be mira_out.fasta
and
mira_out.fasta.qual
. Setting the project
name to "MyProject" would generate
MyProject_out.fasta
and
MyProject_out.fasta.qual
. See also
-FILENAME: and -DIRECTORY: for a list of names that are
influenced.
Default is mira. Works like [-project=<name>], but takes only effect on input files.
Default is mira. Works like [-project=<name>], but takes only effect on output files.
Note: A double dash (e.g. --params) may also be used instead of a single one in front of the quick switches.
Examples for using these switches can be found in the documentation files describing mira usage.
This is the main one-stop-switches for most assemblies. You need to make your choice mainly in four steps and in the end concatenate your choices to the [--job=] switch:
are you building an assembly from scratch (choose: denovo) or are you mapping reads to an existing backbone sequence (choose: mapping)? Pick one. Leaving this out automatically chooses denovo as default.
are the data you are assembling forming a larger contiguous sequence (choose: genome) or are you assembling small fragments like in EST or mRNA libraries (choose: est)? Pick one. Leaving this out automatically chooses genome as default.
do you want a quick and dirty assembly for first insights (choose: draft), a reasonably well done assembly (choose: normal) or an assembly that should be able to tackle even most nasty cases (choose: accurate)? Pick one. Leaving this out automatically chooses normal as default.
finally, which sequencing technologies have created your reads: sanger, 454, solexa or solid? You can pick multiple. Leaving this out automatically chooses only sanger as default.
Once you're done with your choices, concatenate everything with
commas and you're done. E.g.:
'--job=denovo,genome,draft,sanger,454
' will give
you a de-novo assembly of a genome in draft quality using a hybrid
assembly method with Sanger and 454 reads.
Extensive switches open up the full panoply of possibilities the MIRA assembler offers. This ranges from fine-tuning assemblies with the quick switches from above to setting parameters in a way so that mira is suited also for very special assembly cases.
Important note: As soon as you use a quick switch (especially --job), the 'default' settings given for extensive switches in the manual below probably do not apply anymore as the quick switch tweaks a lot of extensive switches internally.
With the introduction of new sequencing technologies, mira also had to be able to set values that allow technology specific behaviour of algorithms. One simple example for this could be the minimum length a read must have to be used in the assembly. For Sanger sequences, having this value to be 150 (meaning a read should have at least 150 unclipped bases) would be a very valid albeit conservative choice. For 454 reads and especially Solexa and ABI SOLiD reads however, this value would be ridiculously high.
To allow very fine grained behaviour, especially in hybrid assemblies, and to prevent the explosion of parameter names, mira uses technology mode switching in the parameter files or on the command line.
Example: assume the following basic command line
mira -fasta -job=denovo,genome,draft,454,solexa
Here is exemplary a part of the output of used parameters that mira will show:
... Assembly options (-AS): Number of passes (nop) : 1 Skim each pass (sep) : yes Maximum number of RMB break loops (rbl) : 1 Spoiler detection (sd) : no Last pass only (sdlpo) : yes Minimum read length (mrl) : [san] 80 [454] 40 [sxa] 20 Base default quality (bdq) : [san] 10 [454] 10 [sxa] 10 ...
You can see the two different kind of settings that mira uses: common settings (like [-AS:nop]) and technology dependent settings (like [-AS:mrl]), where for each sequencing technology used in the project, the setting can be different.
How would one set a minimum read length of 80 and a base default quality of 10 for 454 reads, but for Solexa reads a minimum read length of 30 with a base default quality of 15? The answer:
$
mira -job=denovo,genome,draft,454,solexa -fasta 454_SETTINGS -AS:mrl=80:bdq=10 SOLEXA_SETTINGS -AS:mrl=30:bdq=15
Notice the ..._SETTINGS section in the command line (or parameter file): these tell mira that all the following parameters until the advent of another switch are to be set specifically for the said technology.
Beside common settings there are currently 4 technology settings available:
COMMON_SETTINGS
SANGER_SETTINGS
454_SETTINGS
SOLEXA_SETTINGS
SOLID_SETTINGS
Some settings of mira are influencing global behaviour and are not related to a specific sequencing technology, these must be set in the COMMON_SETTINGS environment. For example, it would not make sense to try and set different number of assembly passes for each technology like in
$
mira -job=denovo,genome,draft,454,solexa -fasta 454_SETTINGS -AS:nop=4 SOLEXA_SETTINGS -AS:nop=3
mira will complain about cases like these. Simply set those common settings in an area prefixed with the COMMON_SETTINGS switch like in
$
mira -job=denovo,genome,draft,454,solexa -fasta COMMON_SETTINGS -AS:nop=4 454_SETTINGS ... SOLEXA_SETTINGS ...
Since MIRA 3rc3, the parameter parser will help you by checking whether parameters are correctly defined as COMMON_SETTINGS or technology dependent setting.
General options control the type of assembly to be performed and other switches not belonging anywhere else.
string
]
Same as the quick switch [-project]. Defines the name of your project and influences the naming of your input and output files.
1 ≤ integer ≤ 256
]
Default is 2. Master switch to set the number of threads used in different parts of mira.
Note 1: currently only the SKIM algorithm uses multiple threads, other parts will follow.
Note 2: Although the main data structures are shared between the threads, there's some additional memory needed for each thread.
Note 3: when running the SKIM in parallel threads, MIRA can give different results when started with the same data and same arguments. While the effect could be averted for SKIM, the memory cost for doing so would be an additional 50% for one of the large tables, so this has not been implemented at the moment. Besides, at the latest when the Smith-Watermans run in parallel, this could not be easily avoided at all.
on|yes|1, off|no|0
]
Default is Yes. Whether mira tries to optimise run time of certain algorithms in a space/time trade-off memory usage, increasing or reducing some internal tables as memory permits.
Note 1: This functionality currently relies on the
/proc
file system giving information on
the system memory ("MemTotal" in /proc/meminfo) and the memory
usage of the current process ("VmSize" in
/proc/self/status
). If this is not
available, the functionality is switched off.
Note 2: The automatic memory management can only work if there actually is unused system memory. It's not a wonder switch which reduces memory consumption. In tight memory situations, memory management has no effect and the algorithms fall back to minimum table sizes. This means that the effective size in memory can grow larger than given in the memory management parameters, but then MIRA will try to keep the additional memory requirements to a minimum.
0 ≤ integer
]
Default is 0. If automatic memory management is used (see above), this number is the size in gigabytes that the MIRA process will use as maximum target size when looking for space/time trade-offs. A value of 0 means that MIRA does not try keep a fixed upper limit.
Note: when in competition to [-GE:kpmf] (see below), the smaller of both sizes is taken as target. Example: if your machine has 64 GiB but you limit the use to 32 GiB, then the MIRA process will try to stay within these 32 GiB.
0 ≤ integer
]
Default is 10. If automatic memory management is used (see above), this number works a bit like [-GE:mps] but the other way round: it tries to keep x percent of the memory free.
Note: when in competition to [-GE:mps] (see above), the argument leaving the most memory free is taken as target. Example: if your machine has 64 GiB and you limit the use to 42 GiB via [-GE:mps] but have a [-GE:kpmf] of 50, then the MIRA process will try to stay within 64-(64*50%)=32 GiB.
1 ≤ integer ≤ 4
]
Default is 1. Controls the starting step of the SNP search in EST pipeline and is therefore only useful in miraSearchESTSNPs.
EST assembly is a three step process, each with different settings to the assembly engine, with the result of each step being saved to disk. If results of previous steps are present in a directory, one can easily "play around" with different setting for subsequent steps by reusing the results of the previous steps and directly starting with step two or three.
on|yes|1, off|no|0
]
Default is Yes. Two reads sequenced from the same clone template form a read pair with a known minimum and maximum distance. This feature will definitively help for contigs containing lots of repeats. Set this to 'yes' if your data contains information on insert sizes (e.g. in paired-end sequencing).
Information on insert sizes can be given via the SI tag in EXP files (for each read pair individually), via insert_size and insert_stdev elements of NCBI TRACEINFO XML files or for the whole project using [-GE:tismin] and [-GE:tismax] (see below).
Additional information to set the orientation of the read-pairs can be given via [-GE:tpbd].
integer
]
Default is -1. The default value for the minimum template size for reads that have no template size in ancillary data. If -1 is used as value, then no default value is given and reads without ancillary data giving this number will behave as if they had no template.
integer
]
Default is -1. The default value for the maximum template size for reads that have no template size in ancillary data. If -1 is used as value, then no default value is given and reads without ancillary data giving this number will behave as if they had no template.
-1 or 1
]
Default is -1 for all sequencing technologies.
This value tells MIRA how read-pairs of a template must be oriented in a contig to be valid. A value of "-1" means the orientation must be 5'-3' to 3'-5', a value of "1" means 5'-3' to 5'-3'.
Set this to "1" if you assemble paired-end 454 data downloaded from the Short Read Archives (SRAs, at the NCBI and EMBL). Set this also to "1" for Solexa data where the paired-end sequencing protocol used creates 5'-3' to 5'-3' pairs.
Note | |
---|---|
Although with Solexa it is possible to build libraries in both directions, with MIRA it is currently not possible to mix within the same sequencing technology paired-end reads which need "-1" as direction with paired-end reads which have "1" as direction. This will be worked on if the need arises. |
on|yes|1, off|no|0
]
Default is yes. Controls whether date and time are printed out during the assembly. Suppressing it is not useful in normal operation, only when debugging or benchmarking.
Here one defines what type of reads to load.
on|yes|1, off|no|0
]
Default is No. Defines whether to load data generated by a given technology.
fofnexp, fasta, fastq, caf, phd,
fofnphd
]
Default is fasta. Takes effect only when [-LR:lsd]) is 'yes'.
Defines whether to load for assembly from FASTA sequences
(<projectname>_in.fasta
) and their
qualities
(<projectname>_in.fasta.qual
), from
a FASTQ file
(<projectname>_in.fastq
), from EXP
files from a file of filenames
(<projectname>_in.fofn
), from a phd
file (<projectname>_in.phd
) or from
a CAF file (<projectname>_in.caf
)
and assemble or eventually reassemble it.
Note 1: Only Sanger supports all file types. 454 and Solexa support only FASTA and FASTQ.
Note 2: fofnphd currently not available.
none, SCF
]
Default is SCF. Takes effect only when [-LR:lsd]) is 'yes' and for Sanger reads.
Defines the source format for reading qualities from external sources. Normally takes effect only when these are not present in the format of the load_job project (EXP and FASTA can have them, CAF and PHD must have them).
on|yes|1, off|no|0
]
Takes effect only when [-LR:lsd]) is 'yes' and for Sanger reads.
Default is no, only takes effect when load_job is fofnexp. Defines whether or not the qualities from the external source override the possibly loaded qualities from the load_job project. This might be of use in case some post-processing software fiddles around with the quality values of the input file but one wants to have the original ones.
on|yes|1, off|no|0
]
Default is yes. Takes effect only when [-LR:lsd]) is 'yes' and for Sanger reads.
Should there be a major mismatch between the external quality source and the sequence (e.g.: the base sequence read from a SCF file does not match the originally read base sequence), should the read be excluded from assembly or not. If not, it will use the qualities it had before trying to load the external qualities (either default qualities or the ones loaded from the original source).
on|yes|1, off|no|0
]
Default is yes. When set to yes, MIRA will stop the assembly if there is no quality file for a given sequence file. E.g., if the FASTA quality file is missing when loading from FASTA.
sanger, tigr, fr, stlouis, solexa
]
Default is sanger for Sanger sequencing data, fr for 454 and solexa for Solexa. Defines the read naming scheme for read suffixes. These suffixes can be used by mira to deduce a template name if none is given in ancillary data.
Currently supported: Sanger centre, TIGR, simple forward / reverse naming, St. Louis schemes and Solexa/Illumina schemes are supported out of the box.
How to choose: please read the documentation available at the different centres or ask your sequence provider. In a nutshell (and probably over-simplified):
"somename.[pqsfrw][12][bckdeflmnpt][a|b|c|..." (e.g. U13a08f10.p1ca), but the length of the postfix must be at least 4 characters, i.e., ".p" alone will not be recognised.
Usually, ".p" + 3 characters or "f" + 3 characters are used for forwards reads, while reverse complement reads take either ".q" or ".r" (+ 3 characters in both cases).
"somenameTF*|TR*|TA*" (e.g. GCPBN02TF or GCPDL68TABRPT103A58B),
Forward reads take "TF*", reverse reads "TR*".
"somename.[fr]*" (e.g. E0K6C4E01DIGEW.f or E0K6C4E01BNDXN.r2nd),
".f*" for forward, ".r*" for reverse.
"somename.[sfrxzyingtpedca]*"
Even simpler than the forward/reverse scheme, it allows only for one two reads per template: "somename/[12]"
on|yes|1, off|no|0
]
Default is no. This switch applies only for sequences from older Illumina / Solexa sequencing technology when loading from FASTA! Defines whether the FASTA quality file contains Solexa scores (which also have negative values) instead of quality values. Solexa scores also have negative values. If set to yes, mira will automatically convert the Solexa scores to phred style quality values.
integer
]
Default is 0. This switch applies only for sequences loaded from FASTQ format!
Defines the quality offset used to convert characters into quality values. Usually, 33 is used for FASTQ in Sanger style, Solexa 1.0 format uses 59 (I think) and newer Solexa 1.3 format uses 64.
The default value of 0 switches on routines that try to guess the correct value from the data present in the FASTQ (which they do when the data contains at least one read which at least one base with quality between 0 and 4).
on|yes|1, off|no|0
]
Default is no. Some file formats above (FASTA, PHD or even CAF and EXP) possibly do not contain all the info necessary or useful for each read of an assembly. Should additional information -- like clipping positions etc. -- be available in a XML trace info file in NCBI format (see File formats), then set this option to yes and it will be merged to all the data loaded, be it for Sanger, 454, Solexa or SOLiD technology. See also -FILENAME: for the name of the XML file to load.
Please note: quality clippings given here will override quality clippings loaded earlier (e.g. in EXP files) or performed by mira. Minimum clippings will still be made by the program, though.
on|yes|1, off|no|0
]
Default is no. If set to yes, the project will not be assembled and no assembly output files will be produced. Instead, the project files will only be loaded. This switch is useful for checking consistency of input files.
General options for controlling the assembly.
integer > 0
]
Default is dependent of the sequencing technology and assembly quality level. Defines how many iterations of the whole assembly process are done.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology and assembly quality level. Defines whether the skim algorithm (and with it also the recalculation of Smith-Waterman alignments) is called in-between each main pass. If set to no, skimming is done only when needed by the workflow: either when read extensions are searched for ( [-DP:ure]) or when possible vector leftovers are to be clipped ( [-CL:pvc]).
Setting this option to yes is highly recommended, setting it to no only for quick and dirty assemblies.
integer > 0
]
Default is dependent of the sequencing technology and assembly quality level. Defines the maximum number of times a contig can be rebuilt during a main assembly passes ([-AS:nop]) if misassemblies due to possible repeats are found.
on|yes|1, off|no|0
]
Default is is currently yes. Tells mira to use coverage information accumulated over time to more accurately pinpoint reads that are in repetitive regions.
float > 1.0
]
Default is 2.0 for all sequencing technologies in most assembly cases. This option says this: if mira a read has ever been aligned at positions where the total coverage of all reads of the same sequencing technology attained the average coverage times [-AS:ardct] (over a length of [-AS:ardml], see below), then this read is considered to be repetitive.
integer > 1
]
Default is dependent of the sequencing technology, currently 400 for Sanger and 200 for 454.
A coverage must be at least this number of bases higher than [-AS:ardct] before being really treated as repeat.
integer > 1
]
Default is dependent of the sequencing technology.
on|yes|1,
off|no|0
]
Default is currently yes for genome assemblies and no for EST assemblies or assemblies with Solexa data.
Takes effect only if uniform read distribution ([-AS:urd]) is on.
When set to yes, mira will analyse coverage of contigs built at a certain stage of the assembly and estimate an average expected coverage of reads for contigs. This value will be used in subsequent passes of the assembly to ensure that no part of the contig gets significantly more read coverage of reads that were previously identified as repetitive than the estimated average coverage allows for.
This switch is useful to disentangle repeats that are otherwise 100% identical and generally allows to build larger contigs. It is expected to be useful for Sanger and 454 sequences. Usage of this switch with Solexa data is currently not recommended.
It is a real improvement to disentangle repeats, but has the side-effect of creating some "contig debris" (small and low coverage contigs, things you normally can safely throw away as they are representing sequence that already has enough coverage).
This switch must be set to no for EST assembly, assembly of transcripts etc. It is recommended to also switch this off for mapping assemblies.
integer > 0
]
Default is dependent of the sequencing technology and assembly quality level. Recommended values are: 3 for an assembly with 3 to 4 passes ([-AS:nop]). Assemblies with 5 passes or more should set the value to the number of passes minus 2.
Takes effect only if uniform read distribution ([-AS:urd]) is on.
float > 1.0
]
Default is 1.5 for all sequencing technologies in most assembly cases. The [--highlyrepetitive] quick-switch sets this to 1.2.
This option says this: if mira determined that the average coverage is $x$, then in subsequent passes it will allow coverage for reads determined to be repetitive to be built into the contig only up to a total coverage of $x*urdcm$. Reads that bring the coverage above the threshold will be rejected from that specific place in the contig (and either be built into another copy of the repeat somewhere else or end up as contig debris).
Please note that the lower [-AS:urdcm] is, the more contig debris you will end up with (contigs with an average coverage less than half of the expected coverage, mostly short contigs with just a couple of reads).
Takes effect only if uniform read distribution ([-AS:urd]) is on.
on|yes|1, off|no|0
]
Default is is dependent on --job quality: currently no for draft and yes for normal and accurate. Switched of for EST assembly.
Tells mira to use keep repeats longer that the length of reads in separate contigs.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology and assembly quality level. A spoiler can be either a chimeric read or it is a read with long parts of unclipped vector sequence still included (that was too long for the [-CL:pvc] vector leftover clipping routines). A spoiler typically prevents contigs to be joined, MIRA will cut them back so that they represent no more harm to the assembly.
Recommended for assemblies of mid- to high-coverage genomic assemblies, not recommended for assemblies of ESTs as one might loose splice variants with that.
A minimum number of two assembly passes ([-AS:nop]) must be run for this option to take effect.
on|yes|1, off|no|0
]
Default is yes. Defines whether the spoiler detection algorithms are run only for the last pass or for all passes ( [-AS:nop]).
Takes effect only if spoiler detection ([-AS:sd]) is on. If in doubt, leave it to 'yes'.
integer ≥ 20
]
Default is dependent of the sequencing technology. Defines the minimum length that reads must have to be considered for the assembly. Shorter sequences will be filtered out at the beginning of the process and won't be present in the final project.
integer ≥ 1
]
Default is dependent of the sequencing technology and the [--job] parameter. For genome assemblies it's usually around 2 for Sanger, 5 for 454, 5 for PacBio and 10 for Solexa. In EST assemblies, it's currently 2 for all sequencing technologies.
Defines the minimum number of reads a contig must have before it is built or saved by MIRA. Overlap clusters with less reads than defined will not be assembled into contigs but reads in these clusters will be immediately transferred to debris.
This parameter is useful to considerably reduce assembly time in large projects with millions of reads (like in Solexa projects) where a lot of small "junk" contigs with contamination sequence or otherwise uninteresting data may be created otherwise.
Note | |
---|---|
Note that a value larger 1 of this parameter interferes with the functioning of [-OUT:sssip] and [-OUT:stsip]. |
integer ≥ 0
]
Default is currently 10 for all sequencing technologies. Defines the default base quality of reads that have no quality read from file.
on|yes|1, off|no|0
]
Default is yes. When set to yes, MIRA will stop the assembly if any read has no quality values loaded.
on|yes|1, off|no|0
]
Default is yes. MIRA has two different pathfinder algorithms it chooses from to find its way through the (more or less) complete set of possible sequence overlaps: a genomic and an EST pathfinder. The genomic looks a bit into the future of the assembly and tries to stay on safe grounds using a maximum of information already present in the contig that is being built. The EST version on the contrary will directly jump at the complex cases posed by very similar repetitive sequences and try to solve those first and is willing to fall back to first-come-first-served when really bad cases (like, e.g., coverage with thousands of sequences) are encountered.
Generally, the genomic pathfinder will also work quite well with EST sequences (but might get slowed down a lot in pathological cases), while the EST algorithm does not work so well on genomes. If in doubt, leave on yes for genome projects and set to no for EST projects.
on|yes|1, off|no|0
]
Default is yes. Another important switch if you plan to assemble non-normalised EST libraries, where some ESTs may reach coverages of several hundreds or thousands of reads. This switch lets MIRA save a lot of computational time when aligning those extremely high coverage areas (but only there), at the expense of some accuracy.
integer > 0
]
Default is 500. Defines the number of potential partners a read must have for MIRA switching into emergency search stop mode for that read.
on|yes|1,off|no|0
]
Default is no. Defines whether there is an upper limit of time to be used to build one contig. Set this to yes in EST assemblies where you think that extremely high coverages occur. Less useful for assembly of genomic sequences.
integer > 0
]
Default is 10000. Depending on [-AS:umcbt] above, this number defines the time in seconds allocated to building one contig.
General options for controlling backbone options for mapping assemblies as well as general strain information.
on|yes|1, off|no|0
]
Default is no. Straindata is a key value file, one read per line. First the name of the read, then the strain name of the organism the read comes from. It is used by the program to differentiate different types of SNPs appearing in organisms and classifying them.
on|yes|1, off|no|0
]
Default is no for de-novo assemblies and yes for mapping.
Defines whether, after having loaded all data from all possible source, MIRA will assign a strain name to reads which didn't get strain information via said data files (either NCBI TRACEINFO XML files or the simple MIRA straindata files). The strain name to assign the is determined via [-SB:dsn] (see below).
string
]
Default is StrainX. Defines the strain name to assign to reads which don't have a strain name after loading, works only if [-SB:ads=yes] (see above).
on|yes|1, off|no|0
]
Default is no. A backbone is a sequence (or a previous assembly) that is used as template for a mapping assembly. The current assembly process will assemble reads first to those loaded backbone contigs before creating new contigs (if any).
This feature is helpful for assembling against previous (and already possibly edited) assembly iterations, or to make a comparative assembly of two very closely related organisms. Please read "very closely related" as in: only SNP mutations or short indels present.
0 < integer
]
Default is dependent on assembly quality level chosen: 0 for 'draft', 1 for 'normal' and [-AS:nop] divided by 2 for 'accurate'.
When assembling against backbones, this parameter defines the pass iteration (see [-AS:nop]) from which on the backbones will be really used. In the passes preceding this number, the non-backbone reads will be assembled together as if no backbones existed. This allows mira to correctly spot repetitive stretches that differ by single bases and tag them accordingly. Note that full assemblies are considerably slower than mapping assemblies, so be careful with this when assembling millions of reads.
Rule of thumb: if backbones belong to same strain as reads to assemble, set to 1. If backbones are a different strain, then set [-SB:sbuib] to 1 lower than [-AS:nop] (example: nop=4 and sbuip=3).
string
]
Default isReferenceStrain. Defines the name of the strain that the backbone sequences have.
on|yes|1, off|no|0
]
Default is no. Useful when using CAF as input for backbone: forces all reads of the backbone contigs to get assigned the new backbone strain, even if they previously had other strains assigned.
Main usage is in multi-step hybrid assemblies.
string
]
Default is Default is an empty string. Useful when using CAF as input for backbone: when set to a given strain name, mira will internally use only reads from the given strain to build the rails it will use to align reads.
Main usage is in multi-step hybrid assemblies.
fasta, caf, gbf
]
Default is fasta. Defines the filetype of the backbone file given. Currently only FASTA, CAF and GBF files are supported.
When GBF (GenBank files, more commonly named '.gbk') files are loaded, the features within theses files are automatically transformed into Staden compatible tags and get passed through the assembly.
0 ≤ integer ≤ 10000
]
Default is 0. Parameter for the internal sectioning size of the backbone to compute optimal alignments. Should be set to two times length of longest read in input data + 15%. When set to 0, MIRA will compute optimal values from the data loaded.
0 ≤ integer ≤ 2000
]
Default is 0. Parameter for the internal sectioning size of the backbone to compute optimal alignments. Should be set to length of the longest read. When set to 0, MIRA will compute optimal values from the data loaded.
-1 ≤ integer ≤ 100
]
Default is -1. Defines the default quality that the backbone sequences have if they came without quality values in their files (like in GBF format or when FASTA is used without .qual files). A value of -1 mira to use the same default quality for backbones as for reads.
on|yes|1, off|no|0
]
Default is no. Standard mapping assembly mode of the assembler is to map available reads to a backbone and discard reads that do not fit. If set to 'yes', mira will use reads that did not map to the backbone(s) to make new contigs (if possible). Please note: while a simple mapping assembly is comparatively cheap in terms of memory and time consumed, setting this option to 'yes' means that behind the scenes data for a full blown de-novo assembly is generated in addition to the data needed for a mapping assembly, which makes it a bit more costly that a de-novo assembly per se.
Options for controlling some data processing during the assembly.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used: yes for Sanger, no for all others. mira expects the sequences it is given to be quality clipped. During the assembly though, it will try to extend reads into the clipped region and gain additional coverage by analysing Smith-Waterman alignments between reads that were found to be valid. Only the right clip is extended though, the left clip (most of the time containing sequencing vector) is never touched.
integer > 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines use a sliding window approach on Smith-Waterman alignments. This parameter defines the window length.
integer > 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines use a sliding window approach on Smith-Waterman alignments. This parameter defines the number maximum number of errors (=disagreements) between two alignment in the given window.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines can be called before assembly and/or after each assembly pass (see [-AS:nop]). This parameter defines the first pass in which the read extension routines are called. The default of 0 tells mira to extend the reads the first time before the first assembly pass.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Only takes effect when [-DP:ure] (see above) is set to yes. The read extension routines can be called before assembly and/or after each assembly pass (see [-AS:nop]). This parameter defines the last pass in which the read extension routines are called. The default of 0 tells mira to extend the reads the last time before the first assembly pass.
Controls for clipping options: when and how sequences should be clipped.
Every option in this section can be set individually for every sequencing technology, giving a very fine grained control on how reads are clipped for each technology.
on|yes|1, off|no|0
]
Default is no. Uses the parameters [-CL:msvsgs:msvsmfg:msvsmeg] (see below).
Before running mira, the ssaha2 or smalt programs from the Sanger centre can be used to detect possible vector sequence stretches in the input data for the assembly. This parameter - if set to yes - will let mira load the result file of a ssaha2 or smalt run and tag the possible vector sequences at the ends of reads.
ssaha2 must be called like this "ssaha2
<ssaha2options> vector.fasta sequences.fasta
"
to generate an output that can be parsed by mira. In the above
example, replace vector.fasta
by the name
of the file with your vector sequences and
sequences.fasta
by the name of the file
containing your sequencing data.
smalt must be called like this: "smalt map -f ssaha
<ssaha2options> hash_index sequences.fasta
This makes you basically independent from any other commercial or license-requiring vector screening software. For Sanger reads, a combination of lucy and ssaha2 or smalt together with this parameter should do the trick. For reads coming from 454 pyro-sequencing, ssaha2 or smalt and this parameter will also work very well. See the usage manual for a walkthrough example on how to use SSAHA2 / SMALT screening data.
Note 1: the output format of SSAHA2 must the native output
format (-output ssaha2
). For SMALT, the
output option -f ssaha
must be used. Other
formats cannot be parsed by MIRA.
Note 2: when using SSAHA2 results, the input file must be
named
<projectname>_ssaha2vectorscreen_in.txt
. When
using SMALT results, the input file must be named
<projectname>_smaltvectorscreen_in.txt
.
Note 3: if both a ssah2 and smalt result file are present, both will be read.
Note 4: I currently use the following SSAHA2 options:
-kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer
6
Note 5: Anyone contributing SMALT parameters?
Note 6: the sequence vector clippings generated from SSAHA2 / SMALT data do not replace sequence vector clippings loaded via the EXP, CAF or XML files, they rather extend them.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes effect only if [-CL:msvs] is yes. While performing the clip of screened vector sequences, mira will look if it can merge larger chunks of sequencing vector bases that are a maximum of [-CL:msvgsgs] apart.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes effect only if [-CL:msvs] is yes. While performing the clip of screened vector sequences at the start of a sequence, mira will allow up to this number of non-vector bases in front of a vector stretch.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes effect only if [-CL:msvs] is yes. While performing the clip of screened vector sequences at the end of a sequence, mira will allow up to this number of non-vector bases behind a vector stretch.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used: yes for Sanger, no for any other. mira will try to identify possible sequencing vector relics present at the start of a sequence and clip them away. These relics are usually a few bases long and were not correctly removed from the sequence in data preprocessing steps of external programs.
You might want to turn off this option if you know (or think) that your data contains a lot of repeats and the option below to fine tune the clipping behaviour does not give the expected results.
You certainly want to turn off this option in EST assemblies as this will quite certainly cut back (and thus hide) different splice variants. But then make certain that you pre-processing of Sanger data (sequencing vector removal) is good, other sequencing technologies are not affected then.
integer ≥ 0
]
Default is dependent of the sequencing technology used. The clipping of possible vector relics option works quite well. Unfortunately, especially the bounds of repeats or differences in EST splice variants sometimes show the same alignment behaviour than possible sequencing vector relics and could therefore also be clipped.
To refrain the vector clipping from mistakenly clip repetitive regions or EST splice variants, this option puts an upper bound to the number of bases a potential clip is allowed to have. If the number of bases is below or equal to this threshold, the bases are clipped. If the number of bases exceeds the threshold, the clip is NOT performed.
Setting the value to 0 turns off the threshold, i.e., clips are then always performed if a potential vector was found.
on|yes|1, off|no|0
]
Default is no. This will let mira perform its own quality clipping before sequences are entered into the assembly. The clip function performed is a sequence end window quality clip with back iteration to get a maximum number of bases as useful sequence. Note that the bases clipped away here can still be used afterwards if there is enough evidence supporting their correctness when the option [-DP:ure] is turned on.
Warning: The windowing algorithm works pretty well for Sanger, but apparently does not like 454 type data. It's advisable to not switch it on for 454. Beside, the 454 quality clipping algorithm performs a pretty decent albeit not perfect job, so for genomic 454 data (not! ESTs), it is currently recommended to use a combination of [-CL:emrc] and [-DP:ure].
integer ≥ 15 and ≤ 35
]
Default is dependent of the sequencing technology used. This is the minimum quality bases in a window require to be accepted. Please be cautious not to take too extreme values here, because then the clipping will be too lax or too harsh. Values below 15 and higher than 30-35 are not recommended.
integer ≥ 10
]
Default is dependent of the sequencing technology used. This is the length of a window in bases for the quality clip.
on|yes|1, off|no|0
]
Default is no. This option allows to clip reads that were not correctly preprocess and have unclipped bad quality stretches that might prevent a good assembly.
mira will search the sequence in forward direction for a stretch of bases that have in average a quality less than a defined threshold and then set the right quality clip of this sequence to cover the given stretch.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Defines the minimum average quality a given window of bases must have. If this quality is not reached, the sequence will be clipped at this position.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Defines the length of the window within which the average quality of the bases are computed.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. This will let mira perform a 'clipping' of bases that were masked out (replaced with the character X). It is generally not a good idea to use mask bases to remove unwanted portions of a sequence, the EXP file format and the NCBI traceinfo format have excellent possibilities to circumvent this. But because a lot of preprocessing software are built around cross_match, scylla- and phrap-style of base masking, the need arose for mira to be able to handle this, too. mira will look at the start and end of each sequence to see whether there are masked bases that should be 'clipped'.
integer ≥ 0
]
Default is dependent of the sequencing technology used. While performing the clip of masked bases, mira will look if it can merge larger chunks of masked bases that are a maximum of [-CL:mbcgs] apart.
integer ≥ 0
]
Default is dependent of the sequencing technology used. While performing the clip of masked bases at the start of a sequence, mira will allow up to this number of unmasked bases in front of a masked stretch.
integer ≥ 0
]
Default is dependent of the sequencing technology used. While performing the clip of masked bases at the end of a sequence, mira will allow up to this number of unmasked bases behind a masked stretch.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used: on for 454 data, off for all others. This will let mira perform a 'clipping' of bases that are in lowercase at both ends of a sequence, leaving only the uppercase sequence. Useful when handling 454 data that does not have ancillary data in XML format.
on|yes|1, off|no|0
]
Default is no. This option is useful in EST assembly. Poly-A stretches in forward reads and poly-T stretches in reverse reads that were not correctly masked or clipped in preprocessing steps from external programs get clipped or tagged here. The assembler will not use these stretches for critical operations.
on|yes|1, off|no|0
]
Default is no. This option is currently not active (as of version 2.9.22).
In future, this will allow to keep the poly-A signal in the reads and tag them. The tags provide a good visual anchor when looking at the assembly with different programs.
integer > 0
]
Default is 10. Only takes effect when [-CP:cpat] (see above) is set to yes. Defines the number of ``A'' (in forward direction) or ``T'' (in reverse direction'' must be present to be considered a poly-A signal stretch.
integer > 0
]
Default is 1. Only takes effect when [-CL:cpat] (see above) is set to yes. Defines the maximum number of errors allowed in the potential poly-A signal stretch. The distribution of these errors is not important.
integer > 0
]
Default is 9. Only takes effect when [-CL:cpat] (see above) is set to yes.Defines the number of bases from the end of a sequence (if masked: from the end of the masked area) within which a poly-A signal stretch is looked for.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. If on, ensures a minimum left clip on each read according to the parameters in [-CL:mlcr:smlc]
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emlc] is on, checks whether there is a left clip which length is at least the one specified here.
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emlc] is on and actual left clip is < [-CL:mlcr], set left clip of read to the value given here.
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. If on, ensures a minimum right clip on each read according to the parameters in [-CL:mrcr:smrc]
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emrc] is on, checks whether there is a right clip which length is at least the one specified here.
integer ≥ 0
]
Default is dependent of the sequencing technology used. If [-CL:emrc] is on and actual right clip is < [-CL:mrcr], set the length of the right clip of read to the value given here.
on|yes|1, off|no|0
]
Default is yes for [--job=genome] assemblies and no for [--job=est] assemblies.
The SKIM routines of MIRA can be also used without much time overhead to find chimeric reads. When this parameter is set, MIRA will use that info to cut back chimeras to their longest non-chimeric length.
Warning | |
---|---|
When working on low coverage data (e.g. < 5 to 6x Sanger and < 10x 454, you may want to switch off this option if you try to go for the longest contigs. Reason: single reads joining otherwise disjunct contigs will probably be categorised as chimeras. |
on|yes|1, off|no|0
]
Default is currently no.
The SKIM routines of MIRA can be also used without much time overhead to find junk sequence at end of reads. When this parameter is set, MIRA will use that info to cut back junk in reads.
It is currently suggested to leave this parameter switched off as the routines seem to be a bit too "trigger happy" and also cut back perfectly valid sequences.
on|yes|1, off|no|0
]
Default is is dependent on --job quality: currently no for draft and yes for normal and accurate. Switched off for EST assembly.
This implements a pretty powerful strategy to ensure a good "high confidence region" (HCR) in reads, basically eliminating 99.9% of all junk at the 5' and 3' ends of reads. Note that one still must ensure that sequencing vectors (Sanger) or adaptor sequences (454) are "more or less" clipped prior to assembly.
Warning | |
---|---|
Extremely effective, but should NOT be used for very low coverage genomic data, for EST projects or if one wants to retain rare transcripts. |
on|yes|1,
off|no|0
]
Default is is dependent yes.
Solexa data has a pretty awful problem with in some reads when
a GGCxG
motif occurs (read more about it in
the chapter on Solexa data). In short: the sequencing errors
produced by this problem lead to many false positive SNP
discoveries in mapping assemblies or problems in contig
building in de-novo assembly.
MIRA knows about this problem and can look for it in Solexa reads during the proposed end clipping and further clip back the reads, greatly minimising the impact of this problem.
integer ≥ 10
]
Default is is dependent on --job: currently 17 for Sanger and 454, 21 for Solexa.
This parameter defines the minimum number of bases at each end of a read that should be free of any sequencing errors. Note that the algorithm is based on SKIM hashing (see below), and compares hashes of all reads with each other. Therefore, using values less than 12 will lead to false negative hits.
Options that control the behaviour of the initial fast all-against-all read comparison algorithm. Matches found here will be confirmed later in the alignment phase. The new SKIM3 algorithm that is in place since version 2.7.4 uses a hash based algorithm that works similarly to SSAHA (see Ning Z, Cox AJ, Mullikin JC; "SSAHA: a fast search method for large DNA databases."; Genome Res. 2001;11;1725-9).
The major differences of SKIM3 and SSAHA are:
the word length n of a hash can be up to 31 bases (in 64 bit versions of MIRA)
SKIM3 uses a maximum fixed amount of RAM that is independent of the word size. E.g., SSAHA would need 4 exabyte to work with word length of 30 bases ... SKIM3 just takes a couple of hundred MB.
The parameters for SKIM3:
integer ≥ 1
]
Number of threads used in SKIM, default is 2. A few parts of SKIM are non-threaded, so the speedup is not exactly linear, but it should be very close. E.g., with 2 processors I get a speedup of 180-195%, with 4 between 350 and 395%.
Although the main data structures are shared between the threads, there's some additional memory needed for each thread.
on|yes|1, off|no|0
]
Default is on. Defines whether SKIM searches for matches only in forward/forward direction or whether it also looks for forward/reverse direction.
You usually will not want to touch the default, except for very special application cases where you do not want MIRA to use reverse complement sequences at all.
integer ≥ 1
]
If only Sanger or 454 data is used, default is 14 on 32 bit systems and 16 on 64 bit systems. Controls the number of consecutive bases $n$ which are used as a word hash. The higher the value, the faster the search. The lower the value, the more weak matches are found. Values below 10 are not recommended.
integer ≥ 1
]
Default is 4. This is a parameter controlling the stepping increment $s$ with which hashes are generated. This allows for a more fine grained search as matches are found with at least $n+s$ (see [-SK:bph]) equal bases. The higher the value, the faster the search. The lower the value, the more weak matches are found.
integer ≥ 1
]
Default is dependent of the sequencing technology used and assembly quality wished. Controls the relative percentage of exact word matches in an approximate overlap that has to be reached to accept this overlap as possible match. Increasing this number will decrease the number of possible alignments that have to be checked by Smith-Waterman later on in the assembly, but it also might lead to the rejection of weaker overlaps (i.e. overlaps that contain a higher number of mismatches).
Note: most of the time it makes sense to keep this parameter in sync with [-AL:mrs].
integer ≥ 1
]
Default is 2000. Controls the maximum number of possible hits one read can maximally transport to the graph edge reduction phase. If more potential hits are found, only the best ones are taken.
In the pre-2.9.x series, this was an important option for tackling projects which contain extreme assembly conditions. It still is if you run out of memory in the graph edge reduction phase. Try then to lower it to 1000, 500 or even 100.
As the assembly increases in passes ([-AS:nop]), different combinations of possible hits will be checked, always the probably best ones first. So the accuracy of the assembly should only suffer when lowering this number too much.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring less than [-SK:fenn] times the average occurence will be tagged with a HAF2 (less than average) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fenn] but less than [-SK:fexn] times the average occurence will be tagged with a HAF3 (normal) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fexn] but less than [-SK:fer] times the average occurence will be tagged with a HAF4 (above average) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fer] but less than [-SK:fehr] times the average occurence will be tagged with a HAF5 (repeat) tag.
float < 0
]
During SKIM analysism, MIRA will estimate how repetitive parts of reads are. Parts which are occuring more than [-SK:fehr] but less than [-SK:fecr] times the average occurence will be tagged with a HAF6 (heavy repeat) tag. Parts which are occuring more than [-SK:fecr] but less than [-SK:nrr] times the average occurence will be tagged with a HAF7 (crazy repeat) tag.
on|yes|1, off|no|0
]
Default is dependent on --job type: yes for de-novo, no for mapping. Tells mira to mask during the SKIM phase subsequences of size [-SK:nph] nucleotides that appear more often than the median occurrence of subsequences would otherwise suggest. The threshold from which on subsequences are considered nasty is set by [-SK:nrr] (see below).
There's one drawback though: the smaller the reads are that you try to assemble with this option turned on, the higher the probability that your reads will not span nasty repeats completely, therefore leading to a abortion of contig building at this site.
The masked parts are tagged with "MNRr" in the reads.
This option is extremely useful for assembly of larger projects (fungi-size) with a high percentage of repeats. Or in non-normalised EST projects, to get at least something assenbled.
Although it is expected that bacteria will not really need this, leaving it turned on will probably not harm except in unusual cases like several copies of (pro-)phages integrated in a genome.
integer ≥ 2
]
Default is depending on the [--job=...] parameters. Normally it's high (around 100) for genome assemblies, but much lower (20 or less) for EST assemblies.
Sets the ratio from which on subsequences are considered nasty and hidden from the SKIM overlapper with a MNRr tag. The value of 10 means: mask all k-mers of [-SK:bph] length which are occurring more than 10 times more often than the average of the project.
integer; 0, 5-8
]
Default is 6. Sets the
minimum level of the HAF tags from which on MIRA will report
tentatively repetitive sequence in the
*_info_readrepeats.lst
file of the info
directory.
A value of 0 means "switched off". The default value of , 6 means all subsequences tagged with HAF6, HAF7 and MNRr will be logged. If you, e.g., only wanted MNRr logged, you'd use 8 as parameter value.
See also [-SK:fenn:fexn:fer:fehr:mnr:nrr] to set the different levels for the HAF and MNRr tags.
integer ≥ 0
]
Default is 0. If the number of reads identified as megahubs exceeds the allowed ratio, mira will abort.
This is a fail-safe parameter to avoid assemblies where things look fishy. In case you see this, you might want to ask for advice on the mira_talk mailing list. In short: bacteria should never have megahubs (90% of all cases reported were contamination of some sort and the 10% were due to incredibly high coverage numbers). Eukaryotes are likely to contain megahubs if filtering is [-SK:mnr] not on.
EST project however, especially from non-normalised libraries, will very probably contain megahubs. In this case, you might want to think about masking, see [-SK:mnr].
integer ≥ 100000
]
Default is 15000000. Has no influence on the quality of the assembly, only on the maximum memory size needed during the skimming. The default value is equivalent to approximately 500MB.
Note: reducing the number will increase the run time, the more drastically the bigger the reduction. On the other hand, increasing the default value chosen will not result in speed improvements that are really noticeable. In short: leave this number alone if you are not desperate to save a few MB.
integer ≥ 10
]
Default is 1024, 2048 when Solexa sequences are used. Maximum memory used (in MiB) during the reduction of skim hits.
Note: has no influence on the quality of the assembly, reducing the number will increase the runtime, the more drastically the bigger the reduction as hits then must be streamed multiple times from disk.
The default is good enough for assembly of bacterial genomes or small eukaryotes (using Sanger and/or 454 sequences). As soon as assembling something bigger than 20 megabases, you should increase it to 2048 or 4096 (equivalent to 2 or 4 GiB of memory).
The align options control the behaviour of the Smith-Waterman alignment routines. Only read pairs which are confirmed here may be included into contigs. Affects both the checking of possible alignments found by SKIM as well as the phase when reads are integrated into a contig.
Every option in this section can be set individually for every sequencing technology, giving a very fine grained control on how reads are aligned for each technology.
integer > 0 and ≤100
]
Default is dependent of the sequencing technology used. The banded Smith-Waterman alignment uses this percentage number to compute the bandwidth it has to use when computing the alignment matrix. E.g., expected overlap is 150 bases, bip=10 -> the banded SW will compute a band of 15 bases to each side of the expected alignment diagonal, thus allowing up to 15 unbalanced inserts / deletes in the alignment. INCREASING AND DECREASING THIS NUMBER: increase: will find more non-optimal alignments, but will also increase SW runtime between linear and \Circum2. decrease: the other way round, might miss a few bad alignments but gaining speed.
integer > 0
]
Default is dependent of the sequencing technology used. Minimum bandwidth in bases to each side.
integer > 0
]
Default is dependent of the sequencing technology used. Maximum bandwidth in bases to each side.
integer > 0
]
Default is dependent of the sequencing technology used. Minimum number of overlapping bases needed in an alignment of two sequences to be accepted.
integer > 0
]
Default is dependent of the sequencing technology used. Describes the minimum score of an overlap to be taken into account for assembly. mira uses a default scoring scheme for SW align: each match counts 1, a match with an N counts 0, each mismatch with a non-N base -1 and each gap -2. Take a bigger score to weed out a number of chance matches, a lower score to perhaps find the single (short) alignment that might join two contigs together (at the expense of computing time and memory).
integer > 0 and ≤100
]
Default is dependent of the sequencing technology used. Describes the min % of matching between two reads to be considered for assembly. Increasing this number will save memory, but one might loose possible alignments. I propose a maximum of 80 here. Decreasing below 55% will make memory and time consumption probably explode.
Note: most of the time it makes sense to keep this parameter in sync with [-SK:pr].
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. Defines whether or not to increase penalties applied to alignments containing long gaps. Setting this to 'yes' might help in projects with frequent repeats. On the other hand, it is definitively disturbing when assembling very long reads containing multiple long indels in the called base sequence ... although this should not happen in the first place and is a sure sign for problems lying ahead.
When in doubt, set it to yes for EST projects and de-novo genome assembly, set it to no for assembly of closely related strains (assembly against a backbone).
When set to no, it is recommended to have [-CO:amgb] and [-CO:amgbemc] both set to yes.
low|0, medium|1, high|2, split_on_codongaps|10
]
Default is dependent of the sequencing technology used. Has no effect if extra_gap_penalty is off. Defines an extra penalty applied to 'long' gaps. There are these are predefined levels: low - use this if you expect your base caller frequently misses 2 or more bases. medium - use this if your base caller is expected to frequently miss 1 to 2 bases. high - use this if your base caller does not frequently miss more than 1 base.
For some stages of the EST assembly process, a special value split_on_codongaps is used. It's even a tick harsher that the 'high' level.
Also, usage of this parameter is probably a good thing if the repeat marker of the contig is set to not mark on gap bases ([-CO:amgb] equals to no). This is generally the case for 454 data.
0 ≤ integer ≤ 100
]
Default is 100. Has no effect if extra_gap_penalty is off. Defines the maximum extra penalty in percent applied to 'long' gaps.
The contig options control the behaviour of the contig objects.
string
]
Default is <projectname>. Contigs will have this string prepended to their names. The [-project=] quick-switch will also change this option.
integer > 0 and ≤100
]
Default is dependent of the sequencing technology used. When adding reads to a contig, reject the reads if the drop in the quality of the consensus is > the given value in %. Lower values mean stricter checking. This value is doubled should a read be entered that has a template partner (a read pair) at the right distance.
on|yes|1, off|no|0
]
Default is yes. One of the most important switches in MIRA: if set to yes, MIRA will try to resolve misassemblies due to repeats by identifying single base stretch differences and tag those critical bases as RMB (Repeat Marker Base, weak or strong). This switch is also needed when MIRA is run in EST mode to identify possible inter-, intra- and intra-and-interorganism SNPs.
on|yes|1, off|no|0
]
Default is no. Only takes effect when [-CO:mr] (see above) is set to yes. If set to yes, MIRA will not use the repeat resolving algorithm during build time (and therefore will not be able to take advantage of this), but only before saving results to disk.
This switch is useful in some (rare) cases of mapping assembly.
on|yes|1, off|no|0
]
Default is no. Only takes effect when [-CO:mr] (see above) is set to yes, effect is also dependent on the fact whether strain data (see - [-SB:lsd]) is present or not. Usually, mira will mark bases that differentiate between repeats when a conflict occurs between reads that belong to one strain. If the conflict occurs between reads belonging to different strains, they are marked as SNP. However, if this switch is set to yes, conflict within a strain are also marked as SNP.
This switch is mainly used in assemblies of ESTs, it should not be set for genomic assembly.
integer ≥ 2
]
Default is dependent of the sequencing technology used. Only takes effect when [-CO:mr] (see above) is set to yes. This defines the minimum number of reads in a group that are needed for the RMB (Repeat Marker Bases) or SNP detection routines to be triggered. A group is defined by the reads carrying the same nucleotide for a given position, i.e., an assembly with mrpg=2 will need at least two times two reads with the same nucleotide (having at least a quality as defined in [-CO:mgqrt]) to be recognised as repeat marker or a SNP. Setting this to a low number increases sensitivity, but might produce a few false positives, resulting in reads being thrown out of contigs because of falsely identified possible repeat markers (or wrongly recognised as SNP).
integer ≥
10
]
Default is dependent of the sequencing technology used. Takes only effect when [-CO:mr] is set to yes. This defines the minimum quality of neighbouring bases that a base must have for being taken into consideration during the decision whether column base mismatches are relevant or not.
integer ≥ 25
]
Default is dependent of the sequencing technology used. Takes only effect when [-CO:mr] is set to yes. This defines the minimum quality of a group of bases to be taken into account as potential repeat marker. The lower the number, the more sensitive you get, but lowering below 25 is not recommended as a lot of wrongly called bases can have a quality approaching this value and you'd end up with a lot of false positives. The higher the overall coverage of your project, the better, and the higher you can set this number. A value of 35 will probably remove most false positives, a value of 40 will probably never show false positives ... but will generate a sizable number of false negatives.
integer ≥ 0
]
Default is dependent of the sequencing technology used. Takes only effect when [-CO:mr] is set to yes. Using the end of sequences of Sanger type shotgun sequencing is always a bit risky, as wrongly called bases tend to crowd there or some sequencing vector relics hang around. It is even more risky to use these stretches for detecting possible repeats, so one can define an exclusion area where the bases are not used when determining whether a mismatch is due to repeats or not.
on|yes|1,
off|no|0
]
Default is yes. When [-CL:pec] is set, the end-read exclusion area can be considerably reduced. Setting this parameter will automatically do this.
Note | |
---|---|
Although the parameter is named "set to 1", it may be that the exclusion area is actually a bit larger (2 to 4), depending on what users will report back as "best" option. |
on|yes|1, off|no|0
]
Default is dependent of the sequencing technology used. Determines whether columns containing gap bases (indels) are also tagged.
Note: it is strongly recommended to not set this to 'yes' for 454 type data.
on|yes|1,
off|no|0
]
Default is yes. Takes effect only when [-CO:amgb] is set to yes. Determines whether multiple columns containing gap bases (indels) are also tagged.
on|yes|1, off|no|0
]
Default is yes. Takes effect only when [-CO:amgb] is set to yes. Determines whether both for tagging columns containing gap bases, both strands.need to have a gap. Setting this to no is not recommended except when working in desperately low coverage situations.
on|yes|1, off|no|0
]
Default is no for all sequencing types. If set to yes, mira will be forced to make a choice for a consensus base (A,C,G,T or gap) even in unclear cases where it would normally put a IUPAC base. All other things being equal (like quality of the possible consensus base and other things), mira will choose a base by either looking for a majority vote or, if that also is not clear, by preferring gaps over T over G over C over finally A.
mira makes a considerable effort to deduce the right base at each position of an assembly. Only when cases begin to be borderline it will use a IUPAC code to make you aware of potential problems. It is suggested to leave this option to no as IUPAC bases in the consensus are a sign that - if you need 100% reliability - you really should have a look at this particular place to resolve potential problems. You might want to set this parameter to yes in the following cases: 1) when your tools that use assembly result cannot handle IUPAC bases and you don't care about being absolutely perfect in your data (by looking over them manually). 2) when you assemble data without any quality values (which you should not do anyway), then this method will allow you to get a result without IUPAC bases that is "good enough" with respect to the fact that you did not have quality values.
Important note: in case you are working with a hybrid assembly, mira will still use IUPAC bases at places where reads from different sequencing types contradict each other. In fact, when not forcing non-IUPAC bases for hybrid assemblies, the overall consensus will be better and probably have less IUPAC bases as mira can make a better use of available information.
on|yes|1, off|no|0
]
Default is yes for all Solexas when in a mapping assembly, else it's no. Can only be used in mapping assemblies. If set to yes, mira will merge all perfectly mapping Solexa reads into longer reads while keeping quality and coverage information intact.
This features hugely reduces the number of Solexa reads and makes assembly results with Solexa data small enough to be handled by current finishing programs (gap4, consed, others) on normal workstations.
General options for controlling the integrated automatic editor. The editors generally make a good job cleaning up alignments from typical sequencing errors like (like base overcalls etc.). However, they may prove tricky in certain situations:
in EST assemblies, they may edit rare transcripts toward almost identical, more abundant transcripts. Usage must be carefully weighed.
the editors will not only change bases, but also sometimes delete or insert non-gap bases as needed to improve an alignment when facts (trace signals or other) show that this is what should have been the sequence. However, this can make post processing of assembly results pretty difficult with some formats like ACE, where the format itself contains no way to specify certain edits like deletion. There's nothing one can do about it and the only way to get around this problem is to use file formats with more complete specifications like CAF, MAF (and BAF once supported by MIRA).
The following edit parameters are supported:
on|yes|1, off|no|0
]
Default is no. Once contigs have been build, mira can call a built-in versions of the automatic contig editors. For Sanger reads this is EdIt, for 454 reads it is a specially crafted editor that knows about deficiencies of the 454 technology (homopolymers).
EdIt will try to resolve discrepancies in the contig by performing trace analysis and correct even hard to resolve errors. This option is always useful, but especially in conjunction with [-AS:nop] and [-DP:ure] (see above).
Notice 1: the current development version has a memory leak in the editor, therefore the option is not automatically turned on.
Notice 2: it is strongly suggested to turn this option on for 454 data as this greatly improves the quality.
on|yes|1, off|no|0
]
Default is yes. Only for Sanger data. If set to yes, the automatic editor will not take error hypotheses with a low probability into account, even if all the requirements to make an edit are fulfilled.
integer, 0 < x ≤ 100
]
Default is 50. Only for Sanger data. The higher this value, the more strict the automatic editor will apply its internal rule set. Going below 40 is not recommended.
Options which would not fit elsewhere.
on|yes|1, off|no|0
]
Default is yes. MIRA will check whether the log directory is running on an NFS mount. If it is and [-MI:sonfs] is active, MIRA will stop with a warning message.
Warning | |
---|---|
You should never ever at all run MIRA on a NFS mounted directory ... or face the the fact that the assembly process may very well take 10 times longer (or more) than normal. You have been warned. |
integer <
0
]
Default is 500. This
parameter has absolutely no influence whatsoever on the
assembly process of MIRA. But is used in the reporting within
the *_assembly_info.txt
file after the assembly
where MIRA reports statistics on large
and all contigs. [-MI:lcs] is
the threshold value for categorising contigs.
integer <
0
]
Default is 5000 for [--job=genome] and 1000 for [--job=est].
This parameter is used for internal statistics calculations and has a subtle influence when being in a [--job=genome] assembly mode.
MIRA uses coverage information of an assembly project to find out about potentially repetitive areas in reads (and thus, a genome). To calculate statistics which are reflecting the approximate truth, the value of [-MI:lcs4s] is used as a cutoff threshold: contigs smaller than this value do not contribute to the calculation of average coverage while contigs larger or equal to this value do. Having this cutoff discards small contigs which tend to muddy the picture of avergae coverage of a project.
If in doubt, don't touch this parameter.
General options for controlling where to find or where to write data.
<directoryname>
]
Default is an empty string. When set to a non-empty string, MIRA will create the log directory at the given location instead of using the current working directory.
This option is particularly useful for systems which have solid state disks (SSDs) which can be used for temporary files. Or in projects where the input and output files reside on a NFS mounted directory (current working dir), to put the log directory somewhere outside the NFS (see also: Things you should not do).
In both cases above, and for larger projects, MIRA then runs a lot faster.
<directoryname>
]
Default is gap4da.
Defines the extension of the directory where mira will write the
result of an assembly ready to import into the Staden package (GAP4) in
Direct Assembly format. The name of the directory will then be
<projectname>_.<extension>
<directoryname>
]
Default is .. Defines the directory where mira should search for experiment files (EXP).
<directoryname>
]
Default is .. Defines the directory where mira should search for SCF files.
The file options allows you to define your own input and output files.
string
]
Default is <projectname>_in.<seqtype>.fasta. Defines the fasta file to load sequences of a project from.
string
]
Default is <projectname>_in.<seqtype>.fasta.qual. Defines the file containing base qualities. Although the order of reads in the quality file does not need to be the same as in the fasta or fofn (although it saves a bit of time if they are).
string
]
Default is <projectname>_in.<seqtype>.fastq. Defines the fastq file to load sequences of a project from.
string
]
Default is <projectname>_in.<seqtype>.caf. Defines the file to load a CAF project from. Filename must end with '.caf'.
string
]
Default is <projectname>_in.fofn. Defines the file of filenames where the names of the EXP files of a project are located.
string
]
Default is <projectname>_in.fofn. Defines the file of filenames where the names of the PHD files of a project are located. Note: this is currently not available.
string
]
Default is <projectname>_in.phd. Defines the file of where all the sequences of a project are in PHD format.
string
]
Default is <projectname>_straindata_in.txt. Defines the file to load straindata from..
string
]
Default is <projectname>_xmltraceinfo_in.<seqtype>.xml. Defines the file to load a trace info file in XML format from. This can be used both when merging XML data to loaded files or when loading a project from an XML trace info file.
string
]
Default is <projectname>_ssaha2vectorscreen_in.txt. Defines the file to load a the info about possible vector sequence stretches.
string
]
Default is <projectname>_smaltvectorscreen_in.txt. Defines the file to load a the info about possible vector sequence stretches.
string
]
Default is <projectname>_in.<seqtype>.<filetype>. Defines the file to load a backbone from. Note that you still must define the file type with [-SB:bft].
Options for controlling which results to write to which type of files. Additionally, a few options allow output customisation of textual alignments (in text and HTML files).
There are 3 types of results: result, temporary results and extra temporary results. One probably needs only the results. Temporary and extra temporary results are written while building different stages of a contig and are given as convenience for trying to find out why mira set some RMBs or disassembled some contigs.
Output can be generated in these formats: CAF, Gap4 Directed Assembly, FASTA, ACE, TCS, WIG, HTML and simple text.
Naming conventions of the files follow the rules described in section Input / Output, subsection Filenames.
on|yes|1,off|no|0
]
Default is no. Controls whether 'unimportant' singlets are written to the result files.
Note | |
---|---|
Note that a value larger 1 of the [-AS:mrpc] parameter will disable the function of this parameter. |
on|yes|1,off|no|0
]
Default is yes. Controls whether singlets which have certain tags (see below) are written to the result files, even if [-OUT:sssip] (see above) is set.
If one of the (SRMr, CRMr, WRMr, SROr, SAOr, SIOr) tags appears in a singlet, MIRA will see that the singlets had been part of a larger alignment in earlier passes and even was part of a potentially 'important' decision. To give the possibility to human finishers to trace back the decision, these singlets can be written to result files.
Note | |
---|---|
Note that a value larger 1 of the [-AS:mrpc] parameter will disable the function of this parameter. |
on|yes|1, off|no|0
]
Default is yes. Removes log files once they should not be needed anymore during the assembly process.
on|yes|1, off|no|0
]
Default is no. Removes the complete log directory at the end of the assembly process. Some logs contain useful information that you may want to analyse though.
on|yes|1,
off|no|0
]
Default is yes.
on|yes|1,
off|no|0
]
Default is yes.
on|yes|1, off|no|0
]
Default is yes for projects only with Sanger reads, 'no' as soon as there are 454, Solexa or SOLiD reads involved.
Note | |
---|---|
MIRA will automatically switch to no (and cannot be forced to 'yes') when 454 or Solexa reads are present in the project as this ensure that the file system does not get flooded with millions of files. |
on|yes|1, off|no|0
]
Default is yes.
on|yes|1,
off|no|0
]
Default is yes.
Note | |
---|---|
The ACE output of MIRA is conforming to the file specification given in the consed documentation. However, due to a bug in consed, consed cannot correctly load tags set by MIRA. There is a workaround: the MIRA distribution comes with a small Tcl script fixACE4consed.tcl which implements a workaround to allow consed loading the ACE generated by MIRA. Use the script like this:
and then load the resulting outfile into consed. |
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is yes.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
on|yes|1, off|no|0
]
Default is no.
integer > 0
]
Default is 60. When producing an output in text format ( [-OUT:ort|ott|oett]), this parameter defines how many bases each line of an alignment should contain.
integer > 0
]
Default is 60. When producing an output in HTML format, ( [-OUT:orh|oth|oeth]), this parameter defines how many bases each line of an alignment should contain.
<single character>
]
Default is (a blank). When producing an output in text format ( [-OUT:ort|ott|oett]), endgaps are filled up with this character.
<single character>
]
Default is (a blank). When producing an output in HTML format ( [-OUT:orh|oth|oeth]), end-gaps are filled up with this character.
Since version 3.0.0, mira now puts all files and directories it
generates into one sub-directory which is named
. This directory contains up to four
sub-directories:
projectname
_assembly
: this directory contains all the
output files of the assembly in different formats.
projectname
_d_results
: this directory contains information
files of the final assembly. They provide statistics as well as, e.g.,
information (easily parseable by scripts) on which read is found in which
contig etc.
projectname
_d_info
:
this directory contains log files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
projectname
_d_log
In case of problems: please do not delete. I will get in touch with you for additional information that might possibly be present in the log directory.
: this directory
contains checkpoint files needed to resume assemblies that crashed
or were stopped.
projectname
_d_chkpt
Note | |
---|---|
The checkpointing functionality has not been completely implemented yet and currently cannot be used. |
The input files must be placed (or linked to) in the directory from which mira is called.
projectname
_in.fofn
File of filenames containing the names of the experiment
or phd files to assemble when the [-LR:ft=FOFNEXP]
option is used. One filename per line, blank lines accepted,
lines starting with a hash (#
) are treated as
comment lines, nothing else. Use [-FN:fofnin] to
change the default name.
projectname
_in.phd
File containing the sequences (and their qualities) to assemble in PHD format.
projectname
_in.fasta
File containing sequences and ...
projectname
_in.fasta.qual
... file containing quality values of sequences for the assembly in FASTA format.
projectname
_in.fastq
FASTQ file containing sequences and qualities. MIRA automatically recognises Sanger FASTQ format (base quality offset = 33) and newer Illumina FASTQ format (base quality offset = 64). Old Illumina FASTQ format with negative base qualities (base offset < 64) is not supported anymore).
projectname
_in.caf
File containing the sequences (and their qualities) to assemble in CAF format. This format also may contain the result of an assembly (the contig consensus sequences).
These result output files and sub-directories are placed in in the
projectname
_results directory after a run of mira.
projectname
_out.<type>
Assembled project written in type = (gap4da / caf / ace / fasta / html / tcs / wig / text) format by mira, final result.
Type gap4da is a directory containing experiment files and a file of filenames (called 'fofn'), all other types are files. gap4da, caf, ace contain the complete assembly information suitable for import into different post-processing tools (gap4, consed and others). html and text contain visual representations of the assembly suited for viewing in browsers or as simple text file. tcs is a summary of a contig suited for "quick" analyses from command-line tools or even visual inspection. wig is a file containing coverage information (useful for mapping assemblies) which can be loaded and shown by different genome browsers (IGB, GMOD, USCS and probably many more.
fasta contains the contig consensus sequences (and .fasta.qual the consensus qualities). Please note that they come in two flavours: padded and unpadded. The padded versions may contains stars (*) denoting gap base positions where there was some minor evidence for additional bases, but not strong enough to be considered as a real base. Unpadded versions have these gaps removed. Padded versions have an additional postfix .padded, while unpadded versions do not have a special postfix.
These information files are placed in in the
projectname
_info directory after a run of
mira.
projectname
_info_assembly.txt
This file contains basic information about the assembly. MIRA will split the information in two parts: information about large contigs and information about all contigs.
For more information on how to interpret this file, please consult the chapter on "Results" of the MIRA documentation manual.
Note | |
---|---|
In contrast to other information files, this file appears always in the "info" directory, even when just intermediate results are reported. |
projectname
_info_contigreadlist.txt
This file contains information which reads have been assembled into which contigs (or singlets).
projectname
_info_contigstats.txt
This file contains statistics about the contigs themselves, their length, average consensus quality, number of reads, maximum and average coverage, average read length, number of A, C, G, T, N, X and gaps in consensus.
projectname
_info_consensustaglist.txt
This file contains information about the tags (and their position) that are present in the consensus of a contig.
projectname
_info_readrepeats.lst
Tab delimited file with three columns: read name, repeat level tag, sequence.
This file permits a quick analysis of the repetitiveness of different parts of reads in a project. See [-SK:rliif] to control from which repetitive level on subsequences of reads are written to this file,
Note | |
---|---|
Reads can have more than one entry in this file. E.g., with
standard settings (-SK:rliif=6 ) if the
start of a read is covered by MNRr, followed by a HAF3 region
and finally the read ends with HAF6, then there will be two
lines in the file: one for the subsequence covered by MNRr,
one for HAF6.
|
projectname
_info_readstooshort
A list containing the names of those reads that have been sorted out of the assembly before any processing started only due to the fact that they were too short.
projectname
_info_readtaglist.txt
This file contains information about the tags and their position that are present in each read. The read positions are given relative to the forward direction of the sequence (i.e. as it was entered into the the assembly).
projectname
_error_reads_invalid
A list of sequences that have been found to be invalid due to various reasons (given in the output of the assembler).
MIRA can write almost all of the following formats and can read most of them.
EXP
Standard experiment files used in genome sequencing. Correct EXP files are expected. Especially the ID record (containing the id of the reading) and the LN record (containing the name of the corresponding trace file) should be correctly set. See http://www.sourceforge.net/projects/staden/ for links to online format description.
SCF
The Staden trace file format that has established itself as compact standard replacement for the much bigger ABI files. See http://www.sourceforge.net/projects/staden/ for links to online format description.
The SCF files should be V2-8bit, V2-16bit, V3-8bit or V3-16bit and can be packed with compress or gzip.
CAF
Common Assembly Format (CAF) developed by the Sanger Centre. http://www.sanger.ac.uk/resources/software/caf.html provides a description of the format and some software documentation as well as the source for compiling caf2gap and gap2caf (thanks to Rob Davies for this).
ACE
The assembly file format used mainly by phrap and consed. Support for .ace output is currently only in test status in mira as documentation on that format is ... sparse and I currently don' have access to consed to verify my assumptions.
Using consed, you will need to load projects with -nophd to view them. Tags /in reads and consensus) are fully supported. The only hitch: consed has a bug which prevents it to read consensus tags which are located throughout the whole file (as MIRA writes per default). The solution to that is easy: filter the CAF file through the fixACE4consed.tcl script which is provided in the MIRA distributions, then all should be well.
If you don't have consed, you might want to try clview (http://www.tigr.org/tdb/tgi/software/) from TIGR to look at .ace files.
MAF
MIRA Assembly Format (MAF). A faster and more compact form than EXP, CAF or ACE. See documentation in separate file.
HTML
Hypertext Markup Language. Projects written in HTML format can be viewed directly with any table capable browser. Display is even better if the browser knows style sheets (CSS).
FASTA
A simple format for sequence data, see http://www.ncbi.nlm.nih.gov/BLAST/fasta.html. An often used extension of that format is used to also store quality values in a similar fashion, these files have a .fasta.qual ending.
Mira writes two kinds of FASTA files for results: padded and unpadded. The difference is that the padded version still contains the gap (pad) character (an asterisk) at positions in the consensus where some of the reads apparently had some more bases than others but where the consensus routines decided that to treat them as artifacts. The unpadded version has the gaps removed.
PHD
This file type originates from the phred base caller and contains basically -- along with some other status information -- the base sequence, the base quality values and the peak indices, but not the sequence traces itself.
GBF, GBK
GenBank file format as used at the NCBI to describe sequences. mira is able to read this format for using sequences as backbones in an assembly. Features of the GenBank format are also transferred automatically to Staden compatible tags.
traceinfo.XML
XML based file with information relating to traces. Used at the NCBI and ENSEMBL trace archive to store additional information (like clippings, insert sizes etc.) for projects. See further down for for a description of the fields used and http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc for a full description of all fields.
TCS
Transpose Contig Summary. A text file as written by mira which gives a summary of a contig in tabular fashion, one line per base. Nicely suited for "quick" analyses from command line tools, scripts, or even visual inspection in file viewers or spreadsheet programs.
In the current file version (TCS 1.0), each column is separated by at least one space from the next. Vertical bars are inserted as visual delimiter to help inspection by eye. The following columns are written into the file:
contig name (width 20)
padded position in contigs (width 3)
unpadded position in contigs (width 3)
separator (a vertical bar)
called consensus base
quality of called consensus base (0-100), but MIRA itself caps at 90.
separator (a vertical bar)
total coverage in number of reads. This number can be higher than the sum of the next five columns if Ns or IUPAC bases are present in the sequence of reads.
coverage of reads having an "A"
coverage of reads having an "C"
coverage of reads having an "G"
coverage of reads having an "T"
coverage of reads having an "*" (a gap)
separator (a vertical bar)
quality of "A" or "--" if none
quality of "C" or "--" if none
quality of "G" or "--" if none
quality of "T" or "--" if none
quality of "*" (gap) or "--" if none
separator (a vertical bar)
Status. This field sums up the evaluation of MIRA whether you should have a look at this base or not. The content can be one of the following:
everything OK: a colon (:)
unclear base calling (IUPAC base): a "!M"
potentially problematic base calling involving a gap or low quality: a "!m"
consensus tag(s) of MIRA that hint to problems: a "!$". Currently, the following tags will lead to this marker: SRMc, WRMc, DGPc, UNSc, IUPc.
list of a consensus tags at that position, tags are delimited by a space. E.g.: "DGPc H454"
The actual stage of the assembly is written to STDOUT, giving status messages on what mira is actually doing. Dumping to STDERR is almost not used anymore by MIRA, remnants will disappear over time.
Some debugging information might also be written to STDOUT if mira generates error messages.
On errors, MIRA will dump these also to STDOUT. Basically, three error classes exist:
WARNING: Messages in this error class do not stop the assembly but are meant as an information to the user. In some rare cases these errors are due to (an always possible) error in the I/O routines of mira, but nowadays they are mostly due to unexpected (read: wrong) input data and can be traced back to errors in the preprocessing stages. If these errors arise, you definitively DO want to check how and why these errors came into those files in the first place.
Frequent cause for warnings include missing SCF files, SCF files containing known quirks, EXP files containing known quirks etc.
FATAL: Messages in this error class actually stop the assembly. These are mostly due to missing files that mira needs or to very garbled (wrong) input data.
Frequent causes include naming an experiment file in the 'file of filenames' that could not be found on the disk, same experiment file twice in the project, suspected errors in the EXP files, etc.
INTERNAL: These are true programming errors that were caught by internal checks. Should this happen, please mail the output of STDOUT and STDERR to the author.
MIRA extracts the following data from the TRACEINFO files:
trace_name (required)
trace_file (recommended)
trace_type_code (recommended)
trace_end (recommended)
clip_quality_left (recommended)
clip_quality_right (recommended)
clip_vector_left (recommended)
clip_vector_right (recommended)
strain (recommended)
template_id (recommended for paired end)
insert_size (recommended for paired end)
insert_stdev (recommended for paired end)
machine_type (optional)
program_id (optional)
Other data types are also read, but the info is not used.
Here's the example for a TRACEINFO file with ancillary info:
<?xml version="1.0"?> <trace_volume> <trace> <trace_name>GCJAA15TF</trace_name> <program_id>PHRED (0.990722.G) AND TTUNER (1.1)</program_id> <template_id>GCJAA15</template_id> <trace_direction>FORWARD</trace_direction> <trace_end>F</trace_end> <clip_quality_left>3</clip_quality_left> <clip_quality_right>622</clip_quality_right> <clip_vector_left>1</clip_vector_left> <clip_vector_right>944</clip_vector_right> <insert_stdev>600</insert_stdev> <insert_size>2000</insert_size> </trace> <trace> ... </trace> ... </trace_volume>
See http://www.ncbi.nlm.nih.gov/Traces/trace.cgi?cmd=show&f=rfc&m=main&s=rfc for a full description of all fields and more info on the TRACEINFO XML format.
MIRA names contigs the following way: <projectname>_<contigtype><number>. While <projectname> is dictated by the [--project=] parameter and <number> should be clear, the <contigtype> might need additional explaining. There are currently three contig types existing:
_c: these are "normal" contigs
_rep_c: these are contigs containing only repetitive areas. These contigs had _lrc as type in previous version of MIRA, this was changed to the _rep_c to make things clearer.
_s: these are singlet-contigs. Technically: "contigs" with a single read.
Basically, for genome assemblies MIRA starts to build contigs in areas which seem "rock solid", i.e., not a repetitive region (main decision point) and nice coverage of good reads. Contigs which started like this get a _c name. If during the assembly MIRA reaches a point where it cannot start building a contig in a non-repetitive region, it will name the contig _rep_c instead of _c.
Note | |
---|---|
Although the distinction between _c and _rep_c makes sense only for genome assemblies, EST assemblies also use it (for no better reason than me not having an alternative or better naming scheme there). |
Note | |
---|---|
Depending on the settings of [-AS:mrpc], your project may or may not contain _s singlet-contigs. Also note that reads landing in the debris file will not get assigned to singlet-contigs and hence not get _s names. |
In case you used strain information in an assembly, you can recover the consensus for just any given strain by using convert_project and convert from a full assembly format (e.g. MAF or CAF) which also carries strain information to FASTA. MIRA will automatically detect the strain information and create one FASTA file per strain encountered.
Note | |
---|---|
To be able to distinguish between consensus bases with a
'N ' call and areas of a strain which were
not covered at all by any read of that strain, MIRA introduces
the '@ ' sign as additional "base". That is,
if you see a '@ ' in the consensus of a
given strain, this may be either due to too low coverage --
and therefore a hole -- or to a genuine deletion in your
strain.
|
MIRA uses and sets a couple of tags during the assembly process. That is, if information is known before the assembly, it can be stored in tags (in the EXP and CAF formats) and will be used in the assembly.
This section lists "foreign" tags, i.e., tags that whose definition was made by other software packages than MIRA.
ALUS, REPT: Sequence stretches tagged as ALUS (ALU Sequence) or REPT (general repetitive sequence) will be handled with extreme care during the assembly process. The allowed error rate after automatic contig editing within these stretches is normally far below the general allowed error rate, leading to much higher stringency during the assembly process and subsequently to a better repeat resolving in many cases.
FpAS: GenBank feature for a poly-A signal. Used in EST, cDNA or transcript assembly. Either read in the input files or set when using [-CL:cpat]. This allows to keep the poly-A signal in the reads during assembly without them interfering as massive repeats or as mismatches.
FCDS, Fgen: GenBank features as described in GBF/GBK files or set in the Staden package are used to make some SNP impact analysis on genes.
other. All other tags in reads will be read and passed through the assembly without being changed and they currently do not influence the assembly process.
This section lists tags which MIRA sets (and reads of course), but that other software packages might not know about.
UNSr, UNSc: UNSure in Read respectively Contig. These tags denote positions in an assembly with conflicts that could not be resolved automatically by mira. These positions should be looked at during the finishing process.
For assemblies using good sequences and enough coverage, something 0.01% of the consensus positions have such a tag. (e.g. ~300 UNSc tags for a genome of 3 megabases).
SRMr, WRMc: Strong Repeat Marker and Weak Repeat Marker. These tags are set in two flavours: as SRMr and WRMr when set in reads, and as SRMc and WRMc when set in the consensus. These tags are used on an individual per base basis for each read. They denote bases that have been identified as crucial for resolving repeats, often denoting a single SNP within several hundreds or thousands of bases. While a SRM is quite certain, the WRM really is either weak (there wasn't enough comforting information in the vicinity to be really sure) or involves gap columns (which is always a bit tricky).
mira will automatically set these tags when it encounters repeats and will tag exactly those bases that can be used to discern the differences.
Seeing such a tag in the consensus means that mira was not able to finish the disentanglement of that special repeat stretch or that it found a new one in one of the last passes without having the opportunity to resolve the problem.
DGPc: Dubious Gap Position in Consensus. Set whenever the gap to base ratio in a column of 454 reads is between 40% and 60%.
SAO, SRO, SIO: SNP intrA Organism, SNP R Organism, SNP Intra and inter Organism. As for SRM and WRM, these tags have a r appended when set in reads and a c appended when set in the consensus. These tags denote SNP positions.
mira will automatically set these tags when it encounters SNPs and will tag exactly those bases that can be used to discern the differences. They denote SNPs as they occur within an organism (SAO), between two or more organisms (SRO) or within and between organisms (SIO).
Seeing such a tag in the consensus means that mira set this as a valid SNP in the assembly pass. Seeing such tags only in reads (but not in the consensus) shows that in a previous pass, mira thought these bases to be SNPs but that in later passes, this SNP does not appear anymore (perhaps due to resolved misassemblies).
STMS: (only hybrid assemblies). The Sequencing Type Mismatch Solved is tagged to positions in the assembly where the consensus of different sequencing technologies (Sanger, 454, Solexa, SOLiD) reads differ, but mira thinks it found out the correct solution. Often this is due to low coverage of one of the types and an additional base calling error.
Sometimes this depicts real differences where possible explanation might include: slightly different bugs were sequenced or a mutation occurred during library preparation.
STMU: (only hybrid assemblies). The Sequencing Type Mismatch Unresolved is tagged to positions in the assembly where the consensus of different sequencing technologies (Sanger, 454, Solexa, SOLiD) reads differ, but mira could not find a good resolution. Often this is due to low coverage of one of the types and an additional base calling error.
Sometimes this depicts real differences where possible explanation might include: slightly different bugs were sequenced or a mutation occurred during library preparation.
MCVc: The Missing Co{V}erage in Consensus. Set in assemblies with more than one strain. If a strain has no coverage at a certain position, the consensus gets tagged with this tag (and the name of the strain which misses this position is put in the comment). Additionally, the sequence in the result files for this strain will have an @ character.
MNRr: (only with [-SK:mnr] active). The Masked Nasty Repeat tags are set over those parts of a read that have been detected as being many more times present than the average sub-sequence. mira will hide these parts during the initial all-against-all overlap finding routine (SKIM3) but will otherwise happily use these sequences for consensus generation during contig building.
FpAS: See "Tags read (and used)" above.
ED_C, ED_I, ED_D: EDit Change, EDit Insertion, EDit Deletion. These tags are set by the integrated automatic editor EdIt and show which edit actions have been performed.
HAF2, HAF3, HAF4, HAF5, HAF6, HAF7. These are HAsh Frequency tags which show the status of read parts in comparison to the whole project. Only set if [-AS:ard] is active (default for genome assemblies).
More info on how to use the information conveyed by HAF tags in the section dealing with repeats and HAF tags in finishing programs further down in this manual.
HAF2 coverage below average ( standard setting at < 0.5 times average)
HAF3 coverage is at average ( standard setting at ≥ 0.5 times average and ≤ 1.5 times average)
HAF4 coverage above average ( standard setting at > 1.5 times average and < 2 times average)
HAF5 probably repeat ( standard setting at ≥ 2 times average and < 5 times average)
HAF6 'heavy' repeat ( standard setting at > 8 times average)
HAF7 'crazy' repeat ( standard setting at > 20 times average)
At the start, things are simple: a read either aligns with other reads or it does not. Reads which align with other reads form contigs, and these MIRA will save in the results with a contig name of _c.
However, not all reads can be placed in an assembly. This can have several reasons and these reads may end up at two different places in the result files: either in the debris file, then just as a name entry, or as singlet (a "contig" with just one read) in the regular results.
reads are too short and get filtered out (before or after the MIRA clipping stages). These invariably land in the debris file.
reads are real singlets: they contain genuine sequence but have no overlap with any other read. These get either caught by the [-CL:pec] clipping filter or during the SKIM phase
reads contain mostly or completely junk.
reads contain chimeric sequence (therefore: they're also junk)
MIRA filters out these reads in different stages: before and after read clipping, during the SKIM stage, during the Smith-Waterman overlap checking stage or during contig building.
The exact place where these single reads land is dependend on why they do not align with other reads.
MIRA is able to find and tag SNPs in any kind of data -- be it genomic or EST -- in both de-novo and mapping assemblies ... provided it knows which read in an assembly is coming from which strain, cell line or organism.
The SNP detection routines are based on the same routines as the routines for detecting non-perfect repeats. In fact, MIRA can even distinguish between bases marking a misassembled repeat from bases marking a SNP within the same project.
All you need to do to enable this feature is to set
[-CO:mr=yes] (which is standard in all
--job=...
incantations of mira and
in some steps of miraSearchESTSNPs. Furthermore, you
will need:
to provide a straindata file for the reads or have the strain information in ancillary NCBI TRACEINFO XML files.
to provide a straindata file for the reads and also give the reference sequence(s) (backbone(s)) a strain name via the [-SB:bsn] parameter.
The effect of using strain names attached to reads can be described
briefly like this. Assume that you have 6 reads (called R1 to R6), three
of them having an A
at a given position, the other
three a C
.
R1 ......A...... R2 ......A...... R3 ......A...... R4 ......C...... R5 ......C...... R6 ......C......
Note | |
---|---|
This example is just that: an example. It uses just 6 reads, with two times three reads as read groups for demonstration purposes and without looking at qualities. For MIRA to recognise SNPs, a few things must come together (e.g. for many sequencing technologies it wants forward and backward reads when in de-novo assembly) and a couple of parameters can be set to adjust the sensitivity. Read more about the parameters: [-CO:mrpg:mnq:mgqrt:emea:amgb:amgbemc:amgbnbs] |
Now, assume you did not give any strain information. MIRA will most probably recognise a problem and, having no strain information, assume it made an error by assembling two different repeats of the same organism. It will tag the bases in the reads with repeat marker tags (SRMr) and the base in the consensus with a SROc tag (to point at an unresolved problem). In a subsequent pass, MIRA will then not assemble these six reads together again, but create two contigs like this:
Contig1: R1 ......A...... R2 ......A...... R3 ......A...... Contig2: R4 ......C...... R5 ......C...... R6 ......C......
The bases in the repeats will keep their SROr tags, but the consensus base of each contig will not get SROc as there is no conflict anymore.
Now, assume you gave reads R1, R2 and R3 the strain information "human", and read R4, R5 and R6 "chimpanzee". MIRA will then create this:
R1 (hum) ......A...... R2 (hum) ......A...... R3 (hum) ......A...... R4 (chi) ......C...... R5 (chi) ......C...... R6 (chi) ......C......
Instead of creating two contigs, it will create again one contig ... but it will tag the bases in the reads with a SROr tag and the position in the contig with a SROc tag. The SRO tags (SNP inteR Organisms) tell you: there's a SNP between those two (or multiple) strains/organisms/whatever.
Changing the above example a little, assume you have this assembly early on during the MIRA process:
R1 (hum) ......A...... R2 (hum) ......A...... R3 (hum) ......A...... R4 (chi) ......A...... R5 (chi) ......A...... R6 (chi) ......A...... R7 (chi) ......C...... R8 (chi) ......C...... R9 (chi) ......C......
Because "chimp" has a SNP within itself (A
versus
C
) and there's a SNP between "human" and "chimp"
(also A
versus C
), MIRA will see a
problem and set a tag, this time a SIOr tag: SNP Intra- and
inter Organism.
MIRA does not like conflicts occurring within an organism and will try to resolve these cleanly. After setting the SIOr tags, MIRA will re-assemble in subsequent passes this:
Contig1: R1 (hum) ......A...... R2 (hum) ......A...... R3 (hum) ......A...... R4 (chi) ......A...... R5 (chi) ......A...... R6 (chi) ......A...... Contig2: R7 (chi) ......C...... R8 (chi) ......C...... R9 (chi) ......C......
The reads in Contig1 (hum+chi) and Contig2 (chi) will keep their SIOr tags, the consensus will have no SIOc tag as the "problem" was resolved.
When presented to conflicting information regarding SNPs and possible repeat markers or SNPs within an organism, MIRA will always first try to resolve the repeats marker. Assume the following situation:
R1 (hum) ......A...T...... R2 (hum) ......A...G...... R3 (hum) ......A...T...... R4 (chi) ......C...G...... R5 (chi) ......C...T...... R6 (chi) ......C...G......
While the first discrepancy column can be "explained away" by a SNP between organisms (it will get a SROr/SROc tag), the second column cannot and will get a SIOr/SIOc tag. After that, MIRA opts to get the SIO conflict resolved:
Contig1: R1 (hum) ......A...T...... R3 (hum) ......A...T...... R5 (chi) ......C...T...... Contig2: R2 (hum) ......A...G...... R4 (chi) ......C...G...... R6 (chi) ......C...G......
The default parameters for MIRA assemblies work best when given real sequencing data and they even expect the data to behave like real sequencing data. But some assembly strategies work in multiple rounds, using so called "artificial" or "synthetic" reads in later rounds, i.e., data which was not generated through sequencing machines but might be something like the consensus of previous assemblies.
If one doesn't take utter care to make these artificial reads at least behave a little bit like real sequencing data, a number of quality ensurance algorithms of MIRA might spot that they "look funny" and trim back these artificial reads ... sometimes even removing them completely. The following list gives a short overview on what these synthetic reads should look like or which MIRA algorithms to switch off in certain cases:
Forward and reverse complement directions: most sequencing technologies and strategies yield a mixture of reads with both forward and reverse complement direction to the DNA sequenced. In fact, having both directions allows for a much better quality control of an alignment as sequencing technology dependent sequencing errors will often affect only one direction at a given place and not both (the exception being homopolymers and 454).
The MIRA proposed end clipping algorithm [-CL:pec] uses this knowledge to initially trim back ends of reads to an area without sequencing errors. However, if reads covering a given area of DNA are present in only one direction, then these reads will be completely eliminated.
If you use only artificial reads in an assembly, then switch off the proposed end clipping [-CL:pec=no].
If you mix artificial reads with "normal" reads, make sure that every part of an artificial read is covered by some other read in reverse complement direction (be it a normal or artificial read). The easiest way to do that is to add a reverse complement for every artificial read yourself, though if you use an overlapping strategy with artificial reads, you can calculate the overlaps and reverse complements of reads so that every second artificial read is in reverse complement to save time and memory afterwards during the computation.
Sequencing type/technology: MIRA currently know Sangers, 454, Solexa and PacBio as sequencing technologies, every read entered in an assembly must be one of those.
Artificial reads should be classified depending on the data they were created from, that is, Sanger for consensus of Sanger reads, 454 for consensus of 454 reads etc. However, Should reads created from Illumina consensus be much longer than, say, 200 or 300 bases, you should treat them as Sanger reads.
Quality values: be careful to assign decent quality values to your artificial reads as several quality clipping or consensus calling algorithms make extensive use of qualities. Pay attention to values of [-CL:qc:bsqc] as well as to [-CO:mrpg:mnq:mgqrt].
Read lengths: current maximum read length for MIRA is around ~30kb. However, to account for some safety, MIRA currently allows only 20kb reads as maximum length.
MIRA treats ploidy differences as repeats and will therefore build a separate contigs for the reads of a ploidy that has a difference to the other ploidy/ploidies.
There is simply no other way to handle ploidy while retaining the ability to separate repeats based on differences of only a single base. Everything else would be guesswork. I thought for some time about doing a coverage analysis around the potential repeat/ploidy site, but came to the conclusion that due to the stochastic nature of sequencing data, this would very probably take wrong decisions in too many cases to be acceptable.
If someone has a good idea, I'll be happy to hear it.
Under the assumption that reads in a project are uniformly distributed across the genome, MIRA will enforce an average coverage and temporarily reject reads from a contig when this average coverage multiplied by a safety factor is reached at a given site. This strategy reduces overcompression of repeats during the contig building phase and keeps reads in reserve for other copies of that repeat.
It's generally a very useful tool disentangle repeats, but has some slight secondary effects: rejection of otherwise perfectly good reads. The assumption of read distribution uniformity is the big problem we have here: of course it's not really valid. You sometimes have less, and sometimes more than "the average" coverage. Furthermore, the new sequencing technologies - 454 perhaps but certainly the ones from Solexa - show that you also have a skew towards the site of replication origin.
Warning: Solexa data from late 2009 and 2010 show a high GC content bias. This bias can reach 200 or 300%, i.e., sequence part for with low GC
One example: let's assume the average coverage of a project is 8 and by chance at one place there 17 (non-repetitive) reads, then the following happens:
(Note: $p$ is the parameter [-AS:urdsip])
Pass 1 to $p-1$: MIRA happily assembles everything together and calculates a number of different things, amongst them an average coverage of ~8. At the end of pass $p-1$, it will announce this average coverage as first estimate to the assembly process.
Pass $p$: MIRA has still assembled everything together, but at the end of each pass the contig self-checking algorithms now include an "average coverage check". They'll invariably find the 17 reads stacked and decide (looking at the [-AS:ardct] parameter which is assumed to be 2 for this example) that 17 is larger than 2*8 and that this very well may be a repeat. The reads get flagged as possible repeats.
Pass $p+1$ to end: the "possibly repetitive" reads get a much tougher treatment in MIRA. Amongst other things, when building the contig, the contig now looks that "possibly repetitive" reads do not overstack by an average coverage multiplied by a safety value ([-AS:urdcm]) which we'll assume now to be 1.5 in this example. So, at a certain point, say when read 14 or 15 of that possible repeat want to be aligned to the contig at this given place, the contig will just flatly refuse and tell the assembler to please find another place for them, be it in this contig that is built or any other that will follow. Of course, if the assembler cannot comply, the reads 14 to 17 will end up as contiglet (contig debris, if you want) or if it was only one read that got rejected like this, it will end up as singlet or in the debris file.
Tough luck. I do have ideas on how to reintegrate those reads at the and of an assembly, but I have deferred doing this as in every case I had looked up, adding those reads to the contigs wouldn't have changed anything ... there's already enough coverage.
What should be done in those cases is simply filter away the contiglets (defined as being of small size and having an average coverage below the average coverage of the project divided 3 (or 2.5)) from a project.
MIRA had since 2.9.36 a feature to keep long repeats in separate contigs ([-AS:klrs]). Due to algorithm changes, this feature is now standard (even if the command line parameter is still present). The effect of this is that contigs with non-repetitive sequence will stop at a 'long repeat' border, including only the first few bases of the repeat. Long repeats will be kept as separate contigs.
This has been implemented to get a clean overview on which parts of an assembly are 'safe' and which parts will be 'difficult'. For this, the naming of the contigs has been extended: contigs named with a '_c' at the end are contigs which contain mostly 'normal' coverage. Contigs with "rep_c" are contigs which contain mostly sequence classified as repetitive and which could not be assembled together with a 'c' contig.
The question remains: what are 'long' repeats. MIRA defines these as repeats that are not spanned by any read that has non-repetitive parts at the end. So, basically, the mean length of the reads that go into the assembly defines the length of 'long' repeats that have to be kept in separate contigs.
It has to be noted that when using paired-end (or template) sequencing, 'long' repeats which can be spanned by read-pairs (or templates) are mostly integrated into 'normal' contigs as MIRA can correctly place them most of the time.
HAF tags (HAsh Frequency) are set by MIRA when the option to colour reads by hash frequency ([-GE:crhf], on by default in most --job combinations) is on. These tags show the status of k-mers (stretch of bases of given length $k$) in read sequences: whether MIRA recognised them as being present in sub-average, average, above average or repetitive numbers.
When using a finishing programs which can display tags in reads (and using the proposed tag colour schemes for gap4 or consed, the assembly will light up in colours ranging from light green to dark red, indicating whether a certain part of the assembly is deemed non-repetitive to extremely repetitive.
One of the biggest advantages of the HAF tags is the implicit information they convey on why the assembler stopped building a contig at an end.
if the read parts composing a contig end are mostly covered with HAF2 tags (below average frequency, coloured light-green), then one very probably has a hole in the contig due to coverage problems which means there are no or not enough reads covering a part of the sequence.
if the read parts composing a contig end are mostly covered with HAF3 tags (average frequency, coloured green), then you have an unusual situation as this should only very rarely occur. The reason is that MIRA saw that there are enough sequences which look the same as the one from your contig end, but that these could not be joined. Likely reasons for this scenario include non-random sequencing artifacts (seen in 454 data) or also non-random chimeric reads (seen in Sanger and 454 data).
if the read parts composing a contig end are mostly covered with HAF4 tags (above average frequency, coloured yellow), then the assembler stopped at grey zone of the coverage not being normal anymore, but not quite repetitive yet. This can happen in cases where the read coverage is very unevenly distributed across the project. The contig end in question might be a repeat occurring two times in the sequence, but having less reads than expected. Or it may be non-repetitive coverage with an unusual excess of reads.
if the read parts composing a contig end are mostly covered with HAF5 (repeat, coloured red), HAF6 (heavy repeat, coloured darker red) and HAF7 tags (crazy repeat, coloured very dark red), then there is a repetitive area in the sequence which could not be uniquely bridged by the reads present in the assembly.
This information can be especially helpful when joining reads by hand in a finishing program. The following list gives you a short guide to cases which are most likely to occur and what you should do.
the proposed join involves contig ends mostly covered by HAF2 tags. Joining these contigs is probably a safe bet. The assembly may have missed this join because of too many errors in the read ends or because sequence having been clipped away which could be useful to join contigs. Just check whether the join seems sensible, then join.
the proposed join involves contig ends mostly covered by HAF3 tags. Joining these contigs is probably a safe bet. The assembly may have missed this join because of several similar chimeric reads reads or reads with similar, severe sequencing errors covering the same spot. Just check whether the join seems sensible, then join.
the proposed join involves contig ends mostly covered by HAF4 tags. Joining these contigs should be done with some caution, it may be a repeat occurring twice in the sequence. Check whether the contig ends in question align with ends of other contigs. If not, joining is probably the way to go. If potential joins exist with other contigs, then it's a repeat (see below).
the proposed join involves contig ends mostly covered by HAF5, HAF6 or HAF7 tags. Joining these contigs should be done with utmost caution, you are almost certainly (HAF5) and very certainly (HAF6 and HAF7) in a repetitive area of your sequence. You will probably need additional information like paired-end or template info in order join your contigs.
MIRA goes a long way to calculate a consensus which is as correct as possible. Unfortunately, communication with finishing programs is a bit problematic as there currently is no standard way to say which reads are from which sequencing technology.
It is therefore often the case that finishing programs calculate an own consensus when loading a project assembled with MIRA. This is the case for at least, e.g., gap4. This consensus may then not be optimal.
The recommended way to deal with this problem is: import the results from MIRA into your finishing program like you always do. Then finish the genome there, export the project from the finishing program as CAF and finally use convert_project (from the MIRA package ) with the "-r" option to recalculate the optimal consensus of your finished project.
E.g., assuming you have just finished editing the gap4 database
DEMO.3
, do the following. First, export the gap4 database back to
CAF:
$
gap2caf -project DEMO -version 3 >demo3.caf
Then, use convert_project with option '-r' to convert it into any other format that you need. Example for converting to a CAF and a FASTA format with correct consensus:
$
convert_project -f caf -t caf -t fasta -r c demo3.caf final_result
mira cannot work with EXP files resulting from GAP4 that already have been edited. If you want to reassemble an edited GAP4 project, convert it to CAF format and use the [-caf] option to load.
As also explained earlier, mira relies on sequencing vector being recognised in preprocessing steps by other programs. Sometimes, when a whole stretch of bases is not correctly marked as sequencing vector, the reads might not be aligned into a contig although they might otherwise match quite perfectly. You can use [-CL:pvc] and [-CO:emea] to address problem with incomplete clipping of sequencing vectors. Also having the assembler work with less strict parameters may help out of this.
mira has been developed to assemble shotgun sequencing or EST sequencing data. There are no explicit limitations concerning length or number of sequences. However, there are a few implicit assumptions that were made while writing portions of the code:
Sequence data produced by electrophoresis rarely surpasses 1000 usable bases and I never heard of, let alone seen, more than 1100. The fast filtering SKIM relies on the fact that sequences will never exceed 10000 bases in length.
The next problem that might arise with 'unnatural' long sequence reads will be my implementation of the Smith-Waterman alignment routines. I use a banded version with linear running time (linear to the bandwidth) but quadratic space usage. So, comparing two 'reads' of length 5000 will result in memory usage of 100MB. I know that this could be considered as a flaw. On the other hand - unless someone comes up with electrophoresis producing reads with more than 2000 usable bases - I see no real need to change this as long as there are more important things on the TODO list. Of course, if anyone is willing to contribute a fast banded SW alignment routine which runs in linear time and space, just feel free to contact the author.
Current data structures allow for a worst case read coverage of maximally 16384 reads on top of the other.
Note: this limit was more than enough for about any kind of genome sequencing, but since people started to do sequencing of non-normalised EST libraries with 454 and Solexa, this limit can be reached all too often. This will change in future releases.
the 32-bit Linux version is limited by the memory made available by the Linux kernel (somewhere around 2.3 to 2.7GB).
to reduce memory overhead, the following assumptions have been made:
the 64-bit Linux version has no implicit memory limits, although the maximum number of bases of all reads may not surpass 2.147.483.648 bases. With that, even aliens with a genome size ~800 times bigger than humans could be tackled (if it were not for other limitations, mainly RAM and processing power).
mira is not fully multi-threaded (yet), but even Sanger projects for bigger bacteria can be assembled in ~2-3 hours on a current hardware platform. Fungi may take two or three days.
For 454 genome projects, bacteria should be done in about a day at most, Fungi could take about 10 days.
a project does not contain sequences from more than 255 different:
sequencing machine types
primers
strains (in mapping mode: 7)
base callers
dyes
process status
a project does not contain sequences from more than 65535 different
clone vectors
sequencing vectors
Note: Versions with uneven minor versions (e.g. 1.1.x, 1.3.x, ..., 2.1.x, ... etc.) are development versions which might be unstable in parts (although I don't think so). But to catch possible bugs, development versions of mira are distributed with tons of internal checks compiled into the code, making it somewhere between 10% and 50% slower than it could be.
Of course one can run MIRA atop a NFS mount (a "disk" mounted over a network using the NFS protocol), but the performance will go down the drain as the NFS server respectively the network will not be able to cope with the amount of data MIRA needs to shift to and from disk (writes/reads to the log directory). Slowdowns of a factor of 10 and more have been observed. In case you have no other possibility, you can force MIRA to run atop a NFS using [-MI:sonfs=no], but you have been warned.
In case you want to keep input and output files on NFS, you can use [-DI:lrt] to redirect the log directory to a local filesystem. Then MIRA will run at almost full speed.
Assembling sequences without quality values is like ... like ... like driving a car downhill a sinuous mountain road at 200 km/h without brakes, airbags and no steering wheel. With a ravine on one side and a rock face on the other. Did I mention the missing seat-belts? You might get down safely, but experience tells the result will rather be a bloody mess.
All MIRA routines internally are geared toward quality values guiding decisions. No one should ever assembly anything without quality values. Never. Ever. Even if quality values are sometimes inaccurate, they do help.
Now, there are very rare occasions
where getting quality values is not possible. If you absolutely cannot
get them, and I mean only in this case, use these
switches:--noqualities[=SEQUENCINGTECHNOLOGY]
. E.g.:
SEQUENCINGTECHNOLOGY
_SETTINGS
-AS:bdq=30
--noqualities=454 454_SETTINGS -AS:bdq=30
or
--noqualities SANGER_SETTINGS -AS:bdq=30 454_SETTINGS -AS:bdq=30
This tells MIRA not to complain about missing quality values and to fake a quality value of 30 for all reads having no qualities, allowing some MIRA routines (in standard parameter settings) to start disentangling your repeats.
Warning | |
---|---|
Doing the above has some severe side-effects. You will be, e.g., at the mercy of non-random sequencing errors. I suggest combining the above with a [-CO:mrpg=4] or higher. You also may want to tune the [-AS:bdq] parameter together with [-CO:mnq] and [-CO:mgqrt] in cases where you mix sequences with and without quality values. |
Viewing the results of a mira assembly or preprocessing the sequences for an assembly can be done with a number of different programs. The following ones are are just examples, there are a lot more packages available:
If you have really nothing else as viewer, a browser who understands tables is needed to view the HTML output. A browser knowing style sheets (CSS) is recommended, as different tags will be highlighted. Konqueror, Opera, Mozilla, Netscape and Internet Explorer all do fine, lynx is not really ... optimal.
You'll want GAP4 (generally speaking: the Staden package) to preprocess the sequences, visualise and eventually rework the results when using gap4da output. The Staden package comes with a fully featured sequence preparing and annotating engine (pregap4) that is very useful to preprocess your data (conversion between file types, quality clipping, tagging etc.).
See http://www.sourceforge.net/projects/staden/ for further information and also a possibility to download precompiled binaries for different platforms.
Reading result files from ssaha2 or smalt from the Sanger centre is supported directly by mira to perform a fast and efficient tagging of sequencing vector stretches. This makes you basically independent from any other commercial or license-requiring vector screening software. For Sanger reads, a combination of lucy (see below), ssaha2 or smalt together with the mira parameters for SSAHA2 / SMALT support ( [-CL:msvs]) and quality clipping ( [-CL:qc]) should do the trick. For reads coming from 454 pyro-sequencing, ssaha2 or smalt and the SSAHA2 / SMALT support also work pretty well.
See http://www.sanger.ac.uk/resources/software/ssaha2/ and / or http://www.sanger.ac.uk/resources/software/smalt/ for further information and also a possibility to download the source or precompiled binaries for different platforms.
lucy from TIGR (now JCVI) is another useful sequence preprocessing program. Lucy is a utility that prepares raw DNA sequence fragments for sequence assembly. The cleanup process includes quality assessment, confidence reassurance, vector trimming and vector removal.
There's a small script in the MIRA 3rd party package which converts the clipping data from the lucy format into something mira can understand (NCBI Traceinfo).
See ftp://ftp.tigr.org/pub/software/Lucy/ to download the source code of lucy.
Viewing .ace
file output without consed
can be done with clview from TIGR. See
http://www.tigr.org/tdb/tgi/software/.
Tablet http://bioinf.scri.ac.uk/tablet/ may also be used for this.
The Integrated Genome Browser (IGB) of the GenoViz project at SourceForge (http://sourceforge.net/projects/genoviz/) is just perfect for loading a genome and looking at mapping coverage (provided by the wiggle result files of MIRA).
TraceTuner (http://sourceforge.net/projects/tracetuner/) is a tool for base and quality calling of trace files from DNA sequencing instruments. Originally developed by Paracel, this code base was released as open source in 2006 by Celera.
phred (basecaller) - cross_match (sequence comparison and filtering) - phrap (assembler) - consed (assembly viewer and editor). This is another package that can be used for this type of job, but requires more programming work. The fact that sequence stretches are masked out (overwritten with the character X) if they shouldn't be used in an assembly doesn't really help and is considered harmful (but it works).
Note the bug of consed when reading ACE files, see more about this in the section on file types (above) in the entry for ACE.
See http://www.phrap.org/ for further information.
A text viewer for the different textual output files.
As always, most of the time a combination of several different packages is possible. My currently preferred combo for genome projects is ssaha2 or smalt and or lucy (vector screening), MIRA (assembly, of course) and gap4 (assembly viewing and finishing).
For re-assembling projects that were edited in gap4, one will also need the gap2caf converter. The source for this is available at http://www.sanger.ac.uk/resources/software/caf.html.
Since the V2.9.24x3 version of mira, there is miramem as program call. When called from the command line, it will ask a number of questions and then print out an estimate of the amount of RAM needed to assemble the project. Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30% for bacteria and 'simple' eukaryotes. The higher the number of repeats is, the more likely you will need to restrict memory usage in some way or another.
Here's the transcript of a session with miramem:
This is MIRA V3.2.0rc1 (development version). Please cite: Chevreux, B., Wetter, T. and Suhai, S. (1999), Genome Sequence Assembly Using Trace Signals and Additional Sequence Information. Computer Science and Biology: Proceedings of the German Conference on Bioinformatics (GCB) 99, pp. 45-56. To (un-)subscribe the MIRA mailing lists, see: http://www.chevreux.org/mira_mailinglists.html After subscribing, mail general questions to the MIRA talk mailing list: mira_talk@freelists.org To report bugs or ask for features, please use the new ticketing system at: http://sourceforge.net/apps/trac/mira-assembler/ This ensures that requests don't get lost. [...] miraMEM helps you to estimate the memory needed to assemble a project. Please answer the questions below. Defaults are give in square brackets and chosen if you just press return. Hint: you can add k/m/g modifiers to your numbers to say kilo, mega or giga. Is it a genome or transcript (EST/tag/etc.) project? (g/e/) [g] g Size of genome? [4.5m]9.8m
9800000 Size of largest chromosome? [9800000] 9800000 Is it a denovo or mapping assembly? (d/m/) [d] d Number of Sanger reads? [0] 0 Are there 454 reads? (y/n/) [n]y
y Number of 454 GS20 reads? [0] 0 Number of 454 FLX reads? [0] 0 Number of 454 Titanium reads? [0]750k
750000 Are there PacBio reads? (y/n/) [n] n Are there Solexa reads? (y/n/) [n] n ************************* Estimates ************************* The contigs will have an average coverage of ~ 30.6 (+/- 10%) RAM estimates: reads+contigs (unavoidable): 7.0 GiB large tables (tunable): 688. MiB --------- total (peak): 7.7 GiB add if using -CL:pvlc=yes : 2.6 GiB Estimates may be way off for pathological cases. Note that some algorithms might try to grab more memory if the need arises and the system has enough RAM. The options for automatic memory management control this: -AS:amm, -AS:kpmf, -AS:mps Further switches that might reduce RAM (at cost of run time or accuracy): -SK:mhim, -SK:mchr (both runtime); -SK:mhpr (accuracy) *************************************************************
If your RAM is not large enough, you can still assemble projects by using disk swap. Up to 20% of the needed memory can be provided by swap without the speed penalty getting too large. Going above 20% is not recommended though, above 30% the machine will be almost permanently swapping at some point or another.
NEW since 2.7.4: The new SKIM3 algorithm (initial all-against-all read comparison) is now approximately 60 times faster that the SKIM algorithms of earlier versions. E.g. SKIMming of 53,000 Sanger type shotguns reads now takes a bit more than a minute instead of 62 minutes.
The times given below are only approximate and were gathered on my home development box (Athlon 4800+) using a single core and minimal debug code compiled in, somewhat slowing down the whole process.
Example 1: a small genomic project with 720 reads forming 35k bases of
contig
sequences. Using --job=denovo,genome,normal,sanger
and resolving minor repeat misassemblies, full read extension and
automatic contig editing takes 19 seconds.
Example 2: a bacterial genome project with two very closely related
strains, 53000 Sanger reads forming a bit more than 3 megabases of
contig sequences for each strain. Using
the --job=denovo,genome,accurate,sanger
(four main
passes, read extension, clipping of vector remnants), resolving repeat
misassemblies (mostly RNA stretches, but also some very closely
related genes) takes 1hr and 48 minutes and uses a maximum of 1.2GB of
RAM (miramem estimated the usage to be 1.5GB).
Example 3: Here are the times for miraSearchESTSNPs in a non-normalised (thus very repetitive) EST project, 9747 reads with a average length of 674 used bases,
The fast filtering algorithm performs about 12 million sequence comparisons per second (8 seconds).
Banded Smith-Waterman performs around 750 sequence alignments per second (with a 15% band to each side, which is quite generous), 4:07 for about 182000 alignment checks.
The three steps of miraSearchESTSNPs (each one again subdivided in a number of MIRA passes), including resolving very high coverage contigs (>500 sequences) in multiple passes and splitting them into different SNP and splice variants takes about 20 minutes.
File Input / Output:
mira can only read unedited EXP files.
There sometimes is a (rather important) memory leak occurring while using the assembly integrated Sanger read editor. I have not been able to trace the reason yet.
convert_project shows unexpected slowness when converting larger projects (e.g. 2 million reads) with more than one contig.
Assembly process:
The routines for determining Repeat Marker Bases (SRMr) are sometimes too sensitive, which sometimes leads to excessive base tagging and preventing right assemblies in subsequent assembly processes. The parameters you should look at for this problem are [-CO:mrc:nrz:mgqrt:mgqwpc]. Also look at [-CL:pvc] and [-CO:emea] if you have a lot of sequencing vector relics at the end of the sequences.
EST projects with Solexa data tend to easily reach a coverage larger than the largest allowed coverage at the moment (16383). Using MIRA with Solexa data from non-normalised EST libraries data may therefore lead to some unexpected results in these areas and hence cannot be recommended.
The assignment of reads to debris or singlets and whether or not they are put into the final result is messy, the statistic numbers about this sometimes even wrong. Needs to be redone.
These are some of the topics on my TODO list for the next revisions to come:
Making parts of the process multi-threaded (currently stopped due to other priorities like Solexa etc.)
Less disk usage when using EST assembly on 10 or more million Solexa reads
Others nifty ideas that I have not completely thought out yet.
Note: description is old and needs to be adapted to the current 2.9.x / 3.x line.
To avoid the "garbage-in, garbage-out" problematic, mira uses a 'high quality alignments first' contig building strategy. This means that the assembler will start with those regions of sequences that have been marked as good quality (high confidence region - HCR) with low error probabilities (the clipping must have been done by the base caller or other preprocessing programs, e.g. pregap4) and then gradually extends the alignments as errors in different reads are resolved through error hypothesis verification and signal analysis.
This assembly approach relies on some of the automatic editing functionality provided by the EdIt package which has been integrated in parts within mira.
This is an approximate overview on the steps that are executed while assembling:
All the experiment / phd / fasta sequences that act as input are loaded (or the CAF project). Qualities for the bases are loaded from the FASTA or SCF if needed.
the ends of the reads are cleaned ensure they have a minimum stretch of bases without sequencing errors
The high confidence region (HCR) of each read is compared with a quick algorithm to the HCR of every other read to see if it could match and have overlapping parts (this is the 'SKIM' filter).
All the reads which could match are being checked with an adapted Smith-Waterman alignment algorithm (banded version). Obvious mismatches are rejected, the accepted alignments form one or several alignment graphs.
Optional pre-assembly read extension step: mira tries to extend HCR of reads by analysing the read pairs from the previous alignment. This is a bit shaky as reads in this step have not been edited yet, but it can help. Go back to step 2.
A contig gets made by building a preliminary partial path through the alignment graph (through in-depth analysis up to a given level) and then adding the most probable overlap candidates to a given contig. Contigs may reject reads if these introduce to many errors in the existing consensus. Errors in regions known as dangerous (for the time being only ALUS and REPT) get additional attention by performing simple signal analysis when alignment discrepancies occur.
Optional: the contig can be analysed and corrected by the automatic editor ("EdIt" for Sanger reads, or the new MIRA editor for 454 reads).
Long repeats are searched for, bases in reads of different repeats that have been assembled together but differ sufficiently (for EdIT so that they didn't get edited and by phred quality value) get tagged with special tags (SRMr and WRMr).
Go back to step 5 if there are reads present that have not been assembled into contigs.
Optional: Detection of spoiler reads that prevent joining of contigs. Remedy by shortening them.
Optional: Write out a checkpoint assembly file and go back to step 2.
The resulting project is written out to different output files and directories.
Table of Contents
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do. Furthermore, it is advised to read through the main documentation of the assembler as this is really just a getting started guide.
For working parameter settings for assemblies involving 454 and / or Solexa data, please also read the MIRA help files dedicated to these platforms.
This example assumes that you have a few sequences in FASTA format that may or may not have been preprocessed - that is, where sequencing vector has been cut back or masked out. If quality values are also present in a fasta like format, so much the better.
We need to give a name to our project: throughout this example, we will assume that the sequences we are working with are from Bacillus chocorafoliensis (or short: Bchoc); a well known, chocolate-adoring bug from the Bacillus family which is able to make a couple of hundred grams of chocolate vanish in just a few minutes.
Our project will therefore be named 'bchoc'.
"Do I have enough memory?" has been one of the most often asked question in former times. To answer this question, please use miramem which will give you an estimate. Basically, you just need to start the program and answer the questions, for more information please refer to the corresponding section in the main MIRA documentation.
Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30%.
The following steps will allow to quickly start a simple assembly if your sequencing provider gave you data which was pre-clipped or pre-screened for vector sequence:
$
mkdir bchoc_assembly1
$
cd bchoc_assembly1
bchoc_assembly1$
cp /your/path/sequences.fasta bchoc_in.sanger.fasta
bchoc_assembly1$
cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual
bchoc_assembly1$
mira --project=bchoc --job=denovo,genome,normal,sanger --fasta
Explanation: we created a directory for the assembly, copied the sequences into it (to make things easier for us, we named the file directly in a format suitable for mira to load it automatically) and we also copied quality values for the sequences into the same directory. As last step, we started mira with options telling it that
our project is named 'bchoc' and hence, input and output files will have this as prefix;
the data is in a FASTA formatted file;
the data should be assembled de-novo as a genome at an assembly quality level of normal and that the reads we are assembling were generated with Sanger technology.
By giving mira the project name 'bchoc'
(--project=bchoc
) and naming sequence file with
an appropriate extension _in.sanger.fasta
, mira
automatically loaded that file for assembly. When there are
additional quality values available
(bchoc_in.sanger.fasta.qual
), these are also
automatically loaded and used for the assembly.
Note | |
---|---|
If there is no file with quality values available, MIRA will stop immediately. You will need to provide parameters to the command line which explicitly switch off loading and using quality files. |
Warning | |
---|---|
Not using quality values is NOT recommended. Read the corresponding section in the MIRA reference manual. |
If your sequencing provider gave you data which was NOT pre-clipped for vector sequence, you can do this yourself in a pretty robust manner using SSAHA2 -- or the successor, SMALT -- from the Sanger Centre. You just need to know which sequencing vector the provider used and have its sequence in FASTA format (ask your provider).
For SSAHA2 follow these steps (most are the same as in the example above):
$
mkdir bchoc_assembly1
$
cd bchoc_assembly1
bchoc_assembly1$
cp /your/path/sequences.fasta bchoc_in.sanger.fasta
bchoc_assembly1$
cp /your/path/qualities.someextension bchoc_in.sanger.fasta.qual
bchoc_assembly1$
ssaha2 -output ssaha2 -kmer 8 -skip 1 -seeds 1 -score 12 -cmatch 9 -ckmer 6 /path/where/the/vector/data/resides/vector.fasta bchoc_in.sanger.fasta > bchoc_ssaha2vectorscreen_in.txt
bchoc_assembly1$
mira -project=bchoc -job=denovo,genome,normal,sanger -fasta SANGER_SETTINGS -CL:msvs=yes
Explanation: there are just two differences to the example above:
calling SSAHA2 to generate a file which contains information on the vector sequence hitting your sequences.
telling mira with SANGER_SETTINGS
-CL:msvs=yes
to load this vector screening data for
Sanger data
Note | |
---|---|
I need an example for SMALT ... |
Mira can be used in many different ways: building assemblies from scratch, performing reassembly on existing projects, assembling sequences from closely related strains, assembling sequences against an existing backbone (mapping assembly), etc.pp. Mira comes with a number of quick switches, i.e., switches that turn on parameter combinations which should be suited for most needs.
E.g.: mira --project=foobar --job=sanger --fasta
-highlyrepetitive
The line above will tell mira that our project will have the general
name foobar and that the sequences are to be loaded
from FASTA files, the sequence input file being
named foobar_in.sanger.fasta
(and sequence quality
file, if
available, foobar_in.sanger.fasta.qual
. The reads
come from Sanger technology and mira is prepared for the genome
containing nasty repeats. The result files will be in a directory
named foobar_results
, statistics about the assembly
will be available in the foobar_info
directory
like, e.g., a summary of contig statistics in
foobar_info/foobar_info_contigstats.txt
. Notice
that the --job= switch is missing some
specifications, mira will automatically fill in the remaining defaults
(i.e., denovo,genome,normal in the example above).
E.g.: mira --project=foobar --job=mapping,accurate,sanger
--fasta --highlyrepetitive
This is the same as the previous example except mira will perform a
mapping assembly in 'accurate' quality of the sequences against a
backbone sequence(s). mira will therefore additionally load the backbone
sequence(s) from the file foobar_backbone_in.fasta
(FASTA being the default type of backbone sequence to be loaded) and, if
existing, quality values for the backbone
from foobar_backbone_in.fasta.qual
.
E.g.: mira --project=foobar --job=mapping,accurate,sanger
--fasta --highlyrepetitive -SB:bft=gbf
As above, except we have added an extensive
switche ( [-SB:bft]) to tell mira that the backbones
are in a GenBank format file (GBF). MIRA will therefore load the
backbone sequence(s) from the file
foobar_backbone_in.gbf
. Note that the GBF file can
also contain multiple entries, i.e., it can be a GBFF file.
This feature is in its infancy, presently only the SKIM algorithm uses
multiple threads. Setting the number of processes for this stage can be
done via the [-GE:not]
parameter. E.g. -GE:not=4
to use 4 threads.
A simple GAP4 project will do nicely. Please take care of the following: You need already preprocessed experiment / fasta / phd files, i.e., at least the sequencing vector should have been tagged (in EXP files) or masked out (FASTA or PHD files). It would be nice if some kind of not too lazy quality clipping had also been done for the EXP files, pregap4 should do this for you.
Step 1: Create a file of filenames (named
mira_in.fofn
) for the project you wish to
assemble. The file of filenames should contain the newline
separated names of the EXP-files and nothing else.
Step 2: Execute the mira assembly, eventually using command line options or output redirection:
$
/path/to/the/mira/package/mira
... other options ...
or simply
$
mira
... other options ...
if MIRA is in a directory which is in your PATH. The result of the
assembly will now be in directory
named mira_results
where you will
find mira_out.caf
, mira_out.html
etc. or in gap4 direct assembly format in
the mira_out.gap4da
sub-directory.
Step 3a: (This is not recommended anymore) Change to the gap4da directory and start gap4:
$
cd mira_results/mira_out.gap4da
mira_results/mira_out.gap4da$
gap4
choose the menu 'File->New' and enter a name for your new database (like 'demo'). Then choose the menu 'Assembly->Directed assembly'. Enter the text 'fofn' in the entry labelled Input readings from List or file name and enter the text 'failures' into the entry labelled Save failures to List or file name. Press "OK".
That's it.
Step 3b: (Recommended) As an alternative to step 3a, one can use the caf2gap converter (see below)
mira_results$
caf2gap -project demo -version 0 -ace mira_out.caf
mira_results$
gap4 DEMO.0
Out-of-the box example. MIRA comes with a few really small toy project to test usability on a given system. Go to the minidemo directory and follow the instructions given in the section for own projects above, but start with step 2. Eventually, you might want to start mira while redirecting the output to a file for later analysis.
It is sometimes wanted to reassemble a project that has already been edited, for example when hidden data in reads has been uncovered or when some repetitive bases have been tagged manually. The canonical way to do this is by using CAF files as data exchange format and the caf2gap and gap2caf converters available from the Sanger Centre (http://www.sanger.ac.uk/Software/formats/CAF/).
Warning | |
---|---|
The project will be completely reassembled, contig joins or breaks that have been made in the GAP4 database will be lost, you will get an entirely new assembly with what mira determines to be the best assembly. |
Step 1: Convert your GAP4 project with the gap2caf tool. Assuming
that the assembly is in the GAP4
database CURRENT.0
, convert it with the
gap2caf tool:
$
gap2caf -project CURRENT -version 0 -ace > newstart_in.caf
The name "newstart" will be the project name of the new assembly project.
Step 2: Start mira with the -caf option and tell it the name of your new reassembly project:
$
mira -caf=newstart
(and other options like --job etc. at will.)
Step 3: Convert the resulting CAF file
newstart_assembly/newstart_d_results/newstart_out.caf
to a gap4 database format as explained above and start gap4 with
the new database:
$
cd newstart_assembly/newstart_d_results
newstart_assembly/newstart_d_results$
caf2gap -project reassembled -version 0 -ace newstart_out.caf
newstart_assembly/newstart_d_results$
gap4 REASSEMBLED.0
One useful features of mira is the ability to assemble against already existing reference sequences or contigs (also called a mapping assembly). The parameters that control the behaviour of the assembly in these cases are in the [-STRAIN/BACKBONE] section of the parameters.
Please have a look at the example in the minidemo/bbdemo2
directory
which maps sequences from C.jejuni RM1221 against (parts of) the genome
of C.jejuni NCTC1168.
There are a few things to consider when using backbone sequences:
Backbone sequences can be as long as needed! They are not subject to normal read length constraints of a maximum of 10k bases. That is, if one wants to load one or several entire chromosomes of a bacterium or lower eukaryote as backbone sequence(s), this is just fine.
Backbone sequences can be single sequences like provided by, e.g., FASTA, FASTQ or GenBank files. But backbone sequences also can be whole assemblies when they are provided as, e.g., CAF format. This opens the possibility to perform semi-hybrid assemblies by assembling first reads from one sequencing technology de-novo (e.g. 454) and then map reads from another sequencing technology (e.g. Solexa) to the whole 454 alignment instead of mapping it to the 454 consensus.
A semi-hybrid assembly will therefore contain, like a hybrid assembly, the reads of both sequencing technologies.
Backbone sequences will not be reversed! They will always appear in forward direction in the output of the assembly. Please note: if the backbone sequence consists of a CAF file that contain contigs which contain reversed reads, then the contigs themselves will be in forward direction. But the reads they contain that are in reverse complement direction will of course also stay reverse complement direction.
Backbone sequences will not not be assembled together! That is, if a sequence of the backbones has a perfect overlap with another backbone sequence, they will still not be merged.
Reads are assembled to backbones in a first come, first served scattering strategy.
Suppose you have two identical backbones and one read which would match both, then the read would be mapped to the first backbone. If you had two (almost) identical reads, the first read would go to the first backbone, the second read to the second backbone. With three almost identical reads, the first backbone would get two reads, the second backbone one read.
Only in backbones loaded from CAF files: contigs made out of single reads (singlets) loose their status as backbones and will be returned to the normal read pool for the assembly process. That is, these sequences will be assembled to other backbones or with each other.
Examples for using backbone sequences:
Example 1: assume you have a genome of an existing organism. From that, a mutant has been made by mutagenesis and you are skimming the genome in shotgun mode for mutations. You would generate for this a straindata file that gives the name of the mutant strain to the newly sequenced reads and simply assemble those against your existing genome, using the following parameters:
-SB:lsd=yes:lb=yes:bsn=
nameOriginalStrain
:bft=caf|fasta|gbf
When loading backbones from CAF, the qualities of the consensus bases will be calculated by mira according normal consensus computing rules. When loading backbones from FASTA or GBF, one can set the expected overall quality of the sequences (e.g. 1 error in 1000 bases = quality of 30) with [-SB:bbq=30]. It is recommended to have the backbone quality at least as high as the [-CO:mgqrt] value, so that mira can automatically detect and report SNPs.
Example 2: suppose that you are in the process of performing a shotgun sequencing and you want to determine the moment when you got enough reads. One could make a complete assembly each day when new sequences arrive. However, starting with genomes the size of a lower eukaryote, this may become prohibitive from the computational point of view. A quick and efficient way to resolve this problem is to use the CAF file of the previous assembly as backbone and simply add the new reads to the pool. The number of singlets remaining after the assembly versus the total number of reads of the project is a good measure for the coverage of the project.
Example 3: in EST assembly with miraSearchESTSNPs, existing cDNA
sequences can also be useful when added to the project during step
3 (in the file step3_in.par
). They will
provide a framework to which mRNA-contigs built in previous steps
will be assembled against, allowing for a fast evaluation of the
results. Additionally, they provide a direction for the assembled
sequences so that one does not need to invert single contigs by
hand afterwards.
(To be expanded)
This can have two causes:
if you work with a 32 bit executable of caf2gap, it might very well be that the converter needs more memory than can be handled by 32 bit. Only solution: switch to a 64 bit executable of caf2gap.
you compiled caf2gap with a caftools version prior to 2.0.1 and
then caf2gap throws segmentation errors
. Simply grab the
newest version of the caftools (at least 2.0.2) at
ftp://ftp.sanger.ac.uk/pub/PRODUCTION_SOFTWARE/src/ and compile the whole
package. caf2gap will be contained therein.
caf2gap has currently (as of version 2.0.2) a bug that turns around all features in reverse direction during the conversion from CAF to a gap4 project. There is a fix available, please contact me for further information (until I find time to describe it here).
Table of Contents
MIRA can assemble 454 type data either on its own or together with Sanger or Solexa type sequencing data (true hybrid assembly). Paired-end sequences coming from genomic projects can also be used if you take care to prepare your data the way MIRA needs it.
MIRA goes a long way to assemble sequence in the best possible way: it uses multiple passes, learning in each pass from errors that occurred in the previous passes. There are routines specialised in handling oddities that occur in different sequencing technologies
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.
While there are step by step walkthroughs on how to setup your 454 data and then perform an assembly, this guide expects you to read at some point in time
the "Caveats when using 454 data" section of this document (just below). This. Is. Important. Read. It!
the mira_usage introductory help file so that you have a basic knowledge on how to set up projects in mira for Sanger sequencing projects.
the GS FLX Data Processing Software Manual from Roche Diagnostics (or the corresponding manual for the GS20 or Titanium instruments).
and last but not least the mira_reference help file to look up some command line options.
If you want to jump into action, I suggest you walk through the "Walkthrough: combined unpaired and paired-end assembly of Brucella ceti" section of this document to get a feeling on how things work. That particular walkthrough is with paired and unpaired 454 data from the NCBI short read archive, so be prepared to download a couple of hundred MiBs.
But please do not forget to come back to the "Caveats" section just below later, it contains a pointers to common traps lurking in the depths of high throughput sequencing.
Please take some time to read this section. If you're really eager to jump into action, then feel free to skip forward to the walkthrough, but make sure to come back later.
Or at least use the vector clipping info provided in the SFF file and have them put into a standard NCBI TRACEINFO XML format. Yes, that's right: vector clipping info.
Here's the short story: 454 reads can contain a kind of vector sequence. To be more precise, they can - and very often do - contain the sequence of the (A or B)-adaptors that were used for sequencing.
To quote a competent bioinformatician who thankfully dug through quite some data and patent filings to find out what is going on: "These adaptors consist of a PCR primer, a sequencing primer and a key. The B-adaptor is always in because it's needed for the emPCR and sequencing. If the fragments are long enough, then one usually does not reach the adaptor at all. But if the fragments are too short - tough luck."
Basically it's tough luck for a lot of 454 sequencing project I have seen so far, both for public data (sequences available at the NCBI trace archive) and non-public data.
Tip | |
---|---|
Use the sff_extract script from Jose Blanca at the University of Valencia. The home of sff_extract is: http://bioinf.comav.upv.es/sff_extract/index.html but I am thankful to Jose for giving permission to distribute the script in the MIRA 3rd party package (separate download). |
Some labs use specially designed tags for their sequencing (I've heard of cases with up to 20 bases). The tag sequences always being very identical, they will behave like vector sequences in an assembly. Like for any other assembler: if you happen to get such a project, then you must take care that those tags are filtered out, respectively masked from your sequences before going into an assembly. If you don't, the results will be messy at best.
Tip | |
---|---|
Put your FASTAs through SSAHA2 or better, SMALT with the sequence of your tags as masking target. MIRA can read the SSAHA2 output (or SMALT when using "-f ssaha" output) and mask internally using the MIRA [-CL:msvs] parameter and the options pertaining to it. |
Sequences coming from the GS20, FLX or Titanium have usually pretty good clip points set by the Roche/454 preprocessing software. There is, however, a tendency to overestimate the quality towards the end of the sequences and declare sequence parts as 'good' which really shouldn't be.
Sometimes, these bad parts toward the end of sequences are so annoyingly bad that they prevent MIRA from correctly building contigs, that is, instead of one contig you might get two.
MIRA has the [-CL:pec] clipping option to deal with these
annoyances (standard for all --job=genome
assemblies). This algorithm performs proposed end
clipping which will guarantee that the ends of reads are
clean when the coverage of a project is high enough.
For genomic sequences: the term 'enough' being somewhat fuzzy ... everything above a coverage of 15x should be no problem at all, coverages above 10x should also be fine. Things start to get tricky below 10x, but give it a try. Below 6x however, switch off the [-CL:pec] option.
"Do I have enough memory?" has been one of the most often asked question in former times. To answer this question, please use miramem which will give you an estimate. Basically, you just need to start the program and answer the questions, for more information please refer to the corresponding section in the main MIRA documentation.
Take this estimate with a grain of salt, depending on the sequences properties, variations in the estimate can be +/- 30%.
Take these estimates even with a larger grain of salt for eukaryotes. Some of them are incredibly repetitive and this leads currently to the explosion of some secondary tables in MIRA. I'm working on it.
The basic data type you will get from the sequencing instruments will be SFF files. Those files contain almost all information needed for an assembly, but they need to be converted into more standard files before mira can use this information.
Let's assume we just sequenced a bug (Bacillus chocorafoliensis) and internally our department uses the short bchoc mnemonic for your project/organism/whatever. So, whenever you see bchoc in the following text, you can replace it by whatever name suits you.
For this example, we will assume that you have created a directory
myProject
for the data of your project and that
the SFF files are in there. Doing a ls -lR
should
give you something like this:
arcadia:/path/to/myProject$
ls -lR
-rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff -rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff -rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff
As you can see, this sequencing project has 3 SFF
files.
The information contained in the SFF file must be extracted to a FASTA, a FASTA quality and a NCBI TRACEINFO XML file. We'll use the sff_extract script to do that. We'll name the output files in a way that makes them immediately suitable for MIRA input.
Note 1: make sure you have Python installed on your system
Note 2: make sure you have the sff_extract script in your path (or use absolute path names)
arcadia:/path/to/myProject$
sff_extract -s bchoc_in.454.fasta -q bchoc_in.454.fasta.qual -x bchoc_traceinfo_in.454.xml EV10YMP01.sff EV5RTWS01.sff EVX95GF02.sff
Note | |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
This can take some time, the the 1.2 million FLX reads from this example need approximately 9 minutes for conversion. Your directory should now look something like this:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 231698898 2007-10-21 15:16 bchoc_in.454.fasta -rw-r--r-- 1 bach users 661132502 2007-10-21 15:16 bchoc_in.454.fasta.qual -rw-r--r-- 1 bach users 193962260 2007-10-21 15:16 bchoc_traceinfo_in.454.xml -rw-rw-rw- 1 bach users 475849664 2007-09-23 10:10 EV10YMP01.sff -rw-rw-rw- 1 bach users 452630172 2007-09-25 08:59 EV5RTWS01.sff -rw-rw-rw- 1 bach users 436489612 2007-09-21 08:39 EVX95GF02.sff
By this time, the SFFs are not needed anymore. You can remove them from this directory if you want.
Starting the assembly is now just a matter of one line with some parameters set correctly:
arcadia:/path/to/myProject$
mira --project=bchoc --job=denovo,genome,normal,454 >&log_assembly.txt
Note | |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
Now, that was easy, wasn't it? In the above example - for assemblies having only 454 data and if you followed the walkthrough on how to prepare the data - everything you might want to adapt in the first time are the following options:
--project (for naming your assembly project)
--job (perhaps to change the quality of the assembly to 'accurate'
Of course, you are free to change any option via the extended parameters, but this is covered in the MIRA main reference manual.
Preparing the data for a Sanger / 454 hybrid assembly takes some more steps but is not really more complicated than a normal Sanger-only or 454-only assembly.
In the following sections, the example project is named bchoc_hyb, simply for us to remember that we did a hybrid assembly there.
Files with 454 input data will have .454.
in the
name, files with Sanger have .sanger.
.
Please proceed exactly in the same way as described for the assembly of 454-only data in the section above, that is, without starting the actual assembly.
In the end you should have three files (FASTA, FASTA quality and TRACEINFO) for the 454 data ready.
There are quite a number of sequencing providers out there, all with different pre-processing pipelines and different output file-types. MIRA supports quite a number of them, the three most important would probably be
(preferred option) FASTA files which are coupled with FASTA quality files and ancillary data in NCBI TRACEINFO XML format.
(preferred option) CAF (from the Sanger Institute) files that contain the sequence, quality values and ancillary data like clippings etc.
(secondary option, not recommended) EXP files as the Staden pregap4 package writes.
Your sequencing provider MUST have performed at least a sequencing vector clip on this data. A quality clip might also be good to do by the provider as they usually know best what quality they can expect from their instruments (although MIRA can do this also if you want).
You can either perform clipping the hard way by removing physically all bases from the input (this is called trimming), or you can keep the clipped bases in the input file and provided clipping information in ancillary data files. These clipping information then MUST be present in the ancillary data (either the TRACEINFO XML, or in the combined CAF, or in the EXP files), together with other standard data like, e.g., mate-pair information when using a paired-ends approach.
This example assumes that the data is provided as FASTA together with a quality file and ancillary data in NCBI TRACEINFO XML format.
Put these files (appropriately renamed) into the directory with the 454 data.
Here's how the directory with the preprocessed data should now look like (note that we changed the bchoc mnemonic to bchoc_hyb just for fun ... and to make a distinction to the 454 only assembly above):
arcadia:/path/to/myProject$
ls -l
-rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.454.fasta -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.454.fasta.qual -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_traceinfo_in.454.xml -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.sanger.fasta -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_in.sanger.fasta.qual -rwxrwxrwx 1 bach 2007-10-13 22:44 bchoc_hyb_traceinfo_in.sanger.xml
The following command line starts a basic, but normally quite respectable hybrid 454 and Sanger assembly of a genome:
arcadia:/path/to/myProject$
mira --fasta --project=bchoc_hyb --job=denovo,genome,normal,sanger,454 >& log_assembly.txt
The only change to starting an assembly with only 454 data was adding "sanger" to the "-job=" command.
Here's a walkthrough which should help you in setting up own assemblies. You do not need to set up your directory structures as I do, but for this walkthrough it could help.
Please make sure that sff_extract is working properly and that you have
at least version 0.2.1 (use sff_extract -v
). Please also make sure
that SSAHA2 can be run correctly (test this by running ssaha2 -v
).
Note: this is how I set up a project, feel free to implement whatever structure suits your needs.
$
mkdir bceti_assembly
$
cd bceti_assembly
bceti_assembly$
mkdir origdata data assemblies
Your directory should now look like this:
arcadia:bceti_assembly$
ls -l drwxr-xr-x 2 bach users 48 2008-11-08 16:51 assemblies drwxr-xr-x 2 bach users 48 2008-11-08 16:51 data drwxr-xr-x 2 bach users 48 2008-11-08 16:51 origdata
Explanation of the structure:
the origdata
directory will contain the 'raw'
result files that one might get from sequencing. Basically,.
the data
directory will contain the
preprocessed sequences we will use for the sequencing
the assemblies
directory will contain
assemblies we make with our data (we might want to make more than
one).
Note | |
---|---|
Since early summer 2009, the NCBI does not offer SFF files anymore, which is a pity. This guide will nevertheless allow you to perform similar assemblies on own data. |
Please browse to
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR005481&cmd=viewer&m=data&s=viewer
and
http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?run=SRR005482&cmd=viewer&m=data&s=viewer
and download the SFF files to the origdata
directory (press the download button on those pages).
En passant, note the following: SRR005481 is described to be a 454 FLX data set where the library contains unpaired data ("Library Layout: SINGLE"). SRR005482 has also 454 FLX data, but this time it's paired-end data ("Library Layout: PAIRED (ORIENTATION=forward)"). Knowing this will be important later on in the process.
arcadia:bceti_assembly$
cd origdata
arcadia:bceti_assembly/origdata$
ls -l
-rw-r--r-- 1 bach users 240204619 2008-11-08 16:49 SRR005481.sff.gz -rw-r--r-- 1 bach users 211333635 2008-11-08 16:55 SRR005482.sff.gz
We need to unzip those files:
arcadia:bceti_assembly/origdata$
gunzip *.gz
And now this directory should look like this
arcadia:bceti_assembly/origdata$
ls -l
-rw-r--r-- 1 bach users 544623256 2008-11-08 16:49 SRR005481.sff -rw-r--r-- 1 bach users 476632488 2008-11-08 16:55 SRR005482.sff
Now move into the (still empty) data
directory
arcadia:bceti_assembly/origdata$
cd ../data
We will first extract the data from the unpaired experiment (SRR005481), the generated file names should all start with bceti:
arcadia:bceti_assembly/data$
sff_extract -o bceti ../origdata/SRR005481.sff
Working on '../origdata/SRR005481.sff': Converting '../origdata/SRR005481.sff' ... done. Converted 311201 reads into 311201 sequences. ******************************************************************************** WARNING: weird sequences in file ../origdata/SRR005481.sff After applying left clips, 307639 sequences (=99%) start with these bases: TCTCCGTC This does not look sane. Countermeasures you *probably* must take: 1) Make your sequence provider aware of that problem and ask whether this can be corrected in the SFF. 2) If you decide that this is not normal and your sequence provider does not react, use the --min_left_clip of sff_extract. (Probably '--min_left_clip=13' but you should cross-check that) ********************************************************************************
(Note: I got this on the SRR005481 data set downloaded in October 2008. In the mean time, the sequencing center or NCBI may have corrected the error)
Wait a minute ... what happened here?
We launched a pretty standard extraction of reads where the whole sequence were extracted and saved in the FASTA files and FASTA quality files, and clipping information will be given in the XML. Additionally, the clipped parts of every read will be shown in lower case in the FASTA file.
After two or three minutes, the directory looked like this:
arcadia:bceti_assembly/data$
ls -l -rw-r--r-- 1 bach users 91863124 2008-11-08 17:15 bceti.fasta -rw-r--r-- 1 bach users 264238484 2008-11-08 17:15 bceti.fasta.qual -rw-r--r-- 1 bach users 52197816 2008-11-08 17:15 bceti.xml
In the example above, sff_extract discovered an unusual pattern sequence and gave a (stern) warning: almost all the sequences created for the FASTA file had a skew in the distribution of bases.
Let's have a look at the first 30 bases of the first 20 sequences of the FASTA that was created:
arcadia:bceti_assembly/data$
head -40 bceti_in.454.fasta | grep -v ">" | cut -c 0-30
tcagTCTCCGTCGCAATCGCCGCCCCCACA tcagTCTCCGTCGGCGCTGCCCGCCCGATA tcagTCTCCGTCGTGGAGGATTACTGGGCG tcagTCTCCGTCGGCTGTCTGGATCATGAT tcagTCTCCGTCCTCGCGTTCGATGGTGAC tcagTCTCCGTCCATCTGTCGGGAACGGAT tcagTCTCCGTCCGAGCTTCCGATGGCACA tcagTCTCCGTCAGCCTTTAATGCCGCCGA tcagTCTCCGTCCTCGAAACCAAGAGCGTG tcagTCTCCGTCGCAGGCGTTGGCGCGGCG tcagTCTCCGTCTCAAACAAAGGATTAGAG tcagTCTCCGTCCTCACCCTGACGGTCGGC tcagTCTCCGTCTTGTGCGGTTCGATCCGG tcagTCTCCGTCTGCGGACGGGTATCGCGG tcagTCTCCGTCTCGTTATGCGCTCGCCAG tcagTCTCCGTCTCGCATTTTCCAACGCAA tcagTCTCCGTCCGCTCATATCCTTGTTGA tcagTCTCCGTCCTGTGCTGGGAAAGCGAA tcagTCTCCGTCTCGAGCCGGGACAGGCGA tcagTCTCCGTCGTCGTATCGGGTACGAAC
What you see is the following: the leftmost 4
characters tcag
of every read are the last bases
of the standard 454 sequencing adaptor A. The fact that they are
given in lower case means that they are clipped away in the SFF
(which is good).
However, if you look closely, you will see that there is something
peculiar: after the adaptor sequence, all reads seem to start with
exactly the same sequence TCTCCGTC
. This is *not*
sane.
This means that the left clip of the reads in the SFF has not been set correctly. The reason for this is probably a wrong value which was used in the 454 data processing pipeline. This seems to be a problem especially when custom sequencing adaptors are used.
In this case, the result is pretty catastrophic: out of the 311201 reads in the SFF, 307639 (98.85%) show this behaviour. We will certainly need to get rid of these first 12 bases.
Now, in cases like these, there are three steps that you really should follow:
Is this something that you expect from the experimental setup? If yes, then all is OK and you don't need to take further action. But I suppose that for 99% of all people, these abnormal sequences are not expected.
Contact. Your. Sequence. Provider! The underlying problem is something that *MUST* be resolved on their side, not on yours. It might be a simple human mistake, but it it might very well be a symptom of a deeper problem in their quality assurance. Notify. Them. Now!
In the mean time (or if the sequencing provider does not react), you can use the [--min_left_clip] command line option from sff_extract as suggested in the warning message.
So, to correct for this error, we will redo the extraction of the sequence from the SFF, this time telling sff_extract to set the left clip starting at base 13 at the lowest:
arcadia:bceti_assembly/data$
sff_extract -o bceti --min_left_clip=13 ../origdata/SRR005481.sff
Working on '../origdata/SRR005481.sff': Converting '../origdata/SRR005481.sff' ... done. Converted 311201 reads into 311201 sequences.arcadia:sff_from_ncbi/bceti_assembly/data$
ls -l
-rw-r--r-- 1 bach users 91863124 2008-11-08 17:31 bceti.fasta -rw-r--r-- 1 bach users 264238484 2008-11-08 17:31 bceti.fasta.qual -rw-r--r-- 1 bach users 52509017 2008-11-08 17:31 bceti.xml
This concludes the small intermezzo on how to deal with wrong left clips.
Let's move on to the paired-end data. While I would recommend that, when working on your own data, you should do some kind of data checking, I'll spare you that step for this walkthrough, just believe me that I did it and I found nothing really too suspicious.
The paired-end protocol of 454 will generate reads which contain the forward and reverse direction in one read, separated by a linker. You have to know the linker sequence! Ask your sequencing provider to give it to you. If standard protocols were used, then the linker sequence for GS20 and FLX will be
>flxlinker GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
while for Titanium data, you need to use two linker sequences
>titlinker1 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >titlinker2 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
In this case, the center apparently used the standard unmodified 454
FLX linker. Put that linker sequence into a FASTA file and copy to
wherever you like ... in this walkthrough I put it into
the origdata
directory (not the
data
directory where we currently are.
arcadia:bceti_assembly/data$
cp /from/whereever/your/file/is/linker.fasta ../origdata
arcadia:bceti_assembly/data$
ls -l ../origdata
-rw-r--r-- 1 bach users 53 2008-11-08 17:32 linker.fasta -rw-r--r-- 1 bach users 544623256 2008-11-08 16:49 SRR005481.sff -rw-r--r-- 1 bach users 476632488 2008-11-08 16:55 SRR005482.sffarcadia:bceti_assembly/data$
cat ../origdata/linker.fasta
>flxlinker GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
There's one thing that must be found out yet: what was the size of the paired-end library which was constructed, and what is the estimated standard deviation of the sizes? Normally, you will get this information from your sequence provider (if you didn't decide it for yourself). As we're working from a data set deposited at the NCBI, this information should also be available in the accompanying documentation there. But it isn't.
For this walkthrough, we'll simply take a library size of 4500 and an estimated standard deviation of 900.
Now let's extract the paired end sequences, and this may take eight to ten minutes.
arcadia:bceti_assembly/data$
sff_extract -o bceti -a -l ../origdata/linker.fasta -i "insert_size:3000,insert_stdev:900" ../origdata/SRR005482.sff
Testing whether SSAHA2 is installed and can be launched ... ok. Working on '../origdata/SRR005482.sff': Creating temporary file from sequences in '../origdata/SRR005482.sff' ... done. Searching linker sequences with SSAHA2 (this may take a while) ... ok. Parsing SSAHA2 result file ... done. Converting '../origdata/SRR005482.sff' ... done. Converted 268084 reads into 415327 sequences.
The above text tells you that the conversion process saw 268084 reads in the SFF. Searching for the paired-end linker and removing it, 415327 sequences were created. Obviously, some sequences had either no linker or the linker was on the far edges of the read so that the 'split' resulted into just one sequences.
The directory will now look like this:
arcadia:bceti_assembly/data$
ls -l
-rw-r--r-- 1 bach users 170346423 2008-11-08 17:55 bceti.fasta -rw-r--r-- 1 bach users 483048864 2008-11-08 17:55 bceti.fasta.qual -rw-r--r-- 1 bach users 165413112 2008-11-08 17:55 bceti.xml
We're almost done. As last step, we will rename the files into a scheme that suits MIRA (we could have used the -s, -q and -x options of sff_extract directly, but I wanted to keep the example straightforward.
arcadia:bceti_assembly/data$
mv bceti.fasta bceti_in.454.fasta
arcadia:bceti_assembly/data$
mv bceti.fasta.qual bceti_in.454.fasta.qual
arcadia:bceti_assembly/data$
mv bceti.xml bceti_traceinfo_in.454.xml
arcadia:bceti_assembly/data$
ls -l
-rw-r--r-- 1 bach users 170346423 2008-11-08 17:55 bceti_in.454.fasta -rw-r--r-- 1 bach users 483048864 2008-11-08 17:55 bceti_in.454.fasta.qual -rw-r--r-- 1 bach users 165413112 2008-11-08 17:55 bceti_traceinfo_in.454.xml
That's it.
Preparing an assembly is now just a matter of setting up a directory and linking the input files into that directory.
arcadia:bceti_assembly/data$
cd ../assemblies/
arcadia:bceti_assembly/assemblies$
mkdir arun_08112008
arcadia:bceti_assembly/assemblies$
cd arun_08112008
arcadia:assemblies/arun_08112008$
ln -s ../../data/* .
arcadia:bceti_assembly/assemblies/arun_08112008$
ls -l
lrwxrwxrwx 1 bach users 29 2008-11-08 18:17 bceti_in.454.fasta -> ../../data/bceti_in.454.fasta lrwxrwxrwx 1 bach users 34 2008-11-08 18:17 bceti_in.454.fasta.qual -> ../../data/bceti_in.454.fasta.qual lrwxrwxrwx 1 bach users 33 2008-11-08 18:17 bceti_traceinfo_in.454.xml -> ../../data/bceti_traceinfo_in.454.xml
Note | |
---|---|
Please consult the corresponding section in the mirausage document, it contains much more information than this stub. |
But basically, after the assembly has finished, you will find four
directories. The log
directory can be deleted
without remorse as it contains logs and some tremendous amount of
temporary data (dozens of gigabytes for bigger
projects). The info
directory has some text files
with basic statistics and other informative files. Start by having a
look at the *_info_assembly.txt
, it'll give you a
first idea on how the assembly went.
The results
directory finally contains the assembly
files in different formats, ready to be used for further processing with
other tools.
If you used the uniform read distribution option, you will inevitably need to filter your results as this option produces larger and better alignments, but also more ``debris contigs''. For this, use the convert_project which is distributed together with the MIRA package.
Also very important when analysing 454 assemblies: screen the small contigs ( < 1000 bases) for abnormal behaviour. You wouldn't be the first to have some human DNA contamination in a bacterial sequencing. Or some herpes virus sequence in a bacterial project. Or some bacterial DNA in a human data set. Look whether these small contigs
have a different GC content than the large contigs
whether a BLAST of these sequences against some selected databases brings up hits in other organisms that you certainly were not sequencing.
Table of Contents
Notes of caution:
this guide is still not finished (and may contain old information regarding read lengths in parts), but is should cover most basic use cases.
you need lots of memory ... ~ 1 to 1.5 GiB per million Solexa reads. Using mira for anything more than a plate of Solexa (~60 to 80 million reads) is probably not a good idea.
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.
While there are step by step instructions on how to setup your Solexa data and then perform an assembly, this guide expects you to read at some point in time
the MIRA reference manual file to look up some command line options as well as general information on what tags MIRA uses in assemblies, files it generates etc.pp
the short usage introduction to MIRA3 so that you have a basic knowledge on how to set up projects in mira for Sanger sequencing projects.
Solexa reads are great for mapping assemblies. I simply love them as you can easily spot differences in mutant organisms ... or boost the quality of a newly sequenced genome to Q60.
Regarding de-novo assemblies ... well, from an assembler's point of view, short reads are a catastrophe. Be they Helicos (currently 25bp), ABI SOLiD (currently 35-50bp) or Solexa (36-80bp). This has two reasons:
Repeats. The problem of repetitive sequences (e.g. rRNA stretches in bacteria) gets worse the shorter the read lengths get.
Amount of data. As mira is by heart an assembler to resolve difficult repetitive problems as they occur in Sanger and 454 reads, it drags along quite a lot of ancillary information which is useless in Solexa assemblies ... but still eats away memory
Things look better for the now available 'longer' Solexa reads. Starting with a length of 75bp and paired-end data, de-novo for bacteria is not that bad at all. The first Solexas with a length of ~110 bases are appearing in public, and these are about as good for de-novo as the first 454 GS20 reads were.
Here's the rule of thumb I use: the longer, the better. If you have to pay a bit more to get longer reads (e.g. Solexa 76mers instead of 36mers), go get the longer reads. With these, the results you generate are way(!) better than with 36 or even 50mers ... both in mapping and de-novo. Don't try to save a couple of hundred bucks in sequencing, you'll pay dearly afterwards in assembly.
Note: This section contains things I've seen in the past and simply jotted down. You may have different observations.
For 36mers and the MIRA proposed-end-clipping, even in the old pipeline I get about 90 to 95% reads matching to a reference without a single error. For 72mers, the number is approximately 5% lower, 100mers another 5% less. Still, these are great numbers.
The new base calling pipeline (1.4 or 2.4?) rolled out by Illumina in Q1/Q2 2009 typically yields 20-50% more data from the very same images. Furthermore, the base calling is way better than in the old pipeline. For Solexa 76 mers, after trimming I get only 1% real junk, between 85 and 90% of the reads are matching to a reference without a single error. Of the remaining reads, roughly 50% have one error, 25% have two errors, 12.5% have three errors etc.
It is worthwhile to re-analyse your old data if the images are still around.
Long homopolymers (stretches of identical bases in reads) can be a slight problem for Solexa. However, it must be noted that this is a problem of all sequencing technologies on the market so far (Sanger, Solexa, 454). Furthermore, the problem in much less pronounced in Solexa than in 454 data: in Solexa, first problem appear may appear in stretches of 9 to 10 bases, in 454 a stretch of 3 to 4 bases may already start being problematic in some reads.
GGCxG
or even GGC
motif in the
5' to 3' direction of reads. This one is particularly annoying and
it took me quite a while to circumvent in MIRA the problems it
causes.
Simply put: at some places in a genome, base calling after a
GGCxG
or GGC
motif is
particularly error prone, the number of reads without errors
declines markedly. Repeated GGC
motifs worsen
the situation. The following screenshots of a mapping assembly
illustrate this.
The first example is a the GGCxG
motif (in form
of a GGCTG
) occuring in approximately one third
of the reads at the shown position. Note that all but one read
with this problem are in the same (plus) direction.
The next two screenshots show the GGC
, once for
forward direction and one with reverse direction reads:
Places in the genome that have GGCGGC.....GCCGCC
(a motif, perhaps even repeated, then some bases and then an
inverted motif) almost always have very, very low number of good
reads. Especially when the motif is GGCxG
.
Things get especially difficult when these motifs occur at sites
where users may have a genuine interest. The following example is a
screenshot from the Lenski data (see walk-through below) where a
simple mapping reveals an anomaly which -- in reality -- is an IS
insertion (see http://www.nature.com/nature/journal/v461/n7268/fig_tab/nature08480_F1.html)
but could also look like a GGCxG
motif in forward
direction (GGCCG
) and at the same time a
GGC
motif in reverse direction:
Here I'm recycling a few slides from a couple of talks I held in 2010.
Things used to be so nice and easy with the early Solexa data I worked with (36 and 44mers) in late 2007 / early 2008. When sample taking was done right -- e.g. for bacteria: in stationary phase -- and the sequencing lab did a good job, the read coverage of the genome was almost even. I did see a few papers claiming to see non-trivial GC bias back then, but after having analysed the data I worked with I dismissed them as "not relevant for my use cases." Have a look at the following figure showing exemplarily the coverage of a 45% GC bacterium in 2008:
Figure 6.5. Example for no GC coverage bias in 2008 Solexa data. Apart from a slight smile shape of the coverage -- indicating the sample taking was not 100% in stationary phase of the bacterial culture -- everything looks pretty nice: the average coverage is at 27x, and when looking at potential genome duplications at twice the coverage (54x), there's nothing apart a single peak (which turned out to be a problem in a rRNA region).
Things changed starting somewhen in Q3 2009, at least that's when I got some data which made me notice a problem. Have a look at the following figure which shows exactly the same organism as in the figure above (bacterium, 45% GC):
Figure 6.6. Example for GC coverage bias starting Q3 2009 in Solexa data. There's no smile shape anymore -- the people in the lab learned to pay attention to sample in 100% stationary phase -- but something else is extremely disconcerting: the average coverage is at 33x, and when looking at potential genome duplications at twice the coverage (66x), there are several dozen peaks crossing the 66x threshold over a several kilobases (in one case over 200 Kb) all over the genome. As if several small genome duplications happened.
By the way, the figures above are just examples: I saw over a dozen sequencing projects in 2008 without GC bias and several dozen in 2009 / 2010 with GC bias.
Checking the potential genome duplication sites, they all looked "clean", i.e., the typical genome insertion markers are missing. Poking around at possible explanations, I looked at GC content of those parts in the genome ... and there was the explanation:
Figure 6.7. Example for GC coverage bias, direct comparison 2008 / 2010 data. The bug has 45% average GC, areas with above average read coverage in 2010 data turn out to be lower GC: around 33 to 36%. The effect is also noticeable in the 2008 data, but barely so.
Now as to actually why the GC bias suddenly became so strong is unknown to me. The people in the lab use the same protocol since several years to extract the DNA and the sequencing providers claim to always use the Illumina standard protocols.
But obviously something must have changed. Current ideas about possoble reasons include
If anyone has a good explanation, or better, a recipe to go back to the nice 2008 coverage distribution ... feel fre to mail me.
This part will introduce you step by step how to get your data together for a simple mapping assembly.
I'll make up an example using an imaginary bacterium: Bacillus chocorafoliensis (or short: Bchoc).
In this example, we assume you have two strains: a wild type strain of Bchoc_wt and a mutant which you perhaps got from mutagenesis or other means. Let's imagine that this mutant needs more time to eliminate a given amount of chocolate, so we call the mutant Bchoc_se ... SE for slow eater
You wanted to know which mutations might be responsible for the observed behaviour. Assume the genome of Bchoc_wt is available to you as it was published (or you previously sequenced it), so you resequenced Bchoc_se with Solexa to examine mutations.
You need to create (or get from your sequencing provider) the sequencing data in either FASTQ or FASTA + FASTA quality format. The following walkthrough uses what most people nowadays get: FASTQ.
Put the FASTQ data into an empty directory and rename the file so that it looks like this:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
The reference sequence (the backbone) can be in a number of different
formats: FASTA, GenBank, CAF. The later two have the advantage of
being able to carry additional information like, e.g., annotation. In
this example, we will use a GenBank file like the ones one can
download from the NCBI. So, let's assume that our wild type strain is
in the following file: NC_someNCBInumber.gbk
. Copy this
file to the directory (you may also set a link), renaming it as
bchocse_backbone_in.gbf
.
arcadia:/path/to/myProject$
cp /somewhere/NC_someNCBInumber.gbk bchocse_backbone_in.gbf
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 6543511 2008-04-08 23:53 bchocse_backbone_in.gbf -rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
Starting the assembly is now just a matter of a simple command line with some parameters set correctly. The following is an example of what I use when mapping onto a reference sequence in GenBank format:
arcadia:/path/to/myProject$
mira --project=bchocse --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:bsn=bchoc_wt:bft=gbf:bbq=30 SOLEXA_SETTINGS -SB:ads=yes:dsn=bchocse >&log_assembly.txt
Note 1 | |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
Note 2 | |
---|---|
Please look up the parameters used in the main manual. The ones above basically say: make an accurate mapping of Solexa reads against a genome; in one pass; the name of the backbone strain is 'bchoc_wt'; the file type containing backbone is a GenBank file; the base qualities for the backbone are to be assumed Q30; for Solexa data: assign default strain names for reads which have not loaded ancilarry data with strain info and that default strain name should be 'bchocse'. |
Note 3 | |
---|---|
For a bacterial project having a backbone of ~4 megabases and with ~4.5 million Solexa 36mers, MIRA needs some ~21 minutes on my development machine. A yeast project with a genome of ~20 megabases and ~20 million 36mers needs 3.5 hours and 28 GiB RAM. |
For this example - if you followed the walk-through on how to prepare the data - everything you might want to adapt in the first time are the following options:
-project (for naming your assembly project)
-SB:bsn to give the backbone strain (your reference strain) another name
-SB:bft to load the backbone sequence from another file type, say, a FASTA
-SB:dsn to give a the Solexa reads another strain name
Of course, you are free to change any option via the extended parameters, but this will be the topic of another FAQ.
MIRA will make use of ancillary information when present. The strain name is such an ancillary information. That is, we can tell MIRA the strain of each read we use in the assembly. In the example above, this information was given on the command line as all the reads to be mapped had the same strain information. But what to do if one wants to map reads from several strains?
We could generate a TRACEINFO XML file with all bells and whistles,
but for strain data there's an easier way: the
straindata
file. It's a simple key-value file,
one line per entry, with the name of the read as key (first entry in
line) and, separated by a blank the name of the strain as value
(second entry in line). E.g.:
1_1_207_113 strain1 1_1_61_711 strain1 1_1_182_374 strain2 ... 2_1_13_654 strain2 ...
Etcetera. You will obviously replace 'strain1' and 'strain2' with your strain names.
This file can be quickly generated automatically, using the extracted names from FASTQ files and rewritten a little bit. Here's how:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 494282343 2008-03-28 22:11 bchocse_in.solexa.fastqarcadia:/path/to/myProject$
grep "^@" bchocse_in.solexa.fastq | sed -e 's/@//' | cut -f 1 | cut -f 1 -d ' ' | sed -e 's/$/ bchocse/' > bchocse_straindata_in.txt
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 494282343 2008-03-28 22:11 bchocse_in.solexa.fastq -rw-r--r-- 1 bach users 134822451 2008-03-28 22:13 bchocse_straindata_in.txt
Note 1 | |
---|---|
The above command has been split in multiple lines for better overview but should be entered in one line. |
Note 2 | |
---|---|
for larger files, this can run a minute or two. |
Note 3 | |
---|---|
As you can also assemble sequences from more that one strain, the
read names in |
This creates the needed data in the file
bchocse_straindata_in.txt
(well, it's one way to
do it, feel free to use whatever suits you best).
When using paired-end data, you must decide whether you want
use the MIRA feature to create long 'coverage equivalent reads' (CERs) which saves a lot of memory (both in the assembler and later on in an assembly editor). However, you then loose paired-end information!
or whether you want to keep paired-end information! at the expense of larger memory requirements both in MIRA and in assembly editors afterwards.
The Illumina pipeline generally gives you two files for paired-end
data: a project-1.fastq
and
project-2.fastq
. The first file containing the
first read of a read-pair, the second file the second read. Depending
on the preprocessing pipeline of your sequencing provider, the names
of the reads can be either the very same in both files or already have
a /1
or /2
appended.
Note | |
---|---|
For running MIRA, you must concatenate all sequence input files into one file. |
If the read names do not follow the /1/2
scheme,
you must obviously rename them in the process. A ltlle
sed command can do this automatically for
you. Assuming your reads all have the prefix SRR_something_
the following line appends /1
to all lines which
begin with @SRR_something_
arcadia:/path/to/myProject$
sed -e 's/^@SRR_something_/&\/1/' input.fastq >output.fastq
If you don't care about the paired-end information, you can start the mapping assembly exactly like an assembly for data without paired-end info (see section above).
In case you want to keep the paired-end information, here's the command line (again an example when mapping against a GenBank reference file, assuming that the library insert size is ~500 bases):
arcadia:/path/to/myProject$
mira --project=bchocse --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:lsd=yes:bsn=bchoc_wt:bft=gbf:bbq=30 SOLEXA_SETTINGS -CO:msr=no -GE:uti=no:tismin=250:tismax=750 -SB:ads=yes:dsn=bchocse >&log_assembly.txt
Note 1 | |
---|---|
For this example to work, make sure that the read pairs are named using the Solexa standard, i.e., having '/1' as postfix to the name of one read and '/2' for the other read. If yours have a different naming scheme, look up the -LR:rns parameter in the main documentation. |
Note 2 | |
---|---|
Please look up the parameters used in the main manual. The ones above basically say: make an accurate mapping of Solexa reads against a genome, in one pass, load additional strain data, the name of the backbone is 'bchoc_wt', the file type containing backbone is a GenBank file, the base qualities for the backbone are to assumed Q30. Additionally, only for Solexa reads, do not merge short reads to the contig, use template size information and set minimum and maximum expected distance to 250 and 750 respectively. |
Note 3 | |
---|---|
You will want to use other values than 250 and 750 if your Solexa paired-end library was not with insert sizes of approximately 500 bases. |
Comparing this command line with a command line for unpaired-data, two parameters were added in the section for Solexa data:
-CO:msr=no
tells MIRA not to merge reads that
are 100% identical to the backbone. This also allows to keep the
template information for the reads.
-GE:uti=no
actually switches
off checking of template sizes when inserting
reads into the backbone. At first glance this might seem
counter-intuitive, but it's absolutely necessary to spot, e.g.,
genome re-arrangements or indels in data analysis after the
assembly.
The reason is that if template size checking were on, the following would happen at, e.g. sites of re-arrangement: MIRA would map the first read of a read-pair without problem. However, it would very probably reject the second read because it would not map at the specified distance from its partner. Therefore, in mapping assemblies with paired-end data, checking of the template size must be switched off.
-GE:tismin:tismax
were set to give the maximum
and minimum distance paired-end reads may be away from each
other. Though the information is not used by MIRA in the assembly
itself, the information is stored in result files and can be used
afterwards by analysis programs which search for genome
re-arrangements.
Note: for other influencing factors you might want to change depending on size of Solexa reads, see section above on mapping of unpaired data.
This section just give a short overview on the tags you might find interesting. For more information, especially on how to configure gap4 or consed, please consult the mira usage document and the mira manual.
In file types that allow tags (CAF, MAF, ACE), SNPs and other interesting features will be marked by MIRA with a number of tags. The following sections give a brief overview. For a description of what the tags are (SROc, WRMc etc.), please read up the section "Tags used in the assembly by MIRA and EdIt" in the main manual.
Note | |
---|---|
Screenshots in this section are taken from the walk-through with Lenski data (see below). |
the SROc tag will point to most SNPs. Should you assemble sequences of more than one strain (I cannot really recommend such a strategy), you also might encounter SIOc and SAOc tags.
the WRMc tags might sometimes point SNPs to indels of one or two bases.
Large deletions: the MCVc tags point to deletions in the resequenced data, where no read is covering the reference genome.
Figure 6.10. "MCVc" tag (dark red stretch in figure) showing a genome deletion in Solexa mapping assembly.
Insertions, small deletions and re-arrangements: these are harder to spot. In unpaired data sets they can be found looking at clusters of SROc, SRMc, WRMc, and / or UNSc tags.
more massive occurences of these tags lead to a rather colourful display in finishing programs, which is why these clusters are also sometimes called Xmas-trees.
In sets with paired-end data, post-processing software (or alignment viewers) can use the read-pair information to guide you to these sites (MIRA doesn't set tags at the moment).
the UNSc tag points to areas
where the consensus algorithm had troubles choosing a base. This
happens in low coverage areas, at places of insertions (compared
to the reference genome) or sometimes also in places where
repeats with a few bases difference are present. Often enough,
these tags are in areas with problematic sequences for the
Solexa sequencing technology like, e.g., a
GGCxG
or even GGC
motif in
the reads.
the SRMc tag points to places where repeats with a few bases difference are present. Here too, sequence problematic for the Solexa technology are likely to have cause base calling errors and subsequently setting of this tag.
Biologists are not really interested in SNPs coordinates, and why should they? They're more interested where SNPs are, how good they are, which genes or other elements they hit, whether they have an effect on a protein sequence, whether they may be important etc. For organisms without intron/exon structure or splice variants, MIRA can generate pretty comprehensive tables and files if an annotated GenBank file was used as reference and strain information was given to MIRA during the assembly.
Well, MIRA does all that automatically for you if the reference sequence you gave was annotated.
For this, convert_project should be used with the asnp format as target and a CAF file as input:
$
convert_project -f caf -t asnp
input.caf output
Note that it is strongly suggested to perform a quick manual cleanup of the assembly prior to this: for rare cases (mainly at site of small indels of one or two bases), mira will not tag SNPs with a SNP tag (SROc, SAOc or SIOc) but will be fooled into a tag denoting unsure positions (UNSc). This can be quickly corrected manually. See further down in this manual in the section on post-processing.
After conversion, you will have four files in the directory which you can all drag-and-drop into spreadsheet applications like OpenOffice Calc or Excel.
The files should be pretty self-explanatory, here's just a short overview:
output_info_snplist.txt
is a simple list of
the SNPs, with their positions compared to the reference
sequence (in bases and map degrees on the genome) as well as the
GenBank features they hit.
output_info_featureanalysis.txt
is a much
extended version of the list above. It puts the SNPs into
context of the features (proteins, genes, RNAs etc.) and gives a
nice list, SNP by SNP, what might cause bigger changes in
proteins.
output_info_featuresummary.txt
looks at the
changes (SNPs, indels) from the other way round. It gives an
excellent overview which features (genes, proteins, RNAs,
intergenic regions) you should investigate.
There's one column (named 'interesting') which pretty much summarises up everything you need into three categories: yes, no, and perhaps. 'Yes' is set if indels were detected, an amino acid changed, start or stop codon changed or for SNPs in intergenic regions and RNAs. 'Perhaps' is set for SNPs in proteins that change a codon, but not an amino acid (silent SNPs). 'No' is set if no SNP is hitting a feature.
output_info_featuresequences.txt
simply
gives the sequences of each feature of the reference sequence
and the resequenced strain.
I've come to realise that people who don't handle data from NextGen sequencing technologies on a regular basis (e.g., many biologists) don't want to be bothered with learning to handle specialised programs to have a look at their resequenced strains. Be it because they don't have time to learn how to use a new program or because their desktop is not strong enough (CPU, memory) to handle the data sets.
Something even biologist know to operate are browsers. Therefore, convert_project has the option to load a CAF file of a mapping assembly at output to HTML those areas which are interesting to biologists. It uses the tags SROc, SAOc, SIOc and MCVc and outputs the surrounding alignment of these areas together with a nice overview and links to jump from one position to the previous or next.
This is done with the '-t hsnp' option of convert_project:
$
convert_project -f caf -t hsnp
input.caf output
Note: I recommend doing this only if the resequenced strain is a very close relative to the reference genome, else the HTML gets pretty big. But for a couple of hundred SNPs it works great.
convert_project can also dump a coverage file in WIG format (using '-t wig'). This comes pretty handy for searching genome deletions or duplications in programs like the Affymetrix Integrated Genome Browser (IGB, see http://igb.bioviz.org/).
We're going to use data published by Richard Lenski in his great paper "Genome evolution and adaptation in a long-term experiment with Escherichia coli". This shows how MIRA finds all mutations between two strains and how one would need just a few minutes to know which genes are affected.
Note | |
---|---|
All steps described in this walkthrough are present in ready-to-be-run
scripts in the solexa3_lenski demo directory of the
MIRA package.
|
Note | |
---|---|
This walkthrough takes a few detours which are not really necessary, but show how things can be done: it reduces the number of reads, it creates a strain data file etc. Actually, the whole demo could be reduced to two steps: downloading the data (naming it correctly) and starting the assembly with a couple of parameters. |
We'll use the reference genome E.coli B REL606 to map one of the strains from the paper. For mapping, I picked strain REL8593A more or less at random. All the data needed is fortunately at the NCBI, let's go and grab it:
the NCBI has REL606 named NC_012967. We'll use the RefSeq version and the GenBank formatted file you can download from ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_B_REL606/NC_012967.gbk
the Solexa re-sequencing data you can get from ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX012/SRX012992/. Download
both FASTQ files, SRR030257_1.fastq.gz
and
SRR030257_2.fastq.gz
.
If you want more info regarding these data sets, have a look at http://www.ncbi.nlm.nih.gov/sra/?db=sra&term=SRX012992&report=full
In this section we will setup the directory structure for the assembly and pre-process the data so that MIRA can start right away.
Let's start with setting up a directory structure. Remember: you can setup the data almost any way you like, this is just how I do things.
I normally create a project directory with three sub-directories:
origdata
, data
, and
assemblies
. In origdata
I
put the files exactly as I got them from the sequencing or data
provider, without touching them and even remowing write permissions
to these files so that they cannot be tampered with. After that, I
pre-process them and put the pre-processed files into
data
. Pre-processing can be a lot of things,
starting from having to re-format the sequences, or renaming them,
perhaps also doing clips etc. Finally, I use these pre-processed
data in one or more assembly runs in the
assemblies
directory, perhaps trying out
different assembly options.
arcadia:/some/path/$
mkdir lenskitest
arcadia:/some/path/$
cd lenskitest
arcadia:/some/path/lenskitest$
mkdir data origdata assemblies
arcadia:/some/path/lenskitest$
ls -l
drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 assemblies drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 data drwxr-xr-x 2 bach bach 4096 2009-12-06 16:06 origdata
Now copy the files you just downloaded into the directory
origdata
.
arcadia:/some/path/lenskitest$
cp /wherever/the/files/are/SRR030257_1.fastq.gz origdata
arcadia:/some/path/lenskitest$
cp /wherever/the/files/are/SRR030257_2.fastq.gz origdata
arcadia:/some/path/lenskitest$
cp /wherever/the/files/are/NC_012967.gbk origdata
arcadia:/some/path/lenskitest$
ls -l origdata
-rw-r--r-- 1 bach bach 10543139 2009-12-06 16:38 NC_012967.gbk -rw-r--r-- 1 bach bach 158807975 2009-12-06 15:15 SRR030257_1.fastq.gz -rw-r--r-- 1 bach bach 157595587 2009-12-06 15:21 SRR030257_2.fastq.gz
Great, let's preprocess the data. For this you must know a few things:
the standard Illumina naming scheme for Solexa paired-end reads
is to append forward read names with /1
and
reverse read names with /2
. The reads are
normally put into at least two different files (one for forward,
one for reverse). Now, the Solexa data stored in the Short Read
Archive at the NCBI also has forward and reverse files for
paired-end Solexas. That's OK. What's a bit less good is that
the read names there DO NOT have /1 appended to names of forward
read, or /2 to names of reverse reads. The forward and reverse
reads in both files are just named exactly the same. We'll need
to fix that.
while Sanger and 454 reads should be preprocessed (clipping sequencing vectors, perhaps quality clipping etc.), reads from Solexa present do not. Some people perform quality clipping or clipping of reads with too many 'N's in the sequence, but this is not needed when using MIRA. In fact, MIRA will perform everything needed for Solexa reads itself and will generally do a much better job as the clipping performed is independent of Solexa quality values (which are not always the most trustworthy ones).
for a mapping assembly, it's good to give the strain name of the backbone and the strain name for the reads mapped against. The former can be done via command line, the later is done for each read individually in a key-value file (the straindata file).
So, to pre-process the data, we will need to
put the reads of the NCBI forward and reverse pairs into one file
append /1
to the names of forward reads, and
/2
for reverse reads.
create a straindata file for MIRA
To ease things for you, I've prepared a small script which will do everything for you: copy and rename the reads as well as creating strain names. Note that it's a small part of a more general script which I use to sometimes sample subsets of large data sets, but for the Lenski data set is small enough so that everything is taken.
Create a file prepdata.sh
in directory
data
and copy paste the following into it:
###################################################################### ####### ####### Prepare paired-end Solexa downloaded from NCBI ####### ###################################################################### # srrname: is the SRR name as downloaded form NCBI SRA # numreads: maximum number of forward (and reverse) reads to take from # each file. Just to avoid bacterial projects with a coverage # of 200 or so. # strainname: name of the strain which was re-sequenced srrname="SRR030257" numreads=5000000 strainname="REL8593A" ################################ numlines=$((4*${numreads})) # put "/1" Solexa reads into file echo "Copying ${numreads} reads from _1 (forward reads)" zcat ../origdata/${srrname}_1.fastq.gz | head -${numlines} | sed -e 's/SRR[0-9.]*/&\/1/' >${strainname}-${numreads}_in.solexa.fastq # put "/2" Solexa reads into file echo "Copying ${numreads} reads from _2 (reverse reads)" zcat ../origdata/${srrname}_2.fastq.gz | head -${numlines} | sed -e 's/SRR[0-9.]*/&\/2/' >>${strainname}-${numreads}_in.solexa.fastq # make file with strainnames echo "Creating file with strain names for copied reads (this may take a while)." grep "@SRR" ${strainname}-${numreads}_in.solexa.fastq | cut -f 1 -d ' ' | sed -e 's/@//' -e "s/$/ ${strainname}/" >>${strainname}-${numreads}_straindata_in.txt
Now, let's create the needed data:
arcadia:/some/path/lenskitest$cd data
arcadia:/some/path/lenskitest/data$ls -l
-rw-r--r-- 1 bach bach 1349 2009-12-06 17:05 prepdata.sh arcadia:/some/path/lenskitest/data$sh prepdata.sh
Copying 5000000 reads from _1 (forward reads) Copying 5000000 reads from _2 (reverse reads) Creating file with strain names for copied reads (this may take a while). arcadia:/some/path/lenskitest/data$ls -l
-rw-r--r-- 1 bach bach 1349 2009-12-06 17:05 prepdata.sh -rw-r--r-- 1 bach bach 1553532192 2009-12-06 15:36 REL8593A-5000000_in.solexa.fastq -rw-r--r-- 1 bach bach 218188232 2009-12-06 15:36 REL8593A-5000000_straindata_in.txt
Last step, just for the sake of completeness, link in the GenBank formatted file of the reference strain, giving it the same base name so that everything is nicely set up for MIRA.
arcadia:/some/path/lenskitest/data$
ln -s ../origdata/NC_012967.gbk REL8593A-5000000_backbone_in.gbf
arcadia:/some/path/lenskitest/data$
ls -l
-rw-r--r-- 1 bach bach 1349 2009-12-06 17:05 prepdata.sh lrwxrwxrwx 1 bach bach 25 2009-12-06 16:39 REL8593A-5000000_backbone_in.gbf -> ../origdata/NC_012967.gbk -rw-r--r-- 1 bach bach 1553532192 2009-12-06 15:36 REL8593A-5000000_in.solexa.fastq -rw-r--r-- 1 bach bach 218188232 2009-12-06 15:36 REL8593A-5000000_straindata_in.txtarcadia:/some/path/lenskitest/data$
cd ..
arcadia:/some/path/lenskitest$
Perfect, we're ready to start assemblies.
arcadia:/some/path/lenskitest$
cd assemblies
arcadia:/some/path/lenskitest/assemblies$
mkdir 1sttest
arcadia:/some/path/lenskitest/assemblies/1sttest$
lndir ../../data
arcadia:/some/path/lenskitest/assemblies/1sttest$
ls -l
lrwxrwxrwx 1 bach bach 22 2009-12-06 17:18 prepdata.sh -> ../../data/prepdata.sh lrwxrwxrwx 1 bach bach 43 2009-12-06 16:40 REL8593A-5000000_backbone_in.gbf -> ../../data/REL8593A-5000000_backbone_in.gbf lrwxrwxrwx 1 bach bach 43 2009-12-06 15:39 REL8593A-5000000_in.solexa.fastq -> ../../data/REL8593A-5000000_in.solexa.fastq lrwxrwxrwx 1 bach bach 45 2009-12-06 15:39 REL8593A-5000000_straindata_in.txt -> ../../data/REL8593A-5000000_straindata_in.txt
Oooops, we don't need the link prepdata.sh
here, just delete it.
arcadia:/some/path/lenskitest/assemblies/1sttest$
rm prepdata.sh
Perfect. Now then, start a simple mapping assembly:
arcadia:/some/path/lenskitest/assemblies/1sttest$
mira --fastq --project=REL8593A-5000000 --job=mapping,genome,accurate,solexa -SB:lsd=yes:bsn=ECO_B_REL606:bft=gbf >&log_assembly.txt
Note 1 | |
---|---|
The above command has been split in multiple lines for better
overview but should be entered in one line. It basically says:
load all data in FASTQ format; the project name is
REL8593A-5000000 (and therefore all input and
output files will have this prefix by default if not chosen
otherwise); we want an accurate mapping of Solexa reads against a
genome; load strain data of a separate strain file
( [-SB:lsd=yes]); the strain name of the reference
sequence is 'ECO_B_REL606' ( [-SB:bsn=ECO_B_REL606]) and
the file type containing the reference sequence in a GenBank
format ( [-SB:bft=gbf]). Last but not least, redirect the
progress output of the assembler to a file named
|
Note 2 | |
---|---|
The above assembly takes approximately 35 minutes on my computer (i7 940 with 12 GB RAM) when using 4 threads (I have '-GE:not=4' additionally). It may be faster or slower on your computer. |
Note 3 | |
---|---|
You will need some 10.5 GB RAM to get through this. You might get away with a bit less RAM and using swap, but less than 8 GB RAM is not recommended. |
Let's have a look at the directory now:
arcadia:/some/path/lenskitest/assemblies/1sttest$
ls -l
-rw-r--r-- 1 bach bach 1463331186 2010-01-27 20:41 log_assembly.txt drwxr-xr-x 6 bach bach 4096 2010-01-27 20:04 REL8593A-5000000_assembly lrwxrwxrwx 1 bach bach 43 2009-12-06 16:40 REL8593A-5000000_backbone_in.gbf -> ../../data/REL8593A-5000000_backbone_in.gbf lrwxrwxrwx 1 bach bach 43 2009-12-06 15:39 REL8593A-5000000_in.solexa.fastq -> ../../data/REL8593A-5000000_in.solexa.fastq lrwxrwxrwx 1 bach bach 45 2009-12-06 15:39 REL8593A-5000000_straindata_in.txt -> ../../data/REL8593A-5000000_straindata_in.txt
Not much which changed. All files created by MIRA will be in the REL8593A-5000000_assembly directory. Going one level down, you'll see 4 sub-directories:
arcadia:/some/path/lenskitest/assemblies/1sttest$
cd REL8593A-5000000_assembly
arcadia:.../1sttest/REL8593A-5000000_assembly$
ls -l
drwxr-xr-x 2 bach bach 4096 2010-01-27 20:29 REL8593A-5000000_d_chkpt drwxr-xr-x 2 bach bach 4096 2010-01-27 20:40 REL8593A-5000000_d_info drwxr-xr-x 2 bach bach 4096 2010-01-27 20:30 REL8593A-5000000_d_log drwxr-xr-x 2 bach bach 4096 2010-01-27 21:19 REL8593A-5000000_d_results
You can safely delete the log and the chkpt directories, in this walkthrough they are not needed anymore.
Results will be in a sub-directories created by MIRA. Let's go there and have a look.
arcadia:/some/path/lenskitest/assemblies/1sttest$
cd REL8593A-5000000_assembly
arcadia:.../1sttest/REL8593A-5000000_assembly$
cd REL8593A-5000000_d_results
arcadia:.../REL8593A-5000000_d_results$
ls -l
-rw-r--r-- 1 bach bach 455087340 2010-01-27 20:40 REL8593A-5000000_out.ace -rw-r--r-- 1 bach bach 972479972 2010-01-27 20:38 REL8593A-5000000_out.caf -rw-r--r-- 1 bach bach 569619434 2010-01-27 20:38 REL8593A-5000000_out.maf -rw-r--r-- 1 bach bach 4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta -rw-r--r-- 1 bach bach 14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual -rw-r--r-- 1 bach bach 472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs -rw-r--r-- 1 bach bach 4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta -rw-r--r-- 1 bach bach 14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 13862715 2010-01-27 20:39 REL8593A-5000000_out.wig
You can see that MIRA has created output in many different formats suited for a number of different applications. Most commonly known will be ACE and CAF for their use in finishing programs (e.g. gap4 and consed).
In a different directory (the info directory) there are also files containing all sorts of statistics and useful information.
arcadia:.../REL8593A-5000000_d_results$
cd ../REL8593A-5000000_d_info/
arcadia:.../REL8593A-5000000_d_info$
ls -l
-rw-r--r-- 1 bach bach 2256 2010-01-27 20:40 REL8593A-5000000_info_assembly.txt -rw-r--r-- 1 bach bach 124 2010-01-27 20:04 REL8593A-5000000_info_callparameters.txt -rw-r--r-- 1 bach bach 37513 2010-01-27 20:37 REL8593A-5000000_info_consensustaglist.txt -rw-r--r-- 1 bach bach 28522692 2010-01-27 20:37 REL8593A-5000000_info_contigreadlist.txt -rw-r--r-- 1 bach bach 176 2010-01-27 20:37 REL8593A-5000000_info_contigstats.txt -rw-r--r-- 1 bach bach 15359354 2010-01-27 20:40 REL8593A-5000000_info_debrislist.txt -rw-r--r-- 1 bach bach 45802751 2010-01-27 20:37 REL8593A-5000000_info_readtaglist.txt
Just have a look at them to get a feeling what they show. You'll find more information regarding these files in that main manual of MIRA. At the moment, let's just have a quick assessment of the differences between the Lenski reference strain and the REL8593A train by counting how many SNPs MIRA thinks there are (marked with SROc tags in the consensus):
arcadia:.../REL8593A-5000000_d_info$
grep -c SROc REL8593A-5000000_info_consensustaglist.txt
102
102 bases are marked with such a tag. You will later see that this is an overestimation due to several insert sites and deletions, but it's a good first approximation.
Let's count how many potential deletion sites REL8593A has in comparison to the reference strain:
arcadia:.../REL8593A-5000000_d_info$
grep -c MCVc REL8593A-5000000_info_consensustaglist.txt
48
This number too is a slight overestimation due to cross-contamination with sequenced strain which did not have these deletions, but it's also a first approximate.
To have a look at your project in gap4, use the caf2gap program (you can get it at the Sanger Centre), and then gap4:
arcadia:.../REL8593A-5000000_d_results$
ls -l
-rw-r--r-- 1 bach bach 455087340 2010-01-27 20:40 REL8593A-5000000_out.ace -rw-r--r-- 1 bach bach 972479972 2010-01-27 20:38 REL8593A-5000000_out.caf -rw-r--r-- 1 bach bach 569619434 2010-01-27 20:38 REL8593A-5000000_out.maf -rw-r--r-- 1 bach bach 4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta -rw-r--r-- 1 bach bach 14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual -rw-r--r-- 1 bach bach 472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs -rw-r--r-- 1 bach bach 4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta -rw-r--r-- 1 bach bach 14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 13862715 2010-01-27 20:39 REL8593A-5000000_out.wigarcadia:.../REL8593A-5000000_d_results$
caf2gap -project REL8593A -ace REL8593A-5000000_out.caf >&/dev/null
arcadia:.../REL8593A-5000000_d_results$
ls -l
-rw-r--r-- 1 bach bach 1233494048 2010-01-27 20:43 REL8593A.0 -rw-r--r-- 1 bach bach 233589448 2010-01-27 20:43 REL8593A.0.aux -rw-r--r-- 1 bach bach 455087340 2010-01-27 20:40 REL8593A-5000000_out.ace -rw-r--r-- 1 bach bach 972479972 2010-01-27 20:38 REL8593A-5000000_out.caf -rw-r--r-- 1 bach bach 569619434 2010-01-27 20:38 REL8593A-5000000_out.maf -rw-r--r-- 1 bach bach 4708371 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta -rw-r--r-- 1 bach bach 14125036 2010-01-27 20:39 REL8593A-5000000_out.padded.fasta.qual -rw-r--r-- 1 bach bach 472618709 2010-01-27 20:39 REL8593A-5000000_out.tcs -rw-r--r-- 1 bach bach 4707025 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta -rw-r--r-- 1 bach bach 14120999 2010-01-27 20:39 REL8593A-5000000_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 13862715 2010-01-27 20:39 REL8593A-5000000_out.wigarcadia:.../REL8593A-5000000_d_results$
gap4 REL8593A.0
Search for the tags set by MIRA which denoted features or problems (SROc, WRMc, MCVc, UNSc, IUPc. See main manual for full list) in the assembly, and edit accordingly. Save your gap4 database as a new version (e.g. REL8593A.1), then exit gap4.
Then use the gap2caf command (also from the Sanger Centre) to convert the gap4 database back to CAF.
arcadia:.../REL8593A-5000000_d_results$
gap2caf -project REL8593A.1 >rel8593a_edited.caf
As gap4 jumbled the consensus (it does not know different sequencing
technologies), having convert_project recalculate the consensus
(with the "-r c
" option) is generally a good
idea.
arcadia:.../REL8593A-5000000_d_results$
convert_project -f caf -t caf -r c rel8593a_edited.caf rel8593a_edited_recalled
You will have to use either CAF or MAF as input, either of which can be the direct result from the MIRA assembly or an already cleaned and edited file. For the sake of simplicity, we'll use the file created by MIRA in the steps above.
Let's start with a HTML file showing all positions of interest:
arcadia:.../REL8593A-5000000_d_results$
convert_project -f caf -t hsnp REL8593A-5000000_out.caf rel8593a
arcadia:.../REL8593A-5000000_d_results$
ls -l *html
-rw-r--r-- 1 bach bach 5198791 2010-01-27 20:49 rel8593a_info_snpenvironment.html
But MIRA can do even better: create tables ready to be imported in spreadsheet programs.
arcadia:.../REL8593A-5000000_d_results$
convert_project -f caf -t asnp REL8593A-5000000_out.caf rel8593a
arcadia:.../REL8593A-5000000_d_results$
ls -l rel8593a*
-rw-r--r-- 1 bach bach 25864 2010-01-27 20:48 rel8593a_info_featureanalysis.txt -rw-r--r-- 1 bach bach 12402905 2010-01-27 20:48 rel8593a_info_featuresequences.txt -rw-r--r-- 1 bach bach 954473 2010-01-27 20:48 rel8593a_info_featuresummary.txt -rw-r--r-- 1 bach bach 5198791 2010-01-27 20:49 rel8593a_info_snpenvironment.html -rw-r--r-- 1 bach bach 13810 2010-01-27 20:47 rel8593a_info_snplist.txt
Have a look at all file, perhaps starting with the SNP list, then the feature analysis, then the feature summary (your biologists will love that one, especially when combined with filters in the spreadsheet program) and then the feature sequences.
This is actually quite straightforward if you name your reads according to the MIRA standard for input files. Assume you have the following files (bchocse being an example for your mnemonic for the project):
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq
Here's the simplest way to start the assembly:
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,accurate,solexa >&log_assembly.txt
Of course, you can add any other switch you want like, e.g., changing the number of processors used, adding default strain names etc.pp
If you have only one library with one insert size, you just need to tell MIRA this minimum and maximum distance the reads should be away from each other. In the following example I have a library size of 500 bp and have set the minimum and maximum distance to +/- 50% (you might want to use other modifiers):
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,accurate,solexa SOLEXA_SETTINGS -GE:tismin=250:tismax=750 >&log_assembly.txt
Note | |
---|---|
For this example to work, make sure that the read pairs are named using the Solexa standard, i.e., having /1 for one read and /2 for the other read. If yours have a different naming scheme, look up the -LR:rns parameter in the main documentation. |
To tell MIRA exactly which reads have which insert size, one must use an XML file containing ancillary data in NCBI TRACEINFO format. In case you don't have such a file, here's a very simple example containing only insert sizes for reads (lane 1 has a library size of 500 bases and lane 2 a library size of 2 Kb):
<?xml version="1.0"?> <trace_volume> <trace> <trace_name>1_17_510_1281/1</trace_name> <insert_size>500</insert_size> <insert_stdev>100</insert_stdev> </trace> <trace> <trace_name>1_17_510_1281/2</trace_name> <insert_size>500</insert_size> <insert_stdev>100</insert_stdev> </trace> ... <trace> <trace_name>2_17_857_850/1</trace_name> <insert_size>2000</insert_size> <insert_stdev>300</insert_stdev> </trace> <trace> <trace_name>2_17_857_850/2</trace_name> <insert_size>2000</insert_size> <insert_stdev>300</insert_stdev> </trace> ... </trace_volume>
So, if your directory looks like this:
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach users 263985896 2008-03-28 21:49 bchocse_in.solexa.fastq -rw-r--r-- 1 bach users 324987513 2008-04-01 13:24 bchocse_traceinfo_in.solexa.xml
then starting the assembly is done like this (note the additional [-LR:mxti] parameter in the section for Solexa setting):
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,accurate,solexa SOLEXA_SETTINGS -LR:mxti=yes >&log_assembly.txt
Two strategies can be thought of to assemble genomes using a combination of Solexa and other (longer) reads: either using all reads for a full de-novo assembly or first assembling the longer reads and use the resulting assembly as backbone to map Solexa reads. Both strategies have their pro and cons.
Throwing all reads into a de-novo assembly is the most straightforward way to get 'good' assemblies. This strategy is also the one which - in most cases - yields the longest contigs as, in many projects, parts of a genome not covered by one sequencing technology will probably be covered by another sequencing technology. Furthermore, having the consensus covered by more than one sequencing technology make base calling a pretty robust thing: if MIRA finds disagreements it cannot resolve easily, the assembler at least leaves a tag in the assembly to point human finishers to these positions of interest.
The downside of this approach however is the fact that the sheer amount of data in Solexa sequencing projects makes life difficult for de-novo assemblers, especially for MIRA which is keeping quite some additional information in memory in de-novo assemblies and tries to use algorithms as exact as possible during contig construction. Therefore, MIRA sometimes still runs into data sets which make it behave quite badly with respect to assembly time and memory consumption (but this is being constantly improved).
Full de-novo hybrid assemblies can be recommended only for bacteria at the moment, although lower eukaryotes should also be feasible on larger machines.
Starting the assembly is now just a matter of a simple command line with some parameters set correctly. The following is a de-novo hybrid assembly with 454 and Solexa reads.
arcadia:/path/to/myProject$
mira --project=bchocse --job=denovo,genome,normal,454,solexa >&log_assembly.txt
This strategy works in two steps: first assembling long reads, then mapping short reads to the full alignment (not just a consensus sequence). The result will be an assembly containing 454 (or Sanger) and Solexa reads.
Assemble your data just as you would when assembling 454 or Sanger data.
This step fetches 'long' contigs from the assembly before. Idea is to get all contigs larger than 500 bases.
$
convert_project -f caf -t caf -x 500 assemblyresult.caf hybrid_backbone_in.caf
You might eventually want to add an additional filter for minimum average coverage. If your project has an average coverage of 24, you should filter for a minimum average coverage of 33% (coverage 8, you might want to try out higher coverages) like this:
$
convert_project -f caf -t caf -x 500 -y 8 assemblyresult.caf hybrid_backbone_in.caf
Copy the hybrid backbone to a new empty directory, add in the Solexa data, start a mapping assembly using the CAF as input for the backbone. If you assembled the 454 / Sanger data with strain info, the Solexa data should also get those (as described above).
arcadia:/path/to/myProject$
ls -l
-rw-r--r-- 1 bach bach 1159280980 2009-10-31 19:46 hybrid_backbone_in.caf -rw-r--r-- 1 bach bach 338430282 2009-10-31 20:31 hybrid_in.solexa.fastqarcadia:/path/to/myProject$
mira --project=hybrid --job=mapping,genome,accurate,solexa -AS:nop=1 -SB:bft=caf >&log_assembly.txt
This section is a bit terse, you should also read the chapter on working with results of MIRA3.
When working with resequencing data and a mapping assembly, I always load finished projects into an assembly editor and perform a quick cleanup of the results.
For close relatives of the reference strain this doesn't take long as MIRA will have set tags (see section earlier in this document) at all sites you should have a look at. For example, very close mutant bacteria with just SNPs or simple deletions and no genome reorganisation, I usually clean up in 10 to 15 minutes. That gives the last boost to data quality and your users (biologists etc.) will thank you for that as it reduces their work in analysing the data (be it looking at data or performing wet-lab experiments).
Assume you have the following result files in the result directory of a MIRA assembly:
arcadia:/path/to/myProject/newstrain_d_results$
ls -l
-rw-r--r-- 1 bach bach 312607561 2009-06-08 14:57 newstrain_out.ace -rw-r--r-- 1 bach bach 655176303 2009-06-08 14:56 newstrain_out.caf ...
The general workflow I use is to convert the CAF file to a gap4 database and start the gap4 editor:
arcadia:newstrain_d_results$
caf2gap -project NEWSTRAIN -ace newstrain_out.caf >& /dev/null
arcadia:newstrain_d_results$
gap4 NEWSTRAIN.0
Then, in gap4, I
quickly search for the UNSc and WRMc tags and check whether they could be real SNPs that were overseen by MIRA. In that case, I manually set a SROc (or SIOc) tag in gap4 via hotkeys that were defined to set these tags.
sometimes also quickly clean up reads that are causing trouble in
alignments and lead to wrong base calling. These can be found at
sites with UNSc tags, most of the time they have the 5' to 3'
GGCxG
motif which can cause trouble to Solexa.
look at sites with deletions (tagged with MCVc) and look whether I should clean up the borders of the deletion.
After this, I convert the gap4 database back to CAF format:
$
gap2caf -project NEWSTRAIN >newstrain_edited.caf
But beware: gap4 does not have the same consensus calling routines as MIRA and will have saved it's own consensus in the new CAF. In fact, gap4 performs rather badly in projects with multiple sequencing technologies. So I use convert_project from the MIRA package to recall a good consensus (and save it in MAF as it's more compact and a lot faster in handling than CAF):
$
convert_project -f caf -t maf -r c newstrain_edited.caf newstrain_edited_recalled
And from this file I can then convert with convert_project to any other format I or my users need: CAF, FASTA, ACE, WIG (for coverage analysis) etc.pp.
I can also also generate tables and HTML files with SNP analysis
results (with the "-t asnp
" and "-t
hsnp
" options of convert_project)
As the result file of MIRA de-novo assemblies contains everything down to 'contigs' with just two reads, it is advised to first filter out all contigs which are smaller than a given size or have a coverage lower than 1/3 to 1/2 of the overall coverage.
Filtering is performed by convert_project using CAF file as input. Assume you have the following file:
arcadia:/path/to/myProject/newstrain_d_results$
ls -l
... -rw-r--r-- 1 bach bach 655176303 2009-06-08 14:56 newstrain_out.caf ...
Let's say you have a hybrid assembly with an average coverage of 50x. I normally filter out all contigs which have an average coverage less than 1/3 and are smaller than 500 bases. These are mostly junk contiglets remaining from the assembly and can be more or less safely ignored. This is done the following way:
arcadia:newstrain_d_results$
convert_project -f caf -t caf -x 500 -y 17 newstrain_out.caf newstrain_filterx500y17
From there on, convert the filtered CAF file to anything you need to continue finishing of the genome (gap4 database, ACE, etc.pp).
These are actual for version 3 of MIRA and might or might not have been addressed in later version.
Bugs:
mapping of paired-end reads with one read being in non-repetitive area and the other in a repeat is not as effective as it should be. The optimal strategy to use would be to map first the non-repetitive read and then the read in the repeat. Unfortunately, this is not yet implemented in MIRA.
Problems:
the textual output of results is really slow with such massive amounts of data as with Solexa projects. If Solexa data is present, it's turned off by default at the moment.
Table of Contents
Pacific Biosciences looks like the new kid on the block of sequencing providers. They seem to have, for the first time since Sanger sequencing, something which is able to produce sequences which are actually longer than Sanger. They also have something new: strobed sequencing. That technique alone was reason enough for me to see whether it could be of any use. After a couple of modifications to the MIRA assembly engine, I think I can say that "yes, it very well can be."
One could feed strobed PacBio sequences to MIRA 3.0.0 and the 2.9.x line before and get some results out of it by faking them to be Sanger, though the results were not always pretty.
The first version of MIRA to officially support sequences from Pacific Biosciences is MIRA 3.2. Versions in the 3.0.1 to 3.0.5 range and 3.1.x had different degrees of support, but were never advertised having it.
I am not affiliated with Pacific Biosciences nor do I -- unfortunately -- have early access to their data. Due to extreme secrecy, almost no one outside the company has actually seen their sequencing data. So some of what this guide contains is a bit of guesswork, reading through dozens and dozens of conference reports, blogs, press releases, tweets and whatever not.
But maybe I got some things right.
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do.
While there are step by step walk-throughs on how to setup your data for Sanger, 454 and Solexa in other MIRA guides, this guide is (currently) a bit more terse. You are expected to read at some point in time
the mira_reference help file to look up some command line options.
for hybrid assemblies of PacBio data with Sanger, 454, Solexa the corresponding mira_usage, mira_454 or mira_solexa help files to look up how to prepare the different data sets.
Let's first have a look at what sequencing (either paired or unpaired) meant until now. I won't go into the details of conventional sequencing as this is covered elsewhere in the MIRA manuals (and in the Web).
In conventional, unpaired sequencing, you have a piece of DNA (a
DNA template) which a machine reads out and then
gives you the sequence back. Assume your piece of DNA to be 10
kilo-bases long, but your machine can read only 1000 bases. Then what
you get back (DNA
below is the DNA template,
R1
is a read) is this:
DNA: actgttg...gtgcatgctgatgactgact.........gactgtgacgtactgcttga...actggatctg R1 : actgttg...gtgcatgct \_________________/ | ~1000 bases
In conventional paired-end sequencing, you still can read only 1000 bases, but you can do it at the beginning and at the end of a DNA template. This looks like that:
DNA: actgttg...gtgcatgctgatgactgact.........gactgtgacgtactgcttga...actggatctg R1 : actgttg...gtgcatgct \_________________/ R2 : gcttga...actggatctg | \_________________/ ~1000 bases | ~1000 bases
While you still have just two reads of approximately 1000 bases, you know one additional thing: these two reads are approximately 10000 bases apart. This additional information is very useful in assembly as it helps to resolve problematic areas.
Enter Pacific Biosciences with their strobed sequencing. With this approach, you can sequence also a given number of bases (they claim between 1000 and 3000), but you can sort of "distribute" the bases you want to read across the DNA template.
Warning | |
---|---|
Overly simplified and probably totally inaccurate description ahead! Furthermore, the extremely short read and gap lengths in these examples serve only for demonstration purposes. |
Here's a simple example: assume you could read around 40 bases with your machinery, but that the DNA template is some ~80 bases. And assume you could tell your machine to read between 6 and 8 bases at a time, then leave out the next 6 to 8 bases, then read again etc. Like so:
DNA: actgttggtgcatgctgatgactgactgactgtgacgtacttgactgactggatctgtgactgactgtgactgactg R1a: actgttg R1b: gatgactgac R1c: cgtacttga R1d: atctgtgac R1e: gactgactg
While in the example above we still read only 44 bases, these 44 bases span 77 bases on the DNA template. Furthermore, we have the additional information that the sequence of reads is R1a, R1b, R1c, R1d and R1e and, because we asked the machine to read in such a pattern, we expect the gaps between the reads to be between 6 and 8 bases wide.
This is actually possible with the system of PacBio. It streams the DNA template through a detection system which reads out the bases only, and only if, a light source (a laser) is switched on. Therefore, while streaming the template through the system, you read the DNA while the laser is on and you don't read anything while it's off ... meanwhile the template is still streamed through.
Now, why would one want to turn the laser off?
It seems as if the light source is actually also the major limitation factor, as it has as nasty side-effect the degradation of DNA it should still read. A real bummer: after 1000 to 3000 bases (sometimes more, sometimes less), the DNA you read is probably so degraded and error ridden (eventually even physically broken) that it makes no sense to continue reading.
Here comes the trick: instead of reading, say, 1000 bases in a row, you can read them in strobes: you switch the light on and start reading a couple of bases (say: 100), switch the light off, wait a bit until some bases (again, let's say approximately 100) have passed by, switch the light back on and read again ~100 bases, then switch off ... etc.pp until you have read your 1000 bases, or, more likely, as long as you can. But, as shown in the example above, these 1000 bases will be distributed across a much larger span on the original DNA template: in a pattern of ~100 bases read and ~100 bases not read, the smaller read-fragments span ~1800 to ~2000 bases.
Cool ... this is actually something an assembler can make real use of.
A more conventional approach could be: you switch the light on and start reading a couple of bases (say: 500), switch the light off, wait a bit until some bases (again, let's say approximately 10000) have passed by, switch the light back on and read again ~500 bases. This would be equivalent to a "normal" paired-end read with an insert size of 11Kb. But assemblers also can make good use of that.
Although Pacific Biosciences keeps pretty quiet on this topic, missed bases seem to be quite a problematic point. A bit like the 454 homopolymer problem but without homopolymers. From http://scienceblogs.com/geneticfuture/2010/02/pacific_biosciences_session_at.php
“Turner [the presenter from PacBio] said nothing concrete about error rates during his presentation, but this issue dominated the questions from the audience. Turner skilfully equivocated, steering clear of providing any hard numbers on the raw error rates and focusing on the system's ability to generate accurate consensus sequences through circular reads. Still, it's clear that deletion errors due to missing bases will pose a non-trivial problem for the system: Turner referred to algorithms for assembling sequence dominated by insertion/deletion errors currently in development.”
Someone else made a nice comment on this (from http://omicsomics.blogspot.com/2010/02/pacbios-big-splash.html):
“Well, not much on error rates from PacBio (apparently in the Q&A their presenter executed a jig, tango, waltz & rumba when asked).”
Astute readers will have noted that in the section on sequencing a DNA template with PacBio, I wrote “approximately” when defining the length of the stretch of non-read bases in strobed sequencing. According to conference reports, the length can only be estimated with a variance of 10-20%. From http://www.genomeweb.com/sequencing/pacbio-says-strobe-sequencing-increases-effective-read-length-single-molecule-se:
“There is uncertainty regarding the size of the "dark" inserts, owing to "subtle fluctuations" in the DNA synthesis speed, he said, but it becomes smaller with longer inserts. For example, with 400-base inserts, the coefficient of variation of its size is 20 percent, but it decreases to 10 percent with 1,600 bases.”
The reports are a bit contradictory regarding achievable read lengths. While PacBio mentions they have attained read length of up to 20Kb in their labs and expect to be able to go to up to 50Kb, the first generation machines are marketed with much lower expectations. However, "much lower" in this context still means: at least 1 Kb and very good chances to have a good percentage of reads in the 3 to 5 Kb range.
The strobed sequencing method should allow to do a couple of interesting things. First of, simulate conventional paired-end sequencing. Then, going into real strobe sequencing, extending the length reads span over a DNA template by perhaps doubling or tripling the length will be extremely useful to cross most but the most annoying repeats one would encounter in prokaryotes ... and probably also eukaryotes once PacBio regularly achieves lengths of 10000 bases.
MIRA currently knows two ways to handle strobed reads:
a more traditional approach by using two strobes at a time as read pair
the "elastic dark insert" approach where all strobes are put in one
read and connected by stretches of N
representing
the dark inserts. "Elastic" means that -- the initial lengths of the
dark inserts being a rough estimate -- the length of the inserts are
then corrected in the iterative assembly passes of MIRA.
The elastic dark insert approach has an invaluable advantage: it keeps the small strobes connected in order in a read. This considerably reduces sources of errors when highly repetitive data is to be assembled where paired-end approaches also have limits.
Keeping the dark inserts as integral part of the reads however also poses a couple of problems. Traditional Smith-Waterman alignments are likely to give some of these alignments a bad score as there will invariably be a number of overlaps where the true length of the dark stretch is so different from the real length that an alignment algorithm needs to counterbalance this with a considerable number of inserted gaps. Which in turn can lower Smith-Waterman score to a level where needed identity thresholds are not met. The following example shows an excerpt of a case where a read with dark insert which length was estimated too low aligning against a read without dark insert:
...TGACTGA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT...
While MIRA algorithms have methods to counterbalance this kind of scoring malus (e.g., by simply not counting gap scores in dark strobe inserts), another effect then appears: multiple alignment disorders. Like in many other assemblers the construction of multiple alignments is done iteratively by aggregating new reads to an existing contig by aligning it against a temporary consensus. As the misestimation of dark insert lengths can reach comparatively high numbers like 20 to 100 bases or more, problems can arise if several misestimated dark inserts in reads come together at one place. A simple example: assume the following scenario, where reads 1, 2, 3 and 4 get to form a contig by being assembled in exactly this order (1,2,3,4):
Read1 ...TGACTGAnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT...
then
Read1 ...TGACTGA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT...
then
Read1 ...TGACTGA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACTGATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT...
then
Read1 ...TGACT*****GA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
This can lead to severe misaligns in multiple alignments with several reads as the following screenshot shows exemplarily.
Figure 7.1. Multiple alignment with PacBio elastic dark inserts, initial status with severe misalignments
However, MIRA is an iterative assembler working in multiple passes and iterations within a pass. This allows for a strategy of iterative correction of the estimation of dark length inserts. Like with every sequencing technology it knows, MIRA analyses the multiple alignment of a contig in several ways and searches for, e.g., misassembled repeats (for more information on this, please refer to the MIRA manual). When having reads with the technology from Pacific Biosciences, MIRA also analyses the elastic dark inserts whether or not their length as measured in the multiple alignment fits the estimated length. If not, the length of the dark insert will be corrected up or down for the next pass, the correction factor being of two thirds of the estimated difference between true and measured length of the dark insert.
Coming back to the example used previously:
Read1 ...TGACT*****GA****************nnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT*****GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
You will note that there are basically two elastic dark insert stretches. The first in read 1 has an underestimation of of the dark insert size of 16 bases, the second has an overestimation of five bases.
Accordingly, MIRA will add two thirds of 16 =
10 N
s to the estimated dark insert in read 1 and
remove 3 N
s (two thirds of 5) from read 4:
Read1 old ...TGACTGAnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read1 new ...TGACTGANNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read4 old ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT... Read4 new ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
These new reads will be used in the next (sub-)passes of MIRA. Continuing the example from above, the next multiple alignment of all four reads would look like this:
Read1 ...TGACT**GA******NNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT**GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT**GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
Again, the dark inserts would be corrected by MIRA, this time adding 4
N
s to read 1 and removing one N
from read 4., so that the next multiple alignment is this:
Read1 ...TGACT*GA**NNNNNNNNNNNNNNnnnnnnnnnnnnnnnnnnnnnTCAGTTGAT... Read2 ...TGACT*GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read3 ...TGACT*GATGACTTTATCTATGGAGCTTATATGCGTCGAGCTTGGTCAGTTGAT... Read4 ...TGACTnnnnnnnnnnnnnnnnnnnnnnnnnATGCGTCGAGCTTGGTCAGTTGAT...
From there it is trivial to see one just needs two more iterations to replace the initial estimated length of the dark insert by the true length of it. The next screenshot continues the live example shown previously after the second pass of MIRA (remember that each pass can have multiple sub-passes):
One pass (and multiple sub-passes) later, the elastic dark inserts in this example have reached their true lengths. The multiple alignment is as good as it can get as the following figure shows:
The elastic dark insert strategy is quite successful for resolving most problems but sometimes also fails to find a perfect solution. However, the remaining multiple alignment is -- in most cases -- good enough for a consensus algorithm to find the correct consensus as the next screenshot shows:
MIRA will happily read data in several different formats (FASTA, FASTQ, etc.). For the sake of simplicity, this guide will use FASTQ as demonstration format, but most of the time not add the quality line.
This is actually quite simple. Just put your reads as FASTQ in a file and you are done. No need to bother about read naming conventions or similar things. Like so:
@readname_001 ACGTTGCAGGGTCATGCAGT... @readname_002 ...
You have two possibilities for that. If the "dark insert" is not too
long and about the same size or shorter than the average sequenced
length at the ends, you can put the data into one read and fill up
the estimated dark insert length with the 'n
'
character. The following example shows this, using lengths of 10 for
the sequenced parts and a dark insert size of 10:
@readname_001 ACGTTGCAGGnnnnnnnnnnGTCATGCAGT @readname_002 ...
In case you have long "dark inserts", it is preferable to keep both
parts physically separated in different reads. In this case, read
naming becomes important for an assembler. Pick any paired-end
naming scheme you want and name your reads accordingly. The
following example shows the same data as above, split in two
reads and using the Solexa naming scheme to denote the first read of
a pair by appending /1
to the read name and the
second part by appending /2
:
@readname_001/1 ACGTTGCAGG @readname_001/2 GTCATGCAGT @readname_002/1 ...
Note | |
---|---|
The example above used Solexa naming scheme to denote paired-end partner reads. You can use any naming scheme you want as long as MIRA knows it. E.g.: the forward/reverse or Sanger or TIGR naming schemes, you will just need to tell MIRA about it with the [-LR:rns] parameter. As soon as first data sets with PacBio are available, MIRA will also implement their naming scheme. |
Should all your reads approximately have the same total length of first part (/1) + dark insert + second part (/2), then you don't need to create an additional file with information about expected distance between the parts, you can use [-GE:tismin:tismax] to tell MIRA about it. In case you have different sizes because, e.g. you have sequenced different libraries, then you will need to tell MIRA which reads have which distance from each other. You can do this in an XML file in TRACEFORMAT (as defined by the NCBI). There will be other means in the future, but these have not been implemented yet.
Like in the case with two strobes, you have the choice between putting all strobes in one read ... or to separate the strobes in multiple reads. The following example shows the case where all strobes are in one read:
@readname_001 ACGTTGCAGGnnnnnnnnnnGTCATGCAGTnnnnnnnnnnnnnnnnnnnnnnnnTATGCACTGACnnnnnTAGCTGA @readname_002 ...
Note that the "dark inserts" do not necessarily need to be of the same length, even within a read. Indeed, depending on your sequencing strategy they can have very varying lengths, although one should take care that these inserts are not much longer than the longest strobes (or longest unstrobed read) in your data set.
In case you have long dark inserts, you should split the parts separated by these long inserts into different reads. E.g., if your strobes are 500 bases long, but separated by dark inserts > 1Kb, split them. You are free to split them however you like, in sub-pairs, in single strobes or whatever. For the example given above, this could be done like this:
@readname_001/1 ACGTTGCAGGnnnnnnnnnnGTCATGCAGT @readname_001/2 TATGCACTGACnnnnnTAGCTGA @readname_002 ...
Note the dark inserts remaining in each read of the "virtual" read-pair. The same sequences could also be split like this:
@readname_001a/1 ACGTTGCAGG @readname_001a/2 GTCATGCAGT @readname_001b/1 TATGCACTGAC @readname_001b/2 TAGCTGA @readname_002 ...
which would then be two read-pairs: the first and second strobes are paired, as well as the third and fourth. Here too, you can use any combination strobes to pair to each other (or to use without pair information).
Combining first and fourth strobe as well as second and fourth would look like this:
@readname_001a/1 ACGTTGCAGG @readname_001b/1 TAGCTGA @readname_001b/2 GTCATGCAGT @readname_001a/2 TATGCACTGAC @readname_002 ...
Note that in this case you probably need to provide paired-end information in a NCBI TRACEARCHIVE XML file to tell MIRA about the different insert sizes.
Finally, you can put the reads all in one template like this:
@readname_001.f1 ACGTTGCAGG @readname_001.f2 GTCATGCAGT @readname_001.f3 TATGCACTGAC @readname_001.f4 TAGCTGA @readname_002 ...
Note the subtle change in the naming of reads where I changed to a different postfix naming. This is because the Solexa naming scheme currently does not (officially) allow for more than two reads per DNA template (well, /1 and /2). The forward/reverse naming scheme like implemented by MIRA however does allow this.
This has just one drawback: currently MIRA will not be able to store the distances between the strobes when they are all in one template. This is being worked on and will be possible in a future version.
Create a directory where you copy your input data into (or where you set a soft-link where it really resides).
Currently (as of version 3.2.0) MIRA allows one input file per sequencing technology (one for Sanger, one for 454, one for Solexa and one for PacBio). This will change in the future, but for the moment it is how it is.
While you could name your input files whatever you like and pass these
as parameters to MIRA, it is easier to follow a simple naming scheme
that allows MIRA to find everything automatically. This scheme is
projectname_in.sequencingtechtype.filetypepostfix
The projectname
is a free
string which you decide to give to your project. The
sequencingtechtype
can be
one of "sanger", "454", "solexa" or "pacbio". Finally the
filetypepostfix
is either
"fasta" and "fasta.qual", "fastq" or any other type supported by MIRA.
Note that MIRA supports loading a lot of other information files (XML TRACEINFO, strain data etc.), please consult the reference manual for more information.
In the most basic incantation, you will need to tell MIRA just five things:
the name of your project.
whether you want a "genome" or "EST" assembly
whether it is a denovo or mapping assembly
which quality level (draft, normal or accurate)
which sequencing technologies are involved (sanger, 454, solexa, pacbio)
Using the most basic quick switches of MIRA, the command line for an accurate denovo genome with PacBio data then looks like this:
mira --project=yourname
--job=genome,denovo,accurate,pacbio
or for a hybrid PacBio and Solexa of the above:
mira --project=yourname
--job=genome,denovo,accurate,pacbio,solexa
Note | |
---|---|
MIRA has -- at the last count -- more than 150 parameters one can use to fine tune almost every aspect of an assembly, from data loading options to results saving, from data preprocessing to results interpretation, from simple alignment parameters to parametrisation of internal misassembly decision rules ... and much more. Many of these parameters can be even set individually for each sequencing technology they apply to. Example given: in an assembly with Solexa, Sanger, 454 and PacBio data, the minimum read length for Solexa could be set to 30, while for 454 it could be 80, Sanger 100 and PacBio 150. Please refer to the reference manual for a full overview on how to use quick switches and extended switches. |
Whole genome sequencing of bacteria will probably be amongst the first for which the long PacBio reads will have an impact. Simply put: the repeat structure -- like rRNA stretches, (pro)phages and or duplicated genes/operons -- of bacteria is such that most genomes known so far can be assembled and or scaffolded with paired-end libraries between 6Kb and 10Kb. Cite paper ...!
Well, using strobed reads where a DNA template is sequenced in several strobes and the dark inserts have approximately the same length as a strobe, the initial PacBio data should be capable to generate strobed data from DNA templates a total span between 2000 and 6000 bases.
Furthermore, strobed reads can be used to generate traditional paired-end sequence with large insert sizes like 10Kb or more.
In the first few examples showing assembly with only PacBio data, we will use the genome of the Bacillus subtilis 168, which is a long standing model organism for systems biology and also used in biotechnology. From a complexity point of view, the genome has some interesting things in. As example, there are 11 rRNA stretches, some of them clustered together, which probably comes from the fact that Bsub evolved under laboratory conditions to become a fast grower. The most awful multiple rRNA cluster is the one starting at ...Kb and is ... Kb long.
The examples afterwards we will work with Escherichia coli ... (Eco), another model organism in the bacterial community. That time we will mix simulated low coverage PacBio data with real data from Solexa deposited at the NCBI Short read Archive (SRA).
Note | |
---|---|
Currently this section contains examples with real Solexa reads but only simulated PacBio reads as I do not have early access to real PacBio data. However, I think that these examples show the possibilities such a technology could have. |
Everyone (or every sequencing group / center) has more or less an own standard on how to organise directories and data prior to an assembly. Here's how I normally do it and how the following examples will be -- more or less -- set up: one top directory with the name of the project containing three specific sub-directories; one for original data, one for eventually reformated data and one for assemblies. That looks a bit like this:
$
mkdir myproject
$
cd myproject
myproject$
mkdir origdata data assemblies
The origdata directory contains whatever data file (or links to those) I have for that project: sequencing files from the provider, reference genomes from databases etc.pp. The general rule: no other files, and these files are generally write protected and unchanged from the state of delivery.
The data directory contains the files as MIRA will want to use them, eventually reformatted or reworked or bundled together with other data. E.g.: if your provider delivered several data files with sequence data for PacBio, you currently need to combine them into one file as MIRA currently reads only one input file per sequencing technology.
The assemblies directory finally contains
sub-directories with different assembly trials I make. Every sub-directory
is quickly set-up by creating it, linking data files from the
data
directory to it and then start
MIRA. Continuing the example from above:
myproject$
cd assemblies
myproject/assemblies$
mkdir firstassembly
myproject/assemblies$
cd firstassembly
myproject/assemblies/firstassembly$
lndir ../../data
myproject/assemblies/firstassembly$
mira --project=...
That strategy keeps things nice and tidy in place and allows for a maximum flexibility while testing out a couple of settings.
Set up directories and fetch genome of Bacillus subtilis 168 from GenBank
$
mkdir bsubdemo1
$
cd bsubdemo1
bsubdemo1$
mkdir origdata data assemblies
bsubdemo1$
cd origdata
bsubdemo1/origdata$
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL009126.fna
bsubdemo1/origdata$
ls -l
-rw-r--r-- 1 bach bach 4275918 2010-06-06 00:34 AL009126.fna
After that, we'll prepare the the simulated PacBio data by running a script which creates paired reads like we would expect from a sequencing with PacBio with the following properties: DNA templates have 10k bases or more, we sequence the first 1000 bases in a strobe, let approximately 8000 bases pass, then sequence another 1000 bases.
bsubdemo1/origdata$
cd ../data
bsubdemo1/data$
fasta2frag.tcl -l 1000 -i 230 -p 1 -insert_size 10000 -pairednaming 454 -P 0 -r 2 -infile ../origdata/AL009126.fna -outfile bs168pe1k_10k_in.pacbio.fasta
no ../origdata/AL009126.fna.qual fragging gi|225184640|emb|AL009126.3|bsubdemo1/data$
ls -l
-rw-r--r-- 1 bach bach 38971051 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta -rw-r--r-- 1 bach bach 1642208 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta.bambus -rw-r--r-- 1 bach bach 1440988 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta.pairs -rw-r--r-- 1 bach bach 111214374 2010-06-06 18:27 bs168pe1k_10k_in.pacbio.fasta.qual
The command line given above will create an artificial data set with
equally distributed PacBio "paired-end" reads with an average coverage
of 8.6 across the genome. Note that the *.bambus
and *.pairs
files are not needed for MIRA, but the
Tcl script generates these for some other use cases.
Next, we move to the assembly directory, make a new one to run a first assembly and link all the needed input files for MIRA into this new directory:
bsubdemo1/data$
cd ../assemblies
bsubdemo1/assemblies$
mkdir firsttest
bsubdemo1/assemblies$
cd firsttest
bsubdemo1/assemblies/firsttest$
ln -s ../../data/bs168pe1k_10k_in.pacbio.fasta .
bsubdemo1/assemblies/firsttest$
ln -s ../../data/bs168pe1k_10k_in.pacbio.fasta.qual .
bsubdemo1/assemblies/firsttest$
ls -l
lrwxrwxrwx 1 bach bach 39 2010-06-06 01:01 bs168pe1k_10k_in.pacbio.fasta -> ../../data/bs168pe1k_10k_in.pacbio.fasta lrwxrwxrwx 1 bach bach 44 2010-06-06 01:01 bs168pe1k_10k_in.pacbio.fasta.qual -> ../../data/bs168pe1k_10k_in.pacbio.fasta.qual
We're all set up now, just need to start the assembly:
bsubdemo1/assemblies/firsttest$
mira --project=bs168pe1k_10k --job=genome,denovo,accurate,pacbio --notraceinfo -GE:not=4 PACBIO_SETTINGS -GE:tpbd=1:tismin=9000:tismax=11000 -LR:rns=fr >&log_assembly.txt
The command above told MIRA
the name (bs168pe1k_10k
)
you chose for your project. MIRA will search input files with this
prefix as well as write output files and directories with that prefix.
the assembly job MIRA should perform. In this case a de-novo genome assembly at accurate level with PacBio data.
some additional information that MIRA should not search for additional ancillary information in NCBI TRACEINFO XML files
the number of threads which MIRA should run at most in parallel.
then tell MIRA that the following switches apply to reads in the assembly which are from Pacific Biosciences
both reads of a PacBio read-pair should assemble in the same direction in a contig.. the minimum distance between the outer read ends should be at minimum 9000 bases and at maximum 11000 bases.
the read naming scheme for a PacBio read-pair is "forward/reverse", i.e., the first read has ".f" appended to its name, the second read ".r".
the standard output of MIRA should be redirected to a file name log_assembly.txt
Some 12 to 13 minutes later, the data set will be assembled. Though you
should note that in real life projects with sequencing errors, MIRA will
take perhaps 3 to 4 times longer. Have a look at the information files
in directory bs168pe1k_10k_assembly /
bs168pe1k_10k_d_info/
, there especially to the files
bs168pe1k_10k_info_assembly.txt
and
bs168pe1k_10k_info_contigstats.txt
which give a
first overview on how the assembly went.
In short: this assembly went -- unsurprisingly -- quite well: the complete chromosome of Bacillus subtilis 168 has been reconstructed into one complete contig. There are just two minor flaws disturbing just a little bit. First, a few (twelve) repetitive reads could not be placed and form a second small contig of 2Kb. Second, the reconstructed chromosome contains 4 single-base differences with respect to the original Bsub chromosome. It is an exercise left to the reader to find out that this is due to almost identical rRNA repeats where two almost adjacent elements lie within the expected template insert size of the simulated PacBio Reads and therefore troubled the assembler a bit.
Your next stop would then be the directory
bs168pe1k_10k_assembly / bs168pe1k_10k_d_results/
which contains the assembly results in all kind of formats. If a format
you need is missing, have a look at convert_project
from the MIRA package, it may be that the format you need can be
generated with it.
Set up directories and fetch genome of Bacillus subtilis 168 from GenBank
$
mkdir bsubdemo2
$
cd bsubdemo2
bsubdemo2$
mkdir origdata data assemblies
bsubdemo2$
cd origdata
bsubdemo2/origdata$
wget ftp://ftp.ncbi.nih.gov/genbank/genomes/Bacteria/Bacillus_subtilis/AL009126.fna
bsubdemo2/origdata$
ls -l
-rw-r--r-- 1 bach bach 4275918 2010-06-06 00:34 AL009126.fna
After that, we'll prepare the the simulated PacBio data by running a script which creates strobed reads like we would expect from a sequencing with PacBio with the following properties: DNA templates are 6k bases or more, we sequence the first ~100 bases in a strobe, let approximately 100 bases pass, and repeat until we have 3000 bases in strobes.
bsubdemo2/origdata$
cd ../data
bsubdemo2/data$
fasta2frag.tcl -l 3000 -i 150 -r 2 -s 1 -strobeon 100 -strobeoff 100 -infile ../origdata/AL009126.fna -outfile bs168_3ks_100_100_in.pacbio.fasta
no ../origdata/AL009126.fna.qual fragging gi|225184640|emb|AL009126.3|bsubdemo2/data$
ls -l
-rw-r--r-- 1 bach bach 166909136 2010-06-06 19:18 bs168_3ks_100_100_in.pacbio.fasta -rw-r--r-- 1 bach bach 416614472 2010-06-06 19:18 bs168_3ks_100_100_in.pacbio.fasta.qual
The command line given above will create an artificial data set with equally distributed PacBio strobed reads with an average coverage of ~20 across the genome, of which only half is filled with sequence data so that the "real" coverage is ~10.
Next, we move to the assembly directory, make a new one to run a first assembly and link all the needed input files for MIRA into this new directory:
bsubdemo2/data$
cd ../assemblies
bsubdemo2/assemblies$
mkdir firsttest
bsubdemo2/assemblies$
cd firsttest
bsubdemo2/assemblies/firsttest$
ln -s ../../data/bs168_3ks_100_100_in.pacbio.fasta .
bsubdemo2/assemblies/firsttest$
ln -s ../../data/bs168_3ks_100_100_in.pacbio.fasta.qual .
bsubdemo2/assemblies/firsttest$
ls -l
lrwxrwxrwx 1 bach bach 39 2010-06-06 01:01 bs168_3ks_100_100_in.pacbio.fasta -> ../../data/bs168_3ks_100_100_in.pacbio.fasta lrwxrwxrwx 1 bach bach 44 2010-06-06 01:01 bs168_3ks_100_100_in.pacbio.fasta -> ../../data/bs168_3ks_100_100_in.pacbio.fasta
We're all set up now, just need to start the assembly:
bsubdemo1/assemblies/firsttest$
mira --project=bs168_3ks_100_100 --job=genome,denovo,accurate,pacbio --notraceinfo --noclipping -GE:not=4 -GO:mr=no PACBIO_SETTINGS -AL:egp=no >&log_assembly.txt
The command above told MIRA
the name (bs168_3ks_100_100
)
you chose for your project. MIRA will search input files with this
prefix as well as write output files and directories with that prefix.
the assembly job MIRA should perform. In this case a de-novo genome assembly at accurate level with PacBio data.
some additional information that MIRA should not search for additional ancillary information in NCBI TRACEINFO XML files
the number of threads which MIRA should run at most in parallel.
a MIRA parameter called mark repeats should be switched off for PacBio reads. This is absolutely necessary when you have strobed reads with elastic dark inserts as MIRA otherwise gets somewhat confused due to alignment problems shown earlier in this guide.
then tell MIRA that the following switches apply to reads in the assembly which are from Pacific Biosciences
a MIRA parameter called extra gap penalty should be switched off for PacBio reads. This is necessary when you have strobed reads with elastic dark inserts as otherwise alignment problems with larger gaps lead to unnecessary rejection of alignments.
the standard output of MIRA should be
redirected to a file name
log_assembly.txt
Wait for approximately 4.5hrs for MIRA to complete. Using elastic dark inserts is a pretty expensive feature from a computation perspective: all the passes and sub-passes of MIRA to move from an estimated length to an actually correct value means to build and break apart all the contigs and start from anew.
Bad news first: looking at the results and info directories, you will see that one single contig with a length of 4199898 bases was created. The original B. subtilis genome we used for this walkthrough is 4215426 bases, so it looks like some 15.5Kb are "missing." But, and this is the good news, the contig which was created represents the B. subtilis genome pretty faithfully: a check with MUMMER confirms that no misassemblies respectively re-ordering event of genome elements occurred.
Note | |
---|---|
The following will need MUMMER3 installed on your system. Fetch it here: http://mummer.sourceforge.net/ |
bsubdemo2/assemblies/firsttest$
cd bs168_3ks_100_100_assembly/bs168_3ks_100_100_d_results
../bs168_3ks_100_100_d_results$
ls -l
-rw-r--r-- 1 bach bach 280894715 2010-06-08 04:53 bs168_3ks_100_100_out.ace -rw-r--r-- 1 bach bach 776536315 2010-06-08 04:52 bs168_3ks_100_100_out.caf -rw-r--r-- 1 bach bach 461365272 2010-06-08 04:52 bs168_3ks_100_100_out.maf -rw-r--r-- 1 bach bach 4347658 2010-06-08 04:52 bs168_3ks_100_100_out.padded.fasta -rw-r--r-- 1 bach bach 13040564 2010-06-08 04:52 bs168_3ks_100_100_out.padded.fasta.qual -rw-r--r-- 1 bach bach 436189259 2010-06-08 04:53 bs168_3ks_100_100_out.tcs -rw-r--r-- 1 bach bach 4269919 2010-06-08 04:52 bs168_3ks_100_100_out.unpadded.fasta -rw-r--r-- 1 bach bach 12808422 2010-06-08 04:52 bs168_3ks_100_100_out.unpadded.fasta.qual -rw-r--r-- 1 bach bach 3203036 2010-06-08 04:53 bs168_3ks_100_100_out.wig../bs168_3ks_100_100_d_results$
nucmer -maxmatch -c 100 -p nucmer ../../../../origdata/AL009126.fna bs168_3ks_100_100_out.unpadded.fasta
1: PREPARING DATA 2,3: RUNNING mummer AND CREATING CLUSTERS [... some lines left out ...] 4: FINISHING DATA../bs168_3ks_100_100_d_results$
delta-filter -q -l 1000 nucmer.delta > nucmer.delta.q
../bs168_3ks_100_100_d_results$
show-coords -r -c -l nucmer.delta.q > nucmer.coords
../bs168_3ks_100_100_d_results$
cat nucmer.coords
NUCMER [S1] [E1] | [S2] [E2] | [LEN 1] [LEN 2] | [% IDY] | [LEN R] [LEN Q] | [COV R] [COV Q] | [TAGS] =============================================================================================================================== 181 4215606 | 4199698 1 | 4215426 4199698 | 99.62 | 4215606 4199898 | 100.00 100.00 | gi|225184640|emb|AL009126.3| bs168_3ks_100_100_c1
As already said: not 100% perfect on a base by base basis, but good enough for a using as reference sequence in subsequent mapping assemblies to get all the bases right.
TO BE EXPANDED: no real walkthrough yest, just a few hints.
Prepare your PacBio data like explained in this guide.
Prepare your other data (Sanger, 454, Solexa, or any combination of it) like explained in the respective MIRA guides.
Start MIRA with, e.g.,
--job=denovo,genome,accurate,pacbio,solexa
(and
any other parameter you need) for a denovo genome assembly at
accurate level with PacBio and Solexa data.
Table of Contents
This document is not complete yet and some sections may be a bit unclear. I'd be happy to receive suggestions for improvements.
Some reading requirements | |
---|---|
This guide assumes that you have basic working knowledge of Unix systems, know the basic principles of sequencing (and sequence assembly) and what assemblers do. Basic knowledge on mRNA transcription and EST sequences should also be present. While there are step by step walkthroughs on how to setup your EST data and then perform assemblies regarding different requirements, this guide expects you to read at some point in time
|
Assembling ESTs can be, from an assemblers point of view, pure horror. E.g., it may be that some genes have thousands of transcripts while other genes have just one single transcript in the sequenced data. Furthermore, the presence of 5' and 3' UTR, transcription variants, splice variants, homologues, SNPs etc.pp complicates the assembly in some rather interesting ways.
Poly-A tails are part of the mRNA and therefore also part of sequenced data. They can occur as poly-A or poly-T, depending from which direction and which part of the mRNA was sequenced. Having poly-A/T tails in the data is a something of a double edged sword. More specifically., if the 3' poly-A tail is kept unmasked in the data, transcripts having this tail will very probably not align with similar transcripts from different splice variants (which is basically good). On the other hand, homopolymers (multiple consecutive bases of the same type) like poly-As are features that are pretty difficult to get correct with today's sequencing technologies, be it Sanger, Solexa or, with even more problems problems, 454. So slight errors in the poly-A tail could lead to wrongly assigned splice sites ... and wrongly split contigs.
This is the reason why many people cut off the poly-A tails. Which in turn may lead to transcripts from different splice variants being assembled together.
Either way, it's not pretty.
Single transcripts (or very lowly expressed transcripts) containing SNPs, splice variants or similar differences to other, more highly expressed transcripts are a problem: it's basically impossible for an assembler to distinguish them from reads containing junky data (e.g. read with a high error rate or chimeras). The standard setting of many EST assemblers and clusterers is therefore to remove these reads from the assembly set. MIRA handles things a bit differently: depending on the settings, single transcripts with sufficiently large differences are either treated as debris or can be saved as singlet.
Chimeras are sequences containing adjacent base stretches which are
not occurring in an organism as sequenced, neither as DNA nor as
(m)RNA. Chimeras can be created through recombination effects during
library construction or sequencing. Chimeras can, and often do, lead
to misassemblies of sequence stretches into one contig although they
do not belong together. Have a look at the following example where two
stretches (denoted by x
and o
are joined by a chimeric read r4 containing both
stretches:
r1 xxxxxxxxxxxxxxxx r2 xxxxxxxxxxxxxxxxx r3 xxxxxxxxxxxxxxxxx r4 xxxxxxxxxxxxxxxxxxx|oooooooooooooo r5 ooooooooooo r6 ooooooooooo r7 ooooooooo
The site of the recombination event is denoted by x|o
in read r4.
MIRA does have a chimera detection -- which works very well in genome
assemblies due to high enough coverage -- by searching for sequence
stretches which are not covered by overlaps. In the above example, the
chimera detection routine will almost certainly flag read
r4 as chimera and only use a part of it: either the
x
or o
part, depending on which
part is longer. There is always a chance that r4 is
a valid read though, but that's a risk to take.
Now, that strategy would also work totally fine in EST projects if one would not have to account for lowly expressed genes. Imagine the following situation:
s1 xxxxxxxxxxxxxxxxx s2 xxxxxxxxxxxxxxxxxxxxxxxxx s3 xxxxxxxxxxxxxxx
Look at read s2; from an overlap coverage perspective, s2 could also very well be a chimera, leading to a break of an otherwise perfectly valid contig if s2 were cut back accordingly. This is why chimera detection is switched off by default in MIRA.
Warning | |
---|---|
When starting an EST assembly via the It is up to you to decide what you want or need. |
Another interesting problem for de-novo assemblers are non-normalised EST libraries. In each cell, the number of mRNA copies per gene may differ by several orders of magnitude, from a single transcripts to several tens of thousands. Pre-sequencing normalisation is a wet-lab procedure to approximately equalise those copy numbers. This can however, introduce other artifacts.
If an assembler is fed with non-normalised EST data, it may very well be that an overwhelming number of the reads comes only from a few genes (house-keeping genes). In Sanger sequencing projects this could mean a couple of thousand reads per gene. In 454 sequencing projects, this can mean several tens of thousands of reads per genes. With Solexa data, this number can grow to something close to a million.
Several effects then hit a de-novo assembler, the three most annoying being (in ascending order of annoyance): a) non-random sequencing errors then look like valid SNPs, b) sequencing and library construction artefacts start to look like valid sequences if the data set was not cleaned "enough" and more importantly, c) an explosion in time and memory requirements when attempting to deliver a "good" assembly. A sure sign of the latter are messages from MIRA about megahubs in the data set.
Note | |
---|---|
The guide on how to tackle hard projects with MIRA gives an overview on how to hunt down sequences which can lead to the assembler getting confused, be it sequencing artefacts or highly expressed genes. |
With contributions from Katrina Dlugosch
EST sequences necessarily contain fragments of vectors or primers used to create cDNA libraries from RNA, and may additionally contain primer and adaptor sequences used during amplification-based library normalisation and/or high-throughput sequencing. These contaminant sequences need to be removed prior to assembly. MIRA can trim sequences by taking contaminant location information from a SSAHA2 or SMALT search output, or users can remove contaminants beforehand by trimming sequences themselves or masking unwanted bases with lowercase or other characters (e.g. 'x', as with cross_match). Many folks use preprocessing trimming/masking pipelines because it can be very important to try a variety of settings to verify that you've removed all of your contaminants (and fragments thereof) before sending them into an assembly program like MIRA. It can also be good to spend some time seeing what contaminants are in your data, so that you get to know what quality issues are present and how pervasive.
Two features of next generation sequencing can introduce errors into contaminant sequences that make them particularly difficult to remove, arguing for preprocessing: First, most next-generation sequence platforms seem to be sensitive to excess primers present during library preparation, and can produce a small percentage of sequences composed entirely of concatenated primer fragments. These are among the most difficult contaminants to remove, and the program TagDust (http://genome.gsc.riken.jp/osc/english/dataresource/) was recently developed specifically to address this problem. Second, 454 EST data sets can show high variability within primer sequences designed to anchor to polyA tails during cDNA synthesis, because 454 has trouble calling the length of the necessary A and T nucleotide repeats with accuracy.
A variety of programs exist for preprocessing. Popular ones include cross_match (http://www.phrap.org/phredphrapconsed.html) for primer masking, and SeqClean (http://compbio.dfci.harvard.edu/tgi/software/), Lucy (http://lucy.sourceforge.net/), and SeqTrim (http://www.scbi.uma.es/cgi-bin/seqtrim/seqtrim_login.cgi) for both primer and polyA/T trimming. The pipeline SnoWhite (http://evopipes.net) combines Seqclean and TagDust with custom scripts for aggressive sequence and polyA/T trimming (and is tolerant of data already masked using cross_match). In all cases, the user must provide contaminant sequence information and adjust settings for how sensitive the programs should be to possible matches. To find the best settings, it is helpful to look directly at some of the sequences that are being trimmed and inspect them for remaining primer and/or polyA/T fragments after cleaning.
Warning | |
---|---|
When using mira or miraSearchESTSNPs with the the simplest parameter calls (using the "--job=..." quick switches), the default settings used include pretty heavy sequence pre-processing to cope with noisy data. Especially if you have your own pre-processing pipeline, you must then switch off different clip algorithms that you might have applied previously yourself. Especially poly-A clips should never be run twice (by your pipeline and by mira) as they invariably lead to too many bases being cut away in some sequences, |
Note | |
---|---|
Here too: In some cases MIRA can get confused if something with the pre-processing went wrong because, e.g., unexpected sequencing artefacts like unknown sequencing vectors or adaptors remain in data. The guide on how to tackle hard projects with MIRA gives an overview on how to hunt down sequences which can lead to the assembler getting confused, be it sequencing artefacts or highly expressed genes. |
MIRA in its base settings is an assembler and not a clusterer, although it can be configured as such. As assembler, it will split up read groups into different contigs if it thinks there is enough evidence that they come from different RNA transcripts.
Imagine this simple case: a gene has two slightly different alleles and you've sequenced this:
A1-1 ...........T........... A1-2 ...........T........... A1-3 ...........T........... A1-4 ...........T........... A1-5 ...........T........... B2-1 ...........G........... B2-2 ...........G........... B2-3 ...........G........... B2-4 ...........G...........
Depending on base qualities and settings used during the assembly like, e.g., [-CO:mr:mrpg:mnq:mgqrt:emea:amgb] MIRA will recognise that there's enough evidence for a T and also enough evidence for a G at that position and create two contigs, one containing the "T" allele, one the "G". The consensus will be >99% identical, but not 100%.
Things become complicated if one has to account for errors in sequencing. Imagine you sequenced the following case:
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
A1-5 ...........T...........
B2-1 ...........G...........
It shows very much the same like the one from above, except that there's only one read with a "G" instead of 4 reads. MIRA will, when using standard settings, treat this as erroneous base and leave all these reads in a contig. It will likewise also not mark it as SNP in the results. However, this could also very well be a lowly expressed transcript with a single base mutation. It's virtually impossible to tell which of the possibilities is right.
Note | |
---|---|
You can of course force MIRA to mark situations like the one depicted above by, e.g., changing the parameters for [-CO:mrpg:mnq:mgqrt]. But this may have the side-effect that sequencing errors get an increased chance of getting flagged as SNP. |
Further complications arise when SNPs and potential sequencing errors meet at the same place. consider the following case:
A1-1 ...........T...........
A1-2 ...........T...........
A1-3 ...........T...........
A1-4 ...........T...........
B1-5 ...........T...........
B2-1 ...........G...........
B2-2 ...........G...........
B2-3 ...........G...........
B2-4 ...........G...........
E1-1 ...........A...........
This example is exactly like the first one, except an additional read
E1-1
has made it's appearance and has an "A"
instead of a "G" or "T". Again it is impossible to tell whether this
is a sequencing error or a real SNP. MIRA handles these cases in the
following way: it will recognise two valid read groups (one having a
"T", the other a "G") and, in assembly mode, split these two groups
into different contigs. It will also play safe and define that the
single read E1-1
will not be attributed to either
one of the contigs but, if it cannot be assembled to other reads, form
an own contig ... if need to be even only as single read (a
singlet).
Note | |
---|---|
Depending on some settings, singlets may either appear in the regular results or end up in the debris file. |
Gaps in alignments of transcripts are handled very cautiously by MIRA. The standard settings will lead to the creation of different contigs if three or more consecutive gaps are introduced in an alignment. Consider the following example:
A1-1 ..........CGA.......... A1-2 ..........*GA.......... A1-3 ..........**A.......... B2-1 ..........***.......... B2-2 ..........***..........
Under normal circumstances, MIRA will use the reads
A1-1
, A1-2
and
A1-3
to form one contig and put
B2-1
and B2-2
into a separate
contig. MIRA would do this also if there were only one of the B2
reads.
The reason behind this is that the probability for having gaps of three or more bases only due to sequencing errors is pretty low. MIRA will therefore treat reads with such attributes as coming from different transcripts and not assemble them together, though this can be changed using the [-AL:egp:egpl] parameters of MIRA if wanted.
Problems with homopolymers, especially in 454 sequencing | |
---|---|
As 454 sequencing has a general problem with homopolymers, this rule of MIRA will sometimes lead formation of more contigs than expected due to sequencing errors at "long" homopolymer sites ... where long starts at ~7 bases. Though MIRA does know about the problem in 454 homopolymers and has some routines which try to mitigate the problem. this is not always successful. |
The assembly of ESTS can be done in two ways when using the MIRA3 system: by using mira or miraSearchESTSNPs.
If one has data from only one strain, mira using the "--job=est" quickmode switch is probably the way to go as it's easier to handle.
For data from multiple strains where one wants to search SNPs, miraSearchESTSNPs is the tool of choice. It's an automated pipeline that is able to assemble transcripts cleanly according to given organism strains. Afterwards, an integrated SNP analysis highlights the exact nature of mutations within the transcripts of different strains.
Using mira in EST projects is quite useful to get a first impression of a given data set or when used in projects that have no strain or only one strain.
It is recommended to use 'est' in the [-job=] quick switch to get a good initial settings default and then eventually adapt with own settings.
Note that by their nature, single transcripts end up in the debris file as they do not match any other reads and therefore cannot be aligned.
An interesting approach to find differences in multiploid genes is to use the result of an "mira --job=est ..." assembly as input for the third step of the miraSearchESTSNPs pipeline.
Like for EST assembly, it is recommended to use 'est' in the [-job=] quick switch to get a good initial settings default. Then however, one should adapt a couple of switches to get a clustering like alignment:
-AL:egp=no
switching off extra gap penalty in alignments allows assembly of transcripts having gap differences of more than 3 bases
-AL:egpl=...
In case [-AL:egp] is not switched off, the extra gap penalty level can be fine tuned here.
-AL:megpp=...
In case [-AL:egp] is not switched off, the maximum extra gap penalty in percentage can be fine tuned here. This allows, together with [-AL:egpl] (see below), to have MIRA accept alignments which are two or three bases longer than the 3 bases rejection criterion of the standard [-AL:egpl=split_on_codongaps] in EST assemblies.
-CO:asir=yes
This forces MIRA to assume that valid base differences (occurring in several reads) in alignments are SNPs and not repeats/marker bases for different variants. Note that depending on whether you have only one or several strains in your assembly, you might want to enable or disable this feature to allow/disallow clustering of reads from different strains.
-CO:mrpg:mnq:mgqrt
With these three parameters you can adjust the sensitivity of the repeat / SNP discovery algorithm.
-AL:mrs=...
When [-CO:asir=no] and [-AL:egp=no], MIRA has lost two of its most potent tools to not align complete nonsense. In those cases, you should increase the minimum relative score allowed in Smith-Waterman alignments to levels which are higher than the usual MIRA standards. 90 or 95 might be a good start for testing.
-CO:rodirs=...
Like [-AL:mrs] above, [-CO:rodirs] is a fallback mechanism to disallow building of completely nonsensical contigs when [-CO:asir=no] and [-AL:egp=no]. You should decrease [-CO:rodirs] to anywhere between 10 and 0.
Please look up the complete description of the above mentioned parameters in the MIRA reference manual, they're listed here just with the why one should change them for a clustering assembly.
Note | |
---|---|
Remember that some of the parameters above can be set independently for
reads of different sequencing technologies. E.g., when assembling EST
sequences from Sanger and 454
sequencing technologies, it is absolutely possible to allow the 454
sequences from having large gaps in alignments (to circumvent the
homopolymer problem), but to disallow Sanger sequences from having
them. The parameters would need be set like this:
or in shorter form (as --job=est already presets
-AL:egp=yes:egpl=split_on_codongaps for all
technologies):
|
miraSearchESTSNPs is a pipeline that reconstructs the pristine mRNA transcript sequences gathered in EST sequencing projects of more than one strain, which can be a reliable basis for subsequent analysis steps like clustering or exon analysis. This means that even genes that contain only one transcribed SNP on different alleles are first treated as different transcripts. The optional last step of the assembly process can be configured as a simple clusterer that can assemble transcripts containing the same exon sequence -- but only differ in SNP positions -- into one consensus sequence. Such SNPs can then be analysed, classified and reliably assigned to their corresponding mRNA transcriptome sequence. However, it is important to note that miraSearchESTSNPs is an assembler and not a full blown clustering tool.
Generally speaking, miraSearchESTSNPs is a three-stage assembly system that was designed to catch SNPs in different strains and reconstruct the mRNA present in those strains. That is, one really should have different strains to analyse (and the information provided to the assembler) to make the most out of miraSearchESTSNPs. Here is a quick overview on what miraSearchESTSNPs does:
Step 1: assemble everything together, not caring about strain information. Potential SNPs are not treated as SNPs, but as possible repeat marker bases and are tagged as such (temporarily) to catch each and every possible sequence alignment which might be important later. As a result of this stage, the following information is written out:
Into step1_snpsinSTRAIN_<strainname>.caf
all the sequences of a given strain that are in contigs (can
be aligned with at least one other sequence) - also, all
sequences that are singlets BUT have been tagged previously
as containing tagged bases showing that they aligned
previously (even to other strains) but were torn apart due
to the SNP bases.
Into step1_nosnps_remain.caf
all the
remaining singlets.
Obviously, if one did not provide strain information to the assembly of step 1, all the sequences belong to the same strain (named "default"). The CAF files generated in this step are the input sequences for the next step.
Note | |
---|---|
If you want to apply clippings to your data (poly-A/T or reading clipping information from SSAHA2 or SMALT), then do this only in step 1! Do not try to re-appply them in step 2 or 3 (or only if you think you have very good reasons to do so. Once loaded and/or applied in step 1, the clipping information is carried on by MIRA to steps 2 and 3. |
Step 2: Now, miraSearchESTSNPs assembles each strain independently
from each other. Again, sequences containing SNPs are torn apart
into different contigs (or singlets) to give a clean
representation of the "really sequenced" ESTs. In the end, each of
the contigs (or singlets) coming out of the assemblies for the
strains is a representation of the mRNA that was floating around
the given cell/strain/organism. The results of this step are
written out into one big file
(step2_reads.caf
) and a new straindata file
that goes along with those results
(step2_straindata.txt
).
Step 3: miraSearchESTSNPs takes the result of the previous step
(which should now be clean transcripts) and assembles them
together, this time allowing transcripts from
different strains with different SNP bases to be assembled
together. The result is then written to
step3_out.*
files and directories.
miraSearchESTSNPs can also be used for EST data of a single strain or when no strain information is available. In this case, it will cleanly sort out transcripts of almost identical genes or, when eukaryotic ESTs are assembled, according to their respective allele when these contain mutations.
Like the normal mira, miraSearchESTSNPs keeps track on a lot of things
and writes out quite a lot of additional information files after each
step. Results and and additional information of step 1 are stored in
step1_*
directories. Results and information of
step 2 are in <strainname>_*
directories. For step 3, it's step3_*
again.
Each step of miraSearchESTSNPs can be configured exactly like mira via command line parameters.
The pipeline of miraSearchESTSNPs is almost as flexible as mira itself: if the defaults set by the quick switches are not right for your use case, you can change about any parameter you wish via the command line. There are only two things which you need to pay attention to
a straindata file must be present for step 1
(*_straindata_in.txt
), but it can very well
be an empty file.
the naming of the result files is fixed (for all three steps), you cannot change it.
These walkthroughs use "msd" as project name (acronym for My Simple Dataset), please replace that with your own project name according to the MIRA naming convention.
Given is just a FASTA and FASTA quality file, where the Sanger sequencing vector sequences and problematic things (like bad quality) have been either completely removed from the data or were masked with "X". Apart from that, no further processing (poly-A removal etc.) was done. Your directory looks like this:
bach@arcadia:$
ls -l
-rwxr--r-- 1 bach bach 15486163 2009-02-22 21:01 msd_in.sanger.fasta -rwxr--r-- 1 bach bach 38017687 2009-02-22 21:01 msd_in.sanger.fasta.qual
Then, use this command:
$
mira --project=msd --job=denovo,est,normal,sanger SANGER_SETTINGS -CL:qc=no >& log_assembly.txt
We switch off the Sanger quality clips because bad quality is already trimmed away by your pipeline.
Like above, but this time 454 sequencing and the FASTA files contain everything (including remaining adaptors and bad quality), but there's a XML with ancillary data which contains all necessary clips (like generated by, e.g., sff_extract):
bach@arcadia:$
ls -l
-rwxr--r-- 1 bach bach 15486163 2009-02-22 21:01 msd_in.454.fasta -rwxr--r-- 1 bach bach 38017687 2009-02-22 21:01 msd_in.454.fasta.qual -rwxr--r-- 1 bach bach 10433244 2009-02-22 21:01 msd_traceinfo_in.454.xml
Then, use this command:
bach@arcadia:$
mira --project=msd --job=denovo,est,normal,454 454_SETTINGS -CL:qc=no >& log_assembly.txt
We just switch off our quality clip for 454 (and load the quality clips from the XML), poly-A removal is performed by MIRA. Loading of TRACEINFO XML data must not be switched on as it's the default for 454 data.
Like above, but this time the data was pre-processed by another program to mask the poly-A stretches with X:
bach@arcadia:
$ls -l
-rwxr--r-- 1 bach bach 15486163 2009-02-22 21:01 msd_in.454.fasta -rwxr--r-- 1 bach bach 38017687 2009-02-22 21:01 msd_in.454.fasta.qual -rwxr--r-- 1 bach bach 10433244 2009-02-22 21:01 msd_traceinfo_in.454.xml
Then, use this command:
bach@arcadia:$
mira --project=msd --job=denovo,est,normal,454 454_SETTINGS -CL:qc=no:cpat=no >& log_assembly.txt
We just switch off our quality clip (and load the quality clips from the XML) and also switch off poly-A clipping. Remember, never perform poly-A/T clipping twice on a data set.
Like above, but this time we assign reads to different
strains. This can happen either by putting the strain information
into the XML file (using the strain
field of the
NCBI TRACEINFO format definition) or by using a two column,
tab-delimited file which mira loads on request.
As written. when using XML no change to the command line from the
last example would be needed. This example uses the extra file with
strain information. The file
msd_straindata_in.txt
contains key value pair
information on the relationship of reads to strains and looks like
this (gnlti* are name of reads):
bach@arcadia:$
cat msd_straindata_in.454.txt
gnlti136478626 tom gnlti136479357 tom gnlti136479063 tom gnlti136478624 jerry gnlti136479522 jerry gnlti136477918 jerry
Then, use this command (note the additional [-LR:lsd] option):
bach@arcadia:$
mira --project=msd --job=denovo,est,normal,454 454_SETTINGS -LR:lsd=yes -CL:qc=no:cpat=no >& log_assembly.txt
Given just a FASTA and FASTA quality file, where the Sanger sequencing vectors and all sequencing related things (like bad quality) have been either completely removed from the data or were masked with "X". Apart from that, no further processing (poly-A removal etc.) was done.
You have n strains (in this example n=2) called "tom" and "jerry"
Your directory looks like this:
bach@arcadia:$
ls -l
-rw-r--r-- 1 bach bach 5276 2009-02-22 21:23 msd_in.sanger.fasta -rw-r--r-- 1 bach bach 13827 2009-02-22 21:23 msd_in.sanger.fasta.qual -rw-r--r-- 1 bach bach 120 2009-02-22 21:27 msd_straindata_in.txt
The file msd_straindata_in.txt
contains key
value pair information on the relationship of reads to strains and
looks like this (gnlti* are name of reads):
bach@arcadia:$
cat msd_straindata_in.txt
gnlti136478626 tom gnlti136479357 tom gnlti136479063 tom gnlti136478624 jerry gnlti136479522 jerry gnlti136477918 jerry
To assemble, use this:
bach@arcadia:$
miraSearchESTSNPs --project=msd --job=denovo,normal,sanger,esps1 >&log_assembly_esps1.txt
Note that the results of this first step are in sub-directories prefixed with "step1".
When the first step finished, continue with this (note that no "--project" is given here):
bach@arcadia:$
miraSearchESTSNPs --job=denovo,normal,esps2 >&log_assembly_esps2.txt
Note that the results of this second step are in sub-directories prefixed with "tom", "jerry" and "remain". You will find in each directory the clean transcripts from every strain/organism.
To see which SNPs exist between both "tom" and "jerry", launch the third step:
bach@arcadia:$
miraSearchESTSNPs --job=denovo,normal,esps3 >&log_assembly_esps3.txt
Note that the results of this third step are in sub-directories prefixed with "step3".
In the step3_d_results
directory for example,
you can transform the CAF file into a gap4 database and then look at
the SNPs searching for the tags SROr, SIOr and SAOr.
Table of Contents
MIRA makes results available in quite a number of formats: CAF, ACE, FASTA and a few others. The preferred formats are CAF and MAF, as these format can be translated into any other supported format.
For the assembly MIRA creates a directory named
in
which a number of sub-directories will have appeared.
projectname
_assembly
Note | |
---|---|
The is
determined by the mira parameter --project=... or,
if used, the specific --proout=... parameter.
|
These sub-directories (and files within) contain the results of the assembly itself, general information and statistics on the results and -- if not deleted automatically by MIRA -- a log directory with log files and temporary data:
:
this directory contains all the output files of the assembly in
different formats.
projectname
_d_results
:
this directory contains information files of the final
assembly. They provide statistics as well as, e.g., information
(easily parseable by scripts) on which read is found in which
contig etc.
projectname
_d_info
:
this directory contains log files and temporary assembly files. It
can be safely removed after an assembly as there may be easily a
few GB of data in there that are not normally not needed anymore.
projectname
_d_log
:
this directory contains checkpoint files needed to resume
assemblies that crashed or were stopped (not implemented yet, but
soon)
projectname
_d_chkpt
The following files in
contain results of the assembly in different formats. Depending on the
output options of MIRA, some files may or may not be there. As long as
the CAF or MAF format are present, you can translate your assembly
later on to about any supported format with the
convert_project program supplied with the MIRA
distribution:
projectname
_d_results
:
this file contains in a human readable format the aligned assembly
results, where all input sequences are shown in the context of the
contig they were assembled into. This file is just meant as a
quick way for people to have a look at their assembly without
specialised alignment finishing tools.
projectname
_out.txt
:
this file contains as FASTA sequence the consensus of the contigs
that were assembled in the process. Positions in the consensus
containing gaps (also called 'pads', denoted by an asterisk) are
still present. The computed consensus qualities are in the
corresponding
projectname
_out.padded.fasta
file.
projectname
_out.padded.fasta.qual
:
as above, this file contains as FASTA sequence the consensus of
the contigs that were assembled in the process, put positions in
the consensus containing gaps were removed. The computed consensus
qualities are in the corresponding
projectname
_out.unpadded.fasta
file.
projectname
_out.unpadded.fasta.qual
:
this is the result of the assembly in CAF format, which can be
further worked on with, e.g., tools from the
caftools package from the Sanger Centre and
later on be imported into, e.g., the Staden gap4 assembly and
finishing tool.
projectname
_out.caf
:
this is the result of the assembly in ACE format. This format can
be read by viewers like the TIGR clview or by consed from the
phred/phrap/consed package.
projectname
_out.ace
:
this directory contains the result of the assembly suited for the
direct assembly import of the Staden gap4
assembly viewer and finishing tool.
projectname
_out.gap4da
The following files in
contain statistics and other information files of the assembly:
projectname
_info
:
This file should be your first stop after an assembly. It will
tell you some statistics as well as whether or not problematic
areas remain in the result.
projectname
_info_assembly.txt
:
This file contains the parameters as given on the mira command
line when the assembly was started.
projectname
_info_callparameters.txt
:
This file contains in tabular format statistics about the contigs
themselves, their length, average consensus quality, number of
reads, maximum and average coverage, average read length, number
of A, C, G, T, N, X and gaps in consensus.
projectname
_info_contigstats.txt
:
This file contains information which reads have been assembled
into which contigs (or singlets).
projectname
_info_contigreadlist.txt
:
This file contains information about the tags (and their position)
that are present in the consensus of a contig.
projectname
_info_consensustaglist.txt
:
A list containing the names of those reads that have been sorted
out of the assembly only due to the fact that they were too short,
before any processing started.
projectname
_info_readstooshort
:
This file contains information about the tags and their position
that are present in each read. The read positions are given
relative to the forward direction of the sequence (i.e. as it was
entered into the the assembly).
projectname
_info_readtaglist.txt
:
A list of sequences that have been found to be invalid due to
various reasons (given in the output of the assembler).
projectname
_error_reads_invalid
Once finished, have a look at the file
*_info_assembly.txt
in the info directory. The
assembly information given there is split in three major parts:
some general assembly information (number of reads assembled etc.). This part is quite short at the moment, will be expanded in future
assembly metrics for 'large' contigs.
assembly metrics for all contigs.
The first part for large contigs contains several sections. The first of these shows what MIRA counts as large contig for this particular project. As example, this may look like this:
Large contigs: -------------- With Contig size >= 500 AND (Total avg. Cov >= 19 OR Cov(san) >= 0 OR Cov(454) >= 8 OR Cov(pbs) >= 0 OR Cov(sxa) >= 11 OR Cov(sid) >= 0 )
The above is for a 454 and Solexa hybrid assembly in which MIRA determined large contigs to be contigs
of length of at least 500 bp and
having a total average coverage of at least 19x or an average 454 coverage of 8 or an average Solexa coverage of 11
The second section is about length assessment of large contigs:
Length assessment: ------------------ Number of contigs: 44 Total consensus: 3567224 Largest contig: 404449 N50 contig size: 186785 N90 contig size: 55780 N95 contig size: 34578
In the above example, 44 contigs totalling 3.56 megabases were built, the largest contig being 404 kilobases long and the N50/N90 and N95 numbers give the respective lengths.
The next section shows information about the coverage assessement of large contigs. An example:
Coverage assessment: -------------------- Max coverage (total): 563 Max coverage Sanger: 0 454: 271 PacBio: 0 Solexa: 360 Solid: 0 Avg. total coverage (size >= 5000): 57.38 Avg. coverage (contig size >= 5000) Sanger: 0.00 454: 25.10 PacBio: 0.00 Solexa: 32.88 Solid: 0.00
Maximum coverage attained was 563, maximum for 454 alone 271 and for Solexa alone 360. The average total coverage (computed from contigs with a size ≥ 5000 bases is 57.38. The average coverage by sequencing technology (in contigs ≥ 5000) is 25.10 for 454 and 32.88 for Solexa reads.
Note | |
---|---|
The value for "Avg. total coverage (size >= 5000)" is currently always calculated for contig having 5000 or mor consensus bases. While this gives a very effective measure for genome assemblies, EST assemblies will often have totally irrelevant values here as most genes in eukaryotes (and prokaryotes) tend to be smaller than 5000 bases. |
The last section contains some numbers useful for quality assessment. It looks like this:
Quality assessment: ------------------- Average consensus quality: 90 Consensus bases with IUPAC: 11 (you might want to check these) Strong unresolved repeat positions (SRMc): 0 (excellent) Weak unresolved repeat positions (WRMc): 19 (you might want to check these) Sequencing Type Mismatch Unsolved (STMU): 0 (excellent) Contigs having only reads wo qual: 0 (excellent) Contigs with reads wo qual values: 0 (excellent)
Beside the average quality of the contigs and whether they contain reads without quality values, MIRA shows the number of different tags in the consensus which might point at problems.
The above mentioned sections (length assessemnt, coverage assessment and quality assessment) for large contigs will then be re-iterated for all contigs, this time including also contigs which MIRA did not take into account as large contig.
The gap4 program from the Staden package is a pretty useful finishing tool and assembly viewer. It has an own database format which MIRA does not read or write, but there are interconversion possibilities using the CAF format and the caf2gap and gap2caf utilities.
Conversion is pretty straightforward. From MIRA to gap4, it's like this:
$
caf2gap -projectYOURGAP4PROJECTNAME
-acemira_result.caf
>&/dev/null
Note | |
---|---|
Don't be fooled by the -ace parameter of
caf2gap. It needs a CAF file as input, not an ACE
file.
|
From gap4 to CAF, it's like this:
$
gap2caf -projectYOURGAP4PROJECTNAME
>tmp.caf$
convert_project -f caf -t caf -r c tmp.cafsomenewname
Note | |
---|---|
Using gap2caf, be careful to use the simple
> redirection to file and
not the >& redirection.
|
Note | |
---|---|
Using first gap2caf and then convert_project is needed as gap4 writes an own consensus to the CAF file which is not necessarily the best. Indeed, gap4 does not know about different sequencing technologies like 454 and treats everything as Sanger. Therefore, using convert_project with the [-r c] option recalculates a MIRA consensus during the "conversion" from CAF to CAF. |
convert_project is tool in the MIRA package which reads and writes a number of formats, ranging from full assembly formats like CAF and MAF to simple output view formats like HTML or plain text.
Have a look at convert_project -h
which lists all
possible formats and other command line options.
It is important to remember that some assembly options of mira improve the overall assembly while increasing the number of contig debris, i.e. small contigs with low coverage that can probably be discarded. One infamous option is the option to use uniform read distribution ( [-AS:urd]) which helps to reconstruct identical repeats across multiple locations in the genome but as a side effect, some redundant reads will end up as typical contig debris. You probably do not want to have a look at contig debris when finishing a genome unless you are really, really, really picky.
By default, the result files of MIRA contain everything which might play a role in automatic assembly post-processing pipelines as most sequencing centers have implemented.
Many people prefer to just go on with what would be large contigs. Therefore the convert_project program from the MIRA package can selectively filter CAF or MAF files for contigs with a certain size, average coverage or number of reads.
The file *_info_assembly.txt
in the info directory
at the end of an assembly might give you first hints on what could be
suitable filter parameters. For example, in assemblies being made with a
in a normal (whatever this means) fashion I routinely only consider
contigs larger than 500 bases and have at least one third of the average
coverage of the N50 contigs.
Here's an example: In the "Large contigs" section, there's a "Coverage assessment" subsection. It looks a bit like this:
... Coverage assessment: -------------------- Max coverage (total): 43 Max coverage Sanger: 0 454: 43 Solexa: 0 Solid: 0 Avg. total coverage (size ≥ 5000): 22.30 Avg. coverage (contig size ≥ 5000) Sanger: 0.00 454: 22.05 Solexa: 0.00 Solid: 0.00 ...
This project was obviously a 454 only project, and the average coverage for it is ~22. This number was estimated by MIRA by taking only contigs of at least 5kb into account, which for sure left out everything which could be categorised as debris. It's a pretty solid number.
Now, depending on how much time you want to invest performing some manual polishing, you should extract contigs which have at least the following fraction of the average coverage:
2/3 if a quick and "good enough" is what you want and you don't want to do some manual polishing. In this example, that would be around 14 or 15.
1/2 if you want to have a "quick look" and eventually perform some contig joins. In this example the number would be 11.
1/3 if you want quite accurate and for sure not loose any possible repeat. That would be 7 or 8 in this example.
Example (useful with assemblies of Sanger data): extracting only contigs ≥ 1000 bases and with a minimum average coverage of 4 into FASTA format:
$
convert_project -f caf -t fasta -x 1000 -y 4
sourcefile.caf targetfile.fasta
Example (useful with assemblies of 454 data): extracting only contigs ≥ 500 bases into FASTA format:
$
convert_project -f caf -t fasta -x 500
sourcefile.caf targetfile.fasta
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only contigs ≥ 500 bases and with an average coverage ≥ 15 reads into CAF format, then converting the reduced CAF into a Staden GAP4 project:
$
convert_project -f caf -t caf -x 500 -y 15
sourcefile.caf tmp.caf
$
caf2gap -project
somename
-acetmp.caf
Example (e.g. useful with Sanger/454 hybrid assemblies): extracting only contigs ≥ 1000 bases and with ≥ 10 reads from MAF into CAF format, then converting the reduced CAF into a Staden GAP4 project:
$
convert_project -f maf -t caf -x 500 -z 10
sourcefile.maf tmp
$
caf2gap -project
somename
-acetmp.caf
Start convert_project with the -h option for help on available options.
MIRA sets a number of different tags in resulting assemblies. They can be set in reads (in which case they mostly end with a r) or in the consensus.(then ending with a c).
If you use the Staden gap4 or consed assembly editor to tidy up the assembly, you can directly jump to places of interest that MIRA marked for further analysis by using the search functionality of these programs.
You should search for the following "consensus" tags for finding places of importance (in this order).
IUPc
UNSc
SRMc
WRMc
STMU (only hybrid assemblies)
MCVc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SROc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SAOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SIOc (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
STMS (only hybrid assemblies)
of lesser importance are the "read" versions of the tags above:
UNSr
SRMr
WRMr
SROr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SAOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
SIOr (only when assembling different strains, i.e., mostly relevant for mapping assemblies)
In normal assemblies (only one sequencing technology, just one strain), search for the IUPc, UNSc, SRMc and WRMc tags.
In hybrid assemblies, searching for the IUPc, UNSc, SRMc, WRMc, and STMU tags and correcting only those places will allow you to have a qualitatively good assembly in no time at all.
Columns with SRMr tags (SRM in Reads) in an assembly without a SRMc tag at the same consensus position show where mira was able to resolve a repeat during the different passes of the assembly ... you don't need to look at these. SRMc and WRMc tags however mean that there may be unresolved trouble ahead, you should take a look at these.
Especially in mapping assemblies, columns with the MCVc, SROx, SIOx and SAOx tags are extremely helpful in finding places of interest. As they are only set if you gave strain information to MIRA, you should always do that.
For more information on tags set/used by MIRA and what they exactly mean, please look up the according section in the reference chapter.
The read coverage histogram as well as the template display of gap4 will help you to spot other places of potential interest. Please consult the gap4 documentation.
I recommend to invest a couple of minutes (in the best case) to a few hours in joining contigs, especially if the uniform read distribution option of MIRA was used (but first filter for large contigs). This way, you will reduce the number of "false repeats" in improve the overall quality of your assembly.
Joining contigs at repetitive sites of a genome is always a difficult decision. There are, however, two rules which can help:
The following screenshot shows a case where one should not join as the finishing program (in this case gap4) warns that no template (read-pair) span the join site:
Figure 9.1. Join at a repetitive site which should not be performed due to missing spanning templates.
The next screenshot shows a case where one should join as the finishing program (in this case gap4) finds templates spanning the join site and all of them are good:
Figure 9.2. Join at a repetitive site which should be performed due to spanning templates being good.
Remember that MIRA takes a very cautious approach in contig building, and sometimes creates two contigs when it could have created one. Three main reasons can be the cause for this:
when using uniform read distribution, some non-repetitive areas may have generated so many more reads that they start to look like repeats (so called pseudo-repeats). In this case, reads that are above a given coverage are shaved off (see [-AS:urdcm] and kept in reserve to be used for another copy of that repeat ... which in case of a non-repetitive region will of course never arrive. So at the end of an assembly, these shaved-off reads will form short, low coverage contig debris which can more or less be safely ignored and sorted out via the filtering options ( [-x -y -z]) of convert_project.
Some 454 library construction protocols -- especially, but not exclusively, for paired-end reads -- create pseudo-repeats quite frequently. In this case, the pseudo-repeats are characterised by several reads starting at exact the same position but which can have different lengths. Should MIRA have separated these reads into different contigs, these can be -- most of the time -- safely joined. The following figure shows such a case:
For Solexa data, a non-negligible GC bias has been reported in genome assemblies since late 2009. In genomes with moderate to high GC, this bias actually favours regions with lower GC. Examples were observed where regions with an average GC of 10% less than the rest of the genome had between two and four times more reads than the rest of the genome, leading to false "discovery" of duplicated genome regions.
when using unpaired data, the above described possibility of having "too many" reads in a non-repetitive region can also lead to a contig being separated into two contigs in the region of the pseudo-repeat.
a number of reads (sometimes even just one) can contain "high quality garbage", that is, nonsense bases which got - for some reason or another - good quality values. This garbage can be distributed on a long stretch in a single read or concern just a single base position across several reads.
While MIRA has some algorithms to deal with the disrupting effects of reads like, the algorithms are not always 100% effective and some might slip through the filters.
Table of Contents
For some EST data sets you might want to assemble, MIRA will take too long or the available memory will not be sufficient. For genomes this can be the case for eukaryotes, plants, but also for some bacteria which contain high number of (pro-)phages, plasmids or engineered operons. For EST data sets, this concerns all projects with non-normalised libraries.
This guide is intended to get you through these problematic genomes. It is (cannot be) exhaustive, but it should get you going.
Use [-SK:mnr=yes:nrr=10] and give it a try. If that does not work, decrease [-SK:nrr] to anywhere between 5 and 9. If it worked well enough increase the [-SK:nrr] parameter up to 15 or 20. But please also read on to see how to choose the "nrr" threshold.
The SKIM phase (all-against-all comparison) will report almost every potential hit to be checked with Smith-Waterman further downstream in the MIRA assembly process. While this is absolutely no problem for most bacteria, some genomes (eukaryotes, plants, some bacteria) have so many closely related sequences (repeats) that the data structures needed to take up all information might get much larger than your available memory. In those cases, your only chance to still get an assembly is to tell the assembler it should disregard extremely repetitive features of your genome.
There is, in most cases, one problem: one doesn't know beforehand which parts of the genome are extremely repetitive. But MIRA can help you here as it produces most of the needed information during assembly and you just need to choose a threshold from where on MIRA won't care about repetitive matches.
The key to this are the two fail-safe command line parameters which will mask "nasty" repeats from the quick overlap finder (SKIM): [-SK:mnr] and [-SK:nrr=10]. [-SK:bph] also plays a role in this, but I'll come back to this later).
If switched on [-SK:mnr=yes], MIRA will use SKIM3 k-mer statistics to find repetitive stretches. K-mers are nucleotide stretches of length k. In a perfectly sequenced genome without any sequencing error and without sequencing bias, the k-mer frequency can be used to assess how many times a given nucleotide stretch is present in the genome: if a specific k-mer is present as many times as the average frequency of all k-mers, it is a reasonable assumption to estimate that the specific k-mer is not part of a repeat (at least not in this genome).
Following the same path of thinking, if a specific k-mer frequency is now two times higher than the average of all k-mers, one would assume that this specific k-mer is part of a repeat which occurs exactly two times in the genome. For 3x k-mer frequency, a repeat is present three times. Etc.pp. MIRA will merge information on single k-mers frequency into larger 'repeat' stretches and tag these stretches accordingly.
Of course, low-complexity nucleotide stretches (like poly-A in eukaryotes), sequencing errors in reads and non-uniform distribution of reads in a sequencing project will weaken the initial assumption that a k-mer frequency is representative for repeat status. But even then the k-mer frequency model works quite well and will give a pretty good overall picture: most repeats will be tagged as such.
Note that the parts of reads tagged as "nasty repeat" will not get masked per se, the sequence will still be present. The stretches dubbed repetitive will get the "MNRr" tag. They will still be used in Smith-Waterman overlaps and will generate a correct consensus if included in an alignment, but they will not be used as seed.
Some reads will invariably end up being completely repetitive. These will not be assembled into contigs as MIRA will not see overlaps as they'll be completely masked away. These reads will end up as debris. However, note that MIRA is pretty good at discerning 100% matching repeats from repeats which are not 100% matching: if there's a single base with which repeats can be discerned from each other, MIRA will find this base and use the k-mers covering that base to find overlaps.
The ratio from which on the MIRA SKIM algorithm won't report matches is set via [-SK:nrr]. E.g., using [-SK:nrr=10] will hide all k-mers which occur at a frequency 10 times (or more) higher than the median of all k-mers.
The nastiness of a repeat is difficult to judge, but starting with 10 copies in a genome, things can get complicated. At 20 copies, you'll have some troubles for sure.
The standard values of 10 for the [-SK:nrr] parameter is a pretty good 'standard' value which can be tried for an assembly before trying to optimise it via studying the hash statistics calculated by MIRA. For the later, please read the section 'Examples for hash statistics' further down in this guide.
If [-SK:mnr=yes] is used, MIRA will write an additional file into the
log directory:
<projectname>_int_skimmarknastyrepeats_nastyseq_preassembly.0.lst
The "nastyseq" file makes it possible to try and find out what makes sequencing data nasty. It's a key-value file with the name of the sequence as "key" and the nasty sequence as "value". "Nasty" in this case means everything which was masked via [-SK:mnr=yes].
The file looks like this:
read1 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ... read2 CCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGCCGAAGC ... read2 AAAAAAAAAAAAAAAAAAAAAAAAAAAA ... read3 GCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCTTCGGCT ... ... etc.
Note that reads can have several disjunct nasty repeats, hence they can occur more than one time in the file as shown with read2 in the example above.
One will need to search some databases with the "nasty" sequences and find vector sequences, adaptor sequences or even human sequences in bacterial or plant genomes ... or vice versa as this type of contamination happens quite easily with data from new sequencing technologies. After a while one gets a feeling what constitutes the largest part of the problem and one can start to think of taking countermeasures like filtering, clipping, masking etc.
During SKIM phase, MIRA will assign frequency information to each and every k-mer of all reads in a sequencing project, giving them different status. Additionally, tags are set in the reads so that one can assess reads in assembly editors that understand tags (like gap4, gap5, consed etc.). The following tags are used:
coverage below average ( default: < 0.5 times average)
coverage is at average ( default: ≥ 0.5 times average and ≤ 1.5 times average)
coverage above average ( default: > 1.5 times average and < 2 times average)
probably repeat ( default: ≥ 2 times average and < 5 times average)
'crazy' repeat ( default: > 5 times average)
stretches which were masked away by [-SK:mnr=yes
]
being more that [-SK:nnr=...
] repetitive.
Selecting the right ratio so that an assembly fits into your memory is not straight forward. But MIRA can help you a bit: during assembly, some frequency statistics are printed out (they'll probably end up in some info file in later releases). Search for the term "Hash statistics" in the information printed out by MIRA (this happens quite early in the process)
Some explanation how bph affects the statistics and why it should be chosen >=16 for [-SK:mnr]
This example is taken from a pretty standard bacterium where Sanger sequencing was used:
Hash statistics: ========================================================= Measured avg. coverage: 15 Deduced thresholds: ------------------- Min normal cov: 7 Max normal cov: 23 Repeat cov: 29 Crazy cov: 120 Mask cov: 150 Repeat ratio histogram: ----------------------- 0 475191 1 5832419 2 181994 3 6052 4 4454 5 972 6 4 7 8 14 2 16 10 =========================================================
The above can be interpreted like this: the expected coverage of the genome is 15x. Starting with an estimated hash frequency of 29, MIRA will treat a k-mer as 'repetitive'. As shown in the histogram, the overall picture of this project is pretty healthy:
only a small fraction of k-mers have a repeat level of '0' (these would be k-mers in regions with quite low coverage or k-mers containing sequencing errors)
the vast majority of k-mers have a repeat level of 1 (so that's non- repetitive coverage)
there is a small fraction of k-mers with repeat level of 2-10
there are almost no k-mers with a repeat level >10
Here's in comparison a profile for a more complicated bacterium (454 sequencing):
Hash statistics: ========================================================= Measured avg. coverage: 20 Deduced thresholds: ------------------- Min normal cov: 10 Max normal cov: 30 Repeat cov: 38 Crazy cov: 160 Mask cov: 0 Repeat ratio histogram: ----------------------- 0 8292273 1 6178063 2 692642 3 55390 4 10471 5 6326 6 5568 7 3850 8 2472 9 708 10 464 11 270 12 140 13 136 14 116 15 64 16 54 17 54 18 52 19 50 20 58 21 36 22 40 23 26 24 46 25 42 26 44 27 32 28 38 29 44 30 42 31 62 32 116 33 76 34 80 35 82 36 142 37 100 38 120 39 94 40 196 41 172 42 228 43 226 44 214 45 164 46 168 47 122 48 116 49 98 50 38 51 56 52 22 53 14 54 8 55 2 56 2 57 4 87 2 89 6 90 2 92 2 93 2 1177 2 1181 2 =========================================================
The difference to the first bacterium shown is pretty striking:
first, the k-mers in repeat level 0 (below average) is higher than the k-mers of level 1! This points to a higher number of sequencing errors in the 454 reads than in the Sanger project shown previously. Or at a more uneven distribution of reads (but not in this special case).
second, the repeat level histogram does not trail of at a repeat frequency of 10 or 15, but it has a long tail up to the fifties, even having a local maximum at 42. This points to a small part of the genome being heavily repetitive ... or to (a) plasmid(s) in high copy numbers.
Should MIRA ever have problems with this genome, switch on the nasty repeat masking and use a level of 15 as cutoff. In this case, 15 is OK to start with as a) it's a bacterium, it can't be that hard and b) the frequencies above level 5 are in the low thousands and not in the tens of thousands.
Hash statistics: ========================================================= Measured avg. coverage: 23 Deduced thresholds: ------------------- Min normal cov: 11 Max normal cov: 35 Repeat cov: 44 Crazy cov: 184 Mask cov: 0 Repeat ratio histogram: ----------------------- 0 1365693 1 8627974 2 157220 3 11086 4 4990 5 3512 6 3922 7 4904 8 3100 9 1106 10 868 11 788 12 400 13 186 14 28 15 10 16 12 17 4 18 4 19 2 20 14 21 8 25 2 26 8 27 2 28 4 30 2 31 2 36 4 37 6 39 4 40 2 45 2 46 8 47 14 48 8 49 4 50 2 53 2 56 6 59 4 62 2 63 2 67 2 68 2 70 2 73 4 75 2 77 4 =========================================================
This hash statistics shows that MG1655 is pretty boring (from a repetitive point of view). One might expect a few repeats but nothing fancy: The repeats are actually the rRNA and sRNA stretches in the genome plus some intergenic regions.
the k-mers number in repeat level 0 (below average) is considerably lower than the level 1, so the Solexa sequencing quality is pretty good respectively there shouldn't be too many low coverage areas.
the histogram tail shows some faint traces of possibly highly repetitive k-mers, but these are false positive matches due to some standard Solexa base-calling weaknesses of earlier pipelines like, e.g., adding poly-A, poly-T or sometimes poly-C and poly-G tails to reads when spots in the images were faint and the base calls of bad quality
Table of Contents
Well, duh! But it's interesting what kind of mails I sometimes get. Like in:
“We've sequenced a one gigabase, diploid eukaryote with Solexa 36bp paired-end with 200bp insert size at 25x coverage. Could you please tell us how to assemble this data set de-novo to get a finished genome?”
A situation like the above should have never happened. Good sequencing providers are interested in keeping customers long term and will therefore try to find out what exactly your needs are. These folks generally know their stuff (they're making a living out of it) and most of the time propose you a strategy that fulfills your needs for a near minimum amount of money.
Listen to them.
If you think they try to rip you off or are overselling their competencies (which most providers I know won't even think of trying, but there are some), ask a quote from a couple of other providers. You'll see pretty quickly if there are some things not being right.
Note | |
---|---|
As a matter of fact, a rule which has saved me time and again for finding sequencing providers is not to go for the cheapest provider, especially if their price is far below quotes from other providers. They're cutting corners somewhere others don't cut for a reason. |
For de-novo assembly of genomes, the MIRA quick switches (--job=...) are optimised for 'decent' coverages that are commonly seen to get you something useful, i.e., ≥ 7x for Sanger, >=18x for 454 FLX, ≥ 25x for 454 GS20. Should you venture into lower coverages, you will need to adapt a few parameters (clipping etc.) via extensive switches.
There's one thing to be said about coverage and de-novo assembly: especially for bacteria, getting more than 'decent' coverage with 454 FLX or Titanium is cheap. Every assembler I know will be happy to assemble de-novo genomes with coverages of 25x, 30x, 40x ... and the number of contigs will still drop dramatically between a 15x 454 and a 30x 454 project.
With the introduction of the Titanium series, a full 454 plate may seem to be too much: you should get at least 200 megabase out of a plate, press releases from 454 seem to suggest 400 to 600 megabases.
In any case, do some calculations: if the coverage you expect to get reaches 50x (e.g. 200MB raw sequence for a 4MB genome), then you (respectively the assembler) can still throw away the worst 20% of the sequence (with lots of sequencing errors) and concentrate on the really, really good parts of the sequences to get you nice contigs.
Including library prep, a full 454 plate will cost you between 8000 to 12000 bucks (or less). The price for a Solexa lane is also low ... 2000 or less. Then you just need to do the math: is it worth to invest 10, 20, 30 or more days of wet lab work, designing primers, doing PCR sequencing etc. and trying to close remaining gaps or hunt down sequencing errors when you went for a 'low' coverage or a non-hybrid sequencing strategy? Or do you invest a few thousand bucks to get some additional coverage and considerably reduce the uncertainties and gaps which remain?
Remember, you probably want to do research on your bug and not research on how to best assemble and close genomes. So even if you put (PhD) students on the job, it's costing you time and money if you wanted to save money earlier in the sequencing. Penny-wise and pound-foolish is almost never a good strategy :-)
I do agree that with eukaryotes, things start to get a bit more interesting from the financial point of view.
Warning | |
---|---|
There is, however, a catch-22 situation with coverage: too much coverage isn't good either. Without going into details: sequencing errors sometimes interfere heavily when coverage exceeds 80x. |
So, you have decided that sequencing your bug with 454 paired-end with unpaired 454 data (or Sanger and 454, or Sanger and Solexa, or 454 and Solexa or whatever) may be a viable way to get the best bang for your buck. Then please follow this advice: prepare enough DNA in one go for the sequencing provider so that they can sequence it with all the technologies you chose without you having to prepare another batch ... or even grow another culture!
The reason for that is that as soon as you do that, the probability that there is a mutation somewhere that your first batch did not have is not negligible. And if there is a mutation, even if it is only one base, there is a >95% chance that MIRA will find it and thinks it is some repetitive sequence (like a duplicated gene with a mutation in it) and splits contigs at those places.
Now, there are times when you cannot completely be sure that different sequencing runs did not use slightly different batches (or even strains).
One example: the SFF files for SRA000156 and SRA001028 from the NCBI short trace archive should both contain E.coli K12 MG-16650 (two unpaired half plates and a paired-end plate). However, they contain DNA from different cultures. Furthermore, the DNA was prepared by different labs. The net effect is that the sequences in the paired-end library contain a few distinct mutations from the sequences in the two unpaired half-plates. Furthermore, the paired-end sequences contain sequences from phages that are not present in the unpaired sequences.
In those cases, provide strain information to the reads so that MIRA can discern possible repeats from possible SNPs.
This is a source of interesting problems and furthermore gets people wondering why MIRA sometimes creates more contigs than other assemblers when it usually creates less.
Here's the short story: there are data sets which include one ore several high-copy plasmid(s). Here's a particularly ugly example: SRA001028 from the NCBI short read archive which contains a plate of paired-end reads for Ecoli K12 MG1655-G (ftp://ftp.ncbi.nih.gov/pub/TraceDB/ShortRead/SRA001028/).
The genome is sequenced at ~10x coverage, but during the assembly, three intermediate contigs with ~2kb attain a silly maximum coverage of ~1800x each. This means that there were ~540 copies of this plasmid (or these plasmids) in the sequencing.
When using the uniform read distribution algorithm - which is switched on by default when using "--job=" and the quality level of 'normal' or 'accurate' - MIRA will find out about the average coverage of the genome to be at ~10x. Subsequently this leads MIRA to dutifully create ~500 additional contigs (plus a number of contig debris) with various incarnations of that plasmid at an average of ~10x, because it thought that these were repetitive sites within the genome that needed to be disentangled.
Things get even more interesting when some of the plasmid / phage copies are slightly different from each other. These too will be split apart and when looking through the results later on and trying to join the copies back into one contig, one will see that this should not be done because there are real differences.
DON'T PANIC!
The only effect this has on your assembly is that the number of contigs goes up. This in turn leads to a number of questions in my mailbox why MIRA is sometimes producing more contigs than Newbler (or other assemblers), but that is another story (hint: Newbler either collapses repeats or leaves them completely out of the picture by not assembling repetitive reads).
What you can do is the following:
either you assemble everything together and the join the plasmid contigs manually after assembly, e.g. in gap4 (drawback: on really high copy numbers, MIRA will work quite a bit longer ... and you will have a lot of fun joining the contigs afterwards)
or, after you found out about the plasmid(s) and know the sequence, you filter out reads in the input data which contain this sequence and assemble the remaining reads.
Table of Contents
This list is a collection of frequently asked questions and answers regarding different aspects of the MIRA assembler.
Note | |
---|---|
This document needs to be overhauled. |
12.1.1. | Test question 1 |
Test answer 1 | |
12.1.2. | Test question 2 |
Test answer 2 |
I have a project which I once started quite normally via "--job=denovo,genome,accurate,454" and once with explicitly switching off the uniform read distribution "--job=denovo,genome,accurate,454 -AS:urd=no" I get less contigs in the second case and I wonder if that is not better. Can you please explain?
Since 2.9.24x1, MIRA has a feature called "uniform read distribution" which is normally switched on. This feature reduces overcompression of repeats during the contig building phase and makes sure that, e.g., a rRNA stretch which is present 10 times in a bacterium will also be present approximately 10 times in your result files.
It works a bit like this: under the assumption that reads in a project are uniformly distributed across the genome, MIRA will enforce an average coverage and temporarily reject reads from a contig when this average coverage multiplied by a safety factor is reached at a given site.
It's generally a very useful tool disentangle repeats, but has some slight secondary effects: rejection of otherwise perfectly good reads. The assumption of read distribution uniformity is the big problem we have here: of course it's not really valid. You sometimes have less, and sometimes more than "the average" coverage. Furthermore, the new sequencing technologies - 454 perhaps but especially the microreads from Solexa & probably also SOLiD - show that you also have a skew towards the site of replication origin.
One example: let's assume the average coverage of your project is 8 and by chance at one place you have 17 (non-repetitive) reads, then the following happens:
$p$= parameter of -AS:urdsip
Pass 1 to $p-1$: MIRA happily assembles everything together and calculates a number of different things, amongst them an average coverage of ~8. At the end of pass '$p-1$', it will announce this average coverage as first estimate to the assembly process.
Pass $p$: MIRA has still assembled everything together, but at the end of each pass the contig self-checking algorithms now include an "average coverage check". They'll invariably find the 17 reads stacked and decide (looking at the -AS:urdct parameter which I now assume to be 2) that 17 is larger than 2*8 and that this very well may be a repeat. The reads get flagged as possible repeats.
Pass $p+1$ to end: the "possibly repetitive" reads get a much tougher treatment in MIRA. Amongst other things, when building the contig, the contig now looks that "possibly repetitive" reads do not overstack by an average coverage multiplied by a safety value (-AS:urdcm) which I'll assume in this example to be 1.5. So, at a certain point, say when read 14 or 15 of that possible repeat want to be aligned to the contig at this given place, the contig will just flatly refuse and tell the assembler to please find another place for them, be it in this contig that is built or any other that will follow. Of course, if the assembler cannot comply, the reads 14 to 17 will end up as contiglet (contig debris, if you want) or if it was only one read that got rejected like this, it will end up as singlet or in the debris file.
Tough luck. I do have ideas on how to reintegrate those reads at the and of an assembly, but I had deferred doing this as in every case I had looked up, adding those reads to the contigs wouldn't have changed anything ... there's already enough coverage. What I do in those cases is simply filter away the contiglets (defined as being of small size and having an average coverage below the average coverage of the project / 3 (or 2.5)) from a project.
When using uniform read distribution there are too many contig with low coverage which I don't want to integrate by hand in the finishing process. How do I filter for "good" contigs?
OK, let's get rid of the cruft. It's easy, really: you just need to look up one number, take two decisions and then launch a command.
The first decision you need to take is on the minimum average coverage the
contigs you want to keep should have. Have a look at the file
*_info_assembly.txt
which is in the info directory after
assembly. In the "Large contigs" section, there's a "Coverage assessment"
subsection. It looks a bit like this:
... Coverage assessment: -------------------- Max coverage (total): 43 Max coverage Sanger: 0 454: 43 Solexa: 0 Solid: 0 Avg. total coverage (size ≥ 5000): 22.30 Avg. coverage (contig size ≥ 5000) Sanger: 0.00 454: 22.05 Solexa: 0.00 Solid: 0.00 ...
This project was obviously a 454 only project, and the average coverage for it is ~22. This number was estimated by MIRA by taking only contigs of at least 5Kb into account, which for sure left out everything which could be categorised as debris. It's a pretty solid number.
Now, depending on how much time you want to invest performing some manual polishing, you should extract contigs which have at least the following fraction of the average coverage:
2/3 if a quick and "good enough" is what you want and you don't want to do some manual polishing. In this example, that would be around 14 or 15.
1/2 if you want to have a "quick look" and eventually perform some contig joins. In this example the number would be 11.
1/3 if you want quite accurate and for sure not loose any possible repeat. That would be 7 or 8 in this example.
The second decision you need to take is on the minimum length your contigs should have. This decision is a bit dependent on the sequencing technology you used (the read length). The following are some rules of thumb:
Sanger: 1000 to 2000
454 GS20: 500
454 FLX: 1000
454 Titanium: 1500
Let's assume we decide for an average coverage of 11 and a minimum length of 1000 bases. Now you can filter your project with convert_project
convert_project -f caf -t caf -x 1000 -y 14 sourcefile.caf filtered.caf
I would like to find those places where MIRA wasn't sure and give it a quick shot. Where do I need to search?
Search for the following tags in gap4 or any other finishing program for finding places of importance (in this order).
IUPc
UNSc
SRMc
WRMc
STMU (only hybrid assemblies)
STMS (only hybrid assemblies)
12.2.1. | What are little boys made of? |
Snips and snails and puppy dog tails. | |
12.2.2. | What are little girls made of? |
Sugar and spice and everything nice. |
I need the .sff files for MIRA to load ...
Nope, you don't, but it's a common misconception. MIRA does not load SFF files, it loads FASTA, FASTA qualities, FASTQ, XML, CAF, EXP and PHD. The reason why one should start from the SFF is: those files can be used to create a XML file in TRACEINFO format. This XML contains the absolutely vital information regarding clipping information of the 454 adaptors (the sequencing vector of 454, if you want).
For 454 projects, MIRA will then load the FASTA, FASTA quality and the corresponding XML. Or from CAF, if you have your data in CAF format.
How do I extract the sequence, quality and other values from SFFs?
Use the sff_extract script from Jose Blanca at the University of Valencia to extract everything you need from the SFF files (sequence, qualities and ancillary information). The home of sff_extract is: http://bioinf.comav.upv.es/sff_extract/index.html but I am thankful to Jose for giving permission to distribute the script in the MIRA 3rd party package (separate download).
No, not anymore. Use the sff_extract script to extract your reads. Though the Roche sfftools package containes a few additional utilities which could be useful.
I am trying to use MIRA to assemble reads obtained with the 454 technology but I can't combine my sff files since I have two files obtained with GS20 system and 2 others obtained with the GS-FLX system. Since they use different cycles (42 and 100) I can't use the sfffile to combine both.
You do not need to combine SFFs before translating them into something MIRA (or other software tools) understands. Use sff_extract which extracts data from the SFF files and combines this into input files.
I have no idea about the adaptor and the linker sequences, could you send me the sequences please?
Here are the sequences as filed by 454 in their patent application:
>AdaptorA CTGAGACAGGGAGGGAACAGATGGGACACGCAGGGATGAGATGG >AdaptorB CTGAGACACGCAACAGGGGATAGGCAAGGCACACAGGGGATAGG
However, looking through some earlier project data I had, I also retrieved the following (by simply making a consensus of sequences that did not match the target genome anymore):
>5prime454adaptor??? GCCTCCCTCGCGCCATCAGATCGTAGGCACCTGAAA >3prime454adaptor??? GCCTTGCCAGCCCGCTCAGATTGATGGTGCCTACAG
Go figure, I have absolutely no idea where these come from as they also do not comply to the "tcag" ending the adaptors should have.
I currently know one linker sequence (454/Roche also calls it spacer for GS20 and FLX paired-end sequencing:
>flxlinker GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC
For Titanium data using standard Roche protocol, you need to screen for two linker sequences:
>titlinker1 TCGTATAACTTCGTATAATGTATGCTATACGAAGTTATTACG >titlinker2 CGTAATAACTTCGTATAGCATACATTATACGAAGTTATACGA
Warning | |
---|---|
Some sequencing labs modify the adaptor sequences for tagging and similar things. Ask your sequencing provider for the exact adaptor and/or linker sequences. |
Another question I have is does the read pair sequences have further adaptors/vectors in the forward and reverse strands?
Like for normal 454 reads - the normal A and B adaptors can be present in paired-end reads. That theory this could could look like this:
A-Adaptor - DNA1 - Linker - DNA2 - B-Adaptor.
It's possible that one of the two DNA fragments is *very* short or is missing completely, then one has something like this:
A-Adaptor - DNA1 - Linker - B-Adaptor
or
A-Adaptor - Linker - DNA2 - B-Adaptor
And then there are all intermediate possibilities with the read not having one of the two adaptors (or both). Though it appears that the majority of reads will contain the following:
DNA1 - Linker - DNA2
There is one caveat: according to current paired-end protocols, the sequences will NOT have the direction
---> Linker <---
as one might expect when being used to Sanger Sequencing, but rather in this direction
<--- Linker --->
Is there a way I can find out which protocol was used?
Yes. The best thing to do is obviously to ask your sequencing provider.
If this is - for whatever reason - not possible, this list might help.
Are the sequences ~100-110 bases long? It's GS20.
Are the sequences ~220-250 bases long? It's FLX.
Are the sequences ~350-450 bases long? It's Titanium.
Do the sequences contain a linker (GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC)? It's a paired end protocol.
If the sequences left and right of the linker are ~29bp, it's the old short paired end (SPET, also it's most probably from a GS20). If longer, it's long paired-end (LPET, from a FLX).
I have two datasets of ~500K sequences each and the sequencing company already did an assembly (using MIRA) on the basecalled and fully processed reads (using of course the accompanying *qual file). Do you suggest that I should redo the assembly after filtering out sequences being shorter than a certain length (eg those that are <200bp)? In other words, am I taking into account low quality sequences if I do the assembly the way the sequencing company did it (fully processed reads + quality files)?
I don't think that filtering out "shorter" reads will bring much positive improvement. If the sequencing company used the standard Roche/454 pipeline, the cut-offs for quality are already quite good, remaining sequences should be, even when being < 200bp, not of bad quality, simply a bit shorter.
Worse, you might even introduce a bias when filtering out short sequences: chemistry and library construction being what they are (rather imprecise and sometimes problematic), some parts of DNA/RNA yield smaller sequences per se ... and filtering those out might not be the best move.
You might consider doing an assembly if the company used a rather old version of MIRA (<3.0.0 for sure, perhaps also <3.0.5).
Suppose you ran the genome of a strain that had one or more large deletions. Would it be clear from the data that a deletion had occurred?
In the question above, I assume you'd compare your strain X to a strain Ref and that X had deletions compared to Ref. Furthermore, I base my answer on data sets I have seen, which presently were 36 and 76 mers, paired and unpaired.
Yes, this would be clear. And it's a piece of cake with MIRA.
Short deletions (1 to 10 bases): they'll be tagged SROc or WRMc. General rule: deletions of up to 10 to 12% of the length of your read should be found and tagged without problem by MIRA, above that it may or may not, depending a bit on coverage, indel distribution and luck.
Long deletions (longer than read length): they'll be tagged with MCVc tag by MIRA ins the consensus. Additionally, when looking at the FASTA files when running the CAF result through convert_project: long stretches of sequences without coverage (the @ sign in the FASTAs) of X show missing genomic DNA.
Suppose you ran the genome of a strain X that had a plasmid missing from the reference sequence. Alternatively, suppose you ran a strain that had picked up a prophage or mobile element lacking in the reference. Would that situation be clear from the data?
Short insertions (1 to 10 bases): they'll be tagged SROc or WRMc. General rule: deletions of up to 10 to 12% of the length of your read should be found and tagged without problem by MIRA, above that it may or may not, depending a bit on coverage, indel distribution and luck.
Long insertions: it's a bit more work than for deletions. But if you ran a de-novo assembly on all reads not mapped against your reference sequence, chances are good you'd get good chunks of the additional DNA put together
Once the Solexa paired-end protocol is completely rolled out and used on a regular base, you would even be able to place the additional element into the genome (approximately).
Any chance you could assemble de-novo the sequence of a from just the Solexa data?
Warning | |
---|---|
Highly opinionated answer ahead, your mileage may vary. |
Allow me to make a clear statement on this: maybe.
But the result would probably be nothing I would call a good assembly. If you used anything below 76mers, I'm highly sceptical towards the idea of de-novo assembly with Solexa (or ABI SOLiD) reads that are in the 30 to 50bp range. They're really too short for that, even paired end won't help you much (especially if you have library sizes of just 200 or 500bp). Yes, there are papers describing different draft assemblers (SHARCGS, EDENA, Velvet, Euler and others), but at the moment the results are less than thrilling to me.
If a sequencing provider came to me with N50 numbers for an assembled genome in the 5-8 Kb range, I'd laugh him in the face. Or weep. I wouldn't dare to call this even 'draft'. I'd just call it junk.
On the other hand, this could be enough for some purposes like, e.g., getting a quick overview on the genetic baggage of a bug. Just don't expect a finished genome.
Hybrid assemblies are assemblies where one used more than one sequencing technology. E.g.: Sanger and 454, or 454 and Solexa, or Sanger and Solexa etc.pp
Basically, one can choose two routes: multi-step or all-in-one-go.
Multi-steps means: to assemble reads from one sequencing technology (ideally the one from the shorter tech like, e.g., Solexa), fragment the resulting contigs into pseudo-reads of the longer tech and assemble these with the real reads from the longer tech (like, e.g., 454). The advantage of this approach is that it will be probably quite faster than the all-in-one-go approach. The disadvantage is that you loose a lot of information when using only consensus sequence of the shorter read technology for the final assembly.
All-in-one-go means: use all reads in one single assembly. The advantage of this is that the resulting alignment will be made of true reads with a maximum of information contained to allow a really good finishing. The disadvantage is that the assembly will take longer and will need more RAM.
In EST projects, do you think that the highly repetitive option will get rid of the repetitive sequences without going to the step of repeat masking?
For eukaryotes, yes. Please also consult the -SK:mnr option.
Remember: you still MUST have sequencing vectors and adaptors clipped! In EST sequences the poly-A tails should be also clipped (or let mira do it.
For prokaryotes, I´m a big fan of having a first look at unmasked data. Just try to start MIRA without masking the data. After something like 30 minutes, the all-vs-all comparison algorithm should be through with a first comparison round. grep the log for the term "megahub" ... if it doesn't appear, you probably don't need to mask repeats
I want to mask away some sequences in my input. How do I do that?
First, if you want to have Sanger sequencing vectors (or 454 adaptor sequences) "masked", please note that you should rather use ancillary data files (CAF, XML or EXP) and use the sequencing or quality clip options there.
Second, please make sure you have read and understood the documentation for all -CL parameters in the main manual, but especially -CL:mbc:mbcgs:mbcmfg:mbcmeg as you might want to switch it on or off or set different values depending on your pipeline and on your sequencing technology.
You can without problem mix your normal repeat masking pipeline with the FASTA or EXP input for MIRA, as long as you mask and not clip the sequence.
An example:
>E09238ARF0 tcag GTGTCAGTGTTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
(spaces inserted just as visual helper in the example sequence, they would not occur in the real stuff)
The XML will contain the following clippings: left clip = 4 (clipping away the "tcag" which are the last four bases of the adaptor used by Roche) right clip= ~90 (clipping away the "tgctgac..." lower case sequence on the right side of the sequence above.
Now, on the FASTA file that was generated with reads_sff.py or with the Roche sff* tools, you can let run, e.g., a repeat masker. The result could look like this:
>E09238ARF0 tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT tgctgacgcacatgatcgtagc
The part with the Xs was masked away by your repeat masker. Now, when MIRA loads the FASTA, it will first apply the clippings from the XML file (they're still the same). Then, if the option to clip away masked areas of a read (-CL:mbc, which is normally on for EST projects), it will search for the stretches of X and internally also put clips to the sequence. In the example above, only the following sequence would remain as "working sequence" (the clipped parts would still be present, but not used for any computation.
>E09238ARF0 ...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA GTCAGTACAAAAAAAAAAAAAAAAAAAAGTACGT........................
Here you can also see the reason why your filters should mask and not clip the sequence. If you change the length of the sequence, the clips in the XML would not be correct anymore, wrong clippings would be made, wrong sequence reconstructed, chaos ensues and the world would ultimately end. Or something.
IMPORTANT! It might be that you do not want MIRA to merge the masked part of your sequence with a left or right clip, but that you want to keep it something like DNA - masked part - DNA. In this case, consult the manual for the -CL:mbc switch, either switch it off or set adequate options for the boundaries and gap sizes.
Now, if you look at the sequence above, you will see two possible poly-A tails ... at least the real poly-A tail should be masked else you will get megahubs with all the other reads having the poly-A tail.
You have two possibilities: you mask yourself with an own program or you let MIRA do the job (-CL:cpat, which should normally be on for EST projects but I forgot to set the correct switch in the versions prior to 2.9.26x3, so you need to set it manually for 454 EST projects there).
IMPORTANT! Never ever at all use two poly-A tail masker (an own and the one from MIRA): you would risk to mask too much. Example: assume the above read you masked with a poly-A masker. The result would very probably look like this:
>E09238ARF0 tcag XXXXXXXXX TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA GTCAGTAC XXXXXXXXXXXXXXXXXXXX GTACGT tgctgacgcacatgatcgtagc
And MIRA would internally make the following out of it after loading:
>E09238ARF0 ...............TTGACTGTAAAAAAAAAGTACGTATGGACTGCATGTGCATGTCATGGTACGTGTCA GTCAGTAC..................................................
and then apply the internal poly-A tail masker:
>E09238ARF0 ...............TTGACTGT................................................ ..........................................................
You'd be left with ... well, a fragment of your sequence.
I looked in the log file and that term "megahub" you told me about appears pretty much everywhere. First of all, what does it mean?
Megahub is the internal term for MIRA that the read is massively repetitive with respect to the other reads of the projects, i.e., a read that is a megahub connects to an insane number of other reads.
This is a clear sign that something is wrong. Or that you have a quite repetitive eukaryote. But most of the time it's sequencing vectors (Sanger), A and B adaptors or paired-end linkers (454), unmasked poly-A signals (EST) or non-normalised EST libraries which contain high amounts of housekeeping genes (always the same or nearly the same).
Countermeasures to take are:
set clips for the sequencing vectors (Sanger) or Adaptors (454) either in the XML or EXP files
for ESTs, mask poly-A in your input data (or let MIRA do it with the -CL:cpat parameter)
only after the above steps have been made, use the -SK:mnr switch to let mira automatically mask nasty repeats, adjust the threshold with -SK:rt
if everything else fails, filter out or mask sequences yourself in the input data that come from housekeeping genes or nasty repeats.
While processing some contigs with repeats i get "Accepting probably misassembled contig because of too many iterations." What is this?
That's quite normal in the first few passes of an assembly. During each pass (-AS:nop), contigs get built one by one. After a contig has been finished, it checks itself whether it can find misassemblies due to repeats (and marks these internally). If no misassembly, perfect, build next contig. But if yes, the contig requests immediate re-assembly of itself.
But this can happen only a limited number of times (governed by -AS:rbl). If there are still misassemblies, the contig is stored away anyway ... chances are good that in the next full pass of the assembler, enough knowledge has been gained top correctly place the reads.
So, you need to worry only if these messages still appear during the last pass. The positions that cause this are marked with "SRMc" tags in the assemblies (CAF, ACE in the result dir; and some files in the info dir).
What are the debris composed of?
sequences too short (after trimming)
megahubs
sequences almost completely masked by the nasty repeat masker ([-SK:mnr])
singlets, i.e., reads that after an assembly pass did not align into any contig (or where rejected from every contig).
sequences that form a contig with less reads than defined by [-AS:mrpc]
I do not understand why ... happened. Is there a way to find out?
Yes. The log directory contains, beside temporary data, a number of log files with more or less readable information. While development versions of MIRA keep this directory after finishing, production versions normally delete this directory after an assembly. To keep the logs also in production versions, use "-OUT:rld=no".
As MIRA also tries to save as much disk space as possible, some logs are rotated (which means that old logs get deleted). To switch off this behaviour, use "-OUT:rrol=no". Beware, the size of the log directory will increase, sometimes dramatically so.
How MIRA clipped the reads after loading them can be found in the file
mira_int_clippings.0.txt
. The entries look like this:
load: minleft. U13a01d05.t1 Left: 11 -> 30
Interpret this as: after loading, the read "U13a01d05.t1" had a left clipping of eleven. The "minleft" clipping option of MIRA did not like it and set it to 30.
load: bad seq. gnl|ti|1133527649 Shortened by 89 New right: 484
Interpret this as: after loading, the read "gnl|ti|1133527649" was checked with the "bad sequence search" clipping algorithm which determined that there apparently is something dubious, so it shortened the read by 89 bases, setting the new right clip to position 484.
Also, is MIRA be available on a windows platform?
As a matter of fact: it was and may be again. While I haven't done it myself, according to reports I got compiling MIRA 2.9.3* in a Cygwin environment was actually painless. But since then BOOST and multi-threading has been included and I am not sure whether it is still as easy.
I'd be thankful for reports :-)
Table of Contents
This documents describes purpose and format of the MAF format, version 1.
I had been on the hunt for some time for a file format that allow MIRA to quickly save and load reads and full assemblies. There are currently a number of alignment format files on the market and MIRA can read and/or write most of them. Why not take one of these? It turned out that all (well, the ones I know: ACE, BAF, CAF, CALF, EXP, FRG) have some kind of no-go 'feature' (or problem or bug) that makes one life pretty difficult if one wants to write or parse that given file format.
What I needed for MIRA was a format that:
is easy to parse
is quick to parse
contains all needed information of an assembly that MIRA and many finishing programs use: reads (with sequence and qualities) and contigs, tags etc.pp
MAF is not a format with the smallest possible footprint though it fares quite well in comparison to ACE, CAF and EXP), but as it's meant as interchange format, it'll do. It can be easily indexed and does not need string lookups during parsing.
I took the liberty to combine many good ideas from EXP, BAF, CAF and FASTQ while defining the format and if anything is badly designed, it's all my fault.
This describes version 1 of the MAF format. If the need arises, enhancements like metadata about total number of contigs and reads will be implemented in the next version.
MAF ...
... has for each record a keyword at the beginning of the line, followed by exactly one blank (a space or a tab), then followed by the values for this record. At the moment keywords are two character keywords, but keywords with other lengths might appear in the future
... is strictly line oriented. Each record is terminated by a newline, no record spans across lines.
All coordinates start at 1, i.e., there is no 0 value for coordinates.
Here's an example for a simple read, just the read name and the sequence:
RD U13a05e07.t1 RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA ER
Reads start with RD and end with ER, the RD keyword is always followed by the name of the read, ER stands on its own. Reads also should contain a sequence (RS). Everything else is optional. In the following example, the read has additional quality values (RQ), template definitions (name in TN, minimum and maximum insert size in TF and TT), a pointer to the file with the raw data (SF), a left clip which covers sequencing vector or adaptor sequence (SL), a left clip covering low quality (QL), a right clip covering low quality (QR), a right clip covering sequencing vector or adaptor sequence (SR), alignment to original sequence (AO), a tag (RT) and the sequencing technology it was generated with (ST).
RD U13a05e07.t1 RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA RQ ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8: TN U13a05e07 DI F TF 1200 TT 1800 SF U13a05e07.t1.scf SL 4 QL 7 QR 30 SR 32 AO 1 40 1 40 RT ALUS 10 15 Some comment to this read tag. ST Sanger ER
RD string: readname
RD followed by the read name starts a read.
LR integer: read length
The length of the read can be given optionally in LR. This is meant to help the parser perform sanity checks and eventually pre-allocate memory for sequence and quality.
MIRA at the moment only writes LR lines for reads with more than 2000 bases.
RS string: DNA sequence
Sequence of a read is stored in RS.
RQ string: qualities
Qualities are stored in FASTQ format, i.e., each quality value + 33 is written as single as ASCII character.
SV string: sequencing vector
Name of the sequencing vector or adaptor used in this read.
TN string: template name
Template name. This defines the DNA template a sequence comes from. In it's simplest form, a DNA template is sequenced only once. In paired-end sequencing, a DNA template is sequenced once in forward and once in reverse direction (Sanger, 454, Solexa). In Sanger sequencing, several forward and/or reverse reads can be sequenced from a DNA template. In PacBio sequencing, a DNA template can be sequenced in several "strobes", leading to multiple reads on a DNA template.
DI character: F or R
Direction of the read with respect to the template. F for forward, R for reverse.
TF integer: template size from
Minimum estimated size of a sequencing template. In paired-end sequencing, this is the minimum distance of the read pair.
TT integer: template size to
Maximum estimated size of a sequencing template. In paired-end sequencing, this is the maximum distance of the read pair.
SF string: sequencing file
Name of the sequencing file which contains raw data for this read.
SL integer: seqvec left
Clip left due to sequencing vector. Assumed to be 1 if not present. Note that left clip values are excluding, e.g.: a value of '7' clips off the left 6 bases.
QL integer: qual left
Clip left due to low quality. Assumed to be 1 if not present. Note that left clip values are excluding, e.g.: a value off '7' clips of the left 6 bases.
CL integer: clip left
Clip left (any reason). Assumed to be 1 if not present. Note that left clip values are excluding, e.g.: a value of '7' clips off the left 6 bases.
SR integer: seqvec right
Clip right due to sequencing vector. Assumed to be the length of the sequence if not present. Note that right clip values are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at and including base 10 and higher.
QR integer: qual right
Clip right due to low quality. Assumed to be the length of the sequence if not present. Note that right clip values are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at and including base 10 and higher.
CR integer: clip right
Clip right (any reason). Assumed to be the length of the sequence if not present. Note that right clip values are including, e.g., a value of '10' leaves the bases 1 to 9 and clips at and including base 10 and higher.
AO four integers: x1 y1 x2 y2
AO stands for "Align to Original". The interval [x1 y1] in the read as stored in the MAF file aligns with [x2 y2] in the original, unedited read sequence. This allows to model insertions and deletions in the read and still be able to find the correct position in the original, base-called sequence data.
A read can have several AO lines which together define all the edits performed to this read.
Assumed to be "1 x 1 x" if not present, where 'x' is the length of the unclipped sequence.
RT string + 2 integers + optional string: type x1 y1 comment
Read tags are given by naming the tag type, which positions
in the read the tag spans in the interval [x1 y1] and afterwards
optionally a comment. As MAF is strictly line oriented, newline
characters in the comment are encoded
as \n
.
If x1 > y1, the tag is in reverse direction.
The tag type can be a free form string, though MIRA will recognise and work with tag types used by the Staden gap4 package (and of course the MIRA tags as described in the main documentation of MIRA).
ST string: sequencing technology
The current technologies can be defined: Sanger, 454, Solexa, SOLiD.
SN string: strain name
Strain name of the sample that was sequenced, this is a free form string.
MT string: machine type
Machine type which generated the data, this is a free form string.
IB boolean (0 or 1): is backbone
Whether the read is a backbone. Reads used as reference (backbones) in mapping assemblies get this attribute.
IC boolean (0 or 1)
Whether the read is a coverage equivalent read (e.g. from mapping Solexa). This is internal to MIRA.
IR boolean (0 or 1)
Whether the read is a rail. This also is internal to MIRA.
ER
This ends a read and is mandatory.
Every left and right clipping pair (SL & SR, QL & QR, CL & CR) forms a clear range in the interval [left right[ in the sequence of a read. E.g. a read with SL=4 and SR=10 has the bases 1,2,3 clipped away on the left side, the bases 4,5,6,7,8,9 as clear range and the bases 10 and following clipped away on the right side.
The left clip of a read is determined as max(SL,QL,CL) (the rightmost left clip) whereas the right clip is min(SR,QR,CR).
Contigs are not much more than containers containing reads with some additional information. Contrary to CAF or ACE, MAF does not first store all reads in single containers and then define the contigs. In MAF, contigs are defined as outer container and within those, the reads are stored like normal reads.
The above example for a read can be encased in a contig like this (with two consensus tags gratuitously added in):
CO contigname_s1 NR 1 LC 24 CS TGCCTGCAGGTCGACTCTAGAAGG CQ -+/,36;:6≤3327<7A1/,,). CT COMM 5 8 Some comment to this consensus tag. CT COMM 7 12 Another comment to this consensus tag. \\ RD U13a05e07.t1 RS CTTGCATGCCTGCAGGTCGACTCTAGAAGGACCCCGATCA RQ ,-+*,1-+/,36;:6≤3327<7A1/,,).('..7=@E8: TN U13a05e07 TF 1200 TT 1800 SF U13a05e07.t1.scf SL 4 SR 32 QL 7 QR 30 AO 1 40 1 40 RT ALUS 10 15 Some comment to this read tag. ST Sanger ER AT 1 24 7 30 // EC
Note that the read shown previously (and now encased in a contig) is absolutely unchanged. It has just been complemented with a bit of data which describes the contig as well as with a one liner which places the read into the contig.
CO string: contig name
CO starts a contig, the contig name behind is mandatory but can be any string, including numbers.
NR integer: num reads in contig
This is optional but highly recommended.
LC integer: contig length
Note that this length defines the length of the 'clear range' of the consensus. It is 100% equal to the length of the CS (sequence) and CQ (quality) strings below.
CT string + 2 integers + optional string: identifier x1 y1 comment
Consensus tags are defined like read tags but apply to the consensus. Here too, the interval [x1 y1] is including and if x1 > y1, the tag is in reverse direction.
CS string: consensus sequence
Sequence of a consensus is stored in RS.
CQ string: qualities
Consensus Qualities are stored in FASTQ format, i.e., each quality value + 33 is written as single as ASCII character.
\\
This marks the start of read data of this contig. After this, all reads are stored one after the other, just separated by an "AT" line (see below).
AT Four integers: x1 y1 x2 y2
The AT (Assemble_To) line defines the placement of the read in the contig and follows immediately the closing "ER" of a read so that parsers do not need to perform time consuming string lookups. Every read in a contig has exactly one AT line.
The interval [x2 y2] of the read (i.e., the unclipped data, also called the 'clear range') aligns with the interval [x1 y1] of the contig. If x1 > y1 (the contig positions), then the reverse complement of the read is aligned to the contig. For the read positions, x2 is always < y2.
//
This marks the end of read data
EC
This ends a contig and is mandatory
Table of Contents
The log directory used by mira (usually
<projectname>_log
in stable versions and
<projectname>_d_log
in development versions)
may contain a number of files with information which could be
interesting for other uses than the pure assembly. This guide gives a
short overview.
Note | |
---|---|
This guide is probably the least complete and most out-of-date as it is updated only very infrequently. If in doubt, ask on the MIRA talk mailing list. |
Warning | |
---|---|
Please note that the format of these files may change over time, although I try very hard to keep changes reduced to a minimum. |
Remember that mira has two options that control whether log files get deleted: while [-OUT:rld] removes the complete log directory after an assembly, [-OUT:rrol] removes only those log files which are not needed anymore for the continuation of the assembly. Setting both options to no will keep all log files.
A simple list of those reads that were invalid (no sequence or similar problems).
A simple list of those reads that were sorted out because the unclipped sequence was too short as defined by [-AS:mrl].
If read extension is used ([-DP:ure]), this file contains the read name and the number of bases by which the right clipping was extended.
If any of the [-CL:] options leads to the clipping of a read, this file will tell when, which clipping, which read and by how much (or to where) the clippings were set.
Note: replace the X by the pass of mira. Should any read be categorised as megahub during the all-against-all search (SKIM3), this file will tell you which.
After the initial all-against-all search (SKIM3), this file tells you to how many other reads each read has overlaps. Furthermore, reads that have more overlaps than expected are tagged with ``mc'' (multycopy).
Note: replace the X by the pass of mira. Similar to
mira_int_posmatch_multicopystat_preassembly.0.txt
, this counts the
hash hits of each read to other reads. This time however per pass.
Note: replace the X by the pass of mira. Only written if [-SK:mnr] is set to yes. This file contains a histogram of hash occurrences encountered by SKIM3.
Note: replace the X by the pass of mira. Only written if [-SK:mnr] is set to yes. One of the more interesting files if you want to know the repetitive sequences cause the assembly to be really difficult: for each masked part of a read, the masked sequences is shown here.
E.g.
U13a04h11.t1 TATATATATATATATATATATATA U13a05b01.t1 TATATATATATATATATATATATA U13a05c07.t1 AAAAAAAAAAAAAAA U13a05e12.t1 CTCTCTCTCTCTCTCTCTCTCTCTCTCTC
Simple repeats like the ones shown above will certainly pop-up there, but a few other sequences (like e.g. SINEs, LINEs in eukaryotes) will also appear.
Nifty thing to try out if you want to have a more compressed overview: sort and unify by the second column.
sort -k 2 -u mira_int_skimmarknastyrepeats_nastyseq_pass.X.lst
Note: replace the X by the pass of mira. Only written if [-CL:pvlc] is set to yes. Tells you where possible sequencing vector (or adaptor) leftovers were found and clipped (or not clipped).
Note: replace the X by the pass of mira. Which read aligns with Smith-Waterman against which other read, 'forward-forward' and 'forward-complement'.
Note: replace the X by the pass of mira. Which possible read overlaps failed the Smith-Waterman alignment check.