Change file for development version of MIRA =========================================== Please note: the 3.1.x series should currently NOT be used for production assembly. It is meant as testing ground for some new developments in MIRA. Versions that contain "rc" are release candidates of an upcomming major version/revision of MIRA. Versions that contain the other characters (e.g. 3.1.1x1) are intermediate development versions or versions made available to fix a quirk. These are not as well tested as the versions for the general audience but should work as intended as they normally fix very specific problems. ******************************************************************************* **************************** Development version **************************** ******************************************************************************* 3.2.0rc3a --------- - fixed error: setting -MI:sonfs from the command line did not work - added code to catch erroneous "-- something" in parameters and give an appropriate error message - PacBio now does not expect XML data as default 3.2.0rc3 -------- - regression problem: changes for convert_project led to incredibly long running times for computing consensus in mapping assemblies. Fixed. - fixed configuration error for miraSearchESTSNPs: using --job=esps2 did not load sequences from step1 - new parameter -SK:acrc for switching on/off searching of reverse complement hits 3.2.0rc2 -------- - new parameter category "MISC" with first new parameter "stop_on_nfs" (-MI:sonfs) - added -DI:lrt to be able to put the log directory to other locations (i.e., support SSDs or take the heat off of NFS mounts) - added -CL:pechsgp to be able to switch handling of Solexa GGCxG problem if wanted. - larger work on convert_project functionality which should now behave a bit more as user might expect (see below) - when used on an assembly file with multiple strains (caf or maf), "convert_project -t fasta" now also creates files with combined output of all strains - reversed default and meaning of '-u' in convert_project: per default filling of strain is off, can be switched on - '-u' in convert_project now also fills the '@' "base" (which stands for "no coverage by this strain") - '-v' in convert_project now works "per strain" and not "per total coverage" - added check for zlib in "./configure" - changed behaviour: -SK:mnr is now switched on by default for EST projects - changed behaviour: the *nastyseq* file in the log directory has been upgraded to an "info" file, i.e., it is now in the info directory. Furthermore, it can log (on demand) not only sequence parts covered by MNRr tags, but also HAF5, HAF6 and HAF7. New parameter: -SK:rliif - bugfix: -OUT:rrol did not work, old logs were not removed (bug introduced in 3.1.7) - updated docs. Section on "things you should not do", added description of a couple of parameters which did not make it to documentation yet, etc. - HTML documentation lacked underline-emphasis, fixed. 3.2.0rc1 -------- - completed change to new DocBook manual system, revamped and extended manuals. 3.1.16 ------ - change: for genome assemblies, MIRA now builds longer contigs when large paired-end libraries are present (10k, 20k or more) - change: filter for bad Solexa reads now writes names to clipping log instead to standard output. - fasta2frag.tcl: added -P, changed -r to work also on paired-ends - bugfix: PacBio data without elastic dark inserts led to segfault - bugfix: -SK:mnr was always treated as "yes", even if set to "no". - bugfix: fixed miraSearchESTSNPs stopping in step2 if step1 result files were empty. - bugfix: in 3.1.15, the logic to automatically set -CO:emea=1 worked only for Sanger reads, now for all reads. - bugfix: read tag comments in ACE file missed a newline 3.1.15 ------ - new parameter -CO:emeas1clpec. Automatically sets emea to 1 if proposed end clipping is used (ends will be "clean"). Improves recognition of misassemblies in cases where only the outer fringes of reads differ. - change in template handling: to be lenient, MIRA internally added/subtracted 10% of the given insertsizes (or at least 1kb). Not anymore! This would give problems with very small libraries (Solexa) or when the given values were "lenient enough" and were made "too lenient" by this and subsequently flagged in different post-processing tools. - change in handling template insert size info from XML: previously, MIRA set stdev to a minimum of 500 bases and used 2*stdev to calculate minimum and maximum insert sizes. The 500 bases minimum rule has been removed, and now using 3*stdev - new parameter: -GE:tpbd to give template partner build direction on the command line. Defines whether the template partner of a read (in a read-pair) must have the same direction (1) or reverse direction (-1) in a contig. - change: when --job=...,454 is used, the default minimum overlap is not 40 anymore, but 20. 40 was too conservative, overlaps at weak contig joins were discarded too often. - improved graph reduction algorithm: some more small overlaps at low coverage sites are taken to Smith-Waterman. This helps to find some more weak contig joins. 3.1.14 ------ - speed up of routine to find and mark IUPAC bases and unsure bases (IUPc & UNSc). Very noticeable when using annotated genomes as mapping reference. - bugfix: IUPC & UNSc were not searched for anymore (introduced in 3.1.12 with the -CO:asir bugfix) - re-activated '-d' in convert_project - adjusted miramem estimator for mapping of Solexa reads 3.1.13 ------ - improvements for large assemblies with millions of reads where setting up data for new contigs during build is sped up. Especially noticeable in EST assemblies, but also genome assemblies with Solexa. 3.1.12 ------ - new option to speed up assemblies with millions of reads: -AS:mrpc controls the minimum number of reads a contig must potentially have before it is really assembled. This prevents all the small junk contigs with very low numbers of reads in, e.g., Solexa sequencing to be assembled and can speed up the assembly by days. - MIRA now uses the tcmalloc library from Google perftools if available. It is highly recommended as it optimises memory allocation and saves a lot of memory on multiple pass assemblies. E.g., memory usage for 810k 454 FLX reads, 45x coverage, 5 pass genome de-novo accurate: 3.0.5 8272988 kB 3.1.11 8273012 kB 3.1.12 9492956 kB 3.1.12tcmalloc 6758916 kB - change: adapted some estimators in miramem, hopefully giving better estimates for RAM usage during MIRA assemblies. - bugfix: array iterator overrun in contig building which had probably no noticeable effect. If, then perhaps rejecting weak matches it would have barely accepted. - bugfix: -CO:asir sometimes set repeat markers instead of SNP markers. - bugfix: mira could try to check physical presence of SCF data even for non-Sanger reads 3.1.11 ------ - optimisation: memory pre-allocation routines for read growth help to get down memory fragmentation and hence less memory requirement overall. Especially noticeable in high coverage 454 sequencing or with strobed PacBio reads. - bugfix: -CO:mr=no was not fully respected. While not used during contig building, possible repeats were always marked in result files and then tranferred to following iterations. - bugfix extendADS(): acquireSequences() could throw due to 0 length of a sequence 3.1.9 ----- - change: mira will stop immediately if it is launched with parameters that suggest miraSearchESTSNPs should be used instead - bugfix: est assembly used genomic pathfinding routines instead of EST routines, leading to more contigs with almost identical consensus. - bugfix: miraSearchESTSNPs pipeline for steps 2 and 3 did not load results from previous steps. - bugfix: fastqselect.tcl script printed out the name of the firt read twice 3.1.8 ----- - changed wiggle file: more info on strains in the description and smaller files, using a stepping and span of 4 instead of 1 - new script: fastqselect.tcl (like fastaselect.tcl, but works on fastq) - new 3rd party scripts by Tony Travis (qual2ball) and Lionel Guy (caf2aceMiraConsed.pl) to simplify integration of MIRA assemblies in Consed. - updated the 3rd party documentation "Instructions for scaffolding MIRA contigs using paired ends" from Gregory Harhay 3.1.7 ----- - changed method to remove old files which hopefully minimises the number of files which fail to be removed during a run - test changes to adapt to >= 2^32 skim hits - adapted post-SW scoring for PacBio - fixed bug: array underrun in alignment code, introduced in 3.1.3 (don't I love valgrind :-) 3.1.6 ----- - changed automatic memory management to use all memory minus 15% instead of of minus 10%. - speedup of SKIM when running in multiple threads: removed unecessary call of a mutex lock (leftover from debugging code). Very noticable when running at higher thread numbers. - bugfix: race condition in SKIM leading to wrong assessment of memory needs - bugfix: calculation for assessment of memory needs was faulty, leading to similar problems as the race condition in SKIM above - workaround: mapping with data containing artificial reads with lengths of several kilobases led to too high values for rail read length being computed. Fixed by capping at 18kb. - temporarily switched off skim junk detection, it might be faulty at high coverages 3.1.5 ----- - bugfix: reading SSAHA2 data gave an error for Solexa reads beginning with a 'N' (now really) - bugfix: some SSAHA2 input files led to infinite loops - calculation of SW alignment score sped up very slightly 3.1.4 ----- - bugfix: when read extension was used for any sequencing technology, it was also applied for reads of technologies where read extension was not wished. - fixed compilation: new use of the stringcontainer could lead to a static initialisation fiasco (dependent on linker used at compile time) and subsequent crashes directly after start. 3.1.3 ----- - added first support for PacBio - fasta2frag.tcl gets a mode to simulate strobed data - reduced hits reported by SKIM when a reads fully covers another. Especially useful for hybrid assemblies of short / long reads. - slight improvement of SW parametrisation and alignment algorithm (for strobed reads) - fixed error with read names when using mapping mode - fixed potential unwanted increase in memory consumption while loading SKIM hits. - fixed compile problem of ./src/caf/caf_flexer.flex on CentOS - fixed small bug in ./configure.in where rescue values for BOOST paths were not set correctly for some systems 3.0.1 (backported from 3.1.2) ----------------------------- This version fixes a few quirks and problems of the initial 3.0.0 release, some of them leading to MIRA aborting or even hard crashes. MIRA also was a bit too picky in 3.0.0 for joining some reads. Due to changes and algorithm optimisation, there should be notable improvements in contig lengths (N50 etc.) in genome assemblies with bad data. In EST assemblies, chimera cutbacks are now disabled by default, leading to less cutbacks. Important note for users of sff_extract in paired-end projects: please switch to the newest version of sff_extract (>= 0.2.8) as the old ones contain a bug and do not reverse quality values for reverse reads. - changed SSAHA2 parser to allow for pathological case of empty vector names - changed method for average coverage estimation slightly to better cope with extremely skewed distributions (seen in some EST data) - added workaround to allow usage of SSAHA2 screening data with Solexa reads - improved speed of pathfinder algorithm for repetitive 454 reads - improved concurrency of SKIM output, better use of available thread capacity - added method to propose smaller cuts at the end of reads in SKIM (-SK:asjdc) - added flags to control chimera cutbacks (-SK:ascdc) and junk cutbacks (-SK:asjdc). On by default for genome assemblies, off for EST assemblies - increased speed of SKIM hit reduction for assemblies with long and short reads (Solexa & ...) - improved handling of reads with problematic ends which could lead to premature stop of contig building - reduced memory need for internal read structure. As part of this and only user visible effect (if at all), the Staden-ID of CAF files is not supported anymore - reduced memory needs for tags. Side-effect: slight speed improvement of algorithms using tags - bugfix: consensus of Solexa bases only with N now results in N instead of a space - bugfix: FASTA file with multiple equal read names now lead to MIRA stopping - fixed critical buffer overruns that could lead to weird errors or even to MIRA crashing hard with segmentation faults - fixed the annoying "len1 or len2 == 0 ?" bug (turned out to be side-effect of chimera clipping) - fixed error in SKIM parametrisation which could lead in some cases to long run times, excessive memory consumption and data corruption. 3.0.0 ----- The MIRA 3 versions are the result of a long development to make assembly of Sanger, 454 and Solexa (Illumina) data as easy and straightforward as possible while keeping a maximum accuracy. Another focus was to make it possible to use results from Solexa mapping projects in current finishing programs, not only viewers. MIRA introduces for that the notion of coverage equivalent reads (CER) which reduces the data volume by 70 to 90%. This allows painless use of of such data sets in gap4 and consed. A lot has changed since the 2.8.x series of MIRA, the following list has just a few highlights which came in during the 2.9.x series: - sequencing technologies: MIRA handles different sequencing technologies independently from each other and has specialised routine for working with each of them. - command-line parameters: MIRA has now a handful of "Do-What-I-Mean" one-stop switches which allows to configure the assembler for 90% of all use cases. Furthermore, many parameters can be adjusted for each sequencing technology so that the assembly engine can be tweaked for very specialised cases if needed. - all sequencing technologies (Sanger, 454, Solexa) have now - recognition of chimeras - new assembly routines to for improved repeat resolving - improved data preprocessing that gets rid of low quality data and sequencing errors at ends of reads ... even when no quality data is available. - 454 data: - fully developed capability for de-novo and mapping assembly of 454 data (paired and non-paired) - automated contig editor to remove most obvious and/or annoying sequencing errors - improved consensus calling streamlined to minimise the dreaded homopolymer problem - Solexa data - can handle Solexa data of any length, no restriction to very short sequences. - memory/space saving: MIRA has special mapping mode which creates data so that widely used finishing tools like gap4 and consed can load these projects and still be fairly quick - alignments enriched with features: MIRA adds information like repetitiveness or repeat marker bases as tags in the assembly so that these can be used during finishing - assembly information files: MIRA writes more information files which can be easily parsed and/or read - mapping assemblies: MIRA has a full SNP analysis for prokaryotic data - comprehensive tables and HTML result files (mapping assembly): the convert_project program can now create easy to use tables and HTML files which show the data in a way suited for less computer-interested people (biologists etc. :-) - memory management: MIRA can now be told to use an upper limit of RAM. - file formats: MIRA can now parse or write more file formats. Notable are change from SSAHA to SSAHA2 for clipping, FASTQ for data input and MAF format for output. - MacOS support: MIRA now compiles on MacOS-X - speed and memory: compared to 2.8.x, MIRA now uses way less memory and is a lot faster. - tons of other features, tweaks and bug fixes. See the CHANGES_old.txt file 2.9.59 (3.0.0) -------------- - moved all output directories into one directory named _assembly - added 3rd party documentation to distribution packages - mapping 454 reads in 'accurate' mode now does not automatically switch on the feature to also build new contigs (which comes rather unexpected for users and also completely changes runtime behaviour) - base jiggling in homopolymers should be further reduced (problem of 454 data) - added FASTQ as conversion target to convert_project - added quickswitch -noquality - renamed -LR:llc to -LR:lcc (corrected typo) - clip lowercases (-LR:lcc) now does not clip if all sequences to be clipped contain just lower case - SNP evaluation routines now handle feature rich GenBank files more gracefully. - streamlined tagging: in mapping assemblies, positions having received SNP consensus tags now don't have less important tags denoting problematic positions (UNSc, IUPc etc.) - improved alignment and consensus calling routines reduces homopolymer errors in 454 data by ~30% - 60% - clearer error message from MIRA when FASTQ file cannot be loaded - changed behaviour of -CL:msvs to now load SSAHA2 data instead of SSAHA data - changed behaviour: MIRA now stops if it encounters input reads with no names - changed behaviour: backbone reads in mapping assemblies now count as normal coverage. - changed behaviour: miraMem now calculates with Titanium average length of 400 instead of 475 - changed behaviour: loading of Solexa data now defaults to "new" phred style qualities (-LR:ssiqf=no) - changed behaviour: for Solexa data, MIRA now calls deletions more aggressively when in a tie with another base. - changed behaviour: for Solexa data, MIRA now calls less IUPAC bases. - changed behaviour: the HAsh Frequency (HAF) tags are now set slightly differently to show the potentially dangerous sites. Furthermore, HAF6 is now defined as 'heavy' repeat, HAF7 as 'crazy' - changed naming of contigs with only repetitive sequence: now named *_rep_c instead of *_lrc - fixed bug: automatically calculated values for -SB:bro and -SB:brl were a factor 2 too low. - fixed bug: de-novo assembly with more than 8 strains led to MIRA stopping where it should not. - fixed bug: log file "Edit.log" from the Sanger EdIt now created only when really needed - fixed bug: --job=esps2 and esps3 failed to start because of flawed default parameters - fixed bug: convert_project did not properly convert singlets into EXP files ... the directory was not created. - fixed bug: convert_project -m replaced qualities of reads with '30' even if those had been present in the input. - fixed automatic recognition of Sanger FASTQ format where some data sets led to slightly offset quality values - fixed bug when pre-loading EXP files that contain quality values (thanks to David Phillip Judge for mailing the bug ... AND the fix) - fixed small bug in loading EXP files with certain entries which went unnoticed for 11 years *sigh* - fixed small bug in CAF and MAF output when gaps are present at the 3' end of reads. - fixed bug that could lead to segmentation faults when calculating assembly statistics info (thank you Valgrind) - fixed bug in how MIRA checked whether it runs in 32 or 64 bit. - fixed Trac bug #5 (sometimes error in transferContigReadTagsToReadpool()) - fixed bug: in mapping assemblies, not all possible alignments of repetitive parts reads where given back by SKIM - fixed bug: in rare cases (mostly projects without templates or paired-end), MIRA joined repeats where it should not - fixed bug: in mapping assemblies with the option to build new contigs, MIRA often preferred to make new contigs from reads that had a difference to the reference sequence - fixed inconsistency in calculation of N50/N90/N95 2.9.58 (aka V3rc4) ------------------ - MIRA now also compiles again on Apple Mac OSX (yay!) - scripts added to distribution: fixACE4consed.tcl, fastaselect.tcl and fasta2frag.tcl - new clipping: clip lowercases (-LR:llc). Made for 454 data, but can be used with other sequencing types, too. - MIRA now switches off automatic memory management when system information cannot be gathered - change to get SKIM working on 31 base hashes also on 32 bit machines where the compiler knows 64 bit data types - slightly changed messages for STDOUT log when writing cluster info to disk - added option "-r f" to convert_project - convert_project can now convert from and to MAF - changed -GE:gbmf to -GE:kpmf (sorry for the inconvenience) - miraSearchESTSNPs does not load "me_stepX.par" files anymore (it parses the command line like normal mira now) - default parameters for miraSearchESTSNPs have been chaged in step 1 and 2. Notably, the automatic editor is switched of in step 1, but switched on in step 2. Furthermore, both steps change from single pass/multi-loop (-AS:nop=1:rbl=...) to multipass/multi-loop (-AS:nop=...:rbl=...) setup - fixed bug in automatic memory management where RAM allocated was actually less than minimum asked for and could lead to disastrous assembly results - fixed bug that prevented recalculation of contig consensus when loading CAF - fixed bug in contig building where in rare cases a missed alignement led to MIRA stopping - fixed bug that led to extremely long run times and suboptimal contigs in rare cases - fixed bug that led to MIRA stopping in rare cases during mapping of Solexa reads - fixed bug that led to MIRA missing internal functionality of merging short reads after loading contigs from CAF or MAF - fixed segmentation fault in parsing the "--job" parameter when only "est" was given - fixed bug that led MIRA want to load a file named ".maf" when trying to load a CAF format. - fixed bug where loading of sequences in CAF failed as MIRA thought there was no data in the file - added logfile to replay eventual errors during contig building 2.9.57 (aka V3rc3) ------------------ - added -GE:amm:gbfm:mps as first trials for automatic memory management - added -FN:bbin for naming backbone input files - added -LR:wqf and -AS:epoq for more thorough checking of presence of quality values, by default MIRA now stops if quality files are expected but not found or when reads with no quality values are present in the assembly. - parameter parsing now checks that parameters that are specific for sequecing types are in a correct section (SANGER_SETTINGS, 454_SETTINGS, etc.pp) and that common parameters are in a COMMON_SETTINGS section. SANGER_SETTINGS is now no longer an alias to COMMON_SETTINGS. Updated documentation to reflect changes. - reduced number of memory allocations for Smith-Waterman alignments - estimator for internal memory usage got better - fixed bug in wrong parameter combination for --job=est assemblies - fixed bug in contig building routine that sometimes stumbled over IUPACs in mapping mode ("logical error 2") 2.9.56 (aka V3rc2hf3) --------------------- - reduced memory needs of SKIM results for projects containing lots of small reads combined with lots larger reads (i.e. Solexa + Sanger and/or 454) - fixed small bug in logfile tracking that stored too many copies of given logfile names and unnecessarily gobbled up hundreds of megabytes of RAM for projects with large numbers of contigs (*sigh*) - (for compiling) configure script now checks presence of expat 2.9.55 (aka V3rc2hf2) --------------------- - (for compiling) tweaked configure script to better handle cases with programs/include files installed in non-standard locations - (for compiling) configure script now checks presence of flex++ 2.9.54 (aka V3rc2hf1) --------------------- - changed loading of CAF files to behave more like other file types - fixed floating point error in miramem - (for compiling) removed unnecessary (and sometimes counterproductive) .flex.C files from source distribution. 2.9.53 (aka V3rc2) ------------------ - fixed bug in ACE output of consensus tags (had "C}}" instead of correct "C}\n}" for closing tags) - fixed another bug in ACE output, this time read tags (forward tags were sometimes written out in reverse) - re-enabled output of results as HTML. Not ideal, but works. - re-enabled "-t html" in convert_project - due to introduction of FASTQ as input format, the abbreviation switch for naming FASTA quality input (-FI:fqi) needs to be renamed to "-FI:fqui" as "-FI:fqi" now names FASTQ input files. - reduction of memory usage for cases where the possible_vector_clip is not needed (memory is not allocated, default for all assemblies without Sanger data) 2.9.52 (aka V3rc1b2) -------------------- - implemented aggressive memory saving for Solexa reads which reduces the data stored per read base from 9 bytes to 5 bytes (45% reduction). The downside: the result files which have alignment positions (CAF, ACE etc.) do not show insertions and deletions anymore in the coordinates. I.e., edits cannot be traced back. As no other assembler has this info and no finishing program I know uses this info anyway, I guess this is ok. - optimisation: MIRA now uses less memory constructing coverage equivalent reads (mapping assemblies -CO:msr=yes) - bugfix: on large contigs with lots of reads, MIRA now uses significantly less temporary memory for all mapping assemblies (e.g. 3GB less when mapping 6m Solexas) - compile instructions for NetBSD (courtesy of Thomas Vaughan) - re-activated and upgraded TCS output (transposed contig summary) - implemented "-a" parameter for convert_project - wrote documentation for the changes 2.9.51 (aka V3rc1b1) -------------------- - reduced overhead of reads by 32 bytes (on 64 bit architectures). That's 320Mb for 10m reads :-) - testing majority vote of 66% for gaps in consensus calling of 454 reads. (NOTE: that code is is faulty as it always evaluates to true and sometimes leads even to a division by zero error. Just disregard this version.) 2.9.50 (aka V3rc1b) ------------------- - bugfix: mixed FASTA and CAF loading works again - bugfix: "--" now sets file type for all sequencing technologies as intended. 2.9.49 (aka V3rc1a) ------------------- - bugfix: CAF loading works again - new tag: DGPc (Dubious Gap Position on Consensus). Set when the number of gaps at a consensus position is between 40% and 60% of the next most frequent base. E.g.: A/* = 10/7. But also when A/C/* = 9/10/7 2.9.48 (aka V3rc1) ------------------ - bugfix for hybrid assemblies involving Solexa reads: some Solexa reads did not make it into contigs - optimised SKIM reduction routine for hybrid assemblies involving Solexa: the overlap graph generated uses less memory. - optimised SKIM/Pathfinder interaction for short Solexa reads (<60 bases, longer reads were not that much of a problem) which allows better de novo with Solexa. - major speed increase in pathfinder module for large de-novo assemblies with millions of reads. E.g. 2 pass de-novo with 6m Solexa paired-end 36mers goes down from 4 hours 20 minutes to ~2 hours. - major speed increase for hybrid assemblies involving Solexas in pathfinder module: a 5 pass de-novo with 800k 454 reads and 3.3m Solexas goes down from >1.5 days to 12hrs - reduction of memory needs for Solexa data (e.g. ~1.2GB for 7.3m Solexa 40mers). (There was an ommision of container capacity reservation since 2.9.44x1) - convenience change: -SB:brl:bro can now be set to 0 for automatic determination of optimal values by MIRA in mapping assemblies (now default). - convenience change: -AL:shme can now be set to 0 for automatic determination of reasonable value (now default) - change for genome assemblies via quick switch: masking of nasty repeats is now turned on, copy threshold at 100x expected frequency - change for genome assemblies with 454 data via quick switch: proposed end clips now more stringent and enforces 27 instead of 17 bases clear space 2.9.47 ------ - small change in default parameters when using Solexa data (alone or hybrid) to better adapt to larger difference to reference (mapping) or low coverage (mapping and de-novo). - reworked small Solexa examples in minidemo directory (mapping and denovo) 2.9.46x3 -------- - unified parameters for loading different sequencing technologies: added --fastq, -LR:lsd:ft:fqqo and -FI:fqi; removed -LR:lsand:l454d:lsxad:lsidd:sanft - MIRA can now load data FASTQ format, routines are courtesy of Heng Li at the Sanger Centre. For Solexa data there's also an automatic recognition of whether it's in Sanger, Solexa 1.0 or Solexa 1.3 format. - convert_project can subsequently also load FASTQ and gets an additional -o parameter. - debris file does not contain same read name multiple times - for build process: moved check for isblank(3) to configure.in - new requirement for compiling: zlib - MIRA now gets compiled with -O3 by default on most platforms - further changes to configure script to allow correct compilation of 64 bit on platforms that compile with 32 per default - MIRA confirmed to compile on OpenSolaris with BOOST (yay!) 2.9.46x2 -------- - some more tweaks to ./configure (better lex/flex handling, better expat recognition, more chatty in case of boost problems) 2.9.46x1 -------- - just some tweaks in the build process to fix reported problems during compilation or linking 2.9.46 ------ - updated major parts of the documentation for anything related to using Solexa reads - new help file on how to assemble 'hard' genomes (mostly geared towards eukaryotes, but some prokaryotes also have a tendency to be nasty) 2.9.45x4 -------- - removed testcode that limited coverage to 5000x, now back to theoretical limit of 16383x. - reactivated "html" as convert option for convert_project. - activated "asnp" and "hsnp" as convert option for convert_project. - removed dumping of debug information when using -t asnp or hsnp in convert_project 2.9.45x3 -------- - tuning of repeat detection for Solexa reads leading to less false positive repeat markers for typical Solexa miscalls - tuning of base calling for Solexa reads leading to less IUPAC bases at places with typical Solexa miscalls 2.9.45x2 -------- - Probable fix for compiling src/mira/dataprocessing.C on Red Hat systems (inclusion of BOOST header file) - fixed configure script to better differentiate between a working BOOST environment and a possibly problematic one - improved ACE output for tags containing comments 2.9.45x1 -------- - fixed bug where a mapping assembly of 454 or Sanger sequences led to segmentation faults in some cases (*sigh*). - fixed bug in output of assembly as HTML or TEXT format where only 21 bases of each contig were given (some test code had not been removed) - fixed bug where MIRA sometimes cut back some backbone rails it thought to be possible chimeras. This could happen for organsism that are further apart from each other than just a few SNPs here and there. While this had no effect whatsoever on the assembly, it still was something of an unclean thing. - test for consed compatibility: added newlines in ACE files to read and consensus tags 2.9.45 ------ - to accomodate the Solexa paired-end naming scheme, CAF files now allow the "/" character in identifiers (like read names). - SK:rt has been renamed to -SK:nrr and the meaning has changed (please read changed documentation). This gives an easier control in handling of repetitive sequences. - skimming for nasty sequences (-SK:mnr) now uses the same algorithms as -CL:pec which are faster and better than the old ones. - new parameter -CL:pecbph - SKIM3 now removes some massive temporary files from the log directory - MRMr tags renamed MNRr - updated support files GTAGDB and consedtaglib.txt 2.9.44x7 -------- - speed up of SKIM hit reduction. Important for large eukaryotic assemblies or de-novo prokaryotic Solexa assemblies, reducing the time of that step from several hours to under one hour or even minutes. 2.9.44x6 -------- - added "solexa" as naming scheme to -GE:rns (using "/1" and "/2" to distinguish forward and reverse reads - added -GE:crhf to color reads by hash frequency. Very handy for finishing. Needs tags "HAF0" to "HAF7" to be defined for gap4 (or consed or other finishing tools) - new log file: "miralog.usedids" which logs all reads (after clipping etc.) which go into contig assembly - statistics regarding the read pool are now printed out after all operations that might change read lengths (read extension or clipping) 2.9.44x5 -------- - added unpadded read position to "*_info_readtaglist.txt" - -SK:pr can now be set individually by sequencing technology 2.9.44x4 -------- - bugfix in chimera search: some chimeras were not recognised, this has been fixed. Downside: a few more reads that are not really chimeras or were the info is inconclusive are now categorised as such. Should however have no influence on the assembly itself. 2.9.44x3 -------- - change of parameter: the "--noclipping" now takes optional technologies to which it should apply. E.g.: "--noclipping=454,solexa". "--noclipping" is equivalent to switching off all technologies. - speed optimisation in pathfinder for de-novo assemblies with Solexa and SOLiD. - bugfix: fixed some pathfinder logic where sequencing errors in repetitive areas led MIRA to perform alignment of reads it shouldn't have. - bugfix: setting -SK:mchr to values >4095 led to am integer overflow and subsequent poor assemblies ... or no assembly at all - bugfix: when using Solexa CER mappings on multiple backbone sequences, the numbering scheme led to illegal CAF files (and hence illegal gap4 databases) - bugfix: division by zero error in statistics calculation of empty read pools 2.9.44x2 -------- - speed optimisations in new assembly engine. 5x-10x speed improvement for large contigs compared to 2.9.44x1 - increased threshold for megahub detection (not sure whether it's a good decision, must test) - for 454 assemblies, adapted -AS:nop down to 4 for normal and 5 for accurate mode (improved repeat resolver and taking FLX reads as quality standard allows for this) 2.9.44x1 -------- - major change of assembly engine, geared towards "100% certain" contigs without misassembly. May lead to shorter (albeit better) contigs when no paired end reads are used, but leads to longer contigs for paired-ends. Currently very slow for large contigs (>150k reads) - more lenient treatment of megahubs in SKIM. If possible, only skims with non-repetitive parts of other reads are taken. - added searches dedicated to hunt chimeras. This was necessary as new assembly engine is more prone to falling into chimera traps than the old one. - further improved HTML output of SNP surrounding - SKIM now honours the -AL:mo values (minimum length of overlaps) and rejects overlaps below these values (important for de-novo of short reads) - routine loading Sanger type data now gives a clearer error message if the file type given in -LR:snft is unknown 2.9.43 ------ - fixed bug in new -CL:pec routines that led to a core dump (struck only in cases where not a single overlap appeared in the whole project) - clarified docs regarding usage of ssaha(1) - fixed problem leading to long run times and high memory requirements when masking of nasty repeats (-SK:mnr) was used on high coverage genomes (100x) 2.9.42x1 -------- - improved HTML output for resequenced genomes - slightly improved logging of values when loading FASTA data - added /proc/meminfo as dump in memory self assessments 2.9.42 ------ - renamed "miraEST" to "miraSearchESTSNPs" - internal changes to get the miraSearchESTSNPs pipeline working again in the 2.9.x line (alpha test) - bugfix: loading FASTA projects containing more quality entries than sequences led to core dump - rewrote -CL:pec routines. Faster, and fixes errors of old version. - change: contigs with only Solexa reads do not trigger editing of contig (temporary trial) - first tests of new statistics module - memory needs increased by 24 byte per read (12 byte on 32 bit systems) and one byte per raw read base - changed default setting of poly-AT length from 10 to 12 - internal version only. Migration to gcc 4.3.2 partly done. 2.9.40 ------ - new parameter: -SK:mchr to cap maximum memory in hit reduction algorithm. This is experimental and will need some refining. - when no clean ends are found by proposed cutbacks, the reads are completely removed from the assembly. This eliminates short reads (i.e. Solexa) with too many errors and which aren't really useful anyway. - Using -AL:shme led to a parsing error. Fixed. - Solexa reads now do not need anymore a minimum left clip to be set, this is handled internally - miramem now gives a better estimate for mapping of Solexa reads (the old values were way too high) - miramem no tries to split memory needs into "unavoidable" (for sequencing data etc.) and "tunable" (via a number of parameters) - -SB:bbq now defaults to '30' - most error messages now dumped to STDOUT instead of STDERR 2.9.39 ------ - in mapping assemblies, repetitive reads are now distributed evenly and not stochastically distributed over the backbone repeats - mapping assemblies with Solexa now have some adjusted default parameters for "normal" and "accurate" levels. They run a bit slower but will squeeze a maximum out of your data. - read clustering now temporarily needs more memory, but runs in a few seconds instead of hours for projects with 10 million reads - new parameter -AL:shme (a temporary hack to handle Solexa reads more thoroughly) - to counter a current defficiency of the Solexa technology, a new clipping filter for Solexa data now filters out reads that have stretches of 20 or more "A" bases or stretches of 12 or more "A" bases and more than 80% "A" in total. - on "out-of-memory" errors, MIRA now dumps a self assessment on where the memory went to get an idea what really happened. Note 1: this is bound to happen only with eukaryots or on very small machines. Note 2: development versions of MIRA by default dump some assements also during the assembly. - documented -OUT:sssip:stsip (which appeared in 2.9.12, my apologies) - changed documentation for 454 assembly to point at publicly available data instead of the spneu project (which put too much strain on my website). - renamed -CL:prc to -CL:pec to reflect it's use on both ends of a read 2.9.38x1 -------- - -CL:prc now also clips left (will have to rename that option). This catches very efficiently vector leftovers in Sanger reads and adaptor leftovers in 454 reads (which also can occur there). - -CL:prc now also clips when a non-ACGT base is at the ends - bugfix: saving as gap4 directory did not save the first contig due to wrong handling of directory creation. - bugfix: convert_project now sets the minimum coverage to 1 to circumvent a quirk in the computation of "Large contigs" of the assembly info display. Better fix in the future. - version 1a: testing new pathfinder algorithm enabled 2.9.37 ------ - new option: -CL:prc (propose right clip). This is a new strategy to ensure a good "high confidence region" (hcr) in reads, basically eliminating all junk at the 3' end of reads. Extremely effective, but should not be used for very low coverage data or for EST projects. This option is now default for genome assemblies in "normal" or "accurate" mode. 2.9.36 ------ - renamed -AS:urdufrd:urdrdct to -AS:ard:ardct - added -AS:ardml:ardgl. This allows for a better control of which reads are defined as repeats. - added -AS:klrs. Needs testing is not switched on by default. - bugfix: number of large contigs was reported too high in the report of the assembly ... because of a really dumb bug in the statistics calculation routine. This had no effect on the assembly itself, just on the *_info_asembly.txt report and also on the summary given after the usage of "convert_project". - bugfix: SSAHA clips were wrongly logged to file - change: log file with clips more verbose - change: 454 reads without explicit forward/reverse naming scheme (e.g. "somename" instead of "somename.f") are now considered to be forward 2.9.35x2 -------- - when running the SKIM in parallel threads, MIRA can give different results when started with the same data and same arguments. The effect is now reduced (it is still present), but at the price of a table loaded after SKIM ran through now being 25% larger, but this can not be helped. - a few fixes in "convert_project" to allow conversion of assemblies in CAF format into clippedfasta and maskedfasta (was previously allowed only for single reads) - typo fix: -OUT:rrol:rld were shown as sequencing type dependent while they are not. 2.9.35x1 -------- - CAF files with 454 data now contain the necessary info to allow gap4 opening the flowgrams. Works only for reads that are NOT paired-end. - slight tweak in the pathfinder that should enhance the assembly with paired-end in a few cases - changed sff_extract so that it runs again with the Python 2.4 series 2.9.34 ------ - bugfix: fixed bug while reading quoted text for "Clone", "Staden_id" and "Template" lines in CAF. This affected mostly users of convert_project and users using CAF as input format for mira. 2.9.33 ------ - new output file: wiggle (-OUT:orw). Can be loaded into the IGB viewer together with a GFF or FASTA of the backbone sequence(s). Very useful for resequencing experiments / mapping assemblies with 454 / Solexa data, when the "view coverage" of gap4 gets really really slow. - convert_project can now convert to multiple targets at once (multiple -t) - fixed wrong reporting of 454 reads without clips in the statistics display after loading of reads. - improved sff_extract to handle paired-end reads, rewrote 454 manual to reflect this. 2.9.32 ------ - tweak: --job=...,est,... now switches off -CL:emrc per default 2.9.31 ------ - reworked info file for contig statistics: contig lengths are reported for the ungapped contigs; added GC content (removed A/C/G/T counts); format is now more easily readable but can still be easily parsed. - fixed bug where reads with very short "good" parts could loose ther right vector clip when using -CL:emrc - bugfix: progress indicator for files >2GB did not work correctly 2.9.30 ------ - added warn message if skim was parametrised in a way that make it run slow - optimised some input/output of temporary files to be faster (using C instead of C++ functions *sigh*) - name scheme of reads now allows for ":" in names (to accept the original Solexa name scheme) - tweak: a mapping assembly that has Solexa will now generate filter more strongly during the SKIm pahse, saving some memory afterwards in the whole assembly - bugfix: -AS:bdq did work only with partial FASTA quality files, not with empty or non-existing files - bugfix: marking repeats of Solexa reads did not completely honour -CO:mrpg, fixed - bugfix: when performing Solexa mappings, MIRA sometimes created almost empty CER (coverage equivalent reads) without strain information. 2.9.29x6 -------- - tweak: adjusted parameters for Solexa mapping to be more lenient in alignments - bugfix: --noclipping now also correctly switches of -CL:emrc - bugfix: --job in parameter files was still not correctly parsed in some cases (bug in flex *sigh*) - bugfix: threaded skim sometimes did not exit when given less sequences than threads*5000 2.9.29x5 -------- - removed the "sffinfo2mirafiles.tcl" script from the distribution. "sff_extract" from Jose Blanca (in the 3rd party package) is taking over this part: more versatile, faster and removes the need of for the sff* tools from Roche. - MIRA now stops if the ratio of megahubs is larger than -SK:mmhr - tweak: EST mode now does not enforce minimum right clip per default - new tag: MCVc (Missing CoVerage in Consensus). Set when a strain has no coverage (previously UNSc and UNSr were set, now they are set only when 'unsure'). - bugfix: non-paired-end 454 data read from CAF was not recognised as non-paired-end during -CO:emrc, fixed. - bugfix: --job in parameter files was not correctly parsed - bugfix: when -OUT:sssip was used, parts of the singlets were still assigned to the debris file. 2.9.29x4 -------- - the cutback strategy for 454 reads introduced in 2929x1 has been eased a bit: paired-end reads do not get cut back and if read lengths would fall below the minumum required length, they don't get cut back neither. - bugfix: the bad sequence search was not performed if the minimum left clip was set to "no" - MACHINE_TYPE, PROGRAM_ID and STRAIN are now read from TRACEINFO XML files. - fixed a couple of bugs that led to an abort of MIRA when writing SNP files. Bug struck only on very rare boundary cases. 2.9.29x3 -------- - implemented a couple of memory reduction strategies for the read objects. This reduces overhead of every read by almost 30% (264 instead of 368) and additionally saves memory of cached values (strain, basecaller, machine type, paths etc.). This should also reduce memory fragmentation a bit. In a typical 454 project with 1 million reads, this amounts to 208-400 MB savings of RAM. - miramem now knows about 454 Titanium reads - convert_project has new command line parameter: -r 2.9.29x2 -------- - additional algorithms to search and mark repeats marker bases that existing routines missed in 454 data. 2.9.29x1 -------- - MIRA now uses full overlap graph repeat resolving algorithms which leads to better and quicker resolving of repeats in bacteria. May be slower for eukaryotes, more tests needed. - new clipping options: -CL:emrc:mrcr:smrc - for 454 reads, MIRA now follows a strategy of cut back first (-CL:emrc), uncover afterwards via read extension. Highly recommended. - default parameter -CO:mrpg=5 for repeat marker base detection in 454 data was to lax, changed back to 4. - fixed bug: when mapping microread data (Solexa, SOLiD), -SB:sbuip was wrongly interpreted and de-novo algorithm started instead of mapping (error introduced in 2.9.28x4) - change: when not being able to delete a temporary log file, MIRA now gives a warning but does not abort 2.9.28x7 -------- - added quality information of consensus sequence to output of CAF files. 2.9.28x6 -------- - Premiere for MIRA: multi-threading makes its appearance. At the moment only for the SKIM algorithm as it's the easiest part and no adverse effects are expected. New parameter -SK:not is for controlling the number of threads. - Test: MIRA now saves more information on failed alignments to build a better overlap graph in following passes. The overall assembly quality gains, but memory consumption rises unpredictably. This may become a problem for highly repetitive genomes of eukaryotic size. To be monitored. - the rawhashhit log file is not written anymore as it was useful only for debugging and just ate memory and time of SKIM. - bugfix: the new read mapping chooser sometimes led to an abort() of the process (error introduced in 2.9.28x4) 2.9.28x5 -------- - renamed 'est_splitsplices' of the -AL:egpl parameter to 'reject_codongaps' - when 454 data is used via the --job=...,454,... switch, -AL:egp=yes:egpl=reject_codongaps are now set for *all* technologies 2.9.28x4 -------- - first version which allows Solexa de-novo. Albeit *very* slow at the moment, do not use for anything else than bacteria (1 week of computation or more, sorry). - new functionality: MIRA now marks IUPAC positions in the consensus as tag "IUPc" - the info_assembly.txt file now gives info on the number of positions in the consensus where sequencing methods disagree. - bugfix in clipping: when reads have no ancillary data (and this no good left clip), the clipping of bad quality stretches could lead to the complete read being clipped if the bad quality on the left was long enough - bugfix: for 454 data, -CO:mrpg and -CO:mgqrt were not honoured but some fixed values used instead. - bugfix: average total coverage of contigs was wrongly reported in statistics and furthermore only as integer, not as double - bugfix: -OUT:sssip:stsip could not be set for sequencing technologies other than Sanger. - optimisation: when mapping against multiple backbones, reads will now be mapped to the best matching backbone instead of suboptimal mapping to backbones earlier in the list. 2.9.28x3 -------- !!!! Do not use this version with Solexa data, some code changes are not completed !!!! - changed graph pruning algorithm to work less aggressively so that 454/Sanger hybrid situations, where there's a low Sanger coverage (0.5x to 2x), now should work better than before - fixed bug in assembly info which led to wrong information being displayed which struck when using 454 data - reduced memory requirements of one of the main overlap storage tables by 10% 2.9.28x2 -------- - fixed ugly bug in new assembly info routines that led to an abort of the assembly. 2.9.28x1 -------- - new file in *_info directory: _info_assembly.txt. This file gives basic information on how the assembly went (assembled reads, contig sizes, N50/N90/N95, coverage, qualities, possible problems) - convert_project now dumps the same info as above on CAF input - for EST assemblies: -job=est now uses per default a poly-AT clipping that preserves the poly-A/T signal - for EST assemblies: renamed -GE:ess to -GE:esps - for EST assemblies: added -job=esps[1-3] - made tagsnp working again (for testing) 2.9.27x2 -------- - new parameter to force consensus to be A, C, G or T and not a IUPAC code (-CO:fnicpst) - enhanced handling of partly masked reads. - implemented alternative SKIM repeat threshold calculation to deal with highly repetitive eukaryotes (temp. fix, needs revisiting later) - temporarily took out default -SK:mnr again 2.9.27x1 -------- - fixed bug in ACE output which was introduced in 26x6 - fixed small bug when "--job=mapping" was used (introduced in 2924x3) - activated SKIM routines to mask nasty repeats, added -SK:mnr:rt - massively reduced disk usage as MIRA can now remove unused log files and the complete log directory on request (added -OUT:rrol:rld) - started documentation for log files written by MIRA 2.9.26x6 -------- - fixed bug in output of read and consensus tag comments in ACE files - fixed bug in output of BS lines of ACE files - added error messages naming the faultive read during loading of CAF files - added -AS:urdcm:urdufrd:urdrdct - while building contigs using backbones, MIRA now tries to guess memory usage and uses preallocation to reduce memory footprint. Though the guesstimate may be wrong a few times which then leads to increased memory usage. - added tag type MRMr (MIRA Repeat Marker) - SKIM now honours "MRMr" and "FPaS" tags in reads and does not use these stretches to find potential overlaps. - SKIM now adapts dynamically hashes it needs to save to non-ACGT bases occuring in sequences. This leads to slightly improved detection of possible overlaps in sequences with "N" or other IUPAC codes 2.9.26x5 -------- - fixed bug that was introduced in 26x2 where sometimes during a mapping assembly, all contig positions were tagged as unsure. - fixed bug that -FN:xtii could not be set for sequencing types other than Sanger. - fixed bug: quality clip and clip of masked bases were performed for all sequencing technologies even when only one requested it. 2.9.26x4 -------- - added log file for clippings on load - fixed bug for minimum left clip function - fixed small bug in "make install" - testing new workaround for linking on MacOS X (10.5) *sigh* 2.9.26x3 -------- - fixed nasty misconfiguration: --job=accurate,454 did not switch on 454 editing - fixed nasty misconfiguration: --job=est,454 did not use optimal parametrization for 454 EST data 2.9.26x2 -------- - major change for mapping assemblies: strains that do not cover the backbones will get the '@' as consensus character. This also appears in result files (FASTAs) specific to the different strains! Therefore, some post-processing may now be needed. - revamped and improved strain difference analysis routines (for mapping assemblies) 2.9.26x1 -------- - tweaked consensus routines for hybrid assemblies - added the St. Louis read naming convention as read naming scheme (-LR:rns=stlouis) - updated and expanded some docs (mira manual, usage and 454) - paired-end reads that have a partner now do not get thrown out at the beginning of the assembly on the minumum length criterium (-AS:mrl). This is to accomodate 454 paired-ends where one read of the pair sometimes might be really, really short. Same thing applies for Solexa. 2.9.25 ------ - fixed small bug for mapping assembly: reads smaller -PF:bqoml were not mapped at all. - the automatic error editing routines for overcalls for new sequencing technologies data previously worked only with 454 data. They now edit in the following combinations: 1) 454 only 2) Solexa only 3) Hybrid assembly of Sanger with (454 and / or Solexa) or 454 with Solexa - the contig statistics file now contains a column for non-covered backbone positions (important for backbone assemblies) - fixed small bug: sometimes singlets were saved in projects even when not requested (-OUT:sssip:stsip) - fixed anoying bug: MIRA (and "convert_project" lost information about backbones or merged short reads - -OUT:sssip:stsip can now be set dependend on sequencing technology - fixed bug: Solexa consensus sometimes chose the wrong base in cases of conflict 2.9.24x3 -------- - added switches and documentation for uniform read distribution (-AS:urd:urdsip). - improved uniform read distribution a bit and made it default for genome assemblies (when there is no Solexa data, not tested on that yet) - added "miramem" as program call to help estimate memory needed for an assembly 2.9.24x2 -------- - test version with trial for uniform read distribution - fixed bug that led to MIRA aborting contig assembly in rare cases (triggered through quite repetitive sequences). Bug was introduced after 2.9.17. - take back too daring optimisation for Solexa mappings (did sometimes not map important data) - merged short reads now get strain information attached (only one strain at the moment) - added -m to convert_project 2.9.24x1 -------- Bugfixes/tweaks when using more than one strain (bringing back and improving functionality that was lost in the 2.8.x -> 2.9.x changes) - added -SB:bsnffa:brfs - added --notraceinfo quickswitch - bugfix: searching for SNPs now only done when having multiple strains in assembly - testing: now also setting tags for "weak" SNPs (for catching indels with 454 reads) - bugfix: now correctly aligns Solexa reads on backbones containing gaps (as encountered in CAFs when using a given strain as rail constructor). - a few more bugfixes and tweaks concerning assembly with multiple strains - tweak: gap2caf sets "clone" information to 'unknown'. MIRA now treat this as "not set" instead of having an additional "unknown" strain. 2.9.23 ------ - fixed bug that led to quality clipping even if -CL:qc was no in cases the clip to masked characters (-CL:mc) was active - fixed bug that made convert_project "forget" to write strain info back to caf 2.9.22x4 -------- - improved again tagging of SNPs when using multiple strains and sequencing technologies (now also adds "medium" SNPs) - fixed bug that sometimes led to additional rounds of repeat disentangling alignments not being called 2.9.22x3 -------- - tweaked/improved Solexa base calling - improved tagging of SNPs when using multiple strains and sequencing technologies - added "-noclipping" quickswitch that switches of every clipping option - added "-lowqualitydata" quickswitch - added "-notraceinfo" quickswitch - added -CO:mroir - changed tagging/clipping of poly-A signal to perform as full blown clipping routine. Renamed -DP:tpae and related options to -CL:cpat and related - bugfix / fallout from changes in 2.9.20x1: the clipping routines for the following options now honour the sequencing technology specific settings: -CL:msvs:emlc:bsqc:qc:mbc:cpat - bugfix: -OUT:oet* is now working again - bugfix: when using multiple strains, the new consensus routine sometimes returned '?'. - re-activated "miraclip" (needs testing) - re-activated "mirapre" (needs real testing) - documented -CL:bsqc - brought most of the main documentation up-to-date to 2.9.22 2.9.22x2 -------- - tweaked routines for calculation of Solexa consensus - bugfix: statistics calculation of Solexa data was not correct in 2.9.22x1 - small fixes around the code 2.9.22x1 -------- - added support for simple forward / reverse read naming scheme - fixed bug in template strand assignment for Sanger read naming scheme - revamped contig statistics in logfile (info file will follow in future) - renamed "RT=" to "ST=" in MINF read tags - changed sequencing technology "454GS20" to "454GS" in MINF tags 2.9.21 ------ - speed improvement of the SKIM algorithm when only mapping reads to (a) backbone(s). Reduced complexity from quadratic to linear, SKIMing a few millions Solexa reads against a backbone now just takes a few minutes instead of one hour or more. - speed improvement of mapping phase, mapping a few million Solexa reads now takes a few minutes per round instead of hours. - automatic Solexa read clip back mechanism activated that honours the quality of mismatches as well as the necessity of having all data at the given mapping place. - added a small hack to be able to use the full Solexa read length and still hide the MINF tags in GAP4: reads now get a ´N´ added as first base just to get clipped away in quality clipping (MINF tags should be replaced by notes when I have time to do that) - the CAF loader now gracefully handles sequences without quality values (although this should not happen) by setting default qualities - tweaked standard parameters for the different read types - small fix to make .ace files immediately readable by consed - a number of smaller bugfixes (like correcting typos etc.pp) 2.9.20x2 -------- - bugfix of a rare error during alignment of reads to contig (new from 2.9.19x1) - can now load Solexa quality scores in FASTA quality files and convert them to phred-style quality values (new parameter -LR:ssiqf) - statistics of reads in assembly are now given for each sequencing type separately - added COMMON_SETTINGS as alias to SANGER_SETTINGS, should help to clarify things a bit in parameter files - removed -horrid - replaced STRM tags with STMS and STMU - added possibility to set different input / output project names (-projectin= and -projectout=), the (still existing) -project= is simply a combination of in/out - improved repeat tagging when different seqencing technologies are involved. 2.9.20x1 (do not use with Solexa data) -------------------------------------- - major rework of parameters information display, no separate info for each sequencing type is shown where appropriate - added SANGER_SETTINGS, 454_..., SOLEXA_..., SOLID_... This allows to set all parameters for all sequencing types in one file (or the command line). Or distribute the settings across different files, whatever one wants. - removed hack to load parameters specifically for 454 data (file "454params.par") as functionality is given by the ..._SETTINGS above. - new -LR category (merged -454 and -SR, together with some -GE) 2.9.19x1 (do not use with Solexa data) -------------------------------------- - fixed minor bug that led to bases sometimes being aligned more against gap columns than against consensus bases - minor changes in quality computation towards end of reads - changed quality computation for consensus with multiple sequencing types: now not best quality only, but additive quality - fixed bug when computing Solexa only consensus (appeared in 2.9.18) 2.9.18 ------ Preparing for simple Solexa / SOLiD mapping assembly: - added -SR - added -PF:bqoml - added -SB:bro - fixed bug that led to inclusion of 100% potential overlaps smaller than minimum allowed overlap (had probably only very small effect on assemblies) Bug appeared in 2.9.17 (SW not being recomputed). 2.9.17 ------ - improved template handling: TRACE_END is now read from TRACEINFO XML - improved template handling: "Strand" is now read/written from/to CAFs, leading to correct information in Staden projects when caf2gap is used - increased speed of SW alignment phase: perfect matches are not computed again. Saves 50-70% of SW alignments for a typical project with 454 GS20 data, with Sanger data it's still ~25%. - increased speed of adding aligned reads to contigs. Important for contigs >50k reads, saving ~20-30% time. - fixed bug in routine that should have sped up endgame of an assembly (came in in 2.9.14b) but led to inconsistent exclusion of reads. - added hack to load parameters specifically for 454 data (file "454params.par") (later removed in 2.9.20x1) - (internal: added timing measurements for contig and pathfinder objects to search for unfriendly runtime behaviour) 2.9.16x1 -------- Experimental version for improved hybrid assemblies - fixed bug that led to short contigs when performing a hybrid assembly with low Sanger coverage - added -PF:uqr:qrml1:qrms1:qrml2:qrms2 2.9.15 ------ - fixed bug in SequenceVector handling of CAF reading routine that was triggered by 454 type data 2.9.14b ------- - rewrote large parts of the 454 assembly tutorial - reworked the spneut4demo_assemblies package - added "sffinfo2mirafiles.tcl" script to the distribution - adapted a few internal parameters for 454 assembly (a bit more stringent, uses a bit less memory and is slightly faster), using similar standard parameters as Newbler. - routine to speed up of endgame of an assembly: intermediate singlets are converted into debris. Speeds up assembly of SpneumoniaeT4 with a 7-3 scheme from ~47hrs to ~21hrs runtime. - de-activated old reads-only editing routines (-454:soer:soemq) - renamed -FILE (-FI) to -FILENAME (-FN) to reflect internal structure - fixed output file name of miraclip to name given in parameters (or constructed from project name) 2.9.14a ------- Feature freeze for 2.9.15 (target: mira easily usable for 454 and 454 / Sanger) - moved -CO:dismin:dismax to -AS:tismin:tismax - change in behaviour: template insert sizes now do not get assigned anymore by default, only on request (-AS:tismin:tismax being unequal to -1). That is, reads that do not have this information as ancillary data will not get default insert sizes assigned unless expressedly wished so. This fixes the bug of 454 reads without template information getting insert sizes assigned. 2.9.14 ------ - new parameters -CLIP:bsqc* - fixed bug: multicopy reads were not detected after early SKIM phase (present since 2.9.12) - small adaptions for SKIM hit reduction for 454 only assemblies 2.9.13 ------ major adaptation and feature enhancement of convert_project to changes after 2.9.8 (not finished, but useable again) - bugfix: convert_project and other tools now keep name of contig and do not rename this to stdname_... anymore - bugfix: convert_project and other tools now do not recalculate the consensus when loading from CAF. Note however that consensus qualities must be recalculated when they are not stored in CAF files. This recalculation may lead to slightly different quality values. - convert_project and other tools now keep order of reads when writing back to CAF files - when loading a CAF with contigs, convert_project now only needs enough memory to convert one contig at a time, not the complete project - workaround: caf2gap failed to convert CAF files where the contig name is equal to the name of a read 2.9.12 ------ - bugfix: insert size standard deviation was not read due type from XML TRACEINFO files (and hence the standard of 500 used, which was sometimes not enough for libraries >3kb) - reduced memory footprint of pathfinder algorithm, important for high coverage 454 areas. - reduced memory footprint of alignment storage by 16% on 64 bit architectures (less for 32 bit architectures) - both reductions now allow de-novo hybrid assembly of S.pneumoniae (1.1 million reads) with 4 GB RAM (and some 2GB free swap). - speed increase by factor 2-5 (depending on repetitive area) in pathfinder algorithm for 454 reads. Effectively almost halving total time needed in hybrid and highly stacked 454 assemblies - improved pathfinder algorithm for better bridging or repeats - storage of singlets in project results can now be controlled by -OUT:sssip:stsip 2.9.11 ------ - bugfix: cut down overzealous editing of 454 reads in misassembled parts of a contig - enhanced internal handling of sequencing data types. Routines now use dedicated parameter sets for each type (cannot be set from command line or parameter file yet) 2.9.10 ------ - major fixes in alignment of contigs which allows better hybrid assemblies (more reads added) - major rework of 454 read editing. Kicks out most of the obvious sequencing errors and now pre-assembly editing reads-only is not required anymore. - major new memory saving option: -GE:kcim (not compatible with spoiler detection -AS:sd) - new routines to thin out overlap graph which reduces number of initial Smith-Waterman overlap alignments by 80%-90% (increasing speed in this part by 5x - 10x). Drawback (mainly 454 data): really highly repetitive areas with complicated solution space will not get optimally solved and more than expected reads of these areas will turn into singlets. 2.9.9 ----- - MIRA now support merging results from a SSAHA vector screen run. This makes you basically independent from any other commercial or license-requiring vector screening software. For Sanger reads, a combination of "lucy" and "ssaha" together with this parameter should do the trick. For reads coming from 454 pyro-sequencing, \Cmd{ssaha}{1} and this parameter will work very well. New parameters: -CL:msvs:msvsgs:msvsmfg:msvsmeg:msvssfc:msvssec and -FI:svsi - along with the above: a new mira program "miraCLIP" has been created that just clips data from loaded files and dumps them to CAF format. The "miraPRE" program should now be used only for a first repeat-disentangling assembly. - renamed -CL:pvc to -CL:pvlc to make function clearer - Work in progress: new functionality that reduces number of hits that must be checked by SW alignment when working with 454 data, especially useful for hybrid Sanger / 454 assemblies. First tests very promising - the "|" characters in the name of reads in fasta files are now kept and not replaced anymore with "_", only in the EXP file names written to disk they are replaced - optimised memory handling when loading CAF files, leading to decreased memory footprint of sequences loaded via CAF (important for millions of reads) 2.9.8b,c,d,e,f -------------- - polishing of the MIRA build process, added flex version check, added --enable-static flag to ./configure - fixed build process for MacOS X (Darwin) - bugfix in alignment routine: hitting band encasing was not discovered in some cases (present since adding feature in 2.9.2) - fixed minor code ambiguouities - bugfix: minimum left clips (-CL:emlc) were not performed for 454 data - added mira_454dev help file as a preliminary guide for 454 assembly 2.9.8 ----- - new program: "miraPRE". miraPRE is a preprocessing step that allows to perform the preprocessing of reads (clipping, read extension, simple 454 editing) together with a first "reconnaissance" assembly of the most repetitive regions. The following "real" assembly is then faster (allowing for playing around with some more options) and more accurate for repeats - bugfix for hybrid assemblies: repeats found were not correctly accounted for, leading to the assembler not correctly recognising when to re-assemble - minor bugfix: the logfiles for rejected alignments contained binary data 2.9.7 ----- - bugfix: 454 reads now are not put through the possible sequence vector clipping routines - bugfix: assemblies with several backbone sequences could lead to the assembler aborting with an error. Fixed. Present since 2.7.5, but different bug than the one fixed in 2.7.6. - improved new consensus computation routines introduced in the 2.9 series (needed because of hybrid assemblies) to better handle aberrant Sanger cases - improved consensus computation for hybrid Sanger / 454 assemblies - improved loading speed of qualities in fasta files where the reads are not in the same order as in the sequence files (mostly noticed with 454 data files having millions of reads) - renamed output directories to _d_info, _d_results and _d_log - small change in output behaviour: contigs are now put first in output files (CAF, GAP4DA, FASTA etc.pp), then follow singlets - change in behaviour: debris are now left out from result files of an assembly. Debris are reads that are too short or do not align to any other read in the data set. Also, 454 reads that could not be assembled into a contig are treated as debris (even if they potentially aligned to other reads in the assembly). - automatically switching off output of GAP4DA format if 454 type data is present (you *really* do not want millions of files in a directory) 2.9.6 ----- - adapted internal multicopy detection for hybrid assemblies - adapted simple 454 overcall editing for hybrid assemblies - enhanced XML traceinfo support for files directly from the NCBI trace archive: - support for XML "ti": now, XMLs & FASTAs from NCBI need not to be rewritten (changing names etc.) - support for XML trace_type_code (now possible to mix Sanger type & 454 type in one fasta and XML traceinfo respectively) - fixed bug in XML traceinfo routines: XML elements with uppercase letters were not recognised - fixed ugly bug that led in very rare cases to suboptimal or missed alignments in banded SW. Bug triggered by short read data (454 type), present since MIRA V1.1.1, correction attempt in V2.2.2/2.3.3 only partly successful (ooops). - fixed bug that led to slow SW band alignments in development version (introduced through bugfix in V2.9.2) 2.9.5 ----- - tweaked internal memory handling (STL), reducing memory footprint of MIRA 2.9.4 ----- - fixed bug in tag type set for SNPs between strains (introduced in 2.9.3) - improved repeat resolving when using low number of passes (e.g. only 2) or for highly complicated repetitive projects by adding selective SW alignment iteration after each pass. - improved interaction between contig building and 454 contig editing process, leading to less building loops needed to achieve the same result 2.9.3 ----- - introduced "Carbon-copy Repeat Marker in Reads" (CRMr) which tremendously help assembling repetitive 454 sequences (also good for Sanger sequences) - reactivated Sanger repeat recognition with new repeat handling routines - fixed bug: when loading sequences with gaps (like in assembled CAF projects), existing gaps in reads were not removed prior to re-assembly - fixed bug that prevented mapping assemblies against backbones to be fast 2.9.2 ----- - added -454:soer:soemq for "simple" editing of errors in 454 reads. - MIRA now stores additional information needed for assemblies of strains and/or 454 data in CAF and EXP files in a way that resist to different transformations to and from gap4 databases (MINF tags) - bugfix when loading CAF: some attributes were not reset, leading to some reads having attributes of preceding reads in CAF file. - fixed bugs that sometimes led to suboptimal alignments when building contig alignments. This became apparent with assembly of highly stacked 454 data - first working version of repeat discovery in 454 data - first working version of "tricky overcall editing" in 454 data (these things are responsible for most of the frameshift errors) 2.9.1 ----- - test activation of 454 read-only-editing routines Change file for MIRA 2.8.3 ========================== 2.8.3 ----- - fixed bug in XML traceinfo routines: XML elements with uppercase letters were not recognised (backport from the 2.9 development line) 2.8.2 ----- - polishing of the MIRA build process, added flex version check, added --enable-static flag to ./configure 2.8.1 ----- - fixed bug that prevented mapping assemblies against backbones to be fast (backported from 2.9.3) - bugfix when loading CAF: some attributes were not reset, leading some reads having attributes of preceding reads in CAF file (backported from 2.9.2) 2.8.0 ----- - replace tcl dependency in compile process with perl dependency 2.6.x -> 2.7.x -> 2.8.0rc2 -------------------------- The 2.8.x series of MIRA is an intermediate step towards MIRA 3.0. However, two major highlights (beside the usual stream of small improvements and bugfixes) justify a new production release: 1) Important speedups in a few central places (Reads, Contigs). Additionally, the all-against-all read comparison algorithm (SKIM) has been speeded-up by a factor of >60 (for large number of reads). 2) MIRA is going Open Source! More specifically, MIRA is being put under the GPL (version 2) as I kindly received the authorisation from both the DKFZ Heidelberg (Deutsches Krebsforschungszentrum, German Cancer Research Center) and from Thomas Pfisterer (the author of the EdIt part in MIRA) to release the code. Please note that the 2.8 line does still not officially support assembly of 454 data. These routines are still under development and will be made available in the 2.9 development line. Important notice: a few parameter switches were changed since the 2.6.x releases, existing parameter files may have to be changed. Please consult the documentation for the new names of the parameters. Changes in detail: ================== 2.7.8 ----- - Major bugfix: assembling with more than one strain without backbone produced less than sub-optimal solutions. 2.7.7 ----- - internal changes - fixed error in assembly class in handling bases tagged as SIOx within a strain. (had only minor repercussion in assemblies) - fixed bug in skim output: permbans were not updated correctly for the summary. (had no effect whatsover on assemblies, just in the printed skim summary of the MIRA log) 2.7.6 ----- - smaller internal changes (removed deprecated strstream constructs, removed old code) - also put EdIt and other code from Thomas under GPL - fixed bug when loading backbones sequences that have no sequence - fixed bug when assembling with more than one backbone sequence (introduced in 2.7.5) - added -AS:bdq - brought documentation up-to-date 2.7.5 ----- - Larger internal changes. Reduced memory footprint for alignment checks: results are now temporarily written to disks instead of being kept in memory. Useful when working with millions of reads. - speedups in contig: moved some redundant costly internal checks into versions compiled only for special bughunting - removed -SK:im:mc - added -SK:bph:hss:mhim - new log file containing SKIM raw hash hit numbers in log directory - new log file containing simple readpool info in log directory - squashed a small bug in alignment code that sneaked in in 2.7.2 - switched miraEST on again (switched off in 2.7.4). SNP analysis still the old one though. - the pathfinder object is now faster for backbone assembly. - put MIRA code under GPL 2.7.4 ----- - added -454:c454cq (not public yet) - change: MIRA now writes files into three direcorties below the starting directory: _log, _info and _results. This helps to keep things a bit cleaner in the directory. - due to the above: temporarily switched off miraEST (also wrt the fact that SNP analysis is going to undergo a major rehaul) - speeded up endgame (coping with remaining singlets) of assemblies that have a large number of reads. - *major* speedup of SKIM (all against all comparison) routine. E.g. SKIMming of 53,000 reads now takes a minute instead of 62 minutes. - reduced memory footprint for SKIM: results are now written to disks instead of being kept in memory. Useful when working with millions of reads. 2.7.3 ----- - first trial version that assembles 454 consensus data (not real 454 data yet) - changed pathfinder strategy. Now uses a less aggressive way of determining next read to add. Should improve all "difficult" assemblies with many repeats a bit - reworked/changed all existing -454 parameters - Change: there is no more default source for the reads to be loaded. It now must be explicitly set, either via -GE:lj or the quickswitches like -fasta etc. - Name change: main assembly 'loops' are now called assembly 'passes' for better distinction to PRMB break loops. Renamed -AS:nol:sel:sdllo to -AS:nop:sep:sdlpo, -SB:sbuil to -SB:sbuip, -DP:feil:leil to -DP:feip:leip - MIRA now writes clear parameter parsing error messages (if needed) on startup 2.7.2 ----- - speedups in assembly & output: moved some redundant costly internal checks into versions compiled only for special bughunting - changed consensus quality computation (faster). Results are mostly the same or a tad higher than for the old routines (a tad lower for quality values <= 10 or so). 2.7.1 ----- - fixed small bug in handling of tags for alignments 2.7.0 ----- - initial takeover from 2.6.0 2.4.x -> 2.5.x -> 2.6.0 ----------------------- Main development focus for the 2.6 production release of the MIRA assembly tools was to realise improvements in speed and memory footprint compared to the 2.4 line. All changes were extensively tested in the 2.5.x development versions and were ported to the new 2.6.0 production version. Highlights: - new, easy to use-and-combine parameter switches for predefined tasks: quick switches. (also called dwim: Do-What-I-Mean switches) - constant memory SKIM routine for fast all against all overlap checking. This was needed when assembling larger bacteria or lower eukaryotes in a limited amount of memory. As bonus, this new SKIM is 40% to 60% faster than the old one. - reduced memory footprint for stored alignments. 37% reduction in this part, important for big assemblies - enhanced read extension routines. The new routines have are now quite efficient in extending reads as much as possible while leaving really bad quality parts untouched - new type of output: GBF (GenBank file). Extremely useful when performing assemblies against a backbone (read mapping strategy) which in itself may be a GenBank file containing features. - enhanced convert_project utility. Converts more assembly file formats - a number of small bugfixes and other improvements Please note that the 2.6 line does not officially support assembly of 454 data. These routines are still under development and will be made available in the 2.7 development line. Important notice: a few parameter switches were changed since the 2.4.x releases, existing parameter files may have to be changed. Please consult the documentation for the new names of the parameters. Focus for next development cycle: - multithreading - 454 data 2.2.8 to 2.4.0 -------------- The new major 2.4 release line of the MIRA assembly tools opens a whole new set of possibilities for sequence assembly. Starting with V2.4.0rc1 (corresponds to 2.3.31 of the development line), there were no more restrictions built into the binary regarding time or number of sequences that MIRA can handle. Your available memory will be the limit. Starting with 2.4.0, binaries are now made available for both 32 and 64 bit platforms of x86 Linux. MIRA has learned a number of useful new tricks like assembling against other sequences (backbones), usage of strain information in genomic assembly (closely related strains can now be assembled in one go), SNP analysis, optimised alignments (no more gap base jiggling), loading sequences gained from the NCBI trace archive etc. Compared to the 2.2.x line, speed has increased (sometimes quite drastically) and memory requirements have decreased a bit. Several smaller and bigger bugs have been fixed. I highly recommend to upgrade to this version as soon as possible, even if parameters could not be kept 100% backward compatible. New / changed features ...................... - added possibility to load "backbones" and assemble against those sequences. Backbones can be in FASTA, CAF, EXP or even Genbank (GBF, GBK) format. Sequence features / tags are honoured in bankbones. - enhanced inference of previously undetected repeat marker bases to include inference of IUPAC support. - sequence alignment get nicer for "long" indel regions. - alignment scoring function now per default assigns decreasing gap extension penalties. Eases life for assembling against genomic backbones. Drawback: -AL:egp must now be manually selected for for EST assembly and for "real" genome assembly, -CO:amgb:amgbemc are recommended. - enhanced handlng of repetitive sequences characterised not by bases, but by insertions and deletions. - enhanced contig tagging mechanism - added counts of IUPAC and funny characters in contig statistics and _info_contigstats.txt files - small change in output when parameter parsing failed: usage is now printed before analysis of error cause - changed position columns in different _info and _out files so that now padded and unpadded positions are given. - small cosmetic changes in different output files - renamed FASTA output files: "raw" files are now named "padded" while previously 'normal' FASTA files without special extension are "unpadded". (getting some consistency with gap4) - new result file type TCS: Transposed Contig Summary. Idea "borrowed" *cough* from TIGR .tcov files. Nicely suited for "quick" analyses from commandline tools or even visual inspection. Written only as final result with appendix "_out.tcs". New parameter: -OUT:ors - first draft of SNP analysis function, saved in assembly information file "_info_snpanalysis.txt" - Can now load GenBank files as backbone reads (new -SB:bft parameter value: "GBF"). Also load the features as GAP4 compatible tags from that file. - larger changes in the tag naming scheme that also have repercussions in the parameter options. This was needed to simplify searches for problematic assemblies in editors (like e.g. gap4). Repeat Marker Bases (RMB) are now split into Strong/Weak types and also whether they occur in reads or in the consensus. PRMB becomes SRMr or SRMc, WRMB becomes WRMr or WRMc. The tags PAOS, PIOS and PROS for SNPs are now SAOr/c, SIOr/c and SROr/c. To keep parameter options naming scheme consistent, some parameters had to be renamed: -AS:pbl to -AS:rbl, and -CO:mpc:npz:mgqpt:mgqwpc to -CO:mrc:nrz:mgqrt:mgqwsc - SRMc, SROc, SIOc and SAOc tags now get the group quality for each base as additional output in the "_info_consensustags.txt" file - cleaned up "_info_consensustags.txt" and "_info_readtags.txt" a bit - cleaned up error messages when SCF data is not found - standard deviation of inserts are now read in NCBI traceinfo files. Minimum and maximum insert sizes are now calculated as insert_size -/+ 4*stddev. - MIRA now automatically corrects sequence names in sequences downloaded from the NCBI traceinfo archive. It replaces the "gnl|ti|....." name with the "real" name (the one after the " name:" string). This allows using FASTA file from the trace archive directly without further preprocessing. - if strain names are given, MIRA now also creates extra strain files in FASTA format as result of the assembly - the parameters with which MIRA was called are now written at the start of a project into _info_callparameters.txt Performance ........... - removed debugging code that was wrongly left activated in calculation of dynamic programming matrix (for alignments). Speedup in alignment calculation: factor ~6. Speedup in typical assembly project: factor >2. Bug was introduced in V2.2.3 (*sigh*) - reduced memory consumption of sequences - reduced memory consumption in assembly process: clipping of vectors (-CL:pvc) inflicted a huge memory penalty. This has been resolved. - optimised assembly when loading backbones (rails are not aligned anymore) - improved handling of similar sequences having (certain) indels: these are now treated as real indels. Takes effect when -CO:amgb is on. - improved genome building anchors by starting in non-multicopy sites - optimised skimming evaluation - small internal speed optimisations - speed enhancements: reads that have contradicting PRMBs will now be excluded from the SKIM and alignment phases in subsequent loops - major speed increase (> factor 10) when loading larger CAF files Tools ..... - streamlined tools: merged several small utilities into 'scftools' and 'fastatools' Options ....... - removed unused options: -AL:emp* (not used since a long time), -CO:mgqwsc:nrz (disappeared in 2.3.29) - renamed -CO:ismin:ismax to -CO:dismin:dismax - added -OUT:ots for tcs output of temporary results - added -CO:np - -CO:asir is now also setable by commandline (was reserved for setting only by miraEST) - added -SB: options. Also -AL:megpp -CO:amgb:amgbemc - renamed -GE:ess:lb:bft:brl parameters to new category -SB - renamed -EG:ess to -GE:ess. Moved -EG:lsd to -SB:lsd (and disbanded -EG category) - added -AS:sel -SK:mhpr to optimize performance for really deep repeats. Only "n" best hits are given to the SW alignment checks - added logic to improve assembly when some reads have too high quality values for wrongly called bases - added quick switches -estmode and -horrid - quick switches on the command line now print out what they are setting Bugfixes ........ - base positions now won't get multiple equal tags. - SNPs were sometimes wrongly disrupting contig building - -CO:amgbemc was (still) not honoured. Fixed. - the problem of slow loading EXP files has been resolved - tags lost their direction when saved as CAF or ACE, fixed. - tags got wrong direction when loaded from CAF - quality in CAF files were put in one single large line, fixed to multiline - -CO:also_mark_gap_bases and -CO:also_mark_gap_bases_even_multicolumn did not work as advertised - several small bugfixes in output functions that led MIRA to abort on rare occasions while saving results to files. - CAF and ACE files now get "correct" multiline tags - TG tags in EXP files were all converted to be on both directions - fixed small bug while parsing -SB parameters - fixed ugly bug in template handling (introduced 2.3.11, the 2.2.x line was fortunately not affected). This led to really bad assemblies when template information was used. - potential problem fix: changed output in EXP files for ON entries to be now multiline, so that the Staden iolib can cope with large entries - the unpadded fasta quality result file contained, in fact, the padded fasta results, fixed. - in rare cases, low quality bases were taken into account when searching for Possible Repeat Marker Bases (PRMBs). Fixed. - in rare cases, mira would stop in loops>1 when internal tag handling discovered an error. Error cause has been fixed. - skimmer: some hits for non-exact matches were not found. - CAF files containing "Ligation_no" lines caused errors while reading them - HTML output function crashed on some systems, this should now not happen anymore. - progress counter during contig building makes "nicer" progress report - several small typos: parameter options should now be in sync again with the documentation (man pages etc.) Changes in detail: ================== 2.6.0rc1 to 2.6.0rc2: --------------------- - improved contig building time by reducing need for alignment recalculation - fixed rare problem that led to abortion of mira in contig building 2.5.12 to 2.6.0rc1: ------------------- - introduced DWIM (Do What I Mean) parameter switches: -genomedraft, -genomenormal, -genomeaccurate -mappingdraft, -mappingnormal, -mappingaccurate -clippinglight, -clippingnormal, -clippingheavy -highlyrepetitive -highqualitydata - fixed dumb error in vector clipping (-CL:pvc) that clipped away too much when vector clipping was performed on a read - fixed rare problem where mira aborted while extending reads. (-DP:ure) - optimised partioned skim parameters for present day quality Sanger type shotgun sequences. Runtime -18%. No effect with 454 type data. - fixed errors in parameter parsing: -OUT:org:orf -AS:sel - quick switches for loading files (e.g. -fasta, -phd etc.) have been cut back to exactly this functionality (loading), without further side effects (like switching of read extension etc.). Functionality transfered to DWIM switches (see above) 2.5.12 ------ - improved algorithm for computing read extensions on Smith-Waterman aligned sub-sequences, now switched on per default for pre-assembly read extension - added -DP:rewl:rewme:feil:leil - added -CO:amgbnbs 2.5.11 ------ - fixed a problem while loading CAF files with exposed sequence vectors that led to wrong sequence positioning in a contig - when CAF files loaded as backbone do not contain contigs, then the reads themselves are used as backbones. - added special mode for backbone assembly, faster and more accurate - changed genomic assembly mode, faster and more accurate 2.5.10 ------ - deleted -CO:mrc parameter - added -CO:mrpg parameter 2.5.9 ----- - reworking of internal data structures leads to 37% decrease of memory consumption of stored alignment data. Important for 454 type data, e.g. project with 250000 reads now uses 850M for this part instead of 1350M. - improved alignment algorithms lead to better alignments of 454 type data (improvements also noticable for Sanger type data, but less so) - bugfix: in very rare cases, sequences could be wrongly inserted into a consensus, leading to slight misalignments. Problem seen first with 454 type data - new parameter: -SB:abnc 2.5.8 ----- - railreads and backbones in contigs now have no SCF files (and exp) files assigned - new parameters: -454:mdis454:hybrid (partly functional) 2.5.7 ----- - renamed -CO:uti to -GE:uti - MIRA switches off usage of template information when no useful information for this is present in data - changes in pathfinder to speed up building contigs with many reads (454 data) or when genome sized backbones are used - new "hidden" parameter: -PF:swcs - CAF sequence names may now contain "#" and "|" as sequence names (the later perhaps not being very useful) - FASTA sequence names may now contain "#" as character - FASTA sequence names containing "|" are rewritten to contain "_" 2.5.6 ----- - added additional Genbank tags as "dont analyse for SNPS" in assout::saveFeatureAnalysis() - first test versions capable of acceptable 454 data assembly 2.5.5 ----- - adjusted typos in documentation: -CO:dismin and dismax lacked the "d" in some parts of the docs (due to a parameter rename earlier *sigh*) - fixed typos in -CL:mlcr:smlc documentation 2.5.4 ----- - new partitioned skim routine: faster, less memory consumption. Current drawback: number of hits cannot currently be limited per read (-SK:mhpr has no effect) - new quick switch: -454data. Please note that as of this writing, mira is not yet optimally suited to handle this type of data well, there is still some development needed in this area. Relying on the consensus of MIRA is NOT recommended at this time! - ommission in manual corrected: -CL:emlc:mlcr:slmc were not described. 2.5.3 ----- - GBF output now complete: protein translation also written when Genbank features are present as tags in the assembly - convert_project: fasta output now also writes out the consensus of contigs in an assembly when the input was CAF - convert_project: added -q - convert_project: added aliases caf2fasta, caf2gbf, caf2text, caf2html, gbf2caf and gbf2fasta as aliases to convert_project which have -f and -t already set accordingly 2.5.2 ----- - decreased memory usage for skim routine by 50%, almost non-noticable speed penalty 2.5.0 -> -------- - ace2caf: new option -F and small improvements - small bugfixes - first GBF output 2.4.0rc2h --------- - added -SB:sbuil parameter option - fixed bug when reading GBF files where positions have a ">" in front - change: assembly of multiple strains improved - change: /note entries are now also read from GBF files and put into tags - change: tags for Repeat Marker Bases and SNPs in reads now do not get any comment, only the consensus tag gets the full comment. Saves quite some space in .exp and .caf files for projects with many such tags. - added GBF as outtype for the "convert_project" program - added demodata for backbone assembly 2.4.0rc2c to 2.4.0rc2g ---------------------- - added logic to improve assembly when some reads have too high quality values for wrongly called bases - added quick switches -estmode and -horrid - quick switches on the command line now print out what they are setting - small bugfix in determination of multicopy reads when using backbone assembly - internal rearrangements 2.4.0rc2 to 2.4.0rc2c --------------------- - the parameters with which MIRA was called are now written at the start of a project into _info_callparameters.txt - improved pathfinder that now better bridges repetitive sequences - fixed bug that struck the pathfinder (endless loop) when a) disk was full and b) -AS:max_contig_buildtime was used - major speed increase in pathfinder when using backbone assembly. Rearranged pruning in pathfinder, leading to dropping untaken paths earlier in the evaluation. Results of the pathfinder are the same as in earlier versions, no change there. - improvement in handling of repeats characterised only by indels: these are now correctly treated as are repeats characterised by basechanges. leads to vastly improved assemblies of repetitive regions. Also due to SRMx tags that were triggered by gaps now induce multibase (multicolumn) tags. These are also set on the ends of stretches of equal bases - backbone contigs consisting of a single sequence now get the name of the sequence as name of the contig - if strain names are given, MIRA now also creates extra strain files in FASTA format as result of the assembly - major speed increase (> factor 10) when loading larger CAF files 2.4.0rc2 -------- Mostly bugfixes and last tweaking on small things. This should be the last release candidate before 2.4.0, which will include more documentation and examples. - bugfix: a fatal bug in 2.4.0rc1 prevented traceinfo XML files from being correctly read. So, when one relied on the XML files, this lead to sequencing vector not being clipped, bad quality being included etc. This has been fixed. - bugfix: HTML output function crashed on some systems, this should now not happen anymore. - progress counter during contig building makes "nicer" progress report - added documentation entry for -AS:ugpf - standard deviation of inserts are now read in NCBI traceinfo files. Minimum and maximum insert sizes are now calculated as insert_size -/+ 4*stddev. - MIRA now automatically corrects sequence names in sequences downloaded from the NCBI traceinfo archive. It replaces the "gnl|ti|....." name with the "real" name (the one after the " name:" string). This allows using FASTA file from the trace archive directly without further preprocessing. 2.3.30 ------ - removed unused options: -AL:emp* (not used since a long time), -CO:mgqwsc:nrz (disappeared in 2.3.29) - enhanced inference of previously undetected repeat marker bases to include inference of IUPAC support. - removed debugging code that was wrongly left activated in calculation of dynamic programming matrix (for alignments). Speedup in alignment calculation: factor ~6. Speedup in typical assembly project: factor >2. Bug was introduced in V2.2.3 (*sigh*) 2.3.29 ------ - alignment scoring function now per default assigns decreasing gap extension penalties. Eases life for assembling against genomic backbones. Drawback: -AL:egp must now be manually selected for for EST assembly and for "real" genome assembly, -CO:amgb:amgbemc are recommended. - sequence alignment get nicer for "long" indel regions. - enhanced handlng of repetitive sequences characterised not by bases, but by insertions and deletions. - enhanced contig tagging mechanism - bugfix: base positions now won't get multiple equal tags. 2.3.28 ------ - further working on SNP/Feature analysis - bugfix: SNPs were sometimes wrongly disrupting contig building - streamlined tools: merged several small utilities into 'scftools' and 'fastatools' - renamed -CO:ismin:ismax to -CO:dismin:dismax - bufix: -CO:amgbemc was (still) not honoured. Fixed. 2.3.27 ------ - further working on SNP/Feature analysis - bugfix: the problem of slow loading EXP files has been resolved - bugfix: tags lost their direction when saved as CAF or ACE, fixed. - bugfix: tags got wrong direction when loaded from CAF - first internal changes to make compiling -Wall ... etc. proof 2.3.26 ------ - bugfix: quality in CAF files were put in one single large line, fixed to multiline - multiple internal changes (char * to string, shifting of functions into namespaces etc.) 2.3.25 ------ - bugfix: -CO:also_mark_gap_bases and -CO:also_mark_gap_bases_even_multicolumn did not work as advertised - reduced memory consumption of sequences - reduced memory consumption in assembly process: clipping of vectors (-CL:pvc) inflicted a huge memory penalty. This has been resolved. 2.3.24 ------ - several small bugfixes in output functions that led MIRA to abort on rare occasions while saving results to files. - added counts of IUPAC and funny characters in _info_contigstats.txt files 2.3.23 ------ - small change in output when parameter parsing failed: usage is now printed before analysis of error cause - changed position columns in different _info and _out files so that now padded and unpadded positions are given. - small cosmetic changes in different output files 2.3.22 ------ - added -OUT:ots for tcs output of temporary results - fleshed out the SNP analysis - renamed FASTA output files: "raw" files are now named "padded" while previously 'normal' FASTA files without special extension are "unpadded". (getting some consistency with gap4) 2.3.21 ------ - bugfix: CAF and ACE files now get "correct" multiline tags - new result file type TCS: Transposed Contig Summary. Idea "borrowed" *cough* from TIGR .tcov files. Nicely suited for "quick" analyses from commandline tools or even visual inspection. Written only as final result with appendix "_out.tcs". New parameter: -OUT:ors - first draft of SNP analysis function, saved in assembly information file "_info_snpanalysis.txt" 2.3.20 ------ - bugfix: TG tags in EXP files were all converted to be on both directions - Can now load GenBank files as backbone reads (new -SB:bft parameter value: "GBF"). Also load the features as GAP4 compatible tags from that file. - optimised assembly when loading backbones (rails are not aligned anymore) 2.3.19 ------ - larger changes in the tag naming scheme that also have repercussions in the parameter options. This was needed to simplify searches for problematic assemblies in editors (like e.g. gap4). Repeat Marker Bases (RMB) are now split into Strong/Weak types and also whether they occur in reads or in the consensus. PRMB becomes SRMr or SRMc, WRMB becomes WRMr or WRMc. The tags PAOS, PIOS and PROS for SNPs are now SAOr/c, SIOr/c and SROr/c. To keep parameter options naming scheme consistent, some parameters had to be renamed: -AS:pbl to -AS:rbl, and -CO:mpc:npz:mgqpt:mgqwpc to -CO:mrc:nrz:mgqrt:mgqwsc - the -SB:bol option has been removed 2.3.18 ------ - fixed small bug while parsing -SB parameters - improved handling of similar sequences having (certain) indels: these are now treated as real indels. Takes effect when -CO:amgb is on. 2.3.17 ------ - fixed ugly bug in template handling (introduced 2.3.11, the 2.2.x line was fortunately not affected). This led to really bad assemblies when template information was used. - improved genome building anchors by starting in non-multicopy sites - new utility program: scf_remix. Useful for "fixing" broken SCFs or SCFs that are out of sync with other data sources 2.3.16 ------ - added -CO:np - -CO:asir is now also setable by commandline (was reserved for setting only by miraEST) - PRMB, PROS, PIOS and PAOS tags now get the group quality for each base as additional output in the "_info_consensustags.txt" file - cleaned up "_info_consensustags.txt" and "_info_readtags.txt" a bit - added counts for IUPAC bases and funny characters in contig statistics 2.3.15 ------ - added -SB:bol:bbq -AL:megpp -CO:amgb:amgbemc - cleaned up error messages when SCF data is not found - potential problem fix: changed output in EXP files for ON entries to be now multiline, so that the Staden iolib can cope with large entries 2.3.14 ------ - renamed -GE:ess:lb:bft:brl parameters to new category -SB - added possibility to load FASTA files as backbone (-SB:bft=FASTA) - added possibility to give backbone sequences a strain name (-SB:bn) - bugfix: the fasta quality result file contained, in fact, the raw fasta results, fixed. - bugfix: in rare cases, low quality bases were taken into account when searching for Possible Repeat Marker Bases (PRMBs). Fixed. 2.3.13 ------ - added possibility to load "backbones" (CAF) and assemble against those. New parameters -GE:lb:bft:brl - renamed -EG:ess:lsd to -GE:ess:lsd (and disbanded -EG category) - bugfix: in rare cases, mira would stop in loops>1 when internal tag handling discovered an error. Error cause has been fixed. 2.3.12 ------ - bugfix in skimmer: some hits for non-exact matches were not found. - optimised skimming evaluation - added -AS:sel -SK:mhpr to optimize performance for really deep repeats. Only "n" best hits are given to the SW alignment checks - small internal speed optimisations 2.3.11 ------ - small bugfix: CAF files containing "Ligation_no" lines caused errors while reading them - speed enhancements: reads that have contradicting PRMBs will now be excluded from the SKIM and alignment phases in subsequent loops - several small typos: parameter options should now be in sync again with the documentation (man pages etc.) 2.2.7 to 2.2.8 (2.3.10) ----------------------- This is an intermediate optimisation release. Although I wanted to build in some more (exiting) new features, especially the spoiler detection and the speed improvements justify releasing the improvements as they are. They make assembly of genomes from 1 to 10mb a bit more fun. - added -AS:sd:sdllo to detect and remedy assembly "spoiler". Only recommended for assembly of genomic sequences! These spoiler can be either chimeric reads or reads with long parts of unclipped vector sequence (that was too long for the -CL: vector leftover clippings). These spoiler typically prevent contigs to be joined, MIRA will cut them back so that they present no more harm. - added -GE:rns to support naming schemes of different sequencing centers. Sanger and TIGR naming schemes are now supported. - added -GE:pd flag for controlling date output in ouput log - major speed improvements for projects where large contigs (>500kb) with many reads are built. - minor bugfix: some overlap where not correctly recognised by the SKIMmer - minor bugfix: now got the percentage progress report bar right, it sometimes showed false status. - minor bugfix: added -AS:umcbt:bts to adapt for larger assemblies on slower machines. More useful for EST assembly than for genomic. - starting with gcc 3.4, mira is now compiled with -O3 as standard optimisation level. Gain of ~5-10% in many algorithms - known issue: I apparently "optimised" some pathfinding routines too much for EST data. Sometimes, for genomic data, some contigs are not at their optimal length. This most likely occurs in low coverage shotguns (<=4), high coverage (>=6) should be ok. I'm working on it. 2.2.6 to 2.2.7 (2.3.9) ---------------------- - fixed serious bug that led to suboptimal assembly of genomic sequences. Upgrade to this version *highly* recomended. - added -OUT:oetas, exttmp singlets now not saved by default - exttemp contigs now not saved as "post" when no change (either repeat marked or edits) happened - *sigh* fixed error in parameter setting for extended_gap_penalty: _long_ gaps sometimes were given a lower penalty than expected - fixed error in .txt output of a contig: some HTML was thrown in sometimes 2.2.5 to 2.2.6 -------------- - fixed a bug in .ace files that prevented consed to load them ("clview" from TIGR was not affected) - fixed rare bug that led to an assembler panic and subsequent abort of the assembly process. - brought the provided demonstration parameter file up-to-date 2.2.4 to 2.2.5 -------------- - fixed typo in miraEST internal standard parameters, miraEST would not start 2.2.3 to 2.2.4 (2.3.7) ---------------------- - added ability to merge data from NCBI trace info files in XML format (-GE:mxti and -FI:xtii) - put -GE:lsd:ess to new group -ESTGENERAL (-EG:lsd:ess) - general update and overhaul of documention 2.2.2 to 2.2.3 -------------- - fixed typos again in on-screen text 2.2.1 to 2.2.2 (2.3.3) ---------------------- - fixed ugly bug in banded Smith-Waterman that lead to misses in some cases 2.2.0 to 2.2.1 -------------- - fixed typos in on-screen text 2.1.22 declared 2.2.0, forked 2.3 branch ---------------------------------------- 2.1.21 to 2.1.22 ---------------- - fixed minor bug in computation of alignment scores. It lead sometimes to suboptimal alignments at the ends of an overlap. Effect was frequently seen in EST projects - new parameters -AL:extra_mlsmatch_penalty:emp* - new FASTA output for the consensus: the raw format, with gaps, lowercase for normal consensus, upper case for special features like PRMB, WRMB, PAOS, PROS, PIOS files are named .raw.fasta 2.1.20 to 2.1.21 ---------------- - miraEST gets starting step as parameter: -GE:ess - fine tuning of gap penalty level for est_splitsplices variant 2.1.19 to 2.1.20 ---------------- - optimised memory requirements when reading FASTA files - slight optimisation of memory requirements for reads - fixed bug when reading FASTA quality files that had more sequences than the FASTA files themselves - fixed a number of minor internal bugs that were found with valgrind which had no traceable effects on the assembly - introduced the TEST versions of MIRA and miraEST 2.1.18 to 2.1.19 ---------------- - fixed small bug that caused some reads with PRMB/WRMB tags that matched ok to be rejected as overlap. Influence on assemblies: light, but annoying 2.1.17 to 2.1.18 ---------------- - fixed dumb bug in SKIM which caused suboptimal hit numbers *deepsigh* - fixed small memory leaks here and there (valgrind rocks) 2.1.16 to 2.1.17 ---------------- - fixed (dumb dumb dumb) bug: check of minimum alignment score did not take the score multiplier into account *sigh* this resulted in good, but somewhat shorter alignments to be rejected - slight tweak in consensus algorithm (less IUPAC codes) - new parameter: -DP:pvcmla. Enables quite effective sequencing vector leftover clipping without loosing splice variants (variants with lower number of bases than -DP:pvcmla will get lost though). - Sequences not adhering to Sanger (and probably St. louis) naming scheme now loose template information. Allows better assembly for projects that don't have this scheme. - miraEST comes with some enhanced standard parametersets 2.1.15 to 2.1.16 ---------------- - changed behaviour of contig: when assume_snp_instead_prmb, now also tags PROS as PRMB - polybase masking now uses an enhanced algorithm, -DP: options changed too to reflect this - minor enhancements in IUPAC consensus computation 2.1.14 to 2.1.15 ---------------- - Reading of SCF V2 was borken on x86, fixed 2.1.13 to 2.1.14 ---------------- - fixed bug that affected consensus: in (really) rare cases, a base in the consensus was replaced by another base in the consensus output 2.1.12 to 2.1.13 ---------------- - improved consensus quality calculation: supporting reads add a bit more to a quality - fixed dumb bug *sigh* that lead to suboptimal assembly results in some cases involving PRMB tags - during assembly, contigs are now only edited when no unresolved misassembly was detected in that contig. (TODO: auch bei nur WRMBs?) 2.1.11 to 2.1.12 ---------------- - added POLY and IUPAC tags for HTML output - small bugfix in HTML output for MISM tag - bugfix in HTML output: tags in consensus are now shown - progressbar when loading FASTA files re-enabled - miraEST now names single-read-contigs (result of step 2) now _Singlet instead of _Contig - helper programs (scf2other etc.) now mention the MIRALIB version in their usage text 2.1.10 to 2.1.11 ---------------- - adapted base version for diss 2.1.9 to 2.1.10 --------------- - tweaked consensus base probabilities - marking of repeats: only when dubious bases are surrounded by good quality (new parameter -CO:mnq) 2.1.8 to 2.1.9 -------------- - improved consensus algorithm when non-clipped vector leftovers occur - improved tagging of possibly misassembled repeats: single read misassemblies now better under control - fixed off by one bugs in tagging of polybases at read ends - new parameter class -SKIM 2.1.7 to 2.1.8 -------------- - improved consensus algorithm for uncertain base/gap candidates - renamed -AL:gpl=est_default to est_splitsplices - new parameter class -DATAPROCESSING - moved -AS:mr put to -CO:mr and -AS:ure to -DP:ure - new parameter option -DP:tpae to enable/disable tagging of poly-A/T at read ends, options for polybase tagging in DP 2.1.2 through 2.1.7 ------------------- - Consensus disregards bases that are 'masked from consensus' (for the time being the tag POLY for poly-A or poly-T at ends of reads) - Consensus is now given with IUPAC bases if base evidence is contradictory - Overall improved IUPAC support - bugfix: clusters were wrongly computed (affected only output) - SKIM algorithm can be parameterised - new extra gap penalty level (egp): 10 (est_default) 2.0.1 to 2.1.2 -------------- - bugfix: parameters for clippings (qual and masked chars) were not used (only 'defaults') - added computing and output of possible clusters - better alignments for difficult cases - writes clustering logfiles - emergency search stops now work on time dependent basis - reworked blacklisting of reads for EST assemblies 2.0.0 to 2.0.1 -------------- - fixed typos - -AS:uess:esspd for restraining computing time on pathological cases of coverages are now functional - fixed bug, standard parameters for third step of miraEST were in wrong section - Fixed error in contig: wrong assumption about insert sizes lead to halt. - New Pathfinder algorithm (faster in resolving) CHECKME! - New banning strategy for found misassemblies CHECKME! - New parameter -CO:emea, -DI:gap4da - Added .ace output (alpha) (Tags?) - Added gap4 directed assembly output - Screened out bases (Xs) are now (durchgeschleift) and not transformed to Ns in reads anymore. - Sequences loaded as FASTA can now also fall back on SCF files for qualities and editing if those are present - standard filenames for in and out changed to "mira" - fixed bug in EdIt that caused crashes when SCF was not present (thank you valgrind! :). - added -GE:project to quickly change standard filenames for in and out. - added quickparams --fasta, --project, --phd - quietened EdIt when analysing stretches containing reads with no SCF data - switching automatic contig editing off when no SCF present for the reads - added possibility to save contig consensus (and qualities) as FASTA: parameters -GE:orf:otf - Singlets are now named "Singlet" instead of "Contig" in result files. They still get the same continuous number as if they were contigs though. - contigs are now more permissive on errors when template partners are in range (rodirs*2) - added -CO:ismin:ismax for controlling default template insert size - PHD files can now be read, added -FI:pi:fpi (fofnphd tut noch nicht) - relocated output parameters to -OUTPUT, added extended temporary output flags - when loading from fasta or phd, template names are now deduced from read name if they're in Sanger Centre scheme - added options to perform clipping on reads by quality (-CL:qc:qcmq:qcwl) - added options to perform clipping on reads by masked bases (-CL:mbc:mbcgs:mbcmfg:mbcmeg) - repositioned -GE:cpv to -CL:pvc - added _reads_invalid and _reads_too_short as output files - added several info and error files as output (for statistics etc.) - cleaned output as text and put into file - speeded up read extension - SCF files are now found even if the filenames differ from given names by appending a .Z or .gz, .scf, .scf.Z etc. - practically doubled speed of banded SW alignment using memrecache algorithms (yeeehaah!) - added -CO:mgcpt:mgcwpc - added possibility to load parameters from file (-params) - added -GE:discard_read_on_eq_error - write a lot of statistics files at the end of an assembly - added clustering log files TODO: Statistiken und Listen (Orphans / Singlets / Cluster?) TODO: fofnphd realisieren TODO: mehrere Inputfiles laden, vorher anzahl reads zählen und per reserve() anmelden TODO: singlets in andere files trennen? TODO: contigs.C: template handling in addRead verbessern TODO: versteckte STL Containerspeicherlecks (ungenügende reserve()) suchen *Changes from V1.5.2 to V1.5.3 - Added -GE:cpv for clipping possible sequencing vector leftovers in reads (just on the left side at the moment) *Changes from V1.5.1 to V1.5.2 - Added new parameter -DIR:exp:scf:log to specify input and output directories (log doesn't work yet) - Fixed a bug that caused segmentation faults when SCF files with 0 bases were used. *Changes from V1.5 to V1.5.1 - Added new parameters -AS:pbl (maximum prmb break loops) -CO:npz (num_prmb_zones) - new function to transfer sequencing vectors expressed as tags in EXPs to clips: searches with a tolerance from clips and strat/end of read, transfers tags found there to clips. - New! EST assembly now supported by usage of strains. Added new parameters -GE:lsd and -FI:sdi - New routine for finding possible repeats (PRMB) and possible SNPs (PSNP) - Added WRMB as weak repeat marker bases - Comments are now allowed in the file of filenames file (fofn) *Changes from V1.4.1 to V1.5 - New read comparison routine (experimental): Skim. Speed factor to the Zebra routines: 10 to 50, depending on memory. Drawbacks: probably isn't as sensitive as ZEBRA, no possibility to subdivide the search space for the time being. - Change in behaviour while loading SCF files: 'fatal' errors in SCFs now do not lead to a halt, but are logged (and the reads concerned excluded from the assembly). - option -FI:fastaqualin added. new FASTA reading routines now load also quality files in FASTA format. - fixed template handling bug: distance of reads was calculated wrongly in the contig (affected assembly only when -CO:uti was on). - New building mechanism using automake and autoconf. *Changes from V1.4.0rc2 to V1.4.1 - the option to load FASTA files was lost somewhere in earlier revisions, thanks to the people who pointed that out. *Changes from V1.4.0rc1 to V1.4.0rc2 - option -GE:filecheck_only added - logfile "log.scfread_fail" added - Bug in EdIt removed that sometimes caused crashes on unclipped sequences - Bug in parameter parsing removed that caused wrong parameters not to be recognised. *Changes from V1.3.20 to V1.4.0rc1 - Reworked HTML format a bit. - Merged tools for project conversion into convert_project. - MIRA complained when it encountered PHRED SCF files that contained irregularities/errors. It will now 'correct' the error internally and continue. - Read extension is now additionally performed _before_ the first assembly (if read extension is enabled) - new command line option '-borg'. This will trigger a lot of parameters to be set into a mode where MIRA is likely to assemble everything that might look like ok to assemble. Albeit this slows down the assemble _a lot_. - Integrated editor had a few bugs fixed. - A few code cleanups *Changes from V1.3.19 to V1.3.20 (maintenance release) - The EdIt routines for ALF data (mira_l) had not been actualised and were not working right: a lot of bases that could have been corrected were not corrected. - In rare cases, buggy SCF files caused the integrated editor to crash. Fixed by augmenting the ability to recognise buggy SCFs. - Added optional HTML output for contigs - Added -GE:orh, -GE:otc and -GE:oth (see man page) for controlling html output and temporary CAF|HTML output files *Changes from V1.3.18 to V1.3.19 - Added -FILE options. File and project names can now be freely chosen - CAF read routine had a small bug. Affected people who worked from the very first base in reads (that wasn't clipped off through quality and/or sequencing vector) *Changes from V1.3.17 to V1.3.18 - Small changes in EdIt. Bugfixes in MIRA and EdIt. If MIRA didn't crash on you, you probably weren't affected. *Changes from V1.3.16 to V1.3.17 - Argl, bad bug in 1.3.16 while loading files which caused mira to crash. Sorry. *Changes from V1.3.15 to V1.3.16 - Integrated EdIt had memory leak and crashed - Minor bug in CAF writing fixed for Solaris and Linux version. Bug did not affect quality of assembly, it's just that not-existing Clonevec names were replaced by the string "(null)". *Changes from V1.3.14 to V1.3.15 - The EdIt (automatic editor) routines contained an error that struck in very rare cases and crashed the assembler when -GE:ace was set. Fixed. - Memory requirements decreased again - Small bugfixes - SGI version now runs in true 64 bit mode *Changes from V1.3.13 to V1.3.14 - Fixed error in handling of repeat marker bases (this could have lead to a crash) *Changes from V1.3.12 to V1.3.13 - Substantially decreased memory requirements phase. Well, requirements decreased incredibly, drastically, dramatically, ... you get the picture. - Fixed small bug when compiling with gcc: filenames were sometimes garbled (Linux, Solaris) *Changes from V1.3.11 to V1.3.12 - added extended checkpointing: each contig is now saved separately during the assembly in log.loop_W_cbX_iY_Z.caf where W ist numeric loop number, X is numeric contig number in this loop, Y is numeric iteration number for this contig in this loop, Z ist either 'pre' or 'post' - indicating before the contig has been edited or after. - Inserted or changed bases now get a quality value != 0. The quality is interpolated from neighbouring non-N and non-gap bases. Rough, but works. *Changes from V1.3.10 to V1.3.11 - Added checkpointing capability (files: bla_out_loop.X.caf) where X stands for the loop number. - Fixed severe bug that caused MIRA to stop. Introduced somewhere in 1.3.x *Changes from V1.3.9 to V1.3.10 - First prototype auf automatic repeat marker *Changes from V1.3.8 to V1.3.9 - Added -AS:nol, -AS:ure and -AS:ace - Added -CO:uti to try making use of template information (insert size) *Changes from V1.2 through V1.3.8 - First integration of MIRA with EdIt, the automatic editor. - Added template handling - Added consensus tags - Removed memory leaks - bugfixes - some more bugfixes - tons of bugfixes *sigh* (note: if the program did not stop or crash in previous version, you were NOT affected, all of your assemblies were correct) *Changes from V1.1.1 to V1.2 - Added -AL:bip, -AL:bmin and -AL:bmax options, to make banded SW configurable *Changes from V1.0.1 to V1.1.1 - Now using banded Smith-Waterman alignment functions. Speed increase between 300% and 700% in the alignment phases. BSW functions might miss a valid alignment, but only in very very rare cases. BSW were needed as labs increasingly show up with read length between 400 and 1000 bases. - added IUPAC uncertainty codes for EXP, SCF and CAF reading routines. These will be treated as N internally and appear as N in the resulting alignment. - fixed bug: the signal analysis routines were never called (oooops). This bug appeared probably in 0.99b6. - fixed bug: temporary files in the SCF load function were not removed when an error occured *Changes from V1.0 to V1.0.1 - added -AL:mo parameter - version schemes now similar to the linux kernel. Even major numbers represent 'stable' version, uneven are 'test' version with features that weren't tested thoroughly on real data sets. *Changes from V0.99b7 to V1.0 (not publicly released) - Faster filter functions with increased sensitivity and specificity built in. Filtering is now done with Zebra-Blocking instead of DNASAND. This is a major improvement in terms of speed (roughly 4x) in the filter phase. - Memory consumption in the assembly phase has been significantly reduced. It should now be perfectly possible to assemble projects with 50,000 to 100,000 reads (though _PLEASE_ contact the author before doing this, so that tips in speed enhancement can be given). - removed SANDSIEVE parameter options - added -AL:egp and gpl options - added ZEBRABLOCKING options - Used parameters are now dumped to stdout when MIRA starts - Unknown identifiers in EXP files do not generate warnings anymore - fixed reported bugs *Changes from V0.99b6 to V0.99b7: - experiment files now don't need to be available when loading CAF projects - bug fixed in parsing command line options: -GE:lj=FOFNEXP wasn't recognised - some debug output removed that happened to be printed when loading CAF files - bugfix when reading experiment files: one line tags like TG WARN - 127..167 "POSSIBLY VECTOR: puc18 289 249 2686" were misinterpreted - can now read quality values in EXP files - added -GE:eq and -GE:eqo options to specify quality sources *Changes from V0.99b5 to V0.99b6: - bug fixed in CAF loading routines - potential bug fixed in contig handling - fixed bug in parsing command line options introduced in 0.99b3 Changes from V0.99b4 to V0.99b5: - changed logic for analysing danger zones (ALUS and REPT): checking should be stricter now - fixed bug in contig: in some rare cases, a division by zero error occured Changes from V0.99b3 to V0.99b4: - fixed bug in dynamic programing algorithm: * in reads are now treated like N Changes from V0.99b2 to V0.99b3: - switched on (experimental) possibility to reassemble CAF projects - added -GENERAL parameter options - temporary files are now removed automatically after the assembly. Use -GENERAL:clean_tmp_files=off if you plan to experiment with different assembly options. - fixed a bug: the -CONTIG:rej_on_dropinrelscore given as parameter was ignored - the -CONTIG:rej_on_dropinrelscore default is 7%, not 5 as I wrote