Version 3.6 of the FASTA programs is a significant update over version 3.5. It uses the same underlying structure as FASTA35 (specifically the strategies for ensuring accurate statistics), but it allows for multiple high-scoring alignments to be shown, rather than just one. This is the main functional difference between FASTA and BLAST - BLAST could show multiple HSPs, FASTA did not. >>Aug. 9, 2019 [src/ncbl2_mlib.c, ncbl2_head.h] Modest extensions made to support reading makeblastdb format v5 databases. Changes have only been made to read the db.pin file, but things work in simple tests. >July 16, 2019 [src/comp_lib9.c] Fixed a memory leak problem when searching with large libraries that could be memory mapped (libraries with .xin index files). If the library did not fit in memory, then the kept allocating new memory. By default, the largest database that fits in memory must be less than 16 GB. Larger libraries will be re-read, which slows down multi-query searches considerably. To increase the size of the library allowed in memory, use the option: "-X M32G" to fit 32 GB libraries. >>Mar. 8, 2019 [src/initfa.c,faatran.c,dropfx2.c] Modify translation table 1 to allow selenocysteine translation (TGA->'U'), and modify scoring matrices to give positive scores to '*':'U'. The translation modification ONLY works with "-t 1". In addition, BLAST BTOP alignments (-m 8CB) convert a 'U' aligned with a '*' to a '*', so the end of the alignment is '**' rather than 'U*' (fastx36) or '*U' (tfastx36). dropfx2.c (fastx36/tfastx36), dropfz3.c(fasty36/tfasty36) did not properly switch protein and translated DNA codes with -m 8CB -- fixed. version date updated to Mar, 2019 >>Feb. 26, 2019 [scripts/get_genome_seq.py] added get_genome_seq.py as a replacement for get_hg38_bed.py, remove get_hg38_bed.py. 'get_genome_seq.py --genome mm10' also produces sequences from mouse mm10 (and can now do any genome that bedtools can read). >>Feb. 23, 2019 [src/comp_lib9.c, mshowbest.c] Modify repeat_thresh so that poor alignment scores (E() > ppst->e_cut_r, typically -E-threshold/10.0) do not look for additional alignments. >>Feb. 21, 2019 [src/nmgetaa.c, scaleswn.c, scripts/get_protein.py, get_hg38_bed.py] Modify nmgetaa.c to ignore ':'s (for sequence subsets) in scripts. The script can do the subsetting. Modify scripts/get_protein.py to provide subsetting. Add scripts/get_hg38_bed.py to extract fasta sequences using the format "chr2:123456-543210" Modify scaleswn.c to estimate Altshul-Gish parameters when gap and extension do not match exactly. >>Feb. 6, 2019 [src/compacc2e.c, nmgetaa.c] modify build_link_data() to allow '+' for space in scripts. Ensure that lib_type is properly initialized (open_lib.c()). >>Jan. 23, 2019 [nmgetaa.c] Fix bug introduced when checking for lib_type. >>Jan. 15, 2019 [src/upam.h, altlib.h, nmgetaa.c] [scripts/rename_exons.py, map_exons_coords.py, get_uniprot.py, get_refseq.py, get_proteins.py] Bug fixes: The VT10, VT20, etc scoring matrices did not have scores for '*:*' alignments, used with FASTX/TFASTX for extending alignments through the termination codon. As a result, searchs with '-t t' did not extend through the termination codon, even though they should have. This has been fixed. Enhancements: FASTA can now download both query and library sequences using a script, by specifying file type 9. Thus: fasta36 "../scripts/get_uniprot.py+P09488 9" /seqlib/swissprot.fasta Will run the script "get_uniprot.py" with the argument "P09488" and use the output of the script as the query sequence. In this example, the library type (9) is specified by the " 9" (this space cannot be replaced with a '+' character). Alternatively, library type '9' can be specified by putting a '!' before the script file name. fasta36 \!../scripts/get_uniprot.py+P09488 /seqlib/swissprot.fasta Scripts can be used to produce query or library sequences, or both. Three scripts that download sequences from the NCBI and Uniprot have been added in the "scripts" directory: "get_uniprot.py" takes Uniprot accessions as arguments, "get_refseq.py" takes refseq accessions (protein or mRNA), and "get_protein.py" gets both Uniprot and RefSeq protein sequences. rename_exons.py and map_exons_coords.py can take annotated BTOP alignments with genome coordinates and map exons to the alternative genome. >>Jan. 2, 2019 [src/mshowbest.c] Fix problems with site annotation when dom_info is provided with -m8CBL [scripts/ann_exons_up_sql.pl, ann_exons_up_www.pl] Make scripts more robust to missing chromosome information, reverse-strand coordinates. >>Dec. 11, 2018 [scripts/ann_exons_up_www.pl, ann_exons_up_sql.pl] Add the option "--gen_coord" to report exon start ('<') and end ('>') genome coordinates features of exons. >>Nov. 14, 2018 [scripts/rename_exons.py, relabel_domains.py, compacc2e.c] Two new scripts, rename_exons.py and relabel_domains.py, that take a blast tabular output file with domain alignment annotations (and possibly raw domain information) and modifies the names (rename_exons.py) or colors (relabel_domains.py). rename_exons.py takes the exon numbering associated with the query sequence and maps it onto the subject alignments. relabel_domains.py can be used to use different color numbers for homologous and non-homologous domains. Both of these programs modify blast tabular output files, which can then be merged back into an alignment display using merge_blastp_annot.pl or merge_fasta_annot.pl. compacc2.c:build_link_data() has been modified to convert '+' in the script string to ' ', to allow passing command line options. A space in the script string is used to separate the script from the library type of the file returned by the script. >>Nov. 6-7, 2018 [doinit.c, mshowbest.c, mshowalign2.c, defs.h, structs.h] (a) Add options to provide query and subject sequence lengths and raw domain coordinates in BLASTP tabular output with the options -m 8CBl and -m 8CBL. If domain annotations are available, -m 8CBL also provides the raw domain coordinates (not just those included in the alignment) in the form |DX:1-100;C=PF12345|XD:1-100;C=PF12345 where |DX a query annotation and |XD indicates a subject annotation. -m 8CBl (lower-case L) shows the sequence lengths, but not the raw domain info. (b) parse the annotation program strings so that '+' are converted to ' '. This greatly simplifies passing arguments to the annotation scripts. Thus: -V \!ann_pfam_sql.pl --db=pfam31 --neg --vdoms can be written as: -V \!ann_pfam_sql.pl+--db=pfam31+--neg+--vdoms (likewise for -V q\!ann_pfam...) (c) provide an option to remove region/feature annotations from non-m8 (blast-tabular) output. This simplifies the process of using scripts/merge_fasta_btab.pl to use .bl_tab (-m 8CBL) files to inject sub-alignment scores and domain information. >>Nov. 1, 2018 [doinit.c] Allow -m F#=file.name in addition to -m "F# file.name" to address problems I had with spaces in shell scripts. >>Oct. 23, 2018 [re-released as fasta-36.3.8g] (see README_v36.3.8g.md) [make/Makefiles*,psisearch2/m89_btop_msa2.pl] Add options to psisearch2/m89_btop_msa2.pl to provide clustalw header (--clustal), require a minimum coverage of the query sequence (--min_align 0.8), and edit sequence identifiers to remove database and accession (--trunc_acc). Remove -lz dependency from non-debug Makefiles. >>Aug. 5, 2018 [re-released as fasta-36.3.8g] [lib_sel.c] Make lib_select.c more robust to missing indirect name files. [scripts/ann*.pl] update various annotation scripts to use https:// instead of http:// >>April 3, 2018 [initfa.c, comp_lib.c, dropfx2.c] Changes to (a) ensure that the "-t t" option correctly inserts and aligns a termination codon '*'. (a) changes to -m 8CB, -m8CC, and -m9C so that aligned termination codons are indicated as "**" (-m8CB) or "*1" (-m8CC, -m9C). >>Mar. 9, 2018 [scripts/annot_blast_btop2.pl, merge_blast_btab.pl, blastp_annot_cmd.sh] Code is now in place to provide sub-alignment scoring using domain annotations with blastp searches (BLOSUM62 only). blastp_annot_cmd.sh runs blast and produces both a standard HTML and a tabular output file. It then runs annot_blast_btop2.pl to add sub-alignment scoring to the tabular ouput file, and then merge_blast_btab.pl merges the domain-annotated blast tabular file with the HTML output file. When combined in this way, the FASTA web server (fasta.bioch.virginia.edu) can produce blastp searches with domain highlights/scoring. >>Feb. 6, 2018 [initfa.c, doinit.c, mshowbest.c, mshowalign2.c] Add a new extended option, -XB, which causes percent identity, percent similarity, and alignment length to be presented using the BLAST model, which does not count gaps in the alignment length. >>Dec. 30, 2017 [released as fasta-36.3.8g] [scaleswn.c] Replace np_to_z() with np1_to_z(), which does not substract low probability from 1.0, thus allowing accurate z-values for very low probabilities. >>Sept. 26, 2017 [comp_lib9.c, compacc2e.c] Previously, if the query sequence was all lower-case letters (seg-ed) and the '-S' option specified, the search would effectively be done with a zero-length sequence, which broke the statistics. The code has been modified to convert all lower-case queries to upper-case when -S is used. [scaleswn.c] Fixed problem with scaleswn.c/ag_stats() not setting parameters properly when matrix was unknown. >>May 23, 2017 [released as fasta-36.3.8f] [url_subs.c] A small, but major change in the output available to the $SRCH_URL and $SRCH_URL2 strings, which are used to enable re-searching, and now pairwise alignment. (It would be better to provide a json string of the information, rather than using fprintf().) An additional value, the name of the query sequence, is provided to these urls so that pairwise alignment becomes possible. >>May 23, 2017 [scripts/ann_feats2ipr.pl,ann_feats_up_www2.pl,test_ann_scripts.sh src/defs.h] Changes to ensure that EBI format databases, which place the ID before the accession, e.g. SP:GSTM1_HUMAN P09488, can be processed properly by annotation scripts. This involved displaying more of the description line, so that the accession field is included, in the annot_XXXXX file. >>May 8, 2017 [compacc2e.c] Address problem where initial domain annotation similarity score/identity not properly reset. [scripts/annot_blast_btop2.pl] Fix various problems with domain scores, particularly in gaps, and domain coordinates. Modify version string to May, 2017 >>April 18, 2017 [cal_cons2.c] Address problem where identity count not correctly assigned to N-terminal domain at the end of a domain. >>April 14, 2017 [src/compacc2e.c, scripts/ann_exons_up_www.pl] Provide a new script to annotate exon positions in Uniprot Proteins (scripts/ann_exons_up_www.pl) that uses the EBI proteins/api/coordinate service. Provide additional error checking on annotates to ensure that domain start is always <= domain end. >>Jan 17, 2017 [scripts/ann_pfam30_tmptbl.pl] ann_pfam30_tmptbl.pl is a modification of ann_pfam30.pl that loads a temporary tables of accessions to be annotated, rather than asking for one sequence at a time. >>Dec 14, 2016 [initfa.c/scaleswn.c] Change required shuffle count (down to 100) and introduce an median/IQR strategy to robustly estimate mean and S.D. for ggsearch (normal) comparisons (-z 3, in place of Altschul-Gish statistics). Modify version string to Dec., 2016. >>Nov 18, 2016 [build_ares.c] fix sequence encoding memory leak >>Sept 30, 2016 [released as fasta-36.3.8e] [psisearch2/] Added a new sub-directory, psisearch2/, which includes scripts and documentation for the new iterative psisearch2_msa.pl and psisearch2_msa.py programs. These programs perform iterative PSIBLAST (or SSEARCH) searches, but with an option (--query_seed) that dramatically reduces false-positives. Modified most of the scripts/ann_*.pl files to work with new NCBI Swissprot accession format. Modified scripts/ann_feats_up_www2.pl and scripts/ann_upfeats_pfam_e.pl to work with JSON format Uniprot descriptions. >>July 28, 2016 [src/pssm_asn_subs.c] Fix another problem with binary ASN.1 file processing where the asnp->abp buffer was not refilled in time. >>July 12, 2016 [src/mshowbest.c] Modified -m8/-m 8CB output to include "eval2" when a second E()-value is available (when -z > 20). "eval2" is shown after the bit score, but before BTOP and annotations. >>May 25, 2016 [scripts/ann_pfam28.pl] Implement --split_over command option, which takes overlapping domains and produces virtual like domains from the overlap region. >>Apr. 12, 2016 [released as fasta-36.3.8d] [src/pssm_asn_subs.c] Fix another problem with binary ASN.1 file processing where the asnp->abp buffer was not refilled in time. [initfa.c] - version date updated to Apr, 2016 [upam.h] - changes to default gap penalties for VT40 (from -14/-2 to -13/-1), VT80 (from -14/-2 to -11/-1), and VT120 (from -10/-1 to 11/-1). >>Mar. 30, 2016 [scripts/m9B_btop_msa.pl] Provide --bound_file_only, --bound_file_in, --bound_file_out. Ensure that alignments outside boundaries are NOT included in MSA. >>Mar. 22, 2016 [scripts/m8_btop_msa.pl, m9B_btop_msa.pl] Ensure that full length query sequence is included in MSA. [pssm_asn_subs.c] Fixes to allow IUPACAA sequences in ASN.1 PSSM. Other fixes to ensure that arrays not allocated are not freed when wfreqs2d[] is not available. >>Mar. 18, 2016 [scripts/m8_btop_msa.pl, m9B_btop_msa.pl] scripts/m8_btop_msa.pl takes a fasta36 -m 8CB output file and produces a multiple sequence alignment that can be used with psi-blast. scripts/m9B_btop_msa.pl takes a fasta36 -m 9B output file and produces a multiple sequence alignment that can be used with psi-blast. >>Feb. 15, 2016 [mshowbest.c, compacc2e.c, cal_cons2.c, dropfx2.c, dropfz3.c] Modify logic for calculating percent identity in sub-alignments to use the BLASTP strategy, which does not could gapped regions as part of the alignment length. Fix the -m 8 display (BLAST tabular output) to use ungapped alignment length for percent identity (as -m BB does). [initfa.c] - version date updated to Feb, 2016 >>Feb. 12, 2016 [compacc2e.c, cal_cons2.c, dropfx2.c, dropfz3.c] Modify display_push_features() to use both the rst.score[score_ix], which is used to calculate the zscore and bitscore, and also sw_score, which is the correct divisor for sub-alignment scores. Previously, only the rst.score[score_ix] was used, which caused some bit scores to be out of range, and produced erroneous Q-value scores for sub-alignments. >>Jan. 24, 2016 [cal_cons2.c] Ensure left_domain_link[01] set to NULL before initialized. Rename ann_feats2l.pl to ann_feats_up_sql.pl for consistency with ann_feats_up_www2.pl. ann_feats_up_www2.pl no longer works because of changes at the EBI. >>Dec. 15, 2015 [re-released as fasta-36.3.8c] [pssm_asn_subs.c] Fixed another problem parsing ASN.1 because of reading past the end of the buffer. [cal_cons2.c] Fix a serious bug that prevented display of annotated sites using -m9c/-m8CC >>Nov. 24, 2015 [re-released as fasta-36.3.8c] [mshowalign2.c] Correct first_line logic to display >>seqid description on first alignment line, but >- on remaining lines. >>Nov. 23, 2015 [released as fasta-36.3.8c] [cal_cons2.c, mshowalign2.c, scripts/annot_blast_btop.pl, scripts/ann*_e.pl] Fix the problem that lalign36 no longer displayed the library/subject accession/description. Correct some problems introduced with BTOP alignment encoding. A new script, scripts/annot_blast_btop.pl, is available to provide -V type sub-alignment scoring to BLASTP BTOP alignments stored in tabular files. In addition, the scripts/ann*.pl scripts were modified to work as part of a unix pipe, and the ann*_e.pl scripts replace the older non "_e.pl" scripts, and were renamed with out the "_e" (thus, ann_pfam_www.pl was removed, and ann_pfam_www_e.pl was renamed ann_pfam_www.pl). >>Nov. 6, 2015 [cal_cons2.c, initfa.c, mshowbest.c, dropfx2.c, dropfz3.c] Implement BLAST+ BTOP alignment format, available with -m 8CB or -m 9B. Convert previously static calc_code alignment strings to dynamic strings. >>Oct. 13, 2015 [released as fasta-36.3.8b] [initfa.c, pssm_asn_subs.c] Fix problems encountered when reading in binary ASN.1 file produced by datatool. Previous versions did not use the final score data provided by the tool; this version now uses that information if it is available. If it is not available, the PSSM integer values are calculated from the frequency data. >>Oct. 8, 2015 [pssm_asn_subs.c] Fix a rare condition where the pssm_asn parser reads past the asn buffer. >>Sep. 28, 2015 [comp_lib9.c, scaleswn.c, dropnfa.c, dropfx2.c dropfz3.c] (1) [scaleswn.c] -- changes to drop back to Altschul-Gish statistics when other strategies fail. (2) Fix to ensure that adler32() is calculated correctly for 1-residue library sequences; definition of adler32() added to drop*.c files. >>Sep. 7, 2015 [Makefile.nmk_icl, Makefile.nm_pcomp, doinit.c, readme.win32] Automatic detection of thread/core number on windows. Changes to readme.w32 documentation, Windows programs no longer require sse2 in name (since all modern x86 processors have it). >>Sep. 4, 2015 [comp_lib9.c, cal_cons2.c, dropfx2.c, dropfz3.c] (1) Fix bug with overlapping domains when a domain ends exactly where the alignment starts. (2) provide command line in -m 8CC output with -DPGM_DOC >>Aug. 31, 2015 [git v36.3.8_30Jul15] [cal_cons2.c, dropfx2.c, dropfz3.c, mshowbest.c, build_ares.c, doinit.c, comp_lib9.c] Modifications to enhance the independence of annotation output to different files. Earlier, annotations could not be properly output to different files in different formats. For example, -m 9c prevented -m "F8CC output.m8CC" -m "F9I ouutput.m9I". Annotation output formats are now more independent. They are not fully independent, however. Thus, if CIGAR format is used for one output, it will be used in all other alignment encoding outputs. >>Aug. 21, 2015 [cal_cons.c, dropfx2.c, dropfz3.c, mshowbest.c, build_ares.c, doinit.c] Add -m 9I to -m 9i. -m 9i reports identity and variation (based on annotation scripts). -m 9I also reports domain content on the initial summary line. >>Aug. 20, 2015 [fasta-36.3.8a] [mshowalign2.c] Fixed bug in lalign36 E()-value, bit score calculations for highest scoring non-identical alignment by reverting to older code. This bug was introduced in fasta-36.3.6d in January, 2014. >>Jul. 21, 2015 [fasta-36.3.8] [compacc2e.c, cal_cons2.c, dropfx2.c dropfz3.c, param.h] Fixed a major bug in the annotation code that had been added to accomodate overlapping domains. The original implementation was not thread-safe, because the array of annotations was modified during the scoring, but was also shared by threads. The new version keeps independent scoring arrays. >>Jun. 23, 2015 [released as fasta-36.3.7b] [dropnnw2.c] Fix problem where glsearch reset (ignored) the -M sequence limit. >>Jun. 18, 2015 [dropfx.c, dropgsw.c, dropfx.c, dropfx2.c, dropfz3.c] Fix problem in do_walign.c with comparison to score_thresh during recursive alignment. >>May. 21, 2015 [compacc2e.c] Add additional checks to ensure that annotations are within the sequence boundaries. >>Jan. 26, 2015 [ re-released as fasta-36.3.7a] [compacc2e.c] Fix problem with domain boundary calculations for subsets of sequences. >>Jan. 21, 2015 [ released as fasta-36.3.7a] [calc_cons2.c, dropfx2.c, dropfy3.c] Fix problems with -m 9c / -m 9C alignment encodings in version 36.3.7. Apparently, the Nov. 25, 2014 fix was not committed properly. In addition, make certain that the query sequence is ALWAYS the reference sequence, particularly in translated alignments. As a result, the insertion/deletion codes are now reversed for fast[xy]36 and tfast[xy]36. >>Jan. 6, 2014 [data/VTML_*.mat] Provided scoring matrix files for the VTML_10,20,40,80,120,160,200 matrices available internally. >>Nov. 25, 2014 [ released as fasta-36.3.7] [cal_cons.c, dropfx2.c, dropfz3.c] Fix problem that prevented -m 9c and -m 8CC unless annotations were present. Added approved copyright notice and Apache 2.0 license to appropriate files. >>Nov. 19, 2014 [mshowbest.c] Add alignment (CIGAR) string and annotation string to BLAST tabular (-m 8) aligments with -m 8C[cCdD]. To get alignment and annotation encoding without BLAST comments, use -m 8X[cCdD]. >>Nov. 10, 2014 [cal_cons2.c, dropfx2.c, dropfz3.c] Ensure that site annotations are shown when annotations are embedded in a sequence, not provided by a script. >>Oct. 27, 2014 [cal_cons2.c] Fix a bug in the annotation alignment that put annotation symbols off by one (or more) in the coordinate lines. Add annotations that align in gaps. >>Oct. 6, 2014 [most source files] The copyright notice for fasta-36.3.7 has been updated to include an open software license, Apache2.0, for redistribution. >>Sept. 28, 2014 [url_subs.c] Substitute annot_p->s_annot_arr_p[] for annot_p->domain_arr_p[i] in display_domains(), encode_json_str(). Remove domain_arr_p from struct annot_entry. With domain_arr_p gone, n_domains is less useful, but it is still available, and used for checking for domain graphics. encode_json_domains() also now uses annot_p->n_annots, and skips over non-domains. >>Sept. 19, 2014 [dropfx2.c, dropfz3.c] Fixes to produce correct coordinates with forward and reverse complement [t]fast[x,y]. >>Sept. 17, 2014 [new version, fasta-36.3.7] [compacc2e.c, cal_cons2.c, dropfx2.c, dropfz3.c] The annotation domain scoring/plotting strategy has been extended to allow overlapping domains. To accommodate overlapping domain annotations, the annotation file format (e.g. gstm1_human.annot) has been extended to accept the form: >sp|P09388|GSTM1_HUMAN 1 - 88 Glutathione_S-Trfase_N :1 7 V F Mutagen: Reduces catalytic activity 100- fold. 90 - 208 Glutathione-S-Trfase_C-like :2 108 V Q Mutagen: Reduces catalytic activity by half. where a "-" in the second field indicates that the first and third fields specify the beginning and end of the domain. In previous versions, a '[' specified the beginning of a domain, and a ']' on a later line specified the end of the domain. '[' and ']' on separate lines required that domains not overlap (so that the '[' and ']' could be paired). fasta-36.3.7 will still read this format, but the "start - stop" format is both simpler and more flexible. Three new annotation scripts are available that use the new domain notation: ann_feats2ipr_e.pl, ann_feats_up_www2_e.pl, ann_pfam_e.pl, and ann_pfam_www_e.pl. All four scripts will report overlapping domains. Overlapping domains also allows domain annotations from different sources to be combined (e.g. InterPro Pfam, Panther, and Superfamily domain annotations), as well as domain annotations of different types, e.g. Uniprot domain and secondary structure annotations. >>Aug. 28, 2014 [re-released as fasta-36.3.6f] [ncbl2_mlib.c] The code used to parse blastfmtdb sequence description lines has not kept up with NCBI's use of ASN.1 in sequence descriptions. This code has been updated, and now works properly with the protein and DNA sequence databases. [comp_lib9.c] Fixed a seg-fault that occurred when an open-file error occurred. >>Aug. 22, 2014 [released as fasta-36.3.6f] [mshowbest.c] Change alignment summary display for lalign to not show identical alignment score unless '-J' option used. Add "The best non-identical alignments" when no "-J" [ann_pfam_www.pl] Fix bugs. [ncbl2_mlib.c] modified to read NCBI ambiguity codes in blastdbfmt/formatdb nucleotide databases. Not extensively tested.` >>Aug. 20, 2014 [compacc2.c, cal_cons.c, dropfx.c, dropfz2.c] Modify sub-alignment score report to calculate bit-score by dividing total alignment bit score by sub-alignment raw score divided by total alignment raw score. This produces a bit score that is much more sensible than the previous strategy, which calculated a z-score from the sub-alignment. >>Aug. 18, 2014 [compacc2.c, cal_cons.c] Undo removal of '[]' from aa0a/aa1a (they are required to visualize domain boundaries in alignment). cal_cons.c now users PSSMs when they are available. >>Aug. 8, 2014 [comp_lib9.c, compacc2.c] Move the call to get query annotations via scripts out of compacc2.c and into comp_lib9.c. >>July 29,2014 [comp_lib9.c, mshowbest.c, mshowalign2.c] Enable high scoring alignment display (like high scoring sequences) with lalign36, when -m 9 (-m 9c/d/C/D) option is provided, or with -m 8. This allows lalign36 to provide a compact, tabular list of non-overlapping local alignments. >>June 30, 2014 [pssm_asn_subs.c] Update the code for parsing ASN.1 binary PSSM files produced by psiblast+. The new code reads more of the optional fields in pssm_intermediate_data(). The fields are not used, but broke the earlier parser. >>June 11, 2014 [cal_cons.c, initfa.c, dropfx.c, dropfz2.c] Extend the match/mismatch encoding provided by -m 9c and -m 9C with -m 9d and -m 9D. The -m 9d/D options provide mismatch locations as well as insertion/deletion locations. For -m 9d, the list of codes has expanded from '=\/*' to '=\/*x'; for -m 9D, 'MDIMX'. Current implementation works for all programs except [t]fast[fms]. Updated version strings to June, 2014. >>May 28, 2014 [mshowalign2.c, mshowbest.c, initfa.c, structs.h] Add the command line option -XI. Changes the calculation of percent identity to ensure that a single mismatch in a long sequence with > 99.9\% identity is displayed as 99.9% (0.999) identity, rather than 100.0% identity. Without this option, a single mismatch in 10,000 residues displays 100% identity, with the option, 99.9% identity is displayed (even though the identity is 99.99%). [cal_consf.c] Fix the false error message "code begins with 0" in cal_consf.c. >>Feb. 12, 2014 [compacc2.c] When providing "sequence length" to annotation scripts, add offsets. Also modify scripts to allow sequence lengths to increase. >>Jan. 28, 2014 (re-released as fasta-36.3.6d/Jan 2014) [dropfs2.c, calconsf.c, tatstats.c] The coordinate fix for fasts36/fastm36 (Dec 18, 2013) broke some fasts/fastm alignments. The alignment code has been reverted to the "classic" code that has been used for more than 10 years. However, that code always marked the first aligned residue as 1, even when the first part of the query did not align. The initial coordinate offset has been fixed; the coordinate is now the position in the first aligned fragment. This may be confusing, because with fasts, the first aligned fragment may not be the first fragment in the query list. The coordinate provided always provides the offset from the beginning of the first fragment in the alignment, not the first fragment in the list. This fix required changes to the definition of calc_astruct(), which required changes to build_ares.c, mshowalign.c, calc_cons.c, dropfx.c, and dropfz2.c. >>Jan. 24, 2014 [mshowalign2.c] Add checks to assumption that '>gi|12345' is an NCBI library entry. [nmgetlib.c] Fix for nmgetlib.c with -DMYSQL_DB Some cleanup of old Makefiles. >>Jan. 1, 2014 [url_subs.c] Fix off by one in domain coordinates in display_domains(). >>Dec. 18, 2013 [dropfs2.c, cal_consf.c] Fix problem with alignment display when query sequence is much longer than library sequence. >>Dec. 11, 2013 [compacc2.c] Modified save_best2() to correctly exclude sequences outside -M n1_low-n1_high limits. >>Nov. 8, 2013 (re-released as fasta-36.3.6d) [ncbl2_mlib.c] Fix problem with src_long8_read() where int/uint64_t seems to cause problems with Linux intel icc. Using int/unsigned int solves the problem. >>Nov. 1, 2013 [apam.c, ncbl2_mlib.c, map_db.c] [apam.c ] Fix problem with query sequences and libraries that do not end in newline ('\n'). [ncbl2_mlib.c, map_db.c] provide grouping for shifts for byte extraction in src_int4/long8_read() to remove compiler warnings. [map_db.c] Fix problem reading sequences for indexing that caused crash. >>Oct. 8, 2013 (released as fasta-36.3.6d) [comp_lib9.c, initfa.c] Modify initfa.c/re_ascii() function to avoid qascii[] characters that had been remapped for annotations. >>Oct. 4, 2013 [nmgetlib.c, ncbl2_mlib.c] Modify nmgetlib.c/re_openlib() to re-use memory mapped file arrays. This had been the intention for some time, but a check for libf != 0 prevented the memory mapped arrays from being reused. libf is no longer checked, just mm_flag. >>Sep. 26, 2013 [ncbl2_mlib.c] Fix a bug in ncbl2_mlib.c/parse_fastadl_asn() that prevented accessions longer than 20 characters in description lines from BLAST formatted libraries. [compacc2.c] Fix a bug in compacc2.c/comment_var() that showed the wrong original sequence in qVariant changes. >>Sep. 2, 2013 [dropfs2.c] Fix bug in dropfs2.c/init_work() that prevents correct tatusov statistics with -z >10. >>Aug. 21, 2013 (released as fasta-36.3.6c) [comp_lib9.c] Fix bug in comp_lib9.c/new_seqr_chain() that prevented memory from being allocated to the chain if a memory mapped database was followed by a non-memory mapped database. >>Aug. 9, 2013 [scaleswn.c] Ensure shift to MLE_STATS if too many scores are excluded by trimming. >>July 31, 2013 (released as fasta-36.3.6b) [url_subs.c] Make JSON output for -m 6 (html) dependent on $ENV{JSON_HTML}. JSON output is not currently used. >>July 26, 2013 [mshowalign2.c, scripts/lavplt_svg.pl] Correct offsets in -m 11 lav plots, and modify lav2plt.pl/ lavplt_svg.pl/ lavplt_ps.pl to reflect the corrections. Move all perl scripts out of /src into /scripts. >>July 19, 2013 (released as fasta-36.3.6a) [compacc2.c, cal_cons.c, dropfx.c, dropfz2.c, build_ares.c] Provide dynamic string allocation/dyn_strcat for annotation string output. This fixes problems with long proteins with many domains or other annotations, which were too long for the fixed annotation output storage. Version date updated to July, 2013. Compiled and tested on Windows32. >>July 8, 2013 [cal_cons.c, dropfx.c, dropfz2.c] Properly terminate annotions with offsets [cal_cons.c], and with domains beyond alignment [dropfx.c, dropfz2.c] >>July 5, 2013 (released as fasta-36.3.6) [comp_lib9.c, doinit.c, dropfx.c, dropfz2.c] Fix conflict between -m 9 and -z -1; fix annotation display using non-script annotations. Stop using calc_last_set in dropfx/fz2.c. >>June 24, 2013 [scripts/ann_feats_up_www2.pl] Add script (ann_feats_up_www2.pl) for annotating UniProt sequences using: "http://www.ebi.ac.uk/Tools/dbfetch/dbfetch/uniprotkb". >>June 6, 2013 [compacc2.c, cal_cons.c, initfa.c, dropfx.c, dropfz2.c] Provide the -XNS/-XXS/-XN+/XX+ and -XND/-XXD/-XN-/-XX- options that specify how N:N and X:X alignments are counted for similarity and identity. By default, N:N (DNA) and X:X (protein) alignments are considered identical, but not similar (because their scores are typically negative to address statistical issues). -XNS/-XXS/-XN+/-XX+ cause N:N/X:X alignments to be counted as similar, even though their alignment are negative. Likewise, -XND/-XXD/-XN-/-XX- cause N:N and X:X alignments to be considered non-identical (and non-similar). >>May 28, 2013 [url_subs.c] do_url1() has been modified to: (1) require env($REF_URL, $SRCH_URL, $SRCH_URL1) for these links to produce printout. (2) Link text is surrounded by . (3) do_url1() now produces output automatically, which can be used to get all the information provided by earlier URL links. >>May 29, 2013 [mshowalign2.c] Re-instate code in showalign() to ensure that original bbp->rst is used for first alignment, rather than that calculated by CHECK_SCORE (which is used for later sub-HSP's). The CHECK_SCORE -S alignment score is based on the non-S alignment, and is then re-scored with the low-complexity -S matrix. But the best alignment excluding low-complexity can have a higher score than the best all-complexity alignment rescored with -S. >>May 27, 2013 [mshowalign2.c, url_subs.c] The plot_domain.cgi SVG code has been expanded to allow the domain structure of the entire query and library sequence, not just the aligned regions, to be displayed. Showing domains above the query or below the library takes an additional 18 px in each direction (36 total); this size needs to be provided in the format string that is provided in $DOMAIN_PLOT_URL. Right now, the argument to $DOMAIN_PLOT_URL can get very long with lots of aligned domain (region), and query and library domain information. It would be better to provide this in some separate way. YAML might also be a more efficient strategy. >>May 9, 2013 [dropfx.c, dropfz2.c, compacc2.c, url_subs.c] The web infrastructure for domain plots has been completed -- plot_domain2.cgi which generates SVG for domain plots now understands reverse-complement cDNA fastx/y alignments, and plots coordinates accordingly. Testing with fastx36/fasty36 revealed some memory errors, which have been fixed. In addition, dropfz2.c has been updated to properly treat some region/alignment-boundary conditions; dropfx.c and dropfz2.c provide equivalent sub-alignment scores. [../scripts, ../misc] A new directory, ./scripts, has been created to collect the scripts used for sequence library expansion and domain/feature annotation. ../scripts/README.scripts provides more information. Modify code to allow expansion scripts (-e) to start with '\!', like annotation scripts. >>Apr. 15, 2013 (compacc2.c, cal_cons.c, dropfx.c, dropfz2.c, mshowalign2.c) Modifications to properly deal with sequence and coordinate offsets in annotation alignments. compacc2.c/get_annot_list() has been modified to only print/read an annotation once (the same sequence may appear twice with fastx/fasty). mshowalign2.c now includes and in HTML mode. This comments are not on their own line, to save output space, so the remainder of the line should be captured. >>Apr. 5, 2013 (doinit.c) Add the ability to specify HTML output using the -m '0H' option. This addresses the problem that -m "F6" does not fully specify the output format. In addition, -m 6 should probably explicitly set -m 0 (if it has not been set), rather than simply 'or'ing it, but right now we do not know when it is set. >>Mar. 17, 2013 (compacc2.c, url_subs.c, plot_domain.cgi, ann_feats2l.pl) Modifications to url_subs.c to support SVG domain maps in HTML output. A new evironment variable has been defined, DOMAIN_PLOT_URL, which can be used to plot (using SVG or PNG) a map of the domains on the library sequence. The argument to DOMAIN_PLOT_URL is the concatenated list of annotations provided by the -V options. All annotations (including sites) are passed; non-alpha-numeric characters are URL encoded. plot_domain.cgi is an example of a script that can be passed as DOMAIN_PLOT_URL. To use this script: $ENV{DOMAIN_PLOT_URL}="\n"; ann_feats2l.pl has been extended to allow the --neg (or --neg-dom) option, which puts domain a NODOM domain annotation between the domain annotations provided by the database. >>Mar. 7, 2013 (cal_cons.c) Modify update code to properly begin global alignments that start with insertions or deletions. >>Feb. 20, 2013 (compacc2.c) Annotation scripts (-V \!ann_feats.pl) were being inactivated if no annotations were returned, fixed. >>Feb. 2, 2013 (comp_lib9.c) Prevent premature termination of query title in -m 9 mode (guarantees the full >accession text to first space is preserved). (compacc2.c) Provide domain information (;C=PF00016) in -m9 domain scoring. >>Jan 7-9, 2013 (initfa.c, pssm_asn_subs.c) Modify pssm_asn_subs.c to properly parse binary PssmWithParameters produced by NCBI asntool from psiblast (blast+) text ASN.1 output. The text ASN.1 uses a binary encoded query sequence; get_lambda() in initfa.c was modified to work with a binary encoded query sequence (the query is used to find the p_i from rrcounts[query[i]]). Modify pssm_asn_subs.c to set query=NULL when PSSM does not include query sequence. Modify read_asn_pssm() to set query=aa0 if query==NULL; >>Dec. 14, 2012 (cal_cons.c, dropfx.c, dropfz2.c) Enable percent identity calculation on domains. Merge cal_cons.c/calc_code() strategies into dropfx.c, dropfz2.c >>Dec. 6, 2012 (comp_lib8.c, comp_lib9.c, nmgetlib.c) Fix code in close_lib_list() that did not properly re-initialize files for re-reading (not seen when library is in memory, or for single sequence search). >>Dec 2, 2012 (wm_align.c, Makefiles) CHECK_SCORE() in wm_align.c must return different scores for local and global (#define GGSEARCH in wm_align.c). Requires modified Makefiles. >>Sep 24, 2012 (doinit.c, compacc2.c, cal_cons.c) Fix bugs introduced with next_annot_entry() strategy for reallocating annot_arr[]; find a bug in cal_cons.c where i1_annot was indexing annot0_arr_p[]; ensure that m_msg.ann_arr_def[] is appropriately initialized. >>Sep 17, 2012 (lav2plt.pl, lavplt_ps.pl, lavplt_svg.pl, lav_defs.pl, l_feat_dom.pl) Convert the lav*.c programs to perl. This simplifies adding the ability to script domain annotation. The format for domain annotations for the lav2plt.pl programs differs slightly from the current up_feats_dom.pl program, because it requires a beginning and end for each domain, e.g.: >sp|Q14247.2|SRC8_HUMAN 80 [] 116 Cortactin 1. 117 [] 153 Cortactin 2. 154 [] 190 Cortactin 3. 191 [] 227 Cortactin 4. 228 [] 264 Cortactin 5. 265 [] 301 Cortactin 6. 302 [] 324 Cort. 7; trunc. 492 [] 550 SH3. and takes a single accession from the command line, e.g.: "l_annot_dom.pl sp|P09488" rather than reading a file. >>Sep 4, 2012 (doinit.c, compacc2.c, fasta_guide.tex) Annotations can now be provided within a sequence (-V '%#!'), by a script (-V '\!up_feats.pl'), or from a file (-V '>Aug 31, 2012 (cal_cons.c, compacc2.c, dropfx.c, dropfz2.c) The region score calculations have been corrected to include regions that overlap alignment boundaries, and regions that start in gaps. >>Aug 10, 2012 (cal_cons.c, compacc2.c, dropfx.c, dropfz2.c) Introduce a second kind of annotation feature, the "Region" (denoted by '[' and ']'), that specifies a region that should be scored separately. These regions cannot be nested, each residue can belong to only one region. However, the scores in these regions can be calculated (perhaps percent identity and length later), and are displayed: >>sp|P09488|gstm1_human GLUTATHIONE S-TRANSFERASE MU 1 ( (218 aa) Site:* : 23Y=23Y : MOD_RES: Phosphotyrosine (By similarity). Site:* : 33Y=33Y : MOD_RES: Phosphotyrosine (By similarity). Site:* : 34T=34T : MOD_RES: Phosphothreonine (By similarity). Region : 3-82 : score=547; bits=146.4 : GST_N Site:^ : 116Y=116Y : BINDING: Substrate. Region : 104-171 : score=465; bits=125.8 : GST_C All information about the region should be provided with the '[' (start) symbol. >>Aug 1, 2012 (dropfx.c, dropfz2.c, c_dispn.c) Fix some very old bugs that caused errors in coordinate displays of reverse-complement fastx/fasty alignments. Fix BLAST alignment display coordinates. Enable variant calculations for FASTY (dropfz2.c), and simplify calculations for dropfx.c >>Jul 29,2012 (doinit.c, compacc2.c, comp_lib9.c) Allow annotation descriptions to be delivered by annotation script, denoted by '=' in first line, e.g.: =*:phosphorylation =^:binding site =@:active site >gi|121735|sp|P09488.3|GSTM1_HUMAN 7 V F Mutagen: Reduces catalytic activity 100- fold. 23 * - MOD_RES: Phosphotyrosine (By similarity). 33 * - MOD_RES: Phosphotyrosine (By similarity). 34 * - MOD_RES: Phosphothreonine (By similarity). remove requirement for leading space before annotation script: e.g.: -V '\!up_feats_c.pl' >>Jul 27, 2012 (compacc2.c, cal_cons.c, dropfx.c) (1) Allow comments/descriptions on features other than type 'V' (variant) to be displayed with alignment. If a '@' SITE feature has a comment provided by the annotation script, the comment will be displayed in the alignment description , e.g.: >>sp|P28161.2|GSTM2_HUMAN Glutathione S-transf (218 aa) ^ :116Y=116Y: BINDING: Substrate (By similarity). @ :210S+210T: SITE: Important for substrate specificity. initn: 632 init1: 632 opt: 632 Z-score: 1414.3 bits: 268.8 E(450603): 2.6e-71 Smith-Waterman score: 945; 75.2% identity (93.6% similar) in 218 aa overlap (1-218:1-218) If no comment is provided, the annotation will only appear in the coordinate line. This provides a way to show annotation locations in BLAST output. (2) Also add code to ensure that symbols returned by annotation scripts are displayed on the coordinate line. (3) Add environment variable substitution to =${TMP_D}/annot.defs and \!${TMP_D}/up_feats_c.pl parsing. >>Jul 24, 2012 (uascii.h, map_db.c) Modify NANN, a value one more than the largest amino-acid encoding value, increasing it from 50 (too small for NCBIStdaa_ext_n) to 60; ESS changed to 59. >>Jul 20, 2012 (mshowalign2.c, mshowbest.c, compacc2.c, comp_lib8.c) (transferred from fasta-36.3.5) (a) Fix bug in mshowalign2.c that occurred because of re-use of the "tmp_len" variable when adding '\n' to -L long descriptions. This typically occurred with -m 10. (b) Modify logic used to capture if an alignment had been calculated, reducing dramatically the number of re-alignments with multiple -m "F" output files. >>Jun 30, 2012 (mshowbest.c) Ensure that opt score and E()-value are based on initial scan score, not later alignment score. score_delta is used to increment initial scan score. However, currently the E()-value of the alignment score is displayed in the alignment list, so the -m 9 and showalign() E()-values can be inconsistent. >>Jun 29, 2012 (from fasta-36.3.5c) (pssm_asn_subs.c) Add chk_asn_buf() before getting RPSPARAMS_MATRIX. >>Jun. 27, 2012 (from fasta-36.3.5c)) (nmgetlib.c, compacc2.c) Fix bug that allocated unnecessary space for re-loading sequences in pre_load_best() (compacc2.c). Ensure that closed/NULL memory mapped file descriptors are not returned. >>Jun. 18, 2012 (compacc2.c) Modify pre_load_best() to allocate memory for sequences to be aligned only if the sequences are not already in memory. (Searches against hg18 with repetitive queries caused very large amounts of memory to be allocated in duplicate.) >>Jun. 12, 2012 (compacc2.c, doinit.c, dropfx.c, cal_consf.c) Implement variant scoring for fastx36. Also address problems with annotation location when -m markx is not set. Check function definitions for other drop functions where variant scoring is not yet implemented. >>Jun. 9, 2012 (defs.h, doinit.c, c_dispn.c) Add 'M' and 'B' options to -m 0,1 to specify annotation location. For example, -m 0M (-m1) causes the annotation to be inserted in the "middle" alignment line, rather than in the coordinate line (making the sequence with the annotated feature ambiguous). -m 0B, -m1B puts the annotation in both the middle (alignment) line and the coordinate line. >>Jun. 8, 2012 (doinit.c, compacc2.c, build_ares.c, mshowbest.c, mshowalign2.c, structs.h and others) Implement a script-driven strategy for feature annotation in alignments. In addition to: fasta36 -V '*%^@', which extracts the annotation characters from the library sequences, we can also do: fasta36 -V '*%^@ \!feature_script.pl' which expects the same annotation characters ('*%^@'), but expects them from the script 'feature_script.pl'. This script gets the sequence description line, e.g: "gi|121746|sp|P09211|GSTP1_HUMAN Glutathione S-transferase P (GST class-pi) (GSTP1-1)", and is expected to return a tab-delimited file: ==== pos label value 23 * 33 * 34 * 116 ^ 173 V N 210 V T ==== Currently, the "value" is ignored unless the label is "V", for variant. If 'V' annotations are present, then the alternative amino-acid residue values are tested in alignments; if the variant residue improves the score, the score is updated and the variant sequence is displayed, and a 'V' indicates the variant in the coordinate line. Currently, variant annotations can only affect library sequences. By default, annotation symbols are shown in the coordinate line for -m 0 (default) and -m 1 (difference) alignments, sometimes overwriting the coordinate. Annotation symbols (from either sequence) can be shown in the middle alignment line by specifying -m 0M or -m 1M, or in both the middle alignment line and the coordinate line with -m 0B, -m 1B. >>May 5, 2012 (dropnnw2.c) Enable rev-comp for ggsearch/glsearch. >>Mar. 13, 2012 (defs.h) Increase default file name length to 256 from 120 to accommodate long file names at the EBI. Also allow much longer command line arguments argv_line[MAX_LSTR=4096] to be reported. >>Jan. 30, 2012 (nmgetlib.c, altlib.h) Read .fastq sequence libraries (ignoring quality information) as library type '7'; >>Dec. 21, 2011 (released as fasta-36.3.5c) (nmgetlib.c) Fixed a problem reading multiple library files that produced segmentation faults because a data buffer was free()ed and then re-used. >>Nov. 17, 2011 (initfa.c, mshowalign.c) (from fasta-36.3.5b) Fix problem with ppst->e_cut_r for LALIGN DNA sequences (set improperly to 0.001). Add ':' to s_bits: in -m 10 output. Also remove "score" from "lsw_s-w opt" score description (not present in non-LALIGN -m 10). >>Nov. 9, 2011 (from fasta-36.3.5b) (lavplt_svn.c, lavplt_ps.c, ncbl2_mlib.c) Fix buffer overrun for lav legend. Fix old problem re-opening NCBI blastdbfmt indirect OID files. >>Oct. 30, 2011 (comp_lib9.c) Correct re-initialization bug that prevented the second query sequence from seeing the entire library. [from fasta-36.3.5a_svn] (comp_lib9.c, comp_lib8.c, ncbl2_mlib.c, nmgetlib.c) Address out-of-memory problems when searching memory mapped, and fix problem using fopen()/fread() rather mmap for NCBI DNA databases. On 32-bit machines, NCBI database files cannot be left open, and are now more agressively closed. However, searches that produce very large numbers of alignments may still run out of memory on low-memory 32-bit machines. (compacc2.c, comp_lib8.c, comp_lib9.c, htime.c) Correct problems that produce negative scan times. >>Oct. 21, 2011 (pcomp_subs2.c, work_thr2.c, mshowalign2.c, make/Makefile.mp_com2, Makefile.fcom) Fixes to re-enable MPI compilation and execution. >>Oct. 18, 2011 (compacc2.c, mshowbest.c, comp_lib8.c, comp_lib9.c, initfa.c) Fix the logic for specifying the number of alignments displayed with the -b 123, -b '>123', -b '=123', -b '$' options, particularly when statistics are not used. >>September 21, 2011 (initfa.c, apam.c, scaleswn.c compacc2.c) Two major problems have been addressed (which also affect fasta-36.3.5 and earlier versions): (a) specifying a -s dna.mat DNA matrix did not work properly; (b) too few shuffles, particularly with DNA sequences, were produced with pairwise comparisons. The problem with scoring matrix files was exacerbated by the use of fixed library alphabets. initfa.c has been modified to recognize that when a DNA scoring matrix is specified, the "-n" option is set. The shuffling problem appeared when, for pairwise DNA comparisons, fewer than 50 shuffles were reported. This occurred because the buffers used to communicate with threads no longer have a fixed amount of sequence buffer associated with them. >>August 23, 2011 (tatstats.c, upam.h, apam.c) The remapping of the amino-acid encoding to NCBIstdaa broke some assumptions in tatstats.c, and elsewhere. In addition to the simple mapping problem, which changed the counts[] assignment in tatstats.c/calc_priors(), the fact that NCBIstdaa does not have contiguous real amino acids (e.g. B is at position 2), broke the generate_tatprobs() function because of a very old bug where priorptr was not always incremented. Some of the drop*.c functions have been updated to ensure that the space allocated for rapid pam[][] score lookup includes space for lower-case characters, which can be present in pseg'ed "map_db -b" libraries. In addition, binary format (currently all mmap'ed) libraries cannot include annotations, because common annotation values ('*', '&') overlap the range of the NCBIstdaa_l (lowercase) mapping. >>August 1, 2011 (map_db.c) map_db.c has been modified to provide a more efficient memory mapping for FASTA format files. map_db -b works like map_db, but, in addition to writing the .xin index file of descriptions and sequences in the FASTA library, it also produces a new protein_library.bsq file and protein_library.xin_b that contains binary encodings of the databases and an index for this file. The binary encoding can be memory mapped, so that database searches can proceed directly from memory. map_db -b .bsq files are very similar to the blastfmtdb files, except that they accomodate lower-case letters (masked) in the sequences. The implementation of blastfmtdb lower-case masking prevents it from being used in directly memory mapped files. map_db.c introduces a new memory mapped format encoding, MP2. I expect this format to be extended to allow not only directly memory mapped files, but also directly memory mapped lookup tables. A database can be hashed, and the hash and link files written to a library file, which can then be used for searches without the need to re-calculate the hash/link tables. (comp_lib9.c, mmgetaa.c, ncbl2_mlib.c, initfa.c, dropfz.c) Modifications to allow memory mapped files to be read and processed directly. Databases with lower-case characters can be memory mapped, which means that lower-case characters are coming into the alignment programs even when -S is not specified. As a result, all the protein scoring matrices must be built-out to allow lower-case characters. Likewise, the dropfz2.c matrices built by init_weights() must always be set for lower-case characters. >>July 20, 2011 (mshowbest.c, mshowalign2.c) gi|12345 numbers are no longer shown in the list of best hits unless -m 8 or -m 9 are used. They are never shown in the alignments. (dropfz2.c) Modify MAX_UC, MAX_LC to be consistent with NCBIstdaa alphabet. Modify <= nsq for init_weights(). >>July 16, 2011 fasta-36.3.6 (comp_lib9.c, drop*.c, cal_cons*.c) The internal encoding of amino-acids has changed to NCBIstdaa throughout the programs. This allows the programs to use memory mapped NCBI blastdbfmt libraries directly, without re-encoding, but lower-case low-complexity mapping is not recognized. This allows substantial speedup in single query searching. However, to allow low-complexity searches, a new memory mapped format/encoding will be required. >>July 5, 2011 fasta-36.3.6 (compacc2.c) Modify save_best2() logic for identifying scores to be used for statistics. An is_valid_stat is set for multi-frame results that specify which scores can be used for the stats[] and qstats[] arrays. Modifications to buf_do_work(), buf_shuf_work(), and buf_qshuf_work() to cause the calculation to be done in the thread, rather than the main program. Fix some bugs in the qshuffle code to ensure that all valid shuffles up to maxshuff are saved. (complib5e.c, complib7e.c, complib8.c) Fix -m 9c/C core dump with -z -1. (cal_cons.c, cal_consf.c) Reverse 'I', 'D' with CIGAR string. >>June 26, 2011 (comp_lib8.c, compacc2.c) Added the ability to search a library produced/specified by a script. Like the "-e expand_script.sh", searching against a library that begins with a '!', e.g. '!library_script.sh', causes the library_script.sh to be executed, producing a temporary file from stdout, which is then scanned as the database. As with expansion files, all the standard library syntax can be included. Thus, if cat_db.sh contains the command 'echo /seqdb/swissprot.lseg', the command: fasta36 query.aa '\!@cat_db.sh' will cause cat_db.sh to produce a temporary file with the line "swissprot.lseg"; the temporary file will be interpreted as an indirect file of filenames; and swissprot.lseg will be searched. Note that in Unix systems, the '!' must be preceeded by a '\' as shown above, so that it is not interpreted by the shell. >>June 23,24 2011 (compacc2.c, comp_lib8.c, mysql_lib.c) A new save_best2() function in compacc2.c has been designed to simplify the logic involved in saving best scores, with the goal of moving some of the save_best() calculations into individual threads. mysql_lib.c has a new command, close_tables, that allows a script to remove a table after it has been used. (It might make more sense to add this to the extension script option.) >>June 14, 2011 (released as fasta-36.3.5a June, 2011) (comp_lib7e.c, comp_lib8.c, compacc2.c) Fix a serious bug in next_sequence_p() that caused a portion of the library to be missed when long sequences filled the sequence buffer before the slots were filled. Make certain that thread buffers are cleared when running an expansion script. Return an extra '\n' before the final summary for consistency with earlier versions. >>June 2, 2011 (released as fasta-36.3.5 June, 2011) (comp_lib8.c, comp_lib5e.c, comp_lib7e.c) Fix a bug that indicated that linked expanded sequences were pre-loaded for alignment when they were not. >>May 24, 2011 (released as fasta-36.3.5) (comp_lib8.c, comp_lib7e.c, comp_lib5e.c, mshowalign2.c, compacc2.c, initfa.c, param.h, scaleswn.c) The in-memory versions of the program are allocating much more memory than they actually use, causing the memory limits to cut in too soon. Fix this by using a smaller MAXLIB_P (36000) for searches against protein libraries, and expanding/contracting the aa1b_size more sensibly. Also add lost_memK value to track lost memory. For protein searches, lost memory is now around 15% of allocated memory (down from 40%). Numerous fixes to improve formatting of HTML output. Full statistics parameters are now available with the fdata output. Add fset_vars() to comp_lib8.c to set m_msg.max_memK properly. Parameters have been modified to ensure less memory waste (all buffers have 1000 sequences); Drop default 64-bit library memory limit to 8GB (-XM8G, LIB_MEMK=8G). >>May 25, 2011 (comp_lib8.c, comp_lib7e.c, comp_lib5e.c, mshowbest.c) Add the '-b >1' option, guarantees that at least 1 result is shown, but otherwise limits by E()-value. '-b =10' guarantees to show exactly 10 results (never more or less if the library is large enough), '-b 10' will show no more than 10 results, limited by -E e_cut, and '-b >1' will show at least 1 result, but is otherwise limited by -E e_cut. >>May 19, 2011 (comp_lib8.c, compacc2.c, param.h) comp_lib8.c is a version of comp_lib7e.c that keeps sequences in memory over multiple searches, but returns seqr_chains of buffers of sequences as they are read, rather than waiting for everything to be read. comp_lib8.c will automatically allocate up to 2 GB (32-bit machines) or 8 GB (64-bit machines) to hold the sequence database in a multiple query search. This number can be increased or decreased using the -XM# (megabytes) or -XM#G (gigabytes) option, or by setting the LIB_MEMK environment variable. -XM4G (LIB_MEMK=4G) makes 4GB available for sequence libraries; -XM-1 makes all machine memory available. >>May 5 2011 (mshowbest.c) Fix problems that prevented "-b align_number" properly limit output with "-z -1". "-z -1" also broke multiple HSPs (since no threshold could be calculated); fixed. (dropnfa.c) Fix some offset arithmetic that prevented FASTA alignments from extending to full length in do_walign(). >>May 4, 2011 (scaleswn.c) Provide additional checks for division by low numbers in fit_llen2() and fit_llens(). The similarities between fit_llen(), fit_llens(), and fit_llen2() have been highlighted, and their differences documented. scaleswn.c now provides pstat_info, which writes all the values required to re-calculate zscores or E()-values from raw scores. >>May 2, 2011 (dropnfa.c) Fix a problem with the traditional cgap(join)/optcut(opt) thresholds (no longer used by default) caused by allowing ktup=3 for proteins. The ktup=3 modification increased the cgap/opt thresholds by 6. (comp_lib5e.c, comp_lib7e.c, comp_lib8.c) Confirm identity of -m # and -m "F3 file.out". Small differences fixed. (mshowbest.c, mshowalign2.c) Remove gi|12345 information from -m B, -m BB blast-like output. NCBI Blast does not display gi numbers. >>Apr. 22, 2011 (doinit.c, initfa.c) Several of the less common options have been changed to expanded options, changing the meaning of -X (which now specifies expanded options), as well as -o, -1, -B, -x, and -y. -o now provides the offset coordinates previously specified with -X; -B is now -XB, -o -Xo, -x -Xx1,-1, and -y -Xy, e.g. -Xy32. >>Apr. 19, 2011 (comp_lib7e.c, comp_lib5e.c, doinit.c, mshowbest.c) Test lastest version with -I interactive mode. Modificiations required to ensure that aligments goto outfd, not stdout, when filename is entered. In addition, in interactive mode there can be more scores shown than e_cut, so bbp->repeat_thresh must be set in showbest() not main() program. >>Apr. 17, 2011 (comp_lib7e.c, doinit.c, compacc.c) The FASTA programs now support multiple output files with different -m out_fmt types using the -m "F# out_file" or -m "F#,#,# out_file" option. Normally, the -m out_fmt option applies to the default output file, which is either stdout, or specified with -O out_file (or within the program in interactive mode). With -m F, an output format can be associated with a separate output file, which will contain a complete FASTA program output. Thus, ssearch36 -m 9c -m "FBB blast.out_file" -m "F10 m10.out_file" query library Will sent the -m 9c output to stdout, but will also send -m BB output to blast.out_file, and -m 10 output to m10.out_file. Consistent -m out_fmt comands can be set to the same file by separating them with ','; e.g.: ssearch36 -m 9c -m "F9c,10 m9c_10.out_file" query library. Producing alternative format alignments in different files has little additional computational cost. One of the shortcomings of this approach is that it affects only the output format, not the other options that modify the amount of output. Thus, if you specify -E 0.001; that expect threshold will be used for all the output files. When a -m option can modify the output (e.g. -m 8 sets -d 0), that modification persists only for that file. >>Apr. 14, 2011 (initfa.c) Fix bugs in e_cut_r calculation that made it much too low for lalign36, and used the >1.0 divisor improperly for all programs (change from e_cut_r = e_cut_r/divisor to e_cut_r = e_cut/divisor). >>Apr. 11, 2011 (comp_lib5e.c, comp_lib7e.c, compacc.c) The non-preload version of FASTA (comp_lib5.c) has been extended to allow script expansion (comp_lib5e.c). To do this, the central score calculation loops have been moved to getlib_buf_work(), just as seqr_chain_work() was created for comp_lib7e.c. Moreover, the function used to build the link_file names is build_link_data() is now in compacc.c. Differences between comp_lib5e.c and comp_lib7e.c have been reduced. >>Apr. 5, 2011 (comp_lib7e.c) Fix issue with closing unopened link_lib_list_p when no results are found. Remove no-sequence error message for link library file. >>Apr. 1, 2011 (comp_lib7e.c) The -e script.sh has been generalized to have all the capabilities of a library file, in particular '@' specifies an indirect file, and "script.sh #" allows a library type to be specified. Thus, the script.sh invoked by "@script.sh" should not produce a fasta file; it should produce a file that contains the name of a fasta file (or possibly some other format). If '@' is used, the link_lib file written to stdout will be prepended with '@', and treated as an indirect file of file names. (comp_lib5.c, comp_lib7.c, comp_lib7e.c) Fix problem with null refstr (no Please cite:). >>Mar. 31, 2011 (comp_lib7.c, comp_lib7e.c) close_lib() was being called after each query. This is incorrect for versions (like comp_lib7) that keep the entire database in memory; the files must be kept open to allow ranlib() to get long descriptions (alternatively, a long description could be read initially). (comp_lib5.c, comp_lib7.c, comp_lib7e.c) Fix query offset coordinates for long queries that are broken up. Allow query library to have zero-length sequences without stopping (queries now stop when end-of-file is reached). (upam.h) Fix gap penalties for BLOSUM80 matrix (change from -14, -2 to -10, -2). >>Mar. 29, 2011 (comp_lib7e.c, doinit.c) Add the ability to search an expanded set of sequences based on the accessions from the initial search using "-e expand.sh" option. If "-e expand_script.sh" is specified, the command: expand.sh link_acc_file > link_lib_file is run by the program (fasta36, ssearch36, fastx36, etc), where link_acc_file and link_lib_file are temporary file names produced by the program. (The location of the temporary files can be specified with the $TMP_DIR environment variable.) link_acc_file contains a list of accession strings for the statistically significant hits - the information in the description line to the first space, e.g. gi|121719|sp|P08010|GSTM2_RAT gi|121746|sp|P09211|GSTP1_HUMAN from a search against my pir1.lseg library. "expand.sh" then reads that file, extracts the accession information, expands the accessions to a new set of accessions, extracts the expanded set of accessions from a database and writes them to standard output (which is saved in the temporary link_lib_file name). The sequences in expanded link_lib_file are then added to the initial search, and included in the list of best scores (and alignments) if their scores are statistically significant. The additional sequences do not change the initial library size. To test the expansion capability, use an expand.sh script that simply cat's a file of homologs to stdout (which will go to link_lib_file and be read), e.g. expand.sh contains "cat ../seq/gst.lib". Building a program that can take an arbitrary list of accessions and produce a library of homologs is more complicated (and slower), but will allow a smaller database to be searched yet produce results similar to those found from a larger database. >>Mar. 24, 2011 (released as fasta-36.3.4) (comp_lib7.c, dropfx.c, dropfz2.c, doinit.c) Fix a bug in the new help display; identify and correct various memory leaks and references to uninitialized data. >>Mar. 15, 2011 (doc/fasta3x.me, fasta3x.tex) The ancient, rarely updated, fasta3x.me has been replaced with fasta3x.tex, with the goal of producing a more up-to-date, accurate, and comprehensive document describing the capabilities of the FASTA programs. In addition, fasta36.1 has been updated/corrected. (make/Makefile.os_x86_64) Mac OS X clang 2.0, distributed with Xcode4.0, does not properly optimize the smith_waterman_sse2_word() in smith_waterman_sse2.c when clang -O is used to compile. >>Mar. 4, 2011 (doinit.c) Histograms are now turned off by default. -H shows histograms for all programs, not just the *_mpi (PCOMPLIB) programs. >>Feb. 27, 2011 (make/Makefile36m.common, Makefile.pcom_t, Makefile.pcom_s) The threaded programs are now the default, and the *_t versions of programs have been removed from the Unix and unix-like (MacOX) distributions. Windows versions can have either threaded or non-threaded versions, since the threaded windows programs require an additional library. Serial versions of the programs can still be built by editing the make/Makefile36m.common file, and using include Makefile.pcom_s instead of include Makefile.pcom_t. The documentation has been edited to reflect these changes. >>Feb. 24, 2011 (comp_lib5.c, comp_lib7.c, doinit.c, initfa.c, structs.h) The FASTA programs have a much more informative help system. If the -DSHOW_HELP option is included in the Makefile, the following changes occur: (1) the program is no longer interactive by default. To get interaction, use the -I option (-I previously meant showing the identity alignment in lalign; that option is now available with -J). (2) fasta36 and fasta36 -h present a short help message. (3) fasta36 -help provides a complete list of options with a more complete set of options. The getopt() option strings are now built dynamically. >>Feb. 18-21, 2011 (doinit.c) Fix missing -m 9i percent identity/alignment length. Fix issues with short sequence description in -m 6 (html) mode. >>Feb. 17, 2011 (comp_lib5.c, comp_lib7.c, doinit.c) Implementation of -m BB which provides completely BLAST-like output (not just alignments). Modification of the -b ### option. Previously, -b 100 guaranteed 100 alignments; now -b 100 limits to 100 alignments if more than 100 alignments have E()-values less than the -E threshold. An '=' symbol before the number reverts to the previous behavior; e.g. -m =100 guarantees 100 alignments, regardless of E()-value (-m =100 is equivalent to -m 100 -E 100000.0, and disables other setting of the E()-value threshold). >>Feb. 10, 2011 (doinit.c, mshowalign2.c, c_dispn.c) The FASTA programs have a new alignment option, "-m B", which shows alignments in BLAST format (no context, coordinates on the same line, BLAST symbols for matches and mismatches.) This version does not change the descriptions of the alignments, which are still FASTA like, but the alignments themselves should look just like BLAST alignments. Option -m BB makes output even more blast-like, showing not only the alignments, but the initial set of high scoring sequences, and other initial information, like BLAST+. >>Feb. 9, 2011 released as fasta-36.3.3 (dropfs2.c, initfa.c, comp_lib*.c) Modify fasts36/fastm36 to allow up to ktup=3 for proteins; ktup=6 for DNA (previously the max was ktup=2 for both). Modify version string to match release version number. >>Feb. 6, 2011 (initfa.c) Fix bug that prevented fastm36 from working properly with DNA queries. >>Jan. 31, 2011 (pcomp_subs2.c, work_thr2.c) Fixes to fasty36_mpi/tfastx36_mpi problem. Only fasty needs pascii[] for alignments, but it wasn't being sent to workers. Fixed. The MPI versions of the programs have now been tested much more thoroughly. >>Jan. 29, 2011 (comp_lib5.c, comp_lib6.c, comp_lib7.c, work_thr2.c, initfa.c, param.h, dropfs2.c, scaleswt.c, dropfx.c) Translated DNA shuffles (tfastx36, tfasty36) now shuffle DNA as codons. (1) Modify param.h pstruct to include shuffle_dna3, initialized in resetp() [initfa.c] (2) modify buf_shuf_work() to use ppst-zs_win and ppst->shuffle_dna3. (3) Add ppst->zs_off=0 to scaleswt.c/process_hist(). (4) Fix some memory leaks in dropfx.c. (5) Fix some other memory leads in dropfs2.c. >>Jan. 28, 2011 (initfa.c, scaleswn.c, mshowalign2.c) Address crashes that occurred when novel scoring matrices and gap penalties were specified, particularly for DNA. Fix memory problem with long (-L) sequence descriptions. >>Jan. 23, 2011 (comp_lib7.c) comp_lib7.c uses a more efficient strategy for reading chunks of sequences that ensures that sequence data is contiguous for *_mpi programs. comp_lib7.c replaces comp_lib6.c, which will be removed. >>Jan. 22, 2011 (many files) Replace "mw.h" with "best_stats.h", a much more informative name. (drop*.c, p_mw.h, w_mw.h) Remove p_mw.h, w_mw.h from code base and update_params() from drop*.c. These files are left over from the old p2_complib.c parallel programs. >>Jan. 21, 2011 released as fasta-36.3.2 (comp_lib5.c, comp_lib6.c, pcomp_subs2.c) Fixes for MPI version of programs. Earlier versions did not handle DNA/translated DNA comparisons properly, because duplicated sequences (forward/reverse strand) were not handled properly. The current code produces the correct scores and alignments, but probably is much less efficient than it should be. >>Jan. 11, 2011 (initfa.c, scaleswn.c) Re-enable DNALIB_LC (read lower-case DNA sequences as lower case). Reset ktup to default after change for short query in multi-query searches. Address multiple issues associated with variable scoring matrices, i.e. -s '?BP62'. Introduce pst->pam_name for the actual scoring matrix, to distinguish it from pst->pam_file, which can correspond to the std_pam->abbrev, for values like BP62 (which encodes both a matrix and a specific set of gap penalties). Ensure that the new scoring matrix is initialized and extended correctly. Fix some issues with scoring matrix names in scaleswn.c >>Jan. 5, 2010 (dropnnw2.c, dropgsw2.h, global_sse2.c,h, glocal_sse2.c,h) Include SSE2 optimization for global/global and global/local alignments provided by Michael Farrar. Global and glocal alignments are now 20X faster. >>Jan. 5, 2011 re-released as fasta-36.3.1 (initfa.c, last_tat.c) Fix bug resetting pst.e_cut_r for DNA sequences. Modify last_tat.c code to use pre-loaded sequence if available. Remove last_tat.c PCOMPLIB code. >>Jan. 3, 2011 released as fasta-36.3.1 (comp_lib5.c, comp_lib6.c) Add >>><<<, >>>/// to -m 9,10 output for separating multiple query searches. Also clean up extra >>>query line before alignments when no alignments are shown. >>Dec. 16, 2010 (dropgsw2.c, dropnnw2.c, dropnsw.c, comp_lib5.c, comp_lib6.c) Fix bug that caused ssearch to not invert coordinates for reverse-complement DNA alignments (I never imagined using ssearch for DNA) in dropgsw2.c, dropnnw2.c, and dropnsw.c. Add SEQ_PAD to aa0[1] (rev-comp copy) in comp_lib5.c, comp_lib6.c. >>Dec. 14, 2010 Modify CIGAR strings for frameshifts, including 1F and 1R for forward and reverse frameshifts. Extensive documentation updates. doc/fasta36.1 is the most comprehensive and accurate description of FASTA options. >>Dec. 1, 2010 (drop*.c, comp_lib5.c, comp_lib6.c) Correct problems with copying for recursive sub-alignments. Correct bug in adler32_crc calculation that suggested a problem with continued library sequences that did not exist. (initfa.c, defs.h) Use MAXLIB, rather than MAXLIB+MAXTST for comp_lib6.c, which pre-allocates the sequence database. Increase MAXLIB. >>Nov. 24, 2010 (drop*.c, drop_func.h) Modify drop*.c functions that do recursive sub-alignments to avoid modifying the aa1[] sequence array, which conceivably could be in use by other threads. do_walign() now has const *aa0 AND const *aa1. To prevent modification of aa1, sub-regions of aa1 are now copied into newly allocated arrays. >>Nov. 20, 2010 (cal_cons.c, mshowbest.c, mshowalign2.c, doinit.c) The -m 9C option displays an alignment code in CIGAR format. (-m 9c shows the older alignment encoding.) >>Nov. 16, 2010 (beginning of fasta-36.3.*, verstr 36.07) (initfa.c, apam.c, upam.h, param.h) Provide the ability to adjust the scoring matrix based on the length of the query sequence for alignments using a protein alphabet (this could certainly be extended to DNA as well). By including a '?' before the scoring matrix, e.g. -s '?BP62', a shallower matrix will be chosen if the entropy of the selected matrix (i.e. bit score per aligned position) times the length of the protein query is <=DEF_MIN_BITS (defs.h), currently 40 -- this value should be set based on the library size). The FASTA programs include BLOSUM50 (0.49 bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10 (3.44 bits/position). The variable scoring matrix option searches down the list of scoring matrices to find one with information content high enough to produce a 40 bit alignment score. This option is included primarily for metagenomics scans, which can include relatively short DNA reads, and correspondingly short protein translations. Also correct the short-query modification to ktup, so that it works properly with translated FASTX/FASTY searches (ktup is set to 1 when the query_length/3 <= 20). (dropnfa.c, dropfx.c, dropfz2.c) Shuffled sequence alignment scores are calculated identically to library alignment scores. Previously, optimized scores were calculated for all shuffled sequences for FASTA type alignments, even though typically 20 - 40% of library sequences were optimized. Now the two sampling strategies are consistent, though this may cause problems when only a small fraction of sequences are optimized. Small changes to provide consistent dropnfa.c, dropfx.c, dropfz2.c parameter display, and fix display with -m 10. >>Nov. 15, 2010 (initfa.c) Enable statistical thresholds by default (previously, they were enabled with -c -1 or -c 0.01 or anything < 1.0). The "classical" join/opt threshold behavior can be restored with -c O (upper case letter O), or by providing an optimization threshold > 1.0. Statistical thresholds dramatically speed up searches (typically 2-fold), and provide more accurate statistical estimates. The old join/optimization thresholds where optimized for BLOSUM50, and other 1/3-bit scaled scoring matrices, and did not work well with BLOSUM62. Statistical thresholds have been tested extensively, particularly with -z 21, and produce much more reliable statistical estimates. >>Oct. 14, 2010 (Makefile.fcom, cal_cons.c) Edits to re-enable compilation and successful execution of tfasta36(_t). tfasta36 has been superceeded by tfastx36(_t), which is faster, and treats frameshifts as a different type of gap. >>Oct. 13, 2010 (mshowbest.c) Make it more difficult to request more description/scores than are available. >>Sep. 30, 2010 (released as fasta-36.2.7) (comp_lib5.c, comp_lib6.c, dropnfa.c, dropfx.c, dropfz2.c) Fix bugs in DEBUG versions with adler32_crc calculations on overlapping sequences. Add more informative error messages when debugging. Fix a problem with hist2.hist_a != NULL with some compilers. Fix formats for some debugging error messages in dropnfa.c, dropfx.c, and dropfz2.c. Also fix repeat_threshold calculation for very short sequences, to guarantee that all matches as good as the best match with the sequence are found. Fix some problems that prevented FASTA from finding short repeats with short queries. This version of the FASTA36 package offers an alternate main program file, comp_lib6.c, which reads the entire database into memory before doing the search. Using comp_lib6.c can dramatically speed up searches with multiple queries (there is no advantage with single query sequences) on large multi-core computers, as each search is done without re-reading the database. On a 48-core processor, we see speedups greater than 40X with ssearch36_t and fastx36_t. To enable comp_lib6.c, edit the make/Makefile36m.common file to comment out lines refering to comp_lib5.c and un-comment lines referring to comp_lib6.c. >>Sep. 29, 2010 (comp_lib5.c, comp_lib6.c, mshowbest.c) Added -m 8C option, which mimics BLAST+ tabular with comment lines format. >>Sep. 17, 2010 (dropfx.c) Fix a bug in dropfx.c/do_walign() that modified library sequences. (This only caused a problem with comp_lib6.c, which reads the entire database into memory and re-uses sequence buffers. Check sequence consistency with adler32 CRC calculation. >>Sep. 15, 2010 (mshowbest.c, mshowalign2.c) Change the output format slightly. E2() expect values (-z 21+) no longer contain the library size (which is always the same as the E(library_size) value), and the -m 9 +- line no longer contains the frame information, since it is redundant. (The redundant rev-comp remains on the >-- HSP lines.) >>Sep. 14, 2010 (comp_lib5.c, mshowbest.c, drop*.c, cal_cons[f].c, etc.) Implement BLAST -m 8 tabular output. >>Sep. 9, 2010 (compacc.c) Fix a bug in pre_load_best() that disabled -L long sequence descriptions. (doinit.c) Fix a bug that prevented non-overlapping alignments from being displayed when the -E threshold was changed. Before -E 0.001 would disable additional alignments. Now, -E "0.001 0" is required to disable the additional alignments. (drop*.c) The display of search parameters has changed to ensure that gap penalties are displayed on the same line as the scoring matrix. Previously, the FASTA "Parameters:" section looked like: Parameters: BL50 matrix (15:-5)xS ktup: 2 join: 42 (0.0944), opt: 30 (0.601), open/ext: -10/-2, width: 16 Scan time: 0.450 With fasta-36.2.7 (and later), the Parameters: section is: Parameters: BL50 matrix (15:-5), open/ext: -10/-2 ktup: 2, join: 42 (0.102), opt: 30 (0.574), width: 16 The [T]FAST[X/Y] Parameters: section includes the frameshift/substitution penalties (tfasty36): Parameters: BL50 matrix (15:-5) open/ext: -12/ -2 shift: -20, subs: -24 ktup: 2, E-join: 0.5 (0.224), E-opt: 0.1 (0.0536), width: 16 >>Aug. 3, 2010 (released as fasta-36.2.6) (scaleswn.c) Modifications to calc_thresh(), proc_hist_ml(), to better accommodate search strategies (fast?? with statistical thresholds) that provide complete scores only for a high-scoring fraction of sequences. For some query sequences, the E()-values from the database were sometimes much "worse" than E2()-values, an observation that is counter-intuitive (if parameters are estimated against shuffled related sequences, the E()-values should get worse, not better). For some queries, the result was very dramatic (E() < 1E-80, E2() < 1E-150). This error appears to occur because the z-trim or mle_cen thresholds are including many related sequences. -z 2 was modified to censor more sequences when only a subset are scored, and -z 1 was modified to adjust z-trim more carefully. As a result, z-trim was reduced, excluding more sequences. If too many sequence are excluded, then regression statistics do not work, and the program fails over to Altschul-Gish statistics. -z 21+ modified so that MLE statistics are used for shuffle E2() values if Altschul-Gish statistics are used for the library E()-values. >>July 30, 2010 (comp_lib5.c, pcomp_subs2.c) Fix bug in buf_align_seq() that allowed buffer over-runs with long DNA sequences with MPI. Checks on buffer over-runs are now included in pcomp_subs2.c/put_rbuf(),get_wbuf(). Aug. 1, 2010, fixed similar bug in buf_shuf_seq(). -z 21 now works with long DNA sequences. >>July 28, 2010 (mshowalign2.c) Fix lalign36/showalign() to show best sub-optimal E()-value, not bptr[0] E()-value (often identical). >>July 19, 2010 (released as fasta-36.2.5) (wm_align.c, dropfx.c,dropfz2.c) Fix some off-by-one boundary calculations to ensure that every query that can fit into a library is aligned correctly. >>May 18, 2010 Implement comp_lib5.c, which simplifies the structure of comp_lib4.c by moving some calculations into functions. >>May 10, 2010 Fix problem setting nshow with small library in interactive mode. >>May 5, 2010 fasta-36.2.3 Fix bug that prevented shuffled scores to be used properly for small databases (prss capability was lost). >>May 2, 2010 fasta-36.2.2 Fix problem with tat_score values from fasts and fastm. fasta35 did not re-calculate the z-score after last_stats(). fasta36 does, so it must ensure that the e-value (sometimes p-value) is used correctly. >>Apr. 29, 2010 More extensive testing of the MPI-PCOMPLIB programs revealed some problems sending sequences when (or more) frames for the same sequence was used. This problem has been addressed, and large scale testing of fastx36_mpi (with 100K sequence queries in a run) works. >>Apr. 16,19, 2010 (pcomp_subs2.c, comp_lib4.c, work_thr2.c) The MPI-PCOMPLIB parallel version of the FASTA36 programs is working. This PCOMPLIB version takes a very different approach from the older PVM/MPI parallel programs (p2_complib2.c/p2_workcomp2.c) - it works virtually identically to the threaded programs (sharing the same work_thr2.c code and get_rbuf/put_rbuf() (manager) and get_wbuf/put_wbuf() (worker/thread) functions. As a result, in this initial version, the database is NOT distributed to the nodes. During multiple searches, the library is re-read each time. However, load is distributed to workers exactly the way it would be for the threaded system, so the workload should scale. To distinguish them from the earlier mp35compsw, mp35compfa, etc, the new versions are search36_mpi, fasta36_mpi, etc. The programs work with multiple queries, and producing multiple sub-alignments, and work with -m 9c encodings. >>Apr. 7, 2010 (various Makefiles, comp_lib4.c, pcomp_subs2.c, thr_bufs2.h, thr_buf_structs.h) The MPI version of the threaded programs, sseach36_mp, now compiles. pcomp_subs2.c replaces pthr_subs2.c, and thr_bufs.h -> thr_buf_structs.h, thr.h -> thr_bufs2.h, and pcomp_bufs2.h has been added as the equivalent of thr_bufs2.h for PCOMPLIB. >>Apr. 2, 2010 (comp_lib4.c, work_thr2.c, compacc.c) Implement init_aa0(), which isolates code that calls init_work and sets up aa0s, aa1s, f_str[1] (reverse complement) and qf_str so that the same code is used by the serial, threaded, and (future) PCOMP versions. (work_thr2.c) work_thr2.c now contains code for either threaded or PCOMPLIB processes. Threaded processes get stuff from work_info; PCOMPLIB processes get the same information via messages sent from init_thr() called by main(). >>Mar. 30, 2010 (comp_lib4.c, work_thr2.c, thr_bufs.c +pcomp_subs2.c The the data buffers used to communicate between workers and threads have been restructured to separate the old buf2_str, which contained sequence, score results, and alignment results, into three buffers, buf2_data_s, buf2_res_s, and buf2_ares_s, separating sequence data from scores and alignments. This was done to simplify communication in the MPI/PVM environment. Workers should be able to return results directly into the appropriate buffer. >>Mar. 25, 2010 fasta-36.2.1 (dropfx.c, dropfz2.c) Found/removed two "static" declarations in small_global that caused problems with [t]fastx/y with threaded alignments. >>Mar. 24, 2010 (now version 36.06 with threaded alignments) (dropnfa.c) The DNA band aligner in dropnfa.c was not thread safe. This has been fixed. >>Mar. 23, 2010 Code for pre-loading/threaded-aligning sequences has been significantly cleaned up. Checks are made before RANLIB() and re_getlib() in showbest() and showalign() that should be consistent with annotations AND functions that cannot encode alignments. Add mshowalign2.c (which does not do PCOMPLIB) to provide threaded alignments. build_ares_code() and buf_do_align() modified to ignore MX_M9SUMM so that alignments are produced whenever demanded (still does not do alignment if a_res is available). >>Mar. 22, 2010 (comp_lib4.c, work_thr2.c, thr_bufs.h) comp_lib4.c has been modified to thread the alignment encoding (build_ares) for -m 9c. If m_msg.quiet and alignments are required for showbest(), then the program identifies the number of alignments required, reads the sequences (and annotations) into a buffer, and sends them to the threads to be encoded. Then, when showbest() is called, bbp->have_ares has been set, and the alignments are not re-calculated. This should be extended to thread actual alignment production, and additional work is required to clean-up the sequence and bline(description) buffers before a second search. >>Mar. 17, 2010 (comp_lib4.c, dropnfa,fx,fz2.c) Modifications to provide more sensible E2() statistical estimates with threshold-heuristic comparison functions and -z 21. Also fixed bug that caused the wrong zs_off to be used with -z 21. dropnfa,fx,fz2.c now optimize all scores when shuff_flg is set. >>Mar. 16, 2010 (comp_lib4.c, scaleswn.c, drop*.c) A new, relatively consistent, statistical estimation strategy has been introduced for the heuristic programs that optimize only a fraction of scores (fasta36, [t]fast[xy]36). Statistics-based heuristic thresholds can increase search speed 2 - 4-fold by doing band optimization on only a small fraction of library sequences (with the -c -1 option, about 10% of alignments are band-optimized, compared with more than 50% with the classic thresholds). However, optimizing only a small part of the library produces two classes of scores, optimized (10% or less) and non-optimized, with different statistical properties. fasta36 addresses this problem by calculating statistical estimates only for the optimized scores, and then correcting the significance of the score by accounting for the frequency of optimization. For example, sampling only 5% of scores increases the z-value (std. deviation above the mean) by -logE(0.05)*sqrt(6)/Pi = 2.34 which offsets the z-score by 23.4. This effect is only seen when the -c option is used to specify statistical thresholds, and is most apparent when looking at the histogram, which will be offset by the appropriate z-score. This strategy appears to produce more accurate statistics in general, but can produce less accurate statistics for the heuristic programs when the -z 21 option is used. >>Mar. 3, 2010 (comp_lib4.c) Fix the new stats[] sampling strategy to sample >60K sequences more more uniformly. The old code massively over-sampled later sequences, because of several bugs. The new code works as expected. The first 60K sequences are represented about 30% more than the rest, but after 60K, sequences are sampled moderately uniformly. The older SAMP_STATS_MORE is uniform across all the scores. (build_ares.c) Move code to produce chains of alignments (a_res) produced by do_walign, followed by subsequent calls to calc_id, calc_code, into a new function, build_ares_code(), which is shared by the serial/threaded and parallel (p2_workcomp.c) programs. This is a first step towards having the parallel programs produce multiple HSP alignments. >>Feb. 27, 2010 (lib_sel.c) Fix problem with new chained library access that prevented more than two files from being searched. Also, library name string has been lengthened to allow a list of libraries to be displayed. >>Feb. 26, 2010 Parallel programs have been tested in both PVM and MPI versions, and some additional bugs have been fixed. Currently, the PVM/MPI versions are fully functional, but only with FASTA35 capabilities. The new multiple HSP alignments and best-shuffle E2() scores are not yet available. >>Feb. 24, 2010 Fix some leaks, largely do to more complex alignment data structures for multiple alignments. Currently, all the major leaks are in data structures allocated in main(), and which I don't bother to de-allocate (mostly library buffer memory). Change zsflag > 10 to zsflag >= 10 && zsflag < 20 in three places. Too many shuffles were being done with zsflag==21. >>Feb. 22, 2010 Begin conversion of p2_complib2.c/p2_workcomp.c. Very old code to allocate aln_d_base removed from v35 and v36. No code for best list shuffle, or multiple high-scoring alignments. However, the code now works properly with statistical thresholds. (Changes made to p2_complib2.c, p2_workcomp.c to update pst struct after last_param.()). >>Feb. 19, 2010 fasta-36x6 Fix issues with -z 26 statistics. Add description of E2() statistics. Added option to specify statistics routine for best-shuffled statistics independently of library statistics by specifying a second -z option. Thus, -z "21 2" uses regression scaled statistics for the library estimate, and MLE statistics for the best-shuffled estimates. >>Feb. 17, 2010 fasta-36x5 Some of the simplifications dealing with threads in comp_lib4.c failed on some compilers and architectures. The code for terminating threads has been modified to allow sequence buffers with zero entries, to simplify the empty_buffer logic. There is now an explicit option to terminate threads by setting lib_bhead_p->stop_thread. However, this flag is never set, as rbuf_done() stops the threads instead. Also fix problem with stats_idx being associated with wrong buf2_p in two frame searches. >>Feb. 15, 2010 fasta-36x4 fasta36 can now display both "search" (E()) and "shuffled" (E2()) E()-value calculation and display in the best scores and alignments. If the -z option is greater than 20, then two evalues are calculated, one from the search (e.g. -z 1 uses regression scaled scores) and a second derived from shuffling the high scoring sequences. The high-scoring sequence shuffled scores are approximately equivalent to doing a PRSS (pairwise shuffle), but more efficient. High-scoring shuffled E()-values (labled E2()) are typically 2 - 5-fold more conservative for average composition proteins, and 10 - 20X more conservative for biased composition proteins. Fix another bug in -S alignment scores vs opt scores in ssearch36 (see Feb. 8). >>February 12, 2010 (prev. version 142) Create comp_lib4.c (from comp_lib3.c), which simplifies some of the processes for handling buffers of results (no more empty_reader_bufs) and enables shuffles of high-scoring sequences to evaluate significance. >>February 8, 2010 Fix a problem with scores and E()-values for SSEARCH sub-alignments when the -S option is used. When the -S option was used to ignore lower-case residues in query or library for the initial score, the final alignments include the lower-case masked residues. The SSEARCH36 was using the non-masked alignment score, rather than the orginal score (FASTA36, and [T]FAST[XY]36 used the masked score). This was incorrect, as the statistics are calculated for masked sequences. The corrected version calculates both a non-masked and a masked score, where the masked score (for subalignments) uses the non-masked alignment. [T]FAST[XY]36 had a related problem, which is that when multiple sequences are in the query with the same pam2p[0] (no -S) score, then the wrong alignment could be shown with the initial scores. Fixing this requires that the alignment routine only work on the region specified from the initial band (fixed in dropnfa.c, dropfx.c, and dropfz2.c). >>February 4, 2010 The more efficient statistical thresholds in fasta36 have been disabled by default. They can be turned on with -c -1, or by setting thesholds (-c "0.05 0.2" would set E_band_opt to 0.05 - target 5% of sequences - and E_join at 20% target). My initial implementation produced very inaccurate statistics, presumably because only a small fraction of unrelated sequences were being band-optimized (fasta35 typically optimized about 60% of library sequences, fasta36 with statistical thresholds optimizes about 2%, which causes a 2 - 3X speed increase). The sampling strategy for fasta36, and [t]fast[xy]36 scores has been adjusted to provide relatively accurate scores for searches that optimize only a small fraction of sequences. On the cases I have tested, statistical accuracy is comparable to, or better than, the version 35 programs, but probably not as robust as ssearch estimates. >>January 29, 2010 The logic to predetermine where scores went for shuffling breaks when some scores are not calculated (e.g. -M 200 - 300). Fix by using nstats as the index for nstats < MAX_STATS, and then use stats_idx afterwards. Provide more efficient score sampling logic. The old method (left over from fasta34 or earlier) generated a random number for every sequence after MAX_STATS; if it was less than MAX_STATS, the sample was used. This logic is still available with -DSAMP_STATS_MORE. The new logic samples every other sequence between MAX_STATS and 2*MAX_STATS, every third between 2*MAX_STATS and 3*MAXSTATS, etc, and randomly replaces one of the stats scores. For 430K SwissProt, this reduces the number of samples from 178K to about 145K, and reduces the number of calls to the random number generator from 430K to 85K. >>January 28, 2010 (comp_lib3.c, mrandom.c) Tests of ssearch36 statistical accuracy suggests that the default statistical estimates (-z 1) are not as accurate as they should be with BLOSUM62, -11/-1. Both -z 11 and -z 2 work better. In FASTA35, -z 11 - 15 caused a 2X-slowdown (actually more) because EVERY library sequence was shuffled, even though only a fraction of the sequences (for libraries > 60,000 would be used for the statistical calculation. comp_lib3.c uses a more sophisticated strategy for sampling scores after 60,000 so that sequences are only shuffled and aligned if they will be used in the statistical calculation. Doing this on SwissProt, with 430,000 sequences, means that ~180,000 additional shuffle alignments are done, not 430,000 additional. However, using -z 11 with the threaded program was much more than 2X-slower -- random() is not re-entrant, and is designed to provide a consistent set of random numbers over threads, so threads were waiting on the random number generator, with a big performance penalty. Using code from WikiPedia, I implemented a random number generator (mrandom.c) that saves a local copy of state, so threaded -z 11 has the correct performance penalty. >>January 25, 2010 (initfa.c 36.04 January 2010) (dropfz2.c, aln_struct.h) At long last, tfasty36 correctly produces multiple alignments on the reverse strand. (Jan. 26, 2010) Fixed introduced bug in fasty36 that used wrong offset in recursion. >>January 17, 2010 Extensive changes have been made to all the drop_* functions, so that multiple alignment results are properly sorted from highest to lowest sw_score. dropnfa.c, dropgsw2.c, dropfx.c and dropfz2.c now all use similar strategies to calculate non-overlapping alternative alignments. score_thresh thresholds are applied to rst.score[ppst->score_ix] appropriately for all recursive functions. >>August 24, 2009 Statistical thresholds have been adjusted to produce more approximately the correct number of joins/band optimizations. The approximate fraction of joins/band optimizations is now shown in the results. >>August 21, 2009 fasta/fastx/fasty/tfastx/tfasty now use statistically based thresholds for joining short segments and deciding to do a band optimization -- similar to the threshold strategy used by BLAST. The statistical thresholds used are set with the -c option, which used to be used to set optcut. The -c option now has three ranges: -c < 0 -- use the old FASTA thresholds, calculated in the same way 0 < -c < 1.0 -- use the statistical thresholds and set E_opt_cut. c >= 1.0 -- use the old FASTA threshold, and specify it. For 0 < -c < 1.0, a second argument can be supplied (-c "0.02 0.1") for the joining E()-threshold. If this value is < 1.0, it is used as E_join; if it is > 1.0, E_opt_cut is multiplied by the value to get E_join. >>August 19, 2009 Implement Lambda/K/H based c_gap, opt_cut in dropnfa.c, dropfx.c (fastx), and dropfz2.c (fasty). Add ELK_to_s() to scaleswn.c. >>August 11, 2009 Fix bug in dropfx.c that used the wrong variables for calculating offsets into a long DNA sequence for subset alignments. Stop putting sw_score in score[0] when no score[0] was calculated. Use 0 instead. >>July 31, 2009 (dropgsw2.c) Fix problems with dropgsw2.c that allowed poor sub-alignments to be shown. Consolidate merge_ares_acc() for all the functions. Add pst.do_rep to disable multiple alignments. >>July 6, 2009 (initfa.c, apam.c, complib2.c, p2_complib.c) move changes for validate_novel_aa() from fasta35. (initfa.c) Enable checks for unusual characters ('Uu' in proteins) for many more programs with the -p option. >>June 16, 2009 Modify statistical sampling strategy to greatly simplify the calculation. >>May 15, 2009 Fix bug in lav2ps.c, lav2svg.c that occured when displaying very long sequence alignments (e.g. genome alignments). The maximum coordinate is set properly now. >>May 5, 2009 (initfa.c) Fix bug (int e_cut in pgm_def_arr[]) that prevented e_cut to be set properly for lalign for DNA. >>May 4, 2009 The functions that return multiple sub-alignments (HSPs) after the best alignment have been modified to ensure that alignments are returned sorted by score, by merging the list of alignments found to the left and right of the best alignment. >>April 28, 2009 (p2_complib2.c, p2_workcomp2.c, mshowbest.c, mshowalign.c) modified to support new coordinate system, preliminary work on multiple HSPs in parallel environment. >>April 14, 2009 (comp_lib2.c, nmgetaa.c) Comprehensive restructuring of library file list from a fixed length array to a variable length linked list. The link lists allows library files to insert additional files into the list, so that, for example, a file of accession numbers can refer to a list of files for the accessions. Eventually, this should allow FASTA to support .pal/.nal files from the NCBI, and to support files of file names most places file names are allowed. >>April 2, 2009 (from fasta35) (structs.h, comp_lib2.c, doinit.c, mshowbest.c, mshowalign.c) The code that selects the number of high scores to display has been reorganized to support the -F e_low option (which was not implemented properly if -b and -d were specified). The code is simplified; m_msg.nshow is used to specify the number of best scores listed, and min(m_msg.nshow, m_msg.ashow) is used to specify the number of alignments shown. >>March 26, 2009 (from fasta35 - fa35_04_07) (initfa.c) Fix problems with 'U' recognition in DNA pam matrix, correct implementation of -r +mat/-mis. Previous versions of fasta35 may not have used the correct DNA matrix when the -r +mat/-mis option was specified. >>March 23, 2009 (initfa.c verstr -> 36.02) (mshowbest.c, aln_structs.h) Add loop for displaying multiple aligned regions with -m 9, -m 9i, and -m 9c in mshowbest.c. >>March 22, 2009 (dropgsw2.c, dropnnw2.c, wm_align.c) Rearrange code in dropgsw2.c, dropnnw2.c (which replaces dropnnw.c) so that a single function, wm_align.c:nsw_malign() is responsible for recursive algnments for both dropgsw2.c (sw_walign) and dropnnw2.c (nw_walign). The strategy for tnese (Smith-Waterman, Global-Local) alignments is identical. nsw_malign() uses a function pointer that calculates S-W or N-W that it gets from dropgsw2.c or dropnnw2.c It might make sense to use a similar strategy for the recursive translated alignments. >>March 19, 2009 (map_db.c, mm_file.h) Fix another bug in map_db.c that appears for sequence files larger than 2Gb. MM_OFF is now consistently used in more of the places where an int64_t might is required. >>March 17, 2009 (list_db.c) Fix a bug in list_db that caused it to misread the maximum sequence length, and then be off by 4-bytes for all the offsets. Include list_db with map_db in the list of auxiliary programs. >>Mar. 8, 2009 fa35_04_06 (comp_lib2.c, pthr_subs2.c, pthr_subs.h, doinit.c, dec_pthr_subs.c) Dynamically allocate pthread_t *fa_threads, rather than limit it to MAX_WORKERS. MAX_WORKERS is no longer used in the Unix environment; it gets its value from sysconf(_SC_NPROCESSORS_CONF). If sysconf() is not available, MAX_WORKERS is used. The threaded programs should now automatically adjust the number of threads to the number of processors. Moreover, the number of threads can be set to more than the number of processors with -T #threads. Also, max_workers was renamed fa_max_workers, and pthread_t *threads is now *fa_threads. >>Mar. 6, 2009 copied comp_lib2.c from v35 (fix for query offset coordinates) >>Oct. 22, 2008 The programs that allow multiple alignments to be found include: ssearch36(_t) fasta36(_t) fastx36(_t) fasty36(_t) fasts and fastf will probably not be updated in this way, because of the difficulty in reconstructing alignments, but fastm may be. Right now, the pvm/mpi versions of the programs do not support multiple sub-alignments. >>Sep. 25, 2008 Modify the syntax for the -E option to allow the repeat E()-value cutoff to be specified in either of two ways. -E "e_cut e_rep" If the value of e_rep is less than one, it is taken as the absolute E()-value threshold for additional local domains, for example: -E "1.0 0.05" says use 1.0 for the main E()-value threshold, and 0.05 as the threshold for additional local alignments. Alternatively, if e_rep >= 1.0, it is taken as a divisor for the E()-value threshold, thus: -E "1.0 10.0" Sets the E()-value threshold for additional local alignments to 1.0/10.0 = 0.1. Finally, if e_rep <= 0.0, no multiple alignments are done (equivalent to previous versions of FASTA).