Metadata-Version: 1.0 Name: pyfasta Version: 0.4.5 Summary: fast, memory-efficient, pythonic (and command-line) access to fasta sequence files Home-page: http://github.com/brentp/pyfasta/ Author: brentp Author-email: bpederse@gmail.com License: MIT Description: ================================================== pyfasta: pythonic access to fasta sequence files. ================================================== :Author: Brent Pedersen (brentp) :Email: bpederse@gmail.com :License: MIT .. contents :: Implementation ============== Requires Python >= 2.5. Stores a flattened version of the fasta file without spaces or headers and uses either a mmap of numpy binary format or fseek/fread so the *sequence data is never read into memory*. Saves a pickle (.gdx) of the start, stop (for fseek/mmap) locations of each header in the fasta file for internal use. Usage ===== :: >>> from pyfasta import Fasta >>> f = Fasta('tests/data/three_chrs.fasta') >>> sorted(f.keys()) ['chr1', 'chr2', 'chr3'] >>> f['chr1'] NpyFastaRecord(0..80) Slicing ------- :: >>> f['chr1'][:10] 'ACTGACTGAC' # get the 1st basepair in every codon (it's python yo) >>> f['chr1'][::3] 'AGTCAGTCAGTCAGTCAGTCAGTCAGT' # can query by a 'feature' dictionary (note this is one based coordinates) >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}) 'CTGACTGA' # same as: >>> f['chr1'][1:9] 'CTGACTGA' # use python, zero based coords >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9}, one_based=False) 'TGACTGA' # with reverse complement (automatic for - strand) >>> f.sequence({'chr': 'chr1', 'start': 2, 'stop': 9, 'strand': '-'}) 'TCAGTCAG' Key Function ------------ Sometimes your fasta will have a long header like: "AT1G51370.2 | Symbols: | F-box family protein | chr1:19045615-19046748 FORWARD" when you only want to key off: "AT1G51370.2". In this case, specify the key_fn argument to the constructor: :: >>> fkey = Fasta('tests/data/key.fasta', key_fn=lambda key: key.split()[0]) >>> sorted(fkey.keys()) ['a', 'b', 'c'] Numpy ===== The default is to use a memmaped numpy array as the backend. In which case it's possible to get back an array directly... :: >>> f['chr1'].tostring = False >>> f['chr1'][:10] # doctest: +NORMALIZE_WHITESPACE memmap(['A', 'C', 'T', 'G', 'A', 'C', 'T', 'G', 'A', 'C'], dtype='|S1') >>> import numpy as np >>> a = np.array(f['chr2']) >>> a.shape[0] == len(f['chr2']) True >>> a[10:14] # doctest: +NORMALIZE_WHITESPACE array(['A', 'A', 'A', 'A'], dtype='|S1') mask a sub-sequence :: >>> a[11:13] = np.array('N', dtype='S1') >>> a[10:14].tostring() 'ANNA' Backends (Record class) ======================= It's also possible to specify another record class as the underlying work-horse for slicing and reading. Currently, there's just the default: * NpyFastaRecord which uses numpy memmap * FastaRecord, which uses using fseek/fread * MemoryRecord which reads everything into memory and must reparse the original fasta every time. * TCRecord which is identical to NpyFastaRecord except that it saves the index in a TokyoCabinet hash database, for cases when there are enough records that loading the entire index from a pickle into memory is unwise. (NOTE: that the sequence is not loaded into memory in either case). It's possible to specify the class used with the `record_class` kwarg to the `Fasta` constructor: :: >>> from pyfasta import FastaRecord # default is NpyFastaRecord >>> f = Fasta('tests/data/three_chrs.fasta', record_class=FastaRecord) >>> f['chr1'] FastaRecord('tests/data/three_chrs.fasta.flat', 0..80) other than the repr, it should behave exactly like the Npy record class backend it's possible to create your own using a sub-class of FastaRecord. see the source in pyfasta/records.py for details. Flattening ========== In order to efficiently access the sequence content, pyfasta saves a separate, flattened file with all newlines and headers removed from the sequence. In the case of large fasta files, one may not wish to save 2 copies of a 5GG+ file. In that case, it's possible to flatten the file "inplace", keeping all the headers, and retaining the validity of the fasta file -- with the only change being that the new-lines are removed from each sequence. This can be specified via `flatten_inplace` = True :: >>> import os >>> os.unlink('tests/data/three_chrs.fasta.gdx') # cleanup non-inplace idx >>> f = Fasta('tests/data/three_chrs.fasta', flatten_inplace=True) >>> f['chr1'] # note the difference in the output from above. NpyFastaRecord(6..86) # sequence from is same as when requested from non-flat file above. >>> f['chr1'][1:9] 'CTGACTGA' # the flattened file is kept as a place holder without the sequence data. >>> open('tests/data/three_chrs.fasta.flat').read() '@flattened@' Command Line Interface ====================== there's also a command line interface to manipulate / view fasta files. the `pyfasta` executable is installed via setuptools, running it will show help text. split a fasta file into 6 new files of relatively even size: $ pyfasta **split** -n 6 original.fasta split the fasta file into one new file per header with "%(seqid)s" being filled into each filename.: $ pyfasta **split** --header "%(seqid)s.fasta" original.fasta create 1 new fasta file with the sequence split into 10K-mers: $ pyfasta **split** -n 1 -k 10000 original.fasta 2 new fasta files with the sequence split into 10K-mers with 2K overlap: $ pyfasta **split** -n 2 -k 10000 -o 2000 original.fasta show some info about the file (and show gc content): $ pyfasta **info** --gc test/data/three_chrs.fasta **extract** sequence from the file. use the header flag to make a new fasta file. the args are a list of sequences to extract. $ pyfasta **extract** --header --fasta test/data/three_chrs.fasta seqa seqb seqc **extract** sequence from a file using a file containing the headers *not* wanted in the new file: $ pyfasta extract --header --fasta input.fasta --exclude --file seqids_to_exclude.txt **extract** sequence from a fasta file with complex keys where we only want to lookup based on the part before the space. $ pyfasta extract --header --fasta input.with.keys.fasta --space --file seqids.txt **flatten** a file inplace, for faster later use by pyfasta, and without creating another copy. (`Flattening`_) $ pyfasta flatten input.fasta cleanup ======= (though for real use these will remain for faster access) :: >>> os.unlink('tests/data/three_chrs.fasta.gdx') >>> os.unlink('tests/data/three_chrs.fasta.flat') Testing ======= there is currently > 99% test coverage for the 2 modules and all included record classes. to run the tests: :: $ python setup.py nosetests Changes ======= 0.4.5 ----- pyfasta split can handle > 52 files. (thanks Devtulya) 0.4.4 ----- fix pyfasta extract 0.4.3 ----- Add 0 or 1-based intervals in sequence() thanks @jamescasbon 0.4.2 ----- update for latest numpy (can't close memmap) 0.4.1 ----- check for duplicate headers. 0.4.0 ----- * add key_fn kwarg to constuctor 0.3.9 ----- * only require 'r' (not r+) for memory map. 0.3.8 ----- * clean up logic for mixing inplace/non-inplace flattened files. if the inplace is available, it is always used. 0.3.6/7 ------- * dont re-flatten the file every time! * allow spaces before and after the header in the orginal fasta. 0.3.5 ----- * update docs in README.txt for new CLI stuff. * allow flattening inplace. * get rid of memmap (results in faster parsing). 0.3.4 ----- * restore python2.5 compatiblity. * CLI: add ability to exclude sequence from extract * CLI: allow spliting based on header. 0.3.3 ----- * include this file in the tar ball (thanks wen h.) 0.3.2 ----- * separate out backends into records.py * use nosetests (python setup.py nosetests) * add a TCRecord backend for next-gen sequencing availabe if tc is (easy-)installed. * improve test coverage. Keywords: bioinformatics blast fasta Platform: UNKNOWN Classifier: Topic :: Scientific/Engineering :: Bio-Informatics