update June 6, 2021

NAME

fastalen.py -  separate fasta sequences by length

SYNOPSIS

fastalen.py [--filelist] infile --split <int>|--gte <int>|--lt <int>|--between <int> <int>

DESCRIPTION

fastalen.py has several methods for splitting sequences from a fasta file into 1 or more files depending on sequence length. Output filenames are automatically generated, based on the input filename.

OPTIONS
--filelist infile - If specified, names of files to be processed are read from infile. Otherwise, infile is assumed to be a fasta sequence file.

infile - If --filelist is specified, infile is assumed to be a list of filenames for fasta files to be processed. Files are listed one name per line. Example:

seq1.fa
seq2.fa
seq3.fa

If --filelist is not specified, infile is assumed to be a fasta sequence file. Sequences may be DNA, RNA or protein.

--split int - Split sequences into two output files, one file with sequences whose length is greater than or equal to int, and the other with sequences less than int.

--gte int - Write a single output file containing only sequences whose length is greater than or equal to int.

--lt int - Write a single output file containing only sequences whose length is less than int.

--between int1 int2 - Write a single output file containing only sequences whose length is greater than or equal to int1 and less than or equal to int2.

Note: --split,--gte, --lt and --between are mutually exclusive.

OUTPUT FILES

Output filenames are built from the input filename. The basename is the input filename minus the first file extension, if it exists. For example, if the input filename is seqs.fa, the basename is "seqs". Sequences are written 100 characters per line.

EXAMPLES

1. --split

fastalen.py seqs.fa --split 5000 

Will create output files seqs.gte5000.fa and seqs.le5000.fa, with sequences of >= 5000 nt and < 5000 nucleotides or amino acids, respectively.

2. --gte

fastalen.py seqs.fa --gte 5000

Will create output file seqs.gte5000.fa with sequences of length >= 5000.

3. --lt

fastalen.py seqs.fa --lt 5000

Will create output file seqs.lt5000.fa with sequences of length < 5000.

4. --between

fastalen.py seqs.fa --between 10000 50000

Will create output file seqs.10000-50000.fa whose sequence length is 5000 <= length <= 50000.

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2
brian.fristensky@umanitoba.ca
http://home.cc.umanitoba.ca/~frist