skipredundant

 

Wiki

The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki.

Please help by correcting and extending the Wiki pages.

Function

Remove redundant sequences from an input set

Description

Redundancy in a database or other collection of sequences occurs when one or more similar sequences are present. The inclusion of very similar sequences in certain analyses will introduce undesirable bias. For example, a family may possess 100 sequences in the sequence database, but 90 of these might be essentially the same sequence, e.g. very close relatives or mutations of a single sequence. Although 100 sequences are known, the family only contains 11 sequences that are essentially unique. For many applications it is desirable or even essential to remove redundant sequences from a set in order to produce a smaller set that is representative of the whole. SEQNR removes redundancy from an input file of sequences, either at a single threshold of sequence similiarty (e.g. 40%) or within a threshold range of sequence similiarty (e.g. 40% - 70%).

Algorithm

Redundancy is calculated for each pair of sequences in turn.

Usage

Here is a sample session with skipredundant


% skipredundant -threshold 80 -redundant "" 
Remove redundant sequences from an input set
Input sequence set: globins.fasta
Redundancy removal options
         1 : Single threshold percentage sequence similarity
         2 : Outside a range of acceptable threshold percentage similarities
Select number [1]: 
Gap opening penalty [10.0]: 
Gap extension penalty [0.5]: 
output sequence(s) [globins.keep]: 

Go to the input files for this example
Go to the output files for this example

Command line arguments

Remove redundant sequences from an input set
Version: EMBOSS:6.4.0.0

   Standard (Mandatory) qualifiers (* if not always prompted):
  [-sequences]         seqset     Sequence set filename and optional format,
                                  or reference (input USA)
   -mode               menu       [1] This option specifies whether to remove
                                  redundancy at a single threshold percentage
                                  sequence similarity or remove redundancy
                                  outside a range of acceptable threshold
                                  percentage similarity. All permutations of
                                  pair-wise sequence alignments are calculated
                                  for each set of input sequences in turn
                                  using the EMBOSS implementation of the
                                  Needleman and Wunsch global alignment
                                  algorithm. Redundant sequences are removed
                                  in one of two modes as follows: (i) If a
                                  pair of proteins achieve greater than a
                                  threshold percentage sequence similarity
                                  (specified by the user) the shortest
                                  sequence is discarded. (ii) If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range (specified by the user) the shortest
                                  sequence is discarded. (Values: 1 (Single
                                  threshold percentage sequence similarity); 2
                                  (Outside a range of acceptable threshold
                                  percentage similarities))
*  -threshold          float      [95.0] This option specifies the percentage
                                  sequence identity redundancy threshold. The
                                  percentage sequence identity redundancy
                                  threshold determines the redundancy
                                  calculation. If a pair of proteins achieve
                                  greater than this threshold the shortest
                                  sequence is discarded. (Any numeric value)
*  -minthreshold       float      [30.0] This option specifies the percentage
                                  sequence identity redundancy threshold
                                  (lower limit). The percentage sequence
                                  identity redundancy threshold determines the
                                  redundancy calculation. If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
                                  (Any numeric value)
*  -maxthreshold       float      [90.0] This option specifies the percentage
                                  sequence identity redundancy threshold
                                  (upper limit). The percentage sequence
                                  identity redundancy threshold determines the
                                  redundancy calculation. If a pair of
                                  proteins have a percentage sequence
                                  similarity that lies outside an acceptable
                                  range the shortest sequence is discarded.
                                  (Any numeric value)
   -gapopen            float      [10.0 for any sequence] The gap open penalty
                                  is the score taken away when a gap is
                                  created. The best value depends on the
                                  choice of comparison matrix. The default
                                  value assumes you are using the EBLOSUM62
                                  matrix for protein sequences, and the
                                  EDNAFULL matrix for nucleotide sequences.
                                  (Floating point number from 1.0 to 100.0)
   -gapextend          float      [0.5 for any sequence] The gap extension,
                                  penalty is added to the standard gap penalty
                                  for each base or residue in the gap. This
                                  is how long gaps are penalized. Usually you
                                  will expect a few long gaps rather than many
                                  short gaps, so the gap extension penalty
                                  should be lower than the gap penalty. An
                                  exception is where one or both sequences are
                                  single reads with possible sequencing
                                  errors in which case you would expect many
                                  single base gaps. You can get this result by
                                  setting the gap open penalty to zero (or
                                  very low) and using the gap extension
                                  penalty to control gap scoring. (Floating
                                  point number from 0.0 to 10.0)
  [-outseq]            seqoutall  [.] Sequence set(s)
                                  filename and optional format (output USA)
  [-redundantoutseq]   seqoutall  [.] Redundant sequences
                                  (optional)

   Additional (Optional) qualifiers:
   -datafile           matrixf    [EBLOSUM62 for protein, EDNAFULL for DNA]
                                  This is the scoring matrix file used when
                                  comparing sequences. By default it is the
                                  file 'EBLOSUM62' (for proteins) or the file
                                  'EDNAFULL' (for nucleic sequences). These
                                  files are found in the 'data' directory of
                                  the EMBOSS installation.

   Advanced (Unprompted) qualifiers:
   -feature            toggle     Sequence feature information will be
                                  retained if this option is set.

   Associated qualifiers:

   "-sequences" associated qualifiers
   -sbegin1            integer    Start of each sequence to be used
   -send1              integer    End of each sequence to be used
   -sreverse1          boolean    Reverse (if DNA)
   -sask1              boolean    Ask for begin/end/reverse
   -snucleotide1       boolean    Sequence is nucleotide
   -sprotein1          boolean    Sequence is protein
   -slower1            boolean    Make lower case
   -supper1            boolean    Make upper case
   -sformat1           string     Input sequence format
   -sdbname1           string     Database name
   -sid1               string     Entryname
   -ufo1               string     UFO features
   -fformat1           string     Features format
   -fopenfile1         string     Features file name

   "-outseq" associated qualifiers
   -osformat2          string     Output seq format
   -osextension2       string     File name extension
   -osname2            string     Base file name
   -osdirectory2       string     Output directory
   -osdbname2          string     Database name to add
   -ossingle2          boolean    Separate file for each entry
   -oufo2              string     UFO features
   -offormat2          string     Features format
   -ofname2            string     Features file name
   -ofdirectory2       string     Output directory

   "-redundantoutseq" associated qualifiers
   -osformat3          string     Output seq format
   -osextension3       string     File name extension
   -osname3            string     Base file name
   -osdirectory3       string     Output directory
   -osdbname3          string     Database name to add
   -ossingle3          boolean    Separate file for each entry
   -oufo3              string     UFO features
   -offormat3          string     Features format
   -ofname3            string     Features file name
   -ofdirectory3       string     Output directory

   General qualifiers:
   -auto               boolean    Turn off prompts
   -stdout             boolean    Write first file to standard output
   -filter             boolean    Read first file from standard input, write
                                  first file to standard output
   -options            boolean    Prompt for standard and additional values
   -debug              boolean    Write debug output to program.dbg
   -verbose            boolean    Report some/full command line options
   -help               boolean    Report command line options and exit. More
                                  information on associated and general
                                  qualifiers can be found with -help -verbose
   -warning            boolean    Report warnings
   -error              boolean    Report errors
   -fatal              boolean    Report fatal errors
   -die                boolean    Report dying program messages
   -version            boolean    Report version number and exit

Qualifier Type Description Allowed values Default
Standard (Mandatory) qualifiers
[-sequences]
(Parameter 1)
seqset Sequence set filename and optional format, or reference (input USA) Readable set of sequences Required
-mode list This option specifies whether to remove redundancy at a single threshold percentage sequence similarity or remove redundancy outside a range of acceptable threshold percentage similarity. All permutations of pair-wise sequence alignments are calculated for each set of input sequences in turn using the EMBOSS implementation of the Needleman and Wunsch global alignment algorithm. Redundant sequences are removed in one of two modes as follows: (i) If a pair of proteins achieve greater than a threshold percentage sequence similarity (specified by the user) the shortest sequence is discarded. (ii) If a pair of proteins have a percentage sequence similarity that lies outside an acceptable range (specified by the user) the shortest sequence is discarded.
1 (Single threshold percentage sequence similarity)
2 (Outside a range of acceptable threshold percentage similarities)
1
-threshold float This option specifies the percentage sequence identity redundancy threshold. The percentage sequence identity redundancy threshold determines the redundancy calculation. If a pair of proteins achieve greater than this threshold the shortest sequence is discarded. Any numeric value 95.0
-minthreshold float This option specifies the percentage sequence identity redundancy threshold (lower limit). The percentage sequence identity redundancy threshold determines the redundancy calculation. If a pair of proteins have a percentage sequence similarity that lies outside an acceptable range the shortest sequence is discarded. Any numeric value 30.0
-maxthreshold float This option specifies the percentage sequence identity redundancy threshold (upper limit). The percentage sequence identity redundancy threshold determines the redundancy calculation. If a pair of proteins have a percentage sequence similarity that lies outside an acceptable range the shortest sequence is discarded. Any numeric value 90.0
-gapopen float The gap open penalty is the score taken away when a gap is created. The best value depends on the choice of comparison matrix. The default value assumes you are using the EBLOSUM62 matrix for protein sequences, and the EDNAFULL matrix for nucleotide sequences. Floating point number from 1.0 to 100.0 10.0 for any sequence
-gapextend float The gap extension, penalty is added to the standard gap penalty for each base or residue in the gap. This is how long gaps are penalized. Usually you will expect a few long gaps rather than many short gaps, so the gap extension penalty should be lower than the gap penalty. An exception is where one or both sequences are single reads with possible sequencing errors in which case you would expect many single base gaps. You can get this result by setting the gap open penalty to zero (or very low) and using the gap extension penalty to control gap scoring. Floating point number from 0.0 to 10.0 0.5 for any sequence
[-outseq]
(Parameter 2)
seqoutall Sequence set(s) filename and optional format (output USA) Writeable sequence(s) <*>.format
[-redundantoutseq]
(Parameter 3)
seqoutall Redundant sequences (optional) Writeable sequence(s) <*>.format
Additional (Optional) qualifiers
-datafile matrixf This is the scoring matrix file used when comparing sequences. By default it is the file 'EBLOSUM62' (for proteins) or the file 'EDNAFULL' (for nucleic sequences). These files are found in the 'data' directory of the EMBOSS installation. Comparison matrix file in EMBOSS data path EBLOSUM62 for protein
EDNAFULL for DNA
Advanced (Unprompted) qualifiers
-feature toggle Sequence feature information will be retained if this option is set. Toggle value Yes/No No
Associated qualifiers
"-sequences" associated seqset qualifiers
-sbegin1
-sbegin_sequences
integer Start of each sequence to be used Any integer value 0
-send1
-send_sequences
integer End of each sequence to be used Any integer value 0
-sreverse1
-sreverse_sequences
boolean Reverse (if DNA) Boolean value Yes/No N
-sask1
-sask_sequences
boolean Ask for begin/end/reverse Boolean value Yes/No N
-snucleotide1
-snucleotide_sequences
boolean Sequence is nucleotide Boolean value Yes/No N
-sprotein1
-sprotein_sequences
boolean Sequence is protein Boolean value Yes/No N
-slower1
-slower_sequences
boolean Make lower case Boolean value Yes/No N
-supper1
-supper_sequences
boolean Make upper case Boolean value Yes/No N
-sformat1
-sformat_sequences
string Input sequence format Any string  
-sdbname1
-sdbname_sequences
string Database name Any string  
-sid1
-sid_sequences
string Entryname Any string  
-ufo1
-ufo_sequences
string UFO features Any string  
-fformat1
-fformat_sequences
string Features format Any string  
-fopenfile1
-fopenfile_sequences
string Features file name Any string  
"-outseq" associated seqoutall qualifiers
-osformat2
-osformat_outseq
string Output seq format Any string  
-osextension2
-osextension_outseq
string File name extension Any string  
-osname2
-osname_outseq
string Base file name Any string  
-osdirectory2
-osdirectory_outseq
string Output directory Any string  
-osdbname2
-osdbname_outseq
string Database name to add Any string  
-ossingle2
-ossingle_outseq
boolean Separate file for each entry Boolean value Yes/No N
-oufo2
-oufo_outseq
string UFO features Any string  
-offormat2
-offormat_outseq
string Features format Any string  
-ofname2
-ofname_outseq
string Features file name Any string  
-ofdirectory2
-ofdirectory_outseq
string Output directory Any string  
"-redundantoutseq" associated seqoutall qualifiers
-osformat3
-osformat_redundantoutseq
string Output seq format Any string  
-osextension3
-osextension_redundantoutseq
string File name extension Any string  
-osname3
-osname_redundantoutseq
string Base file name Any string  
-osdirectory3
-osdirectory_redundantoutseq
string Output directory Any string  
-osdbname3
-osdbname_redundantoutseq
string Database name to add Any string  
-ossingle3
-ossingle_redundantoutseq
boolean Separate file for each entry Boolean value Yes/No N
-oufo3
-oufo_redundantoutseq
string UFO features Any string  
-offormat3
-offormat_redundantoutseq
string Features format Any string  
-ofname3
-ofname_redundantoutseq
string Features file name Any string  
-ofdirectory3
-ofdirectory_redundantoutseq
string Output directory Any string  
General qualifiers
-auto boolean Turn off prompts Boolean value Yes/No N
-stdout boolean Write first file to standard output Boolean value Yes/No N
-filter boolean Read first file from standard input, write first file to standard output Boolean value Yes/No N
-options boolean Prompt for standard and additional values Boolean value Yes/No N
-debug boolean Write debug output to program.dbg Boolean value Yes/No N
-verbose boolean Report some/full command line options Boolean value Yes/No Y
-help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose Boolean value Yes/No N
-warning boolean Report warnings Boolean value Yes/No Y
-error boolean Report errors Boolean value Yes/No Y
-fatal boolean Report fatal errors Boolean value Yes/No Y
-die boolean Report dying program messages Boolean value Yes/No Y
-version boolean Report version number and exit Boolean value Yes/No N

Input file format

The input is a standard EMBOSS sequence query (also known as a 'USA').

Major sequence database sources defined as standard in EMBOSS installations include srs:embl, srs:uniprot and ensembl

Data can also be read from sequence output in any supported format written by an EMBOSS or third-party application.

The input format can be specified by using the command-line qualifier -sformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: gff (gff3), gff2, embl (em), genbank (gb, refseq), ddbj, refseqp, pir (nbrf), swissprot (swiss, sw), dasgff and debug.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Input files for usage example

File: globins.fasta

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

Output file format

The output is a standard EMBOSS sequence file.

The results can be output in one of several styles by using the command-line qualifier -osformat xxx, where 'xxx' is replaced by the name of the required format. The available format names are: embl, genbank, gff, pir, swiss, dasgff, debug, listfile, dbmotif, diffseq, excel, feattable, motif, nametable, regions, seqtable, simple, srs, table, tagseq.

See: http://emboss.sf.net/docs/themes/SequenceFormats.html for further information on sequence formats.

Output files for usage example

File: globins.keep

>HBB_HUMAN Sw:Hbb_Human => HBB_HUMAN
VHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQRFFESFGDLSTPDAVMGNPKV
KAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGK
EFTPPVQAAYQKVVAGVANALAHKYH
>HBA_HUMAN Sw:Hba_Human => HBA_HUMAN
VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTTKTYFPHFDLSHGSAQVKGHGK
KVADALTNAVAHVDDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLPAEFTPA
VHASLDKFLASVSTVLTSKYR
>MYG_PHYCA Sw:Myg_Phyca => MYG_PHYCA
VLSEGEWQLVLHVWAKVEADVAGHGQDILIRLFKSHPETLEKFDRFKHLKTEAEMKASED
LKKHGVTVLTALGAILKKKGHHEAELKPLAQSHATKHKIPIKYLEFISEAIIHVLHSRHP
GDFGADAQGAMNKALELFRKDIAAKYKELGYQG
>GLB5_PETMA Sw:Glb5_Petma => GLB5_PETMA
PIVDTGSVAPLSAAEKTKIRSAWAPVYSTYETSGVDILVKFFTSTPAAQEFFPKFKGLTT
ADQLKKSADVRWHAERIINAVNDAVASMDDTEKMSMKLRDLSGKHAKSFQVDPQYFKVLA
AVIADTVAAGDAGFEKLMSMICILLRSAY
>LGB2_LUPLU Sw:Lgb2_Luplu => LGB2_LUPLU
GALTESQAALVKSSWEEFNANIPKHTHRFFILVLEIAPAAKDLFSFLKGTSEVPQNNPEL
QAHAGKVFKLVYEAAIQLQVTGVVVTDATLKNLGSVHVSKGVADAHFPVVKEAILKTIKE
VVGAKWSEELNSAWTIAYDELAIVIKKEMNDAA

File: globins.redundant

>HBB_HORSE Sw:Hbb_Horse => HBB_HORSE
VQLSGEEKAAVLALWDKVNEEEVGGEALGRLLVVYPWTQRFFDSFGDLSNPGAVMGNPKV
KAHGKKVLHSFGEGVHHLDNLKGTFAALSELHCDKLHVDPENFRLLGNVLVVVLARHFGK
DFTPELQASYQKVVAGVANALAHKYH
>HBA_HORSE Sw:Hba_Horse => HBA_HORSE
VLSAADKTNVKAAWSKVGGHAGEYGAEALERMFLGFPTTKTYFPHFDLSHGSAQVKAHGK
KVGDALTLAVGHLDDLPGALSNLSDLHAHKLRVDPVNFKLLSHCLLSTLAVHLPNDFTPA
VHASLDKFLSSVSTVLTSKYR

Data files

For protein sequences EBLOSUM62 is used for the substitution matrix. For nucleotide sequence, EDNAFULL is used. Others can be specified.

EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA.

To see the available EMBOSS data files, run:

% embossdata -showall

To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run:


% embossdata -fetch -file Exxx.dat

Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata".

The directories are searched in the following order:

Notes

None.

References

None.

Warnings

None.

Diagnostic Error Messages

None.

Exit status

It always exits with status 0.

Known bugs

None.

See also

Program name Description
aligncopy Reads and writes alignments
aligncopypair Reads and writes pairs from alignments
biosed Replace or delete sequence sections
codcopy Copy and reformat a codon usage table
cutseq Removes a section from a sequence
degapseq Removes non-alphabetic (e.g. gap) characters from sequences
descseq Alter the name or description of a sequence
entret Retrieves sequence entries from flatfile databases and files
extractalign Extract regions from a sequence alignment
extractfeat Extract features from sequence(s)
extractseq Extract regions from a sequence
featcopy Reads and writes a feature table
featreport Reads and writes a feature table
feattext Return a feature table original text
listor Write a list file of the logical OR of two sets of sequences
makenucseq Create random nucleotide sequences
makeprotseq Create random protein sequences
maskambignuc Masks all ambiguity characters in nucleotide sequences with N
maskambigprot Masks all ambiguity characters in protein sequences with X
maskfeat Write a sequence with masked features
maskseq Write a sequence with masked regions
newseq Create a sequence file from a typed-in sequence
nohtml Remove mark-up (e.g. HTML tags) from an ASCII text file
noreturn Remove carriage return from ASCII files
nospace Remove whitespace from an ASCII text file
notab Replace tabs with spaces in an ASCII text file
notseq Write to file a subset of an input stream of sequences
nthseq Write to file a single sequence from an input stream of sequences
nthseqset Reads and writes (returns) one set of sequences from many
pasteseq Insert one sequence into another
revseq Reverse and complement a nucleotide sequence
seqcount Reads and counts sequences
seqret Reads and writes (returns) sequences
seqretsetall Reads and writes (returns) many sets of sequences
seqretsplit Reads sequences and writes them to individual files
sizeseq Sort sequences by size
skipseq Reads and writes (returns) sequences, skipping first few
splitsource Split sequence(s) into original source sequences
splitter Split sequence(s) into smaller sequences
trimest Remove poly-A tails from nucleotide sequences
trimseq Remove unwanted characters from start and end of sequence(s)
trimspace Remove extra whitespace from an ASCII text file
union Concatenate multiple sequences into a single sequence
vectorstrip Removes vectors from the ends of nucleotide sequence(s)
yank Add a sequence reference (a full USA) to a list file

Author(s)

Jon Ison
European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Please report all bugs to the EMBOSS bug team (emboss-bug © emboss.open-bio.org) not to the original author.

History

Target users

This program is intended to be used by everyone and everything, from naive users to embedded scripts.

Comments

None