uniqid.py - Convert Phylip files into other formats

update February 17, 2021

NAME

uniqid.py - Encode sequence names by replacing them with a random numerical code, or decode names by restoring the original names.

SYNOPSIS

uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --encode sourcein sourceencoded tsvout

uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --encodesame sourcein sourceencoded tsvin

uniqid.py [-f list_of_fields] [-s seperator] [-nf string] --decode sourceencoded sourceout tsvin

DESCRIPTION

The Phylip programs truncate sequence names longer than 10 characters. uniqid.py -encode can be run to replace sequence names with unique numerical codes of 9 characters in length. A separate TSV file is also created that maps the original names to the encoded names. These can be used as input for Phylip programs to restore the original names to output files.

IMPORTANT - While this script was written for use with Phylip files, you can decode ANY text file produced by any program or series of programs, as long as the first step was to encode the sequence names in the original fasta file.

OPTIONS

--encode (default) - The first three filenames on the command line are read as sourcein, the original source file sourcein, sourceencoded sequences in which the description line is replaced with a unique ID; and tsvout, a comma-separated value file containing the unique identifier and the corresponding definition line

--encodesame - Encode another file, substituting in the same random names from a previous run using --encode. This makes it possible to encode two or more files using the same random names, so that all output files generated can be decoded with a single csv file.

The first three filenames on the command line are read as sourcein, the original source file; sourceencoded, the sourceencoded in which the description line is replaced with a unique ID generated previously by --encode; and tsvin, a comma-separated value file containing the unique identifier and the corresponding definition line

--decode - The first three filenames on the command line are read as sourceencoded, any text file containing unique IDs generated from a previous run using --encode; sourceout, the output file in which the unique ID is replaced by the original name, or the name plus parts of the definition line; tsvin, the tsv file generated by a previous run using --encode. If options -f, s- and -nf were set during encoding, the same options must be used when decoding.

-f list_of_fields - similar to -f in the Unix cut command. A comma-separated list of fields to be written to textout when decoding files. eg. -f 1,2 would write out both fields 1 and 2. Default; -f 1

-s seperator - seperator is a character to use as the seperator when parsing a definition line into fields. default = "\t ", a TAB character. eg. -s "," will create a comma-separated tsvfile

-nf string - (default: !%) String is one or more characters to begin the unique identifier, which which the definition line is replaced. The string should be some set of 1 or more characters that are not expected to be found in the input file. Some Phylip programs appear to translate the underscore character '_' to a blank, which would prevent decoding. Don't use '_' as part of the string.
Note: Phylip programs don't allow special characters in sequence names eg :[]() etc. Never use these characters as part of string.

EXAMPLE

The file test2.fsa is a fasta file with long sequence names:

>NP_001149637MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQNGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANTPVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMGTVKRTEVVYKDNTRISELKIHAYFAPVN---------------->ACG36197MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQNGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANTPVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMGTVKRTEVVYKDNTRISELKIHAYFAPVN---------------->NP_001151815MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGTNGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATYPDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKGTVRRTEISYTGNIRISELNIHVLYTPM----------------->ACG44408MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGTNGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATYPDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKGTVRRTEISYTGNIRISELNIHVLYTPM-----------------

To encode this file,

uniqid.py --encode test2.fsa test2.uniq.fsa test2.tsv

test2.uniq.fsa:

>!_7662224MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQNGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANTPVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMGTVKRTEVVYKDNTRISELKIHAYFAPVN---------------->!_7203510MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQNGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANTPVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMGTVKRTEVVYKDNTRISELKIHAYFAPVN---------------->!_2958197MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGTNGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATYPDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKGTVRRTEISYTGNIRISELNIHVLYTPM----------------->!_3585127MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGTNGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATYPDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKGTVRRTEISYTGNIRISELNIHVLYTPM-----------------

test2.tsv

!_7662224 NP_001149637!_7203510 ACG36197!_2958197 NP_001151815!_3585127 ACG44408

To use these sequences with Phylip, run

phylcnv.py -inf fasta -outf pint test2.uniq.fsa test2.uniq.pint

test2.uniq.pint is a Phylip interleaved sequence file that will work with any Phylip sequence program:

4 194!_7662224 MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQ!_7203510 MA-------------YCAFAP---SAPPQYSELKMTLYTNKEVYRSGPDQ!_2958197 MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGT!_3585127 MA------------------PPSYSIAPVQSELNMTLY-NKEVY-GGRGTNGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANTNGV-TITE--RSNMGTTWVFSWPVADGPT--PDANIVGQLQGTSVQVANTNGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATYNGVTTLVN--RGPIGTTWVFSWPVTDGPAGGADANVVGHLQGTGVQVATYPVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMGPVVVYHYSLGLVF-EDKRFNGSTLQIQGTSQINGEWSIVGGTGQLTMAMGPDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKGPDYMWHYSLGLVFGQGSRFNGSTLQIQGTSKINGEWSIVGGTGELAMAKGTVKRTEVVYKDNTRISELKIHAYFAPVN----------------TVKRTEVVYKDNTRISELKIHAYFAPVN----------------TVRRTEISYTGNIRISELNIHVLYTPM-----------------TVRRTEISYTGNIRISELNIHVLYTPM-----------------

If you ran protpars, the resultant parsimony tree would be written to a file called outtree:

(((!_3585127,!_2958197),!_7203510),!_7662224);

To restore the original sequence names,

uniqid.py --decode outtree test2.outtree test2.tsv

test2.outtree:

(((ACG44408,NP_001151815),ACG36197),NP_001149637);

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist