update May 28, 2011
NAME
phylcnv.py -
Convert Phylip files into other formats
SYNOPSIS
phylcnv.py
[-ini|-ins]
[-ocsv|-otsv] [-inv] [infile] [outfile]
DESCRIPTION
This script reads a file containing
sequence data or discrete data (eg. molecular markers) in Phylip format
and writes it in the indicated format. The output formats
supported currently are comma-seperated value (.csv) or tab-seperated
value (.tsv). The csvfile is a comma-separated value file
of
the type produced by exporting a spreadsheet (eg. OpenOffice Calc or
MS-Excell) to a .csv file. If csv file is specified, the
script will read the csv file and write a Phylip file removing the .csv
or .CSV extension (if any) and replacing it with .phyl. Otherwise,
phylcnv.py will take input from the standard input and write to the
standard output.
If no input or output file names are specified, input is from the
standard input and output is the standard output.
OPTIONS
-ini - input is interleaved (default)
-ins - input is sequential
-ocsv - output is csv format ie. field separator is a comma
-otsv - output is tsv format ie. field separator is a TAB character
(default)
-inv - invert characters so that 0 -> 1 and 1 - > 0
INPUT
The input is a Phylip data file. There
are two Phylip formats,
interleaved and sequential.
Interleaved - The
first line has integers telling the number of
isolates/species/strains, and the number of markers. Each subsequent
line has an isolate/species/strain name of exactly 10 characters,
followed by marker data. If the name is greater than 10 characters, it
is truncated. If it is less than 10 characters, it is padded with
blanks to 10 characters.
3 319
G-A1 01001100110011000011100101011001001001011010000011
G-A2 01001000110111010011101101010000000001111001001011
G-A3 01000100110011000011100000011010101001011100111011
11010000010111011011101001110101110001001100010011
00010000000100111011101001100011110011011110011010
00110000010100111011100011100011100011011101010010
01000111111111011010101011100111111111000101100011
00000110010111110100000010110101001100000000010111
010001100101110000000010101001111111000???????????
01101011000101110011110100110101111101000001010111
01111011000100010101110000111101011100000001010011
??????????????????????????111101111000000001010011
11010110001110011110000000001001101100001000000000
11110111001111010110001010000001100010100000000111
10010110001110010110001010001000101111011001101111
00000000000000101011000101000101100100011000100001
11111001011000000000000101000101100000011100011101
01111000010000000010000101000001100100111000001101
0000100000000010100
0000101100001110100
0000101100100010100
Sequential- The first line
has integers telling the number of
isolates/species/strains, and the number of markers. Each sequence is
represented by a sequence name, on one line, followed by one or more
lines of sequence. Unlike fasta format, which uses a '>' character
to indicate a new sequence name, Phylip sequential format detects the
end of the sequence by reading the number of non-blank characters
represented by the second number on teh first line. This is an
inherently dangerous file format for that reason.
3 319
G-A1
01001100110011000011100101011001001001011010000011
11010000010111011011101001110101110001001100010011
01000111111111011010101011100111111111000101100011
01101011000101110011110100110101111101000001010111
11010110001110011110000000001001101100001000000000
00000000000000101011000101000101100100011000100001
0000100000000010100
G-A2
01001000110111010011101101010000000001111001001011
00010000000100111011101001100011110011011110011010
00000110010111110100000010110101001100000000010111
01111011000100010101110000111101011100000001010011
11110111001111010110001010000001100010100000000111
11111001011000000000000101000101100000011100011101
0000101100001110100
G-A3
01000100110011000011100000011010101001011100111011
00110000010100111011100011100011100011011101010010
010001100101110000000010101001111111000???????????
??????????????????????????111101111000000001010011
10010110001110010110001010001000101111011001101111
01111000010000000010000101000001100100111000001101
0000101100100010100
OUTPUT
The output file consists of one or more
lines lines of comma-separated marker data, in which the first field is
the name of the marker, and all other fields are single characters.
Example:
LR210,1,0,1,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,1
LR211,0,1,1,1,0,1,1,0,1,1,0,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0,1
LR212,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,1,1,0,0,1,0,1,1,1,1,0,1
LR213,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,0,1,1,0,1
LR214,0,0,1,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
LR215,1,0,1,1,1,1,0,1,1,1,0,1,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1
LR216,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,1
LR217,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,1,1,0,1
NOTES
1. This script is used by blmarker for
File
--> Import Discrete Data from CSV file.
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist