update September 24, 2011
phylcnv.py - Convert Phylip files into other formats
phylcnv.py [-inf format] [-outf format] [-inv] infile outfile
This script reads
a file containing sequence data or discrete data (eg. molecular
markers) in Phylip format and writes it in the indicated
format. The output formats supported currently are comma-seperated
value (.csv) or tab-seperated value (.tsv). The csvfile is a
comma-separated value file of the type produced by exporting a
spreadsheet (eg. OpenOffice Calc or MS-Excell) to a .csv file.
If csv file is specified, the script will read the csv file and
write a Phylip file removing the .csv or .CSV extension (if any) and
replacing it with .phyl. Otherwise, phylcnv.py will take input from
the standard input and write to the standard output.
If no input or output file names are specified, input is from the standard input and output is the standard output.
-inf - input file format. pint (default): Phylip interleaved format; pseq: Phylip sequential format; csv: comma-separated value; tsv: tab-separated value
output file format. csv (default): comma-separated value; tsv:
tab-separated value; pint: Phylip interleaved; pseq: Phylip
sequential; fasta – Fasta; flatdna, flatpro, flattext:
BioLegato flat file formats.
-inv - invert marker characters so that 0 -> 1 and 1 - > 0; + -> - and - -> +
comma-separated value and tab-separated value, as generated by most spreadsheets. When exporting to these formats, many spreadsheets will enclose each item in double quotes. phylcnv.py strips out double quotes during input, and does not write them during output.
The input is a
Phylip data file. There are two Phylip formats, interleaved and
sequential. Examples are shown for molecular marker data, but DNA or
protein sequence files would be handled exactly the same
Interleaved - The first line has integers telling the number of isolates/species/strains, and the number of markers. Each subsequent line has an isolate/species/strain name of exactly 10 characters, followed by marker data. If the name is greater than 10 characters, it is truncated. If it is less than 10 characters, it is padded with blanks to 10 characters.
Sequential- The first line has integers telling the number of isolates/species/strains, and the number of markers. Each sequence is represented by a sequence name, on one line, followed by one or more lines of sequence. Unlike fasta format, which uses a '>' character to indicate a new sequence name, Phylip sequential format detects the end of the sequence by reading the number of non-blank characters represented by the second number on the first line. This is an inherently dangerous file format for that reason.
The output file
consists of one or more lines lines of comma-separated marker data,
in which the first field is the name of the marker, and all other
fields are single
1. This script is used by blmarker for File --> Import Discrete Data from CSV file.
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2