update June 20, 2020
NAME
phylcnv.py - Convert Phylip files into other formats
SYNOPSIS
phylcnv.py [-inf format] [-outf format] [-inv] infile outfile
DESCRIPTION
This script
reads
a file containing sequence data or discrete data (eg. molecular
markers) in Phylip format or fasta format and writes it in
the indicated
format. The output formats supported currently are comma-seperated
value (.csv) or tab-seperated value (.tsv). The csvfile is a
comma-separated value file of the type produced by exporting a
spreadsheet (eg. OpenOffice Calc or MS-Excell) to a .csv
file.
If csv file is specified, the script will read the csv file
and
write a Phylip file removing the .csv or .CSV extension (if any)
and
replacing it with .phyl. Otherwise, phylcnv.py will take input
from
the standard input and write to the standard output.
If no
input or output file names are specified, input is from the
standard
input and output is the standard output.
OPTIONS
-inf - input file format. pint (default): Phylip interleaved format; pseq: Phylip sequential format; fasta: fasta format, allowing line wrapping of sequences; csv: comma-separated value; tsv: tab-separated value
-outf –
output file format. csv (default): comma-separated value; tsv:
tab-separated value; pint: Phylip interleaved; pseq: Phylip
sequential; fasta – Fasta; flatdna, flatpro, flattext:
BioLegato flat file formats.
-inv - invert marker characters so
that 0 -> 1 and 1 - > 0; + -> - and - -> +
INPUT
CSV,TSV
comma-separated value and tab-separated value, as generated by
most
spreadsheets. When exporting to these formats, many spreadsheets
will
enclose each item in double quotes. phylcnv.py strips out double
quotes
during input, and does not write them during output.
Phylip
The input is a
Phylip data file. There are two Phylip formats, interleaved and
sequential. Examples are shown for molecular marker data, but DNA
or
protein sequence files would be handled exactly the same
way. If sequences were read from a fasta file, the names will go
to Phylip output without the '>'. As well, if the name line
contains information other than the name itself eg.
>AI352966 - MB75-5H PZ204.BNlib Brassica napus cDNA clone pMB75-5H
ONLY the
name will be written, and not the title ie.
AI352966
The name is
defined as the string between the '>' and the first blank on
the fasta title line. As well, Phylip sequence formats require
that the names be exactly 10 characters long. Shorter names will
be padded with blanks.
Interleaved - The first line has integers telling the number of isolates/species/strains, and the number of markers. Each subsequent line has an isolate/species/strain name of exactly 10 characters, followed by marker data. If the name is greater than 10 characters, it is truncated. If it is less than 10 characters, it is padded with blanks to 10 characters.
3 319
G-A1 01001100110011000011100101011001001001011010000011
G-A2 01001000110111010011101101010000000001111001001011
G-A3 01000100110011000011100000011010101001011100111011
11010000010111011011101001110101110001001100010011
00010000000100111011101001100011110011011110011010
00110000010100111011100011100011100011011101010010
01000111111111011010101011100111111111000101100011
00000110010111110100000010110101001100000000010111
010001100101110000000010101001111111000???????????
01101011000101110011110100110101111101000001010111
01111011000100010101110000111101011100000001010011
??????????????????????????111101111000000001010011
11010110001110011110000000001001101100001000000000
11110111001111010110001010000001100010100000000111
10010110001110010110001010001000101111011001101111
00000000000000101011000101000101100100011000100001
11111001011000000000000101000101100000011100011101
01111000010000000010000101000001100100111000001101
0000100000000010100
0000101100001110100
0000101100100010100
Sequential- The first line has integers telling the number of isolates/species/strains, and the number of markers. Each sequence is represented by a sequence name, on one line, followed by one or more lines of sequence. Unlike fasta format, which uses a '>' character to indicate a new sequence name, Phylip sequential format detects the end of the sequence by reading the number of non-blank characters represented by the second number on the first line. This is an inherently dangerous file format for that reason.
3 319
G-A1
01001100110011000011100101011001001001011010000011
11010000010111011011101001110101110001001100010011
01000111111111011010101011100111111111000101100011
01101011000101110011110100110101111101000001010111
11010110001110011110000000001001101100001000000000
00000000000000101011000101000101100100011000100001
0000100000000010100
G-A2
01001000110111010011101101010000000001111001001011
00010000000100111011101001100011110011011110011010
00000110010111110100000010110101001100000000010111
01111011000100010101110000111101011100000001010011
11110111001111010110001010000001100010100000000111
11111001011000000000000101000101100000011100011101
0000101100001110100
G-A3
01000100110011000011100000011010101001011100111011
00110000010100111011100011100011100011011101010010
010001100101110000000010101001111111000???????????
??????????????????????????111101111000000001010011
10010110001110010110001010001000101111011001101111
01111000010000000010000101000001100100111000001101
0000101100100010100
FASTA - Input files can have sequence on a single line, or wrapped as shown below.
>AI352966 - MB75-5H PZ204.BNlib Brassica napus cDNA clone pMB75-5H
guaagcuaugaagggagggaugauguuuauggugaauuggaacccagagg
cuuaacaggugacucguuaagaaaacuaccaugcuauaucaugucaagug
agaugaccaagaagcaaaucauucacugcacuauuugucuucaggacauu
gcaguaggcgaaaucacacgaaguuuaccgagaugugaccauacguuuca
ccugguuuguguugauaaauggcucaucagacauggaucaugccccauuu
gcagacaggccguuaaagauuaaaaagcaccauugguguccgaggagugu
acguagcaaaaauccauuguccuuauauguuguuguaagucucugaaucc
uuguuuuagucucucuuuguuacuuuuacuuauagcaucauccauagguu
ucuacuuuugaauguauacuauuguagacaugaauaauancaccuacagu
uauguuggagaaaaaaauauagaacucagauuaaguuaugcacug
OUTPUT
The output file
consists of one or more lines lines of comma-separated marker
data,
in which the first field is the name of the marker, and all other
fields are single
characters.
Example:
LR210,1,0,1,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,1
LR211,0,1,1,1,0,1,1,0,1,1,0,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0,1
LR212,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,1,1,0,0,1,0,1,1,1,1,0,1
LR213,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,0,1,1,0,1
LR214,0,0,1,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
LR215,1,0,1,1,1,1,0,1,1,1,0,1,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1
LR216,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,1
LR217,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,1,1,0,1
NOTES
1. This script is used by blmarker for File --> Import Discrete Data from CSV file.
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist