phylcnv.py - Convert Phylip files into other formats

update June 20, 2020

NAME

phylcnv.py - Convert Phylip files into other formats

SYNOPSIS

phylcnv.py [-inf format] [-outf format] [-inv] infile outfile

DESCRIPTION

This script reads a file containing sequence data or discrete data (eg. molecular markers) in Phylip format or fasta format and writes it in the indicated format. The output formats supported currently are comma-seperated value (.csv) or tab-seperated value (.tsv). The csvfile is a comma-separated value file of the type produced by exporting a spreadsheet (eg. OpenOffice Calc or MS-Excell) to a .csv file. If csv file is specified, the script will read the csv file and write a Phylip file removing the .csv or .CSV extension (if any) and replacing it with .phyl. Otherwise, phylcnv.py will take input from the standard input and write to the standard output.

If no input or output file names are specified, input is from the standard input and output is the standard output.

OPTIONS

-inf - input file format. pint (default): Phylip interleaved format; pseq: Phylip sequential format; fasta: fasta format, allowing line wrapping of sequences; csv: comma-separated value; tsv: tab-separated value

-outf – output file format. csv (default): comma-separated value; tsv: tab-separated value; pint: Phylip interleaved; pseq: Phylip sequential; fasta – Fasta; flatdna, flatpro, flattext: BioLegato flat file formats.
-inv - invert marker characters so that 0 -> 1 and 1 - > 0; + -> - and - -> +

INPUT

CSV,TSV
comma-separated value and tab-separated value, as generated by most spreadsheets. When exporting to these formats, many spreadsheets will enclose each item in double quotes. phylcnv.py strips out double quotes during input, and does not write them during output.

Phylip

The input is a Phylip data file. There are two Phylip formats, interleaved and sequential. Examples are shown for molecular marker data, but DNA or protein sequence files would be handled exactly the same way. If sequences were read from a fasta file, the names will go to Phylip output without the '>'. As well, if the name line contains information other than the name itself eg.

>AI352966 - MB75-5H PZ204.BNlib Brassica napus cDNA clone pMB75-5H

ONLY the name will be written, and not the title ie.

AI352966

The name is defined as the string between the '>' and the first blank on the fasta title line. As well, Phylip sequence formats require that the names be exactly 10 characters long. Shorter names will be padded with blanks.

Interleaved - The first line has integers telling the number of isolates/species/strains, and the number of markers. Each subsequent line has an isolate/species/strain name of exactly 10 characters, followed by marker data. If the name is greater than 10 characters, it is truncated. If it is less than 10 characters, it is padded with blanks to 10 characters.

          3   319
          G-A1      01001100110011000011100101011001001001011010000011
          G-A2      01001000110111010011101101010000000001111001001011
          G-A3      01000100110011000011100000011010101001011100111011
          11010000010111011011101001110101110001001100010011
          00010000000100111011101001100011110011011110011010
          00110000010100111011100011100011100011011101010010
          01000111111111011010101011100111111111000101100011
          00000110010111110100000010110101001100000000010111
          010001100101110000000010101001111111000???????????
          01101011000101110011110100110101111101000001010111
          01111011000100010101110000111101011100000001010011
          ??????????????????????????111101111000000001010011
          11010110001110011110000000001001101100001000000000
          11110111001111010110001010000001100010100000000111
          10010110001110010110001010001000101111011001101111
          00000000000000101011000101000101100100011000100001
          11111001011000000000000101000101100000011100011101
          01111000010000000010000101000001100100111000001101
          0000100000000010100
          0000101100001110100
          0000101100100010100

Sequential- The first line has integers telling the number of isolates/species/strains, and the number of markers. Each sequence is represented by a sequence name, on one line, followed by one or more lines of sequence. Unlike fasta format, which uses a '>' character to indicate a new sequence name, Phylip sequential format detects the end of the sequence by reading the number of non-blank characters represented by the second number on the first line. This is an inherently dangerous file format for that reason.

                   3       319
          G-A1       
          01001100110011000011100101011001001001011010000011
          11010000010111011011101001110101110001001100010011
          01000111111111011010101011100111111111000101100011
          01101011000101110011110100110101111101000001010111
          11010110001110011110000000001001101100001000000000
          00000000000000101011000101000101100100011000100001
          0000100000000010100
          G-A2       
          01001000110111010011101101010000000001111001001011
          00010000000100111011101001100011110011011110011010
          00000110010111110100000010110101001100000000010111
          01111011000100010101110000111101011100000001010011
          11110111001111010110001010000001100010100000000111
          11111001011000000000000101000101100000011100011101
          0000101100001110100
          G-A3       
          01000100110011000011100000011010101001011100111011
          00110000010100111011100011100011100011011101010010
          010001100101110000000010101001111111000???????????
          ??????????????????????????111101111000000001010011
          10010110001110010110001010001000101111011001101111
          01111000010000000010000101000001100100111000001101
          0000101100100010100

FASTA - Input files can have sequence on a single line, or wrapped as shown below.

>AI352966 - MB75-5H PZ204.BNlib Brassica napus cDNA clone pMB75-5H
guaagcuaugaagggagggaugauguuuauggugaauuggaacccagagg
cuuaacaggugacucguuaagaaaacuaccaugcuauaucaugucaagug
agaugaccaagaagcaaaucauucacugcacuauuugucuucaggacauu
gcaguaggcgaaaucacacgaaguuuaccgagaugugaccauacguuuca
ccugguuuguguugauaaauggcucaucagacauggaucaugccccauuu
gcagacaggccguuaaagauuaaaaagcaccauugguguccgaggagugu
acguagcaaaaauccauuguccuuauauguuguuguaagucucugaaucc
uuguuuuagucucucuuuguuacuuuuacuuauagcaucauccauagguu
ucuacuuuugaauguauacuauuguagacaugaauaauancaccuacagu
uauguuggagaaaaaaauauagaacucagauuaaguuaugcacug

OUTPUT

The output file consists of one or more lines lines of comma-separated marker data, in which the first field is the name of the marker, and all other fields are single characters.

Example:
LR210,1,0,1,1,0,1,0,0,0,1,0,0,1,0,0,0,1,0,0,1,0,1,1,0,1,0,1
LR211,0,1,1,1,0,1,1,0,1,1,0,0,1,0,0,0,1,0,0,1,0,1,1,1,1,0,1
LR212,0,0,1,1,0,1,0,0,1,1,0,0,1,0,0,1,1,0,0,1,0,1,1,1,1,0,1
LR213,1,0,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,0,0,0,0,1,0,1,1,0,1
LR214,0,0,1,1,1,0,0,1,1,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0
LR215,1,0,1,1,1,1,0,1,1,1,0,1,1,0,1,1,1,0,0,0,0,1,0,1,1,0,1
LR216,1,1,1,1,0,1,1,0,1,1,1,0,0,0,0,1,1,0,0,1,1,1,1,1,0,0,1
LR217,1,1,1,1,1,1,0,1,1,1,1,1,1,1,1,1,1,1,0,0,0,1,0,1,1,0,1

NOTES

1. This script is used by blmarker for File --> Import Discrete Data from CSV file.

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist