update July 19, 2008
NAME
BLExtractSubset.py - Extract a subset of sequences from a GDE flatfile
SYNOPSIS
python BLExtractSubset.py namefile infile outfile

DESCRIPTION

This script reads infile, containing sequences in GDE flatfile format, and writes a subset of those sequences, as listed in namefile, to the outfile.


NAMEFILE
The namefile consists of a list of names for sequences in a GDE flatfile, with one name per line.

Example:
A30238
A31075
A34313

INPUT
The input file is a GDE flatfile of strings consisting of a name, followed by a string. The type of the string is indicated by the flag characters used by GDE:  # for DNA or RNA,  % for protein, or " for text.

Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A30839
NQASVVANQLIPINTALTLVMMRSEVVTPVGIPAEDIPRLVSMQVNRAVPLGTTLMPDM
VKGYPPA
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF


OUTPUT
The output is a GDE flatfile containing only the sequence specified in namefile. In the case where two or more sequences occur with the same name, only the first sequence will be written, and an error message will be written to the standard output.

Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF

NOTES
1. This script is used by GDE for Edit --> Extract subset.
 
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist