update October 2, 2020
NAME
BLExtractSubset.py - Extract a subset of sequences from a GDE flatfile
SYNOPSIS
python BLExtractSubset.py namefile infile outfile

DESCRIPTION

This script reads infile, containing sequences in GDE flatfile format, and writes a subset of those sequences, as listed in namefile, to the outfile.


NAMEFILE
The namefile consists of a list of names for sequences in a GDE flatfile, with one name per line.

Example:
A30238
A31075
A34313

The namefile can also be a TAB-separated value (.tsv) file, in which the first column contains names or accession numbers. For example, BioLegato writes output from SSEARCH as a multicolumn file:

# ssearch3 -T 16 -Q -p -b %NUMOFSCORES% -d 20 -m 0 -z 11 -E 0.0001 -m F8C bio6395951563558234905.tmp.tsv -m F6 bio6395951563558234905.tmp.html bio6395951563558234905.tmp /home/psgendb/BIRCHDEV/public_html/tutorials/bioLegato/dataset/all.pro.fsa 3                              
# SSEARCH 36.3.8h Aug, 2019                              
# Query: DQ288897:CDS1 - 74 aa                              
# Database: /home/psgendb/BIRCHDEV/public_html/tutorials/bioLegato/dataset/all.pro.fsa                              
# 50 hits found                              
# subject id     % identity     alignment length     mismatches     gap opens     q. start     q. end
DQ288897:CDS1    100.00    74    0    0    1    74
DQ342338:CDS1    98.65    74    1    0    1    74
NM_001365279:CDS1    97.30    74    2    0    1    74
XM_027486857:CDS1    87.84    74    9    0    1    74
XM_020385060:CDS1    85.14    74    11    0    1    74
AY907349:CDS1    89.19    74    8    1    1    74

Lines beginning with hashmarks (#) are ignored, and the leftmost column is used as the list of names or accession numbers.

INPUT
The input file is a GDE flatfile of strings consisting of a name, followed by a string. The type of the string is indicated by the flag characters used by GDE:  # for DNA or RNA,  % for protein, or " for text.

Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A30839
NQASVVANQLIPINTALTLVMMRSEVVTPVGIPAEDIPRLVSMQVNRAVPLGTTLMPDM
VKGYPPA
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF


OUTPUT
The output is a GDE flatfile containing only the sequence specified in namefile. In the case where two or more sequences occur with the same name, only the first sequence will be written, and an error message will be written to the standard output.

Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF

NOTES
1. This script is used by BioLegato for Edit --> Extract subset.
 
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB  Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist