BLExtractSubset.py - Extract a subset of sequences from a GDE flatfile

Example:
A30238
A31075
A34313

The namefile can also be a TAB-separated value (.tsv) file, in which the first column contains names or accession numbers. For example, BioLegato writes output from SSEARCH as a multicolumn file:

# ssearch3 -T 16 -Q -p -b %NUMOFSCORES% -d 20 -m 0 -z 11 -E 0.0001 -m F8C bio6395951563558234905.tmp.tsv -m F6 bio6395951563558234905.tmp.html bio6395951563558234905.tmp /home/psgendb/BIRCHDEV/public_html/tutorials/bioLegato/dataset/all.pro.fsa 3
# SSEARCH 36.3.8h Aug, 2019
# Query: DQ288897:CDS1 - 74 aa
# Database: /home/psgendb/BIRCHDEV/public_html/tutorials/bioLegato/dataset/all.pro.fsa
# 50 hits found
# subject id    % identity    alignment length    mismatches    gap opens    q. start    q. end
DQ288897:CDS1    100.00    74    0    0    1    74
DQ342338:CDS1    98.65    74    1    0    1    74
NM_001365279:CDS1    97.30    74    2    0    1    74
XM_027486857:CDS1    87.84    74    9    0    1    74
XM_020385060:CDS1    85.14    74    11    0    1    74
AY907349:CDS1    89.19    74    8    1    1    74

The input file is a GDE flatfile of strings consisting of a name, followed by a string. The type of the string is indicated by the flag characters used by GDE: # for DNA or RNA, % for protein, or " for text.

Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A30839
NQASVVANQLIPINTALTLVMMRSEVVTPVGIPAEDIPRLVSMQVNRAVPLGTTLMPDM
VKGYPPA
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF

The output is a GDE flatfile containing only the sequence specified in namefile. In the case where two or more sequences occur with the same name, only the first sequence will be written, and an error message will be written to the standard output.

Example:

%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF