update October 2, 2020
NAME
BLExtractSubset.py
- Extract a subset of sequences from a GDE flatfile
SYNOPSIS
python
BLExtractSubset.py namefile infile outfile
DESCRIPTION
This script reads infile, containing
sequences in GDE flatfile format, and writes a subset of those
sequences, as listed in namefile, to the outfile.
NAMEFILE
The namefile consists of a list of
names for sequences in a GDE flatfile, with one name per line.
Example:
A30238
A31075
A34313
The
namefile can also be a TAB-separated value (.tsv) file, in
which the first column contains names or accession numbers.
For example, BioLegato writes output from SSEARCH as a
multicolumn file:
# ssearch3 -T 16 -Q -p -b %NUMOFSCORES% -d 20 -m
0 -z 11 -E 0.0001 -m F8C bio6395951563558234905.tmp.tsv -m F6
bio6395951563558234905.tmp.html bio6395951563558234905.tmp
/home/psgendb/BIRCHDEV/public_html/tutorials/bioLegato/dataset/all.pro.fsa
3
# SSEARCH 36.3.8h Aug, 2019
# Query: DQ288897:CDS1 - 74 aa
# Database:
/home/psgendb/BIRCHDEV/public_html/tutorials/bioLegato/dataset/all.pro.fsa
# 50 hits found
# subject id %
identity alignment
length mismatches
gap opens q.
start q. end
DQ288897:CDS1 100.00
74 0 0
1 74
DQ342338:CDS1 98.65
74 1 0
1 74
NM_001365279:CDS1 97.30
74 2 0
1 74
XM_027486857:CDS1 87.84
74 9 0
1 74
XM_020385060:CDS1 85.14
74 11 0
1 74
AY907349:CDS1 89.19
74 8 1
1 74
Lines beginning with hashmarks (#) are ignored, and the
leftmost column is used as the list of names or accession numbers.
INPUT
The input file is a GDE flatfile of
strings consisting of a name, followed by a string. The type of
the string is indicated by the flag characters used by GDE:
# for DNA or RNA, % for protein, or " for text.
Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A30839
NQASVVANQLIPINTALTLVMMRSEVVTPVGIPAEDIPRLVSMQVNRAVPLGTTLMPDM
VKGYPPA
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF
OUTPUT
The output is a GDE flatfile
containing only the sequence specified in namefile. In the case
where two or more sequences occur with the same name, only the
first sequence will be written, and an error message will be
written to the standard output.
Example:
%A30238
MKSAILTGLLFVLLCVDHLSSASQSVVATQLIPINTALTPIMMKGQVVNPAGIPFAEMSQ
IVGKQVNRPVAKDETLMPNMVKTYRAAK
%A31075
MKSVILTGLLFVLLCVDHMTASQSVVATQLIPMNSALTPVMMEGKVTNPIGIPFAEMSQM
VGKQVNRPVAKGQTIMPNMVKTYAAGK
%A34313
MLTVSLLVCAMMALTQANDDKILKGTATEAGPVSQRAPPNCPAGWQPLGDRCIYYETTAM
TWALAETNCMKLGGHLASIHSQEEHSFIQTLNAGVVWIGGSACLQAGAWTWSDGTPMNFR
SWCSTKPDDVLAACCMQMTAAADQCWDDLPCPASHKSVCAMTF
NOTES
1. This script is used by BioLegato
for Edit --> Extract subset.
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist