update January 23, 2019
NAME
blsort.py - Sort a BioLegato table
SYNOPSIS
blsort.py
infile outfile -cols
integer[,integer] [-descending] [-sep
seperator]
DESCRIPTION
blsort.py sorts a BioLegato table file using one or
more columns as the sort key.
In the current implementation, columns containing float, integer,
or string are sorted. All other data types (eg. date, currency)
are sorted as strings.
Comments may be included in the file, as illustrated in the
example. A comment is any line beginning with the hash symbol (#).
Comments are not sorted, but are simply echoed to the output file.
All comments will be written to the beginning of the output file,
regardless of where they appear in the input.
blsort.py is a sort function for BioLegato, but can be run
as a standalone command.
Before sorting, any rows that have fewer fields than the widest
row will be padded with empty fields so that all rows have the
same number of fields. As well, rows with empty fields in one of
the sort columns specified in -cols will not be included in the
sort process. If -descending is set, those unsortable rows will be
placed at the end of the table. Otherwise, unsortable rows will be
placed at after header lines, if any.
OPTIONS
infile - input file in BioLegato table format.
The file begins with an optional set of comments, beginning with
the hash symbol (#) in the first column. These are copied to
outfile and are not processed as part of the sort.
outfile - output file, also in BioLegato table format.
-cols integer[,integer] - A comma-separated list of
integers, telling which column(s) to use for sorting. By default,
only column 1 is used. If two or more integers are given, sort
priority goes left to right through the list. Invalid fields are
ignored eg. -cols 5 would be ignored if there are only four
columns.
-descending - (default False). Sort in descending, rather
than ascending order.
-sep seperator - (default TAB) - Character to use
as a column seperator, both for input and output. Common
alternatives include comma (,) and pipe (|). By convention,
tab-separated tables typically have the .tsv extension, while
comma-separated files have the .csv extension.
EXAMPLE
Given the tab-separated input file example.tsv:
#ncbiquery.py
#DATABASE: nuccore
#QUERY: Fristensky [AUTH] AND Pisum
[ORGN] AND
1:500000[SLEN]
#FILTER:
#WEBENV:
NCID_1_1256580311_130.14.22.215_9001_1407873498_1239736160_0MetA0
#COUNT: 18
#uid Title
BioMol Slen
22552 P. sativum disease resistance response
protein (PI49) mRNA mRNA 734
20657 P.sativum Cab II gene for chlorophyll
a/b-binding protein genomic
2368
169079 Pea (P.sativum) disease resistance
response protein (PI206)
mRNA 594
169086 Pea ferredoxin I (Fed-1)gene, complete
cds genomic 1995
169060 Pisum sativum chlorophyll a/b-binding
protein (Cab9) gene, genomic
1919
blsort.py
example.tsv example_sorted.tsv -cols 3,4
would write produce the following output file:
#ncbiquery.py
#DATABASE: nuccore
#QUERY: Fristensky [AUTH] AND Pisum
[ORGN] AND
1:500000[SLEN]
#FILTER:
#WEBENV:
NCID_1_1256580311_130.14.22.215_9001_1407873498_1239736160_0MetA0
#COUNT: 18
#uid Title
BioMol Slen
169060 Pisum sativum chlorophyll a/b-binding
protein (Cab9) gene, genomic
1919
169086 Pea ferredoxin I (Fed-1)gene, complete
cds genomic 1995
20657 P.sativum Cab II gene for chlorophyll
a/b-binding protein genomic
2368
169079 Pea (P.sativum) disease resistance
response protein (PI206)
mRNA 594
22552 P. sativum disease resistance response
protein (PI49) mRNA mRNA 734
AUTHOR
Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist