blsort.py - Sort a BioLegato table

update January 23, 2019

NAME

SYNOPSIS

blsort.py infile outfile -cols integer[,integer] [-descending] [-sep seperator]

DESCRIPTION

blsort.py sorts a BioLegato table file using one or more columns as the sort key.

In the current implementation, columns containing float, integer, or string are sorted. All other data types (eg. date, currency) are sorted as strings.

Comments may be included in the file, as illustrated in the example. A comment is any line beginning with the hash symbol (#). Comments are not sorted, but are simply echoed to the output file. All comments will be written to the beginning of the output file, regardless of where they appear in the input.

blsort.py is a sort function for BioLegato, but can be run as a standalone command.

Before sorting, any rows that have fewer fields than the widest row will be padded with empty fields so that all rows have the same number of fields. As well, rows with empty fields in one of the sort columns specified in -cols will not be included in the sort process. If -descending is set, those unsortable rows will be placed at the end of the table. Otherwise, unsortable rows will be placed at after header lines, if any.

OPTIONS

infile - input file in BioLegato table format. The file begins with an optional set of comments, beginning with the hash symbol (#) in the first column. These are copied to outfile and are not processed as part of the sort.

outfile - output file, also in BioLegato table format.

-cols integer[,integer] - A comma-separated list of integers, telling which column(s) to use for sorting. By default, only column 1 is used. If two or more integers are given, sort priority goes left to right through the list. Invalid fields are ignored eg. -cols 5 would be ignored if there are only four columns.

-descending - (default False). Sort in descending, rather than ascending order.

-sep seperator - (default TAB) - Character to use as a column seperator, both for input and output. Common alternatives include comma (,) and pipe (|). By convention, tab-separated tables typically have the .tsv extension, while comma-separated files have the .csv extension.

EXAMPLE

Given the tab-separated input file example.tsv:

#ncbiquery.py
#DATABASE: nuccore
#QUERY: Fristensky [AUTH] AND Pisum [ORGN]        AND 1:500000[SLEN]
#FILTER:
#WEBENV: NCID_1_1256580311_130.14.22.215_9001_1407873498_1239736160_0MetA0
#COUNT: 18
#uid    Title    BioMol    Slen
22552    P. sativum disease resistance response protein (PI49) mRNA    mRNA    734
20657    P.sativum Cab II gene for chlorophyll a/b-binding protein    genomic    2368
169079    Pea (P.sativum) disease resistance response protein (PI206)     mRNA    594
169086    Pea ferredoxin I (Fed-1)gene, complete cds    genomic    1995
169060    Pisum sativum chlorophyll a/b-binding protein (Cab9) gene,    genomic    1919

blsort.py example.tsv example_sorted.tsv -cols 3,4

would write produce the following output file:

#ncbiquery.py
#DATABASE: nuccore
#QUERY: Fristensky [AUTH] AND Pisum [ORGN]        AND 1:500000[SLEN]
#FILTER:
#WEBENV: NCID_1_1256580311_130.14.22.215_9001_1407873498_1239736160_0MetA0
#COUNT: 18
#uid    Title    BioMol    Slen
169060    Pisum sativum chlorophyll a/b-binding protein (Cab9) gene,    genomic    1919
169086    Pea ferredoxin I (Fed-1)gene, complete cds    genomic    1995
20657    P.sativum Cab II gene for chlorophyll a/b-binding protein    genomic    2368
169079    Pea (P.sativum) disease resistance response protein (PI206)     mRNA    594
22552    P. sativum disease resistance response protein (PI49) mRNA    mRNA    734

AUTHOR

Dr. Brian Fristensky
Department of Plant Science
University of Manitoba
Winnipeg, MB Canada R3T 2N2
frist@cc.umanitoba.ca
http://home.cc.umanitoba.ca/~frist