TUTORIAL: Extracting features from text files

Sep. 16, 2022

ARTEMIS Web Site: http://www.sanger.ac.uk/science/tools/artemis
ARTEMIS Manual

The software gap problem in bioinformatics
Although Artemis has a wealth of analytical features, no program will be able to address every possible question. There will always be gaps in the analytical pipeline, for which no software tool exists.

Artemis does have the ability to save annotations as text files, which in principle could be analyzed in a spreadsheet to ask questions such as

Is there a transcriptional strand bias for protein coding genes?
What percentage of the genome codes for proteins?
Are repeated sequences evenly distributed across the genome, or are there repeate rich and repeat poor regions?

The problem is that the feature information exported by Artemis is not directly machine readable. Although the data appear in columns, spreadsheets can't directly read these files.

As a case in point, we will show how one can leverage Unix commands to bridge these gaps. The first two questions above could be answered if we could read the information for all CDS features into a spreadsheet. We will use a series of commands to read a text file in which columns in the table are created by blank spaces, and create a TAB-separated value (TSV) file that can be read by spreadsheets.
(TAB is a non-printing character that is seen as a fixed space in documents, rather than as a printed character.)

This tutorial assumes you have already done the BIRCH DNAPlotter tutorial.

To begin the tutorial, go to the dnaplotter directory created in the previous tutorial.

cd $HOME/tutorials/artemis/dnaplotter

1. Save a file of features from Artemis.

artemis

launch artemis

Read in Acidanus_hospitalis.gen using File --> Open.

Go to the Feature window at the bottom of Artemis and right click on the mouse. To save these features in a file, right-click and choose Save List to File. Name the file Acidianus_hospitalis_fea.txt.

2. Extraction of features to a TSV file using Unix commands

We can convert the text file of features into a tab-separated value file, in which each the fields are separated by TAB characters, using the following four steps:

step	command	explanation
1	`grep CDS <` `Acidianus_hospitalis_fea.txt > out1.txt`	grep reads lines from Acidianus_hospitalis_fea.txt and prints only those lines containing the string 'CDS' to out1.txt.
2	`cut -c 1-33 < out1.txt > out2.txt`	Cut out columns 1 through 33 from out1.txt and write them to out2.txt.
3	`tr -s ' ' <` `out2.txt > out3.txt`	tr reads lines from out1.txt and prints those lines with all multiple blanks translated into single blanks (the -s option)
4	`tr ' ' '\t' < out3.txt > out4.txt`	tr reads lines from out2.txt and prints to out3.txt those lines with each blank translated into a tab character, represented by \t in the command. Both \t and the blank must be enclosed in quotes.
5	`sed -e 's%\t$%\tf%' < out4.txt >Acidianus_hospitalis_fea.tsv`	The Stream Editor, sed, reads lines from out3.txt and prints those lines, modified according to the expression after the -e option. This expression says substitute (s) each occurrence of tab followed by the end of a line (\t$) with a tab followed by the letter f. Output is written to Acidianus_hospitalis_fea.tsv.

Notes:
Step 1 - < means redirection of the contents of a file to the standard input. > means redirection of output of a program to the standard output.

Step 2 - The free text in column 5 presents a problem, because it is not computer-readable. Also, note that column 4, which indicates the strand on which the feature is transcribed, is blank for the forward strand. After we compress the blanks, that would shift the text into column 4. For our purposes, it is better to get rid of column 5 at this point.

Step 4 - We're most of the way there. You can see in the output file that the TAB character forces each field to appear in its own column. The columns are still only one character wide, but appear to be a fixed width. However, we are still left with the problem that the field for the forward strand is empty, so the line will end in the TAB character. We can use sed to search of all occurrences of TAB followed by the end of line, and change those to have a TAB, followed by an f, and then the end of line.

Step 5 - sed substitutes one regular expression for another. The two regular expressions are separated by '%' characters. Thus, the first expression is \t$, where '$' stands for the end of line. Whenever this is found, an the matching expression is replaced by \tf, TAB followed by the letter f. (The end of line is not replaced).

Doing the task as a single command - Remember that in Unix, pipes (|) tell the shell to redirect the standard output from one command the standard input for the next command. This eliminates the need to save intermediate steps in temporary files. The series of steps listed above could be precisely replicated with the following command line:

cat Acidianus_hospitalis_fea.txt | grep CDS | cut -c 1-33 | tr -s ' ' \ | tr ' ' '\t' | sed -e 's%\t$%\tf%' > Acidianus_hospitalis_fea.tsv

Note: when a command is very long, you can break it up into two or more lines by inserting a left slash '\', which tells the shell to continue reading on the next line.

3. Using the data in a spreadsheet

At this point, you should be able to read the file using any spreadsheet program. On Linux, this is typically LibreOffice. You may be able to view the file in LibreOffice by double-clicking on Acidianus_hospitalis_fea.tsv in the file manager.

Alternatively, to the Applications menu on your Linux desktop and choose Office --> LibreOffice Calc. To read the file, choose File --> Open, and select Acidianus_hospitalis_fea.tsv.

The import window should look like this:

Make sure that Seperator Options is set to "Tab", and click on OK. Your spreadsheet should appear:

The spreadsheet could now be used to calculate useful statistics on coding sequences in A. hospitalis. Examples of information that could be derived from such a spreadsheet:

a histogram showing the distribution of transcript sizes, and an average transcript size
the percentage of the genome represented by CDS sequences
is there a strand bias for transcription between forward and reverse strands ie. is there a major difference in the number of transcripts on the forward strand versus the reverse strand?
are there regions with a strong strand bias?

Obviously, a spreadsheet of this type could be made for any annotated feature in a GenBank entry.

Not all text files are easily machine-parsable - Like many bioinformatics programs, the text files in which Artemis saves feature information were not explicitly designed to be machine-parsable. For example, when Artemis saves features from the Zea mays entry KR014666, an excerpt from the file has the following lines:

gap 1063651 1063750 gap 1071134 1071233 CDS 1091603 1093045 c similar to B73 protein GRMZM2G010987_T01 mRNA <1091603 >1093045 c CDS 1109880 1110602 c similar to B73 protein GRMZM5G835704_T01 mRNA <1109880 >1110602 c gap 1114991 1115090 gap 1124793 1124892 gap 1127331 1127430 gap 1131616 1131715

The fixed column format that Artemis uses did not take into account the GenBank convention that uses < and > to indicate that the precise start and stop sites of a feature are unknown, but are assumed to be respectively upstream and downstream of the listed coordinates. Since these characters shift columns 4 and 5 to the right,
the cut command we used above would not correctly cut out the first 4 columns. A similar problem would occur if features were saved for sequences greater than 9,999,999 nucleotides in length. For a more general solution to parsing these files, more sophisticated pattern recognition methods, probably involving regular expressions, would have to be employed.