BIRCH

TUTORIAL: Basic shell scripting


Sept. 28, 2016


BASH Programming - Introduction HOW-TO [http://tldp.org/HOWTO/Bash-Prog-Intro-HOWTO.html]


Chadwick, R (2015) Bash Scripting Tutorial - Ryan's Tutorials [http://ryanstutorials.net/bash-scripting-tutorial/]

Gite, V.  (2015) Learning bash scripting for beginners [http://www.cyberciti.biz/open-source/learning-bash-scripting-for-beginners/]



This tutorial is designed users who already have some familiarity with the command line to learn the very useful skill of scripting. It should be noted, however, that shell languages such as the Bash shell used here are not designed as programming languages. Bash syntax is not as rigorously defined as in languages such as Java, and has only limited functionality, compared to most languages for which literally thousands of API methods and objects are available. Users wishing to learn programming in a formal way, to write programs, should learn languages such as Java or Python.

In essence, shell scripting amounts to putting a series of Unix commands into a file. The file behaves as a program, executing all commands in order.

To do this tutorial, we begin by creating a directory in which to work:

cd tutorials
mkdir scripting
cd scripting


Next, download the following files to your scripting directory:

To be executable, all programs and scripts must have their execute permissions set

chmod a+x testprot.sh

The goal of this tutorial is to create a script that can read a Fasta file and determine whether the file contains DNA/RNA or protein sequences.

While you could edit this file in any text editor, it is best to use an editor that supports syntax highlighting for various programming languages.

On Linux, a good choice is Gedit. At the command line, type 'gedit testprot.sh &' to run the editor in the background. Or, launch Gedit from the Applications menu and open testprot.sh from the File menu.

line 1 - the "shebang" line - tells which shell to use when running the script
lines beginning with '#' - these are comment lines. They are ignored by the shell. However, the best programmers organize their programs by writing comments first, explaining the major steps, and then filling in code.

So far, this script contains only two executable lines. "infile=$1" creates a variable called infile and sets it to hold the name of the first argument typed after the name of the program. "cat $infile" - the shell substitutes the value of $infile into the cat command, and then executes cat using the file specified as infile. The result is that cat prints the contents of $infile to the standard output. Note that when you are setting a variable, you just use the name of the variable. When you are referencing a variable (ie. you want it's value returned) you preceed the variable with a dollar sign ($). So in the statment infile=$1, the shell is referencing the value of $1, the first command line argument, and assigning that value to the infile variable.

To get an idea of how the script works, try typing the following two commands:

./testprot.sh dna.fsa

./testprot.sh pro.fsa


The first command will print out the contents of dna.fsa, and the second will print out the contents of pro.fsa. (Since the current working directory is not usually in your $PATH, we, we need to preceed the name of the script with './', which appends the path to the current working directory to the beginning of the command. That way, the shell can find your script.

We need a strategy to somehow read infile, and check to see if it contains protein or amino acid sequences. We do know that both nucleic acid and protein sequences use different alphabets, as described in Sequence File Formats. Inspection of both lists of symbols show that while there is some overlap between the two, (Y stands for pyrimidine and also for tyrosine), there are some symbols that are unique to proteins: FPEJLZOIQ*X. The script can search for these characters within sequences, and if it finds any of them, it must be protein, and if none are found, it is almost certainly nucleic acid. While it is possible that some proteins will not contain any of these amino acids (say, a protein with only A, G, C, and T amino acids), we will not worry about that rare possibility.

There is a second problem. The FASTA name lines, beginning with '>', could contain any of the amino acid symbols in plain text. The script therefore needs to have a way to filter out these lines so that only sequence lines are checked.

Thus if the file contains protein, one or more lines will be printed. For a DNA sequence, no output lines should be printed. We should therefore have a way to count the number of lines sent to the output.

In formal computer terms, this sort of strategy is called an algorithm. The algorithm for determining whether a file contains nucleic acids or proteins could be described in pseudo-code like this:

eliminate lines beginning with '>'
print only lines containing any of 'FPEJLZOIQ*X'
count the number of output lines
if the number of output lines is > 0
then
    the file contains proteins
else
    the file contains DNA



The first step, then, is to filter out name lines. Change line 12 to

cat $infile | grep -v '^>'

To make your changes take effect, you must Save the file every time you make changes.

Rerun the two commands above, to see the output. You'll notice that this time, the name lines are not printed to the output. Output from the cat command has been piped as input to the grep command, using the pipe '|' character. Normally, grep prints all lines that match the search pattern. However, with the -v option, grep prints every line except those that match the pattern. In this case, the pattern to match is the right arrow character '>', preceeded by a caret '^', which indicates that the pattern must be found at the beginning of a line.

To find lines containing any of the protein-specific symbols, we can use another instance of grep. The -e option of grep tells grep to search with a regular expression. In this case, the regular expression [FPEJLZOIQ*X] matches any of these symbols. Any line with at least one of these symbols will be printed. To take care of the possibility that a sequence might be in lowercase, we include the -i option, which tells grep to search in a case-insensitive manor. The line now becomes

cat $infile | grep -v '^>' | grep -i -e [FPEJLZOIQ*X]

Run the script on both files, to see the results.

Finally, to count the number of lines of output, we use the wc command with the l option, which counts lines in an input stream:

cat $infile | grep -v '^>' | grep -i -e [FPEJLZOIQ*X] |wc -l

This time, testprot.sh will print a number, rather than printing the output lines. Run the script to test that it does this.

We would like to capture the output of this line in a variable.  In Unix, enclosing a command in reverse quotes (`) returns the output of the command. This character is on the same key as the tilda (~), usually in the upper left hand corner of the keyboard. This is NOT the same as the regular quote (').  Thus, we can create a variable called result, and assign it the value of the output. The code to do this, and then print the value of the result variable is

result=`cat $infile | grep -v '^>' | grep -i -e [FPEJLZOIQ*X] |wc -l`
echo $result


As modified, testprot.sh should still print a number. However, restructured in this way, we now can work with the result variable. In the algorithm, we see an if-then statement, which does something different, depending on whether amino acid codes are found or not. Add he following lines below the two lines we just described:

if (($result > 0))
then
    msg="$infile contains protein."
else
    msg="$infile contains DNA."
fi


This if-then statement begins with the word 'if', and ends with 'fi' (the opposite of if). The if line contains a condition. If the condition is true, the code immediately after 'then' is executed. If the line is false, the code after the else line is executed. (An else line is only needed if you want to do something in both cases.) The condition evaluated is $result > 0. If the value of result is greater than 0 the codition is true, which means we have a protein. If the result is not greater than 0, the condition is false, and we have DNA. Note that when Bash evaluates arithmetic expressions, you must enclose the condition in double parentheses.

Potentially, there are many things we might do in the then and else blocks. You could put as many lines of code in the then and else sections as you wish. In this example, by the way, we have indented the then and else blocks to make it easier to see the code which is executed in each of these blocks. However, the shell ignores indentation.

In this case, we assign a string to a variable called msg. To make the messages more informative, we not only tell whether the file contains protein or DNA, but we also tell the shell to include the name of the input file, $infile, in the message.

Finally, we need to add a line  that prints the message:

# output the result
echo $msg


Run the scripts again, to see the output.

We often want to store the output from a command in a file. To do this, we might want the script to read the name of an output file as a command line argument, and then use that output filename in when we finally print the message. This requires two steps.

First, to add a new argument to the command line add a new line after line 7, so that lines 7 and 8 look like this:

infile=$1
outfile=$2


This new lines creates a variable called outfile, and assigns to it the value of the second command line argument.

We can use this to write the message to the output file by modifying the last line as

echo $msg > $outfile

To run the scripts now, we need to have two command line arguments. We can call the output files anything we want, but the following is one example:

./testprot.sh dna.fsa dna.out

./testprot.sh pro.fsa pro.out


If you run these commands, you can then see the output in the files dna.out and pro.out.

If you want to see an example of the complete script, you can look at testprot_answer.sh.