This assignment is worth 20% of the course grade.
Due: Thursday October 4, 2018.
|DNA replication in circular prokaryotic
genomes begins at a single origin of replication,
often visualized as being at 12 o'clock on the circle. Two
replication forks propagate in opposite directions,
and replication continues until they meet at the terminus,
visualized as being at 6 o'clock on the circle.
In bidirectional replication, each replication fork has both a leading and lagging strand. As the two strands f and c are "peeled apart", DNA synthesis on the leading strand proceeds uninterrupted, while on the lagging strand, DNA must be replicated in short stretches, referred to as Okazaki fragments, which are initiated as the DNA duplex opens up.
|Thus, as shown in Figure 2, we can define two
regions of a circular chromosome, arbitrarily designated A
and B, as illustrated at right. If the total length of the
bacterial chromosome was L, then region A would span
coordinates 1 to L/2, and region B would span coordinates
(L/2+1) to L.
Referring to Figure 1, in region A the f strand is template for leading strand synthesis, and in region B, the c strand is the template for leading strand synthesis.
ln -s /home/plants/frist/courses/bioinformatics/as1/genomes
|go to your as1 directory
create a symbolic link
a. DNAPlotter Map
For each genome in your list create a DNAPlotter Map. Make sure to include a track for each of the following:
We can automate procedures using Unix commands by writing those
commands in a file referred to as a script. A script implementing
steps 1 -5 from the tutorial Extracting
features from text files can be found in the file fea2tsv.sh. In this exercise, we'll modify
the script with some improvements on the original protocol.
The problem is as follows: In the Background section above, we
have oversimplified things by assuming that for all prokaryotic
genomes, the replication origin will always be annotated as
starting at position 1. However, the choice of which nucleotide in
a circular genome gets specified as position 1 is often an
arbitrary location for many genome projects. Consequently, many
prokaryotic genomes place the replication at a different position.
For this reason, it would be far easier to calculate the f and c
values if our TSV file contains annotation for both CDS and
Fortunately, the grep command can read a file containing a list
of patterns, each on a separate line. The output from grep will be
any lines from that match any of the patterns. For example, you
could create a pattern file called fealist.txt containing the
If we typed
grep -f fealist.txt < Corynebacterium_ulcerans0102.fea
any lines beginning with either CDS or rep_origin would be
printed to the standard output. (In regular expressions, '^'
indicates the beginning of a line. If '^' was not included in the
expression, the search would match any line that included CDS or
rep_origin anywhere within a line.)
To get started, save this file in your as1 directory. You
will need to make this file executable in order to run the script:
default, the current working directory is not in your
$PATH. Therefore, when we run a script in the current
directory, we have to precede its name with ./
to tell the shell that the script is in the current
In this example, output would be written to a file called Corynebacterium_ulcerans0102.fea.tsv.
Your job is to modify the script as follows:
||output file name
For each genome, the goal is to quantify the transcriptional
bias, that is, the tendency for coding sequences to be transcribed
on either the forward or reverse strands, for each of the two
regions, A and B, as illustrated in Figure 2 above.
a. Decide on a cut off row that
delineates the junction between regions A and B
You can find the length of the genome on the LOCUS line of the
GenBank file for each genome. This information is also found in
Artemis, using View --> Overview. The replication
terminus can be assumed to be at the half way point on the circle,
opposite the replication origin. That is, if the origin is
position 1, and the sequence is length L, then the terminus would
be at position L/2. For example, if the sequence was 2,500,000
bases long, the terminus would be at position 1,250,000.
|If the replication origin was at
a position other than 1, you would have to calculate the
terminus based on that position.
||Given the halfway point H = L/2, the location
of the terminus T is
if R > H
Next, scroll down the rows of your spreadsheet to roughly the
halfway point. Look for a coding sequence whose coordinates
overlap the terminus. This row would be the last row in region A.
The next row would be the starting point for region B. For
example, if there were 4000 CDS sequences total, region A might
span rows 1 through 1987 in the spreadsheet, and region B would
span rows 1988 through 4000.
b. Calculate the transcriptional bias
for regions A and B.
The transcriptional strand bias (TSB) is the degree to which the direction of transcription is skewed either to the forward strand or the reverse strand. It could be calculated as a ratio of the difference between numbers of CDS features on the forward and reverse strands to the total number of coding sequences in each region. That is,
TSB = (f-c)/(f+c)
Suppose you had the following data in your spreadsheet:
If the CDS sequences in region A spanned the first 5 rows, and
those in region B spanned the last 5 rows, you could calculate the
TSB values for each region.
For example, to calculate TSB for region A in LibreOffice Calc, you could count the number of CDS features on the forward strand using the formula =COUNTIF(D1:D5,"f"). Similarly the number of CDS features on the complementary strand would be counted using =COUNTIF(D1:D5,"c"). Region B would be calculated similarly. The results would be shown in a table for each species.
||f(1:5) = 5
||c(1:5) = 0
||f(7:10) = 1
||c(7:10) = 4
Save your spreadsheet in LibreOffice Calc format, by choosing
File --> Save As. In the Save window, choose "ODF Spreadsheet
(.ods) as the format, and Save. For example, if the file you read
into the spreadsheet was Corynebacterium_ulcerans0102.fea.CDS.tsv,
the file would be exported to
6. (3 points) Presentation.
Part of the grade will be determined by the quality of your
web page(s) for the assignment, including:
2. Do all work in the as1
directory. That way, all your files will already be where they
need to be.
Your report should include the following:
1. Links to your tracks file and your fea2tsv.sh script, and a link to your AS1seqs_1.nam file.
2. For each genome, present your results in a 2-column table, as shown in the sample file. You are expected to follow the file naming conventions used above.3. A discussion of the main findings from your data. The questions in part 5 above are a starting point, but feel free to add additional observations, explanations, or to state hypotheses arising from the observations. Feel free to make tables, charts, graphs, or anything that will get your points across.
1. Create a new HTML file called as1/as1.html. Your web page for Assignment 1 should take the form of a report, that makes it easy to figure out what you did.
2. Make all files in the as1 directory world-readable. (chmod a+r *)
3. Edit either PLNT4610.html or PLNT7690.html to include a link to as1/as1.html.
4. In the Firefox or SeqMonkey Browser, go to your home page and follow all hypertext links to your assignment, and test all links to your output files.
On the day the assignments are due, I should be able to just go
to each person's web site and find the output. You don't need to
send me an email message saying that your assignment is complete.
If you choose not to hand in this assignment, you don't need to do
1.Your work is assumed to be your own original work. All University policies regarding academic integrity apply.
2. Show your work. For example, a spreadsheet that had the final answer typed in, rather than calculated by a formula, would not provide any evidence that you had actually done the work.
4. Chadwick, R (2015) Bash Scripting Tutorial - Ryan's