Assignment 2 (Oct. 17,
This assignment is worth 20% of the course grade.
Due: Thursday November 2, 2017.
Goal: To develop a strategy that will estimate the
number of members for any middle repetitive sequence in a
genome, given a representative sample of sequences from that
According to Venter et al.1, the AluI family of
interspersed repetitive sequences makes up about 10% of the human
DNA, which is 3 x 109 bp/haploid genome. In-situ
to chromosome spreads also shows that AluI sequences are present
almost everywhere in the human genome, on all chromosomes.
However, for most species, a complete genomic sequence is not
available. A fundamental principle of statistics is that given a
sufficiently large sample, the statistical properties of the sample
should approximate the statistical properties of the population from
which it was taken. Given only a sample of a larger genome, can you
develop a strategy to estimate the number of members for any middle
repetitive sequence? That is, we don't always have the luxury of
sequencing a complete genome. How can we get the same result from
looking at a sample? Trying this strategy on the human genome
should calibrate the reliability of the approach. Think of the
computer as a laboratory, and sequence data as a model that can be
manipulated and tested experimentally.
1. (3 points) Construct a synthetic
control sequence containing AluI repeats interspersed among
The first part of any experiment is to design a good control that
will demonstrate that the methodology is working properly. In this
case, we need to have a control sequence that can be used to assess
whether we can find all copies of an AluI family sequence within
larger sequences. In eukaryotic genomes, middle repetitive sequences
are typically found interspersed among single-copy sequences.
For this purposes, you are asked to concatenate together a set of
randomized sequences, interspersed with copies of the AluI subfamily
consensus sequences, found in the file AluI-subfamily-consensus.gen.
This file contains 8 AluI family sequences, representing subfamilies
of the AluI sequence found in the human genome.
Your test sequence might be organized something like this:
Start by saving all AluI consensus sequences in a fasta-format file
Next, we need to create some randomized sequences to represent the
unique-sequence DNA between Alu sequences. This can be done by
randomizing any sequence you wish (Eg. choose any DNA sequence we've
used in our tutorials) by reading a sequence into bldna and using Similarity
--> Shuffle. By default, SHUFFLE will do local
shuffling, but in our case it is better to shuffle the entire
sequence as a whole by setting WINDOW to the highest possible
setting. You will need to create enough randomized sequences to
separate the AluI sequences from each other. Save these in a file
To create your complete test sequence, simply copy and paste your
sequences one after the other into the file, until all sequences are
present. You could also do this in BioLegato, using Cut and Paste to
intersperse the different copies of Alu with different random
sequences, and then saving in FASTA format. In the FASTA file, you
will need to delete the name lines between sequences. Finally, add a
name line as the first line of the FASTA file, giving the sequence
whatever name you want. The final file should be called
2. (4 points) Compare programs and
settings to find the best method for detecting AluI repeats
We have tried many programs for similarity searches. You will need
to do a bit of experimentation to see whether there are any
differences in terms of which ones find all Alu sequences, and which
miss some. The choice of program, and the settings used, may give
different results. Ideally, we want to find all Alu sequences,
without finding any false positives.
This is most easily done using bldna to compare an Alu consensus
sequence with your synthetic sequence. Note that programs in the
Similarity menu require you to read both sequencs into BioLegato,
whereas those in the Database menu require the query sequence to be
in BioLegato, and the database set to "User-created database", ie.
your synthetic sequence is the "database".
Based on your experiments, choose a program for use in later steps.
Explain your choice.
3. (4 points) Construct a target
The idea is to assemble a representative sample of human sequences
chosen from NCBI. In a carefully chosen set of sequences, even
a dataset of 1000 kb is adequate to make the estimate. How
you choose your sequences is far more important. See the tutorial on
of Related Sequences for examples of how to find and retrieve
sequences. You should use blncbi to construct a query that will
retrieve a list of suitable sequences, from which you can
choose a representative sample.
Points to consider:
- Which types of sequences are best for this purpose? eg.
genomic, cDNA, EST, GSS, BACs, RefSeq Genomic, RefSeq RNA etc.
Which ones should definitely NOT be used?
- How do you narrow down a list of sequences?
- Does the size of fragments matter? Chromosomal location?
Number of sequences?
- How do we avoid introducing systematic biases into our
- Are there sequencing gaps ie. stretches of sequences with
multiple "N"s added to indicate gaps between contigs?
4. (4 points) Test your approach on your
Your goal is to answer the question, "What percentage of the sample
consists of Alu1 family sequences?" If the sample is representative,
then it should provide a good estimate for the genome as a whole.
It may be instructive to look at some of the sample sequences using
DXHOM, since presumably, all similarities will show up as diagonals.
DXHOM would therefore serve as a check on whether the more automated
ways of looking for repetitive sequences are actually finding them.
If you are missing a significant number of Alu1 sequences, it may be
possible to modify your search , either by using a different
program, or changing search parameters.
|How to count
AluI sequences in GenBank entries
Since you may have to examine a lot of output it would be
useful to have a way to easily count the number of
Alu sequences annotated in a GenBank file, or the number
of hits in an output file. You will need to inspect
GenBank or output files to see how annotated Alu
sequences, or hits, are listed, respectively. For example,
in GenBank entry AE006463, Alu1 sequences area annotated
with the feature qualifier /rpt_family="SINE/Alu"
. Therefore, the following statement could be used to
count the number of Alu1 sequences annotated in a file
containing this entry:
'/rpt_family="SINE/Alu"' AE006463.gen | wc -l
wc is a command that reports the
number of words, characters and lines in a file. Used with
the -l option, it only reports the number of lines found
in a file. Since the actual terms used to annotate
sequences vary from author to author, you would to have to
look at a given GenBank file to see how sequences were
annotated. It is also important to realize that not all
authors of sequences will bother to annotate Alu1 repeats,
even when they are present.
|How to count hits in
your output files
The above method will let you count the annotated AluI
sequences in your test data, for comparison with your own
results. Different programs display hits in different ways.
Based on inspection of your output files, choose an
appropriate regular expression to use with grep that will
enable you to easily count hits using grep and wc.
Based on the number of hits found in your sample sequences using the
program you have chosen, it is straightforward to calculate the
percentage of AluI sequences in the sample.
5. (3 points) Conclusions - What have you
- Summarize your results in whatever way gets the point across
most clearly, using tables or figures if that helps. Show all
- The meaning of your results.
- What important assumptions went into the experiment? How
might these assumptions have influenced your results? For
example, did your search method work as well with real
sequences, compared to the results from your synthetic test
- If your results differ from the reported value of 10%, can
you hypothesize why that might be? That is, if you missed a
significant number of Alu1 sequences, can you make any
generalizations about why?
- Do the results tell you anything else that is significant?
6. (2 points) Presentation.
Part of the grade will be determined by the quality of your
web page(s) for the assignment, including:
- The assignment page(s) must be accessible at
No other URL will be accepted.
- All links must work, and all graphics must display. Each time
I have to contact you to fix something, 1 point will be
deducted. You get no credit for anything I can't access.
- Pay attention to the organizational and stylistic hints found
2. Do what it takes to make it easy to read and to
understand the points you wish to get across.
How to get
1. Create a directory called either
public_html/PLNT4610/as2 or public_html/PLNT7690/as2. Make this
directory world-readable and world executable.
2. Do all work in the as2
directory. That way, all your files will already be where they
need to be.
you need to complete your assignment
Your report should include the following:
How to post it
- Links to data:
- your AluI-subfamily-consensus.fsa, random.fsa, and
- your target dataset
- links to any output files used to generate your results.
These could include the results of NCBI searches, or
similarity search output.
- Enough detail on your experimental methods that someone
skilled in bioinformatics could reproduce your work. This
requires a bit of judgement. Too much detail can result in a
report that is impenetrable.
- Your results. It will probably be useful to include a table
summarizing your results, along with a description of your
1. Create a new HTML file called as2/as2.html. Your web page
for Assignment 1 should take the form of a report, that makes it
easy to figure out what you did.
2. Make all files in the as2 directory world-readable. (chmod
3. Edit either PLNT4610.html or PLNT7690.html to include a link
4. In the Firefox or SeqMonkey Browser, go to your home page
and follow all hypertext links to your assignment, and test all
links to your output files.
5. If you paste excerpts of output into a web page, change the
output section to a fixed font such as Courier, or
set the style to "Preformat". The output from most sequence
programs assumes that each character takes up an equal amount of
width, which is not true for proportional fonts such as Helvetica or Times.
Academic integrity: Your work is assumed to be your own
original work. All University policies regarding academic
On the day the assignments are due, I should be able to just go
to each person's web site and find the output. You don't need to
send me an email message saying that your assignment is complete.
If you choose not to hand in this assignment, you don't need to do
Venter et al. (2001) The sequence of the human genome.
Science 291: 1304-1351.