PLNT4610/PLNT7690 BIOINFORMATICS

Assignment 2

Genome Organization

This assignment is worth 20% of the course grade.

Due: Tuesday November 5, 2024.

Goal: To develop a strategy that will estimate the number of members for any middle repetitive sequence in a genome, given a representative sample of sequences from that genome.

Rationale

According to Venter et al.¹, the AluI family of interspersed repetitive sequences makes up about 10% of the human DNA, which is 3 x 10⁹ bp/haploid genome. In-situ hybridization to chromosome spreads also shows that AluI sequences are present almost everywhere in the human genome, on all chromosomes. However, for most species, a complete genomic sequence is not available. A fundamental principle of statistics is that given a sufficiently large sample, the statistical properties of the sample should approximate the statistical properties of the population from which it was taken. Given only a sample of a larger genome, can you develop a strategy to estimate the number of members for any middle repetitive sequence? That is, we don't always have the luxury of sequencing a complete genome. How can we get the same result from looking at a sample? Trying this strategy on the human genome should calibrate the reliability of the approach. Think of the computer as a laboratory, and sequence data as a model that can be manipulated and tested experimentally.

1. (3 points) Construct a synthetic control sequence containing AluI repeats interspersed among random sequence.

The first part of any experiment is to design a good control that will demonstrate that the methodology is working properly. In this case, we need to have a control sequence that can be used to assess whether we can find all copies of an AluI family sequence within larger sequences. In eukaryotic genomes, middle repetitive sequences are typically found interspersed among single-copy sequences. For this purpose, you are asked to concatenate together a set of randomized sequences, interspersed with copies of the AluI subfamily consensus sequences, found in the file AluI-subfamily-consensus.gen. This file contains 8 AluI family sequences, representing subfamilies of the AluI sequence found in the human genome.

Your test sequence might be organized something like this:

Start by saving all AluI consensus sequences in a fasta-format file AluI-subfamily-consensus.fsa.

Next, we need to create some randomized sequences to represent the unique-sequence DNA between Alu sequences. This can be done by randomizing any sequence you wish (Eg. choose any DNA sequence we've used in our tutorials) by reading a sequence into bldna and using Similarity --> Shuffle. By default, SHUFFLE will do local shuffling, but in our case it is better to shuffle the entire sequence as a whole by setting WINDOW to the highest possible setting. You will need to create enough randomized sequences to separate the AluI sequences from each other. Save these in a file called random.fsa.

To create your complete test sequence, simply copy and paste your sequences one after the other into the file, until all sequences are present. You could also do this in BioLegato, using Cut and Paste to intersperse the different copies of Alu with different random sequences, and then saving in FASTA format. In the FASTA file, you will need to delete the name lines between sequences. Finally, add a name line as the first line of the FASTA file, giving the sequence whatever name you want. The final file should be called synthetic.fsa.

2. (4 points) Compare programs and settings to find the best method for detecting AluI repeats

We have tried many programs for similarity searches. You will need to do a bit of experimentation to see whether there are any differences in terms of which ones find all Alu sequences, and which miss some. The choice of program, and the settings used, may give different results. Ideally, we want to find all Alu sequences, without finding any false positives.

This is most easily done using bldna to compare an Alu consensus sequence with your synthetic sequence. Note that programs in the Similarity menu require you to read both sequences into BioLegato, whereas those in the Database menu require the query sequence to be in BioLegato, and the database set to "User-created database", ie. your synthetic sequence is the "database".

Based on your experiments, choose a program for use in later steps. Explain your choice.

3. (4 points) Construct a target sequence dataset

The idea is to assemble a representative sample of human sequences chosen from NCBI. In a carefully chosen set of sequences, even a dataset of 1000 kb is adequate to make the estimate. How you choose your sequences is far more important. See the tutorial on Creating Datasets of Related Sequences Part 2 for examples of how to find and retrieve sequences. You should use blncbi to construct a query that will retrieve a list of suitable sequences, from which you can choose a representative sample.

Points to consider:

Which types of sequences are best for this purpose? eg. genomic, cDNA, EST, GSS, BACs, RefSeq RNA etc. Which ones should definitely NOT be used?
How do you narrow down a list of sequences?
Does the size of fragments matter? Chromosomal location? Number of sequences?
How do we avoid introducing systematic biases into our dataset?
Are there sequencing gaps ie. stretches of sequences with multiple "N"s added to indicate gaps between contigs?

4. (4 points) Test your approach on your target dataset

Your goal is to answer the question, "What percentage of the sample consists of Alu1 family sequences?" If the sample is representative, then it should provide a good estimate for the genome as a whole.

It may be instructive to look at some of the sample sequences using DXHOM, since presumably, all similarities will show up as diagonals. DXHOM would therefore serve as a check on whether the more automated ways of looking for repetitive sequences are actually finding them. If you are missing a significant number of Alu1 sequences, it may be possible to modify your search , either by using a different program, or changing search parameters.

How to count AluI sequences in GenBank entries

Since you may have to examine a lot of output it would be useful to have a way to easily count the number of Alu sequences annotated in a GenBank file, or the number of hits in an output file. You will need to inspect GenBank or output files to see how annotated Alu sequences, or hits, are listed, respectively. For example, in GenBank entry AE006463, Alu1 sequences area annotated with the feature qualifier /rpt_family="SINE/Alu" . Therefore, the following statement could be used to count the number of Alu1 sequences annotated in a file containing this entry:

{mars:/home/plants/frist/courses/bioinformatics/as2/2015revision}grep '/rpt_family="SINE/Alu"' AE006463.gen | wc -l317

wc is a command that reports the number of words, characters and lines in a file. Used with the -l option, it only reports the number of lines found in a file. Since the actual terms used to annotate sequences vary from author to author, you would to have to look at a given GenBank file to see how sequences were annotated. It is also important to realize that not all authors of sequences will bother to annotate Alu1 repeats, even when they are present.

How to count hits in your output files

The above method will let you count the annotated AluI sequences in your test data, for comparison with your own results. Various similarity programs programs display hits in different ways. Some programs will give you a summary of the number of hits. For those that do not, you can probably use grep and 'wc -l' to count hits. Based on inspection of your output files, you would need to choose an appropriate regular expression to use with grep that will enable you to easily count hits.

Based on the number of hits found in your sample sequences using the program you have chosen, it is straightforward to calculate the percentage of AluI sequences in the sample.

5. (3 points) Conclusions - What have you learned?

Summarize your results in whatever way gets the point across most clearly, using tables or figures if that helps. Show all calculations.
The meaning of your results.

What important assumptions went into the experiment? How might these assumptions have influenced your results? For example, did your search method work as well with real sequences, compared to the results from your synthetic test sequence?
If your results differ from the reported value of 10%, can you hypothesize why that might be? That is, if you missed a significant number of Alu1 sequences, can you make any generalizations about why?
Do the results tell you anything else that is significant?

6. (2 points) Presentation.

Part of the grade will be determined by the quality of your web page(s) for the assignment, including:

The assignment page(s) must be accessible at
http://home.cc.umanitoba.ca/~userid/PLNT4610/as2/as2.html or http://home.cc.umanitoba.ca/~userid/PLNT7690/as2/as2.html
No other URL will be accepted.
All links must work, and all graphics must display. Each time I have to contact you to fix something, 1 point will be deducted. You get no credit for anything I can't access.
Pay attention to the organizational and stylistic hints found in Lecture 2. Do what it takes to make it easy to read and to understand the points you wish to get across.

How to get started

1. Create a directory called either public_html/PLNT4610/as2 or public_html/PLNT7690/as2. Make this directory world-readable and world executable.

2. Do all work in the as2 directory. That way, all your files will already be where they need to be.

What you need to complete your assignment

Your report should include the following:

Links to data:

your AluI-subfamily-consensus.fsa, random.fsa, and synthetic.fsa files.

your target dataset

links to any output files used to generate your results. These could include the results of NCBI searches, or similarity search output.

Enough detail on your experimental methods that someone skilled in bioinformatics could reproduce your work. This requires a bit of judgement. Too much detail can result in a report that is impenetrable.

Your results. It will probably be useful to include a table summarizing your results, along with a description of your findings.

How to post it

1. Create a new HTML file called as2/as2.html. Your web page for Assignment 1 should take the form of a report, that makes it easy to figure out what you did.

2. Make all files in the as2 directory world-readable. (chmod a+r *)

3. Edit either PLNT4610.html or PLNT7690.html to include a link to as2/as2.html.

4. In the Firefox or SeqMonkey Browser, go to your home page and follow all hypertext links to your assignment, and test all links to your output files.

5. If you paste excerpts of output into a web page, change the output section to a fixed font such as Courier, or set the style to "Preformat". The output from most sequence programs assumes that each character takes up an equal amount of width, which is not true for proportional fonts such as Helvetica or Times.

Academic integrity: Your work is assumed to be your own original work. All University policies regarding academic integrity apply.

Evidence of cheating will include, but is not limited to:

identical files submitted
identical wording
identical results

On the day the assignments are due, I should be able to just go to each person's web site and find the output. You don't need to send me an email message saying that your assignment is complete. If you choose not to hand in this assignment, you don't need to do anything.

References

1. Venter et al. (2001) The sequence of the human genome. Science 291: 1304-1351.