Optimal Syntenic Layouter

Summary

We developed the software OSLay for ordering and sorting contigs of unfinished genome assemblies employing the synteny between related sequences at nucleotide level.
In contrast to existing tools, OSLay is even capable of using fragmented assemblies as reference to detect supercontigs. Our approach enables the meta-assembly of a genome i.e. to use data obtained form two different sequencing techniques to close gaps. Besides facilitating the visual validation of the current assembly status, results can be directly imported into the assembly editing tool Consed to support the generation of applicable primer pairs.

Idea

Generally speaking, the OSL algorithm tries to find a contig layout by moving and flipping contigs of a target assembly, for a fixed ordering of reference assembly, to elongate local diagonals provided by overlapping sequence matches.

Download...

Installers for several operation systems can be downloaded here.

Installation

Uninstallation

Start OSLay:

Four windows will appear providing different views and options: An additional message window is available when clicking Window->Messages... in the result window.

A typical workflow looks like this:

  1. Import all files or open an OSLay .cgv file
  2. Adapt parameters
  3. Obtain final result
  4. Configure visual appearance if needed
  5. Grab and export your output files
  6. Save your project
Find some useful hints applying OSlay here

Import Files:

You need to have three files at hand to get OSLay running.

  • Query FASTA File:
    Your multiFASTA file which contains DNA sequences of contigs you want to order.
  • Subject FASTA File:
    Your FASTA (multiFASTA) file of the subject sequence (assembly) which guides the contig layout process.
  • BLAST Matches file:
    A BLAST file (standard or tabular format) which is the alignment of the query fasta file against the subject fasta file using BLASTN. (NUCmer .coords file is also readable.)
Once all filenames are provided, click 'Apply' and OSLay starts to compute a syntenic layout.
top of page

Result Window:


Three views are shown in the result window:
  • Original Data
    This view shows the raw data as it is parsed from the BLAST result file. Thin blue lines (horizontal and vertical) indicate the contig borders. Contigs are sorted by their length.
  • Summarized Matches with Connectors
    This view shows the filtered and cleaned data. Match diagonals are smoother than in the raw data view because the numerous, short BLAST matches are now substituted by one-piece diagonals, the so-called summarized matches.
    Some cells in the comparison grid are shaded yellow indicating a contained summarized match gives rise to a connector.
    The green dots (red dots) represent locations where summarized maches touch or would touch the contig border of the target (reference) contigs on the xAxis. They are used exclusively to order the xAxis (yAxis) assembly.
  • Syntenic Layout
    This view contains the original matches but now, the connectors do not appear any longer because a syntenic layout is now established.
    Green boxes enclose ordered and sorted supercontigs of the target assembly.
    In case you sorted the yAxis as well, red boxes represent the computed yAxis supercontigs.
There are several ways to analyze your data visually. Use the tools provided in the menubar
You can (from left to right)
  • undo and redo any kind of action
  • select any data object and see its coordinates and info in the status bar by clicking the black arrow
  • move any view
  • zoom in and out ('-' and '+' on keyboard)
  • zoom the data so that it fits the view window
Click on (yellow shaded) cells containing matches to receive infos about contigs, number of matches, etc..
top of page

Analyzing results and adapting parameters - Parameters Window:

The following configurations can be all adjusted in the Parameter Window.
Typical values (bp) are given in brackets.
Generally, small values are more suitable for bacterial genome sizes (up to ~6MB), bigger values for e.g. mammals genomes (~120Mb)
Hovering the mouse over the parameter names in the window shows tooltips explaining them.

After the adjustment of parameters it is recommended to check visually or statistically whether these changes could improve your results.

|| Summarizing Matches ||

Minimal Match Size:

Every sequence match smaller than this value will be removed from the view.

Hint: Use this filter if a lot of noise matches appear in the Original Data view which might complicate the computation.

[~100-500 bp]
Maximal Distance to Contig Border:

Only summarized matches lying within this distance give rise to a connector and therefore can be locally extended.

Hint1: If fewer connectors and matches appear in the Summarized Matches view as you have expected, try to increase this value.
Hint2: In some cases, decreasing this value might facilitate the computation of a contig layout because the number of false-positive connectors is reduced.

[~1000-15000 bp]
Width of Diagonal Search Space:

Longer stretches of similar DNA sequences in the dot-plot usually consist of a bunch of seperated, short BLAST matches. These match clusters are substituted by a one-piece diagonal for easier handling.
Only BLAST matches found within a imaginary diagonal search space get summarized.

Hint: If summarized matches appear to be shorter than the corresponding BLAST matches in the Original Data view, increase this value to extend the search space.

[~4000-15000 bp]
Maximal Gap Length between Consecutive Matches:

Sequence matches which previously were positioned within a diagonal, now are succeedingly traversed trying to elongate the summarized match as much as possible.
However, if the gap length between two consecutive matches is too large, the traversing stops and the summarized match ends at that position.

Hint: If summarized matches appear to be shorter than expected, increase this value.
[~10000-20000 bp]
top of page

|| Contig Layout ||

Maximal Height Difference between Connectors:

Only connectors coming from different contigs and lying within this distance can be connected.

Hint: This value represents the permitted gap size between two contigs. Be careful when choosing large values: they might bias the result since short contigs originally positioned between these two contigs, could be skipped.

[~2000-30000 bp]
Trim Unmatched Contig Ends:

Due to inserts of foreign DNA in only one genome for instance, unmatched contig ends (grey region in picture) appear. They often bias the positioning of connectors i.e. calculation of the point where a match would touch the contig border.
This option allows you to ignore these unmatched end regions and thereby might advance contig border connections.

Hint: This option is useful for genomes containing known inserts, normaly "breaking" contigs into parts. This is observable, when matches do not touch the contig border but end within the contig without giving rise to a connector.
Filter Repeats:

Summarized Matches starting or ending at nearly the same coordinates (x and y axis) get removed.
If there is only a partial repeat, the longer match is kept.

Hint: Removing of repetetive regions located near contig borders can facilitate the layout process because it eliminate unwanted connectors.

Avoid Weak Extensions:

Contig ends might contain short match segments which align to other regions than the rest of the target contig. If these short segments give rise to connectors, this can mislead the layout process.
This option only keeps the connector derived from the longest marginal match. Others are removed.

Show Only Cells containing Recombinations:

If checked, only contigs are shown containing "broken" matches.
Considering only single cells respectively, these are matches lying to far apart from each other and/or showing different slopes.

Hint: This option does not modify the contig layout! It just keeps the concerned cells visible whereas the others are hidden. It gives the user the possiblity to check whether recombinations or misassemblies exist.

Compute Layout for xAxis (rather than yAxis):

If checked, only the vertical (green) connectors are used for computing a layout for the xAxis assembly.
If unchecked, only the horizontal (red) connectors are used for computing a layout for the yAxis assembly.

Compute Layout for Both Axis:

If checked, both axis are sorted and oriented subsequently. The layout computation of one axis is totally independent from the other axis.
If unchecked, only one axis is layouted.

Export Syntenic Layout:

If checked, a window appears providing several possibilities to write results to files.

top of page

Visualization Parameters

The following configurations can be all adjusted in the Visualizations Window.
  • Show only Boundaries of Contigs larger than:
    If the data set contains many short contigs, their borders normaly shown as thin, blue lines may be visible as an opaque blue bar depending on the zoom factor.
    By typing a number, only those contigs borders are drawn coming from contigs longer than this value.
  • Query Label and Subject Label:
    Choose a name for the x and y axis.
  • Antialiasing:
    Check this for smoother visualization. Data looks better when you want to export the result views.
  • xAxis (yAxis) Supercontigs:
    Choose a color for the boxes indicating the supercontigs on xAxis (yAxis).
  • Background Color Summarized Matches View:
    Choose a color for cell background in Summarized Matches View.
  • Background Color Syntenic Layout View:
    Choose a color for cell background in Syntenic Layout View.
  • Window Background:
    Choose a color for the background in the result window.
  • Visibility Options:
    Check if a particular view should be hidden in result window as well as in exported pictures.
top of page

Message Window

Once opened by selecting Window > Messages... in the result window, the user is provided with the statistical output of each run.
Get information about
  • number of matches
  • number of summarized matches
  • number of connectors
  • names of contigs containing broken matches
  • ordered list of contig names
  • percentage of contigs successfully layouted
  • ...
top of page

Output

OSLay exports the results into several files which can be chosen from the following window:

top of page

Handling project files

top of page

 

Useful Hints:

The goal of the program is to obtain one or several straight match diagonals in the dot-plot from the bottom left to the upper right corner. The contig order of the target assembly is determined by the reference genome or assembly.
The presence of extended match diagonals imply that OSLay could sort and orient the contigs by exploiting the collinearity of the two genomes.
top of page

Author: Daniel C. Richter  -  Last change: 01/08/07
The OSLay software you are using is still beta and under development.