BIRCH

TUTORIAL: FINDING AND RETRIEVING COMPLETE EUKARYOTIC GENOMES


Dec. 5, 2016

ENTREZ documentation: http://www.ncbi.nlm.nih.gov/books/NBK44864/


Rationale: There are numerous reasons why it is important to be able to retrieve complete genomes. These include genome assembly and genome annotation, transcriptome projects, and analysis of genome structure and evolution.

In many cases, there may be a number of alternative assemblies of a genome, as well as contig and scaffold files with parts of chromosomes. In other cases, genome assemblies may consist of hundreds of thousands of contigs, and in many cases we don't know which chromosomes each contig falls on. In principle, prokaryotic genomes should be easy to find and download, since prokaryotic genomes are typically single, circular DNA molecules. The reality is that even for prokaryotic genomes, the vast majority are not single sequences, but rather a collection of contigs.

The most comprehensive source for genomic sequences is the Genome repository at the NCBI. To illustrate the process of finding and downloading genomes, a good test subject is the species in the genus Saccharomyces. These fungi are referred to as yeasts, and qualify as eukaryotes, but still have small genomes that make them easy to work with.

Overview:
Goal: To find and download genomes from two Saccharomyces species, and save them in formats that facilitate further genome analysis.

1. Find genome files at the NCBI Genomes site

I can't repeat this often enough. ALWAYS create a new directory for each project.
mkdir getgenome
cd getgenome

The best starting point is the Genome page at NCBI [https://www.ncbi.nlm.nih.gov/genome].

Click on Browse by Organism and type the genus name "Saccharomyces" into the Search by organism box.



Note that for some species, there are many assemblies from  different sequencing projects, and often from different strains. As we'll see next, NCBI has designated for each genome a reference assembly.

Click on "Saccharaomyces cerevisiae" to see the genome data for the baker's yeast.

This page has links for downloading sequences and annotation in a variety of formats
If you need annotation... It should be emphasized that FASTA files do not contain any annotation information. For cases in which you want to utilize this information, for example, in browsing a genome, you would need to download either the GFF file or the GenBank file. GenBank files contain both annotation and sequence for each chromosome.

Links in the box at the top of the window will download single files containing sequences of annotation for all chromosomes in a single file. Data for individual chromosomes can be downloaded as separate files using links at the bottom of the page (not shown).

To download a Fasta file containing all chromosomal sequences for S. cerevisiae, go to the box at top under Reference Genome and click on "Download sequences in FASTA format". Save this file called GCF_000146045.2_R64_genomic.fna.gz, to your getgenome directory.  The .gz file extension indicates that this file has been compressed using gunzip. Compression is used to save network load during download. To uncompress this file, type

gunzip GCF_000146045.2_R64_genomic.fna.gz

The file will be uncompressed, and is replaced with a text file in fasta format called GCF_000146045.2_R64_genomic.fna.

2. What's in the file?

Linux gives you some easy ways of finding out what is in a file.

Head - The head command lets you see the first few lines of a file, just to have some idea of what is in it.
 eg.

head GCF_000146045.2_R64_genomic.fna

will print the first 10 lines of this file

>NC_001133.9 Saccharomyces cerevisiae S288c chromosome I, complete sequence
ccacaccacacccacacacccacacaccacaccacacaccacaccacacccacacacacacatCCTAACACTACCCTAAC
ACAGCCCTAATCTAACCCTGGCCAACCTGTCTCTCAACTTACCCTCCATTACCCTGCCTCCACTCGTTACCCTGTCCCAT
TCAACCATACCACTCCGAACCACCATCCATCCCTCTACTTACTACCACTCACCCACCGTTACCCTCCAATTACCCATATC
CAACCCACTGCCACTTACCCTACCATTACCCTACCATCCACCATGACCTACTCACCATACTGTTCTTCTACCCACCATAT
TGAAACGCTAACAAATGATCGTAAATAACACACACGTGCTTACCCTACCACTTTATACCACCACCACATGCCATACTCAC
CCTCACTTGTATACTGATTTTACGTACGCACACGGATGCTACAGTATATACCATCTCAAACTTACCCTACTCTCAGATTC
CACTTCACTCCATGGCCCATCTCTCACTGAATCAGTACCAAATGCACTCACATCATTATGCACGGCACTTGCCTCAGCGG
TCTATACCCTGTGCCATTTACCCATAACGCCCATCATTATCCACATTTTGATATCTATATCTCATTCGGCGGTcccaaat
attgtataaCTGCCCTTAATACATACGTTATACCACTTTTGCACCATATACTTACCACTCCATTTATATACACTTATGTC

You could see a list of the sequences in this file by using the grep program to search for lines containing the '>' character:

grep '>' GCF_000146045.2_R64_genomic.fna

produces the output

>NC_001133.9 Saccharomyces cerevisiae S288c chromosome I, complete sequence
>NC_001134.8 Saccharomyces cerevisiae S288c chromosome II, complete sequence
>NC_001135.5 Saccharomyces cerevisiae S288c chromosome III, complete sequence
>NC_001136.10 Saccharomyces cerevisiae S288c chromosome IV, complete sequence
>NC_001137.3 Saccharomyces cerevisiae S288c chromosome V, complete sequence
>NC_001138.5 Saccharomyces cerevisiae S288c chromosome VI, complete sequence
>NC_001139.9 Saccharomyces cerevisiae S288c chromosome VII, complete sequence
>NC_001140.6 Saccharomyces cerevisiae S288c chromosome VIII, complete sequence
>NC_001141.2 Saccharomyces cerevisiae S288c chromosome IX, complete sequence
>NC_001142.9 Saccharomyces cerevisiae S288c chromosome X, complete sequence
>NC_001143.9 Saccharomyces cerevisiae S288c chromosome XI, complete sequence
>NC_001144.5 Saccharomyces cerevisiae S288c chromosome XII, complete sequence
>NC_001145.3 Saccharomyces cerevisiae S288c chromosome XIII, complete sequence
>NC_001146.8 Saccharomyces cerevisiae S288c chromosome XIV, complete sequence
>NC_001147.6 Saccharomyces cerevisiae S288c chromosome XV, complete sequence
>NC_001148.4 Saccharomyces cerevisiae S288c chromosome XVI, complete sequence
>NC_001224.1 Saccharomyces cerevisiae S288c mitochondrion, complete genome


3. Retrieve the corresponding sequence information for S. arboricola.

If you go back to the page listing Saccharomyces species, you can find a link to S. arboricola. The page for this species is shown below:


As before, download the Representative Genome in FASTA format, and unzip the file. Use the grep command to list the sequences in this file.

>NC_026171.1 Saccharomyces arboricola H-6 chromosome I, whole genome shotgun sequence
>NC_026172.1 Saccharomyces arboricola H-6 chromosome II, whole genome shotgun sequence
>NC_026173.1 Saccharomyces arboricola H-6 chromosome III, whole genome shotgun sequence
>NC_026174.1 Saccharomyces arboricola H-6 chromosome IV, whole genome shotgun sequence
>NC_026175.1 Saccharomyces arboricola H-6 chromosome V, whole genome shotgun sequence
>NC_026176.1 Saccharomyces arboricola H-6 chromosome VI, whole genome shotgun sequence
>NC_026177.1 Saccharomyces arboricola H-6 chromosome VII, whole genome shotgun sequence
>NC_026178.1 Saccharomyces arboricola H-6 chromosome VIII, whole genome shotgun sequence
>NC_026179.1 Saccharomyces arboricola H-6 chromosome IX, whole genome shotgun sequence
>NC_026180.1 Saccharomyces arboricola H-6 chromosome X, whole genome shotgun sequence
>NC_026181.1 Saccharomyces arboricola H-6 chromosome XI, whole genome shotgun sequence
>NC_026182.1 Saccharomyces arboricola H-6 chromosome XII, whole genome shotgun sequence
>NC_026183.1 Saccharomyces arboricola H-6 chromosome XIII, whole genome shotgun sequence
>NC_026184.1 Saccharomyces arboricola H-6 chromosome XIV, whole genome shotgun sequence
>NC_026185.1 Saccharomyces arboricola H-6 chromosome XV, whole genome shotgun sequence
>NC_026186.1 Saccharomyces arboricola H-6 chromosome XVI, whole genome shotgun sequence
>NW_011644260.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold1, whole genome shotgun sequence
>NW_011644261.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold2, whole genome shotgun sequence
>NW_011644262.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold3, whole genome shotgun sequence
>NW_011644263.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold4, whole genome shotgun sequence
>NW_011644264.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold5, whole genome shotgun sequence
>NW_011644265.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold6, whole genome shotgun sequence
>NW_011644266.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold7, whole genome shotgun sequence
>NW_011644267.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold8, whole genome shotgun sequence
>NW_011644268.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold9, whole genome shotgun sequence
>NW_011644269.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold10, whole genome shotgun sequence
>NW_011644270.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold11, whole genome shotgun sequence
>NW_011644271.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold12, whole genome shotgun sequence
>NW_011644272.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold13, whole genome shotgun sequence
>NW_011644273.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold14, whole genome shotgun sequence
>NW_011644274.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold15, whole genome shotgun sequence
>NW_011644275.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold16, whole genome shotgun sequence
>NW_011644276.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold17, whole genome shotgun sequence
>NW_011644277.1 Saccharomyces arboricola H-6 unplaced genomic scaffold SU7_scaffold18, whole genome shotgun sequence
>NW_011644278.1 Saccharomyces arboricola H-6 mitochondrion, whole genome shotgun sequence

Note that this file has a number of scaffolds that could not be matched to any of the chromosomes. This indicates that many of the chromosomes are probably incomplete scaffolds, missing substantial amounts of sequence.

This is not a particularly big file, but there are many genomes in which there are a large number of contigs or scaffolds. An easy way to count how many would be to pipe the output from the grep command into the wc command, which counts words or lines in a file. To get wc to count lines in a file the command is wc -l. Thus:

grep '>' GCF_000292725.1_SacArb1.0_genomic.fna | wc -l
35

Instead of printing the output lines, the output is simply the number of lines in the file, which would be the number of sequences.