cai Wiki The master copies of EMBOSS documentation are available at http://emboss.open-bio.org/wiki/Appdocs on the EMBOSS Wiki. Please help by correcting and extending the Wiki pages. Function Calculate codon adaptation index Description cai calculates the Codon Adaptation Index for a given nucleotide sequence, given a reference codon usage table. The CAI index is a simple, effective measure of synonymous codon usage bias. It index assesses the extent to which selection has been effective in moulding the pattern of codon usage. In that respect it is useful for predicting the level of expression of a gene, for assessing the adaptation of viral genes to their hosts, and for making comparisons of codon usage in different organisms. The index may also give an approximate indication of the likely success of heterologous gene expression. Algorithm The CAI index uses a reference set of highly expressed genes from a species to assess the relative merits of each codon. A score for a gene sequence is calculated from the frequency of use of all codons in that gene sequence. Usage Here is a sample session with cai % cai TEMBL:AB009602 Calculate codon adaptation index Codon usage file [Eyeast_cai.cut]: Output file [ab009602.cai]: Go to the input files for this example Go to the output files for this example Command line arguments Calculate codon adaptation index Version: EMBOSS:6.4.0.0 Standard (Mandatory) qualifiers: [-seqall] seqall Nucleotide sequence(s) filename and optional format, or reference (input USA) -cfile codon [Eyeast_cai.cut] Codon usage table name [-outfile] outfile [*.cai] Output file name Additional (Optional) qualifiers: (none) Advanced (Unprompted) qualifiers: (none) Associated qualifiers: "-seqall" associated qualifiers -sbegin1 integer Start of each sequence to be used -send1 integer End of each sequence to be used -sreverse1 boolean Reverse (if DNA) -sask1 boolean Ask for begin/end/reverse -snucleotide1 boolean Sequence is nucleotide -sprotein1 boolean Sequence is protein -slower1 boolean Make lower case -supper1 boolean Make upper case -sformat1 string Input sequence format -sdbname1 string Database name -sid1 string Entryname -ufo1 string UFO features -fformat1 string Features format -fopenfile1 string Features file name "-cfile" associated qualifiers -format string Data format "-outfile" associated qualifiers -odirectory2 string Output directory General qualifiers: -auto boolean Turn off prompts -stdout boolean Write first file to standard output -filter boolean Read first file from standard input, write first file to standard output -options boolean Prompt for standard and additional values -debug boolean Write debug output to program.dbg -verbose boolean Report some/full command line options -help boolean Report command line options and exit. More information on associated and general qualifiers can be found with -help -verbose -warning boolean Report warnings -error boolean Report errors -fatal boolean Report fatal errors -die boolean Report dying program messages -version boolean Report version number and exit Input file format cai reads a nucleic acid sequence of a gene. Input files for usage example Database entry: TEMBL:AB009602 ID AB009602; SV 1; linear; mRNA; STD; FUN; 561 BP. XX AC AB009602; XX DT 15-DEC-1997 (Rel. 53, Created) DT 14-APR-2005 (Rel. 83, Last updated, Version 2) XX DE Schizosaccharomyces pombe mRNA for MET1 homolog, partial cds. XX KW MET1 homolog. XX OS Schizosaccharomyces pombe (fission yeast) OC Eukaryota; Fungi; Dikarya; Ascomycota; Taphrinomycotina; OC Schizosaccharomycetes; Schizosaccharomycetales; Schizosaccharomycetaceae; OC Schizosaccharomyces. XX RN [1] RP 1-561 RA Kawamukai M.; RT ; RL Submitted (07-DEC-1997) to the EMBL/GenBank/DDBJ databases. RL Makoto Kawamukai, Shimane University, Life and Environmental Science; 1060 RL Nishikawatsu, Matsue, Shimane 690, Japan RL (E-mail:kawamuka@life.shimane-u.ac.jp, Tel:0852-32-6587, Fax:0852-32-6499) XX RN [2] RP 1-561 RA Kawamukai M.; RT "S.pmbe MET1 homolog"; RL Unpublished. XX FH Key Location/Qualifiers FH FT source 1..561 FT /organism="Schizosaccharomyces pombe" FT /mol_type="mRNA" FT /clone_lib="pGAD GH" FT /db_xref="taxon:4896" FT CDS <1..275 FT /codon_start=3 FT /transl_table=1 FT /product="MET1 homolog" FT /db_xref="GENEDB:SPCC1739.06c" FT /db_xref="GOA:O74468" FT /db_xref="InterPro:IPR000878" FT /db_xref="InterPro:IPR003043" FT /db_xref="InterPro:IPR006366" FT /db_xref="InterPro:IPR006367" FT /db_xref="InterPro:IPR012066" FT /db_xref="InterPro:IPR014776" FT /db_xref="InterPro:IPR014777" FT /db_xref="InterPro:IPR016040" FT /db_xref="UniProtKB/Swiss-Prot:O74468" FT /protein_id="BAA23999.1" FT /translation="SMPKIPSFVPTQTTVFLMALHRLEILVQALIESGWPRVLPVCIAE FT RVSCPDQRFIFSTLEDVVEEYNKYESLPPGLLITGYSCNTLRNTA" XX SQ Sequence 561 BP; 135 A; 106 C; 98 G; 222 T; 0 other; gttcgatgcc taaaatacct tcttttgtcc ctacacagac cacagttttc ctaatggctt 60 tacaccgact agaaattctt gtgcaagcac taattgaaag cggttggcct agagtgttac 120 cggtttgtat agctgagcgc gtctcttgcc ctgatcaaag gttcattttc tctactttgg 180 aagacgttgt ggaagaatac aacaagtacg agtctctccc ccctggtttg ctgattactg 240 gatacagttg taataccctt cgcaacaccg cgtaactatc tatatgaatt attttccctt 300 tattatatgt agtaggttcg tctttaatct tcctttagca agtcttttac tgttttcgac 360 ctcaatgttc atgttcttag gttgttttgg ataatatgcg gtcagtttaa tcttcgttgt 420 ttcttcttaa aatatttatt catggtttaa tttttggttt gtacttgttc aggggccagt 480 tcattattta ctctgtttgt atacagcagt tcttttattt ttagtatgat tttaatttaa 540 aacaattcta atggtcaaaa a 561 // Output file format cai writes the Codon Adaptation Index to the output file. Output files for usage example File: ab009602.cai Sequence: AB009602 CAI: 0.188 Data files cai requires a reference codon usage table prepared from a set of genes which are known to be highly expressed. This is specified by the -cfile option and must exist in the EMBOSS data directory. The default codon usage table Eyeastcai.cut is the standard set of Saccharomyces cerevisiae highly expressed gene codon frequiencies. Another table (Eschpo_cai.cut) was prepared from a set of Schizosaccharomyces pombe genes by Peter Rice for the S. pombe sequencing team at the Sanger Centre, and is available in the EMBOSS data directory. You should prepare your own codon usage table for your organism of interest. EMBOSS data files are distributed with the application and stored in the standard EMBOSS data directory, which is defined by the EMBOSS environment variable EMBOSS_DATA. To see the available EMBOSS data files, run: % embossdata -showall To fetch one of the data files (for example 'Exxx.dat') into your current directory for you to inspect or modify, run: % embossdata -fetch -file Exxx.dat Users can provide their own data files in their own directories. Project specific files can be put in the current directory, or for tidier directory listings in a subdirectory called ".embossdata". Files for all EMBOSS runs can be put in the user's home directory, or again in a subdirectory called ".embossdata". The directories are searched in the following order: * . (your current directory) * .embossdata (under your current directory) * ~/ (your home directory) * ~/.embossdata Notes Codons are nucleotide triplet that encode an amino acid residue in a polypeptide chain. There are four possible nucleotides in DNA; adenine (A), guanine (G), cytosine (C) and thymine (T), therefore 64 possible triplets to encode the 20 amino acids plus the translation termination signal. The encoding is therefore redundant, with all but two amino acids coded for by more than one triplet. Organisms often have a particular preference for one of the possible codons for a given amino acid. Codon preferences reflect a balance between mutational bias and selection for efficiency of translation. In fast-growing microorganisms there are optimal codons that reflect the composition of the genomic tRNA pool and probably help achieve faster translation rates and high accuracy. Such selection is expected to be strong in highly expressed genes, as is the case for Escherichia coli or Saccharomyces cerevisiae. In contrast, codon usage optimization is normally absent in organisms with slower growing rates such as Homo sapiens (human), where codon preferences are determined by mutational biases characteristic to a particular genome. Various factors are thought to influence codon usage bias in baceteria, including gene expression level already mentioned, %G+C composition (reflecting horizontal gene transfer or mutational bias), GC skew (reflecting strand-specific mutational bias), amino acid conservation, protein hydropathy, transcriptional selection, RNA stability, and optimal growth temperature. Various methods have been used to analyze codon usage bias. CAI and methods such as the 'frequency of optimal codons' (Fop) are commonly used to predict gene expression levels. Others such as the 'effective number of codons' (Nc) and Shannon entropy are used to measure codon usage evenness, whereas multivariate statistical methods, iincluding correspondence analysis and principal component analysis, may be used to analyze variations in codon usage between genes. References 1. Sharp PM., Li W-H. "The codon adaptation index - a measure of directional synonymous codon usage bias, and its potential applications." Nucleic Acids Research 1987 vol 15, pp 1281-1295. 2. Synonymous codon usage in bacteria. Curr Issues Mol Biol. 2001 Oct;3(4):91-7. Warnings None. Diagnostic Error Messages None. Exit status It always exits with status 0. Known bugs None. See also Program name Description chips Calculates Nc codon usage statistic codcmp Codon usage table comparison codcopy Copy and reformat a codon usage table cusp Create a codon usage table from nucleotide sequence(s) syco Draw synonymous codon usage statistic plot for a nucleotide sequence Author(s) Alan Bleasby European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK Please report all bugs to the EMBOSS bug team (emboss-bug (c) emboss.open-bio.org) not to the original author. History Written (March 2001) - Alan Bleasby. Target users This program is intended to be used by everyone and everything, from naive users to embedded scripts. Comments None