Calculating codon-usage bias statistics with kmerdb
Chi-square tests, ENC, CAI, and other statistics
Preface
Around May of 2025, I began wondering what else would be simple to do with a k-mer counting library. I’ve had an interest in completing a Djikstra or De Bruijn solver for a while, or completing my aligner using the minimizer functionality I recently implemented. After thinking about this for a while, I’ve decided to implement some simple functions for obtaining codon counts from coding sequences (CDS), validating or ignoring non-canonical CDS sequences (such as those who use non-standard (ATG) start codons).
What is codon usage bias?
Codon usage bias is a term that is fairly self explanatory: it is a preference for the use of certain codons (or not) in certain scenarios. In one sense, you may consider that not all proteins need to have the same translational efficiency, while some CDSes may need very rapid translational efficiency to perform their function in cells with changing needs. Codon usage is often pigeonholed as a synthetic biology problem for increasing or modulating the translational efficiency of transgenes by modifying or “optimizing” a gene for a certain codon usage pattern for desired effects.
However, codon usage has implications for a variety of problems in genetic biology beyond translational efficiency. With sequencing technologies on the rise, there are more questions about situations where codon usage modulates the effects of mutation, paralogy, horizontal-gene transfer, and more. Codon usage studies are truly in their infancy, and while we have some metrics to date that have utility such as the Effective Number of Codons, the Codon Adaptation Index, Frequency of Optimal Codons, and Relative Synonymous Codon Usage, the point of having effective metrics is to address novel hypotheses.
Hypothesis
First let me state a premise. I think that codon usage patterns in proteins/CDS of metabolic genes between two very different species of bacteria will be useful in distinguishing those organisms that global codon usage patterns alone. I believe that codon usage patterns are different between certain species of bacteria. The H_{0} will be that codon usage patterns in metabolic genes of two similar organisms will be indistinguishable via chi-square test.
Data acquisition
Data such as fasta sequences, .gff3/.gtf files and CDS sequences can be downloaded directly from NCBI using the datasets
command. This allows you to access genome annotations, sequences, and more. GO terms can be accessed through the NCBI Genes database, UniProt, and geneontology.org. Mapping to RefSeq IDs can be done with a mixture of Python code, NCBI EUtils, and other gtf/gff parsing or Gene Ontology tools.
1
datasets download genome accession $ACCESSION --include gff3,rna,cds,protein,genome,seq-report
kmerdb codons
- codon count table
The kmerdb codons command will count codons from valid CDSes. By default, it will print a nx64 table, including the ochre, amber, and opal stop codons
This may be problematic for some species of bacteria where the gene begins with a residue other than Methionine or a stop codon other than Ochre, Amber, or Opal stop codons. In those cases, the CDS will be considered ‘invalid’ and the program will either throw an error about the invalid CDS, or it will include the row/gene if the
--include-noncanonicals
parameter is supplied.
1
2
3
4
5
6
7
kmerdb codons my_cds_sequences.fna > cds_codon_counts.tsv
head cds_codon_counts.tsv
AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT
lcl|NC_000913.3_cds_NP_414542.1_1 1 1 0 0 1 7 0 0 0 1 0 0 0 10 3 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 00 0 0 0 0 1 0 0 1 0 2 0 0 0 0 0 0 0 00 0 0 0 0 0 0 0 0 0 0 0
lcl|NC_000913.3_cds_NP_414543.1_2 22 16 12 22 2 19 8 5 0 12 2 3 1 15 22 30 11 6 19 8 2 6 18 3 3 19 4 18 2 13 43 840 14 13 30 14 36 26 15 9 22 10 22 5 18 27 19 0 8 012 6 10 9 11 0 9 4 3 10 19 13 11
lcl|NC_000913.3_cds_NP_414544.1_3 3 6 7 6 2 3 3 1 0 4 1 3 0 99 7 3 3 11 3 4 1 9 2 2 4 5 5 0 5 15 2 15 5 6 8 6 9 17 5 2 12 5 8 3 5 7 8 0 3 05 1 6 3 0 0 6 4 3 3 4 6 6
lcl|NC_000913.3_cds_NP_414545.1_4 17 11 5 8 2 11 6 4 0 6 0 3 0 97 8 5 5 16 5 9 3 11 0 1 7 0 9 1 6 29 1 19 4 10 20 5 8 24 10 2 11 4 11 2 5 15 5 0 5 04 3 2 5 0 0 1 3 3 5 11 12 13
The result is a .tsv file with codon counts from each codon in the CDS except for the start and stop codons, unless --include-start-codons
or --include-stop-codons
parameters are supplied. The default is a n x 64 table of codon counts, one row for each of the ‘n’ genes. You can omit the stop codon columns on your own or using --no-stop-codons-in-table
to make the table n x 61.
The goal is to allow users to run their own analysis such as Effective Codon Number, or Relative Synonymous Codon Usage using the count data provided.