Post

Calculating codon-usage bias statistics with kmerdb

Chi-square tests, ENC, CAI, and other statistics

Preface

Around May of 2025, I began wondering what else would be simple to do with a k-mer counting library. I’ve had an interest in completing a Djikstra or De Bruijn solver for a while, or completing my aligner using the minimizer functionality I recently implemented. After thinking about this for a while, I’ve decided to implement some simple functions for obtaining codon counts from coding sequences (CDS), validating or ignoring non-canonical CDS sequences (such as those who use non-standard (ATG) start codons).

What is codon usage bias?

Codon usage bias is a term that is fairly self explanatory: it is a preference for the use of certain codons (or not) in certain scenarios. In one sense, you may consider that not all proteins need to have the same translational efficiency, while some CDSes may need very rapid translational efficiency to perform their function in cells with changing needs. Codon usage is often pigeonholed as a synthetic biology problem for increasing or modulating the translational efficiency of transgenes by modifying or “optimizing” a gene for a certain codon usage pattern for desired effects.

However, codon usage has implications for a variety of problems in genetic biology beyond translational efficiency. With sequencing technologies on the rise, there are more questions about situations where codon usage modulates the effects of mutation, paralogy, horizontal-gene transfer, and more. Codon usage studies are truly in their infancy, and while we have some metrics to date that have utility such as the Effective Number of Codons, the Codon Adaptation Index, Frequency of Optimal Codons, and Relative Synonymous Codon Usage, the point of having effective metrics is to address novel hypotheses.

Hypothesis

First let me state a premise. I think that codon usage patterns in proteins/CDS of metabolic genes between two very different species of bacteria will be useful in distinguishing those organisms that global codon usage patterns alone. I believe that codon usage patterns are different between certain species of bacteria. The H_{0} will be that codon usage patterns in metabolic genes of two similar organisms will be indistinguishable via chi-square test.

Data acquisition

Data such as fasta sequences, .gff3/.gtf files and CDS sequences can be downloaded directly from NCBI using the datasets command. This allows you to access genome annotations, sequences, and more. GO terms can be accessed through the NCBI Genes database, UniProt, and geneontology.org. Mapping to RefSeq IDs can be done with a mixture of Python code, NCBI EUtils, and other gtf/gff parsing or Gene Ontology tools.

1
datasets download genome accession $ACCESSION --include gff3,rna,cds,protein,genome,seq-report

kmerdb codons - codon count table

The kmerdb codons command will count codons from valid CDSes. By default, it will print a nx64 table, including the ochre, amber, and opal stop codons

This may be problematic for some species of bacteria where the gene begins with a residue other than Methionine or a stop codon other than Ochre, Amber, or Opal stop codons. In those cases, the CDS will be considered ‘invalid’ and the program will either throw an error about the invalid CDS, or it will include the row/gene if the --include-noncanonicals parameter is supplied.

1
2
3
4
5
6
7
kmerdb codons my_cds_sequences.fna > cds_codon_counts.tsv
head cds_codon_counts.tsv
	AAA	AAC	AAG	AAT	ACA	ACC	ACG	ACT	AGA	AGC	AGG	AGT	ATA	ATC	ATG	ATT	CAA	CAC	CAG	CAT	CCA	CCC	CCG	CCT	CGA	CGC	CGG	CGT	CTA	CTC	CTG	CTT	GAA	GAC	GAG	GAT	GCA	GCC	GCG	GCT	GGA	GGC	GGG	GGT	GTA	GTC	GTG	GTT	TAA	TAC	TAG	TAT	TCA	TCC	TCG	TCT	TGA	TGC	TGG	TGT	TTA	TTC	TTG	TTT
lcl|NC_000913.3_cds_NP_414542.1_1	1	1	0	0	1	7	0	0	0	1	0	0	0	10	3	0	0	0	0	0	0	0	0	0	1	0	0	0	0	0	0	00	0	0	0	0	1	0	0	1	0	2	0	0	0	0	0	0	0	00	0	0	0	0	0	0	0	0	0	0	0
lcl|NC_000913.3_cds_NP_414543.1_2	22	16	12	22	2	19	8	5	0	12	2	3	1	15	22	30	11	6	19	8	2	6	18	3	3	19	4	18	2	13	43	840	14	13	30	14	36	26	15	9	22	10	22	5	18	27	19	0	8	012	6	10	9	11	0	9	4	3	10	19	13	11
lcl|NC_000913.3_cds_NP_414544.1_3	3	6	7	6	2	3	3	1	0	4	1	3	0	99	7	3	3	11	3	4	1	9	2	2	4	5	5	0	5	15	2	15	5	6	8	6	9	17	5	2	12	5	8	3	5	7	8	0	3	05	1	6	3	0	0	6	4	3	3	4	6	6
lcl|NC_000913.3_cds_NP_414545.1_4	17	11	5	8	2	11	6	4	0	6	0	3	0	97	8	5	5	16	5	9	3	11	0	1	7	0	9	1	6	29	1	19	4	10	20	5	8	24	10	2	11	4	11	2	5	15	5	0	5	04	3	2	5	0	0	1	3	3	5	11	12	13

The result is a .tsv file with codon counts from each codon in the CDS except for the start and stop codons, unless --include-start-codons or --include-stop-codons parameters are supplied. The default is a n x 64 table of codon counts, one row for each of the ‘n’ genes. You can omit the stop codon columns on your own or using --no-stop-codons-in-table to make the table n x 61.

The goal is to allow users to run their own analysis such as Effective Codon Number, or Relative Synonymous Codon Usage using the count data provided.

This post is licensed under CC BY 4.0 by the author.