Implementing D2 Metrics In Cython For Kmer Count Profile Distance

1 minute read

Published: October 12, 2024

D2 metrics to compare sequences from kmer frequency vectors

Blaisdell, Reinert, Shepperd, and others have used D2 family metrics, among others, to compare genomic sequences from their kmer count or frequency vectors.

It’s essential for sequence comparison to adjacently/concurrently have some type of non-alignment (direct, exact solution to local alignment problems. More complex with gap-penalties) concept of sequence similarity. One method is to split the genome, or sequence of consideration, into the frequencies of words, or kmers, for the purpose of then comparing the vectors of counts as a proxy to how many words or similar substrings each sequence has.

D2 family metrics have essential features for any pairwise metric: self-normalization (D2S), and better/optimal statistical power (D2*). Note that D2 metrics have been shown to be asymptotically normal with larger and larger sequences. The puzzle piece method of using k-mer frequency vector similarity, with long sequences, may provide advantages over alignment based similarities.

More to come…

D2S

Xw is the word count, in this case, equivalent to a k-mer count from a k-mer count vector/profile.

X~w = Xw - nhat*pw      Y~w = Yw - nhat*py

… [NOTE]: E(Xw) = nhat*pw

nhat = n - k

and then we have the revised D2 metrics. From Blaisdell 1986

D2S = \sigma X~w*Y~w / \sqrt( X~w^2 + Y~w^2 )

D2*

The power optimized D2* statistic relies on some improvements to portability of the statistics to spaces where the D2S shows its own failings with regards to statistical power (1 - beta).

D2* = \sigma X~w*Y~w / \sqrt( mhat * nhat * pwx * pwy )

For those whose mathematics understanding may make reading the previous difficult, let’s begin with D2.

Check back in.

Share on

Twitter Facebook LinkedIn

Thoughts on ML deployments, containerized workflows, and notebooks.

6 minute read

Published: February 06, 2025

This article hopes to bring the reader up to date (ca. 2017-2022) on modern cloud-native and scalable solutions for data science and natural science research application stacks using the Docker container standard for container specification (vs Singularity, Podman, or containerd containers that are equally valid). First I will provide a brief description of the goal of Docker containers. Next I’ll touch on the kubernetes architecture for distributed data processing and application service management. Finally, I’ll describe code repository, container registries, and Markdown/Rmarkdown/LaTeX documentation as it purtains to a service’s lifespan w.r.t. notebooks and documentation of custom services and their orchestration.

Matt Ralston

Implementing D2 Metrics In Cython For Kmer Count Profile Distance

D2 metrics to compare sequences from kmer frequency vectors

D2S

D2*

Share on

You May Also Enjoy

Migrant Workers Playbook

Migrant Workers Playbook Es

Debunking Coronavirus Conspiracy Theories

Thoughts on ML deployments, containerized workflows, and notebooks.