Data Analysis

DNA - Blue"DNA - Blue" by Spanish Flea is licensed under CC BY-NC-ND 2.0

How do I analyze k-mer profiles?

First, I would suggest that you begin with a solid review of the wikipedia page for fundamentals. As I understand it, there are dozens of application spaces for k-mer in the bioinformatics application space. Without constraining the final application area or suggesting a particular use, we begin with what a k-mer profile can represent and then suggest some ways to generate matrices for further analysis.

The fundamental matrix X

In this case we generate a matrix X that can be unnormalized, normalized, or with dimensionality reduction via PCA or t-SNE. Of course, additional transformations may be applied to the normalized or unnormalized X, by consuming the matrix X via the PyPI package kmerdb and then re-writing the transformed matrix X` to X.tsv.

#Check 'kmerdb matrix -h' for more details on usage

# Generating k-mer profiles
# Single profiles
kmerdb profile -k $k input1.fa input1.kdb # $k represents some common choice of k.
kmerdb profile -k $k input2.fa input2.kdb # Repeat via bash code as necessary to generate all inputs
# Compound profiles
#kmerdb profile -k $k -p 3 input1.fa input2.fa input3.fa first_three.kdb
#kmerdb profile -k $k -p 3 input4.fa input5.fa input6.fa second_three.kdb

# Unnormalized count matrix
kmerdb matrix -p $cores pass *.$k.kdb > X.tsv 
# In this case, X.tsv is a tsv for import into Pandas

Then implement a custom normalization, transformation, or processing step.

import pandas as pd
df = pd.read_csv("X.tsv", sep="\t")

# Perform some normalization on X, and print as tsv
final_df.to_csv("X1.tsv", sep="\t", index=False)

Then, you can use the tsv as input to further exporatory steps such as clustering or distance matrix generation. All commands (matrix, distance, kmeans, hierarchical) are designed to mostly read and write tsv/csv through STDIN/STDOUT and are thus pipeable.

# Generate a Spearman correlation coefficient distance matrix
kmerdb distance spearman X.tsv
# Calculate the ssxy/sqrt(ssxx*ssyy) Pearson correlation coefficient
kmerdb distance pearson X.tsv
# Use SciPy to calculate correlation distance
kmerdb distance correlation X.tsv
#cat X.tsv | kmerdb distance spearman STDIN # or '/dev/stdin', has a bizarre syntax for reading from STDIN
# kmerdb distance -h

kmerdb kmeans -k 20 -i X.tsv sklearn
#cat X.tsv | kmerdb kmeans -k 20 Biopython
# kmerdb kmeans -h
kmerdb hierarchical -i X.tsv
#cat X.tsv | kmerdb hierarchical
# kmerdb hierarchical -h

Transforming the data matrix with Unix pipes

# Create the initial data matrix X to use as tsv input to other commands
kmerdb matrix Unnormalized sample1.kdb sample2.kdb ... sampleN.kdb > X.tsv
# or, again with the awful syntax for reading from STDIN.
kmerdb matrix Unnormalized *.kdb | kmerdb matrix Normalized STDIN | kmerdb distance spearman STDIN | kmerdb kmeans -k 10 sklearn STDIN

I encourage you to check out the CLI documentation for details on what functions are used for normalization, dimensionality reduction, and distance matrix generation.

Also, please take a look at the example_report README.md for more details about how to populate the report with metadata about an analysis of samples via kdb.

Then, please check out the template index.Rmd for information about the statistical analyses performed and how these become the primary index.html page for the results.

And finally, please check out the finished report.

Phone

(302) 547-2437

Address

1170 Old Wilmington Rd.
Hockessin, DE 19707
United States of America