Quickstart

Introduction
Install
Usage
Documentation
Development
- Unit testing
API
License
Acknowledgements

Introduction

What is `kmerdb`?

The K-mer database project kmerdb consists of a file format (.kdb) and command-line utility (CLI) for creating k-mer count vector profiles from sequencing data. kmerdb views k-mers and their De Bruijn graphs as fundamental tools for biological sequence abundances and sequence similarities. K-mer methods have been noted as fast and are used in areas from NGS quality control, sequence alignment, phylogenomic investigations, and more.

kmerdb is based on the block GNU-zip file (bgzf) standard. Each .kdb file has a metadata header section, much like .bam files. It is essentially a tab-delimited format with a YAML header. Input file checksums, unique and total k-mer counts, and nullomer counts are stored in the metadata block at the top of the file.

Please visit the Install page for details on installation.

See the commands in the Usage section for an idea of what functionality is built in to kdb.

See the original blog post on the concept here.

The kdb project was designed to facilitate conversation between heavily optimized legacy codebases without much public attention, like Jellyfish, regarding the utility of standardizing k-mer frameworks. These frameworks are used throughout assembly and alignment hashing/seed-matching strategies. The primary goal of this project is documenting data shapes, compression strategies (which of course related to efficiency of storage, transmission, rapid access, etc.), and anticipating UI possibilities with the increases in read/write speeds afforded by improving SSD technologies and utilization of more channels of more rapid interfaces for data transmission (i.e. m2, NVMe, PCIx).

`kmerdb` makes k-mer count vectors

The k-mer database count vector profile file format is rather simple. It contains a YAML metadata header section, followed by a tab delimited format. Both sections are combined in a compressed layer facilitated by Biopython’s bio.bgzf module.

Each file can be inspected with the ‘view’ and ‘header’ commands detailed in the section below.

Install

The current version on PyPI is shown above.

pip install kmerdb

See the install page for development install instructions.

Usage

How to view `-h|--help`

Use ‘-h’ to view detailed usage information about the subcommands

About `--debug`

A debug mode is included to omit the error handling and condensing for the purposes of reporting an error exactly as raised. This doesn’t happen at all times due to the nature of exception handling in kmerdb. If the program doesn’t produce a description of what it’s doing with -v|-vv vebose log settings, then please use -vv --debug options to report the issue to the issue tracker.

# Usage    -h|--help, -v|-vv verbosity, --debug mode, --quiet mode
kmerdb --help # [+ --debug mode]
kmerdb usage graph

 o-O      |||
o---O     |||             [|[          kmerdb           ]|]
O---o     |||
 O-o      |||        version :     v0.8.13
  O       |||
 o-O      |||        GitHub  : https://github.com/MatthewRalston/kmerdb/issues
o---O     |||         PyPI   : https://pypi.org/project/kmerdb/
O---o     |||      Website   : https://matthewralston.github.io/kmerdb
 O-o      |||









# |||||||||||||||||||||||||||||||||||||||
#      [ Usage ] :        |||||||||||||||
# |||||||||||||||||||||||||||||||||||||||


#  Check test/data for example fasta files.


# -----------------------------
# Generate k-mer count profiles
# -----------------------------
kmerdb profile -k 12 -o profile_1 input_1.fa.gz [input_2.fq] ...
...


# -----------------------------
# Merge profiles
# -----------------------------
kmerdb matrix pass profile_1.12.kdb profile_2.12.kdb profile_3.12.kdb ... > count_matrix.tsv

# -----------------------------
# Generate inter-profile distances
# -----------------------------
kmerdb distance pearson count_matrix.tsv


# -----------------------------
# Pipeline form
# -----------------------------

kmerdb matrix pass ... | kmerdb distance pearson STDIN > pearson_correlation_matrix.tsv

# -----------------------------
# Okay, how about PCA, t-SNE?
# -----------------------------
kmerdb matrix PCA -n 25 ... > 25_pseudomer_count_profiles.tsv
# kmerdb matrix tSNE -n 25


# -----------------------------
# k-means clustering?
# -----------------------------
kmerdb kmeans <Biopython|sklearn> -k 4 --distance e -i input_distance_matrix.tsv # Using 'e' for Euclidean distance with Biopython. Check the source, Biopython RTD, and sklearn RTD.
# Produces

# -----------------------------
# okay, now straight-up hierarchical clustering:
# -----------------------------
kmerdb hierarchical -i input_distance_matrix.tsv --method complete # Uses complete linkage



                                                                       lang :         python
                                                                          v :      >= v3.7.4

                      package manger : pip
                        version      : >= 24.0
        package root : /home/matt/.local/share/virtualenvs/kdb-A0LvysMk/lib/python3.12/site-packages/kmerdb
        exe file     : /home/matt/.local/share/virtualenvs/kdb-A0LvysMk/lib/python3.12/site-packages/kmerdb/__init__.py

                      required packages : 11
                   development packages : 8

           ARGV : ['/home/matt/.local/share/virtualenvs/kdb-A0LvysMk/bin/kmerdb', 'usage', 'graph']
        
O---o
 O-o
  O
 o-O
o---O
O---o
 O-o
  O
 o-O
o---O
O---o
 O-o
  O
 o-O
o---O




Beginning program...
========================================
[ DEBUG ] : Default is warning-only logging. [-v,-vv, --debug, --quiet]
[ INFO  ] : Default is warning-only logging. [-v,-vv, --debug, --quiet]
========================================
WARNING. Some features are experimental. Note: you are tracking kmerdb/main. Visit the README.md header, usage section, or quickstart page for additl.
----------------------------------------




                          [ name ] :         graph

                   description : create a edge list in (block) .gz format from .fasta|.fa or .fastq format.




   :     4 column output : [ row idx | k-mer id node #1 | k-mer id node #2 | edge weight (adjacency count) ]

   :  make a deBruijn graph, count the number of k-mer adjacencies,  printing the edge list to STDOUT




                  +=============+====================+====================+=================================+
                  <    row idx  |  k-mer id node #1  |  k-mer id node #2  |  edge weight (adjacency count)  >
                  |             |                    |                    |                                 |
                  |             +
                  |
                  |
                  |
                  |
                  |





--------------------------


                    kmerdb graph -k 12 input_1.fa [example_2.fastq] output.12.kdbg

                    [-]    inputs : 

                           Input file can .fastq (or .fa).   - gzip.  Output is a weighted edge list in .kdb format (gzipped .csv with YAML header)

                    [-]    parameters : 

                           uses < -k > for k-mer size, --quiet to reduce runtime, -v, -vv to control logging.



                    [-]    [ usage ]  :  kmerdb graph -k $K --quiet <input_1.fa.gz> [input_2.fq.gz] <output_edge_list_file.12.kdbg>















name: arguments
type: array
items:
- name: k
  type: int
  value: choice of k-mer size
- name: quiet
  type: flag
  value: Write additional debug level information to stderr?




name: inputs
type: array
items:
- name: <.fasta|.fastq>
  type: array
  value: gzipped or uncompressed input .fasta or .fastq file(s)
- name: .kdbg
  type: file
  value: Output edge-list filepath.




name: features
type: array
items:
- name: k-mer count arrays, linear, produced as file is read through sliding window.
    (Un)compressed support for .fa/.fq.
  shortname: parallel faux-OP sliding window k-mer shredding
  description: Sequential k-mers from the input .fq|.fa files are added to the De
    Bruijn graph. In the case of secondary+ sequences in the .fa or considering NGS
    (.fq) data, non-adjacent k-mers are pruned with a warning. Summary statistics
    for the entire file are given for each file read, + a transparent data structure.
- name: k-mer neighbors assessed and tallied, creates a unsorted edge list, with weights
  shortname: weighted undirected graph
  description: an edge list of a De Bruijn graph is generated from all k-mers in the
    forward direction of .fa/.fq sequences/reads. i.e. only truly neighboring k-mers
    in the sequence data are added to the tally of the k-mer nodes of the de Bruijn
    graph and the edges provided by the data.




name: steps
type: array
items:
- name: read input file(s) from filesystem into k-mer arrays
  shortname: shred inputs into k-mer count arrays
  description: shred input sequences into k-mer count vector
- name: merge k-mer arrays and aggregate metadata
  shortname: merge k-mer count arrays for aggregate metadata (header)
  description: merge counts of nullomers, unique kmers, and total kmers.
- name: collate the weighted edge lists after reading multiple files. Output data
    consists of a edge_list, analogous metadata-header as YAML, kmer_counts, and nullomer_ids.
  shortname: extract undirected weighted graph
  description: consists of info from a .kdb file and a .kdbg file. The node IDs, the
    edges, and the number of times the pair was observed from forward sequences in
    the provided dataset
- name: print 'table' Final stats and close output file
  shortname: metrics and shutdown
  description: print final statistics, typically metadata values, and ensure file
  is closed.

  
...



#   +

# [ 3 main features: ]     k-mer counts (kmerdb profile -k 12 <input.fa|.fq> [<input.fa|.fq>])    'De Bruijn' graph (kmerdb graph)         [matrices, distances, and clustering!]

# Create a [composite] profile of k-mer counts from sequence files. (.fasta|.fastq|.fa.gz|.fq.gz)
kmerdb profile -k 8 example_1.fq.gz [example_2.fq.gz] profile_1.8.kdb

# Build a weighted edge list (+ node ids/counts = De Bruijn graph)
kmerdb graph -k 12 example_1.fq.gz example_2.fq.gz edges_1.kdbg

# View k-mer count vector
kmerdb view profile_1.8.kdb # -H for full header

# Note: zlib compatibility
#zcat profile_1.8.kdb

# View header (config.py[kdb_metadata_schema#L84])
kmerdb header profile_1.8.kdb

## Optional normalization, dim reduction, and distance matrix features:

# K-mer count matrix - Cython Pearson coefficient of correlation [ ssxy/sqrt(ssxx*ssyy) ]
kmerdb matrix from *.8.kdb | kmerdb distance pearson STDIN
# 
# kmerdb matrix DESeq2 *.8.kdb
# kmerdb matrix PCA *.8.kdb
# kmerdb matrix tSNE *.8.kdb
#   # <from> just makes a k-mer count matrix from k-mer count vectors.
# 

# Distances on count matrices [ SciPy ]  pdists + [ Cython ] Pearson correlation, scipy Spearman and scipy correlation pdist calculations are available ]
kmerdb distance -h
# 
#
# usage: kmerdb distance [-h] [-v] [--debug] [-l LOG_FILE] [--output-delimiter OUTPUT_DELIMITER] [-p PARALLEL] [--column-names COLUMN_NAMES] [--delimiter DELIMITER] [-k K]
#                       {braycurtis,canberra,chebyshev,cityblock,correlation,cosine,dice,euclidean,hamming,jaccard,jensenshannon,kulsinski,mahalanobis,matching,minkowski,pearson,rogerstanimotorusselrao,seuclidean,sokalmichener,sokalsneath,spearman,sqeuclidean,yule} [<kdbfile1 kdbfile2 ...|input.tsv|STDIN> ...]

# +

#    Kmeans (sklearn, BioPython)
kmerdb kmeans -k 4 -i dist.tsv
#    BioPython Phylip tree + upgma
kmerdb hierarchical -i dist.tsv

kmerdb minimizers

The kmerdb minimizers command produces a plain-text index file from either a samplesheet or one-or-more .fa|.fq files. The output is a table of the fasta sequence identifier, the length of the sequence, the coordinate of the k-mer in the fasta sequence, the k-mer id corresponding to that position, and a boolean whether or not the k-mer id is chosen as a minimizer.

kmerdb minimizers --help

usage: kmerdb minimizers [-h] -w WINDOW_SIZE [-a] [-v] [--quiet] [--debug] fasta kdb

positional arguments:
  fasta                 A fasta file (.fa) to generate minimizers from
  kdb                   A .kdb file to generate minimizers

options:
  -h, --help            show this help message and exit
  -w WINDOW_SIZE, --window-size WINDOW_SIZE
                        Choice of window size (in base-pairs) to select minimizers with
  -a, --include-unselected-minimizers
                        Include minimizer positions that were not selected as minimizers (length: (N-k+1)/window_size)
  -v, --verbose         Prints warnings to the console by default
  --quiet               Do not produce additional stdout
  --debug               Debug mode. Do not format errors and condense log

kmerdb alignment

TBD, Work in progress, based off of Smith-Waterman alignment using minimizer seed regions and dynamic programming seed-extension patterns.

Reference

Smith, Temple F. & Waterman, Michael S. (1981). “Identification of Common Molecular Subsequences”. Journal of Molecular Biology. 147 (1): 195–197.

kmerdb profile

A typical workflow first requires the generation of k-mer count vector profiles. The following command will generate multiple profiles at $K-mer resolution simultaneously.

parallel 'kmerdb profile -k $K -o {.} {}' ::: $(/bin/ls test/data/*.fasta.gz)

NOTE: the test datasets include with the kmerdb git repository may be used to practice creating profiles, count matrices, distance matrices, and clustering output from small microbial genomic datasets.

kmerdb view

As mentioned before under KDB format, the kdb file consists of a header or metadata section, followed by data blocks until the end of the file. The header is YAML formatted and the data blocks are formatted as tab-separated value files (.tsv), with the last/right-most column being a JSON formatted metadata column. For developers, the YAML schema can be found in the config.py file.

# This should display the entire header of most files
>zcat example.8.kdb | head -n 30 
# This will also display just the header
>kmerdb header example.8.kdb
# The -H flag includes the header in the uncompressed output
>kmerdb view -H example.8.kdb
version: 0.8.3
metadata_blocks: 1
k: 8
total_kmers: 4132866
unique_kmers: 64103
unique_nullomers: 1433
metadata: false
sorted: false
kmer_ids_dtype: uint64
profile_dtype: uint64
count_dtype: uint64
frequencies_dtype: float64
tags: []
files:
- filename: test/data/Cacetobutylicum_ATCC824.fasta.gz
  md5: 0a0f73e1c8b8285703e29279bafaabef
  sha256: f9081291b62ff3387f1ca6ee2484669c849ed1840fdf2dd9dc3a0c93e9e87951
  total_reads: 2
  total_kmers: 4132866
  unique_kmers: 64103
  nullomers: 1433
header_offset: 324


========================

0	0	572	0.00013840274521361204
1	1	454	0.00010985112994227251
2	2	716	0.0001732453943582976
3	3	1256	0.00030390532865086844

     ...

kmerdb matrix

Input to the matrix command is either a space separated list (positional arguments) of .kdb files, OR a input .tsv file OR otherwise, a .tsv can be read from STDIN by using ‘STDIN’ as the final positional argument.

The matrix subcommand generates the count matrix either un-normalized, normalized (via DESeq2), or with PCA or t-SNE dimensionality reduction applied. Note that default behavior of PCA if -n is not specified is to generate an elbow graph for the user to pick the appropriate choice of principal components for downstream analyses. The -n parameter is passed to the n_components parameter of sklearn.decomposition.PCA, which is commonly used for PCA in Python.

>kmerdb matrix -h
usage: kmerdb matrix [-h] [-v] [--debug] [-l LOG_FILE] [-nl {20,50,100,200}]
                     [-p PARALLEL] [--with-index]
                     [--column-names COLUMN_NAMES] [--delimiter DELIMITER]
                     [--output-delimiter OUTPUT_DELIMITER]
                     [--no-normalized-ints] [-k K] [-n N]
                     [--perplexity PERPLEXITY]
                     {PCA,tSNE,DESeq2,pass,Frequency}
                     [<kdbfile1 kdbfile2 ...|input.tsv|STDIN> ...]

positional arguments:
  {PCA,tSNE,DESeq2,from}
                        Choice of dimensionality reduction, normalization
                        method (DESeq2), or matrix-from (collate data only)
  <kdbfile1 kdbfile2 ...|input.tsv|STDIN>
                        Two or more .kdb files, or another count matrix in
                        tsv/csv

options:
  -h, --help            show this help message and exit
  -v, --verbose         Prints warnings to the console by default
  --debug               Debug mode. Do not format errors and condense log
  -l LOG_FILE, --log-file LOG_FILE
                        Destination path to log file
  -nl {20,50,100,200}, --num-log-lines {20,50,100,200}
                        Number of logged lines to print to stderr. Default: 50
  -p PARALLEL, --parallel PARALLEL
                        Read files in parallel
  --with-index          Print the row indices as well
  --column-names COLUMN_NAMES
                        A filepath to a plaintext flat file of column names.
  --delimiter DELIMITER
                        The choice of delimiter to parse the input .tsv with.
                        DEFAULT: ' '
  --output-delimiter OUTPUT_DELIMITER
                        The output delimiter of the final csv/tsv to write.
                        DEFAULT: ' '
  --no-normalized-ints  Don't round normalized counts to the nearest integer
  -k K                  The k-dimension that the files have in common
  -n N                  The number of dimensions to reduce with PCA or t-SNE.
                        DEFAULT: an elbow graph will be generated if -n is not
                        provided to help the user choose -n
  --perplexity PERPLEXITY
                        The choice of the perplexity for t-SNE based
                        dimensionality reduction
>kmerdb matrix -n 3 PCA test/data/*.$K.kdb # Will use principal components analysis/SVD to reduce the dimensions of the count matrix

kmerdb distance

Similar to the matrix subcommand, arguments to ‘kmerdb distance’ are multiple .kdb files, a input.tsv file, or the keyword ‘STDIN’ to read from standard input.

Suppose you want a distance matrix between profiles; this is made easy with the distance command, and prints the distance matrix as .tsv to STDOUT

>kmerdb distance correlation test/data/*.$K.kdb

The result is a symmetric matrix in tsv format with column headers formed from the filenames with the extensions/suffixes removed.

kmerdb kmeans

Now you would like to cluster your count matrix, perhaps after reducing dimensionality of the dataset. tSNE is a recommended choice for using the count matrix to differentiate between strains or species. PCA is equally useful in understanding the differences between species and reducing the dimensionality to something reasonable is an important first step before clustering.

# You may supply any arbitrary count tsv to cluster, and it will run that directly.
>kmerdb kmeans -k 3 -i example.tsv

# Alternatively, you may pipe the result from kdb matrix directly to kdb cluster.
# The matrix command will not produce PCA reduced dimension matrix unless the -n parameter is passed as well.
# The choice of dimensions, n, can be identified from the elbow graph produced if -n is not passed.
>kmerdb matrix PCA -n 3 test/data/*.$K.kdb | kmerdb kmeans -k 3 -i STDIN

# Alternatively, t-SNE may be used to project the data into 2 dimensions for visualization.
>kmerdb matrix tSNE -n 2 test/data/*.$K.kdb | kmerdb cluster -k 3 -i STDIN

kmerdb hierarchical

Usage for the ‘hierarchical’ subcommand is very similar to ‘kmeans’. Be sure to use -i (or STDIN) to specify input, as its not positional in this command.

Documentation

Documentation for the functions, parameters, positional arguments, and steps used by the subcommands may be obtained from the argparse -h,–help arguments, the ‘usage’ command (e.g. kmerdb usage profile), the README.md file, or the module’s docstrings in the ReadTheDocs documentation.

Development

Unit testing

The repository features Travis-CI pytest unit tests as the primary unit testing methodology for the functions, but not for all of the more complex IO-related operations, like those found in bin/kdb.

The method for installation and unit tests in a new development environment can be seen in ‘.travis.yml’. Currently, only unit tests are available for the suite. Acceptance testing has not been implemented. Until then, all acceptance testing is done manually prior to commits, squashes, rebases, and merges. Unit tests may be run with the following:

python setup.py test

API

Importing

The following section is designed to give developers a flavor of how to use the kdb modules for development purposes. The first topic of the API is importing.

```python
# Simple full module import

# specific modules
from kmerdb import distance
from kmerdb import fileutil, graph

````

Reading

The next obvious topic is reading kdb files. It’s very easy to read them line by line, so you should have no problems inventing your own incremental and custom python distance metrics for a reader that is capable of compression but uses a generator style. Note that the lines are parsed raw for speed, and they don’t form any memory intensive objects. You are free to make the processing of each line as complex or simple as you want.

```python

import yaml from kdb import fileutil

files = [“foo.kdb”, “bar.kdb”, “baz.kdb”]

profiles = [fileutil.open(f, ‘r’, slurp=True) for f in files]

print(yaml.dump(profiles[0].metadata)) # Prints the YAML header

foo_profile = profiles[0]

foo.metadata # YAML metadata

foo.kmer_ids # kmer_ids

foo.counts # a NumPy array of the counts

for i, c in enumerate(o.counts): print(i, c) # Calculate something with each k-mer count 0 56 1 24

Check out the readthedocs or __init__ source file for more details on module usage.

License

Created by Matthew Ralston - Scientist, Programmer, Musician - Email

Distributed under the Apache v2.0 license. See ‘LICENSE.txt’ for the copy distributed with this project. Open source software is not for everyone.

Acknowledgements

Thank you to the authors of kPAL and Jellyfish for the early inspiration. And thank you to others for the encouragement along the way, who shall remain nameless. I wanted this library to be a good strategy for assessing these k-mer profiles, in a way that is both cost aware of the analytical tasks at play, capable of storing the exact profiles in sync with the current assemblies, and then updating the kmer databases only when needed to generate enough spectral signature information.

The intention is that more developers would want to add functionality to the codebase or even just utilize things downstream, but to build out directly with numpy and scipy/scikit as needed to suggest the basic infrastructure for the ML problems and modeling approaches that could be applied to such datasets. This project has begun under GPL v3.0 and hopefully could gain some interest.

Thanks to free-icon for the download icon used on the home page. It is also cited at the bottom of each page.

Thanks to my former mentors BC, MR, IN, CR, and my newer bosses PJ and KL. Thanks of course to Liftco Gymdustries for the facelift. Thanks to the Pap lab and the DOE for the first dataset that I continue to use. Thank you to Ryan for the food and stuff. Thanks to Blahah for tolerating someone snooping and imitating his Ruby style. Thanks to Erin for getting my feet wet in this new field. Thanks to Rachel for the good memories and friendship. Thanks to Yasmeen for the usual banter. Thanks to Max, Robin, and Robert for the halfway decent memories in St. Louis. And thanks to my family and friends. Go Blue Hens 2021.

Quickstart

Introduction

What is `kmerdb`?

`kmerdb` makes k-mer count vectors

Install

Usage

How to view `-h|--help`

About `--debug`

kmerdb minimizers

kmerdb alignment

kmerdb profile

kmerdb view

kmerdb matrix

kmerdb distance

kmerdb kmeans

kmerdb hierarchical

Documentation

Development

Unit testing

API

Importing

Reading

foo.metadata # YAML metadata

foo.kmer_ids # kmer_ids

foo.counts # a NumPy array of the counts

License

Acknowledgements

Email

Phone

Address

Introduction

What is kmerdb?

kmerdb makes k-mer count vectors

Install

Usage

How to view -h|--help

About --debug

kmerdb minimizers

kmerdb alignment

kmerdb profile

kmerdb view

kmerdb matrix

kmerdb distance

kmerdb kmeans

kmerdb hierarchical

Documentation

Development

Unit testing

API

Importing

Reading

foo.metadata # YAML metadata

foo.kmer_ids # kmer_ids

foo.counts # a NumPy array of the counts

License

Acknowledgements

Email

Phone

Address

What is `kmerdb`?

`kmerdb` makes k-mer count vectors

How to view `-h|--help`

About `--debug`