Developing
Introduction
What is .kdb
or what are k-mers?
Please refer to the quickstart for basic guide on the command-line usage.
Developing
This guide is meant to encourage other developers to leverage the capabilities of this .kdb
bgzf spec.
The key file to watch is config.py
, which keeps the version number as metadata for the metadata. This project makes no warantees about the stability of the format specification. It is unclear if the project goal is to store tabular count matrices, or to develop ingestion for graph databases stored as CYPHER text. At this point I can’t decide on which features to develop since there is so little attention to this project. For this reason, metadata formats are destined to change. Perhaps someone would like to contribute to the jsonschema definitions?
This guide shows you how to do basic operations with the kmerdb
submodules.
Install
pip install kmerdb
git clone
git clone git@github.com:MatthewRalston/kmerdb.git
cd kdb
# I know the following may be prone to fail, but I need to know if this is an issue, I always install from setup.py.
pip install --no-cache-dir . # python setup.py install
Usage
After installation is complete, you can import the module directly or focus on specific submodules you’d like to import.
fileutil
The main module you will be using to interact with .kdb
format files is kmerdb.fileutil
, which provides the open method, and the KDBReader
/KDBWriter
classes.
We will use this in the section labelled read to discuss the lazy loading behaviors of KDBReader
objects.
read
First we will import the fileutil
module to access individual columns from bgzf compressed .kdb
format files.
from kmerdb import fileutil
onefile
Here we can view the contents of one kmerdb file in Python as NumPy arrays. Default settings as of 0.6.5 are uint64
arrays for counts and indices, and float64
for frequencies.
from kmerdb import fileutil
with fileutil.open("example.kdb", 'r', slurp=False) as kdb:
# Check the metadata
assert kdb.metadata["k"] == k, "Assertion failed to verify chosen k"
kdb.slurp() # Actually read the data from the disk as needed.
print(kdb.profile)
print(kdb.kmer_ids)
print(kdb.counts)
print(kdb.frequencies)
multiple
from multiprocessing import pool
from kmerdb import fileutil
# Do some sanity checks
files = list(map(lambda f: fileutil.open(f, 'r', slurp=False), myFiles))
assert all(kdbrdr.k == k for kdbrdr in files), "Couldn't validate a uniform choice of k"
# Read in parallel
file_reader = fileutil.FileReader()
if cores > 1: # read in parallel
with Pool(processses=cores) as pool:
files = pool.map(file_reader.load_file, myFiles)
else:
files = list(map(lambda f: fileutil.open(f, 'r', slurp=True), myFiles))
data = np.array([kdbrdr.slurp() for kdbrdr in files], dtype=dtype)