Taking A Break From Shell Scripting
I don’t need much opinion on bowtie or bwa-mem. I read the bbmap paper.
I didnt have an opinion on the performance rate or the heuristics. HiSAT2 claims some impressive throughput on a condensed index for high-per-read alignment rates. Bowtie2 is plenty sufficient with similar performance on the index, based on a suffix-array. STAR has a lot of popularity, high-memory consumption rates and probably GC or managed memory. I’m not looking at the source code.
I like vsearch. I’ve been trying to find the paper in my reference management system to read.
vsearch is a more generic alignment strategy and heuristic expansion on traditional Smith-Waterman and Needleman-Wunsch strategies.
Part of my goal in kmerdb is to facilitate or incorporate both “local” alignment strategies (kmerdb aligner) and minimizer quantitation (lol, no) to facilitate seed selection heuristics and provide a proper strategy for adjusting that as needed.
The goal of the assembly strategy is to provide a more “approximate” sequence assembly strategy from first principles, using shortcuts in alignment steps to produce more robust traversal during the graph traversal.
I occassionally read papers on integrative analysis but my interests remain in the functional genomics space. Key genes and metabolic pathways. I’ve tried linking my overrepresentation analysis to Gene Ontologies by Ensembl and GenBank, as well as GO consortium via RDF format specification, the basis for true (by principle) web3. RDF format specification is a graph spec and allows for simple lookups and dynamic AJAX expansion of information, which would be interesting if you wanted something at the CLI to narrow or expand group definitions and associate all downstream genes by lookup through the GO terms to a wider set of biological phenomena, and select housekeeping genes or gene sets that are known to be within a certain expression tolerance, to compare transcripts.
RDF dumps of ontologies would provide a richer experience, and eliminate network access, during microservice lookup to key CLI programs or workflows, or web services.
It didnt work. I still think GO lookup via GenBank or NCBI blast datasets is the way to go moving forward for things related to quantitation via genetic or transcriptomic datasets.
A microservice would be neat, as it would drive traffic and could provide simpler services via the command-line. A delegation or “worker” description could be useful, but running CLI services via a microservice is expensive on compute and ugly. But its cheap to a web-server as all it needs to do is appropriately delegate to compute infra and then sleep the thread.
A systemd service that could consume sets of genes or identifiers, sequences, no qualities, and deliver associated gene ontology terms could expand or feed from services at the command-line from NCBI, namely the ‘dataset’ utility.
Additionally, sequence datasets such as SRA lookup can be manually downloaded, or for high-performance workloads can be run on local workstations to keep network load off of primary compute servers.
Such .fa or .txt files in a cache system could reduce technical burden.
sra lookup should only be done for certain specimens to conserve efforts on only those lookups.
Larger data sweeps can be useful, especially if test datasets can be properly annotated and preserved, and the documentation kept sufficient to provide useful descriptions of static test datasets where efforts can be preserved to minimize technical debt.
Once it’s downloaded, transfer is cheap.