M.Sc. Bioinfo + 5 years industry exp. - - - Looking for a position in biosciences. I'm gifted at the command line or computer terminal, but I perform best on my feet, standups, cleaning, guitar with samples in the centrifuge, more in the water bath, data being processed on the computer while I sleep, and other assays to plan. Currently writing a DNA sequence graph algorithm and manuscript (for bioRxiv). I play math rock on my guitar, I have some open-source software on GitHub, I pay my bills in the DE-NJ-PA tri-state area near Delco, and I'm launching a DE-based LLC for Univ. of Delaware and local NGS sequencing projects.
Hello everyone! Today we'll be talking about essential array/list methods in different languages, reminding our readers of differences across languages. As beginner programmers, many of us are not aware of differences in the implementations of fundamental structures between languages. While we occasionally hear generalizations (perl is faster than python for strings etc.), benchmarking is more common among mature libraries/packages than standard library or primitive data structures.
An algorithm's cost depends on its data structures and their instance methods. For example, appending items, iterating, or performing normalizations on a list. As a beginner, I was not aware of the differences between lists, linked-lists, arrays, and hashmaps. Additionally, list/array methods perform differently in speed between languages and sometimes have multiple approaches (push, append, concat, extend, etc). Our first example in this tutorial will be about appending to a basic list/array. Skip ahead to Binary Search if you feel comfortable with this concept already.
Array/List Append
Some of these operations modify the structure in place (mutability) while some return a duplicated structure with the modifications.
Appending to an array is a basic operation that is fundamental to many basic computational operations and applications. It is important to understand that the plurality of arrays (or lists if you prefer) means that array methods are extremely common in our codebases. One such fundamental operation, the append, is essential for processing datasets and working iteratively. In early programming languages, the basic 'last in, first out' (LIFO) stack was typically a collection of integers (or other primitives) simply thrown into a collection for later use. Importantly, the most simple 'list', the stack may or may not have a limited capacity and subject to overflow limits. A principle difference between the 'array' and the linked-list is the way the "next" element of the collection may be accessed. For the array, the next element is simply the next address in physical memory. The linked-list however, stores a pointer to the next element along with the current element.
Now let's think about access performance for the array or list. For example, if the memory addresses 2000-2256 referred to some data, then you could append (or prepend) to a stack by assigning a new integer to address 1998 with an array implementation such that the array pointer references 1998. In contrast, what if you needed to access the bottom of the stack, or the first element added to it (to determine array length, for example)? In this case, you'd need to traverse the length of the whole array to find where the array ends. The cost of this operation grows with the size of the array, N. Now imagine how expensive it would be to grow an array if the language's implementation (FIFO behaviour, I suppose) required appending to address 2258, then 2260, etc. Each append would make subsequent appends progressively expensive. This is not usually or always the case, but I include it to recommend users experiment with different data structures and methods to determine the best performance.
In fact, we're not even gonna test this one. Haskell has one clear way to append and the data structure 'list' is heavily optimized.
Map, filter, reduce and higher-order functions
I'm going to quote Learn You a HaskellM. Lipovaca for the description of the map function:
Map takes a function and a list and applies that function to every element in the list, producing a new list.
Map, filter, and reduce are simple higher-order functions that are often not taught in tutorials. Novice programmers may make the mistake of using a for loop to increment through a csv to perform mutation/selection of certain lines. Unfortunately, imperative/procedural style tends to teach students that data comes from the state/position of your location in the file, rather than a holistic view of data as resulting from discrete/piecewise transformations on an input set. I'm not going to start talking about monads and why learning abstract algebra is going to transform your code, but I will say that higher-order functions help you think about making code modular, simple, and readable to understand the transforms and aggregations of your data. Higher-order functions behave just like normal functions, but they can use and/or return anonymous functions, allowing more complex operations to be created. Think of how hard filtering a list would be if you had to write your own filtering function for each different criterion. Instead, higher-order functions allow us to simply provide a conditional, or a boolean-returning lambda/callback/anonymous function. As you can see from the examples below, there are few to no alternatives and thus benchmarking is not an issue within a language.
Map in various languages
Map is a simple function or array method that applies a transformation to data in an iterable, array, or list format. Interestingly, some languages or environments have lazy and/or parallel versions of the map function, providing a condiderable speedup over a procedural for-loop. Simple functions are usually declared anonymously or through a lambda convention and are tightly coupled to the transformation. Maps can often be chained with filter steps and additional mapped transformations. For example, you could read lines of a csv into an array, map to the third column, convert the type to float, round to 2 decimal places and assign that to a variable. Then you could map a median normalization across the array, before calculating aggregate stats or generating a histogram. Objectively, the functional method is easier to read and interpret.
Searching for an element in an array is a common task in algorithms and it appears more often than you may think. For example, the common conditional 'if varname in listname:' uses a linear search to find an element in an array. In Python, 'in' is an infix style of '__contains__' and is a linear search. If you're running the 'in' command on an unsorted list for each of 1M records, you can see how non-optimal detection of presence/abscence could heavily impact performance of your script. So what *should* a programmer use?
There are many different search algorithms that work on 1D or 2D structures. Of course the simplest method in 2018 may be to use a hashmap based data structure for direct access. However, the 1D binary search algorithm is illustrated below. In this diagram, the search is for the element '27'. The algorithm begins by bisecting the array such that 21 is compared to 27. As the desired element is larger than the element in the middle of the sorted array, the algorithm then bisects the upper half of the array, recursing until it locates the element 27. Also note that a linear search (shown on the bottom) takes a larger number on average than the binary search to locate an item in the array.
At this point, you have seen options for appending, searching, and mapping transforms on arrays in a few languages. The append example showed how much variation there can be in methodology and performance for fundamental array operations. The search method showed how the 'in' operator can become taxing, and how simple sorting can dramatically improve performance. Arrays are important data structures for all scripts, packages/libraries, and apps. Some simple curiosity can yield performance gains and makes coding a more scholastic practice. I hope you enjoyed my first introductory coding post and I'll see you next time!
I want to discuss linear runtimes and what that means in alignment-free methods for bioinformatics and sequence alignments and quasi alignments. First, it is the splitting of the sequences, as they are read, into, let’s say, ‘a’ De Bruijn graph. This graph consists of the k-mers, their neighborhoods, and of course the walks or paths through the graph that constitute optimal criteria and local maxima of course for traversal and contig/walk/path maximization. Typically, a search through the De Bruijn structure may be Breadth-First to find optimal depths for traversal of the path through the De Bruijn structure, optimizing for creating some sequence. This leads to read collapse along the sequence unidirectionally (bidirectionally in a unidimensional space) along the sequence space.