Tags: beginner  python  ruby  haskell  javascript 

Hello everyone! Today we'll be talking about essential array/list methods in different languages, reminding our readers of differences across languages. As beginner programmers, many of us are not aware of differences in the implementations of fundamental structures between languages. While we occasionally hear generalizations (perl is faster than python for strings etc.), benchmarking is more common among mature libraries/packages than standard library or primitive data structures.

An algorithm's cost depends on its data structures and their instance methods. For example, appending items, iterating, or performing normalizations on a list. As a beginner, I was not aware of the differences between lists, linked-lists, arrays, and hashmaps. Additionally, list/array methods perform differently in speed between languages and sometimes have multiple approaches (push, append, concat, extend, etc). Our first example in this tutorial will be about appending to a basic list/array. Skip ahead to Binary Search if you feel comfortable with this concept already.

Array/List Append

Some of these operations modify the structure in place (mutability) while some return a duplicated structure with the modifications. Appending to an array is a basic operation that is fundamental to many basic computational operations and applications. It is important to understand that the plurality of arrays (or lists if you prefer) means that array methods are extremely common in our codebases. One such fundamental operation, the append, is essential for processing datasets and working iteratively. In early programming languages, the basic 'last in, first out' (LIFO) stack was typically a collection of integers (or other primitives) simply thrown into a collection for later use. Importantly, the most simple 'list', the stack may or may not have a limited capacity and subject to overflow limits. A principle difference between the 'array' and the linked-list is the way the "next" element of the collection may be accessed. For the array, the next element is simply the next address in physical memory. The linked-list however, stores a pointer to the next element along with the current element.

Now let's think about access performance for the array or list. For example, if the memory addresses 2000-2256 referred to some data, then you could append (or prepend) to a stack by assigning a new integer to address 1998 with an array implementation such that the array pointer references 1998. In contrast, what if you needed to access the bottom of the stack, or the first element added to it (to determine array length, for example)? In this case, you'd need to traverse the length of the whole array to find where the array ends. The cost of this operation grows with the size of the array, N. Now imagine how expensive it would be to grow an array if the language's implementation (FIFO behaviour, I suppose) required appending to address 2258, then 2260, etc. Each append would make subsequent appends progressively expensive. This is not usually or always the case, but I include it to recommend users experiment with different data structures and methods to determine the best performance.

Comparison of list data structures[wiki]

Linked list Array Dynamic array Balanced tree Random access list Hashed array tree
Indexing Θ(n) Θ(1) Θ(1) Θ(log n) Θ(log n)[10] Θ(1)
Insert/delete at beginning Θ(1) N/A Θ(n) Θ(log n) Θ(1) Θ(n)
Insert/delete at end Θ(1) when last element is known;
Θ(n) when last element is unknown
N/A Θ(1) amortized Θ(log n) Θ(log n) updating Θ(1) amortized
Insert/delete in middle search time + Θ(1)[11][12][13] N/A Θ(n) Θ(log n) Θ(log n) updating Θ(n)
Wasted space (average) Θ(n) 0 Θ(n)[14] Θ(n) Θ(n) Θ(n)

Map, filter, reduce and higher-order functions

Map takes a function and a list and applies that function to every element in the list, producing a new list.

Map, filter, and reduce are simple higher-order functions that are often not taught in tutorials. Novice programmers may make the mistake of using a for loop to increment through a csv to perform mutation/selection of certain lines. Unfortunately, imperative/procedural style tends to teach students that data comes from the state/position of your location in the file, rather than a holistic view of data as resulting from discrete/piecewise transformations on an input set. I'm not going to start talking about monads and why learning abstract algebra is going to transform your code, but I will say that higher-order functions help you think about making code modular, simple, and readable to understand the transforms and aggregations of your data. Higher-order functions behave just like normal functions, but they can use and/or return anonymous functions, allowing more complex operations to be created. Think of how hard filtering a list would be if you had to write your own filtering function for each different criterion. Instead, higher-order functions allow us to simply provide a conditional, or a boolean-returning lambda/callback/anonymous function. As you can see from the examples below, there are few to no alternatives and thus benchmarking is not an issue within a language.

Map in various languages

Map is a simple function or array method that applies a transformation to data in an iterable, array, or list format. Interestingly, some languages or environments have lazy and/or parallel versions of the map function, providing a condiderable speedup over a procedural for-loop. Simple functions are usually declared anonymously or through a lambda convention and are tightly coupled to the transformation. Maps can often be chained with filter steps and additional mapped transformations. For example, you could read lines of a csv into an array, map to the third column, convert the type to float, round to 2 decimal places and assign that to a variable. Then you could map a median normalization across the array, before calculating aggregate stats or generating a histogram. Objectively, the functional method is easier to read and interpret.

Binary Search

Searching for an element in an array is a common task in algorithms and it appears more often than you may think. For example, the common conditional 'if varname in listname:' uses a linear search to find an element in an array. In Python, 'in' is an infix style of '__contains__' and is a linear search. If you're running the 'in' command on an unsorted list for each of 1M records, you can see how non-optimal detection of presence/abscence could heavily impact performance of your script. So what *should* a programmer use?

There are many different search algorithms that work on 1D or 2D structures. Of course the simplest method in 2018 may be to use a hashmap based data structure for direct access. However, the 1D binary search algorithm is illustrated below. In this diagram, the search is for the element '27'. The algorithm begins by bisecting the array such that 21 is compared to 27. As the desired element is larger than the element in the middle of the sorted array, the algorithm then bisects the upper half of the array, recursing until it locates the element 27. Also note that a linear search (shown on the bottom) takes a larger number on average than the binary search to locate an item in the array.