Beginner Array/List Methods

[ beginner  python  ruby  haskell  javascript  r  ]

Hello everyone! Today we'll be talking about essential array/list methods in different languages, reminding our readers of differences across languages. As beginner programmers, many of us are not aware of differences in the implementations of fundamental structures between languages. While we occasionally hear generalizations (perl is faster than python for strings etc.), benchmarking is more common among mature libraries/packages than standard library or primitive data structures.

An algorithm's cost depends on its data structures and their instance methods. For example, appending items, iterating, or performing normalizations on a list. As a beginner, I was not aware of the differences between lists, linked-lists, arrays, and hashmaps. Additionally, list/array methods perform differently in speed between languages and sometimes have multiple approaches (push, append, concat, extend, etc). Our first example in this tutorial will be about appending to a basic list/array. Skip ahead to Binary Search if you feel comfortable with this concept already.

Array/List Append

Some of these operations modify the structure in place (mutability) while some return a duplicated structure with the modifications.

Appending to an array is a basic operation that is fundamental to many basic computational operations and applications. It is important to understand that the plurality of arrays (or lists if you prefer) means that array methods are extremely common in our codebases. One such fundamental operation, the append, is essential for processing datasets and working iteratively. In early programming languages, the basic 'last in, first out' (LIFO) stack was typically a collection of integers (or other primitives) simply thrown into a collection for later use. Importantly, the most simple 'list', the stack may or may not have a limited capacity and subject to overflow limits. A principle difference between the 'array' and the linked-list is the way the "next" element of the collection may be accessed. For the array, the next element is simply the next address in physical memory. The linked-list however, stores a pointer to the next element along with the current element.

Now let's think about access performance for the array or list. For example, if the memory addresses 2000-2256 referred to some data, then you could append (or prepend) to a stack by assigning a new integer to address 1998 with an array implementation such that the array pointer references 1998. In contrast, what if you needed to access the bottom of the stack, or the first element added to it (to determine array length, for example)? In this case, you'd need to traverse the length of the whole array to find where the array ends. The cost of this operation grows with the size of the array, N. Now imagine how expensive it would be to grow an array if the language's implementation (FIFO behaviour, I suppose) required appending to address 2258, then 2260, etc. Each append would make subsequent appends progressively expensive. This is not usually or always the case, but I include it to recommend users experiment with different data structures and methods to determine the best performance.

Tabe 1: Comparison of list data structures[wiki]

Linked list Array Dynamic array Balanced tree Random access list Hashed array tree
Indexing Θ(n) Θ(1) Θ(1) Θ(log n) Θ(log n)[10] Θ(1)
Insert/delete at beginning Θ(1) N/A Θ(n) Θ(log n) Θ(1) Θ(n)
Insert/delete at end Θ(1) when last element is known;
Θ(n) when last element is unknown
N/A Θ(1) amortized Θ(log n) Θ(log n) updating Θ(1) amortized
Insert/delete in middle search time + Θ(1)[11][12][13] N/A Θ(n) Θ(log n) Θ(log n) updating Θ(n)
Wasted space (average) Θ(n) 0 Θ(n)[14] Θ(n) Θ(n) Θ(n)

Append/prepend in various languages

list, llist, numpy.array, deque
>import collections
>exampleList = [0,1,2]
>exampleList.append(3) # instance method append
>exampleList.insert(len(exampleList), 4) # Direct index insert method
>exampleList += [5] # plus equals; note that singleton array instantiation is required
>exampleList.extend([6]) # instance method extend, singleton instantiation *again*
>npArray = numpy.array([0,1,2])
>npArray.append(3) # very slow
>exampleLList = llist.sllist([0,1,2])
>exampleLList.append(1) # suprisingly, less optimal than a basic list...
>exampleDeque = collections.deque([1])
>exampleDeque.appendleft(0)
# P E R F O R M A N C E
# 2.7.10, GCC 4.2.1, Apple LLVM 8.0 on 2011 OSX
>import numpy
>import timeit

>emptyList   = 's=[]'
>emptyLList  = 'import llist; t=llist.sllist()'
>emptyNp     = 'import numpy; u=numpy.array([], dtype=int)'
>emptyDeque  = 'import collections; v=collections.deque([])'
>listAppend  = timeit.Timer('s.append(1)', emptyList)
>listInsert  = timeit.Timer('s.insert(len(s), 1)', emptyList)
>plusEquals  = timeit.Timer('s += [1]', emptyList)
>listExtend  = timeit.Timer('s.extend([1])', emptyList)
>llistAppend = timeit.Timer('t.append(1)', emptyLList)
>npAppend    = timeit.Timer('numpy.append(u, 1)', emptyNp)
>dequeAppend = timeit.Timer('v.appendleft(1)', emptyDeque)

>print numpy.mean([listAppend.timeit() for x in range(1000)])
0.19117085886001586
>print numpy.mean([listInsert.timeit() for x in range(1000)])
0.32939401173591615
>print numpy.mean([plusEquals.timeit() for x in range(1000)])
0.42768072748184205
>print numpy.mean([listExtend.timeit() for x in range(1000)])
0.42768072748184205
>print numpy.mean([llistAppend.timeit() for x in range(1000)])
0.558116751909256
>print numpy.mean([npAppend.timeit() for x in range(1000)])
9.360302472352982 # This is similarly ~2 orders slower than list append
# according to this stackoverflow thread https://stackoverflow.com/a/7134033/2005704
>print numpy.mean([dequeAppend.timeit() for x in range(1000)])
0.12067595839500428
Array, Queue
>exampleArray = [0,1,2]

>exampleArray.push(3)     # First 'native' type of append
>exampleArray << 4        # Second 'native' type of append
>exampleArray[exampleArray.length] = 5 # Direct index assignment
>exampleArray.concat([6]) # Analogous to Python's List#extend
>exampleArray.insert(exampleArray.length, 7) # Direct index insert
>exampleArray.unshift(-1) # Prepend
>exampleQueue = Queue.new
  
# P E R F O R M A N C E
# 2.1.2p95 on 2011 OSX
>require 'benchmark'
>require 'thread'

>iterations = 1000 # number of times to run each method

>Benchmark.bm do |bm|
>  bm.report do
>    exampleArray = []
>    iterations.times { exampleArray.push(0) }
>  end
>  bm.report do
>    exampleArray = []
>    iterations.times { exampleArray << 0 }
>  end
>  bm.report do
>    exampleArray = []
>    iterations.times { exampleArray[exampleArray.length] = 0 }
>  end
>  bm.report do
>    exampleArray = []
>    iterations.times { exampleArray.concat([0]) }
>  end
>  bm.report do
>    exampleArray = []
>    iterations.times { exampleArray.unshift(0) }
>  end
>  bm.report do
>    exampleArray = []
>    iterations.times { exampleArray.insert(exampleArray.length, 0) }
>  end
>  bm.report do
>    exampleQueue = Queue.new
>    iterations.times { exampleQueue << 0 }
>  end
>  bm.report do
>    exampleQueue = Queue.new
>    iterations.times { exampleQueue.push(0) }
>  end
>end
 user     system      total        real
 0.000000   0.000000   0.000000 (  0.000268) #      Array#push    ...
 0.000000   0.000000   0.000000 (  0.000194) #      Array#<<      ...
 0.000000   0.000000   0.000000 (  0.000290) #      x[x.length] = ...
 0.000000   0.000000   0.000000 (  0.000534) #      Array.concat([...])
 0.000000   0.000000   0.000000 (  0.000247) #      Array.unshift(...)
 0.000000   0.000000   0.000000 (  0.000546) # x.insert(x.length, ...)
 0.010000   0.000000   0.010000 (  0.010391) #      Queue#<<      ...
 0.010000   0.000000   0.010000 (  0.005754) #      Queue#push    ...
Array
>

# P E R F O R M A N C E
# NodeJS v8.9.4 on 2011 OSX
const Benchmark = require('benchmark')
var suite = new Benchmark.Suite

function pushArray(arr){
  arr.push(0);
}
function indexAppend(arr){
  arr[arr.length] = 0;
}
function unshiftArray(arr){
  arr.unshift(0);
}
function concatArray(arr){
  arr.concat([0]);
}

# First, test the methods without varying the array size 'parametrically'
function growArrays(){
  // Growing arrays
  exampleArray = [];
  suite.add("Growing an array with continuous 'Array#push' ", pushArray(exampleArray))
  exampleArray = [];
  suite.add("Growing an array with direct index assignment (x[x.length] = ...)", indexAppend(exampleArray))
  exampleArray = [];
  suite.add("Growing an array with Array#unshift", unshiftArray(exampleArray))
  exampleArray = [];
  suite.add("Growing an array with continuous 'Array#concat' of singletons", concatArray(exampleArray));
  return suite;
}
function constantSuite(){
  // Fresh arrays
  suite.add("Array#push on fresh arrays (extra cost of array instantiation)", pushArray([]));
  suite.add("Direct index assignment/concatenation on fresh arrays (extra cost of array instantiation)", indexAppend([]));
  suite.add("Array#unshift on fresh arrays (extra cost of array instantiation)", unshiftArray([]));
  suite.add("Array#concat on fresh arrays (extra cost of array instantiation)", concatArray([]));
  return suite;
}


function modArray(description, min, max, callback){
  for (var i = min; i < max; i++) {
    var e = Array(i).fill(1);
    suite.add(description + ` with array size ${i}`, ()=>callback(e));
  }
}


function modArraySize(){
  modArray("Array#push as a function of array size", 0, 1000, pushArray);
  modArray("Direct index assignment/concatenation as a function of array size", 0, 1000, indexAppend);
  modArray("Array#unshift as a function of array size", 0, 1000, unshiftArray);
  modArray("Array#concat as a function of array size", 0, 1000, concatArray);
  return suite;
}


suite.on("cycle", function(event){
  console.log(String(event.target));
}).on("complete", function(){
  console.log("Fastest is " + this.filter("fastest").map("name"));
});
ga=growArrays().run();
cs=constantSuite().run();
ma=modArraySize().run();

// Constant time appends (+ cost of array instantiation, in hz)
>console.log(cs.map((x) => x.hz));			
[ 17761430.053821966,   // Array#push
  15674404.910573114,   // Direct index concatenation
   6827994.338145904,   // Array#unshift
   2680587.6144321645 ] // Array#concat
// Sadly my laptop started throwing heap/memory issues with Benchmarkjs, so we couldn't complete the experiment.
list append

In fact, we're not even gonna test this one. Haskell has one clear way to append and the data structure 'list' is heavily optimized.

>liszt = [1,2,3]         -- An optimized FILO array. 
>example1 = 0 : liszt    -- The ultra-fast recommended way to do prepending. 
>example2 = liszt ++ [4] -- Technically a new singleton-array instantiation

Map, filter, reduce and higher-order functions

I'm going to quote Learn You a HaskellM. Lipovaca for the description of the map function:

Map takes a function and a list and applies that function to every element in the list, producing a new list.

Map, filter, and reduce are simple higher-order functions that are often not taught in tutorials. Novice programmers may make the mistake of using a for loop to increment through a csv to perform mutation/selection of certain lines. Unfortunately, imperative/procedural style tends to teach students that data comes from the state/position of your location in the file, rather than a holistic view of data as resulting from discrete/piecewise transformations on an input set. I'm not going to start talking about monads and why learning abstract algebra is going to transform your code, but I will say that higher-order functions help you think about making code modular, simple, and readable to understand the transforms and aggregations of your data. Higher-order functions behave just like normal functions, but they can use and/or return anonymous functions, allowing more complex operations to be created. Think of how hard filtering a list would be if you had to write your own filtering function for each different criterion. Instead, higher-order functions allow us to simply provide a conditional, or a boolean-returning lambda/callback/anonymous function. As you can see from the examples below, there are few to no alternatives and thus benchmarking is not an issue within a language.

Map in various languages

Map is a simple function or array method that applies a transformation to data in an iterable, array, or list format. Interestingly, some languages or environments have lazy and/or parallel versions of the map function, providing a condiderable speedup over a procedural for-loop. Simple functions are usually declared anonymously or through a lambda convention and are tightly coupled to the transformation. Maps can often be chained with filter steps and additional mapped transformations. For example, you could read lines of a csv into an array, map to the third column, convert the type to float, round to 2 decimal places and assign that to a variable. Then you could map a median normalization across the array, before calculating aggregate stats or generating a histogram. Objectively, the functional method is easier to read and interpret.

>e = [0, 1, 2]
>map(lambda x: x + 1, e) # Map is disfavored in Python vs the list comprehension. Lambda expressions are limited.
[1, 2, 3]
>[x + 1 for x in e]      # Simple expressions or function names can be applied to each element in list comprehensions.
[1, 2, 3]
>[0,1,2].map{|x| x+1}
 => [1, 2, 3]
>[0,1,2].map((x) => x + 1)
[ 1, 2, 3 ]
>map (+1) [0,1,2]
[1,2,3]
>e <- c(0, 1, 2)
>sapply(e, function(x) x + 1)
[1] 1 2 3
>Map(function(x) x + 1, e)
[1] 1 2 3  

Filter in various languages

>e = [0, 1, 2, 3, 4, 5]
>filter(lambda x: x > 2, e) # Filter is also disfavored in Python vs the list comprehension. 
[3, 4, 5]
>[x for x in e if x > 2]
[3, 4, 5]
>e = [0,1,2,3,4,5]
>e.filter{|x| x > 2} # Could be aliases to the same function
 => [3, 4, 5]
>e.select{|x| x > 2}
 => [3, 4, 5]
>[0,1,2,3,4,5].filter((x) => x > 2)
[ 3, 4, 5 ]
>filter (>2) [0,1,2,3,4,5]
[3,4,5]
>e <- c(0,1,2,3,4,5)
>e[e > 2]
[1] 3 4 5
>subset(e, e > 2)
[1] 3 4 5
>Filter(function(x) x > 2, e)
[1] 3 4 5  

Reduce in various languages

>e = [0, 1, 2]
>reduce(lambda x, y: x + y, e)
3
>e = [0,1,2]
>e.reduce{|x,y| x + y} 
 => 3
>[0,1,2].reduce((x,y) => x + y)
3
>foldr (+) [0,1,2] # See the following for more details: https://wiki.haskell.org/Foldr_Foldl_Foldl%27
3
>e <- c(0,1,2)
>Reduce(function(x,y) x + y, e)
[1] 3

Binary Search

Searching for an element in an array is a common task in algorithms and it appears more often than you may think. For example, the common conditional 'if varname in listname:' uses a linear search to find an element in an array. In Python, 'in' is an infix style of '__contains__' and is a linear search. If you're running the 'in' command on an unsorted list for each of 1M records, you can see how non-optimal detection of presence/abscence could heavily impact performance of your script. So what *should* a programmer use?

There are many different search algorithms that work on 1D or 2D structures. Of course the simplest method in 2018 may be to use a hashmap based data structure for direct access. However, the 1D binary search algorithm is illustrated below. In this diagram, the search is for the element '27'. The algorithm begins by bisecting the array such that 21 is compared to 27. As the desired element is larger than the element in the middle of the sorted array, the algorithm then bisects the upper half of the array, recursing until it locates the element 27. Also note that a linear search (shown on the bottom) takes a larger number on average than the binary search to locate an item in the array.

Binary search tree example

Search in various languages

>import bisect
>e = [0, 1, 2]
>eSet = set(e)
>frozenESet = frozenset(e)
>def index(a, x):
.  i = bisect_left(a, x)
.  if i != len(a) and a[i] == x:
.    return i
.  raise ValueError

# Presence/absence
>1 in e # in is an infix method for lists that uses '__contains__'
>1 in eSet
>1 in frozenSet
# Numerical index
>e.index(1)
>index(e, 1)
# Retrieval
>next(filter(lambda x: x == 1, e), None) # Returns 1 or None
>require 'set'
>e = [0,1,2]
>eSet = Set.new(e)
# Presence/absence
>e.include?(1)
>e.member?(1)
>eSet.include?(1)
>eSet.member?(1)
# Retrieval
>e.index(1)
>e.find_index(1)
>e.any?{|x| x == 1}
>e.find{|x| x == 1}
>e.sort.bsearch{|x| x >= 1 } # ONLY if sorted
>function bSearch(array, item){
>  let low = 0;
>  let high = array.length - 1;
>  while (low <= high) {
>    mid = (low + high) / 2;
>    if (array[mid] > value) high = mid - 1;
>    else if (array[mid] < value) low = mid + 1;
>    else return mid;	
>  }
>  return -1;
}
>e[0,1,2]
>e.includes(1)
>e.indexOf(1) // 1 or -1 if DNE
>e = [0, 1, 2]
>import Data.Array (Array, Ix, (!), listArray, bounds)
 
>:{bSearch
>  :: Integral a
>  => (a -> Ordering) -> (a, a) -> Maybe a
>bSearch p (low, high)
>  | high < low = Nothing
>  | otherwise =
>    let mid = (low + high) `div` 2
>    in case p mid of
>         LT -> bSearch p (low, mid - 1)
>         GT -> bSearch p (mid + 1, high)
>         EQ -> Just mid
>:} 

>bSearchArray
>  :: (Ix i, Integral i, Ord e)
>  => Array i e -> e -> Maybe i
>bSearchArray a x = bSearch (compare x . (a !)) (bounds a)
>:{
>axs =
  listArray
    (0, 11)
    [ "alpha"
    , "beta"
    , "delta"
    , "epsilon"
    , "eta"
    , "gamma"
    , "iota"
    , "kappa"
    , "lambda"
    , "mu"
    , "theta"
    , "zeta"
    ]
>:}
>bSearchArray axs "mu"
(9, "mu")

At this point, you have seen options for appending, searching, and mapping transforms on arrays in a few languages. The append example showed how much variation there can be in methodology and performance for fundamental array operations. The search method showed how the 'in' operator can become taxing, and how simple sorting can dramatically improve performance. Arrays are important data structures for all scripts, packages/libraries, and apps. Some simple curiosity can yield performance gains and makes coding a more scholastic practice. I hope you enjoyed my first introductory coding post and I'll see you next time!