Choosing the right data structure

Introduction

In data processing, a data structure is a logical structure intended to contain data, in order to give them an organization enabling them to be processed. Knowledge of them makes it possible to choose the one that best responds to a given problem. My goal is not to do an algorithm course but only to detail some of the implementations chosen by Microsoft to realize the classes of the .Net Framework and to be able to choose the one that best corresponds to our need. If you want to understand in more detail how a data structure works, I advise you to go on your favorite search engine and type its name.

The notations

To measure the effectiveness of an algorithm, one will measure the time it uses. Just count the instructions and weight them by their cost. There are several types of operations:

  • Arithmetic operations
  • Read / assign variables
  • Comparisons

To note the complexity of the algorithms the notion of O(n) (as in maths) is used. O(n) means "at worst". Here are some examples for everyone to understand. Let us imagine that we are a list containing 1 million elements. If an algorithm is

  • O(1): Whatever the number of elements, the time is always identical.
  • O(ln(n)): For our list it means that ln(1000000)≈14 operations to realize the algorithm

In order of complexity (from the fastest to the slowest): 1, ln(n), n, n ln(n), n2, en

LinkedList<T>

It is a twice-chained circular list. This makes it possible to make head and tail insertions at 0(1). Indeed to insert a element in tail, it is enough to insert it before the head element because the list is circular. To search for an item you have to browse the entire list item by item until you find the right one, which is not the most efficient.

List<T>

This is a array that increases its size when needed. The initial size of the array by default is 4. When you try to insert an element while the array is full, a new array is allocated. This one is twice as big as the previous one. The items of the old array are copied to the new array and the new item can be added. To insert an element at a precise index, you must first move all the following elements of the array one space and then put the new element in its place. Removal involves moving all of the following items back one space. To search for an item you have to browse the entire list item by item until you find the right one. Again, this is not the most effective.

SortedList<TKey, TValue>

Its operations are identicals to List. However, the items in the table are sorted in ascending order. When adding, it is necessary to start by looking for his place in the table. For this, a dichotomous search is performed (O(log(n)). The rest of the operations are strictly identical to the insertion in an unsorted list.

HashSet<T>

The HashSet is a hash table with resolution of collisions by coalesced chaining. To insert an element, the function starts by computing the hash value using the GetHashCode method and thus deduces the index of the element in the array (hash modulo size of the array). If there are already an element at this position (this is called conflict), a chaining principle is used. That is, we look for an empty bucket from the end of the table and add the element. At the position we should have inserted the element we added information indicating the index that was used to store a value having the same hash value. To find an element, you calculate its hash and fetch the element at the corresponding index. If the value in this box matches the value you are looking for: Bingo. Otherwise you look at whether there was a collision (there is a link to another element) and you follow the link. If it is still not the right value, you keep going until there is no more link to follow. A hash function is considered good if there are no more than 5 elements in the collection having the same hash value. On average the search is in O(1). However, if the hash function is bad and generates too much collision, the search can be in O(n).

Dictionary<TKey, TValue>

The dictionary works like the HashSet except that the hash value is calculated from the key (TKey) and not from the value (TValue).

SortedSet<T>

The SortedSet is implemented with a Red-Black tree (or two-color tree). This is a balanced search binary tree. All operations are done in O(log(n)).

SortedDictionary<TKey, TValue>

In short, it is a SortedSet<KeyValuePair<TKey, TValue>>. It therefore has the same operating mode as the SortedSet.

Summary

Insert

GeneralHeadTail
LinkedListO(n)O(1)O(1)
ListO(n)O(n)O(1)
SortedDictionnaryO(log(n))N/AN/A
DictionnaryO(1)N/AN/A
HashSetO(1)N/AN/A
SortedSetO(log(n))N/AN/A
SortedListO(n)N/AN/A

Remove

GeneralHeadTail
LinkedListO(n)O(1)O(1)
ListO(n)O(n)O(1)
SortedDictionnaryO(log(n))N/AN/A
DictionnaryO(1)N/AN/A
HashSetO(1)N/AN/A
SortedSetO(log(n))N/AN/A
SortedListO(n)N/AN/A

Get by index

General
LinkedListO(n)
ListO(1)
SortedListO(1)

Get by value

General
LinkedListO(n)
ListO(n)
SortedDictionnaryO(log(n))
DictionnaryO(1)
HashSetO(1)
SortedSetO(log(n))
SortedListO(log(n))

Measures

Add First

Add Last

Contains

Remove

Remove Last

Conclusion

I hope to have succeeded in demonstrating that each data structure has a different purpose and good performance for some operations and less good performance for others. So, you have to make the right choice according to your needs.

Source code of the tests: Test Performance

Leave a reply