The in Operator

Introduction Link to heading

If you use the in operator in Python, testing whether an element is present in a collection is almost as easy as asking a question.

>>> 'one'  in ['one', 'two', 'three']
True
>>> 'four'  in ('one', 'two', 'three')
False
>>> 'five' in {'one', 'two', 'three'}
False
>>> 'three' in {'one': 1, 'two': 2, 'three': 3}
True

From the examples above, we can see that the in operator works with various collection types: lists, tuples, sets, and dictionaries. What about performance? If you frequently had to test membership of items against a collection of million elements, which data structure would you choose and why?

Memory Footprint and Performance Link to heading

Lists and Tuples Link to heading

To create and store elements in a list, Python has to allocate some memory for the list itself, and memory for the pointers to the elements it stores. To prevent memory allocation on each list.append, Python allocates some extra memory to keep appends fast (amortized O(1)). So while there may be a single element in a list, there is actually room for 4. Once those 4 places are occupied, Python will perform a reallocation making room for another 4. You can test this out for yourself by calling list.__sizeof__() after each append.

l = []
print(l.__sizeof__())  # allocation for the list itself

for i in range(5):
    l.append(i)  # only when the fifth item is appended will the list change size
    print(len(l), l.__sizeof__())

Tuples on the other hand, have a fixed size. That means, there is no reallocation and no other layers of indirection. The elements of tuples are stored directly in a struct making them very lean, memory wise.

Finding an element in a tuple or a list potentially requires traversing the whole collection and comparing each element in the collection against the object which membership you are testing. The following code snippet illustrates that:

def contains(collection, target):
    for item in collection:
        if target == item:
            return True
    return False

If the target object is not present in the collection, the whole collection will be traversed before False is returned. The complexity of such an operation is O(n). For a list twice as long, membership testing will take twice as long. If membership testing happens often on large lists or tuples, Python will spend quite some time traversing those lists and tuples, over and over again.

Sets and Dictionaries Link to heading

What about sets and dictionaries? To understand those data structures, we first need to understand the basics of hash maps.

Custom Hash Map Link to heading

Not being satisfied with having to traverse the whole collection to find the element of interest, we wonder is there a way we could simply find the index of the element we are looking for. So, if we have a list l = ['one', 'two', 'three'], and we want to know whether 'two' is in our list, we need a function that will map the element 'two' to an index of 1 because l[1] == 'two'. The hash function does precisely that. Let’s try to implement our own hash map so that we can qualitatively understand the memory footprint and performance.

The hash map stores key-value pairs in buckets. The underlying data structure those key-value pairs are stored in is an array. For our Python equivalent, we will use a list. Those buckets are again stored in arrays, so we can use a list again to store them.

def naive_hash(value: str):
    # Maps every string starting with 'a' or 'A' to 0, 'b' or 'B' to 1, etc.
    if value[0] not in 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ':
        raise ValueError('naive_hash supports only ascii letters')
    return ord(value[0].upper()) - 65


class KeyValue:
    def __init__(self, key, value):
        self.key = key
        self.value = value

    def __repr__(self):
        cls = self.__class__.__name__
        return f'<{cls}(key={self.key}, value={self.value})>'


class Bucket:
    def __init__(self):
        # Bucket holds a list of key-value pairs.
        self._keys_values: list[KeyValue] = []

    def put(self, target: KeyValue):
        # Check whether this key is already present. If it is replace its value.
        for key_value in self._keys_values:
            if key_value.key == target.key:
                key_value.value = target.value
                return
        # This key is not present, just append the key-value pair.
        self._keys_values.append(target)

    def get(self, key: str) -> KeyValue:
        # If there are many key-value pairs in a bucket, we have to check 
        # each one for the key we are searching for.
        for key_value in self._keys_values:
            if key_value.key == key:
                return key_value
        raise KeyError(f'no key "{key}"')

    def delete(self, key: str):
        for key_value in self._keys_values:
            if key_value.key == key:
                self._keys_values.remove(key_value)
                return
        raise KeyError(f'no key "{key}"')

    def contains(self, key) -> bool:
        for key_value in self._keys_values:
            if key_value.key == key:
                return True
        return False

    def __bool__(self):
        return bool(self._keys_values)

    def __repr__(self):
        cls = self.__class__.__name__
        return f'<{cls} {repr(self._keys_values)}>'


class HashMap:
    def __init__(self):
        # HashMap holds a list of buckets.
        # We initialize the list with empty buckets to avoid memory 
        # reallocation. There are 26 ascii letters, so we create 26 buckets.
        self._buckets: list[Bucket] = [Bucket() for _ in range(26)]

    def put(self, key_value: KeyValue):
        # We calculate the index, i.e. the corresponding bucket for this key.
        hash_value = naive_hash(key_value.key)
        self._buckets[hash_value].put(key_value)

    def get(self, key: str):
        hash_value = naive_hash(key)
        return self._buckets[hash_value].get(key)

    def delete(self, key: str):
        hash_value = naive_hash(key)
        self._buckets[hash_value].delete(key)

    def contains(self, key) -> bool:
        hash_value = naive_hash(key)
        return self._buckets[hash_value].contains(key)

    def __repr__(self):
        cls = self.__class__.__name__
        buckets = ', '.join(
            f'{idx}: {bucket}' for idx, bucket in enumerate(self._buckets)
            if bucket
        )
        if buckets:
            return f'<{cls} {buckets}>'
        return f'<{cls}>'

There is a lot to digest in the code sample above, so let’s repeat the most important steps and points. The main idea of the hash map is to find the corresponding bucket for a given key. We do this by using our naive_hash function which produces a bucket index for any string containing ascii letters. There is one subtle point to be made regarding our naive hash function. It produces the same hash value, i.e. the same bucket index for all strings starting with the same letter. This is called a hash collision, and when that happens, the corresponding bucket stores multiple key-value pairs. Let’s see an example.

>>> hm = HashMap()
>>> hm
<Hashmap>  # our hash map is empty
>>> hm.put(KeyValue('one', 1)) 
>>> hm
<HashMap 14: <Bucket [<KeyValue(key=one, value=1)>]>>  # bucket 14 got an item
>>> hm.put(KeyValue('two', 2))
>>> hm
<HashMap 14: <Bucket [<KeyValue(key=one, value=1)>]>, 19: <Bucket [<KeyValue(key=two, value=2)>]>>
>>> hm.put(KeyValue('three', 3))  # hash collision
>>> hm
<HashMap 14: <Bucket [<KeyValue(key=one, value=1)>]>, 19: <Bucket [<KeyValue(key=two, value=2)>, <KeyValue(key=three, value=3)>]>>
>>> naive_hash('one'), naive_hash('two'), naive_hash('three')
(14, 19, 19)

From the above example we can see that 'one' goes to bucket 14, while 'two' and 'three' produce a hash collision and end up in the same bucket, bucket 19.

Hash Map Performance Link to heading

Because of the underlying hash function, finding the appropriate bucket is a single operation, which means retrieving elements from the hash map is very fast. In fact, it is O(1), i.e. independent of the number of elements in the hash map unless collisions are so frequent elements end up in the same bucket. That’s why production ready hash maps reallocate memory, keep track of bucket occupancy, and redistribute elements in the bucket to keep retrieving elements from the hash map O(1). To prevent hash collisions from happening too frequently, Python keeps approximately one third of the underlying array unoccupied. This increases the memory footprint, but it gives a huge performance gain.

Although sets and dictionaries have a memory overhead, they are data structures optimized for fast membership checking, and you should use them with the in operator whenever possible.

What is a Collection Link to heading

Recalling our example from the introduction, you might believe that affecting performance with regard to the in operator amounts to changing braces; from square and round to curly. Actually, that’s not far from the truth. The question is, why does it work? And, maybe more importantly, how can you define your own objects that support membership checking via the in operator?

This works because lists, tuples, dictionaries, and sets are all instances of collections.abc.Collection class. That is an abstract base class used to define a protocol an instance of that class should implement. If you check the documentation, you will see that Collection inherits from Sized, Iterable, and Container. Inheriting from Sized means that you can ask how many elements are there in the collection by implementing __len__. Inheriting from Iterable means that the collection can be iterated over using a for loop, by implementing __iter__. And finally, inheriting from Container means that you can test for membership using the in operator by implementing the __contains__ special method. This is what all collections have in common, and this is what makes Python easy to use. Let’s explore this through some code samples.

class ContainerTest:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __contains__(self, item):
        return item in (self.x, self.y)

>>> ct = ContainerTest(1, 2)
>>> 1 in ct
True
>>> 2 in ct
True
>>> 3 in ct
False

In the example above, Python will actually call the __contains__ special method like this.

ct.__contains__(1)

That’s fairly straightforward, but what is perhaps less known is that Python will try to support the in operator even if __contains__ is not implemented. First, Python will try iterating over the collection if that object supports iteration via __iter__. Lastly, it will try membership testing using the legacy __getitem__ protocol. Let’s take a look at a few code examples.

class IterableTest:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __iter__(self):
        yield from (self.x, self.y)

This class supports iteration, and that is enough for membership testing using the in operator. That’s the reason why the in operator works with generators as well.

>>> it = IterableTest(1, 2)
>>> 1 in it
True
>>> 2 in it
True
>>> 3 in it
False

Finally, if neither __contains__ nor __iter__ is available, Python will try the legacy __getitem__ protocol for membership testing (and iteration).

class GetItemTest:
    def __init__(self, x, y):
        self.x = x
        self.y = y

    def __getitem__(self, item):
        if item == 0:
            return self.x
        elif item == 1:
            return self.y
        else:
            raise IndexError

The __getitem__ protocol is used to support the indexing operation, such as you would apply on a list.

>>> l = [1, 2, 3]
>>> l[0]  # l.__getitem__(0)
1

Let’s verify that __getitem__ supports membership testing.

>>> gt = GetItemTest(1, 2)
>>> 1 in gt
True
>>> 2 in gt
True
>>> 3 in gt
False

Conclusion Link to heading

We have seen that the in operator is compatible with many data structures. Python will try to support that operator first by checking whether the __contains__ special method is implemented. If it is not, it will try iterating through the collection by using __iter__, or __getitem__ if the former is not available. If elements of a collection are hashable, the in operator will be most performant with sets and dictionaries.