Performance: Improving memory needs and dict lookup time with string interning

written by Javier Cordero Martinez on 2016-11-03

In this article I will talk about the method called string interning and how it can help us to reduce the memory consumed when we need to handle a big amount of strings with a lot of repeated sentences, this technique will improve our search when dealing with a big dictionary too.

The last time I needed to use this technique, we needed to build a new dictionary mixing data from two different dictionaries, each one with results from different NLP analyses, a sentiment analysis and a category analysis, both dictionaries can be of any size, with a lot of sentences, categories and terms duplicated, so to build the final dictionary a lot of lookups and string comparations needed to be done.

What is string interning?

String interning1 is just a method used to allow storing only one copy of each string used, with this method we will be able to do some string processing faster and needing less space, in Python, this is possible because strings are immutable. It is also available in Java, Ruby, .NET and others.

When string interning is used, the interpreter builds a table where each string interned has a pointer, this way, when we are using a duplicated string assigned to two variables and intern it, we are just using two variables pointing to the same table row. String interning is massively used at Python internals for names, keys for dictionaries that hold module, class and instance attributes, etc.

It is important to know that we are going to handle strings as always, all the stuff is magically handled by the interpreter.

The improvement in dictionary lookup is achieved because instead of doing a string compare after hashing, only a pointer compare will be done, the same applies to most of string operations, some examples will be added later.

Using intern()

Python 2

We can enable string interning with the built-in function intern, one pitfall of Python 2, is that only ASCII strings are supported, so Unicode strings cannot be interned.

a = intern('this text will use string interning using built-in')
Python 3

In this version, intern built-in has been removed and is now at sys module, sys.intern()2. In this case, Unicode strings are supported.

import sys
a = sys.intern('this text will use string interning using sys')


I will use Python 3 but it will work with Python 2 too. In this example we will see that two variables with the same string are different (is expression check if two variables are the same object) after calling intern, the strings a and b become the same pointer

>>> a = 'this string is foo'
>>> b = 'this string is foo'
>>> a is b  # different objects
False
>>> a == b  # with same content
True
>>> import sys
>>> a = sys.intern(a)  # after intern(), string operations work the same
>>> a == b
True
>>> a is b
False
>>> b = sys.intern(b)
>>> a is b  # both strings have been interned, the vars now have the same pointer
True
>>> a == b
True
>>> a  # text is still returned, like always
'this string is foo'

As we can think, some operations, like string compare will be faster because the pointers will be compared instead of looking to each character, so we can spend a minimal amount of time calling intern() for better gains later.

Interned strings can be garbaged collected if no reference if kept of the return value from sys.intern(), it is important to save it.

Internal details of Python implementation

Obviously, string interning is possible in each Python implementation, but there are some implementation behavior that will be good to know, I will talk about CPython (because it is the only one I'm using), small strings will be automatically interned, but only when they are compile-time constants:

>>> import sys
>>> a = 'example'
>>> b = a + 's'  # an expression instead of a constant
>>> a is 'example'  # a is a compile-time constant, is already interned
True
>>> b is 'examples'  # b isn't a compile-time constant, is not interned automatically
False
>>> b == 'examples'
True
>>> sys.intern(b) is 'examples'
True

Memory measure

I will use Pympler3, a very easy to use memory measure tool compatible with Python 3, in this example, two dictionaries will be created, both with the same number of keys, and the same strings repeated, but one will be using interned strings.

>>> import sys
>>> from pympler import asizeof
>>> not_interned = {str(x):'a'*(x%50) for x in range(1000)}  # a dict with 1000 keys and 20 repeated strings as value
>>> not_interned_size = asizeof.asized(not_interned).format()  # memory consumed by dictionary a in bytes
>>> not_interned_size.format()
"{'29': 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa'....1': 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'} size=180240 flat=49248"
>>> interned =  {str(x):sys.intern('a'*(x%50)) for x in range(1000)}  # same as not_interned, but value are interned strings
>>> interned_size = asizeof.asized(interned).format()
>>> interned_size.format()
"{'29': 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaa'....1': 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'} size=109104 flat=49248"
>>> print('{0} bytes'.format(not_interned_size.size - interned_size.size))
71136 bytes
  • size is the sum of flat size plus the sizes of any referents (in bytes), this is the real size of the object.
  • flat is the sum of the basic size of the object plus the item size times the number of allocated items (in bytes).

As we can see, not interned string dict has a size of 180240 bytes and the interned one, 109104 bytes, so 69.46 kbytes were saved just using sys.intern in this example with 50 sentences repeated 20 times (note that the intern table will be bigger, because it now holds that 50 sentences but they aren't repeated). Not too much, but it was a small example to show Pympler.

Example: Reducing memory needs

For this example, I will use a .csv from Kaggle4, this dataset has all baby names used in each state of USA since 1910 (total 147.5 MB) to 2014, for example, Anna for years 1945 and 1946, was used 6 times at the State of Alaska each year:

Id, Name, Year, Gender, State, Count

948,Anna,1945,F,AK,6
1004,Anna,1946,F,AK,6

So, we can assume that there will be a lot of repeated strings, the same name every year for each state, the gender letter and the state code are strings too. Let's see if string interning is useful!

Obviously, you will use something like Pandas to process this data but assume we are receiving this data (from a DB, for example) and need to write it to memory before doing anything else, we are going to focus only in reducing Virtual Memory (VM, memory allocated by the Python interpreter) and reducing the size of the list stored at Python heap, remember that memory no longer used and garbage-collected doesn't reduce VM size, the interpreter keep it to avoid asking for more memory again, so VM memory will increase or stay the same even some objects go out of scope.

For this example I will use this three functions, mem_summary() will just show the memory status, load_csv() will load the csv already downloaded and will return a list of tuples with each row, intern_csv() will change some strings from the list to be interned so we can see the improvements:

import sys
import csv
import psutil
from pympler import summary, muppy, asizeof


def mem_summary():
    summary.print_( summary.summarize(muppy.get_objects()), limit=5)
    print('Virtual memory size:', psutil.Process().memory_info_ex().vms / (1024*1024), 'MB')


def load_csv():
    """ Load a csv to memory as list of tuples """
    with open('StateNames.csv', 'r') as csvfile:
        next(csvfile)
        csvreader = csv.reader(csvfile)
        for row in csvreader:
            yield (int(row[0]), row[1], int(row[2]), row[3], row[4], int(row[5]))


def intern_csv(gen):
    """ Given a list of tuples, apply string interning to some of its fields """
    for row in gen:
        yield (row[0], sys.intern(row[1]), row[2], sys.intern(row[3]), sys.intern(row[4]), row[5])

Without string interning

>>> mem_summary()  # initial memory summary and VM size
         types |   # objects |   total size
============== | =========== | ============
  <class 'dict |        1747 |      1.32 MB
   <class 'str |       11288 |      1.30 MB
  <class 'type |         512 |    504.47 KB
  <class 'code |        3438 |    483.64 KB
   <class 'set |         370 |    141.44 KB
Virtual memory size: 98.93359375 MB
>>> loaded_data = list(load_csv())  # csv just loaded as list of tuples
>>> mem_summary()  # after data is loaded, check memory again
         types |   # objects |   total size
============== | =========== | ============
  <class 'list |         464 |     46.11 MB
  <class 'dict |        1747 |      1.32 MB
   <class 'str |       11288 |      1.30 MB
  <class 'type |         512 |    504.75 KB
  <class 'code |        3438 |    483.64 KB
Virtual memory size: 1667.3828125 MB
print('Real list memory usage', asizeof.asizeof(loaded_data)/(1024*1024), 'MB')
Real list memory usage 1522.7899169921875 MB

With a fresh process, we already have 98.93MB allocated by VM, after loading the csv, VM memory has increased to 1.62GB and our list of tuples is using 1.48GB (the 46.11MB show at summary table, is just the size of the list, without the tuples inside it).

With string interning

>>> mem_summary()  # initial memory summary and VM size
         types |   # objects |   total size
============== | =========== | ============
  <class 'dict |        1747 |      1.32 MB
   <class 'str |       11288 |      1.30 MB
  <class 'type |         512 |    504.47 KB
  <class 'code |        3438 |    483.64 KB
   <class 'set |         370 |    141.44 KB
Virtual memory size: 98.93359375 MB
>>> loaded_data_interned = list(intern_csv(load_csv()))  # csv loaded as list of tuples with strings interned
>>> mem_summary()  # after data is loaded, check memory again
         types |   # objects |   total size
============== | =========== | ============
  <class 'list |         464 |     46.11 MB
  <class 'dict |        1747 |      1.32 MB
   <class 'str |       11288 |      1.30 MB
  <class 'type |         512 |    504.75 KB
  <class 'code |        3438 |    483.64 KB
Virtual memory size: 1057.33984375 MB
print('Real list memory usage', asizeof.asizeof(loaded_data_interned)/(1024*1024), 'MB')
Real list memory usage 916.0488128662109 MB

Again, we have the same amount initially allocated by VM, 98.93MB, but now we see some improvements, after loading the dataset and interning the strings on it, VM allocated memory was reduced to 1.03GB and the real size of the list is now 0.89GB.

In conclusion, with just using string interning, our needs of total memory allocated by the Python process are now 37% smaller where our needs for memory at Python heap to store the list was reduced by another 40%.

Conclusions

As show, this simple technique can help us to improve our code in two ways:

  • Faster dictionary lookups (Python is using it internally for some data structures) and string operations.
  • Reduce memory needs when working with a lot of repeated strings.