Prerequisites

Download the CPython source and build it in the debug mode to step into the interpreter’s internals.

git clone https://github.com/python/cpython.git
cd cpython

# Configure with debug options
./configure --with-pydebug

# Build (feel free to adjust -j)
make -j8

This will produce a local python binary.

Introduction

CPython is one of the many implementations of the Python runtime written in C code. There are other implementations such as PyPy, Jython. PyPy is a python compiler written in Python. PyPy’s logo is an Ouroboros to represent the self-hosting nature of the compiler. In this blog post, I summarize how object memory is alllocated and freed and how CPython manages memory leaks.

Python is a dynamically-typed language. The size of variables can’t be calculated at compile-time. Most of Python’s core types are dynamically sized. The list type can be of any size, a dict can have any number of keys, and even int is dynamic. The user never has to specify the size of these types. Names in Python can be reused for values of different types.

a_value = 1
a_value = "Now, I'm a string"
a_value = ["Now", "I", "am", "a", "list"]

To overcome these constraints, CPython relies heavily on dynamic memory allocation, but adds safety rails to automate the freeing of memory using the garbage collection and reference counting algorithms.

Instead of the Python developer having to allocate memory, Python object memory is allocated automatically by a single, unified API. This design requires that the entire CPython standard library and core modules(written in C) use this API.

Allocation Domains

The raw domain is used for allocation from the system heap and large, or non-object related memory.
The object domain is used for the allocation of all Python object related memory.
The PyMem domain is the same as PYMEM_DOMAIN_OBJ. this exists for legacy purposes.

Each domain implements the same interface of frunctions:

_Alloc(size_t size) allocates memory of size bytes and returns a pointer.
_Calloc(size_t nelem, size_t el_size) allocates nelem elements each of size el_size and returns a pointer.
_Realloc(void* ptr, size_t new_size) reallocates memory of size new_size.
_Free(void* ptr) frees memory at ptr back to the heap.

The PyMemAllocatorDomain enumeration represents the three domains in CPython as PYMEM_DOMAIN_RAW, PYMEM_DOMAIN_OBJ and PYMEM_DOMAIN_MEM.

CPython uses \(2\) memory allocators:

malloc is the operating system allocator for the raw memory domain.
pymalloc is the CPython allocator the PyMem and object domains.

The CPython memory allocator

The CPython memory alloctor sit on top of the OS memory allocator and has an algorithm for allocation.

Most of the memory allocation requests are small and of a fixed size, because PyObject is \(16\) bytes, PyASCIIObject is \(42\) bytes, PyCompactUnicodeObject is \(72\) bytes.
The pymalloc allocator allocates memory blocks only upto \(256\) KB. Anything larger is sent to the OS allocator.
pymalloc uses the GIL instead of the system thread-safety check.

To help clarify this situation, you can imagine a sports stadium, home of CPython FC, as an analogy. To help manage crowds, CPython FC has implemented a system breaking the stadium up into sections A to E, each with seating in rows \(1\) to \(40\).

At the front of the stadium, rows \(1\) to \(10\) are the roomier premium seats, with \(80\) seats in each row. At the back rows \(31\) to \(40\) are the conomy seats with \(150\) seats per row.

The Python memory allocation algorithm has similar characteristics:

Just like the stadium as seats, the pymalloc algorithm has memory blocks.
Just like the seats can be premium, regular, or economy, memory blocks are all of range of fixed sizes. You can’t bring your desk chair!
Just like seats of the same size are put into rows, blocks of the same size are put into pools.

A central register keeps a record of where blocks are and the number of blocks available in a pool, just as the stadium allocates seating. When a row in the stadium is full, the next row is used. When a pool of blocks is full, the next pool is used. Pools are groups into arenas, just like the stadium groups the rows into sections.

%load_ext itikz

There are several advantages to this strategy:

The algorithm is more performant for CPython’s main use case: small, short-lived objects.
The algorithm uses the GIL instead of system thread-lock detection.
The algorithm uses memory mapping mmap() syscall instead of heap allocations.

The requested memory is always matched to a block size. Blocks of the same size are all put into the same pool of memory. Pools are grouped into arenas.

Blocks, Pools and Arenas

pymalloc allocates large memory regions called arenas from the system, then divides them into pools.

Arenas

Arenas are allocated from the system heap using mmap when available otherwise malloc. CPython arenas are \(256\)-KB on 32-bit systems and \(1\) MB on 64-bit systems. Each arena contains multiple pools.

#ifdef USE_LARGE_ARENAS
#define ARENA_BITS              20                    /* 1 MiB */
#else
#define ARENA_BITS              18                    /* 256 KiB */
#endif
#define ARENA_SIZE              (1 << ARENA_BITS)
#define ARENA_SIZE_MASK         (ARENA_SIZE - 1)

Arenas are scattered in the main memory. Each arena is a contiguous chunk of memory. CPython maintains a doubly linked list usable_arenas that tracks all the arena objects, that have atleast one pool available for allocation. The usable_arenas list is maintained in ascending order of each arena’s nfreepools (number of free pools). This sorting ensures that allocations preferentially use arenas with fewest free pools.

Our main book-keeping data structure of interest is the arena_object. The usable_arenas doubly linked list strings together arena_object nodes . These contain meta-information of each arena with prevarena and nextarena pointers. Each of these nodes have a address field which is a pointer to the actual contiguous block of memory (\(256\)KB / \(1\)MB) returned by malloc.

/* Record keeping for arenas. */
struct arena_object {
    /* The address of the arena, as returned by malloc.  Note that 0
     * will never be returned by a successful malloc, and is used
     * here to mark an arena_object that doesn't correspond to an
     * allocated arena.
     */
    uintptr_t address;

    /* Pool-aligned pointer to the next pool to be carved off. */
    pymem_block* pool_address;

    /* The number of available pools in the arena:  free pools + never-
     * allocated pools.
     */
    uint nfreepools;

    /* The total number of pools in the arena, whether or not available. */
    uint ntotalpools;

    /* Singly-linked list of available pools. */
    struct pool_header* freepools;
    /* Whenever this arena_object is not associated with an allocated
     * arena, the nextarena member is used to link all unassociated
     * arena_objects in the singly-linked `unused_arena_objects` list.
     * The prevarena member is unused in this case.
     *
     * When this arena_object is associated with an allocated arena
     * with at least one available pool, both members are used in the
     * doubly-linked `usable_arenas` list, which is maintained in
     * increasing order of `nfreepools` values.

     *
     * Else this arena_object is associated with an allocated arena
     * all of whose pools are in use.  `nextarena` and `prevarena`
     * are both meaningless in this case.
     */

    struct arena_object* nextarena;
    struct arena_object* prevarena;
};

pool_address serves as the arena’s high water mark pointer that tracks where to carve off the next pool. It’s initialized when an arena is created and advances as pools are allocated.

Each arena_object node also contains a field freepools of the type pool_header* which points to the head of a linked list which tracks available pools in the arena.

When an arena’s nfreepools reaches zero (all pools allocated), it’s removed from usable_arenas linked list. Conversely, when a pool becomes empty and is added to an arena’s freepools, the arena may be added back to usable_arenas if it was previously full.

Pools

Pools are \(4\)KB subdivisions (or \(16\) Kb on large systems) of arenas. Normally, the size of the pool is equal to the size of a memory page. Limiting pool to the fixed size of blocks helps with fragmentation. Each pool manages blocks of single size class and can be in three states:

Used : Partially allocated blocks.
Full : All blocks allocated.
Empty : All blocks available.

/*
 * Size of the pools used for small blocks.  Must be a power of 2.
 */

#ifdef USE_LARGE_POOLS
#define POOL_BITS               14                  /* 16 KiB */
#else
#define POOL_BITS               12                  /* 4 KiB */
#endif
#define POOL_SIZE               (1 << POOL_BITS)

Each pool has a special header structure:

/* Pool for small blocks. */
struct pool_header {

    union { pymem_block *_padding;
            uint count; } ref;          /* number of allocated blocks    */

    pymem_block *freeblock;             /* pool's free list head         */

    struct pool_header *nextpool;       /* next pool of this size class  */

    struct pool_header *prevpool;       /* previous pool of this size class                             */

    uint arenaindex;                    /* index into arenas of base adr */

    uint szidx;                         /* block size class index        */

    uint nextoffset;                    /* bytes to virgin block         */

    uint maxnextoffset;                 /* largest valid nextoffset      */

};

Pools of the same sized blocks are linked together using doubly linked list (the nextpool and prevpool fields). The szidx field keeps the size class index, whereas ref.count keeps the number of used blocks. The arenaindex stores the number of an arena in which this pool was created.

In order to efficient manage pools, a pool table called usedpools is maintained. A pool table is an array of pointers. It is segmented by the size class index i. For an index i, usedpools[i+1] points to the head of linked list of all partial used pools of the same size class i. Each node of this linked list is of type pool_header.

If a full pool has a block freed, then the pool is put back in the used state. The newly freed pool is added to the front of the approapriate usedpools[] list, so the next allocation for its size class will use the freed block.

On transition to empty, a pool is unlinked from its usedpools[] list and added to the front of the arena’s singly linked freepools list.

Blocks

Blocks are the actual allocated units, with sizes ranging from \(8\) to \(512\) bytes in \(8\)-byte increments.

Request in bytes	Size of the allocated block	Size class index
1-8	8	0
9-16	16	1
17-24	24	2
…	…	…
505-512	512	63

On a 64-bit system, since an arena is \(1\) MB, there will always be \(64\) pools each of size \(16\)KB.

Within a pool, blocks of fixed size can be allocated and freed. Available blocks within a pool are listed in the singly linked list freeblock. Whenever a block is freed, its inserted at the front of the freeblock list. When a pool is initialized, only the first two blocks are linked within the freeblock list.

Block allocation API

When a block of memory is requested by a memory domain that uses pymalloc, the API function pymalloc_alloc() is invoked. This function is a good place to insert a break-point and step through the code to test your knowledge of blocks, pools and arenas.

CPython interning

The general rule is that objects are allocated on assignment. Variables just point to objects. There is an exception to this general rule. This is python runtime implementation specific. Often commonly used objects are preallocated and are shared instead of costly new allocations. This is mainly done for performance optimization.

x = 255
y = 255
print(x is y, x == y)
x = 1024
y = 1024
print(x is y, x == y)

True True
False True

ints in the range \([-5,257)\), empty tuples and all empty or single length strings are preallocated.

`string` interning explained

String interning is a method of storing only one copy of each distinct string value, which must be immutable. In Python3, we have a function sys.intern() and if we use this function, we can enter a string in the table of interned strings. We get a reference to the interned string.

So, we can gain a little extra performance on dictionary lookup (key comparisons after hashing can be done by a pointer compare instead of a string compare).

Names used in programs are automatically interned. Dictionaries used to hold module, class or instance attributes have interned keys.

from sys import intern
a, b = "strin", "string"
print(f"{a + 'g' is b}")    # Returns false
intern(a + 'g') is intern(b) # returns True

False

True

Mutable containers memory allocationn strategy

There are different mutable containers in Python - lists, sets and dicts. Behind the scenes, there is a strategy for allocating these containers.

A good strategy will plan for growth and shrinkage. It will slightly overallocate memory needed by the container, to leave room for growth. Each time we append to a list, we won’t have to reallocate the memory. We also have to remember that sometimes we have to shrink the memory for a mutable container. This can help reduce the number of expensive function calls for instance to realloc or memcpy.

List allocation strategy

Lists are stored as fixed length array of pointers. So, we just point to objects. By design, we overallocate for list growth by append.

The capacity growth pattern is roughly \(4,8,16,25,35,46,\ldots\)

If we put something at the end of the list, these operations are cheap. But, if we put something in the middle or the beginning, we’ll have to copy/shift the elements in memory to perform this operation.

On my 64-bit machine, with Python 3.13.7, the list allocation size is:

64 bits : 56 + 8 * length

import sys
l = []
for i in range(17):
    l.append(i+1)
    print(f"List len = {len(l)}, size = {sys.getsizeof(l)} bytes")

List len = 1, size = 88 bytes
List len = 2, size = 88 bytes
List len = 3, size = 88 bytes
List len = 4, size = 88 bytes
List len = 5, size = 120 bytes
List len = 6, size = 120 bytes
List len = 7, size = 120 bytes
List len = 8, size = 120 bytes
List len = 9, size = 184 bytes
List len = 10, size = 184 bytes
List len = 11, size = 184 bytes
List len = 12, size = 184 bytes
List len = 13, size = 184 bytes
List len = 14, size = 184 bytes
List len = 15, size = 184 bytes
List len = 16, size = 184 bytes
List len = 17, size = 248 bytes

There is a fixed overhead of \(56\) bytes (the object header, type pointer, reference count ) etc.

For each size tier, the calculation is approximately:

88 bytes = 56 + 32(4 slots)
120 bytes = 56 + 64(8 slots)
184 bytes = 56 + 128(16 slots)
248 bytes =56 + 192(24 slots)

We shrink when the list size goes below \(1/2\) of the allocated storage. This is why appending to Python lists is \(O(1)\) amortized time - most appends just fill unused capacity, and only occasional appends trigger a reallocation.

Overallocation of dictionaries and sets

These are represented as fixed-length hash tables. We overallocate when we reach \(2/3\)rd of the capacity of a dictionary of set. For small dictionaries or sets, we quadruple the capacity else we double the capacity.