Memory Fragmentation in tcmalloc

One of Deep’s strengths is fast ingest of data.

While the main driver behind the performance comes from our append-only data representation on the filesystem, we still have to load up and index a lot of data in memory before committing them to disk. That means we need a fast memory allocator.

Enter tcmalloc. We picked tcmalloc for its speed and its good handling of small objects, which make up a large part of a database. This works well with a standard schema of an integer primary key and some small-sized secondary keys. However, in some test cases, we encounter tremendous tcmalloc fragmentation. When we use our big data bench tool on certain schemas, we saw a huge permanent increase in tcmalloc’s freelist, which smacks of fragmentation.

The Problem

Here’s a schema which caused high fragmentation (note that this is a very uncommon schema that we cooked up to specifically reproduce memory fragmentation):

CREATE TABLE IF NOT EXISTS `table1` (
  `id` char(80) NOT NULL,
  `column1` int(10) unsigned NOT NULL,
  `column2` varchar(40) NOT NULL,
  `column3` date NOT NULL,
  `column4` datetime NOT NULL,
  `column5` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `column6` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',
  `column7` char(6) NOT NULL,
  `column8` tinytext NOT NULL,
  `column9` varchar(40) NOT NULL,
  `column10` tinyblob NOT NULL,
  `column11` tinyint(1) NOT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `index1` (`column1`,`column9`,`column5`),
  KEY `column4` (`column4`),
  KEY `index2` (`column9`,`column5`)
) DEFAULT CHARACTER SET=utf8;

Its most noticeable attribute is the huge primary key. It’s also peppered with columns of various sizes, some tiny and some huge. We ran several hours of data ingest into this schema, with randomized data on a 40-core machine with 40 threads, culminating in roughly 900 million rows by the end of the run. What happened to our memory profile when we were ingesting this bad boy?

Our total consumed (allocated) memory hovers roughly around the allowed limit set by the client, but from an outside point of view our memory consumption creeps up because our freelist is growing. Ouch. Let’s take a closer look at the freelist.

The freelist grows and grows, in a generally upward trend. If the test ran for longer, it would have kept on growing and we would see a similar graph with an even larger freelist. At the same time, we kept the Deep’s memory usage constant, so the increased freelist is not coming from generally increased memory usage. That means the low amount of unmapped memory, compared to the freelist, shows that tcmalloc is not coalescing its spans and releasing the memory to the OS.

That can only mean that our memory is fragmented. Where does tcmalloc think the fragmentation is occuring?

------------------------------------------------
MALLOC:    53816178376 (51323.1 MiB) Bytes in use by application
MALLOC: +            0 (    0.0 MiB) Bytes in page heap freelist
MALLOC: +  20005946928 (19079.2 MiB) Bytes in central cache freelist
MALLOC: +            0 (    0.0 MiB) Bytes in transfer cache freelist
MALLOC: +     23438088 (   22.4 MiB) Bytes in thread cache freelists
MALLOC: +    521449624 (  497.3 MiB) Bytes in malloc metadata
MALLOC:   ------------
MALLOC: =  74367013016 (70921.9 MiB) Actual memory used (physical + swap)
MALLOC: +    434388992 (  414.3 MiB) Bytes released to OS (aka unmapped)
MALLOC:   ------------
MALLOC: =  74801402008 (71336.2 MiB) Virtual address space used
MALLOC:
MALLOC:        8931934              Spans in use
MALLOC:             56              Thread heaps in use
MALLOC:           8192              Tcmalloc page size
------------------------------------------------

The fragmentation is happening in the central freelist! This isn’t supposed to happen.

The Mystery

Almost all Deep objects fall under what tcmalloc considers ‘small objects.’ For small objects, tcmalloc operates with two types of caches for freed objects: a thread-local cache (called the thread cache, as in tcmalloc) freelist where objects are allocated from and freed to, and a central cache that thread cache freelists can draw from or give to. For example, when trying to allocate an object if the thread cache freelist is empty, it will simply ask the central cache for more freed objects. The central cache will scrounge up a freelist, whether by requisitioning existing freelists or allocating more memory. Similarly, if a thread cache freelist is too large and has been inactive for sometime, it will return portions of it back to the central cache.

tcmalloc divides the central cache into size classes, and each size class has a corresponding freelist. Think of the central cache as a chained hash table where the hash function is the size of the objected rounded up to the nearest size class. This structure greatly helps reduce fragmentation: if I were to allocate a 7-byte object, for example, tcmalloc would actually allocate 8 bytes for it and place it in the 8-byte size class. If I free the 7-byte object and then later want to allocate an 8-byte object, tcmalloc can just give me the freed 7-byte (actually 8-byte) object and thereby avoid an awkward 7-byte hole in memory.

The question is, why are we having so much trouble with central freelist fragmentation? Deep seems to be the optimal use case for tcmalloc: almost everything we allocate is a small object, and on top of that we allocate lots of small objects of the same size class (keys and values that are of the same size as defined the schema). Moreover, the example above showed roughly 19 GB of freelist – you would imagine that if we were to allocate an object, it would at least match the size of one of the hundreds of billions of objects in the huge freelist, so why does the freelist keep on growing?

The Answer

In order to understand why we’re seeing a slowly growing central cache freelist, it’s important to understand a bit more about where the fragmentation is occurring. tcmalloc object allocation uses spans, which are sets of contiguous pages that are carved up into objects of a certain size class. In the case of small objects, the span length is 1 page, so feel free to imagine that the spans are pages when thinking about fragmentation. We modified tcmalloc to provide extra debug information: how many spans were created, how many spans were nonempty (meaning they contain free objects that can be returned if the user wants to allocate something in that span’s size class), and how many spans were empty (meaning they are completely allocated and contain no free objects). Here’s an example of what the size class breakdown looks like for a standard test run with the schema described at the top (this is a different test run than the one I’ve graphed above):

------------------------------------------------
Size class breakdown
------------------------------------------------
class   1 [        8 bytes ] : 16960049 objs; 129.4 MiB; 129.4 cum 
MiB 16959268 free;    18494 spans;    18343 nonempty spans;      151 
empty spans;    18494 pages
class   2 [       16 bytes ] : 108521605 objs; 1655.9 MiB; 1785.3 
cum MiB 108510096 free;   490439 spans;   390614 nonempty spans;    
99825 empty spans;   490439 pages
class   3 [       32 bytes ] : 107720688 objs; 3287.4 MiB; 5072.7 
cum MiB 107713824 free;   792443 spans;   637021 nonempty spans;   
155422 empty spans;   792443 pages
class   4 [       48 bytes ] :  8376842 objs; 383.5 MiB; 5456.1 cum
 MiB  8375100 free;    49975 spans;    49937 nonempty spans;       38 
empty spans;    49975 pages
class   5 [       64 bytes ] : 54495148 objs; 3326.1 MiB; 8782.3 
cum MiB 54473328 free;  1257781 spans;   792331 nonempty spans;   465450
 empty spans;  1257781 pages
class   6 [       80 bytes ] :      681 objs;   0.1 MiB; 8782.3 cum
 MiB      228 free;       19 spans;        7 nonempty spans;       12 
empty spans;       19 pages
class   7 [       96 bytes ] :      350 objs;   0.0 MiB; 8782.3 cum
 MiB      106 free;       15 spans;        5 nonempty spans;       10 
empty spans;       15 pages
class   8 [      112 bytes ] :      206 objs;   0.0 MiB; 8782.4 cum
 MiB      135 free;        3 spans;        3 nonempty spans;        0 
empty spans;        3 pages
class   9 [      128 bytes ] :      457 objs;   0.1 MiB; 8782.4 cum
 MiB       94 free;   256018 spans;       10 nonempty spans;   256008 
empty spans;   256018 pages
class  10 [      144 bytes ] :      400 objs;   0.1 MiB; 8782.5 cum
 MiB      107 free;       10 spans;        8 nonempty spans;        2 
empty spans;       10 pages
class  11 [      160 bytes ] :      147 objs;   0.0 MiB; 8782.5 cum
 MiB       89 free;       12 spans;        7 nonempty spans;        5 
empty spans;       12 pages
class  12 [      176 bytes ] :      363 objs;   0.1 MiB; 8782.6 cum
 MiB       98 free;        8 spans;        5 nonempty spans;        3 
empty spans;        8 pages
class  13 [      192 bytes ] :      487 objs;   0.1 MiB; 8782.6 cum
 MiB       18 free;     1738 spans;        6 nonempty spans;     1732 
empty spans;     1738 pages
class  14 [      208 bytes ] :      234 objs;   0.0 MiB; 8782.7 cum
 MiB       75 free;        7 spans;        7 nonempty spans;        0 
empty spans;        7 pages
class  15 [      224 bytes ] :       24 objs;   0.0 MiB; 8782.7 cum
 MiB       21 free;        3 spans;        1 nonempty spans;        2 
empty spans;        3 pages
class  16 [      240 bytes ] :  3026471 objs; 692.7 MiB; 9475.4 cum
 MiB  3018141 free;   819323 spans;   206730 nonempty spans;   612593 
empty spans;   819323 pages
class  17 [      256 bytes ] :  2184075 objs; 533.2 MiB; 10008.6 
cum MiB  2150899 free;  2630804 spans;   328800 nonempty spans;  2302004
 empty spans;  2630804 pages
class  18 [      288 bytes ] :  8533773 objs; 2343.9 MiB; 12352.5 
cum MiB  8528103 free;   437696 spans;   396063 nonempty spans;    41633
 empty spans;   437696 pages
class  19 [      320 bytes ] :    10880 objs;   3.3 MiB; 12355.8 
cum MiB     9945 free;   168084 spans;      426 nonempty spans;   167658
 empty spans;   168084 pages
class  20 [      352 bytes ] :  3691551 objs; 1239.2 MiB; 13595.0 
cum MiB  3690602 free;   247431 spans;   225891 nonempty spans;    21540
 empty spans;   247431 pages
class  21 [      384 bytes ] :   303763 objs; 111.2 MiB; 13706.3 
cum MiB   303185 free;    22958 spans;    19184 nonempty spans;     3774
 empty spans;    22958 pages
class  22 [      416 bytes ] :       46 objs;   0.0 MiB; 13706.3 
cum MiB       18 free;        3 spans;        2 nonempty spans;        1
 empty spans;        3 pages
class  23 [      448 bytes ] :      135 objs;   0.1 MiB; 13706.4 
cum MiB       75 free;       16 spans;        8 nonempty spans;        8
 empty spans;       16 pages
class  24 [      480 bytes ] :       58 objs;   0.0 MiB; 13706.4 
cum MiB       23 free;        4 spans;        3 nonempty spans;        1
 empty spans;        4 pages
class  25 [      512 bytes ] :      104 objs;   0.1 MiB; 13706.4 
cum MiB       52 free;       61 spans;        7 nonempty spans;       54
 empty spans;       61 pages
class  26 [      576 bytes ] :      576 objs;   0.3 MiB; 13706.8 
cum MiB       64 free;      116 spans;       15 nonempty spans;      101
 empty spans;      116 pages
class  27 [      640 bytes ] :   728279 objs; 444.5 MiB; 14151.3 
cum MiB   724342 free;   337907 spans;   146561 nonempty spans;   191346
 empty spans;   337907 pages
class  28 [      704 bytes ] :       28 objs;   0.0 MiB; 14151.3 
cum MiB       11 free;        3 spans;        2 nonempty spans;        1
 empty spans;        3 pages
class  29 [      768 bytes ] :       78 objs;   0.1 MiB; 14151.3 
cum MiB       16 free;        8 spans;        4 nonempty spans;        4
 empty spans;        8 pages
...

Certain size classes have tremendous fragmentation. Several of them correspond with the sizes of keys and values defined in our schema. A certain amount of fragmentation is to be expected, because a span can be deleted if and only if a span only has no allocated objects. If our allocation patterns leave behind a dangling object in every span, for example, we’ll be heavily fragmented. But on the flip side, we’ll also have a ton of free objects, so part of the mystery is why our small object allocations weren’t hitting the existing freelist but contributing to a creeping growth of the freelist.

If we break the freelist growth down by size class, the answer appears. Let’s look at the 256 and 288-byte size classes (just to pick two to compare – pick any set of size classes and you’ll see a similar pattern).

For one thing, even though the 256-byte size class is not using a lot of memory now, at one point its freelist grew to ~6 GB! For another, it shows that individual size classes freelists are growing and shrinking as expected. The freelist for the 256-byte size class grows as Deep uses less of the object, and shrinks as Deep uses more. The % free graph is especially telling, because it shows that about 90% of the 288-byte size class is free whereas less than 10% of the 256-byte size class is free.

What this means is that the creeping fragmentation we see comes from the aggregate fragmentation of different size classes. The comparison above shows that at one point we allocated a bunch of 256-byte size class objects, then freed them up to make room for 288-byte size class objects (we need to ‘make room’ because our total allocated memory is bound by the settings given by the client), then switched back to allocating 256-byte size class objects. Because we honor the client’s memory desires, we switch between which size class objects are being used, and so we compound fragmentation over the course of multiple size classes. Even when we’re leaning heavily on one size class and reducing its free list through allocations, we’re freeing up memory in several other classes that stay around forever because they’re fragmented. When we hit different size classes heavily, they fragment and stay that way afterward, which accounts for the slowly creeping freelist (upper bounded by the number of size classes we can possibly hit, times the maximum amount of memory we are allowed to allocate, which results in a pretty large number).

The Solution

Let me stress that this is a problem only in testing, with a custom schema we’ve created to trigger this condition. Our field deployments have never seen this problem. That said, this is a good opportunity to fix a problem before it hits our customers. We’ve just recently finished the investigation phase of figuring out why we have a creeping freelist, but our gears are turning about how solve this problem. Currently, we think that a periodic reshuffling of memory, to move objects to tightly pack spans and free the rest up for release back to the OS, is the best option. We’re looking to POC this not in tcmalloc, but jemalloc, where user control over arenas give us the ability to easily make this move (basically we’ll be copying a bunch of objects over to a new arena, all on the same page, and freeing the old ones from their spans so those pages can be released).