One of Deep’s strengths is fast ingest of data.
While the main driver behind the performance comes from our append-only data representation on the filesystem, we still have to load up and index a lot of data in memory before committing them to disk. That means we need a fast memory allocator.
Enter tcmalloc. We picked tcmalloc for its speed and its good handling of small objects, which make up a large part of a database. This works well with a standard schema of an integer primary key and some small-sized secondary keys. However, in some test cases, we encounter tremendous tcmalloc fragmentation. When we use our big data bench tool on certain schemas, we saw a huge permanent increase in tcmalloc’s freelist, which smacks of fragmentation.
Here’s a schema which caused high fragmentation (note that this is a very uncommon schema that we cooked up to specifically reproduce memory fragmentation):
CREATE TABLE IF NOT EXISTS `table1` (`id` char(80) NOT NULL,`column1` int(10) unsigned NOT NULL,`column2` varchar(40) NOT NULL,`column3` date NOT NULL,`column4` datetime NOT NULL,`column5` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',`column6` timestamp NOT NULL DEFAULT '0000-00-00 00:00:00',`column7` char(6) NOT NULL,`column8` tinytext NOT NULL,`column9` varchar(40) NOT NULL,`column10` tinyblob NOT NULL,`column11` tinyint(1) NOT NULL,PRIMARY KEY (`id`),UNIQUE KEY `index1` (`column1`,`column9`,`column5`),KEY `column4` (`column4`),KEY `index2` (`column9`,`column5`)) DEFAULT CHARACTER SET=utf8;
Its most noticeable attribute is the huge primary key. It’s also peppered with columns of various sizes, some tiny and some huge. We ran several hours of data ingest into this schema, with randomized data on a 40-core machine with 40 threads, culminating in roughly 900 million rows by the end of the run. What happened to our memory profile when we were ingesting this bad boy?
Our total consumed (allocated) memory hovers roughly around the allowed limit set by the client, but from an outside point of view our memory consumption creeps up because our freelist is growing. Ouch. Let’s take a closer look at the freelist.
The freelist grows and grows, in a generally upward trend. If the test ran for longer, it would have kept on growing and we would see a similar graph with an even larger freelist. At the same time, we kept the Deep’s memory usage constant, so the increased freelist is not coming from generally increased memory usage. That means the low amount of unmapped memory, compared to the freelist, shows that tcmalloc is not coalescing its spans and releasing the memory to the OS.
That can only mean that our memory is fragmented. Where does tcmalloc think the fragmentation is occuring?
------------------------------------------------MALLOC: 53816178376 (51323.1 MiB) Bytes in use by applicationMALLOC: + 0 ( 0.0 MiB) Bytes in page heap freelistMALLOC: + 20005946928 (19079.2 MiB) Bytes in central cache freelistMALLOC: + 0 ( 0.0 MiB) Bytes in transfer cache freelistMALLOC: + 23438088 ( 22.4 MiB) Bytes in thread cache freelistsMALLOC: + 521449624 ( 497.3 MiB) Bytes in malloc metadataMALLOC: ------------MALLOC: = 74367013016 (70921.9 MiB) Actual memory used (physical + swap)MALLOC: + 434388992 ( 414.3 MiB) Bytes released to OS (aka unmapped)MALLOC: ------------MALLOC: = 74801402008 (71336.2 MiB) Virtual address space usedMALLOC:MALLOC: 8931934 Spans in useMALLOC: 56 Thread heaps in useMALLOC: 8192 Tcmalloc page size------------------------------------------------
The fragmentation is happening in the central freelist! This isn’t supposed to happen.
Almost all Deep objects fall under what tcmalloc considers ‘small objects.’ For small objects, tcmalloc operates with two types of caches for freed objects: a thread-local cache (called the thread cache, as in tcmalloc) freelist where objects are allocated from and freed to, and a central cache that thread cache freelists can draw from or give to. For example, when trying to allocate an object if the thread cache freelist is empty, it will simply ask the central cache for more freed objects. The central cache will scrounge up a freelist, whether by requisitioning existing freelists or allocating more memory. Similarly, if a thread cache freelist is too large and has been inactive for sometime, it will return portions of it back to the central cache.
tcmalloc divides the central cache into size classes, and each size class has a corresponding freelist. Think of the central cache as a chained hash table where the hash function is the size of the objected rounded up to the nearest size class. This structure greatly helps reduce fragmentation: if I were to allocate a 7-byte object, for example, tcmalloc would actually allocate 8 bytes for it and place it in the 8-byte size class. If I free the 7-byte object and then later want to allocate an 8-byte object, tcmalloc can just give me the freed 7-byte (actually 8-byte) object and thereby avoid an awkward 7-byte hole in memory.
The question is, why are we having so much trouble with central freelist fragmentation? Deep seems to be the optimal use case for tcmalloc: almost everything we allocate is a small object, and on top of that we allocate lots of small objects of the same size class (keys and values that are of the same size as defined the schema). Moreover, the example above showed roughly 19 GB of freelist – you would imagine that if we were to allocate an object, it would at least match the size of one of the hundreds of billions of objects in the huge freelist, so why does the freelist keep on growing?
In order to understand why we’re seeing a slowly growing central cache freelist, it’s important to understand a bit more about where the fragmentation is occurring. tcmalloc object allocation uses spans, which are sets of contiguous pages that are carved up into objects of a certain size class. In the case of small objects, the span length is 1 page, so feel free to imagine that the spans are pages when thinking about fragmentation. We modified tcmalloc to provide extra debug information: how many spans were created, how many spans were nonempty (meaning they contain free objects that can be returned if the user wants to allocate something in that span’s size class), and how many spans were empty (meaning they are completely allocated and contain no free objects). Here’s an example of what the size class breakdown looks like for a standard test run with the schema described at the top (this is a different test run than the one I’ve graphed above):
------------------------------------------------Size class breakdown------------------------------------------------class 1 [ 8 bytes ] : 16960049 objs; 129.4 MiB; 129.4 cum MiB 16959268 free; 18494 spans; 18343 nonempty spans; 151 empty spans; 18494 pagesclass 2 [ 16 bytes ] : 108521605 objs; 1655.9 MiB; 1785.3 cum MiB 108510096 free; 490439 spans; 390614 nonempty spans; 99825 empty spans; 490439 pagesclass 3 [ 32 bytes ] : 107720688 objs; 3287.4 MiB; 5072.7 cum MiB 107713824 free; 792443 spans; 637021 nonempty spans; 155422 empty spans; 792443 pagesclass 4 [ 48 bytes ] : 8376842 objs; 383.5 MiB; 5456.1 cum MiB 8375100 free; 49975 spans; 49937 nonempty spans; 38 empty spans; 49975 pagesclass 5 [ 64 bytes ] : 54495148 objs; 3326.1 MiB; 8782.3 cum MiB 54473328 free; 1257781 spans; 792331 nonempty spans; 465450 empty spans; 1257781 pagesclass 6 [ 80 bytes ] : 681 objs; 0.1 MiB; 8782.3 cum MiB 228 free; 19 spans; 7 nonempty spans; 12 empty spans; 19 pagesclass 7 [ 96 bytes ] : 350 objs; 0.0 MiB; 8782.3 cum MiB 106 free; 15 spans; 5 nonempty spans; 10 empty spans; 15 pagesclass 8 [ 112 bytes ] : 206 objs; 0.0 MiB; 8782.4 cum MiB 135 free; 3 spans; 3 nonempty spans; 0 empty spans; 3 pagesclass 9 [ 128 bytes ] : 457 objs; 0.1 MiB; 8782.4 cum MiB 94 free; 256018 spans; 10 nonempty spans; 256008 empty spans; 256018 pagesclass 10 [ 144 bytes ] : 400 objs; 0.1 MiB; 8782.5 cum MiB 107 free; 10 spans; 8 nonempty spans; 2 empty spans; 10 pagesclass 11 [ 160 bytes ] : 147 objs; 0.0 MiB; 8782.5 cum MiB 89 free; 12 spans; 7 nonempty spans; 5 empty spans; 12 pagesclass 12 [ 176 bytes ] : 363 objs; 0.1 MiB; 8782.6 cum MiB 98 free; 8 spans; 5 nonempty spans; 3 empty spans; 8 pagesclass 13 [ 192 bytes ] : 487 objs; 0.1 MiB; 8782.6 cum MiB 18 free; 1738 spans; 6 nonempty spans; 1732 empty spans; 1738 pagesclass 14 [ 208 bytes ] : 234 objs; 0.0 MiB; 8782.7 cum MiB 75 free; 7 spans; 7 nonempty spans; 0 empty spans; 7 pagesclass 15 [ 224 bytes ] : 24 objs; 0.0 MiB; 8782.7 cum MiB 21 free; 3 spans; 1 nonempty spans; 2 empty spans; 3 pagesclass 16 [ 240 bytes ] : 3026471 objs; 692.7 MiB; 9475.4 cum MiB 3018141 free; 819323 spans; 206730 nonempty spans; 612593 empty spans; 819323 pagesclass 17 [ 256 bytes ] : 2184075 objs; 533.2 MiB; 10008.6 cum MiB 2150899 free; 2630804 spans; 328800 nonempty spans; 2302004 empty spans; 2630804 pagesclass 18 [ 288 bytes ] : 8533773 objs; 2343.9 MiB; 12352.5 cum MiB 8528103 free; 437696 spans; 396063 nonempty spans; 41633 empty spans; 437696 pagesclass 19 [ 320 bytes ] : 10880 objs; 3.3 MiB; 12355.8 cum MiB 9945 free; 168084 spans; 426 nonempty spans; 167658 empty spans; 168084 pagesclass 20 [ 352 bytes ] : 3691551 objs; 1239.2 MiB; 13595.0 cum MiB 3690602 free; 247431 spans; 225891 nonempty spans; 21540 empty spans; 247431 pagesclass 21 [ 384 bytes ] : 303763 objs; 111.2 MiB; 13706.3 cum MiB 303185 free; 22958 spans; 19184 nonempty spans; 3774 empty spans; 22958 pagesclass 22 [ 416 bytes ] : 46 objs; 0.0 MiB; 13706.3 cum MiB 18 free; 3 spans; 2 nonempty spans; 1 empty spans; 3 pagesclass 23 [ 448 bytes ] : 135 objs; 0.1 MiB; 13706.4 cum MiB 75 free; 16 spans; 8 nonempty spans; 8 empty spans; 16 pagesclass 24 [ 480 bytes ] : 58 objs; 0.0 MiB; 13706.4 cum MiB 23 free; 4 spans; 3 nonempty spans; 1 empty spans; 4 pagesclass 25 [ 512 bytes ] : 104 objs; 0.1 MiB; 13706.4 cum MiB 52 free; 61 spans; 7 nonempty spans; 54 empty spans; 61 pagesclass 26 [ 576 bytes ] : 576 objs; 0.3 MiB; 13706.8 cum MiB 64 free; 116 spans; 15 nonempty spans; 101 empty spans; 116 pagesclass 27 [ 640 bytes ] : 728279 objs; 444.5 MiB; 14151.3 cum MiB 724342 free; 337907 spans; 146561 nonempty spans; 191346 empty spans; 337907 pagesclass 28 [ 704 bytes ] : 28 objs; 0.0 MiB; 14151.3 cum MiB 11 free; 3 spans; 2 nonempty spans; 1 empty spans; 3 pagesclass 29 [ 768 bytes ] : 78 objs; 0.1 MiB; 14151.3 cum MiB 16 free; 8 spans; 4 nonempty spans; 4 empty spans; 8 pages
Certain size classes have tremendous fragmentation. Several of them correspond with the sizes of keys and values defined in our schema. A certain amount of fragmentation is to be expected, because a span can be deleted if and only if a span only has no allocated objects. If our allocation patterns leave behind a dangling object in every span, for example, we’ll be heavily fragmented. But on the flip side, we’ll also have a ton of free objects, so part of the mystery is why our small object allocations weren’t hitting the existing freelist but contributing to a creeping growth of the freelist.
If we break the freelist growth down by size class, the answer appears. Let’s look at the 256 and 288-byte size classes (just to pick two to compare – pick any set of size classes and you’ll see a similar pattern).
For one thing, even though the 256-byte size class is not using a lot of memory now, at one point its freelist grew to ~6 GB! For another, it shows that individual size classes freelists are growing and shrinking as expected. The freelist for the 256-byte size class grows as Deep uses less of the object, and shrinks as Deep uses more. The % free graph is especially telling, because it shows that about 90% of the 288-byte size class is free whereas less than 10% of the 256-byte size class is free.
What this means is that the creeping fragmentation we see comes from the aggregate fragmentation of different size classes. The comparison above shows that at one point we allocated a bunch of 256-byte size class objects, then freed them up to make room for 288-byte size class objects (we need to ‘make room’ because our total allocated memory is bound by the settings given by the client), then switched back to allocating 256-byte size class objects. Because we honor the client’s memory desires, we switch between which size class objects are being used, and so we compound fragmentation over the course of multiple size classes. Even when we’re leaning heavily on one size class and reducing its free list through allocations, we’re freeing up memory in several other classes that stay around forever because they’re fragmented. When we hit different size classes heavily, they fragment and stay that way afterward, which accounts for the slowly creeping freelist (upper bounded by the number of size classes we can possibly hit, times the maximum amount of memory we are allowed to allocate, which results in a pretty large number).
Let me stress that this is a problem only in testing, with a custom schema we’ve created to trigger this condition. Our field deployments have never seen this problem. That said, this is a good opportunity to fix a problem before it hits our customers. We’ve just recently finished the investigation phase of figuring out why we have a creeping freelist, but our gears are turning about how solve this problem. Currently, we think that a periodic reshuffling of memory, to move objects to tightly pack spans and free the rest up for release back to the OS, is the best option. We’re looking to POC this not in tcmalloc, but jemalloc, where user control over arenas give us the ability to easily make this move (basically we’ll be copying a bunch of objects over to a new arena, all on the same page, and freeing the old ones from their spans so those pages can be released).