In a previous post, I’ve written about how to check and enable transparent hugepages in Linux globally.

Although this post is important if you actually have a usecase for hugepages, I’ve seen multiple people getting fooled by the prospect that hugepages will magically increase performance. However, hugepaging is a complex topic and, if used in the wrong way, might easily decrease overall performance.

This post attempts to explain advantages, disadvantages and caveats of using hugepages not at all, selectively . As a technology-heavy or pedantically exact post is likely to be inaccessible to the users which often get fooled by hugepages, I’ll sacrifice accuracy for simplicity. Just bear in mind that most of the topics are really complex and therefore grossly simplified.

Note that we’re talking about 64-bit x86 systems running Linux here and that I just assume the system implements transparent hugepages (i.e. it is not a disadvantage that hugepages are not swappable) as this is the case with almost any recent Linux environment.

More technical descriptions will be provided in the links below.

Virtual memory

If you’re a C++ programmer, you know that objects in memory have certain addresses (i.e. the value of a pointer).

However, these addresses do not neccessarily represent physical addresses (i.e. an address in RAM). They represent addresses in Virtual memory. You CPU has a MMU (memory management unit) hardware that assists the kernel in mapping virtual memory to a physical location.

This approach has numerous advantages, but mainly it is useful for

• Performance (for various different reasons)
• Isolating programs, i.e. no program can just read the memory of another program

What are pages?

The virtual memory space is divided into pages.

Each individual page points to some physical memory – it might point to a section of physical RAM, but it might also point to an address assigned to a phyiscal device such as a graphics card.

Most pages you’re dealing with point either to the RAM or are swapped out, i.e. stored on a HDD or an SSD.

The kernel manages the physical location for each page. If a page which has been swapped out is accessed, the kernel stops the thread trying to access the memory, reads the page from the HDD/SSD into RAM and subsequently continues running the thread.

This process is transparent to the thread, i.e. it doesn’t have to read explicitly from the HDD/SSD.

Normal pages are 4096 bytes long. Hugepages have a size of 2 Megabytes.

The Translation Lookaside Buffer (TLB)

When a program accesses some memory page, the CPU needs to know which physical page to read the data from (i.e. a virtual-to-phyical address map).

The kernel contains a data structure (the page table) that contains all information about all the pages in use. Using this data structure, we could map the virtual address to a physical address.

However, the page table is pretty complex and slow and we simply can’t parse the entire data structure every time some process accesses the memory.

Luckily, our CPU contains hardware – the TLB – that caches the virtual-to-physical address mapping. This means that although you have to parse the page table the first time you access the page, all subsequent accesses to the page can be handled by the TLB, which is really fast!

But as it is implemented in hardware (which makes it fast in the first place), it also only has a limited capacity. So if you access a larger number of pages, the TLB can’t store the mapping for all of them. This will make your program much slower.

Huge pages to the rescue

So what can we do (assuming the program still needs the same amount of memory) to avoid the TLB being full?

This is where hugepages come in. Instead of 4096 bytes „consuming“ one TLB entry, one TLB entry can now point to a whopping 2 Megabytes

So if we assume the TLB has 512 entries, without hugepages we can map
$$4096\ \text{b} \cdot 512 = 2\ \text{MB}$$
but with hugepages we can map
$$2\ \text{MB} \cdot 512 = 1\ \text{GB}$$

So hugepages are great – they can lead to greatly increased performance almost without effort. But they don’t come without caveats.

Swapping hugepages

Your kernel automatically tracks how often each memory page is used. If there is an insufficient amount of physical memory (i.e. RAM) available, your kernel will move less-important (i.e. less often used) pages to your hard drive to free up some RAM fore more important pages.

In principle, the same goes for hugepages. But the kernel can only swap entire pages – not individual bytes.

Let’s assume we have a program like this:

char* mymemory = malloc(2*1024*1024); //We'll assume this is one hugepage!
// Fill mymemory with some data
// Do lots of other things,
// causing the mymemory page to be swapped out
// ...
// Access only the first byte
putchar(mymemory[0]);

In that case, the kernel will need to swap in (i.e. read) the entire 2 Megabytes from the HDD/SSD just for you to read a single byte. With normal pages, only 4096 bytes need to be read from the HDD/SSD.

So if a hugepage is swapped, reading it is only faster if you need to access almost the entire hugepage. That means if you randomly access different parts of the memory and just read a couple of kilobytes, you should just use plain old normal pages and not worry about

On the other hand, if you need to access a large portion of memory sequentially, hugepages may increase your performance. Still, you need to benchmark this using your program (not some abstract benchmark software!) and see whether it is faster with or without hugepages enabled.

Memory allocation

As a C programmer, you know that you can request arbitrarily small (or almost arbitrarily large) amounts of memory from the heap using malloc().

Let’s assume you’re requesting 30 bytes of memory:

char* mymemory = malloc(30);

To the programmer, it might look like you’re „requesting“ 30 bytes of memory from the operating system and get back a pointer to some virtual memory.

But in reality, malloc() is just a C function that internally calls the brk and sbrk functions to request or release memory from the operating system.

However, it’s inefficient just to request more and more memory from the OS for every little allocation – more than likely, a memory segment has been free()d and we can re-use that memory . malloc() implements rather complex algorithms to re-use free()d memory.

But this is all transparent to you – so why should you care? Because it also means that calling free() does not mean that the memory is not neccessarily returned to the operating system immediately.

There is a type of issue called memory fragmentation. In extreme cases there are segments of the heap where only a few bytes are used while everything in between has been free()d.

In most cases, the kernel will just swap out most of the unused memory because

If you are using hugepages for the entire memory of your program (i.e. not selectively as shown below), this migh

Note that memory fragmentation is an incredibly complex issue and even little changes of a program may affect memory fragmentation significantly. In most cases programs don’t cause significant memory fragmentation, but you should keep in mind that, if there are memory fragmentation issues in some area of the heap, hugepages can make the issue worse.

Selective hugepaging

Using the information in this article, you have identified some parts of your program that could benefit from hugepages, but other parts of your program won’t benefit from huge pages. Should you enable hugepages or not?

Lucikly you can use madvise() to only enable hugepaging for memory areas where you benefit from it

First, check that hugepages are enabled in madvise mode (see this blogpost).

Then, use madvise to tell the kernel that hugepages shall be used for the

#include <sys/mman.h>

// Allocate a large amount of memory for which you have some use
size_t size = 256*1024*1024;
char* mymemory = malloc(size);
// Either just enable hugepages ...
madvise(mymemory, size, MADV_HUGEPAGE | MADV_SEQUENTIAL)
Refer to the madvise manpage for more details – there’s a lot to learn about memory management and madvise, and the topic has an incredibly steep learning curve. So if you intend to go down this route, prepare to read, test & benchmark for a few weeks before expecting any positive result.