Intel sees Linux performance jump nearly 4000% or 40 times from one line of code

Sitting Tux the penguin mascot of Linux

There are essentially two ways technology, including computing, makes progress, it is either by boosting performance or by improving efficiency. Any and all such optimizations are welcomed by the community.

Speaking of optimization, an Intel kernel testing bot recently spotted a massive performance improvement in the Linux kernel achieved on a single line of code commit. A whopping 3889% or nearly 40 times faster throughput was seen on "will-it-scale" scaling test during 1-byte memory allocation (malloc1). The test was done on a 4-socket Intel Xeon Platinum 8380H (Cooper Lake) bed for 224 threads in total (each 8380H chip is a 28-core 56-thread SKU).

kernel test robot noticed a 3888.9% improvement of will-it-scale.per_process_ops on:

commit: d4148aeab412432bf928f311eca8a2ba52bb05df ("mm, mmap: limit THP alignment of anonymous mappings to PMD-aligned sizes")

In addition, it also saw "significant impact" on Sapphire Rapids Xeon® Platinum 8480+ during stress-ng. If you are not familiar, stress-ng is essentially a stress test that is based on "Bogo ops" or bogus operations per second.

For those wondering, the commit in question is related to efficient memory management (mm) and memory mapping (mmap) techniques using Transparent Hugepages (THP) and Page Middle Directory (PMD).

Necessary changes and improvements are being made here and as such, going forward, anonymous mapping sizes will be multiple of PMD to fix previous performance regressions from Translation Lookaside Buffer (TLB) and cache aliasing and conflicts:

Since commit efa7df3e3bb5 ("mm: align larger anonymous mappings on THP boundaries") a mmap() of anonymous memory without a specific address hint and of at least PMD_SIZE will be aligned to PMD so that it can benefit from a THP backing page.

However this change has been shown to regress some workloads significantly. [1] reports regressions in various spec benchmarks, with up to 600% slowdown of the cactusBSSN benchmark on some platforms. The benchmark seems to create many mappings of 4632kB, which would have merged to a large THP-backed area before commit efa7df3e3bb5 and now they are fragmented to multiple areas each aligned to PMD boundary with gaps between. The regression then seems to be caused mainly due to the benchmark's memory access pattern suffering from TLB or cache aliasing due to the aligned boundaries of the individual areas.

Another known regression bisected to commit efa7df3e3bb5 is darktable [2] [3] and early testing suggests this patch fixes the regression there as well.

To fix the regression but still try to benefit from THP-friendly anonymous mapping alignment, add a condition that the size of the mapping must be a multiple of PMD size instead of at least PMD size. In case of many odd-sized mapping like the cactusBSSN creates, those will stop being aligned and with gaps between, and instead naturally merge again.

Please note that the immense improvement found here is in a synthetic test case and thus real-world workloads are unlikely to see such enormous gains.

Source: Linux LKML public inbox (link1, link2)