Page fracturing on Intel CPUs
During one of my studies, I encountered a strange phenomenon - it seemed as if every TLB flush in my virtual-machine, even a flush of a single page, caused a full TLB flush. Understanding this issue required some effort, and eventually was summarized in a single sentence in the academic paper I was working on. Since this micro-architectural behavior is not documented, it seemed appropriate to share the analysis in more detail. I break the story into background, investigation and conclusion sections. Feel free to skip directly to the latter, which describes this undocumented behavior.
Background
As we know, all modern CPUs use virtual memory, and CPUs translate virtual memory address into physical ones before they actually carry out the memory accesses. This translation is performed according to an architectural, memory-resident data-structure, which is set by the operating system, and maps virtual pages into physical frames. This data structure is a radix-tree, which also known as a page-table hierarchy in many architectures, including the x86 architecture, which is the subject of this discussion. Translating a virtual-address by “walking” a page-table hierarchy require multiple memory access, and since the CPU performs many memory accesses, it is essential to perform these translations fast.
For that matter CPUs hold a cache that holds the virtual-to-physical mappings, which is known as translation lookaside buffer or TLB. Unlike some other common caches, the TLB is not coherent, and when the operating system changes the mappings in memory, it is required to initiate a TLB flush operation to invalidate the caches. In the x86 architecture, two types of TLB flushes exist: a selective flush of a single page-table entry (PTE), which maps a single page, and a full flush that invalidates all the page-entries in the address-space1.
When hardware virtualization is used, the physical addresses of the (guest) virtual machine are not the same as the (host) machine physical address, since virtualization requires an additional level of memory indirection to allow the hypervisor to provision its physical memory among virtual machines. Modern CPUs perform the two levels of address translation in hardware: the guest virtual addresses (GVA) are translated into guest physical addresses (GPA) using the virtual-machine page tables, and the result GPA is then translated into host-physical address (HPA). This HPA is the address that would actually be used for the memory access. Again, to make these translation efficient, the TLB caches translations from GVA directly into HPA2.
Investigation
During my study it seemed that the TLB is fully flushed very frequently for no apparent reason. For brevity, I spare you the what led to think so. After some experiments, I started to suspect that invalidations of a single page (using the invlpg
instruction) caused a full TLB flush instead. However, I had to ensure that the behavior I saw is not a result of cache evictions from the TLB or some unrelated operating system events. To confirm my suspicions I therefore needed a small, confined and controlled virtual-machine environment. KVM-unit-tests, which is an environment to create unit-tests that run inside virtual-machines and test that the KVM hypervisor behaves correctly, fitted these needs. Using this environment, I created a small test that measures the time it takes to perform memory accesses when occasionally a TLB flush is performed. I compared the time it takes to perform the accesses when the TLB flush is of a single page with the time it takes when a full TLB flush is performed. Note that the flush of the single page in the test was of a page that was not a part of the test working-set, and actually should not have been cached at all.
To perform a fair comparison, we need to take care of one additional point: a full TLB flush by itself (disregarding memory accesses) takes more time than a flush of a single page. I therefore subtracted the time it takes to perform the flushes for the comparison. The results on my Haswell machine confirmed my suspicions (results are formatted manually for readability, and some comments added):
measurement | cycles | notes |
---|---|---|
with invlpg | 948,965,249 | |
with full flush | 1,047,927,009 | |
invlpg only | 127,682,028 | |
full flushes only | 224,055,273 | |
access net | 107,691,277 | considerably lower than the overhead of the flushes |
w/full flush net | 823,871,736 | |
w/invlpg net | 821,283,221 | almost identical to full-flush net |
As seen, the net time that memory accesses takes is almost identical when full TLB flushes are performed and when a single page is flushed. Performing the memory accesses by themselves (when no TLB flushes are initiated) is eight times lower, so the experienced behavior cannot be attributed to measurements errors.
What about AMD? Paolo Bonzini, the maintainer of KVM, was kind enough to run the same experiment on an AMD machine:
measurement | cycles |
---|---|
with invlpg | 285,639,427 |
with full flush | 584,419,299 |
invlpg only | 70,681,128 |
full flushes only | 265,238,766 |
access net | 242,538,804 |
w/full flush net | 319,180,533 |
w/invlpg net | 214,958,299 |
AMD behavior is reasonable3. As the flushed page is not part of the working-set, the runtime of the test when no flushes take place (“access net”) and when a single page is flushed (“w/invlpg net”) are almost identical and faster by over 30% than the time it takes when full TLB flushes take place.
Does it always happen? Could it be that Intel missed such an issue, which does not affect correctness, but can hurt performance? Such a performance hit is not theoretical, but can have real performance impact on some common workloads, for example the Apache webserver, which frequently maps files to memory and then unmaps them, triggering TLB flushes. Further analysis was needed. Now that I confirmed the phenomenon is real, I could start to look on the performance-counters that measure TLB misses 4. After a few experiments, I realized that the behavior is indeed specific for virtual machines and depends on the page-sizes in the guest and host (nested) page tables. Here is a table that summarizes the number of dTLB misses, as reported by the performance counters, according to the different page-sizes:
Host page | Guest page | Full Flush | Selective Flush | |
---|---|---|---|---|
VM | 4KB | 4KB | 103,008,052 | 93,172 |
4KB | 2MB | 102,022,557 | 102,038,021 | |
2MB | 4KB | 103,005,083 | 2,888 | |
2MB | 2MB | 4,002,969 | 2,556 | |
Bare-metal | 4KB | 50,000,572 | 789 | |
2MB | 1,000,454 | 537 |
Ok, finally we understand that the problem only happens in then 2MB pages are used in virtual machines, which are mapped with 4KB pages in the nested page-tables. Note that this page sizes are not related to the size of the flushed pages: the flushed page was not even mapped in the page-tables so it could not have been cached in the TLB.
Conclusion
We got to the understanding that when a VM uses 2MB pages in its page-tables and the hypervisor uses 4KB pages in the nested page-tables, selective TLB cause full TLB flushes. But why? Through some private communication with Intel, I understood that this behavior is not a bug, but a feature that is intended to deal with what Intel calls “page fracturing”. What is page-fracturing? Let’s look at an excerpt of a virtual address space, as it is mapped in the guest and host page-table:
In such a scenario, the guest-virtual to host-physical address translation of each 4KB page will be held in a separate TLB entry. Obviously, some translations may be cached in the TLBs, while others - not. Let’s assume the mapping of A is cached and the one of B is not, and the OS directs the CPU to
guess they need to do it to follow the SDM 4.10.4.1 (regarding pages larger than 4 KBytes):
The INVLPG instruction and page faults provide the same assurances that they provide when a single TLB entry is used: they invalidate all TLB entries corresponding to the translation specified by the paging structures.”
Let’s assume that the OS directs the CPU to flush the mapping of the page A.
Guest page tables: | 2MB | 4KB |
+-----------------------------+-----+ Guest page tables: | 2MB | 4KB |
+-----+-----+-----+-----+-----+-----+---+ Host page tables: | 4KB | 4KB | ... | 4KB | 4KB | 4KB |
+-----+-----+-----+-----+-----+-----+
A B
| |
v v
+-----------------------------+-----+
Guest page tables: | 2MB | 4KB |
+-----+-----+-----+-----+-----+-----+---+
Host page tables: | 4KB | 4KB | ... | 4KB | 4KB | 4KB |
+-----+-----+-----+-----+-----+-----+
I created a simple test that repeatedly accessed several (50) memory pages and then invalidated a completely different page:
t_start = rdtsc();
for (i = 0; i < ITERATIONS; i++) {
invlpg(another_addr);
for (j = 0; j < N_PAGES; j++)
v = buf[PAGE_SIZE * j];
}
t_single = rdtsc() - t_start;
printf("with invlpg: %lu\n", t_single);
To make sense of these resu Then, I ran a similar test which performed a full TLB flush (by reloading the CR3 register) instead of the single page invalidation. Howeve
although it is not its intended use.
A verification was needed and for that matter I created a KVM unit-test - a small confined test which is no
address translation is done in two levels. , the common and more efficient mode of
can be performed in granularity of either a single page Many CPUs support multiple page sizes,
Several months ago, everybody was surprised to find out that the CPU speculative execution can be exploited to leak privileged information, using attacks that were named Spectre and Meltdown. Technical blogs and even the main-stream media reported broadly about those vulnerabilities. CPU vendors have struggled to provide firmware patches that would prevent the attacks in a timely manner. OS and other software providers introduced software solutions such as the retpoline and page table isolation (PTI) to protect against these vulnerabilities. All of these mitigations caused performance degradation, required extensive engineering effort, and caused various problems.
So half a year later - are you protected? Probably not. Recently I found that the Linux protection against Spectre v2 is broken in virtual machines and Dave Hansen found a bug in the Meltdown protection. Spectre v1 was never considered fully resolved, with mitigations keep coming in, but even the existing ones were found to be buggy.
There is clearly a problem, as currently it is hard for people to easily realize whether they are protected against these attacks. Sure, one can see whether the OS or other piece of software reports that it is protected, but these reports might be wrong. There is a fundamental problem with these protections that, in a way, is the same one that caused Linus Torvalds to politely (yes, politely) decline a Meltdown mitigation technique we proposed1:
Sure, I can see it working, but it’s some really shady stuff, and now the scheduler needs to save/restore/check one more subtle bit.
And if you get it wrong, things will happily work, except you’ve now defeated PTI. But you’ll never notice, because you won’t be testing for it, and the only people who will are the black hats.
This is exactly the “security depends on it being in sync” thing that makes me go “eww” about the whole model. Get one thing wrong, and you’ll blow all the PTI code out of the water.
Linus criticism of our work is valid, yet it does not seem that other protection mechanisms against these vulnerabilities are much better. And even if the OS is well-protected against these vulnerabilities, nobody guarantees that the system will remain safe after an out-of-tree module is loaded, for example. All it takes for the Spectre v2 protection to be broken, for example, is a single indirect branch that was not converted into a retpoline.
It seems that in order to make the protection work, independent tools that validate the protection mechanisms are needed. I found the Spectre v2 by using the hardware performance counters to count indirect branches that were executed by the kernel and finding it is not zero. Dumping the page-tables and tracing translation-lookaside buffer (TLB) invalidations can be used to find out PTI bugs. Anti-malware tools should take up the glove and make these checks.
Yet, perhaps there is an additional problem of over-hyping Spectre. Side-channel attacks were known long before Spectre, and invoking them using speculative execution may not be such a game-changer. Unlike Meltdown, which is a real CPU bug, the Spectre family of vulnerabilities may pose lower risk as they are not easily exploitable. The Spectre v2 proof-of-concept exploited some Linux wrongdoings (e.g., not zeroing registers after a context switch from a virtual machine), which were relatively easily fixed and became a good mitigation against other OS bugs. Some new Spectre attacks were not even reported to be successfully exploitable other than in artificial demos.
It might be that the industry over-reacted to Spectre. Even if Spectre vulnerabilities are addressed, software might still leak privileged data through side-channels, so it is not as if the existing protection schemes are complete. Now that the media frenzy is gone, perhaps it is time to reconsider whether paying in performance for questionable “generic” protection schemes against these attacks makes sense, or whether protection should be done on a case-by-case basis.
Update: I wonder if Windows is indeed safe. Windows uses retpoline, but it is not clear whether they are used exclusively or with alternative solutions that use hardware mitigation (IBPB/IBRS). Anyhow, measuring the performance counter of Windows 10 (that runs in the VM) raises some questions, as it show there are indirect branches inside Windows kernel. Here are the performance counters as measure in a KVM guest:
$ sudo perf stat -e br_inst_exec.taken_indirect_jump_non_call_ret:Gk \
-e br_inst_exec.taken_indirect_near_call:Gk -a -- sleep 5
Performance counter stats for 'system wide':
1,682,939 br_inst_exec.taken_indirect_jump_non_call_ret:Gk
1,102,037 br_inst_exec.taken_indirect_near_call:Gk
5.001077704 seconds time elapsed
Footnotes
-
Ignoring INVPCID flavors and global page flushes for simplicity ↩ ↩2
-
There are additional schemes ↩
-
The results show as if memory accesses are faster when TLB flushes of a single page are performed then when no TLB flushes take place. This does not make much sense. This result appears to be an artifact of the test measurement scheme, which disregards out-of-order execution behavior. The tests were only intended to provide qualitative indication whether full TLB flushes take place. ↩
-
Usually, I try to avoid relying on such counters as the sole indication for hardware behavior, as it is known that some of them are can provide incorrect results in certain cases. ↩