With the introduction of 256-bit registers, AVX also introduces masked loads and stores as a way to perform calculation on the remaining values that do not fit in the YMM register. With masked loads and stores, one can load partial values from memory, while the masked values will be masked to zero. This greatly reduces the complexity of writing loops and reduces the need for an extra branch. Because if your data does not fit in a 256 register you can mask the missing bits out. To prevent segmentation faults when accessing memory that is not mapped the instruction suppresses page fault if the mask is set to zero.
Let’s first take a look at the behaviour of masked loads and stores in AVX and AVX2 While AVX512 has eight dedicated mask registers, AVX and AVX2 use the XMM/YMM registers instead to store the mask.
So what would happen if the mask of a load instruction is set to zero? Well,
the effect that we would expect is that the behaviour is identical to
xor ymm, ymm to clear the register and set it to zero.
If we perform a store with a mask set to zero we would assume it generates a
nop, simply because we are not storing anything.
If we test the behavior of the load instruction on Intel CPUs starting from Sandy Bridge until Cannonlake, we see that emitting a load instruction with the mask set to zero still results in a memory access and a full page walk. That means that if we access an unknown memory location first to trigger a page walk and then a store in the TLB, that the next access should be quicker if the address is mapped. To be on the safe side, we execute it 16 times and average the timings. Using this gadget we can resolve the complete mapped address space if we have code execution. We did not see this behavior on AMD Bulldozer and AMD Zen CPUs.
If we perform the same test with the store instruction, then we see a different pattern. On pre-Zen CPUs, we can see a clear timing difference between a mapped and an unmapped address. In addition the store instruction takes a lot more time to execute compared to the load instruction if the mapping is invalid. As a result of this we can reduce the total amount of instructions we test down to 4. If we test this on Intel CPUs, we also see a clear difference in behaviour between pre-Skylake and post-Skylake CPUs. On pre-Skylake CPUs, we see not difference at all between valid and invalid loads. However, on post-Skylake CPUs we can actually see a clear difference.
Upon contacting Intel about this issue, they pointed me to the AVX masked load chapter in the optimization manual that documents this behaviour, and in addition to that explains the change between pre-Skylake and post-Skylake. They also state here that this is expected behaviour and not a bug. In CPUs before Skylake the fetch to the cache would only be issued after the mask and address have been resolved. However, in Skylake and newer the fetch will be issued before the mask is actually known. In our tests we found that even if the mask is known, it will still emit the µ-op.
If we take a look at the AVX512 behaviour, we see the same as on Skylake. Even if the mask is stored in the special mask registers, it will still first fetch the data before checking the mask.
This gadget gives us some nice effects, first of all we can scan a complete address range for mapped pages with a small set of instructions, in a relatively short time. It also gives us the ability to execute side channel attacks on systems before Coffee Lake Refresh without the use of an interrupt handler or TSX.