Managing Information Leaks in the Spectre Era
As the final semester is about to kick off, I too have to face finishing up my Master's degree with a thesis. Fortunately for me, many topics in the field of security engineering align closely to my professional work and personal interests, so that I can write about them here. This post marks the start of a series of posts that cover the process of researching and prototyping object-based data isolation.
In the world of systems security, temporal and spatial memory errors have long been the leading cause of actively exploited software vulnerabilities, and there seems no reason to assume that this will change any time soon. These errors mostly stem from the fact that many systems rely on memory that is managed manually by the programmer, and the programmer can make mistakes. The desire for manual memory management is in turn based on the idea that systems should be fast and memory-efficient, and that only by manually managing memory, one can optimally allocate the available resources to meet the ever-increasing performance requirements.
Despite increasing adoption trends in languages that discourage manual memory management (such as garbage collection in Go and C#), and languages that try to verify that memory management is performed safely (such as Rust's borrow-checker), many systems that are in production today still rely on codebases that consist largely of unsafe C and C++ code. One option is to migrate (crucial) systems to safer languages - this is part of my professional work - but unfortunately, this is hard to scale and cannot be done in one night, one year or even one decade (despite what some Microsoft engineers seem to think). As such, much research is dedicated to finding new solutions for existing systems and codebases, to detect (and possibly mitigate) memory errors at the scale of thousands to millions of lines of code.
Architectural Information Leaks
Software vulnerabilities are heterogeneous in both their forms and effects, but in general the exploitation goal is to "bypass a security policy". This is often referred to as violating the CIA triangle: breaking confidentiality, integrity, availability or a combination thereof. In this work, I am particularly interested in mitigating information leaks that breach confidentiality.
Such information leaks are often associated with the aforementioned memory errors. Common cases include buffer overreads and use-after-free vulnerabilities.
void
int*
These types of vulnerabilities are well-known and well-understood, and arguably easy to spot in the trivial examples above. Yet, most production systems are highly complex, and memory allocations and accesses are often spaced far apart in both time and space. As such, it is hard and often infeasible to exhaustively identify memory errors from manual inspection or static analysis alone. This was confirmed once more by the recently published MongoDB buffer overread vulnerability.
To identify information leaks at scale, research has hence been put towards building memory sanitizers that can detect memory errors at runtime. Implementations vary, but the common theme is to introduce additional metadata for allocations, frees and pointers that are then used in runtime validity checks. For example:
- Introduce poisoned redzones around allocated memory. A buffer overread would then read a poison value, which would be detected
- Mark pointers after being freed. When a pointer is reused, its mark is detected
- Quarantine and poison freed memory so that it is not allocated again (immediately). A use-after-free would then read a poison value, which would then be detected
To allow for this, the program is instrumented at compile-time, and then runs with a modified memory allocator. When talked about memory sanitizers, Address Sanitizer (ASan) is often meant. It was introduced by Google in 2011, and became part of LLVM and GCC shortly after. From their wiki:
The run-time library replaces the malloc and free functions. The memory around malloc-ed regions (red zones) is poisoned. The free-ed memory is placed in quarantine and also poisoned. Every memory access in the program is transformed by the compiler in the following way:
Before:
*address = ...; // or: ... = *address;After:
if (IsPoisoned(address)) { ReportError(address, kAccessSize, kIsWrite); } *address = ...; // or: ... = *address;
This instrumentation adds significant overhead in terms of memory (due to the additional metadata) and runtime (due to the additional checks). So significant even, that instrumented systems are rarely deployed in production, but rather only used when debugging and testing.
Sanitizing At Scale
Even with a sanitizer that can accurately find memory errors for a given input, the input space is often enormous. To draw meaningful conclusion about memory safety, it is thus important to cover as much of this input space as possible. For this, industry has turned to conducting fuzzing campaigns. A fuzzer repeatedly feeds different inputs to a system and then relies on the sanitizer to decide if that particular input reveals memory errors. Sometimes fuzzers are described as "random input generators" but nowadays that rarely holds true. Similar to evolution in biology, modern fuzzers often rely on fitness (or rather, interestingness) to decide the inputs that they generate: interesting inputs are further mutated, while uninteresting inputs are discarded earlier on. To decide what is interesting, fuzzers can rely on white-box heuristics such as line or branch coverage, or utilize system-specific heuristics in a more black-box setting.
Two common methods exist to improve a fuzzer: (1) have it cover more inputs and (2) have it cover more relevant inputs. The latter can be done by improving a fuzzer's input evolution algorithms or the heuristics they rely on. The first can be done by increasing a fuzzer's throughput. Since fuzzers depend on memory sanitizers and memory sanitizers decrease throughput due to the overhead they introduce, much effort has been put into making memory sanitizers more efficient by improving on the metadata that they store or optimizing their checks.
Recent work has improved on this by exploiting under-utilized exiisting hardware components (FloatZone), as well as utilizing new hardware primitives for very efficient bookkeeping and validity checking (RangeSanitizer).
From RangeSanitizer: Runtime overhead comparison of RSan and the state of the art sanitizers on SPEC CPU2006 and CPU2017.
But Then There Was the Micro-Architecture
As unfortunate as it is, even if memory sanitizers would be perfect and fuzzers could cover the entire input space somehow, we could not stop worrying about information leaks. The disclosure of spectre, has revealed an entirely new attack surface that stems from micro-architectural speculative execution optimizations. The vulnerabilities are less well-known than the traditional memory errors, and their omni-presence in today's systems justifies a brief revisit of their cause and effect.
In the complexity of modern day computing, there are many aspects at many different layers that can be optimized to make systems fast - or perhaps more importantly, faster. Assembly is often referred to as the lowest layer in this regard (ignoring machine code for simplicty). Its instructions are the primitive, yet most complete truth we have to understand what the processor will execute for us. These instructions are considered to be at the architectural level, since they are generated by a compiler based on the Instruction Set Architecture (ISA) for this specific CPU. This is where the aforementioned "classic" memory errors (buffer overreads and use-after-frees) happen.
Though to run these architectural instructions faster, CPU vendors have not just concerned themselves with increasing the transistor counts to keep up with Moore's law. They have also been focussed on better utilizing the transistors that are already present by parallelizing many operations. Large performance gains were achieved by pipelining a CPU's execution unit, allowing a CPU, for instance, to already start decoding the next instruction, while it is busy executing the current instruction, parallelizing the available execution steps.
Another major performance improvement was achieved by exploiting the fact that instructions that do not depend on each other, can be executed out of order. A "slow" instruction that is waiting for its data to become available (i.e. by reading from memory) can be overtaken by another instruction if there is a processing unit available.
; Example of a simple addition program that computes
; int res = (10 + 12) + (5 + 8)
; Load the operands
; These mov instructions may be executed in any order
; without changing the architectural behavior
mov r0, #10
mov r1, #12
ldr r12 [sp, #12] ; load from memory, slow!
mov r3, #8
; Sum the operands
; These additions may be executed in any order
; without changing the architectural behavior
add r0, r0, r1 ; r0 = 10 + 12
add r2, r2, r3. ; r2 = 5 + 8
; The final add cannot be reordered, because it depends on r0 and r2
add r0, r0, r2 ; r0 = r0 + r2
Finally, the last major category of optimizations falls under speculative execution. In its simplest form, this allows a CPU to start executing multiple branches in parallel when encountering a conditional jump. After determining the condition, only the computation result of the correct branch will be preserved. Execution results of the other, wrongfully executed branches are discarded, never to be used again.
// This condition will take a few cycles to compute
bool condition = ....;
int i = 0;
if else
// The results of the wrong branch are discarded
// and the correct number is printed
;
A CPU has the freedom to rearrange instructions and speculatively execute them, as long as it adheres to its part of the contract: operating on the ISA as expected. It should, in other words, not violate any architectural constraints.
In the case of speculative execution, since both branches have been already speculatively executed, the processor can just continue with the correct one which enables significant speedups compared to waiting for the condition to be fully evaluated and only then starting execution from the right branch. At the same time, since the wrong result is discarded, the program still behaves correctly and will never know about the wrongfully executed and then discarded state, in theory. How instructions are executed - and thus how speculative execution is implemented - is dictated by a CPU's micro-architecture, implemented in both its hardware and microcode.
Traces Everywhere
It that turns out that, even though speculative execution adheres to the architectural rules, the (micro-architectural) speculative paths that are taken leave traces that can be observed at the architectural level. A popular example of an observable trace is cache presence.
Memory lookups are orders of magnitudes slower than cache lookups, so depending on a program's memory access patterns, caching data from memory improves performance considerably. However, by timing memory accesses, one can now determine if a value was cached or not. A popular way to do so is by performing a FLUSH+RELOAD attack (although many other methods exist). An example is shown below.
int data;
// FLUSH: make sure that no item in this array is cached
for
// ...
// some actions are performed on the data
// ...
// RELOAD: access all values and see how long this takes per value
for
A FLUSH+RELOAD attack is already quite powerful, as it allows attackers to leak information (such as crypto keys) across shared pages on multi-tenant systems like cloud servers. This attack surface is widened when speculative execution is added to the mix:
int
The example above is "safe" at the architectural level: if the password is incorrect, the function will return -1 and will not leak any information. Yet, at the micro-architectural level both branches have been speculatively executed! Even though the result (i.e. setting the value of dest[secret] to 1) has been rolled back, dest[secret] is now in the cache, and an attacker could perform FLUSH+RELOAD to now reveal this cache state, and thus the value of secret. The cache operates as a side-channel for the micro-architectural state.
The code snippet is an example of Spectre-V1, also known as bounds check bypass. This particular snippet is contrived, but real-world "spectre gadgets" continue to be found, even in high-level languages like JavaScript. Patching spectre is nearly impossible, as removing speculative execution from a CPU would incur an unacceptably large a performance penalty. Besides, speculative execution is implemented mostly in hardware, which is infeasible to update for all existing systems. Detecting exploitation of spectre gadgets is hard, and many more side-channels exist than just the cache, such as power usage and variable instruction timing differences.
As if leaking cache states was not concerning enough, speculative execution gadgets have also proven to be useful for breaking modern hardware memory protections. Apple M1 memory protections have been shown vulnerable, due to the exploitation of speculatively executed code as a memory integrity "oracle".
Unfortunately, there is no rest for security engineers. Even if programs are considered safe at the architectural level, information might still be leaked through the micro-architecture.
Acceptable Mitigations
Since its publication, it has been clear that spectre is here to stay, and that effective yet acceptable mitigations must be found in software. The most straight-forward fix is to disable speculative execution altogether by having the compiler add lfence instructions to branches and jump targets. As already mentioned, this would cause too drastic of a performance decrease to be applied everywhere, but it can be selectively be applied to manually identified spectre gadgets or code that is considered so vital to security, that the decreased performance is justified. This however, relies mostly on manual inspection, which does not particularly scale well.
Hence, other mitigation options have been, and continue to be actively researched. As usual, the goal is to find mitigations that have their performance penalty legitimized by strong enough security improvements, while minimizing the compatibility issues that arise from applying them at scale. LLVM has introduced speculative load hardening which is considered to be the state-of-the-art spectre-v1 mitigation at the time of writing.
SLH is based on the principle of detecting wrongful speculative execution at runtime. When detected, the loaded data values are scrambled, which "poisons" the cache, so that nothing useful can be deduced from it anymore, even when its state is leaked. In 2023, ultimate SLH was published to improve on existing SLH solutions by also poisoning values for arithmetic operations with variable load times (dependent on their operands) and the results of some obscure x86 repeat instructions.
From USLH: Runtime overhead comparison of modern spectre mitigations on SPEC CPU2017.
Though SLH incurs (much) less overhead than lfence-ing, its security benefits do still not come cheaply. Hence, SLH too, must be selectively applied and is not enabled at scale.
Embracing Spectre
Instead of costly measures that prevent spectre from leaking any information at all (such as lfence and SLH), recent work has focussed on constraining the information that can be leaked to "acceptable" levels. At VUSec - the research group at the VU that organizes my Master's degree - the idea of Type-Based Data Isolation has been explored to this end.
As the name suggests, information (data) is isolated based on its type. Each type has its own allocation arena, and using both compile-time and runtime instrumentation data accesses are constrained to their own arena at the architectural and micro-architectural level. This implies that, even if one manages to leak information, the leak is restricted to that type only. The granularity of types can vary from very coarse (e.g. "safe" and "unsafe" data) to more fine forms (such as one type per primitive). The idea is quite powerful and has a much lower performance overhead, as also underpinned by Apple's recent adoption in the iPhone 17.
From Type-Based Data Isolation: Runtime overhead comparison of Type-Based Data Isolation (blue) and Speculative Load Hardening (red) on SPEC CPU2017.
Yet, the effectiveness of type-based data isolation still largely depends on its implementation details. The hardest challenge is to decide on the types to use: can they be determined accurately, also for existing (and old) codebases? If type information is not present, or very coarse-grained, the guarantees that TDI provide are weakened (i.e. there are few, very large arenas that allow for a lot of information to be leaked intra-arena).
And even if types were fine-grained, compatible and accurate, TDI would still allow for inter-arena information leaks by being able to, for instance, exploit an intra-domain buffer overflow. The guarantees hence depend on the specific target program that is instrumented, and its (architectural) security mitigations.
An intra-arena buffer-overflow of char arr[8] into void *ptr might still overwrite a pointer that allows inter-arena information leaking.
Again, mitigations that are applied at scale are normally weighed on their security improvements versus their performance costs and compatibility issues. TDI scores great on all three: for single-digit overhead, systems can enjoy serious security improvements with relatively little compatibility concerns. Though, what happens if we turn the knobs?
Expanding the Tradeoff Space
With that introduction out of the way, the foundation for a research topic has been laid. In the coming months, I will explore the tradeoff space further to find solutions that have stronger security guarantees than TDI, and less overhead than SLH. To do so, I will base my prototype on novel structures in the state-of-the-art memory sanitizing developments, combined with data isolation methods that have already proven themselves worthy in production systems. More to follow.