Shared Virtual-Memory Objects for Disaggregated Memory with Limited Coherency
- Typ der Arbeit: Bachelor-/Masterarbeit
- Status der Arbeit: laufend
- Projekte: ParPerOS
- Betreuer: Alexander Halbuer, Daniel Lohmann
- Bearbeiter: Daria Richter

Disaggregated memory pools [Generated with AI]
This topic can be explored within a master thesis or can be split into two bachelor theses. See the bottom for details.
Context
The emerging Compute Express Link (CXL) standard extends the border of main memory to a broader circle. In a nutshell, it allows byte-granular memory access via the PCIe interface, enabling devices (e.g., GPUs) to access and cache host memory, as well as hosts to extend memory capacity with extension cards. More recent versions of the CXL standard (2.0/3.0) go even further, allowing for memory disaggregation with centralized memory pools. In combination with coherent access across machines, this allows for efficient communication via shared memory. Unfortunately, due to the high costs of tracking cache line states, we expect only a small fraction of a memory pool to be cache-coherent across machines. For the residual, dominating part of the memory, software mechanisms must be employed to ensure synchronization.
Problem
With Morsels, we introduced a novel memory-management paradigm that shifts from the management of individual pages to larger virtual-memory objects, technically represented as subtrees of the page-table hierarchy. This reduces management overhead and enables very fast transfer between address spaces. With the extension of the memory domain with shared CXL memory pools, we want to extend the Morsel concept in this regard. Shared memory objects should fully reside on the memory pool (including page tables) and multiple hosts should be able to simultaneously interact with this object. To cope with the limited coherency, the idea is to place page tables in coherent memory for synchronization and implement an ownership model for data pages on the software level.
Goal
On the implementation side, this could be achieved with so-called Overlay-Morsels - one per host. Initially, all parts within this overlay are shared read-only with the authoritative truth on the memory pool. The first write access to a page triggers a page fault, which leads to acquiring the ownership of this specific page, meaning that it is mapped writable by the overlay. The remaining parts of the memory object stay unaffected. Additionally, the page fault handler must ensure that other hosts cannot access the page anymore, effectively clearing its present bit in the authoritative truth (and possibly other overlays) and performing a flush. For performing such flushes/invalidations, we expect a mechanism to send interrupts to other attached hosts. This mechanism will also be used to initiate write backs if another machine requests an exclusively owned page.
The main difference between this approach and existing RDMA approaches is that the data always resides on the shared memory pool and accesses are performed on cache-line granularity, not page-wise. Due to the lack of compatible hardware featuring the CXL 3.0 standard, the evaluations will be based on a multi-NUMA server system emulating the performance characteristics of CXL-attached memory.
For Bachelor Thesis: Morsel Views
The first part is the implementation of the Overlay-Morsel mechanism. The original Morsel shall only be used read-only. To access a portion of a Morsel writable we want to use the overlay. The overlay initially shares all parts with the original read-only but can make parts of it writable by:
- Unmapping the respective part from the original Morsel,
- performing a TLB flush to prevent further accesses to the unmapped part,
- mapping the part writable into the overlay.
With this procedure we want to ensure a multiple reader, single writer semantic. Reverting the operation works exactly the opposite way.
Challenges are:
- Tracking of the overlays,
- synchronizing TLB flushes across all overlays.
- How to deal with multiple overlays accessing the same part (at least one as writer)?
For Master Thesis Only (or 2. Bachelor Thesis)
With the basic mechanism implemented, we now want to use it to solve the initial problem of limited coherence. Our assumptions are:
- The original Morsel is located on disaggregated and shared CXL memory.
- Each host has a single Overlay-Morsel.
- Only page tables are cache-coherent; pages must be explicitly evicted after returning write capability.
- There is a mechanism to trigger remote interrupts to perform remote TLB shootdowns and request access to exclusively owned parts.
To simulate the characteristics of remote CXL memory, we want to use a dual-NUMA memory system. The first NUMA node represents the remote memory and the second node the local host memory.
Schedule
The thesis will follow these key steps (bachelor: 1-3, master: 1-6):
- Getting started: Familiarize with kernel development, set up a suitable development environment, and establish a functional test setup.
- Implementation: Develop the Overlay-Morsel mechanism.
- Evaluation 1: Analyze the additional bookkeeping overhead in suitable synthetic scenarios.
- Emulate the CXL characteristics: Adapt the implementation of the overlay mechanism to match the assumptions above.
- Evaluation 2a: Analyze the Overlay-Morsel mechanism in an emulated CXL environment using synthetic benchmarks to quantify synchronization costs.
- Evaluation 2b: Run and measure real-world applications to evaluate the suitability of the mechanism to overcome the limitations of the limited hardware cache-coherence.
Topics: CXL, paging, disaggregated memory, Linux kernel
References
Web Links
An Introduction to the Compute Express Link (CXL) Interconnect: General introduction into the CXL concept
Papers
