Memory Channel Controller
Far, disaggregated memory is transforming data centers, but it comes with performance overheads. The old idea of near-data processing (NDP) gets renewed to address this challenge. While hardware vendors are rushing to design NDP hardware, a critical piece of the puzzle is missing: the operating system abstraction. We propose Memory Channel Controllers – a modern take on mainframe I/O channels – to make NDP portable, virtualizable, and capable of fine-grained cooperation with the host CPU.
New interconnects, new hardware
Driven by protocols like CXL, data centers are moving towards disaggregated “far” memory to increase capacity and efficiency. However, this comes at a cost. Accessing this “far memory” incurs a significant latency tax compared to local DRAM, often reducing bandwidth and stalling the CPU.
When moving data is expensive, the logical solution is to move the computation. The industry knows this:
- Marvell is integrating ARM Neoverse cores into their memory extension cards.
- Samsung is demonstrating “Processing-Near-Memory” (PNM) engines within CXL devices.
But here is the problem: To the programmer, these systems look like two separate computers. We currently lack a unified model to access, program, and multiplex these remote processors. Without a proper OS abstraction, these powerful accelerators risk being isolated, hard-to-manage islands of compute.
Our take: an OS-centric abstraction
If we look back in time, back to the old days of mainframe computers, there were a class of storage that has similar characteristics: larger and cheaper than memory but slower: the spinning disk. In those IBM mainframes, the solution to address the I/O bottleneck is mainframe channel controllers - powerful and programmable units that offload I/O tasks from CPU.
Inspired by that, we propose the Memory Channel Controller (MCC) as the standard abstraction for modern near-data processing for far, disaggregated memory.
The MCC acts as a programmable unit that sits between the CPU and far memory. Instead of the CPU fetching data byte-by-byte (and stalling on latency), the application sends Channel Programs to the MCC. By executing logic directly next to the data, the MCC eliminates costly round-trips over the CXL link, hiding the high latency of far memory while freeing up the host CPU for other tasks.
We prioritize virtualization and portability as core design choices for this abstraction. Rather than exposing raw hardware constraints to the user, the OS manages MCCs as virtual resources: allocating, managing and multiplexing them. This allows developers to write portable Channel Programs (e.g., in a high-level DSL) that are decoupled from the specific underlying hardware implementation.
For more details, check out our APSys’25 paper!
Beyond “dump and wait”
Most existing accelerators (like GPUs) follow a “bulk offload” model: the CPU sends a massive chunk of data, waits, and gets a result back.
MCCs enable a third way: fine-grained interaction.
As modern interconnects (like CXL or Enzian’s ECI) expose memory transactions, it’s possible to construct hardware-based message-passing channels to enable efficient fine-grain data movement. This enables chatty channel programs. For example, the MCC can handle latency-sensitive work of the workload (e.g. pointer chasing) while the CPU handles complex business logic, with two parts communicating interactively in real-time.
Real hardware, not just simulation
We are not just simulating this. We are prototyping the MCC architecture on Enzian (check out other articles on this website!).
Using Enzian’s coherent interconnect (ECI), we emulate future CXL hardware today, allowing us to build and test the actual compiler, OS support, and hardware logic before commercial CXL processors hit the market.
Join the Discussion
This is an active project at the NetOS Group @ ETH Zurich. We are currently exploring applications in database processing, graph analytics and more.
If you are interested in how we can rethink near-data processing for the modern era, please reach out!