The protocol documentation for Unified Bus has finally been released. Most of the initial design work for the protocol was done four or five years ago, and I haven’t worked on interconnects for more than two years. Yet reading this 500+ page document today still feels very familiar.

As with most protocol documents, the UB documentation presents a wealth of details about the Unified Bus protocol, but rarely touches on the thinking behind its design. As a small foot soldier who participated in UB in its early days, I’ll share some of my personal thoughts. The productized UB today may differ in many ways from what we designed back then, so don’t take this as an authoritative guide—just read it as anecdotes.

Why UB

To understand the inevitability of Unified Bus (UB), we must return to a fundamental contradiction in computer architecture: the split between the Bus and the Network.

For a long time, the computing world has been divided into islands by these two completely different interconnect paradigms.

  • Inside an island (for example, within a single server or a chassis), we use bus technologies such as PCIe or NVLink. They are designed for tightly coupled systems; devices share a unified physical address space, communication latency can be on the order of nanoseconds, and bandwidth is extremely high. This is a performance paradise, but its territory is very limited—the physical distance and the number of devices a bus can connect are strictly constrained.
  • Between islands, we rely on network technologies such as Ethernet or InfiniBand. They are born for loosely coupled systems, excel at connecting tens of thousands of nodes, and have superb scalability. But that scalability comes at a cost: complex protocol stacks, additional forwarding overhead, and latencies in the microsecond or even millisecond range create an orders-of-magnitude gap compared with buses.

This “inside vs. outside” architecture worked well for a long time. However, a specter began to haunt the computing world—Scaling Law.

About 10 years ago, researchers in deep learning discovered a striking regularity: as long as you keep increasing model size, data, and compute, model performance predictably and steadily improves. This discovery changed the game. What used to be a “good enough” single machine with 8 GPUs suddenly became a drop in the bucket in the face of models with tens or hundreds of billions of parameters.

At that moment, a clear and urgent need presented itself to system architects everywhere: can we tear down the wall between buses and networks? Can we create a unified interconnect that offers bus-level programming simplicity and extreme performance, while also providing network-level massive scalability?

This is UB’s core mission. It’s not merely a patch or improvement on existing protocols but a thorough rethinking. UB aims to build a true “datacenter-scale computer,” seamlessly connecting heterogeneous compute, memory, and storage across the entire cluster into a unified, programmable whole. In this vision, accessing memory on a remote server should be as simple and natural as accessing local memory; tens of thousands of processors should collaborate as efficiently as if they were on a single chip.

Master–Slave Architecture vs. Peer Architecture

In traditional computer systems, CPUs and other devices (such as memory, storage, NICs) are typically organized in a master–slave architecture. The CPU is the master, initiating and controlling all data transfers, while other devices are slaves that passively respond to the CPU’s commands. PCIe and RDMA are products of this master–slave approach. When CPU performance was sprinting ahead under Moore’s Law, this architecture had historical advantages. But as heterogeneous computing has become mainstream, it has increasingly become a bottleneck for modern systems.

  • Performance bottleneck: Every I/O operation requires CPU involvement. As the number and speed of devices grow, the CPU becomes the system bottleneck.
  • Higher latency: The data path is long and traverses multiple software layers, adding software overhead and copies that increase latency. Even if technologies like RDMA allow user-space software on the CPU to bypass the kernel to reach the NIC, they are still constrained by many limitations of PCIe uncacheable accesses and cannot realize true distributed shared memory.
  • Poor scalability: In heterogeneous computing, many GPUs, NPUs, and other accelerators need to communicate with the CPU. The master–slave model scales poorly and struggles to enable efficient “horizontal” data exchange among devices.

To break this bottleneck, UB proposes a peer architecture. In UB’s world, all devices are equal and can be viewed as memory regions. Any device can, using load/store memory semantics, directly access the memory of other devices as if it were local, without intervention from the remote CPU. This lets the data path bypass the operating system entirely, enabling zero-copy and microsecond-level ultra-low latency.

This peer architecture brings many benefits. For example, memory on different servers can form a shared memory pool; idle memory on a compute-intensive application server can be efficiently used by a memory-intensive application server. Heterogeneous compute and storage resources can also be pooled and dynamically composed according to application needs, improving resource utilization and reducing unnecessary data movement.

Bus and Network

To understand UB’s design philosophy, we need to understand the fundamental differences between buses and networks. Of course, we shouldn’t get bogged down in hair-splitting—modern buses (such as PCIe) borrow switching ideas from networks—but in terms of design goals and scale, their paradigms differ significantly.

Feature Bus Network
Design paradigm Designed for in-node communication; a tightly coupled system. Devices share physical lines and use arbitration to decide usage. Designed for inter-node communication; a loosely coupled system. Data is segmented into packets and store-and-forwarded by switches.
Address space Typically a unified physical address space. The CPU accesses devices via MMIO. Each node has an independent address space. Messages are exchanged using independent network addresses (e.g., IP).
Congestion control Flow control via low-level hardware arbitration and credit mechanisms; relatively simple. Congestion is the norm; requires complex end-to-end congestion control (e.g., TCP, UB C-AQM) to ensure stability and fairness.
Strengths Ultra-low latency and very high bandwidth. Excellent scalability; can connect tens of thousands of nodes.
Weaknesses Poor scalability; physical distance and device count are very limited. Complex protocol stacks; higher forwarding and processing overhead.

Traditionally, within a “supernode” (e.g., a server or chassis) we use bus technology to chase extreme performance; between supernodes, we use network technology to pursue large-scale expansion. These are two entirely different stacks and programming abstractions.

UB’s core value is that it achieves unification at the architectural and programming abstraction levels. Whether the physical medium is a high-speed electrical backplane within a supernode or long-distance fiber between supernodes, UB presents a uniform memory semantic to applications.

This means UB acknowledges that, at the physical layer, intra–supernode interconnects (more bus-like) and inter–supernode interconnects (more network-like) may differ, but it provides a unifying abstraction that hides these differences from applications. This delivers the best of both worlds: bus-level programming simplicity and high performance potential, combined with network-level massive scalability.

The difference between bus and network isn’t about right or wrong, but about paradigms at different scales. Newtonian mechanics is accurate and simple enough for the macroscopic, low-speed world; only as we approach the speed of light or delve into the microscopic do we need relativity and quantum mechanics. For a long time, we comfortably used the classic bus paradigm within the “chassis-scale” macroscopic world, and relied on networks at the “data center” scale. However, AI’s Scaling Law acts like a new observational instrument that pushes computational demand to the extreme, making the “crack” between the two scales—the communication chasm—impossible to ignore. That is the historical inevitability of UB’s birth: we need a new paradigm that unifies these two scales.

There Is Nothing New Under the Sun

There is nothing new under the sun. After working in a field for a while, you realize solving problems is like building with blocks: first list the key issues, then pick a solution for each from the existing toolbox, and combine them.

For networking, the key questions are just a few:

  • What programming abstraction do we offer to applications?
  • At what layer is the abstraction implemented, and how do we split responsibilities across hardware, operating system, programming languages and runtimes, and applications?
  • Given that split, how do we design the interfaces between software and hardware?
  • Who manages each device? Which devices power on and boot together?
  • At what granularity are packets segmented and transmitted on the network?
  • How are addresses allocated?
  • What network topology is used?
  • How do nodes discover each other in the network?
  • Once addresses exist, how is routing done? Should multipath be supported?
  • Point-to-point: should we do per-link flow control, and how?
  • End-to-end: should we do congestion control across multiple links, and how?
  • Do we provide reliable transport semantics? If so, how do we detect loss and retransmit? How are other failures handled and reported?
  • Do we provide in-order delivery semantics? If so, how is it implemented?
  • Do we provide byte-stream semantics or message semantics?
  • Do we provide shared memory semantics? If so, do we provide cache coherence? Can shared-memory access be done with a single hardware instruction, or does it require software to execute multiple instructions?
  • If other semantics are included in the programming abstraction, how are they implemented?
  • How are authentication, authorization, and encryption handled?

Once these questions are thought through and answered, the design is mostly done. A similar approach works in other domains, too. For example, today’s AI agents are essentially about choosing the model, how to implement user memory, how to implement a knowledge base, which context-engineering techniques to use, what toolset to include, and which workflows should be extracted into sub-agents.

One-sided Semantics and Two-sided Semantics

One-sided semantics (memory semantics)

In “The Return of the Condor Heroes,” the sixteen-year pact between Yang Guo and Xiaolongnü is a good example of one-sided semantics. After being poisoned by the Love Flowers at the bottom of the Passionless Valley, Xiaolongnü, knowing her days were numbered, sought the antidote and to motivate Yang Guo to live on. She jumped off the Brokenhearted Cliff, but before doing so, carved into the cliff face: “Sixteen years hence, meet here; as husband and wife, do not break this promise.” By leaving this inscription, she wanted Yang Guo to believe she was still alive and, with that faith, patiently wait sixteen years. After carving, she leapt from the cliff.

Xiaolongnü’s carving on the cliff face is a one-sided “write” operation; she did not need Yang Guo present to confirm it. Sixteen years later, Yang Guo returned as agreed, saw the words on the cliff, and finally reunited with Xiaolongnü at the bottom of the valley. Yang Guo’s “read” operation is likewise one-sided: he merely read the information from the cliff without needing Xiaolongnü to be present. In computer networking, this communication pattern is called “one-sided semantics.” The sender (Xiaolongnü) can write data directly into a location accessible to the receiver (Yang Guo) (the cliff), and the receiver can read it at their convenience, with no need for both parties to be online simultaneously.

Because one-sided semantics are primarily read/write operations, they are also called memory semantics.

Note that the target of one-sided semantic reads/writes is not necessarily a memory address. Anything that relies on shared storage for communication falls under one-sided semantics. For example, Redis and other key-value stores also provide a kind of one-sided semantics, where the key is not a memory address but a string.

From the story of Yang Guo and Xiaolongnü, we can also see a drawback of one-sided semantics: it has no way to notify the receiver, and the sender cannot know whether the receiver has received the information. If Yang Guo doesn’t pay attention, he will miss the words on the cliff. Whether Yang Guo saw the words on the cliff, Xiaolongnü cannot know either.

Two-Sided Semantics (Message Semantics)

To address this drawback, two-sided semantics that require cooperation between sender and receiver emerged. The earliest semantics in computer networks were two-sided: from the earliest sending and receiving of network packets, later evolving to sending and receiving data over connections.

Because two-sided semantics are mainly message send/receive operations, they are also called message semantics.

Sharp readers may notice that writing to a memory address and sending a message to the peer application look quite similar, no? A memory address is a number, and sending a message to the other side is an IP address and port number—it seems there’s no difference, right?

The key difference lies in the semantics of the “write” operation. When writing to a memory address, each address can hold only one piece of data; when new data is written, the old data is overwritten. Sending messages is different: although it also only needs a destination address, all messages sent will be retained on the peer side. If an application only needs to receive messages from a fixed sender, message semantics can be easily implemented using memory semantics—as long as the sender ensures that the data does not overwrite itself. But if multiple senders need to send messages to the same receiver at unpredictable times and also need to notify the receiver in a timely manner, pure memory semantics become troublesome—how to coordinate these senders so that their writes to memory addresses do not conflict? In this case, message semantics are more appropriate.

Message semantics sound nice, but in high-performance networks they often lead to performance issues. Each time a message is received, the receiver’s CPU needs to process it. If you only want to read a block of data but need to bother the receiver’s CPU, performance will certainly suffer.

More importantly, message semantics require the receiver to pre-post memory buffers. If the receiver does not know in advance how large the incoming message will be, how many buffers should be prepared? If the receiver needs to receive messages from multiple senders, it must also prepare for the possibility that multiple senders send almost simultaneously. Once receive buffers are insufficient, sends will fail.

From first principles, two-sided message semantics are better suited for notifications, while one-sided memory semantics are better for transferring large chunks of data. It’s like when I want to send a large file to someone: I’ll most likely upload the file to a cloud drive, then send an email to notify the other party to download it from the drive, rather than attaching the large file to the email. Uploading to and downloading from the cloud drive correspond to one-sided semantics, while sending the email is two-sided semantics.

The UB protocol provides precisely these one-sided memory operations, allowing one server to directly read and write another server’s memory without the peer CPU’s involvement, thereby achieving extremely high data transfer efficiency and very low latency.

For two-sided semantics, it is important to recognize that their primary role is to notify applications. If an application has multiple messages waiting to be processed, just put them into a queue and wake up once. After the application wakes up, it will naturally process all messages in the queue in order. Traditionally, these packet handling and process event notifications are accomplished via interrupts and the operating system. In UB, hardware can complete most tasks on the data plane, greatly reducing OS overhead.

Of course, anyone who has studied distributed systems knows that memory semantics and message semantics can be implemented in terms of each other. But being able to implement them does not mean the implementation is efficient. Therefore, in different scenarios, both semantics have value. The key in UB is to provide efficient memory semantics so that transferring large data blocks and accessing shared data are more efficient.

Fusion of Semantics: Efficient Notifications with Immediate Values

We hope applications use one-sided memory semantics to transfer data and two-sided message semantics to send notifications. In practice, however, the two are often coupled: after completing a large data write, an application almost always needs to initiate a separate message to notify the peer “the data is ready.” This two-step “write data first, then send notification” introduces extra network latency and software overhead.

To eliminate this inefficiency, the UB protocol introduces an elegant innovation: operations with immediate values (with immediate). It fuses data transfer and lightweight notification into a single hardware primitive, completely removing the need for the application layer to issue a second notification.

1. Write with Immediate

The standard Write is a purely one-sided operation; data is written “silently” into the target memory, and the receiver’s application is unaware of it. Our proposed Write with Immediate, however, notifies the peer application after the write completes:

  • First, like a normal Write, hardware performs an efficient one-sided memory write that bypasses the peer CPU.
  • The key is that after the data write completes, UB hardware generates a completion event in the receiver’s JFC.
  • This completion event acts like a two-sided message, notifying the peer application.
  • Furthermore, this completion event generated remotely by the Write operation can carry an 8-byte immediate value supplied by the initiator.

Write with Immediate is like a cloud drive automatically sending the recipient a notification email with a note (the immediate value) after the file upload succeeds. It fuses the two logical steps of “transfer large data” and “send notification” into a single atomic operation at the hardware level, greatly simplifying application logic and avoiding reordering between one-sided write messages and two-sided notification messages.

2. Send with Immediate

This is an enhancement to standard two-sided messaging. When initiating a Send operation, in addition to the usual data buffer, the application can attach an 8-byte “immediate value.” This immediate value is not written into the receiver’s memory buffer but is delivered by hardware directly into the completion record in the receiver’s JFC corresponding to that message. This means that at the moment the receiver’s application gets the message completion notification, it can immediately read this 8-byte metadata from the completion record, without extra memory reads. This is very efficient for conveying small metadata such as message type tags.

Connection-Oriented and Connectionless Semantics: The Jetty Abstraction

Scalability Challenges of RDMA “Connections”

Before the emergence of UB’s disruptive paradigm, engineers in networking were like scientists in the “normal science” phase, striving to solve problems within the existing “connection-oriented” paradigm. RDMA was itself a great success, but as data centers scaled out, its inherent scalability issues gradually became a new “puzzle.” In RDMA, communication must first “establish a connection,” whose concrete entity is the Queue Pair (QP). Each QP contains a Send Queue (SQ) and a Receive Queue (RQ), along with a whole set of associated state machines to handle packet ordering, retransmission, acknowledgments, and other complex reliability logic.

The cost of this design is that the state of each QP must be kept entirely in the NIC’s on-chip memory (SRAM) so that hardware can process at line rate. In small-scale HPC clusters this is fine. But when we apply this model to data centers with tens of thousands of servers, each running hundreds or thousands of application processes, the model hits the “scalability ceiling”:

  1. Hardware resources exhausted: For a server to communicate with 1,000 other servers, it must maintain 1,000 QPs. On-chip NIC memory is extremely precious and will be quickly exhausted.
  2. Management complexity explodes: Applications and the OS need to manage massive amounts of connection state, which itself incurs huge software overhead.

To tackle this puzzle, the community put in tremendous effort and developed technologies such as XRC (eXtended Reliable Connection) and SRQ (Shared Receive Queue).

  • SRQ allows multiple QPs to share the same Receive Queue, reducing receive buffer memory usage to some extent, but the sender still needs to maintain a separate QP for each peer.
  • XRC goes further by allowing multiple remote nodes to share the same target QP, further reducing connection state.

However, these techniques are essentially “patches” on the original “connection-oriented” model. They make the model more complex but do not fundamentally solve the problem. When the major anomaly of the Scaling Law emerged, we realized that what was needed was not a more delicate patch but a full-blown paradigm revolution—as long as communication requires applications to explicitly create and manage a “connection” state, the scalability ceiling will always be there.

From “Connection” to “Jetty”

We realized then that we had to completely abandon the connection-oriented mental model—strike at the root. This idea ultimately gave birth to UB’s core abstraction: Jetty.

The traditional “connection” model is like opening an exclusive, point-to-point private shipping lane between two ports. From first principles, the essence of communication is simply “to deliver a piece of information reliably from point A to point B.” Many concepts in networking—such as port (harbor), beacon (lighthouse), ping (the sound emitted by sonar), gateway (a navigable waterway or strait), firewall (a watertight, fireproof bulkhead on a ship)—originate from maritime terms. Professor Cheng Chuanning gave our connectionless abstraction the name jetty; its original meaning is an artificial structure extending from the shore into the sea, such as a breakwater or a pier.

We deliberately chose the word “Jetty” (pier) rather than reusing common networking terms. Kuhn noted in The Structure of Scientific Revolutions that the establishment of a new paradigm is often accompanied by the birth of a new language. Old words like “connection” carry too much inertia from the old paradigm. Creating a new term forces us to think with an entirely new worldview—not point-to-point “private shipping lanes,” but many-to-many “public docks.” This new vocabulary forms the jargon of UB’s new paradigm; they are the textbooks for entering this new world.

Initially, we envisioned a simpler model. A Jetty is like a kayak launch point. An application thread (a kayaking enthusiast) places a request (a kayak) into the JFS and can leave immediately; the launch point is instantly available for the next person. This sounds very efficient because it completely decouples hardware and software.

However, this seemingly simple design hides a fatal flaw: it cannot realize reliable software–hardware flow control. Hardware may complete tasks faster than software can process completion events. If hardware keeps posting completion events (CQEs) to the JFC while software cannot retrieve them in time, the JFC will soon fill up. Once the JFC overflows, subsequent completion events will be dropped, leading to disastrous consequences—the software will never learn that certain operations have completed.

To solve this problem, the final design adopts a more sophisticated “berth” model. We can think of a Jetty as a public dock with multiple berths. Each request that needs to set sail (an individual two-sided message or a memory read/write request) is like a ship and must first apply for a berth at the dock. This berth is an occupied resource slot within the Jetty. In UB’s concrete implementation, when an application submits a request (WQE) to JFS (Jetty For Send), that request occupies one slot in the JFS.

The key is that this slot is not fleeting. Each slot in JFS corresponds one-to-one to a slot in JFC (Jetty For Completion). When the hardware completes the network transfer and the remote operation, it places a completion event (Completion) into the corresponding slot in JFC. Only after the application has processed this completion event is the slot (the “berth”) fully released, becoming idle and available for the next request. This JFS–JFC pairing also forms a delicate hardware flow control between the CPU and the NIC: completion events in JFC that have not been processed by software will in turn prevent the hardware from accepting new requests into JFS.

Therefore, different requests in Jetty are indeed more akin to the concept of berths at a ship terminal. From submission, through network transfer, to the initiator’s software finally processing the completion event, a request occupies a berth at the dock throughout its entire lifecycle. Although this design is more complex than the “kayak launch point” model, it establishes a back-pressure (Back Pressure) mechanism via the one-to-one mapping between JFS and JFC, fundamentally addressing the issue of event loss caused by mismatched software–hardware speeds.

The fundamental advantage of this model is that it reduces an N × N “private channel” management problem to the management of N “public docks,” solving the scalability challenge.

Jetty’s Core Innovation: Decoupling the Transaction Layer from the Transport Layer

The key innovation in Jetty’s design is to completely decouple the transaction layer from the transport layer. This decoupling is not only a separation of the software–hardware interface and transport-layer protocols, but also a fundamental architectural breakthrough that brings revolutionary advantages for large-scale multi-core, multi-application scenarios.

In UB’s layered architecture:

  • Jetty is the abstraction of the transaction layer: Applications interact with hardware via JFS/JFC/JFR; each Jetty corresponds to an application’s communication needs and handles transaction-level operations.
  • TP (TransPort) is the entity of the transport layer: TP represents the end-to-end link between two transport endpoints (also called a Transport Channel, TPC), providing reliable communication services to the transaction layer. TP maintains transport state (congestion-control state, sequence numbers, bitmaps, etc.) and is transparent to the upper transaction layer.

Some readers may question: Given that transport state (congestion-control state, bitmaps, etc.) must be kept in hardware in both the traditional QP model and the Jetty model, what fundamental advantage does Jetty have over RDMA’s traditional QP?

The key lies in decoupling the transport and transaction layers, and resource sharing in multi-core, multi-application scenarios. Let’s understand this with a concrete example:

Assume a typical datacenter scenario:

  • Locally, there is one server with 128 CPU cores running 128 application threads
  • On the remote side there are 1,024 servers, each also with 128 threads
  • All threads need to communicate in an all-to-all pattern

The dilemma of the traditional QP model:

  • The QP in traditional RDMA tightly couples the transaction and transport layers
  • Each communicating pair of threads needs a separate QP (Queue Pair)
  • Total number of QPs required: 128 × 1024 × 128 = 16,777,216 QPs
  • Each QP must maintain a full connection state, including send queue, receive queue, sequence numbers, retransmission state, etc. While this state can be kept in host memory, it must be loaded into the NIC’s on-die memory when sending data
  • Managing state at this scale quickly exhausts NIC hardware resources

The elegant approach of the Jetty model:

  • Transaction layer: Each application thread creates one Jetty, for a total of 128 Jetties
  • Transport layer: The NIC maintains 1,024 TPs (TransPorts), each TP corresponding to one remote server
  • Key point: Different threads on the same local server, when accessing the same remote server, use their own independent Jetties at the transaction layer but share the same TP at the transport layer
  • Total resources needed: 128 Jetties (transaction layer) + 1,024 TPs (transport layer)

With this layered design, Jetty achieves:

  1. Flexibility at the transaction layer: Each application has its own Jetty with independent send/receive queues, isolated from one another
  2. Efficiency at the transport layer: Underlying TPs are shared among all applications targeting the same destination; transport state (congestion control, loss recovery, packet sequence numbers, etc.) is centrally managed by the shared TP
  3. A leap in scalability: Hardware resource needs drop from O(N²M²) to O(NM), where N is the number of local threads and M is the number of remote machines

The essence of this architecture is recognizing that the “transaction abstraction needed by applications” (Jetty) and the “transport state the network must maintain” (TP) are concerns at two different layers and should not be forcibly bound together as in traditional QP. By decoupling the transaction and transport layers, we preserve application independence and isolation while achieving efficient sharing of underlying transport resources, thereby maintaining excellent performance and scalability even in extreme scenarios like multithreaded all-to-all communication.

Other Advantages of Jetty: Connectionless Semantics and a Simplified Programming Model

Beyond the resource sharing enabled by decoupling the transaction and transport layers, Jetty also brings many improvements in the programming model and system reliability:

1. Truly connectionless semantics

  • Applications do not need to explicitly establish and manage connections; they can send requests directly to any target
  • Avoids the cumbersome connection setup (QP creation, address exchange, state synchronization) and teardown in traditional RDMA
  • Especially suited to dynamic, short-lived communication scenarios, such as request–response patterns in microservice architectures

2. Lower connection management overhead

  • No need to pre-establish connections for every potential communicating pair
  • Saves substantial connection setup time and signaling overhead
  • Clear advantages when communication patterns change dynamically (e.g., the set of nodes an application will talk to is not known at runtime)

3. Simpler failure handling and recovery

  • No connection state to rebuild; recovery after node failures is simpler
  • The application layer need not detect and handle the complex logic of disconnect/reconnect
  • Failures in the TP layer can be handled transparently to the upper Jetty; reconstruction of underlying transport channels does not affect the application’s transaction abstraction

4. A simplified programming model

  • Application developers need not manage complex connection lifecycles
  • Reduces common issues such as connection leaks and misconfigured connection counts
  • Lowers the development and debugging burden for distributed applications, letting developers focus on business logic itself

Practical Considerations in the Jetty Model: HOL Blocking, Fairness, and Isolation

Of course, the “public dock” model must also face real-world complexities.

First is the issue of Head-of-Line Blocking (HOL). Because each berth (a JFS/JFC resource pair) in Jetty must be occupied for the entire lifecycle of a request, HOL blocking objectively exists. In a FIFO queue, if the head-of-queue item is a huge, time-consuming task (e.g., an extremely large data send), it will occupy a berth for a long time. If many small sends are placed into the same JFS after this huge send, they may fill up the entire queue (the ring buffer). In that case, even if some small sends have already completed, new sends cannot be enqueued into JFS because the first huge send still has not completed.

However, this usually does not cause serious problems in practice. First, a Jetty can have a very large number of “berths,” up to the thousands. Second, UB is a very fast network, and most requests have extremely short lifetimes. Therefore, in most scenarios the probability that HOL will fill all berths is not high.

Second are the issues of Fairness and Isolation. Since all departing ships leave from the same dock (a single JFS), you cannot guarantee fairness among ships bound for different destinations or with different priorities. A “crazy” shipper (an application) might keep piling up cargo at the dock, consuming all resources and leaving other shippers’ vessels no chance to depart.

As for HOL blocking, fairness, and isolation, the Jetty model offers a unified and flexible solution: when needed, applications can create multiple Jetties.

  • Mitigate HOL blocking: If an application needs to handle large requests alongside many small ones, a best practice is to use different Jetties for them, splitting the “slow ships” for large requests and the “speedboats” for small requests to different docks.
  • Need isolation: If a critical application does not want its send/receive traffic interfered with by any other app, it can create its own dedicated one-to-one Jetty (a JFS/JFR pair), which logically partially reverts to the “connection” abstraction, using an independent “private dock” to guarantee QoS.
  • Need fairness: If a service must fairly handle requests from multiple tenants, it can create different Jetties per tenant or per class of request, and then perform its own polling or scheduling at the application layer.

This is precisely the elegance of the Jetty abstraction: it offers an extremely simple and scalable “connectionless” model as the default, while returning to the application the choice of “how much isolation and traffic splitting is needed.” Applications can choose the best trade-off for their needs between “fully shared” and “fully isolated.”

Implementing One- and Two-Sided Semantics under the Jetty Abstraction

The Jetty abstraction can implement both one-sided and two-sided core semantics efficiently using a unified queue model.

1. One-sided memory semantics (One-Sided)

One-sided operations (such as RDMA Read/Write) behave like a memory access: the initiator provides the address and data and does not require the remote application CPU’s involvement. In the Jetty model, the process is greatly simplified:

  • The initiator application submits a “write” request (including target address, data, etc.) into JFS.
  • The UB hardware pulls the request from JFS and completes reliable delivery to the target.
  • The target-side UB hardware writes the data directly into the specified memory address.
  • On the initiator, the UB hardware places a completion event (CQE) into JFC. The initiator learns the operation has completed by checking JFC.

Throughout, the initiator application does not even need a receive queue (JFR), because it does not “receive” any application-layer message; it only cares whether its own operation has “completed”.

2. Two-sided message semantics (Two-Sided)

Two-sided operations (such as Send/Receive) require participation from both applications. The working mechanism of JFR (Jetty for Receive) is similar to RDMA’s receive queue (RQ); its core is to decouple message arrival from the application’s buffer management:

  • First, the target-side application needs to pre-post several “receive” requests to JFR. Each receive request points to a memory buffer provided by the application.
  • When the initiating application submits a “send” request to JFS, the UB hardware reliably delivers it to the target.
  • After the target-side UB hardware receives the message, it consumes a pre-posted receive request from JFR and writes the received data directly into the buffer associated with that request.
  • Once the data is stored, the hardware places a completion event in JFC (Jetty for Completion) to notify the target application. This completion event informs the application of the buffer address, data size, and other information.
  • The target application checks its JFC, discovers this completion event, and can then process the new message in the corresponding buffer.
  • Meanwhile, on the initiator side, the UB hardware also places a completion event in its JFC to indicate that the “send” operation has succeeded.

This pre-posted receive-buffer model leads to a classic problem in message semantics: if the receiver cannot know a message’s size in advance, how should it manage buffers efficiently? To meet this challenge, Jetty provides flexible buffer splitting and coalescing mechanisms. For example, a large receive buffer can be split to receive multiple small messages; conversely, a large message can be “scattered” across multiple pre-posted small buffers.

However, this flexibility introduces extra complexity and can more easily cause subtle performance issues. Complex buffer management not only increases the burden on application software, but may also interfere with JFC flow control because a single logical message maps to multiple hardware operations. Therefore, from a design and best-practice perspective, we recommend a simpler, more efficient one-to-one model: prepare a sufficiently large receive buffer for each message. This again reinforces UB’s core design philosophy: two-sided message semantics are born for efficient “notifications,” while bulk data transfer should be handled by one-sided memory semantics.

3. Efficient application wake-up mechanism

Peak-performance applications continuously poll (Polling) JFC/JFR to obtain completion status and new messages. But in many scenarios, the application may be sleeping. If every completion event triggers an interrupt to wake the CPU, the overhead is too high. Jetty achieves an efficient asynchronous notification mechanism through the cooperation of JFC and EQ: when submitting a request, the application can set a flag to ask the hardware to trigger an Event after the transaction completes. The hardware places this Event into the EQ, and multiple Events can correspond to a single interrupt. After the process is awakened, it only needs to check the EQ to learn that an “event” has occurred, and then batch-handle the multiple completion messages accumulated in JFC and JFR. This turns the potential “once per message” wake-up overhead into “once per batch,” greatly improving efficiency.

In summary, the Jetty abstraction is the cornerstone of UB’s connectionless design philosophy. It replaces the complex, connection-oriented state-machine model in traditional networks with a simple, connectionless “jetty-queue” model, offloading heavy work to hardware and ultimately providing upper-layer software with an extremely simple, extremely high-performance, and extremely scalable programming interface.

Strong transactional order and weak transactional order

In distributed systems, order is central to consistency, yet also a shackle on performance.

Message semantics: breaking free from the shackles of byte streams

Traditional network communication, exemplified by TCP, provides a reliable point-to-point byte-stream (Byte Stream) abstraction. This is a powerful model that ensures no loss, no duplication, and in-order delivery. However, when we carry multiple independent business “messages” over a single connection, this strict byte-stream ordering becomes a performance bottleneck. If the packet carrying the first message is lost, TCP’s reliability mechanism stalls the entire connection until the packet is retransmitted successfully. All subsequent messages—even if logically independent—must wait. This is the well-known “Head-of-Line Blocking” (HoL Blocking).

Modern network protocol design fully recognizes this issue. Whether it is QUIC and HTTP/3—the future of the Web—or UB for high-performance data centers, one of the core shifts is replacing “byte-stream semantics” with “message semantics.” By building multiple independent logical flows atop connectionless UDP or similar lower-layer protocols, a problem with one message’s delivery no longer blocks other unrelated messages. This provides the necessary foundation to discuss ordering at a higher level—namely, between transactions.

The dream of strong ordering: the allure and challenges of global total order

Since we can logically distinguish independent transactions, a natural ultimate ideal emerges: can we build a communication system that extends ordering guarantees from a single point-to-point connection to the entire network?

The earliest inspiration for this idea came from a simple physical intuition: the effect of an event in the network is like ripples from a stone dropped into water, propagating from one network node (switch or host) to subsequent nodes. Each packet forwarding step resembles the advancement of a wavefront. If we could capture the relative order in which these “waves” propagate across the network, wouldn’t we naturally obtain a globally consistent order for all events?

This was the research project I collaborated on during my PhD with Zuo Gefei, Bai Wei, and my advisor Zhang Lintao (1Pipe: Scalable Total Order Communication in Data Center Networks). The core motivation was that if the network could provide all nodes with a “One Big Pipe” abstraction, strictly ordering all transmitted transactions (unicast or multicast) in a single virtual global sequence, then many distributed systems problems would have more efficient and simpler solutions, for example:

  • Distributed transactions: An atomic write spanning multiple nodes can be packaged as a totally ordered “scattering” message. The protocol ensures all nodes see this write at the same “logical moment,” naturally achieving atomicity without complex two-phase commit or locks.
  • State-machine replication: The core of consensus algorithms (e.g., Paxos/Raft) is to agree on a single order for the operation log. If the network itself provides total order, replication would only need to handle failures, and the complexity would drop dramatically.
  • Memory consistency: In distributed shared-memory systems, globally total-ordered communication can avoid many types of data hazards (read/write ordering conflicts), solving the ordering of memory updates.

This ideal essentially attempts an engineering realization of an ultimate strengthening of the happen-before () relation described by Leslie Lamport in his landmark paper based on special relativity—a globally total order () that is compatible with all causal relationships.

However, from the first submission in 2018 to the final publication in 2021, my iPipe work went through five submissions, and its core challenges became clear: real-world networks are full of failures (Failure), and a purely total-order system is extremely fragile in the face of failures. Merely assigning a globally unique sequence number to a message is not enough, because you cannot guarantee reliable delivery and atomicity. What happens if a sender assigns a sequence number but crashes before the message is acknowledged by all receivers? If a receiver is permanently down, how do you ensure the integrity of this “atomic” scattering operation? To address these issues, we were forced to add complex failure detection and recovery mechanisms on top of the idealized total-order model, greatly increasing system complexity.

The way of weak ordering: a new paradigm embracing uncertainty

This difficult research experience made me reflect deeply: do we really need such an expensive and complex strongly causal, strongly ordered system?

A glimpse of the answer unexpectedly came from two seemingly unrelated fields: fundamental physics and artificial intelligence. A colleague told me that cutting-edge experiments in physics, such as the “quantum switch,” have revealed that at the finest scales, the universe’s causal structure may not be as “rigid” as in our macroscopic world. Causal order itself may be indeterminate and superposed; the determinism we experience may just be a statistical average of the microscopic probabilistic world at macroscopic scales.

This idea resonates with the intrinsic nature of modern AI systems. Today, our largest-scale compute workloads—training and inference of deep neural networks—have core algorithms (e.g., stochastic gradient descent) that are probabilistic and naturally tolerate, even exploit, “noise.” In a system that is itself probabilistic, small reordering or latency introduced by communication is simply another form of noise that the algorithm can digest.

If the universe’s underlying rules may not be strongly causal and our most important applications can tolerate weak causality, then is it necessary—or is it over-engineering—to force a perfect, strongly consistent communication layer on unreliable hardware?

This is the philosophical foundation of “weak transactional order” in the Unified Bus. UB recognizes that different applications have vastly different requirements for order and consistency. Therefore, it does not provide a single rigid ordering model, but rather a graded set of transactional ordering primitives that applications can choose as needed.

UB transactional order: execution order and completion order

UB decomposes transactional ordering guarantees into two orthogonal dimensions: execution order and completion order.

Transaction execution order (Execution Order)

It defines the order in which requests are executed on the Target side and is central to ensuring consistency.

  • NO (No Order): The default, with the highest performance. Transactions are completely independent, and the Target may execute them in any order. Suitable for stateless queries, independent log uploads, etc.
  • RO (Relaxed Order): The core of weak ordering. It guarantees that a chain of transactions from the same Initiator, marked RO or SO, will execute in the order sent. But it does not block other transactions unrelated to this chain. It maximizes parallelism while ensuring no reordering within the causal chain.
  • SO (Strong Order): The guarantee of strong ordering. A transaction marked SO must wait until all prior RO and SO transactions from the same Initiator have completed execution before it starts. This provides a strong sequencing point, suitable for key operations requiring strict serialization.
  • Fence: A special barrier mechanism. It ensures that only after all preceding transactions (of any type) have completed can subsequent transactions begin. It is used to establish clear boundaries between different, logically independent transaction “batches.”

Transaction completion order (Completion Order)

It defines the order in which completion notifications (CQEs) are produced, decoupled from execution order. This allows more flexible optimizations. For example, transactions may execute in order but complete out of order (e.g., a write can be reported complete once it is persisted to a log without waiting for data to be flushed to storage).

By composing these primitives into different transaction service modes (such as ROI, ROT, ROL), UB empowers upper-layer applications to make the wisest, fine-grained choices between performance and consistency according to their own business logic. For distributed databases that require strong consistency, SO and Fence may be used more; whereas for large-scale AI training, the vast majority of gradient updates can use RO or even NO, thereby pushing system throughput to the extreme. This design philosophy is a systematic answer to the “ordering” problem in distributed systems, from theoretical reflection to engineering practice.

Load/Store and Read/Write: Two worldviews of memory access

In the design philosophy of the Unified Bus, access to remote memory is not limited to a single approach, but provides two core, complementary programming paradigms: one is Load/Store, which is deeply integrated with the processor instruction set, and the other is the more flexible, software-defined Read/Write. These two paradigms represent two different “worldviews”, and behind them are deep trade-offs across multiple dimensions such as programming model, performance, consistency, and degree of hardware coupling.

Two paradigms: synchronous Load/Store and asynchronous Read/Write

To understand the fundamental differences between these two paradigms, we first return to a basic question: how exactly does a remote memory access take place at the application and hardware levels?

What is Load/Store?

When discussing any system semantics, we must first clarify the level at which it is implemented; otherwise it is easy to talk past each other. The core of the Load/Store semantics lies in whether it is directly supported by processor hardware instructions.

  • In the classic Turing machine model, after a Load instruction completes, the next instruction can begin.
  • In modern processors’ out-of-order execution models, after a Load instruction is issued, unrelated subsequent instructions may continue executing, but any instruction that depends on that Load’s result will be automatically stalled by hardware until the data returns.
  • In some specialized processors (such as NPUs), a single Load instruction can even move a non-contiguous, large block of data (such as a slice of a Tensor).

All of these fall under Load/Store semantics because they are directly initiated and managed by hardware instructions. In contrast, memory accesses initiated by software and encapsulated by the runtime (for example, the software constructs a work request, sends it to the NIC via a driver, and then polls a completion queue) are generally not considered Load/Store semantics, even if they ultimately implement a remote read of data.

Synchronous vs. Asynchronous

Load/Store is essentially a synchronous memory access model, while traditional RDMA Read/Write is the epitome of an asynchronous model.

A typical asynchronous RDMA write operation is cumbersome and lengthy:

  1. Software constructs a Work Queue Element (WQE) in memory.
  2. Software notifies the NIC of a new task by “ringing the doorbell” (Doorbell).
  3. The NIC reads the WQE from memory into its on-chip memory.
  4. The NIC DMAs user data from host memory to the NIC according to the information in the WQE.
  5. The NIC packages the data into network packets and sends them to the remote side.
  6. The remote NIC receives the packets and writes the data into the target memory.
  7. The remote NIC returns an acknowledgment message.
  8. The initiating NIC receives the acknowledgment and writes a Completion Queue Element (CQE) into memory.
  9. Software needs to actively poll the CQE to finally confirm the operation has completed.

The entire process involves multiple DMAs and complex software-hardware interactions. In UB, Read/Write borrows techniques such as NVIDIA BlueFlame, allowing a small amount of information to be carried when “ringing the doorbell”, writing the WQE directly into the NIC’s device address space, thereby removing steps 1 and 3 and saving two DMAs, but its asynchronous nature does not change.

By contrast, a synchronous Load/Store operation is extremely simplified:

  1. The application executes a Load or Store instruction.
  2. The CPU’s network module (or a tightly integrated coprocessor) directly converts that instruction into network packets.
  3. The remote network module completes the memory read/write and returns the result (or acknowledgment) over the network.
  4. The Initiator CPU’s instruction completes, and the pipeline continues executing.

Modern CPU pipeline mechanisms can effectively hide part of the latency of synchronous accesses. Although an overly long remote Load may stall parts of the pipeline and reduce parallelism, its end-to-end latency and software overhead are far lower than the asynchronous model, making it especially suitable for small, latency-sensitive accesses.

Summary of pros and cons

Synchronous remote memory access (Load/Store)

  • Advantages: Simple process with extremely low latency; transparent to applications and can be used directly for memory expansion; high efficiency for small data accesses; can support hardware cache coherence.
  • Disadvantages: High hardware requirements (requires tight CPU integration); small data per access (typically cache-line granularity); large reliability “blast radius” (a node failure may drag down all other nodes accessing that node’s memory); high overhead for large-scale cache coherence.

Asynchronous remote memory access (Read/Write)

  • Advantages: Flexible specification of data size with high throughput for large transfers; low hardware requirements and good decoupling; exceptions can be captured by software with good fault isolation.
  • Disadvantages: Complex process with higher latency; not transparent to applications, requires explicit programming; no hardware cache coherence, must be ensured in software.

Precisely because both models have their strengths, the UB protocol chooses to provide both, allowing developers to choose as needed based on the scenario.

Remote memory addressing

Having provided two programming paradigms, the next key question is: how to “name” and “locate” remote memory—i.e., addressing? UB’s memory management mechanism is built around how to efficiently support these two modes, Load/Store and Read/Write.

In UB’s world, the basic unit of memory management is the Memory Segment. A memory segment is a contiguous region of physical memory. When a node (Home) wishes to share part of its memory with other nodes (Initiators), it creates a memory segment and assigns it a unique identifier (TokenID). For an Initiator to access this remote memory, it must first obtain from the Home the information about this segment, including TokenID, base address (UBA), and size (Len).

After obtaining this information, the Initiator faces a key choice: how to “translate” this remote memory address into one it can understand and use? Industry and academia have explored multiple paths, each with its pros and cons:

  1. Unified physical memory addressing (Globally Shared Physical Memory)

This approach is the simplest and most straightforward, commonly seen in traditional tightly coupled HPC systems. Across the entire system, all nodes’ physical memories are mapped into a single globally unified physical address space. When any node accesses an address, hardware can directly resolve which node and which piece of physical memory it belongs to. The advantage is simple hardware implementation. But its fatal drawback is extremely poor scalability. As the number of nodes grows, maintaining a globally consistent view of physical addresses becomes exceedingly difficult and costly.

  1. Network address + remote virtual address (Network Address + Remote VA)

This is a more flexible and scalable scheme. Accessing a remote memory location requires a “pair”: the target node’s network address and the virtual address of that memory on the target node. This decouples address spaces, with each node maintaining its own address space, yielding excellent scalability. The read and write transactions in the UB protocol support this mode of access.

However, this network address + virtual address pair is typically very long (for example, 128 bits), and cannot be directly used as a memory address by existing CPU instructions. To initiate a remote read/write, the CPU must execute special instructions that package this long address together with the operation type into a request and hand it to the NIC hardware. Another important characteristic of this approach is that it is essentially a non-cacheable (Non-cacheable) access mode. Each read/write goes straight to the peer’s memory, with no data cached locally. The upside is a simple model with no cache-coherence issues, because there is no cache. But the downside is also obvious: every access must pay the full network latency.

  1. Mapping to local virtual address (Mapping to Local VA)

This is the core mechanism provided by the UB protocol in pursuit of extreme performance, supported by the load and store instructions. In this mode, the Initiator maps the obtained remote memory segment into its own process’s virtual address space via the local memory management unit (UBMMU). Once mapped, the CPU can access this remote memory using standard load and store instructions just like local memory.

The performance advantage of this approach is enormous because it “localizes” remote memory, allowing the CPU pipeline to handle remote accesses seamlessly without special instructions or software overhead. More importantly, this mode naturally supports caching (Cacheable). When the CPU executes a load on a remote address, the data can be cached in the local cache, and subsequent accesses can complete with extremely low latency.

Of course, introducing caching brings new challenges: how to ensure cache coherence? The UB protocol designs a delicate cache-coherence mechanism for this. By setting different permissions (such as read-only or read-write) when mapping memory segments, the system leaves ample design space for subsequent cache-coherence management. For example, a memory segment mapped by multiple nodes in read-only mode has relatively simple cache management; once writes are allowed, more complex mechanisms are needed to ensure data consistency.

In summary, UB’s memory management provides a layered and flexible solution. It offers a simple read/write mode that requires no consideration of coherence, and also a high-performance load/store mode deeply integrated with the CPU instruction set and supporting caching, allowing applications to make the most appropriate choices among ease of use, performance, and consistency according to their needs.

Cache coherence

When multiple nodes can map the same remote memory segment locally and cache it, cache coherence shifts from an “optional” to a “must-answer” question. If not managed properly, data replicas in different nodes’ caches may diverge, causing programs to read stale data and leading to catastrophic consequences. As a system intended to provide memory-level semantics, the UB protocol must offer a clear and reliable cache-coherence solution.

Designing a cache-coherence protocol is essentially a trade-off among performance, complexity, and the strength of coherence. Industry has developed various models with different strengths and implementation approaches. Below we discuss several typical schemes and analyze them in light of UB’s design philosophy:

  1. Any Node, Strong Consistency (Dynamic Sharer List)

This is the most idealized model: it allows multiple nodes to cache a copy of the data simultaneously and guarantees that the access experience is exactly the same as accessing memory on a local multi-core processor (i.e., strong consistency). When one node modifies the data, the protocol must, via something like “broadcast” or “multicast,” either invalidate the cached replicas on all other nodes (Invalidate), or propagate the update to them (Update). The key challenge of this model is to maintain a dynamic sharer list (Sharer List). Because any node may join or leave at any time, the size and membership of this list are not fixed. Having hardware efficiently manage such a dynamic list is highly complex and poses scalability challenges.

  1. Multiple Readers, Single Writer

This is the most common and classic consistency model in practice, and it underpins modern multi-core CPU protocols such as MSI/MESI/MOESI. It stipulates that at any moment, a piece of data may be held as read-only caches by any number of nodes, but at most one node may hold write permission. When a node wants to write, it must first obtain the unique write permission, and the prerequisite for obtaining that permission is that all other read-only cached replicas in the network must be invalidated. The Ownership concept and the transitions among the Invalid/Read/Write states described in the UB protocol document embody this idea. This model is relatively simple to implement, performs very well in read-mostly scenarios, and strikes a solid balance between performance and complexity.

  1. Exclusive Access/Ownership Migration

This is a special case of the “multiple readers, single writer” model. In this mode, at any time only a single node is allowed to access (cache) a memory region. When a node (Borrower) needs access, it “borrows” ownership from the original holder (Home) and becomes the new Owner. During this period, the original Home node temporarily loses access to that memory region until the Borrower “returns” it. This model is the simplest to implement because it completely avoids consistency issues among multiple replicas. It suits memory-borrowing scenarios in which the Home node “rents out” idle memory to form a large memory pool, and other nodes short on memory obtain memory from the pool to make up for local shortages.

  1. Limited Nodes, Strong Consistency (Fixed Sharer List)

This is a simplification and compromise of the first ideal model. It also provides strong consistency but limits the number of nodes that can share a piece of data simultaneously. Because the number of sharers is capped, hardware can maintain a fixed-size list, greatly reducing design complexity. However, this model is not very practical because it introduces an unnatural constraint at the application layer and struggles to accommodate general-purpose and dynamically changing computational needs.

  1. Software-Managed Consistency (Explicit Cache Management)

This approach delegates part or all of the responsibility for maintaining consistency to software. Hardware still provides caching to accelerate access, but it does not automatically guarantee consistency among caches on different nodes. When an application needs to ensure it reads the latest data, software must explicitly perform a refresh cache (or invalidate) operation to proactively discard potentially stale local caches. When an application modifies data and wants it to be visible to other nodes, it must also explicitly perform a write back (or flush) operation to write the local cache back to main memory. This model gives software maximum flexibility, but it demands a lot from programmers and is error-prone.

  1. Non-cacheable Mode (Non-cacheable)

This is the simplest and most straightforward “consistency” approach: without caches, there is naturally no need to consider consistency. As mentioned earlier, UB’s read/write transactions fall into this mode. Each access goes directly to the target main memory, ensuring that what is read is always the latest data. The cost is that applications must implement data movement themselves—moving data from the Home node to local memory—to enjoy the access efficiency benefits of caching.

The design of the UB protocol is the result of seeking an optimal balance among the above possibilities. With the “multiple readers, single writer” ownership model at its core, it provides strong hardware consistency guarantees for load/store cache accesses, while retaining the non-cacheable read/write path as a complement, enabling different applications to find the best balance of consistency and performance for their needs.

The Killer App for Memory Pools: KV Cache

When we first conceived a UB-based memory pool five years ago, we had a powerful solution in hand but were searching for a truly matching problem. We envisioned aggregating the memory of thousands of servers via UB to build an unprecedented, unified, massive memory pool, where any data in the pool could be accessed with ultra-low latency close to local memory. That was exciting technically, but a practical question lingered: “What kind of application actually needs such a vast, ultra-high-performance shared memory pool?”

The explosion of LLM inference services brought the KV Cache challenge. When generating text, LLMs must cache a huge amount of intermediate state (the KV Cache), often tens to hundreds of GB—far beyond a single GPU’s VRAM. More critically, this data must be accessed at high frequency during the generation of every token, making it extremely sensitive to latency and bandwidth. Suddenly, everything we envisioned five years ago—massive capacity, ultra-low latency, and efficient sharing—found a perfect fit in the KV Cache problem.

1. Prefill-Decode Separation

An LLM handles a request in two phases:

  • Prefill phase: Ingests the user’s prompt, conversation history, or agent tool-call trace, and computes the KV Cache for all tokens in parallel. This is compute-bound.
  • Decode phase: Generates new tokens one by one. For each token, it must read the full KV Cache (including the prompt and all previously generated tokens). This is memory-bandwidth-bound.

Because these phases have very different computational characteristics, large-scale LLM inference frameworks widely adopt the Prefill-Decode (PD) separation scheduling strategy. The system aggregates many Prefill requests into a large batch for computation, while aggregating Decode requests into another batch. This separation can significantly improve GPU utilization and overall system throughput.

2. Prefix KV Cache

In many application scenarios, different user requests often contain the same “prefix.” For example, in multi-turn conversations and agent workflows, subsequent requests fully include the prior conversation or tool-call history.

Recomputing the KV Cache for the same prefixes is a huge waste. The Prefix Caching technique emerged to address this. Its core idea is to store computed prefix KV Caches in a global memory pool. When a new request arrives, the system checks whether its prefix matches a cached entry. If so, it retrieves the shared KV Cache from the pool and continues computing from the end of the prefix. This dramatically reduces Time to First Token (TTFT) and saves substantial compute.

This memory-pool-based Prefix Caching mechanism is essentially a cross-request reuse of computation results. The global memory pool and low-latency memory borrowing advocated by the UB protocol provide an ideal hardware foundation for building an efficient, cross-server global KV Cache pool.

At a deeper level, the success of KV Cache may be one of the most fundamental contributions from computer systems to AI. The attention mechanism in Transformer models can be viewed as a novel, differentiable “key-value store.” In this paradigm, the query vector (Query) is the “key” we’re looking up, while every token in the context provides its own “key” (Key) and “value” (Value). Unlike traditional systems that perform exact, discrete matching via a hash table map[key] -> value, attention performs a fuzzy, continuous “soft” matching (the “soft” in softmax). It computes similarity (attention scores) between the current Query and all Keys, then uses those scores to take a weighted sum over all Values—effectively “reading” the entire database at once, weighted by relevance.

Summary: Load/Store and Read/Write

In summary, the Load/Store and Read/Write provided by UB are by no means redundant; they are two abstractions required by different scenarios.

  • Load/Store delivers ultra-low latency and programming transparency, seamlessly integrating remote memory into the processor’s native instruction set—ideal for building high-performance, fine-grained shared-memory applications. It does, however, add hardware implementation complexity.
  • Read/Write offers a more traditional and flexible asynchronous access model that decouples hardware and software, simplifies the consistency model, and is better suited for bulk data movement and latency-insensitive scenarios.

URMA: Unified Remote Memory Access

At this point, we’ve explored many key decisions behind UB’s design: from the “peer” architecture that breaks CPU bottlenecks, to the “Jetty connectionless model” that addresses large-scale scalability, and to “weak transaction ordering” and “Load/Store semantics” for performance optimization.

URMA (Unified Remote Memory Access) is the top-level concept that unifies all these design philosophies, proposed by Dr. Kun Tan, director of the Distributed and Parallel Software Lab. URMA is precisely the unified programming abstraction and set of core semantics that the UB protocol offers to upper-layer applications.

URMA was born from deep insight into future computing patterns. In future data centers and high-performance computing clusters, heterogeneous compute units such as CPUs, GPUs, and NPUs will collaborate as peers to handle complex tasks. To fully unleash the potential of such heterogeneous compute, the underlying communication protocol must meet several stringent requirements:

  1. Direct communication among heterogeneous compute: Different types of compute units must be able to communicate directly as peers, bypassing traditional master-slave bottlenecks, thereby enabling efficient parallelism and collaboration on fine-grained tasks.
  2. Extreme scalability: The protocol must efficiently support communication from within a single node to ultra-large clusters, easily handling interconnection across tens of thousands of nodes.
  3. Maximized network efficiency: The protocol needs built-in flexible routing and transport mechanisms—such as multipathing and out-of-order delivery—to fully utilize expensive data center network bandwidth and safeguard real-time performance.

URMA is precisely the purpose-built answer to these three demands. It aims to efficiently and reliably enable communication between any two UB entities (UB Entity), whether it is unilateral memory access (DMA) or bilateral message send/receive. The key features we discussed earlier are ultimately reflected in URMA’s design:

  1. Peer-to-Peer Access: This is the cornerstone of URMA. Any heterogeneous compute device can achieve direct communication via URMA without CPU involvement, echoing our initial vision of a “peer architecture”.
  2. Inherently Connectionless: Through the Jetty abstraction, URMA allows applications to directly reuse the reliable services provided by the UB transport layer without establishing and maintaining end-to-end connection state. This fundamentally solves the scalability issues of traditional RDMA and is key to supporting ultra-large-scale deployments.
  3. Weak Ordering: URMA allows applications to configure ordering behavior according to their needs. By default it permits out-of-order execution and out-of-order completion, which not only avoids the head-of-line blocking problem, but also unleashes the underlying hardware’s potential for multipath transmission and parallel processing, thereby significantly improving end-to-end efficiency.

In short, URMA is not just an API or protocol specification; it represents a forward-looking, unified communication paradigm for heterogeneous computing. It fuses the high performance of a bus with the flexibility of a network, and through a set of innovative designs, hides the complexity of the underlying hardware from applications, ultimately providing a simple, efficient, and highly scalable programming interface. This is the ultimate meaning behind the name Unified Bus “统一总线”.

Of course, like any paradigm shift, URMA’s unified abstraction is not without cost. It pushes down into hardware part of the complexity previously handled by operating systems and software (e.g., memory management and coherence), creating major hardware design challenges. At the same time, it exposes more options and responsibilities to upper-layer applications, such as choices around ordering and load-balancing strategies. The success of a new paradigm does not lie in solving every old problem without introducing any new ones, but in addressing the core contradiction— in UB’s case, the tension between ‘performance’ and ‘scale’—that is the most pressing and important issue of the present era.

Congestion Control

The history of congestion control is, in essence, a chronicle of our struggle with the “queue” demon.

Initially, the designers of TCP/IP viewed the network as “best effort”, with packet loss as the signal of congestion. Therefore, TCP adopted the classic “Additive Increase, Multiplicative Decrease” (AIMD) algorithm: slowly increase the send window when no loss occurs; once loss happens, cut the window in half aggressively. This model succeeded on the wide-area Internet, but fundamentally it is an “after-the-fact” control. It relies on queues filling and eventually dropping packets to sense congestion, which inevitably leads to two problems: 1) high latency, i.e., “Bufferbloat”; 2) severe oscillations in network utilization.

To address this, people proposed Active Queue Management (AQM), represented by RED (Random Early Detection). The idea of RED is not to wait until the queue is full to drop packets, but to probabilistically drop packets at random when the queue length starts to increase, proactively signaling congestion to the sender. This alleviates Bufferbloat to some extent, but “packet loss” as a congestion signal is still too crude.

A gentler approach is Explicit Congestion Notification (ECN). ECN allows routers, upon detecting queue buildup, to mark the packet header instead of dropping the packet. When the sender receives this mark, it knows congestion has occurred and proactively reduces its sending rate. ECN avoids unnecessary loss and retransmissions and is a standard feature of modern congestion control.

However, in data centers that pursue extreme low latency and high throughput, these TCP-based congestion control schemes are still not fine-grained enough. Especially for RDMA, which is connectionless and extremely loss-sensitive, faster and more precise control is needed. RoCEv2 networks therefore designed DCQCN (Data Center Quantized Congestion Notification). DCQCN combines ECN with rate-based control; when the switch marks congestion, the NIC quickly reduces the sending rate by certain steps, achieving faster convergence and lower queue occupancy.

UB’s C-AQM (Confined Active Queue Management) pushes this fine-grained control even further. DCQCN is still a passive “congest first, then slow down” mode, whereas C-AQM’s core idea is “allocate on demand, proactively grant”, targeting “near-zero queues”. The biggest advantage behind this is Huawei’s “end–network co-design” capability as an end-to-end (NIC + switch) network equipment provider.

The working mechanism of C-AQM embodies the finesse of this coordination:

  1. Sender (UB Controller) actively requests: When sending data, the sender can set the I (Increase) bit to 1 in the packet header to request more bandwidth from the network.
  2. Switch (UB Switch) precise feedback: When the switch receives this request, it evaluates the congestion status of its egress port. If it believes increasing bandwidth would cause congestion, it not only sets the C (Congestion) bit in the header to reject the request, but also provides a suggested bandwidth value in the Hint field. This Hint value is computed by the switch based on its precise queue and bandwidth utilization; it tells the sender: “You can’t increase further; you should adjust your rate to this suggested value.”
  3. Sender fast response: Upon receiving this feedback containing the precise Hint value, the sender can immediately adjust its sending rate to the level suggested by the switch.

This process is like an intelligent traffic control system. A driver (the sender) wants to accelerate and first uses a signal light (the I bit) to ask the traffic cop ahead (the switch). Based on real-time traffic at the intersection, the cop not only turns on a red light (the C bit) to say “no,” but also uses a radio (Hint field) to directly tell the driver “keep 30 km/h.”

Through this “request–precise feedback” closed loop, C-AQM precisely matches the sender’s rate to the service the network can provide, so packets “arrive and depart,” and switch queues remain at an extremely low, near-zero level. This not only eliminates the high latency caused by Bufferbloat, but also maximizes effective network throughput. This near-zero-queueing design philosophy is one of the key cornerstones enabling UB’s microsecond-level end-to-end latency.

Reliable Transport

When building any reliable network protocol, handling packet loss is a central topic. The textbook TCP/IP playbook—“slow start, congestion avoidance, fast retransmit, fast recovery”—is well known. However, in data center networks, to maximize bandwidth utilization, Equal-Cost Multi-Path routing (ECMP) is widely used. Traffic is spread across multiple physical paths, which inevitably makes packet arrival order differ from send order. A fundamental tension emerges: How can we accurately identify packet loss in a network full of “ordered disorder”?

Traditional fast retransmit relies on the core rule that “three duplicate ACKs imply a lost packet.” This assumption is reasonable on a single path, but under ECMP such “reordering” easily fools the mechanism, causing many cases of Spurious Retransmission. The network didn’t actually drop packets—some packets just took a shortcut and arrived first—yet the protocol mistakes it for loss and aggressively triggers unnecessary retransmissions, wasting precious bandwidth and potentially even causing real congestion due to the extra traffic.

UB’s Load-Balancing Mechanisms

Unlike the “leave it to hash and hope” load balancing of traditional ECMP, UB gives the choice to the application, allowing it to make informed trade-offs between performance and ordering at different granularities of load balancing. UB supports load balancing at two levels:

1. Transaction-level load balancing: a TPG-based “convoy” mode

UB introduces the concept of a TPG (Transport Protocol Group). Think of a TPG as a logistics company that transports batches of “cargo” (i.e., transactions) from point A to point B. To increase capacity, the company can use multiple highways simultaneously, each highway being a TP Channel.

When a transaction (e.g., a large RDMA Write) needs to be sent, the TPG chooses one TP Channel for it. Once chosen, all packets (TP Packets) of this transaction are carried on that fixed “highway.” This is like a large convoy, with all vehicles staying in the same lane.

The big advantage of this transaction-level load balancing is its simplicity and orderliness. Because all packets of the same transaction follow the same path, they naturally arrive in order, fundamentally avoiding packet-level reordering. This lets the upper-layer reliable transport confidently use the most efficient fast retransmit mechanisms, since any “duplicate ACK” signal is very likely to indicate a real loss rather than reordering. It is a safe, stable, and easy-to-manage parallelization strategy suitable for most scenarios requiring reliable transport.

2. Packet-level load balancing: a “racing” mode for ultimate performance

For applications pursuing maximal network utilization and lowest latency, UB offers a more aggressive packet-level load balancing mechanism. In this mode, the system allows packets of the same transaction to be “sharded” across multiple TP Channels, and even to be dynamically steered onto different physical paths at the switch by modifying the LBF field in the packet header.

It’s like a road race: to reach the finish line as fast as possible, each race car (TP Packet) can freely choose the least congested lane, dynamically overtaking and changing lanes.

This mode can best “fill” all available bandwidth in the network, achieving unmatched throughput. However, it inevitably introduces a “side effect”: reordering. Later packets may arrive before earlier ones by choosing a faster path.

Loss retransmission under multipath

Under multipath transmission, determining “packet loss” must be more cautious and intelligent. A one-size-fits-all approach won’t work. Therefore, we did not design a universal retransmission algorithm, but instead hand the decision back to users along two dimensions:

  1. Retransmission scope: “collective punishment for one mistake,” or a “precision strike”?

    • GoBackN (GBN): A simple and classic strategy. Once loss is detected, the sender retransmits starting from the lost packet, including all subsequent packets that were already sent. It is easy to implement and the receiver keeps minimal state, but in high-latency, high-bandwidth, high-loss networks it may retransmit many packets that actually arrived correctly, making it inefficient.
    • Selective Retransmission: A finer-grained strategy. The sender only retransmits packets confirmed as lost. The receiver must maintain more complex state (e.g., a bitmask) to tell the sender which packets were received and which were not. This approach is the most efficient but also more complex to implement.
  2. Trigger mechanism: “in a blazing hurry,” or “think twice before acting”?

    • Fast Retransmit (Fast Retransmit): Similar to TCP’s mechanism, it rapidly triggers retransmission via redundant acknowledgments (e.g., error responses in UB) without waiting for a full timeout period. Its advantage is fast response, significantly reducing latency when packets are lost. But as noted earlier, it is very sensitive to reordering.
    • Timeout Retransmit (Timeout Retransmit): This is the most conservative and reliable mechanism. The sender starts a timer for each packet sent; if no acknowledgment is received within the preset time (RTO), it deems the packet lost and retransmits. Its advantage is that it ultimately covers all loss scenarios and is unaffected by reordering. The downside is that RTO is typically conservatively calculated, and waiting for timeout introduces longer latency.

By combining strategies along these two dimensions, UB offers four retransmission modes to fit different network scenarios:

Retransmission algorithm Applicable scenarios Network packet loss rate Design rationale
GoBackN + Fast Retransmit Single-path transmission, such as per-flow load balancing Very low This is the most classic and efficient mode. When the path is stable and reordering risk is extremely low, we should fix the very rare losses as quickly as possible.
GoBackN + No Fast Retransmit Multi-path transmission, such as per-packet load balancing Very low When we know the network itself introduces substantial reordering (e.g., ECMP), we must disable the reordering-sensitive fast retransmit and rely entirely on timeout to ensure reliability, avoiding a false retransmit storm.
Selective Retransmission + Fast Retransmit Single-path transmission, such as per-flow load balancing Low In a stable single-path network, if packet loss becomes non-negligible, selective retransmission provides a clear efficiency gain over GBN by avoiding unnecessary retransmissions.
Selective Retransmission + No Fast Retransmit Multi-path transmission, such as per-packet load balancing Low This is the most complex yet most adaptable mode. It offers the most efficient and robust reliable transport for complex networks that have both multi-path reordering and a certain level of packet loss.

Coordination of retransmission and transaction ordering

Earlier we discussed how UB uses “weak transaction order” to break unnecessary ordering shackles and maximize the efficiency of parallel transport. A natural and profound question follows: if the system allows transactions to execute out of order, what is the relationship between a retransmitted packet of an “old” transaction and the packet(s) of a “new” transaction?

Our core design philosophy: Reliability at the transport layer and ordering at the transaction layer are orthogonally decoupled by design.

  • Mission of the transport layer: In its world there are only TP Packets and their sequence numbers (PSN). Its sole objective is, via GBN or selective retransmission and the like, to ensure that all TP Packets belonging to a transaction are eventually delivered intact and correct from the Initiator’s transport layer to the Target’s transport layer. It handles physical packet loss in the network, achieving “data must arrive.”
  • Mission of the transaction layer: In its world there are only transactions (Transaction) and their ordering tags (NO, RO, SO). Its work begins only after the transport layer confirms it has received all data for a transaction. Based on the transaction’s ordering tag, it decides when to execute this transaction that has just “collected” all its pieces. It handles business-logic dependencies, achieving “preserve order as needed.”

Let’s understand how this decoupling works through a concrete scenario:

  1. Send: The Initiator sends two transactions in sequence:

    • Transaction A (RO): split into three packets A1, A2, A3.
    • Transaction B (RO): split into two packets B1, B2.
  2. Loss: During network transmission, A2 is unfortunately lost. A1, A3, B1, and B2 all arrive at the Target successfully.

  3. Transport-layer response: The Target’s UB transport layer gets to work.

    • For transaction B, it received B1 and B2 and, checking via PSN, finds the data complete. It then delivers the reassembled transaction B up to the transaction layer, declaring its job done.
    • For transaction A, it received A1 and A3, but via PSN or a bitmap discovers A2 is missing. It quietly buffers A1 and A3, and via TPNAK or by waiting for timeout, informs the Initiator’s transport layer: “A2 is lost, please retransmit.”
  4. Transaction-layer decision: At this moment, the Target’s transaction layer starts to work as well.

    • It has received the complete transaction B delivered by the transport layer.
    • It checks B’s ordering tag, which is RO (Relaxed Order). This means B does not need to wait for any transactions sent before it.
    • Therefore, the transaction layer immediately schedules transaction B for execution, completely ignoring transaction A, which is still waiting for the retransmitted A2.
  5. Retransmission and final execution:

    • Shortly thereafter, the retransmitted A2 packet finally arrives.
    • The Target’s transport layer merges it with the previously buffered A1 and A3, reassembles complete transaction A, and delivers it to the transaction layer.
    • The transaction layer receives transaction A, checks its RO tag, and likewise executes it immediately.

In this process, the loss and retransmission of an “old” transaction (A) does not block the execution of a “new” transaction (B) at all. This is the power of weak transaction order working in concert with reliable transport.

So how would Strong Order (SO) change all this?

If in the above scenario transaction B is marked SO (Strong Order), then in step 4 “transaction-layer decision,” the situation is completely different. After the transaction layer receives complete transaction B, it checks the SO tag and realizes it must wait for all transactions before B (i.e., transaction A) to complete. Thus, even though all of B’s data is ready, the transaction layer can only keep it on standby. Only after the retransmitted A2 arrives and transaction A is successfully executed can transaction B be executed.

In summary, UB’s decoupled design achieves extreme efficiency:

  • Network-layer issues are solved at the network layer: The transport layer can aggressively use advanced techniques (such as selective retransmission) to fight loss as efficiently as possible, without worrying that its behavior will interfere with higher-layer business logic ordering.
  • Ordering at the business layer is decided by the business itself: The transaction layer is freed from network details and can focus on the application’s real needs to decide whether to wait and what strength of ordering guarantees are required, avoiding unnecessary head-of-line blocking.

This clear division of responsibility gives the system maximum parallelism and performance by default, while retaining the ability to build strict ordering for applications that need strong consistency. It is an elegant and efficient answer to the complexity of modern data center networks.

Deadlock Avoidance

In any lossless communication network, deadlock is a specter that must be confronted. When the network uses credit-based flow control or backpressure, if the dependency relationships of resources (such as buffers) form a cycle, a “circular wait” can occur where all nodes are waiting for others to release resources while holding on to their own—this is deadlock.

A typical scenario: there are four switches in the network forming a cyclic dependency. UB Switch 1’s egress buffer is full because data destined for UB Switch 2 can’t be sent out; Switch 2’s buffer is full because it is waiting on Switch 3; Switch 3 is waiting on Switch 4; and Switch 4 is waiting on Switch 1. All buffers are occupied, data cannot flow, and the entire system grinds to a halt.

To avoid this disastrous situation, the circular-wait condition must be broken by design. In UB, we use the following classic deadlock-avoidance schemes:

  1. Routing-based deadlock avoidance

This approach tackles the root by designing a loop-free routing algorithm to ensure deadlock does not occur. A classic example is tree-based “Up/Down Routing.” First, select a root in the network topology and construct a spanning tree. The routing rule is constrained as follows: packets may route from “downstream” nodes to “upstream” nodes (i.e., closer to the root), and from “upstream” to “downstream,” but are never allowed to route to a “downstream” node after passing through an “upstream” node. This simple restriction effectively breaks any potential routing loops, thereby avoiding deadlock. The method is simple and efficient, but it may not fully utilize all available paths in the network, sacrificing some flexibility and performance.

  1. Virtual Channels/Lanes (VC/VL)

This is the most common and flexible deadlock-avoidance mechanism in modern high-performance networks (e.g., InfiniBand). The core idea is to divide each physical link into multiple logically independent virtual channels (VCs), each with its own buffer resources.

Although the physical topology may contain cycles, we can design rules for using VCs to build an acyclic “VC dependency graph.” For example, we can divide VCs into different “levels” and stipulate that packets may only move from a lower-level VC to a higher-level VC. When a packet circulates along a loop, each hop must enter a higher-level VC. Since the number of VC levels is finite, a packet cannot keep moving indefinitely, which breaks circular wait. The VC mechanism decouples resource dependencies from the physical-link level to the finer-grained virtual-channel level, greatly improving routing flexibility and utilization of network resources.

  1. Timeout-based deadlock recovery

Unlike the first two “avoidance” strategies, this is a “detect and recover” strategy. The system sets a timer for each packet. If a packet stays in a buffer beyond a certain threshold, the system assumes a deadlock may have occurred. Once a deadlock is detected, it takes measures to break it—the simplest and bluntest method is to drop one or more “old” packets, freeing their buffers so other packets can proceed. This method is typically used as a complement to other deadlock-avoidance mechanisms, a last-resort safety net, because it violates the lossless property of the network.

Deadlock avoidance in memory access

In a complex system like UB, a seemingly simple memory access may hide a series of complex chain reactions behind it. A native memory operation (such as a Load instruction) may trigger secondary memory operations (such as handling page faults and address translation during memory borrowing). When cyclic dependencies in resources or process flow form between these native and secondary operations, the system may fall into deadlock.

Below are three typical memory access deadlock scenarios:

  1. Memory Pooling/Borrowing: In a peer architecture, each UBPU is both a “memory consumer” and a “memory provider.” When two nodes borrow memory from each other, a deadlock may form. For example, node A borrows memory from node B, while node B also borrows memory from node A. When both A and B need to update the peer memory via Writeback and wait for the peer’s acknowledgment (TAACK), if both acknowledgments are mutually blocked due to resource contention, a deadlock occurs.
  2. Page Table Access: When a memory access requires address translation through the UMMU, if the UMMU’s page table entries themselves are stored on remote, borrowed memory, then reading the page table entry—a secondary operation—must again initiate a remote memory access through the same port. This may contend with the original memory access and cause deadlock.
  3. Page Fault Handling: UB supports dynamic memory management, which means a memory access may trigger a page fault. Handling a page fault may require accessing external storage or fetching data from another UBPU. If the secondary operation for handling the fault forms a hardware dependency with the native operation it serves, a deadlock may result.

To address these complex deadlock scenarios, UB provides a combination of measures:

  • Request retry: Allows operations to fail under resource constraints and be retried by upper layers.
  • Virtual channel isolation: Allocates different virtual channels for different traffic types (e.g., native accesses, page table accesses, page fault handling) to break resource dependency cycles at the hardware level.
  • Transaction type differentiation: Distinguishes transaction types and applies different handling policies.

In addition, implementers can use simpler strategies, such as ensuring that page tables are always stored locally, to fundamentally avoid some deadlock scenarios.

Deadlock Avoidance in Message Communication

UB’s bilateral message communication, such as Send/Receive on Jetty, is queue-based. When queue resources are insufficient, message communication is blocked. If message queues on different UBPU form cyclic dependency loops by sending messages to each other, deadlocks may occur.

For example, node A and node B both send a large number of Send transactions to each other. Due to many other nodes also sending requests to A and B, their receive queues (Jetty) are saturated. At this point, both A and B send “receiver not ready” (RNR) acknowledgment messages (TAACK) to each other. If the TAACK from A to B is blocked by A’s Send data stream to B, and the TAACK from B to A is also blocked by B’s Send data stream to A, then neither side can handle the other’s RNR message or free its receive queue, forming a deadlock.

To avoid such message communication deadlocks, UB provides three fundamental mechanisms:

  1. Separation of transport and transaction layers: A key decoupling design. Even if upper-layer message transaction handling is blocked due to functional resource shortage (e.g., busy application logic), the underlying transport protocol layer can still run independently and not be blocked. This prevents a single point of application-layer congestion from spreading into wide-ranging, circuit-level network backpressure.
  2. Transaction-layer responses return resource status: When the transaction layer cannot process a request due to resource shortages, it explicitly returns the resource status (e.g., “busy”) to the initiator via a response message. Upon receiving such a response, the initiator can decide to retry or adopt other strategies, avoiding deadlock waits on network links.
  3. Timeout mechanism: Set timeouts for message communication. If an operation does not complete for a long time, the system deems it failed and releases the resources it holds. This is a final safeguard to ensure that even if a deadlock occurs, the system can recover on timeout and keep links unblocked.

URPC: Remote Procedure Call for Heterogeneous Hardware

At this point, we have built a solid foundation for the Unified Bus. URMA provides powerful, peer remote memory access capabilities. Whether deeply integrated with the CPU instruction set via Load/Store or with more flexible asynchronous Read/Write, it paves the way for data exchange between hardware units. However, this memory-semantics abstraction is still too low-level for application developers. It’s like giving developers a powerful set of assembly instructions but lacking a compiler for a high-level language.

When application developers think about business logic, the most natural model is “function calls,” not “memory reads and writes.” We need a way to encapsulate UB’s powerful memory-level communication into a higher-level, easier-to-understand-and-use abstraction. This is the original purpose of URPC (Unified Remote Procedure Call).

The Renaissance of RPC: From Software Services to Hardware Functions

Traditional RPC frameworks, such as gRPC and Thrift, have long been the bedrock of distributed software development. Their core idea is to make cross-machine function calls as simple as local calls. However, these frameworks’ design philosophy is deeply rooted in a “CPU-centric” worldview:

  1. The communication endpoints are software processes: Both the initiator and the executor of RPC run as software services on CPUs.
  2. The data path depends on the operating system: All data I/O must go through the kernel TCP/IP stack, incurring inevitable copies and context-switch overhead.
  3. Parameter passing is dominated by pass-by-value: Without shared memory, all parameters, no matter how large, must be serialized, copied, transmitted over the network, and then deserialized.

In the heterogeneous, peer computing vision depicted by UB, this traditional model struggles. What we need to call is no longer just a software function on another server, but potentially:

  • An application on the CPU needs to call a hardware-accelerated operator on an NPU.
  • A kernel on the GPU needs to call a data-preparation function on a remote memory controller.
  • A data processing unit (DPU) needs to call a network function on another DPU in the cluster.

URPC’s core mission is to provide a standard “function call” abstraction for high-performance, fine-grained, direct communication between heterogeneous hardware. It is not merely a communication protocol between software components; it is a function-layer protocol for collaboration between heterogeneous hardware units.

Pass-by-Reference: When Pointers Can Cross Machines

The most revolutionary aspect of URPC’s design is that it natively and efficiently supports pass-by-reference. This is unimaginable in the traditional RPC world, but under the UB system, it is natural.

We can achieve this because a URPC “reference” is not an ordinary virtual address meaningful only within a local process. It is a globally valid UB address. When one UBPU (e.g., a CPU) initiates a URPC call to another UBPU (e.g., a GPU) and passes a reference to a data structure, it passes a “key” that works end-to-end. The remote GPU hardware, upon receiving this address, can, without CPU intervention, directly access the initiator’s memory across the network using the underlying URMA Load/Store hardware instructions.

This capability is immensely valuable. Imagine an AI training job calling a function to process a model’s weights or dataset of hundreds of GBs, while the function only needs to read or modify a small portion of the data.

  • In traditional RPC, this implies copying hundreds of GBs of data. To reduce copies, the caller would need to extract exactly what the callee needs, causing architectural coupling between caller and callee.
  • In URPC, we only pass a reference to the entire data structure. The callee can fetch precisely and only the data it needs on demand, avoiding massive, unnecessary data movement.

Even more interesting, this design opens the door to finer performance optimizations. If the upper-level programming language or API can distinguish between a read-only reference (&T) and a writable reference (mut &T), URPC can propagate this information all the way down to the UB hardware. Faced with a read-only reference, the hardware knows the data will not be modified and can confidently enable more aggressive caching strategies without worrying about complex cache-coherence overhead.

Pass-by-Value: When Copying Is Unavoidable

Of course, pass-by-reference is not a panacea. In many scenarios, we still need traditional pass-by-value. For example:

  • Heterogeneous data structures: When a call happens between different programming languages or hardware architectures, their definitions of memory layout may be completely different. In this case, format conversion and data copying are required.
  • Small parameters: For very small parameters (e.g., configuration items, scalar values), the overhead of setting up a remote memory mapping and then reading via Load/Store may exceed simply packing the data into the request and sending it once.

URPC fully recognizes this and provides efficient support for pass-by-value scenarios. But “efficient” here is fundamentally different from protobuf or JSON serialization in traditional RPC frameworks. URPC’s serialization/deserialization (SerDes) is designed for hardware. Its goal is an extremely minimal format and the lowest computational complexity, so the process can be maximally offloaded into hardware, freeing valuable CPU resources from tedious packing/unpacking.

UB SuperNode: The Third Pole of the Software-Hardware Ecosystem for Large Models

At this point, we have explored in depth the long journey of the Unified Bus, from design philosophy to key technical implementations. However, any technology—no matter how exquisite its design—must be validated in a concrete, tangible system to show its true value. That system is the UB-architecture-based SuperNode (SuperPoD). It is not just a product; it is the ultimate answer to all our initial thinking, arguments, and persistence.

Looking back at the origin of the UB project, our initial dream was to break down the clear-cut barrier between buses and networks and create a new interconnect paradigm that combines bus-level performance with network-level scale. We firmly believe that future computing models will require us to treat the data center’s resources—compute, memory, and storage—as a unified whole, building a single logical “giant computer.”

At the time, this idea was dismissed by many as a pipe dream. The prevailing view was that intra-node interconnects in 8-card servers were sufficient, and cross-node communication didn’t need such extreme performance. However, when the Scaling Law became the AI field’s ironclad “law of physics,” people finally realized that the compute capacity of a single node had long hit the ceiling; tens of thousands of processors had to collaborate with unprecedented efficiency, and the interconnect technology linking them became the decisive factor in the success or failure of the entire system.

It was against this backdrop that the UB Supernode emerged. It is no longer a paper protocol, but a large-scale computing system that has undergone extensive validation in production scenarios. The architects of the UB Supernode turned our former vision into reality through a series of key technical features:

  1. Large-scale networking: This is the Supernode’s core capability. To support ultra-large-scale model training, the Supernode must break through the scalability bottlenecks of traditional networks. To this end, we designed the UB-Mesh networking architecture, whose core is a topology called nD-FullMesh. It fully exploits the traffic locality of AI training workloads and uses high-density, short-reach direct connections to link massive numbers of nodes at extremely low cost and latency. On this basis, with a hybrid topology of 2D-FullMesh and Clos, the Supernode achieves over 90% linear speedup at an 8192-card scale, and it reserves interfaces for a future UBoE (UB over Ethernet) solution that scales to million-card clusters.
  2. Bus-class interconnect: Even at massive scale, the Supernode maintains bus-class, extreme performance. UB provides synchronous, memory-semantic access on the order of hundreds of nanoseconds (for latency-critical Load/Store instructions) and asynchronous, memory-semantic access of 2~5 microseconds (for large-block Read/Write), with inter-node bandwidth reaching the TB/s level.
  3. Full pooling and peer collaboration: Under UB, all resources across the entire Supernode—whether NPU and CPU compute, DRAM memory, or SSD storage (SSU)—are aggregated into a unified resource pool. More importantly, these resources are peers: any NPU can directly access another node’s memory, bypassing the CPU to enable decentralized collaboration.
  4. Protocol unification: Underpinning all of this is UB’s unified protocol stack and programming model. It eliminates the translation overhead and management complexity caused by the coexistence of multiple protocols such as PCIe, Ethernet, and InfiniBand in traditional architectures, allowing upper-layer applications to efficiently harness the cluster’s heterogeneous resources with a single, unified set of semantics.
  5. High availability: For a system with tens of thousands of optical modules, reliability is a formidable challenge. UB addresses this with layered reliability mechanisms: at the link layer, LLR (Link Layer Retry) handles transient bit errors; at the physical layer, support for lane degradation (Lane Degradation) and 2+2 optical module backup enables hitless service recovery from failures; at the transport layer, end-to-end retransmission is the final safeguard. Together, these mechanisms ensure that, even at an 8192-card scale, the optical interconnect achieves a mean time between failures (MTBF) exceeding 6000 hours.

Globally, hardware ecosystems capable of supporting ultra-large-scale AI model training and inference are few and far between, because this requires deep coordination across chips, networks, and even operating systems—a daunting task that only companies with full-stack software and hardware capabilities can accomplish. Until recently, there were only two main actors on this stage:

  • NVIDIA: Centered on its GPUs, it filled out its InfiniBand NIC and switch portfolio by acquiring Mellanox, launched the Grace CPU and attempted to acquire ARM, continually reinforcing its “GPU + DPU + CPU” “three-chip” strategy, and ultimately built the powerful DGX SuperPOD ecosystem.
  • Google: As the other giant, its TPU hardware is deeply integrated with its internal software ecosystem, forming another closed yet efficient realm. Many of the world’s SOTA models with the fastest inference speed (tokens output per second) run on Google TPUs.

Each has, in its own way, solved the scalability and efficiency challenges at the scale of tens of thousands of cards, thereby defining the compute landscape of this era.

As an ordinary engineer who was involved from the very beginning, I feel a surge of emotion looking back on this journey. Our early persistence stemmed from a different worldview—we believed that the future of computing would inevitably be built on a logically unified “data center computer.” At first, only a dozen or so architects gathered around a whiteboard to discuss prototypes; we were more like evangelists then, striving to persuade others to believe in a future that had not yet arrived.

The real turning point came after GPT-3 was released in 2020. With indisputable performance, it demonstrated the power of the Scaling Law and won wide recognition within the company for the vision we had upheld. Since then, UB has received greater investment, and the small team of just over a dozen quickly grew into a massive project with thousands directly involved.

Today, with the UB Supernode in mass production at scale and the Unified Bus protocol officially released, the communication primitives and architectures we once drew on whiteboards have finally been realized and opened to a broader ecosystem. The birth of UB adds the most critical piece of the puzzle to the Ascend ecosystem, marking the emergence—after NVIDIA’s GPUs and Google’s TPUs—of the world’s third complete hardware–software ecosystem capable of supporting top-tier large-model training and inference.

This vision, once seen as a fantasy, has become a new reality with the rollout of the UB Supernode. Perhaps this is the allure of technological evolution: like a scientific revolution, it begins with a few people reimagining what the world “ought to be,” and ultimately becomes a new industry-wide consensus on what the world “is.”

Comments

2025-09-28
  1. Why UB
  2. Master–Slave Architecture vs. Peer Architecture
  3. Bus and Network
  4. There Is Nothing New Under the Sun
  5. One-sided Semantics and Two-sided Semantics
    1. One-sided semantics (memory semantics)
    2. Two-Sided Semantics (Message Semantics)
    3. Fusion of Semantics: Efficient Notifications with Immediate Values
  6. Connection-Oriented and Connectionless Semantics: The Jetty Abstraction
    1. Scalability Challenges of RDMA “Connections”
    2. From “Connection” to “Jetty”
    3. Jetty’s Core Innovation: Decoupling the Transaction Layer from the Transport Layer
    4. Other Advantages of Jetty: Connectionless Semantics and a Simplified Programming Model
    5. Practical Considerations in the Jetty Model: HOL Blocking, Fairness, and Isolation
    6. Implementing One- and Two-Sided Semantics under the Jetty Abstraction
  7. Strong transactional order and weak transactional order
    1. Message semantics: breaking free from the shackles of byte streams
    2. The dream of strong ordering: the allure and challenges of global total order
    3. The way of weak ordering: a new paradigm embracing uncertainty
    4. UB transactional order: execution order and completion order
      1. Transaction execution order (Execution Order)
      2. Transaction completion order (Completion Order)
  8. Load/Store and Read/Write: Two worldviews of memory access
    1. Two paradigms: synchronous Load/Store and asynchronous Read/Write
      1. What is Load/Store?
      2. Synchronous vs. Asynchronous
      3. Summary of pros and cons
    2. Remote memory addressing
    3. Cache coherence
    4. The Killer App for Memory Pools: KV Cache
    5. Summary: Load/Store and Read/Write
  9. URMA: Unified Remote Memory Access
  10. Congestion Control
  11. Reliable Transport
    1. UB’s Load-Balancing Mechanisms
      1. 1. Transaction-level load balancing: a TPG-based “convoy” mode
      2. 2. Packet-level load balancing: a “racing” mode for ultimate performance
    2. Loss retransmission under multipath
    3. Coordination of retransmission and transaction ordering
  12. Deadlock Avoidance
    1. Deadlock avoidance at the data link layer
    2. Deadlock avoidance in memory access
    3. Deadlock Avoidance in Message Communication
  13. URPC: Remote Procedure Call for Heterogeneous Hardware
    1. The Renaissance of RPC: From Software Services to Hardware Functions
    2. Pass-by-Reference: When Pointers Can Cross Machines
    3. Pass-by-Value: When Copying Is Unavoidable
  14. UB SuperNode: The Third Pole of the Software-Hardware Ecosystem for Large Models