The protocol documentation for Unified Bus has finally been released. Most of the initial design of the protocol dates back four or five years. I also haven’t worked on interconnects for more than two years, but reading this 500+ page document today still feels very familiar.

Like most protocol documents, the UB document introduces a wealth of details about the Unified Bus protocol, but says little about the thinking behind its design. As a foot soldier who participated in the early UB project, I’ll share some of my personal thoughts. The UB product as it exists today may differ in many ways from what we designed back then, so don’t take this as an authoritative guide. Treat it as a collection of anecdotes.

Why Build UB

To understand the inevitability of Unified Bus (UB), we must return to a fundamental contradiction in computer architecture: the split between the Bus and the Network.

For a long time, the computing world has been divided into islands by these two distinct interconnect paradigms.

  • Inside an island (for example, within a single server or chassis), we use bus technologies such as PCIe or NVLink. They are designed for tightly coupled systems: devices share a unified physical address space, communication latency can reach the nanosecond level, and bandwidth is extremely high. This is a paradise of performance—but the territory of this paradise is very limited: the physical distance and number of devices a bus can connect are strictly constrained.
  • Between islands, we rely on network technologies such as Ethernet or InfiniBand. They were born for loosely coupled systems, excel at connecting tens of thousands of nodes, and have tremendous scalability. But that scalability comes at a cost: a complex protocol stack, additional forwarding overhead, and microsecond- to millisecond-level latency create an orders-of-magnitude gap in performance compared with buses.

This “inside/outside divide” worked well for a long time. However, a specter began to haunt the computing world—Scaling Law.

Around 10 years ago, researchers in deep learning discovered a striking regularity: as long as you keep increasing model size, data, and compute, model performance will predictably and continuously improve. This discovery changed the game. The once “good enough” 8-GPU single machine configuration suddenly became a drop in the bucket in the face of hundred-billion or even trillion-parameter models.

At that moment, a clear and urgent need confronted system architects: can we tear down the wall between buses and networks? Can we create a unified interconnect that combines bus-level ease of programming and extreme performance with network-level massive scalability?

That is the core mission of UB. It is not merely a patch or refinement of existing protocols, but a thorough rethinking. UB’s goal is to build a true datacenter-scale computer, seamlessly connecting the cluster’s heterogeneous compute, memory, and storage into a unified, programmable whole. In this vision, accessing memory on a remote server should feel as simple and natural as accessing local memory; tens of thousands of processors collaborating should be as efficient as if they were on a single chip.

Master–Slave Architecture and Peer-to-Peer Architecture

In traditional computer systems, the relationship between the CPU and other devices (such as memory, storage, and NICs) is usually master–slave. The CPU is the master, initiating and controlling all data transfers, while other devices are slaves that passively respond to the CPU’s commands. PCIe and RDMA are products of this master–slave architecture. Decades ago, when CPU performance surged ahead under Moore’s Law, the master–slave model had historical advantages. But as heterogeneous computing has become mainstream, the master–slave approach has increasingly become a bottleneck in modern systems.

  • Performance bottleneck: All I/O operations require CPU involvement. As the number and speed of devices increase, the CPU becomes the system bottleneck.
  • Higher latency: Data paths are long and pass through multiple layers of software, incurring extra software overhead and data copies. Even with technologies like RDMA that let user-space software bypass the CPU on the NIC path, you’re still constrained by many PCIe uncacheable limitations and cannot realize true distributed shared memory.
  • Poor scalability: In heterogeneous scenarios, large numbers of GPUs, NPUs, and other accelerators must communicate with the CPU. The master–slave model struggles to scale efficiently and cannot support high-performance “horizontal” data exchange among devices.

To break this bottleneck, UB proposes a peer-to-peer architecture. In the world of UB, all devices are equals and can be viewed as memory regions. Any device can use load/store memory semantics to access other devices’ memory as if it were local, without intervention from the other side’s CPU. This allows the data path to bypass the operating system entirely, enabling zero-copy and microsecond-level ultra-low latency.

This peer-to-peer architecture brings many benefits. For example, memory across different servers can form a shared memory pool. Idle memory on a compute-intensive application server can be efficiently utilized by a memory-intensive one. Heterogeneous compute and storage resources can be pooled and dynamically composed according to application needs, improving utilization and reducing unnecessary data movement.

Bus and Network

To understand UB’s design philosophy, you need to grasp the fundamental differences between buses and networks. Of course, we shouldn’t get bogged down in hair-splitting: modern buses (such as PCIe) borrow switching ideas from networks. But in terms of goals and scale, the paradigms differ markedly.

Feature Bus (Bus) Network (Network)
Design paradigm Designed for in-node communication; a tightly coupled system. Devices share physical lines and use arbitration to decide access. Designed for inter-node communication; a loosely coupled system. Data is segmented into packets and store-and-forwarded by switches.
Address space Typically a unified physical address space. The CPU accesses devices via memory-mapped I/O (MMIO). Each node has an independent address space. Messages are exchanged using separate network addresses (e.g., IP).
Congestion control Flow control via low-level hardware arbitration and credit mechanisms; relatively simple. Congestion is the norm; requires complex end-to-end congestion control (e.g., TCP, UB C-AQM) to ensure stability and fairness.
Advantages Extremely low latency and very high bandwidth. Excellent scalability; can connect tens of thousands of nodes.
Disadvantages Poor scalability; physical distance and device count are very limited. Complex protocol stack; relatively high forwarding and processing overhead.

Traditionally, we use bus technology within a “supernode” (such as a single server or chassis) to pursue extreme performance, and network technology between supernodes to pursue large-scale expansion. These are two entirely different stacks and programming abstractions.

UB’s core value is that it achieves unification at the architectural and programming-abstraction levels. Whether the physical medium is an intra-supernode high-speed electrical backplane or an inter-supernode long-distance optical link, UB provides a unified memory semantic to upper-layer applications.

This means that UB acknowledges that, at the physical layer, the interconnect inside a supernode (more bus-like) and between supernodes (more network-like) can differ, but it hides those physical differences from applications through a unified abstraction. This gives you the best of both worlds: bus-level ease of programming and performance potential, plus network-level massive scalability.

The difference between bus and network is not about right or wrong, but about paradigms at different scales. Just as Newtonian mechanics is sufficiently precise and simple in the macroscopic, low-speed world, we only need relativity and quantum mechanics when approaching the speed of light or the microscopic realm. For a long time, we were content to use the classic bus paradigm in the “inside the chassis” macroscopic world, and rely on networks at the “data center” relativistic scale. However, AI’s Scaling Law is like a new instrument: it pushes computing demands to the extreme and makes the “crack” between the two scales—the communication chasm—impossible to ignore. This is the historical inevitability of UB’s birth: we need a new paradigm that unifies the two scales.

There Is Nothing New Under the Sun

There is nothing new under the sun. After working in a field for a while, you realize solving problems is like building with blocks: list the key problems, then, for each one, choose an existing solution and compose them.

For networks, the key questions are just a few:

  • What programming abstraction do we provide to applications?
  • At what layer is that abstraction implemented, and how do we split responsibilities across hardware, operating system, programming languages and runtimes, and applications?
  • Given that split, how do we design the hardware–software interface?
  • Who manages each device? Which devices power on and boot together?
  • At what granularity are packets segmented and transmitted across the network?
  • How are addresses assigned?
  • What topology does the network use?
  • How do nodes in the network discover each other?
  • Once we have addresses, how do we route? Do we support multipathing?
  • For point-to-point links, do we implement per-link flow control, and how?
  • End-to-end across multiple links, do we implement congestion control, and how?
  • Do we provide reliable transport semantics? If so, how do we detect loss and retransmit? How do we handle and report other failures?
  • Do we provide in-order delivery semantics? If so, how is it implemented?
  • Do we provide byte-stream semantics or message semantics?
  • Do we provide shared-memory semantics? If so, do we provide cache coherence? Can shared-memory access be done with a single hardware instruction, or does it require multiple software instructions?
  • If the programming abstraction provides other semantics, how are they implemented?
  • How do we handle authentication, authorization, and encryption?

Think through and answer these, and you’re 70–80% of the way to a design. A similar approach applies in other domains. For example, today’s AI agents are “just” choices among which model to use, how to implement user memory, how to implement a knowledge base, which context-engineering techniques to use, what tool set to provide, and which workflows should be factored into sub-agents.

One-sided Semantics and Two-sided Semantics

One-sided semantics (memory semantics)

In The Return of the Condor Heroes, the sixteen-year pact between Yang Guo and Xiaolongnü is a great example of one-sided semantics. Poisoned by love flowers in the bottom of Passionless Valley, Xiaolongnü, knowing her days were numbered, sought both the antidote and a way to spur Yang Guo to live on. She leapt from Broken Heart Cliff but carved on the cliff face, “Sixteen years later, meet here; our love is deep—do not break the promise.” By leaving those words, she hoped Yang Guo would believe she still lived and, with that faith, patiently wait for sixteen years. After carving, she jumped off the cliff.

Xiaolongnü’s carving on the cliff is a one-sided “write” operation; she didn’t need Yang Guo present to acknowledge it. Sixteen years later, Yang Guo arrived as promised, saw the words on the cliff, and finally reunited with Xiaolongnü at the valley bottom. Yang Guo’s “read” operation is likewise one-sided: he simply read the information on the cliff without needing Xiaolongnü to be there. In computer networks, this mode of communication is called “one-sided semantics.” The sender (Xiaolongnü) can write data directly to a location the receiver (Yang Guo) can access (the cliff), and the receiver can read it at their convenience without both parties being online simultaneously.

Because one-sided semantics are mainly read/write operations, they are also known as memory semantics.

Note that the objects read and written by one-sided semantics are not necessarily memory addresses. Anything that relies on shared storage for communication falls under one-sided semantics. For example, Redis and other key-value stores also provide a kind of one-sided semantics, where the key is no longer a memory address but a string.

From the story of Yang Guo and Xiaolongnu, we can also see a drawback of one-sided semantics: it cannot notify the receiver, and the sender has no way of knowing whether the receiver has received the information. If Yang Guo doesn’t look carefully, he will miss the words on the cliff face. Whether Yang Guo saw the words on the cliff face is also unknown to Xiaolongnu.

Bilateral Semantics (Message Semantics)

To address this drawback, bilateral semantics, which require cooperation between sender and receiver, emerged. The earliest semantics in computer networks were bilateral: from the earliest sending and receiving of network packets, then evolving to sending and receiving data over connections.

Because bilateral semantics are mainly message send/receive operations, they are also called message semantics.

A keen reader will notice that writing to a memory address looks quite similar to sending a message to a peer application, right? A memory address is a number, and sending a message uses an IP address and port number—on the surface, what’s the difference?

The key difference lies in the semantics of “write.” When writing to a memory address, each address can hold only one datum; new data overwrite old data. Sending messages is different: although there is only a single destination address, all messages sent will be retained on the remote side. If an application only needs to receive messages from a fixed sender, message semantics can be easily implemented using memory semantics—so long as the sender ensures that data do not overwrite each other. But if multiple senders need to send messages to the same receiver at unpredictable times, and the receiver needs to be notified promptly, pure memory semantics become troublesome—how do we coordinate these senders to avoid conflicts when writing to memory addresses? In such cases, message semantics are more appropriate.

Message semantics sound great, but in high-performance networks they often lead to performance issues. On each message reception, the receiver’s CPU must process the message. If all you want is to read a block of data, but you still have to bother the receiver’s CPU, performance can’t be very high.

More importantly, message semantics require the receiver to pre-allocate memory buffers. But if the receiver doesn’t know in advance how large the incoming message will be, how much buffer should be prepared? If the receiver needs to receive messages from multiple senders, it must also prepare buffers for the possibility that several senders send almost simultaneously. Once receive buffers are insufficient, sends will fail.

From first principles, bilateral message semantics are better suited for notifications, whereas one-sided memory semantics are better for transferring large chunks of data. It’s like sending a large file to someone: most likely you upload the large file to a cloud drive and then send an email notifying them to download it, rather than attaching the large file directly to the email. Uploading to and downloading from the cloud drive correspond to one-sided semantics, whereas sending the email notification corresponds to bilateral semantics.

The UB protocol provides exactly this kind of one-sided memory operation: it allows one server to directly read and write another server’s memory without intervention by the remote CPU, achieving extremely high throughput and very low latency.

For bilateral semantics, it’s crucial to recognize that their most important role is to notify the application. If an application has multiple messages waiting to be processed, it can enqueue them and issue a single wake-up. After being awakened, the application will naturally process all messages in the queue in order. Traditionally, packet processing and process event notifications have been handled via interrupts and the operating system. In UB, however, the hardware can perform most tasks on the data plane, greatly reducing OS overhead.

Of course, anyone who has studied distributed systems knows that memory semantics and message semantics can be implemented in terms of each other. But being able to implement one with the other doesn’t mean it’s efficient. Therefore, in different scenarios, both semantics have value. The key of UB is to provide efficient memory semantics so that transferring large data blocks and accessing shared data are more efficient.

Connection-Oriented and Connectionless Semantics: The Jetty Abstraction

The Scalability Challenges of RDMA “Connections”

Before the disruptive paradigm of UB emerged, network engineers were like scientists in the “normal science” phase, striving to solve problems within the existing “connection-oriented” paradigm. RDMA itself was a huge success, but as data centers grew, its inherent scalability issues gradually became a new “puzzle.” In RDMA, communication must first “establish a connection,” whose concrete entity is the Queue Pair (QP). Each QP includes a Send Queue (SQ) and a Receive Queue (RQ), along with a complete set of state machines to handle ordering, retransmission, acknowledgment, and other complex reliability logic.

The cost of this design is that the state of each QP must be fully stored in the NIC’s on-chip memory (SRAM) so that the hardware can process at line rate. In small-scale high-performance computing clusters, this is not a problem. But when we apply this model to ultra-large-scale data centers with tens of thousands of servers, each running hundreds or thousands of application processes, the model hits a “scalability ceiling”:

  1. Hardware resources are exhausted: a server that communicates with 1,000 other servers needs to maintain 1,000 QPs. The NIC’s on-chip memory is extremely precious and will be consumed quickly.
  2. Management complexity explodes: applications and the operating system need to manage massive amounts of connection state, which itself introduces huge software overhead.

To untie this knot, the community has invested great effort, developing technologies like XRC (eXtended Reliable Connection) and SRQ (Shared Receive Queue).

  • SRQ allows multiple QPs to share a single receive queue, which reduces receive-buffer memory consumption to some extent, but the sender still needs to maintain a separate QP for each peer.
  • XRC goes further by allowing multiple remote nodes to share the same target QP, further reducing connection state.

However, these techniques are essentially “patches” to the original connection-oriented model. They make the model more complex but do not fundamentally solve the problem. When the massive “anomaly” of the Scaling Law appeared, we realized that what we needed was not a more delicate patch, but a wholesale paradigm revolution—as long as communication requires applications to explicitly create and manage a “connection” state, the scalability ceiling will always be there.

From “Connections” to “Jetty”

At that time, we realized we had to completely abandon the connection-oriented mental model—attack the root. This idea ultimately gave birth to the core abstraction of UB: Jetty.

The traditional “connection” model is like opening an exclusive, point-to-point private shipping lane between two ports. From first principles, the essence of communication is simply “reliably delivering a piece of information from point A to point B.” Many concepts in communications—port, beacon, ping, gateway, firewall—originate in navigation and seafaring. Professor Cheng Chuanning gave our connectionless abstraction the name jetty, which literally refers to a man-made structure projecting from shore into the sea, such as a breakwater or pier.

We deliberately chose the word “Jetty,” rather than reusing common networking terms. As Kuhn notes in The Structure of Scientific Revolutions, the establishment of a new paradigm is often accompanied by the birth of a new language. Old words like “connection” carry too much inertia from the old paradigm. Creating a new term forces us to think with a completely new worldview—not point-to-point “private shipping lanes,” but many-to-many “public jetties.” This new vocabulary constitutes the “jargon” of UB’s new paradigm; it is the “textbook” for entering this new world.

Initially, we envisioned a simpler model. A Jetty is like a kayak launch point. An application thread (the kayaker) places a request (the kayak) into the JFS and can leave immediately; the launch point is instantly available for the next person. This sounds very efficient, because it decouples hardware and software completely.

However, this seemingly simple design hides a fatal flaw: it cannot implement reliable soft/hard flow control. The hardware may complete tasks faster than software can process completion events. If the hardware keeps posting completion events (CQEs) into the JFC while the software cannot keep up, the JFC will soon fill up. Once the JFC overflows, subsequent completion events will be dropped, leading to disastrous consequences—the software will never learn that certain operations have completed.

To solve this, the final design adopts a more sophisticated “berth” model. We can think of a Jetty as a public pier with multiple berths. Each request that needs to set sail (an independent bilateral message or a memory read/write request) is like a ship that must first claim a berth at the pier. This berth is an occupied resource slot in the Jetty. In UB’s concrete implementation, when an application submits a request (WQE) to the JFS (Jetty For Send), the request occupies one slot in the JFS.

The key is that this slot is not fleeting. Each slot in the JFS corresponds one-to-one with a slot in the JFC (Jetty For Completion). When the hardware finishes network transmission and the remote operation, it places a completion event (Completion) into the corresponding slot in the JFC. Only after the application has processed this completion event is the slot (the “berth”) fully released and returned to the free state for the next request. This JFS–JFC pairing also creates a delicate hardware flow control between the CPU and the NIC: completion events in the JFC that have not yet been processed by software will in turn prevent the hardware from accepting new requests in the JFS.

Therefore, different requests in a Jetty are indeed more akin to berths in a ship pier. From submission, through network transmission, to the initiator’s software finally processing the completion event, a request will occupy a berth on the pier for its entire lifecycle. Although more complex than the “kayak launch point” model, this design establishes a backpressure (Back Pressure) mechanism via the one-to-one relationship between JFS and JFC, fundamentally preventing event loss caused by mismatched hardware/software speeds.

The fundamental advantage of this model is that it reduces an N x N “private shipping lane” management problem to the management of N “public jetties,” thereby solving the scalability challenge.

Practical Considerations of the Jetty Model: HOL Blocking, Fairness, and Isolation

Of course, the “public jetty” model must also face real-world complexity.

First is the Head-of-Line (HOL) blocking problem. Because each berth (JFS/JFC resource pair) in a Jetty must be occupied throughout a request’s lifecycle, HOL blocking objectively exists. In a FIFO queue, if the head of the queue is a huge and time-consuming task (such as sending a very large amount of data), it will occupy a berth for a long time. If multiple such large requests pile up at the front, they may fill all available berths in the Jetty, causing all subsequent tasks—even tiny, fast ones—to have to wait and be unable to depart.

However, this issue usually does not cause serious trouble in practice. First, a Jetty can have a very large number of “berths,” up to the thousands. Second, UB is a very fast network, and most requests are extremely short-lived. Therefore, in most scenarios, the probability that HOL fills all berths is not high.

Next come the issues of Fairness and Isolation. Since all outbound ships depart from the same pier (a single JFS), there is no way to guarantee fairness across ships headed to different destinations or with different priorities. A “crazy” shipper (an application) might keep piling cargo onto the pier, occupying all resources so that other shippers’ vessels have no chance to depart.

To address HOL blocking, fairness, and isolation, the Jetty model offers a unified and flexible solution: when needed, an application can create multiple Jettys.

  • Mitigate HOL blocking: If an application needs to handle a mix of large requests and many small requests, a best practice is to use different Jettys for them, shunting the “slow ships” (big requests) and “speedboats” (small requests) to different piers.
  • Need isolation: If a critical application does not want its send/receive traffic to be disturbed by any other application, it can create its own dedicated one-to-one Jetty (a JFS/JFR pair), which logically partially reverts to the “connection” abstraction, using a dedicated “private pier” to ensure QoS.
  • Need fairness: If a service needs to handle requests from multiple tenants fairly, it can create a different Jetty for each tenant or each type of request, and then do round-robin or scheduling at the application layer.

This is precisely the elegance of the Jetty abstraction: it provides an extremely simple and scalable “connectionless” model as the default, while handing back to the application the choice of “how much isolation and traffic splitting” it needs. Applications can, based on their needs, make the most suitable trade-off between “fully shared” and “fully isolated.”

Implementing One- and Two-Sided Semantics under the Jetty Abstraction

The Jetty abstraction can use a unified queue model to efficiently implement the two core semantics: one-sided and two-sided.

1. One-sided memory semantics (One-Sided)

One-sided operations (such as RDMA Read/Write) behave like a memory access: the initiator only needs to provide the address and data, without the remote application CPU’s involvement. In the Jetty model, this flow is greatly simplified:

  • The initiating application submits a “write” request (including target address, data, etc.) to the JFS.
  • UB hardware fetches the request from the JFS and completes reliable delivery to the target.
  • The target-side UB hardware writes the data directly to the specified memory address.
  • On the initiator side, the UB hardware places a completion event (CQE) into the JFC. The initiator application learns the operation has completed by checking the JFC.

Throughout, the initiator application does not even need a receive queue (JFR), because it does not “receive” any application-layer message; it only cares whether its operation has “completed.”

2. Two-sided message semantics (Two-Sided)

Two-sided operations (such as Send/Receive) require participation from both applications.

  • The initiating application submits a “send” request to the JFS.
  • UB hardware reliably delivers it to the target.
  • The target-side UB hardware places the received message into the target Jetty’s JFR.
  • The target application discovers a new message by checking its JFR and processes it.
  • On the initiator side, the UB hardware places a completion event into the JFC, notifying that the “send” has succeeded.

3. Efficient application wake-up mechanism

Highest-performance applications continuously poll the JFC/JFR to obtain completion status and new messages. But in many scenarios, the application may be sleeping. If every completion event triggered an interrupt to wake the CPU, the overhead would be too high. Jetty, through the cooperation of JFC and EQ, provides an efficient asynchronous notification mechanism: when submitting a request, the application can set a flag asking the hardware to trigger an Event after the transaction completes. The hardware places this Event into the EQ, and multiple Events can correspond to a single interrupt. After the process is awakened, it only needs to check the EQ to know an “event” occurred, and then batch-handle the multiple completion messages accumulated in the JFC and JFR. This turns the potential “once per message” wake-up cost into a “once per batch” cost, greatly improving efficiency.

In summary, the Jetty abstraction is the cornerstone of UB’s connectionless design philosophy. It uses a simple, connectionless “dock–queue” model to replace the complex, connection-oriented state-machine model of traditional networks, pushes the heavy lifting into hardware, and ultimately provides upper-layer software with an extremely simple, extremely performant, and extremely scalable programming interface.

Strong vs. weak transaction ordering

In distributed systems, order is the core of consistency but also a shackle on performance. How to efficiently establish correct event order across nodes is the ultimate topic for all distributed protocols. This chapter dives into the underlying principles and design trade-offs of transaction ordering, explaining why Unified Bus (UB) provides a flexible mechanism that encompasses both strong and weak modes.

Message semantics: breaking free from the byte stream

Traditional network communication, represented by TCP, provides us with a reliable, point-to-point byte stream abstraction. This is a powerful model that guarantees no loss, no duplication, and in-order delivery. However, when we pass multiple independent business “messages” over a single connection, this strict byte-stream ordering becomes a performance bottleneck. If the packet carrying the first message is lost, TCP’s reliability mechanism stalls the entire connection until that packet is successfully retransmitted. All subsequent messages—even if logically independent—must wait. This is the famous “Head-of-Line Blocking” (HoL Blocking).

Modern network protocol design recognizes this limitation deeply. Whether it is QUIC and HTTP/3, which carry the future of the Web, or UB designed for high-performance data centers, one of the core transformations is to replace “byte-stream semantics” with “message semantics.” By building multiple independent logical flows on top of connectionless UDP or a similar substrate, the transmission problem of one message no longer blocks other unrelated messages. This provides the necessary foundation for discussing order at a higher level—namely, between transactions.

The dream of strong ordering: the allure and challenge of global total order

Since we can logically distinguish independent transactions, a natural ultimate ideal emerges: can we build a communication system that extends ordering guarantees from a single point-to-point connection to the entire network?

The earliest inspiration for this idea comes from a very physical intuition: the effect of an event in the network is like waves generated by a stone dropped into water, propagating from one network node (switch or host) to subsequent nodes. Each packet forwarding step is like the advance of a wavefront. If we could capture the sequence in which these “waves” propagate across the entire network, wouldn’t we naturally be able to define a globally consistent order for all events? If the network could provide all nodes with a “One Big Pipe” abstraction, letting all transmitted transactions be strictly ordered in a virtual global sequence, then many distributed-systems problems that have plagued us for decades would be resolved.

This was a research topic I worked on during my PhD with Zuo Gefei, Bai Wei, and my advisor Zhang Lintao (1Pipe: Scalable Total Order Communication in Data Center Networks). The core motivation of this work is that if the network could provide all nodes with a “One Big Pipe” abstraction, letting all transmitted transactions (whether unicast or multicast) be strictly ordered in a virtual global sequence, then many long-standing challenges in distributed systems would be greatly simplified:

  • Distributed transactions: An atomic write spanning multiple nodes can be packaged as a totally ordered “scattering” message; the protocol ensures all nodes see this write at the same “logical instant,” thereby achieving atomicity naturally, without complex two-phase commits or locks.
  • State-machine replication: The core of consensus algorithms (such as Paxos/Raft) is to agree on a single order for the operation log. If the network itself provides total order, the complexity of replication will be greatly reduced.
  • Memory consistency: In distributed shared memory systems, the order of memory reads and writes is key to consistency. Globally total-ordered communication can directly resolve the ordering of memory updates.

Essentially, this ideal attempts to engineer the ultimate strengthening of the happen-before () relation that Leslie Lamport described in his landmark paper based on special relativity—a global total-order relation () that is compatible with all causal relations.

However, from the first submission in 2018 to final publication, years of refinement also exposed the core challenge: real-world networks are full of failures, and a pure total-order system is extremely fragile in the face of failures. Merely assigning a globally unique sequence number to messages is not enough, because you cannot guarantee reliable delivery and atomicity. What happens if a sender crashes after assigning a sequence number but before the message is acknowledged by all receivers? If a receiver goes down permanently, how is the integrity of this “atomic” scatter operation ensured? To solve these problems, we were forced to add complex failure detection and recovery mechanisms on top of the idealized total-order model, and system complexity increased accordingly.

The way of weak ordering: embracing a new paradigm of uncertainty

This difficult research experience led me to reflect deeply: do we really need such an expensive and complex strongly causal, strongly ordered system?

A glimmer of an answer came unexpectedly from two seemingly unrelated fields: fundamental physics and artificial intelligence. A colleague told me that cutting-edge experiments in physics, such as the “quantum switch,” have revealed that at the finest scales the universe’s causal structure may not be as “solid” as in our macroscopic world. Causal order itself may be uncertain and superposed; the determinism we experience may be a statistical average of a probabilistic micro-world at macroscopic scales.

This idea resonates intriguingly with the nature of modern AI systems. Today, our largest computational workloads—training and inference of deep neural networks—are driven by algorithms (such as stochastic gradient descent) that are themselves probabilistic and that naturally tolerate, even exploit, “noise.” In a system that already allows “jitter” and “error,” slight reordering or delay introduced by communication is merely another kind of noise the algorithms can digest.

Since the universe’s underlying rules may not be strongly causal, and our most important applications can tolerate weak causality, is it really necessary to construct a perfect, strongly consistent communication layer atop unreliable hardware—or is that an unnecessary, over-engineered design?

This is precisely the philosophical foundation of the “weak transactional order” design in Unified Bus. UB recognizes that different application scenarios have vastly different requirements for ordering and consistency. Therefore, it does not provide a single, rigid ordering model, but instead offers a set of graded transactional-order primitives that applications can choose from as needed.

UB Transaction Order: Execution Order and Completion Order

UB decomposes transactional ordering guarantees into two orthogonal dimensions: execution order and completion order.

Transaction Execution Order (Execution Order)

It defines the order in which requests are executed at the Target side and is central to ensuring consistency.

  • NO (No Order): The default option with the highest performance. Transactions are completely independent, and the Target can execute them in any order. Suitable for stateless queries, independent log uploads, and similar scenarios.
  • RO (Relaxed Order): The core of relaxed ordering. It ensures that, for the same Initiator, the chain of transactions marked RO or SO will be executed in the order sent. But it does not block other transactions unrelated to this chain. It maximizes parallelism while ensuring “no reordering within the causal chain.”
  • SO (Strong Order): Strong-order guarantee. Transactions marked SO must wait until all prior RO and SO transactions from that Initiator have finished executing before they can start. This provides a strong sequencing point, suitable for critical operations requiring strict serialization.
  • Fence: A special barrier mechanism. It ensures that only after all preceding transactions (of any type) have finished executing can subsequent transactions begin. It is used to establish a clear boundary between different, logically independent transaction “batches.”

Transaction Completion Order (Completion Order)

It defines the order in which completion notifications (CQEs) are generated, decoupled from execution order. This allows the system to perform more flexible optimizations. For example, transactions can execute in order but complete out of order (e.g., a write can be reported complete once it is persisted to a log, without waiting for data to be flushed).

By composing these primitives into different transaction service modes (e.g., ROI, ROT, ROL), UB empowers upper-layer applications to make the most informed, fine-grained choices between performance and consistency according to their business logic. Systems requiring strong consistency, such as distributed databases, may use SO and Fence more often; whereas for large-scale AI training, the vast majority of gradient updates can use RO or even NO, pushing system throughput to the extreme. This design philosophy is a systematic answer—from theoretical reflection to engineering practice—to the “ordering” problem in distributed systems.

Load/Store and Read/Write: Two Worldviews of Memory Access

In the design philosophy of Unified Bus, there is more than one way to access remote memory; instead, it offers two core, complementary programming paradigms: one deeply integrated with the processor instruction set, Load/Store, and the other more flexible and software-defined, Read/Write. These two paradigms represent two different “worldviews,” reflecting deep trade-offs across the programming model, performance, consistency, and degree of hardware coupling.

Two paradigms: synchronous Load/Store and asynchronous Read/Write

To understand the fundamental difference between the two paradigms, we must first return to a basic question: how does a remote memory access actually occur at the application and hardware levels?

What is Load/Store?

When discussing any system semantics, we must first clarify the level at which they are implemented, or we will end up talking past each other. The core of the Load/Store semantics is whether it is directly supported by processor hardware instructions.

  • In the classic Turing machine model, the next instruction can start only after the Load instruction completes.
  • In modern out-of-order processors, after a Load is issued, unrelated subsequent instructions may proceed, but any instruction that depends on that Load’s result is automatically stalled by hardware until the data returns.
  • In some specialized processors (such as NPUs), a single Load can even move a non-contiguous, very large block of data (e.g., a slice of a tensor).

All of these belong to Load/Store semantics because they are directly initiated and managed by hardware instructions. In contrast, memory accesses initiated by software and wrapped by a runtime (for example, the software constructs a work request, sends it to the NIC via a driver, and then polls a completion queue) are usually not considered Load/Store semantics, even if they ultimately perform a remote read.

Synchronous vs. Asynchronous

The essence of Load/Store is a synchronous memory access model, whereas traditional RDMA Read/Write is the archetype of an asynchronous model.

A typical asynchronous RDMA write involves a lengthy and intricate sequence:

  1. Software constructs a Work Queue Element (WQE) in memory.
  2. Software notifies the NIC of a new task via a “doorbell.”
  3. The NIC reads the WQE from memory into its on-chip memory.
  4. The NIC, per the WQE, DMAs user data from host memory into the NIC.
  5. The NIC packages the data into network packets and sends them to the remote side.
  6. The remote NIC receives the packets and writes the data into the target memory.
  7. The remote NIC returns an acknowledgment.
  8. The initiator NIC, after receiving the acknowledgment, writes a Completion Queue Element (CQE) in memory.
  9. Software must actively poll the CQE to finally confirm completion.

This whole process involves multiple DMAs and complex software–hardware interactions. In UB, Read/Write borrows techniques such as NVIDIA BlueFlame, allowing a small payload to be included with the “doorbell” so that the WQE is written directly into the NIC’s device address space, eliminating steps 1 and 3 and saving two DMAs, but the interaction remains fundamentally asynchronous.

By contrast, a synchronous Load/Store operation is radically simplified:

  1. The application executes a Load or Store instruction.
  2. The CPU’s networking module (or a tightly integrated coprocessor) directly converts the instruction into network packets.
  3. The remote networking module completes the memory read/write and returns the result (or an acknowledgment) over the network.
  4. The initiating CPU’s instruction completes, and the pipeline continues.

Modern CPU pipelines can hide part of the latency of synchronous accesses quite well. Although an overly long remote Load may stall parts of the pipeline and reduce parallelism, its end-to-end latency and software overhead are far lower than in the asynchronous model, making it especially suitable for latency-sensitive small-block accesses.

Pros and cons summary

Synchronous remote memory access (Load/Store)

  • Advantages: Simple process, extremely low latency; transparent to applications and can be used directly for memory extension; high efficiency for small transfers; can support hardware cache coherence.
  • Disadvantages: High hardware demands (requires tight CPU integration); small per-access data size (typically cache-line granularity); large reliability “blast radius” (a node failure can drag down all other nodes accessing that node’s memory); high overhead for large-scale cache coherence.

Asynchronous remote memory access (Read/Write)

  • Advantages: Flexible control of access size with high throughput for large transfers; lower hardware requirements, good decoupling; exceptions can be caught in software, offering good fault isolation.
  • Disadvantages: Complex process with higher latency; not transparent to applications, requiring explicit programming; no hardware cache coherence, so software must ensure it.

Because each of these two modes has its own strengths, the UB protocol chooses to provide both, allowing developers to select as needed based on the scenario.

Remote memory addressing

With two programming paradigms in place, the next key question is: how to “name” and “locate” remote memory—that is, addressing? UB’s memory management mechanisms are built around efficiently supporting Load/Store and Read/Write.

In UB, the basic unit of memory management is the Memory Segment. A memory segment is a contiguous region of physical memory. When a node (Home) wants to share part of its memory with other nodes (Initiators), it creates a memory segment and assigns it a unique identifier (TokenID). For an Initiator to access this remote memory, it must first obtain information about this segment from the Home, including the TokenID, base address (UBA), and size (Len).

After obtaining this information, the Initiator faces a key choice: how to “translate” this remote memory address into one it can understand and use? Industry and academia have explored multiple paths, each with pros and cons:

  1. Unified physical memory addressing (Globally Shared Physical Memory)

This approach is the simplest and most direct, and is common in traditional tightly coupled HPC systems. Across the entire system, all nodes’ physical memories are mapped into a single globally unified physical address space. When any node accesses an address, the hardware can directly resolve which node and which chunk of physical memory it belongs to. The advantage is simple hardware implementation. The fatal drawback is extremely poor scalability. As the number of nodes grows, maintaining a globally consistent physical address view becomes exceedingly difficult and expensive.

  1. Network address + remote virtual address (Network Address + Remote VA)

This is a more flexible and scalable solution. Accessing a remote memory address requires a “pair”: the target node’s network address and the memory’s virtual address on that node. This decouples address spaces so each node maintains its own, delivering excellent scalability. The read and write transactions in the UB protocol support this access method.

However, this network-address + virtual-address pair is often very long (e.g., 128 bits) and cannot be directly used as a memory address by existing CPU instructions. To initiate a remote read/write, the CPU must execute special instructions that package this long address together with the operation type into a request for the NIC to process. Another important characteristic is that this is essentially a non-cacheable (Non-cacheable) access mode. Each read/write goes straight to the peer memory, with no local caching. The benefit is a simple model with completely no cache-coherence issues, because there is no cache in the first place. But the drawback is also obvious: each access must bear the full network latency.

  1. Mapping to local virtual address (Mapping to Local VA)

This is the core mechanism the UB protocol provides in pursuit of extreme performance, supported by the load and store instructions. In this mode, the Initiator maps the obtained remote memory segment, via the local memory management unit (UBMMU), into its own process’s virtual address space. Once mapped, the CPU can access this remote memory just like local memory, using standard load and store instructions.

The performance advantage of this approach is tremendous because it “localizes” remote memory, allowing the CPU pipeline to handle remote accesses seamlessly without special instructions or software overhead. More importantly, this mode natively supports caching (Cacheable). When the CPU issues a load to a remote address, the data can be cached in the local cache, so subsequent accesses can complete with very low latency.

Of course, introducing caches also brings new challenges: how to ensure cache coherence? The UB protocol designs an elegant cache-coherence mechanism for this. By setting different permissions (such as read-only or read-write) when mapping memory segments, the system leaves ample design space for subsequent cache-coherence management. For example, a memory segment mapped read-only by multiple nodes has relatively simple cache management; once writes are allowed, more complex mechanisms are required to ensure data consistency.

In summary, UB’s memory management mechanism offers a layered and flexible solution. It provides both a simple read/write mode that does not need to consider coherence, and a high-performance load/store mode that integrates deeply with the CPU ISA and supports caching, allowing applications to make the most appropriate trade-offs among usability, performance, and consistency according to their needs.

Cache Coherence

When multiple nodes can map the same segment of remote memory locally and cache it, cache coherence turns from an “option” into a “must-answer problem.” If mismanaged, the copies in different nodes’ caches may diverge, leading programs to read stale data and causing catastrophic outcomes. As a system intended to provide memory-level semantics, the UB protocol must offer a clear and reliable cache-coherence solution.

Designing a cache-coherence protocol is essentially a trade-off among performance, complexity, and consistency strength. The industry has developed various models with different strengths and implementations. Below we discuss several archetypal schemes and analyze them in light of UB’s design philosophy:

  1. Any node, strong consistency (dynamic sharer list)

This is the most idealized model: it allows multiple nodes to cache a copy of the data simultaneously while ensuring that the access experience is identical to accessing memory in a local multi-core processor (i.e., strong consistency). When a node modifies the data, the protocol must, via broadcast- or multicast-like mechanisms, either invalidate all other nodes’ cache copies (Invalidate), or propagate the update to them (Update). The key challenge is maintaining a dynamic sharer list (Sharer List). Since any node may join or leave at any time, the size and membership of this list are not fixed; managing such a dynamic list efficiently in hardware is highly complex and challenges scalability.

  1. Multiple readers, single writer(Multiple Readers, Single Writer)

This is the most common and classic coherence model in practice and the foundation of protocols such as MSI/MESI/MOESI in modern multi-core CPUs. It stipulates that at any moment a piece of data can be held in read-only caches (Read-only Cache) by any number of nodes, but at most one node may hold write permission (Write Permission). When a node wants to write the data, it must first obtain the unique write permission, and the prerequisite for obtaining it is that all other read-only cache copies in the network must be invalidated (Invalidate). The Ownership (ownership) concept and the transitions among the three states Invalid/Read/Write described in the UB protocol documentation embody exactly this idea. This model is relatively simple to implement, delivers high performance in read-heavy scenarios, and strikes a good balance between performance and complexity.

  1. Exclusive access/ownership migration(Ownership Migration)

This is a special case of the “multiple readers, single writer” model. In this mode, only one node is allowed to access (cache) a memory segment at any given time. When a node (Borrower) needs access, it “borrows” ownership from the original holder (Home) and becomes the new Owner. During this period, the original Home node temporarily loses access to that memory until the Borrower “returns” it. This model is the simplest to implement because it completely avoids coherence among multiple replicas. It suits memory-borrowing scenarios, where the Home node “rents out” idle memory to form a large memory pool, and other nodes short on memory request memory from the pool to make up for local shortages.

  1. Limited nodes, strong consistency (fixed sharer list)

This is a simplification and compromise of the first ideal model. It also offers strong consistency but limits the number of nodes that can share a copy at the same time. Because the number of sharers has an upper bound, hardware can maintain it with a fixed-size list, greatly reducing design complexity. However, this model is not very practical, because it introduces an unnatural constraint at the application level and struggles to fit general, dynamically changing computing needs.

  1. Software-managed coherence(Explicit Cache Management)

This approach assigns part or all of the responsibility for maintaining coherence to software. Hardware still provides caching to accelerate access, but it does not automatically ensure coherence among caches on different nodes. When an application needs to ensure it reads the latest data, software must explicitly perform a refresh cache (or invalidate) operation to proactively discard potentially stale local cache. When the application modifies data and wants it visible to other nodes, it must explicitly perform a write back (or flush) operation to write the local cache back to main memory. This model gives software maximum flexibility but demands a lot from programmers and is error-prone.

  1. Non-cacheable mode(Non-cacheable)

This is the simplest and most straightforward “coherence” scheme: no cache, hence no coherence to worry about. As mentioned above, UB’s read/write transactions fall into this mode. Every access goes directly to the target main memory, ensuring that what is read is always the freshest data. The cost is that applications must implement data movement themselves—moving data from the Home node to the local node—in order to enjoy the efficiency gains that caching brings.

The design of the UB protocol is the result of seeking the best balance among the above possibilities. Centered on the “multiple readers, single writer” ownership model, it provides strong hardware coherence guarantees for load/store cached accesses, while retaining the non-cacheable read/write path as a complement, enabling different applications to find the best balance between consistency and performance for their needs.

The killer app for the memory pool: KV Cache

When we first conceived a UB-based memory pool five years ago, we held a powerful solution in our hands but kept searching for a problem that truly fit it. We envisioned using UB to aggregate the memory of thousands of servers to build an unprecedented, unified, massive memory pool in which any data could be accessed with latency close to local memory. This was technically exciting, but a practical question lingered: “What kind of application would actually need such a huge, ultra-high-performance shared memory pool?”

The explosion of LLM inference services brought the challenge of KV cache. When generating text, LLMs need to cache vast intermediate states (i.e., KV Cache), often tens to hundreds of GB in size—far beyond a single GPU’s VRAM. More crucially, this data must be accessed at high frequency during each token’s generation, making it extremely sensitive to latency and bandwidth. Suddenly, all the ideas we proposed five years ago—huge capacity, ultra-low latency, efficient sharing—found a perfect application in the KV Cache problem.

1. Prefill-Decode separation

An LLM processes a request in two stages:

  • Prefill stage: It takes the user’s prompt, dialog history, or Agent tool-call traces, and computes the KV Cache for all tokens in parallel. This is a compute-intensive(Compute-bound) process.
  • Decode stage: It generates new tokens one by one. For each token, it needs to read the full KV Cache (including the prompt and all previously generated tokens). This is a memory-bandwidth-intensive(Memory-bound) process.

Because the computational characteristics of these two stages differ dramatically, large-scale LLM inference frameworks commonly adopt Prefill-Decode (PD) separation scheduling. The system batches many Prefill requests into one large batch for computation, while aggregating Decode requests into another batch. This separation can significantly improve GPU utilization and overall system throughput.

2. Prefix KV Cache

In many application scenarios, different user requests often contain the same “prefix” (Prefix). For example, in multi-turn dialogs and Agents, subsequent requests entirely include the previous conversation history or tool invocation history.

Recomputing the KV Cache for these identical prefixes is a huge waste. Prefix Caching arises to address this. Its core idea is to store the computed prefix KV Cache in a global memory pool. When a new request arrives, the system checks whether its prefix matches an entry in the cache. If it matches, it directly finds the shared KV Cache from the pool and then continues computation from the end of the prefix. This greatly reduces the time to first token (Time to First Token, TTFT) and saves considerable compute resources.

This memory-pool-based Prefix Caching mechanism is essentially a form of cross-request reuse of computation results. The global memory pool and low-latency memory borrowing advocated by the UB protocol provide an ideal hardware foundation for implementing an efficient, cross-server global KV Cache pool.

At a deeper level, the success of the KV Cache may be one of the most fundamental contributions from computer systems to the field of AI. The attention mechanism in Transformer models can be viewed as a novel, differentiable “key–value store” (Key-Value Store). In this paradigm, the query vector (Query) is the “key” we are looking up, while every token in the context provides its own “key” (Key) and “value” (Value). Unlike traditional systems that perform exact, discrete matches via hash tables, map[key] -> value, attention performs a fuzzy, continuous “soft” match (the “soft” in softmax is precisely this). It computes the similarity (attention score) between the current Query and all Keys, and then takes a weighted sum of all Values according to these scores. This amounts to reading the entire database at once, weighted by relevance.

Summary: Load/Store and Read/Write

In summary, the Load/Store and Read/Write provided by UB are by no means redundant functions, but two abstractions necessary for different scenarios.

  • Load/Store delivers ultra-low latency and programming transparency, seamlessly integrating remote memory into the processor’s native instruction set. It is a powerful tool for building high-performance, fine-grained shared-memory applications. However, it also increases hardware implementation complexity.
  • Read/Write offers a more traditional and flexible asynchronous access model. It decouples software and hardware, simplifies the consistency model, and is more suitable for bulk data movement and scenarios less sensitive to latency.

URMA: Unified Remote Memory Access

At this point, we have explored many key decisions behind UB’s design: from the “peer architecture” that breaks the CPU bottleneck, to the “Jetty connectionless model” that addresses large-scale scalability challenges, and to the “weak transaction ordering” and “Load/Store semantics” for performance optimization.

URMA (Unified Remote Memory Access, 统一远程内存访问) is the top-level concept that fuses all these design philosophies, proposed by Dr. Kun Tan, Director of the Distributed and Parallel Software Lab. URMA is precisely the collection of unified programming abstractions and core semantics that the UB protocol provides to upper-layer applications.

URMA was born from a profound insight into future computing paradigms. In future data centers and high-performance computing clusters, heterogeneous compute units such as CPUs, GPUs, and NPUs will collaborate as peers to handle complex computational tasks. To fully unleash the potential of such heterogeneous compute power, the underlying communication protocol must satisfy several stringent demands:

  1. Direct communication among heterogeneous compute: It must allow different types of compute units to communicate directly as peers, bypassing traditional master–slave bottlenecks, thereby enabling efficient parallelism and collaboration on fine-grained tasks.
  2. Extreme scalability: The protocol must efficiently support communication from within a single node to ultra-large-scale clusters, easily handling interconnection needs across tens of thousands of nodes.
  3. Maximized network efficiency: The protocol needs built-in flexible routing and transport mechanisms—such as multipath and out-of-order transport—to fully utilize expensive data center network bandwidth and ensure real-time performance.

URMA is precisely the answer tailored to these three demands. It aims to efficiently and reliably complete communication between any two UB entities (UB Entity), whether one-sided memory access (DMA) or two-sided message send/receive. The key characteristics we discussed earlier are ultimately reflected in URMA’s design:

  1. Peer-to-Peer Access: This is the cornerstone of URMA. Any heterogeneous compute device can use URMA to achieve direct communication without CPU involvement, echoing our original vision of a “peer architecture.”
  2. Inherently Connectionless: Through the Jetty abstraction, URMA lets applications directly reuse the reliable services provided by the UB transport layer, without establishing and maintaining end-to-end connection state. This fundamentally resolves the scalability issues of traditional RDMA and is key to supporting ultra-large-scale deployments.
  3. Flexible Ordering (Weak Ordering): URMA allows applications to configure ordering behavior according to their own needs. By default it permits out-of-order execution and out-of-order completion, which not only avoids head-of-line blocking but also unleashes the underlying hardware’s vast potential for multipath transport and parallel processing, significantly improving end-to-end efficiency.

In short, URMA is not just an API or a protocol specification; it represents a future-oriented, unified communication paradigm for heterogeneous computing. It combines the high performance of buses with the flexibility of networks, and through a series of innovative designs shields upper-layer applications from hardware complexity, ultimately providing a simple, efficient, and highly scalable programming interface. This is the ultimate meaning of the name Unified Bus “统一总线.”

Of course, as with any paradigm shift, URMA’s unified abstraction does not come without cost. It pushes some complexities previously handled by operating systems and software (such as memory management and consistency) down into hardware, bringing significant hardware design challenges. At the same time, it exposes more options and responsibilities to upper-layer applications, such as choices around ordering and load-balancing strategies. The victory of a new paradigm does not lie in solving all old problems without introducing any new ones, but in addressing the core contradiction—in UB’s case, the contradiction between ‘performance’ and ‘scale’—that is the most urgent and important issue of our time.

Congestion Control

The history of congestion control is, in essence, a history of our struggle against the “queue” demon.

Originally, the designers of TCP/IP considered the network “best-effort,” with packet loss as the signal of congestion. Therefore TCP adopted the classic Additive Increase, Multiplicative Decrease (AIMD) algorithm: gradually increase the send window when no packet is lost; once a packet is lost, cut the window in half aggressively. This model achieved great success on the wide-area Internet, but its fundamental problem is that it is an “after-the-fact” control. It relies on queues filling up and ultimately dropping packets to sense congestion, which inevitably leads to two issues: 1) high latency, the so-called “Bufferbloat”; and 2) severe oscillations in network utilization.

To address this, Active Queue Management (AQM) was proposed, represented by RED (Random Early Detection). The idea of RED is: don’t wait until the queue is full to drop packets; instead, as the queue length begins to increase, randomly drop packets with a certain probability to proactively signal congestion to the sender. This alleviates Bufferbloat to some extent, but “packet loss” as a congestion signal is still too crude.

A gentler approach is Explicit Congestion Notification (ECN). ECN allows routers to mark the packet header when they detect that queues are building up, instead of dropping the packet. Upon receiving this mark, the sender knows there is congestion in the network and proactively reduces the sending rate. ECN avoids unnecessary packet loss and retransmission and is standard in modern congestion control.

However, in data centers where ultra-low latency and high throughput are paramount, these TCP-based congestion control schemes are still not fine-grained enough. Especially for RDMA-like applications that are connectionless and extremely sensitive to loss, faster and more precise congestion control is needed. RoCEv2 networks therefore designed DCQCN (Data Center Quantized Congestion Notification). DCQCN combines ECN with rate-based control: after switches mark congestion, the NIC quickly reduces its sending rate in quantized steps, achieving faster convergence and lower queue occupancy.

UB’s C-AQM (Confined Active Queue Management) pushes this fine-grained control to the extreme. DCQCN is still a passive “congest first, then slow down” mode, whereas the core idea of C-AQM is “allocate on demand, grant proactively,” aiming for “near-zero queues.” The biggest advantage behind this is Huawei’s “end–network coordination” capability as an end-to-end (NIC + switch) network equipment provider.

C-AQM’s operation embodies the finesse of this coordination:

  1. Sender (UB Controller) proactively requests: When sending data, the sender can set the I (Increase) bit in the packet header to 1 to request more bandwidth from the network.
  2. Switch (UB Switch) provides precise feedback: Upon receiving this request, the switch evaluates congestion on its egress port. If it deems that increasing bandwidth would cause congestion, it not only sets the C (Congestion) bit in the header to reject the request, but also provides a suggested bandwidth value in the Hint field. This Hint is computed by the switch based on precise queue and bandwidth utilization; it tells the sender: “You can’t increase further; you should adjust your rate to this suggested value.”
  3. Sender responds quickly: After receiving this feedback containing the precise Hint, the sender can immediately adjust its sending rate to the level suggested by the switch.

This process is like a smart traffic control system. The driver (sender) wants to accelerate and first asks the traffic cop (switch) ahead via the signal light (the I bit). Based on real-time traffic at the intersection, the cop not only turns on the red light (the C bit) to say “no,” but also tells the driver over the radio (the Hint field) to “hold at 30 km/h.”

Through this “request–precise feedback” closed loop, C-AQM precisely matches the sender’s rate to the service capacity the network can provide, so packets “arrive and depart immediately,” and switch queues remain at an extremely low, near-zero watermark. This not only eliminates the high latency caused by Bufferbloat, but also maximizes effective network throughput. This near-zero-queuing design philosophy is one of the key cornerstones enabling UB’s microsecond-level end-to-end latency.

Reliable Transport

When building any reliable network protocol, how to handle packet loss is a central topic. The textbook TCP/IP solution—“slow start, congestion avoidance, fast retransmit, fast recovery”—is well known. However, in data center networks, to maximize bandwidth utilization, Equivalent Multipath Routing (ECMP) is widely used. Traffic is spread across multiple physical paths, which inevitably causes arrival order to differ from send order. A fundamental contradiction emerges: How can we accurately detect loss in a network full of “orderly disorder”?

Traditional fast retransmit hinges on the logic that “upon receiving three duplicate ACKs, a packet is considered lost.” This assumption is basically reasonable on a single path, but in an ECMP environment, such “out-of-order” easily fools the mechanism, leading to a large number of spurious retransmissions. The network didn’t actually drop packets; some packets just took a shortcut and arrived earlier. Yet the protocol mistakenly assumes loss and triggers unnecessary retransmissions, wasting precious bandwidth and even potentially causing real congestion by injecting extra traffic.

UB’s Load Balancing Mechanism

Unlike traditional networks that rely on ECMP hashing and a hit-or-miss style of load balancing, UB hands the choice to applications, letting them make informed trade-offs between performance and ordering requirements at different granularities. UB supports two levels of load balancing:

1. Transaction-level load balancing: the “convoy” mode based on TPG

UB introduces the concept of TPG (Transport Protocol Group). You can think of a TPG as a logistics company responsible for transporting batches of “cargo” (i.e., transactions) from point A to point B. To increase capacity, the company can use multiple highways simultaneously, with each highway being a TP Channel.

When a transaction (for example, a large RDMA Write) needs to be sent, the TPG selects a TP Channel for it. Once selected, all TP Packets of this transaction will be transmitted on this fixed “highway.” This model is like a massive convoy where all vehicles stay in the same lane.

The great advantage of this transaction-level load balancing lies in its simplicity and ordering. Since all packets of the same transaction travel along the same path, they inherently arrive in order, fundamentally avoiding packet-level reordering. This allows the upper-layer reliable transport protocol to confidently use the most efficient fast retransmit mechanism, because any “duplicate ACK” signal is highly likely to indicate an actual loss rather than reordering. This is a safe, stable, and easy-to-manage parallel strategy suitable for most scenarios requiring reliable transport.

2. Packet-level load balancing: the “racing” mode for extreme performance

For applications that pursue extreme network utilization and minimal latency, UB provides a more aggressive packet-level load balancing mechanism. In this mode, the system allows packets from the same transaction to be “scattered” across multiple TP Channels, and even, by modifying the LBF field in the packet header, to be dynamically directed to different physical paths at the switch layer.

It’s like a highway race: to reach the finish line fastest, each race car (TP Packet) can freely choose the least congested lane and dynamically overtake and change lanes.

This mode can maximally “fill” all available bandwidth in the network, achieving unparalleled throughput. However, it inevitably brings a “side effect”: reordering. Later packets may well arrive ahead of earlier ones by taking a faster path.

Retransmission under multipath

Under multipath transmission, determining “packet loss” must be more cautious and intelligent. Forcing a one-size-fits-all approach won’t work. Therefore, we did not design a universal retransmission algorithm; instead, we hand the decision back to users, composed of choices along two dimensions:

  1. Retransmission scope: “one errs, all are punished,” or a “precise strike”?

    • GoBackN (GBN): A simple and classic strategy. Once loss is detected, the sender retransmits the lost packet and all subsequently sent packets after it. The upside is simplicity, with minimal state required at the receiver. But in high-latency, high-bandwidth, and high-loss networks, it may retransmit many packets that actually arrived correctly, leading to inefficiency.
    • Selective Retransmission: A more fine-grained strategy. The sender retransmits only those packets confirmed lost. The receiver must maintain more complex state (e.g., a bitmask) to inform the sender which packets were received and which were not. This approach is the most efficient but also more complex to implement.
  2. Triggering mechanism: “rush headlong,” or “think twice”?

    • Fast Retransmit: Similar to TCP, it uses redundant acknowledgments (e.g., error responses in UB) to trigger retransmission quickly, without waiting for a full timeout period. Its advantage is fast response, which can significantly reduce latency upon loss. But as mentioned earlier, it is very sensitive to reordering.
    • Timeout Retransmit: The most conservative and reliable mechanism. The sender starts a timer for each packet sent; if no acknowledgment is received before the preset time (RTO), the packet is considered lost and retransmitted. Its advantage is that it ultimately covers all loss scenarios and is unaffected by reordering. The downside is that RTO calculation is typically conservative, and waiting for timeout introduces longer latency.

By combining strategies along these two dimensions, UB provides four retransmission modes to suit different network scenarios:

Retransmission algorithm Applicable scenario Network loss rate Design rationale
GoBackN + Fast Retransmit Single-path transport, e.g., per-flow load balancing Very low This is the most classic and efficient mode. When the network path is stable and reordering risk is minimal, we should repair the rare losses as quickly as possible.
GoBackN + No Fast Retransmit Multipath transport, e.g., per-packet load balancing Very low When we know the network itself introduces substantial reordering (e.g., ECMP), we must disable fast retransmit, which is sensitive to reordering, rely entirely on timeouts for reliability, and avoid spurious retransmission storms.
Selective Retransmission + Fast Retransmit Single-path transport, e.g., per-flow load balancing Low In a stable single-path network where loss becomes non-negligible, selective retransmission offers clear efficiency gains over GBN by avoiding unnecessary retransmissions.
Selective Retransmission + No Fast Retransmit Multipath transport, e.g., per-packet load balancing Low This is the most complex yet most adaptable mode. For complex networks with both multipath reordering and some loss, it provides the most efficient and robust reliable transport.

Coordination between retransmission and transaction ordering

Earlier we discussed how UB uses “weak transaction ordering” to break unnecessary ordering shackles and maximize the efficiency of parallel transmission. A natural and profound question follows: if the system allows transactions to execute out of order, what is the relationship between a “retransmitted packet” of an “old” transaction and the packets of a “new” transaction?

We have a core design philosophy: Reliability at the transport layer and ordering at the transaction layer are orthogonal and decoupled by design.

  • Mission of the transport layer: In its world there are only TP Packets and their sequence numbers (PSN). Its sole goal is to use mechanisms such as GBN or selective retransmission to ensure that for a transaction, all TP Packets are ultimately delivered intact and correct from the Initiator’s transport layer to the Target’s transport layer. It is responsible for handling physical packet loss in the network, achieving “data must arrive.”
  • Mission of the transaction layer: In its world there are only transactions (Transaction) and their ordering tags (NO, RO, SO). Its work begins only after the transport layer confirms that all data of a transaction have been collected. Based on the transaction’s ordering tag, it decides when to execute this transaction that has just “assembled all its parts.” It is responsible for handling business-logic dependencies, achieving “ordering on demand.”

Let’s understand how this decoupling works through a concrete scenario:

  1. Send: The Initiator successively sends two transactions:

    • Transaction A (RO): split into three packets A1, A2, A3.
    • Transaction B (RO): split into two packets B1, B2.
  2. Loss: During network transmission, A2 is unfortunately lost. A1, A3, B1, and B2 all arrive at the Target successfully.

  3. Transport-layer response: The Target’s UB transport layer gets to work.

    • For transaction B, it received B1 and B2 and, by checking PSNs, found the data complete. It then hands the reassembled transaction B up to the transaction layer, declaring the task complete.
    • For transaction A, it received A1 and A3 but discovered via PSN or bitmap that A2 is missing. It quietly buffers A1 and A3 and, via TPNAK or by waiting for a timeout, informs the Initiator’s transport layer: “A2 is lost, please retransmit.”
  4. Transaction-layer decision: At this point, the Target’s transaction layer also begins to work.

    • It received the complete transaction B handed up from the transport layer.
    • It checks the ordering tag of transaction B, which is RO (Relaxed Order). This means B does not need to wait for any transactions sent before it.
    • Therefore, the transaction layer immediately dispatches transaction B for execution, completely ignoring transaction A, which is still anxiously awaiting the retransmitted packet A2.
  5. Retransmission and final execution:

    • Later, the retransmitted A2 packet finally arrives.
    • The Target’s transport layer merges it with the previously buffered A1 and A3, reassembles the complete transaction A, and delivers it to the transaction layer.
    • The transaction layer receives transaction A, checks its RO tag, and likewise dispatches it for immediate execution.

In this process, loss and retransmission of an “old” transaction (A) do not block the execution of a “new” transaction (B). This is the power of weak transaction ordering working in concert with reliable transport.

So how would Strong Order (SO) change all this?

If in the scenario above, transaction B is marked SO (Strong Order), then in step 4, “transaction-layer decision,” the situation is entirely different. When the transaction layer receives the complete transaction B, it checks its SO tag and realizes it must wait for all transactions before B (i.e., transaction A) to complete. Therefore, even if all of B’s data are ready, the transaction layer can only let it “stand by” until the retransmitted A2 arrives and transaction A completes, after which transaction B can execute.

In summary, this decoupled design in UB achieves extreme efficiency:

  • Problems at the network layer are solved at the network layer: The transport layer can aggressively use advanced techniques (such as selective retransmission) to combat loss most efficiently, without worrying that its behavior will interfere with the upper-layer business logic ordering.
  • Ordering at the business layer is decided by the business itself: The transaction layer can free itself from network details and focus on deciding, based on real application needs, whether to wait and what strength of ordering guarantees are needed, thus avoiding unnecessary “head-of-line blocking.”

This clear division of responsibility gives the system maximal parallelism and performance by default, while preserving the ability to build strict ordering for applications that require strong consistency. It is an elegant and efficient answer to the complexity of modern data center networks.

Deadlock avoidance

In any communication network that guarantees losslessness (lossless), deadlock is a specter that must be faced. When the network uses credit-based flow control or back pressure, if the dependency relationships of resources (such as buffers) form a cycle, a “circular wait” can occur in which all nodes wait for others to release resources while holding resources themselves—this is a deadlock.

A typical scenario is: there are four switches in the network forming a cyclic dependency. The egress buffer of UB Switch 1 is full because it cannot send data destined for UB Switch 2; Switch 2’s buffer is also full because it is waiting for Switch 3; Switch 3 is waiting for Switch 4; and Switch 4 is waiting for Switch 1. All buffers are occupied, packets cannot flow, and the entire system stalls.

To prevent such catastrophic situations from occurring, we must break the conditions for circular wait at the design level. In UB’s design, the following classic deadlock-avoidance schemes are used:

  1. Routing-based deadlock avoidance

This method tackles the problem at its root by designing an acyclic routing algorithm to ensure deadlocks cannot occur. A classic example is spanning-tree-based “Up/Down Routing.” First, select a root node in the network topology and construct a spanning tree. The routing rules are constrained as follows: packets may route from a “down” node to an “up” node (i.e., a node closer to the root), and from an “up” node to a “down” node, but they are never allowed to route to a “down” node after having passed through an “up” node. This simple restriction effectively breaks any potential routing cycles, thereby avoiding deadlock. The advantage of this method is simplicity and efficiency, but the downside is that it may not fully utilize all available paths in the network, sacrificing some flexibility and performance.

  1. Virtual Channels (Virtual Channels/Lanes, VC/VL)

This is the most commonly used and most flexible deadlock-avoidance mechanism in modern high-performance networks (such as InfiniBand). The core idea is to partition each physical link into multiple logically independent virtual channels (VCs). Each VC has its own dedicated buffer resources.

Although the physical topology may contain cycles, we can construct an acyclic “VC dependency graph” by carefully designing the rules for VC usage. For example, we can divide VCs into different “levels” and require that packets only transition from lower-level VCs to higher-level VCs. When a packet circulates around a loop, each transition must enter a higher-level VC. Since the number of VC levels is finite, the packet will eventually be unable to find a higher-level VC and thus cannot continue, breaking the circular wait. The VC mechanism decouples resource dependencies from the physical link level down to the finer-grained virtual channel level, greatly improving routing flexibility and network resource utilization.

  1. Timeout-based deadlock recovery

Unlike the previous two “avoidance” strategies, this is a “detect and recover” strategy. The system sets a timer for each packet. If a packet remains in a buffer longer than a certain threshold, the system assumes a deadlock may have occurred. Once a deadlock is detected, measures are taken to break it; the simplest and most direct approach is to drop one or more “old” packets, freeing their occupied buffers so other packets can proceed. This method is usually used as a complement to other deadlock-avoidance mechanisms, serving as a final safety net because it compromises the lossless nature of the network.

Deadlock avoidance in memory access

In a complex system like UB, a seemingly simple memory access can hide a chain of complex side effects. A primary memory operation (such as a Load instruction) may trigger secondary memory operations (such as handling page faults or address translation during memory borrowing). When circular dependencies in resources or processes form between these primary and secondary operations, the system may fall into deadlock.

Here are three typical deadlock scenarios in memory access:

  1. Memory Pooling/Borrowing: In a peer architecture, each UBPU is both a “memory consumer” and a “memory provider.” Deadlock can arise when two nodes borrow memory from each other. For example, node A borrows memory from node B while node B borrows memory from node A. When both A and B need to update the peer memory via Writeback and wait for the other’s acknowledgment (TAACK), if the acknowledgments are mutually blocked due to resource contention, a deadlock results.
  2. Page Table Access: When a memory access requires address translation via the UMMU, and the UMMU’s page table entries are themselves stored in remote, borrowed memory, the secondary operation of reading those entries must again issue a remote memory access through the same port. This can contend with the primary memory access and cause deadlock.
  3. Page Fault Handling: UB supports dynamic memory management, meaning memory accesses may trigger page faults. Handling a page fault may require accessing external storage or retrieving data from another UBPU. If the secondary operation for handling the fault forms a hardware-level dependency with the primary operation it serves, a deadlock may occur.

To address these complex deadlock scenarios, UB provides a toolkit:

  • Request retry: Allow operations to fail under resource constraints and be retried by upper layers.
  • Virtual channel isolation: Assign different virtual channels to different traffic types (e.g., primary access, page-table access, page-fault handling) to break resource dependency cycles at the hardware level.
  • Transaction-type differentiation: Distinguish between transaction types and apply different handling policies.

In addition, implementers can adopt simpler strategies—such as ensuring page tables are always stored locally—to fundamentally avoid certain deadlock scenarios.

Deadlock avoidance in message communication

UB’s bilateral message communication, such as Send/Receive on Jetty, is queue-based. When queue resources are insufficient, message communication becomes blocked. If message queues on different UBPUs form a closed dependency cycle by sending messages to each other, a deadlock can occur.

For example, both node A and node B are sending a large number of Send transactions to each other. Because many other nodes may also be sending requests to A and B, the receive queues (Jetty) of both A and B become full. At this point, A and B both send “Receiver Not Ready” (RNR) acknowledgments (TAACK) to each other. If A’s TAACK to B is blocked by A’s Send data flow to B, and B’s TAACK to A is simultaneously blocked by B’s Send data flow to A, neither side can process the other’s RNR message, nor can they free their own receive queues, resulting in deadlock.

To avoid such message-communication deadlocks, UB provides three basic mechanisms:

  1. Separation of transport and transaction layers: This is a key decoupling design. Even if upper-layer message transactions are blocked due to insufficient functional resources (e.g., busy application logic), the underlying transport protocol layer can continue to run independently without being blocked. This prevents a single point of application-layer congestion from spreading into large-scale, circuit-level network backpressure.
  2. Transaction-layer responses return resource status: When the transaction layer cannot process a request due to resource shortage, it explicitly returns the resource status (e.g., “busy”) to the initiator via a response message. Upon receiving such a response, the initiator can decide to retry or take other actions, avoiding deadlock waits on the network link.
  3. Timeout mechanism: Set timeouts for message communication. If an operation does not complete for a long time, the system deems it failed and releases the resources it holds. This is a final safeguard to ensure that even if a deadlock occurs, the system can recover on timeout and keep links unblocked.

URPC: Remote Procedure Call designed for heterogeneous hardware

Up to this point, we have built a solid foundation for the Unified Bus. URMA provides powerful, peer-to-peer remote memory access capabilities. Whether deeply integrated Load/Store with the CPU instruction set or more flexible asynchronous Read/Write, both pave the way for data exchange between hardware units. However, this memory-semantics abstraction is still too low-level for application developers. It’s like giving developers a powerful set of assembly instructions but no high-level language compiler.

When application developers think about business logic, the most natural model is “function calls,” not “memory reads/writes.” We need a way to wrap UB’s powerful memory-level communication capabilities into a higher-level abstraction that is easier to understand and use. This is the original motivation behind URPC (Unified Remote Procedure Call).

The renaissance of RPC: from software services to hardware functions

Traditional RPC frameworks, such as gRPC and Thrift, have long been the cornerstone of distributed software development. Their core idea is to make cross-machine function calls as simple as local calls. However, the design philosophy of these frameworks is deeply rooted in a “CPU-centric” worldview:

  1. The communicating parties are software processes: Both the initiator and executor of RPC are software services running on CPUs.
  2. The data path depends on the operating system: All data send/receive must go through the kernel TCP/IP stack, incurring unavoidable copy and context-switch overhead.
  3. Parameter passing is dominated by pass-by-value: Without shared memory, all parameters—no matter how large—must be serialized, copied, transmitted over the network, and deserialized.

In UB’s vision of heterogeneous, peer computing, this traditional model struggles. What we need to call is no longer just a software function on another server, but potentially:

  • An application on a CPU needs to invoke a hardware-accelerated operator on an NPU.
  • A kernel on a GPU needs to invoke a data-shaping function on a remote memory controller.
  • A data processing unit (DPU) needs to invoke a network function on another DPU in the cluster.

URPC’s core mission is to provide a standard “function call” abstraction for high-performance, fine-grained, direct communication among heterogeneous hardware. It is not merely a communication protocol between software components, but a “functionality-layer” protocol for cooperation among heterogeneous hardware units.

Pass-by-reference: when pointers can cross machines

The most revolutionary aspect of URPC’s design is its native, efficient support for pass-by-reference. This would be unimaginable in the traditional RPC world, but within UB’s framework it is perfectly natural.

We can achieve this because a URPC “reference” is not an ordinary virtual address meaningful only within a local process. It is a globally valid UB address. When one UBPU (e.g., a CPU) initiates a URPC call to another UBPU (e.g., a GPU) and passes a reference to a data structure, it is handing over a “key” with unobstructed access. Once the remote GPU hardware receives this address, it can, without CPU intervention, directly use the underlying URMA Load/Store hardware instructions to cross the network and access the caller’s memory.

The value of this capability is immense. Imagine an AI training job that needs to call a function to process hundreds of gigabytes of model weights or datasets, while the function only needs to read or modify a small portion of that data.

  • In traditional RPC, this implies copying hundreds of gigabytes of data. To reduce copies, the caller would have to “carve out” exactly the data the callee needs to access, creating architectural coupling between caller and callee.
  • In URPC, we only need to pass a reference to the entire data structure. The callee can fetch precisely the subset it needs, on demand, thereby avoiding massive, unnecessary data movement.

Even more interestingly, this design opens the door to finer-grained performance optimizations. If the higher-level programming language or API can distinguish between a read-only reference (&T) and a writable reference (mut &T), URPC can propagate this information all the way down to the UB hardware. When faced with a read-only reference, the hardware knows the data will not be modified, so it can confidently enable more aggressive caching strategies without worrying about the overhead of complex cache-coherence maintenance.

Pass-by-Value: When copying is unavoidable

Of course, pass-by-reference is not a silver bullet. In many scenarios, we still need traditional pass-by-value (Pass-by-Value). For example:

  • Heterogeneous data structures: When the call occurs between two different programming languages or hardware architectures, their definitions of data-structure memory layouts may be entirely different. In this case, format conversion and data copying are required.
  • Small parameters: For some very small parameters (such as configuration items and scalar values), the overhead of establishing a remote memory mapping and then reading via Load/Store may be greater than simply packaging the data in the request and sending it all at once.

URPC fully recognizes this, so it also provides efficient support for pass-by-value scenarios. But the “efficient” here is fundamentally different from protobuf or JSON serialization in traditional RPC frameworks. URPC’s serialization/deserialization (SerDes) mechanism is designed for hardware. Its goal is an extremely minimal format and the lowest computational complexity, so that this process can be maximally offloaded to hardware, thereby freeing precious CPU resources from tedious data packing/unpacking.

UB Supernode: The world’s third pole of the large-model software–hardware ecosystem

Up to this point, we have explored in depth the long road of Unified Bus—from design philosophy to key technical implementations. However, any technology, no matter how exquisite its design, must ultimately be validated in a concrete, tangible system to reveal its true value. That system is the supernode (SuperPoD) based on the UB architecture. It is not merely a product; it is the ultimate answer to all our initial thinking, arguments, and persistence.

Looking back at the origin of the UB project, our original dream was to break down the clear divide between buses and networks and create a new interconnect paradigm that combines bus-level performance with network-level scale. We firmly believe that future computing models will inevitably require us to view an entire data center’s resources—compute, memory, storage—as a unified whole, constructing a logically “giant computer.”

At the time, many regarded this idea as a pipe dream. The mainstream view was that intra-node interconnect within an 8-card server was sufficient, and cross-node communication did not require such extreme performance. However, when the Scaling Law became the irrefutable “law of physics” in AI, people finally realized that single-node compute had already hit the ceiling; tens of thousands of processors must collaborate with unprecedented efficiency, and the interconnect that links them became the decisive factor for the success or failure of the entire system.

It was in this context that the UB supernode came into being. It is no longer a paper protocol, but a large-scale computing system subjected to extensive validation in production scenarios. Through a series of key technical features, the architects of the UB supernode turned our vision into reality:

  1. Large-scale networking: This is the supernode’s core capability. To support ultra-large-scale model training, the supernode must break through the scaling bottlenecks of traditional networks. We designed the UB-Mesh networking architecture, whose core is a topology called nD-FullMesh. It fully leverages the traffic locality of AI training workloads and connects massive nodes with high-density short-range direct links at extremely low cost and latency. Building on this, with a hybrid topology of 2D-FullMesh and Clos, the supernode can achieve over 90% linear scaling efficiency at 8,192 cards, and it reserves interfaces for a future UBoE (UB over Ethernet) solution that scales to million-card clusters.
  2. Bus-class interconnect: On top of massive scale, the supernode still maintains bus-class extreme performance. UB provides hundreds-of-nanoseconds synchronous memory-semantic access (for latency-critical Load/Store instructions) and 2–5 microseconds asynchronous memory-semantic access (for large-block Read/Write), with inter-node bandwidth reaching the TB/s level.
  3. Full pooling and peer collaboration: Under UB’s connectivity, all resources of the entire supernode—whether NPU and CPU compute, or DRAM memory and SSD storage (SSU)—are aggregated into a unified resource pool. More importantly, these resources are peers: any NPU can directly access another node’s memory, bypassing the CPU to achieve decentralized collaboration.
  4. Protocol unification: Underpinning all this is UB’s unified protocol stack and programming model. It eliminates the conversion overhead and management complexity caused by multiple coexisting protocols such as PCIe, Ethernet, and InfiniBand in traditional architectures, enabling upper-layer applications to efficiently harness the entire cluster’s heterogeneous resources with a single unified set of semantics.
  5. High availability: A system with tens of thousands of optical modules faces enormous reliability challenges. UB addresses this with layered reliability mechanisms: at the link layer, LLR (Link Layer Retry) handles transient bit errors; at the physical layer, lane degradation and 2+2 optical module redundancy enable service-transparent recovery from faults; at the transport layer, end-to-end retransmission is the last line of defense. Together, these mechanisms ensure that at the ultra-large scale of 8,192 cards, the optical interconnect achieves a mean time between failures (MTBF) exceeding 6,000 hours.

Globally, hardware ecosystems capable of supporting ultra-large-scale AI model training and inference are few, because this requires deep collaboration across chips, networks, and even operating systems—a formidable task that only companies with full-stack hardware–software capabilities can accomplish. Previously, there were only two protagonists on this stage:

  • NVIDIA: Centered on its GPUs, it acquired Mellanox to complete the InfiniBand NIC and switch landscape, launched the Grace CPU and attempted to acquire ARM, continually reinforcing its “GPU + DPU + CPU” three-chip strategy, and ultimately built the powerful DGX SuperPOD ecosystem.
  • Google: As another giant, its TPU hardware is deeply bound to its internal software ecosystem, forming another closed yet efficient kingdom. Many of the world’s SOTA models with the fastest inference speed (tokens per second) run on Google TPUs.

They have, in their own ways, solved scalability and efficiency at the tens-of-thousands-card scale, thereby defining the compute landscape of this era.

As an ordinary engineer who was involved from the beginning, I feel a flood of emotions looking back on this journey. Our initial persistence stemmed from a different worldview—we believed that future computing would inevitably be built atop a logically unified ‘data-center computer.’ In the beginning, only a dozen architects gathered around a whiteboard to discuss prototypes; we were more like evangelists, striving to persuade others to believe in a future that had not yet arrived.

The real turning point came after GPT-3 was released in 2020. With indisputable performance, it demonstrated the power of the Scaling Law and led to broad internal recognition of the vision we had upheld. Since then, UB has received greater investment, and the small team of just over a dozen rapidly grew into a vast project with thousands directly involved.

Today, with the large-scale mass production of UB supernodes and the official release of the Unified Bus protocol, the communication primitives and architectures we once sketched on whiteboards have finally landed and opened to a broader ecosystem. The birth of UB has supplied the most critical missing piece for the Ascend ecosystem, marking the emergence—after NVIDIA’s GPUs and Google’s TPUs—of the world’s third complete hardware–software ecosystem capable of supporting top-tier large-model training and inference.

This vision, once regarded as a fantasy, has become a new ‘reality’ with the rollout of UB supernodes. Perhaps this is the charm of technological evolution: like a scientific revolution, it begins with a few people reimagining what the world ‘ought to be,’ and ultimately becomes the industry’s new consensus about what the world ‘is.’

Comments

2025-09-28
  1. Why Build UB
  2. Master–Slave Architecture and Peer-to-Peer Architecture
  3. Bus and Network
  4. There Is Nothing New Under the Sun
  5. One-sided Semantics and Two-sided Semantics
    1. One-sided semantics (memory semantics)
    2. Bilateral Semantics (Message Semantics)
  6. Connection-Oriented and Connectionless Semantics: The Jetty Abstraction
    1. The Scalability Challenges of RDMA “Connections”
    2. From “Connections” to “Jetty”
    3. Practical Considerations of the Jetty Model: HOL Blocking, Fairness, and Isolation
    4. Implementing One- and Two-Sided Semantics under the Jetty Abstraction
  7. Strong vs. weak transaction ordering
    1. Message semantics: breaking free from the byte stream
    2. The dream of strong ordering: the allure and challenge of global total order
    3. The way of weak ordering: embracing a new paradigm of uncertainty
    4. UB Transaction Order: Execution Order and Completion Order
      1. Transaction Execution Order (Execution Order)
      2. Transaction Completion Order (Completion Order)
  8. Load/Store and Read/Write: Two Worldviews of Memory Access
    1. Two paradigms: synchronous Load/Store and asynchronous Read/Write
      1. What is Load/Store?
      2. Synchronous vs. Asynchronous
      3. Pros and cons summary
    2. Remote memory addressing
    3. Cache Coherence
    4. The killer app for the memory pool: KV Cache
    5. Summary: Load/Store and Read/Write
  9. URMA: Unified Remote Memory Access
  10. Congestion Control
  11. Reliable Transport
    1. UB’s Load Balancing Mechanism
      1. 1. Transaction-level load balancing: the “convoy” mode based on TPG
      2. 2. Packet-level load balancing: the “racing” mode for extreme performance
    2. Retransmission under multipath
    3. Coordination between retransmission and transaction ordering
  12. Deadlock avoidance
    1. Deadlock avoidance at the data link layer
    2. Deadlock avoidance in memory access
    3. Deadlock avoidance in message communication
  13. URPC: Remote Procedure Call designed for heterogeneous hardware
    1. The renaissance of RPC: from software services to hardware functions
    2. Pass-by-reference: when pointers can cross machines
    3. Pass-by-Value: When copying is unavoidable
  14. UB Supernode: The world’s third pole of the large-model software–hardware ecosystem