Bojie Li (李博杰)
2022-01-08
2021-08-24
1Pipe is a causal and total order communication primitive to scatter groups of messages via data center network. With in-network computation using Barefoot or Arista switches, 1Pipe achieves scalability and high performance with low CPU and network overheads. Published in SIGCOMM’21.
2021-07-06
AKG (Auto Kernel Generator) is a tensor compiler for NPUs. AKG leverages polyhedral schedulers to perform a much wider class of transformations, and extends the semantics of the polyhedral representation to combine complex tiling techniques and hierarchical fusion strategies. Published in MICRO’20 and PLDI’21.
2021-05-01
2021-02-26
本文转载自微软亚洲研究院(MSRA)公众号,感谢MSRA的邀请!
攻读博士是一段孤独又充满挑战的旅程,若有师兄师姐指点迷津,一定能带你走出迷茫,勇敢选择。2 月 8 日,五位毕业于微软亚洲研究院联合培养博士生项目的师兄师姐院友,在线上为师弟师妹们带来了自己在读博期间和日后工作中的切身感悟,以期答疑解惑并鼓励大家坚定梦想。本次活动由微软亚洲研究院资深学术经理孙丽君、实习生项目负责人窦安琪组织与主持。
希望这些师兄师姐的分享能鼓励与启发读博路上的每一个你,带你愉快地享受这一段美好又充满挑战的时光~
分享嘉宾:
- 傅孝明 2016 届中科大-微软联培博士,现任中国科学技术大学副研究员
- 张弛 2017 届中山大学-微软联培博士,现任 DeepMotion Co-Founder, R&D Director
- 黄丹青 2019 届中山大学-微软联培博士,现任微软亚洲研究院研究员
- 李博杰 2019 届中科大-微软联培博士,现任华为 2012 实验室高级研究工程师
- 李潇 2019 届中科大-微软联培博士,现任微软亚洲研究院研究员
2019-12-08
中国科学技术大学博士论文 作者:李博杰 / USTC Doctoral Dissertation, Author: Bojie Li
中文版: 基于可编程网卡的高性能数据中心系统 (PDF, 8 MB)
AI Translated Unofficial English Version: High Performance Data Center Systems with Programmable Network Interface Cards (PDF, 8 MB)
Publication Date: 2019-05-26.
2019-08-19
Communication intensive applications in hosts with multi-core CPU and high speed networking hardware often put considerable stress on the native socket system in an OS. Existing socket replacements often leave significant performance on the table, as well have limitations on compatibility and isolation.
In this paper, we describe SocksDirect, a user-space high performance socket system. SocksDirect is fully compatible with Linux socket and can be used as a drop-in replacement with no modification to existing applications. To achieve high performance, SocksDirect leverages RDMA and shared memory (SHM) for inter-host and intra-host communication, respectively. To bridge the semantics gap between socket and RDMA/SHM, we optimize for the common cases while maintaining compatibility in general. SocksDirect achieves isolation by employing a trusted monitor daemon to handle control plane operations such as connection establishment and access control. The data plane is peer-to-peer between processes, in which we remove multi-thread synchronization, buffer management, large payload copy and process wakeup overheads in common cases. Experiments show that SocksDirect achieves 7 to 20x better message throughput and 17 to 35x better latency compared with Linux socket, and reduces Nginx HTTP latency by 5.5 times.
2019-08-08
2019-08-08
2019-08-07
Hardware-based transports, such as RDMA, are becoming prevalent because of its low latency, high throughput and low CPU overhead. However, current RDMA NICs have limited NIC memory to store per-flow transport states. When the number of flows exceed memory capacity, the NIC needs to swap out flow states to host memory via PCIe, leading to performance degradation.
This paper presents a hardware-based transport without per-flow state. At its core, flow state bounces between the two end hosts along with a data packet, analagous to a thread whose state is always in-flight. To enable multiple in-flight packets, each thread is assigned a distinct sequence of packets to send. We enable each thread to fork, throttle and merge independently, which effectively simulates a window-based congestion control mechanism. For loss recovery, we design an epoch-based single loss detector for all flows, which enables selective retransmission and the storage size is proportional to the number of lost packets in a round trip. When there are more losses than the NIC can handle, the receiver CPU is notified to recover loss.
We design and implement RDMA, TCP and TLS transports without per-flow states in an FPGA prototype. The transports have small network bandwidth and CPU overhead. Simulations and testbed experiments show that flows share network bandwidth fairly in a multi-bottleneck network, and solves the incast problem even better than DCTCP and DCQCN. With a large number of concurrent flows, the throughput of our stateless hardware-based TLS transport is 100x of a stateful hardware-based transport and 50x of a software-based transport.