← 01.me

Research

Bojie Li (李博杰). I work on AI agents and personalization, LLM systems, and datacenter networking. Below are recent papers — most with an interactive companion site — followed by earlier systems research.

Recent work

2026 · click a title for the interactive site or the paper

cs.AIarXiv:2607.12520

The Model Knows Your Project, Not You: Measuring Recognition in LLMs with NameRank

What do frontier models actually recall about people and tools from their weights? NameRank scores recognition (0–1) across 4,685 entities, 54 cohorts, and 36 models, with judges verifying specific facts — hallucination, context echo, and guesses earn nothing. Citations explain only a third; named artifacts, methods, and papers propagate over credentials, titles, or contributor listings, and no bibliometric predicts recognition well.

Interactive site arXiv

cs.AIarXiv:2607.11598

Interaction Scaling: Grounding the Third Axis of Test-Time Compute

Beyond reasoning and sampling lies a third axis of test-time compute: the model proposes an artifact, an external instrument observes how it actually behaves, and the model revises. Reasoning-only and best-of-N plateau on hard coding tasks, but interaction keeps improving — a proposer-reviewer reaches perfect accuracy across three model families, and layout tools catch rendering defects that vision judges call perfect.

Interactive site arXiv

cs.LGarXiv:2607.07435

RLVP: Penalize the Path, Reward the Outcome

Agents acting in the real world must respect outcome-neutral constraints — business hours, authentication, not re-calling an unresponsive user — that outcome rewards can't express, since violating them often boosts apparent success. Because group-relative advantage is within-group variance, a verifiable path penalty supplies the missing signal: penalize the path, reward the outcome — high success, near-zero violations.

Interactive site arXiv

cs.SDarXiv:2607.02640

Metronome: Bound the Cache, Keep the Beat for Real-Time Interaction Model Serving

Real-time audio agents turn serving into a periodic task whose KV cache grows monotonically until it exhausts the pool — then latency falls off a cliff, silently, on run-to-run variance. Metronome bounds each session's resident state, eliminating the collapse (0/20 vs 14/20 bad runs) and turning per-frame latency into an honest load signal for admission control.

Interactive site arXiv

cs.DCarXiv:2607.02630

Fine-Grained Computation Offload for Off-the-Shelf Servers in Tens of Lines

Accelerator stalls needn't force a rewrite: the concurrency already exists, because serving concurrent requests already suspends and resumes them. Framing overlap as routing, an offload submits to an executor and the request suspends via the server's own deferred-response path — 22–138 lines across ten production servers for 1.2–5.4× gains, and 17.3× from an LD_PRELOAD fiber runtime.

Interactive site arXiv

cs.AIarXiv:2606.30383

Whose Side Is Your Agent On? Multi-Party Principal Loyalty in LLM Agents

When an agent represents a principal while talking to a counterparty whose interests diverge, "help whoever you're talking to" is the wrong objective. PrincipalBench measures multi-party loyalty across 13 frontier models, exposing a sharp selective/over-refusing split; a prompt-time scaffold and per-token-KL distillation make small models loyal — but both only move along a leak/over-refusal trade-off.

Interactive site arXiv

cs.AIarXiv:2606.29472

Agent-Computer Observation Interfaces Enable Dynamic Computer Use

Computer-use agents tie observation to action — one screenshot every few seconds, no audio — leaving them blind between frames. AOI, a model-agnostic perception layer, decouples continuous observation via gated keyframe capture, volume-gated transcription, and persistent visual narration, adding almost nothing on static content yet gaining +17 to +48 pp on DynaCU-Bench with zero retraining.

Interactive site arXiv

cs.AIarXiv:2606.24470

The Latent Bridge: A Continuous Slow-Fast Channel for Real-Time Game Agents

A real-time game agent must act in milliseconds yet plan over seconds — opposite ends of the latency-quality tradeoff. Coupling a frozen reactive VLM and a frozen reasoning VLM, only the channel between them trains: a learned latent bridge projects the slow model's residuals into the fast model's embedding space, no text round-trip needed.

Interactive site arXiv

cs.AIarXiv:2606.19172

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

Personal memory is two problems — content and reasoning skill — that the brain keeps apart. Rather than folding both into a per-user LoRA that contaminates unrelated text, User as Engram stores a user's facts as surgical rows in a hash-keyed table and carries reasoning in one shared adapter: 5.6× better indirect reasoning, ~33,000× smaller, users composing losslessly.

Interactive site arXiv

cs.AIarXiv:2606.17929

PreAct: Computer-Using Agents that Get Faster on Repeated Tasks

Computer-using agents solve every task from scratch, re-reading the screen and re-reasoning each tap. PreAct compiles the first success into a small state-machine program and replays it 8.5–13× faster with no per-step model calls, checking the screen matches before each action; an independent evaluator gates what enters the store, so repeated tasks get faster without getting riskier.

Interactive site arXiv

cs.LGarXiv:2606.17107

Models Take Notes at Prefill: KV Cache Can Be Editable and Composable

Prefix caching reuses prefill only across a shared prefix, so one changed field invalidates everything downstream. Yet across four model families the field's own key/value drives under 1% of the decision — at prefill the model already wrote its conclusion onto downstream tokens. Read as memoized notes, the KV cache becomes editable and composable, decision-identical at 14.9× lower latency.

Interactive site arXiv

cs.AIarXiv:2606.16707

User as Code: Executable Memory for Personalized Agents

Bag-of-facts user memory recalls individual facts but can't aggregate, resolve contradictions, or enforce rules, because storing a fact and acting on it are separate steps. User as Code makes memory executable: typed Python objects hold state, functions encode rules, an append-only log checkpoints into code. Recall stays strong (78.8% LOCOMO) while aggregate questions jump from 6–43% to 99%.

Interactive site arXiv

cs.ARarXiv:2606.13708

Tiara: A Programmable Line-Rate ISA for Remote Memory Access

RDMA one-sided verbs need the exact remote address, so 1-RTT performance breaks when that address must first be read from remote memory — the Indirection Wall behind graph traversals, page-table walks, and paged KV lookups. Tiara, a statically verifiable instruction set on the memory-side NIC, resolves indirection locally, collapsing multi-RTT dependent chains into a single round-trip.

Interactive site arXiv

cs.AIarXiv:2605.28717

OpenURMA: A Clean-Room Open Implementation of the Unified Bus Protocol

Datacenter RDMA is bottlenecked at the NIC, not the wire: per-connection state balloons at high fanout and a 64-byte op pays a four-traversal PCIe round-trip. Huawei's Unified Bus decouples endpoint from transport state and reaches memory via native load/store. OpenURMA is the first clean-room open implementation — a 64-byte fetch in ~500 ns, 4.37× below a matched RoCEv2 baseline.

arXiv

cs.LGarXiv:2604.24827

Incompressible Knowledge Probes: Estimating Black-Box LLM Parameter Counts via Factual Capacity

Frontier labs don't disclose parameter counts, yet storing F facts requires at least F/(bits per parameter) weights — so factual recall lower-bounds model size. Incompressible Knowledge Probes ask 1,400 facts that resist reasoning and compression, calibrating a log-linear map to parameter count across 93 open-weight models (R²=0.910). The instrument is deliberately coarse — order-of-magnitude capacity, not precise counts.

Interactive site arXiv

cs.MMarXiv:2604.20940

Sema: Semantic Transport for Real-Time Multimodal Agents

Real-time multimodal agents transport raw audio and screenshots over stacks built for human perception, but agents consume task-relevant semantics, not reconstructed signals — shifting transport from signal fidelity to meaning preservation. Sema pairs discrete audio tokens with a hybrid screen representation and bursty delivery, cutting uplink bandwidth 64× for audio and 130–210× for screenshots within 0.7 pp of raw accuracy.

Interactive site arXiv

Earlier systems research

datacenter networking, RDMA, and programmable hardware

Lead author

APNet'23

FastWake

Polling gives RDMA low latency but pins a core to one thread; interrupts share cores at far higher latency, and apps with hundreds of threads are stuck paying it. FastWake redesigns the interrupt-mode host stack on commodity hardware and unmodified apps — a per-core dispatcher polls every completion queue and context-switches via a kernel fast path — approaching hardware latency limits.

Recent work

Earlier systems research

Lead author

Co-author

Preliminary work

Engineering