The Full Story of the Storage Performance Issue in the Course Review Community

This month, the Course Review Community encountered a storage performance issue that lasted nearly two weeks, causing slow service responses and degraded user experience. This post documents how the issue was discovered, investigated, and resolved, covering NFS performance, ZFS logs, Proxmox VE virtualization storage configuration, and more.

December 9: Sudden Drop in Storage Node Performance

Starting from the afternoon of December 9, the performance of the NFS storage node dropped sharply. Testing disk performance on the NFS host:

debian@debian100:~$ dd if=/dev/zero of=test bs=64k count=16k conv=fdatasync
^C5985+0 records in
5985+0 records out
392232960 bytes (392 MB, 374 MiB) copied, 48.6483 s, 8.1 MB/s

Write speed was only 8.1 MB/s, and many nfsd processes were in D state (uninterruptible sleep, waiting for I/O):

PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
288788 root   20   0       0      0      0 D  20.6  0.0 110:25.67 nfsd
288785 root   20   0       0      0      0 D  18.6  0.0  52:12.57 nfsd
288786 root   20   0       0      0      0 D  11.3  0.0  61:36.45 nfsd

The Course Review Community service became very slow, but did not completely go down (quite a few new reviews were still added that day); we just kept receiving monitoring alerts.

Temporary Workaround

After talking with Big Squirrel that night (the Course Review Community servers are in a rack in a data center he manages; he provides the bare metal compute and storage nodes), we decided to temporarily create a VM on a new compute node backed by local SSD, sync the database and static files over to it, and get the service running there first.

Because NFS performance was so poor, the sync process was extremely slow. But after 2 hours, the database and static files were finally synced, and that night we switched traffic to the temporary server via Cloudflare.

However, the temporary server had no mail service or backup service configured, so to fully restore the service we still needed to fix the old server.

Original Architecture and Analysis

Big Squirrel and I analyzed issues in the original architecture.

Original Architecture

The Course Review Community servers run on a Proxmox VE cluster, with the storage architecture as follows:

NFS storage node: a server with 12 HDDs forming a ZFS storage pool
Compute nodes: run Proxmox VE and mount the storage via NFS
VM disks: raw disk image files stored on NFS

This means the access path for a VM’s rootfs is:

1	VM filesystem → VirtIO block device → disk image file → NFS → network → ZFS → HDD

This Block Device over NFS architecture is inherently risky for performance.

Why Does Block Device over NFS Have Performance Problems?

The core problem of Block Device over NFS is I/O amplification caused by file fragmentation.

When the filesystem inside the VM writes data, it thinks it’s writing to contiguous disk blocks. But in reality:

“Contiguous” writes in the VM → map to some offset in the disk image file
The disk image file may itself be fragmented on the NFS server’s ZFS
The file’s blocks on ZFS are further spread across multiple HDDs

The result: I/O that appears contiguous inside the VM turns into a large amount of random I/O on the underlying physical disks.

VM 内连续写入 4KB × 100 次：
  VM 视角：连续写入 400KB
  实际情况：100 次随机磁盘寻道

HDD 顺序写入：~150 MB/s
HDD 随机写入：~1 MB/s（受限于寻道时间）

This I/O amplification is particularly bad on HDDs, because HDD random I/O performance is over 100× worse than sequential I/O.

Besides fragmentation, there are also issues such as multi-layer cache invalidation, synchronous write overhead, and stacked latency.

The Right Way to Use NFS

NFS should be used as file storage, similar to Google Cloud Storage or AWS S3:

✅ Correct use: mount NFS directly inside the VM and store user uploads, logs, backups, etc.
❌ Incorrect use: mount NFS on the host and then put VM disk image files on it

If you really need to expose a block device over the network, you should use protocols designed for that purpose:

iSCSI: a dedicated network block device protocol that exposes a block device directly to clients
Ceph RBD: distributed block storage designed for virtualization

These protocols operate directly at the block device layer and avoid the filesystem-level fragmentation problem.

Therefore, Big Squirrel suggested that going forward, VM rootfs should all be placed on local SSD, not on NFS. But since the NFS hardware issue was still unresolved, copying rootfs was extremely slow, and it wasn’t feasible to copy all data out quickly.

December 12: Migrating Rootfs to Local SSD

After a painfully long data copy, the old server’s rootfs was finally fully migrated from NFS to local SSD. The system was rebooted to use the new local storage.

I synced data back from the temporary server to the old server, but after syncing, when I tested service performance, I found that disk performance was still unstable. The reason was that there were still other VMs on that physical machine that hadn’t been migrated yet and were still using NFS. NFS I/O blocking was affecting the entire Proxmox VE I/O thread—no performance isolation.

This is a common issue in virtualized environments: when a VM’s storage backend has problems, it can affect other VMs on the same host.

We decided to wait until all NFS data on the old machine had been migrated off before moving the service back.

December 16: Worsening NFS Issues

The NFS performance problems hadn’t been resolved over the past few days; in fact, they got worse. I tried to move other VMs’ data on the old machine out of NFS, but found that NFS access had become even slower.

Even mounting a qcow2 disk image from NFS on the NFS host itself would hang:

1
2
3

# Try mounting a qcow2 image on NFS
qemu-nbd -c /dev/nbd0 /nfs/vm-disk.qcow2
# Directly hangs; after 120 seconds the kernel starts timing out and printing dmesg warnings

I emailed Big Squirrel:

We really need to fix NFS ASAP. Right now NFS access is extremely slow. I tried to mount a qcow2 disk in NFS that has data on it, and it just hung. I’m worried we might lose user data stored in it.

On the NFS host, the rootfs is also very slow—installing a package takes forever.

December 17: Root Cause Identified — ZFS Log on a Failing Disk

Big Squirrel found the root cause:

The NFS storage node used the ZFS filesystem, and the ZFS log (ZIL/SLOG) happened to be on an HDD that was close to failing.

ZFS write workflow:

Data is first written to the ZIL (ZFS Intent Log)
Once the ZIL confirms, the write is reported as successful
A background process writes the data into the main storage pool

By default, the ZIL (ZFS Intent Log) is stored on each disk. But to improve synchronous write performance, ZFS allows you to place the ZIL on a dedicated fast device, known as SLOG (Separate LOG).

SLOG plays a role similar to a database’s WAL (Write-Ahead Log):

Receives all synchronous write requests
Quickly acknowledges successful writes (since SLOG is typically a fast SSD)
Then, in the background, writes data into the main pool

SLOG performance directly determines the latency of synchronous writes.

When the disk holding the ZIL has abnormally high latency, all synchronous writes get blocked. Moreover, I had mounted NFS with the sync option, causing every write to wait for ZIL confirmation, making performance plummet.

Historical SLOG Configuration

Why was SLOG configured on a single HDD? There’s some background here.

Originally, SLOG was placed on a 10K RPM high-speed HDD. Ideally, SLOG should be on NVMe SSD for best performance. However, a few months ago the NVMe drive in that machine was removed, and this high-speed HDD was used as a temporary replacement.

At that time, the reasons for choosing HDD instead of SSD were:

The machine’s SSDs had relatively small capacity
SLOG write volume is very high, raising concerns that frequent writes would wear out the SSD quickly
SSD endurance is roughly proportional to capacity; small SSDs are easier to wear out

This decision was reasonable a few months ago. But later, the machine had already been upgraded with large SSDs, while the SLOG configuration hadn’t been updated accordingly—this performance incident exposed that hidden risk.

Solution

Big Squirrel migrated the ZFS log device from the failing HDD to an SSD. After the migration, NFS performance returned to normal.

December 21: Local SSD Performance Issues

Work was extremely busy this week, and on December 20 a user DM’d me asking why they weren’t receiving registration emails—because the temporary server had no mail service deployed.

On December 21 I finally had time to handle this. While preparing to move the service back to the old server (now using local SSD), I found that local SSD performance was also poor.

Using iostat to check disk performance:

1
2
3

$ iostat -x 5
Device    r/s   rkB/s   w/s   wkB/s  w_await  %util
sdb      0.60   12.80  53.20  1469.60  1266.87  89.20

Write latency was 1266ms! That’s unacceptable for an SSD. Normal SSD write latency should be under 1ms.

Issue 1: Proxmox VE Disk Cache Configuration

Checking the VM config:

1 2	cat /etc/pve/qemu-server/100.conf \| grep scsi1 # scsi1: local-lvm:vm-100-disk-0,iothread=1,size=640G

I found that the disk had no cache mode configured, so it defaulted to none (no cache).

Differences between cache modes:

none (default): every write goes straight to the physical disk and waits for disk acknowledgment before returning. Safest but slowest.
writeback: writes go into the host’s memory cache first and are immediately acknowledged. Disk is written asynchronously in the background. Best performance, but recent writes may be lost if the host crashes.
writethrough: writes go to both cache and disk and wait for disk acknowledgment. In-between option.

Fix:

# 编辑 VM 配置
nano /etc/pve/qemu-server/100.conf

# 修改磁盘配置，添加 cache=writeback 和 discard=on
scsi1: local-lvm:vm-100-disk-0,cache=writeback,discard=on,iothread=1,size=640G

discard=on enables TRIM support, so when the VM deletes files it notifies the underlying storage to reclaim space, which benefits both SSD lifespan and performance.

Issue 2: Physical disk write cache was disabled

After modifying the VM configuration, performance was still not ideal. Continuing the investigation, I tested the raw write speed of the physical disk:

1 2	dd if=/dev/zero of=/dev/sda bs=1M count=1000 oflag=direct # 310378496 bytes copied, 22.5299 s, 13.8 MB/s

13.8 MB/s! Is this the speed an SSD should have? A normal SATA SSD should have 400–500 MB/s sequential write speed.

Check the disk write cache status:

$ hdparm -W /dev/sda
/dev/sda:
 write-caching =  0 (off)

$ hdparm -W /dev/sdb
/dev/sdb:
 write-caching =  0 (off)

Write cache was disabled! This is the root cause of the poor performance.

Role of disk write cache:

Enabled: write operations are first written into the disk’s internal DRAM cache and immediately reported as successful. The disk controller then writes data to flash in the background.
Disabled: every write operation must wait until the data is actually written to flash before returning.

Enterprise SSDs (such as the Toshiba THNSNJ1T02CSY in this machine) often disable write cache by default for data safety, but in an environment protected by RAID1 it is safe to enable it.

Fix:

# 启用写缓存
hdparm -W1 /dev/sda
hdparm -W1 /dev/sdb

# 验证
hdparm -W /dev/sda
# write-caching =  1 (on)

Set to automatically enable at boot:

# 创建 udev 规则
cat > /etc/udev/rules.d/69-hdparm.rules << 'EOF'
ACTION=="add|change", KERNEL=="sd[ab]", RUN+="/sbin/hdparm -W1 /dev/%k"
EOF

After enabling write cache, disk performance returned to normal:

1
2
3

$ iostat -x 5
Device    r/s   rkB/s   w/s    wkB/s   w_await  %util
sdb      3.00  2820.80 127.40 14824.00   2.06   18.48

Write latency dropped from 1266ms to 2ms, finally back to normal.

Issue 3: Leftover LVM metadata volume

There was also an old, unused LVM metadata volume on the physical machine. After deleting it, performance improved further.

Migrating services without downtime

After fixing the storage performance issues, I began migrating services from the temporary VM created on December 10 (hereafter the source server) back to the repaired old server (hereafter the target server). To minimize downtime, I used a two-stage synchronization method.

Preparation

On the target server, first shut down MySQL and Web services to ensure that the data directory will not be written to:

1
2
3

# On the target server
systemctl stop gunicorn
systemctl stop mysql

1. Pre-sync user-uploaded data (source server services remain running)

1 2	# On the source server, sync most static files to the target server rsync -avz --progress /path/to/uploads/ target-server:/path/to/uploads/

2. Pre-sync database files (source server services remain running)

1
2
3

# On the source server, first perform a full sync of the database directory
# At this time the source server’s services are still running, so the data may be inconsistent, but that’s okay
rsync -avz --progress /var/lib/mysql/ target-server:/var/lib/mysql/

3. Lock the database and do incremental sync (downtime)

# 在源服务器上执行，锁定数据库，阻止写入
mysql -u root -p -e "FLUSH TABLES WITH READ LOCK;"

# 在源服务器上执行，增量同步数据库（只传输变化的文件，非常快）
rsync -avz --progress /var/lib/mysql/ 目标服务器:/var/lib/mysql/

# 在源服务器上执行，解锁数据库
mysql -u root -p -e "UNLOCK TABLES;"

Because most of the data has already been synchronized, this incremental sync only takes a few dozen seconds.

4. Incremental sync of user-uploaded data (source server services continue running)

1 2	# On the source server, perform another incremental sync of user-uploaded data rsync -avz --progress /path/to/uploads/ target-server:/path/to/uploads/

5. Start services on the target server

# 在目标服务器上执行
chown -R mysql:mysql /var/lib/mysql
systemctl start mysql
systemctl start gunicorn

6. Verify that services on the target server are working properly

1 2	# Check the Web service curl -vv http://localhost:3000/

7. Switch traffic

In Cloudflare, point the domain name to the target server’s IP. Then check the target and source server logs to ensure that traffic has switched to the target server.

The total downtime was only a few dozen seconds (during the database lock period).

Lessons learned

Avoid block device over NFS: A VM’s rootfs should be on local storage or dedicated block storage (such as iSCSI, Ceph RBD), not an image file on NFS. NFS is suitable for mounting inside the VM for file sharing.
Storage performance isolation: When a single VM’s storage has problems, it should not affect other VMs. Consider using independent storage backends or I/O limits.
Put ZIL/SLOG on fast devices: ZFS’s log device determines synchronous write performance. Although SLOG sees heavy write volume, modern high-capacity SSDs (whose lifespan scales with capacity) can easily handle it. If you use an HDD as SLOG and that disk has issues, the write performance of the entire pool will be affected. Ideally, use an NVMe SSD.
Monitor disk health: Regularly check SMART data and catch problems before disks completely fail.
Configure disk cache appropriately: Choose the appropriate cache mode based on the characteristics of the storage backend. For redundant storage, writeback is usually a better choice. Physical disks should have write cache enabled.

Timeline review

Date	Event
12/9	NFS performance dropped sharply, services became slow, temporarily migrated to a local SSD server
12/12	Old server’s rootfs migrated to local SSD
12/16	NFS issue worsened, other services on the old server that hadn’t been migrated were almost unreachable
12/17	Located ZFS log on a bad disk and fixed NFS
12/21	Discovered local SSD performance issue, fixed it, then migrated services back to the old server

From the onset of the issue to the final resolution took nearly two weeks. During this period, on December 9, service access was slow and occasional alerts were received, but there was no complete outage. After December 10, services did not go down, but registration emails could not be sent.

This incident is a reminder: storage is the cornerstone of service stability; performance problems at any layer will be amplified step by step and ultimately affect user experience.