The Node That Never Sleeps

I. Where Things Stand

Right now, all twenty threads of ms-01-1 are pinned at full load, spinning on nothing, the CPU holding steady at 77°C. It isn’t computing anything. It’s just busy staying awake. These cores should be asleep when there’s no work; instead a busy-wait loop keeps every one of them occupied, for the sole purpose of never letting the CPU enter a sleep state.

Another machine watches it from the other end of the network. The moment it locks up, that machine cuts its power remotely and brings it back.

ms-01-1 is one node in the small cluster I run at home, a Minisforum MS-01 mini PC. Right now it’s serving normally: its VMs are running as usual, the data it holds is online, the cluster is healthy. The busy-wait loop isn’t some extra workload I’ve added, and it doesn’t fight the VMs for CPU — it only occupies time that would otherwise be idle, and yields the instant real work arrives. It’s also a machine I’ve given up on actually fixing.

The fault is in the hardware. What I can say for certain is that it isn’t in the DIMMs themselves, but somewhere in the machine that can’t be replaced. My closest read on it is the link that carries addresses and data between the CPU and memory: every so often, somewhere in transit, it flips a single bit. By that account the memory cells are sound the whole time; the error happens only in the instant of transfer, and the next read “heals” it on its own. And this machine runs non-ECC memory1, which has no mechanism to detect such an error, let alone report it — so it leaves behind not one line of log, not one counter tick, completely invisible to the system. Either nothing happens at all, or a single process dies abruptly to a segfault, or the entire machine locks up without warning.

Not letting it sleep is the one thing that still works, after I’d tried everything else. Swap the DIMMs, swap the power supply, flash the BIOS, drop the memory frequency — I went through them one by one, and none of it helped. If it could be fixed, I’d have fixed it long ago; but it’s out of warranty, and the broken part can’t be replaced on its own, so no matter how I run the numbers it isn’t worth it. So it stays where it is, kept running by this one setting that keeps it hot all day, with a watchdog on the machine across from it as a last line of defense.

The thing that finally saved it — don’t let the CPU sleep — occurred to me early on. I just didn’t chase it down right away.

How it got to this point starts with a tripped breaker a month ago.

II. The Breaker

Late one night a month ago, the breaker tripped at home. Seven servers lost power hard, all at once. My UPS, of all things, wasn’t on the network — it couldn’t raise an alert, never mind trigger a graceful shutdown — so I didn’t find out until I woke up the next morning.

Those seven machines are a small cluster I built at home: Proxmox VE for virtualization, Ceph for storage, deployed hyperconverged — compute and storage running on the same set of machines — hosting my own GitLab, three Kubernetes clusters, internal DNS, single sign-on, Time Machine backups for my Mac, and a string of other services.

I don’t run all this because the services on top matter that much. Almost everything is cattle, not pets — when one breaks I don’t nurse it, I rebuild it. The config all lives in Git; I’ve rebuilt the entire GitLab from scratch once already, and the backups have backups. The banner I started under was suitably high-minded, of course: data sovereignty, keeping my own data in my own hands. But after a few years, if I’m honest, what actually draws me in is the tinkering with infrastructure itself — and the software layer most of all. Declarative config, Nix, GitOps, Terraform: that’s where the fun is. As for the services that were supposed to carry the whole “data sovereignty” banner, only a handful ever went live or got used. Hardware interests me even less. And precisely because it doesn’t matter if the workloads on top go down, I didn’t care whether the memory had ECC — these mini PCs don’t support it anyway — and precisely because it was cheap, buying a machine like this was a trade-off I was happy to make.

ms-01-1 had thrown the occasional segfault before — sporadic, nothing that got in the way, so I hadn’t paid much attention. After that power loss, the symptoms got an order of magnitude worse. Over the next two weeks, the segfaults went from once in a while to dozens on every boot; the share of storage it carried wouldn’t come up, and the whole machine started locking up without warning every few days. It was always the same handful of processes going down, though that didn’t stand out at the time. I can’t prove the power loss broke it. It had been faulty to begin with; all I can be sure of is the order in time. I kept putting off dealing with it properly, until two weeks ago I finally took it on.

Here’s the odd part. Two of the seven are identical MS-01s — same CPU, same BIOS, same memory model, same config, bought in adjacent batches. The other one, ms-01-2, never had a single problem, start to finish. Both lost power that night; same model, the same non-ECC memory — and only ms-01-1 went wrong.

III. Withholding Judgment

I’d guessed the rough direction early: C-states — the power-saving sleep levels an individual core drops into when it has nothing to do, deeper meaning more savings and a slower wake-up. It’s the core that sleeps, not the whole machine: the moment a core runs out of work, it nods off on its own, while the machine keeps running and serving as normal.

The crashes had a pattern: they struck when the machine was idle or lightly loaded, and almost never under full load. That “the more idle it is, the more likely it breaks” shape points naturally at these sleep levels. And the MS-01 had form here: early BIOS versions (around 1.22) had a stability issue tied to C-states and other power-saving features, fixed in later releases. Two signs stacked up, pointing the same way.

But knowing there’s a switch you could throw, and reaching out to throw it, are two different things. I didn’t throw it.

The reasoning is simple: some change happening to stop the crashes is not the same as my having found the cause. That change could easily be suppressing the symptom while the real fault sits untouched underneath. For a fault that occurs at a steady rate, “I changed it and it got better” counts as at least a little evidence. But this machine’s fault rate itself drifts: same config, under one percent on one boot, drifting past fifty percent some time later. Against that background noise, whether it crashes after a change tells you nothing. No crash might mean fixed, or it might mean I caught a quiet stretch. A crash reads two ways too, and neither beats the other. A test that every outcome confirms is no test at all.

So more than making it stop crashing on the surface, I cared about where it was actually broken. It comes down to one rule: until I’d ruled out the other possibilities one by one, this machine was running sick in my eyes. Strictly speaking, the sickness wasn’t the machine’s but mine — but as long as the doubt held, I treated it as sick.

That rule set the course for everything that followed. The string of seemingly futile tests to come were all deliberate elimination, one suspect at a time. Unlike the power switch I didn’t throw, every one of these tests could give an answer I wouldn’t like: move the DIMMs to the twin and the fault might follow them; memtest might actually report errors; some frequency might turn out to be the culprit. It’s precisely because they could fail that their results meant anything.

And this whole scheme of elimination held together because of the healthy twin, ms-01-2, as a control. One broken, one sound, everything else identical. It let me sidestep the unanswerable question — did my change actually do anything — and swap in an answerable one: move a part between the two machines, and watch whether the fault follows the part or stays with the original. That one I can answer, because I’m only watching which side the fault lands on, regardless of how high its rate happens to have drifted.

IV. Ruling Things Out

The first suspect, naturally, was the DIMMs themselves.

The standard way to vet memory is to run memtest86+, which works outside the operating system, reading and writing every memory cell over and over and comparing the results. Just getting it to boot ate a whole night: disabling Secure Boot, chainloading it from GRUB, getting stuck on POST several times along the way with a black screen, nothing to do but cut the power remotely and start over. Once it was actually running, I didn’t dare stop at two passes — experience says ten at least. So it ran for twenty-one hours and forty-eight minutes, a full ten passes, the temperature riding up around 90°C the whole time.

The result was zero errors.

It was the first dead end of the whole investigation. A test passing perfectly clean, and yet ruling nothing out — at most it pushed one possibility, that the memory cells were physically damaged, down to very unlikely. What memtest reads, writes, and compares is whether data is stored correctly, and this machine’s problem wasn’t there.

I tried a second tool of the same kind: stressapptest, which saturates memory bandwidth to force out flaws in the high-speed signaling. Four hours, more than seven hundred terabytes moved, and again zero events. Two tools, the same clean bill.

Clean, and still the machine crashed. I stepped back and started doubting the premise itself: maybe it wasn’t hardware at all, but software — some kernel bug. The kernel version it was on did carry a few known regressions in memory management. I upgraded it to a newer version that fixed those regressions. That evening the reproducer ran a couple hundred rounds without a single crash; I nearly took it for over and done, and left it running overnight. By the next morning, some twelve hundred rounds and two hours and fifty minutes of continuous running later, the whole machine had frozen there without warning.

Only later did I understand this step was not like the others. The new kernel really did fix a software bug — one that was real, and an entirely separate matter. It fired fast and often, taking the blame for every crash, so I kept assuming the crashes were a software problem. Once it was fixed, the hardware fault hiding behind it — far slower to fire — finally showed its face. That software bug was never the fault itself, only a smokescreen drawn across the hardware fault. The upgrade fixed nothing; it just cleared the smoke. Those couple hundred clean rounds were exactly the kind of “no crash after a change” I’d been most wary of — and the hardest kind to see through. The other steps gave me at most a useless clean result; this one nearly became a false finish line. The smoke gone, and a low stretch of the hardware fault happening to coincide — the two together almost had me call it fixed and walk away.

At the time, of course, I didn’t see it that clearly. With the kernel failing to settle it, my suspicion was still circling outside the hardware — and firmware falls in that same band. I flashed the BIOS to the latest version. That old 1.22 C-state business had pointed this way to begin with. After the flash, it still froze; the release notes had not one entry touching memory, DDR5, or the memory controller. That old lead fell apart too: the community’s firmware-level C-state bug, the one that hit every machine, was a different matter from mine — the twin ran the same BIOS and was perfectly fine. By this point I’d done everything that could be done from the keyboard without opening the machine: software ruled out, firmware ruled out, the memory-checking tools all run clean. The fault really did live in this machine’s hardware — and hardware isn’t a single block.

To tell which block had gone bad, all that was left was to get my hands in. Up to here — flashing the BIOS, booting memtest, running those reproducers — I’d never had to leave my study: these machines have no BMC2, but each has an IP-KVM (which brings the display and keyboard out over the network for remote control); a hard power cycle relies on a switched PDU in the rack (a networked power strip you can toggle remotely). All of it doable from the keyboard. Pulling hardware was the first thing the IP-KVM couldn’t help with. It could put a keyboard and a screen in front of me, but it couldn’t turn a screw for me. I’d put this step off to the very end, because the cost of it, for me, is staggering. This machine and three other mini PCs are crammed onto the same 3D-printed rack kit I made myself, each with four fibers, a power cable, an Ethernet cable, an HDMI, and a USB running off the back. To touch the hardware of any one of them, I have to open the side panel of the rack, put on a headlamp, note where every cable goes, pull them out one by one, then take a cordless screwdriver and lift the whole kit off the rack. The kit fits tight; putting it back is more work than taking it out. A round trip like that runs over an hour.

The first thing that swap could distinguish was whether the DIMMs were bad or the machine itself. I switched the memory between the two MS-01s: ms-01-1’s DIMMs into the healthy ms-01-2, ms-01-2’s into ms-01-1. The logic is simple — if the fault followed the DIMMs to ms-01-2, the DIMMs were the problem; if it stayed with ms-01-1, the machine was. The fault stayed with ms-01-1: with a known-good pair of DIMMs in it, it crashed all the same. So not the DIMMs, but the machine itself — the power delivery on the board, the memory controller inside the CPU, the CPU socket, that class of thing, all either soldered down or beyond fixing by swapping a part. Given how punishing the teardown was, I just left this swap in place rather than reverse it.

The power supply was next. Strictly, having narrowed it to the machine itself, the supply counts as one part of that, but it hadn’t been checked on its own, and there was reason to suspect it: the fault picked idle moments to strike, which fits a flaw where the machine jolts from idle into activity and the power delivery doesn’t keep up. So I opened it up again, connected ms-01-1 to ms-01-2’s known-good power adapter, and ran the same test. It still crashed at almost exactly the same rate; the supply was fine. I could have skipped this teardown — I’d already narrowed it to the machine itself — but skip it, and the supply would never truly have been checked, and the machine would still be running sick in my eyes.

Last was the memory frequency. I dropped it from 5600 all the way to 4400, then tried 5200. On the surface the numbers differed from one step to the next, but once you account for that self-drifting fault rate, no real difference could be told between them. Frequency wasn’t the knob that moved the outcome either.

Whether the data was stored correctly; whether it was software or firmware outside the hardware; which block of the machine it was — I’d put each of these three layers to the question. But each layer answered at most “very unlikely,” never “impossible”: a test only pushed one way-of-being-broken down far enough to set aside; none was truly ruled out. The one exception was the DIMM swap: the fault didn’t follow them, and that one is certain. What’s left standing is a fault that genuinely lives in the machine itself, yet that no test ever truly saw.

V. The Invisible Fault

The reason those “zero error” tests passed clean one after another is that they were all the same kind of test — and that kind is, by its nature, nearly blind to this fault.

To see this, you first have to separate two ways of being broken. One is that after a value is stored in memory, the storage cell itself goes wrong, and what was written doesn’t match what’s read back; this is exactly what a conventional memory test assumes. This machine isn’t that case: the memory cells were sound the whole time. The best account I can give is that the error is in the carrying, not the storing: a value travels back and forth between CPU and memory, and now and then a single bit goes astray in transit — but once it settles, what’s stored is correct, and reads back correct. If that’s so, the error happens only along the way, with both ends clean: what’s written is right, and what’s read is right.

Tools like memtest and stressapptest are all, at bottom, doing the same one thing: write a value in, read it back a while later, and compare the two. But that “write it in, read it back, compare once” approach is precisely what misses a fault that errs in the instant of transfer and has already recovered by the time it’s read: by the time you compare, it has long since healed, leaving not a trace in the data. They have another thing in common: they run memory and CPU at full load, while this fault is at its most active exactly when the machine is idle. A test congenitally blind to it, and that happens to suppress its trigger condition, naturally turns up nothing.

I even wrote a small tool with Claude Code, named it memchase, to try to force it out. The tool has no other use: it builds an enormous ring of pointers in memory, then chases it around at the highest density I could manage. Each pointer read is used at once to jump to the next — a true “read one, use one” — and every jump is checked on the spot against an independent table, so a bit flipped in transit gets caught red-handed. To meet the “only fires when idle” condition, it also leaves a gap between each run of the chase, letting the core fall asleep and wake again. Pointer density maxed out, used the instant it’s read, idle gaps and all, it ran more than eight hundred million steps and never made a sound. This time, “caught nothing” carried the most weight of all: even a tool aimed squarely at it, that genuinely did sleep, that took pointer-chasing to its limit, couldn’t force it out. It proved nothing, but it squeezed the “broken in the storage cells” path down to a sliver: the fault almost certainly isn’t in the stored data, but somewhere along the road the data travels. As for why taking pointer-chasing to the limit still wasn’t enough — I wouldn’t understand that until I remembered something else.

What finally brought it into the open wasn’t another tool, but a detail I remembered.

I’d noticed the crashes seemed to come whenever I ran a certain class of command: the Proxmox VE management tools — qm, pct, pveversion, that lot. They had one thing in common: every one is a Perl script.

This was exactly what memchase had missed. The key was never “lots of pointers” — memchase had already maxed that out, to no effect. What mattered was that Perl, every time it runs, rebuilds its whole structure from scratch. The Perl interpreter is forever dereferencing pointers and following them inward; and each call to these PVE tools forks a new process, reloads a large batch of modules, and throws up great pointer structures in freshly allocated memory, then uses them right away. memchase chased a ring built once and never changed again; Perl, by contrast, ceaselessly manufactures new, used-in-an-instant pointers: a value has just arrived, no time yet to self-heal, and it has already jumped in. Flip one bit on that very jump, and it charges straight into a garbage address and crashes on the spot. A workload like this — constantly creating, using at once, sliced into idleness by fork after fork — happens to hit the fault’s trigger condition and its most lethal use at the same time.

So the final reproducer hardly counts as a memory test at all. It’s nothing more than putting one such Perl tool in a loop, calling it over and over, and counting the crashes:

i=0; while :; do i=$((i+1)); pveversion >/dev/null 2>&1 || echo "FAIL $i"; done

pveversion is the simplest of those Perl tools, and the script just runs it endlessly and prints a line whenever it crashes. On ms-01-1 it crashed on the second round, and crashed steadily after that — roughly one failure in every two. On the healthy twin ms-01-2, three hundred calls in a row, not one crash. Orders of magnitude faster than any memory test before it.

The same fault, but its consequences fall into tiers, depending on where the flipped pointer lands. Land in a user process, and that process dies to a segfault and exits while the machine lives on — the most common tier. Land on a kernel pointer, and the kernel oopses (a crash in kernel space), leaving a task that can’t be killed by any means. Land deeper still, on the scheduler or some global lock, and the whole machine freezes without warning: no log, no panic, nothing left behind.

That kind of freeze had one strange detail: the machine was stone dead and yet still answered a ping. I’d ping it and ICMP came back inside a millisecond; and yet at the same time SSH wouldn’t connect, even a raw connection to its SSH port timed out, and the picture on the monitor was frozen. It was still answering ICMP while everything above that had stopped responding. Which layer down there was still moving, I didn’t dig into.

By this point I knew what it was, how to force it out, and what it could do. None of which was a fix.

VI. Limping Along

What finally held it down was the very thing I’d pointed at from the start: C-states. After two weeks of going around, the answer landed back on the same spot as the first hour’s guess.

But “guessing it” and “acting on it correctly” are two different things, and the distance between them was very nearly the whole investigation.

The easiest move is to disable the deep sleep states outright and pin the CPU to the shallowest sleep level. I tried it. For the first two hours it was flawless, not a single crash. Had I stopped there, this article would end here. But I let it keep running, and in a fresh round it froze again. Those two “flawless” hours were nothing but a trough the fault rate had drifted into on its own — the very trap I’d refused to step into at the start. This time, the same trap sprang shut on me, exactly as before.

Digging further, the mechanism came clear. Cap it at the shallowest level and the core still sleeps — just not deeply — and it froze all the same. So what triggers the fault is the act of the core sleeping itself, regardless of how deep. As long as it still closes its eyes, even the lightest doze, the fault still has its chance. The one road that avoids it entirely is to not let the core sleep at all. I switched the kernel’s idle policy to idle=poll, so that even when a core has nothing to do it doesn’t sleep but spins on a busy-wait loop, the whole CPU awake without pause. This fixed nothing — the broken hardware is untouched — it just removed the triggering act wholesale. One CPU spinning hot forever like this is hardly elegant, but it works. It’s the exact inverse of that earlier reproducer: the reproducer used relentless forking to send most cores in and out of idle; this keeps every core awake at all times, with not a sliver of idle left.

This time the fault rate dropped to zero. This zero I trust more than those two peaceful hours before it: the shallow-sleep run merely happened not to fire, the cores still sleeping, the risk there the whole time; now the cores never close their eyes at all, and the triggering act is gone at the source. But zero across any number of runs is still only not having seen it happen — it can’t prove it never will.

The price is heat. Run every core spinning around the clock and this i9, packed into its little chassis, runs hot — so I set a power cap on the whole CPU, holding the steady-state temperature at 77°C, a dozen-odd degrees of margin from the thermal-throttle line. Full load, hot, never asleep.

Precisely because there’s no guaranteeing it never happens again, there’s that last line of defense. A machine that’s truly frozen needs someone to cut its power once, hard, to bring it back. And this machine has no out-of-band management; the only thing that can hard-reboot it from afar is that switched PDU in the rack. So on another machine I set up a watchdog, watching ms-01-1. That strange detail I hadn’t dug into earlier became a design constraint here: a frozen machine can still answer a ping, but answering a ping isn’t the same as being alive. So the watchdog doesn’t rely on ping; it probes two layers at once — whether SSH can log in, and whether the bare port 22 will accept a connection. Only when neither answers does it judge the machine truly dead, then cut and restore power through the PDU to bring it back.

By this point, this is about all that can be done. If I could RMA it, I’d have RMA’d it long ago — but it’s out of warranty; the one “real fix” left is to send the whole board out for third-party repair, which isn’t worth it. So it sits in this state: a node still in production service, with another machine beside it ready at any moment to cut its power and bring it back.

VII. Certainty

From these two weeks of detours I didn’t come away with a way to cure it. What I came away with was certainty: certainty that the problem is in the machine itself and nothing else; certainty that no swap, no flash, no downclock will help; certainty that this is what it truly is. For a machine it doesn’t matter if it breaks, this certainty is exactly what I’d wanted all along.

Laid out plainly, the whole thing is out of proportion: two solid weeks of hardware forensics — twin control, cross-swapping parts, a purpose-built tool, a dozen-odd reboots — spent on a node that is cattle through and through. But that disproportion is probably the natural state of a certain kind of homelabber: what they’re serious about is the tinkering itself, rigorous in method, while what runs on top, and whether it breaks, hardly matters. It’s that same preference that led me to pick this non-ECC mini PC in the first place and plant the flaw in it — and then, after the fact, made me willing to spend two weeks running that flaw to ground.

If it were someone else’s machine, I’d probably tell them to let it go.

This one is mine.

Footnotes

  1. ECC — Error-Correcting Code memory. It checks data on every read and write, catching and correcting single-bit errors and reporting them to the system. Consumer platforms, and this mini PC, use plain non-ECC memory: without that check, an error is neither corrected nor recorded, completely invisible to anything above it. ↩︎

  2. BMC (Baseboard Management Controller), or out-of-band management — a management chip independent of the operating system and separately powered, which lets you power-cycle a machine, view its console, and install an OS even when it’s dead. Proper servers mostly have one; these mini PCs don’t, so once one freezes, the only way in is to cut power at the source. ↩︎