[Part 2] SQL Server 2022 Crashes — When the Fix Doesn’t Hold: A Second Root Cause Hiding Behind the First

15 minutes ago
9 min read

In my last post I traced a wave of EXCEPTION_ACCESS_VIOLATION (0xc0000005) crashes on our SQL Server 2022 fleet to an EDR heap-hook incompatibility on Windows Server 2025, and walked through the cross-server comparison that pinned it down. That conclusion held up — for that crash signature. What it didn’t do was stop the bleeding.

A few nodes kept crashing after the EDR remediation. Same exception code, but when I actually read the dumps frame by frame, it was a different fault — different module, different faulting address, different trigger. The EDR hook was one root cause. It wasn’t the only one. This post is the part of the story where the “solved” case reopened, and how a tool I built specifically for this investigation cut what would have been weeks of manual log-spelunking down to a single afternoon of queries.

The Setup

The estate is eight SQL Server 2022 Enterprise nodes (major version 16) spread across four regions, hosted on a mix of GCP and AWS compute — self-managed VMs, not a managed database service, which matters later. Each region runs an Always On Availability Group. To keep the regions anonymized I’ll refer to the nodes that mattered as:

Node A1 / Node A2 — an AG pair in one region, both producing dumps
Node B5 — an AG secondary in a second region, the most persistent offender
Node B4 — Node B5’s in-region twin, identical build and config, has not crashed since last November (this turns out to be the most important node in the whole investigation)
Node C2 — a node in a third region with an occasional, separate crash flavor

The Tell: The Same Fix Didn’t Fix Everything

After the EDR work, I expected the dump folders to go quiet. Most did. But Node A1, Node A2 and Node B5 kept dropping new SQLDump files. When I opened them, the signature was clearly not the heap-corruption pattern from the first investigation:

The original EDR-class crashes faulted deep in ntdll heap routines (and once in vcruntime140) — classic heap corruption, many spids reading the same poisoned address.
These new crashes faulted at a fixed engine offset — SqlDK rva 0x53530, reached through a sqllang → sqlmin → SqlDK call path — while reading address 0x28. That’s a NULL + 0x28 dereference: a null/invalid struct-field read inside the engine, not a corrupted heap block.

The cross-node fingerprint was the part that made me sit up. Two independent hosts, on different days, faulted at the identical read address 0x28:

Node A1   SqlDK rva 0x6903   read 0x0000000000000028   <-- initiating
Node B5   SqlDK rva 0x68E8   read 0x0000000000000028   <-- initiating
...later in each storm:
          SqlDK rva 0xAC46   read 0xFFFFFFFFFFFFFFFF    <-- cascade as the process tears down

A shared fault offset across two separate machines is far stronger “same bug” evidence than “the RVAs are within a few bytes of each other,” which is the kind of soft correlation I’d leaned on earlier. This was a single, reproducible engine code path.

What Was Actually Pulling the Trigger

Every initiating dump — the first fault in each storm, before the cascade — captured an input buffer that was a monitoring query from our Telegraf [[inputs.sqlserver]] collector. They’re unmistakable: each one carries the collector’s distinctive prologue.

SET DEADLOCK_PRIORITY -10;
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
-- ...DMV-heavy collector query (database state pivots, recent-backup history, perf counters)

That explained the thing that had been nagging me from the start: why is a quiet AG secondary with no application traffic crashing? Because the monitoring collector polls every node on a timer — primaries and readable secondaries alike. The trigger never needed app traffic. It rode in on the telemetry pipeline.

Reading across the fleet, two distinct trigger classes fell out of the dumps:

Trigger (from dump input buffer)	Faulting path	Nodes
Telegraf RecentBackups collector query	sqllang → sqlmin → SqlDK rva 0x53530	Node A1, Node A2, Node B5
Batch-mode monitoring query (a WhoIsActive-style DMV scan)	SqlTsEs → SqlDK rva 0x71DA	Node C2

And Yes — I Re-Checked the EDR

After the last investigation, the obvious question was whether the EDR hook had crept back in. It hadn’t. The sensor’s injected DLL was still loaded in-process on these nodes — that’s expected — but across every one of these new dumps it never appears in a faulting frame. The fault is in Microsoft’s own engine modules, on a path entered by a monitoring query. This is a second, independent defect that happened to be living in the same fleet, masked by the louder EDR crashes until those were cleared. Two bugs, one symptom code.

The Control Node That Reframed the Whole Thing

Here’s the wrinkle that keeps this honest. Node B4 — Node B5’s in-region twin — runs the identical RecentBackups collector query every single minute, and it has not produced a crash dump since last November. Same build, same config, same query, same cadence. One twin crashes; the other doesn’t.

So the monitoring query is the necessary trigger but not the sufficient cause. Something node-specific — a plan shape, a data distribution, a build-state subtlety — differentiates the nodes that fault on SqlDK 0x53530 from the ones that run the same path clean. That node-specific factor is the open question, and it’s exactly what I’ll be taking to Microsoft. It would have been very easy (and wrong) to declare “the Telegraf query is the root cause” and stop. The control twin says otherwise.

Engine Knobs First — and Why They Weren’t Enough

Before excluding any monitoring, I tried to close the path from inside the engine, because that’s the less invasive lever:

Trace flag 12836 ON across all eight nodes, and
BATCH_MODE_ON_ROWSTORE = OFF in the affected database on all eight nodes.

The batch-mode knob genuinely addressed Node C2’s separate SqlTsEs flavor. But for the main SqlDK 0x53530 path, the proof came the hard way: Node B5 crashed again with both knobs already in place. Fresh dumps, same faulting frame, same Telegraf input buffer. That recrash is the cleanest possible evidence that TF 12836 and the batch-mode setting do not touch this code path. The engine defect is real and it’s upstream of those switches.

The Mitigation That Held — With an Operational Gotcha

This stretch was a lot of casting and reeling in nothing. We kept fishing — align the builds, tune the memory, flip the engine knobs — and the line stayed slack every time. With the knobs ruled out for this path, the last lever left was to remove the trigger entirely: stop the collector queries from ever reaching the engine. That’s the cast that finally hooked it. I added the offending collectors to the Telegraf exclude_query list:

exclude_query = ["SQLServerPerformanceCounters",
                 "SQLServerProperties",
                 "SQLServerRequests",
                 "SQLServerRecentBackups"]

Then the gotcha that cost a few hours of confusion: exclude_query only takes effect when Telegraf is reloaded. The config landed on disk and I confirmed it was correct — but the live agents kept happily polling every node for another half hour. Config-on-disk is not the same as runtime state. Only after restarting Telegraf on both collector hosts did the RecentBackups query actually stop fleet-wide; I could watch its last-execution timestamp freeze and stay frozen across all nodes, with the agents still reporting healthy. The known trigger surface is now closed.

The exclusion stopped the bleeding, but it’s a tourniquet — it just stops one named query from running. The durable fix sat one layer up: upgrade the Telegraf sqlserver input plugin to the latest version. The pleasant surprise was that the newer plugin had dropped the deprecated collector queries entirely — the exact legacy RecentBackups-style query that was driving the engine onto the SqlDK 0x53530 path no longer exists in the current plugin. So the upgrade removes the offending query at the source instead of relying on me to keep an exclusion list accurate forever. The version we’d been running was old enough to still carry collector code the maintainers had since retired — sometimes the cleanest remediation is simply catching up to a release where the bad behavior was already removed.

As an engine-side insurance policy I’m also adding a Resource Governor classifier that caps the monitoring login to MAXDOP 1, so that if any other collector query ever wanders onto a parallel plan over this path, it can’t.

Where It Stands

Item	Status
Telegraf collector queries excluded + agents restarted on both collector hosts; verified stopped fleet-wide	Done
TF 12836 ON and BATCH_MODE_ON_ROWSTORE = OFF across all eight nodes (closes the separate batch-mode flavor)	Done
Upgrade the Telegraf sqlserver input plugin to latest — the new release dropped the deprecated collector queries that drove the AV (durable fix, supersedes the manual exclusion)	Done
Resource Governor classifier capping the monitoring login to MAXDOP 1	In progress
Microsoft support case on the SqlDK rva 0x53530 access violation	Not filed yet — the next step; reproducer is captured and ready
Per-node dump watch with alerting on the first new stack dump anywhere	Active

Because these are self-managed SQL Server VMs rather than a managed database service, an access violation inside SqlDK is squarely a Microsoft engine matter — the cloud provider’s support covers the VM and hypervisor, not the in-process engine fault. To be clear about where things actually stand: I haven’t opened that case yet. That’s the next move now that the fleet is quiet enough to work it calmly. The reproducer is already captured and waiting: the exact monitoring input buffer, the SqlDK 0x53530 frame, the shared 0x28 fault address across nodes, the build number, and the killer detail that one twin node faults while its identical sibling runs the same query clean.

An honest caveat I keep on every update: the absence of new dumps since the restart is encouraging, but it’s early, and it’s not proof of a fix. The trigger is removed; the underlying engine defect is not yet confirmed resolved by Microsoft. Trigger gone, bug still open.

The Tool That Made This Tractable: A SQL Server MCP

None of this would have moved at the pace it did without a tool I built specifically for the job: a custom SQL Server MCP server — a read-only interface that lets an AI agent query the entire fleet through one safe, governed surface. And it’s not limited to T-SQL: it’s smart enough to reach the artifacts that actually matter during a crash investigation — it reads the memory dumps, the SQL Server ERRORLOG, the Windows event logs, and the Windows failover-cluster log — and correlates them with live DMV state. Instead of RDP-ing into eight boxes, hunting for dump folders that don’t live where the error log says they should, and eyeballing stacks one file at a time, I could ask across all nodes at once:

Enumerate every SQLDump on disk per node and diff it against the last known count, so a new fault anywhere surfaces immediately — then read the input buffer straight out of each dump, which is what surfaced the Telegraf prologue as the common trigger.
Pull the ERRORLOG and the Windows event logs around a crash window and line them up against the dump timestamps — so “what did the engine and the OS each say in those 30 seconds?” is one question, not three logins.
Generate and read the failover-cluster log to confirm whether an AG actually moved (or didn’t) when a node went down — the difference between “customer impact” and “contained”.
Compare sys.dm_os_loaded_modules side by side across nodes — the same move that cracked the first investigation, now run fleet-wide in seconds.
Pull session, memory and query-stats DMVs from a crashing node and its clean twin together, which is what turned Node B4 from a footnote into the control group.

Every figure in my status reports comes from live read-only queries through that MCP — nothing estimated, nothing stale. The control-group comparison, the trigger identification, and the “is the collector actually stopped everywhere now?” verification were all minutes of work instead of days. I’m writing a dedicated post on how I built it — the design, the safety model, and the lessons from putting an AI agent in front of production databases — so stay tuned for that one.

Takeaways

“Solved” is per-signature, not per-symptom. One exception code (0xc0000005) covered two genuinely different defects. Clearing the loud one revealed the quiet one. Read the faulting frame, not just the error number, before you close a case.
A loaded module is not a guilty module. The EDR DLL sat in every dump, but it never once appeared in a faulting frame on this second crash class. Presence in the process is not participation in the crash.
The query in the input buffer is the witness, not always the culprit. The monitoring query was the necessary trigger — but the twin node running the identical query every minute, crash-free, proves it isn’t sufficient on its own. Keep a control before you name a cause.
Config on disk is not runtime state. The exclusion did nothing until the agent restarted. Always verify the live process actually picked up the change — the same lesson, in a different costume, as checking the in-memory DLL rather than the file on disk.
Engine knobs are cheap to try and quick to disprove. A recrash with the trace flag already on is the most useful negative result you can get — it rules out a whole class of fix in one event.
Build the diagnostic leverage once, reuse it forever. A fleet-wide, read-only query surface paid for itself the first afternoon. The investigations that drag on are usually the ones where every data point costs a manual login.

This post continues a real production incident. The earlier EDR heap-hook finding stands for its crash class; this follow-up covers a second, distinct engine defect uncovered afterward, mitigated by removing its trigger, and headed to Microsoft next. Server names, regions, and customer-specific details have been anonymized.