top of page

SQL Server 2022 Engine Crashes on Windows Server 2025 — Tracing the Cause to an EDR Heap Hook Incompatibility

  • May 25
  • 13 min read

[UPDATED]

In March 2026, our production messaging-database cluster started crashing again. Same exception code as the incident we resolved last year (0xc0000005, Access Violation), but a different stack signature, different servers, different cloud, and ultimately a completely different root cause. This time the bug wasn't in SQL Server. It wasn't in the storage drivers. It wasn't in our queries. It was in something we'd never have suspected without comparing the crashing servers against a fleet member that didn't crash — our endpoint detection and response (EDR) agent.


What follows is the full investigation, from the first dump file on the floor to the comparison query that pinpointed the differentiator. If you're running SQL Server 2022 on Windows Server 2025 with a third-party EDR injected into sqlservr.exe, this story may save you a few weeks.


The Environment

  • SQL Server 2022 Enterprise Edition (16.x), builds ranging from CU22-GDR (16.0.4230.2) to CU24-GDR (16.0.4252.3)

  • Windows Server 2025 Datacenter (Build 26100) on all production messaging DBs

  • Mixed cloud footprint: AWS EC2 (r5d.xlarge) and GCP (n2-standard-4) hosts, split across two production regions

  • AlwaysOn Availability Groups in each region, plus cross-region AG listeners

  • In-process linked-server providers: MSDASQL with Simba ODBC for Google BigQuery, MSOLEDBSQL for cross-region AG references, plus a CData SQL Gateway for downstream analytics

  • Endpoint protection: Carbon Black Cloud Sensor 4.0.3.2029 (deployed enterprise-wide, including on every DB host)


The Symptoms

The crashes followed a recognizable pattern across three of our four production servers. Single-spid AVs scattered over weeks, occasionally erupting into a "storm" — tens of spids crashing in the same minute, sometimes taking the AG with them.

SqlDumpExceptionHandler: Process 226 generated fatal exception c0000005
EXCEPTION_ACCESS_VIOLATION reading address 000001A2EB3C0000

Exception Address = ntdll.dll RVA 0x1659E7
Stack: ntdll!RVA 0x1659E7 ← SqlDK!... ← sqllang!...

A few things stood out about the pattern:

  • The exception always pointed into ntdll.dll at the same RVA — an internal heap-manager routine. The actual SQL Server stack frames above it were just unwinding through normal query execution.

  • During the storms, dozens of spids hit the AV reading the same corrupted memory address. In one of our incidents, fifteen separate spids on a single primary node all faulted reading 0x000001A2EB3C0000 within a five-minute window. That's not a query bug — that's a single object in the heap that something else has freed or overwritten.

  • Just before each storm, the ERRORLOG filled up with OS Error: There was AV or Assert in Resource Monitor entries. Resource Monitor is SQLOS's internal memory-broker thread; access violations there mean the heap itself is corrupt and the broker is the unlucky thread that noticed first.

  • Module list in every dump included ctiuser.dll, sitting at a standard library load address inside the sqlservr.exe process. We knew it wasn't a Microsoft DLL but didn't immediately know what it was.

  • Patching didn't help. One server crashed for the first time after we upgraded it to CU24-GDR. So the crashes weren't a regression introduced by a specific build.

The first major incident on one of our regional AG pairs took ~50 spids down in 90 seconds. Both Availability Groups in that region failed over to their secondaries, and the receiving secondary then crashed itself. Customer-facing impact was contained by the failover, but only barely.


What We Tried (and What Didn't Work)

This is where the investigation got expensive. Many of these felt like reasonable hypotheses at the time, but in hindsight every one of them was either based on a co-occurrence in the dumps or a piece of conventional wisdom that didn't apply to our specific failure mode.

1. Disable in-process linked-server providers (AllowInProcess = 0)

An earlier investigation had flagged our in-process Simba ODBC linked server as the prime suspect. The theory: Simba runs inside sqlservr.exe when AllowInProcess is set to 1, so any crash in the provider takes the engine down with it. Setting AllowInProcess = 0 on the MSDASQL registry key was supposed to push the provider out-of-process and contain the failure.

The problem: Simba ODBC for BigQuery only works in-process. Setting AllowInProcess = 0 broke every linked-server query that touched BigQuery. We reverted to AllowInProcess = 1 within a day. The crashes had nothing to do with whether linked servers ran in-process or not.

2. Cumulative Update alignment

Two of our four servers were on CU22-GDR while the other two had been upgraded to CU24-GDR. We initially thought the CU mismatch across AG replicas could be contributing, since Microsoft only officially supports CU mismatch as a transient state during rolling upgrades.

We aligned one of the secondaries to CU24-GDR on 26 April. It crashed two weeks later on 10 May. So CU24-GDR didn't fix it either. The same was true for a peer in another region: patched to CU24-GDR on 13 March, crashed on 16 May. Patching wasn't the answer.

3. Memory configuration

One of the dumps showed MemoryLoad = 96% and only 644 MB available physical memory on a 16 GB box at crash time. That made memory pressure look like a likely culprit. We checked:

  • max server memory — already tuned per dbatools recommendation on all four servers (11261 MB on the 16 GB GCP hosts, 24000 MB on the 32 GB AWS host)

  • Lock Pages in Memory — already granted to the SQL service account on every server

  • Page file — misconfigured (system-managed and sharing the tempdb drive), but other servers in the fleet had the same misconfiguration and weren't crashing

So memory pressure was a symptom, not a cause. Something inside sqlservr.exe outside the buffer pool (which Lock Pages in Memory protects) was growing and squeezing out the rest of the process. We didn't yet know what.

4. Disabling monitoring jobs

One of the storm dumps showed EXEC dbo.CaptureWhoIsActive in the input buffer at crash time. We briefly considered pausing the job, but stepped back. sp_WhoIsActive is a read-only DMV query — there's no plausible mechanism by which it corrupts a heap. It runs every minute, so by simple base-rate logic, of course it shows up in the input buffer of some spid when any crash happens. Pausing it would have removed our best source of session-state evidence during the next crash. We left it running.

5. Page file restructuring on tempdb-shared drives

Worth fixing for performance hygiene (tempdb shouldn't share I/O with the page file on a "dedicated" tempdb NVMe), but we confirmed other production servers in our fleet had the same page-file layout and weren't crashing. So this wasn't causal either. We parked it for a separate work item.


The Breakthrough — Compare Against a Server That Doesn't Crash

We had been so focused on what was wrong with the crashing servers that we'd skipped the most useful diagnostic move: look at a server in the same fleet that has the same config and doesn't crash, and find the one difference.

We picked a peer in a third region — same business application, same linked-server providers (including Simba ODBC for BigQuery), same Carbon Black sensor deployment policy, heavier active workload (it was sitting on 13.9 GB of query workspace memory at the time of the snapshot), and zero crash dumps in the last six months. We ran a comparison query across both:

SELECT @@SERVERNAME, @@VERSION;

SELECT name, description, company, file_version
FROM sys.dm_os_loaded_modules
WHERE name NOT LIKE '%\Microsoft SQL Server\%'
ORDER BY name;

SELECT name, product, provider, data_source, provider_string
FROM sys.servers WHERE is_linked = 1;

SELECT TOP 15 type, SUM(pages_kb)/1024 AS mb
FROM sys.dm_os_memory_clerks
WHERE type NOT IN ('MEMORYCLERK_SQLBUFFERPOOL')
GROUP BY type ORDER BY mb DESC;

Side-by-side, almost everything matched. Same SQL Server CU on one comparison pair. Same Carbon Black sensor version (4.0.3.2029) across all four. Same Simba ODBC. Same MSOLEDBSQL. Same SQLNCLI. Same Windows trace flags. Same Lock Pages in Memory configuration.


One thing was different.

Attribute

Crashing servers (4)

Control server (no crashes)

Operating system

Windows Server 2025 (Build 26100)

Windows Server 2022 (Build 20348)

SQL Server build

CU22-GDR / CU24-GDR

CU24-GDR (matches one crashing server)

Carbon Black sensor

4.0.3.2029

4.0.3.2029 (identical)

Linked-server provider mix

SQLNCLI, MSOLEDBSQL, MSDASQL/Simba

SQLNCLI, MSOLEDBSQL, MSDASQL/Simba

Workload

Light at sample time

Heavy — 13.9 GB query workspace

The only factor that perfectly separated crashing from not-crashing was the OS version. The Carbon Black sensor version was identical — same DLL, same product version, same vendor signature. The CU was the same on one comparison pair. The workload was actually heavier on the server that didn't crash. So the cause wasn't Carbon Black alone, and it wasn't a specific SQL Server build, and it wasn't workload.

It was Carbon Black 4.0.3.2029 plus Windows Server 2025 Build 26100.


What changed in Windows Server 2025

Build 26100 is the same kernel base as Windows 11 24H2. It includes significant rework of the user-mode heap allocator — updates to segment heap promotion, low-fragmentation heap metadata layout, and the heap block header format. These changes are normally invisible to user-mode code, but they break any third-party agent that has inline-hooked heap APIs based on assumptions about the older WS2022 layout.


Carbon Black, like most modern EDRs, does exactly that. It hooks memory allocation and free operations inside sqlservr.exe via DLL injection (the ctiuser.dll we kept seeing in every dump). Sensor version 4.0.3.2029 was validated against Windows Server 2022 (Build 20348). On Build 26100, those hooks misread heap block headers, leak references, or write to memory the heap manager has already reclaimed. The result is exactly the crash signature we were seeing — AVs deep in ntdll heap routines, reused corrupt addresses across many spids, Resource Monitor AV/Asserts as the canary thread, and a load-bearing third-party DLL sitting innocently in the module list.


The Fix

Two parallel remediations, in priority order:


Action 1: Upgrade the Carbon Black sensor

We opened a case with Broadcom support to confirm the minimum sensor version that's been validated against Windows Server 2025 / Build 26100. The 4.0.x line predates WS2025 GA. The 4.1.x line (we already had 4.1.0.5463 installed on engineering workstations and weren't seeing any crashes there) is the next supported branch. Rollout to all four production DBs is scheduled.


Action 2 (as needed): Add SQL Server process bypass in Carbon Black

This is the faster of the two and provides immediate mitigation. In the Carbon Black Cloud console, under the policy that covers our DB hosts, we added "Performs any operation" bypass entries for:

  • **\sqlservr.exe

  • **\sqlwriter.exe

  • **\sqlagent.exe

  • **\SQLDumper.exe

  • \fdhost.exe, \fdlauncher.exe (full-text search)


After the policy syncs, restart the SQL Server service on each host so ctiuser.dll is no longer mapped into the process. The DLL only loads at process start, so a restart is required for the bypass to take effect — checking sys.dm_os_loaded_modules after restart confirms it's gone.


Validation Plan

Each of our previously-crashing servers had been crashing at roughly a 1–4 week cadence. After applying the process bypass on a previously-crashing server, we'd monitor for 14 days. If no AVs occurred in that window on a server that previously had multiple, the hypothesis was confirmed. The same change would then roll out to the remaining servers, and the Carbon Black sensor upgrade would follow behind as the durable fix.


The Recurrence (and What It Actually Taught Us)

We rolled out the process bypass, restarted SQL Server on the first server we touched, and that node entered a quiet validation window. Around the same time, our security team pushed the Carbon Black sensor upgrade to the next major version across all of our affected database hosts.


Eleven days later, one of the still-on-the-old-CU servers crashed again. Four EXCEPTION_ACCESS_VIOLATION dumps in 33 seconds, same process ID across all four (i.e. one storm event, not multiple restarts). New crash signature this time — the exception address was in vcruntime140.amd64 instead of ntdll heap routines — but the same heap-corruption root-cause class.


The first instinct was "the upgrade didn't take." But the disk side checked out: the Carbon Black sensor's ctiuser.dll on disk was indisputably the new version. So why was the process still crashing?


The answer was in sys.dm_os_loaded_modules:

Field

Pre-fix dump (May)

Recurrence dump (June, "after" upgrade)

ctiuser.dll base address

0x00007FFBDF9A0000

0x00007FFBDF9A0000

ctiuser.dll end address

0x00007FFBDFD08FFF

0x00007FFBDFD08FFF

ctiuser.dll size

0x00369000 (3,575,808 bytes)

0x00369000 (3,575,808 bytes)

Byte-for-byte identical. A genuine sensor-version replacement almost always changes the file size (different code, different optional features, different bundled libraries). Combined with ASLR — which picks a fresh randomized base address for every new process — the unchanged base meant the sqlservr.exe process had not restarted since before the upgrade.

SQL Server uptime on the crashing host: 253 hours. The original 2026-05-25 bypass-and-restart had landed cleanly, but no subsequent restart had happened on that server, and the Carbon Black sensor upgrade rolled through while the SQL Server process was still running. The new DLL was on disk, but the running process kept using the old in-memory mapping. Windows does not hot-swap loaded DLLs.

Indirect confirmation of the diagnosis: a peer server that had been restarted after the sensor upgrade (it was patched to the new SQL CU two days earlier, which forced a service restart as part of the install) had been stable for ~38 hours at this point and remained stable.


Fix: roll the pending CU on the recurring server

The recurring server had a CU upgrade waiting in the rolling-patch queue anyway. The CU installer stops SQL Server, replaces binaries, and restarts the service — which is exactly what was needed to drop the old ctiuser.dll mapping and load the new sensor DLL at startup. One operation closed both the CU mismatch with its peer and the EDR remediation gap.

After the patch and restart, sys.dm_os_loaded_modules showed the expected change: different base_address (ASLR-picked) and the new file_version. Validation resumed.


What we would have caught earlier with the right verification

The original rollout plan checked the Carbon Black agent version on disk after the upgrade. That check passed on every host, including the one that recurred. The check that would have caught the gap is the one against the live process — sys.dm_os_loaded_modules — which tells you what is actually mapped into sqlservr.exe, not what the file system has. We've added that as an explicit verification step in our EDR-upgrade procedure for any host running SQL Server.


If a crash does recur on any host where the live-process check confirms the new DLL is loaded, the next escalation step is a full memory dump (SQLDUMPER_FULL_DUMP=1 as a service environment variable, or DBCC TRACEON(2544, -1)) to identify exactly what object lives at the recurring corrupt address. We haven't needed it yet.


Full Timeline

Date

Event

Mar 2, 2026

First isolated AV on a Cluster A primary — spid 469

Mar 7, 2026

First storm on Cluster A primary — ~30 spids in 2 minutes

Mar 13, 2026

Cluster B secondary patched to CU24-GDR (KB5083252)

Apr 9, 2026

Multiple AVs on Cluster A primary, different stack address pattern

Apr 26, 2026

Cluster A secondary patched to CU24-GDR (KB5089900)

May 1–3, 2026

Isolated AVs continue on Cluster A primary

May 10, 2026

First AV on Cluster A secondary — 14 days after CU24-GDR patch

May 16, 2026

Major storm on Cluster B secondary — ~50 spids in 90 seconds, both AGs failed over

May 24, 2026

Second storm on Cluster A primary — ~20 spids reading the same corrupt address

May 25–26, 2026

Cross-server comparison run against control server — OS version identified as sole differentiato. Sensor version upgraded — 14-day monitoring window begins

Jun 3, 2026

Peer server (Cluster A secondary) patched to the newer SQL CU — service restart picked up the new sensor DLL

Jun 5, 2026

Recurrence on Cluster A primary — 4 AVs in 33 seconds. Diagnosed: same in-memory ctiuser.dll (byte-identical base + size) because SQL service had not been restarted since the sensor upgrade rolled through. Fix: roll the pending CU patch to force the restart.

Key Takeaways

  1. Always look at a peer that isn't broken. We spent weeks examining the crashing servers in detail. The one query that actually solved it took 30 seconds to write and 5 seconds to run, against a server in the same fleet that wasn't crashing. If you're stuck on a recurring issue and you have a control group somewhere in your estate, use it before anything else.

  2. Module list in a crash dump is signal, but not proof of cause. ctiuser.dll sat in every one of our dumps from day one. That made it suspicious. But "loaded in the process" and "caused the crash" aren't the same thing — the dump module list will show dozens of third-party DLLs that have nothing to do with the failure. We needed the comparison-with-control step to promote a suspicion into a hypothesis.

  3. Input buffer at crash time is base-rate biased. Whatever query is running every minute will be in someone's input buffer when the crash happens. Don't treat that as a cause — treat it as the unlucky witness.

  4. Patching is necessary but rarely sufficient. Both of our patched servers crashed after the patch. Don't let "they're on the latest CU" close the case.

  5. Memory pressure is usually a symptom, not a cause. A server at 96% MemoryLoad is telling you something is eating memory inside sqlservr.exe outside the buffer pool. If your max server memory is already tuned and Lock Pages in Memory is enabled, the answer isn't to lower the cap further — it's to find what's growing on top of the cap. In our case it was the EDR agent's working set plus its hooks, but it could just as easily be CLR, an in-process linked-server provider, or extended stored procedures.

  6. EDR sensors are load-bearing third-party code inside your SQL Server process. Treat them with the same change-management rigor you'd apply to a SQL Server CU. Confirm the sensor version is validated against your specific OS build — not just "Windows Server" or "Windows Server 2025 supported." Build numbers matter.

  7. When you find the fix, capture the evidence trail. If you don't write down which servers crashed, which didn't, and what the single difference was, you'll have a hard time getting your security team to approve the sensor upgrade or the process bypass. We kept a one-page summary of the comparison results — that did more for the policy conversation than any of the dump files.

  8. An EDR sensor upgrade is not complete until every injected process has restarted. Disk-level file replacement is a no-op for processes already mapped to the old DLL — Windows does not hot-swap loaded DLLs. If your sensor is injecting into long-running services (SQL Server, IIS, message brokers), the upgrade procedure must include a follow-on service restart on every host. We learned this the second time when one host recurred 11 days after the "upgrade" because the SQL Server process had never restarted.

  9. Verify on-disk and in-memory separately. Get-Item ctiuser.dll | VersionInfo and the EDR vendor console both tell you the file-system state. sys.dm_os_loaded_modules (or Process Explorer against sqlservr.exe) tells you what is actually mapped into the running process. For anything injected, only the in-memory check is authoritative. Compare base_address across a known-restart boundary — ASLR will pick a new address on every fresh process, so an unchanged base address is a smoking gun that no restart happened.

  10. A scheduled CU rollout is a useful forcing function. When kernel-injected components (EDR, ODBC drivers, CLR assemblies) need to take effect on a long-running SQL service, batching the EDR refresh into the next CU window combines the necessary restart with planned maintenance and avoids a second outage. We used the pending CU patch on our recurring host to force the EDR DLL refresh; one operation, two gaps closed.


Quick Test (If You're Seeing Something Similar)

If you're on SQL Server 2022, on Windows Server 2025 (Build 26100), with any third-party EDR injected into sqlservr.exe, and you're seeing recurring EXCEPTION_ACCESS_VIOLATION crashes with stack traces ending in ntdll heap routines, run this on a crashing server and a non-crashing peer:

SELECT @@SERVERNAME, @@VERSION;

SELECT name, file_version, company, description
FROM sys.dm_os_loaded_modules
WHERE name LIKE '%\System32\%'
  AND company NOT LIKE 'Microsoft%'
  AND name NOT LIKE '%\Microsoft SQL Server\%'
ORDER BY name;

If you see an EDR DLL in the list (Carbon Black ctiuser.dll, CrowdStrike CSAgent.sys companion DLLs, SentinelOne SentinelHelperService.exe companion DLLs, Trend Micro tmcomm.dll, etc.) on both servers but only one is crashing, check the OS build via Get-ComputerInfo OSBuildNumber. If the crashing server is on 26100 and the non-crashing one is on an older build, you're likely hitting the same incompatibility we did.


Immediate mitigation: Upgrade EDR's latest version (WS2025 compatible) and restart the SQL Server service. Durable fix: confirm with your EDR vendor the minimum sensor version validated against Build 26100, and roll it out.


This post is based on a real production incident, resolved by cross-server comparison after several earlier hypotheses were ruled out. Server identifiers, region names, and customer-specific details have been anonymized.



Comments


Leave a Reply

Your email address will not be published. Required fields are marked *

© 2025 by Renz Bagasbas. All rights reserved.

bottom of page