SQL Server 2022 Enterprise Repeated Engine Crashes (Access Violation Exceptions)

Feb 27
6 min read

Updated: Mar 12

In late October 2025, our production SQL Server 2022 Enterprise cluster running on AWS EC2 started crashing repeatedly with Access Violation (0xc0000005) exceptions. This wasn't just any database going down — this cluster is the heart of the business. It powers the core platform that our customers depend on every day: real-time communications services, call routing, messaging, and the APIs that integrate with everything downstream. When this database cluster goes offline, the entire production pipeline stalls — customer-facing services degrade, revenue stops flowing, and every second of downtime counts.

What followed was a five-week investigation involving AWS Enterprise Support and Microsoft SQL Server Engineering — ultimately traced to a confirmed engine-level bug. This is the full story, including the workaround that stabilized our environment and the official bug IDs for tracking.

The Environment

SQL Server 2022 Enterprise Edition (16.x), initially on CU21
Windows Server 2022 Datacenter on AWS EC2 (r6idn.4xlarge)
AlwaysOn Availability Group with 2 nodes
Deployed from the official AWS EC2 SQL Server Enterprise AMI
Production workload serving the core business platform — customer-facing communications, real-time APIs, and reporting services
Trace flags enabled: 174, 1448, 1800, 2610, 3226, 11042

The Symptoms

Starting 22 October 2025, the SQL Server engine process began terminating unexpectedly on the primary AG node. The crash dumps and ERRORLOG entries all pointed to the same pattern:

SqlDumpExceptionHandler: Process generated fatal exception c0000005
EXCEPTION_ACCESS_VIOLATION writing address 0x0000012F418CD070

Key observations:

Crashes were hitting multiple worker threads (spids 44, 30, 29, 199, etc.)
Stack signatures were consistent across every dump — always the same Access Violation pattern
After the initial crash, SQL Server would generate recurring dumps (~318 total), approximately every 5 seconds, until the service was manually restarted
The instance became effectively hung during these episodes — even basic AG operations like ALTER DATABASE ... SET HADR SUSPEND would block
Crash dumps consistently pointed to SQL Server engine code, not user or external modules
The repeated terminations were triggering automatic AG failovers, directly impacting production availability and revenue

The very first occurrence was notable: it happened approximately 12 seconds after we decreased the max server memory setting from 105GB to 95GB. Later crash events, however, occurred on their own with no configuration changes — ruling out that specific change as the root cause.

What We Tried (and What Didn't Work)

Our team went through a systematic series of troubleshooting steps before arriving at a resolution. Here's what we tried and what we ruled out:

1. CU Downgrade — CU21 back to CU18

Our first instinct was to suspect a regression in a recent Cumulative Update. We rolled back from CU21 to the previously stable CU18 and applied all latest Windows Updates to bring the OS completely current. The crashes continued. This confirmed the issue was not introduced by CU21 specifically.

2. Trace Flag 11042 (SESSION_CONTEXT)

We enabled TF 11042 as a precaution based on a known SESSION_CONTEXT issue documented by Microsoft. No impact on the crashes.

3. AWS Driver Updates

AWS Support identified that the NVMe and ENA drivers on our EC2 instances were outdated. We updated all drivers to the latest available versions:

AWS NVMe drivers → updated to 1.7.0
Amazon ENA drivers → updated to 2.11.0

The crashes continued after the driver updates. AWS also confirmed via internal health checks that the underlying hardware was healthy with no system status failures in the preceding weeks.

4. Upgraded Back to CU21 (then CU22)

After the CU18 downgrade didn't help, we upgraded back to CU21, and later to CU22 when it was released. The same Access Violation pattern persisted on the latest public build. This confirmed we were dealing with something deeper in the engine.

5. Trace Flag 12836

Microsoft's initial analysis of our dump files found a similar issue in their database that had been resolved by enabling Trace Flag 12836. We applied it:

Added -T12836 as a startup parameter in SQL Server Configuration Manager
Also activated it on the running instance using DBCC TRACEON (12836, -1)
Validated it was active on both AG nodes via DBCC TRACESTATUS

Unfortunately, a crash occurred just 30 minutes after enabling the trace flag. The stack signature was identical. TF 12836 alone did not resolve the issue.

The Breakthrough

After further escalation within Microsoft, the engineering team came back with a more targeted analysis. They identified that the dump signatures, while similar, were related to a known issue with sqlmin!ListBase::Delete. The recommended fix involved two actions:

Action 1: Disable BATCH_MODE_ON_ROWSTORE

ALTER DATABASE SCOPED CONFIGURATION SET BATCH_MODE_ON_ROWSTORE = OFF;
-- Run against all user databases

This was the key change. Batch Mode on Rowstore is a performance feature introduced in SQL Server 2019 that allows batch mode processing for analytical queries even on traditional rowstore tables. It turns out that under certain conditions, this feature was triggering the Access Violation.

Our workload included ad-hoc aggregation queries from reporting clients that occasionally requested large memory grants. These queries were benefiting from BATCH_MODE_ON_ROWSTORE, but the feature was also the source of the crash. Disabling it meant a potential performance trade-off on those specific queries, but it eliminated the engine crash entirely.

Action 2: Verify No Extended Events Collecting additional_memory_grant

Microsoft also asked us to confirm that no Extended Events sessions were capturing the additional_memory_grant event, as this could contribute to the issue. We verified that none were active.

Stabilization

After applying BATCH_MODE_ON_ROWSTORE = OFF on all user databases, we monitored the environment closely. Seven days passed with zero Access Violation exceptions. The SQL Server cluster was fully stable and operational.

We confirmed the issue was resolved and proceeded with case closure, with a plan to revert the workaround (re-enable BATCH_MODE_ON_ROWSTORE and remove TF 12836) once Microsoft releases the official fix in a future Cumulative Update.

Microsoft Bug IDs

Microsoft confirmed two bugs associated with this issue:

Bug ID	Component	Details
3653653	Trace Flag 12836	Originally implemented as a fix for assertion dumps. Consideration to default it on for PG. Fix expected to be defaulted in SQL 2022 CU 24.
4868207	BATCH_MODE_ON_ROWSTORE	Access Violation issue. Changes from SQL Server 2025 need to be ported back to SQL 2022. Expected in CU24, possibly CU23.

Full Timeline

Date	Event
Oct 22, 2025	First Access Violation crash on primary AG node
Early Nov	CU downgrade to CU18 — crashes persist
Nov 7, 2025	AWS support case opened with full environment details, dumps, and logs
Nov 8–11	AWS investigates — driver updates applied, EC2rescue logs collected, hardware confirmed healthy
Nov 11	Another crash on secondary node — ~318 recurring dumps generated
Nov 12	AWS escalates to Microsoft Engineering
Nov 13	Microsoft recommends Trace Flag 12836
Nov 14	TF 12836 enabled — crash recurs 30 minutes later
Nov 15–20	Upgraded to CU22, crashes continue; issue escalated to Microsoft product group
Nov 22	Microsoft suspects memory corruption / heap memory issues based on dump analysis
Nov 27	Microsoft provides targeted fix: disable BATCH_MODE_ON_ROWSTORE
Nov 27	Workaround applied on all user databases
Dec 4, 2025	7 days stable — zero new Access Violation exceptions
Dec 10, 2025	Microsoft confirms Bug IDs 3653653 and 4868207 — fixes targeting CU23/CU24
Dec 12, 2025	Case closed

Key Takeaways

Crash dumps are your best friend. Consistent stack signatures across multiple dumps made it possible for Microsoft to identify the root cause. Always collect and preserve them.
CU downgrades don't always help. When the bug exists across multiple builds, rolling back wastes time. It's still worth trying as a first step, but don't get stuck on it.
Don't overlook database-scoped configurations. Features like BATCH_MODE_ON_ROWSTORE are enabled by default and can silently cause issues at the engine level — especially under specific memory grant patterns.
AWS-to-Microsoft escalation takes time. Because SQL Server on EC2 runs under AWS's license, support requests to Microsoft must flow through AWS. This adds communication overhead. For future cases, plan for multi-week resolution timelines on complex engine bugs.
AG resilience has limits. While AlwaysOn AG provided failover capability, the crashes were occurring on both nodes (primary and secondary). A two-node AG doesn't help when the bug is in the engine itself and both replicas are running the same build.
Document everything in real-time. Keeping a detailed log of every action, every dump upload, and every configuration change made it possible to communicate clearly with multiple support engineers across shift changes on both the AWS and Microsoft sides.
Track the bug IDs. If you're on SQL Server 2022 and experiencing similar Access Violation crashes, check Microsoft's CU release notes for Bug IDs 3653653 and 4868207. The fixes are expected in CU23 or CU24.

Immediate Workaround (If You're Experiencing This)

If you're seeing repeated EXCEPTION_ACCESS_VIOLATION (0xc0000005) crashes on SQL Server 2022 with stack traces pointing to engine code and sqlmin!ListBase::Delete, try the following:

-- Run on each user database
ALTER DATABASE SCOPED CONFIGURATION SET BATCH_MODE_ON_ROWSTORE = OFF;

Monitor for 7 days. If the crashes stop, you've likely hit the same bug. Keep the workaround in place until Microsoft releases the official fix in a future CU, then revert to default behavior.

This post was written based on a real production incident resolved through collaboration with AWS Enterprise Support and Microsoft SQL Server Engineering. All environment-specific identifiers have been anonymized.