Introduction
The SQL Server Error 19407: The lease between availability group and the Windows Server Failover Cluster has expired indicates a communication failure between SQL Server and the Windows Server Failover Cluster (WSFC), causing an Always On Availability Group (AG) to transition to a resolving state or fail over. This error can disrupt critical applications, such as financial or e-commerce systems, by taking databases offline temporarily. This guide explores the causes of Error 19407, diagnostic steps, and actionable solutions, including scripts and real-world scenarios, to restore AG stability and prevent future occurrences.
What Causes SQL Server Error 19407?
The Error 19407 occurs when the lease mechanism, a heartbeat between SQL Server and the WSFC, fails to renew within the default 20-second timeout period. Common causes include:
High CPU Usage: CPU utilization near 100%, preventing the lease thread from responding.
Memory Pressure: Low virtual or physical memory causing SQL Server or WSFC processes (e.g., RHS.exe) to be paged out.
I/O Bottlenecks: Slow disk I/O, often from maintenance tasks like backups, causing delays in lease renewal.
Memory Dumps: SQL Server generating diagnostic dump files (e.g., due to deadlocked schedulers or access violations), freezing the lease thread.
Virtual Machine Issues: VM migrations (e.g., Hyper-V Live Migration, VMware vMotion) or backups suspending the VM, disrupting communication.
Network Issues: Connectivity problems between SQL Server and WSFC, such as network latency or quorum loss.
Cluster Misconfiguration: Incorrect quorum settings or resource monitor issues affecting AG health checks.
Example Error Message:
2025-09-02 10:00:00.00 Server Error: 19407, Severity: 16, State: 1.
The lease between availability group 'AG_Name' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster.
Real-Life Scenario: In a financial database, the AG FinanceAG goes offline with Error 19407 during a nightly backup, causing application downtime. The DBA suspects high I/O load or CPU pressure.
Step 1: Diagnosing Error 19407
To resolve the error, identify the root cause by analyzing logs and system metrics.
1.1 Check SQL Server Error Log
Review the SQL Server error log for Error 19407 and related messages (e.g., Error 19419 or dump generation).
Steps:
Open SQL Server Management Studio (SSMS).
Navigate to Management > SQL Server Logs.
Look for Error 19407 and preceding events:
SELECT * FROM sys.fn_get_audit_file('C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\*', DEFAULT, DEFAULT)
WHERE event_time >= '2025-09-02 00:00:00'
AND error = 19407;
Sample Output:
2025-09-02 10:00:00.00 Server Error: 19407, Severity: 16, State: 1.
The lease between availability group 'FinanceAG' and the Windows Server Failover Cluster has expired.
2025-09-02 09:59:50.00 Server **Dump thread - spid = 0, EC = 0x0000000000000000
Observation: A memory dump preceded the lease timeout, indicating a potential SQL Server issue.
1.2 Check Cluster Log
Export and analyze the WSFC log to correlate with SQL Server events.
PowerShell Command:
Get-ClusterLog -Node SQLNode1 -Destination C:\Logs\Cluster.log
Sample Output:
2025/09/02-10:00:00.587 INFO [RCM] rcm::RcmApi::FailResource: (FinanceAG)
2025/09/02-10:00:00.588 ERR [RES] SQL Server Availability Group <FinanceAG>: [hadrag] Lease Thread terminated
Observation: The cluster log confirms the lease expired, with a 10-second delay after the expected renewal.
1.3 Monitor System Resources
Check CPU, memory, and I/O metrics during the error time.
CPU Usage (PowerShell):
Get-Counter -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 5 -MaxSamples 30 | Select-Object -ExpandProperty CounterSamples | Select-Object TimeStamp, CookedValue
I/O Waits (T-SQL):
SELECT
wait_type,
wait_time_ms,
wait_time_ms / NULLIF(waiting_tasks_count, 0) AS avg_wait_ms
FROM
sys.dm_os_wait_stats
WHERE
wait_type IN ('PREEMPTIVE_HADR_LEASE_MECHANISM', 'PREEMPTIVE_SP_SERVER_DIAGNOSTICS')
AND wait_time_ms > 0;
Sample Output:
wait_type wait_time_ms avg_wait_ms
------------------------------- ------------ -----------
PREEMPTIVE_HADR_LEASE_MECHANISM 526000 526000
PREEMPTIVE_SP_SERVER_DIAGNOSTICS 526000 526000
Observation: High wait times for lease-related tasks suggest system resource contention.
Real-Life Finding: In the financial database, high CPU usage (95%) and I/O waits during the backup window correlate with the lease timeout.
Step 2: Resolving Error 19407
Based on the diagnosis, apply these solutions to restore AG functionality.
2.1 Address High CPU Usage
High CPU can prevent the lease thread from responding.
Steps:
Identify high-CPU processes:
Get-Process | Sort-Object CPU -Descending | Select-Object -First 5
If SQL Server is the culprit, analyze queries:
SELECT
spid,
cpu,
sql_text = SUBSTRING(t.text, (r.statement_start_offset/2)+1,
((CASE r.statement_end_offset
WHEN -1 THEN DATALENGTH(t.text)
ELSE r.statement_end_offset
END - r.statement_start_offset)/2) + 1)
FROM
sys.sysprocesses p
CROSS APPLY sys.dm_exec_sql_text(p.sql_handle) t
WHERE
cpu > 1000
ORDER BY
cpu DESC;
Optimize or kill problematic queries:
KILL 123; -- Replace 123 with problematic SPID
Pros:
Reduces CPU contention, restoring lease renewal.
Immediate performance improvement.
Cons:
Killing sessions may disrupt applications.
Requires query tuning for long-term fix.
Alternative: Increase CPU resources or adjust MAXDOP to limit query parallelism:
EXEC sp_configure 'max degree of parallelism', 4;
RECONFIGURE;
2.2 Mitigate Memory Pressure
Low memory can cause paging, delaying lease renewal.
Steps:
Check memory usage:
Get-Counter -Counter "\Memory\Available MBytes" -SampleInterval 5 -MaxSamples 30
Adjust SQL Server max memory:
EXEC sp_configure 'max server memory', 8192; -- Set to 8GB, adjust as needed
RECONFIGURE;
Pros:
Prevents paging of SQL Server or WSFC processes.
Improves overall system stability.
Cons:
Requires sufficient physical memory.
May need application downtime for reconfiguration.
2.3 Optimize I/O Performance
Slow I/O, often from backups or maintenance, can delay lease renewal.
Steps:
Identify I/O bottlenecks:
SELECT
database_name,
file_type,
io_stall_ms / NULLIF(num_of_reads + num_of_writes, 0) AS avg_io_stall_ms
FROM
sys.dm_io_virtual_file_stats(NULL, NULL)
WHERE
num_of_reads + num_of_writes > 0;
Reschedule maintenance tasks (e.g., backups, index rebuilds) to off-peak hours:
USE msdb;
EXEC sp_update_job
@job_name = 'NightlyBackup',
@new_name = 'NightlyBackup',
@enabled = 1,
@schedule_name = 'OffPeakSchedule'; -- Adjust to off-peak schedule
Pros:
Reduces I/O contention during peak hours.
Minimizes lease timeout risk.
Cons:
Requires coordination with maintenance schedules.
May need storage upgrades for persistent issues.
Real-Life Fix: In the financial database, rescheduling backups to 3 AM reduced I/O waits, preventing lease timeouts.
2.4 Handle Memory Dumps
Memory dumps freeze SQL Server, causing lease timeouts.
Steps:
Check for dump files:
Get-ChildItem -Path "C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\LOG" -Filter "*.mdmp"
Analyze dump causes:
SELECT
event_type,
event_time,
message
FROM
sys.fn_xe_file_target_read_file('C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\*.xel', NULL, NULL, NULL)
WHERE
event_type LIKE '%dump%';
Address root causes (e.g., deadlocked schedulers) by optimizing queries or applying SQL Server patches.
Pros:
Resolves underlying SQL Server issues.
Prevents recurring dumps.
Cons:
Dump analysis may require Microsoft support.
Patching may involve downtime.
2.5 Fix Virtual Machine Issues
VM migrations or backups can suspend SQL Server, causing timeouts.
Steps:
Check for VM migration events:
Get-WinEvent -LogName "Microsoft-Windows-Hyper-V-VMMS" | Where-Object { $_.Id -eq 20000 } | Select-Object TimeCreated, Message
Coordinate with VM admins to avoid migrations during peak hours.
Disable VM backups during critical windows:
Set-VM -Name SQLServerVM -AutomaticCheckpointsEnabled $false
Pros:
Prevents VM-related lease timeouts.
Improves AG reliability.
Cons:
Requires VM admin coordination.
May delay backup schedules.
2.6 Adjust Lease Timeout Settings
Increase the lease timeout to make AGs more resilient to transient issues.
Steps:
Check current lease timeout:
Get-ClusterResource -Name "FinanceAG" | Get-ClusterParameter
Increase lease timeout to 40 seconds:
Get-ClusterResource -Name "FinanceAG" | Set-ClusterParameter -Name LeaseTimeout -Value 40000
Apply changes (no restart needed):
Stop-ClusterResource -Name "FinanceAG"
Start-ClusterResource -Name "FinanceAG"
Increase HealthCheckTimeout:
ALTER AVAILABILITY GROUP FinanceAG MODIFY REPLICA ON 'SQLNode1' WITH (FAILURE_CONDITION_LEVEL = 3, HEALTH_CHECK_TIMEOUT = 60000);
Pros:
Reduces false positives for transient issues.
No node restart required.
Cons:
Delays detection of real failures.
May mask underlying issues.
Step 3: Preventing Error 19407
Prevent future lease timeouts with these strategies.
3.1 Monitor System Resources
Implement monitoring to detect CPU, memory, and I/O issues early.
T-SQL Script:
SELECT
cpu_count,
physical_memory_kb / 1024 AS memory_mb,
(SELECT AVG(CookedValue) FROM (Get-Counter -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 5 -MaxSamples 10 | Select-Object -ExpandProperty CounterSamples) AS CPU) AS avg_cpu_percent
FROM
sys.dm_os_sys_info;
PowerShell Script:
$monitor = {
$cpu = Get-Counter -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 5 -MaxSamples 1 | Select-Object -ExpandProperty CounterSamples | Select-Object CookedValue
if ($cpu.CookedValue -gt 90) { Send-MailMessage -To "dba@company.com" -From "alert@sqlserver.com" -Subject "High CPU Alert" -Body "CPU usage exceeded 90%." -SmtpServer "smtp.company.com" }
}
Start-Job -ScriptBlock $monitor -Name "CPUMonitor"
Pros: Proactive issue detection. Cons: Requires monitoring setup and maintenance.
3.2 Optimize Maintenance Schedules
Schedule I/O-intensive tasks during low-traffic periods.
Example:
USE msdb;
EXEC sp_add_jobschedule
@job_name = 'NightlyBackup',
@name = 'OffPeakSchedule',
@freq_type = 4, -- Daily
@freq_interval = 1,
@active_start_time = 030000; -- 3:00 AM
3.3 Separate Resource Monitors
Run each AG on a separate Resource Monitor to isolate issues.
Steps:
Create a new Resource Monitor:
Add-ClusterResource -Name "FinanceAG_Monitor" -ResourceType "Resource Monitor"
Assign AG to the new monitor:
Set-ClusterResource -Name "FinanceAG" -Parameter ResourceMonitorName -Value "FinanceAG_Monitor"
Pros: Isolates AG failures. Cons: Increases cluster complexity.
3.4 Enable Quorum Witness
Add a file share or cloud witness to improve quorum stability.
Steps:
Configure a file share witness:
Set-ClusterQuorum -FileShareWitness "\\FileServer\Witness"
Verify quorum configuration:
Get-ClusterQuorum
Pros: Enhances failover reliability. Cons: Requires additional infrastructure.
Pros and Cons of Resolution Techniques
Technique | Pros | Cons |
---|---|---|
Address CPU Usage | Improves system performance | May disrupt sessions |
Mitigate Memory Pressure | Prevents paging issues | Requires sufficient memory |
Optimize I/O | Reduces contention | Needs schedule adjustments |
Handle Memory Dumps | Resolves SQL Server issues | May require Microsoft support |
Fix VM Issues | Prevents VM-related timeouts | Requires VM admin coordination |
Adjust Timeout Settings | Increases AG resilience | Delays real failure detection |
Best Practices and Standards
Monitor Resources: Use tools like SQL Server Agent alerts or System Center Operations Manager.
Schedule Maintenance: Perform backups and index rebuilds during off-peak hours.
Apply Patches: Keep SQL Server and Windows updated to avoid dump-related issues.
Use Quorum Witness: Ensure robust quorum configuration for failover stability.
Audit AG Health: Regularly check AG status:
SELECT
ag.name,
ar.replica_server_name,
ar.availability_mode_desc,
ar.failover_mode_desc
FROM
sys.availability_groups ag
JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id;
Document Changes: Log all configuration changes for troubleshooting.
Real-Life Case Study
Scenario: The FinanceAG AG in a financial database goes to resolving state with Error 19407 during nightly backups.
Actions:
Checked error log:
SELECT * FROM sys.fn_get_audit_file('C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\*', DEFAULT, DEFAULT)
WHERE error = 19407;
-- Output: High I/O waits detected
Analyzed I/O:
SELECT database_name, io_stall_ms / NULLIF(num_of_reads + num_of_writes, 0) AS avg_io_stall_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL);
-- Output: High I/O stalls during backup
Rescheduled backups:
USE msdb;
EXEC sp_update_jobschedule
@job_name = 'NightlyBackup',
@name = 'OffPeakSchedule',
@active_start_time = 030000;
Increased lease timeout:
Get-ClusterResource -Name "FinanceAG" | Set-ClusterParameter -Name LeaseTimeout -Value 40000
Added a file share witness:
Set-ClusterQuorum -FileShareWitness "\\FileServer\Witness"
Result: Lease timeouts ceased, and AG stability was restored with no further downtime.
Conclusion
The SQL Server Error 19407: Always On Availability Group Lease Timeout is caused by resource contention, memory dumps, or VM issues disrupting the lease mechanism. By analyzing logs, optimizing system resources, adjusting timeout settings, and implementing preventive measures like quorum witnesses, you can resolve and prevent this error. Use the provided scripts to diagnose and fix issues, ensuring robust Always On Availability Group operations.
No comments:
Post a Comment
Thanks for your valuable comment...........
Md. Mominul Islam