Md Mominul Islam | Software and Data Enginnering | SQL Server, .NET, Power BI, Azure Blog

while(!(succeed=try()));

LinkedIn Portfolio Banner

Latest

Home Top Ad

Responsive Ads Here

Post Top Ad

Responsive Ads Here

Tuesday, September 2, 2025

Resolving SQL Server Error 19407: Always On Availability Group Lease Timeout Solutions

 


Introduction

The SQL Server Error 19407: The lease between availability group and the Windows Server Failover Cluster has expired indicates a communication failure between SQL Server and the Windows Server Failover Cluster (WSFC), causing an Always On Availability Group (AG) to transition to a resolving state or fail over. This error can disrupt critical applications, such as financial or e-commerce systems, by taking databases offline temporarily. This guide explores the causes of Error 19407, diagnostic steps, and actionable solutions, including scripts and real-world scenarios, to restore AG stability and prevent future occurrences.


What Causes SQL Server Error 19407?

The Error 19407 occurs when the lease mechanism, a heartbeat between SQL Server and the WSFC, fails to renew within the default 20-second timeout period. Common causes include:

  1. High CPU Usage: CPU utilization near 100%, preventing the lease thread from responding.

  2. Memory Pressure: Low virtual or physical memory causing SQL Server or WSFC processes (e.g., RHS.exe) to be paged out.

  3. I/O Bottlenecks: Slow disk I/O, often from maintenance tasks like backups, causing delays in lease renewal.

  4. Memory Dumps: SQL Server generating diagnostic dump files (e.g., due to deadlocked schedulers or access violations), freezing the lease thread.

  5. Virtual Machine Issues: VM migrations (e.g., Hyper-V Live Migration, VMware vMotion) or backups suspending the VM, disrupting communication.

  6. Network Issues: Connectivity problems between SQL Server and WSFC, such as network latency or quorum loss.

  7. Cluster Misconfiguration: Incorrect quorum settings or resource monitor issues affecting AG health checks.

Example Error Message:

2025-09-02 10:00:00.00 Server Error: 19407, Severity: 16, State: 1.
The lease between availability group 'AG_Name' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster.

Real-Life Scenario: In a financial database, the AG FinanceAG goes offline with Error 19407 during a nightly backup, causing application downtime. The DBA suspects high I/O load or CPU pressure.


Step 1: Diagnosing Error 19407

To resolve the error, identify the root cause by analyzing logs and system metrics.

1.1 Check SQL Server Error Log

Review the SQL Server error log for Error 19407 and related messages (e.g., Error 19419 or dump generation).

Steps:

  1. Open SQL Server Management Studio (SSMS).

  2. Navigate to Management > SQL Server Logs.

  3. Look for Error 19407 and preceding events:

SELECT * FROM sys.fn_get_audit_file('C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\*', DEFAULT, DEFAULT)
WHERE event_time >= '2025-09-02 00:00:00' 
AND error = 19407;

Sample Output:

2025-09-02 10:00:00.00 Server Error: 19407, Severity: 16, State: 1.
The lease between availability group 'FinanceAG' and the Windows Server Failover Cluster has expired.
2025-09-02 09:59:50.00 Server **Dump thread - spid = 0, EC = 0x0000000000000000

Observation: A memory dump preceded the lease timeout, indicating a potential SQL Server issue.

1.2 Check Cluster Log

Export and analyze the WSFC log to correlate with SQL Server events.

PowerShell Command:

Get-ClusterLog -Node SQLNode1 -Destination C:\Logs\Cluster.log

Sample Output:

2025/09/02-10:00:00.587 INFO [RCM] rcm::RcmApi::FailResource: (FinanceAG)
2025/09/02-10:00:00.588 ERR [RES] SQL Server Availability Group <FinanceAG>: [hadrag] Lease Thread terminated

Observation: The cluster log confirms the lease expired, with a 10-second delay after the expected renewal.

1.3 Monitor System Resources

Check CPU, memory, and I/O metrics during the error time.

CPU Usage (PowerShell):

Get-Counter -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 5 -MaxSamples 30 | Select-Object -ExpandProperty CounterSamples | Select-Object TimeStamp, CookedValue

I/O Waits (T-SQL):

SELECT 
    wait_type,
    wait_time_ms,
    wait_time_ms / NULLIF(waiting_tasks_count, 0) AS avg_wait_ms
FROM 
    sys.dm_os_wait_stats
WHERE 
    wait_type IN ('PREEMPTIVE_HADR_LEASE_MECHANISM', 'PREEMPTIVE_SP_SERVER_DIAGNOSTICS')
    AND wait_time_ms > 0;

Sample Output:

wait_type                        wait_time_ms  avg_wait_ms
-------------------------------  ------------  -----------
PREEMPTIVE_HADR_LEASE_MECHANISM  526000        526000
PREEMPTIVE_SP_SERVER_DIAGNOSTICS 526000        526000

Observation: High wait times for lease-related tasks suggest system resource contention.

Real-Life Finding: In the financial database, high CPU usage (95%) and I/O waits during the backup window correlate with the lease timeout.


Step 2: Resolving Error 19407

Based on the diagnosis, apply these solutions to restore AG functionality.

2.1 Address High CPU Usage

High CPU can prevent the lease thread from responding.

Steps:

  1. Identify high-CPU processes:

Get-Process | Sort-Object CPU -Descending | Select-Object -First 5
  1. If SQL Server is the culprit, analyze queries:

SELECT 
    spid,
    cpu,
    sql_text = SUBSTRING(t.text, (r.statement_start_offset/2)+1, 
        ((CASE r.statement_end_offset 
            WHEN -1 THEN DATALENGTH(t.text)
            ELSE r.statement_end_offset 
        END - r.statement_start_offset)/2) + 1)
FROM 
    sys.sysprocesses p
    CROSS APPLY sys.dm_exec_sql_text(p.sql_handle) t
WHERE 
    cpu > 1000
ORDER BY 
    cpu DESC;
  1. Optimize or kill problematic queries:

KILL 123; -- Replace 123 with problematic SPID

Pros:

  • Reduces CPU contention, restoring lease renewal.

  • Immediate performance improvement.

Cons:

  • Killing sessions may disrupt applications.

  • Requires query tuning for long-term fix.

Alternative: Increase CPU resources or adjust MAXDOP to limit query parallelism:

EXEC sp_configure 'max degree of parallelism', 4;
RECONFIGURE;

2.2 Mitigate Memory Pressure

Low memory can cause paging, delaying lease renewal.

Steps:

  1. Check memory usage:

Get-Counter -Counter "\Memory\Available MBytes" -SampleInterval 5 -MaxSamples 30
  1. Adjust SQL Server max memory:

EXEC sp_configure 'max server memory', 8192; -- Set to 8GB, adjust as needed
RECONFIGURE;

Pros:

  • Prevents paging of SQL Server or WSFC processes.

  • Improves overall system stability.

Cons:

  • Requires sufficient physical memory.

  • May need application downtime for reconfiguration.

2.3 Optimize I/O Performance

Slow I/O, often from backups or maintenance, can delay lease renewal.

Steps:

  1. Identify I/O bottlenecks:

SELECT 
    database_name,
    file_type,
    io_stall_ms / NULLIF(num_of_reads + num_of_writes, 0) AS avg_io_stall_ms
FROM 
    sys.dm_io_virtual_file_stats(NULL, NULL)
WHERE 
    num_of_reads + num_of_writes > 0;
  1. Reschedule maintenance tasks (e.g., backups, index rebuilds) to off-peak hours:

USE msdb;
EXEC sp_update_job 
    @job_name = 'NightlyBackup',
    @new_name = 'NightlyBackup',
    @enabled = 1,
    @schedule_name = 'OffPeakSchedule'; -- Adjust to off-peak schedule

Pros:

  • Reduces I/O contention during peak hours.

  • Minimizes lease timeout risk.

Cons:

  • Requires coordination with maintenance schedules.

  • May need storage upgrades for persistent issues.

Real-Life Fix: In the financial database, rescheduling backups to 3 AM reduced I/O waits, preventing lease timeouts.

2.4 Handle Memory Dumps

Memory dumps freeze SQL Server, causing lease timeouts.

Steps:

  1. Check for dump files:

Get-ChildItem -Path "C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\LOG" -Filter "*.mdmp"
  1. Analyze dump causes:

SELECT 
    event_type,
    event_time,
    message
FROM 
    sys.fn_xe_file_target_read_file('C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\*.xel', NULL, NULL, NULL)
WHERE 
    event_type LIKE '%dump%';
  1. Address root causes (e.g., deadlocked schedulers) by optimizing queries or applying SQL Server patches.

Pros:

  • Resolves underlying SQL Server issues.

  • Prevents recurring dumps.

Cons:

  • Dump analysis may require Microsoft support.

  • Patching may involve downtime.

2.5 Fix Virtual Machine Issues

VM migrations or backups can suspend SQL Server, causing timeouts.

Steps:

  1. Check for VM migration events:

Get-WinEvent -LogName "Microsoft-Windows-Hyper-V-VMMS" | Where-Object { $_.Id -eq 20000 } | Select-Object TimeCreated, Message
  1. Coordinate with VM admins to avoid migrations during peak hours.

  2. Disable VM backups during critical windows:

Set-VM -Name SQLServerVM -AutomaticCheckpointsEnabled $false

Pros:

  • Prevents VM-related lease timeouts.

  • Improves AG reliability.

Cons:

  • Requires VM admin coordination.

  • May delay backup schedules.

2.6 Adjust Lease Timeout Settings

Increase the lease timeout to make AGs more resilient to transient issues.

Steps:

  1. Check current lease timeout:

Get-ClusterResource -Name "FinanceAG" | Get-ClusterParameter
  1. Increase lease timeout to 40 seconds:

Get-ClusterResource -Name "FinanceAG" | Set-ClusterParameter -Name LeaseTimeout -Value 40000
  1. Apply changes (no restart needed):

Stop-ClusterResource -Name "FinanceAG"
Start-ClusterResource -Name "FinanceAG"

Increase HealthCheckTimeout:

ALTER AVAILABILITY GROUP FinanceAG MODIFY REPLICA ON 'SQLNode1' WITH (FAILURE_CONDITION_LEVEL = 3, HEALTH_CHECK_TIMEOUT = 60000);

Pros:

  • Reduces false positives for transient issues.

  • No node restart required.

Cons:

  • Delays detection of real failures.

  • May mask underlying issues.


Step 3: Preventing Error 19407

Prevent future lease timeouts with these strategies.

3.1 Monitor System Resources

Implement monitoring to detect CPU, memory, and I/O issues early.

T-SQL Script:

SELECT 
    cpu_count,
    physical_memory_kb / 1024 AS memory_mb,
    (SELECT AVG(CookedValue) FROM (Get-Counter -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 5 -MaxSamples 10 | Select-Object -ExpandProperty CounterSamples) AS CPU) AS avg_cpu_percent
FROM 
    sys.dm_os_sys_info;

PowerShell Script:

$monitor = {
    $cpu = Get-Counter -Counter "\Processor(_Total)\% Processor Time" -SampleInterval 5 -MaxSamples 1 | Select-Object -ExpandProperty CounterSamples | Select-Object CookedValue
    if ($cpu.CookedValue -gt 90) { Send-MailMessage -To "dba@company.com" -From "alert@sqlserver.com" -Subject "High CPU Alert" -Body "CPU usage exceeded 90%." -SmtpServer "smtp.company.com" }
}
Start-Job -ScriptBlock $monitor -Name "CPUMonitor"

Pros: Proactive issue detection. Cons: Requires monitoring setup and maintenance.

3.2 Optimize Maintenance Schedules

Schedule I/O-intensive tasks during low-traffic periods.

Example:

USE msdb;
EXEC sp_add_jobschedule 
    @job_name = 'NightlyBackup',
    @name = 'OffPeakSchedule',
    @freq_type = 4, -- Daily
    @freq_interval = 1,
    @active_start_time = 030000; -- 3:00 AM

3.3 Separate Resource Monitors

Run each AG on a separate Resource Monitor to isolate issues.

Steps:

  1. Create a new Resource Monitor:

Add-ClusterResource -Name "FinanceAG_Monitor" -ResourceType "Resource Monitor"
  1. Assign AG to the new monitor:

Set-ClusterResource -Name "FinanceAG" -Parameter ResourceMonitorName -Value "FinanceAG_Monitor"

Pros: Isolates AG failures. Cons: Increases cluster complexity.

3.4 Enable Quorum Witness

Add a file share or cloud witness to improve quorum stability.

Steps:

  1. Configure a file share witness:

Set-ClusterQuorum -FileShareWitness "\\FileServer\Witness"
  1. Verify quorum configuration:

Get-ClusterQuorum

Pros: Enhances failover reliability. Cons: Requires additional infrastructure.


Pros and Cons of Resolution Techniques

Technique

Pros

Cons

Address CPU Usage

Improves system performance

May disrupt sessions

Mitigate Memory Pressure

Prevents paging issues

Requires sufficient memory

Optimize I/O

Reduces contention

Needs schedule adjustments

Handle Memory Dumps

Resolves SQL Server issues

May require Microsoft support

Fix VM Issues

Prevents VM-related timeouts

Requires VM admin coordination

Adjust Timeout Settings

Increases AG resilience

Delays real failure detection


Best Practices and Standards

  1. Monitor Resources: Use tools like SQL Server Agent alerts or System Center Operations Manager.

  2. Schedule Maintenance: Perform backups and index rebuilds during off-peak hours.

  3. Apply Patches: Keep SQL Server and Windows updated to avoid dump-related issues.

  4. Use Quorum Witness: Ensure robust quorum configuration for failover stability.

  5. Audit AG Health: Regularly check AG status:

SELECT 
    ag.name,
    ar.replica_server_name,
    ar.availability_mode_desc,
    ar.failover_mode_desc
FROM 
    sys.availability_groups ag
    JOIN sys.availability_replicas ar ON ag.group_id = ar.group_id;
  1. Document Changes: Log all configuration changes for troubleshooting.


Real-Life Case Study

Scenario: The FinanceAG AG in a financial database goes to resolving state with Error 19407 during nightly backups.

Actions:

  1. Checked error log:

SELECT * FROM sys.fn_get_audit_file('C:\Program Files\Microsoft SQL Server\MSSQL15.MSSQLSERVER\MSSQL\Log\*', DEFAULT, DEFAULT)
WHERE error = 19407;
-- Output: High I/O waits detected
  1. Analyzed I/O:

SELECT database_name, io_stall_ms / NULLIF(num_of_reads + num_of_writes, 0) AS avg_io_stall_ms
FROM sys.dm_io_virtual_file_stats(NULL, NULL);
-- Output: High I/O stalls during backup
  1. Rescheduled backups:

USE msdb;
EXEC sp_update_jobschedule 
    @job_name = 'NightlyBackup',
    @name = 'OffPeakSchedule',
    @active_start_time = 030000;
  1. Increased lease timeout:

Get-ClusterResource -Name "FinanceAG" | Set-ClusterParameter -Name LeaseTimeout -Value 40000
  1. Added a file share witness:

Set-ClusterQuorum -FileShareWitness "\\FileServer\Witness"

Result: Lease timeouts ceased, and AG stability was restored with no further downtime.


Conclusion

The SQL Server Error 19407: Always On Availability Group Lease Timeout is caused by resource contention, memory dumps, or VM issues disrupting the lease mechanism. By analyzing logs, optimizing system resources, adjusting timeout settings, and implementing preventive measures like quorum witnesses, you can resolve and prevent this error. Use the provided scripts to diagnose and fix issues, ensuring robust Always On Availability Group operations.

No comments:

Post a Comment

Thanks for your valuable comment...........
Md. Mominul Islam

Post Bottom Ad

Responsive Ads Here