GitHub Copilot incident

Incident with multiple GitHub services

Name: GitHub Copilot — Incident with multiple GitHub services
Start: 2026-04-23T16:12:47.744Z
End: 2026-04-23T17:30:49.716Z

Critical Resolved Upstream link ↗

Started

April 23, 2026 at 04:12 PM UTC

Duration

1h 18m

Resolved

April 23, 2026 at 05:30 PM UTC

Updates timeline

Investigating Apr 23, 04:12 PM UTC

We are investigating reports of degraded availability for Copilot and Webhooks
Investigating Apr 23, 04:19 PM UTC

We are investigating multiple unavailable services.
Investigating Apr 23, 04:34 PM UTC

Actions is experiencing degraded performance. We are continuing to investigate.
Investigating Apr 23, 04:52 PM UTC

We have identified the root problem and are working on mitigation.
Investigating Apr 23, 05:03 PM UTC

The degradation affecting Actions and Copilot has been mitigated. We are monitoring to ensure stability.
Investigating Apr 23, 05:04 PM UTC

Many services are mitigated and are validating the remaining services.
Investigating Apr 23, 05:10 PM UTC

Webhooks is operating normally.
Resolved Apr 23, 05:30 PM UTC

On April 23, 2026, between 16:03 UTC and 17:27 UTC, multiple GitHub services experienced elevated error rates and degraded performance due to DNS resolution failures originating from our DNS infrastructure in our VA3 datacenter. Approximately 5–7% of overall traffic was affected during the impact window: - Webhooks: ~0.35% of API requests returned 5xx (peak ~0.39%). ~0.88% of requests exceeded 3s latency; at peak, >3s responses represented ~10% of Webhooks API traffic. - Copilot Metrics: ~9% of Copilot Insights dashboard requests returned 5xx. - Copilot cloud agents: ~10% of cloud agent sessions were affected and failing. - Octoshift: 0.88% of active repo migrations failed and 79% saw elevated durations (avg. 5.2 min) during this period. - Git Operations: averaged 1.25% errors over the duration of the incident, with a peak of 2.07% errors. - Actions: Workflow run status updates experienced delays of up to ~8s over the duration of the incident window. Our DNS infrastructure in VA3 entered a degraded state and began intermittently returning NXDOMAIN responses and timing out on lookups for both internal service discovery and external endpoints. This caused a cascading impact across the dependent services listed above. We identified a specific load pattern under which our DNS resolvers began failing. The evidence points to a recently introduced traffic-balancing mechanism, rolled out progressively to support our growth, as the root cause. We have since reverted this change. We are immediately prioritizing investments in a more controlled rollout and validation process, including a dedicated environment to safely shadow production DNS traffic and detect these failure modes before they can affect production.

Live GitHub Copilot status

Current indicator + 24h latency

All incidents

Cross-service timeline

Subscribe via RSS

Atom feed for any reader