OSMOS Events API 5XX

Mar 17, 2026 at 10:00am UTC

Affected services

Events API

Resolved
Mar 17, 2026 at 10:00am UTC

OSMOS Events API Incident
Date: 17 March 2026
Incident Duration: 9:45 AM UTC – 12:50 PM UTC

Overview

On 17th March 2026, OSMOS experienced a temporary disruption impacting the Events API. During this period, clients observed elevated error rates (5xx) and increased latency. The issue was fully mitigated, and system stability was restored.

Incident Summary

Start Time: 9:45 AM UTC, Peak Impact: 11:15 AM UTC, Recovery Start: 12:15 PM UTC, Service Restored: 12:50 PM UTC, Full Stability: 1:05 PM UTC.

Root Cause

The incident was caused by a failure in a dependent internal OSMOS service used by the Events API. The Events API relies on internal services for processing incoming requests. One such service experienced instability, leading to high outbound connection load, increased CPU and memory utilization, and instance unavailability in the region. This resulted in a cascading failure where Events API instances became inaccessible, requests were rerouted across regions increasing latency, and system-wide 5xx errors increased.

Impact

Elevated 503 (Service Unavailable) and 504 (Timeout) errors were observed. Increased latency majorly impacted the South Asia timezone region. The issue caused reporting of clicks and impressions to result in inaccurate ad-spend reporting, and real-time reporting was impacted. 503 errors were largely mitigated through recovery mechanisms, but real-time reports remained impacted afterward. 504 errors could not be fully recovered.

Detection

The issue was identified through OSMOS automated alerting systems, the OSMOS API status page (https://api-status.osmos.ai/), and reports from retailers using the Events API.

Resolution & Recovery Timeline

Phase I — Initial Degradation (9:45 AM UTC): Approximately 30% of requests were failing, mainly in the South Asia region. Traffic was rerouted to other regions, causing higher latency, with high CPU, memory usage, and outbound connections observed.

Phase II — Widespread Outage (11:15 AM UTC): Errors expanded across all regions with a significant increase in latency and 5xx responses. Instances became difficult to access due to cascading failures. The OSMOS team identified and applied immediate quick fixes focused on reducing load on impacted internal services, stabilizing API infrastructure, and restoring partial availability. Recovery required manual interventions alongside quick fixes due to system complexity and cascading effects.

Phase III — Recovery (12:15 PM – 12:50 PM UTC): Error rate reduced from approximately 80% to 3%. Fixes were fully applied and optimized, and system stability progressively improved.

Phase IV — Full Recovery (by 1:05 PM UTC): Achieved approximately 99.999% success rate and entered monitoring and stabilization phase.

Corrective Actions

Immediate actions taken included stabilizing affected internal service dependencies, restoring Events API instances, and optimizing traffic routing and failover handling.