Incidents | osmos.ai

Incidents | osmos.ai Incidents reported on status page for osmos.ai https://api-status.osmos.ai/ https://d1lppblt9t2x15.cloudfront.net/logos/f2de6a6a2850ce9db9f7d950eb408dcf.png Incidents | osmos.ai https://api-status.osmos.ai/ en OSMOS Events API 5XX https://api-status.osmos.ai/incident/851978 Tue, 17 Mar 2026 10:00:00 -0000 https://api-status.osmos.ai/incident/851978#8e01310494cee10323c9c0ab53d30c4263c692f0d896128bd446f00f49daa5fd OSMOS Events API Incident Date: 17 March 2026 Incident Duration: 9:45 AM UTC – 12:50 PM UTC Overview On 17th March 2026, OSMOS experienced a temporary disruption impacting the Events API. During this period, clients observed elevated error rates (5xx) and increased latency. The issue was fully mitigated, and system stability was restored. Incident Summary Start Time: 9:45 AM UTC, Peak Impact: 11:15 AM UTC, Recovery Start: 12:15 PM UTC, Service Restored: 12:50 PM UTC, Full Stability: 1:05 PM UTC. Root Cause The incident was caused by a failure in a dependent internal OSMOS service used by the Events API. The Events API relies on internal services for processing incoming requests. One such service experienced instability, leading to high outbound connection load, increased CPU and memory utilization, and instance unavailability in the region. This resulted in a cascading failure where Events API instances became inaccessible, requests were rerouted across regions increasing latency, and system-wide 5xx errors increased. Impact Elevated 503 (Service Unavailable) and 504 (Timeout) errors were observed. Increased latency majorly impacted the South Asia timezone region. The issue caused reporting of clicks and impressions to result in inaccurate ad-spend reporting, and real-time reporting was impacted. 503 errors were largely mitigated through recovery mechanisms, but real-time reports remained impacted afterward. 504 errors could not be fully recovered. Detection The issue was identified through OSMOS automated alerting systems, the OSMOS API status page (https://api-status.osmos.ai/), and reports from retailers using the Events API. Resolution & Recovery Timeline Phase I — Initial Degradation (9:45 AM UTC): Approximately 30% of requests were failing, mainly in the South Asia region. Traffic was rerouted to other regions, causing higher latency, with high CPU, memory usage, and outbound connections observed. Phase II — Widespread Outage (11:15 AM UTC): Errors expanded across all regions with a significant increase in latency and 5xx responses. Instances became difficult to access due to cascading failures. The OSMOS team identified and applied immediate quick fixes focused on reducing load on impacted internal services, stabilizing API infrastructure, and restoring partial availability. Recovery required manual interventions alongside quick fixes due to system complexity and cascading effects. Phase III — Recovery (12:15 PM – 12:50 PM UTC): Error rate reduced from approximately 80% to 3%. Fixes were fully applied and optimized, and system stability progressively improved. Phase IV — Full Recovery (by 1:05 PM UTC): Achieved approximately 99.999% success rate and entered monitoring and stabilization phase. Corrective Actions Immediate actions taken included stabilizing affected internal service dependencies, restoring Events API instances, and optimizing traffic routing and failover handling. Incident Report: Service Downtime Due to Google Cloud Outage https://api-status.osmos.ai/incident/602396 Thu, 12 Jun 2025 21:30:00 -0000 https://api-status.osmos.ai/incident/602396#abcd86e2c0f7d4c91ff860a52cd549f7e15d5a4e0d95f67a0e2fcf233235308c Incident Summary: On June 13th, 2025, between 12:00 AM IST and 2:00 AM IST, all dashboards and campaign management systems were unavailable due to a regional outage in Google Cloud services. Impact: * Users were unable to access core platform functionalities, including performance dashboards, campaign monitoring, and management tools. * Campaign API calls and automated workflows dependent on Google Cloud infrastructure were intermittently failing during the outage window. * Monitoring alerts were triggered across multiple services, confirming system-wide inaccessibility for user-facing modules. **Root Cause:** The outage was attributed to a service disruption within the Google Cloud Infrastructure (GCP), affecting our region's compute and storage components. This led to degraded availability of key microservices deployed on Google Kubernetes Engine (GKE), as well as access to Cloud SQL and BigQuery. Resolution: * Platform availability began recovering at 2:00 AM IST, once GCP services stabilized. * Health checks and manual validations confirmed the restoration of all affected services. Campaign Manager API Upgrade https://api-status.osmos.ai/incident/427075 Tue, 10 Sep 2024 12:45:00 -0000 https://api-status.osmos.ai/incident/427075#ffc5c028ada920cf675641b309837e177d76f68a5a3fa3687257de43af2d08d2 We will be upgrading the Campaign Manager API to improve performance and introduce new features. During this time, users may experience brief interruptions in API service.