Service Unavailable: Portal and Endpoints

Incident Report for Pricemoov

Postmortem

📝 Postmortem: Pricemoov Database Outage and Performance Degradation

Incident Date: July 10–11, 2025

Duration: ~14 hours

Impacted Services: Pricemoov Portal, Pricing Automation, API

Status: Resolved

🧭 Summary

On July 10, Pricemoov experienced a significant production outage and service degradation affecting the Pricemoov Portal, Pricing Automation workflows, and public APIs. The incident was traced to inefficient insert operations generated by our recently launched simulation feature, which inserted large JSONB payloads under a faulty uniqueness constraint. These inserts caused excessive database contention, ultimately blocking write operations.

The issue was partially resolved after a recovery instance was launched and fully resolved once the original instance finally rebooted with the upgraded version of PostgreSQL 15, which introduced improved behavior for uniqueness constraints, and rebooting into a correct configuration.

🔍 Root Cause

The incident resulted from the interaction of four compounding factors:

  1. 🛠️ Faulty Uniqueness Constraint
    A uniqueness constraint introduced in a patch two weeks prior was logically flawed. It applied to records where certain fields could be NULL, unintentionally treating semantically distinct records as duplicates. This created contention across what should have been disjoint records.
  2. 🧪 High-Volume Inserts from Simulation Feature
    The strategy recommendation infos feature—designed to model simulated pricing scenarios as well —generated a growing number of insert operations on a daily basis. These inserts included large JSONB columns, making each insert expensive in terms of memory, WAL writes, and index maintenance. This behavior was not visible during initial testing, as the data volume scaled non-linearly in production over time.
  3. 🔁 Gradual System Degradation
    Because the volume of inserted data increased progressively, the inefficiencies weren’t immediately visible. Over time, locking behavior and LWLock:Bufferevent waits accumulated until the database hit 100% CPU and I/O, blocking further transactions.
  4. ⏱️ Extended Recovery Time
    During the recovery process, Pricemoov Engineering attempted to provision a recovery instance based on a recent snapshot. However, the desired AWS instance class was unavailable at the time, causing a significant delay. A process that typically completes within 35 minutes extended to over 3 hours, from 14:21 to 17:43 CEST.

🧨 Timeline of Events

✅ Resolution

  • 🆙 Upgraded PostgreSQL to version 15, enabling correct handling of uniqueness constraints with nullable fields.
  • ♻️ Rebooted and promoted a clean writer instance with a corrected parameter group.
  • 🟢 Launched degraded fallback instance temporarily to restore APIs, while continuing PostgreSQL upgrade on the main instance.
  • ✅ Final upgrade completed and confirmed at 19:34 EDT / 01:34 CEST (July 11).
  • 🚫 Removed the flawed uniqueness constraint and restructured the insert logic for simulation data.
  • 📉 Purged excessive daily records generated during the incident period.

📈 Impact

  • Availability: Pricemoov Portal and Pricing Automation were unavailable or read-only for ~14 hours.
  • Performance: Pricemoov API outage initially but stabilized after fallback DB activation.
  • Data Integrity: No data loss occurred as customer did not take any actions during this window; WAL replay was completed successfully.
  • Business Impact: API unresponsive impacting directly some customer website availability. Simulation jobs stalled; users experienced failed automation runs and unresponsive UI in the Portal.

🛡️ Actions Taken

  • ✅ Upgraded database engine to PostgreSQL 15 to benefit from modern uniqueness semantics.
  • ✅ Added alerting thresholds for insert latency, LWLock wait times, and bulk JSONB insert anomalies.
  • ✅ Documented and validated DB reboot escalation paths when parameter group changes are applied.

🔍 Preventive Measures

  1. Query/Lock Profiling Dashboards Add automated detection for lock-heavy patterns (e.g., LWLock:BufferContent or repeated index retries).
  2. Reboot Resilience Improve operational playbooks to detect and recover from Aurora reboot edge cases like parameter group drift.
  3. Data Growth Monitoring Visualize daily record growth for simulation-related tables and track their impact on WAL volume and I/O.

📌 Takeaways

  • Large JSONB inserts + uniqueness constraints = explosive DB contention.
  • Gradual degradation can be more dangerous than immediate failure if observability isn’t precise.
  • Always test feature rollout against realistic production data patterns, especially when inserting large structured documents.

🏁 Current Status

  • ✅ All systems operational as of 19:34 EDT / 01:34 CEST (July 11)
  • ✅ No data loss
  • 🛠️ Monitoring and engineering improvements in progress
Posted Jul 11, 2025 - 07:09 EDT

Resolved

Issue fixed. DB is running properly
Posted Jul 10, 2025 - 19:34 EDT

Monitoring

Due to the unavailability of the initial database size during recovery, we launched another instance in degraded mode.

✅ APIs are operational
⚠️ Pricemoov Portal and Pricing Automation are still running with degraded performance

Our engineering team continues to monitor the situation closely and will provide further updates as we restore full service levels.

We appreciate your patience and understanding.
Posted Jul 10, 2025 - 11:43 EDT

Update

We attempted a manual reboot of our production database following stability issues. Unfortunately, the reboot operation did not complete successfully.

Our engineering team has identified that this issue now requires intervention from AWS.

We are working closely with our cloud partner and AWS to trigger a forced reboot and restore service as quickly as possible.

We will provide further updates as soon as we have confirmation from AWS.

Thank you for your continued patience.
Posted Jul 10, 2025 - 11:18 EDT

Update

Our engineering team has identified that the writer database instance remained in a “pending reboot” state following the upgrade, preventing full recovery and resumption of normal operations. This was due to a parameter group change requiring a manual reboot, as confirmed by our cloud provider's support team (DoiT/AWS).

We are now proceeding with a controlled reboot of the affected instance to complete the upgrade process and restore full functionality. Please note that no recent data changes have occurred during this maintenance window, and no unconfirmed transactions are at risk.
We expect the service to gradually recover shortly after the reboot. We will continue to monitor closely and provide another update as soon as the platform is fully operational again.

Thank you for your continued patience.
Posted Jul 10, 2025 - 10:55 EDT

Update

Our primary database is undergoing a planned upgrade. Both writer and reader instances are currently in the "Upgrading" state. We’ve confirmed there are no failure or rollback flags, indicating the upgrade process is still progressing normally.

Aurora upgrades can take extended time due to:

A distributed snapshot followed by WAL (write-ahead log) replay before upgrading system catalogs.

Sequential restarts of multiple instances, which increases total duration.

We are closely monitoring the upgrade and will provide updates as it completes. No data loss or failure has occurred. Thank you for your continued patience.
Posted Jul 10, 2025 - 09:38 EDT

Update

As database recovery is taking longer than expected, the Pricemoov Engineering team has initiated a dual recovery process. Starting at 14:00 UTC+2, we are restoring service from a secondary database instance to accelerate full recovery and minimize disruption.

We are closely monitoring the situation and will provide further updates shortly.

Thank you for your continued patience and understanding.
Posted Jul 10, 2025 - 08:32 EDT

Update

Database is now in the WAL (Write-Ahead Log) recovery phase before bringing the instance back online.
Posted Jul 10, 2025 - 08:21 EDT

Identified

We are currently experiencing an issue with our primary database, which is undergoing an automatic reboot following an elevated load condition. This may result in temporary unavailability or degraded performance of our services.

Our engineering team is actively monitoring the situation and working to restore normal operations as quickly as possible.

We will provide an update within the next 15 minutes or as soon as new information is available.

Thank you for your patience and understanding.
Posted Jul 10, 2025 - 08:01 EDT

Investigating

We are currently investigating this issue.
Posted Jul 10, 2025 - 06:50 EDT
This incident affected: Pricemoov Portal, API, and Pricing Automation.