📝 Postmortem: Pricemoov Database Outage and Performance Degradation
Incident Date: July 10–11, 2025
Duration: ~14 hours
Impacted Services: Pricemoov Portal, Pricing Automation, API
Status: Resolved
🧭 Summary
On July 10, Pricemoov experienced a significant production outage and service degradation affecting the Pricemoov Portal, Pricing Automation workflows, and public APIs. The incident was traced to inefficient insert operations generated by our recently launched simulation feature, which inserted large JSONB payloads under a faulty uniqueness constraint. These inserts caused excessive database contention, ultimately blocking write operations.
The issue was partially resolved after a recovery instance was launched and fully resolved once the original instance finally rebooted with the upgraded version of PostgreSQL 15, which introduced improved behavior for uniqueness constraints, and rebooting into a correct configuration.
🔍 Root Cause
The incident resulted from the interaction of four compounding factors:
- 🛠️ Faulty Uniqueness Constraint
A uniqueness constraint introduced in a patch two weeks prior was logically flawed. It applied to records where certain fields could be NULL, unintentionally treating semantically distinct records as duplicates. This created contention across what should have been disjoint records.
- 🧪 High-Volume Inserts from Simulation Feature
The strategy recommendation infos feature—designed to model simulated pricing scenarios as well —generated a growing number of insert operations on a daily basis. These inserts included large JSONB columns, making each insert expensive in terms of memory, WAL writes, and index maintenance. This behavior was not visible during initial testing, as the data volume scaled non-linearly in production over time.
- 🔁 Gradual System Degradation
Because the volume of inserted data increased progressively, the inefficiencies weren’t immediately visible. Over time, locking behavior and LWLock:Bufferevent waits accumulated until the database hit 100% CPU and I/O, blocking further transactions.
- ⏱️ Extended Recovery Time
During the recovery process, Pricemoov Engineering attempted to provision a recovery instance based on a recent snapshot. However, the desired AWS instance class was unavailable at the time, causing a significant delay. A process that typically completes within 35 minutes extended to over 3 hours, from 14:21 to 17:43 CEST.
🧨 Timeline of Events
✅ Resolution
- 🆙 Upgraded PostgreSQL to version 15, enabling correct handling of uniqueness constraints with nullable fields.
- ♻️ Rebooted and promoted a clean writer instance with a corrected parameter group.
- 🟢 Launched degraded fallback instance temporarily to restore APIs, while continuing PostgreSQL upgrade on the main instance.
- ✅ Final upgrade completed and confirmed at 19:34 EDT / 01:34 CEST (July 11).
- 🚫 Removed the flawed uniqueness constraint and restructured the insert logic for simulation data.
- 📉 Purged excessive daily records generated during the incident period.
📈 Impact
- Availability: Pricemoov Portal and Pricing Automation were unavailable or read-only for ~14 hours.
- Performance: Pricemoov API outage initially but stabilized after fallback DB activation.
- Data Integrity: No data loss occurred as customer did not take any actions during this window; WAL replay was completed successfully.
- Business Impact: API unresponsive impacting directly some customer website availability. Simulation jobs stalled; users experienced failed automation runs and unresponsive UI in the Portal.
🛡️ Actions Taken
- ✅ Upgraded database engine to PostgreSQL 15 to benefit from modern uniqueness semantics.
- ✅ Added alerting thresholds for insert latency, LWLock wait times, and bulk JSONB insert anomalies.
- ✅ Documented and validated DB reboot escalation paths when parameter group changes are applied.
🔍 Preventive Measures
- Query/Lock Profiling Dashboards Add automated detection for lock-heavy patterns (e.g., LWLock:BufferContent or repeated index retries).
- Reboot Resilience Improve operational playbooks to detect and recover from Aurora reboot edge cases like parameter group drift.
- Data Growth Monitoring Visualize daily record growth for simulation-related tables and track their impact on WAL volume and I/O.
📌 Takeaways
- Large JSONB inserts + uniqueness constraints = explosive DB contention.
- Gradual degradation can be more dangerous than immediate failure if observability isn’t precise.
- Always test feature rollout against realistic production data patterns, especially when inserting large structured documents.
🏁 Current Status
- ✅ All systems operational as of 19:34 EDT / 01:34 CEST (July 11)
- ✅ No data loss
- 🛠️ Monitoring and engineering improvements in progress