Delays in creating and retrieving documents and threads.

Incident Report for Workiro

Postmortem

Summary

From 15:04 to 21:48 UTC on 2025-09-22, a third party provider experienced delayed data writes across all regions. This affected a subset of their customers, including Workiro, and resulted in lengthy delays in our application reading and writing document and thread data. Users were affected in that they were unable to determine whether documents had been correctly uploaded or threads correctly created. Whilst the third party systems were recovering and data was replicating geographically documents and threads were intermittently displayed.

Root Cause and Remediation

This incident was triggered by another customer of the third party accidentally triggering a resync of their full production dataset. These operations didn’t use the recommended method for large updates and the latency of the data writes dramatically increased from the usual 160ms. When the third party’s on-call team was made aware of the issue, they identified the customer responsible for the large number of expensive writes and worked with them to identify and cease the operations that erroneously introduced these expensive write requests (if the operations could not be identified, they were prepared to enforce hard write limits on that environment to stabilize the platform). In order to preserve data integrity, their team waited for multiple regions to catch up and stay caught up. Once they validated sustained low write delays in these regions, the team initiated a manual operation to bring the remaining delayed regions in sync with the recovered ones.

Throughout the incident the Workiro product and engineering team worked with the third party to ascertain their understanding of the situation and timeline for the fix, whilst testing our own application to understand the ramifications on user experience.

Follow-up actions

  • Near term

    • Confirm that the third party has migrated Workiro to a dedicated queue to process writes, therefore isolating workloads, ensuring this cannot happen again.
    • Implement in app banner with link to our statuspage to keep Workiro users better informed during any ongoing incident in future.
  • Medium term

    • Provide a better user experience in error cases whereby documents are uploaded to Workiro but success isn’t immediately clear, therefore preventing the creation of duplicate documents through repeated attempts to upload.
Posted Sep 25, 2025 - 17:53 BST

Resolved

Thank-you for your patience. We have been informed that the latencies to all affected environments have recovered and confirmed that Workiro is now functioning as normal.
Posted Sep 22, 2025 - 22:11 BST

Monitoring

We are continuing to monitor the recovery with our provider. The application is expected to gradually return to normal over the next hour with full recovery estimated in 2 hours.
Posted Sep 22, 2025 - 20:37 BST

Identified

We have identified the issue is related to a problem being experienced by a third party service. They are aware of the situation and confirm their engineers are working on it with the highest priority. They have identified the source to be a backlog of writes in their processing queue, and a sequential bottleneck for some environments.
Posted Sep 22, 2025 - 17:17 BST

Investigating

Our engineers are currently investigating reports of degraded service, specifically when uploading documents or creating threads.
Posted Sep 22, 2025 - 16:40 BST