Incident Title: Unplanned Outage Due to Recursion Issue in Salesforce Integration
Date: 3/15/2025
Duration: Approximately 5 hours
Impact: Intermittent errors in token creation affecting internal and external users
The incident was triggered by a planned upgrade to a new version ("Winter 24") for a merchant's Salesforce integration with Authvia. A significant change involved modifying the ContactHandler to run as Queueable instead of using Future methods.
During the implementation, a custom flow within that merchant's Salesforce system, named "Contact Update - Provision Contact with AuthVia," inadvertently initiated a recursive process. This recursion occurred when the trigger was executed by users lacking necessary permissions in Authvia's Salesforce integration. This situation led to the recursion within the "ContactTrigger" class, which calls the "ContactHandler."
The recursive loop resulted in intermittent failures during token creation, impacting both internal and external user operations.
The incident was identified at 15:10 UTC when John observed an unusually high error rate in the api-base-authorizer
during his routine check-up using NewRelic. Concurrently, A partner reported multiple errors affecting a significant portion of their API interactions.
A rapid response involved multiple team members:
By 16:30 UTC, disabling the application resolved the immediate instability.
The final resolution came at 19:18 UTC with a deployed code fix.
A detailed timeline from the JWT token creation issue at 13:10 UTC to the incident resolution at 20:00 UTC outlines the sequence of events, detection, response, and resolution stages.
The recursive triggering in Salesforce was traced back to a lack of appropriate permissions for the user, which was unexpected and led to a system overload due to the recursive function calls.
The incident was caused by an unexpected volume of requests due to a recursive process in Salesforce integration. Future prevention will involve implementing a new library fix to improve cross-lambda invocations.
This incident underscores the importance of thorough testing, especially when implementing changes in complex integrated systems. By addressing the identified root cause and implementing the recommended changes, we aim to prevent similar incidents in the future.