Authorization is intermittently failing
Incident Report for AuthVia
Postmortem

Incident Overview

Incident Title: Unplanned Outage Due to Recursion Issue in Salesforce Integration
Date: 3/15/2025
Duration: Approximately 5 hours
Impact: Intermittent errors in token creation affecting internal and external users

Leadup

The incident was triggered by a planned upgrade to a new version ("Winter 24") for a merchant's Salesforce integration with Authvia. A significant change involved modifying the ContactHandler to run as Queueable instead of using Future methods.

Fault

During the implementation, a custom flow within that merchant's Salesforce system, named "Contact Update - Provision Contact with AuthVia," inadvertently initiated a recursive process. This recursion occurred when the trigger was executed by users lacking necessary permissions in Authvia's Salesforce integration. This situation led to the recursion within the "ContactTrigger" class, which calls the "ContactHandler."

Impact

The recursive loop resulted in intermittent failures during token creation, impacting both internal and external user operations.

Detection

The incident was identified at 15:10 UTC when John observed an unusually high error rate in the api-base-authorizer during his routine check-up using NewRelic. Concurrently, A partner reported multiple errors affecting a significant portion of their API interactions.

Response

A rapid response involved multiple team members:

  • 15:10 UTC: John immediately alerted Matt Fillion.
  • 15:14 UTC: Matt reached out to Jeffrey Morales.
  • 15:16 UTC: A P2 level was declared, and a dedicated Slack channel was established for incident tracking.
  • By 16:30 UTC: The problematic application for Authvia in Salesforce was disabled, stabilizing the system.
  • 19:18 UTC: A code fix was implemented and deployed.

Recovery

By 16:30 UTC, disabling the application resolved the immediate instability.

The final resolution came at 19:18 UTC with a deployed code fix.

Timeline

A detailed timeline from the JWT token creation issue at 13:10 UTC to the incident resolution at 20:00 UTC outlines the sequence of events, detection, response, and resolution stages.

Root Cause Analysis

The recursive triggering in Salesforce was traced back to a lack of appropriate permissions for the user, which was unexpected and led to a system overload due to the recursive function calls.

Blameless Root Cause

The incident was caused by an unexpected volume of requests due to a recursive process in Salesforce integration. Future prevention will involve implementing a new library fix to improve cross-lambda invocations.

Lessons Learned and Follow-up

  • Improvements: Enhancing cross-lambda invocation mechanisms.
  • Follow-up Task: Implement caching for certain services to reduce unnecessary inter-service requests. A Jira issue will be created to track this task.

Conclusion

This incident underscores the importance of thorough testing, especially when implementing changes in complex integrated systems. By addressing the identified root cause and implementing the recommended changes, we aim to prevent similar incidents in the future.

Posted Mar 26, 2024 - 14:19 PDT

Resolved
This incident has been resolved.
Posted Mar 15, 2024 - 13:00 PDT
Monitoring
We have released a fix that will resolve Authvia for Salesforce issues. We will continue to monitor errors
Posted Mar 15, 2024 - 12:18 PDT
Update
All services are looking normal with the exception of Salesforce
Posted Mar 15, 2024 - 10:28 PDT
Update
An unprecedented amount of traffic from one of our integration partners overloaded one of our services. We have disabled access to this partner and are working on a more permanent solution to avoid this problem going forward.
Posted Mar 15, 2024 - 09:39 PDT
Identified
The issue has been identified and a fix is being implemented.
Posted Mar 15, 2024 - 09:38 PDT
Update
We are continuing to investigate this issue.
Posted Mar 15, 2024 - 09:20 PDT
Investigating
Using our JWT creation our system is intermittently failing to provide a token
Posted Mar 15, 2024 - 06:10 PDT
This incident affected: lock.authvia.com, app.authvia.net, Integrations (Salesforce Bill N Pay), and api.authvia.com (authorization-service).