Summary
Starting at 13:15 UTC on Monday, February 12th, 2024, Users of the 3DEXPERIENCE platform SaaS Public Cloud experienced service unavailability across all regions. This issue prevented access to the service, resulting in an access error. The incident lasted for2 hours and 25 minutes, until 15:40 UTC.
Furthermore, another occurrence of this incident happened on Tuesday, February 13th from 07:20 UTC to 09:20 UTC, lasting 2 hours.
These incidents were immediately detected by the Cloud Operations team and managed until resolution.
Symptom
Some users across all regions of the 3DEXPERIENCE platform SaaS Public Cloud trying to connect to the services during the incident period were affected.
Causes & Response
Introduction
The service that verifies Users' licensing rights stopped responding:
- All application servers part of this service were unavailable, and were all excluded from the load-balancing.
- The computing resources on which the application is running were all unresponsive.
- The application crashed because of a race condition in the threads management.
- The crash was caused by an issue in the thread concurrency implementation combined with an important load on the service.
- The conditions to reproduce this issue in our qualification phase are not present and then have not permitted to reveal it.
Recovery Time Objective
The recovery time took longer than it should have and was outside our defined objectives and according to our Service Level Agreement.
Due to the important load on the computing resources and the application crash, the computing resources became unresponsive and prevented the automated recovery processes to operate.
- The first step of the automated operations consist of decommissioning unhealthy resources that causes errors to end-users, while adding new resources in parallel. This first step failed:
- The unhealthy computing resources were busy writing debugging information to the filesystem, causing an important volume of IO on the underlying block storage and prevented the normal detachment of virtual storage.
- All computing resources were affected,
- Because the automatic recovery process failed, we had to switch to manual recovery operations :
- Manually perform the cleanup of unhealthy resources,
- Deploy new resources replacing previous ones,
- Check overall service consistency,
- Once the consistency validation was completed, the service went back to normal.
Additional Occurrence of Incident on Next Day:
After the service was restored, the development team took immediate action to address the root cause of the outage:
- The faulty code responsible for the issue was identified and the team worked on developing a permanent solution.
- This fix was completed and delivered by Monday night, and went to the expedited validation process specifically designed for critical fixes. The deployment of this fix was scheduled for the next morning to prevent any further occurrences of the issue.
- In addition, we added additional computing resources to support the load, but the issue impacted all computing resources the same way,
- However, before the deployment of the fix, the service experienced another similar incident.
- Following the same recovery protocol, we managed to restore the service with the same delay.
- We then applied the developed fix. This action corrected the situation, stabilizing the service to prevent a new recurrence of this incident.
Prevention
The Root Cause Analysis (RCA) has been initiated to understand why these issues occurred.
In Closing
Finally, we sincerely apologize for the inconvenience this unprecedented event may have caused you.
We know how important the 3DEXPERIENCE platform SaaS is to our users and their businesses. We will make sure to learn from this event in order to maintain our customers’ trust and to continue improving on availability of our online services even further.
Need Assistance?
Our support team is here to help you make the most of our software. Whether you have a question, encounter an issue, or need guidance, we've got your back.