Back

TWID 06/29/2023

Server stability response

TWID
Andy
Andy
Published
...

Server Stability

I'll copy and paste this section verbatim from the blog, because developer insights are important to hear directly and unsummarized.

First up, we have an update from the Engineering team:

The services that power Destiny 2 are a critical part of every player’s experience, and we are committed to improving stability and reliability long-term. Last month, we spoke about the backend upgrades we implemented to prepare for the year of Lightfall, and service stability issues that have recently affected players. Today, we’d like to dive into some of the technical details around Destiny 2’s service stability and provide a deeper layer of transparency about what we are doing to improve it throughout the upcoming Seasons.

In preparation for Lightfall and for future large releases, we invested in a wide effort to update our internal services with the long-term goal of improving scalability, maintenance, and levels of support for our service infrastructure. Specifically, we want to ensure that we have improved stability during moments or events of high player concurrency. This work resulted in the successful launches of both Lightfall and the Root of Nightmares World First Race. Both performed more smoothly than prior launches, and we expect that trend to continue for future releases. However, we’ve identified several issues with those changes that we are working to correct.

Improving our Infrastructure

When gameplay messages from Destiny 2 are received, they are sent to a key service called “Claims,” which then routes them onto the server that is responsible for your player data. This is an essential service for keeping the client and server in sync during every moment of gameplay. As you can imagine, this means Claims handles a tremendous amount of volume, routing every single kill, orb, or unit of Glimmer in Destiny 2 to the correct recipient.

Ahead of Lightfall’s launch, a few improvements were made to the Claims service. We made updates to some of its underlying communication technology and made changes that allowed Claims to scale out to more servers so that during moments or events of high concurrency, the service could make use of those extra resources and avoid getting bogged down. While the updates achieved our scale goals, we discovered issues around the service’s error recovery functionality.

Normally, if Claims has its communication channels disrupted to other services, it is designed to automatically restore these connections. These disruptions can happen for a wide variety of reasons, including hardware failures, network hitches, or problems with other services. However, despite rigorous testing, the updated system is not always recovering as expected in our live game environment. If these channels are permanently disrupted, this can be one of the causes behind Weasel, Baboon, or other error codes for a large subset of the player base. In these cases, even a rolling restart of our Claims service is not always enough to restore the service. Instead, a full restart of our Destiny 2 services must be performed to restore the Claims system, which we are rapidly working to correct.

Fixing these Claims issues is the very top priority of our Services organization right now, but we must do it very carefully. Done incorrectly, we could unintentionally make stability for players worse or create new issues. This is not a process that can happen overnight, and we must make sure that as we make these fixes, we guarantee that your gameplay messages get routed reliably so that your Destiny 2 experience remains smooth and stable.

Although Claims is just one of many services which are receiving ongoing updates and maintenance to improve the stability and reliability of Destiny 2, the improvements we are making are a crucial step in both addressing the immediate issues players are experiencing as well as better equipping us to deal with any potential future issues that may arise.

The Road Ahead

To give everyone a clear understanding of our next steps to steadily improve game stability, we’ve developed a roadmap with key milestones over the next two Seasons:

  • Ongoing:

    • We will continue to make improvements to our production and deployment processes to reduce the risk of disruptions for players while reducing our maintenance and deployment downtime windows.
    • We will also constantly improve our response procedures for incident recovery to bring Destiny 2 back online as quickly as possible when incidents occur.
    • This work has been ongoing and will continue throughout this timeline.
  • Mid-Season 21 (update 7.1.5):

    • We will make targeted improvements to our logging and alerting systems, allowing us to diagnose issues more quickly with Claims and related systems.
    • These changes are designed to minimize the risk of further degrading stability, while helping us to confirm the effectiveness of fixes further out on the roadmap.
  • Season 22 Launch (update 7.2.0):

    • We are deploying a large set of improvements meant to improve the “self-healing” ability of Claims and reduce the odds of us needing to bring Destiny 2 temporarily offline when an issue occurs.
    • We are adding functionality for services to detect Claims services that are in an unhealthy state and send their messages to healthy services instead.
    • We are additionally making six targeted fixes to Claims systems where we have identified issues that could impact Claims stability or recovery.
    • We are making an improvement to better evict old gameplay messages in our pipelines, which should help with faster recoveries and reduce the chance of a “death spiral” of slow messages causing more slow messages.
  • We are also deploying improvements that will help us to make Destiny 2 services even more robust in the future.

    • We are adding improved support for targeted “Chaos Testing” against our services, allowing us to better simulate different failure modes for our services.
    • We are adding more logging for non-Claims portions of our messaging pipelines to detect other issues that could lead to connection problems for players.
  • Finally, to reduce the odds of introducing new problems with these changes, we are also updating and expanding our Claims Unit Tests. This is automated testing that verifies code is behaving the way we expect it should.

  • Season 23 Launch (update 7.3.0):

    • Based on the results of our 7.2.0 updates and improved logging, we will be targeting deeper and broader architectural improvements to improve the service stability and rapid recovery of Destiny 2, which will include a range of additional improvements.

We hope this sheds new light on what’s been causing some recent stability issues, the methods we’re using to track them, what we’re doing to improve game stability in the short/medium/long term, and when to expect those updates to go live.

Please note that some of these changes will be structural in nature and could introduce additional instability as we initially roll them out. As always, we will work to minimize this risk with deep testing and around-the-clock monitoring, and will implement the most appropriate response needed to get everyone back online as quickly as possible when instability occurs. Our goal is to give Destiny 2 players the best possible experience, and we look forward to seeing the results of the work we’ve been doing since Lightfall’s launch to improve stability. Thank you for your patience and feedback as we work toward achieving the milestones above.

Lucky Week

  • Starting at Tuesday reset, exotic fish will have a doubled catch rate across every fishing zone.

The Witness Cutscene

  • The origins cutscene for the Witness that was available in game last week is now available on YouTube here
  • This cutscene will be available after this year of seasons concludes.

Prime Gaming Extension

  • The current Prime Gaming rewards are receiving an extension and will be available through July 19th. You can claim these rewards here if you have an Amazon Prime account.
  • See the rewards below: Prime Gaming Rewards

Community Emblem Competition

  • They are beginning their first ever emblem competition, where the winning design will be the featured emblem in a future charity campaign.
  • Submissions should be to Twitter, tagged with the hashtag #Emblematic.
  • Submissions will be open from today until July 13, with the winner being announced on July 20.
  • The following guidelines apply:
    • Nameplate dimensions are: 474 x 96 px. (Other components of full emblem design (the icon and the header) are not required for submission.)
    • There are no theme or color restrictions. The world is your oyster.
    • Art must be 2D and non-animated.
    • Submissions must include the hashtag #Emblematic to be included in the competition.
    • Branding, logos, or any third-party intellectual property cannot be included.
    • Any hateful or explicit imagery will lead to immediate disqualification.
    • This is not a popularity contest. All submissions will be presented to the voting panel without identifying information about the original artist.
    • All submissions are subject to this policy: https://www.bungie.net/7/en/Legal/submittingidea

 

Did we miss anything? Let us known on Twitter or in the #tips channel on our Discord.


Sources: