Astria Devnet: Dusk-2 Postmortem & Releasing Dusk-3


Jan 22, 2024

On December 26, 2023, the Astria devnet (dusk-2) experienced an unexpected shutdown, which required us to turn down the network and spin up the third devnet (dusk-3). 

Our team is committed to transparency and accountability, and this post aims to provide a detailed overview of the events and our response, as well as steps we are taking to prevent similar occurrences in the future. Furthermore, it outlines key changes that accompany dusk-3.

Timeline of Events

  • On December 26 at 17:46 UTC, the incident was automatically reported via an automated monitor on the Astria shared sequencer on RPC reporting a 503 error. The network halted.

  • At 17:47 UTC, the incident was acknowledged by on-call and response began.

  • At 18:11 UTC, initial response found that a single node in the sequencer validator rotation had gone offline, and automated restart failed. An issue with blocksync after restart was identified, patched and shipped. 

Network moved forward 1 block after restart before halting again. 

  • At 23:25 UTC, communications were sent via social media to share that there were issues with the network.

  • On December 27 at 00:43 UTC, a decision was made to shut down the network and investigate after Core devs come back from holiday break.

Between January 2-4 as Core devs came back online: 

  • Non-deterministic bug found around execution of multi-proposal rounds. 

  • A fix was implemented, and the network showed initial signs of recovery. However, one of the three validating nodes was found to have invalid app state related to the previous non-determinism. Node was promptly cleaned and sync began.

  • Efforts to sync the full nodes were undertaken, but the syncing process encountered a halt at block height 921,976. 

Further investigation showed that the consensus over block was incorrect due to non-determinism. 

  • Validation confirmed the non-deterministic bug, prompting the decision to start a new network with enhanced releases.

Our Response and Next Steps

To remain transparent with how the network is running at any given time, we have set up a status page and incident response plans to increase speed of communication when issues do occur.

New Devnet & Infrastructure Upgrade

The new network, dusk-3, comes with the following improvements: 

  • The issue was exasperated because there were 3 validating nodes running on dusk-2, one node going down ended up halting the network. To remedy this, new devnet networks will be running a minimum of 4 nodes. 

  • Improved stability and security measures (Fixes 1, 2, 3)

Although the RaaS remains disabled for now, rollups can still be deployed on our devnet. Visit the docs to get started.

