Astria Devnet: Dusk-2 Postmortem & Releasing Dusk-3

Eshita

Jan 22, 2024

On December 26, 2023, the Astria devnet (dusk-2) experienced an unexpected shutdown, which required us to turn down the network and spin up the third devnet (dusk-3). 

Our team is committed to transparency and accountability, and this post aims to provide a detailed overview of the events and our response, as well as steps we are taking to prevent similar occurrences in the future. Furthermore, it outlines key changes that accompany dusk-3.


Timeline of Events

  • On December 26 at 17:46 UTC, the incident was automatically reported via an automated monitor on the Astria shared sequencer on RPC reporting a 503 error. The network halted.

  • At 17:47 UTC, the incident was acknowledged by on-call and response began.

  • At 18:11 UTC, initial response found that a single node in the sequencer validator rotation had gone offline, and automated restart failed. An issue with blocksync after restart was identified, patched and shipped. 

Network moved forward 1 block after restart before halting again. 

  • At 23:25 UTC, communications were sent via social media to share that there were issues with the network.

  • On December 27 at 00:43 UTC, a decision was made to shut down the network and investigate after Core devs come back from holiday break.


Between January 2-4 as Core devs came back online: 

  • Non-deterministic bug found around execution of multi-proposal rounds. 

  • A fix was implemented, and the network showed initial signs of recovery. However, one of the three validating nodes was found to have invalid app state related to the previous non-determinism. Node was promptly cleaned and sync began.

  • Efforts to sync the full nodes were undertaken, but the syncing process encountered a halt at block height 921,976. 

Further investigation showed that the consensus over block was incorrect due to non-determinism. 

  • Validation confirmed the non-deterministic bug, prompting the decision to start a new network with enhanced releases.


Our Response and Next Steps

To remain transparent with how the network is running at any given time, we have set up a status page and incident response plans to increase speed of communication when issues do occur.


New Devnet & Infrastructure Upgrade

The new network, dusk-3, comes with the following improvements: 

  • The issue was exasperated because there were 3 validating nodes running on dusk-2, one node going down ended up halting the network. To remedy this, new devnet networks will be running a minimum of 4 nodes. 

  • Improved stability and security measures (Fixes 1, 2, 3)

Although the RaaS remains disabled for now, rollups can still be deployed on our devnet. Visit the docs to get started.


For the latest updates, follow us on Twitter. Join our community Discord to share any feedback.

© 2024. All Rights Reserved. Astria.org

© 2024. All Rights Reserved. Astria.org