[Nectar Notice] [monash-01] Unscheduled Outage - Hypervisor Hardware Failure - rccomdc3rc12-01

Posted almost 4 years ago by Shahaan Ayyub

Topic is Locked

Shahaan Ayyub

Dear User,

This notice is in reference to the investigation of hardware failure on the Hypervisor rccomdc3rc12-01 that happened on the 29th of August 2022. The investigation has finished and unfortunately both disks (configured as redundant RAID 1), which were hosting the root as well as ephemeral disks of the instances have failed.

Impact

As a result, we are not able to recover the data on the drives. All data including OS drive or/and the ephemeral disks is lost. However, if the data is stored on the persistent volume storage, it will still be intact.

What should I do?

We advise users to

Delete old VMs (Snapshots are not possible)
Create new VMs and
attach previously attached volumes to the newly created VMs, (if any)

Improvement

We have treated this catastrophic hardware failure as a major incident and put measures in-place to prevent this from happening again in the future. One of the major improvements to prevent this from happening again is moving away from local disk backed storage in the infrastructure for all new instances.

Your new VMs will land on the new HCI based infrastructure. The new hypervisors are powered with AMD EPYC processors which provide significant performance improvement and security. Additionally, this offers higher availability and failproof evacuation and live migrations of VMs in case of Hypervisor failures.

We sincerely apologize for any inconvenience caused by this outage. Please be aware that the research cloud compute infrastructure is not designed to be highly available, users should take this opportunity to ensure appropriate backup strategies are in place for critical data and/or services.

Kind regards,

Monash Nectar Research Cloud Team

0 Votes

0 Comments