[COMPLETE][Scheduled outage] monash-02 data-centre move & brief instance downtime (Feb

[COMPLETE][Scheduled outage] monash-02 data-centre move & brief instance downtime (Feb - Mar 2017)

Blair Bethwaite

started a topic over 7 years ago

Overview

Monash University is establishing a modern environment for future research computing needs. To do this, Monash eSolutions & eResearch are moving some eResearch infrastructure to a new data centre. The monash-02 availability-zone of the research cloud will be impacted by this activity.

Details

During February and March 2017 users with research cloud server instances in the monash-02 AZ will experience a brief downtime to each instance (expected to be no more than 1 hour) as it is offline-migrated to the new data-centre. Furthermore, users will be unable to create new server instances in the monash-02 AZ from Friday the 10th of Feburary until the end of the relocation (if you have an urgent need to do so please contact Monash Research Cloud support via the helpdesk or email).

GPU and high-memory instances are not able to be migrated and will experience a longer outage, approximately 2 days - their downtime will be dependent on physical relocation activities.

There may also be periods of degraded volume storage performance, however the volume service will remain operational throughout.

Specific timing of outage to individual instances will be confirmed as prerequisite works are completed. Impacted users are encouraged to subscribe to this (https://support.ehelp.edu.au/discussions/topics/6000045078/) forum thread in order to receive updates. In particular, approximate timing of downtime for each affected server will be posted and updated here.

Further details and updates, particularly regarding other impacted services, can be found on the Monash eSolutions website. This announcement will be updated as new information becomes available so you are encouraged to follow/subscribe for updates.

Planning for minimised impact / what do I need to do

Whilst a large amount of effort is going into ensuring minimal interruption to users we must acknowledge that there will be heightened operational risk to monash-02. If you are running critical services that can be relocated to another availability-zone over this period then you are encouraged to do so. You should also ensure you have snapshots and/or other backups for any important services/data.

If you have dependent services running in monash-02 and have a preference for a concurrent outage to all instances (e.g. a cluster), or inversely, a non-concurrent outage to a particular set of instances (e.g. load-balanced workers or DR standbys), please contact support to provide details.

Support / questions

Please follow/subscribe to this forum (click the star button or press the 'w' key) to be notified of updates to this announcement - we will use it to communicate updated information and inform users of any related incidents.

Please direct all support requests related to the data-centre move and monash-02 research cloud availability-zone specifically to the Nectar eHelp desk (support@nectar.org.au)

FAQ

Q: Will IP addresses change when instances are moved to the data-centre?
A: No
Q: Should I shutdown my own instance before the outage or will it be done for me?
A: We will be unable to provide exact timing (to the hour) for each instance so Monash admins will shutdown any instance running at the time of its migration. However, users are encouraged to shutdown their own instances to ensure a clean shutdown.
Q: How will I know when my instance/s are ready?
A: After migration instances will be returned to the state they we in prior to migration (i.e. active or shutoff) and users will be contacted by support to confirm the instance is ready to use.
Q: Do I need to backup my data?
A: If you have important data on any research cloud infrastructure then you should already be backing it up - the research cloud has no implicit backup regime for e.g. instance virtual disks or volumes, users are entirely responsible for their own data-protection measures. No instance storage data-loss is expected in the migration, however before the outage we recommend you take a snapshot of all impacted instances (note that this will not include data stored on secondary, non-root, ephemeral drives - this must be backed up by other means).
Q: What is a "high memory instance"?
A: Any instance type with more than 64GB of RAM. The standard publicly visible set of m1.* and m2.* instance types are not considered high memory.

2 Comments

Blair Bethwaite

said about 7 years ago

A draft schedule of impact is now available here:

https://docs.google.com/spreadsheets/d/12rAbCRN4cPBUjmwde8VatMf8QrsCAjAGIy1JY1vbM1I/edit?usp=sharing

Please note:

highlighted lines indicate servers supported by the Monash Research Cloud team (where we will be managing restoration of any impacted services)
the minimum listed downtime is 1 hour, however where possible servers will be migrated without any downtime
GPU and high memory servers will be offline for approximately two days as their physical servers must be relocated

Blair Bethwaite

said about 7 years ago

Progress update:

The majority of monash-02 research cloud instances have now been migrated to new equipment in Monash's new datacentre. This was made possible thanks to a strategic investment from Monash University, made in order to minimise the impact of the overall datacentre on the research community.

During the first wave of migrations an issue was discovered with the temporary cross-datacentre networking configuration that is in place to facilitate the migration. This issue particularly impacted instances using volume storage and resulted in those instance being unable to boot in the new datacentre. At this point migrations were halted and some impacted instances were rolled back to the old datacentre.

Since then we have been working closely with our network hardware and software vendors to identify the root cause of the issues and are pleased (read: relieved) to announce that these have been identified and a workaround implemented.

The schedule has been updated with current state and all as yet not migrated instances have new migration dates. Specifically, all remaining non-GPU or high-memory hosts will be migrated tomorrow (Thursday the 2nd of March). The remaining instances that have hardware specific dependencies will be physically migrated on Friday the 3rd of March.

Please note the schedule now has a filter set so that already migrated instances are no longer displayed. If you wish to see these please go to: "Data -> Filter views... -> Create new temporary filter view".