[Complete] NSP Xenserver Upgrade - Saturday 05 December between 5:00am and 7:00pm AEDT
Posted
about 9 years ago
by Nick Golovachenko
Nick Golovachenko
Maintenance Work: Upgrade - Xenserver 6.2 to XenServer 6.5
Affected Zone: NSP-NP, NSP-Q2 (ALL NSP Zone)
Date: Saturday 05/12/2015
Scheduled Maintenance Window : 5:00AM -7:00 PM AEDT *NB: Estimated impact to running VM will only be 1 to 2 seconds
Note: This change has a small network outage with low impact and potentially no downtime to users.
Impact to users :
With assistance from the vendor, this change has been assessed as low risk with no impact expected on running virtual machines during its application. During the change window, users may experience an outage due to a live migration failure, this is may impact 1% of users. If your machine does fail to live migrate, NSP support staff will attempt to cleanly restart your VM on another server.
You are not required to do or to change anything during this time since your VMs will continue to run as normal.
Maintenance Work: Networking Link Aggregation - AARNet4 double the bandwidth to 20Gb/s
Impact to users : The upgrade will double the bandwidth to 20Gb/s as we will be combining our existing 10Gb link with our current (but unused) RDSI provided 10Gb/s AARNet4 link. During this time internet connectivity to all NSP instances will be lost. You are not required to do or to change anything during this time since your VMs will continue to run as normal with no internet connectivity.
Xenserver Details
Current Version: 6.2 SP1 Patch 1025
Proposed Version: 6.5 SP1 Patch 1004
XenServer 6.5 was released in January of 2015. It was available in pre-release form under the project name of Creedence. All new XenServer installations should be made using XenServer 6.5, Citrix recommends upgrading to version 6.5. XenServer 6.2 was orginally released in June of 2013. It was the first release of XenServer as part of the open source initiative. While XenServer 6.2 remains a supported platform by Citrix, existing users are encouraged upgrade to XenServer 6.5 at their earliest convenience. New installations should be made using XenServer 6.5.
During this time we’ll be upgrading the Xenserver Hypervisor to the lastest stable version. The fix will address a low memory issus with the 32bit Xenserver Dom 0 domain by upgrading to 64bit kernel.
Change procedure
Each XenServer Hypervisor host will be placed into maintenance and virtual machines will automatically be live migrated to another XenServer host.
The upgrade has been successfully applied, tested, documented on the NSP test system and includes a rollback procedure.
Improvement:
1. All 64BIT, avoid LOW kernel space memory we have seen in version 6.2 causing OOM and instability.
2. XenServer 6.5 includes the latest version, OVS 2.1.3, which supports megaflows. Megaflow reduces the number of required entries in the flow table for most common situations and improves the ability of Dom0 to handle many server VMs connected to a large number of clients. We have a large number of packet flows in 6.2 this is resolved in 6.5.
3. XenServer 6.5 contains a new DVSC version from Nicira (DVSC-Controller-37734.1), and contains platform related security fixes (for example, OpenSSL and Bash Shellshock)
4. In order to fulfill dynamic capacity requirements, you may wish to add capacity to the storage array to increase the size of the LUN provisioned to the XenServer host. The Live LUN Expansion feature allows to you to increase the size of the LUN without any VM downtime.
5. Significant Performance Improvements: reference https://www.citrix.com/blogs/2015/01/13/xenserver-6-5/ network drivers, storage HBA drivers now included in kernel. These changes as well as a rollback have been successfully tested on the NSP test cluster.
Issues Resolved by installing Xenserver 6.5 and Cloudplatform 4.5.1:
SR#70144254, 70241942, 70283558 - Live Storage migration and associated problems. [ref:_00D306M9V._50060ngR8Z:ref]
Based on the logs we have collected, following problems related to Storage Live migration and snapshotting are evident:
The pool displays problem identified and resolved in CS-42031(CCP 4.5.0.0 HF 5 refer to http://support.citrix.com/article/CTX142584 & also included in 4.5.1 ). Problem occurs when scheduled snapshot overlaps duration of actual copy to secondary storage. This results in SR marked with a cross remaining attached to the pool interfering with SR scan and GarbageCollection processes and identified by message in the SMlog: Mar 19 15:26:24 qh2-nsp01 SMGC: [14862] SR ed3c ('14a68eb4-62aa-4beb-91d5-e5070ecd46a9/var/cloud_mount/b9d64979-0c47-31fe-a496-6943e9dfca63/snapshots/41/1134') (15 VDIs in 15 VHD trees): no changes Mar 19 15:26:24 qh2-nsp01 SM: [14840] lock: released /var/lock/sm/e72886f4-8fd5-4623-aa75-94f8d4bea4c2/sr Mar 19 15:26:24 qh2-nsp01 SM: [14840] * sr_scan: EXCEPTION XenAPI.Failure, ['INTERNAL_ERROR', 'Db_exn.Uniqueness_constraint_violation("VDI", "uuid", "5a7a6434-c3c1-4295-b9c9-83e9d325c097")']
Errors regarding live storage migration: Failure to remove unused VDI, fixed in XS62ESP1011(ref CA-126097): Jun 2 11:04:40 qh2-nsp02 SM: [27365] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Can't remove open logical volume "VHD-c739e89f-9791-4ea6-81d2-59c121ab2a52" Jun 2 11:04:40 qh2-nsp02 SM: [27365] 'Jun 2 11:04:40 qh2-nsp02 SM: [27365] *** lvremove failed on attempt #0 Jun 2 11:04:40 qh2-nsp02 SM: [27365] ['/usr/sbin/lvremove', '-f', '/dev/VG_XenStorage-dc0e0a15-a520-9861-edb2-a808fb095df5/VHD-c739e89f-9791-4ea6-81d2-59c121ab2a52']
However, after installing hotfix 11, live migration still might fail with the error bridge_not_available appearing in the xensource logs. Hotfix 16 for SP1 resolves that problem: http://support.citrix.com/article/CTX141779
Another possible reason for VM migration to fail is error SR_BACKEND_FAILURE_80, which may occur during live migration during high I/O utilization of storage and leave original disk intact even though migration succeeds: In CloudPlatform logs this failed migration would be recorded like so: 2015-03-19 00:03:13,820 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 178 2015-03-19 00:03:13,820 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 178 This is addressed in hotfix XS62ESP1031: http://support.citrix.com/article/CTX201763 - "When migrating VMs by using Storage XenMotion, the virtual disks are not deleted."
In CloudPlatform, cross storage migration could also fail with problem of serializing APIs to source and destination pools, which was addressed in HF 5 for 4.5.0.0 and CCP 4.5.1: http://support.citrix.com/article/CTX142584 CS-41703: Problem: In a setup with multiple clusters, live migration of a volume that is attached to a running VM from one primary storage to another may fail if the request is being forwarded from one Management Server to another Management Server. Root Cause: This occurs when the request is placed on the first Management Server and the request has to be forwarded to another Management Server as the host is managed by the second Management Server. This occurs because CloudPlatform fails to serialize the request as request contains a MAP of non-native data types. Solution: Fixed this by updating the code to use the list of pairs instead of a map of non-native data types. Also fixed this issue for migrating a virtual machine with volumes.
Finally, if large volumes are snapshotted and backed up, this operation can fail if it exceeds default timeouts(6 hours) and leave status of snapshot in inconsistent state(e.g. status is stuck in “BackingUp” state). To address that following global parameters and mysql settings need to be adjusted: Global Settings: job.cancel.threshold.minutes - value is in minutes, can be increased to 720(12 hours) or 1440(24 hours) backup.snapshot.wait - value is in seconds, can be increased to 43200 seconds(12 hours) or 86400(24 hours)
Backend Snapshots bug causing intermittent failures (tap disk not closed, requires manual intervention)[Automation] scale vm failed with error "Unable to serialize" The issue has been fixed in 4.3.0.2 release. Bug id is : CS-25817 Release notes: http://support.citrix.com/article/CTX141663XS62ESP1031 When migrating VMs by using Storage XenMotion, the virtual disks are not deletedPossible reason for VM migration to fail is error SR_BACKEND_FAILURE_80, which may occur during live migration during high I/O utiliasion of storages and leave original disk intact even though migration succeeds: In CloudPlatform logs this failed migration would be recorded like so:2015-03-19 00:03:13,820 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 1782015-03-19 00:03:13,820 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 178This is addressed in hotfix XS62ESP1031: http://support.citrix.com/article/CTX201763 - "When migrating VMs by using Storage XenMotion, the virtual disks are not deleted."CLOUDSTACK-7833 VM Async work jobs log “Was unable to find lock for the key vm_instance” errorsCLOUDSTACK-5356 Xenserver - Failed to create snapshot when secondary store was made unavailable
This change is part of a series of fixes to the NSP that will improve the stability and quality of the NSP service.
If you have any concerns or questions about the upgrade, please contact us on support@nsp.nectar.org.au
Maintenance Work: Upgrade - Xenserver 6.2 to XenServer 6.5
Affected Zone: NSP-NP, NSP-Q2 (ALL NSP Zone)
Date: Saturday 05/12/2015
Scheduled Maintenance Window : 5:00AM -7:00 PM AEDT *NB: Estimated impact to running VM will only be 1 to 2 seconds
Note: This change has a small network outage with low impact and potentially no downtime to users.
Impact to users :
With assistance from the vendor, this change has been assessed as low risk with no impact expected on running virtual machines during its application. During the change window, users may experience an outage due to a live migration failure, this is may impact 1% of users.
If your machine does fail to live migrate, NSP support staff will attempt to cleanly restart your VM on another server.
You are not required to do or to change anything during this time since your VMs will continue to run as normal.
Maintenance Work: Networking Link Aggregation - AARNet4 double the bandwidth to 20Gb/s
Affected Zone: NSP-NP, NSP-Q2 (ALL NSP) - 115.146.80.0/20, 103.6.252.0/22
Date: 05/12/2015
Impact to users :
The upgrade will double the bandwidth to 20Gb/s as we will be combining our existing 10Gb link with our current (but unused) RDSI provided 10Gb/s AARNet4 link. During this time internet connectivity to all NSP instances will be lost.
You are not required to do or to change anything during this time since your VMs will continue to run as normal with no internet connectivity.
Xenserver Details
Current Version: 6.2 SP1 Patch 1025
Proposed Version: 6.5 SP1 Patch 1004
XenServer 6.5 was released in January of 2015. It was available in pre-release form under the project name of Creedence. All new XenServer installations should be made using XenServer 6.5, Citrix recommends upgrading to version 6.5. XenServer 6.2 was orginally released in June of 2013. It was the first release of XenServer as part of the open source initiative. While XenServer 6.2 remains a supported platform by Citrix, existing users are encouraged upgrade to XenServer 6.5 at their earliest convenience. New installations should be made using XenServer 6.5.
During this time we’ll be upgrading the Xenserver Hypervisor to the lastest stable version. The fix will address a low memory issus with the 32bit Xenserver Dom 0 domain by upgrading to 64bit kernel.
Change procedure
Each XenServer Hypervisor host will be placed into maintenance and virtual machines will automatically be live migrated to another XenServer host.
The upgrade has been successfully applied, tested, documented on the NSP test system and includes a rollback procedure.
Improvement:
1. All 64BIT, avoid LOW kernel space memory we have seen in version 6.2 causing OOM and instability.
2. XenServer 6.5 includes the latest version, OVS 2.1.3, which supports megaflows. Megaflow reduces the number of required entries in the flow table for most common situations and improves the ability of Dom0 to handle many server VMs connected to a large number of clients. We have a large number of packet flows in 6.2 this is resolved in 6.5.
3. XenServer 6.5 contains a new DVSC version from Nicira (DVSC-Controller-37734.1), and contains platform related security fixes (for example, OpenSSL and Bash Shellshock)
4. In order to fulfill dynamic capacity requirements, you may wish to add capacity to the storage array to increase the size of the LUN provisioned to the XenServer host. The Live LUN Expansion feature allows to you to increase the size of the LUN without any VM downtime.
5. Significant Performance Improvements: reference https://www.citrix.com/blogs/2015/01/13/xenserver-6-5/ network drivers, storage HBA drivers now included in kernel.
These changes as well as a rollback have been successfully tested on the NSP test cluster.
Issues Resolved by installing Xenserver 6.5 and Cloudplatform 4.5.1:
SR#70144254, 70241942, 70283558 - Live Storage migration and associated problems. [ref:_00D306M9V._50060ngR8Z:ref]
Based on the logs we have collected, following problems related to Storage Live migration and snapshotting are evident:
The pool displays problem identified and resolved in CS-42031(CCP 4.5.0.0 HF 5 refer to http://support.citrix.com/article/CTX142584 & also included in 4.5.1 ).
Problem occurs when scheduled snapshot overlaps duration of actual copy to secondary storage. This results in SR marked with a cross remaining attached to the pool interfering with SR scan and GarbageCollection processes and identified by message in the SMlog: Mar 19 15:26:24 qh2-nsp01 SMGC: [14862] SR ed3c ('14a68eb4-62aa-4beb-91d5-e5070ecd46a9/var/cloud_mount/b9d64979-0c47-31fe-a496-6943e9dfca63/snapshots/41/1134') (15 VDIs in 15 VHD trees): no changes
Mar 19 15:26:24 qh2-nsp01 SM: [14840] lock: released /var/lock/sm/e72886f4-8fd5-4623-aa75-94f8d4bea4c2/sr
Mar 19 15:26:24 qh2-nsp01 SM: [14840] * sr_scan: EXCEPTION XenAPI.Failure, ['INTERNAL_ERROR', 'Db_exn.Uniqueness_constraint_violation("VDI", "uuid", "5a7a6434-c3c1-4295-b9c9-83e9d325c097")']
Errors regarding live storage migration: Failure to remove unused VDI, fixed in XS62ESP1011(ref CA-126097):
Jun 2 11:04:40 qh2-nsp02 SM: [27365] FAILED in util.pread: (rc 5) stdout: '', stderr: ' Can't remove open logical volume "VHD-c739e89f-9791-4ea6-81d2-59c121ab2a52"
Jun 2 11:04:40 qh2-nsp02 SM: [27365] 'Jun 2 11:04:40 qh2-nsp02 SM: [27365] *** lvremove failed on attempt #0
Jun 2 11:04:40 qh2-nsp02 SM: [27365] ['/usr/sbin/lvremove', '-f', '/dev/VG_XenStorage-dc0e0a15-a520-9861-edb2-a808fb095df5/VHD-c739e89f-9791-4ea6-81d2-59c121ab2a52']
However, after installing hotfix 11, live migration still might fail with the error bridge_not_available appearing in the xensource logs. Hotfix 16 for SP1 resolves that problem: http://support.citrix.com/article/CTX141779
Another possible reason for VM migration to fail is error SR_BACKEND_FAILURE_80, which may occur during live migration during high I/O utilization of storage and leave original disk intact even though migration succeeds: In CloudPlatform logs this failed migration would be recorded like so:
2015-03-19 00:03:13,820 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 178
2015-03-19 00:03:13,820 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 178
This is addressed in hotfix XS62ESP1031: http://support.citrix.com/article/CTX201763 - "When migrating VMs by using Storage XenMotion, the virtual disks are not deleted."
In CloudPlatform, cross storage migration could also fail with problem of serializing APIs to source and destination pools, which was addressed in HF 5 for 4.5.0.0 and CCP 4.5.1: http://support.citrix.com/article/CTX142584
CS-41703: Problem: In a setup with multiple clusters, live migration of a volume that is attached to a running VM from one primary storage to another may fail if the request is being forwarded from one Management Server to another Management Server.
Root Cause: This occurs when the request is placed on the first Management Server and the request has to be forwarded to another Management Server as the host is managed by the second Management Server. This occurs because CloudPlatform fails to serialize the request as request contains a MAP of non-native data types.
Solution: Fixed this by updating the code to use the list of pairs instead of a map of non-native data types. Also fixed this issue for migrating a virtual machine with volumes.
Finally, if large volumes are snapshotted and backed up, this operation can fail if it exceeds default timeouts(6 hours) and leave status of snapshot in inconsistent state(e.g. status is stuck in “BackingUp” state).
To address that following global parameters and mysql settings need to be adjusted: Global Settings:
job.cancel.threshold.minutes - value is in minutes, can be increased to 720(12 hours) or 1440(24 hours)
backup.snapshot.wait - value is in seconds, can be increased to 43200 seconds(12 hours) or 86400(24 hours)
Backend Snapshots bug causing intermittent failures (tap disk not closed, requires manual intervention)[Automation] scale vm failed with error "Unable to serialize" The issue has been fixed in 4.3.0.2 release. Bug id is : CS-25817 Release notes: http://support.citrix.com/article/CTX141663XS62ESP1031 When migrating VMs by using Storage XenMotion, the virtual disks are not deletedPossible reason for VM migration to fail is error SR_BACKEND_FAILURE_80, which may occur during live migration during high I/O utiliasion of storages and leave original disk intact even though migration succeeds: In CloudPlatform logs this failed migration would be recorded like so:2015-03-19 00:03:13,820 ERROR [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Invocation exception, caused by: com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 1782015-03-19 00:03:13,820 INFO [c.c.v.VmWorkJobHandlerProxy] (Work-Job-Executor-148:ctx-69de1066 job-135107/job-149952 ctx-4194bf22) Rethrow exception com.cloud.utils.exception.CloudRuntimeException: Source and destination host are not in same cluster, unable to migrate to host: 178This is addressed in hotfix XS62ESP1031: http://support.citrix.com/article/CTX201763 - "When migrating VMs by using Storage XenMotion, the virtual disks are not deleted."CLOUDSTACK-7833 VM Async work jobs log “Was unable to find lock for the key vm_instance” errorsCLOUDSTACK-5356 Xenserver - Failed to create snapshot when secondary store was made unavailable
This change is part of a series of fixes to the NSP that will improve the stability and quality of the NSP service.
If you have any concerns or questions about the upgrade, please contact us on support@nsp.nectar.org.au
0 Votes
0 Comments
Login or Sign up to post a comment