Troubleshooting SSH access to a NeCTAR instance

Modified on Thu, 3 Apr at 1:36 PM

Introduction

There are a number of possible causes for SSH issues, each of which requires a different solution, this page will help you troubleshoot these issues. It assumes that you have already attempted to follow instructions for new users on how to use SSH, and how to launch a NeCTAR instance.

Note, the following below does not apply to accessing your instance via VNC, instead go to this tutorial.

A quick checklist

Before we launch into the full diagnostic procedure, here is a quick checklist of common problems:

Have you set up the instance’s security groups?
Did you supply / select a key-pair when launching the instance?
Are you using the correct private key (and is this spelled correctly in your command) when you connect?
Are you using the correct user account name?
Are you SSH’ing to the correct hostname or IP address for the instance?
Is the instance running? (Check what the Dashboard says.)
For MacOS/Linux, if you set up / modified the user account’s “~/.ssh” directory and contents by hand, did you remember to set the directory and file modes?

Diagnostic key

The following “key” relates the primary symptom that you observe when attempting to SSH to an instance, and the possible problems that could cause this.

Symptom / message	Explanation	Possible problems
Connection timed out	When your SSH client attempted to connect to the server, there was no response.	1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 25
Connection refused	When your SSH client attempted to connect to the server, the server refused the connection without even trying to check your credentials.	10, 11, 12, 25, 27
No route to host	When your SSH client attempted to connect, the network where the instance said that that IP address is not active.	3, 4, 25, 27
No route to network	When your SSH client attempted to connect, your local network or an intermediate network said that it could not find a way to the network where the IP would be found.	1, 2, 25,
Connection reset by peer	Your SSH client established a network connection to the remote instance, but it was closed unexpectedly by the remote SSH daemon.	9, 25
Warning: Remote host identification has changed	Your SSH client established a network connection to the remote instance, and started the SSL negotiation. The remote SSH daemon has sent a hostkey that is different to the one that the client-side has recorded as the expected hostkey.	22, 25
Warning: Unprotected private key	The SSH client has detected that your local key storage is insecure. The client has not attempted to connect at all	18, 25
Unable to negotiate with <IP> port <PORT> no matching host key type found. Their offer: ssh-dss	The remote SSH server is only offering a DHA host key. The SSH client will not accept this and has abandoned the attempt to connect.	23, 25
Permission denied (publickey, ...).	The remote SSH daemon has not accepted the supplied key as valid for the account name you used, and has refused your login.	10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 24, 25
Asks for a password unexpectedly	The remote SSH daemon has not accepted the supplied key as valid for the account name you used. It has fallen back to asking for a password. (There should be no password that you can give at this point.)	10, 11, 12, 13, 14, 15, 16, 17, 19, 20, 21, 24, 25
Unable to use key file ... OpenSSH SSH-2 private key (old PEM format)	You are attempting to use the wrong kind of SSH private key with Putty	26

Problems and solutions

Problem: Your PC / laptop machine is not connected to the network (#1)

Explanation: If your system is not properly connected to the local network (at your end) then you will not be able to connect to a NeCTAR instance.
Symptoms: Various networking related errors are possible, including connection timeouts and “no route to network”.
Diagnosis: Use your web browser to see if you can use local services (e.g. your university website) and sites on the public internet (e.g. Google). If any of these work, this is not the problem. (There are various techniques you can use for isolating the problem, but it is complicated and beyond the scope of this document.)
Solution: Check the status of your PC’s physical and/or wireless network connections. Disconnecting and reconnecting the network, or rebooting may help. If that fails, then escalate to your local IT support, or to your ISP. (This is not a NeCTAR problem.

Problem: Your local (e.g. institutional) network is down or there are networking problems on AARNET (#2)

Explanation: If there are local networking problems (at your end) then you may be unable to connect to a NeCTAR instance.
Symptoms: Various networking related errors are possible, including connection timeouts and “no route to network”.
Diagnosis: As for the previous problem. If you have some knowledge of your local network topology you can get a rough idea of what is broken based on what sites your can contact.
Solution: There probably isn't anything you can do to solve the problem. It is worth checking for network outage notices / notifications in the relevant sites, provided that you can connect to them. Alternatively, talk to your local IT support or ISP.

Problem: There is a networking outage in NeCTAR itself (#3)

Explanation: If there are networking issues in the data centre where your NeCTAR instance runs, or in the centre's links to AARNET, then you may be unable to connect to the instance.
Symptoms: Various networking related errors are possible, including connection timeouts and “no route to network”.
Diagnosis: As for the previous problem. If you have some knowledge of NeCTAR network topology you can get a rough idea of what is broken based on what NeCTAR IP addresses your can contact.
Solution: There probably isn't anything you can do to solve the problem. Check for outage notices in the NeCTAR Support Announcements page. Alternatively, raise a support ticket with NeCTAR Support or the NeCTAR Node operators.

Problem: Your instance no longer exists (#4)

Explanation: It is possible that you are trying to connect to an instance that no longer exists. It might have been deleted by someone else with "member" rights to the tenant, or it might have been terminated due to an earlier security incident.
Symptoms: If you attempt to connect to a NeCTAR instance that has been deleted, then one of the following will happen.

If the IP address for your deleted instance has not been recycled, then your SSH client is likely to get a “No route to host” message.
If the IP address has been recycled, then you should see a warning that the host key has changed. If you “accept” the new host key, you may then get a variety of symptoms, depending on how the new instance (probably not yours!) has been configured.

Diagnosis: Use the NeCTAR Dashboard to check that the instance exists, and that you are using the correct IP address for it.
Solution: Create a new instance or try again with the correct IP address.

Problem: Your instance is paused / suspended / shut down (#5)

Explanation: It is possible that you are trying to connect to an instance that is not currently running. Instances can go into paused, suspended or shutdown states for a variety of reasons. These include direct action by yourself or the NeCTAR adminstrators, a compute node that has rebooted spontaneously and failed OpenStack requests such as snapshots and volume attach / detach requests.
Symptoms: Your SSH client is likely to get either a “No route to host” message or a "Connection timeout".
Diagnosis: Use the NeCTAR Dashboard to check that the instance is in ACTIVE / RUNNING state.
Solution: Use the NeCTAR Dashboard to unpause / resume / start the instance. If that fails, raise a support ticket. Note that if your instance has been paused and locked as a result of an earlier security instance, you will not be able to restart it.

Problem: Your instance’s (virtual) network interface is broken (#6)

Explanation: It is possible that your instance's network interface is incorrectly configured. This can happen at two levels:

The network interface could have become disabled at the OpenStack / hypervisor level; i.e. in the virtualization layers.
The network interface could have become disabled within your instance itself; i.e. by the network initialization scripts and/or the NetworkManager application.

Symptoms: Connection attempts time out. Ping fails, even if you have enabled ICMP ingress.
Diagnosis: When the instance reboots, the “cloud-init” log messages will list the network interface (typically “eth0”) as off.
Solution: It is possible that rebooting the instance will clear the problem. If that fails, raise a support ticket. (If the problem is inside your instance, fixing it may be very difficult unless you have "root" access via the instance's console.)

Problem: Your instance’s security groups don’t have the SSH port open (#7)

Explanation: Network access to an instance is mediated by the OpenStack Security Groups and their access rules. By default, a NeCTAR instance has no access whatsoever. To allow network access, groups / rules need to be configured to "open" ports for access from external IP addresses. For SSH access, TCP port 22 must be open for (at least) the machine you are appearing to connect from.
Symptoms: Connection attempts time out. Ping may succeed (if ICMP ingress is allowed by the security groups)
Diagnosis: The NeCTAR Dashboard details page for an instance lists the access rules for all security groups that apply to the instance. Look for entries like this:

 ALLOW IPv4 22/tcp from 0.0.0.0/0
 ALLOW IPv4 22/tcp from 130.102.111.53/32

The first of these would allows SSH access from anywhere, and the second allows access from a single IP address. Note that if your local machine has a private IP address, then the IP address that needs to be granted access is the IP address of your network's NAT gateway.
Solution: If required, use the Dashboard either to edit one of the instance's existing security groups to add extra rules, or to create and add a new security group to the instance. If you are uncertain as to which IP addresses to grant access to, contact your local IT support.

Problem: Your instance has an internal firewall that is blocking SSH (#8)

Explanation: A firewall can also be configured within an instance; e.g. using "iptables". This is advisable if you need to protect against the OpenStack security groups mechanism "leaking". When you have an internal firewall in your instance and you want to connect to it via SSH, then access through the firewall must be granted on the port you use for SSH.
Symptoms: Depending on the the internal firewall configuration, the most likely symptom is that connection attempts time out.
Diagnosis: Login via the VNC console and see if the instance has an internal firewall enabled. (For example, running “sudo iptables-save | less” on a typical Linux system will list any IP-tables firewall rules.)
Solution: This will depend on how your instance's internal firewall has been implemented. Please check the documentation. Note that if you do accidentally block SSH access, then recovery will be difficult unless you have "root" access via the instance's console.

Problem: Problems with non-standard SSH ports (#9)

Explanation: Sometimes people configure SSH to listen for connections on a different port to the standard one (port 22).
Symptoms: The symptoms are typically the same as for normal firewall problems, except that opening the firewalls do not help.
Diagnosis: Login via the Console and check the "/etc/ssh/sshd.config" to see if a non-standard port has been configured.
Solution: There is generally a security-related reason for using a non-standard SSH port. The solution is generally to specify the non-standard port number in your SSH client when connecting. The "ssh" and "putty" clients both provide mechanisms for "remembering" the port number for a connection.

Problem: You are using an incorrect IP address or DNS name (#10)

Explanation: If you misremember, mistype or miscopy the IP address or DNS name for your instance when you invoke the SSH client, it will try to connect to the wrong instance, or no instance at all.
Symptoms: Various, depending on what is behind the (incorrect) IP address or DNS name.
Diagnosis: If you are using a DNS name, perform a DNS lookup to find out what IP address it resolves to. Then use the NeCTAR Dashboard to check whether the IP address is the correct one for the instance.
Solution: Figure out what the correct name / address is, and use it.

Problem: You are using a stale DNS name (#11)

Explanation: NeCTAR does not currently provide a so-called "floating IP addresses" or custom DNS names for NeCTAR instances. As an alternative, some people implement DNS names using "CNAME" records in an external DNS service that point to the "synthetic" DNS entry for an instance. Problems can arise when you rebuild or migrate an instance so thit has a new IP address. Unless it is updated, the DNS record will now refer to the IP of a non-existent instance, or a new instance that belongs to someone else.
Symptoms: Various, depending on the IP address that the stale DNS name currently resolves to.
Diagnosis: Perform a DNS lookup to find out what IP address the name resolves to. Then use the NeCTAR Dashboard to check that the resolved IP address is the correct one.
Solution: If the DNS address is stale (i.e. it refers to the wrong IP address) you need to get whoever manages the DNS record to update it for you. As a temporary work-around, you could use the (real) IP address instead of the (stale) DNS name.

Problem: Your instance’s SSH daemon is not running / working (#12)

Explanation: The SSH daemon process ("sshd") on your instance accepts incoming SSH connections from clients. If it is not running, the operating system on your instance will "refuse" connection requests.
Symptoms: Connection refused error messages
Diagnosis: Login via the VNC console and use “ps” to check if there is an “sshd” daemon running. If not, check the relevant system log files for clues as to why it didn’t start.
Solution: Restarting the instance’s network services should cause “sshd” to start. If you cannot login via the Console to do this, then a hard instance reboot might fix the problem. If neither of those work, then this could be difficult to solve.

Problem: You did not supply a key-pair when launching the instance.(#13)

Explanation: When you launch a NeCTAR instance with one of the standard images, you need to supply an SSH public key. If you neglect to do that, the admin account for your instance will be set up without any way to authenticate. (For other images, this may or may not be a concern, depending on whether the image has preconfigured a public key or password.)
Symptoms: You get permission denied message, or an unexpected prompt for a password.
Diagnosis: Use the NeCTAR Dashboard to check that there is a key name associated with the instance.
Solution: The simplest solution is to Terminate the instance and launch a new one with a public key.

Problem: Your public key was not injected into the instance(#14)

Explanation: Sometimes a new NeCTAR instance can launch without properly injecting the public key into the admin account. (This can happen if there is a problem with the metadata service that the instance uses to get configuration information.
Symptoms: You get permission denied message, or an unexpected prompt for a password.
Diagnosis: If the instance has failed to talk to the metadata service, this will show up in the console logs. Unfortunately, figuring out why the instance was unable to contact the metadata service can be difficult. (Note that this affects instances in the QRIScloud AZ more than others. Instances in QRIScloud may have two active network interfaces, and depending on various issues with the instance OS, requests to the metadata server IP address can be incorrectly routed.)
Solution: If the problem with the metadata service was transitory, a hard reboot of the instance may be sufficient to cause the public key to be injected properly. Otherwise, you should raise a NeCTAR support ticket.

Problem: The user account you are trying to login to is incorrectly configured (#15)

Explanation: When you set up additional user accounts on an instance and configure them for SSH access, it is easy to get the permissions incorrect. If you do this, the "sshd" will not allow SSH login using the public key.
Symptoms: You get permission denied message, or an unexpected prompt for a password.
Diagnosis: This problem should not occur with the default admin account that is set up by "cloud-init"; e.g. when you use one of the NeCTAR standard images. It can occur if you have added extra user accounts using "adduser" or "useradd" or similar. There are two scenarios:

You have attempted to set a password for the account (using the "passwd") command. That password will not work for login using SSH, by design. Your instance is sitting on the public internet, and it is vulnerable to people breaking in by guessing account names and passwords. NeCTAR images are configured so that the SSH daemon will not accept password login.
You have attempted to set up an SSH key for an account, but the file and directory permissions are wrong. The SSH daemon performs some security checks to make sure that a user's "~/.ssh/authorized_keys" file cannot be updated or replaced by another (non-privileged) user. Specifically: 1) the user's home directory must be writable by the user, 2) the user's "~/.ssh" directory must be writable by the user only, 3) the user's "~/.ssh/authorized_keys" directory must be writable by the user only, and 4) "~/.ssh/authorized_keys" file must contain public keys not private keys.

Solution: Create the ".ssh" directory, add the user's public key to the authorized keys file, and make sure that the file / directory ownership and access modes are correct. (Use the "chown" and "chmod" commands to correct the ownership and permissions.)

Problem: You are using the wrong account name (#16)

Explanation: If you use the wrong account name when you try to login, the SSH daemon will attempt to log you in anyway, and fall back to asking for asking a password when your offered key is found to be incorrect. (The login mechanism is being deliberately obtuse to avoid giving clues to people who are trying to guess user names and passwords.)
Symptoms: You get permission denied message, or an unexpected prompt for a password.
Diagnosis: There is no simple way to diagnose for this problem, apart from knowing what account name you should be using.
Solution: On an instance launched from one of the NeCTAR standard images, the default admin account name will be:

System types	Account name
Ubuntu	"ubuntu"
Debian	"debian"
CentOS, Scientific Linux, Fedora, openSUSE	"ec2-user"

For other (e.g. non-NeCTAR) images, you need to consult the documentation or ask the person who created or shared the image. Note that some images may allow or require password-based SSH access. Any image that does this is NOT suitable for running on the NeCTAR cloud, unless you use the instance Security Groups to lock down SSH access to specific (trusted) IP addresses or network ranges.

Problem: You are using the wrong private key (#17)

Explanation: The private key that you use on the client side must match one of the public keys that has been injected (or added manually) for the account that you are trying to use. (The login mechanism is being deliberately obtuse to avoid giving clues to people who are trying to guess user names and passwords.)
Symptoms: You get permission denied message, or an unexpected prompt for a password.
Diagnosis: There is no simple way to diagnose for this problem, apart from knowing what private key you should be using. Use Checking the fingerprint of your SSH key pair to validate that fingerprints of your SSH client key and the instance key are the same.
Solution: Be careful to record the public / private key-pairs that you should be using for each instance. If you lose the private key that you should be using then there is guaranteed way for you to regain access.

Problem: Your private key is not properly secured (#18)

Explanation: The SSH command attempts to protect you from letting other people see your private key.
Symptom: The SSH command warns you that the "permissions are too open" for a private key file and skips it. Login will then fail because your client does not offer the required key to the SSH server.
Diagnosis: Check that the private key file that you are using on the client side can only be read by you. When you run “ls -l” on the key file, it should display the file “mode” as “-rw-------”. Specifically, the last six characters should be hyphens, indicating no read write or execute permissions for “group” or “world”
Solution: Set the file access on the private key file to “rw--------” by running “chmod 600 ” where “” is the keyfile pathname. (The detailed error message for this error provides more instructions.) If you know or suspect that other people can login to your machine, then you should treat the key as compromised and generate a new keypair.

Problem: Too many failed login attempts causes "fail2ban" banning (#19)

Explanation: The standard NeCTAR images are configured with "fail2ban" to protect against brute-force attempts to break authentication. This is recommended security practice for any system that is exposed to potential attack by hackers. Fail2ban works by temporarily "banning" attempts to authenticate (for example, login) if it detects more than a given number of failed login attempts from the same source in a short space of time. While the ban is in place, all login attempts will, irrespective of whether you have given the correct credentials.
Symptoms: The actions performed by "fail2ban" are configurable, resulting in a variety of different observable symptoms. Typically you will be prompted for a password unexpectedly.
Diagnosis: It is difficult to reliably diagnose this unless you can login to the instance via the console. If you can login that way, check the fail2ban logfile which is typically "/var/log/fail2ban.log".
Solution: Wait for the ban to expire, and then try again using the correct credentials. The actual ban time is typically 10 minutes, but this is configurable as well.

Problem: Too many private keypairs causes "fail2ban" banning (#20)

Explanation: This problem is caused by an unfortunate interaction between the Linux / Mac "ssh" command on your home machine and "fail2ban" on the server. The default behavior of "ssh" is to find all of the private key files in your "~/.ssh" directory and offer them to the server as login keys. If you have more than (typically) 3 keys, it may take more than 3 attempts before "ssh" offers the right one. Unfortunately, each "offer" of a key that the remote server doesn't accept counts as an authorization failure, so you can get "banned" before you can login.
Symptoms: Your login fails unexpectedly, with the same symptoms as #19 above.
Diagnosis: Run the "ssh" command withe the "-v -v -v" options, then check the output to see how many private keys that the command finds. If the logging shows "ssh" scanning "~/.ssh" for private keys and then "offering" them to the remote ssh service, one at a time, then this is possibly your problem.
Solution: There two aspects to solving this:

You need to tell ssh to use a specific private key. One way to do that is to use the "-i" option. Another way is to put the details of the host and the corresponding key into your "~/.ssh/config" file. It is advisable to use an absolute path for the key file. The ssh command has a tendency to "keep going" if it cannot find a named key file; e.g. because you are in the wrong directory.
In addition, you need to either minimize the number of private keys in the "~/.ssh" directory (e.g. by putting them somewhere else), or use "-o IdentitiesOnly=yes" to stop ssh from trying keys that you don't specifically request to be tried.

Problem: You are using the wrong password (#21)

Explanation: While we strongly recommend that you don't do this, it is possible to change the SSH configurations on a NeCTAR instance to accept passwords as authentication. Also, some third-party images are built with password authentication enabled.
Symptoms: You are unexpectedly prompted for a password.
Diagnosis: Unfortunately, the only way to diagnose this problem requires that you can login as "root".
Solution: Find out what the correct username and password are and use them. If this is a system that someone has set up for you, ask them. If you are launching a third-party image, check the documentation to see if there is a default account and password. (And if there is, change it as soon as possible. If you can find it, then so can the hacker community!)

Problem: Unexplained “remote host identification hash changed” error / warning (#22)

Explanation: The host key mechanism is intended to ensure that you are connecting to the host that you are expecting to. This failure is a sign that either you are connecting to a different host, or something about the host has changed.
Symptoms: A “remote host identification hash changed” warning.
Diagnosis: There are a variety of possible causes for this. Most of them are benign, but this is a possible symptom of a security issue. See the "Host keys and security" section below for more details.
Solution: If you are confident that the explanation is benign, you can clear the problem by updating the file indicated by the error / warning message: typically your “~/.ssh/known_hosts” file.

Problem: Your instance is offering a DHA host key (#23)

Explanation: The SSH daemon can use a variety of public / private key algorithms. One of the algorithms (DHA) that was used in older versions of Linux is now deemed to be insufficiently secure. In particular:

Starting with version 7.0, OpenSSH disables DHA host keys by default; see https://www.gentoo.org/support/news-items/2015-08-13-openssh-weak-keys.html.
Starting with the Sierra release, OpenSSH in Mac OS disables DHA host keys by default; see http://jeffreifman.com/2016/10/01/fix-macos-sierra-upgrade-breaking-ssh-keys.

Symptoms: "Unable to negotiate with <IP> port <PORT> no matching host key type found. Their offer: ssh-dss"
Diagnosis: The problem can be diagnosed can be confirmed by running ssh with "-v -v -v", or by looking at the host keys and configs on the instance. They may be found in "/etc/ssh".
Solution: There are two solutions:

You can (temporarily) tell the SSH client to accept DHA host keys anyway by adding a "HostKeyAlgorithms +ssh-dss" line to the client configs. (Check the ssh manual entry!)
If you can login to the instance with root access, change directory to the "/etc/ssh" and check for an "ssh_host_rsa_key" and "ssh_host_rsa_key.pub":
- If they don't exist, create them using "ssh-keygen -f /etc/ssh/ssh_host_rsa_key -t rsa -N ' ' -b 4096". Then restart the instance's "sshd" service, as appropriate for its version of Linux.
- If they do exist, check that the "/etc/sshd_config" file does not specify the wrong host key file; i.e. the DSA one.

NB: Read what the "General Advice" section (below) says about precautions to take when modifying SSH configurations on a NeCTAR instance.

Problem: The "ssh" command cannot find your key (#24)

Explanation: The "ssh" command allows you to specify the private key file with the "-i" option; e.g. "-i~/.ssh/someKey". The problem arises when "ssh" cannot find the keyfile that you specified. Instead of giving up, "ssh" falls back to its default strategy of loading all of the keys in "~/.ssh". If the required key is not there, or if there are too many keys in the directory (see #20), then login will fail.
Symptoms: Login fails unexpectedly.
Diagnosis: Check that you are using the correct pathname for the key file. (Don't forget that relative pathnames are resolved in the current directory!) You can see which file that "ssh" attempts to load by adding "-v -v -v" to the command line.
Solution: Use the correct keyfile pathname.

Problem: SSH issues with systems managed by someone else (#25)

Explanation: There are potentially other reasons for SSH connections to fail.
Symptoms: Various.
Diagnosis: Not applicable
Solution: Contact the people who manage the system. The eHelp support desk may be able to help in some cases, but if you know who manages the system / service it may be easier to contact them directly.

Problem: Key needs converting for use with Putty (#26)

Explanation: Different formats exist for storing ssh keys. Putty does not accept PEM format.
Symptoms: You get an error message: Unable to use key file "C:\path\to\key.pem" (OpenSSH SSH-2 private key (old PEM format))
Diagnosis: Not applicable
Solution: See the Windows Authentication section at https://support.ehelp.edu.au/support/solutions/articles/6000077794-getting-started Basically, you need to convert the key using puttygen, which you can obtain from the same place where you got Putty: https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html

Problem: The server's network interface is down (#27)

Explanation: If the server was unable to determine its IP address at boot time, or if it's network configs are broken in other ways, it may fail to enable the network interface (NIC) for its primary external IP address.
Symptoms: Various, including "Connection Refused".
Diagnosis: Check the console logs from the last reboot for indications that the external interface didn't come up. If you can login to the machine, use "ifconfig -a" or "ip link show" to figure out which NICs are active. On the OpenStack side, the server's primary "port" will be DOWN.
Solution: Someone with strong Linux system admin skills needs to figure out from the system logs why the NIC didn't start.

If the root problem is that the instance was unable to get its IP address and netmask via DHCP to 169.254.169.254, then the DHCP service might be not working, or access to it may be blocked. Check the Security Group egress and ingress rules for UDP ports 67 and 68 respectively. If necessary, escalate to the eHelp support desk.

Host keys and security.

The internet which you are using to connect to your NeCTAR VM is built on top of network protocols that use packet switching to transfer data from one place to another. At the lower levels, everything is broken down into packets. Each packet consists of a data payload, together with control information including a source address and a destination address. These addresses are the numeric IP addresses that you see, and often have to use; e.g. in an SSH command.

In an ideal world, each IP address would uniquely identify one system in the entire internet. When a packet was send to a given IP address, you would expect it to be delivered to that system, and that system only.
In reality, the Internet is beyond the point where there are enough IP addresses (specifically IPv4 addresses) to address all systems that need to be on the net.
One consequence of this is that IP addresses often get recycled. When a system is shut down, its old IP address is likely to be reallocated to a new system, either immediately or sometime later. This means that if you try to use an IP address that you wrote down some time ago, it could now refer to a different system.

A second issue is that IP-based protocols are not secure at the base level. A person with the appropriate access to a network can: look at the contents of any packet, change the contents of any packet, alter the addressing information in packets, or interfere with the routing of packets. Given this, it is feasible for a third-party to intercept or interfere with a communication stream between two system. This could potentially happen at any point in the network, and it is impossible to fix at the network level.
SSH attempts to address network-level insecurity using public key encryption. Public key encryption is based on having a pair of related keys, one to encrypt and the other to decrypt:

If you have the private key: you can encrypt data so that it can only be decrypted using the public key, you can decrypt data that encrypted using the public key.
If you have the public key: you can encrypt data so that it can only be decrypted using the private key, you can decrypt data that encrypted using the private key.

As part of the system setup, each SSH client and server host generates a public / private keypair. The public key of the keypair is known as the host key. When the SSH client connects to an SSH server, one of the first things that the server does is to send its host key to the client. The SSH client uses this as follows:

If the client can find a system wide or user-specific “known_hosts” file, the server ‘s IP is looked up there.
If there is an entry for the IP, and the host keys match, the client proceeds.
If there is no entry, the client will warn you, and then ask you if you want to connect anyway. This typically saves the server host key in the user's "known_hosts" file.
If there is an entry in the file, but the keys don't match, the SSH client will refuse to continue. It typically suggests that you edit the relevant "known_hosts" file to update the host key that was stored there.
The SSH server decrypts the secret with its private key, and re-encrypts it with the client’s host key and sends it back to the client.
The SSH client decrypts the secret and checks that it matches. If it does, then the client knows that it is talking to the correct server.
A similar procedure is used to agree on an encryption algorithm, and create and securely exchange a encryption key for the session.
Finally, the client and server switch to using the agreed encryption algorithm and session key.

As you can see, the host key plays an important part in ensuring that you are connecting to the right server, and not some other server that is impersonating the real one. It also ensures that there is no "man in the middle" snooping on the conversation between the client and the (real) server. However, this does depend on you being careful. If you blindly replace host keys, you may end up talking to the fake server.
In the NeCTAR context, a newly launched instance should always have a host key that your SSH client has not encountered before. Therefore if you know that this is an instance that you have just launched, or if you know that you have never connected to it before, then it is safe to accept the unknown host key. If not, it still may be safe. Certain things are known to cause an instances host key to change:

You can regenerated the host keys explicitly.
If you (or something else) delete the host keys, they will be regenerated on the next reboot.
Certain changes to the sshd configurations will cause new keys to be generated. For example, changing the host key type will do this.
It has been known for application of a system patch to cause the host keys to change.

There are a couple of things that you can do to authenticate the instance in the case of an (apparent) host key change:

If you have set a password on the "root" account, you can use the NeCTAR Dashboard to get to the instance's console and login as root.
If you have not set a root password, some versions of Linux output the fingerprints for the host keys to the console log. You can view an instance's console log via the NeCTAR Dashboard.

Diagnostic tools and techniques

Ssh debug
The first technique for diagnosing problems with SSH is to run your ssh client with maximum debug output (e.g. "ssh -v -v -v ...) and read the output. The debug output is not designed to be "friendly" to non-expert users, but if you compare what the log says when you connect successfully (e.g. to another NeCTAR instance or a different system entirely) there should be clues as towhat the problem is.

The NeCTAR Dashboard
There are a number of things that you can check using the NeCTAR Dashboard, if you a Member of the NeCTAR project that manages the instance

You can check to see if the instance that you are attempting to connect to still exists, and if it is currently in ACTIVE state.
You can check that you are using the correct IP address.
You can check the access rules for the instance in the instance's Details page.
You can check the name of the SSH keypair that you should be using. (Note that this should be treated as a hint only. It depends on you (or the person who launched the instance) keeping track of what the names correspond to.
You can view the instance's primary logfile.
You can access the instance's virtual console. If you have an account with a password, you can login that way. Even if you don't, the console can give you an indication of whether the instance is "alive".

Ping and traceroute tools
These tools work best if you have configured ICMP access in the instance's Security Groups. If an instance responds to "ping", then you know that there is basic networking connectivity to the instance, and that the return path is also working. Ping can fail if either instance networking is "broken", or if ICMP traffic is not enabled in the security groups. In either case, "traceroute" can at least tell you whether network packets are routed to the data centre where the instance runs.

When running traceroute, be aware that the default behavior for this command is typically to use UDP packets, and to pick a port where it is "unlikely" that any service is listening. Furthermore, when you ping on a specific port with "-p <port>" this is actually a base port number: the destination port is incremented on each probe. The net result is that the UDP traceroute probes will typically be blocked at your instance's firewall.

One alternative is to use TCP traceroute which uses TCP "SYN" message for probing. Unfortunately, this feature is not available with all versions of "traceroute", and on the versions that support it, it typically requires admin access. (There are "security issues").

A second alternative if you have enabled ICMP in the Security Groups is to try ICMP traceroute. This uses ICMP "echo" packets.

Notes:

There are a variety of different ping and traceroute tools, depending on your platform. Unfortunately, it is not always easy to find the best one to use. (For example, the standard Windows and Mac OS versions don't support TCP tracerouting.)
The output from traceroute can be a bit tricky to interpret. Internet routing is sometimes unstable. In other cases, packets can inexplicably go through different paths within a data centre; e.g. due to traffic going through different redundant switches. Finally, as noted above, depending on how you use the tools, probe packets may be routinely dropped by your instance firewalls.

Using telnet to make a TCP/IP connection
If you have a "telnet" command installed, you can use it to test whether it is possible to establish a TCP/IP connection on a particular port. The network port to try for SSH is 22. (Refer to your system's telnet command documentation on how to specify the port.)

General advice

Use a private key passphrase

It is advisable to put a passphrase on your private key files. You can do this when you create the key pair: the "ssh-keygen" command will prompt you for a passphrase. Alternatively, you can use "ssh-keygen -p -f <private-key-filename>" to set or update the passphrase on an existing private key file.

The passphrase for your private key should be especially strong. We recommend the following:

It should be at least 12 characters
Use a mixture of letters (upper and lower case), digits and symbols
No family names, phone numbers, car registration numbers, and other things that are easy to guess by someone who has knowledge of you.
Avoid words in the dictionary.
Avoid popular, commonly used, or easily guessable names or phrases; e.g. "Never going to give you up".
If you need to write your password down on paper, make sure that it is stored securely.
Avoid entering your passphrase on a system that that cannot be trusted. (Beware of keystroke logging hardware or software, which could capture your passphrase while you are typing it.)
Don't recycle passwords / passphrases, or use the same one for different purposes.

The last point is particularly important. If your passphrase does get stolen or compromised, you need to be able to limit the damage.

Finally, it should be noted that if you either lose your private key entirely, or forget the passphrase, the chances that you can "crack" the keys are so close to zero that it is not worth trying. You will need another way regain access.

Set a root password

One of the problems with cloud computing (compared to using a PC or laptop) is that if something goes wrong you cannot get physical access to the system. If you get locked out of an instance, regaining access can be difficult and time consuming. The best way to protect yourself against getting locked is to set a password on either the root account or an account with (full) sudo privilege. (Aside: setting a root password is better than setting a password on a sudo account. If you make a mistake when editing the "/etc/sudoers" file, it is possible lock out sudo access. Only the root user can fix that.) If you have an account with root privilege and a password, then you will be able to use it login via the instance's virtual console. That should work even if the instances virtual networking is not working, or if the SSH or sudoers configurations are damaged. It may even allow access if your instance needs to be booted into "single user" mode to fix something.

Take precautions when making potentially dangerous changes

While you always need to be careful when doing Linux system administration, certain changes to a NeCTAR instance have significant risk associated with them due to the "lockout" problem. Here are some precautions you should take:

Make sure you have set the root password; see above.
If there is valuable data on the system, or if you have spent a lot of time setting the system up, make sure that you have a recent backup before doing something that is potentially risky.
If you are changing SSH or Network configurations:
- Use two SSH sessions: one with a root shell to make the changes, and a second one to test them out.
- Be aware that if you "stop" the networking or sshd services, that is likely to disconnect your SSH sessions. Think carefully. It is generally advisable to do a service "reload" or "restart" rather than a "top" followed by a "start".
If you are changing the "/etc/sudoers" file, or modifying passwords or groups in a way that might affect sudo access:
- Use "sudo bash" to get yourself a root shell and use that to make the changes. (If you break "sudo", then you typically
- Use the "visudo", "vipw" or "vigr" commands if you are manually editing your instance's "sudoers", "password", "shadow-password", "group" or "shadow-group" files. They implement some sanity checks that can make the changes a bit safer.

Recovering an instance from a lockout

Depending on what exactly has caused the lockout, it may be possible to regain access to a locked-out instance. However, it is complicated and there is no guarantee that it will work. Here are some of the possible techniques that might work depending on the situation:

Use your "root" account to login via the OpenStack Console for the instance. You can access the Console via the NeCTAR Dashboard, or using the openstack command line tools. Note that this will only work if you had the forsight to set a root password before you got locked out of your instance; see "Set a root password" above. There is no default "root" password set on instances. (Why not? Well think about the security implications!)
Use the openstack command tools to boot the instance into "rescue" mode. If this works, you should be able to repair the problem with your instance's root image and then "unrescue" to boot normally. However, there are a few scenarios where rescue won't work; e.g. it won't work if you have lost your private key, or if the instance is "boot from volume".
Use the openstack command tools to snapshot the instance and download the snapshot to your computer. Then repair the image on your computer and upload the repaired image. Finally boot a new instance from the snapshot. Unfortunately, this will give you a new instance with a new IP address and MAC address, and you will have lost your ephemeral volume. Furthermore, this is unlikely to work with a "boot from volume" instance because the snapshotting is likely to fail.

For more information on the last two, please refer to this article: Recovery Options When You Cannot Access Your Instance