Trouble Shooting

From Biowiki
Jump to: navigation, search

---

Troubleshooting

A rough outline of what steps to take when things go wrong if the sys admin is out of the lab.

Misc. notes

  • dave is the lab router, that little white box on the switch next to the laser printer, and should be in your /etc/hosts file; you can ping it.
  • babylon.biowiki.org and garibaldi are the same machine with two IP addresses/network cards/Ethernet connections: babylon.biowiki.org faces the Internet, is used to SSH in from off-site, and does not respond to ping; garibaldi faces the cluster subnet, is used when you are on the lab/cluster VPN, and does respond to ping.
  • If the below stuff doesn't solve your problems, contact me.

Misc "How To..."

Remount (that is, umount, then mount again) the NFS

Read this: http://biowiki.org/ClusterNFS

Or, consult the end of the VPN tunnel guide I gave you.

If the NFS is giving you crap, go down to the datacenter and reboot lorien. I recommend doing it in person instead of over SSH (using shutdown -r now), because you really want to see the messages that come up during booting. Also, it might reboot, but its network drivers will fail (it has happened before), and if you're on SSH, you don't know what happened because the machine is dead to networking. Warning: the "enabling swap space" step takes a really long time (could be even 15-20 minutes, I don't remember), but that is OK.

Important note: when rebooting lorien, the loopback mount (that is, the NFS server mounts the NFS-shared directory onto itself... the only reason for this is for consistency - now all the nodes, including the NFS server, have the same directory structure, that is they all have a /mnt/nfs/ and /nfs/ symlinking to it) will fail. That is because, oddly enough, the OS is trying to mount NFS filesystems before the NFS server is brought up... I guess they weren't prepared for loopback mounts of an NFS server onto itself (or more likely, I'm wrong). So you can either (1) ignore it and just don't use /mnt/nfs/ or /nfs/ on lorien, or (2) mount it manually after booting is complete, like this:

$ su -

$ mount -t nfs lorien:/home/ /mnt/nfs/

Rebuild the VPN tunnel

There are two components to rebuilding the tunnel:

  1. The stuff that happens on xander, the mouth of the tunnel, which should be done automatically now.
  2. The stuff that happens on your machine, which is not automatic (but will be someday).

So, to rebuild the tunnel, the only thing you (might) have to do is tell your Mac that xander is the mouth of the tunnel. Do this on your Mac:

$ su -

$ route delete 192.168.0.0 xander 255.255.255.0 # this might give an error, but that's OK

$ route add 192.168.0.0 xander 255.255.255.0

This will add an entry to the routing table that says packets bound for the cluster subnet should be routed through xander. If there is already an entry there, it will exit with an error - that is OK, it tells us the routing table is already correct.

You may have to remount the NFS after this (see above).

If Thing #1 above fails (i.e. xander does not rebuild the tunnel automatically - you can tell that's the case if you can't ping anything on the cluster from xander), do all the steps in the VPN tunnel guide I gave you manually.

Install new programs on the cluster (in a way that all cluster nodes can use them)

$ ssh lorien

$ su -

$ cd /home/src/

$ mkdir <your program directory>

$ cd <your program directory>

$ wget <URL to the program files that you want to download (this step can be replaced by using sftp or scp to copy stuff to your program dir)

If it's a tar/gzip archive, unpack it:

$ gunzip <archive>.tar.gz

$ tar xvf <archive>.tar

Now here things get different for different cases, so I will describe just the one where programs come with a configure script. DO NOT run that script with the defaults! It will do a local install - by which I mean it will put binaries/executables into /usr/local/bin/, and so on - not on the NFS where they belong so that other nodes can use them! What you want to do is run config like this:

$ ./configure --prefix=<your program directory>

so now the binaries/executables will be put into <your program directory>/bin/, libraries into <your program directory>/lib/, and so on - and because /home/src/, which you are in, is NFS-mounted as /mnt/nfs/ (or /nfs/) everywhere, everything on the cluster can use your program!

It is nice to add it to your path as well. Put the following into your home directory's .bash_profile file:

export PATH=${PATH}:<your program directory>/bin

This has the added benefit that when you submit SGE jobs as yourself, they will not have to use the complete path to run your program - just the binary/executable name will do.

If there are any Perl libraries in the program, add them to your .bash_profile as well, like this:

xport PERL5LIB=${PERL5LIB}:<your program's Perl libraries directory>

Tackle Sun Grid Engine problems

I wrote a wiki page about this here: http://biowiki.org/twiki/bin/viewauth/Main/HowToAdministerSunGridEngine#TroubleShooting.

These are also helpful:

Restore files from backup (including files you accidentally deleted)

Consult me, or at least make me aware you are trying to do this yourself. If you are feeling adventureous, read the section "3. Using the TSM client," subsection "Restoring files from backup (via the command line, of course)," here: http://biowiki.org/TSMBackupSystem.

You can "undelete" files you accidentally deleted this way, as long as they existed at 9:30PM (when the nightly backup takes place), so a copy was made onto tape backup.

---

Problems you may have and how to tackle them

You are at the lab

Your machine's networking is completely dead

Symptoms: can't ping any other lab machines (such as giles or xander) or the router (dave); no wireless networks show up in the Mac AirPort drop-down menu at the top right corner of the screen; can't print.

Resolution: dave is misbehaving. To punish it, unplug it from power supply, wait a few seconds, plug it back in, wait a few more seconds, see if the DaveNet wireless network reappears. You can now select it (you may have to enter the wireless password) and use the network.

Caveats: You may have to rebuild the tunnel and unmount/remount the NFS (see "Misc How To..." above).

Your machine can reach other lab machines, but cannot reach the cluster

Symptoms: you can ping and SSH into lab machines (such as giles or xander), but you cannot ping or SSH into any cluster machines; your shell hangs whenever you go into /nfs/ or /mnt/nfs/, or try to read any file from there, or do anything involving the NFS in general.

Resolution:

  • If you can SSH into babylon.biowiki.org:
    • ...then the tunnel is probably down. Rebuild the tunnel (see "Misc How To..." above).
  • If you cannot SSH into babylon.biowiki.org:
    • ...something is wrong with garibaldi. This is kind of serious. Go down to the datacenter and reboot it.

There was a power outage

Symptoms: you show up and all the machines are their login screens; some of the Macs may be off; you log in and cannot reach the cluster and things hang when you use the NFS.

Resolution: everything should be brought back up automatically. You will have to rebuild the tunnel. Ensure that the following things are working: www.biowiki.org, genome.biowiki.org, and the CVS repository.

Side note: by default, Macs in the lab won't reboot after power loss. You can fix this by going to: Finder -> Applications -> System Preferences -> Hardware -> Energy Saver -> Options -> Other Options -> Restart automatically after a power failure. The "Energy Saver" settings are also where you can disable the computer, or the hard disks, from falling asleep. All of the above is really handy if you want to reliably log in remotely at any time.

You are out of the lab

You can't SSH into...

__babylon.biowiki.org__

Go down to the datacenter and reboot garibaldi. If that doesn't fix it, contact me.

---

-- Created by: Andrew Uzilov on 14 Jul 2006