Upgrade Fritz's RAM and attempt to restore KvmAccess.
Contents
1. People
2. Goals
Restore KvmAccess
- Upgrade fritz to 24G of RAM
3. Outcome
3.1. Short Version
- RAM upgrade was a success
- Downtime was ~2h instead of 30 minutes
- Belkin KVM appears to not be functioning correctly
- Deleuze rebooted without (major) incident
3.2. Long Version
The actual RAM upgrade for fritz seemed to go quite smoothly. Fritz was down for ~15 minutes for hardware surgery, beating the expected time by a great margin. It was when it booted that problems developed. None were fatal, but combined caused fritz to take an ~1h15m before afs was restored and nearly another hour before all services were restored.
Attempts to restore KvmAccess ended with hopper being accessible and the other machines appearing to not work. ClintonEbadi requested that hopper's cables be swapped onto fritz, but fritz continues to not display anything. The KVM may be dying.
We managed to reboot deleuze. It required manually pressing the power button to make it finish halting, but it booted cleanly and quickly afterward.
Problems:
- We hit the ext3 mandatory file system check interval, which added ~40 minutes to the reboot.
/var/lib/libvirt (/dev/md3) was not auto-detected and failed to come up, requiring manual intervention to continue booting.
It appears the partition type was not set correctly (Linux instead of Linux RAID autodetect). The partition type was updated, but there is a high chance of that not actually fixing the boot process.
- libnss-afs, ncsd, and nsswitch.conf, and the init order are interacting badly, causing fritz to pause for an additional 15-20 minutes during boot
- The Belkin KVM is behaving oddly.
TBD
Fixes:
fritz's slow boot (UBIK timeouts before the network is able to come up) appears to be caused by nscd not using a persistent cache for the user and group databases. persistent-cache for both was enabled.
/dev/md3 not being auto-detected may have a few causes.
Partition type was Linux instead of Linux raid auto-detect. The partition type for both components of the array was updated.
- There may still be a problem with md metadata for the partitions. Further investigation is needed.
3.3. Next Steps
- Decide if we want to try to make the Belkin KVM work properly or acquire some other means of remote console access
If we go with KvmAccess, we should likely purchase new vga and ps/2 cables as it appears the cables we have now may have become unreliable (they are, after all, all 5+ years old and have been in an environment that encourages plastics to become brittle)
4. Supporting Material
Excerpts from Maintenance Manual relevant to installing memory
5. Itinerary
5.1. Upgrade Fritz Memory
SERVICE IMPACTING. Goal: 30 minutes downtime. Stretch: One hour. (Fritz can take up to 15 minutes to reboot)
- Remove Belkin KVM from rack to allow access to fritz
- Power fritz down
- Install memory into fritz
- Power fritz up, ensure that POST succeeds and system boots
5.2. Restore KVM
Not Service Impacting
- Trace and re-attach cables going into Belkin KVM
- Ensure that all machines can be controlled using KVM
- Re-rack Belkin KVM
- Double check no cables were jostled loose while re-racking
5.3. Reboot Deleuze
Minor Service Impact (Mail will be rejected briefly)
- After (or while) restoring KVM, reboot deleuze.