welcome: please sign in

Diff for "OnSiteVisits/20140417"

Differences between revisions 4 and 5
Revision 4 as of 2014-04-16 18:45:25
Size: 1095
Editor: ClintonEbadi
Comment: include a few pieces of useful documentation
Revision 5 as of 2014-04-18 05:41:38
Size: 2806
Editor: ClintonEbadi
Comment: things that happened
Deletions are marked like this. Additions are marked like this.
Line 2: Line 2:
Planned next visit
Line 4: Line 3:
Upgrade Fritz's RAM and attempt to restore KvmAccess.

<<TableOfContents>>
Line 14: Line 16:

== Outcome ==

=== Short Version ===

 * RAM upgrade was a success
 * Downtime was ~2h instead of 30 minutes
 * Belkin KVM appears to not be functioning correctly
 * Deleuze rebooted without (major) incident

=== Long Version ===

The actual RAM upgrade for fritz seemed to go quite smoothly. Fritz was down for ~15 minutes for hardware surgery, beating the expected time by a great margin. It was when it booted that problems developed. None were fatal, but combined caused fritz to take an ~1h15m before afs was restored and nearly another hour before all services were restored.

Attempts to restore KvmAccess ended with hopper being accessible and the other machines appearing to not work. ClintonEbadi requested that hopper's cables be swapped onto fritz, but fritz continues to not display anything. The KVM may be dying.

We managed to reboot deleuze. It required manually pressing the power button to make it finish halting, but it booted cleanly and quickly afterward.

Problems:

 * We hit the ext3 mandatory file system check interval, which added ~40 minutes to the reboot.
 * `/var/lib/libvirt` (`/dev/md3`) was not auto-detected and failed to come up, requiring manual intervention to continue booting.
   * It appears the partition type was not set correctly (`Linux` instead of `Linux RAID autodetect`). The partition type was updated, but there is a high chance of that not actually fixing the boot process.
 * libnss-afs, ncsd, and nsswitch.conf, and the init order are interacting badly, causing fritz to pause for an additional 15-20 minutes during boot
 * The Belkin KVM is behaving oddly.
 * '''TBD'''

Upgrade Fritz's RAM and attempt to restore KvmAccess.

1. People

2. Goals

  • Restore KvmAccess

  • Upgrade fritz to 24G of RAM

3. Outcome

3.1. Short Version

  • RAM upgrade was a success
  • Downtime was ~2h instead of 30 minutes
  • Belkin KVM appears to not be functioning correctly
  • Deleuze rebooted without (major) incident

3.2. Long Version

The actual RAM upgrade for fritz seemed to go quite smoothly. Fritz was down for ~15 minutes for hardware surgery, beating the expected time by a great margin. It was when it booted that problems developed. None were fatal, but combined caused fritz to take an ~1h15m before afs was restored and nearly another hour before all services were restored.

Attempts to restore KvmAccess ended with hopper being accessible and the other machines appearing to not work. ClintonEbadi requested that hopper's cables be swapped onto fritz, but fritz continues to not display anything. The KVM may be dying.

We managed to reboot deleuze. It required manually pressing the power button to make it finish halting, but it booted cleanly and quickly afterward.

Problems:

  • We hit the ext3 mandatory file system check interval, which added ~40 minutes to the reboot.
  • /var/lib/libvirt (/dev/md3) was not auto-detected and failed to come up, requiring manual intervention to continue booting.

    • It appears the partition type was not set correctly (Linux instead of Linux RAID autodetect). The partition type was updated, but there is a high chance of that not actually fixing the boot process.

  • libnss-afs, ncsd, and nsswitch.conf, and the init order are interacting badly, causing fritz to pause for an additional 15-20 minutes during boot
  • The Belkin KVM is behaving oddly.
  • TBD

4. Supporting Material

inside-fritz.png

5. Itinerary

5.1. Upgrade Fritz Memory

SERVICE IMPACTING. Goal: 30 minutes downtime. Stretch: One hour. (Fritz can take up to 15 minutes to reboot)

  • Remove Belkin KVM from rack to allow access to fritz
  • Power fritz down
  • Install memory into fritz
  • Power fritz up, ensure that POST succeeds and system boots

5.2. Restore KVM

Not Service Impacting

  • Trace and re-attach cables going into Belkin KVM
  • Ensure that all machines can be controlled using KVM
  • Re-rack Belkin KVM
  • Double check no cables were jostled loose while re-racking

5.3. Reboot Deleuze

Minor Service Impact (Mail will be rejected briefly)

  • After (or while) restoring KVM, reboot deleuze.

OnSiteVisits/20140417 (last edited 2014-04-18 19:19:34 by ClintonEbadi)