<?xml version="1.0" encoding="utf-8"?><!DOCTYPE article  PUBLIC '-//OASIS//DTD DocBook XML V4.4//EN'  'http://www.docbook.org/xml/4.4/docbookx.dtd'><article><articleinfo><title>OnSiteVisits/20140417</title><revhistory><revision><revnumber>6</revnumber><date>2014-04-18 19:19:34</date><authorinitials>ClintonEbadi</authorinitials><revremark>in theory, fritz boots cleanly now</revremark></revision><revision><revnumber>5</revnumber><date>2014-04-18 05:41:38</date><authorinitials>ClintonEbadi</authorinitials><revremark>things that happened</revremark></revision><revision><revnumber>4</revnumber><date>2014-04-16 18:45:25</date><authorinitials>ClintonEbadi</authorinitials><revremark>include a few pieces of useful documentation</revremark></revision><revision><revnumber>3</revnumber><date>2014-04-15 16:52:26</date><authorinitials>ClintonEbadi</authorinitials><revremark>more details</revremark></revision><revision><revnumber>2</revnumber><date>2014-04-15 16:46:16</date><authorinitials>ClintonEbadi</authorinitials><revremark>date changed</revremark></revision><revision><revnumber>1</revnumber><date>2014-04-05 04:34:21</date><authorinitials>ClintonEbadi</authorinitials><revremark>stub</revremark></revision></revhistory></articleinfo><para>Upgrade Fritz's RAM and attempt to restore <ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/KvmAccess#">KvmAccess</ulink>. </para><section><title>People</title><itemizedlist><listitem><para><ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/AnishJacob#">AnishJacob</ulink> </para></listitem><listitem><para><ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/SrikanthSastry#">SrikanthSastry</ulink> </para></listitem></itemizedlist></section><section><title>Goals</title><itemizedlist><listitem><para>Restore <ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/KvmAccess#">KvmAccess</ulink> </para></listitem><listitem><para>Upgrade fritz to 24G of RAM </para></listitem></itemizedlist></section><section><title>Outcome</title><section><title>Short Version</title><itemizedlist><listitem><para>RAM upgrade was a success </para></listitem><listitem><para>Downtime was ~2h instead of 30 minutes </para></listitem><listitem><para>Belkin KVM appears to not be functioning correctly </para></listitem><listitem><para>Deleuze rebooted without (major) incident </para></listitem></itemizedlist></section><section><title>Long Version</title><para>The actual RAM upgrade for fritz seemed to go quite smoothly. Fritz was down for ~15 minutes for hardware surgery, beating the expected time by a great margin. It was when it booted that problems developed. None were fatal, but combined caused fritz to take an ~1h15m before afs was restored and nearly another hour before all services were restored. </para><para>Attempts to restore <ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/KvmAccess#">KvmAccess</ulink> ended with hopper being accessible and the other machines appearing to not work. <ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/ClintonEbadi#">ClintonEbadi</ulink> requested that hopper's cables be swapped onto fritz, but fritz continues to not display anything. The KVM may be dying. </para><para>We managed to reboot deleuze. It required manually pressing the power button to make it finish halting, but it booted cleanly and quickly afterward. </para><para>Problems: </para><itemizedlist><listitem><para>We hit the ext3 mandatory file system check interval, which added ~40 minutes to the reboot. </para></listitem><listitem><para><code>/var/lib/libvirt</code> (<code>/dev/md3</code>) was not auto-detected and failed to come up, requiring manual intervention to continue booting. </para><itemizedlist><listitem><para>It appears the partition type was not set correctly (<code>Linux</code> instead of <code>Linux RAID autodetect</code>). The partition type was updated, but there is a high chance of that not actually fixing the boot process. </para></listitem></itemizedlist></listitem><listitem><para>libnss-afs, ncsd, and nsswitch.conf, and the init order are interacting badly, causing fritz to pause for an additional 15-20 minutes during boot </para></listitem><listitem><para>The Belkin KVM is behaving oddly. </para></listitem><listitem><para><emphasis role="strong">TBD</emphasis> </para></listitem></itemizedlist><para>Fixes: </para><itemizedlist><listitem><para>fritz's slow boot (UBIK timeouts before the network is able to come up) appears to be caused by <code>nscd</code> not using a persistent cache for the user and group databases. <code>persistent-cache</code> for both was enabled. </para></listitem><listitem><para><code>/dev/md3</code> not being auto-detected may have a few causes. </para><itemizedlist><listitem><para>Partition type was <code>Linux</code> instead of <code>Linux raid auto-detect</code>. The partition type for both components of the array was updated. </para></listitem><listitem><para>There may still be a problem with md metadata for the partitions. Further investigation is needed. </para></listitem></itemizedlist></listitem></itemizedlist></section><section><title>Next Steps</title><itemizedlist><listitem><para>Decide if we want to try to make the Belkin KVM work properly or acquire some other means of remote console access </para><itemizedlist><listitem><para>If we go with <ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/KvmAccess#">KvmAccess</ulink>, we should likely purchase new vga and ps/2 cables as it appears the cables we have now may have become unreliable (they are, after all, all 5+ years old and have been in an environment that encourages plastics to become brittle) </para></listitem></itemizedlist></listitem></itemizedlist></section></section><section><title>Supporting Material</title><itemizedlist><listitem><para><ulink url="https://wiki.hcoop.net/OnSiteVisits/20140417/OnSiteVisits/20140417?action=AttachFile&amp;do=get&amp;target=ram-upgrade-guide.pdf">Excerpts from Maintenance Manual</ulink> relevant to installing memory </para></listitem></itemizedlist><para><inlinemediaobject><imageobject><imagedata fileref="https://wiki.hcoop.net/OnSiteVisits/20140417?action=AttachFile&amp;do=get&amp;target=inside-fritz.png"/></imageobject><textobject><phrase>inside-fritz.png</phrase></textobject></inlinemediaobject> </para></section><section><title>Itinerary</title><section><title>Upgrade Fritz Memory</title><para><emphasis role="strong">SERVICE IMPACTING</emphasis>. Goal: 30 minutes downtime. Stretch: One hour. (Fritz can take up to 15 minutes to reboot) </para><itemizedlist><listitem><para>Remove Belkin KVM from rack to allow access to fritz </para></listitem><listitem><para>Power fritz down </para></listitem><listitem><para>Install memory into fritz </para></listitem><listitem><para>Power fritz up, ensure that POST succeeds and system boots </para></listitem></itemizedlist></section><section><title>Restore KVM</title><para><emphasis>Not Service Impacting</emphasis> </para><itemizedlist><listitem><para>Trace and re-attach cables going into Belkin KVM </para></listitem><listitem><para>Ensure that all machines can be controlled using KVM </para></listitem><listitem><para>Re-rack Belkin KVM </para></listitem><listitem><para>Double check no cables were jostled loose while re-racking </para></listitem></itemizedlist></section><section><title>Reboot Deleuze</title><para><emphasis>Minor Service Impact</emphasis> (Mail will be rejected briefly) </para><itemizedlist><listitem><para>After (or while) restoring KVM, reboot deleuze. </para></listitem></itemizedlist></section></section></article>