welcome: please sign in

The following 399 words could not be found in the dictionary of 7 words (including 7 LocalSpellingWords) and are highlighted below:
about   above   accepting   Access   account   Addresses   Administration   affect   afs   after   After   again   all   allowing   along   alpha   already   alright   Also   also   Alt   amount   an   and   And   any   apache2   appear   approach   are   as   As   at   authdaemon   aux   available   backup   basically   be   because   been   before   below   bind9   block   boot   Booting   bos   bottom   busy   but   can   case   Category   channel   check   clean   cleanly   close   combine   configuration   connected   connections   console   contacted   continue   controlled   Coop   couple   courier   Crash   crash   cron   database   databases   Deleuze   deleuze   depend   depends   dev   didn   do   Do   does   doesn   doing   domain   domtool   done   double   down   echo   egrep   Either   either   emergency   ensure   error   errors   especially   etc   even   everything   except   failed   files   fileserver   filesystem   finish   first   follow   follows   for   from   general   get   gives   go   goes   grep   had   hand   handy   happened   harmless   have   Have   hcoop   Historical   hook   how   However   http   hung   ibd   If   if   implicit   important   improperly   In   in   inaccessible   inavailability   include   inetd   init   initialized   Inno   instance   interfere   invoke   Ip   is   issues   it   It   items   its   Jan   January   just   keys   Keys   keystrokes   killall   known   Kvm   last   least   leave   Left   libnss   like   line   lines   list   ll   login   logins   longer   looking   lookups   machines   main   make   matter   may   means   menu   message   messages   minutes   mire   Mire   more   much   My   mysql   necessary   need   needed   needs   net   new   no   none   not   now   nscd   null   of   On   on   one   only   open   Open   openbsd   or   Other   other   others   our   outage   Outdated   outpost   over   pages   particular   partition   partitions   passwords   people   period   possible   Post   postgr   postgres   Postgres   postmaste   Preparations   Press   press   previous   primarily   printed   probably   problem   problematic   problems   proceeds   processes   properly   ps   re   really   reboot   rebooted   Rebooting   rebooting   receiving   recently   recovery   removed   report   response   rest   restart   restarted   restarts   root   Rq   run   runit   running   said   salvage   salvager   salvaging   saying   script   see   sees   select   sending   server   servers   service   services   setup   shell   should   shut   shutdown   simple   Since   single   slave   smooth   So   so   solved   some   something   Sp   spamassassin   specific   start   started   starting   starts   startup   state   status   steps   still   Still   stop   stopped   stops   sudo   sure   surely   sv   sync   Sys   System   systems   take   taking   test   that   The   the   their   them   then   Then   There   there   these   they   thing   things   think   this   This   Those   those   to   To   toggle   two   unavailability   unclean   until   up   updated   updates   use   user   users   uses   using   usual   ve   verify   vice   volume   volumes   wait   waiting   want   was   watching   way   We   we   web   were   When   when   whether   which   while   who   Wiki   wiki   will   with   without   won   working   yes   yet   You   you   zgib  

Clear message


1. Preparations

1) Have root shell open on the other machines (i.e. outpost, mire). In general this should not be necessary, but in the January 2010 outage, after a period of Deleuze inavailability, Mire stopped accepting new logins so we had no way to restart services except a reboot.

2) Have KvmAccess, IpAddresses and RebootingMireSp wiki pages handy. (Either printed or from our Wiki backup -- http://main.zgib.net/hcoop/)

3) Have HCoop passwords list handy, it goes hand in hand with the above wiki pages

2. Deleuze reboot

If it's a clean reboot, first shutdown all services possible, primarily those that depend on AFS, but others are also important as basically everything depends on user lookups and libnss-afs. So shutdown as much as you can, and it'll make reboot controlled and smooth.

To reboot, hook up to the IPKVM, open channel connected to Deleuze console, and from there, either reboot with sudo reboot as usual, or if it's hung, invoke the SysRq reboot as follows:

3. Booting

Rebooting deleuze is problematic because of the way how AFS starts. On boot, you see a message of OpenAFS starting up, and then it proceeds with other services that follow after it. However, if the server was rebooted in response to a crash or had an unclean shutdown, AFS will salvage the vice partitions (that is, it'll run a filesystem check).

In case it was a clean shutdown, no problems.

BUT, in case it was a crash or something and it does start the salvager, the startup script won't wait for the salvager to finish -- it will just continue, allowing other services to start. The problem is, when salvage is running on a partition, all volumes from the partition are inaccessible. In our case, it means all volumes are inaccessible as they're on a single partition, and all services that want to use AFS then start improperly, as AFS is not yet available. We've recently updated the OpenAFS startup script to wait while the salvager is running. It is alpha state, I think it should be working, but just gives some amount of harmless error lines when it's done. So some issues of the services starting without AFS should now be solved because it *does* block until salvager is done. Still, you can go over the items below to double-check:

So from the KVM console or SSH login, you can run "bos status deleuze" to see whether the fileserver is salvaging. If yes, really the only thing you can do is shut down the services which started after it, and surely didn't start properly because AFS is (still) inaccessible. Those include:

nscd mysql postgres apache2 domtool-server cron spamassassin courier-authdaemon openbsd-inetd

You close them down with:

/etc/init.d/SERVICE_NAME stop   (init.d approach)
sv stop SERVICE_NAME            (runit approach)

killall SERVICE_NAME            (to be sure it's down)

It is important to verify that the service is really down; especially in case of courier-authdaemon which won't want to restart cleanly using sv restart courier-authdaemon if a previous improperly initialized instance is running.

The salvager will take about 20 minutes. When it is done, "bos status deleuze" will no longer report salvager running, and on the console you'll probably get a couple of "waiting for busy volume..." messages which are alright.

Then, you init.d start or sv start all those services that were stopped, watching for any errors.

In general, there should be none, except for things like MySQL saying things like "InnoDB: Crash recovery may have failed for some .ibd files!". This is alright; looking at it, one sees these are messages for people who no longer have an account at HCoop and their databases have been removed.

4. Post-boot

So after things appear to be working again, DOUBLE-CHECK that Postgres in particular is started properly (if it is, it'll appear in ps aux| grep postgr and you'll see its processes along with probably some users connections that already contacted the database). As said, double-check this as Postgres is known, in our setup at least, to need one, two or more restarts before it really starts properly.

ps aux | egrep 'postmaste()r' > /dev/null || echo 'Postgres not running!'

After Postgres, restart domtool-server (it uses Postgres).

Do a test configuration of a domain with domtool (doesn't matter how simple it is) to ensure that all domtool servers are working.

4.1. Other systems

Since Deleuze is the main server, its period of unavailability will affect other machines. In specific, the web server needs to be restarted, or even rebooted if SSH stops taking logins (this happened Jan 20, 2010 outage).

Also, restart domtool-slave processes on all machines that have it.

Also, on last Deleuze unavailability, DNS on outpost stopped receiving updates, it needed a service restart (domtool and/or bind9).

So the bottom line is, after rebooting deleuze, re-check everything ;-)

CategorySystemAdministration CategoryOutdated CategoryHistorical

RebootingDeleuze (last edited 2018-10-20 04:03:42 by ClintonEbadi)