Diff for "RebootingDeleuze"

Differences between revisions 2 and 3

This page describes the steps that admins need to take after rebooting deleuze.

Rebooting deleuze is problematic because of the way how AFS starts. On boot, you see a message of OpenAFS starting up, and then it proceeds with other services that follow after it. However, if the server was rebooted in response to a crash or had an unclean shutdown, AFS will salvage the vice partitions (that is, it'll run a filesystem check). BUT, the startup script won't wait for it to finish as i.e. an ext3.fsck would, it will just continue, allowing other services to start. The problem is, when salvage is running on a partition, all volumes from the partition are inaccessible. In our case, it means all volumes are inaccessible as they're on a single partition.

So from the KVM console or SSH login, you can run "bos status deleuze" to see whether the fileserver is salvaging. If yes, really the only thing you can do is shut down the services which started after it, and surely didn't start properly because AFS is (still) inaccessible. Those include:

nscd mysql postgres apache2 domtool-server cron spamassassin courier-authdaemon

You close them down with:

/etc/init.d/SERVICE_NAME stop   (init.d approach)
sv stop SERVICE_NAME            (runit approach)

killall SERVICE_NAME            (to be sure it's down)

It is important to verify that the service is really down; especially in case of courier-authdaemon which won't want to restart cleanly using sv restart courier-authdaemon if a previous improperly initialized instance is running.

The salvager will take about 20 minutes. When it is done, "bos status deleuze" will no longer report salvager running, and on the console you'll probably get a couple of "waiting for busy volume..." messages which are alright.

Then, you init.d start or sv start all those services that were stopped, watching for any errors.

In general, there should be none, except for things like MySQL saying things like "InnoDB: Crash recovery may have failed for some .ibd files!". This is alright; looking at it, one sees these are messages for people who no longer have an account at HCoop and their databases have been removed.

-  ⇤ ← Revision 2 as of 2008-10-26 11:53:37 → 
  Size: 817
  Editor: AdamChlipala
  Comment: SpamAssassin
+   ← Revision 3 as of 2009-12-24 12:42:03 → ⇥
  Size: 2202
  Editor: DavorOcelic
  Comment:
-Deletions are marked like this.
+Additions are marked like this.
 Line 3:
-. Wait at least 20 minutes, and likely no more than 40 minutes, for AFS to come online and finish its consistency checks.
 2. Restart mysql, stopping explicitly first, then starting:
 {{{
/etc/init.d/mysql stop
killall mysql
/etc/init.d/mysql start
+Rebooting deleuze is problematic because of the way how AFS starts. On boot, you see a message of OpenAFS starting up, and then it proceeds with other services that follow after it. However,  if the server was rebooted in response to a crash or had an unclean shutdown, AFS will salvage the vice partitions (that is, it'll run a filesystem check). BUT, the startup script won't wait for it to finish as i.e. an ext3.fsck would, it will just continue, allowing other services to start. The problem is, when salvage is running on a partition, all volumes from the partition are inaccessible. In our case, it means all volumes are inaccessible as they're on a single partition.

So from the KVM console or SSH login, you can run "bos status deleuze" to see whether the fileserver is salvaging. If yes, really the only thing you can do is shut down  the services which started after it, and surely didn't start properly because AFS is (still) inaccessible. Those include: {{{
nscd mysql postgres apache2 domtool-server cron spamassassin courier-authdaemon
-Line 10:
+Line 8:
-. Restart postgresql:
 {{{
/etc/init.d/postgresql-8.1 stop
killall postmaster
/etc/init.d/postgresql-8.1 start
+You close them down with: {{{
/etc/init.d/SERVICE_NAME stop   (init.d approach)
sv stop SERVICE_NAME            (runit approach)

killall SERVICE_NAME            (to be sure it's down)
-Line 16:
+Line 15:
-. Restart apache:
 {{{
/etc/init.d/apache2 stop
killall apache2
/etc/init.d/apache2 start
}}}
 5. Restart domtool:
 {{{
/etc/init.d/domtool-server stop
/etc/init.d/domtool-server start
}}}
 6. Restart cron:
 {{{
sv stop cron
killall cron
sv start cron
}}}
 7. Restart Spam``Assassin
 {{{
/etc/init.d/spamassassin restart
 }}}
+It is important to verify that the service is really down; especially in case of courier-authdaemon which won't want to restart cleanly using sv restart courier-authdaemon if a previous improperly initialized instance is running.

The salvager will take about 20 minutes. When it is done, "bos status deleuze" will no longer report salvager running, and on the console you'll probably get a couple of "waiting for busy volume..." messages which are alright.

Then, you init.d start or sv start all those services that were stopped, watching for any errors.

In general, there should be none, except for things like MySQL saying things like "InnoDB: Crash recovery may have failed for some .ibd files!". This is alright; looking at it, one sees these are messages for people who no longer have an account at HCoop and their databases have been removed.

Quick Links

Search Wiki

Page Tools

Diff for "RebootingDeleuze"