SoftwareArchitecturePlans

1. What Debian version do we run on each server?

AdamChlipala suggests stable on Main and testing on Dynamic and Shell because:

We want our primary services to be as reliable as possible.
Members will want to use some cutting-edge stuff for running their dynamic web sites and custom daemons, and stable doesn't keep up very well with the cutting edge. On the other hand, unstable just seems too risky.
If Shell is used as a testing environment for services later pushed to Dynamic, then it should have the same software versions as Dynamic.

Update: We're currently planning stable on Main and Dynamic, since testing too often has catastrophic upgrade failures in practice.

2. What resource limits are imposed on the different servers?

2.1. Decisions that we've agreed on

We don't need explicit limits on usage of Main's local resources, because only admins will be able to control them.

2.2. Questions to be resolved

Do we impose ulimits and related stuff on Dynamic?
AdamChlipala says:
- We need some measures in place to prevent runaway processes from crashing everyone's dynamic web sites. The question is, do we use automated measures or do we just monitor closely and intervene manually when needed? A bad runaway process can take the server down quickly, so I think it's necessary to use ulimits and their ilk.
How do we control resource usage on Shell?
AdamChlipala says:
- I think I'm in favor of no ulimits or similar on Shell, relying on monitoring and manual intervention to deal with runaway processes and other horrors. We've already had some folks unable to use some implementations of non-mainstream programming languages because these implementations aren't able to deal with our resource limits... and, if you know me, you can probably guess that that Just Breaks My Heart!
Where we do decide to use monitoring and manual intervention, what monitoring tools can best help us do it?
DavorOcelic says:
- I've talked about this multiple times before, and I'm still interested in doing something real in this area. First of all, there's a log parser I've written, which is very similar to Logsurfer (or Logsurfer+ for that matter), but which resolves some of their crucial limitations; we'd definitely turn Main machine into a common loghost, so this would be a good place to deploy this on. Another good thing I have in mind is Nagios, a ping/service/anything monitoring tool. Third tool I have in mind is the excellent Puppet (kind of cfengine new-generation) that we can script to test and fix stuff on our systems.

3. Who can log into which servers?

3.1. Decisions that we've agreed on

Only admins can log into Main
Everyone can log into Shell
DavorOcelic says:
- This is a good general rule. For any exceptions, both the usual Unix auth mechanism and LDAP allow great flexibility (per-user list of allowed machines and also per-machine list of allowed users).

3.2. Questions to be resolved

Can everyone log into Dynamic, too?
AdamChlipala says:
- I think it is important to allow this. My mental model has Shell made deliberately unstable because we don't know how to impose automatic limits that allow all of the stuff that people want to do. I know that a lot of the people involved in this planning aren't particularly interested in using non-mainstream programming languages and other things that conventional hosting providers are never going to support, but for me and several other members this is one of the defining aspects of HCoop. That means that we need to be able to go crazy with Shell, while committing to keeping Dynamic up all the time. If Shell is down, members need to be able to use Dynamic to configure their services. That doesn't mean that they can't use the development-production split model when Shell is up, logging in only there.

4. How are we going to handle the basic logistics of a shared filesystem and logins?

4.1. Decisions that we've agreed on

We're going to use AFS filesystem and Kerberos. (AFS mandates the use of Kerberos).
We're going to use LDAP for logins. (Can play together with AFS and Kerberos, no worries).

4.2. Questions to be resolved

Everything else!

5. How are we going to charge (monetarily or just to have a sense of who is using what) members accurately for their disk usage?

There are a lot of issues here. We provide a number of shared services whose default models create files on the behalf of members but that are (by default) owned by a single UNIX user. Examples include PostgreSQL and MySQL databases, virtual mailboxes, Mailman mailing lists, and domtool configuration files. Any of these can grow so large as to use up all disk space on a volume, through either malicious action or accidental runaway processes.

Right now we use this gimpy scheme of group quotas on /home, storing all of these files on that partition with group ownership telling which member is responsible for them. I think AFS provides a nicer way of doing this. With the way we do it now, we are constantly fighting the behavior of the out-of-the-box Debian packages to set permissions differently than how we need them to be. With AFS, I think we can separate permissions from locations.

The Big Questions