SoftwareArchitecturePlans

This page was meant to organize a discussion and is not the canonical reference on our organizational decisions. It may often be out of date.

1. Terminology

To save space below, we'll use the following working names for the different pieces of hardware involved:

2. The Big List of Scary Things

These are the issues that we're dealing with for the first time in our new set-up, meaning that we should pay special attention to them.

3. The Big Questions

3.1. What Debian version do we run on each server?

AdamChlipala suggests stable on Main and testing on Dynamic and Shell because:

Update: We're currently planning stable on Main and Dynamic, since testing too often has catastrophic upgrade failures in practice.

3.2. What resource limits are imposed on the different servers?

3.2.1. Decisions that we've agreed on

3.2.2. Questions to be resolved

  1. Do we impose ulimits and related stuff on Dynamic?

    AdamChlipala says:

    • We need some measures in place to prevent runaway processes from crashing everyone's dynamic web sites. The question is, do we use automated measures or do we just monitor closely and intervene manually when needed? A bad runaway process can take the server down quickly, so I think it's necessary to use ulimits and their ilk.
  2. How do we control resource usage on Shell?

    AdamChlipala says:

    • I think I'm in favor of no ulimits or similar on Shell, relying on monitoring and manual intervention to deal with runaway processes and other horrors. We've already had some folks unable to use some implementations of non-mainstream programming languages because these implementations aren't able to deal with our resource limits... and, if you know me, you can probably guess that that Just Breaks My Heart!
  3. Where we do decide to use monitoring and manual intervention, what monitoring tools can best help us do it?

    DavorOcelic says:

    • I've talked about this multiple times before, and I'm still interested in doing something real in this area. First of all, there's a log parser I've written, which is very similar to Logsurfer (or Logsurfer+ for that matter), but which resolves some of their crucial limitations; we'd definitely turn Main machine into a common loghost, so this would be a good place to deploy this on. Another good thing I have in mind is Nagios, a ping/service/anything monitoring tool. Third tool I have in mind is the excellent Puppet (kind of cfengine new-generation) that we can script to test and fix stuff on our systems.

3.3. Who can log into which servers?

3.3.1. Decisions that we've agreed on

3.3.2. Questions to be resolved

  1. Can everyone log into Dynamic, too?

    AdamChlipala says:

    • I think it is important to allow this. My mental model has Shell made deliberately unstable because we don't know how to impose automatic limits that allow all of the stuff that people want to do. I know that a lot of the people involved in this planning aren't particularly interested in using non-mainstream programming languages and other things that conventional hosting providers are never going to support, but for me and several other members this is one of the defining aspects of HCoop. That means that we need to be able to go crazy with Shell, while committing to keeping Dynamic up all the time. If Shell is down, members need to be able to use Dynamic to configure their services. That doesn't mean that they can't use the development-production split model when Shell is up, logging in only there.

3.4. How are we going to handle the basic logistics of a shared filesystem and logins?

3.4.1. Decisions that we've agreed on

3.4.2. Questions to be resolved

Everything else!

3.5. How are we going to charge (monetarily or just to have a sense of who is using what) members accurately for their disk usage?

There are a lot of issues here. We provide a number of shared services whose default models create files on the behalf of members but that are (by default) owned by a single UNIX user. Examples include PostgreSQL and MySQL databases, virtual mailboxes, Mailman mailing lists, and domtool configuration files. Any of these can grow so large as to use up all disk space on a volume, through either malicious action or accidental runaway processes.

Right now we use this gimpy scheme of group quotas on /home, storing all of these files on that partition with group ownership telling which member is responsible for them. I think AFS provides a nicer way of doing this. With the way we do it now, we are constantly fighting the behavior of the out-of-the-box Debian packages to set permissions differently than how we need them to be. With AFS, I think we can separate permissions from locations.

4. Daemons shared by members

4.1. Off-site file back-up services

4.1.1. Questions to be resolved

4.2. DNS

4.2.1. Decisions that we've agreed on

Update: Scrap that! We're using BIND on Main and Dynamic, since it's so much better supported throughout the 'net, makes master/slave configurations easier, etc.. In the future, we want to expand to include a tertiary DNS server in a different geographic location and on an entirely different network.

4.2.2. Questions to be resolved

  1. How do we arrange redundant DNS infrastructure?

JustinLeitgeb says:

4.2.3. References to how we do things now

DnsConfiguration, DomainRegistration

4.3. FTP

4.3.1. Decisions that we've agreed on

4.3.2. References to how we do things now

FtpConfiguration, FileTransfer

4.4. HTTP

4.4.1. Decisions that we've agreed on

4.4.2. Questions to be resolved

  1. Do we completely separate adminstrative web sites from the rest, or do we allow any member static web site to be served by Main?

    DavorOcelic says:

    • Well. I think we don't have many administrative web sites (nor the ones we have are used heavy enough) to justify complete separation. It should be OK to run static web sites from Main, I believe. We could create default web spaces for users, like ~/public_html/ served from Dynamic, and ~/static_html/ served from Main, or something like that. (Please give more input on this).
      • I think it would better to have a domtool directive that chose which machine the site was served on (e.g. ServedOn static|dynamic) and then let members choose how to lay out their own directories. -- ClintonEbadi

4.4.3. References to how we do things now

UserWebsites, DynamicWebSites, VirtualHostConfiguration

4.5. IMAP/POP

4.5.1. Decisions that we've agreed on

4.5.2. Questions to be resolved

  1. Do we keep using Courier IMAP or do we switch to something like Cyrus?

4.5.3. References to how we do things now

UsingEmail, EmailConfiguration

4.6. Jabber

4.6.1. Decisions that we've agreed on

4.6.2. Questions to be resolved

4.6.3. References to how we do things now

JabberServer

4.7. Mailing lists

4.7.1. Decisions that we've agreed on

4.7.2. Questions to be resolved

  1. How/where do we store mailing list data so that it is appropriately charged towards a member's storage quota?

4.7.3. References to how we do things now

MailingListConfiguration

4.8. Relational database servers

4.8.1. Decisions that we've agreed on

4.8.2. Questions to be resolved

  1. Are we satisfied with the latest versions from Debian stable, or do we want to do something special?
  2. Do remote PostgreSQL authentication (from Dynamic, etc.) via the ident method? DavorOcelic thinks it's OK.

4.8.3. References to how we do things now

UsingDatabases

4.9. SMTP

4.9.1. Decisions that we've agreed on

4.9.2. Questions to be resolved

  1. Run secondary MX on Dynamic or elsewhere?

4.9.3. References to how we do things now

UsingEmail, EmailConfiguration

4.10. Spam detection

4.10.1. Decisions that we've agreed on

4.10.2. References to how we do things now

UsingEmail, SpamAssassin, FeedingSpamAssassin, SpamAssassinAdmin

4.11. SSH

4.11.1. Decisions that we've agreed on

4.11.2. References to how we do things now

SshConfiguration

4.12. SIP Redirection

5. Services run on top of these daemons

5.1. Domtool

Everyone's favorite spiffy system for letting legions of users manage the same daemons securely.

AdamChlipala says:

JustinLeitgeb says:

5.1.1. References to how we do things now

DomainTool

5.2. Portal

5.2.1. Decisions that we've agreed on

5.2.2. References to how we do things now

The portal

5.3. Web e-mail client

5.3.1. Decisions that we've agreed on

5.3.2. References to how we do things now

SquirrelMail

5.4. Webmin/Usermin

5.4.1. Decisions that we've agreed on

5.4.2. References to how we do things now

Usermin

5.5. Wiki

5.5.1. Decisions that we've agreed on

5.5.2. Questions to be resolved

5.5.3. References to how we do things now

This wiki

6. Security

Here are the security issues we need to worry about, sorting by resource categories of varying abstraction levels. What we mostly deal with here is avoiding negative consequences of actions by members with legitimate access to our servers.

6.1. CPU time

We haven't really encountered any trouble with this literal resource yet. However, potential problems come in when we're talking about user dynamic web site programs called by a shared Apache daemon. Apache allocates a fixed set of child processes, and each pending dynamic web site program takes up one child process for the duration of its life. Enough infinite-looping or slow CGI scripts can bring Apache down for everyone.

6.1.1. Current remedies

As per ResourceLimits, we use patched suexec programs to limit dynamic page generation programs to 10 seconds of running time. We also have a time-out for mod_proxy accesses, which we provide to allow members to implement dynamic web sites through their own daemons that the main Apache proxies.

6.2. Disk usage

We can't let one person use up all of the disk space, now can we?

6.2.1. Current remedies

We use group quotas so that members can be charged for files that they don't own. This is still hackish and allows some unintended behaviors. DaemonFileSecurity has more detail.

6.3. Network bandwidth

We don't do a thing to limit this now, since our current host provides significantly more bandwidth than we need.

6.3.1. Questions to be resolved

  1. Should we start doing anything beyond monitoring?

6.4. Network connection privileges

It's good to follow least privilege in who is allowed to connect to/listen on which ports.

6.4.1. Current remedies

We have a firewall system in place now. It uses a custom tool documented partially on FirewallRules.

6.5. Number of processes

Fork bombs are no fun, and many resource limiting schemes are per-process and so require a limit on process creation to be effective.

6.5.1. Current remedies

As per ResourceLimits, we use the nproc ulimit.

6.6. RAM

This is probably the most surprising thing for novices to the hosting co-op planning biz. If you would classify yourself as such, then I bet you would leave RAM off your list of resources that need to be protected with explicit security measures!

Nonetheless, it may just be the most critical resource to control. In our experiences back when everything ran on Abulafia, the most common cause of system outage was some user running an out-of-control process that allocated all available memory, causing other processes to drop dead left and right as memory allocation calls failed. We're letting people run their own daemons 24/7, so this just can't be ignored.

6.6.1. Current remedies

As per ResourceLimits, we use the as ulimit to put a cap on how much virtual memory a process can allocate.


CategorySystemAdministration CategoryHistorical

SoftwareArchitecturePlans (last edited 2018-04-22 01:34:40 by ClintonEbadi)