Monday, May 17, 2010

(More) Email Server Issues

This past weekend, we had another email outage. This time it looks like it took the server took about 12 hours to rebuild itself. Time for a backup/restore or defrag. But, the PHB says he's going to purchase an Exchange 2010 setup at the start of the next fiscal and doesn't want to have the downtime. Meanwhile, my faithful Frankenserver takes another hit. This server has been in service since 2005 and has run fine for years until recently.

When I left on Friday afternoon, everything looked fine, but I had a feeling that it wasn't. At 6:00 Saturday morning, I got up and checked the server. Sure enough, the db was corrupted again. So I did the usual rebuild:

#serveradmin stop mail
# mv /var/imap /var/imap.old
# mkdir /var/imap
/usr/bin/cyrus/tools/mkimap
reading configure file...
i will configure directory /var/imap.
i saw partition /var/spool/imap.
done
configuring /var/imap...
creating /var/spool/imap...
done
# chown -R cyrusimap:mail /var/imap
# sudo -u cyrusimap /usr/bin/cyrus/bin/reconstruct
#serveradmin start mail


the last step is usually done with the -i option, but I usually skip it and perform the user folder rebuilds manually with:

# cd /var/spool/imap
#su cyrusimap
$ /usr/bin/cyrus/bin/reconstruct -r -f user/username
user/username
user/username/Deleted Messages
user/username/Drafts
user/username/Sent Messages


I have a text file on my server that has all of the usernames. I wrote a quick script to parse the file and rebuild each user.

After looking at various log files, it looks like Mailman is causing the issues. We have a list that goes out to all of our users (about 2300). (Send to one address and this spiders out to the lists for each location. The Mailman docs refer to this as an umbrella list.)

Our HR department decided to "go green" by sending out death notices and other items via this list. Even though, they've been told to do a delay-send until the evening hours, they keep "forgetting" and sending them in the middle of the day. This usually ends up tying up the server for a couple of hours during which time messages continue to come into the server, but nothing goes or out until it finishes processing the list.

At this point I decided to hold off on the rebuild and wait until I talked to the boss. I got him to agree to hold off on the rebuild until we can clean up the server and verify that it is actually stable (this time).

I spent an hour or so this morning researching various mailman options such as can we delay the sending until X o'clock? Turns out others have asked the same question and the answer ranges from not easily to no. I ran through the various configuration options and found that my SMTP_MAX_RCPTS was set to 500. This causes the server to attempt to send 500 messages in each batch. I changed this to the recommended 10. We shall see if this works.

I was also able to get him to agree to let me purge some of the older emails on the server and set up some quotas.

Stay tuned.....

No comments: