Thursday, May 27, 2010

E-Mail Server Update

On the afternoon my last post, the server went down again. This time it happened as I was "tail-ing" the log file and saw what was happening:

May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: Logging region out of memory; you may need to increase its size
May 17 13:16:27 server lmtpunix[19636]: DBERROR: opening /var/imap/deliver.db: Cannot allocate memory
May 17 13:16:27 server lmtpunix[19636]: DBERROR: opening /var/imap/deliver.db: cyrusdb error
May 17 13:16:27 server lmtpunix[19636]: FATAL: lmtpd: unable to init duplicate delivery database
May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: read: 0x401410, 28: No space left on device
May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: Ignoring log file: /var/imap/db/log.0000000001: No space left on device
May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: DB_ENV->log_put: 1: No space left on device
May 17 13:16:27 server lmtpunix[19636]: DBERROR: error exiting application: No space left on device
May 17 13:16:27 server master[14002]: service lmtpunix pid 19636 in READY state: terminated abnormally
May 17 13:16:27 server lmtpunix[19637]: DBERROR db4: Logging region out of memory; you may need to increase its size
May 17 13:16:27 server lmtpunix[19637]: DBERROR: opening /var/imap/deliver.db: Cannot allocate memory
May 17 13:16:27 server lmtpunix[19637]: DBERROR: opening /var/imap/deliver.db: cyrusdb error
May 17 13:16:27 server lmtpunix[19637]: FATAL: lmtpd: unable to init duplicate delivery database
May 17 13:16:28 server lmtpunix[19637]: DBERROR db4: read: 0x401410, 28: No space left on device


Notice the 5th line: No space left on device. This can't be right, I have about 16 GB of free space (according to df -h). But, lets think about this for a moment. I've been chasing a database corruption problem. If the OS thinks that it is out of space, of course it won't be able to update the db.

Looking back at my notes from a couple of months ago, before this started, I had about 30 GB free. Where did all of the free space go? I talked to the PHB and he finally agreed to let me take the server down over the weekend to work on it. In preparation for this, I also looked at the folders in the mail store and compared them to the FileMaker Pro database that I use to track all of the account info. (For you privacy nuts, I don't keep passwords, if you forget it, I can change it but I cannot see what it is.) I also looked at tuning options for Postfix, Cyrus and Mailman.

Fast forward to Friday night. I purged the mail queue, the stopped the mail server and manually ran the garbage cleanup script. Our webmail users can't understand that unless they click "Purge" the mail doesn't actually get deleted. So, I wrote a script that deletes the files from their Trash folder and rebuilds the index. This happens quietly behind the scenes and they never know it happened unless they try to retrieve something from the Trash. I actually once had someone tell me that's where they store their important emails. I asked them: Do you keep your checkbook in the garbage can at home? And, they went away muttering to themselves.

I rebooted the server into single user mode and fsck'd the drive. It took forever to get through the various steps, to the point that I stopped watching and forced myself to wait 10 minutes between checks. When it got to "Checking Volume Information" it said:

Invalid volume free block count
(It should be 8456596 instead of 4562047)
Repairing Volume


After it was done (roughly 3 hours from start time) I then had 32 G free, not 16 as reported before I started. Yea!!!

From here, I set about rebuilding the Cyrus database and the user databases. I restarted the mail server (sudo serveradmin start) and watched the mail begin to flow. When I felt comfortable, I restarted Mailman and watched it fly through the postings that were waiting to be sent.

Its now a couple of days later and everything seems to be stable.... Except, now the boss want me to get to 100 GB free!!! (le sigh...) Stay tuned for more updates

No comments: