Thursday, May 27, 2010

E-Mail Server Update

On the afternoon my last post, the server went down again. This time it happened as I was "tail-ing" the log file and saw what was happening:

May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: Logging region out of memory; you may need to increase its size
May 17 13:16:27 server lmtpunix[19636]: DBERROR: opening /var/imap/deliver.db: Cannot allocate memory
May 17 13:16:27 server lmtpunix[19636]: DBERROR: opening /var/imap/deliver.db: cyrusdb error
May 17 13:16:27 server lmtpunix[19636]: FATAL: lmtpd: unable to init duplicate delivery database
May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: read: 0x401410, 28: No space left on device
May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: Ignoring log file: /var/imap/db/log.0000000001: No space left on device
May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: DB_ENV->log_put: 1: No space left on device
May 17 13:16:27 server lmtpunix[19636]: DBERROR: error exiting application: No space left on device
May 17 13:16:27 server master[14002]: service lmtpunix pid 19636 in READY state: terminated abnormally
May 17 13:16:27 server lmtpunix[19637]: DBERROR db4: Logging region out of memory; you may need to increase its size
May 17 13:16:27 server lmtpunix[19637]: DBERROR: opening /var/imap/deliver.db: Cannot allocate memory
May 17 13:16:27 server lmtpunix[19637]: DBERROR: opening /var/imap/deliver.db: cyrusdb error
May 17 13:16:27 server lmtpunix[19637]: FATAL: lmtpd: unable to init duplicate delivery database
May 17 13:16:28 server lmtpunix[19637]: DBERROR db4: read: 0x401410, 28: No space left on device


Notice the 5th line: No space left on device. This can't be right, I have about 16 GB of free space (according to df -h). But, lets think about this for a moment. I've been chasing a database corruption problem. If the OS thinks that it is out of space, of course it won't be able to update the db.

Looking back at my notes from a couple of months ago, before this started, I had about 30 GB free. Where did all of the free space go? I talked to the PHB and he finally agreed to let me take the server down over the weekend to work on it. In preparation for this, I also looked at the folders in the mail store and compared them to the FileMaker Pro database that I use to track all of the account info. (For you privacy nuts, I don't keep passwords, if you forget it, I can change it but I cannot see what it is.) I also looked at tuning options for Postfix, Cyrus and Mailman.

Fast forward to Friday night. I purged the mail queue, the stopped the mail server and manually ran the garbage cleanup script. Our webmail users can't understand that unless they click "Purge" the mail doesn't actually get deleted. So, I wrote a script that deletes the files from their Trash folder and rebuilds the index. This happens quietly behind the scenes and they never know it happened unless they try to retrieve something from the Trash. I actually once had someone tell me that's where they store their important emails. I asked them: Do you keep your checkbook in the garbage can at home? And, they went away muttering to themselves.

I rebooted the server into single user mode and fsck'd the drive. It took forever to get through the various steps, to the point that I stopped watching and forced myself to wait 10 minutes between checks. When it got to "Checking Volume Information" it said:

Invalid volume free block count
(It should be 8456596 instead of 4562047)
Repairing Volume


After it was done (roughly 3 hours from start time) I then had 32 G free, not 16 as reported before I started. Yea!!!

From here, I set about rebuilding the Cyrus database and the user databases. I restarted the mail server (sudo serveradmin start) and watched the mail begin to flow. When I felt comfortable, I restarted Mailman and watched it fly through the postings that were waiting to be sent.

Its now a couple of days later and everything seems to be stable.... Except, now the boss want me to get to 100 GB free!!! (le sigh...) Stay tuned for more updates

Wednesday, May 26, 2010

Its Racing Season

Believe it or not, the Netman isn't all work, he knows how to have fun also. One of the things that my wife and I like to do is go to races. She's mostly a Nascar fan, but goes to other types of races to entertain me. We have our most extensive season planned and started it off on Mother's Day.

We went to see the Trans-Am series at New Jersey Motorsports Park. I remember growing up watching Mustangs, Corvettes, Camaros and other cars on TV. This was the first time that I've had a chance to see this series and was excited.

We stumbled onto the Club Diner in Bellmawr on our first trip to the track and its become part of our ritual. Usually we have to figure out how to get there. I must have set it as a waypoint in the GPS the last time we went because it took us right to the door. Strange, but I digress.

The entry list had about 20 entries, but when I checked the practice times, most of them didn't have a time listed. When we got there we learned that only 8 cars showed up, but for some reason, they list 12 cars getting points.

Here are a few of my pictures from the race:

Here's Tomy Drissi and Tony Ave getting ready to go out for practice and qualifying. My wife can't wait for the Marmaduke movie to come out and decided that she was going to root for Tomy.
Drissi won the race (surprise, surprise), RJ Lopez came in second and Ave was third. Here they are on the podium spraying champagne and doing the "hat dance."

My personal highlight of the day was getting to see Drissi's owner Paul Gentilozzi. (Below talking on the phone.) I had a chance to talk to him a couple of times and he's much cooler than the TV reporters make him out to be.

Here's a link to my Picasa album. Its not my best work, but here it is. Next up - Arca and Sprint Cup at Pocono.

Monday, May 17, 2010

(More) Email Server Issues

This past weekend, we had another email outage. This time it looks like it took the server took about 12 hours to rebuild itself. Time for a backup/restore or defrag. But, the PHB says he's going to purchase an Exchange 2010 setup at the start of the next fiscal and doesn't want to have the downtime. Meanwhile, my faithful Frankenserver takes another hit. This server has been in service since 2005 and has run fine for years until recently.

When I left on Friday afternoon, everything looked fine, but I had a feeling that it wasn't. At 6:00 Saturday morning, I got up and checked the server. Sure enough, the db was corrupted again. So I did the usual rebuild:

#serveradmin stop mail
# mv /var/imap /var/imap.old
# mkdir /var/imap
/usr/bin/cyrus/tools/mkimap
reading configure file...
i will configure directory /var/imap.
i saw partition /var/spool/imap.
done
configuring /var/imap...
creating /var/spool/imap...
done
# chown -R cyrusimap:mail /var/imap
# sudo -u cyrusimap /usr/bin/cyrus/bin/reconstruct
#serveradmin start mail


the last step is usually done with the -i option, but I usually skip it and perform the user folder rebuilds manually with:

# cd /var/spool/imap
#su cyrusimap
$ /usr/bin/cyrus/bin/reconstruct -r -f user/username
user/username
user/username/Deleted Messages
user/username/Drafts
user/username/Sent Messages


I have a text file on my server that has all of the usernames. I wrote a quick script to parse the file and rebuild each user.

After looking at various log files, it looks like Mailman is causing the issues. We have a list that goes out to all of our users (about 2300). (Send to one address and this spiders out to the lists for each location. The Mailman docs refer to this as an umbrella list.)

Our HR department decided to "go green" by sending out death notices and other items via this list. Even though, they've been told to do a delay-send until the evening hours, they keep "forgetting" and sending them in the middle of the day. This usually ends up tying up the server for a couple of hours during which time messages continue to come into the server, but nothing goes or out until it finishes processing the list.

At this point I decided to hold off on the rebuild and wait until I talked to the boss. I got him to agree to hold off on the rebuild until we can clean up the server and verify that it is actually stable (this time).

I spent an hour or so this morning researching various mailman options such as can we delay the sending until X o'clock? Turns out others have asked the same question and the answer ranges from not easily to no. I ran through the various configuration options and found that my SMTP_MAX_RCPTS was set to 500. This causes the server to attempt to send 500 messages in each batch. I changed this to the recommended 10. We shall see if this works.

I was also able to get him to agree to let me purge some of the older emails on the server and set up some quotas.

Stay tuned.....

Tuesday, May 11, 2010

PICC Day 2

Day 2 of the PICC conference was little more scattered and varied than the first one. The first session was "The Evolution of Storage Networking and the Current Trends in the Industry." Jacob Farmer, CTO of Cambridge Computer led this session. Jacob's talk was a exactly what the title says. He talked about storage virtualization and how abstraction layers interract with the Operating system and the File System. Most of the concepts were as I imagined they would be, although its nice to know the "real" names for things. I learned that the Equallogic SANs that I have at work are using "spindle virtualization." In short, this means that you don't have to worry about carving up your SAN into various LUNs, each (possibly) having different RAID configurations. When you setup the Equalllogic, you format it (RAID 5, 6, 10, or whatever you want) and then only worry about the size of the partitions that you create. This allows you to easily add additional storage by adding another unit to the "cluster." Personally, I think this is a cooler method of dealing with SANs.

I only stayed for the first half of this talk because I wanted to go to a couple other sessions. Next I went to "Pushing Boulders Uphill: NOAA Updates - High Performance Computing across the WAN." [In my best "Staples" commercial voice] "Wow, that's a long title" The speakers gave an overview of how NOAA deals with their datasets. Its amazing how much data they move around their network on a daily basis. More importantly, some of the discussion was about how the grew their network and some of the tools they use. It was amazing to see how "simple" it was, albeit from a very high level overview. Even better, they use some of the same (open source) tools that I use to monitor their network. I must be doing something right.

Directly after this, Matt Simmons led a discussion on "Keeping Nagios Sane." More than any other, this session demonstrated what community is about. Matt started the session by stating that he wasn't an expert, but these were the things that worked for him. Most of the participants in the session use Nagios and contributed some of their ideas during the discussion. Could any of us be considered "experts"? Maybe. Depending on your definition.

Sometimes, I get a little annoyed when my coworkers or others call me an "expert." What does that mean? I know what I know. But, that's not everything there is to know. Usually I know more about my field than they do. Do I know everything? Of course not. I think what separates us from the paper admins is our ability to get up to speed quickly in a variety of areas. Often, this is driven by our quest for knowledge. Sometimes, we actually need to do something with what we learn. What separates this profession from most others is that there are always several ways to accomplish a task. Some of them are more elegant than others and some are just crazy ideas that lead to some cool discovery. I should probably finish this post before I go off on a rant. BTW, Matt, check out cacti for trending.

Next I went to a session of Budgeting for System Administrators given by Adam Moskowitz. Alan gave some very cogent advice to us regarding the budget process. Sometimes we tend to forget that those above us may not understand the buzzwords. We need to be show them what our needs are in a language that they can understand and relate to. Also, they may not fully understand the inter-relatedness (is that a word?) of the various parts of our projects. Finally, sometimes we have to be willing to compromise and either scale-back or phase in our projects.

After this, my brain needed a break and I attended an "Unconference" This was probably the closest to what most people would imagine would happen if a group of techies got together in a room to just discuss a topic, except most of us were totally worn out at that point. We ended up talking about DNS, specifically, how we each manage it. We were shown Carnegie-Mellon's NetReg by one of the the people who wrote it. How cool is that. I had looked at it a couple of times before, but didn't realize its power. It's not something that I would use in my current environment, but I know its out there if the need arises.

The last session I attended was "An overview of Google's Technologies...." given by Tom Limoncelli. It was kind of fitting that I started and ended with the same speaker. Tom gave a quick overview of Google and how they grew their infrastructure. More cool stuff to keep in the back of my head.

Overall, it was a good conference and I had a blast. I met some new and interesting people. (And not the weirdos that are usually drawn to me) Kudos to William Bilancio, Tom Limoncelli, and the entire planning and organizing committees. They've already begun to consider the next conference and I'm sure this one will be even better.

Friday, May 7, 2010

LOPSA-NJ’s PICC








I'm at LOPSA-NJ’s PICC conference in sunny New Brunswick, NJ. I've taken Matt Simmons' challenge to blog about the conference (and restart this long neglected blog).The first session I'm attending is Tom Limoncelli's Time Management for System Administrator's - A New Approach. This talk was a whirlwind tour of his classic book Time Management for System Administrators but slightly rearrange, redone and updated.

This was the first time I was able to hear Tom speak formally. I was especially looking forward to this talk since it is based upon the book that radically changed my outlook on doing my job. It was everything I was expecting and more. Sometimes, a refresher course in a subject that you're familiar with can change your way of thinking. A couple of tips that I picked up were:
  • Keep a single calendar with your work and personal events. And, be sure to write everything down.
  • A reminder that I have to find a todo list manager or GTD app for my new Droid Incredible. Since the Android platform is a competitor to the iPhone, I don't see Apple supporting syncing to the phone in the near future. (More on the phone in a later post.)
  • Go over your notes at the end of the day and move incomplete tasks to the next day (or even further out).
  • Remind my co-workers about the 30 minute rule.
  • Tom mentioned that he keeps a lab notebook. I tend to keep everything on one pad and things can sometimes get lost.
  • An interesting idea was to put contact information for the helpdesk onto the desktop wallpaper. I'll have to play around with this.
The second session was also hosted by Tom Limoncelli. This one was titled: Help, Everyone hates our IT Department. This was sort of an extension of the first session. The key bullet points for me were:
  • How do users get help? And, do they know who to call and when.
  • Has the scope of support been clearly defined?
  • What is the definition of an Emergency?
The major takeaway that I had from this session was: If you can manage expectations and communications, it will help to manage the perception of the IT department. (I will probably revise this statement later, after I've had a chance to think about it.)

The evening ended with a talk by David Blank-Edelman - Through the Lens Geeky: How SysAdmins are Portrayed in Pop Culture. His presentation showed several film clips portraying SysAdmins from the 50's through recent films. The good, the bad, the unrealistic (of both sides) were shown.

During the second session I texted one of my friends the title of the session I was in. Her response was : Everyone doesn't hate the IT department, they just misunderstand you. Both Tom's and David's presentations drove this point home. When I get back, I'll be musing over this. How do I change perception? How can we (better) publicize what we do?