The Adventures of Netman

Tuesday, October 25, 2011

(Mis)Adventures in Networking

EDIT: Whoops! I never published this blog post.

Wow, it's been a while since my last post. Anyway, if you were looking for an example of Murphy's law, I've got one for you.

We had 2 simple goals: upgrade our internet service and bring up our new WAN connections. This adventure starts on Tuesday, April 19, 2011.

Background:

In the city where I work, we are really locked into our choice of providers. Because of the city's infrastructure, we have 2 options: the incumbent telecom provider and the local cable company who has no interest in providing service to us.

This install is an upgrade to our existing service and also a relocation to another building. It was originally supposed to take place last November. As we were ramping up for the install, we discovered that my former boss didn't place the order. (This after I received a couple of phone calls asking why we were requesting service at 2 sites during the summer.) After a couple of angry phone calls to the telco rep, the order was placed. We were given an install date in early February. A week before the date arrived I sent my weekly pester the project manager email and was told that they were having "unforseen issues and would be unable to meet the date." (According to the rep, they ran out of capacity) The date they gave me was in June. Called the rep, explained to him how this was unacceptable and that I was going to recommend that we take action to terminate our existing contracts. He was then able to get us a March date. As we got close to the March date, we were told that date was only for our end of the connection and it would take a few weeks more to get the other end installed. Finally in late March, we were told that everything was in place and we looked at the calendar to find a suitable date that wouldn't cause too much disruption. We decided on spring break week since the majority of our employees were off anyway.

Tuesday Afternoon:
I met with an engineers from my ISP and my integrator to go over the connections and site prep prior to the installation. He checked the configuration on the router as it was shipped and made some changes/corrections. He made a call to try to bring up the connection, even though it wouldn't be routing any traffic at that point - just to make sure everything was good on the telco end. After spending some time on hold and talking to several people, he was told that our appointment wasn't until 9:00 the next morning and there was nothing that could be done until then. I found this rather entertaining since he works for the same company. While he was doing this the integrator and I busied ourselves with racking switches, connecting cables, checking connections, etc. A couple of hours later we went over our game plan for installation and left for the day.

Wednesday Morning:
We met at the appointed time and set out to make magic!! Our enthusiasm was soon dashed when we couldn't get a link to the demark. We did the usual switch the polarity of the cables, trying new/different cables, trying different SFP modules, nothing was working. At some point during we noticed that we didn't have a link light on the WAN router either - but the connection was up and passing traffic. The ISP engineer continued to work with the telco side to try to troubleshoot the issue.

During a trip to the demark to bang my head against the wall I looked at the fiber box that went to the MDF - it read "SM Fiber to MDF". I thought SM - as in Single Mode Fiber? This should be MultiMode Fiber. I disbelief I opened the box and looked inside. Sure enough, it was Single Mode. WTF! (Bang Head on wall). Funny thing is the vendor that installed the fiber was hired by the telco on the premise that they wouldn't install the line unless dedicated fiber was in place. Now, the demark happens to also be in one of the IDF's in the building. Couldn't we just use that? We were told no. After we put our heads together for a few minutes, I said "F it, this is the only opening we have for over a month that we can do this during the day." I then connected to the fiber that was in the IDF, ran upstairs, moved the cross-connects and bingo, the WAN came up with a link light.

While this was going on, the telco engineer had moved the router to the demark to try to get a connection there. After a while, he was able to get the line up and brought the router back upstairs. We re-racked the router and fired it up. He was unable to pass any traffic across the connection. He gets on the phone again and calls me over after about 15 minutes and puts his cell on speaker. The person on the other line says "Oh, you want to switch your internet service to that line? I asked him why the heck he thought we were going through all of this? His answer (I wish I was making this up): "Oh, I thought you were just testing the line." As Bill Engvall says "I didn't start out the day wanting to be a jackass, but you just pushed my jackass button." Needless to say, we had internet in a couple of minutes. Elapsed time: 3.5 hours.

I wish this was the end of the story, but it's not.

While we did have internet, traffic was sporadic. Our initial troubleshooting lead us to believe it was a DNS issue. And, indeed, we were having trouble with DNS resolution. But, when we tried to use a public DNS server, it only marginally better. It worked fine from the DMZ and from outside. Made a few changes to the routers and firewall and we were able to get access at-first. After about 10 minutes, it stopped working again. Undid the router and firewall changes and it worked again for about 10 minutes.

Hours of troubleshooting later, we were able find the issue, but not a clear solution. When we starting routing traffic through the new WAN, it introduced routing loops at almost every point in the network. Considering we have 22 locations, the problem was quite massive.

Making corrections after deployment (or whoops!)

Like clockwork, almost as soon as we deployed the new workstations, we began receiving complaints about things that weren't right. Most of the initial complaint were "I need the administrator password so that I can install software and make changes." This was met with a resounding "NO!" It's been an uphill battle and we admittedly have to do some good work to repair our department's reputation. It was most entertaining to get a phone call with a complaint and to be able to tell them "Look at the computer, I'm logged into it right now. And, that program that you said doesn't work does."

The last week or so has been pretty quiet. After they began to see printers and software magically appear on their computers, they stopped complaining.

There was one item that slipped by us: Some of the laptops would be going home with teachers and administrators and they would need to be able to connect to their home wireless networks. We had this locked down to prevent students from changing to an open network that they can see from the school.

After hours of digging and false starts, I found the preference file that needed to be changed:

/Library/Preferences/SystemConfiguration/preferences.plist

My first reaction was it's a plist, it should be easy to script the change. Wrong! It doesn't act like any of the other plist files. Using a "defaults write" command would put the setting into the wrong place in the file. I tried to open it in several plist editors, and it only found some of the settings, none of which I needed to change.

After I got home I spent a couple of hours trying to use sed and awk to change the file, but I couldn't get the regex correct. I was about to give up, when I decided to give Google another try and found my answer:

/usr/libexec/airportd

I ran it with the -h flag to see the options available and I found what I needed:

RequireAdmin (Boolean)
RequireAdminIBSS (Boolean)
RequireAdminNetworkChange (Boolean)
RequireAdminPowerToggle (Boolean)

I could use "/usr/libexec/airportd en1 prefs RequireAdmin=NO" to change all of the settings at once or I could use the others to get more granular if I chose. I wrapped this in a quick BASH script and tested - success!! Here's the code that I created.

#!/bin/sh
#
# Script to allow Users to change the wireless network without requiring a password
#
#

# Turn off all settings
/usr/libexec/airportd en1 prefs RequireAdmin=NO

# To allow granular settings you could do the following:
#/usr/libexec/airportd en1 prefs RequireAdminIBSS=
#/usr/libexec/airportd en1 prefs RequireAdminNetworkChange=
#/usr/libexec/airportd en1 prefs RequireAdminPowerToggle=

exit 0

Tomorrow I get to see if it works with Lion.

Adventures in Managed Computing

Over the last month we've been working on deploying over 700 new workstations and laptops. This time around, I was able to get support from the upper administration to make them fully managed. I was also able to convince them that we needed to pay for installation. After a few months of negotiating with Apple and my local var we were able to settling on an SOW for the project.

The VAR would:

Install the management and imaging software on a brand new server.
Install and configure the pile of servers that I had in my office.
Image, deliver and deploy the new computers.
Ensure that the computers were integrated into the management system and our online tracking system.

We chose to use Casper for JamfSoftware to manage the computers. Casper has a pretty good reputation in the community and we decided to give it a whirl. After this is settled, we'll begin a SCCM deployment for the PC side of the house.

I'll post about my experiences with Casper in a later post.

That is all for now. Later.

Thursday, May 27, 2010

E-Mail Server Update

On the afternoon my last post, the server went down again. This time it happened as I was "tail-ing" the log file and saw what was happening:



May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: Logging region out of memory; you may need to increase its size

May 17 13:16:27 server lmtpunix[19636]: DBERROR: opening /var/imap/deliver.db: Cannot allocate memory

May 17 13:16:27 server lmtpunix[19636]: DBERROR: opening /var/imap/deliver.db: cyrusdb error

May 17 13:16:27 server lmtpunix[19636]: FATAL: lmtpd: unable to init duplicate delivery database

May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: read: 0x401410, 28: No space left on device

May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: Ignoring log file: /var/imap/db/log.0000000001: No space left on device

May 17 13:16:27 server lmtpunix[19636]: DBERROR db4: DB_ENV->log_put: 1: No space left on device

May 17 13:16:27 server lmtpunix[19636]: DBERROR: error exiting application: No space left on device

May 17 13:16:27 server master[14002]: service lmtpunix pid 19636 in READY state: terminated abnormally

May 17 13:16:27 server lmtpunix[19637]: DBERROR db4: Logging region out of memory; you may need to increase its size

May 17 13:16:27 server lmtpunix[19637]: DBERROR: opening /var/imap/deliver.db: Cannot allocate memory

May 17 13:16:27 server lmtpunix[19637]: DBERROR: opening /var/imap/deliver.db: cyrusdb error

May 17 13:16:27 server lmtpunix[19637]: FATAL: lmtpd: unable to init duplicate delivery database

May 17 13:16:28 server lmtpunix[19637]: DBERROR db4: read: 0x401410, 28: No space left on device

Notice the 5th line: No space left on device. This can't be right, I have about 16 GB of free space (according to df -h). But, lets think about this for a moment. I've been chasing a database corruption problem. If the OS thinks that it is out of space, of course it won't be able to update the db.

Looking back at my notes from a couple of months ago, before this started, I had about 30 GB free. Where did all of the free space go? I talked to the PHB and he finally agreed to let me take the server down over the weekend to work on it. In preparation for this, I also looked at the folders in the mail store and compared them to the FileMaker Pro database that I use to track all of the account info. (For you privacy nuts, I don't keep passwords, if you forget it, I can change it but I cannot see what it is.) I also looked at tuning options for Postfix, Cyrus and Mailman.

Fast forward to Friday night. I purged the mail queue, the stopped the mail server and manually ran the garbage cleanup script. Our webmail users can't understand that unless they click "Purge" the mail doesn't actually get deleted. So, I wrote a script that deletes the files from their Trash folder and rebuilds the index. This happens quietly behind the scenes and they never know it happened unless they try to retrieve something from the Trash. I actually once had someone tell me that's where they store their important emails. I asked them: Do you keep your checkbook in the garbage can at home? And, they went away muttering to themselves.

I rebooted the server into single user mode and fsck'd the drive. It took forever to get through the various steps, to the point that I stopped watching and forced myself to wait 10 minutes between checks. When it got to "Checking Volume Information" it said:

Invalid volume free block count

(It should be 8456596 instead of 4562047)

Repairing Volume

After it was done (roughly 3 hours from start time) I then had 32 G free, not 16 as reported before I started. Yea!!!

From here, I set about rebuilding the Cyrus database and the user databases. I restarted the mail server (sudo serveradmin start) and watched the mail begin to flow. When I felt comfortable, I restarted Mailman and watched it fly through the postings that were waiting to be sent.

Its now a couple of days later and everything seems to be stable.... Except, now the boss want me to get to 100 GB free!!! (le sigh...) Stay tuned for more updates

Wednesday, May 26, 2010

Its Racing Season

Believe it or not, the Netman isn't all work, he knows how to have fun also. One of the things that my wife and I like to do is go to races. She's mostly a Nascar fan, but goes to other types of races to entertain me. We have our most extensive season planned and started it off on Mother's Day.

We went to see the Trans-Am series at New Jersey Motorsports Park. I remember growing up watching Mustangs, Corvettes, Camaros and other cars on TV. This was the first time that I've had a chance to see this series and was excited.

We stumbled onto the Club Diner in Bellmawr on our first trip to the track and its become part of our ritual. Usually we have to figure out how to get there. I must have set it as a waypoint in the GPS the last time we went because it took us right to the door. Strange, but I digress.

The entry list had about 20 entries, but when I checked the practice times, most of them didn't have a time listed. When we got there we learned that only 8 cars showed up, but for some reason, they list 12 cars getting points.

Here are a few of my pictures from the race:

Here's Tomy Drissi and Tony Ave getting ready to go out for practice and qualifying. My wife can't wait for the Marmaduke movie to come out and decided that she was going to root for Tomy.

Drissi won the race (surprise, surprise), RJ Lopez came in second and Ave was third. Here they are on the podium spraying champagne and doing the "hat dance."

My personal highlight of the day was getting to see Drissi's owner Paul Gentilozzi. (Below talking on the phone.) I had a chance to talk to him a couple of times and he's much cooler than the TV reporters make him out to be.

Here's a link to my Picasa album. Its not my best work, but here it is. Next up - Arca and Sprint Cup at Pocono.

Monday, May 17, 2010

(More) Email Server Issues

This past weekend, we had another email outage. This time it looks like it took the server took about 12 hours to rebuild itself. Time for a backup/restore or defrag. But, the PHB says he's going to purchase an Exchange 2010 setup at the start of the next fiscal and doesn't want to have the downtime. Meanwhile, my faithful Frankenserver takes another hit. This server has been in service since 2005 and has run fine for years until recently.

When I left on Friday afternoon, everything looked fine, but I had a feeling that it wasn't. At 6:00 Saturday morning, I got up and checked the server. Sure enough, the db was corrupted again. So I did the usual rebuild:

#serveradmin stop mail
# mv /var/imap /var/imap.old
# mkdir /var/imap
/usr/bin/cyrus/tools/mkimap
reading configure file...
i will configure directory /var/imap.
i saw partition /var/spool/imap.
done
configuring /var/imap...
creating /var/spool/imap...
done
# chown -R cyrusimap:mail /var/imap
# sudo -u cyrusimap  /usr/bin/cyrus/bin/reconstruct
#serveradmin start mail

the last step is usually done with the -i option, but I usually skip it and perform the user folder rebuilds manually with:

# cd /var/spool/imap
#su cyrusimap
$ /usr/bin/cyrus/bin/reconstruct -r -f user/username
user/username
user/username/Deleted Messages
user/username/Drafts
user/username/Sent Messages

I have a text file on my server that has all of the usernames. I wrote a quick script to parse the file and rebuild each user.

After looking at various log files, it looks like Mailman is causing the issues. We have a list that goes out to all of our users (about 2300). (Send to one address and this spiders out to the lists for each location. The Mailman docs refer to this as an umbrella list.)

Our HR department decided to "go green" by sending out death notices and other items via this list. Even though, they've been told to do a delay-send until the evening hours, they keep "forgetting" and sending them in the middle of the day. This usually ends up tying up the server for a couple of hours during which time messages continue to come into the server, but nothing goes or out until it finishes processing the list.

At this point I decided to hold off on the rebuild and wait until I talked to the boss. I got him to agree to hold off on the rebuild until we can clean up the server and verify that it is actually stable (this time).

I spent an hour or so this morning researching various mailman options such as can we delay the sending until X o'clock? Turns out others have asked the same question and the answer ranges from not easily to no. I ran through the various configuration options and found that my SMTP_MAX_RCPTS was set to 500. This causes the server to attempt to send 500 messages in each batch. I changed this to the recommended 10. We shall see if this works.

I was also able to get him to agree to let me purge some of the older emails on the server and set up some quotas.

Stay tuned.....

Tuesday, May 11, 2010

PICC Day 2

Day 2 of the PICC conference was little more scattered and varied than the first one. The first session was "The Evolution of Storage Networking and the Current Trends in the Industry." Jacob Farmer, CTO of Cambridge Computer led this session. Jacob's talk was a exactly what the title says. He talked about storage virtualization and how abstraction layers interract with the Operating system and the File System. Most of the concepts were as I imagined they would be, although its nice to know the "real" names for things. I learned that the Equallogic SANs that I have at work are using "spindle virtualization." In short, this means that you don't have to worry about carving up your SAN into various LUNs, each (possibly) having different RAID configurations. When you setup the Equalllogic, you format it (RAID 5, 6, 10, or whatever you want) and then only worry about the size of the partitions that you create. This allows you to easily add additional storage by adding another unit to the "cluster." Personally, I think this is a cooler method of dealing with SANs.

I only stayed for the first half of this talk because I wanted to go to a couple other sessions. Next I went to "Pushing Boulders Uphill: NOAA Updates - High Performance Computing across the WAN." [In my best "Staples" commercial voice] "Wow, that's a long title" The speakers gave an overview of how NOAA deals with their datasets. Its amazing how much data they move around their network on a daily basis. More importantly, some of the discussion was about how the grew their network and some of the tools they use. It was amazing to see how "simple" it was, albeit from a very high level overview. Even better, they use some of the same (open source) tools that I use to monitor their network. I must be doing something right.

Directly after this, Matt Simmons led a discussion on "Keeping Nagios Sane." More than any other, this session demonstrated what community is about. Matt started the session by stating that he wasn't an expert, but these were the things that worked for him. Most of the participants in the session use Nagios and contributed some of their ideas during the discussion. Could any of us be considered "experts"? Maybe. Depending on your definition.

Sometimes, I get a little annoyed when my coworkers or others call me an "expert." What does that mean? I know what I know. But, that's not everything there is to know. Usually I know more about my field than they do. Do I know everything? Of course not. I think what separates us from the paper admins is our ability to get up to speed quickly in a variety of areas. Often, this is driven by our quest for knowledge. Sometimes, we actually need to do something with what we learn. What separates this profession from most others is that there are always several ways to accomplish a task. Some of them are more elegant than others and some are just crazy ideas that lead to some cool discovery. I should probably finish this post before I go off on a rant. BTW, Matt, check out cacti for trending.

Next I went to a session of Budgeting for System Administrators given by Adam Moskowitz. Alan gave some very cogent advice to us regarding the budget process. Sometimes we tend to forget that those above us may not understand the buzzwords. We need to be show them what our needs are in a language that they can understand and relate to. Also, they may not fully understand the inter-relatedness (is that a word?) of the various parts of our projects. Finally, sometimes we have to be willing to compromise and either scale-back or phase in our projects.

After this, my brain needed a break and I attended an "Unconference" This was probably the closest to what most people would imagine would happen if a group of techies got together in a room to just discuss a topic, except most of us were totally worn out at that point. We ended up talking about DNS, specifically, how we each manage it. We were shown Carnegie-Mellon's NetReg by one of the the people who wrote it. How cool is that. I had looked at it a couple of times before, but didn't realize its power. It's not something that I would use in my current environment, but I know its out there if the need arises.

The last session I attended was "An overview of Google's Technologies...." given by Tom Limoncelli. It was kind of fitting that I started and ended with the same speaker. Tom gave a quick overview of Google and how they grew their infrastructure. More cool stuff to keep in the back of my head.

Overall, it was a good conference and I had a blast. I met some new and interesting people. (And not the weirdos that are usually drawn to me) Kudos to William Bilancio, Tom Limoncelli, and the entire planning and organizing committees. They've already begun to consider the next conference and I'm sure this one will be even better.