Tuesday, October 25, 2011

(Mis)Adventures in Networking

EDIT:  Whoops! I never published this blog post.

Wow, it's been a while since my last post.  Anyway, if you were looking for an example of Murphy's law, I've got one for you.

We had 2 simple goals:  upgrade our internet service and bring up our new WAN connections.  This adventure starts on Tuesday, April 19, 2011.

Background:

In the city where I work, we are really locked into our choice of providers.  Because of the city's infrastructure, we have 2 options: the incumbent telecom provider and the local cable company who has no interest in providing service to us.

This install is an upgrade to our existing service and also a relocation to another building.  It was originally supposed to take place last November. As we were ramping up for the install, we discovered that my former boss didn't place the order. (This after I received a couple of phone calls asking why we were requesting service at 2 sites during the summer.) After a couple of angry phone calls to the telco rep, the order was placed.  We were given an install date in early February.  A week before the date arrived I sent my weekly pester the project manager email and was told that they were having "unforseen issues and would be unable to meet the date." (According to the rep, they ran out of capacity) The date they gave me was in June.  Called the rep, explained to him how this was unacceptable and that I was going to recommend that we take action to terminate our existing contracts. He was then able to get us a March date. As we got close to the March date, we were told that date was only for our end of the connection and it would take a few weeks more to get the other end installed. Finally in late March, we were told that everything was in place and we looked at the calendar to find a suitable date that wouldn't cause too much disruption.  We decided on spring break week since the majority of our employees were off anyway.

Tuesday Afternoon:
I met with an engineers from my ISP and my integrator to go over the connections and site prep prior to the installation. He checked the configuration on the router as it was shipped and made some changes/corrections.  He made a call to try to bring up the connection, even though it wouldn't be routing any traffic at that point - just to make sure everything was good on the telco end.  After spending some time on hold and talking to several people, he was told that our appointment wasn't until 9:00 the next morning and there was nothing that could be done until then. I found this rather entertaining since he works for the same company.  While he was doing this the integrator and I busied ourselves with racking switches, connecting cables, checking connections, etc. A couple of hours later we went over our game plan for installation and left for the day.

Wednesday Morning:
We met at the appointed time and set out to make magic!! Our enthusiasm was soon dashed when we couldn't get a link to the demark. We did the usual switch the polarity of the cables, trying new/different cables, trying different SFP modules, nothing was working.  At some point during we noticed that we didn't have a link light on the WAN router either - but the connection was up and passing traffic. The ISP engineer continued to work with the telco side to try to troubleshoot the issue.

During a trip to the demark to bang my head against the wall I looked at the fiber box that went to the MDF - it read "SM Fiber to MDF". I thought SM - as in Single Mode Fiber? This should be MultiMode Fiber.  I disbelief I opened the box and looked inside.  Sure enough, it was Single Mode. WTF! (Bang Head on wall). Funny thing is the vendor that installed the fiber was hired by the telco on the premise that they wouldn't install the line unless dedicated fiber was in place. Now, the demark happens to also be in one of the IDF's in the building. Couldn't we just use that? We were told no.  After we put our heads together for a few minutes, I said "F it, this is the only opening we have for over a month that we can do this during the day." I then connected to the fiber that was in the IDF, ran upstairs, moved the cross-connects and bingo, the WAN came up with a link light.

While this was going on, the telco engineer had moved the router to the demark to try to get a connection there. After a while, he was able to get the line up and brought the router back upstairs.  We re-racked the router and fired it up. He was unable to pass any traffic across the connection.  He gets on the phone again and calls me over after about 15 minutes and puts his cell on speaker.  The person on the other line says "Oh, you want to switch your internet service to that line? I asked him why the heck he thought we were going through all of this? His answer (I wish I was making this up): "Oh, I thought you were just testing the line." As Bill Engvall says "I didn't start out the day wanting to be a jackass, but you just pushed my jackass button." Needless to say, we had internet in a couple of minutes. Elapsed time: 3.5 hours.

I wish this was the end of the story, but it's not.

While we did have internet, traffic was sporadic. Our initial troubleshooting lead us to believe it was a DNS issue.  And, indeed, we were having trouble with DNS resolution.  But, when we tried to use a public DNS server, it only marginally better. It worked fine from the DMZ and from outside.  Made a few changes to the routers and firewall and we were able to get access at-first. After about 10 minutes, it stopped working again. Undid the router and firewall changes and it worked again for about 10 minutes.

Hours of troubleshooting later, we were able find the issue, but not a clear solution. When we starting routing traffic through the new WAN, it introduced routing loops at almost every point in the network.  Considering we have 22 locations, the problem was quite massive.

No comments: