On the morning of Wednesday, May 30, we received an email from one of our long-time customers stating her site was down.
After confirming that, yes, it was down for us too (sometimes people just get locked out due to misbehaving browsers or too many failed login attempts), we tested a couple other sites. Two out of three sites we tested showed an “internal server error 500” which is tech jargon for “we don’t know what’s wrong but something is really wrong.”
We immediately called the datacenter where it was revealed a virtual memory shortage had taken down a majority of our customers’ websites. We believe the issue was corrected within 15-30 minutes for most customers.
While we’ve been hosting websites a long time, we’ve never run across this one.
The immediate fix was to simply raise the virtual memory allocations over the default 1GB for the affected sites whereby they were able to load.
Still, how to keep this from happening in the future? More on what the long-term fixes look like in a moment.
First, a brief-yet-morbid snapshot of three other oh-man moments we’ve seen in past years:
- We once had a server administrator attempt to adjust user permissions on a single website, however he accidently blew out user permissions across an entire server. All sites went down. I can still remember the last keystroke before the moment of silence punctuated with a quiet, “Oh no.” Permissions had to be reconstructed by hand, taking the entire night and halfway into the next day to restore. To make timing worse, the client who had the unfortunate honor of being “patient zero” had a paid ad campaign funneling traffic to their extinct site, making the temperature in the server room a few degrees hotter. No pressure.
- Our datacenter once had an internal router go down within their network. Fortunately, they had a backup router waiting in the wings for just such a moment. Unfortunately, the routing table on the replacement router was out-of-date, which wreaked havoc on that segment of the network. Fortunately, the issue was corrected within a few hours and procedures have been put into place to ensure routing tables are kept current.
- A client called once about slow web performance. When we tested our sites, we found indeed, sites were slow to respond; really slow. Turns out one of the datacenter’s major telco carriers was under a massive DDoS attack (I believe it was AT&T under fire at the time.) The only reason sites were able to load at all was because the datacenter also had two other fiber optic connections to two other carriers leading into their facility. (More carriers have been added since.)
This is all to say we know the unforeseen happens. It’s kind of an occupational hazard in the web hosting industry.
Now, about those long-term fixes..
Fix #1: Site Monitoring
It’s never a good feeling to find out a client’s site is down, but it’s even worse when the client is the first to notice. In fact, it’s an awful feeling.
Servers can send out email notifications to raise the alarm when services stop working, however there are no server-side monitors to report whether webpages are actually being served.
Seems like a big gap, right? It is.
To counter this, we are partnering with site monitoring company, SiteUptime, to monitor all Platinum (now marketed as Revelation) web hosting plans. Customers on smaller hosting accounts benefit by proxy since server-wide problems experienced by their larger neighbors will alert us on their behalf.
Of course, a problem experienced by one site doesn’t mean a problem will be experienced by all sites, but this goes a long way toward ensuring a fire in town is noticed quickly.
Smaller accounts have the option of having site monitoring added for an extra $2.00/month.
Fix #2: Offsite Backups
As our web hosting customer, you’ve always had a pretty robust backup routine supporting you, whether you knew it or not. We have nightly, weekly and monthly backup routines for every web hosting account we carry. The next level in maintaining your data integrity is to ensure a higher degree of safety for those backups.
95% of the time, your current backups are enough, but if the whole server died in some dramatic, fiery way, you (and we) would be sorely out of luck. Such destruction is rare, but the potential exists.
By next week, we will have an additional layer of redundancy in place. That extra protection is called Guardian Backup & Recovery. Essentially, additional snapshots of the server are made and securely stored at a completely different datacenter facility.
We figure this should close the risk gap another 4%, leaving the last 1% up to catastrophic acts of God and nuclear strikes. If these events happen, we have other things to worry about besides our websites.
Fix #3: Server Upgrade
This is probably the most attractive, most meaningful adjustment we’re making. While the server affected by this outage was only four years old, that’s something like 28 years old in tech years.
Sparing you the techno-babble specifics, suffice it to say we’re moving to solid-state drives (really fast) and quadrupling the memory (really, really fast) while upgrading the entire server to something more current and cutting-edge.
With this server upgrade, we should also be able to safely double the virtual memory limits for all sites.
So, this has been a long way of saying, “Sorry about that.” Again, it never feels good to get those “site down” calls and I hope you can accept my personal apology.
It is my hope and plan that this extra investment in infrastructure will ensure solid, steady performance for your digital marketing storefront for many years moving forward.
P.S. There will be no increase in your regular hosting bill as a result of these advancements—in case you were wondering. (We’ll just need to adopt a few additional hosting customers, that’s all.)
Please contact us if you have any questions.