Uh oh. Something’s not working – and as a digital product store owner, all of the pressure is on you to fix it. What do you do?
It’s never fun when a website goes down, but with a digital store, you’ve got revenue, the happiness of your customers, and your brand reputation on the line. The good news is, there are some precautions you can take to lessen the risk of a site crash happening in the first place, as well as some steps to take if the worst does happen.
In this week’s edition of The EDDit, we discuss what to do if your digital product site goes down, so that you feel more confident – and prepared to handle it!
Verify that there’s a problem
It’s important to verify that there is a problem before going into more intensive troubleshooting mode. Occasionally, you might get word that your site is “down”, when there are actually other factors at play.
To assess the situation, ask the following questions:
Does your site load normally from your own devices? Check desktop and mobile versions. You can even use tools like the Uptrends website availability test, Where’s it Up, and Down for Everyone, or Just Me? to check how your site is loading in different parts of the world.
Website availability test (Uptrends)
Is there any issue with the user’s device? There’s not necessarily much you can do if this is the case.
Is there a problem with the browser being used? Load your site on multiple browsers to check.
Does the user have a stable internet connection? Can they visit other sites without a problem?
Is the user seeing a cached version of your site? Doing a hard refresh clears the cache and forces the browser to load the most recent version of the page. This can be done by:
holding Shift + the Reload button on Mac, or
holding Ctrl + the Reload button (or Ctrl + F5) on Windows / Linux, or
other options as shown here, depending on the OS and browser.
Once you’ve ruled out user, browser, connectivity, and caching issues, you’ll want to move on to further troubleshooting.
Identify the source of the problem
Know your HTTP errors
Modern browsers will try to tell you a little more about the request if it simply fails, however it is also important to know what HTTP status codes stand for and how they are grouped before you start the process of debugging.
In particular, if you’re receiving a 4XX or 5XX error, it’s important to understand what that means if you want to accurately assess (and address) the problem.
4XX client errors
These errors are caused by the user’s browser. The most common 4XX errors are:
404 not found. In this case, the resource requested by the user was not found on the server.
403 forbidden. This is a permissions-based error, meaning the client (the browser user) does not have adequate permission to access the resource. This could mean the resource requires authentication, or that the files on the server have incorrect permissions assigned to them.
If you want to familiarize yourself more with different types of 4XX errors, you can refer to Wikipedia’s list here.
5XX server errors
Unlike 4XX errors, 5XX errors occur on the server side. There are several common 5XX errors, including:
500 internal server error. A code-level error, or other nonspecific server-level error.
502 bad gateway. When the server is acting as a proxy, this error can occur when it receives an invalid request from another server in the network hierarchy.
503 service unavailable. This error happens when the server is unable to handle the request, either because it’s down for maintenance, or because it’s handling too many requests at once.
504 gateway timeout. Requests can time out when the server is taking a long time to respond, producing a 504 error.
If you encounter a 500 error, you’ll want to check your error logs, as your code produced an error. For 502 and 503 errors, check that your web server service (such as Nginx, Apache, NodeJS, etc.) is running, and that all dependent services (database, PHP, etc.) are active.
When it comes to 504 errors, your server simply took too long to process the request, and stopped it altogether. This could be due to several reasons – slow database queries, an external service you require that’s not responding, or your server resources are maxed out. Each of these needs to be handled slightly differently.
You can learn more about 5XX errors over here.
Know where your logs are
It’s important that you either memorize, or securely store a resource that contains your error log locations for your servers. Since time is critical when your eCommerce store is down, quickly identifying the problem is key.
If you’re using monitoring tools like Rollbar, it can be as simple as logging into your account and seeing your error logs.
Check your server’s load
All servers will have a way for you to view the current load, which is a listing of all the resources being used at that current time – and in some cases, historically. Know how to view these to determine if there is something that is using too much of a given resource.
Three primary resources you’ll want to pay attention to are CPU, memory, and disk I/O. You’ll want to focus primarily on CPU and memory, as these are the resources usually affected by web-based traffic.
Did you know?
Your CPU is what handles all of the requests to your server, and it does all the processing of those requests, too. When your traffic spikes, your CPU usage will spike as well, as your server attempts to handle more and more requests. If there are more requests than your CPU can handle, the requests begin to queue, and will be handled using a First In, First Out (FIFO) method.
Memory usage is another common bottleneck that can cause your site to slow down, or come to a halt completely. When your server runs out of memory, requests sit and wait until more memory is available before they can be completed. As with CPU, the two options are either adding more memory, or reducing the amount of memory each request requires.
Both CPU and memory have two basic methods to help sustain traffic spikes or high-consuming requests:
Add more of the needed resource
Optimize your code or database to require less resources
There are entire books written about these two topics, but we’ll just leave it at this: The quick and easy solution is to add more resources, which can cost more money. The proper long-term solution is to identify code and database queries that are unnecessary, and either temporarily disable them or optimize them.
Monitoring tools like NewRelic can actually give you insights into what processes, code, and database queries are consuming the most of your resources, which can help you figure out whether you should optimize or increase your resources. We’ve actually used this a number of times to identify code that was causing 504 errors (timeouts) on our own sites!
Contact your hosting company
Most hosting companies use automated monitoring, so if the issue is with your host, there’s a good chance they already know about it and are actively working to fix it. However, if you’re not sure, you should contact them to let them know your site is down, and inform them of the specific error you’re getting.
Needless to say, it’s important to choose a host with a good reputation when it comes to support. When researching hosting providers, be sure to check the terms of service (TOS) and service level agreement (SLA) to gauge things like the technical support, guaranteed uptime, server availability, and monitoring you can expect from them.
Make an announcement on social media
If you’re experiencing more than momentary downtime, you might want to make an announcement on social media – especially if you’re running a large-scale business with more than a few people having problems. For example, Twitter is a common way for companies to communicate to their users quickly when there’s an issue:
Twitter announcement to customers (Comcast)
Get purchased products to your customers
If you have customers who have ordered products from your site, but haven’t received them due to your site being down, you’ll need a way to deliver those products in a timely manner. Depending on the scale of your store, keeping copies of your products in Dropbox or Google Drive can be a good option; this way, you can easily send a private download link to the customer. You can ask the customer to notify you once they’ve downloaded the product, so that you can promptly delete the unique link.
Use best practices for prevention
There are many reasons why a site can go down, but as they say: prevention is the best medicine. So, what are some of the ways you can guard against these potential issues?
First, you’ll want to be sure you’re monitoring for site downtime; after all, it’s better to find out about it yourself than be told by a coworker, or worse – a customer.
Always backup everything
Anytime you are going to make a change to your site, be sure to make a backup. Whether you want to push new code, update plugins, themes, your CMS, or do anything else, backing up your site is absolutely essential.
It’s also important to have a predetermined (and tested) ‘rollback’ plan. Before you push that shiny button to make changes, be aware of what steps you would need to take in order to reverse those changes – and make sure to test them in a staging environment.
Follow service providers on social media
Some service providers announce outages publicly, so following them on social media can help you stay informed if anything happens. Turn on push notifications for those accounts to be the first to know about any issues.
Some providers even have convenient status pages for their infrastructure. Bookmark them!
Consider using monitoring tools
NodePing. This server monitoring service can hit your homepage from multiple geographic locations to alert you of downtime. You can set up alerts for your homepage and your checkout to look for specific strings of text on the page, such as footer text on the homepage and text on the purchase button at the checkout. NodePing will alert you via SMS (or a number of other methods) when one of these checks fails.
Rollbar. Effective for error tracking and crash reporting, Rollbar monitors errors in real time, and groups and catalogs your error logs into a real-time feed and searchable web interface. More importantly, it can notify you if any specific errors start to trend. This helps you know if an update to your site is causing problems.
NewRelic. NewRelic integrates with servers directly to send near real-time stats about your server to their logging platform. This can help you determine slow requests, database queries, inefficient code (which it can break down line by line) – ultimately, what exactly is causing errors or pages to load slowly.
With these three services, you can detect code level errors and outages and be notified in real-time. This way you’ll never have an outage you are not aware of, and you can react swiftly.
Have a plan of action
Knowing what to do if your site goes down is half of the battle – and having a plan in place can make things a lot easier for you and your customers, should problems occur. Hopefully this post has given you some guidance to refer to, and some insights that can help you prevent site downtime in the first place!
How have you handled downtime for your own digital product store? What steps have you taken to restore your site and prevent future issues? Let’s hear what you have to say. Leave a comment below!
Illustration by Jessica Johnston.