It’s the things you don’t know you don’t know…

A few weeks ago I started a project to set up a new home for my various web projects in the AWS cloud. All these sites use WordPress, and I wanted to make sure I built them on a robust and scalable architecture.

The tricky thing with WordPress is that it was designed around a traditional web server model, where one physical (or virtual) machine serves one instance of the code. When you update themes or plugins, or upload content, you change the data on that instances’s locally attached storage. I wanted to be able to horizontally scale, and WordPress’s design makes that difficult.

Horizontal scaling is achieved by adding additional server instances to spread the load, whereas vertical scaling relies on beefing up the one machine with more CPU or memory. Horizontal scaling is generally preferable, because it not only provides more flexible scaling, eg you can scale up during periods of peak demand, and scale back down later, but because it gives you the added benefit of redundancy – if I have three machines serving my web sites, I can have two die and still be online, albeit perhaps in a degraded state.

To achieve this horizontal scaling in AWS, we need a way to have many machines share the same set of code. Amazon’s Elastic File System  allows us to do that by providing an NFSv4 compatible file system that can be mounted simultaneously to many EC2 instances. All the instances can read and write the same filesystem – problem solved.

So, I built a machine image for my WordPress solution that, on boot, would update itself and mount the EFS volume, ready to serve the shared code within. I set up an autoscaling group and an Application Load balancer to distribute traffic to them. I moved my domains to Route53 bound them via aliases to the LB’s AWS resource name. I used Amazon Certificate Manager to create (free!) SSL certificates that are automatically bound to the LBs. I backed this WordPress Taj Mahal with an Aurora RDS instance and enabled Cloudfront as a CDN.

It was all too easy, and the result was exactly what I wanted. I sat back feeling smug.


Of course, that didn’t last long. A few days ago I got a Pingdom alert saying a site was down. Then another, and another…what was going on?

I went online to check…”Bad Gateway 504″. Oh crap. Hm.

So what was failing? EC2 instances? Load balancer? Some weird WordPress problem? I checked them all and was none the wiser. Unfortunately for me, this happened at 8am on a workday, so I had to leave everything in a broken state and come back to it later.

When I logged in later, I noticed that requests to read the content on the EFS volume mounted on the EC2 instances took a long time to respond – several seconds in some cases. Hm, dodgy EFS? Nope, all the data seemed fine and there was nothing in the EFS console that looked like an alarm.

So, I tried that classic IT trick of turning it on and off again. I rebooted EC2 instances, I unmounted and remounted the volumes…nothing helped.

Having exhausted all the other possibilities, I concluded that the EFS performance issue had to be the cause of all my problems, so I posted an issue with AWS support. The next morning, I had my answer.

“You’ve run out of EFS Burst Credits.”

Huh?

So, I guess I hadn’t been paying as much attention as I should and this detail had passed me by…EFS volumes have “Burst Credits” that are calculated based on the size of the data stored on them. And it seems when you run out of burst credits, you can expect your EFS volume’s performance to suck badly enough that your whole system can fail.

Here’s how my EFS burst credit level dropped over the preceding two weeks:

As my EFS volume is under a GB in size, it generates very little in the way of burst credits. You can check out the details here.

You can also see here that it’s the ONLY metric that mattered. And boy, did it cause me some grief.

So, my new plan is to keep using EFS, but to use rsync, on boot, to clone the data from the EFS mount to the locally attached EBS volume and use that as the web server’s document root. I’ll leave rsync running to watch both the local and the EFS filesystems for changes and to keep them in sync.

Note to the folks at AWS – put the burst credit metric on the EFS console! Don’t make schmucks like me find out the hard way!

Making decisions out of opinions

It’s a common problem in technology companies – people have opinions that they hold strongly because they are passionate about the technology they like, know, or believe is the best. The passion is good, because it drives enthusiasm and engagement, but how do we resolve differences of opinion, when those differences threaten to derail desired business outcomes?

The first step to achieving agreement is to ensure everyone is sharing the same understanding about what the business is trying to do. Without this crucial alignment, there can’t be agreement because people will be pulling in different directions. In organisations where the vision has not been clearly communicated this can be a challenge, but it’s a first principle on which everything else is founded: agree the vision.

So, let’s assume we’ve got the vision, and everyone understands it. Has everyone committed to it? If not, there’s our next problem. If people aren’t committed to the vision, and aren’t acting in accordance with it, then our team is broken. We need to find out why people aren’t committing to the vision and fix it.

Assuming we have the vision and everyone is committed to it, things should get easier from here on. We can now make our decisions based on how well they help us achieve the vision. How do we do that?

Most of us are making decisions daily, based solely on our prior knowledge and our intuition. This is known as the Recognition-Primed Decision making process (RPD). If we understand and are committed to the vision, there is nothing wrong with this – it’s fast and will give us a right-enough answer most of the time. The only caveat is that it does rely on our ability (and willingness) to accurately compile knowledge. Given that we all suffer from “myside” bias, to some degree, the quality of decisions made this way will thus vary proportionally to the degree to which we accurately catalogue and draw upon our experiences.

RPD works well when we have the needed expertise and the decision scope is limited to things we are solely responsible for. Even when responsibility spans multiple people or groups, we may still be able to work this way, so long as there is general agreement to start with (everyone is committed to the same vision), and we are collaborative in how we work. The key here is to make sure that everyone who could be affected by our decisions is kept informed and given the opportunity to speak up with any concerns about the choices we are making. Feedback needs to be sought and considered.

If we are working instead in an environment where communication is not optimal, we may need to engage in a more structured process for making decisions.

One way to do this is to use a decision matrix. A popular version of this is the Kepner-Tregoe decision analysis (KTDA) process, which aims to guide us through steps that are designed to lead us to a rational decision.

The first step in the KTDA is to write a concise “decision statement” about what it is we want to decide. An example is “What sort of pet should I get?”

Next, we will specify the objectives of the decision – what does this decision need to provide in terms of results. Each objective needs to be classified as either a “must” (the decision must deliver this objective) or a “want”. Objectives are then weighted to indicate their importance, eg, if it is very important that my pet have soft fur, I will weight that objective as a 10.

It’s important, when weighting objectives to be as honest as possible about the true importance of that objective, as this is where we will often attempt to “stack the deck” in favour of our preferred option. Get a multi-person consensus on the weightings before you move to the next stage.

Once we have our decision statement and our weighted objectives agreed we can begin evaluating each alternative (eg, cat, dog, turtle). To do this, each alternative is scored, from 0-10, on how well it delivers each objective. Again, this is a moment where our preferences can bias our answers, so it’s important to gain consensus on the scores.

Once you have all your scores in place, the final score for each alternative is calculated by multiplying the score with the weighting for each objective. Alternatives that fail to deliver a “must” objective are excluded. The alternative with the highest score is the one that, rationally, best delivers the objectives and is the one that should be chosen.

Here’s a worked example:

In this example, even though the dog rates very highly on home security and companionship, because I didn’t weight home security as highly as I did soft fur the cat ended up with the highest score. Because I considered soft fur a must, the turtle had to be excluded.

Here, kitty kitty.

Crucial to any decision-making process is its ability to minimise the influence of unfounded beliefs and prejudices. It should also aim to remove emotional heat from the process by allowing everyone to see that their preference has been evaluated fairly and objectively.

Of course, humans aren’t always rational and objectivity is hard, so even with the best of intentions any rational process can be subverted. By following the processes here we can at least provide a paper trail as to how decisions were made. Later, if we find we made the wrong call we can always go back and see how we arrived at the wrong conclusion and learn from it.

Hm, perhaps I should have gotten a hamster.

Using Node-Red to fix stuff

Node-RED is a truly awesome tool that allows you to very quickly build an app that can talk to IoT hardware (eg devices like a Raspberry Pi), your local machine and online services. In the matter of a few minutes you can hook all these things together and getting them doing something useful.

At my place we have two Internet connections. One of them is hooked up via a router that is the best part of a decade old. It works pretty reliably, when it works, but every 36-48 hours it locks up and stops working and has to be rebooted.

I’ve put up with the inconvenience of this for years, but tonight I decided I’d finally had enough and it was time to solve the problem.

If you want to give this a try at home, the first step is to install Node.js. You can do that from here. Then, install Node-RED and some other npm modules we’ll want (these instructions work for MacOSX, you may need to vary them for your system):

sudo npm install -g node-red node-red-node-twilio pm2

Then, once everything is installed, run node-red using pm2:

pm2 start /usr/local/bin/node-red -- -v

If all has gone well, you should now be able to open a web browser, point it to localhost on your machine on port 1880 and see this:

If that’s what you see, you’re ready to build!

Building an app in Node-RED involves creating a “flow”. A flow is simply a set of nodes (those things in the list on the left), “wired” together and configured to do what you need.

A node can be an input node, and output node, or a function node through which messages can flow both in and out. Messages originate at input nodes, travel through from none to many function nodes and are emitted at an output node. Each node has the opportunity to modify the message payload before it is passed to the next node. Function nodes can have more than one output, which allows you to create branching logic.

Nodes are configured by double-clicking them, which opens a panel that allows you to set parameters for that node. “Wires” connect the output from one node to the input of the next. This is all done by pointing and clicking and dragging. Couldn’t get much simpler!

So, what I need is a flow that will attempt to access the Internet via the flaky router. If it succeeds, all is well and I don’t need to do anything more. If it fails, I want to call the web UI on the router and tell it to reset the router. Then I want an SMS notification to be sent to my phone, letting me know of the outage and router reset.

Here’s how to do this with Node-RED:

You can see I have three input nodes – two are used to trigger test scenarios, so I won’t describe them here. The one that matters is the “Check every 30 seconds” input node that I have configured to inject a message into my flow every 30 seconds. This message flows to the http request node which is configured (when triggered by a message arriving) to do a GET call on www.microsoft.com (initially I used Google, but they don’t like being used this way). The data returned from that request gets loaded into the message object’s payload slot and passed to the next node.

The “No ping?” node is a Javascript function that looks at the message payload data from the http request and checks to see if it looks like it comes from the pinged site.

If the data doesn’t contain the string “Microsoft”, and the device isn’t currently being rebooted, the function emits a message out of its second output that flows into and triggers the “Reset Modem” HTTP request node.

The HTTP request simply emulates the web call that my router’s web UI makes when I click the “Reset” button on it.

var rebooting = flow.get('rebooting') || false;
// if null, no message will be passed to the output
var msg2 = null;
var msg3 = null;

if (msg.payload.match(/Microsoft/)) {
    if (!rebooting) {
        msg.payload = 'OK';
    }
    else {
        node.warn("Reboot complete");
        var currentTime = new Date().toLocaleTimeString();
        msg.payload = 'Reboot complete at ' + currentTime;
        flow.set('rebooting', false);
        var smsMessage = 'Router down from ';
        smsMessage += flow.get('rebootStart');
        smsMessage += ' to ' + currentTime;
        msg3 = { payload: smsMessage };
    }
}
else {
    if (rebooting === false) {
        node.warn("Rebooting");
        var currentTime = new Date().toLocaleTimeString();
        msg2 = { payload: 'factory=E0' };
        flow.set('rebooting', true);
        flow.set('rebootStart', currentTime);
        msg.payload = 'Requesting Reboot at ' + currentTime;
    }
    else {
        msg.payload = 'Reboot in progress';
    }
}

return [msg, msg2, msg3];

If the data does contain the expected string, either the router is still working fine, in which case it emits a message out to the console to say “OK”, or it indicates that the router is working again after a reboot.

I’m storing state in the flow context so that I don’t trigger additional reboots when a reboot is already in progress. I also use the stored state to determine when a reboot is complete, and when it is I use a third output to send an SMS telling me the start and end time of the outage.

Now, whenever my router goes down, it’ll automatically get reset, and once it’s back up I’ll get an SMS to let me know what happened, and how long the outage lasted.

Much better!

Have a play with Node-RED and let me know what you think in the comments.

PS: Node-RED has other nodes for getting data and working with it in a myriad of ways, including support for different kinds of protocols, storage engines, cloud services, and home automation gear. All of it is Open Source and free to use. Check it out.

But does it work in IE?

The state of the OS and web browser market

UPDATE: Android surpasses Windows as the world’s most popular operating system for the first time. Windows’ decline continues.

Original article follows…

An update for you with some interesting stats for Internet-using computers of all shapes and sizes. You will see they are continuing to follow a long term trend. (TL;DR people are continuing to move from desktop PCs to devices, and Windows market share is dropping as a result.)

Firstly, it looks like Android may surpass Windows as the most popular OS for Internet users (globally) sometime in the next few months.

Of course, these figures are somewhat skewed by the bazillions of users in Asia, but the trend is also visible in other markets as people turn away from desktops and towards devices.

Europe remains strongest for Windows, overall, though its decline there is still pretty consistent.

North America reveals a, perhaps predictably, higher share of Apple devotees than other regions. iOS is the second-most popular OS.

Think about that for a moment.

In the US, 38% of users are now using an Apple-branded computer or device to access the Internet. 60% are now using an OS other than Windows.

A mere 5 years ago Windows had 75% of the market and such a decline would have seemed unthinkable. If the long-term trend continues, in 6 months it will be half that.

If you’re a desktop app developer, you might want to consider what is compelling about the desktop environment and make sure you play to its strengths. You might also consider getting some experience in developing for mobile.

Turning to the Web Browser market, we see the market share of IE is now below 5% globally. Millions of web developers cry “hurrah!”.

Perhaps surprisingly, globally, Edge does not even rate a mention yet, and is lagging behind even the perpetual bridesmaid Opera in popularity. Chrome continues to grow in popularity and market share.

Even in the US, traditionally a market that has strong IE support, IE is continuing the steady downward trend and now sit at 8.1%. Edge is growing but almost imperceptibly – certainly not as fast as IE is shedding market share – and now sits at 3.4%. By contrast, check out those Safari numbers!

In Europe, IE and Edge combined are at about 8% of the market, with Edge again barely growing share as IE continues to drop year on year.

What does it all mean?

Well, if you’re a web developer it’s all pretty good news, showing that the worst browsers ever invented™ continue their slide into obscurity. It also tells us that if we plan to target China and India, we probably need to start testing web apps in the UC browser on Android devices.

If you’re a desktop app developer, you might want to consider what is compelling about the desktop environment and make sure you play to its strengths. You might also consider getting some experience in developing for mobile.

A counter-balance to this info is that while we continue to service customers with corporate networks that are slow to embrace change we might take some comfort from that acting as a brake on those customers asking for something different.

Unfortunately, it also means we’ll keep getting the question “yes, but does it work in IE?”.

As my teenagers would say, “kill me now”.