Beyond competency tests and riddles: hiring people with real-world skills


The topic of hiring is frequently discussed in the tech community. It’s a seller’s market, where talent is in top demand. It’s also a position of great responsibility where someone who’s mediocre can become an albatross on the team’s neck. Stories are common about days-long interviews with multiple rounds meeting different groups of people.

When I came to work at OpenRoad, my interview was nine months long, and involved a major project for one of our biggest clients.

Fortunately, it was paid work, and it was ideal from a hiring perspective: OpenRoad had an extended period in which to evaluate me: they received real code from me, they interacted with me as part of a team, and they saw how I handled deadlines and project pressures. When they hired me, I needed no ramp-up. On my side, I got to experience real working conditions and my potential colleagues at OpenRoad.

Not all businesses have the wherewithal to hire someone as a contractor for half a year to evaluate them.

Unlike other fields where there are more obvious markers of professional credibility—cases won for lawyers, or papers published for academics—we still don’t know how to judge developers aside from working with them and seeing what they deliver. The discussion in the tech community about how to hire is really a question about how to give the process of hiring some certainty that seems particularly lacking for our profession.

The age of competency tests

In the ‘90s, formal proficiency tests and certifications tried to fill the role of badges of professional ability. If a candidate for a Java development role wasn’t also a Sun Certified Java Professional, or a network engineer also a Cisco Certified Technician, then the same vendors were happy to sell exams that could be administered as part of the interview. These were multi-page affairs that took half a day or more, and they were high stakes—the potential employer had paid a lot of money for the test, and was biased towards treating it as distilled credibility. You could lose marks because you didn’t remember an obscure corner case that never occurs in practice.

Even then, a candidate could be a lemon. Answering a lot of narrow questions correctly implies a broad grasp of the subject, but that’s not guaranteed and can even be misleading. In 2006, I earned my Certified Information Systems Security Professional (CISSP), a highly-regarded security certification that requires passing a six hour multiple choice exam covering ten different knowledge domains. Some of it is deeply technical, like cryptography algorithms; other parts are about crafting policies in a corporate environment, or secure development processes. Typically a quarter to a third of the test takers fail, so my employer paid to fly employees to its headquarters for a week long boot camp before the exam.

What did I learn from that boot camp? That a secure fence is eight feet tall. That was an actual question on my test: how high is a secure fence? Twelve feet, another choice, is the wrong answer. The test was full of questions like that, where the right answer had to be given to you. There was a rationale behind the right answer, in this case that eight feet with triple strand razor wire was considered sufficient to deter casual intruders, while six feet without razor wire did not (and twelve feet was unnecessarily expensive, when eight feet would do). Knowing the rationale in order to reason to the correct answer (which seems much more valuable an expertise for a consultant) was actively punished by the multiple choice format. We were told outright that any answer reading “the support of upper management” was always correct, because the test designers held as axiomatic that upper management’s full support was the foundation of all security.

Replace competency tests with… FizzBuzz?

Exhaustive tests have not proven to be reliable indicators of professional ability. However, there still has to be a way to filter out people in the interviewing phase who are simply unable to do the job. For software development roles, we use a common test called FizzBuzz. It’s a smoke test, a basic competency check that, if you’re unable to pass, really should rule you out. It performs the filtering function that certifications and professional exams do, at much less cost in both time and money, without signalling a false positive of general ability if the candidate aces it.

The test is this: write a program that prints the numbers from 1 to 100, except that when the number is a multiple of 3, print “fizz”; if a multiple of 5, print “buzz”; if a multiple of both 3 and 5, print “fizzbuzz”.

FizzBuzz Test Illustrated

I actually flunked this the first time I tried it. We were discussing it, I blurted out some pseudocode on the whiteboard, and my colleague pointed out that I wasn’t supposed to print the number if I printed “fizz” or “buzz”. Whoops.

Tech interviewing in the early ‘aughts shifted from “test that they know this” to “test how they think”. One vein of that was riddle interviewing, exemplified by Google’s process, where a bunch of engineers would ask you to figure out the most efficient way to move a mountain one inch to the south, and then judge you on how you worked the problem. This method has its own issues, like cram sites full of questions like this, clever answers included. More seriously, as we’ve started to broadly understand the sociological context of interviewing and its effect on candidates and interviewers, the deeper problem becomes evident: if I ask you to judge someone’s thinking, you’ll tend to be biased towards thinking like yours. At least the professional exams had objective answers like “eight feet”. I’d hate to think I blew an interview because the interviewer disliked my white-boarding.

A better direction, in my opinion, is towards simpler tests that are obvious “no-hire’s” if the candidate just flat out fails them, but offering a chance to pass if their work was interesting, all while being relatively low stakes. FizzBuzz does nicely at that: it takes 10 minutes at most, 20 if you want to explore it to see if optimizations are available or to discuss intricacies of the language, but it’s still a small part of a larger interview process. More importantly, my failure at FizzBuzz wasn’t an inability to program, it was rushing through the specification. “He can code but needs to be more careful on requirements” is good data for a candidate who otherwise performs well in the interview. Recognizing weaknesses is as valuable as verifying strengths.

But how do you do that for roles that aren’t about programming?

FizzBuzz for non-programmers: the Smashficiator

We were hiring for a Quality Assurance role, initially for manual testing but also for an experienced QA person who could build up our quality assurance practices at OpenRoad. How do you do a basic, fast competency test for a QA candidate?

You build a buggy app. We called it the OpenRoad Dualie Addend Smashificator.

The idea came like most good ones do, as an offhand comment in a meeting. The president joked that we should build a poor calculator app and see if the candidate could find the bugs. We kept coming back to that idea: why not build a simple app in a web page, give a brief requirements list, and give the candidate 30 minutes with it?

So I set aside an afternoon and built a single page app that took two numbers and added them. It had two text inputs, plus and minus buttons for each, and a button to trigger calculation. There were two pages for it: the first had requirements and candidate instructions, the second was the app itself.

As a programming exercise, it’s interesting to try to build something buggy. If something works that shouldn’t, you’ve got a bug, but it’s also a requirement if it doesn’t work as it shouldn’t—it can’t not work in a way it’s not supposed to not work. Good times.

Our candidate instructions were these:

  • Read the requirements
  • Create a test plan
  • Launch the app and execute the test plan
  • File bugs in JIRA, our issue tracker

Here were the requirements:

  • The adder is a self-contained web page
  • The adder has two inputs that accept numeric values only
  • The range of acceptable inputs is 1 through 10, inclusive
  • Input is by keyboard or by increment/decrement buttons
  • On valid input to both text boxes and pressing the calculate button, the total is displayed
  • The design is responsive

Our app had many deliberate flaws in it, but we were careful to vary the flaws. Some were deviations from the requirements, such as auto-calculating a total after entering a second number, rather than waiting for the button press. There were math errors—if the second input had a 1, it was treated as a zero. There were boundary errors: you couldn’t enter a number outside of 1 to 10, but if you used the increment/decrement buttons, you could go straight through those bounds. And there were ambiguous cases: if you used the buttons to get inputs outside the acceptable ranges, the addition was still correct—should that fail or succeed?


Promising candidates did a few good things immediately. Of the 30 minutes given, they would spend the first half creating a simple test plan that covered a good breadth of issues, given the short time, and usually did so with simple one-liners that were fine for shaping a limited amount of QA attention (like “test entering values outside the acceptable ranges in both inputs”, and “verify that addition is correct in all cases of valid inputs”). When they found issues, they stopped and created an issue ticket that described the bug and the circumstances, then resumed their test plan.

Afterwards, we would spend around an hour evaluating the exercise. We would ask them to prioritize the tickets to see if they gave the math errors a higher priority than the red border that didn’t disappear. If they didn’t finish their test plan, it wasn’t a problem unless they got stuck on a particular bug that ate all their time. All QA involves cost-benefit analysis, and given a short time, it’s better to have a more thorough survey of many bugs than full characterization of just one.

Within a limited scope, we got a demonstration of a candidate’s abilities exceeding a reasonable baseline, without creating a pass/fail moment in a long process. We also got a demonstration that it was a limited test—there’s no way someone could fully evaluate the toy case we presented in half an hour, but if they handled their limited time well, it gave us a demonstration that the candidate could prioritize adequately.

Hiring for real-world skills

Many jobs have implicit FizzBuzz tests. Our design department includes, as part of its interviewing process, working the pipeline: create this and then hand it off to someone else—if they can’t find that someone else in an office of forty people after they’ve already been introduced, that’s a bit of a red flag.

The important thing about FizzBuzz tests is to remember that all tests are limited in what they can tell you, so smaller and more varied tests tell you more while decreasing the odds of being misled by any one of them. Be conscious of the FizzBuzzes in the process, and if you can, build your own: a narrow, contrived test that’s a good fit with the actual duties can be cheap to make, simple to execute, and tell you a lot more than you might think if it’s treated as a data gathering moment, rather than a gateway.

And if you get a chance to build something called “the Smashificator”, grab it.

Justin Johnson is a senior software developer at OpenRoad. He plays a lead role in our hiring process for developers and QA engineers, including building the Smashificator.