Uncovering Test Secrets, Part 1

This might seem like a no-brainer, but many tests used for selection/promotion have no validity. In lay terms, the scores predict absolutely nothing! Not only do these tests fail their basic purpose, but they invite legal challenges, favor the inept, and eliminate the qualified. That’s why validation is so important. We all know personal opinions and unstructured interviews are lousy tests. Tests scores (including interviews) are supposed to accurately predict job performance.

Tell Me About Yourself

Asking someone to “Tell me about yourself” does not sound like a test question. But, what would you call asking a question, evaluating the answer, and making a decision? It makes no difference if it’s written on paper or verbal. If you make a decision based on a candidate’s answer, it’s a test. Now, how about this kind of test question:

“Give me an example of when you solved a difficult problem. Tell me what the problem was, what you did, and the result.”

Better, yes? But only if the interviewer knows that problem solving is important to the job; the kind of problem solving required; the difference between good and bad problem solving; can pry details from the candidate; and, uses a standardized scoring sheet. Why get all nitpicky? Structure is the best way an interview question can be a good predictor of job performance.

You may think unstructured interviews are the bread-and-butter sandwich of recruiting, but look between the slices: usually you’ll find mold and rancid butter. Interview questions need validation just as much as formal tests.

You cannot trust any selection tool that is not valid! Consider Merriam-Webster’s definition of valid: having authority, relevant and meaningful, appropriate to the objective, supported by truth, basis for flawless reasoning, evidence, justifiable. Now, isn’t that something you want?

False Sense of Security

Once upon a time I worked for a large consulting company. It was filled with administrative assistants who could not spell and used bad grammar. After listening to client complaints, I went to the internal HR department and asked if they gave AA applicants a spelling and grammar test. “Yes,” they answered, “We use one developed by the owner.” A quick examination showed the test looked OK (i.e., was face valid) so I asked if I could see the scores from the last 100 hires. Guess what? Passing rates averaged about 95%!

High AA scores might have given HR a warm and fuzzy feeling, but it was just another case of organizational incontinence. I considered giving them a box of departmental-sized Depends, but I think they would have missed the point. Their test did not test anything! It was not valid. While worker bees labored to present a professional image to big-buck clients, incompetent AAs were misspelling words and using atrocious grammar. I’m sure the president would have responded if he had known, but HR was not going to rock the boat if their life depended on it.

Self-developed tests may seem like a good idea, but they are usually inaccurate, invalid, or poorly maintained. Bogus tests harm the organization because they give a false sense of security, while actually doing nothing to improve quality of hire. This is especially true when managers get frustrated and decide unilaterally to make up their own test. If it’s important enough to test, it is important enough to validate. Otherwise, forget it. Even high-profile assessment organizations make foolish decisions.

From the Frying Pan Into the Fire

Math and reading are becoming problematic. I’ve heard from dozens of organizations about employees who cannot read, calculate, or write. This is an issue when becoming automated, adopting computer-driven equipment, or encountering frequent or steep learning curves. In response, unknowing people think grabbing a test off the shelf will solve their problems. I’ve even seen some who used reading tests developed for placing students in the right English class.

Testing studies show a three-bears effect. That is, human KSAs come in sizes: too little, too much, and just right for the job. For example, we know intelligent people tend to perform better than unintelligent ones; and, intelligent people tend to score higher on abstract verbal and numerical tests. But now life gets challenging …

Article Continues Below

It’s a fact of life (at the group level) that intelligence test scores cluster into different curves depending on demographics. There are plenty of theories why, but we’ll conveniently ignore them. Let’s just say we have five demographic groups: Pandas, Penguins, Puppies, Kittens, and Bunnies. Pandas score an average of 85, with 2/3 falling between 70 and 100; Penguins average 93 with 2/3 between 77 and 107; Puppies average 100 with 2/3 between 85 to 115; Kittens average 107 with 2/3 between 100 to 122; and Bunnies average 114 with 2/3 between 107 to 129.

Demographics membership does not force someone to be smart or dull. Individual Pandas can still score substantially higher than individual Bunnies and individual Bunnies can score lower than an individual Penguin. There will just not be as many high-scoring Pandas and Penguins at the group level than Kittens and Bunnies. Now this next part is important!

We don’t need to be rocket scientists to know that low scores lead to mistakes, bad products, and safety violations, while high scores usually lead to boredom and turnover. Balancing demographic differences with our need for “just-right” intelligence, how do we establish and defend cutoff scores?

No organization I know is forced to hire unqualified people. But, the EEOC and OFCCP expect you to show there is a business need and job requirement. Oh yes, and it is incumbent on employers to give new employees “reasonable” time to learn the necessary skills. If you eliminate new employees based on something they could learn in a reasonable time, you better be able to explain why.

A well-done validation study keeps the input funnel filled; employs only fully qualified employees; keeps training times reasonable; minimizes adverse impact at the group level; and maintains both business necessity and job requirements.

In the next part, I’ll discuss a few differences between validation and litigation.


14 Comments on “Uncovering Test Secrets, Part 1

  1. Good stuff Dr. Williams! Throw a little reliability into the mix and you have the makings of an entry level statistics course. At the very least, you’ve begun the primer on test construction. My only hesitation is that recruiters might forget that people are still people.

    I call your attention to the one of the most “studied” and familiar tests in the US – the SAT’s. Depending upon the study, only 10 to 20% of the variance in first year GPA can be explained by SAT’s. Common sense says the courses being taken play a huge role in this discrepancy. In other words, unless your new hire is only doing one thing over and over again (one “course”), they are bound to excel in some areas and not in others.

    IMHO – use a valid and RELIABLE test for 50% of the decision and your gut for the other 50%. Unless law suits rule your world, people are still people.

  2. Based on my recollection of the research, the only thing SATS and GPAS predict is success in more school (not employment). They are not perfect (nothing is) but are often required for a school to maintain its accreditation status.

    When people tell me they like to use their gut, I suggest that EVERY decision is “gut”…In my experience, guts that are well-informed with trustworthy data make better decisions than uninformed ones.

  3. Was reading this and curious what folks think about Topgrading. I hadn’t heard much about it until yesterday. I spoke with Rosetta Stone yesterday – it has about 1,600 employees (about ½ in kiosks but the corporate side expanding bout 25% a year). It went to “topgrading” (more info at topgrading.com) a couple of years ago. Basically what it does is instead of a job description, Rosetta develops a scorecard of what the person who gets the job needs to accomplish over the next year, and what competencies are needed to accomplish that. Then it does a panel interview to see if a candidate has the competencies to meet those accomplishments. This is followed by a “tandem interview” going through the candidate’s college and work history play by play, the clubs and committees they were on, and more. That’s it in short. (The CEO, Tom Adams, also asks questions about candidates’ favorite foods and favorite books, and why, but I have a funny feeling, Wendell, I know how you feel about that.)

  4. Dr. Williams – Although I think you realize it, I wasn’t making a relationship between SAT’s and work success. My point, I think, is in synch with yours. That is, use the right tool for the right job… even if the tool is a test. Then, use common sense (a compilation of valid and reliable data, gut feelings, past history) in making your decision.

    Thanks again for a great article.

    BTW – GPA’s have a much higher predictive value than SAT’s on success in school… yet another example of “my gut tells me” common sense!

  5. Todd – Without knowing a thing about “topgrading”, my gut tells me they are looking for evidence that the candidate can actually DO the job they are applying for. What a novel idea… and one that my 89 year old father taught me about 50 years ago. My father took one college course, decided it wasn’t for him, and ended his second career (out of 3) teaching people how to fly 747’s. His philosophy on passing or failing his 747 students (there was no in-between) was, “If my life depended on it, who would pass?”

  6. Topgrading is an interesting concept that requires more than a short reply..On the plus side it has considerable structure…this is a good thing…Otherwise, it has more than a few flaws…I’ll cover these points in an upcoming article.

  7. Dr. Williams raises an issue, lack of test validity, that I/O Psychologists and sophisticated recruitment and selection professionals are well aware of. That said, these groups also know that a valid and reliable assessment (or test) is profit generating because it lowers litigation risk (whether people want to admit it or not, there are significant risks associated with the standard, unstructured interview and untrained interviewers) and increases average new hire productivity. I think people just need to understand the risks and rewards associated with using, or not using, assessment tools to make an educated decision.

    From my perspective I think there are four reasons companies fail to validate assessments. First, companies often lack objective, quantifiable employee performance measures and therefore cannot determine the validity or reliability of any assessment. If you are reading this and are currently considering incorporating assessments into your selection process, or changing your current assessment process, I strongly urge you define in quantifiable terms high, average, and low performance for the employee population you are thinking about before doing anything else. It is no surprise that some of the largest revenue generators for assessments are sales, customer service, and executives. These employees are measured constantly! If the big push towards “performance-driven cultures” is sustained I imagine this problem will go away or be minimal within a few years anyway.

    Second, those involved in hiring fall victim to the “Illusory Superiority” cognitive bias, or above average effect. Basically, these people think they are better than the average person at selecting employees and therefore don’t need help. I find this more prevalent in more senior leaders, particularly those who believe their level of success is largely due to their ability to convince people to work for them (which is more of a good sales skill than an assessment skill). If your hiring manager or recruiters fall into this category the only successful way I’ve found to change their minds are to show them their level of success at predicting performance, which if you also fall victim to lack of objective, quantifiable data is really hard to do. I did this once with the VP of Sales for a multi-billion dollar computer software company who was reluctant to implement a work styles assessment into his organization’s selection process by gathering information on the size of the sales organizations and annual sales for his top three competitors and showing him that each of these firms had a higher average sales per sales rep than his organization did.

    Third, they lack funding and/or the ability to sell the value of paying for the validation study. To be fair, much of the blame here might fall on the assessment provider. I used to work for SHL, a leading global provider of assessment solutions, and this problem came up often. I think the lowest cost validation study I ever sold was around $25,000, and a number of them ran well into six figures. From the customers point of view this is a really big pill to swallow when you have no idea if the results will be positive. There are a number of ways to approach this particular problem, but let me warn you that many vendors will likely be too conservative to try these solutions. You might try getting the vendor to agree to provide you with the assessment tool in trial form for free for a short period of time. During that time use it for the target population but don’t even look at the results (or use them in any way to make hiring decisions). Once you have collected enough performance data on the target population that took the assessment you can then examine the relationship between assessment scores and actual performance. This also provides a great way to address the fourth problem discussed below. This approach might take a lot longer than typical validation studies, but it removes the financial burden and most of the risk from the decision making process and unless the assessment is in paper format it should cost the vendor about 20-30% of a full blown validation study. This could be a great solution if you are considering multiple assessments at the same time and have a large enough population to test portions of it with different tools at the same time. Another option could be to agree to some sort of cost sharing agreement for the validation study and in return offer a higher price and or long-term contract to the vendor. If they are serious about their business and believe in their products they should consider something like this. Software companies do this type of thing all the time with customization projects.

    Fourth, is the fact that most positions in companies are not linked to profits, revenues or costs, and without such a link it is impossible for the decision makers to put a financial value on a change in the validity of the selection process. In other words, if the current process has a validity coefficient of .25 and a vendor claims their tool can increase the validity coefficient to .3 the buyer does not understand the value of that change. The only way to solve this is to put a financial value on the output of the target population.

    If you are currently considering assessment solutions, or seeking to improve the quality of your hires, I recommend you start by defining performance, linking performance to profits, revenues, or costs, and measuring the validity of the current selection process. This will give you all the tools you need to evaluate any change in the selection process accurately and quantify that change in a financial way so you can determine if the change is worth investing in.

    Best of luck,

    CorDell Larkin

  8. With the exception of the “gut” comments, which are problematic because they are not duplicable or scalable (ie. it’s nice if you’ve got tons of experience and expertise because your gut is just bringing your sub-conscious to your attention, but totally useless if you lack that expertise and experience), the Dave Pollock comments are perhaps some of the most insightful in a roundabout way. GPA vs. SAT, track record vs. test, the former wins everytime…if you can be objective and specific about it, which leads to the second point, GPA is a poor indicator of job performance especially over long terms of time in a broad range of industries and positions, which comes back to a point that is very much appreciated “test for what you are hiring for”, if you’re hiring for landscaping positions then testing word processing skills is useless.

    Top-grading is an outstanding process…but it’s only as good as the user. It is good at focusing on some critical data points such as “why did you leave?” but without someone who is going to dig into those circumstances and pull out specific stories it’s not that valid. I’ve had many discussions with a huge variety of people regarding Top-Grading, including the authors of “Who?” (sort of a sequel to Top-Grading). The summation is essentially this:
    – it’s very time consuming and consequently costly
    – it’s only as good as the person using it
    – it’s designed to gather a massive amount of data points but it needs work when it comes to the interpretation of those datapoints

    We’ve found in practice, implementing it in my company, that it’s easier said than done, but also very worthwhile for the right candidates in the right position where you can justify the cost if it’s executed effectively.

    Bottom line:
    – Effective screening varies from position to position because the key considerations vary (for example, hire for attitude vs. skill is great when you’re hiring for McDonald’s and less so when you’re hiring an IT Infrastructure Architect or LEED Building Engineer)

    – Desire plays a HUGE role

    – There are well researched behavioral patterns that result in expertise in a field and you should always look for them

    – Past performance/measureable results are always the best indicator of future performance/results

    Final question, is anyone familiar with the Gallup concept of “hiring for talent” described in the book “First Break All the Rules?” I’m curious about testing it out and would like to learn more as well as if anyone has tried it and the results they’ve gotten.

  9. I have to agree with Michael. The emphasis on data above all else in the hiring process is problematic. It is the interpretation of that data that matters. And measuring the potential of a human being is more complicated than plotting data points on a graph. Much of this grows out of a perceived need to make everything into a “science” to make it legitimate. The computer programmers who were once gods have lost their jobs to India, but that way of approaching business still comfortably resides here in the states.

    I can tell you from my experience with Rosetta Stone that top-grading is not the silver bullet. Despite the growth of that company, there is a great deal of turmoil within the rank and file, with some of the best people leaving only to be replaced by applicants with less expertise. The people who get hired in the topgrading system are the people who go to the topgrading website to prepare for the interview. Rather than getting the best person for the JOB, companies end up getting the person who has figured out how to answer questions in the topgrade system. At Rosetta Stone, they have made a conscious effort to reduce salaries and incentives in order to feed executive bonuses, and they are using topgrading as a way of justifying this course of action. Good news for their competitors, as a lot of talent has left or is preparing to leave. Going public will end up being the ruin of Rosetta Stone, like so many other companies, who see short-term profits as more important than the long-term development of their workforce. At the rate they are losing people, they will not even be a major player in 5 years.

  10. I find the negative comments about “data points” and “science” realy disturbing…it’s as if accurate and complete data about a candidate’s skills were less important than (or threaten to eliminate) personal opinion…

    Who among the critics is willing to argue the benefits of knowing as little as possible about a candidate before making a gut-decision about hiring or promotion? Who is willing to argue that testing (yes, even interviews are tests) need not be proven effective BEFORE using it.

    The response that validated tests will somehow minimize the effectiveness of the decision-making process is as problematic as a physician who does not believe in using diagnostic tests.

  11. Dr. Williams I don’t think that’s the statement at all. Obviously, there can be no interpretation without data, and all interpretation has to come from valid data, however, that data itself is not the answer. This is one of the reasons personality tests so often fail within small organizations, they simply lack a baseline of top performance to compare the scores to so they think because someone is a high D or I or C, or whatever depending on the type of test etc. that this means something it does not. Generalizations are rarely a good idea in making hiring decisions, people are not statistics.

    More data is also not necessarily better, you can waste a ton of time just gathering meaningless data, which becomes a problem with most unstructured interviews. I’ve chatted with people on many occasions who think they are hot shot interviewers and say they’ll ask one question (frequently something stupid like “what do you do in your spare time”) and think they can tell everything they need from the answer. The problem is there is absolutely no relationship between the answer and the candidate’s performance.

    We need to look for two things: accuracy, and relevance, ideally with as little data as possible in order to maximize efficiency (hence my comments about critical data points, it turns out that the length of time someone has stayed at their last 5-10 jobs is a good indicator of what you can expect in their next job because people are creatures of habit, why they left or the circumstances surrounding leaving is very telling, what they wore, how often they skipped lunch, how many years experience they have, etc. are not), we then need to focus on interpretation of the data. Some common examples, they were fired, is that a good thing or a bad thing? Conventional wisdom states it’s a red flag, but Henry Ford fired Iacocca and he went on to lead one of the spectacular turn-arounds in corporate history.

    It gets more complicated when you start to consider that an individual could be horrible in one organization and outstanding it another simply based on strucure and culture (a friend of mine was a disaster working for a real estate firm then went to Canon and became the regions top sales person in less than a year, if you’d looked at his record at the previous place of employment you’d have written him off, I can give you hundreds of examples like that). Combine this with the need of organizations in some cases to hire individuals who already have the expertise and on the other hand to find individuals they can mentor and groom for success. You can see the same among students, some of the drop out students are by far the most effective in the working world, or some top students will do poorly with a teacher they dislike (or who dislikes them).

    I’m not a fan of gut or opinion, I believe at the end of the day you can break it down to externally verifiable measureable evidence, however, I think this component of interpretation is by far more difficult than gathering the data itself.

    There is one final concern of getting overly “Standardized”, and it’s one I hesitate to bring up, but it is the issue of incremental rather than exponential increases in performance. The reality is if you always do what you’ve always done you’ll get the some results you’ve always gotten. By taking your top performers, using them as a baseline and refusing to deviate from their type you are dramatically decreasing the likelihood of adopting some new practice that will provide a dramatic jump in results. I always have mixed feelings around this point because there are dangers associated with such deviation so they have to be calculated.

    Final note, there is no replacement for passion and drive within your people.

  12. I give up…my comments keep being interpreted in the wrong way…You keep doing your own thing, but as head of two training departments, hiring manager, user of professonal recruiting services, senior line manager, senior consultant for a large HR consulting company, went back to grad school three times to study selection, and watched these systems work in some of the largest organizations in the world, I think I know a little about the effectiveness of my recommendations.

  13. Apparently it’s a two way street because I don’t think anyone’s disagreeing with you, just hitting different angles.

Leave a Comment

Your email address will not be published. Required fields are marked *