How Valid Is This Test?

9781118531181_cover.inddNo business wants to spend time and money on a measurement method that does not work. This is why most businesses know to ask this basic question: “How valid is this method or test?” The challenge only begins here, though, because you then need to be able to understand and evaluate the answer. To help you, try following these seven tactics.

(Excerpted with permission from the publisher, Wiley, from Talent Intelligence: What You Need to Know to Identify and Measure Talent by Nik Kinley and Shlomo Ben-Hur. Copyright © 2013.)

Ask for Evidence. We were recently looking at the validity of a popular U.S. interviewing system that described itself as being accurate and valid. On a Web page entitled “Validity,” the vendor described a wide variety of research showing that interviews can be valid predictors of success. Yet there was not a single mention of any research that the vendor had conducted into the validity of its own system. So rule No, 1 is that you need to get specific and ask vendors for the evidence that their particular method or tool is valid. And beware of statements such as, “The test is predictive,” but do not come with any specific validity figures or evidence.

Ask What Is Meant by Validity. Validity figures are not always what they appear to be. For starters, there is no one way for vendors to measure or report validity. When you are told that a measurement method has 80 percent validity, it could mean many different things. Classically, validity refers to whether the ratings and scores that people achieve on particular measures can predict their performance in a business. And by and large, this is what you should expect to hear. Yet we have seen some vendors define validity as being whether individuals agree with the results, so when a vendor tells you that a particular measure is valid, you need to ask, “In what way?”

In response to this question, you may sometimes hear phrases such as “content validity,” “criterion validity,” and “construct validity.” For many people, though, this kind of technical jargon can be confusing and can put them off from delving more deeply into the subject. But it need not do so. All you need to remember is that you are essentially trying to find out two things: “How do you know that the method or tool measures what it is supposed to?” and, “What business outcomes do results with this method predict, and to what degree?”

It is worth noting here that “performance” can mean different things. It can mean actual results (such as sales figures), managers’ appraisal ratings of individuals, and even self-ratings of performance. Beyond task performance, it can mean contribution to team performance or organizational citizenship behavior.

Furthermore, just because a measure can predict performance in skilled and semiskilled workers does not mean that it can also predict performance in managers. There are additional questions that you need to ask when told that a measure can predict performance: “What types of performance?” and, “In what types of people?” Moreover, with measures of potential, extra questions to ask are, “How far ahead can it predict performance?” and, “After how long?”

Beware of Very High Validity Figures. When looking at the degree to which methods or tools can predict outcomes, the single best predictor of performance, intelligence, can achieve maximum validities of only 0.5 to 0.6. If you hear anything more than that, start asking questions.

Check How Many People the Tool Has Been Validated With. One essential question to ask is, “How many people?” For instance, if you are told that a measure can predict, say, absenteeism in semiskilled workers, you need to ask how many people were tested. If the answer comes back with anything fewer than 100, then the results may not be reliable. For psychometric tests, ideally you should be looking for two thousand or more people to have been tested.

If the Method or Tool Uses Norm Groups, Check the Quality and Relevance of Them. Not all methods and tools use norm groups, but some rely on them. Norm groups are comparison groups, a kind of benchmark. They enable you to compare the score of a particular individual on a certain test or measurement method with the scores of other people who have also done the test. This is particularly useful with ability tests, such as measures of intelligence and physical fitness, as it can help you understand what scores mean. For example, an individual may get a score of 25 out of 30 on an intelligence test, which sounds good. But if you then find out that the average score is 27, that score of 25 does not look so good after all. We need to know how well others usually perform to understand precisely how good a score is.

As useful as norm groups may sound, the science of developing them and where they should and should not be used are much-debated issues. If you are going to use norm groups, then they should be good ones: if they are not, they may be misleading.

Article Continues Below

So what counts as a “good” norm group? You need to look for two qualities. The first is size — the number of people in the group. Simply put, the bigger, the better. With competency ratings from individual psychological assessments, the norm group may be very small  — under 100. For psychometrics, however, it will ideally be in the thousands.

The second quality you should look for is relevance. Having a norm group of two thousand white males from Scandinavia is impressive, but if you are trying to interpret the scores of Singaporean women, it is of no use. To be effective, then, a norm group needs to be representative of the people you are assessing. This can be in terms of gender, age, ethnicity, and education level. It can also be in terms of industry, function, and type of role. The more relevant, the better. For job applicants being tested with an intelligence test, for example, the best norm group is not the scores of people already employed, but other applicants for the same type of roles.

One quick way to evaluate the quality of a norm group you are already using is to look at how many of the people you are assessing score above the average for the norm group. If the norm group is perfect, then 50 percent of your people will score above the norm average and 50 percent will score below it. If almost everyone is scoring above or below the norm average, then you know that the norm group may not be relevant enough.

Moreover, for larger organizations it may be worthwhile trying to create your own norm groups specific to your business. The absolute minimum you need for competency and individual psychological assessment ratings is around 50 people. This is low, though, and you would need to be a little cautious about comparisons. For psychometrics, the minimum is around 150 people, although once again this is low. A number you could be completely confident in would be around 2,000, so our suggestions are absolute minimums. Some vendors will try to charge you for creating a specific norm group for your business. Others do not charge. Obviously, we recom-mend the latter.

Remember Reliability. For relatively objective methods such as psychometric tests and SJTs, you do not need to ask about reliability. A test cannot be valid without also being reliable, so asking about validity is enough. However, for more subjective methods such as assessment centers and individual psychological assessment, it is important to ask about inter-rater reliability. This is the degree to which two assessors agree (or disagree) in their ratings and judgments about people. The less reliability and agreement there is between assessors, the less likely results are to be accurate.

Look for Independent Reviews. This final step is an important one: always look for independent evidence of whether measures work. An easy place to start here is to ask the vendor if any such research exists. You can also do a Web search for the name of the tool. Moreover, with psychometric tests, probably the best thing you can do is to check one of the independent, nonprofit bodies that publish test reviews. The national psychology associations or societies of many countries provide this kind of service. By far our favorite is provided by the University of Nebraska’s Buros Institute. Its reviews can contain some deeply technical information, but they also contain some clear and no-nonsense recommendations on whether to use tests.

These, of course, are just questions about validity. However, businesses need to think more broadly about the issue of whether measures work. We have discussed, for example, the need to ask about incremental validity. Yet businesses also need to think about what measures need to do over and above merely predicting performance. This could include things like helping managers engage potential new employees, identifying areas new employees may need support with, and helping plan for individuals’ development. Validity, then, is not the be-all and end-all, and the most valid test is sometimes not the one that will work best for your business. Nevertheless, it is a good place to start: a test that is not valid will not be able to do much for your business.

Nik Kinley is a London-based independent consultant who has specialized in talent measurement and behavior change for more than 20 years. He was the global head of assessment for the BP Group, head of learning for Barclays GRBF, and a senior consultant with YSC, the leading European assessment firm. Shlomo Ben-Hur is an organizational psychologist and professor of leadership and organizational behavior at the IMD business school in Switzerland. He has more than 20 years' experience in senior executive positions, including vice president of leadership development and learning for the BP Group and chief learning officer for DaimlerChrysler Services.


8 Comments on “How Valid Is This Test?

  1. Generally good advice.

    I presume “SJT” stands for “subjective judgment test”.

    The suggestion: “One quick way to evaluate the quality of a norm group …” can mislead. 50/50 (above/below) will only hold true (statistically) if choice of assessment takers introduced no bias relative to the norm group. If, for example, the norm group is a large random sample of working adults and the job requires a BS in Engineering, it would be reasonable to expect applicants who earned the required degree to exhibit an upward bias in measurements of numerical ability and numeric reasoning. That does not condemn the norm group; instead, it allows the employer to see where, on these two constructs, those engineering job applicants, who were selected for assessment, score relative to the population of working adults.

    Another important point to make is that validity addresses the extent to which an instrument measures what it purports to measure. A tape measure provides high-validity measurements of short distances. Employers must also be able to demonstrate the job-relatedness of their measures – the candidate’s height might be very important for an NBA center, but probably not for a loan officer. Employers should conduct Job Analyses (one for each job) to establish the job-relatedness of the measures they use for making selection decisions.

  2. Great summary. I concur with these points and have discussed many of them over the years. It is great to see more interest and information around the appropriate ways to measure and evaluate talent. Cant wait to order this book.

  3. @ Nik and Shlomo: Thank you. The chief validator for many assessment tests is “how well does it sell”?


  4. First, thanks all for the kind comments. Keith: I couldn’t agree more – far too often the chief validator for vendors is how well a test sells, while for purchasing firms it can be just how well known a test or vendor is (what Peter Saville calls ‘faith validity’). Richard: I certainly agree with your caveat regarding the quality of the norm group, I guess we were assuming size and relevance and were just trying to provide a quick and dirty method of gauging norm group quality beyond these two basics. And of course, your point about job relevance is spot on.
    If you liked this post we do have more: we provide weekly commentary on the latest research and some practical how-to guides at

  5. Richard – Apologies I forgot to mention: SJT stands for Situational Judgement Test. Sometimes called “low fidelity simulations,” SJTs entail presenting people with realistic work scenarios and then asking questions about them. The scenarios can be presented as written descriptions, videos or animations and answers are selected from a multiple choice list. Some situational judgment tests can adapt to the test taker, too, in that responses to one scenario determine which situation is presented next.

  6. Thanks for your excellent summary. I’m still wondering why the scientific practicioner gap is still that wide. The weirdest tests are still used all over the world. Expertise like yours must be spread in easy words – I hope your book will soon be translated into German 🙂

  7. @ Nik: Thank you.
    @ Sara: “why the scientific practicioner gap is still that wide.”
    Because salesmanship and marketing trumps science any day.


Leave a Comment

Your email address will not be published. Required fields are marked *