How much rigor do you use when analyzing your employee engagement survey results? The truth is, most organizations don’t dig deep enough.
Download: Sample Employee Engagement Survey
Let me share an example that will help clarify what I mean. The other day my seven-year-old came to me and told me he was sick and couldn’t go to school. Now, I love my kid, but let’s just say that when it comes to school attendance, he hasn’t earned my unadulterated trust just yet. So I check his throat and, wouldn’t you know it, it’s bright red. Flaming red. So, (a)I felt bad for questioning him, and (b) I immediately went to get a thermometer. His temperature was higher than normal, but nothing I would normally worry about. But coupled with how bad his throat looked, it was enough to put us on the road to an urgent care facility and the doctor running some strep tests.
So, back to your engagement survey. There are three basic steps when analyzing results of any kind (we’ll focus this article on differences in employee engagement scores). First, identify a difference. Second, determine if it’s a real difference. And third, if it is real, determine how meaningful it is.
In the example, my son came to me with a difference (he felt different from when he was healthy). I could have taken these results at face value, but I’m sure you can come up with a number of reasons why that would be inadequate parenting. So my next step was to determine whether or not that was a real difference. A visual check and a temperature check indicated that perhaps there was a difference from his normal functioning. At this point, it seemed likely that he should stay home from school. (OK, he’s starting to earn a bit of that academic trust back)
Likewise, most organizations stop at step one or two and make the mistake of drawing conclusions and taking actions without understanding whether their engagement results are real and meaningful.
Let’s look at each step in more detail.
Step 1: Identify Changes in Scores “Differences”
A typical employee engagement survey consists of participants answering questions along some sort of scale. This is your basic quantitative survey. Reporting generally falls along the lines of taking the percentage of respondents who answered a certain way, and ranking those numbers against the other questions or comparing them to past years or to industry norms. “We see a 5% increase in item X as compared to last year and we have lost 3.2% in item Y.” Does this sound familiar at all? Likely. We seldom see a company that doesn’t make this comparison.
This method of interpreting results is intuitive; it makes sense to almost any audience. The problem here is in the fundamental nature of measuring humans. We’re not atomic particles. You can ask us a question one day and it might be a completely different answer from when you asked us last week. We’re fickle and capricious and easily swayed. And even when we try to be consistent, we’re not. So any survey of employee attitudes and perceptions is going to have some natural fluctuations one way or the other.
Flip a coin 100 times and I’ll bet you $100 that it doesn’t end up 50-50. It will more likely be off balance, something like 53-47 or 45-55. If you then said “looks like we lost 8 percentage points from our first flip,” technically you’re correct. But does that loss mean anything?
So when we say we “lost 3.2% in item Y from last year” not only are we being overly precise on something that shouldn’t be measured with such granularity, we’re not even sure if that’s a real difference or if it’s just random chance. This method is not robust enough to draw any conclusions yet. We need to know if the difference is real and significant.
Step 2: Test the Significance of the Difference
Fortunately there are statistical analyses that we can use to analyze whether those differences are likely due to chance or to some sort of meaningful difference. Data are compared, and a score is spat out. That score corresponds with a percentage likelihood that the difference is due to chance—or not. The typical cutoff used by statisticians and sociologists when dealing with people is 5%, which for our purposes here means there is only a 5% chance that the difference between two things is due to random variation.
This level of rigorous testing is critical for interpreting data meaningfully. We have to know if a real difference exists before we start investing in something that was never actually a problem to begin with. However, statistical testing is not without its concerns.
Statisticians are quite good at what they do. As such, they have developed tests that are incredibly sensitive in picking up differences between groups, even when the groups are quite small. In industry, we often are testing large groups of employees with relatively sensitive tests, which can lead to complications.
Let’s look at this in real life. A 2013 Business Insider poll asked which American states were the most arrogant. New York won, although followed closely by Texas. Let’s say I want to examine whether New Yorkers have a reason to be more arrogant than Texans, and I choose “average IQ” as my measure. I want to make sure I capture the effect, so I test 4,000 people from each state—a statistically representative sample of the overall population. Lo and behold, I find that New Yorkers do, in fact, have a reason to be arrogant, as they have significantly higher IQs than Texans (corroborated by a 2006 study). After all, as Babe Ruth said, “It ain’t braggin’ if you can back it up.”
A deeper dive into the data, however, reveals that New Yorkers averaged a 100.7 on the test, and Texans an even 100. That means in a 75-minute test, with hundreds of activities and tasks, the average Texan possibly defined one fewer word, or could only recite six numbers backwards instead of seven. In other words, despite the statistical test indicating there was a significant difference, that difference is meaningless. Texans redeemed.
Using significance can work if the environment is just right. But, there are enough potential obstacles that could necessitate something a bit more powerful.
Step 3: Measure the Effect Size
This is where effect size calculations come in. Effect size measures the magnitude of the difference between two groups. This simple procedure can shed a vast amount of light onto the true nature of the differences and allow for more meaning to be drawn from the results. The interpretation of effect size numbers is actually very straightforward.
- A score between 0 and .2 is a trivial difference.
- A score between .2 and .5 is a moderate difference, and at that point where you want to start paying attention.
- Any score above .5 is substantial and should be either a cause for concern a cause for celebration.
For the IQ example, the effect size was .02, i.e. it shouldn’t merit a second thought.
Effect size can always be measured, and is independent of significance, meaning every significant result has an effect size, but not every effect size is significant.
Sometimes, when looking at the difference between two groups, it is readily apparent that the difference is so small as to be negligible (as in the IQ example). However, often it is not. Statistics take into account both the average of a group and also the variation of scores around that average. This can reveal insights that might go undetected.
Recently, I completed an analysis for one of our employee engagement client partners. When comparing their survey items to our national norms, several items jumped off the page as needing immediate attention. But, one that was lurking in the wings was a question that asked if they “…[understood] how [their] work contributes to the overall goals of the organization.” This particular item typically receives fairly high scores across industries, and while this organization came in lower than the industry benchmark, the scores were still relatively high. Under normal circumstances, this would have been treated as a non-factor.
However, the effect size was unusually large. Further analysis revealed that the scores on this question are relatively tight around the average score (there wasn’t much variance between employee opinions). When this company scored lower than the benchmark or average, that was a more meaningful difference than a question with a greater differential, but for which answers range all over the scale. In other words, it wasn’t just a few employees telling the company they were having a “below average employee experience,” it was most employees. Dealing with this issue ended up being one of this company’s highest priorities to come out of the survey. It wouldn’t have even registered without checking for effect size.
The reality is that any question you ask is going to have a distribution of answers. To ignore the distribution, in favor of reporting simple comparisons such as those we typically see from most surveys is not only inadequate, it’s potentially misleading. Bad data result in bad decisions. The effect size is a straightforward calculation that resolves any problems arising from sample size or comparing differently shaped distributions. Further, it has the added bonus of simplifying ordering priorities.
The next time you analyze results, ask yourself if you’d rather:
- Make your job harder and get poor conclusions, but at least do it the way it’s always been done, OR
- Expend the same amount of effort in calculating, less effort in interpreting, and draw more accurate conclusions.
Without answering each question in the three-step process, organizations can be distracted by focusing on the wrong issue or miss an issue all-together. Remember:
- Is there a difference?
- Is that difference real?
- Is that difference meaningful?
The end of my story is that when the doctor swabbed my son’s throat for a strep test, the redness came off on the swab. It turns out my wonderful progeny had decided he didn’t want to go to school. So at breakfast, he ate around all the red marshmallows from his cereal until he was done and then swallowed them last, hence the bright red throat. I was too impressed to be angry. And more germane to this article, had I not followed the three-step process (i.e. seeing if the apparent difference was meaningful), my conclusion would have been incorrect. Maybe we all need to quit faking it and get back to school.