Research InsightsPart 17 of 246 min read

The Vanishing Findings: What Happens When You Apply Real Statistics

22 CEO traits narrowed to 9 with raw significance, 4 after FDR correction, and just 2 that survived era-robustness checks. Medical-grade rigor applied to CEO research.

22→2

traits tested to traits that survived full correction

Verata Research

2025-04-16

The Vanishing Findings: What Happens When You Apply Real Statistics

The Finding

We began with 22 testable CEO background traits. After standard logistic regression with controls, 9 showed raw statistical significance at p<0.05. After false discovery rate (FDR) correction for multiple comparisons, 4 survived. After era-robustness checks -- testing whether the effect held across different vintage-year windows -- just 2 remained. 22 became 9 became 4 became 2.

The two survivors: general management background (OR 1.15, 95% CI 1.06-1.25) and years of experience (OR 1.07 per decade, 95% CI 1.02-1.13). Every other trait -- including those that dominate PE search mandates and populate the covers of business magazines -- vanished under the application of standard statistical methodology.

This is not a novel statistical technique. FDR correction has been standard practice in medical research, genomics, and the social sciences for decades. It addresses a simple mathematical fact: when you test 22 hypotheses simultaneously, you expect roughly one to appear significant by chance alone at the p<0.05 level. Without correction, you are guaranteed to find 'results' that are artifacts of multiple testing. The CEO selection literature has historically not applied this correction. Our study does.

Why This Matters

Every few years, someone publishes a study claiming to have identified the traits of successful CEOs. The findings receive coverage in business press, inform executive search practices, and shape the mental models of investors and boards. Then, a few years later, the next study's traits fail to replicate the previous study's findings. Different traits emerge. The old ones disappear. Nobody remarks on the inconsistency because each study is treated as an independent event rather than a test of a persistent claim.

This pattern -- striking initial findings that fail to replicate -- is the hallmark of the multiple comparisons problem. When researchers test many variables and report only the significant ones, the published record accumulates false positives. Each individual study looks compelling. The aggregate record is noise.

The drug trial standard illustrates the gap. A pharmaceutical company testing a new compound must pre-register its hypotheses, correct for every comparison it runs, and demonstrate replication across independent samples before the FDA will approve the drug. The CEO selection industry's most consequential talent decision is based on criteria that would not pass the lowest bar of evidence-based medicine. We do not hold CEO research to this standard. The vanishing findings demonstrate what happens when we do.

What the Data Shows

The cascade from 22 to 2 is worth examining in detail, because it illustrates how each layer of statistical rigor eliminates findings that initially appeared robust.

Stage 1: Raw regression (22 traits tested, 9 significant). At this stage, operations background, MBA, prior CEO title, and several other traits show p<0.05 in logistic models controlling for sector, deal size, and vintage. This is the stage at which most prior CEO studies stop. It is also the stage most susceptible to false positives.

Stage 2: FDR correction (9 to 4). The Benjamini-Hochberg procedure adjusts p-values for the number of simultaneous tests. MBA drops out (adjusted p=0.06). Several functional background traits drop out. Four traits remain: general management, years of experience, finance background, and one interaction term.

Stage 3: Era-robustness checks (4 to 2). We split the dataset by vintage-year windows (2000-2008, 2009-2018) and required the effect to be consistent across both eras. Finance background, which showed significance in the early era, lost its effect in the later era -- suggesting a cohort effect rather than a stable predictor. The interaction term similarly failed to replicate across eras.

Survivors: General management (OR 1.15), Years of experience (OR 1.07 per decade)
Eliminated at FDR: MBA, prior CEO, MBB consulting
Eliminated at era-robustness: Finance background, interaction terms
Never significant: Operations, tech, FAANG, sales, sector-specific matches

The prior widely-cited finding that operations background predicts CEO success was, on closer examination, an artifact of tag contamination in functional classification. When properly coded, the effect vanishes entirely.

The Counterargument

One reasonable objection is that FDR correction is too conservative -- that it discards real effects along with false ones. This is a known tradeoff in statistics, and it is worth addressing directly.

FDR correction controls the expected proportion of false discoveries among rejected hypotheses. It is, in fact, less conservative than the Bonferroni correction that is standard in many fields. If anything, our approach is generous: a stricter correction would eliminate even more findings. The traits that survived our pipeline cleared a bar that is lower than what most clinical research requires.

Another objection: perhaps the effects are real but small, and our sample size is insufficient to detect them. This is testable. With 12,174 observations, we have statistical power to detect odds ratios as small as 1.10 for traits with 30%+ prevalence. The traits that failed significance did not fail because our sample was too small. They failed because the effect sizes are genuinely near zero. The confidence intervals tell this story clearly: they are not wide intervals that happen to cross the null. They are tightly centered on 1.0.

The findings do not vanish because the statistical method is too strict. They vanish because they were never real. They were artifacts of a research tradition that did not correct for the most basic source of false positives in observational data.

What This Means for Your Firm

The vanishing findings have a direct operational implication: most of what you have been told about what makes a successful PE CEO is statistically unsupported. This is not a claim about any specific study or consultant. It is a mathematical consequence of how the research has been conducted.

For your firm, the action items are specific and implementable.

Apply skepticism to CEO trait claims. When a search firm, consultant, or internal team asserts that a given trait predicts CEO success, ask for the evidence. Specifically, ask whether the finding survived correction for multiple comparisons and whether it replicated across independent samples. If the answer is no -- or if the question has never been asked -- the claim is unvalidated.
Demand replication, not novelty. The valuable finding is not the new trait that emerges from the latest study. It is the trait that has replicated across multiple studies with proper correction. In our data, only two traits clear this bar.
Recognize the asymmetry. Your competitors are making CEO selection decisions based on the full, uncorrected set of 'findings.' If you make decisions based on only the validated subset, you are operating with a cleaner signal. This is a structural advantage.
Build internal evidence. Track your own hires against your own outcomes. Over time, you will accumulate a proprietary dataset that lets you distinguish signal from noise within your specific portfolio context.

The drug trial analogy is not hyperbole. You are making a decision that will affect hundreds of millions of dollars in enterprise value. The question is whether you want to base that decision on evidence that would survive FDA review or evidence that would not.

Get the Full Research Report

This insight is from “From Pedigree to Performance” — the complete analysis of 12,174 CEO appointments. Download the full report with methodology, statistical tables, and recommendations.

Download Report

← Previous: Part 16