Skip to main content

Command Palette

Search for a command to run...

When a Good Trial Estimate Fits the Wrong Women Auditing Eligibility Attrition and Missing Data in Women’s Health

Published
13 min read
When a Good Trial Estimate Fits the Wrong Women Auditing Eligibility Attrition and Missing Data in Women’s Health
G

Based in Western Europe, I'm a tech enthusiast with a track record of successfully leading digital projects for both local and global companies.

A randomized trial can be well run and still give you a “good” estimate for the wrong group of women. That mismatch is the external-validity gap Rothwell highlighted in The Lancet: internal validity can be strong while applicability quietly fails (Rothwell, 2005). This is a methodological audit problem, not a “science is bad” problem: before you get pulled into p-values and subgroup plots, ask for whom the effect was estimated, compared to what (comparator and care context), and what’s missing (who never made it into analysis). Often the fastest audit tool is hiding in plain sight: the CONSORT flow diagram, showing who was screened, excluded, randomized, lost, and analyzed—and whether “who counted” shifted over time (CONSORT, 2010).

Women’s health is especially vulnerable to this because routine design choices can filter out hormonally complex, clinically common life stages: pregnancy and lactation exclusions, contraception mandates, “regular cycles only” criteria, and age windows that skip postpartum or perimenopause (STRAW+10, 2012). These choices are often justified as safety or endpoint “cleanliness,” but they also define the treatment-effect question itself, the estimand, by determining which intercurrent events (like pregnancy, discontinuation, rescue meds) are allowed and how data after them are handled (ICH E9(R1), 2019). Even among eligible participants, outcome data often go missing for structural reasons—visit frequency, childcare, travel time, language barriers—not simply “noncompliance.” Regulators flag these burdens as predictable threats to representative participation (FDA Diversity Guidance, 2020).

This article gives you a research-literate, clinic-usable way to spot when a trial’s estimate may be accurate yet misapplied. You’ll learn to separate two filters that often remove the same women—selection (who gets in) versus missing outcomes (who stays measurable)—and to interpret missingness using the practical map of MCAR, MAR, and MNAR (NRC, 2010). We’ll also cover credibility checks: what to look for in CONSORT attrition reporting (CONSORT, 2010), how to read design trade-offs through PRECIS-2 (PRECIS-2, 2015), why complete-case and LOCF analyses can mislead when dropout is informative (NRC, 2010; Little et al., 2012), and what stronger approaches and sensitivity analyses look like when MNAR is plausible (NRC, 2010; ICH E9(R1), 2019).

The goal isn’t to make you distrust research. It’s to help you use it precisely. If you’ve ever left an appointment feeling like your symptoms were waved off—or scrolled past five conflicting “expert” takes—this is how you sanity-check what the evidence actually applies to. Working in research methods (and growing up around NHS midwifery), I’ve learned that “good trials” can still miss the women actually in front of us. By the end, you’ll have a short bias-audit workflow—eligibility funnel, follow-up integrity, missingness suspicion—plus phrasing you can bring to appointments when the headline result doesn’t seem to match real-world women’s bodies or real-world constraints. Accurate numbers can still mislead if they describe the wrong women, or only the subset who could stay measured.

When a “Good” Estimate Describes the Wrong Women

Why women’s health is especially vulnerable to selection and missingness

In women’s health, routine design choices can narrow “who counts” in predictable ways. Many trials exclude people who are pregnant or lactating, require contraception or frequent pregnancy testing, restrict enrollment to “regular cycles,” or use age windows that skip postpartum or perimenopause. These choices are often framed as safety or endpoint “cleanliness,” but they systematically remove hormonally complex life stages that are central to real clinical care. ICH E9(R1) clarifies the underlying issue: eligibility rules and protocol responses to events like pregnancy define the estimand, the precise treatment-effect question, by determining what intercurrent events are allowed and how data after them are handled (ICH E9(R1), 2019). STRAW+10 reinforces why crude age bands miss meaningful variability: perimenopause is staged and heterogeneous, not a simple before/after state (STRAW+10, Menopause, 2012).

Even when people are eligible and enroll, outcomes can go missing for reasons that have little to do with “motivation”: visit frequency, travel time, childcare, shift work, language barriers, and repeated monitoring burdens. FDA diversity guidance flags these burdens as predictable threats to representative participation because they are unevenly distributed across socioeconomic groups (FDA Diversity Guidance, 2020). If you’re reading UK-based research or NHS-facing guidance, look for the same principle: whether recruitment and follow-up burdens systematically exclude people by language, caregiving demands, or travel constraints. PRECIS-2 makes the design trade-off easier to see: tighter Eligibility and more intensive Follow-up push trials toward controlled conditions that tend to lose participants who cannot sustain the schedule (PRECIS-2, 2015). Retention strategies like reducing visit burden and making follow-up easier to complete (for example, more flexible scheduling or simpler data collection) have evidence behind them, so attrition is not just “participant behavior.” It is partly a design choice with fixable levers.

The bias point is simple: routine exclusions and routine dropout can move an estimate away from the truth for the target population. Differential missingness can even change interpretation when missing outcomes relate to benefit or harm. The National Research Council stresses that missing data can materially alter conclusions and should be tested with sensitivity analyses rather than brushed aside (NRC, 2010). Meta-epidemiologic evidence also points in the same direction: trials with risk-of-bias features, including incomplete outcome data, tend to show more favorable effects on average (Savović et al., PLOS Med, 2012). Practically, it helps to separate two filters that are often blended: who got in versus who stayed measurable.

Selection bias vs missing outcomes: two filters that often remove the same women

Selection: the “easier-to-study” sample

Selection bias is a systematic mismatch between who ends up in the study and the population a reader cares about, driven by eligibility criteria, recruitment channels, and requirements like “able to comply” or “able to attend frequent visits” that preferentially enroll easier-to-study participants (Rothwell, 2005). The “healthy volunteer” pattern is well documented: volunteer cohorts can be healthier and less deprived than the general population (Fry et al., 2017). But even a well-recruited sample can become distorted after randomization if outcomes go missing.

Missing outcome data: when “no measurement” isn’t neutral

Missing outcome data means endpoint measurements are not observed. It becomes informative when the chance of being missing is related to symptoms, benefit, harm, or prognosis. If people who feel worse or experience side effects are more likely to stop returning, “no measurement” can tilt results. That is why the NRC emphasizes sensitivity analyses rather than treating missingness as a footnote (NRC, 2010). In women’s health, the same constraints that narrow eligibility (pregnancy risk, cycle requirements, visit burden) often also raise dropout risk, creating a double distortion.

A missingness map you can use: MCAR, MAR, MNAR

Before trusting a complete-case or “as observed” result, it helps to map the most plausible missingness mechanism.

MCAR: missing like a coin flip (rare)

Missing completely at random (MCAR) means missingness is unrelated to measured or unmeasured data. It mainly reduces precision, not validity. Near-MCAR tends to be mechanical (lost samples, device failure). CONSORT’s requirement to report reasons for losses lets readers judge whether the “coin flip” story fits. Attrition percentage alone is a blunt heuristic (CONSORT, 2010). Treat MCAR as an argument that needs support: similar missingness across arms, balanced non-outcome-related reasons, and no sign that early response or adverse effects predict dropout (RoB 2).

(RoB 2 = the Cochrane risk-of-bias tool for randomized trials.)

MAR: explained by what we measured (manageable, not magic)

Missing at random (MAR) means missingness depends on observed data (for example, baseline severity), not on the unobserved outcome itself. It can be plausible, but it can also be fragile (van Buuren, 2018; Carpenter & Kenward, 2013). Under MAR, complete-case analysis can skew results by changing who is being compared: if baseline factors predict both missingness and outcomes, analyzing only completers effectively reweights the sample (White et al., 2011). Some covariate-dependent cases can be unbiased, but papers should justify conditions rather than assume them.

Reader-focused checks: look for likelihood-based repeated-measures models (often MMRM) or multiple imputation, and whether variables that predict missingness were included (van Buuren, 2018; NRC, 2010). ICH E9(R1) adds the higher-level test: does the missing-data approach match the estimand the paper claims to answer (ICH E9(R1), 2019)?

MNAR: missing because of the outcome (often plausible, highest risk)

Missing not at random (MNAR) means missingness depends on the unobserved outcome. People stop reporting because symptoms worsened, side effects emerged, or benefit did not materialize. In that setting, naive analyses can be wrong in direction, not just magnitude (NRC, 2010; Diggle & Kenward, 1994). Women’s-health trials have built-in MNAR pathways because discontinuation often tracks tolerability and life-stage events. When dropout plausibly follows adverse effects, lack of efficacy, or pregnancy-related protocol stops, the key question is not whether there was dropout. It’s whether the authors tested whether dropout could change interpretation (Marrazzo et al., NEJM, 2015).

Credibility markers: reasons reported by arm, efforts to collect outcomes after discontinuation when feasible, and MNAR sensitivity analyses aligned to the estimand (ICH E9(R1), 2019; Little et al., NEJM, 2012; NRC, 2010). This may include delta-adjusted or controlled multiple imputation, reference-based imputation, or tipping-point analysis. In plain terms: these are stress tests—authors assume the missing outcomes were a bit worse (or better) than the observed data and check whether the bottom-line result still holds.

The women’s-health exclusion checklist: who was filtered out before the first data point?

A two-minute eligibility scan can prevent over-generalizing from a narrow sample (Van Spall et al., JAMA, 2007; FDA Diversity Guidance, 2020):

  • pregnancy/lactation exclusions and contraception mandates
  • “regular cycles only” criteria
  • tight age windows (skipping postpartum/perimenopause)
  • comorbidity/medication exclusions
  • prior-treatment washouts
  • “able/willing to attend frequent visits” requirements (PRECIS-2, 2015)

Then ask what these exclusions likely do to baseline risk and the apparent benefit-harm balance. A common (not universal) pattern is: exclude complexity, see lower baseline symptom and adverse-event risk and higher adherence, then get more optimistic efficacy and tolerability than routine care may deliver (Savović et al., 2012).

Quantify the funnel with CONSORT

Use CONSORT counts (CONSORT, 2010): funnel yield = enrolled (randomized) ÷ screened (assessed for eligibility) (or enrolled ÷ eligible if available). Low yield is a generalizability warning flag, not an automatic disqualifier. Some questions legitimately require tight criteria. If useful, situate the trial using PRECIS-2’s pragmatic versus explanatory framing (PRECIS-2, 2015). A compact annotation that travels across papers:

  • top 3 exclusions (verbatim)
  • follow-up/visit burden
  • which life stages are structurally missing (for example, postpartum/lactation, perimenopause per STRAW+10) (STRAW+10, 2012)

Attrition isn’t a footnote: who left, why, and what that does to the result

Start with the CONSORT diagram and record, by arm: randomized, followed up, analyzed, and how many outcomes are missing and when losses happened (CONSORT 2010, Items 13a/13b/16). Then interpret through RoB 2’s missing-outcome domain (RoB 2). Differential dropout often matters more than the headline percentage because randomization protects baseline comparability, not comparability after selective disappearance (NRC, 2010). Simple heuristics can triage (about 5% often negligible, about 20% often concerning, about 50% usually very serious), but they are not laws. Caution rises for subjective outcomes (pain, bleeding scores, mood), which are more vulnerable to bias if dropout tracks worsening symptoms or adverse effects (Wood et al., BMJ, 2008).

With the paper in hand, do three quick moves:

  • Circle the first timepoint where losses spike (early dropout often signals tolerability or early non-benefit).
  • Check whether reasons are given by arm (and whether “withdrew consent” is doing too much work).
  • Look for any comparison of completers vs non-completers (even a baseline table note can hint at informative dropout).

When analysis choices quietly change the answer

Complete-case and LOCF are easy to run and easy to misuse. Complete-case is unbiased only under strong assumptions (MCAR or limited covariate-dependent settings that require justification) (van Buuren, 2018). Last observation carried forward (LOCF) freezes participants at an earlier value and treats imputed values as observed, overstating certainty. Re-analyses show LOCF can meaningfully change inferences compared with modern methods (NRC, 2010; Little et al., NEJM, 2012).

Per-protocol and as-treated analyses can also induce bias by conditioning on post-randomization behavior (adherence, switching, discontinuation) that is influenced by prognosis and side effects (Hernán, Brumback & Robins, 2000). The “healthy adherer” pattern—better outcomes among adherent participants even on placebo in the Coronary Drug Project—is a useful reminder that adherence can reflect underlying prognosis rather than treatment effect (Horwitz, Viscoli, Berkman et al., Lancet). If PP/AT results are presented, look for explicit estimands and methods that address informative censoring rather than assuming discontinuation was random (ICH E9(R1), 2019; RoB 2).

A 10-minute bias-audit worksheet: three checks, one usable bottom line

1) Eligibility funnel check (applicability)

Using CONSORT counts plus PRECIS-2 Eligibility/Follow-up (CONSORT, 2010; PRECIS-2, 2015), note: (1) funnel yield, (2) biggest exclusions (reproductive status, comorbidity/meds), (3) visit burden and compliance rules.

2) Follow-up integrity check (outcome capture)

Write one line: outcome capture %, timing of losses, symmetry by arm. Interpret via RoB 2 and CONSORT’s expectation of reasons by arm (CONSORT, 2010; RoB 2). Useful recording format: Arm A: missing (early/mid/late); Arm B: missing (early/mid/late).

3) Missingness suspicion check (MNAR plausibility)

Keep it human, then technical: are people dropping out for reasons that are likely tied to how they’re doing? In ICH E9(R1) terms, this is about intercurrent events and whether the analysis matches the question the authors say they’re answering (ICH E9(R1), 2019). Ask: did they (1) report dropout reasons that track outcomes (adverse events, lack of efficacy, pregnancy, rescue meds)? (2) collect outcomes after discontinuation where feasible? (3) run MNAR-aware sensitivity analyses (delta, tipping point, reference-based MI)? (NRC, 2010; Little et al., 2012)

Translation scripts (calibrated, not binary)

GRADE supports being precise about uncertainty, downgrading for risk of bias and indirectness rather than treating estimates as universal (GRADE).

A helpful shorthand is to label the evidence you’re holding:

  • Gold standard: multiple well-run trials, low missingness, and participants who look like the people you’re treating.
  • Promising: a single good trial, but with some indirectness (narrow eligibility, different care context) or moderate attrition.
  • Theoretical: strong mechanism or lab rationale, but weak or thin clinical outcome data.

Examples:

  • “This result is reliable for a narrower group similar to the trial participants; applicability to my situation is uncertain.”
  • “This is directionally suggestive, but missing outcomes could be biasing the size (or direction) of effect.”
  • “This risks misleading certainty because key sensitivity analyses and missingness details aren’t reported.”

How to raise applicability and missingness without sounding “anti-science”

Clinic-usable questions framed as decision support (CONSORT, 2010; RoB 2; NRC, 2010):

  • “Who was excluded (pregnancy/lactation, irregular cycles, comorbidities) and does that match someone like me?”
  • “Were missing outcomes similar across arms, or did one group lose more follow-up?”
  • “What were the reasons for dropout in each arm (side effects, lack of benefit, logistics)?”
  • “Did the study still collect outcomes after people stopped or switched treatment?”
  • “What did sensitivity analyses show if missing outcomes weren’t random, and if none were done, why?”

If evidence is narrow or MNAR-type dropout is plausible, it is often more realistic to shift from a binary verdict to a monitoring decision: a time-limited plan with a clinician that tracks outcomes you care about and sets stop or switch thresholds. Accurate numbers can still mislead if they describe the wrong women, or only the subset who stayed measurable.


A trial can be impeccably randomized and still miss the women you’re trying to treat. The practical takeaway is to audit applicability: who was filtered out by eligibility rules, what care context the trial compares against, and who disappeared from measurement over time (Rothwell, 2005; CONSORT, 2010). In women’s health, routine exclusions around pregnancy, lactation, cycle regularity, and age windows can quietly redefine the estimand and leave out clinically common life stages (ICH E9(R1), 2019; STRAW+10, 2012). Then attrition and missing outcomes can tilt results, especially when missingness is plausibly MNAR and tied to symptoms or side effects (NRC, 2010). The goal is precision, not cynicism: use the funnel-and-follow-up checks, look for modern missing-data methods and sensitivity analyses, and bring calibrated questions to appointments.

On your next paper, start with Step 1 (Eligibility funnel), then do Step 2 (Follow-up integrity)—and only then decide whether you need Step 3 (MNAR suspicion and sensitivity).

More from this blog

My Very Private Trainer Experience

634 posts

As an IT professional turned fitness enthusiast, I share insights on overcoming gym anxiety, setting goals, debunking myths, and balancing fitness with mental well-being and nutrition for beginners.