When Studies Disagree A Practical Protocol for Auditing Effect Sizes Uncertainty and False Conflicts

“Study A says yes, Study B says no” is a common way health evidence gets turned into a debate. It is also rarely as simple as headlines suggest. Most apparent contradictions do not come from two studies showing opposite effects. They come from summaries that treat “statistically significant” as a final verdict, instead of reading what the study actually estimated: the effect size and how uncertain it is. A wide confidence interval often signals absence of evidence, not evidence of absence (Altman & Bland, 1995). Major statistical guidance has also warned against treating p-values as bright-line truth tests (ASA, 2016; ASA, 2019).

This article is a practical, research-literate protocol for deciding what is really “conflicting” when studies seem to disagree, and what to do with uncertainty without sliding into wellness-trend thinking or cynical dismissal. The goal is not to pick a side. It is to extract a comparable claim, check whether the studies are even answering the same question, and then classify the uncertainty in a way that supports defensible decisions.

You’ll learn how to:

Separate headline contradiction from result-level disagreement by focusing on effect estimates plus precision.
Translate “same topic” into the same question using PICOTS: Population, Intervention/Exposure, Comparator, Outcome, Timeframe, Study design, so mismatches stop masquerading as conflict (Richardson et al., 1995; PRISMA 2020; Cochrane Handbook).
Check whether studies are targeting different estimands (for example, intention-to-treat vs per-protocol, different follow-up rules), which can legitimately produce different answers even with similar PICO (ICH E9(R1), 2019).
Walk through a fast “reconciliation ladder” that looks for common drivers of apparent disagreement: surrogate endpoints, exposure windows, baseline-risk differences, bias profiles, multiplicity, selective emphasis, and low precision.

A special focus runs through the piece: why women’s-health evidence often feels inconsistent in practice. Not because women’s physiology is “mysterious,” but because methods can misclassify hormonal status, studies often lack power for subgroup effects, or researchers rely on proxies that do not map cleanly onto outcomes people actually care about (STRAW+10; Elliott-Sale et al., 2021). If you have ever brought a well-sourced question to a clinical appointment and been met with a shrug, or with a confident claim that does not match the literature, this framework is meant to help you audit the claim, not just accept authority.

If you want a simple way to use this in a GP appointment (or any time you’re being given a confident one-liner), pick two or three of these:

“What’s the absolute difference for someone with my baseline risk?”
“Is that result from the pre-specified primary outcome and timepoint, or one of many analyses?”
“Was the analysis intention-to-treat or per-protocol (and how did they handle switching or nonadherence)?”

By the end, you’ll have a one-page “conflict audit” worksheet: what to extract from each paper, what thresholds matter for decisions, and how to report the most honest current stance: best estimate, plausible range, and what evidence would change the view, without overstating certainty. This is also the same extraction grid I use before I let myself share a study in my own “papers to practice” workflow.

What’s Actually “Conflicting” When Studies Seem to Disagree?

Headline contradiction vs estimate contradiction

Most “Study A says yes, Study B says no” fights are headline-level, not result-level. A real comparison starts when you ignore the press-release verdict and extract the effect estimate plus its uncertainty, for example, RR 0.92 (95% CI 0.74 to 1.15) instead of “no benefit.” A wide CI often signals absence of evidence, not evidence of absence (Altman & Bland, 1995). This is why binary “significant/non-significant” framing creates false drama. The ASA has warned against threshold interpretations of p-values (ASA, 2016; ASA, 2019). Reporting standards also push readers toward effect sizes with precision, not conclusions-only summaries (ICMJE Recommendations).

What qualifies as a real result-level disagreement?

A genuine conflict exists when two studies estimate the same underlying question, meaning aligned population and comparator, comparable outcome definition, similar exposure or dose, and similar timeframe, and the estimates differ enough to change an evidence-based decision. “One is significant and the other isn’t” is not automatically a contradiction. Similar point estimates can produce different p-values because of sample size and random error (ASA, 2016; ASA, 2019). CONSORT emphasizes effect sizes with CIs for pre-specified primary endpoints as what readers are meant to interpret. A useful decision lens is compatibility: are the estimates plausibly consistent given uncertainty, and do their CIs cross decision-relevant thresholds (GRADE imprecision; Guyatt et al., 2011)?

Worked mini-audit (hypothetical numbers):

Study A: RR 0.85 (95% CI 0.70 to 1.03)
Study B: RR 0.90 (95% CI 0.78 to 1.04)
Headlines might read “benefit” vs “no effect,” but the point estimates are similar and the CIs overlap heavily. That’s not a real contradiction; it’s Bucket 3 (same question, imprecise evidence): both studies are compatible with a modest benefit and with little/no benefit.

Contrast that with:

Study A: RR 0.70 (95% CI 0.60 to 0.82)
Study B: RR 1.05 (95% CI 0.92 to 1.20)
If PICOTS and estimands truly match, those ranges barely overlap and could change decisions. Now you treat it as a potential result-level disagreement and walk the ladder (bias profile, outcome definition, timeframe, baseline risk, multiplicity).

Minimal “intake form” before calling it disagreement

Before deciding two papers “disagree,” extract the same minimum fields:

Primary endpoint (pre-specified) and how it was measured (CONSORT)
Point estimate + 95% CI (adjusted and unadjusted when relevant)
Absolute risk/absolute effect when possible, not ratios alone (Woloshin, Schwartz & Welch, 2000)
Time horizon and analysis population tied to the endpoint (CONSORT)

This standard extraction can then be translated into a simple comparable claim.

Turn “Same Topic” Into a Comparable Claim: PICOTS + the Estimand

PICOTS verifies whether you’re comparing the same question

When studies “conflict,” the mismatch is often upstream. They are not asking the same question, even if they share a topic label. A compact fix is PICOTS: Population, Intervention/Exposure, Comparator, Outcome, Timeframe, Study design (Richardson et al., 1995). PRISMA 2020 effectively requires this discipline for synthesis decisions (PRISMA 2020), and the Cochrane Handbook flags checks for clinical and methodological diversity before treating results as comparable. Protocol: write each PICOTS element as a noun phrase.

Even with matching PICO, the “effect” may differ: align the estimand

Even with similar PICOTS, studies may target different estimands, for example, intention-to-treat versus per-protocol, or different rules for nonadherence, treatment switching, and follow-up (ICH E9(R1), 2019). The target trial framing makes the logic explicit: specify the causal question, then check whether the analysis really answers it (Hernán & Robins). In plain language, the same intervention on paper can lead to different estimated effects if follow-up rules and intercurrent events differ.

The Reconciliation Ladder: A 10–15 Minute Protocol for Explaining “Conflicting” Results

If you’re short on time: do Step 1 (same question?), Step 6 (absolute effects + adjustment + estimand), and Step 7 (precision). Use Steps 2–5 when something still doesn’t make sense.

Step 1 — Same question, or just the same topic?

Check for indirectness: different populations (baseline risk), comparators, settings, or versions and doses of the exposure. If PICOTS does not line up in a way that would change a decision, it is not a contradiction. It is two answers to two questions (GRADE indirectness; Cochrane Handbook; PRISMA 2020).

Step 2 — Same outcome definition, or a surrogate?

A surrogate endpoint (often a biomarker) stands in for patient-important outcomes such as symptoms, function, or mortality. The bar is not correlation. The key question is whether the surrogate reliably captures the intervention’s effect on the clinical endpoint (Prentice, 1989). Trial-level surrogacy matters: do treatment effects on the surrogate predict treatment effects on clinical outcomes across trials (Buyse et al., 2000; IOM/National Academies, 2010)?

Surrogate failure is well documented in clinical research: a trial can “win” on a marker and still fail to improve (or even worsen) patient-important outcomes. For example, in CAST, suppressing arrhythmias did not translate into improved survival (CAST). Treat surrogate-only “wins” as provisional.

Step 3 — Same exposure window and follow-up timeframe?

Short follow-up can capture acute effects while missing delayed benefit or harm. Long follow-up can dilute early effects if adherence falls or switching occurs. These choices change what is being estimated through the handling of intercurrent events (ICH E9(R1), 2019). Use CONSORT flow and methods to confirm follow-up and attrition, then ask: what window does this estimate represent?

Step 4 — Same design and bias profile?

Before debating biology, scan for predictable distortion. This is also why two clinicians can sound equally confident while steering you in opposite directions: they may be implicitly trusting different study designs, endpoints, or bias profiles.

RCTs can still be biased through problems with randomization, deviations from intended intervention, missing outcomes, outcome measurement, and selective reporting (RoB 2; Sterne et al., 2019). Observational studies add confounding, selection bias, and measurement error (ROBINS-I; Sterne et al., 2016). Practical cues include methods (eligibility, measurement, verification) and the flow diagram (loss to follow-up). Missing data can shift estimates under plausible assumptions (Akl et al., 2012; NRC, 2010).

Step 5 — Different baseline risk (or credible effect modification)?

Opposing averages can both be true if baseline risk or effect modifiers differ across samples. Subgroup claims need credibility checks: pre-specified, supported by an interaction test (not “significant in one subgroup only”), few hypotheses, consistency, and plausibility (Sun et al., 2010; Altman & Bland, 2003; Rothwell, 2005).

Step 6 — Different statistical lens?

Translate relative effects into absolute effects using baseline risk. Interpretation changes when results are expressed as relative risk reduction versus absolute risk reduction (Naylor et al., 1992; Forrow et al., 1992), and abstracts often emphasize ratios (Schwartz et al., 2006). Treat the adjustment set as part of the claim. STROBE expects adjusted and unadjusted estimates with precision. Also re-check the estimand (ITT vs per-protocol; how switching and nonadherence were handled) (ICH E9(R1), 2019).

Where to get baseline risk in practice: start with the control-group event rate in the paper (for trials), or use a guideline risk calculator for your context, then sanity-check whether your tracked context (age, symptoms, cycle status, relevant labs) makes you more like the trial’s average participant or not.

A practical extraction format is: “absolute difference per 1,000 people over 1 year” with the CI translated when possible.

Step 7 — Low precision vs real contradiction

A “null” can reflect a wide CI and too few events (GRADE imprecision; Guyatt et al., 2011). Early significant findings are also prone to exaggerated effects in underpowered studies (Button et al., 2013; Ioannidis, 2008). Gelman and Carlin describe Type M (magnitude) error: even when the direction is correct, the size can be overstated (Gelman & Carlin, 2014). When results “conflict,” check whether one estimate comes from a small, early, unstable study. And yes—this is the part that gets all of us, because a clean “p<0.05” result feels like certainty even when it isn’t.

When “Conflict” Is Manufactured: Spin, Multiplicity, and Selective Emphasis

Spin presents findings as more favorable or certain than the data support, especially when the primary endpoint is null or clinically small (Boutron et al., 2010; Boutron et al., 2014). Press releases that exaggerate causality are associated with exaggerated news claims (Sumner et al., 2014).

Micro-example (what “anchor it to the primary endpoint CI” looks like):

Press release headline: “New supplement cuts migraine risk by 30%.”
Paper’s primary endpoint: RR 0.70 (95% CI 0.45 to 1.08) at the pre-specified timepoint.
More honest read: “The best estimate suggests fewer migraines, but the uncertainty includes anything from a large reduction to no clear effect.”

Multiplicity can also create apparent disagreement. Protocol and registration audits repeatedly find outcome switching and selective emphasis (Chan et al., 2004; Chan et al., 2008; Mathieu et al., 2009; Dwan et al., 2013). Reader check: open the trial registry entry, copy the pre-specified primary outcome and timepoint, and confirm the paper’s abstract headline matches that exact outcome/timepoint.

Many tests also mean many chances for borderline positives. In aggregate, clusters of p-values just under 0.05 are consistent with selective reporting signals (Head et al., 2015; Ioannidis & Trikalinos, 2007; Simonsohn et al., 2014). Composite endpoints add another trap: a “positive” composite may be driven by frequent, less important components while patient-important components show little change (Montori et al., 2005; Ferreira-González et al., 2007; Freemantle et al., 2003).

Why Women’s-Health Evidence “Conflicts” More Often (Methodology, Not Mystery)

Hormonal status misclassification

In cycle-based research, “cycle day” is a noisy proxy for hormone exposure. Calendar counting can misclassify follicular vs luteal phase because ovulation timing varies within and between people (Fehring, 2006; Hicks, 2017). Misclassification tends to weaken true effects toward the null and increase between-study variability, which can look like conflict (Elliott-Sale et al., 2021; Hicks, 2017). Menopause research has a parallel issue when staging definitions drift. A practical check is whether studies use STRAW+10 and report how staging was operationalized, rather than inferring status from age bands or self-labels (Harlow et al., 2012). If verification of cycle phase, contraceptive use, or menopausal stage differs, treat it as a PICOTS mismatch (Cochrane Handbook; GRADE indirectness). In the ladder, this most often shows up as Step 1 (same question?) and Step 6 (baseline risk/context).

Underpowering and unstable subgroup headlines

Even when women are included, studies are rarely powered for interaction effects (treatment-by-sex, treatment-by-cycle-stage). Low power leads to unstable subgroup estimates and exaggerated “wins” when something barely clears a threshold (Sun et al., 2010; Button et al., 2013; Gelman & Carlin, 2014). Quick sanity checks include subgroup CI width, event counts, and how many subgroup cuts were attempted. Public sources like FDA Drug Trials Snapshots and NIH ORWH can help contextualize whether sex-stratified claims were realistically testable. When this is the driver, you often land in Bucket 3 (imprecision) unless there is a strong, replicated interaction with credible testing.

If the Conflict Doesn’t Resolve: Classify the Uncertainty and Still Make a Defensible Call

Four uncertainty buckets

Bucket 1: different questions. If PICOTS does not match, stop forcing a comparison and decide which question matters (population stage, outcome type, timeframe) (Richardson et al., 1995; PRISMA 2020; GRADE indirectness).

Bucket 2: same question, different bias. Prefer the estimate with fewer predictable distortions: credible outcome ascertainment, lower or less differential attrition, more plausible confounding control, over the boldest narrative (RoB 2; ROBINS-I). Missing data can plausibly flip conclusions, so attrition is part of credibility (Akl et al., 2012).

Bucket 3: same question, imprecise evidence. Wide CIs mean many true effects remain compatible, including benefit, trivial effect, or harm (GRADE imprecision; Guyatt et al., 2011). Underpowered “wins” are a reason to wait for larger preregistered evidence rather than updating strongly on a single result (Button et al., 2013).

Bucket 4: true heterogeneity. Effects can vary by setting, baseline risk, implementation, or effect modifiers. The task becomes mapping where and for whom (Cochrane Handbook). In meta-analysis terms, a CI describes uncertainty in the mean effect, while a prediction interval reflects what might happen in a new setting. That range is often what matters for decisions when heterogeneity exists (Higgins, Thompson & Spiegelhalter, 2009; Riley, Higgins & Deeks, 2011; Borenstein et al., 2009).

A calibrated takeaway template

To avoid “picking sides,” report (1) the best estimate, (2) the plausible range (CI, and PI if heterogeneity is credible), and (3) whether that range crosses decision-relevant thresholds, not whether it crosses a p-value cutoff (ASA, 2016; ASA, 2019; GRADE imprecision; Guyatt et al., 2011). Template: “Best estimate: . Plausible range: (CI; PI if applicable). Because this range [does/does not] include clinically important benefit/harm, the most defensible current stance is __.”

Add an update rule: “Revise this view if __” (for example, a larger preregistered RCT, independent replication, or a PRISMA-guided synthesis with pre-specified heterogeneity checks) (PRISMA 2020; Cochrane Handbook). Record funding and conflicts of interest as context. Industry sponsorship is associated with more favorable efficacy conclusions on average, so it belongs in the distortion log without becoming an automatic disqualification (Lundh et al., 2017; ICMJE COI standards).

The One-Page “Conflict Audit” Worksheet

A fast extraction makes comparisons reproducible (CONSORT; STROBE; ICH E9(R1)).

Example filled-in line (hypothetical):

Primary endpoint (verbatim): “Migraine days/month at 12 weeks”; Effect: RR 0.85 (95% CI 0.70 to 1.03); Absolute: 20 fewer per 1,000 over 12 weeks (CI: 60 fewer to 10 more); Baseline risk used: control group 65/1,000; Bucket: 3 (imprecise)

Use this as your blank template:

Citation (author/year/journal)
PICOTS (Richardson et al., 1995)
Primary endpoint (verbatim) + measurement (CONSORT)
Effect estimate + 95% CI (adjusted + unadjusted if relevant)
Absolute risk/absolute effect if reported/derivable (Woloshin, Schwartz & Welch, 2000)
Baseline risk used (from ___): control-group event rate / guideline calculator / other (note what you used)
Estimand clues: ITT vs per-protocol, follow-up window, handling of switching/nonadherence/missingness (ICH E9(R1), 2019)

Add quick bias flags:

Missing data / attrition (CONSORT; Akl et al., 2012)
Outcome subjectivity + blinding/measurement (RoB 2)
Multiplicity (outcomes/timepoints/subgroups)
Funding + sponsor role / COI (ICMJE; Lundh et al., 2017)

Then label the outcome distance from real-world benefit:

Patient-important
Surrogate
Composite (IOM/National Academies, 2010; Prentice, 1989; Buyse et al., 2000)

Finally, pick one uncertainty bucket and record low/moderate/high confidence using CI width and event counts (imprecision), predictable bias threats, and whether the full CI would change a decision (Guyatt et al., 2011). If a random-effects meta-analysis exists, record both pooled CI and prediction interval. If the PI crosses no effect, write “effects likely vary by setting” rather than forcing a universal verdict (Higgins et al., 2009; Riley et al., 2011; Borenstein et al., 2009; Cochrane Handbook).

Next time you see a “Study A vs Study B” claim, pick that single claim and do a two-study audit: fill the worksheet for Study A and Study B, label the uncertainty bucket, then write the calibrated takeaway in two sentences (best estimate + plausible range + what would change your mind). You end up with something you can actually use—on a news scroll, in a GP visit, or in your own notes—without pretending the uncertainty isn’t there.

When Studies Disagree A Practical Protocol for Auditing Effect Sizes Uncertainty and False Conflicts

Comments

Women's Health Unfiltered: Evidence, Protocols, and Real Stories

When a Good Trial Estimate Fits the Wrong Women Auditing Eligibility Attrition and Missing Data in Women’s Health

More from this blog

Keep your home workout honest when the room keeps changing

Tilia to tabs the 10 second scan that ends break roulette in remote work

Decoding remote work body signals with when and where

Slack checkmarks without the wellness theater

Diverse Enrollment Isn’t Subgroup Evidence in Clinical Trials

What’s Actually “Conflicting” When Studies Seem to Disagree?

Headline contradiction vs estimate contradiction

What qualifies as a real result-level disagreement?

Minimal “intake form” before calling it disagreement

Turn “Same Topic” Into a Comparable Claim: PICOTS + the Estimand

PICOTS verifies whether you’re comparing the same question

Even with matching PICO, the “effect” may differ: align the estimand

The Reconciliation Ladder: A 10–15 Minute Protocol for Explaining “Conflicting” Results

Step 1 — Same question, or just the same topic?

Step 2 — Same outcome definition, or a surrogate?

Step 3 — Same exposure window and follow-up timeframe?

Step 4 — Same design and bias profile?

Step 5 — Different baseline risk (or credible effect modification)?

Step 6 — Different statistical lens?

Step 7 — Low precision vs real contradiction

When “Conflict” Is Manufactured: Spin, Multiplicity, and Selective Emphasis

Why Women’s-Health Evidence “Conflicts” More Often (Methodology, Not Mystery)

Hormonal status misclassification

Underpowering and unstable subgroup headlines

If the Conflict Doesn’t Resolve: Classify the Uncertainty and Still Make a Defensible Call

Four uncertainty buckets

A calibrated takeaway template

The One-Page “Conflict Audit” Worksheet

Command Palette

Comments

Women's Health Unfiltered: Evidence, Protocols, and Real Stories

When a Good Trial Estimate Fits the Wrong Women Auditing Eligibility Attrition and Missing Data in Women’s Health

More from this blog

What’s Actually “Conflicting” When Studies Seem to Disagree?

Headline contradiction vs estimate contradiction

What qualifies as a real result-level disagreement?

Minimal “intake form” before calling it disagreement

Turn “Same Topic” Into a Comparable Claim: PICOTS + the Estimand

PICOTS verifies whether you’re comparing the same question

Even with matching PICO, the “effect” may differ: align the estimand

The Reconciliation Ladder: A 10–15 Minute Protocol for Explaining “Conflicting” Results

Step 1 — Same question, or just the same topic?

Step 2 — Same outcome definition, or a surrogate?

Step 3 — Same exposure window and follow-up timeframe?

Step 4 — Same design and bias profile?

Step 5 — Different baseline risk (or credible effect modification)?

Step 6 — Different statistical lens?

Step 7 — Low precision vs real contradiction

When “Conflict” Is Manufactured: Spin, Multiplicity, and Selective Emphasis

Why Women’s-Health Evidence “Conflicts” More Often (Methodology, Not Mystery)

Hormonal status misclassification

Underpowering and unstable subgroup headlines

If the Conflict Doesn’t Resolve: Classify the Uncertainty and Still Make a Defensible Call

Four uncertainty buckets

A calibrated takeaway template

The One-Page “Conflict Audit” Worksheet