How to Read a Diabetes Research Study:
What the Headline Leaves Out
Why a frightening headline is rarely what it seems, told plainly with Jude, then the appraisal toolkit with Grace, then the four lenses applied to a real study with John. Stop wherever you have enough.
How we teach: three rules, borrowed from Taleb
You earn each level by showing you understand it, not by scrolling past it. We only teach what we would use on ourselves and the people we love.
Understanding beats memory and luck, so the checks reshuffle every time you retry. A pass means you got it, not that you guessed it. And we teach you to tell a trend (signal) from one reading (noise).
We give you the scaffolding and get out of your way. Roam where your curiosity leads, go as deep as you want, and ask Grace anything. We will not teach a bird how to fly.
Reading a study and not sure what the numbers really say? Ask Grace, then take it to your care team.
One page, three depths
This guide compounds: each layer rests on the one beneath it. Read Jude’s plain version, then pass a short understanding check to open Grace, then another to open John. You can roam freely within a layer; you cannot skip ahead a layer, because the next one would not make sense and you would be standing on a gap.
Why a scary headline is rarely the whole story
A headline says a new pill “halves your risk”. It sounds enormous. Then you read on and find the risk went from two people in a thousand down to one in a thousand. That is still a halving, and it is true, but for any one person it is a very small change. The big-sounding number and the small real change are the same fact, said two ways. The trick is knowing which way you are being shown.
There is a kind question to ask of any health story, and it keeps everyone honest: out of how many people, and over how long? “Halves your risk” tells you nothing on its own. “One fewer person in a thousand, over ten years” tells you what it would mean for someone like you. The first is built for sharing; the second is built for deciding.
Two more gentle habits. First, one study is not the final word; a single result can be a fluke, and the strongest findings are the ones repeated by different teams. Second, the people in a study are rarely exactly you; a trial often picks the most motivated volunteers, so the everyday result is usually a little smaller. None of this means research cannot be trusted. It means you read it the way you read the weather: useful, honest about its limits, and never a promise about your own day.
The same change, said two ways. The shape of risk, not your personal probability. The big-sounding version is built to share; the small honest version is built to decide. Always ask: out of how many, over how long?
Does this match the life of the person living it? A number that frightens you off a sensible choice, or rushes you into a risky one, is being used on you, not for you. Slow down, ask the kind question, and take it to your team.The Pemberton lens, lived recognisability, one of the four GNL appraisal lenses.
The appraisal toolkit
Relative risk, absolute risk, and number needed to treat
Take a real teaching case. In the ERSPC prostate-screening trial, screening cut prostate-cancer deaths by a relative 20%; the same data, expressed in absolute terms, was about one death prevented per 1,000 men screened over roughly nine years.1 Both numbers are correct. Only the second helps a man decide. The bridge between them is the number needed to treat (NNT): how many people take the treatment for one to benefit. The smaller the absolute change, the larger the NNT, and the more carefully the benefit must be weighed against the cost and the harms.
| How it is framed | What it says | What it leaves out |
|---|---|---|
| Relative (“20% fewer deaths”) | The proportion the risk fell by | The baseline; how big the risk was to begin with |
| Absolute (“~1 fewer death per 1,000”) | The real change for a person like you | Nothing essential; this is the deciding number |
| NNT (“~1,000 screened for 1 death prevented”) | How many are treated for one to gain | Reads alongside number needed to harm |
The Goldacre rule: a percentage risk claim with no baseline is incomplete and possibly misleading. “Reduces risk by 50%” is not yet a fact; it becomes one only when you know the risk it started from.2
The study hierarchy, and why it is only a rule of thumb
Different questions need different designs. For “does this treatment work”, a randomised controlled trial sits high, because randomising who gets what balances out the hidden differences between people (confounding). For “what is it like to live with this device”, a qualitative study is the right tool, and an RCT answers nothing. A systematic review that pools many trials sits at the top, but only if its question and methods were fixed in advance.
| Question | Best-suited design | Main weakness to watch |
|---|---|---|
| Does the treatment work? | Randomised controlled trial | May enrol the most motivated volunteers |
| What causes harm over time? | Prospective cohort study | Confounding; the groups differ in other ways |
| What is the whole picture? | Systematic review of trials | Only as good as the trials it pools |
| What is it like to live with? | Qualitative study | Not designed to give an effect size |
Greenhalgh’s warning matters as much as the ladder itself: do not apply the hierarchy mechanically. A sloppy meta-analysis does not outrank a large, well-designed cohort study; the design is a starting prior, not a verdict.3
What a headline hides
Three quiet failures account for most misleading health stories. Outcome switching: the planned main result was disappointing, so a secondary one is promoted to the headline. Publication bias: the trials that found nothing were never published, so the survivors paint too rosy a picture. Surrogate endpoints: the study measured a stand-in (a blood marker) rather than the thing that matters to a person (feeling well, living longer). When you meet a striking claim, ask what was actually measured, and whether the trials you can read are all the trials that were run.
Publication bias is a survivor effect: the trials that disagreed are missing from the shelf, so the published picture overstates the benefit. Reboxetine is the textbook case (six of seven trials unpublished).2 Counts illustrate the pattern, not one trial’s exact tally.
A big-sounding percentage is not the same as a big difference. Whenever you are handed a risk number, ask the two questions that keep everyone honest: out of how many, and over how long? And ask whether the studies you can see are all the studies that were done.The Goldacre lens, evidence-grade discipline, one of the four GNL appraisal lenses.
The four GNL lenses, worked on a real study
The method: four lenses, applied in turn
Appraisal at GNL runs through four lenses. Each asks a different question; a claim that survives all four is one we will build on, and a claim that fails one earns a caveat or a refusal.
| Lens | The question it asks | What it catches |
|---|---|---|
| Taleb | Is it robust to the rare, ruinous case, and does the author carry the downside? | Tail risk hidden by an average; advice from people with no skin in the game |
| Goldacre | Out of how many, over how long, and are all the trials on the shelf? | Relative dressed as absolute; publication bias; surrogate endpoints |
| Pemberton | Does it match the life of the person living it? | A trial cohort that is not the real-world patient; numbers that shame |
| Hayes | Is the method sound, and are the assumptions named? | A model sold as the territory; a single fragile data source |
A worked appraisal: a Grade A method that returns low-certainty evidence
Consider a systematic review of artificial-intelligence tutoring in undergraduate health-professions education (Lai 2026), chosen because its method is textbook and its honest conclusion is still cautious.4 It was registered in advance, reported to PRISMA standard, used a formal risk-of-bias tool, and pooled 66 randomised trials across 4,911 students. Every upstream box is ticked. Yet the certainty of the evidence (graded with GRADE) came back Low to Very Low on nearly every outcome. Run the four lenses and you can see why, and the discipline transfers cleanly to any diabetes-technology study.
| Lens | What it surfaces in this study |
|---|---|
| Taleb | The pooled average hides the tails. Most trials ran under three weeks; none reached real-world practice change. An average across short, shallow trials says little about the durable case that matters. |
| Goldacre | One feasible sensitivity analysis moved a “significant improvement” to “no significant difference” once high-risk trials were removed. The headline effect was fragile; out-of-how-many and over-how-long both undercut it. |
| Pemberton | The evidence stopped at learner reaction and knowledge; it never reached whether anyone practised differently or any patient was better off. The outcome that matters to a life was not measured. |
| Hayes | The method was sound; the upstream trials were the wrong shape (median 63 participants, allocation concealment adequate in under a quarter). A clean review can only return what its inputs support, and it said so plainly. |
A The systematic review is a Grade A design; its GRADE certainty output is Low to Very Low, which is the teaching point, not a flaw. A well-conducted review returns only what the upstream evidence supports.4
A strong method does not guarantee strong evidence. When a “promising” review lands, do not stop at the design label; ask what shape the underlying trials were, how long they ran, and whether they measured anything that matters to a person. The same questions catch the over-sold device review and the over-sold supplement study alike.
Where to land: living honestly in the gap
Two more real cases sharpen the habit. Ergodicity (the population-vs-individual gap): DiRECT reported about a 46% one-year remission rate for type 2 diabetes, which is a fact about a hundred people, not a 46% chance for any one of them.5 The trial owns the average; the person walks a single path. The trial-vs-real-world gap: hybrid closed-loop systems report 70 to 78% time in range in pivotal trials, and real-world cohorts of the same systems often sit lower, because real users miss carbs, sleep badly, and live their lives. The device is doing its job; the cohort was not the world. And the bottom of any curve is rarely free: the trial that proved tight glucose control works also recorded about three times the rate of severe hypoglycaemia at the lowest targets.6 The honest reader names what the evidence forces them to say, and what it cannot, and sets the decision with the care team.
Calibration, not certainty: a band of what the evidence supports, not a single point. The shape of confidence, not a personal probability. Over-claim on the left, dismiss on the right; the honest reader lands in the middle and sets decisions with the care team.
It is the rare, ruinous case that decides a strategy, not the average Tuesday. An average that wins ninety-nine times and loses everything once is a losing strategy over time. Read every study for what it says about the tail, and trust the author who carries the downside of being wrong.The Taleb lens, robustness to outliers and skin in the game, one of the four GNL appraisal lenses.
A model is only as honest as its assumptions. A review is only as strong as the trials it pools; a risk figure only as solid as the cohort behind it. Name what would strengthen the claim, and never sell the model as the territory.The Hayes lens, technical and methodological rigour, one of the four GNL appraisal lenses.
The whole guide, summarised
Glucose never lies; neither should the study that claims to read it. Ask the kind question, weigh the tails, and live honestly in the gap.
This page is the taster. The full journey, three modules and their questions, with your progress saved, lives in Learn with Grace. Glucose never lies; come and learn to read the studies that claim to read it.
References
Evidence grades A (strongest) to D (editorial or working analysis). Books cited as teaching scaffolds, not as clinical evidence.
- Schroder FH, et al; ERSPC Investigators. Screening and prostate-cancer mortality in a randomized European study. N Engl J Med. 2009;360(13):1320-1328 (and 2014 follow-up, Lancet 384:2027-2035). Canonical relative-vs-absolute teaching case (~20% relative reduction, ~1 death prevented per 1,000 over ~9 years). A
- Goldacre B. Bad Pharma: How Drug Companies Mislead Doctors and Harm Patients. Fourth Estate, 2012 (reboxetine: six of seven trials unpublished); and I Think You’ll Find It’s a Bit More Complicated Than That. Fourth Estate, 2014 (relative-vs-absolute, the baseline rule). D as cited evidence; Grade A teaching scaffold.
- Greenhalgh T, Dijkstra P. How to Read a Paper: The Basics of Evidence-Based Medicine and Healthcare, 7th edition. Wiley-Blackwell, 2025 (study-design hierarchy as a rule of thumb, Ch 3). A
- Lai NM, Lim YS, Win MT, Bhargava P, Thomas P, Ong QC. The Effectiveness of Artificial Intelligence in Undergraduate Health Professions Education: Systematic Review and Meta-Analysis of RCTs. JMIR Medical Education. 2026;12:e88933, DOI 10.2196/88933. Grade A method, Low-to-Very-Low GRADE output (the worked teaching case). A
- Lean MEJ, et al. Primary care-led weight management for remission of type 2 diabetes (DiRECT). Lancet. 2018;391:541-551 (~46% one-year remission); 5-year follow-up Lancet Diabetes Endocrinol. 2024;12:233-246. A
- DCCT Research Group. The effect of intensive treatment of diabetes on the development and progression of long-term complications. N Engl J Med. 1993;329(14):977-986 (about three-fold higher severe hypoglycaemia at the lowest HbA1c). A
One page, three voices: Jude, Grace, John. Critical appraisal, built on real studies.
