“Alarmingly high” number of interpretive errors and inconsistencies in MRI examinations


Magnetic resonance imaging (MRI) results play a crucial role in patient diagnosis and treatment, as well as the decisions of insurance companies and other payers to approve procedures. A study conducted by Hospital for Special Surgery (HSS, New York City, USA) published in The Spine Journal has found “marked variability” and an “alarmingly high” number of interpretive errors among 10 distinct MRI examination reports of the lower back of one patient, studied over a period of three weeks.

“The study subject was a 63-year-old woman with a history of low back pain and right L5 radicular symptoms,” noted Richard Herzog, the principal investigator, and executive director of the Spreemo Health Quality Research Institute (New York City, USA). As well as experiencing weakened toe and ankle reflexes, she “had a positive dural tension (seated slump) sign on the right.”

Twelve MRI examinations of the patient’s lumbar spine were performed; 10 at separate accredited imaging centres within or near New York City, and two control MRIs at HSS to establish reference findings and ensure that no changes in the patient’s findings occurred over the course of the study. The imaging centres were blinded to their participation in the clinical study.

Review of the reference MRI examinations was performed by Richard Herzog, who also serves as director of Spine Imaging at HSS, and Adam Flanders, director of Neuroradiology at Thomas Jefferson University Medical Centre, Philadelphia, USA.

Aside from prior discussion and agreement with respect to grading of stenosis, the reviewers independently read the reference HSS MRI examinations to establish the reference findings.  In this process, three minor disagreements were observed between Herzog and Flanders relating to “the severity of neural foraminal stenosis” but they were otherwise in agreement with respect to the remainder of the findings.

“An error in interpretation in a study exam was considered present if there was no mention in the report of a reference finding,” the authors wrote of the comparison between reference MRIs and those gathered at separate imaging centres. “Any positive finding reported in a study examination that was not present in the reference finding was also considered an error in interpretation.” No true positive finding was observed in a study centre report which was not found in the reference examinations.

To avoid over-reporting errors, the authors only recorded stenosis severity discrepancies of greater than two grades (for example, “severe” in a control report and “mild” in a centre report), among other measures.

Forty-nine distinct findings were reported among the 10 examinations “related to the presence of a distinct pathology at a specific motion segment.” Not one of these findings was unanimously noted in all 10 reports, with only one finding—anterior spondylolisthesis—found in nine of the 10. Almost a third of interpretive findings (32.7%) appeared only once across all study reports.

“The fact that no interpretive finding was reported unanimously by the radiologist at all centres and that one-third of all reported findings only appeared once across all 10 study examination reports indicates that there is at best significant difference in the standards employed by radiologists…and at worst significant prevalence of interpretive errors,” the authors wrote.

Authors used the Fleiss kappa statistic to measure inter-rater agreement (1=perfect agreement, >0.75 excellent agreement, ≤0.75 and >0.4= intermediate/good agreement, ≤0.4 and >0= poor agreement, ≤0=agreement no better than chance). Agreement across interpretive findings in all 10 reports was poor at 0.2±0.03. Giving an example of the extent of discrepancies between the reports, the authors wrote, “The number of examinations reporting the presence of a disc herniation at a given motion segment ranged from 70% at L3–L4 to 20% at L5–S1; two examinations reported a disc herniation at all five motion segments and one examination did not report a disc herniation at any motion segment.”

In addition to high levels of variability among the MRI examinations, the authors found a “high rate of interpretive errors” when comparing the separate centre examinations with the reference examinations. Interpretive false-negative rate revealed “miss” rates of between 10% (one instance of anterior spondylolisthesis; found by 9/10 centres) and 72.5% (four instances of nerve root involvement; 11 correctly reported among the findings, 3 incorrectly reported, 29 missed).

The average interpretive error count was 12.5±3.2 per examination. More pathologies were missed (average 10.9±2.9 per examination) than over-reported (average 1.6±0.9 per examination), giving an average true-positive rate of 56.4%±11.7 and an average false-negative rate of 43.6%±11.7. According to authors, the high false-negative rate suggests that “important pathologies [such as disc herniation] are routinely underreported,” while “high false-positive rates for specific pathologies indicate that diagnostic findings, such as central canal stenosis, may be routinely overcalled.”

Whilst spinal surgeons, the authors wrote, may be more capable of interpreting and acting according to MRI examinations themselves with less reliance on the interpretative report, physicians such as general practitioners who are involved in acute stages of caregiving, may not be so well-trained. This can lead to diagnoses and treatment plans based on “inaccurate MRI interpretation, resulting in incorrect treatment recommendations, delayed recovery, or poor outcome,” as well as an “over-reliance” on MRI reports in the cases of less invasive procedures.

“It would be an omission,” the authors wrote, “not to add that the payer community heavily relies upon MRI reports during utilisation and authorisation review procedures.” Errors, they stated, can “significantly delay authorisation of appropriate care.”

The authors acknowledge that their study is limited in a number of ways. For example, the results are limited by only performing one MRI from each centre, which cannot thus be held as representative of the overall performance of each. The use of a single patient, as well, means that the results may not be fully generalisable.

Concluding, the authors note that their study “highlights critical issues” and provides “novel insights and perspective.” To improve variability and error at the crucial stage of diagnosis, the authors recommend that the “broad acceptance of the prevalence of errors and their potential impact on care is a critical first step toward system capable of providing industry-wide, standardised measurement of diagnostic MRI quality.”


Please enter your comment!
Please enter your name here