Ossiculoplasty Atlas · Outcomes, Prognosis & Complications · Module 08

8Applying and Validating Prognostic Scoring Systems

Using Austin-Kartush, OOPS and MERI in practice and judging how well each actually predicts the air-bone gap achieved after ossiculoplasty.

FWhy score an ear at all?

A prognostic score is a way of turning the messy, individual reality of a diseased middle ear into a single number that can be written down, compared and acted on. The promise is appealing: read off a few features at the microscope, add them up, and obtain an estimate of the air–bone gap (ABG)the reconstruction is likely to leave behind. That estimate feeds three jobs at once — counselling the patient before surgery, planning whether to reconstruct now or stage, and reporting outcomes in a language other surgeons can understand [2001].

But a score is only as good as its agreement with what actually happens. This module is not about any one system in isolation — the Austin–Kartush grades, the Ossiculoplasty Outcome Parameter Staging (OOPS) index and the Middle Ear Risk Index (MERI) each have their own chapter. It is about using them together in practice and about the harder, more important question that follows: when you place a number on an ear, how much should you trust it?The honest answer, as we will see, is “enough to counsel, not enough to promise.”

FThe three systems on one bench

Before comparing how well they predict, it helps to remember what each of the three workhorses is actually measuring, because they were built to answer slightly different questions.

System	What it captures	Output
Austin–Kartush	Ossicular remnants only — malleus and stapes superstructure	Four types (A–D)
OOPS	Surgery type, malleus, drainage, mucosa, mastoidectomy	Cumulative 0–9
MERI	Otorrhoea, perforation, cholesteatoma, ossicular status, granulation, prior surgery, smoking	Cumulative, banded mild/moderate/severe

Austin–Kartush is the narrowest: it grades only the ossicular chain, sorting ears into four quadrants by whether the malleus handle and the stapes arch survive [1971]. OOPS is a statistically derived cumulative score that deliberately scores the environment the reconstruction must heal in, and famously omits the stapes superstructure and cholesteatoma because they did not survive its multivariate analysis [2001]. MERI, originally from Kartush and later refined by Becvarovski and Kartush to add smoking, is the broadest: it sums a weighted list of disease and host factors and counts both the stapes and cholesteatoma that OOPS discards [1994, 2001]. Black’s SPITE method — Surgical, Prosthetic, Infection, Tissue and Eustachian-tube factors — sits alongside them as an earlier statistical attempt at the same task [1992].

TScoring the same ear three ways

Because the three systems weigh different variables, the sameear can land in very different places on each scale. Consider a revision ear with a present malleus, an absent stapes superstructure, a dry but fibrotic middle ear and no residual cholesteatoma. On Austin–Kartush it is a Type C(malleus present, stapes absent). On OOPS the missing stapes earns no points at all, so its score is driven by the revision, the fibrosis and the malleus — landing it in the intermediate band. On MERI the same absent stapes doesadd weight, nudging it toward moderate–severe. None of these is “wrong”; they are simply answering different questions. Build an ear below and watch the three readings diverge.

Malleus handle

Stapes superstructure

Surgery

Middle ear

Mucosa

Cholesteatoma

Austin–Kartush quadrant is set purely by malleus and stapes status. OOPS uses Dornhoffer & Gardner weights (surgery, malleus, drainage, mucosa; mastoid fixed as “none” here) — note it ignores the stapes and cholesteatoma. The MERI column is an illustrative subset (otorrhoea, cholesteatoma, ossicular status, revision, fibrosis) of Kartush’s weighted index, which does count both. The point is that the same ear can read “favourable” on one scale and “high risk” on another. Teaching tool, not the full published instruments. Verified against Dornhoffer & Gardner 2001 and Kartush 1994.

The exercise makes a practical point that trips up trainees: you cannot quote a patient “their score” without naming the system, and you cannot compare two papers’ outcomes unless they used the same instrument. The most striking divergence is the stapes superstructure — central to Austin–Kartush and MERI, invisible to OOPS — which is exactly the kind of disagreement that validation studies exist to adjudicate [2001, 1994].

TWhat “validated” really means

Surgeons use the word “validated” loosely, but it has a precise meaning, and the distinctions matter when you weigh which index to trust.

Derivation versus external validation. A score built and tested on the same cohort will always flatter itself; the index has been tuned to that data. The real test is whether it holds up in othersurgeons’ hands and other populations. OOPS was derived on 200 ears by a single surgeon, then tested externally by independent groups [2001, 2021].
Discrimination versus calibration. Discrimination asks whether the score can rankears correctly — do higher-scoring ears really do worse? It is usually summarised by the area under the receiver-operating-characteristic curve (ROC AUC), where 0.5 is a coin toss and 1.0 is perfect. Calibration asks whether the predicted gap matches the observed gap in absolute decibels. An index can discriminate yet be poorly calibrated, and vice versa.
Correlation strength. Series that correlate a cumulative score with the achieved ABG report the correlation coefficient. Across the literature these are statistically significant but modest— the score and the outcome move together far more often than chance, yet much of the variability in an individual ear stays unexplained [2009, 2022].

Keeping these apart prevents the commonest error in this field: treating a statistically significant correlation as if it licensed a precise individual prediction. It does not. A p-value tells you a relationship is real; it says nothing about how tight it is.

CHow well each one predicts

When the systems are scored head-to-head on the same ears, a consistent picture emerges. In a study of 526 chronic-otitis-media ears treated with tympanoplasty, every case was scored with both OOPS and MERI and the two were compared on how well they discriminated the hearing outcome. OOPS came out ahead at both three and twelve months, with an ROC area of about 0.64 against roughly 0.55for MERI — the latter barely better than chance [2021]. The chart sets the two against the 0.5 chance line.

Larger benchmarks tell the same story with the same caveat. The multi-institutional Ear Environment Risk study, the biggest yet at 1,679 ears, confirmed that OOPS out-predicts MERI, though all of the classic indices correlated with outcome only weakly — the study’s own purpose-built scale edged ahead of both [2025]. A single retrospective series that scored the same ears on MERI, SPITE and OOPS found every one of the three significantly but modestly correlated with the achieved ABG, with no single method dominating [2009]. And a systematic review of MERI across middle-ear surgery concluded the same: the cumulative score is a useful, inversely proportional prognostic guide, not a precise predictor [2022].

The practical synthesis is therefore twofold. First, OOPS is the better of the two classic cumulative indices for discriminating hearing outcome. Second, and more important, all of them are only modestly predictive. A rising score reliably tracks a rising mean gap across groups of ears, but the spread within any band is wide. The plot below makes that tension explicit.

CUsing the score at the bedside

How, then, should a thoughtful surgeon use these instruments, knowing they are real but loose?

Counsel in bands, never in decibels. The value of a modestly correlated score lives at the level of groups. You cannot tell a patient their gap will be 19 dB, but you cantell the patient whose ear scores in the high-risk band that ears like theirs, on average, end up with a substantially larger gap than a favourable ear — and that staging the reconstruction or accepting more modest goals is reasonable. The score turns a vague sense that “this is a difficult ear” into a defensible, comparable statement [2001, 2021].

Pick one system and use it consistently.Because the same ear reads differently on each scale, the only way to make your own outcomes interpretable — and comparable with the literature — is to score every case with the same instrument and report it. OOPS is the reasonable default given its discrimination, but a unit that already records MERI loses little by continuing, provided it does so for every ear [2021, 2025].

Let the index structure judgement, not replace it.None of these scores captures the microscopic realities that often decide an individual case — fibrotic adhesions, the true mobility of the footplate, round-window patency, the quality of the malleus as an anchor — and several of their inputs carry interobserver subjectivity [2022]. Use the score to standardise counselling and outcome reporting and to flag the high-risk ear for a staged or conservative plan. Do not use it to promise a decibel, to refuse an operation outright, or to override the audiogram and your own eyes at the microscope. Within those limits, validated prognostic scoring is one of the few genuinely evidence-derived instruments the ossiculoplasty surgeon owns [2001, 2025].

Case 8.8

A 47-year-old woman is booked for revision tympanoplasty with ossiculoplasty for recurrent chronic otitis media. At surgery the malleus handle is intact, the stapes superstructure has been eroded away, the middle ear is dry but the mucosa is fibrotic, there is no residual cholesteatoma, and no canal-wall-down cavity is needed. Your resident scores the ear and tells the patient beforehand that 'the scoring system predicts a 14 dB air-bone gap'.

What is the most appropriate correction to make to the resident's counselling?

Self-assessment — Applying and Validating Prognostic Scoring Systems4 questions

Question 1 · Foundation

What does a prognostic ossiculoplasty score such as OOPS primarily aim to predict?

Question 2 · Foundation

On which variable do the Austin-Kartush and MERI systems disagree most sharply with OOPS?

Question 3 · Trainee

A score is described as having an ROC area under the curve of 0.55 for predicting hearing success. What does this mean?

Question 4 · Clinician

Given the validation evidence, how should a surgeon best use OOPS or MERI when counselling a patient?

Tracked locally in your browser — see /progress for the dashboard.