15Limits of Classification: Toward International Outcome Comparison
Why no single classification yet enables global outcome comparison in ossiculoplasty, and what an ideal unified middle-ear surgery scheme would require.
FWhy classification was supposed to help
A classification is a promise. When Wullstein numbered his tympanoplasty types in the 1950s, the promise was that two surgeons anywhere in the world could write down the same operative finding, recognise it in each other’s reports, and so build a shared body of evidence about what works [1956]. Austin made the same promise for ossicular defects, reducing the question to the malleus handle and the stapes superstructure [1971]; Kartush extended it and then went further, trying to turn the finding into a number that predicts the result [1994]. Seventy years on, the promise is only half kept. We have many languages, and that is precisely the problem.
This module steps back from the individual schemes covered earlier in the chapter and asks an uncomfortable question: can we yet compare ossiculoplasty outcomes across centres and countries? The honest answer is not reliably. Three separate fault lines run through the field. The classifications disagree about what to record; the audiometric reports disagree about howto measure the result; and the prognostic indices, when tested, predict that result only weakly. Understanding these limits is not academic pessimism — it is what tells you how much to trust a published success rate, and what a better system would have to do.
FWhat each scheme can and cannot say
The first fault line is one of scope. The schemes you have met are not rival descriptions of the same thing; they describe different parts of the elephant. Wullstein’s types and the Austin–Kartush grade are anatomical: they code which ossicles survive and nothing else. They are quick, reproducible and useful at the operating microscope, but a type A ear with healthy mucosa and a type A ear that is draining, scarred and previously operated carry the same label and wildly different prospects.
The risk indices— MERI, OOPS and Black’s SPITE — were invented to fix exactly this blind spot by folding the ear environment into the score: mucosal disease, otorrhoea (often graded by Bellucci [1973]), cholesteatoma, perforation, previous surgery and even smoking [1994, 1992, 2001]. But they predict an outcome; they do not standardisehow that outcome is measured. And a third, separate instrument — the AAO-HNS reporting guideline — standardises the audiogram but classifies neither the anatomy nor the risk. No single scheme spans all three domains. The selector below makes the gaps visible: pick any system and watch which of the three requirements it leaves blank.
The lesson is structural. Because each scheme is partial, a complete description of an ear today requires severalof them stapled together — an Austin type, a MERI or OOPS score, and an AAO-HNS-compliant audiogram. There is no agreed way to combine them, so every unit improvises, and the improvisations do not match.
TThe reporting problem: incomparable audiograms
Even if the classifications agreed, the outcome itself is reported in inconsistent units. The functional result of ossiculoplasty is the postoperative air–bone gap (ABG)— but “the ABG” is not one number. Some authors average three frequencies (500, 1000, 2000 Hz); the 1995 AAO-HNS Committee on Hearing and Equilibrium recommended a four-frequency average that adds 3000 Hz; others substitute 4000 Hz. Some quote the gap against the preoperative bone line, others against the postoperative bone line. And the definition of successdrifts between papers: closure to within 10 dB in one series, 20 dB in another, 30 dB in a third.
These are not pedantic differences. A change from a three- to a four-tone average, or moving the success threshold from 20 to 30 dB, can shift a reported “success rate” by tens of percentage points without a single patient hearing any differently. The scale of the problem has been quantified: in a review of 169 tympanoplasty and ossiculoplasty studies published between 2005 and 2015, not one met all ten of the AAO-HNS reporting criteria, and only about a tenth met seven to nine of them [2017]. When the majority of the literature does not report results in a common, complete format, pooling those results — the whole point of a meta-analysis or an international registry — becomes statistically unsound.
| Reporting choice | How it varies between papers | Why it breaks comparison |
|---|---|---|
| Frequencies averaged | 3-tone (0.5/1/2 kHz) vs 4-tone (adds 3 or 4 kHz) | High-frequency gaps persist after surgery, so adding them worsens the apparent result |
| Bone-conduction reference | Pre-operative vs post-operative bone line | Overclosure or drill-related shifts change the computed gap |
| Success threshold | ABG within 10, 20 or 30 dB | A looser threshold inflates the success rate for identical hearing |
| Follow-up interval | Weeks to years; fibrosis and retraction evolve | Early results flatter techniques that deteriorate late |
TDo the risk indices actually predict?
Suppose we set the reporting problem aside and ask only whether the prognostic indices do their core job — predicting the postoperative gap. Here the evidence is sobering. When MERI, OOPS and SPITE are tested head-to-head in the same patients, all three correlate with the actual outcome only weakly. In one tertiary-centre series of 179 ears, the correlations were of the order of r = 0.19–0.27 — statistically significant, but explaining only a small fraction of the variance, and with no index overwhelmingly ahead of the others [2020].
Worse for the dream of a single universal score, the ranking of the indices changes with the population. In a Korean cohort the OOPS index out-predicted MERI (receiver-operating-characteristic area roughly 0.64 versus 0.55 at twelve months), a difference the authors attributed to OOPS capturing the extent of inflammation more fully [2021]. So the “best” index is partly an artefact of which factors happen to drive outcome in a given case mix. An index built on a draining, cholesteatomatous population will look good there and mediocre in a dry, primary-surgery practice. This is why comparative reviews repeatedly conclude that whichever scheme most fully captures the inflammatory and surgical burden of a particular cohort performs best in that cohort [2001]— a moving target, not a constant.
CWhy no scheme has won
If the limits are this clear, why has the field not simply adopted one system? Several forces keep the languages fragmented. The first is proliferation: rather than validate an existing scheme, groups have tended to publish new ones — Yung’s classification is one of several mid-2000s additions [2003]— each reasonable in isolation but collectively splintering the evidence base. The second is biological irreducibility: outcome depends on mucosal healing, aeration, fibrosis and eustachian function that evolve over months and resist capture by any fixed intraoperative score, so even a perfect classifier of the ear at surgery cannot fully predict the ear at one year.
The third is the absence of demonstrated inter-observer reliability for many schemes. A grade is only a shared language if two surgeons assign the same value to the same ear, yet for several systems this has rarely been formally tested. The fourth is case-mix dependence, seen above: an index calibrated in one population mis-ranks ears in another. Together these mean that no scheme is wrong so much as incomplete and local— each is a good answer to a slightly different question, which is exactly why none can serve as the single global standard the field needs.
CWhat an ideal unified scheme would require
The constructive question is what a genuinely comparable system would have to deliver. Drawing the threads together, an ideal unified middle-ear surgery classification would need to be multi-axial, coding three things at once that the present schemes split apart.
- The ossicular defect, in the reproducible anatomical shorthand the Austin–Kartush scheme already provides — malleus handle and stapes superstructure present or absent [1971, 1994].
- The ear environment, in the risk-factor vocabulary of MERI, OOPS and SPITE — mucosa, drainage, cholesteatoma, ventilation, revision status — but harmonised into one agreed factor list rather than three competing ones [2001, 1992].
- The audiometric outcome, in a single mandated format: a defined frequency set, a defined bone-conduction reference, an agreed success threshold and a stated follow-up interval, as the AAO-HNS guideline intended but the literature has not adopted [2017].
Two further properties are non-negotiable. The scheme must have demonstrated inter-observer reliability— published agreement statistics, not just face validity — and it must be validated across diverse populations so its predictions hold from a dry primary practice to a draining tertiary referral base. Only then could an audit in one country be laid alongside an audit in another and the difference attributed to surgery rather than to bookkeeping.
Until such a system exists and is adopted, the practical discipline for the clinician is defensive. Record the ossicular type, the risk factors and an AAO-HNS-style audiogram in full, so your data can be re-pooled however the consensus eventually falls. Read published success rates with the reporting conventions in mind, asking which frequencies, which threshold and which follow-up produced them. And quoteoutcomes to patients from cohorts that resemble their ear, not from the most flattering series in the literature. The limits of classification are real, but knowing them is itself a form of rigour — and the first step toward the unified language that genuine international comparison still awaits.
What is the principal reason these two reports cannot be directly compared, despite describing very similar ears?
Why does an Austin-Kartush ossicular grade, on its own, fail to predict the hearing result of an ossiculoplasty?
A meta-analysis finds it cannot pool ossiculoplasty hearing results from different papers. Which methodological problem is most often responsible?
Head-to-head studies comparing MERI, OOPS and SPITE in the same cohorts have generally shown what?
What would an ideal unified middle-ear surgery classification need to provide that no current single system does?