Observed over expected (O/E) analysis is commonly misapplied to performance comparisons. Please don’t.

A few years ago, I had a shocking and enlightening discussion about analytic methods with a group of epidemiologists and biostatisticians from Blue Cross Blue Shield of Michigan.

PGIP Quarterly Meeting in Lansing

We were sitting around a table at a conference center in Lansing, where we were debriefing from a meeting of the Physician Group Incentive Program. We were talking about the methods for performance comparison. Everyone knew that we needed to “risk adjust” to take into account differences in patient characteristics when comparing the performance of different physicians, clinic practice units, and physician organizations. If we failed to properly risk adjust, the poorly performing providers would surely argue “my patients are sicker.”

Traditional Risk Adjustment using Standardization

When epidemiologists want to compare the performance of two health care providers on a level playing field, the traditional method is to do risk adjustment using an approach called standardization.    The concept is to determine which patient or environmental variables influence the outcome of interest.  These are called confounding variables, because differences in the mix of patients based on these variables can confound the performance comparison unless they are taken into consideration. Examples of such confounding variables include age, gender, the presence of co-morbid conditions, etc.  If any of the confounding variables are continuous numbers, like age, the epidemiologist must first convert them to discrete categories, or groupings.  For example, if age was the confounding variable, the epidemiologist might define categories for “adults” and “children.”  Or, the epidemiologist might break age into ten-year ranges.  If there is more than one confounding variable, the categories are defined based on the combinations of values, such as “adult females,” “adult males,” etc.  These categories are sometimes called “risk strata” or “risk cells.”  Then, for each of these categories, for each of the providers being compared, the epidemiologist calculates the outcome measure of interest, such as the mortality rate or the total cost of care.  The result is a matrix of measure values for each of the risk cells for each provider.  This matrix can be conceptualized as a “model” of the actual performance of each provider.

To create a level playing field for comparisons, the epidemiologist then creates a standardized population.  The standardized population is simply the number or percentage of patients in each of the risk cells.  Then, the model of each provider’s performance is applied to that standardized population to determine what the outcomes would have been if that provider had cared for the standardized mix of patients.  For each of the risk cells, the standardized number of patients is multiplied by the actual performance that provider achieved for such patients.  Then, the results for all the risk cells are aggregated to obtain that provider’s risk-adjusted performance. Another way of thinking about this is that the risk adjusted outcome is the weighted average outcome for all the risk cells, where the weights are the proportion of patients in that risk cell in the standardized, level-playing-field population.  If the provider’s actual mix of patients was “sicker” or “higher risk” than the standardized population, then the risk adjusted outcome will be more favorable than the unadjusted outcome.

“Observed Over Expected” Analysis

In the literature, even in respected journals, I have seen many examples of performance comparisons that used a different analytic approach called “observed over expected” or “O/E,” rather than using the traditional standardization approach.  A recent example is a paper regarding the mortality-avoidance performance of childrens’ hospitals.  Just as with standardization, the O/E method begins by identifying confounding variables — the patient characteristics that are predictors of the outcome of interest.   With O/E analysis, confounding variables that are continuous, like age, do not have to be converted to discrete categories or groupings.  All the confounding variables are used as independent variables in a regression model.  Then, the resulting regression model is applied to each individual patient observation, inserting the values of the predictor variables for that patient into the regression formula to obtain the “expected” value of the outcome of interest.  At that point, you have the actual observed value and the expected value for each patient (or case).  Then, you sum up the observed values for all the patients for a given provider.  And, you sum up the expected values for all the patients for a given provider.  Finally, you divide the sum of observed values by the sum of expected values to get the O/E ratio.  If the ratio is greater than one, that is interpreted to mean that the provider has a higher-than-expected value for that outcome.  If the outcome variable is undesirable, such as mortality rate, complication rate or cost, an O/E ratio of greater than one is interpreted to mean that the provider performed poorly compared to the other providers.  People have been routinely using O/E analysis as if it was a way of doing risk-adjusted performance comparisons — as a way of “leveling the playing field” to do performance comparisons that take into account differences in patient characteristics.

But, sitting around the table in Lansing, I was shocked to realize that O/E analysis is not actually applicable for this purpose.  Why? Because O/E analysis does not actually create a level playing field.

On the contrary, O/E analysis is conceptually the opposite of the traditional standardization approach.  In traditional standardization, a model of each provider’s actual performance is applied to a standardized population.  In O/E analysis, the regression model is essentially a model of typical performance.  That regression model is applied to the actual population that received care from a particular provider.  The problem is that different providers can see a different mix of patients.  Consider the following simple calculations.

In this simple example, we are comparing just two providers.  We are considering just one confounding variable, age.  And, we are breaking that variable into just two categories, adults and children.  As shown in the example, provider A sees mostly adults, while provider B sees mostly children.  Provider B performs poorly in those children, but actually performs better that A in the adult population.  Because provider B sees more children, the poor performance in children dominates the O/E calculation, so provider B looks bad in terms of an O/E ratio of 1.09.  But, since there are more adults than children in the overall population, which we are using as the “standardized” population, provider B’s superior performance in adults dominates the risk adjustment.  So, provider B has a risk-adjusted mortality rate that is better than provider A.  In other words, if you use O/E analysis for level-playing-field performance comparisons, you may get the wrong answer.

Risk adjustment using standardization is not computationally difficult.  But, it does require a bit of programming work to convert continuous variables such as age into categorical variables, such as breaking age into ten-year ranges.  In my opinion, O/E analysis should not be used for risk-adjusted performance comparisons.  O/E analysis does have the advantage of being computationally more convenient, particularly when you have many confounding variables and when some of them are continuous variables.  But, it is not that much more work to do it right.

Social tagging: > >

5 Responses to Observed over expected (O/E) analysis is commonly misapplied to performance comparisons. Please don’t.

  1. Darline El Reda says:

    I remember this conversation very well and was quite surprised to learn about the frequent use of O/E metrics when, in reality, what we need are risk adjusted metrics. A must have conversation for anyone in the business of assessing and comparing performance.

  2. L. G. Hunsicker, M.D., Professor of Medicine, U. Iowa College of Medicine says:

    I am not a (formal) statistician, so I respond to this analysis with some temerity. It seems to me that Dr. Ward’s numerical analysis is correct, but that his conclusion that O-E should not be used to evaluate performance of hospitals/practitioners may not be the end of the discussion. The issue, at root, is over which patient population one wants to evaluate a physician’s/institution’s performance – the average population overall, or the average population that the physician/institution sees. The problem arises from the fact that there may be an interaction between physician performance and the characteristics of the patient population. In his conundrum, Dr. Ward’s Provider A sees mostly adults, but does less well with these patients than expected, but better with children. Conversely, Provider B sees mostly children,does less well with the children than expected, but better with adults. (This may be the reverse of what might have been expected. We might have guessed that each provider would have done better on average with the patients that he sees more of – the pediatrician with children and the internist with adults.) But we want an “overall” measure of performance, not a separate evaluation for each patient population. So we have to construct a weighted average performance – and here is the basis for the difference between Dr. Ward’s two computations. If we weight the average performance by the characteristics of the general population – asking the question how well each practitioner would do if (s)he saw a patient population typical of the population as a whole, we get Dr. Ward’s preferred answer. But we may prefer to weight the average by the patient population that the provider actually sees – in essence asking how his/her performance compares with a hypothetical/counterfactual average provider seeing the kind of patients that the practitioner being evaluated actually sees. In that case we get the evaluation that Dr. Ward doesn’t like. Two different questions. Two different answers. Which question is the relevant one? I think that many would say that the relevant question is how well does each practitioner perform in caring for the kind of patients (s)he generally sees. Consider one likely consequence of taking Dr. Ward’s point of view, and consider the likely situation with a pediatrician who sees mostly children and whose care of children is superior. But his/her care of adults (which (s)he rarely sees) is less good. Because the patient population at large has more adults than children, the pediatrician’s good outcomes with children would be down weighted and his/her less good outcomes with adults would be weighted more heavily. If the reverse is true for the internists,in this comparison the pediatricians are likely, as a class, to look less good than the internists. (Incidentally, I am an internist, not a pediatrician.) I personally think that this approach does more damage to my sense of fairness than the alternative of evaluating each provider relative to the patients that (s)he sees. I would agree with Dr. Ward that this does not permit an overall ranking of all providers without reference to special expertise. Each provider is only reasonably compared with other providers caring for similar patients. But then, I suspect that most will think it a somewhat irrelevant exercise to rate providers on their (hypothetical) outcomes with patients that they don’t actually see very often.

  3. Dr. Ward says:

    Dr. Hunsicker, thank you for taking the time to submit your thoughtful post. I agree with your point that it would be inappropriate to try to use standardization to compare the performance of a pediatrician to the performance of an internist. But, the problem with such a comparison is not because standardization is not asking the right question. The problem is more fundamental.

    Before doing any comparisons of providers’ performance, the analyst must first be comfortable that the providers being compared are from the same population. Risk adjustment is only appropriate for the purposes of adjusting for relatively small differences in the mix of characteristics of the members of the population of providers. Since pediatricians and internists focus on non-comparable populations of patients, they are different populations of physicians. It is not appropriate to use risk adjustment to try to extrapolate results from one population to another population or to draw conclusions from comparisons of different populations.

    Unfortunately, there are no formal methods or established criteria (at least none that I’ve ever heard of) for make this determination of same vs. different population. This is primarily determined by human judgment and convention. However, when doing risk adjustment, I like to check the difference between adjusted and unadjusted measures. If the difference between the two is more than about 20%, it makes me suspicious that we are dealing with different populations, and not merely adjusting for different mix within the same population.

    If everyone agrees that we are comparing performance of providers within the same provider population, then, I still feel that O/E analysis is not an appropriate method for the purpose of making fair, level-playing-field comparisons of provider performance, since the measured differences between providers is still potentially confounded by differences in the mix of patient characteristics, as demonstrated in the example calculations. As I pointed out in a subsequent post, methods analogous to O/E are useful to identifying and prioritizing opportunities for improvement of each provider.

  4. VC says:

    If you constrain risk adjustment to similar physicians, and define “similar” as meaning that the O/E result is close to the direct risk adjustment result, is O/E still wrong? You have made sure these 2 methods are giving very similar answers when you limit the comparison this way. It seems that if you apply this constraint, then O/E is an approximation to direct standardization.

    One issue with direct standardization is that it tends to break down when you look at smaller numbers of patients. If you have physicians that overall treat similar patients, but then look at the care of a specific condition, you start to get different populations, if only by chance. Once you get to smaller numbers also, the direct standardization score for a physician can be dominated by a cell with a small number of patients, with more populated cells contributing little to the physician’s score. If a patient treats 100 patients, but his score is dominated by 2 of these, this seldom seems fair to a provider. They may feel “punished” by taking on sicker patients (even if they are only sicker within some subgroup). Also, focusing the provider’s attention on a small sub-group of patients doesn’t do much to improve quality or reduce costs in the big picture. The direct standardization may tell you of the opportunity for improvement if the physician treated the average population – this may be very different from an improvement opportunity in reality.

    I think the purpose of the comparison has to be taken into account also. Do you really want a definite answer to the question “is physician A better than B?” or do you want to get an idea of whether a given physician has opportunities for improvement, and an idea of the magnitude of those opportunities?

    I am not a statistician either, but I have seen very strange results with direct standardization in some cases, with a handful of patients dominating scores, especially when you get a high-utilizing young male or two by chance in your group. This can happen even when overall patient counts are pretty high for a provider. As a result of this I struggle with the “right” adjustment.

  5. Dr. Ward says:


    Sorry I have not been keeping up with replies to comments. Thanks for taking the time to write such a thoughtful reply. Some thoughts for your consideration:

    1) Regarding the scenario when O/E and direct standardization give similar answers, and if one agrees that direct standardization is the “right” answer, then in that scenario, O/E happened to be “right” too. But I don’t see the point of using O/E as an approximation, since the programming for computing direct standardization is not very complex. Why approximate, when you can have the real thing?

    2) Regarding the scenario when one of the risk cells in direct standardization dominates the risk adjusted value, I don’t see that as a problem caused by direct standardization, nor is it remedied by doing O/E (indirect standardization). Remember that risk adjustment is about dealing with potential sources of bias caused by the presence of confounding variables. The problem of a few outlier observations dominating the population-level measure is about variability, rather than bias. There are different methods to address variability such as (a) increasing your sample size, (b) considering censoring or winsorization of outlier observations, (c) applying tests of statistical significance of differences when doing comparisons, etc.

    3) Regarding the need to consider the purpose of the analysis to select the correct method, I enthusiastically agree. As I explained in the subsequent blog post (http://rewardhealth.com/archives/1747 ), opportunity analysis is a good use for O/E analysis. (In fact, it is the only use that I know of.)


    Rick Ward


Leave a Reply

Notify me of follow-up comments via e-mail. You can also subscribe without commenting.