Observed over expected (O/E) analysis is commonly misapplied to performance comparisons. Please don’t.

A few years ago, I had a shocking and enlightening discussion about analytic methods with a group of epidemiologists and biostatisticians from Blue Cross Blue Shield of Michigan.

20080314_0001 PGIP quarterly meeting in Lansing — PGIP Quarterly Meeting in Lansing

We were sitting around a table at a conference center in Lansing, where we were debriefing from a meeting of the Physician Group Incentive Program. We were talking about the methods for performance comparison. Everyone knew that we needed to “risk adjust” to take into account differences in patient characteristics when comparing the performance of different physicians, clinic practice units, and physician organizations. If we failed to properly risk adjust, the poorly performing providers would surely argue “my patients are sicker.”

Traditional Risk Adjustment using Standardization

When epidemiologists want to compare the performance of two health care providers on a level playing field, the traditional method is to do risk adjustment using an approach called standardization. The concept is to determine which patient or environmental variables influence the outcome of interest. These are called confounding variables, because differences in the mix of patients based on these variables can confound the performance comparison unless they are taken into consideration. Examples of such confounding variables include age, gender, the presence of co-morbid conditions, etc. If any of the confounding variables are continuous numbers, like age, the epidemiologist must first convert them to discrete categories, or groupings. For example, if age was the confounding variable, the epidemiologist might define categories for “adults” and “children.” Or, the epidemiologist might break age into ten-year ranges. If there is more than one confounding variable, the categories are defined based on the combinations of values, such as “adult females,” “adult males,” etc. These categories are sometimes called “risk strata” or “risk cells.” Then, for each of these categories, for each of the providers being compared, the epidemiologist calculates the outcome measure of interest, such as the mortality rate or the total cost of care. The result is a matrix of measure values for each of the risk cells for each provider. This matrix can be conceptualized as a “model” of the actual performance of each provider.

To create a level playing field for comparisons, the epidemiologist then creates a standardized population. The standardized population is simply the number or percentage of patients in each of the risk cells. Then, the model of each provider’s performance is applied to that standardized population to determine what the outcomes would have been if that provider had cared for the standardized mix of patients. For each of the risk cells, the standardized number of patients is multiplied by the actual performance that provider achieved for such patients. Then, the results for all the risk cells are aggregated to obtain that provider’s risk-adjusted performance. Another way of thinking about this is that the risk adjusted outcome is the weighted average outcome for all the risk cells, where the weights are the proportion of patients in that risk cell in the standardized, level-playing-field population. If the provider’s actual mix of patients was “sicker” or “higher risk” than the standardized population, then the risk adjusted outcome will be more favorable than the unadjusted outcome.

“Observed Over Expected” Analysis

In the literature, even in respected journals, I have seen many examples of performance comparisons that used a different analytic approach called “observed over expected” or “O/E,” rather than using the traditional standardization approach. A recent example is a paper regarding the mortality-avoidance performance of childrens’ hospitals. Just as with standardization, the O/E method begins by identifying confounding variables — the patient characteristics that are predictors of the outcome of interest. With O/E analysis, confounding variables that are continuous, like age, do not have to be converted to discrete categories or groupings. All the confounding variables are used as independent variables in a regression model. Then, the resulting regression model is applied to each individual patient observation, inserting the values of the predictor variables for that patient into the regression formula to obtain the “expected” value of the outcome of interest. At that point, you have the actual observed value and the expected value for each patient (or case). Then, you sum up the observed values for all the patients for a given provider. And, you sum up the expected values for all the patients for a given provider. Finally, you divide the sum of observed values by the sum of expected values to get the O/E ratio. If the ratio is greater than one, that is interpreted to mean that the provider has a higher-than-expected value for that outcome. If the outcome variable is undesirable, such as mortality rate, complication rate or cost, an O/E ratio of greater than one is interpreted to mean that the provider performed poorly compared to the other providers. People have been routinely using O/E analysis as if it was a way of doing risk-adjusted performance comparisons — as a way of “leveling the playing field” to do performance comparisons that take into account differences in patient characteristics.

But, sitting around the table in Lansing, I was shocked to realize that O/E analysis is not actually applicable for this purpose. Why? Because O/E analysis does not actually create a level playing field.

On the contrary, O/E analysis is conceptually the opposite of the traditional standardization approach. In traditional standardization, a model of each provider’s actual performance is applied to a standardized population. In O/E analysis, the regression model is essentially a model of typical performance. That regression model is applied to the actual population that received care from a particular provider. The problem is that different providers can see a different mix of patients. Consider the following simple calculations.

In this simple example, we are comparing just two providers. We are considering just one confounding variable, age. And, we are breaking that variable into just two categories, adults and children. As shown in the example, provider A sees mostly adults, while provider B sees mostly children. Provider B performs poorly in those children, but actually performs better that A in the adult population. Because provider B sees more children, the poor performance in children dominates the O/E calculation, so provider B looks bad in terms of an O/E ratio of 1.09. But, since there are more adults than children in the overall population, which we are using as the “standardized” population, provider B’s superior performance in adults dominates the risk adjustment. So, provider B has a risk-adjusted mortality rate that is better than provider A. In other words, if you use O/E analysis for level-playing-field performance comparisons, you may get the wrong answer.

Risk adjustment using standardization is not computationally difficult. But, it does require a bit of programming work to convert continuous variables such as age into categorical variables, such as breaking age into ten-year ranges. In my opinion, O/E analysis should not be used for risk-adjusted performance comparisons. O/E analysis does have the advantage of being computationally more convenient, particularly when you have many confounding variables and when some of them are continuous variables. But, it is not that much more work to do it right.

Dr. Ward

Richard E. Ward, MD, MBA, CEO of Reward Health. Also physician, health care analytics and informatics innovator, husband, father, tenor & gradually improving swimmer

9 thoughts on “Observed over expected (O/E) analysis is commonly misapplied to performance comparisons. Please don’t.”

Darline El Reda
October 11, 2011 at 9:42 am

I remember this conversation very well and was quite surprised to learn about the frequent use of O/E metrics when, in reality, what we need are risk adjusted metrics. A must have conversation for anyone in the business of assessing and comparing performance.
Pingback: So, is there any good use of O/E analysis? Yes. It’s called Benchmark Opportunity Analysis.
L. G. Hunsicker, M.D., Professor of Medicine, U. Iowa College of Medicine
April 15, 2012 at 6:13 pm

I am not a (formal) statistician, so I respond to this analysis with some temerity. It seems to me that Dr. Ward’s numerical analysis is correct, but that his conclusion that O-E should not be used to evaluate performance of hospitals/practitioners may not be the end of the discussion. The issue, at root, is over which patient population one wants to evaluate a physician’s/institution’s performance – the average population overall, or the average population that the physician/institution sees. The problem arises from the fact that there may be an interaction between physician performance and the characteristics of the patient population. In his conundrum, Dr. Ward’s Provider A sees mostly adults, but does less well with these patients than expected, but better with children. Conversely, Provider B sees mostly children,does less well with the children than expected, but better with adults. (This may be the reverse of what might have been expected. We might have guessed that each provider would have done better on average with the patients that he sees more of – the pediatrician with children and the internist with adults.) But we want an “overall” measure of performance, not a separate evaluation for each patient population. So we have to construct a weighted average performance – and here is the basis for the difference between Dr. Ward’s two computations. If we weight the average performance by the characteristics of the general population – asking the question how well each practitioner would do if (s)he saw a patient population typical of the population as a whole, we get Dr. Ward’s preferred answer. But we may prefer to weight the average by the patient population that the provider actually sees – in essence asking how his/her performance compares with a hypothetical/counterfactual average provider seeing the kind of patients that the practitioner being evaluated actually sees. In that case we get the evaluation that Dr. Ward doesn’t like. Two different questions. Two different answers. Which question is the relevant one? I think that many would say that the relevant question is how well does each practitioner perform in caring for the kind of patients (s)he generally sees. Consider one likely consequence of taking Dr. Ward’s point of view, and consider the likely situation with a pediatrician who sees mostly children and whose care of children is superior. But his/her care of adults (which (s)he rarely sees) is less good. Because the patient population at large has more adults than children, the pediatrician’s good outcomes with children would be down weighted and his/her less good outcomes with adults would be weighted more heavily. If the reverse is true for the internists,in this comparison the pediatricians are likely, as a class, to look less good than the internists. (Incidentally, I am an internist, not a pediatrician.) I personally think that this approach does more damage to my sense of fairness than the alternative of evaluating each provider relative to the patients that (s)he sees. I would agree with Dr. Ward that this does not permit an overall ranking of all providers without reference to special expertise. Each provider is only reasonably compared with other providers caring for similar patients. But then, I suspect that most will think it a somewhat irrelevant exercise to rate providers on their (hypothetical) outcomes with patients that they don’t actually see very often.
Dr. Ward
April 16, 2012 at 1:09 pm

Dr. Hunsicker, thank you for taking the time to submit your thoughtful post. I agree with your point that it would be inappropriate to try to use standardization to compare the performance of a pediatrician to the performance of an internist. But, the problem with such a comparison is not because standardization is not asking the right question. The problem is more fundamental.

Before doing any comparisons of providers’ performance, the analyst must first be comfortable that the providers being compared are from the same population. Risk adjustment is only appropriate for the purposes of adjusting for relatively small differences in the mix of characteristics of the members of the population of providers. Since pediatricians and internists focus on non-comparable populations of patients, they are different populations of physicians. It is not appropriate to use risk adjustment to try to extrapolate results from one population to another population or to draw conclusions from comparisons of different populations.

Unfortunately, there are no formal methods or established criteria (at least none that I’ve ever heard of) for make this determination of same vs. different population. This is primarily determined by human judgment and convention. However, when doing risk adjustment, I like to check the difference between adjusted and unadjusted measures. If the difference between the two is more than about 20%, it makes me suspicious that we are dealing with different populations, and not merely adjusting for different mix within the same population.

If everyone agrees that we are comparing performance of providers within the same provider population, then, I still feel that O/E analysis is not an appropriate method for the purpose of making fair, level-playing-field comparisons of provider performance, since the measured differences between providers is still potentially confounded by differences in the mix of patient characteristics, as demonstrated in the example calculations. As I pointed out in a subsequent post, methods analogous to O/E are useful to identifying and prioritizing opportunities for improvement of each provider.
VC
August 29, 2016 at 4:39 pm

If you constrain risk adjustment to similar physicians, and define “similar” as meaning that the O/E result is close to the direct risk adjustment result, is O/E still wrong? You have made sure these 2 methods are giving very similar answers when you limit the comparison this way. It seems that if you apply this constraint, then O/E is an approximation to direct standardization.

One issue with direct standardization is that it tends to break down when you look at smaller numbers of patients. If you have physicians that overall treat similar patients, but then look at the care of a specific condition, you start to get different populations, if only by chance. Once you get to smaller numbers also, the direct standardization score for a physician can be dominated by a cell with a small number of patients, with more populated cells contributing little to the physician’s score. If a patient treats 100 patients, but his score is dominated by 2 of these, this seldom seems fair to a provider. They may feel “punished” by taking on sicker patients (even if they are only sicker within some subgroup). Also, focusing the provider’s attention on a small sub-group of patients doesn’t do much to improve quality or reduce costs in the big picture. The direct standardization may tell you of the opportunity for improvement if the physician treated the average population – this may be very different from an improvement opportunity in reality.

I think the purpose of the comparison has to be taken into account also. Do you really want a definite answer to the question “is physician A better than B?” or do you want to get an idea of whether a given physician has opportunities for improvement, and an idea of the magnitude of those opportunities?

I am not a statistician either, but I have seen very strange results with direct standardization in some cases, with a handful of patients dominating scores, especially when you get a high-utilizing young male or two by chance in your group. This can happen even when overall patient counts are pretty high for a provider. As a result of this I struggle with the “right” adjustment.
Dr. Ward
January 14, 2017 at 5:33 pm

VC,

Sorry I have not been keeping up with replies to comments. Thanks for taking the time to write such a thoughtful reply. Some thoughts for your consideration:

1) Regarding the scenario when O/E and direct standardization give similar answers, and if one agrees that direct standardization is the “right” answer, then in that scenario, O/E happened to be “right” too. But I don’t see the point of using O/E as an approximation, since the programming for computing direct standardization is not very complex. Why approximate, when you can have the real thing?

2) Regarding the scenario when one of the risk cells in direct standardization dominates the risk adjusted value, I don’t see that as a problem caused by direct standardization, nor is it remedied by doing O/E (indirect standardization). Remember that risk adjustment is about dealing with potential sources of bias caused by the presence of confounding variables. The problem of a few outlier observations dominating the population-level measure is about variability, rather than bias. There are different methods to address variability such as (a) increasing your sample size, (b) considering censoring or winsorization of outlier observations, (c) applying tests of statistical significance of differences when doing comparisons, etc.

3) Regarding the need to consider the purpose of the analysis to select the correct method, I enthusiastically agree. As I explained in the subsequent blog post (https://rewardhealth.com/archives/1747 ), opportunity analysis is a good use for O/E analysis. (In fact, it is the only use that I know of.)

Regards,

Rick Ward
PW
March 15, 2018 at 11:55 am

I think the risk-adjusted mortality rate using standardization is inconclusive. If you change the standardized population, your conclusion will change too. If you apply both physicians’ performance to a different population, let’s say, 20% adults and 80% children, then physician A will have an adjusted mortality rate of 3.3% and physician B 4.0%, which leads to completely opposite conclusions. The O/E method, on the other hand, can tell us how much better or worse a physician is doing compared against the average, given his or her own case mix.
PW
March 15, 2018 at 1:01 pm

I agree that the O/E method, being indirect standardization in essence, suffers from the principal drawback of indirect standardization method- that is, the O/E (SMR) from one population cannot be compared to the O/E from another population unless the two populations have similar/balanced distribution of all variables of interest. Therefore in O/E calculations, we cannot simply say one physician is better than the other by comparing their O/E ratio because the O/E ratios are not directly comparable given their very different underlying case mixes. It is appropriate to say though, physician A has a N-fold higher mortality rate than averagely expected for his case mix. If the case mixes are pretty balanced for the two physicians, then it would be ok to compare their O/E ratios. But if their case mixes are balanced, why would adjustment be needed in the first place lol…
Dr. Ward
September 12, 2020 at 3:36 pm

PW. Thank you for your comments.

I agree with your assertion that the conclusions to be drawn from direct standardization can change if you select a different definition of the standard population. My preferred practice is to use the overall patient mix of all the providers being compared as the standard population. If comparing trends over time, it is important that the same standard is used for all time periods. In any case, when reporting results of a comparison using direct standardization, it is important to choose the standard carefully and disclose the standard in the report. It is part of the definition of the metric.

Regarding your second comment, I think when our purpose is to do level playing field comparisons of the performance of providers, we should distinguish between three scenarios:

(1) Different Populations: The patients receiving care from one provider have a mix of characteristics that is sufficiently different from the characteristics of those receiving care from the other providers that we should consider the providers to be in different populations, and therefore the comparison is inappropriate with or without risk adjustment.

(2) Nearly Identical Populations: The mix of patients receiving care from one provider are so similar to those of of the other providers that we don’t consider it necessary to add the extra work or explanatory complexity associated with risk adjustment. This can be the case in a randomized clinical trial and in a walk-in/urgent care or emergency room setting. But, in my experience, it is rarely the case in routine practice.

(3) Same Population, But Differences Worthy of Risk Adjustment. This is the middle category, where we want to do the work of risk adjustment to assess the potential impact of confounding variables on the comparison. If we find that risk adjustment changes the result a small but meaningful amount, we should emphasize the risk adjusted results in our report. If we find that risk adjusted results are not meaningfully different from unadjusted results, we may choose to emphasize the simpler unadjusted results in our report, and keep risk adjusted results in our back pocket in case the question is raised. If we find that risk adjusted results are dramatically different from unadjusted results, then we should question our own judgement that the providers were all in the same population and consider the possibility that the comparison is inappropriate.

Regarding O/E (indirect standardization) versus direct standardization, I stand my my assertion that O/E is not the right method for performance comparisons, despite the fact that many peer reviewed papers continue to use what I consider to be the wrong method after all these years!

Reading between the lines of some of the comments, it seems there may be some antipathy to the purpose of comparing the performance of providers. I share some of that antipathy, particularly if performance comparisons are done in a punitive manner and are insufficiently integrated into a framework for clinical process improvement. But the question of whether provider performance comparison is a good purpose can be separated from the question of which method of risk adjustment is right for that purpose.

Observed over expected (O/E) analysis is commonly misapplied to performance comparisons. Please don’t.

Dr. Ward

Share

9 thoughts on “Observed over expected (O/E) analysis is commonly misapplied to performance comparisons. Please don’t.”

Leave a Comment

Free Subscription to Blog

Recent Posts

CMS Innovation Center engages in linguistic manipulation and health economics denialism in new definition of “VBC”

What should we do when there is a clash between two noble goals: consumer transparency and quality improvement? Five proposed principles.

CDC’s new $200M Center for Forecasting and Outbreak Analytics mistakenly frames modeling efforts as “forecasting.” It should be all about policy decision support.

Oncology Care Model failure calls for rethinking our approach to real, sustainable improvement in specialty care

U of Michigan study: Epic’s sepsis predictive model has “poor performance” due to low AUC, but reporting AUC is like building a bridge half way over a river. How to finish the job.

What is missing from the CMS Innovation Center five year vision?