In a recent paper in JAMA Internal Medicine, Andrew Wong and his colleagues from the University of Michigan reported their “external validation” of the Epic Sepsis Model (ESM), a proprietary predictive model deployed in hundreds of hospitals in the US. The authors reported a sensational finding that thrilled people who love to hate on Epic Corporation, which is the developer of the model and the dominant electronic medical record (EMR) vendor, particularly among academic medical centers. They accused Epic’s ESM of “poor discrimination” and “poor performance.” They used impressively sophisticated methods, but only did half the job. They probably failed to do the rest of the job not because of laziness, but because they made a common error among health services researchers. They conceptualized a tool as if it was a treatment.
How does the Epic Sepsis Model work?
The ESM is a predictive model that takes pieces of data that are routinely captured and stored in the Epic EMR to produce a number called the ESM score. The ESM can be compared to a laboratory assay that produces a numeric result. They are both tests. Tests are tools. A proper and useful evaluation of a test can only be done in the context of the intended use of the test and the impact of this intended use on the relevant outcomes. Asking if a test is “effective” does not make any more sense than asking if a scalpel is effective. Obviously, that depends on what you’re planning to cut.
The Epic EMR calculates the EMS score every 15 minutes throughout a hospital admission. A higher ESM score indicates a higher probability that the patient has or is likely to develop an infection within the bloodstream — a condition called “sepsis” that is dangerous and that requires urgent antibiotic treatment. If doctors were to wait for blood culture test results to confirm the existence of sepsis, too much time would pass. Therefore, doctors initiate antibiotic treatment when they assess that the patient is likely to have sepsis. The sooner such patients can be identified, the sooner antibiotic treatment can be initiated, and the less likely the sepsis will kill the patient. The purpose of the ESM is to save lives by identifying hospitalized patients likely to develop sepsis earlier than they would have been identified by their physicians, leading to earlier onset of antibiotic treatment. But, as with laboratory tests, the ESM score itself does not directly cause the initiation of antibiotic treatment. Rather, the Epic EMR can be configured to compare the ESM score to a threshold value, and to generate an alert message when the threshold is exceeded. The alert is offered within the EMR to the physicians caring for the patient, thereby augmenting the information that is already in the physicians’ heads. The alert and the score may or may not influence the decision by the physician to order antibiotic treatment. The key point is that neither the ESM nor the alert are treatments. Therefore, they do not produce treatment outcomes that can be evaluated.
Protocols and Processes
In the medical field, there are only two things that can be evaluated or assessed:
- protocols (which are designs for care processes that are intended to produce theoretical outcomes) and
- instances of care processes that produce real-world outcomes.
Protocols and processes can be simple, or they can be complex. When a protocol or process is very simple, we may describe it as a “guideline” or “standard of care.” When it gets a bit more complex, we may describe it as a “care map” or “pathway.” When a protocol or process gets very complex, we may describe it as a “system of care” or a “care model.” In any case, the assessment of real-world process instances is accomplished by measuring the magnitude of the various outcomes that are actually produced. Ideally, the real-world process instances are carried out in the context of a randomized clinical trial (RCT), making the comparison of outcomes more convincing. If the real-world processes are just part of routine practice, the evaluator must consider whether the real-world process is being executed as designed or if the process failed to conform to the design (i.e. it is not “in control”). The assessment of protocols is accomplished by comparing them to alternative protocols in terms of the magnitude and uncertainty of the various outcomes that are expected to be produced, using mathematical models to calculate such expected outcomes.
A protocol or process may incorporate the use of a predictive model, such as by specifying the initiation of a particular treatment when a predictive model output value exceeds a particular threshold value. A predictive model is only “good” or “bad” to the degree it is incorporated into a protocol that is expected to achieve favorable outcomes or a process that is measured to achieve favorable outcomes. The same predictive model can produce favorable or unfavorable outcomes when used in different protocols or processes.
The fundamental principle that tools cannot be assessed for effectiveness also applies to papers that purport to be evaluations of the effectiveness of any decision support, lab test, telehealth sensor, information system, artificial intelligence algorithm, machine learning algorithm, patient risk assessment instrument, or improvement methodology (like lean, six sigma, CQI, TQM, QFD, etc.). Even the most respectable health services research journals are clogged with examples of papers that violate this principle with great bravado, often obscuring the violation with a thick layer of fancy statistical methods. The principle remains that tools can only produce evaluable outcomes when put to use in protocols and processes, and any tool is only effective to the degree it enables a protocol or process that produces good outcomes.
Requirements for Evaluating an ESM-based Protocol
Therefore, to evaluate the ESM, the authors needed to first clarify how they intended to incorporate the ESM into a clinical protocol. Presumably, the protocol being assessed in the Wong analysis has something to do with the decision-making process during an inpatient encounter when the clinician repeatedly decides between two decision alternatives: (1) initiating sepsis treatment or (2) not initiating sepsis treatment. But, the Wong analysis failed to flesh out some essential details about the intended protocol. Unanswered questions include:
- Did the authors intend to compare “usual care” (current practice not using any sepsis predictive model) to an alternative protocol in which the clinicians relied exclusively on the ESM alert at a threshold value of 6?
- Or did they intend to compare usual care to a protocol in which the clinicians initiated treatment when either the ESM was positive or the clinician otherwise suspected sepsis?
- Or did they intend to compare usual care to a care process in which the clinicians considered both the ESM score and all the other data elements in some undefined subjective decision-making process?
- Or did they intend to compare all four of these alternative protocols?
- Also, for the protocols that utilized the ESM, did they also intend to compare different versions of the protocol using different threshold values for the ESM alert?
In addition to clarifying the clinical protocol, the authors needed to clarify the types of outcomes they were considering to be relevant to their assessment. Unanswered questions include:
- Were they considering only deaths from sepsis, or were they also considering other health outcomes, such as suffering extra days in the hospital or suffering side effects from treatment?
- In addition to health outcomes, were they considering economic outcomes, such as the cost of treatment or the cost of extra days in the hospital?
If the authors were only considering deaths from delayed treatment of sepsis, then we already knew the answer before doing any analysis. The optimal protocol would always be to treat every patient throughout every day of hospitalization, so as never to experience any delays in treatment. No clinical judgement and no predictive model required! Obviously, if the authors were considering the use of the ESM, they must have been considering multiple outcomes, such as the cost or side-effects of initiating antibiotic therapy in patients that do not actually suffer from sepsis.
But, if there were more than one outcome being considered, then they needed to clarify how they were going to incorporate multiple outcomes to determine which protocol was the best. Unanswered questions include:
- Were they going to calculate some summary metric to be maximized or minimized?
- If so, were they going to assign utilities to the various health outcomes to allow them to generate a summary measure of the health outcomes, such as by generating a “quality adjusted life year” (QALY) metric?
- If they were considering economic outcomes, were they going to combine health and economic outcomes into a single metric, such as by generating cost-effectiveness ratios or assigning a dollar value to the health outcomes, or by asserting some dollars per QALY opportunity cost to allow the dollars to be converted to QALYs?
What the Wong study authors did instead
Unfortunately, the authors did not clarify the protocol, nor did they clarify the outcomes considered. Instead, they used an alternative method that could be utilized without taking a stand on the intended protocol or the relevant outcomes. They generated a Receiver Operating Characteristic (ROC) curve, which is a graph showing the trade-off between sensitivity and specificity of a test when different threshold values are used.
Virtually all introductory epidemiology text books describe ROC curves. A nice description is available here. As with a prior evaluation of the ESM, Wong et al calculated the Area Under the ROC Curve (AUC), asserting this metric to provide “external validation” because they considered it to be a measure of “model discrimination.” The authors clarified that the AUC “represents the probability of correctly ranking 2 randomly chosen individuals (one who experienced the event and one who did not).” Although this definition is mathematically correct, in my opinion, the AUC is more appropriately understood and used as a general reflection of the favorability of the collection of available trade-offs between sensitivity and specificity across many different values of sensitivity or specificity. “Sensitivity” is the percentage of people with the predicted condition that can be identified through a test that is “positive,” defined as a test value greater than a particular threshold. Sensitivity is about avoiding missing cases of sepsis (false negatives). At the same threshold, the “specificity” is the percentage of people that do not have the predicted condition that can be identified through a test that is “negative,” defined as a test value less than the threshold. Specificity is about avoiding false alarms (false positives). As the threshold is decreased to make a test more sensitive in detecting impending sepsis, you have to put up with more false alarms (decreased specificity). The ROC curve traces the achievable combinations of sensitivity and specificity as the threshold is varied across its range. A greater AUC means that the ROC curve is generally bowed toward the upper left, offering more favorable combinations of sensitivity and specificity. It is important to note, however, that when one test has a larger AUC than another test, it does not necessarily follow that it has a superior specificity for every value of sensitivity, nor a superior value of sensitivity for every specificity. As shown in the diagram above, the ROC curves for two tests can cross one another. Therefore, a test with a smaller AUC may actually perform better for a specific use in a clinical protocol.
So why do analysts report AUC values? Because they have not done the work to develop a decision analytic model to allow comparison of expected outcomes for alternative protocols using different predictive models with different threshold values. That’s like building a bridge part-way over the river and declaring it “done.”
So, how can the bridge be completed?
A complete analysis would compare the expected outcomes of alternative protocols. The analytic methodology appropriate to this purpose is called decision analytic modeling. In any type of analysis, it is always a good practice to start with the end in mind and work back to determine the necessary steps. I like to follow the lead of one of my healthcare heroes, David Eddy, MD, PhD. Eddy uses the term “clinical policies” to describe protocols and other designs for clinical practices. He advocates for the presentation of the results of decision analytic models in what he describes as a “balance sheet,” with columns corresponding to the policy alternatives considered and rows corresponding to the outcomes thought to be relevant and materially different across those policy alternatives. (Students of accounting will note that such a table is more conceptually aligned with the concept of an income statement than a balance sheet, but Eddy’s PhD was in mathematics, not business!). A balance sheet to compare alternative protocols for the initiation of antibiotic therapy for presumed sepsis would need to compare “usual care” (not using Epic’s ESM model), to various alternative protocols using the ESM model. The alternative ESM-based protocols could differ in terms of the threshold values applied to the ESM score, and the intended impact of the ESM alert on the ordering behavior of the physician, leading to a balance sheet that looks something like the following:
The decision analytic model should be designed to fill in the values of such a table, providing estimates of the magnitudes of the outcomes and the range of uncertainty surrounding those estimates. The calculations for such a model can be done in many different ways, such as using traditional continuous decision analytic models or newer discrete agent-based simulation models. A traditional model could be conceptualized as follows:
As shown in the diagram above, the Wong paper authors provided only some of the numbers needed. Note that in this tree diagram, some of the branches terminate with “no impact of ESM” — meaning that for the subset of patients, the outcomes are expected to be the same for usual care and the ESM protocol being evaluated. The tree diagram allows the analyst to “prune” the “no impact” branches of the tree and focus on the subsets of patients for which outcomes are expected to be different.
One might object to the fact that such outcomes calculations usually rely on some assumptions. People sometimes attack models by saying “Aha! You made assumptions!” But, as long as the assumption values are based on the current-best-available information, and the analyst acknowledges the uncertainty in those values (ideally quantitatively through sensitivity analysis), the model is serving its purpose. The assumptions are not flaws in the model. They are features. The assumptions complete the bridge across the river, supporting protocol selection decisions that need to be made now, while also pointing the way to further research to provide greater certainty for the assumptions that are found to be most influential to protocol selection decisions to be made in the future.
Reporting Number Needed to Treat (NNT) is like building a bridge almost across the river.
For decades, it has been popular for health services researchers that were not fully on-board with the principles of decision analytic modeling to go part-way by reporting a “number needed to treat” (NNT) statistic. The NNT statistic is the number of patients that need to undergo a treatment to save one life (or to achieve one unit of whatever was the primary outcome intended to be changed by the treatment). Such a statistic acknowledges a trade-off between two outcomes, or at least two statistics that are proxies for causally “downstream” outcomes. In the case of the Epic EPM evaluation by Wong et al, the authors adapted this NNT concept as a baby-step to acknowledging the fundamental trade-off between successfully identifying impending sepsis and chasing after false-alarm alerts. In this portion of their paper, they vaguely described two alternative protocols. In the first of these, the attending physicians were to ignore all but the first ESM alert message, and do some unspecified sepsis “evaluation” process for patients for which a first ESM alert was generated before the physicians had already initiated antibiotic therapy. They calculated they would “need to evaluate 8 patients to identify a single patient with eventual sepsis.” If they used an alternate protocol where the “evaluation” was to be re-done after every alert (not just the first one), then the number needed to evaluate jumped to 109.
The reporting of these statistics is telling. It acknowledges one trade-off, but still stops short of estimating the outcomes of interest. How costly is that evaluation? Is it just an instance of physician annoyance leading to some “alert fatigue,” or does it involve ordering costly tests? How many lives would be saved? Would there be side effects from the tests or the treatment? How expensive would the treatment be? Is 8 a reasonable number of “evaluations” to do? Is 109 reasonable?
Interestingly, the NNT statistics did not make it into the abstract. Instead, the less useful AUC statistic was elevated to the headline, perhaps because it seemed to the authors to be less tangled-up in protocols and therefore somehow more rigorous, objective and “scientific.”
Wong and colleagues’ stated conclusion was that medical professional organizations constructing national guidelines should be cognizant of the broad use of proprietary predictive models like the Epic ESM and make formal recommendations about their use. I strongly agree with that conclusion.
I would add a second conclusion of my own; Peer reviewers for professional journals such as JAMA Internal Medicine should insist that published evaluations of predictive models include:
- Documentation to clarify one or more specific protocols utilizing the model (and “usual care”),
- Estimates of all the relevant outcomes thought to be materially different across the protocols, and
- Documentation of the assumption values used in the model generating the outcomes estimates.
In other words, stop publishing predictive model “validations.” Instead, insist that investigators finish the job and publish decision analytic models comparing clinical protocols, some of which may incorporate predictive models.