Henry Ford observational study that shows 66% in-hospital mortality reduction from hydroxychloroquine seems to be guilty of “heroic propensity score” problem

During the 1990s, I served as the director of the Center for Clinical Effectiveness at the Henry Ford Health System in Detroit.  I was lucky to have an opportunity to hang around with a really talented group of people there, some of whom are still there.  Back in the day, I was among the loudest voices advocating for the value of retrospective observational research using databases.  But, at the same time, I was also one of the co-investigators in the largest clinical recruitment site for the largest randomized clinical trial (RCT) done up to that point – the prostate, lung, colorectal and ovarian (PLCO) cancer screening trial.  So I have a foot in both RCT and observational research canoes. Both are important tools that can be used or misused.

Fast forward to the COVID-19 pandemic. Henry Ford established a system-wide COVID-19 Task Force, reviewed the paucity of evidence available, developed consensus on clinical protocols for treatment, and implemented those protocols across six hospitals. Based on its commitment to science and improvement, it also launched randomized clinical trials, including one focused on the efficacy of hydroxychloroquine to prevent SARS-CoV-2 infections in first-responders. It also conducted observational research on the effectiveness of hydroxychloroquine and azithromycin, alone or together, in treating hospitalized COVID-19 patients. Samia Arshad, Marcus Zervos and colleagues took the time to share that observational research in a peer-reviewed journal. That’s how a serious integrated delivery system is supposed to act, and it made me proud to have been associated with them in the past.

As shown in the graph above, the study showed that hydroxychloroquine reduced in-hospital mortality by 66% — or, more carefully stated, the in-hospital mortality “hazard ratio” was 66% when comparing COVID-19 inpatients that received hydroxychloroquine (and not azithromycin) with COVID-19 inpatients that received neither hydroxychloroquine nor azithromycin. Because of strong interest in COVID-19 treatment in general and political controversies surrounding President Trump’s advocacy for hydroxychloroquine in particular, the results were covered in the media and the paper became sufficiently controversial as to prompt the Chief Clinical Officer and the Chief Academic Officer to publish an open letter saying:

Unfortunately, the political climate that has persisted has made any objective discussion about this drug impossible, and we are deeply saddened by this turn of events. Our goal as scientists has solely been to report validated findings and allow the science to speak for itself, regardless of political considerations. To that end, we have made the heartfelt decision to have no further comment about this outside the medical community – staying focused on our core mission in the interest of our patients, our community, and our commitment to clinical and academic integrity.

I admire how the Henry Ford team handled the controversy, and I would defend the value of publishing sincere study findings without regard to the political winds that are blowing.

However, on the question of the value of this particular observational study, I come down on the side of holding out for randomized clinical trials before suggesting changes in treatment protocols, even given the urgent circumstances of a novel disease advancing to pandemic status quickly. Yes, RCTs can be annoyingly slow. But, just as vaccine development has been fast-tracked, so could RCTs of COVID-19 therapeutics. As a life-long advocate and practitioner of observational research, I have a healthy skepticism about all the ways that such research can go astray. Experience has shown that even in ideal circumstances, observational research is tricky business.

A careful read of this particular Henry Ford study shows that it does not qualify as ideal circumstances.

In the paper, the authors describe one of the strengths of the study as the fact that the clinical practices studied were based on “regularly updated and standardized institutional clinical treatment guidelines.” Guidelines and protocols are good for reducing unwarranted variation in practice. But they constitute a limitation with regard to the conclusiveness of the study, not a strength.

In any retrospective observational study, the biggest concern is that there may be sources of bias — factors that make the different treatment groups less comparable — particularly factors that cannot be subject to “control” or “adjustment” in the analysis.  The ideal scenario, sometimes described as a “natural experiment” is when you have different sites of care, each of which happens to have a consistent practice pattern that is different from the pattern observed at other sites. Such a scenario resembles a study where the sites were actually randomized to different treatment protocols.  As long as you don’t have big concerns about the comparability of the populations at those different sites, such an analysis can be persuasive, especially if you can “mop up” the remaining small amount of confounding variation using statistical methods such as risk-stratification, risk-adjustment or propensity matching. 

In the case of the Henry Ford study, the investigators had the opposite scenario:  all sites were using the same treatment protocol, the purpose of which is to intentionally give different treatments based on different patient characteristics. Consequently, the patients receiving different treatments are intentionally non-comparable.  They used propensity matching in an attempt to reduce the potential bias, implicitly asserting that if two patients had the same “propensity score” they were in fact comparable patients despite the fact that the treatment protocol and the clinical judgement of the attending physicians determined that they needed different treatments. From the perspective of the investigators, they did the best they could under the circumstances, and I am not suggesting any intent to do hand-waving to distract from the underlying sources of bias. But I do know from experience that statisticians hired by some “wellness” and “disease management” program vendors have intentionally done so over the years. I call this problem the “heroic propensity matching” problem and it is on my “watch out” list along with the “regression to the mean” problem, the “volunteer bias” problem, and my favorite, the “risk factor switcheroo” problem.

My view has always been that statistical methods to deal with biases in observational studies are only appropriate for use in “mopping up” small amounts of bias.  You generally can’t compare humans to mice, nor can you compare pediatric patients to adult patients thinking that propensity matching or risk-adjusting is going to solve the comparability problem.  Consequently, I’m skeptical about results based on a comparison of treatment groups where the patients are intentionally different by virtue of a formal practice protocol, in addition to the usual problematic differences due to clinical judgments — and I’m reluctant to accept propensity matching as an appropriate or adequate remedy.

So, in this scenario, I favor going through the front door and doing real RCTs as quickly as possible.



Leave a Comment

Your email address will not be published. Required fields are marked *

Free Subscription to Blog

Recent Posts