Al Lewis calls workplace wellness programs “get well quick schemes”

Al Lewis is an actuarial consultant that has long focused on challenging wellness and care management vendors to prove their value.  He founded the “Disease Management Purchasing Consortium” and established a training and certification program for “critical outcomes report analysis.”

Al has been calling out the methodological carelessness and dirty tricks of wellness and care management vendors and health plans for years.   These shoddy and unethical methods produce deceptively optimistic results, often to the delight of the customers of the programs who crave evidence that they made a wise choice. Many of the methods have been discredited long ago, but like cockroaches and ants, they just keep coming back. Faced with this unsavory state of affairs over many years, poor Al has resorted to sarcasm — probably partly to avoid getting bitter, and partly to keep his audience awake long enough to absorb the otherwise dry, tedious concepts.

He recently collaborated with Vick Khanna in a blog post in Health Affairs that focused on a particular type of wellness and care management program — workplace wellness — now a $6B industry.  Such programs typically are funded and sponsored by employers, and involve incentivizing employees to complete a health risk assessment and then, hopefully, pursue healthier lifestyle behaviors. Employers purchasing these programs typically believe they will lead to substantial, short term increases in worker productivity and decreases in health care costs. The blog post is definitely worth reading.

To summarize:

  1. Both workplace wellness program vendors and the benefit consultants who advocate for them have conflicts of interest which lead them to use deceptive methods and publish papers and marketing material which claim implausible levels of savings and return-on-investment.
  2. Although health plans often sell workplace wellness programs to self-insured employers (for a profit), virtually none of them believes they really produce savings, so they don’t spend the money on such programs for the fully-insured business for which the health plan itself bears the risk.  Health plans don’t eat their own dog food.
  3. The most common trick is to compare the outcomes for highly motivated employees who choose to complete the health risk assessments and participate in wellness interventions to the outcomes for poorly motivated employees who do not.  Epidemiologists call this “volunteer bias.”  It is a problem in evaluation studies of all types of member/patient-facing programs, but is obviously an even bigger problem with workplace wellness, when motivation to change behavior is the whole point of the program.
  4. Other common tricks include taking credit for improvements that occur due to random variation (“regression to the mean”), or taking credit for improvements that occurred before the program actually started — as was the case with the widely-touted results from Safeway’s famous workplace wellness program.
  5. They recommend that employers should avoid these “get well quick” schemes and, instead, do the harder work of creating a deep culture change promoting wellness.  If employers want to try workplace wellness programs, they should at least commit to identifying and then counting the events that the wellness program is intended to reduce to see whether they really decrease across the entire work force after the program is implemented.
  6. Lastly, they point out that the workplace wellness industry convinced the federal government to include taxpayer-financed wellness incentives in the Affordable Care Act.   The Federal Employee Plan is in the process of picking a wellness vendor.  They recommended dropping federally-funded wellness programs until valid evaluations show they work.
Read More

Klar 3: The necessity of re-qualifying the population to avoid regression-toward-the-mean bias in historical comparison groups

Ron Klar, MD, MPH

Ron Klar, MD, MPH is a health care consultant with a long history of involvement in federal health care policy and health plan innovation. He published a recent series of three posts regarding the draft rules for the Medicare Shared Savings Program (MSSP) in the Health Affairs Blog, an influential forum for debating health policy issues. This is my third in a series of 4 posts describing areas of agreement and disagreement with Dr. Klar. In my first post, I described areas of agreement.  In my second, I covered my disagreements about Dr. Klar’s proposed changes regarding care relationship derivation.  In this post, I will describe my disagreement regarding Klar’s proposed changes to the approach to selecting a comparison group for savings assessment.

In the draft MSSP rules, CMS proposed two “options” for methods of selecting the comparison group for determining savings. The rules, following the lead of the health reform legislation language, mislabels the comparison group as a “benchmark.” CMS is not really trying to determine if an ACO is better than or comparable to the best-performing provider organization, as is implied by using the term “benchmark.”  What they really intend is to compare the actual cost to the cost that would have been expected to occur if the same beneficiaries had been cared for by non-ACO providers. CMS indicates in the draft rule that they prefer option 1, which involves using the same assignment algorithm in the prior time period as is used for the accountability/performance period. This approach is described as “requalification” in the care management evaluation standards published by the Disease Management Association of America (DMAA). Option 2, for which CMS is seeking feedback, involves using historical information for the cohort of beneficiaries that was actually assigned to the ACO.

In Dr. Klar’s first post, he explained that he prefers option 2, arguing that option 1 has no “face validity” because the individual beneficiaries will be different. I strongly disagree.

As I noted in my blog post last week, when claims-based patient selection logic is applied, the selection is determined based not only on unchanging characteristics of the person (like gender), but also on data regarding health care events that happened at particular points in time. The person-months in the years before meeting the assignment criteria do not have the same risk as the person-months after meeting the assignment criteria. There is randomness in the timing of events, as people experience peaks and valleys of individual risk. When you select people based on recent health care events, you are not selecting randomly. You are preferentially picking people who tend to be in a risk peak as evidenced by recent health care utilization. Without any intervention, continuing random variation will cause the risk of the selected population to decrease over time, toward the mean risk of the overall population. This is known as a regression-toward-the-mean bias.  This type of bias is strongest when the patient is being purposefully selected based on being a high risk outlier, such as when a predictive model is used to generate a risk score used to select patients to be targeted for outreach for a care management program.  But, this type of bias exists in a weaker form for any patient selection based on recent health care utilization.  Patients naturally have higher risk in the time periods just before and after health care utilization, since they seek health care in response to illness episodes that drive cost.  To avoid regression-toward-the-mean bias, I prefer option 1, which offers a symmetrical selection process for the ACO intervention population and the historical comparison population.

Dr. Klar correctly points out that if no risk adjustment is done, ACOs could be incentivized to preferentially seek care relationships with lower risk patients. I feel this should be solved by doing risk adjustment (as has been proposed in the rule), rather than by using option 2.

Klar goes on to propose a variety of additional modifications to the rules that illustrate the complications of using the option 2 pre-post design, such as having to apply a weighted average scheme to deal with people with different numbers of years of available history and people who died during the performance period.



Read More

Identifying and Understanding Analysis Tricks: Regression Toward the Mean

Imagine that you are a new homeowner, shopping for insurance for your new house.  You live in an area prone to earthquakes, and you are not a big risk-taker.  You decide that you should have earthquake insurance.  You are on the web researching earthquake insurance policies. You come across the web site of Acme Insurance, an international leader in earthquake damage coverage.  The web site says they are the best earthquake insurance company because they not only pay for earthquake damage, they have an innovative program to actually prevent earthquakes experienced by their beneficiaries. The program involves assigning an earthquake prevention coordinator (EPC) to each homeowner.  The EPC does one session of telephonic earthquake prevention coaching, sends some earthquake prevention educational materials by e-mail, and makes a follow-up call to assure that the homeowner is exhibiting good earthquake prevention behaviors.   This is a proprietary program, so more details are only available to Acme Insurance beneficiaries.  The program is proven to reduce earthquakes by 99%.  You click on the link to view the research study with the documented proof.

The study was conducted by Acme Analysis, a wholly-owned earthquake informatics subsidiary of Acme Insurance.  The study begins by noting an amazing discovery.  When Acme analyzed its earthquake claims for 2010, it noted that 90% of its earthquake damage cost occurred in only 10% of its beneficiaries.  It noted that these high cost beneficiaries were living in particular cities.  For example, it noted high earthquake claims cost in Port au Prince, Haiti for damage incurred during the January 12, 2010 earthquake there.  It developed an innovative high risk classification approach based on the zodiac sign of the homeowners’ birth date and the total earthquake claims cost for damage incurred in the prior month.  On February 1, 2010, they applied this risk classification to identify high risk homeowners, most of which were Libras or Geminis living in Port au Prince.  They targeted 100 of those high risk homeowners for their earthquake prevention program.    The EPCs sprung into action, making as many earthquake prevention telephone coaching calls and sending as many earthquake prevention e-mails as they could, considering the devastated telecommunications infrastructure in Port au Prince.

The program evaluation team then compared the rate of earthquakes exceeding 6.0 on the Richter scale and average earthquake damage claims for those 100 people for the pre-intervention period vs. the post intervention period. Among the 100 beneficiaries targeted by the program, the average number of major earthquakes plummeted from 1 in the pre-intervention period (January, 2010) to 0 in the post-intervention period (March, 2010), and the number of minor earthquakes (including aftershocks) dropped from 20 down to just 10. But the program was not just good for the beneficiaries wanting to avoid earthquakes.  It was a win-win for Acme Insurance.  Earthquake damage claims had dropped from an average of $20,000 per beneficiary during the January, 2010 pre-intervention period to an average of just $200 for damage incurred during the post-intervention period in March, 2010, when two of the targeted beneficiaries experienced damage from an aftershock.  The program effectiveness was therefore 1 – (200/20,000) = 0.99.  That means the innovative program was 99% effective in preventing earthquake damage claims cost.  After considering the cost of the earthquake prevention coordinators and their long-distance telephone bills, the program return on investment (ROI) was calculated to be 52-to-1.  The program was a smashing success, proving that Acme Insurance is the right choice for earthquake coverage.

Can you spot the problem? Can you extrapolate this insight to the evaluation of health care innovations such as disease management, care coordination, utilization management, patient-centered-medical home, pay-for-performance, accountable care organizations, etc.?

The problem is called “regression toward the mean.” It is a type of bias that can affect the results of an analysis, leading to incorrect conclusions.  The problem occurs when a sub-population is selected from a larger population based on having extreme values of some measure of interest.  The fact that the particular members had an extreme value at that point in time is partly a result of their underlying unchanging characteristics, and partly a matter of chance (random variation).   Port au Prince, like certain other cities along tectonic plate boundaries, is earthquake prone.  This is an unchanging characteristic.  But, it was a matter of chance that a major earthquake hit Port au Prince in the particular month of January, 2010.  If you track Port au Prince in subsequent months, their theoretical risk of an earthquake will be somewhat higher because it is still an earthquake prone area.  But, chances are that, in any typical month, Port of Prince will not have a major earthquake.

An analogous effect can be observed when you identify “high risk” patients based on having recently experienced extreme high rates of health care utilization and associated high cost.  The high cost of such patients is partly driven by the underlying characteristics of the patients (e. g. age, gender, chronic disease status), and partly based on random chance.  If you track such patients over time, their cost-driving characteristics will lead them to have somewhat higher costs than the overall population.  But, the chance component will not remain concentrated in the selected patients.  It will be spread over the entire population.  As a result, the cost for the identified “high risk” patients will decrease substantially.  It will “regress” toward the mean.  With high risk classification methods typically used in the health care industry, my experience is that this regression is in the 20-60% range over a 3-12 month period, without any intervention at all.  Of course, the overall population cost will continue to follow its normal inflationary trend line.

This regression-toward-the-mean phenomenon has been at play in many, many evaluations of clinical programs of health plans and wellness and care management vendors.  Sometimes unwittingly.  Sometimes on purpose.  Starting back in the 1990s, disease management vendors were fond of negotiating “guarantees” and “risk sharing” arrangements with managed care organizations where they would pick high risk people and guarantee that their program would reduce cost by a certain amount.   Based on regression toward the mean, the vendor could rely on the laws of probability to achieve the promised result, regardless of the true effectiveness of their program.  The vendor would get their negotiated share of the savings.  Good work if you can get it.  It lasted for a few years until the scheme was widely discredited.  But not widely enough, it appears.  Wellness and care management vendors still routinely compare cost before and after their intervention for a cohort of patients selected based on having extreme high cost in the pre-intervention period.  Health plans and employers eat up the dramatic savings numbers, happy to see that the data “proved” that they made wise investments.

Study designs suffering from obvious regression-toward-the-mean bias will usually be excluded from publication in peer-reviewed scientific journals.  But, they do show up in less formally-reviewed clinical program evaluations by hospitals and physician organizations.  For example, in a recent analysis of a patient-centered medical home (PCMH) pilot, the authors concluded that the program had caused a “48 percent reduction in its high-risk patient population and a 35 percent reduction in per-member-per-month costs” as shown in the following graphic.

In this PCMH program, a total of 46 “high risk poly” members were selected based on having high recent health care utilization involving 15 or more health care providers.  The intervention consisted of assigning a personal health nurse that developed a personal health plan,  having a personal health record (based on health plan data), and providing reimbursement for monthly 1-hour visits with a primary care physician.  The analysis involved tracking of the risk category (based on health plan claims data) and the per-member-per-month (PMPM) cost for the cohort of 46 patients, comparing the pre-intervention period (2009) to the intervention period (2010).   I’m sure the program designers and evaluators for this PCMH pilot are well meaning and not trying to mislead anybody.  I share their enthusiasm for the PCMH model of primary care delivery.  But, I think the evaluation methodology is not up to the task of proving whether the program did or did not save money. Furthermore, even with a better study design to solve the problem of regression-to-the-mean bias, the random variation in health care costs is far too large to be able to detect even a strong effect of a PCMH program in a population of only 46 patients.  Or even 4,600 patients for that matter.  I’d guess that proper power calculations would reveal that at least 46,000 patients would be required to have a chance of proving PCMH effectiveness in reducing cost.

So, how do you solve the problem of Regression Toward the Mean?

As with any type of bias, the problem is with the comparability of the comparison group.  The gold standard study design is, of course, a randomized controlled trial (RCT), where the comparability of the comparison group is assured by randomly assigning subjects between the intervention group and the comparison group (also called the “control group”).

If randomization is not possible, one can try to find a concurrent comparison group that is not eligible for the program and which is thought to be otherwise similar to the eligible population.  The same selection criteria that is applied to the eligible population to pick targets is also used in the ineligle population to pick “simulated targets” for a comparison group.  Note that in such a concurrent control design, the comparison should be between targets and the simulated targets, without considering which of the targets volunteered to actually participate in the intervention.  This aspect of the design is called an “intention to treat” approach, intended to avoid another type of bias called “volunteer bias.” (More on that in a future post.)

Often, the evaluators do not have access to concurrent data from an ineligible population to provide a concurrent comparison group.  In such a case, an alternative is “requalification” of the population in the pre-intervention period and the post-intervention period. Requalification involves using the exact same selection criteria used to pick targets at baseline and shifting it forward in time to pick a comparison group. The result will be a comparison group that is a different list of patients than the ones picked for the intervention. Some of the targets of the intervention may be requalified to be in the comparison group. Others will not. Some members of the comparison group will be members that did not qualify to be in the intervention group.  It is counter-intuitive to some people that such an approach creates a better comparison group than just tracking the intervention group over time.  But, with requalification, you are assured that the same influence that luck had in selecting people based on recent utilization will be present in the comparison group.  The idea is to avoid bias in the selection process by trying to make the selection process symmetrical between intervention and comparison groups.

If I apply these remedies for regression toward the mean bias, does that mean I will be assured of reliable evaluation results?

Unfortunately, no. The bad news is that clinical programs are devilishly hard to properly evaluate.  There are many other sources of bias intrinsic to many evaluation designs.  And, the random variation in the measures of interest are often very large compared to the relatively weak effects of most clinical programs.   This creates a “signal to noise” problem that is particularly bad when trying to evaluate small pilot programs or the performance of individual providers over short time periods.

If you really want to create a learning loop within your organization, there is no alternative to building a team that includes people with the expertise required to identify and solve thorny evaluation and measurement problems.

Read More