Data Scientist Beware – Your Models Aren’t That Good – Here’s Why

Written by Damian Mingle, Chief Data Scientist, SwitchPoint Ventures and Empiric Data Science Team

Many organizations think all Data Scientists are created equal (after all, they all have the same bag of tricks) and companies apply pressure to put models into production. Unfortunately, many Data Scientists are not challenged to justify their results and the organization is left to deploy models it may have to blindly trust. In some industries, like finance, the cost of getting it wrong can be very expensive. In other industries, like healthcare, the cost of getting it wrong could have dire human consequences.

Shuffling the Deck Matters

Imagine a scenario where your IT department passes data to your Data Scientist to build a model. The Data Scientist may launch right into building a model, without considering how the source system may have produced the data. Seems harmless, right? Wrong. Like a casino dealer, a good Data Scientist needs to be a good shuffler.

In a typical healthcare-related process, the business tells IT they want patients with a particular condition:

  • Patients with the condition are marked as “X”
  • Patients without the condition are marked as “Not-X”

Let’s further say that there are 10,000 observations—8,000 marked “X” and 2,000 marked “Not-X”. The “X” observations are grouped and the “Not-X” observations are also grouped. When the Data Scientist builds on this training set, the resulting model is 84% accurate. However, when deployed in production, the model accuracy falls to 62%. Why?

In truth, the data rows need to be shuffled before training. This simple technique represents what is actually happening in the real-world—patients present randomly with the condition, not grouped as a series of “X” and then a series of ”Not-X”. Using the NumPy library in Python, the code below can be used to randomly shuffle a data array.


In this example, shuffling the training data would result in a model that is 81% accurate and not swing more than 1% when deployed in the real world. Shuffling is good practice in every situation but is explicitly required when performing a train, test, holdout split of the data (in preparation of model training) and when training any model that uses stochastic or batch gradient decent (support vector machines, logistic regression, neural networks, etc.).

The Definition of Success

How you measure the reliability of a model is incredibly important, and often overlooked.  It is critically important to know how the results of your models will be used, and what “success” actually looks like to the practitioners using the results.  This is particularly important with imbalanced classes and multiclass models.

Say we have the set of preliminary diagnoses for all patients in a hospital, and we want to predict the likelihood of performing a specific surgery.  If we measure the success of each prediction against all patient encounters, we could get a deceptively encouraging accuracy and recall.

However, this is not a relevant way of measuring the utility of the model.  It is clinically obvious that the majority of the encounters have diagnoses that are unrelated to the surgery we want to predict.  This makes it very easy for the model to discern what should NOT be predicted, which in turn could weaken its ability to accurately determine those cases which are clinically relevant.  A more astute Data Scientist would winnow down the patients to only those with clinically relevant diagnoses.  From there, the machine model will need to work harder to differentiate between those cases which end up requiring surgery, and which would not.  From this we would expect less impressive accuracy and recall, but a more relevant precision.

In short, more data is not always better.  If the data is irrelevant to the model, then it is easy to dilute your results and inflate some of the success metrics.

Your Data Is Leaking

An organization may not realize this, but a Data Scientist goes through periods of being either Picasso or Einstein. In the Picasso phase, the imagination may run wild – blending data features, adding new data sources, and creating new functions. In the Einstein phase, the Data Scientist adheres to a strict scientific methodology, constantly challenging outcomes and documenting the approach to ensure that the experiment can be reproduced by any other competent Data Scientist. The best results occur when the Data Scientist behaves equally as both a Picasso and an Einstein. When one approach is favored more heavily than the other, data leakage can occur.

Imagine your organization wants to predict the likelihood of a patient having a specific medical condition upon entering the waiting room of a hospital emergency department. The Data Scientist has created a machine learning model that is 94% accurate. Upon presenting these results to the executive team, the business is convinced that the next 12 months are going to be the best in company history and launches an aggressive sales campaign.

But beware; if seems too good to be true, it often isn’t true.

For example, we may have a scenario where the solution is only 66% accurate in production, not 94% as expected. This could be the result of a data leak, which an astute Data Scientist should have observed when preparing the data in training and validation.

Before exploring what caused the data leak, let’s review the continuum of data obtained throughout the patient’s Emergency Department (ED) encounter below. This is the historical data provided to the Data Scientist for model training and validation. Each box represents a group of data obtained during a specific portion of the patient’s ED visit. The model is designed to predict what is in the green box—whether or not the patient was diagnosed with the specific medical condition for which we are looking. The data leak occurred when the Data Scientist included data from the orange box, Patient Discharge Information.


There is data in the Patient Discharge Information that creates “forward-looking bias”— the model can intuit the presence of a condition based on discharge data. But the discharge data is not available at admission, which is the point at which the model needs to make a prediction. Excluding the patient discharge information from training would have required the model to focus on other data provided before or during the patient’s visit. Doing so would result in a model with a lower success metric — perhaps in the range of 84% instead of 94%—but in production the model would deviate just 1% in either direction.



Clean, Clean, Clean

Because Data Scientists construct models so that a machine can learn on its own, it is important that model inputs map to organizational outputs. Data Scientists need to be good housekeepers—ensuring that training data is clean and will result in a more accurate model.

Imagine the Data Scientist is asked to build a linear regression model with a single variable. The chart below (a) is a result of “dirty” data. There is systematic bias in the data, causing model results to shift left, away from the true, or “clean”, values.



If a machine learning model is trained on this dirty data, then it will not do well on clean data seen in a real-world situation. More experienced Data Scientists may discover this bias after observing just a few samples of data and attempt to fix it. With limited exposure, the Data Scientist could easily detected one source of data errors but fail to fix a second unrelated source of errors. After this fix is put into place, model results look like those in (b) below. In this scenario, it might have been better to do nothing at all rather than fix just a few data samples.


Another strategy is to clean a subset of data rows and then train a model on just the clean records. That is not guaranteed to work and, in many cases overfits to the subset which produces results like those in (c) below.



With so many risks present from ignoring or insufficiently cleaning data, many Data Scientists simply do the basics and hope for the best. However, there are more sophisticated solutions to solving this problem. Imagine a solution that includes code to review data on a row-by-row basis, according to a user-defined function, which does one of two things:

  1. Cleans the row
  2. Eliminates the row

This approach would sample, clean, update, detect, and estimate data according to the flow below:


This approach allows organizations and Data Scientists to train a model while progressively cleaning data and preserving convergence guarantees. More simply stated, this approach closes the gap on the error rate between dirty and clean data. The detectors and estimators allow for optimizations that can improve model accuracy significantly.

An in-depth technical overview of cleaning techniques and optimization approaches is presented by the SampleClean project, a collaboration among Data Scientists at UC Berkeley, Brown University, Tel Aviv University, and Columbia University. (

Data Scientists are good at leveraging machine learning models to solve a problem. The approach described above incorporates a data cleansing component, which is a very different way to pre-process the data than most Data Scientists are familiar with.

Now What

To provide reliable value to their company, Data Scientists need to have the shuffling skills of a good casino dealer, a definition of success that is clinically relevant, the artistic temperament of Picasso, the scientific rigor of Einstein, and fastidious cleaning habits. Data driven organizations that employ and develop this level of data science talent can see spectacular results with quantifiable returns on investment. Organizations should hold Data Science teams accountable to these ideas by asking questions beyond the measurable successes of their models. Doing so allows both parties, Data Scientist and their organization, to deploy reliable solutions that produce a serious competitive advantage.

To learn more about Empiric Health and how we make a difference in data science, visit