Why & When: Cross Validation in Practice

Nearly every time I lead a machine learning course, it becomes clear that there is a fundamental acceptance that cross validation should be done…and almost no understanding as to why it should be doneor how it is actually done in a real-world workflow. Finally I’ve decided to move my answers from the white board to the blog post. Hope this helps!

Cross validation is like standing in a line in San Francisco: Everyone’s doing it, so everybody does it, and if you ask someone about it, chances are they don’t know why they’re doing it.

Fortunately for those who might be doing so without fully knowing why, there is very good reason to cross validate (which is generally not true about joining any random Bay Area queue).

So, other than our now inbred inclination as data scientists to do the thing that everyone else is doing…why should we cross validate?

The Why & the When

To discuss why we perform cross-validation, it’s easiest to review how and when we incorporate cross validation into a data science workflow.

To set the stage, let’s pretend we’re not doing a GridSearchCV. Just good old fashioned cross validation.

And, let’s forget about time. Time of observation usually matters quite a bit…and that complicates things…quite a bit. For right now let’s pretend time of observation doesn’t matter.

First Partition of Data

Divide your data into a holdout set (maybe you prefer to call it a testing set?) and the rest of your data.

What’s a good practice here? If you’re data isn’t too big, import your entire_dataset, randomly sample [your chosen holdout percent]% of observation ids (or row indices). Create two new data frames, holdout_data and rest_of_data. Clear your variable entire_dataset. Export holdout_data to the file or storage type of your convenience, and clear your variable holdout_data.

If your data is too big, adapt a similar workflow by using indices instead of data frames and only loading data when necessary.

To clear a variable in python, you’re welcome to del your variable or set it to Null, as you’d like. Just make sure you can’t access the data by accident.

EDA & Feature Engineering

Let’s assume we then follow good practices around exploratory data analysis and creating a reusable feature engineering pipeline.

Notice that we’re only using the rest_of_data for EDA and designing our feature engineering pipeline. Let’s think about this for a second. Only doing this on the rest_of_data allows us to test the entire modeling process when we test on the holdout data. Clever, right?

Now, we’ve got our data ready to go and we have an automated way to process new data, via our pipeline.

Specify a Model

We need to specify a model, e.g. choose to use ‘random forest’. For this non-grid-search scenario, we’re going to also say this is where tuning occurs.

Cross Validation

Now we cross-validate.

You know the idea: take the rest_of_data, partition it into k folds (partition). For step j, create a [j]_train_set (all the data but the jth partition) and a [j]_test_set (the jth partition). Train on the training set, test on the test set, record the test performance, and throw away the trained model. Do this for all k partitions.

Or, more likely, have sklearn do it for you.

Go ahead, we’ll wait.

Now we have k data points of “sample performance” of our specified model.

We inspect these results by looking at the mean (or median) score and the variance. This is the key to why we cross validate.

Why We Perform Cross Validation:

We want to get an idea of how well the specified model can generalize to unseen data and how reliable that performance is. The mean (or median) will gives us something like expected performance. The variance gives us an idea of how likely it is to actually perform near that ‘expected’ performance.

I’m not making this up–you can look at the sklearn docs and they, too, look at the variance. Some people even plot out the performances.

If you’re coming from a traditional analytics or statistics background, you might think “don’t we have theories for that? don’t we have p-values for that? don’t we have modeling assumptions for that?” The answer is, generally, no. We don’t have the same strong assumptions around the distribution of data or the performance of models in broader machine learning as we do in, say, linear models or GLMs. As a result, we need to engineer what we can’t theory [sic]. Cross validation is the engineered answer to “how well will this perform?”

Assuming the performance measures are acceptable, we’ve now found a specification of a model we like. Huzzah! (Otherwise, go back and specify a different model and repeat this process.)

Note: we did not get a trained model; we got a bunch of trained models, none of which we will actually use and all of which we will immediately discard. What we found was that the “specified model”, e.g. random forest, was a good choice for this problem.

Get the Performance Estimate

Armed with a specified model we like, we train this specified model on the entirety of rest_of_data. Sweet.

Now we load the holdout_data and run it through the feature engineering pipeline. We test our trained-on-rest-of-data specified model on the holdout_data. This gives us our performance estimate.

We hold a silent sort of intuition about performance at this step because rest_of_data is necessarily larger than any training set used during cross validation.

The intuition is that the specified model trained on this larger set would not have higher variation in performance and on average will perform at least as well as the average performance seen during cross validation. However, to check this we would need to repeat the steps from “First partition of Data” through this one multiple times.

NOTE: We do not assume it will perform at or better than average performance seen in cross validation. That is a very different assumption.

NOTE: we did not get a trained model; we got a trained model which we will immediately discard. What we found was a better estimate for in-production performance of the “specified model”, e.g. random forest.

Train Your Final Model

Assuming our performance estimate is sufficient, now we train the specified model on the entire_dataset. This is the trained model we use for production and/or decision making. This is our final model.

Monitoring & Maintenance

Until, as always, we iterate.

Hope this gave a little more insight to cross-validation in practice!

Talk the Talk: A Short Story on Why You Need at Least the Bare Bones of Data Science Jargon

This is a true, grimace-inducing story from a past position. Names and details have been changed to protect the innocent…and everyone else involved.

I’m running a model and taking a late lunch in a near empty cafeteria when my (then) boss slacks me. He wants an immediate video call. Like many, Adam was an engineering manager-turned-data science manager. He was very hands-off for the most part. But, he was prone to sudden bursts of interest.

“Where are you in your project?” His tone and lack of greeting make it clear he is stressed.

Adam’s background didn’t lend him to data science jargon, so I had long before learned to explain my work in straightforward language–particularly when he was already frustrated.

“I’m running a model now. I just finished making more columns–err–input variables–to improve predictions.”

“How long will that take?”

“We’re already using predictions from an early model, but I’ve been working back and forth with some of the subject matter experts, creating new input data, training models, improving performance…”

“Well when are you going to have a deliverable?” he’s agitated.

“We’re already using the predictions. I’m just working on improving–“

“But when will you have something to ship?” he interrupts. “James has already shipped four features this quarter.”

I stare at the screen. He glares back at me from beneath the finger-print smudges, brows knit and demanding an explanation.

James was a data scientist on a different team. We were friendly and caught up regularly around the modern day water cooler (READ: pamplemousse La Croix-filled fridge). As is often the case, we kept abreast of each other’s work out of vague collaborative intention, common interest in the company, and a general sense of camaraderie. Also, snacks.

So, I happen to know that James had already accomplished much more that quarter than building a four columns.

But why is Adam so impressed by feature engineering a few columns? And why is he calling them “shipped”?

“I need to see what features you’ve shipped.” He’s looking down at his phone, the edge of which is in view of the camera.

Why is he calling them “shipped”?

“Are you working on the website or the app?”

What?

“You need to be shipping features.” He gestures with his phone still in his hand.

Why…….ohhhhh shit.

Where do I begin?

“So…a feature–in data science–is a column of input data,” I explain.

Blank stare.

A pair of analysts I know sit at my table. Extra shit. I check my pocket for headphones. No luck.

“What do you mean, a feature in data science?” his aggression is waning but I scoot to the edge of the table nonetheless.

“It’s a column of data–the input values for a model.” I pause to see where that lands. “And feature engineering means building columns of input data. James built four columns of input data.” And to give credit where it’s due, “…He’s also done a lot more than that.”

Adam’s looking at me suspiciously, but not altogether in disbelief.

“In data science a feature isn’t like….a button on an app or a drop down menu or something.” The change in his facial expression tells me–yep, that’s what he was thinking. His phone drops out of sight. “A feature is a predictor–an input value.” I searched for another way to say it. “A column of X.”

Nothing.

“Ok, so today I finished creating change-in-weekly-spend columns because I think they might help predict churn.”

He looks at me blankly. This is getting nowhere.

“So, the features are change-in-weekly-spend. I made one for each of the 10 weeks prior to a set of given dates.” Pause. “So I feature engineered those input values. I’m training a model using them now.”

Silence.

“I shipped 10 features this morning.”

“Ok, see this is what I need to know.” He was relieved. “I need to be showing your value. You should be telling me this. I would have said so this morning.”

After a quick bit of chat commiserating about meetings, the call ends.

I slump on the bench. What else has been lost in translation? I think to myself. What does he think I do? What is he saying at those–

“Did he just ask you what a feature was?” One of the analysts asks.