Minimum data needs for ML models

“How much data do we need to build a predictive model?”

I get this question all the time when working with new clients. And the honest (and slightly unsatisfying) answer is: we don’t know upfront.

Unlike building software—where an engineer can often give a crisp yes/no—predictive modeling lives in the land of it depends.

Take returns, for example. Let’s say you want to predict whether a shopper will return an item, so you can proactively educate or reach out.

Sometimes a few thousand labeled orders with clear shopper behavior is plenty to get signal. I’ve even seen cases where a hundred examples were enough to build a surprisingly solid baseline predictive model.

Other times? Returns are rare, or the patterns are too messy, and you’ll need way more data to get anything reliable.

I recommend starting small, with even 100-200 labeled examples. See if that’s enough for a baseline model, and go from there.

The only way to know? Roll up your sleeves and see.