Minimum data needs for ML models
“How much data do we need to build a predictive model?”
I get this question all the time when working with new clients. And the honest (and slightly unsatisfying) answer is: we don’t know upfront.
Unlike building software—where an engineer can often give a crisp yes/no—predictive modeling lives in the land of it depends.
Take returns, for example. Let’s say you want to predict whether a shopper will return an item, so you can proactively educate or reach out.
Sometimes a few thousand labeled orders with clear shopper behavior is plenty to get signal. I’ve even seen cases where a hundred examples were enough to build a surprisingly solid baseline predictive model.
Other times? Returns are rare, or the patterns are too messy, and you’ll need way more data to get anything reliable.
I recommend starting small, with even 100-200 labeled examples. See if that’s enough for a baseline model, and go from there.
The only way to know? Roll up your sleeves and see.