Data debt is the accumulated, mostly invisible cost an organization carries when its data was never put into a state that AI can use, trace, and reproduce. The term borrows from technical debt: a fast shortcut accrues interest later, and the same happens with data. A dataset that looks clean in a demo slice can hide missing context, undocumented state, and no record of which version produced a result, and that cost comes due once a model reaches production.
Data debt is not a storage problem, and it is not the same as poor data quality. Data can be well stored and broadly accurate yet still be unusable or unreproducible for a specific AI run. It surfaces when a result changes between runs and no one can say which data state produced the earlier one, or when an audit asks which dataset produced which result and the trail is missing.
Reducing data debt means moving data into an AI-ready state and keeping that state stable from one run to the next: capturing the exact data a run used, holding it as a version, and being able to replay it later so the earlier result still holds.