Data ingestion is the process of moving data from its sources into a system where it can be stored, processed, or analyzed. It covers both batch ingestion, where data arrives on a schedule, and streaming ingestion, where it flows in continuously. For example, sales records might be ingested in nightly batches, while card transactions are ingested as a continuous stream. Ingestion is often confused with data acquisition, which is about obtaining the data in the first place, while ingestion is the step that brings it into the destination and shapes it for use. For AI, the part that gets overlooked is state. The exact condition of the data at the moment it is ingested, its schema, distributions, and transformations, is what a model actually runs on, so a result can only be reproduced later if that ingested state was captured rather than assumed.
Frequently asked questions
What is data ingestion?
The process of moving data from its sources into a system where it can be stored, processed, or analyzed, in batch or streaming form.
How is data ingestion different from data acquisition?
Acquisition is about obtaining data in the first place. Ingestion brings that data into a destination system and shapes it for use.
Why does data ingestion matter for AI?
A model runs on the exact state of the data at ingestion, so a result is only reproducible later if that state was captured rather than assumed.