What is a data lake?

A data lake is a central store that holds raw data of any shape, structured, semi-structured, and unstructured, at scale and low cost, with schema applied on read rather than on write, the opposite of a rigid data warehouse.

A data lake is a storage repository, usually built on cheap object storage like S3, that holds large volumes of raw data in its native format: tables, JSON logs, images, audio, Parquet files, whatever. Its defining trait is schema-on-read. Unlike a data warehouse, which requires you to model and clean data into a strict schema before loading it (schema-on-write), a lake lets you land everything first and impose structure only when you query it. That flexibility makes lakes the default landing zone for high-volume, heterogeneous data, machine-generated logs, event streams, ML training sets, where modeling everything up front is impractical. The classic tradeoff is governance: without discipline a lake degrades into a "data swamp" of undocumented, unqueryable files. The modern answer is the lakehouse, table formats like Delta Lake, Apache Iceberg, and Hudi add warehouse-style transactions, schema enforcement, and time travel on top of lake storage, blurring the line between lake and warehouse. In a typical pipeline raw data lands in the lake (often via ELT), then curated subsets are modeled into a warehouse for analytics. Agents querying organizational data usually hit the modeled warehouse layer, not the raw lake, through servers like BigQuery, Snowflake, or ClickHouse.