There’s a misunderstanding hiding beneath a lot of the excitement around large language models (LLMs) and AI. It’s not the obvious stuff about hype or overblown promises, though those are rampant right now with a study suggesting 95% of AI projects have zero returns. It’s much much quieter than that. It’s the assumption that these systems know things in the way humans know things. It’s the idea of Artificial general intelligence (AGI) – that these systems are somehow accumulating understanding over time, staying current as the world changes, and forming an internal model of reality as they go. The fact is, they aren’t, and the gap between that assumption and how these systems actually work explains a huge misunderstanding with a lot of people.
A language model is trained on a massive snapshot of data that ends at a very specific moment in time. Go look at any company’s LLM model page to find these dates. That snapshot is baked into the model and then frozen. From that point forward, nothing inside the model updates unless it is retrained. No matter how fluent, confident, or authoritative an answer sounds, the model has no native awareness of anything that happened after its cutoff date or outside of the data that it was trained on.
You can see this limitation with a simple example:


This is where things go off the rails. People notice that AI systems can answer questions about recent events or live systems. From the outside, it looks like continuous learning and actual up-to-date knowledge. What’s actually happening is the model’s data is being augmented. The model itself hasn’t learned anything new. It’s been given access to external information and asked to reason over it.
Most real-world systems do this through retrieval‑augmented generation, or RAG. A query comes in, relevant documents or records are retrieved from a database, a search engine, or an API, and those results are injected into the model’s context window. The model then operates on all the data and comes up with an answer based on that material. When it works well, it feels like the model is suddenly intelligent and up-to-date. In reality, it’s just better informed in that moment.
All of this matters because it shifts the center of gravity away from the model, the idea of artificial intelligence, and toward the data. If the data is organized, structured, well‑labeled, and explicit about things like time, ownership, and authority, the model can do impressive work. It can connect dots across documents, explain why decisions were made, surface inconsistencies, and summarize complex systems faster than any human could.
If the data is not well organized, the failure mode is hallucinations because the models still answer the question, because answering is what the models are optimized to do. Accuracy is not guaranteed.
An LLM has no inherent understanding of what your data means. It doesn’t know which document is authoritative, which version is obsolete, even how the data was derived. Unless you encode those distinctions explicitly, the model will infer them. And inference, in this context, is just guesswork dressed up as confidence.
I’ll go through an e-commerce example below to demonstrate what I mean.
Once you stop treating an LLM like a knowledge store and start treating it like a reasoning engine sitting on top of your information architecture, the ideas will fall into place much easier. The model doesn’t save you from poor structure though, it magnifies it. Good data modeling, clear schemas, explicit metadata, and disciplined information hygiene don’t just make AI safer, they determine whether it’s useful at all.
This is why AI hasn’t reduced the importance of experienced engineers and technical leaders. Someone still has to decide what is true, what is current, what matters, and what must never be inferred. AI can help you explore the space faster. Humans are still responsible for deciding which parts of that space correspond to reality.
Here’s a very simplified e-commerce example that’ll help AI better answer questions about data.
Note, the below code is for MySQL/MariaDB.
Layer 0: Raw events
This is the raw layer. It may contain duplicates, messy data. This is the layer you want to be able to cite as source-of-truth.
Table definition + sample data
DROP TABLE IF EXISTS l0_raw_order_events;
CREATE TABLE l0_raw_order_events (
raw_event_id BIGINT PRIMARY KEY,
event_type VARCHAR(32) NOT NULL,
order_id VARCHAR(32) NOT NULL,
order_line_id VARCHAR(32) NOT NULL,
customer_id VARCHAR(32) NOT NULL,
product_sku_raw VARCHAR(64) NOT NULL,
quantity INT NOT NULL,
unit_price_cents INT NOT NULL,
event_ts DATE NOT NULL
);
INSERT INTO l0_raw_order_events VALUES
(1001, 'ORDER_PLACED', 'O-1', 'O-1-1', 'C-10', 'sku001', 2, 1500, '2026-02-10'),
(1002, 'ORDER_PLACED', 'O-1', 'O-1-1', 'C-10', 'SKU001', 2, 1500, '2026-02-10'), -- duplicate of 1001
(1003, 'ORDER_PLACED', 'O-2', 'O-2-1', 'C-10', 'SKU002', 1, 2500, '2026-02-11'),
(1004, 'REFUND_ISSUED', 'O-1', 'O-1-1', 'C-10', 'SKU001', 1, 1500, '2026-02-12'), -- partial refund
(1005, 'ORDER_PLACED', 'O-3', 'O-3-1', 'C-20', 'sku-003', 3, 800, '2026-02-12'); -- new customer + messy SKU
Layer 0 table (results)
| raw_event_id | event_type | order_id | order_line_id | customer_id | product_sku_raw | quantity | unit_price_cents | event_ts |
|---|---|---|---|---|---|---|---|---|
| 1001 | ORDER_PLACED | O-1 | O-1-1 | C-10 | sku001 | 2 | 1500 | 2026-02-10 |
| 1002 | ORDER_PLACED | O-1 | O-1-1 | C-10 | SKU001 | 2 | 1500 | 2026-02-10 |
| 1003 | ORDER_PLACED | O-2 | O-2-1 | C-10 | SKU002 | 1 | 2500 | 2026-02-11 |
| 1004 | REFUND_ISSUED | O-1 | O-1-1 | C-10 | SKU001 | 1 | 1500 | 2026-02-12 |
| 1005 | ORDER_PLACED | O-3 | O-3-1 | C-20 | sku-003 | 3 | 800 | 2026-02-12 |
Lineage at Layer 0
None. This is the source.
Layer 1: Cleaned event facts (normalized + deduped)
Layer 1 creates normalized data while staying close to the original events.
SQL to derive Layer 1 tables
DROP TABLE IF EXISTS l1_clean_order_events;
CREATE TABLE l1_clean_order_events (
l1_event_id BIGINT AUTO_INCREMENT PRIMARY KEY,
raw_event_id BIGINT NOT NULL,
order_id VARCHAR(32) NOT NULL,
order_line_id VARCHAR(32) NOT NULL,
customer_id VARCHAR(32) NOT NULL,
product_sku VARCHAR(64) NOT NULL,
qty INT NOT NULL,
unit_price DECIMAL(10,2) NOT NULL,
event_type VARCHAR(32) NOT NULL,
event_ts DATE NOT NULL
);
-- Lineage table L1 -> L0
DROP TABLE IF EXISTS lineage_l1_to_l0;
CREATE TABLE lineage_l1_to_l0 (
l1_event_id BIGINT NOT NULL,
raw_event_id BIGINT NOT NULL,
PRIMARY KEY (l1_event_id, raw_event_id)
);
-- Populate the normalized data (fix SKU names, deduplicate records, etc.
INSERT INTO l1_clean_order_events (
raw_event_id, order_id, order_line_id, customer_id, product_sku, qty, unit_price, event_type, event_ts
)
WITH normalized AS (
SELECT
raw_event_id,
order_id,
order_line_id,
customer_id,
REGEXP_REPLACE(UPPER(TRIM(product_sku_raw)), '[^A-Z0-9]', '') AS product_sku,
quantity AS qty,
CAST(unit_price_cents / 100.0 AS DECIMAL(10,2)) AS unit_price,
event_type,
event_ts,
ROW_NUMBER() OVER (
PARTITION BY
order_id, order_line_id, customer_id,
REGEXP_REPLACE(UPPER(TRIM(product_sku_raw)), '[^A-Z0-9]', ''),
quantity, unit_price_cents, event_type, event_ts
ORDER BY raw_event_id
) AS rn
FROM l0_raw_order_events
)
SELECT
raw_event_id, order_id, order_line_id, customer_id, product_sku, qty, unit_price, event_type, event_ts
FROM normalized
WHERE rn = 1;
-- Populate lineage using the raw_event_id we carried forward
INSERT INTO lineage_l1_to_l0 (l1_event_id, raw_event_id)
SELECT l1_event_id, raw_event_id
FROM l1_clean_order_events;
Layer 1 table (results)
| l1_event_id | raw_event_id | order_id | order_line_id | customer_id | product_sku | qty | unit_price | event_type | event_ts |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 1001 | O-1 | O-1-1 | C-10 | SKU001 | 2 | 15.00 | ORDER_PLACED | 2026-02-10 |
| 2 | 1003 | O-2 | O-2-1 | C-10 | SKU002 | 1 | 25.00 | ORDER_PLACED | 2026-02-11 |
| 3 | 1004 | O-1 | O-1-1 | C-10 | SKU001 | 1 | 15.00 | REFUND_ISSUED | 2026-02-12 |
| 4 | 1005 | O-3 | O-3-1 | C-20 | SKU003 | 3 | 8.00 | ORDER_PLACED | 2026-02-12 |
(Notice raw_event_id 1002 is removed by dedupe.)
Lineage: lineage_l1_to_l0 (results)
| l1_event_id | raw_event_id |
|---|---|
| 1 | 1001 |
| 2 | 1003 |
| 3 | 1004 |
| 4 | 1005 |
Layer 2: Analytics-ready net order lines (net revenue + net quantity)
Layer 2 is where the system becomes opinionated: it converts event-level reality (sales + refunds) into net business facts that dashboards and AI will ask about. This layer must be attributable because it is derived, aggregated truth.
SQL to derive Layer 2 tables
DROP TABLE IF EXISTS l2_order_line_net;
CREATE TABLE l2_order_line_net (
l2_line_id BIGINT AUTO_INCREMENT PRIMARY KEY,
order_id VARCHAR(32) NOT NULL,
order_line_id VARCHAR(32) NOT NULL,
customer_id VARCHAR(32) NOT NULL,
product_sku VARCHAR(64) NOT NULL,
net_qty INT NOT NULL,
net_revenue DECIMAL(12,2) NOT NULL,
effective_day DATE NOT NULL
);
-- Lineage table L2 -> L1
DROP TABLE IF EXISTS lineage_l2_to_l1;
CREATE TABLE lineage_l2_to_l1 (
l2_line_id BIGINT NOT NULL,
l1_event_id BIGINT NOT NULL,
contribution_type VARCHAR(16) NOT NULL, -- 'SALE' or 'REFUND'
PRIMARY KEY (l2_line_id, l1_event_id)
);
-- Derivation: aggregate by order line
-- These derivations can be anything, but it's important that you can attribute them back
INSERT INTO l2_order_line_net (order_id, order_line_id, customer_id, product_sku, net_qty, net_revenue, effective_day)
SELECT
order_id,
order_line_id,
customer_id,
product_sku,
SUM(CASE WHEN event_type='ORDER_PLACED' THEN qty
WHEN event_type='REFUND_ISSUED' THEN -qty
ELSE 0 END) AS net_qty,
SUM(CASE WHEN event_type='ORDER_PLACED' THEN qty*unit_price
WHEN event_type='REFUND_ISSUED' THEN -qty*unit_price
ELSE 0 END) AS net_revenue,
MAX(event_ts) AS effective_day
FROM l1_clean_order_events
GROUP BY order_id, order_line_id, customer_id, product_sku
ORDER BY order_id, order_line_id;
-- Populate lineage: map each L2 row back to contributing L1 rows
INSERT INTO lineage_l2_to_l1 (l2_line_id, l1_event_id, contribution_type)
SELECT
l2.l2_line_id,
l1.l1_event_id,
CASE WHEN l1.event_type='ORDER_PLACED' THEN 'SALE' ELSE 'REFUND' END
FROM l2_order_line_net l2
JOIN l1_clean_order_events l1
ON l1.order_id = l2.order_id
AND l1.order_line_id = l2.order_line_id
AND l1.customer_id = l2.customer_id
AND l1.product_sku = l2.product_sku;
Layer 2 table (results)
| l2_line_id | order_id | order_line_id | customer_id | product_sku | net_qty | net_revenue | effective_day |
|---|---|---|---|---|---|---|---|
| 1 | O-1 | O-1-1 | C-10 | SKU001 | 1 | 15.00 | 2026-02-12 |
| 2 | O-2 | O-2-1 | C-10 | SKU002 | 1 | 25.00 | 2026-02-11 |
| 3 | O-3 | O-3-1 | C-20 | SKU003 | 3 | 24.00 | 2026-02-12 |
Lineage: lineage_l2_to_l1 (results)
| l2_line_id | l1_event_id | contribution_type |
|---|---|---|
| 1 | 1 | SALE |
| 1 | 3 | REFUND |
| 2 | 2 | SALE |
| 3 | 4 | SALE |
End-to-end lineage: Layer 2 → Layer 0 (final derived fact → original raw)
Even with only 3 layers, you still want an ability to attribute the final data back to the raw data, so an LLM can cite original records without walking multiple joins every time.
SQL to build end-to-end lineage
DROP TABLE IF EXISTS lineage_l2_to_l0;
CREATE TABLE lineage_l2_to_l0 (
l2_line_id BIGINT NOT NULL,
raw_event_id BIGINT NOT NULL,
PRIMARY KEY (l2_line_id, raw_event_id)
);
INSERT INTO lineage_l2_to_l0 (l2_line_id, raw_event_id)
SELECT DISTINCT
l2.l2_line_id,
l0.raw_event_id
FROM lineage_l2_to_l1 l2
JOIN lineage_l1_to_l0 l0
ON l0.l1_event_id = l2.l1_event_id;
lineage_l2_to_l0 (results)
| l2_line_id | raw_event_id |
|---|---|
| 1 | 1001 |
| 1 | 1004 |
| 2 | 1003 |
| 3 | 1005 |
Meaning:
- Net line O-1/O-1-1 derived from raw events 1001 (sale) and 1004 (refund)
- Net line O-2/O-2-1 derived from raw event 1003 (sale)
- Net line O-3/O-3-1 derived from raw event 1005 (sale)
Meta Data Tables
These tables exist for machines, not humans.
ai_table_catalog
This is what you pass to the model so it can:
- discover tables
- understand the purpose and grain
- find primary keys and join keys
- know which tables are “source of truth”
- know which tables are safe for analytics vs raw
DROP TABLE IF EXISTS ai_table_catalog;
CREATE TABLE ai_table_catalog (
table_name VARCHAR(128) PRIMARY KEY,
layer VARCHAR(8) NOT NULL, -- L0, L1, L2
table_role VARCHAR(64) NOT NULL, -- 'raw', 'clean_facts', 'analytics'
description TEXT NOT NULL, -- human + AI-readable
grain VARCHAR(255) NOT NULL, -- e.g. 'one row per ingested event'
primary_key VARCHAR(255) NOT NULL, -- e.g. 'raw_event_id'
natural_key VARCHAR(512) NULL, -- useful for dedupe / uniqueness explanation
important_columns TEXT NOT NULL, -- short list of key columns and meaning
join_guidance TEXT NOT NULL, -- how to join to other tables
freshness_policy VARCHAR(255) NOT NULL, -- e.g. 'append-only daily'
authoritative_source TINYINT(1) NOT NULL DEFAULT 0, -- 1 if source-of-truth
pii_notes TEXT NULL, -- what to be careful about
query_examples TEXT NULL -- small examples agents can use
);
INSERT INTO ai_table_catalog (
table_name, layer, table_role, description, grain, primary_key, natural_key,
important_columns, join_guidance, freshness_policy, authoritative_source, pii_notes, query_examples
) VALUES
(
'l0_raw_order_events','L0','raw',
'Raw order events exactly as ingested. Contains duplicates and inconsistent SKUs. Use for audit and citations.',
'One row per ingested event record (may include duplicates).',
'raw_event_id',
'order_id + order_line_id + customer_id + normalized product_sku + quantity + unit_price_cents + event_type + event_ts',
'raw_event_id (unique id), event_type (ORDER_PLACED/REFUND_ISSUED), order_id/order_line_id, customer_id, product_sku_raw (messy), quantity, unit_price_cents, event_ts',
'Prefer joining upward via lineage_l1_to_l0 rather than joining directly to L2; raw has duplicates.',
'Append-only; no updates expected.',
1,
'customer_id may be considered sensitive depending on org policy.',
'SELECT * FROM l0_raw_order_events WHERE order_id=''O-1'';'
),
(
'l1_clean_order_events','L1','clean_facts',
'Cleaned event-level facts derived from L0. SKUs normalized, duplicates removed. Each row maps to exactly one representative raw_event_id.',
'One row per deduplicated event fact.',
'l1_event_id',
'order_id + order_line_id + customer_id + product_sku + qty + unit_price + event_type + event_ts',
'l1_event_id (surrogate PK), raw_event_id (pointer to source), product_sku (normalized), qty, unit_price, event_type, event_ts',
'Join to L0 through lineage_l1_to_l0. Join to L2 through lineage_l2_to_l1 or by (order_id, order_line_id, customer_id, product_sku).',
'Rebuilt from L0; deterministic dedupe.',
0,
'customer_id may be considered sensitive depending on org policy.',
'SELECT * FROM l1_clean_order_events WHERE customer_id=''C-10'' ORDER BY event_ts;'
),
(
'l2_order_line_net','L2','analytics',
'Net order-line facts for analytics. Sales and refunds combined into net_qty and net_revenue.',
'One row per (order_id, order_line_id, customer_id, product_sku).',
'l2_line_id',
'order_id + order_line_id + customer_id + product_sku',
'net_qty (sales-refunds), net_revenue (signed revenue), effective_day (max event_ts for the line)',
'Trace back to contributing events via lineage_l2_to_l1, and all the way to raw via lineage_l2_to_l0.',
'Rebuilt from L1; deterministic aggregation.',
0,
'customer_id may be considered sensitive depending on org policy.',
'SELECT * FROM l2_order_line_net WHERE effective_day=''2026-02-12'';'
);
ai_column_catalog
This gives an agent column-level semantics without reading schema DDL.
DROP TABLE IF EXISTS ai_column_catalog;
CREATE TABLE ai_column_catalog (
table_name VARCHAR(128) NOT NULL,
column_name VARCHAR(128) NOT NULL,
data_type VARCHAR(64) NOT NULL,
description TEXT NOT NULL,
semantic_type VARCHAR(64) NOT NULL, -- 'id', 'timestamp', 'money', 'quantity', 'category', etc.
is_primary_key TINYINT(1) NOT NULL DEFAULT 0,
is_foreign_key TINYINT(1) NOT NULL DEFAULT 0,
fk_table VARCHAR(128) NULL,
fk_column VARCHAR(128) NULL,
example_values TEXT NULL,
PRIMARY KEY (table_name, column_name)
);
INSERT INTO ai_column_catalog VALUES
('l2_order_line_net','l2_line_id','BIGINT','Surrogate primary key for net order-line row.','id',1,0,NULL,NULL,'1,2,3'),
('l2_order_line_net','order_id','VARCHAR(32)','Order identifier from commerce system.','id',0,0,NULL,NULL,'O-1,O-2'),
('l2_order_line_net','order_line_id','VARCHAR(32)','Order line identifier within an order.','id',0,0,NULL,NULL,'O-1-1'),
('l2_order_line_net','customer_id','VARCHAR(32)','Customer identifier (may be PII depending on org policy).','id',0,0,NULL,NULL,'C-10,C-20'),
('l2_order_line_net','product_sku','VARCHAR(64)','Normalized product SKU (uppercase, alphanumeric).','id',0,0,NULL,NULL,'SKU001,SKU002'),
('l2_order_line_net','net_qty','INT','Net quantity = sales quantity minus refunded quantity.','quantity',0,0,NULL,NULL,'1,3'),
('l2_order_line_net','net_revenue','DECIMAL(12,2)','Net revenue = sum(qty*unit_price) with refunds treated as negative.','money',0,0,NULL,NULL,'15.00,24.00'),
('l2_order_line_net','effective_day','DATE','Effective day for the net line (MAX(event_ts) from contributing events).','date',0,0,NULL,NULL,'2026-02-11,2026-02-12');
ai_lineage_catalog
DROP TABLE IF EXISTS ai_lineage_catalog;
CREATE TABLE ai_lineage_catalog (
lineage_table_name VARCHAR(128) PRIMARY KEY,
from_table_name VARCHAR(128) NOT NULL,
to_table_name VARCHAR(128) NOT NULL,
from_key_column VARCHAR(128) NOT NULL,
to_key_column VARCHAR(128) NOT NULL,
relationship VARCHAR(64) NOT NULL, -- 'many-to-many', 'many-to-one', etc.
description TEXT NOT NULL,
query_example TEXT NULL
);
INSERT INTO ai_lineage_catalog VALUES
(
'lineage_l1_to_l0',
'l1_clean_order_events',
'l0_raw_order_events',
'l1_event_id',
'raw_event_id',
'many-to-one',
'Maps each cleaned event (L1) to the representative raw event (L0) it came from.',
'SELECT l0.* FROM lineage_l1_to_l0 x JOIN l0_raw_order_events l0 ON l0.raw_event_id=x.raw_event_id WHERE x.l1_event_id=1;'
),
(
'lineage_l2_to_l1',
'l2_order_line_net',
'l1_clean_order_events',
'l2_line_id',
'l1_event_id',
'many-to-many',
'Maps each net order-line (L2) to the contributing cleaned events (L1). Includes contribution_type (SALE/REFUND).',
'SELECT l1.*, x.contribution_type FROM lineage_l2_to_l1 x JOIN l1_clean_order_events l1 ON l1.l1_event_id=x.l1_event_id WHERE x.l2_line_id=1;'
),
(
'lineage_l2_to_l0',
'l2_order_line_net',
'l0_raw_order_events',
'l2_line_id',
'raw_event_id',
'many-to-many',
'End-to-end mapping from net order-line (L2) directly to raw events (L0) for citations and audit.',
'SELECT l0.* FROM lineage_l2_to_l0 x JOIN l0_raw_order_events l0 ON l0.raw_event_id=x.raw_event_id WHERE x.l2_line_id=1;'
);
lineage_transform_meta
DROP TABLE IF EXISTS lineage_transform_meta;
CREATE TABLE lineage_transform_meta (
transform_id VARCHAR(32) PRIMARY KEY,
from_table VARCHAR(128) NOT NULL,
to_table VARCHAR(128) NOT NULL,
description VARCHAR(255) NOT NULL,
logic_summary TEXT NOT NULL,
version VARCHAR(16) NOT NULL,
updated_at DATETIME NOT NULL
);
INSERT INTO lineage_transform_meta VALUES
('L1_CLEAN','l0_raw_order_events','l1_clean_order_events',
'Normalize and deduplicate raw events',
'Normalize SKU (UPPER/TRIM/strip non-alphanumeric); dedupe by natural key using ROW_NUMBER; convert cents->dollars; keep representative raw_event_id for citation.',
'v1.0','2026-02-14 00:00:00'),
('L2_NET','l1_clean_order_events','l2_order_line_net',
'Compute net qty and net revenue per order line',
'Aggregate by (order_id, order_line_id, customer_id, product_sku). Treat refunds as negative. effective_day = MAX(event_ts). Populate lineage_l2_to_l1 and lineage_l2_to_l0 for traceability.',
'v1.0','2026-02-14 00:00:00');
Next Steps + Example
From this point, creating AI code that can read the metadata and decide on how to query the data is a pretty straight forward process of feeding in the ai_* tables as system prompts and asking it to create SQL queries to fetch data as it needs. Executing those SQL continuously (within limits) as long as the AI wants to keep exploring can also easily be coded.
Here are two example questions:
Question 1: How do you think the data in L2 was derived from L0?
Answer 1: L2 represents net business outcomes aggregated from cleaned L1 events, which themselves are normalized and deduplicated versions of raw L0 data.
Question 2: give me the lineage of O-1 from L2 to L0?
Answer 2: L2 O-1 is derived from one cleaned sale event and one cleaned refund event in L1, which trace back to raw events 1001 and 1004 in L0, producing:
net_qty = 2 − 1 = 1
net_revenue = 30 − 15 = 15
All that’s to say, doing the work upfront with data makes the LLM side of things significantly easier.
