{"id":244,"date":"2026-02-16T11:59:48","date_gmt":"2026-02-16T16:59:48","guid":{"rendered":"https:\/\/www.parhammofidi.com\/blog\/?p=244"},"modified":"2026-02-16T15:00:54","modified_gmt":"2026-02-16T20:00:54","slug":"when-llms-know-and-dont-know","status":"publish","type":"post","link":"https:\/\/www.parhammofidi.com\/blog\/index.php\/2026\/02\/16\/when-llms-know-and-dont-know\/","title":{"rendered":"When LLMs Know (And Don&#8217;t Know)"},"content":{"rendered":"\n<p class=\"wp-block-paragraph\">There\u2019s a misunderstanding hiding beneath a lot of the excitement around large language models (LLMs) and AI. It\u2019s not the obvious stuff about hype or overblown promises, though those are rampant right now with a study suggesting <a href=\"https:\/\/mlq.ai\/media\/quarterly_decks\/v0.1_State_of_AI_in_Business_2025_Report.pdf\">95% of AI projects have zero returns<\/a>. It\u2019s much much quieter than that. It\u2019s the assumption that these systems <em>know<\/em> things in the way humans know things. It&#8217;s the idea of Artificial general intelligence (AGI) &#8211; that these systems are somehow accumulating understanding over time, staying current as the world changes, and forming an internal model of reality as they go. The fact is, they aren\u2019t, and the gap between that assumption and how these systems actually work explains a huge misunderstanding with a lot of people.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">A language model is trained on a massive snapshot of data that ends at a very specific moment in time. Go look at any company&#8217;s LLM model page to find these dates. That snapshot is baked into the model and then frozen. From that point forward, nothing inside the model updates unless it is retrained. No matter how fluent, confident, or authoritative an answer sounds, the model has no native awareness of anything that happened after its cutoff date or outside of the data that it was trained on.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">You can see this limitation with a simple example:<\/p>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"812\" height=\"260\" src=\"https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image.png\" alt=\"\" class=\"wp-image-249\" srcset=\"https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image.png 812w, https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image-300x96.png 300w, https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image-768x246.png 768w\" sizes=\"auto, (max-width: 812px) 100vw, 812px\" \/><figcaption class=\"wp-element-caption\">OpenAI ChatGPT<\/figcaption><\/figure>\n\n\n\n<figure class=\"wp-block-image aligncenter size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"741\" height=\"380\" src=\"https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image-1.png\" alt=\"\" class=\"wp-image-250\" srcset=\"https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image-1.png 741w, https:\/\/www.parhammofidi.com\/blog\/wp-content\/uploads\/2026\/02\/image-1-300x154.png 300w\" sizes=\"auto, (max-width: 741px) 100vw, 741px\" \/><figcaption class=\"wp-element-caption\">Anthropic Claude<\/figcaption><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">This is where things go off the rails. People notice that AI systems <em>can<\/em> answer questions about recent events or live systems. From the outside, it looks like continuous learning and actual up-to-date knowledge. What\u2019s actually happening is the model&#8217;s data is being augmented. The model itself hasn\u2019t learned anything new. It\u2019s been given access to external information and asked to reason over it.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Most real-world systems do this through <a href=\"https:\/\/www.parhammofidi.com\/blog\/index.php\/2025\/08\/16\/building-a-retrieval-augmented-generation-rag-system-to-supplement-ai-models\/\">retrieval\u2011augmented generation, or RAG<\/a>. A query comes in, relevant documents or records are retrieved from a database, a search engine, or an API, and those results are injected into the model\u2019s context window. The model then operates on all the data and comes up with an answer based on that material. When it works well, it feels like the model is suddenly intelligent and up-to-date. In reality, it\u2019s just better informed in that moment.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">All of this matters because it shifts the center of gravity away from the model, the idea of artificial intelligence, and toward the data. If the data is organized, structured, well\u2011labeled, and explicit about things like time, ownership, and authority, the model can do impressive work. It can connect dots across documents, explain why decisions were made, surface inconsistencies, and summarize complex systems faster than any human could.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">If the data is not well organized, the failure mode is hallucinations because the models still answer the question, because answering is what the models are optimized to do. Accuracy is not guaranteed.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">An LLM has no inherent understanding of what your data <em>means<\/em>. It doesn\u2019t know which document is authoritative, which version is obsolete, even how the data was derived. Unless you encode those distinctions explicitly, the model will infer them. And inference, in this context, is just guesswork dressed up as confidence.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">I&#8217;ll go through an e-commerce example below to demonstrate what I mean.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Once you stop treating an LLM like a knowledge store and start treating it like a reasoning engine sitting on top of your information architecture, the ideas will fall into place much easier. The model doesn\u2019t save you from poor structure though, it magnifies it. Good data modeling, clear schemas, explicit metadata, and disciplined information hygiene don\u2019t just make AI safer, they determine whether it\u2019s useful at all.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">This is why AI hasn\u2019t reduced the importance of experienced engineers and technical leaders. Someone still has to decide what is true, what is current, what matters, and what must never be inferred. AI can help you explore the space faster. Humans are still responsible for deciding which parts of that space correspond to reality.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<p class=\"wp-block-paragraph\">Here&#8217;s a very simplified e-commerce example that&#8217;ll help AI better answer questions about data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Note, the below code is for MySQL\/MariaDB.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Layer 0: Raw events<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">This is the <strong>raw<\/strong> layer. It may contain duplicates, messy data. This is the layer you want to be able to <strong>cite as source-of-truth<\/strong>.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Table definition + sample data<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS l0_raw_order_events;\nCREATE TABLE l0_raw_order_events (\n  raw_event_id      BIGINT PRIMARY KEY,\n  event_type        VARCHAR(32) NOT NULL,\n  order_id          VARCHAR(32) NOT NULL,\n  order_line_id     VARCHAR(32) NOT NULL,\n  customer_id       VARCHAR(32) NOT NULL,\n  product_sku_raw   VARCHAR(64) NOT NULL,\n  quantity          INT NOT NULL,\n  unit_price_cents  INT NOT NULL,\n  event_ts          DATE NOT NULL\n);\n\nINSERT INTO l0_raw_order_events VALUES\n(1001, 'ORDER_PLACED', 'O-1', 'O-1-1', 'C-10', 'sku001', 2, 1500, '2026-02-10'),\n(1002, 'ORDER_PLACED', 'O-1', 'O-1-1', 'C-10', 'SKU001', 2, 1500, '2026-02-10'),  -- duplicate of 1001\n(1003, 'ORDER_PLACED', 'O-2', 'O-2-1', 'C-10', 'SKU002', 1, 2500, '2026-02-11'),\n(1004, 'REFUND_ISSUED', 'O-1', 'O-1-1', 'C-10', 'SKU001', 1, 1500, '2026-02-12'), -- partial refund\n(1005, 'ORDER_PLACED', 'O-3', 'O-3-1', 'C-20', 'sku-003', 3, 800, '2026-02-12');  -- new customer + messy SKU<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Layer 0 table (results)<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>raw_event_id<\/th><th>event_type<\/th><th>order_id<\/th><th>order_line_id<\/th><th>customer_id<\/th><th>product_sku_raw<\/th><th>quantity<\/th><th>unit_price_cents<\/th><th>event_ts<\/th><\/tr><\/thead><tbody><tr><td>1001<\/td><td>ORDER_PLACED<\/td><td>O-1<\/td><td>O-1-1<\/td><td>C-10<\/td><td><code>sku001<\/code><\/td><td>2<\/td><td>1500<\/td><td>2026-02-10<\/td><\/tr><tr><td>1002<\/td><td>ORDER_PLACED<\/td><td>O-1<\/td><td>O-1-1<\/td><td>C-10<\/td><td><code>SKU001<\/code><\/td><td>2<\/td><td>1500<\/td><td>2026-02-10<\/td><\/tr><tr><td>1003<\/td><td>ORDER_PLACED<\/td><td>O-2<\/td><td>O-2-1<\/td><td>C-10<\/td><td><code>SKU002<\/code><\/td><td>1<\/td><td>2500<\/td><td>2026-02-11<\/td><\/tr><tr><td>1004<\/td><td>REFUND_ISSUED<\/td><td>O-1<\/td><td>O-1-1<\/td><td>C-10<\/td><td><code>SKU001<\/code><\/td><td>1<\/td><td>1500<\/td><td>2026-02-12<\/td><\/tr><tr><td>1005<\/td><td>ORDER_PLACED<\/td><td>O-3<\/td><td>O-3-1<\/td><td>C-20<\/td><td><code>sku-003<\/code><\/td><td>3<\/td><td>800<\/td><td>2026-02-12<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Lineage at Layer 0<\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">None. This is the source.<\/p>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Layer 1: Cleaned event facts (normalized + deduped)<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Layer 1 creates normalized data while staying close to the original events.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">SQL to derive Layer 1 tables<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS l1_clean_order_events;\nCREATE TABLE l1_clean_order_events (\n  l1_event_id      BIGINT AUTO_INCREMENT PRIMARY KEY,\n  raw_event_id     BIGINT NOT NULL,\n  order_id         VARCHAR(32) NOT NULL,\n  order_line_id    VARCHAR(32) NOT NULL,\n  customer_id      VARCHAR(32) NOT NULL,\n  product_sku      VARCHAR(64) NOT NULL,\n  qty              INT NOT NULL,\n  unit_price       DECIMAL(10,2) NOT NULL,\n  event_type       VARCHAR(32) NOT NULL,\n  event_ts         DATE NOT NULL\n);\n\n-- Lineage table L1 -&gt; L0\nDROP TABLE IF EXISTS lineage_l1_to_l0;\nCREATE TABLE lineage_l1_to_l0 (\n  l1_event_id    BIGINT NOT NULL,\n  raw_event_id   BIGINT NOT NULL,\n  PRIMARY KEY (l1_event_id, raw_event_id)\n);\n\n-- Populate the normalized data (fix SKU names, deduplicate records, etc.\nINSERT INTO l1_clean_order_events (\n  raw_event_id, order_id, order_line_id, customer_id, product_sku, qty, unit_price, event_type, event_ts\n)\nWITH normalized AS (\n  SELECT\n    raw_event_id,\n    order_id,\n    order_line_id,\n    customer_id,\n    REGEXP_REPLACE(UPPER(TRIM(product_sku_raw)), '&#91;^A-Z0-9]', '') AS product_sku,\n    quantity AS qty,\n    CAST(unit_price_cents \/ 100.0 AS DECIMAL(10,2)) AS unit_price,\n    event_type,\n    event_ts,\n    ROW_NUMBER() OVER (\n      PARTITION BY\n        order_id, order_line_id, customer_id,\n        REGEXP_REPLACE(UPPER(TRIM(product_sku_raw)), '&#91;^A-Z0-9]', ''),\n        quantity, unit_price_cents, event_type, event_ts\n      ORDER BY raw_event_id\n    ) AS rn\n  FROM l0_raw_order_events\n)\nSELECT\n  raw_event_id, order_id, order_line_id, customer_id, product_sku, qty, unit_price, event_type, event_ts\nFROM normalized\nWHERE rn = 1;\n\n-- Populate lineage using the raw_event_id we carried forward\nINSERT INTO lineage_l1_to_l0 (l1_event_id, raw_event_id)\nSELECT l1_event_id, raw_event_id\nFROM l1_clean_order_events;<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Layer 1 table (results)<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>l1_event_id<\/th><th>raw_event_id<\/th><th>order_id<\/th><th>order_line_id<\/th><th>customer_id<\/th><th>product_sku<\/th><th>qty<\/th><th>unit_price<\/th><th>event_type<\/th><th>event_ts<\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>1001<\/td><td>O-1<\/td><td>O-1-1<\/td><td>C-10<\/td><td>SKU001<\/td><td>2<\/td><td>15.00<\/td><td>ORDER_PLACED<\/td><td>2026-02-10<\/td><\/tr><tr><td>2<\/td><td>1003<\/td><td>O-2<\/td><td>O-2-1<\/td><td>C-10<\/td><td>SKU002<\/td><td>1<\/td><td>25.00<\/td><td>ORDER_PLACED<\/td><td>2026-02-11<\/td><\/tr><tr><td>3<\/td><td>1004<\/td><td>O-1<\/td><td>O-1-1<\/td><td>C-10<\/td><td>SKU001<\/td><td>1<\/td><td>15.00<\/td><td>REFUND_ISSUED<\/td><td>2026-02-12<\/td><\/tr><tr><td>4<\/td><td>1005<\/td><td>O-3<\/td><td>O-3-1<\/td><td>C-20<\/td><td>SKU003<\/td><td>3<\/td><td>8.00<\/td><td>ORDER_PLACED<\/td><td>2026-02-12<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">(Notice raw_event_id <strong>1002<\/strong> is removed by dedupe.)<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">Lineage: <code>lineage_l1_to_l0<\/code> (results)<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>l1_event_id<\/th><th>raw_event_id<\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>1001<\/td><\/tr><tr><td>2<\/td><td>1003<\/td><\/tr><tr><td>3<\/td><td>1004<\/td><\/tr><tr><td>4<\/td><td>1005<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Layer 2: Analytics-ready net order lines (net revenue + net quantity)<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Layer 2 is where the system becomes opinionated: it converts event-level reality (sales + refunds) into <strong>net business facts<\/strong> that dashboards and AI will ask about. This layer must be attributable because it is derived, aggregated truth.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">SQL to derive Layer 2 tables<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS l2_order_line_net;\nCREATE TABLE l2_order_line_net (\n  l2_line_id      BIGINT AUTO_INCREMENT PRIMARY KEY,\n  order_id        VARCHAR(32) NOT NULL,\n  order_line_id   VARCHAR(32) NOT NULL,\n  customer_id     VARCHAR(32) NOT NULL,\n  product_sku     VARCHAR(64) NOT NULL,\n  net_qty         INT NOT NULL,\n  net_revenue     DECIMAL(12,2) NOT NULL,\n  effective_day   DATE NOT NULL\n);\n\n-- Lineage table L2 -&gt; L1\nDROP TABLE IF EXISTS lineage_l2_to_l1;\nCREATE TABLE lineage_l2_to_l1 (\n  l2_line_id         BIGINT NOT NULL,\n  l1_event_id        BIGINT NOT NULL,\n  contribution_type  VARCHAR(16) NOT NULL, -- 'SALE' or 'REFUND'\n  PRIMARY KEY (l2_line_id, l1_event_id)\n);\n\n-- Derivation: aggregate by order line\n-- These derivations can be anything, but it's important that you can attribute them back\nINSERT INTO l2_order_line_net (order_id, order_line_id, customer_id, product_sku, net_qty, net_revenue, effective_day)\nSELECT\n  order_id,\n  order_line_id,\n  customer_id,\n  product_sku,\n  SUM(CASE WHEN event_type='ORDER_PLACED'  THEN qty\n           WHEN event_type='REFUND_ISSUED' THEN -qty\n           ELSE 0 END) AS net_qty,\n  SUM(CASE WHEN event_type='ORDER_PLACED'  THEN qty*unit_price\n           WHEN event_type='REFUND_ISSUED' THEN -qty*unit_price\n           ELSE 0 END) AS net_revenue,\n  MAX(event_ts) AS effective_day\nFROM l1_clean_order_events\nGROUP BY order_id, order_line_id, customer_id, product_sku\nORDER BY order_id, order_line_id;\n\n-- Populate lineage: map each L2 row back to contributing L1 rows\nINSERT INTO lineage_l2_to_l1 (l2_line_id, l1_event_id, contribution_type)\nSELECT\n  l2.l2_line_id,\n  l1.l1_event_id,\n  CASE WHEN l1.event_type='ORDER_PLACED' THEN 'SALE' ELSE 'REFUND' END\nFROM l2_order_line_net l2\nJOIN l1_clean_order_events l1\n  ON l1.order_id = l2.order_id\n AND l1.order_line_id = l2.order_line_id\n AND l1.customer_id = l2.customer_id\n AND l1.product_sku = l2.product_sku;<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\">Layer 2 table (results)<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>l2_line_id<\/th><th>order_id<\/th><th>order_line_id<\/th><th>customer_id<\/th><th>product_sku<\/th><th>net_qty<\/th><th>net_revenue<\/th><th>effective_day<\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>O-1<\/td><td>O-1-1<\/td><td>C-10<\/td><td>SKU001<\/td><td>1<\/td><td>15.00<\/td><td>2026-02-12<\/td><\/tr><tr><td>2<\/td><td>O-2<\/td><td>O-2-1<\/td><td>C-10<\/td><td>SKU002<\/td><td>1<\/td><td>25.00<\/td><td>2026-02-11<\/td><\/tr><tr><td>3<\/td><td>O-3<\/td><td>O-3-1<\/td><td>C-20<\/td><td>SKU003<\/td><td>3<\/td><td>24.00<\/td><td>2026-02-12<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<h4 class=\"wp-block-heading\">Lineage: <code>lineage_l2_to_l1<\/code> (results)<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>l2_line_id<\/th><th>l1_event_id<\/th><th>contribution_type<\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>1<\/td><td>SALE<\/td><\/tr><tr><td>1<\/td><td>3<\/td><td>REFUND<\/td><\/tr><tr><td>2<\/td><td>2<\/td><td>SALE<\/td><\/tr><tr><td>3<\/td><td>4<\/td><td>SALE<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>End-to-end lineage: Layer 2 \u2192 Layer 0 (final derived fact \u2192 original raw)<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">Even with only 3 layers, you still want an ability to attribute the final data back to the raw data, so an LLM can cite original records without walking multiple joins every time.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\">SQL to build end-to-end lineage<\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS lineage_l2_to_l0;\nCREATE TABLE lineage_l2_to_l0 (\n  l2_line_id     BIGINT NOT NULL,\n  raw_event_id   BIGINT NOT NULL,\n  PRIMARY KEY (l2_line_id, raw_event_id)\n);\n\nINSERT INTO lineage_l2_to_l0 (l2_line_id, raw_event_id)\nSELECT DISTINCT\n  l2.l2_line_id,\n  l0.raw_event_id\nFROM lineage_l2_to_l1 l2\nJOIN lineage_l1_to_l0 l0\n  ON l0.l1_event_id = l2.l1_event_id;<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><code>lineage_l2_to_l0<\/code> (results)<\/h4>\n\n\n\n<figure class=\"wp-block-table\"><table class=\"has-fixed-layout\"><thead><tr><th>l2_line_id<\/th><th>raw_event_id<\/th><\/tr><\/thead><tbody><tr><td>1<\/td><td>1001<\/td><\/tr><tr><td>1<\/td><td>1004<\/td><\/tr><tr><td>2<\/td><td>1003<\/td><\/tr><tr><td>3<\/td><td>1005<\/td><\/tr><\/tbody><\/table><\/figure>\n\n\n\n<p class=\"wp-block-paragraph\">Meaning:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>Net line <strong>O-1\/O-1-1<\/strong> derived from raw events <strong>1001 (sale)<\/strong> and <strong>1004 (refund)<\/strong><\/li>\n\n\n\n<li>Net line <strong>O-2\/O-2-1<\/strong> derived from raw event <strong>1003 (sale)<\/strong><\/li>\n\n\n\n<li>Net line <strong>O-3\/O-3-1<\/strong> derived from raw event <strong>1005 (sale)<\/strong><\/li>\n<\/ul>\n\n\n\n<hr class=\"wp-block-separator has-alpha-channel-opacity\"\/>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Meta Data Tables<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">These tables exist for machines, not humans.<\/p>\n\n\n\n<h4 class=\"wp-block-heading\"><code>ai_table_catalog<\/code><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">This is what you pass to the model so it can:<\/p>\n\n\n\n<ul class=\"wp-block-list\">\n<li>discover tables<\/li>\n\n\n\n<li>understand the purpose and grain<\/li>\n\n\n\n<li>find primary keys and join keys<\/li>\n\n\n\n<li>know which tables are \u201csource of truth\u201d<\/li>\n\n\n\n<li>know which tables are safe for analytics vs raw<\/li>\n<\/ul>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS ai_table_catalog;\nCREATE TABLE ai_table_catalog (\n  table_name            VARCHAR(128) PRIMARY KEY,\n  layer                 VARCHAR(8) NOT NULL,              -- L0, L1, L2\n  table_role            VARCHAR(64) NOT NULL,             -- 'raw', 'clean_facts', 'analytics'\n  description           TEXT NOT NULL,                    -- human + AI-readable\n  grain                 VARCHAR(255) NOT NULL,            -- e.g. 'one row per ingested event'\n  primary_key           VARCHAR(255) NOT NULL,            -- e.g. 'raw_event_id'\n  natural_key           VARCHAR(512) NULL,                -- useful for dedupe \/ uniqueness explanation\n  important_columns     TEXT NOT NULL,                    -- short list of key columns and meaning\n  join_guidance         TEXT NOT NULL,                    -- how to join to other tables\n  freshness_policy      VARCHAR(255) NOT NULL,            -- e.g. 'append-only daily'\n  authoritative_source  TINYINT(1) NOT NULL DEFAULT 0,    -- 1 if source-of-truth\n  pii_notes             TEXT NULL,                        -- what to be careful about\n  query_examples        TEXT NULL                         -- small examples agents can use\n);<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>INSERT INTO ai_table_catalog (\n  table_name, layer, table_role, description, grain, primary_key, natural_key,\n  important_columns, join_guidance, freshness_policy, authoritative_source, pii_notes, query_examples\n) VALUES\n(\n  'l0_raw_order_events','L0','raw',\n  'Raw order events exactly as ingested. Contains duplicates and inconsistent SKUs. Use for audit and citations.',\n  'One row per ingested event record (may include duplicates).',\n  'raw_event_id',\n  'order_id + order_line_id + customer_id + normalized product_sku + quantity + unit_price_cents + event_type + event_ts',\n  'raw_event_id (unique id), event_type (ORDER_PLACED\/REFUND_ISSUED), order_id\/order_line_id, customer_id, product_sku_raw (messy), quantity, unit_price_cents, event_ts',\n  'Prefer joining upward via lineage_l1_to_l0 rather than joining directly to L2; raw has duplicates.',\n  'Append-only; no updates expected.',\n  1,\n  'customer_id may be considered sensitive depending on org policy.',\n  'SELECT * FROM l0_raw_order_events WHERE order_id=''O-1'';'\n),\n(\n  'l1_clean_order_events','L1','clean_facts',\n  'Cleaned event-level facts derived from L0. SKUs normalized, duplicates removed. Each row maps to exactly one representative raw_event_id.',\n  'One row per deduplicated event fact.',\n  'l1_event_id',\n  'order_id + order_line_id + customer_id + product_sku + qty + unit_price + event_type + event_ts',\n  'l1_event_id (surrogate PK), raw_event_id (pointer to source), product_sku (normalized), qty, unit_price, event_type, event_ts',\n  'Join to L0 through lineage_l1_to_l0. Join to L2 through lineage_l2_to_l1 or by (order_id, order_line_id, customer_id, product_sku).',\n  'Rebuilt from L0; deterministic dedupe.',\n  0,\n  'customer_id may be considered sensitive depending on org policy.',\n  'SELECT * FROM l1_clean_order_events WHERE customer_id=''C-10'' ORDER BY event_ts;'\n),\n(\n  'l2_order_line_net','L2','analytics',\n  'Net order-line facts for analytics. Sales and refunds combined into net_qty and net_revenue.',\n  'One row per (order_id, order_line_id, customer_id, product_sku).',\n  'l2_line_id',\n  'order_id + order_line_id + customer_id + product_sku',\n  'net_qty (sales-refunds), net_revenue (signed revenue), effective_day (max event_ts for the line)',\n  'Trace back to contributing events via lineage_l2_to_l1, and all the way to raw via lineage_l2_to_l0.',\n  'Rebuilt from L1; deterministic aggregation.',\n  0,\n  'customer_id may be considered sensitive depending on org policy.',\n  'SELECT * FROM l2_order_line_net WHERE effective_day=''2026-02-12'';'\n);<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><code>ai_column_catalog<\/code><\/h4>\n\n\n\n<p class=\"wp-block-paragraph\">This gives an agent column-level semantics without reading schema DDL.<\/p>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS ai_column_catalog;\nCREATE TABLE ai_column_catalog (\n  table_name        VARCHAR(128) NOT NULL,\n  column_name       VARCHAR(128) NOT NULL,\n  data_type         VARCHAR(64) NOT NULL,\n  description       TEXT NOT NULL,\n  semantic_type     VARCHAR(64) NOT NULL,      -- 'id', 'timestamp', 'money', 'quantity', 'category', etc.\n  is_primary_key    TINYINT(1) NOT NULL DEFAULT 0,\n  is_foreign_key    TINYINT(1) NOT NULL DEFAULT 0,\n  fk_table          VARCHAR(128) NULL,\n  fk_column         VARCHAR(128) NULL,\n  example_values    TEXT NULL,\n  PRIMARY KEY (table_name, column_name)\n);<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>INSERT INTO ai_column_catalog VALUES\n('l2_order_line_net','l2_line_id','BIGINT','Surrogate primary key for net order-line row.','id',1,0,NULL,NULL,'1,2,3'),\n('l2_order_line_net','order_id','VARCHAR(32)','Order identifier from commerce system.','id',0,0,NULL,NULL,'O-1,O-2'),\n('l2_order_line_net','order_line_id','VARCHAR(32)','Order line identifier within an order.','id',0,0,NULL,NULL,'O-1-1'),\n('l2_order_line_net','customer_id','VARCHAR(32)','Customer identifier (may be PII depending on org policy).','id',0,0,NULL,NULL,'C-10,C-20'),\n('l2_order_line_net','product_sku','VARCHAR(64)','Normalized product SKU (uppercase, alphanumeric).','id',0,0,NULL,NULL,'SKU001,SKU002'),\n('l2_order_line_net','net_qty','INT','Net quantity = sales quantity minus refunded quantity.','quantity',0,0,NULL,NULL,'1,3'),\n('l2_order_line_net','net_revenue','DECIMAL(12,2)','Net revenue = sum(qty*unit_price) with refunds treated as negative.','money',0,0,NULL,NULL,'15.00,24.00'),\n('l2_order_line_net','effective_day','DATE','Effective day for the net line (MAX(event_ts) from contributing events).','date',0,0,NULL,NULL,'2026-02-11,2026-02-12');<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><code>ai_lineage_catalog<\/code><\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS ai_lineage_catalog;\nCREATE TABLE ai_lineage_catalog (\n  lineage_table_name   VARCHAR(128) PRIMARY KEY,\n  from_table_name      VARCHAR(128) NOT NULL,\n  to_table_name        VARCHAR(128) NOT NULL,\n  from_key_column      VARCHAR(128) NOT NULL,\n  to_key_column        VARCHAR(128) NOT NULL,\n  relationship         VARCHAR(64) NOT NULL,   -- 'many-to-many', 'many-to-one', etc.\n  description          TEXT NOT NULL,\n  query_example        TEXT NULL\n);<\/code><\/pre>\n\n\n\n<pre class=\"wp-block-code\"><code>INSERT INTO ai_lineage_catalog VALUES\n(\n  'lineage_l1_to_l0',\n  'l1_clean_order_events',\n  'l0_raw_order_events',\n  'l1_event_id',\n  'raw_event_id',\n  'many-to-one',\n  'Maps each cleaned event (L1) to the representative raw event (L0) it came from.',\n  'SELECT l0.* FROM lineage_l1_to_l0 x JOIN l0_raw_order_events l0 ON l0.raw_event_id=x.raw_event_id WHERE x.l1_event_id=1;'\n),\n(\n  'lineage_l2_to_l1',\n  'l2_order_line_net',\n  'l1_clean_order_events',\n  'l2_line_id',\n  'l1_event_id',\n  'many-to-many',\n  'Maps each net order-line (L2) to the contributing cleaned events (L1). Includes contribution_type (SALE\/REFUND).',\n  'SELECT l1.*, x.contribution_type FROM lineage_l2_to_l1 x JOIN l1_clean_order_events l1 ON l1.l1_event_id=x.l1_event_id WHERE x.l2_line_id=1;'\n),\n(\n  'lineage_l2_to_l0',\n  'l2_order_line_net',\n  'l0_raw_order_events',\n  'l2_line_id',\n  'raw_event_id',\n  'many-to-many',\n  'End-to-end mapping from net order-line (L2) directly to raw events (L0) for citations and audit.',\n  'SELECT l0.* FROM lineage_l2_to_l0 x JOIN l0_raw_order_events l0 ON l0.raw_event_id=x.raw_event_id WHERE x.l2_line_id=1;'\n);<\/code><\/pre>\n\n\n\n<h4 class=\"wp-block-heading\"><code>lineage_transform_meta<\/code><\/h4>\n\n\n\n<pre class=\"wp-block-code\"><code>DROP TABLE IF EXISTS lineage_transform_meta;\nCREATE TABLE lineage_transform_meta (\n  transform_id   VARCHAR(32) PRIMARY KEY,\n  from_table     VARCHAR(128) NOT NULL,\n  to_table       VARCHAR(128) NOT NULL,\n  description    VARCHAR(255) NOT NULL,\n  logic_summary  TEXT NOT NULL,\n  version        VARCHAR(16) NOT NULL,\n  updated_at     DATETIME NOT NULL\n);\n\nINSERT INTO lineage_transform_meta VALUES\n('L1_CLEAN','l0_raw_order_events','l1_clean_order_events',\n 'Normalize and deduplicate raw events',\n 'Normalize SKU (UPPER\/TRIM\/strip non-alphanumeric); dedupe by natural key using ROW_NUMBER; convert cents-&gt;dollars; keep representative raw_event_id for citation.',\n 'v1.0','2026-02-14 00:00:00'),\n('L2_NET','l1_clean_order_events','l2_order_line_net',\n 'Compute net qty and net revenue per order line',\n 'Aggregate by (order_id, order_line_id, customer_id, product_sku). Treat refunds as negative. effective_day = MAX(event_ts). Populate lineage_l2_to_l1 and lineage_l2_to_l0 for traceability.',\n 'v1.0','2026-02-14 00:00:00');<\/code><\/pre>\n\n\n\n<h3 class=\"wp-block-heading\"><strong>Next Steps + Example<\/strong><\/h3>\n\n\n\n<p class=\"wp-block-paragraph\">From this point, creating AI code that can read the metadata and decide on how to query the data is a pretty straight forward process of feeding in the <code>ai_*<\/code> tables as system prompts and asking it to create SQL queries to fetch data as it needs. Executing those SQL continuously (within limits) as long as the AI wants to keep exploring can also easily be coded.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">Here are two example questions:<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Question 1<\/strong>: How do you think the data in L2 was derived from L0?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Answer 1<\/strong>: L2 represents net business outcomes aggregated from cleaned L1 events, which themselves are normalized and deduplicated versions of raw L0 data.<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Question 2<\/strong>: give me the lineage of O-1 from L2 to L0?<\/p>\n\n\n\n<p class=\"wp-block-paragraph\"><strong>Answer 2<\/strong>: L2 O-1 is derived from one cleaned sale event and one cleaned refund event in L1, which trace back to raw events 1001 and 1004 in L0, producing:<br>net_qty = 2 \u2212 1 = 1<br>net_revenue = 30 \u2212 15 = 15<\/p>\n\n\n\n<p class=\"wp-block-paragraph\">All that&#8217;s to say, doing the work upfront with data makes the LLM side of things significantly easier.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>There\u2019s a misunderstanding hiding beneath a lot of the excitement around large language models (LLMs) and AI. It\u2019s not the obvious stuff about hype or overblown promises, though those are rampant right now with a study suggesting 95% of AI projects have zero returns. It\u2019s much much quieter than that. It\u2019s the assumption that these [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":254,"comment_status":"closed","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[165,48],"tags":[25,191,192,188,52,145,189,190,187,166],"class_list":["post-244","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-ai","category-code-development","tag-ai","tag-aiengineering","tag-data","tag-dataarchitecture","tag-database","tag-databasedesign","tag-dataengineering","tag-datalineage","tag-llm","tag-rag"],"_links":{"self":[{"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/244","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/comments?post=244"}],"version-history":[{"count":14,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/244\/revisions"}],"predecessor-version":[{"id":262,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/posts\/244\/revisions\/262"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/media\/254"}],"wp:attachment":[{"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/media?parent=244"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/categories?post=244"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.parhammofidi.com\/blog\/index.php\/wp-json\/wp\/v2\/tags?post=244"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}