2026-07-01 16:11 UTCIn-site rewrite6 min readUpdated: 2026-07-01 16:34 UTC

My Notes After Databricks Data and AI Summit 2026

The author argues that the data layer is the most undervalued part of the AI stack but will become critical as AI moves into production. AI agents expose data pipeline flaws, and Databricks is heading in the right direction but its architecture is still incomplete. The article explores the evolving role of data infrastructure and the necessary features of an AI-native data system.

SourceHacker News AIAuthor: redskyluan

A Few Notes from Databricks Data + AI Summit 2026: Why the Data Layer Matters Again

Jun 30, 202613 min read

After this year’s Databricks Data + AI Summit, I found myself thinking less about any single announcement and more about a question that has been sitting with me for a while:

When AI really moves into production, what does the data layer become?

My current answer is simple, though the implications are not: in this cycle, the data layer is the part of the AI stack that has been repriced the slowest. That is starting to change.

Data: the part of the AI stack the market has not priced yet

Algorithms have been repriced in public. Models improve quickly, and the industry can see the progress almost every week. Compute has been repriced by NVIDIA, the cloud providers, and the capital markets. Everyone understands that GPUs matter.

Data has moved more slowly. Not because it matters less. The opposite is true. Data is slow to reprice because it is hard to talk about and even harder to fix. Enterprise data is messy, scattered, duplicated, stale, and full of permissions that nobody fully understands. Business semantics do not line up cleanly across systems. The thing people call “real time” is often still a scheduled job that ran sometime last night.

That work is painful. It is also not very glamorous. But once AI moves from demos into production, the pain becomes impossible to hide.

In conversations with people building and training models, including those at OpenAI and Anthropic, the discussion often comes back to the same point. Models are converging. Compute can be bought, at least if you have enough money. The defensible layer is increasingly becoming the data: the quality of it, the freshness of it, the permissions around it, and the speed at which it can be turned into useful context.

This is not only an application-layer problem. Inside model companies, model quality still depends heavily on the data pipeline. A training run may require days of preparation before the first serious experiment begins. If an upstream field is dirty, a batch is mislabeled, or a filtering rule is wrong, days of compute and waiting can disappear before anyone notices the loss curve has drifted.

AI agents make the data problem impossible to hide

Agents expose the same problem in a more operational form.

When AI agents fail in production, the first cause is often not that the model is incapable. It is that the model is acting on the wrong context: a record it cannot access, a document that expired six hours ago, a data source that quietly changed overnight, or a retrieval path that is too expensive to use often enough. I recently saw a strong team lose nearly a week to a stale context pipeline. The agent was confidently answering yesterday’s question. The model was not dumb. The context was wrong, and the system had no clean way to prove where the error entered the loop.

That is the failure mode that matters. The next infrastructure bottleneck is not simply better reasoning. It is fresh, trusted, cheap, and auditable context at the moment a model or agent makes a decision.

That is why I think the data layer is the next part of the AI stack to be repriced.

Databricks is aiming at the right problem

I am skeptical of many products that call themselves “AI data platforms.” Too often the story arrives before the system.

Databricks is different enough that I think it deserves serious attention. Two things stood out to me at the Summit.

The first is still the engineering culture. At Databricks’ scale, it would be easy for the company to become purely sales-driven. Yet the founders are still on stage talking about execution engines, transactions, real-time analytics, and the pipes underneath the product. I respect that. You can feel when a company still has product and engineering intuition at its core. It shows up in small architectural decisions long before it shows up in a keynote.

The second is the customer base. The users I spoke with at the Summit were not talking about AI as a demo layer. They were trying to push AI into production systems, and the problems they described were much more concrete: agents need to read and write business state; real-time analytics cannot keep paying the tax of moving data; pipelines need to become more autonomous; agent behavior needs governance at runtime, not only after the fact.

That is why announcements such as Lakebase, Lakehouse//RT, data agents, and AI governance matter. The names are less important than the direction. Put transactions closer to the lake. Pull real-time analytics back toward the same data foundation. Automate more of the pipeline. Extend governance from “who can see this dataset” to “what is this agent allowed to do in this specific step?”

I do not see that as a wrong turn. I see it as evidence that many of us are looking at the same future from different angles.

The database is expanding. It is no longer only a place to store and query data. It is becoming the foundation for facts, state, semantics, governance, and action.

The map is good. But it is not finished.

Databricks is right in the direction. That does not mean the architecture has reached its final form.

Photo: The Known Data Realm · Databricks Data + AI Summit 2026

I see three areas where the map is still incomplete.

The lakebase itself.

Starting with Postgres is a smart entry point. Developers know it. The ecosystem is huge. Compatibility lowers adoption friction. That matters.

But the architecture that gets people in the door is not always the architecture that wins the final workload.

AI-era operational systems need transactions, memory, vectors, multimodal data, trace, branching, rollback, and very fine-grained tenant isolation. A traditional relational core can expose some of these through extensions and surrounding services, but that does not make them native. Classic Postgres was not designed for cloud-native distributed scale, nor for agents that create short-lived databases, fork state, write to memory, generate traces, and disappear.

Moving Postgres closer to object storage does not erase those questions. Object storage is cheap and reliable, but it is not low-latency by default. To make it feel fast, you need a cache layer that is both aggressive and correct. A cache that stays stable under real transactional load is one of the hardest systems problems in databases. So my honest question about Lakebase is not whether the demo is impressive. It is whether the system can carry real OLTP workloads at production scale without turning that cache into the thing that wakes people up at 3 a.m.

The multimodal data.

Databricks has drawn a strong map across OLTP, warehousing, real-time analytics, data science, and governance. But AI applications increasingly consume text, images, audio, video, embeddings, behavior logs, and agent traces. Those are not just objects sitting next to tables. They are the data that agents retrieve, reason over, transform, and write back.

If multimodal data remains outside the core map, then the most important AI data assets still live in the margins.

The default user.

Much of the product surface still assumes a human user: dashboards, natural-language BI, Excel-style workflows, and analyst-facing experiences. Those are valuable. But agents use databases differently.

An agent does not open a dashboard once a day. It runs in a loop. It retrieves context, makes a decision, calls a tool, writes state, checks a policy, and repeats. Every step may need to be audited. Every retrieval may influence the next action. Every write may need rollback. Every permission check may need to happen at runtime.

That is a different database workload.

Photo: Unity AI Gateway · Governance —— Databricks Data + AI Summit 2026

When the database user is an agent

For decades, a database could mostly focus on one question: how to execute this query correctly and quickly.

In the agent era, the question becomes broader:

How does an agent get the freshest, most trusted, lowest-cost, and most auditable context at the moment it makes a decision?

That is not just a query optimization problem. It is a systems problem across storage, indexing, governance, lineage, replay, cost control, and runtime policy enforcement.

This is where the category starts to shift. A data system can no longer be only an intelligence system: you ask a question, it returns an answer. It has to become closer to an operating system for AI: the place where agents read context, make decisions, call tools, write state, and leave behind a trace that humans and other systems can inspect.

Auditability cannot be bolted on after the fact. If an agent gives the wrong answer, takes the wrong action, or spends too much money, the first question will be: what exactly did it see at that moment?

To answer that, the system needs to know which documents were retrieved, which vectors were matched, which metadata filters were applied, which reranker changed the order, which tool was called, what policy was enforced, and what state was written back. Debugging and governance become the same workflow.

That is the architecture I do not think anyone has fully solved yet.

What “AI-native” should actually mean

“AI-native” is becoming one of those phrases that can mean almost anything. I do not think there is a clean definition yet. But if we work backward from real agent workloads, an AI-native data system has to do at least a few things well.

Multimodal data has to be first-class

Text, images, audio, video, embeddings, logs, and traces should not be scattered across a relational table, a vector column, an object bucket, and several side indexes. They need to live in one logical system where retrieval, filtering, ranking, and governance can happen together.

The hard part is not storing these assets. The hard part is making them queryable together without turning the architecture into another pipeline problem.

Elasticity has to start from the workload

Agent traffic is bursty. A system may be quiet for an hour and then receive a flood of retrieval, memory, and tool-use requests. The data lake or object store should become the durable foundation: cheap, reliable, and decoupled from compute.

But compute should not remain expensive just because the corpus exists. If nobody is searching, the system should spend very little. If a workload wakes up, compute should arrive quickly. In that world, the natural pricing unit is not always a permanent cluster. It may be the query, the session, or the minute of active compute.

Multi-tenancy has to move to the agent level

Traditional multi-tenant systems often assume a manageable number of large tenants. Agentic systems may create millions or billions of tiny, short-lived, isolated states. Each agent may carry its own memory, permissions, traces, temporary branches, and write paths.

A design built for thousands of large tenants will struggle when the tenant becomes the agent run itself.

Branching and rollback become core database features

Agents will write the wrong thing. That is not a corner case. It is part of the workload.

A useful AI data layer needs Git-like branching and fast rollback for data state. An agent run should be able to fork a working branch, test an action, write temporary state, and discard or promote it. If a bad update lands, the system should be able to return to a known good point quickly.

Versioning is not only an analytics convenience anymore. It becomes an operational safety mechanism.

Trace and deterministic replay are mandatory

When an agent fails, the question is not just “what was the final answer?” It is “what did the agent see, retrieve, rank, decide, call, and write?”

That requires a trace of every meaningful step. More importantly, it requires replay. The system should be able to reconstruct the decision context as it existed at the time, not as it looks after a documen

[truncated for AI cost control]