All writing
Data

Data Pipelines Are Decisions. Treat Them That Way.

Nobody calls a meeting to decide how to handle nulls in the customer address field. These decisions get made at 4pm on a Wednesday and become load-bearing the moment someone builds on top of them.

Nobody calls a meeting to decide how to handle nulls in the customer address field. Nobody writes a design doc for whether the pipeline should deduplicate at ingestion or at transformation. Nobody reviews the choice to use a left join instead of an inner join on a table where 8% of records don't match.

These decisions get made at 4pm on a Wednesday by an engineer who needs to ship the pipeline by Friday. They're reasonable decisions in the moment. They become load-bearing decisions the moment someone builds a dashboard, a model, or a quarterly report on top of them.

Eighteen months later, when someone asks why the revenue numbers don't match between two reports, the answer is in a join condition that nobody documented, nobody reviewed, and nobody remembers choosing.

Pipelines are architecture

We don't let engineers ship production services without a design review. We don't let them choose a database without evaluating tradeoffs. We don't let them define an API contract without discussion. But we routinely let engineers define the shape of an organization's analytical reality — what gets included, what gets excluded, how things get counted — without any of that rigor.

A pipeline that lands customer data decides: what is a customer? The pipeline that joins transactions to accounts decides: which transactions count? The pipeline that filters out records with missing fields decides: who is invisible in every downstream analysis?

These aren't background decisions. They're business logic encoded in SQL and Python, running on a schedule, with no review process and no documentation beyond the code itself. And code is a terrible form of documentation for business logic, because it tells you what happens but not why it was chosen.

The inheritance problem

Every decision made in a pipeline is inherited by everything downstream. The schema you chose becomes the schema every analyst queries. The grain you picked becomes the grain every report uses. The null handling you selected — or didn't select — becomes the null handling every model consumes.

I've inherited pipelines where the original engineer handled a tricky data quality issue with a WHERE clause that filtered out roughly 6% of records. The filter was correct at the time — those records were genuinely bad. But the upstream source fixed the issue eight months later. The filter kept running. For over a year, every report and every model built on that pipeline was silently missing 6% of the data. Not because anyone decided to exclude it. Because nobody re-examined a decision that had calcified into infrastructure.

This is how pipelines work. Temporary fixes become permanent fixtures. Workarounds become assumptions. And assumptions become invisible, because the pipeline runs, the data arrives, and the dashboard renders. Nothing looks broken from the outside.

The cost of implicit decisions

Explicit decisions can be revisited. Implicit decisions can't, because nobody knows they were made.

When a pipeline engineer chooses to truncate timestamps to daily granularity, that's a decision. When they round financial figures to two decimal places during transformation, that's a decision. When they use the created_date field instead of the effective_date field, that's a decision. Every one of those choices constrains what's possible downstream — and if the choice was never documented, the constraint is invisible until someone runs into it and spends two days debugging.

The cost isn't the decision itself. Most of these decisions are fine. The cost is the accumulated weight of hundreds of undocumented choices that interact in ways nobody can predict, because nobody can see them all at once.

I've watched teams spend entire sprints tracking down data discrepancies that originated in pipeline logic written by someone who left the company a year ago. The logic wasn't wrong, exactly. It was a reasonable choice for a context that no longer existed. But because the context was never recorded — only the implementation — the team had to reverse-engineer the reasoning from the code, the data, and whatever institutional memory remained.

That's expensive. And it's entirely preventable.

What a design review for pipelines looks like

It doesn't look like a two-hour architecture review board. It looks like three questions, answered before the pipeline ships:

What business logic does this pipeline encode? Not the technical implementation — the business decisions. What gets included, what gets excluded, how things get counted, what "current" means, what happens when data is missing. Write it in plain language. If you can't explain it in plain language, the logic is more complex than you think.

What assumptions does this pipeline make about its inputs? What schema does it expect? What volume range is normal? What happens if a field that's always been populated shows up null? What happens if the source system sends duplicates? If the answer to any of these is "I don't know," that's the first thing to figure out — not the last.

What does this pipeline guarantee about its outputs? Not what it produces today, but what downstream consumers can rely on. Freshness. Completeness. Grain. Schema stability. If you can't state the guarantee, consumers will infer one from observation, and when reality diverges from their inference, that's an incident.

These questions take 30 minutes to answer. A lightweight review takes an hour. The alternative is a quarter of rework 18 months from now, when the decisions have compounded and the person who made them is gone.

The real issue

The real issue isn't that engineers make bad decisions in pipelines. They mostly make good ones. The real issue is that we've built a culture where pipeline work is treated as invisible infrastructure — connective tissue that nobody examines — rather than as the place where an organization's analytical decisions actually get made.

Every schema encodes a worldview, every transformation is an editorial choice, and every filter is an inclusion decision. If you wouldn't ship a product feature without a design review, you shouldn't ship a pipeline without one either.

The data your organization runs on is shaped by decisions that someone made in a pipeline. The question is whether those decisions were deliberate, documented, and reviewed — or whether they were made at 4pm on a Wednesday and never thought about again.