Coalesce
A Practitioner Guide

Agentic Data Engineering

Going from ad hoc prompts to production playbooks with the Coalesce MCPs.

Transform Catalog Quality
Hands-on with Coalesce MCPs

Contents

00 IntroductionWhy this guide, and who it is for. 3 01 FundamentalsWhat an MCP is, the server, its tools, and the framework. 4 02 PromptingAd hoc MCP use, one question at a time. 7 03 SkillsPackaging workflows into reproducible recipes. 12 04 Playbooks and Built-In AgentsPutting workflows into production. 16
00 · Introduction

Introduction

Agentic data engineering is about how to use AI in your workflows. It's mainly focused on integrating them in tools you already use through MCPs. While it's focused on the Coalesce product suite (Transform, Catalog, and Quality), most of the lessons are applicable no matter which tools you're using.

We focus on the workflows you most often find yourself doing: planning a change and checking its downstream impact, getting to the root cause of an anomaly, keeping a catalog governed, or generating the same report every week. It runs alongside the agent tooling already in the products, from CoCo in Snowflake to Copilot in Coalesce Transform and Scout in Coalesce Quality.

This guide covers everything from a single ad hoc prompt to a production playbook. Whether you are exploring on your own, building skills for your team, or putting workflows into production, you will find practical patterns and real examples throughout.

What you'll learn
  • What an MCP is, and what the Coalesce MCP server exposes
  • How to write prompts that are reliable without over-specifying
  • How to package repeatable workflows into reproducible skills
  • How to test, iterate, and put workflows into production
Who this is for
  • Data and analytics engineers looking to speed up core workflows
  • Data leaders looking for ways to systemize the way they use AI across DataOps workflows
  • Governance teams looking to standardize how they work with their data platform
What you'll get out of this guideBy the end of this guide, you'll have understood how to get started with the Coalesce MCPs, best practices for prompting, how to design and test skills and what to consider before putting them into practice.
Chapter 01
01

Fundamentals

What an MCP is, the server, its tools, and the framework.

01What an MCP is
02How to understand MCP tools
03Three levels of working with the MCPs
Fundamentals

What an MCP is

MCP stands for Model Context Protocol. It is an open standard that lets a client like Claude or Cursor give a large language model (LLM) structured context from other tools and data sources. This helps them use specific context and metadata that would otherwise not have been available to them.

At Coalesce, we have an MCP for Transform, Catalog, and Quality. Each one exposes a lot of the capabilities you'd otherwise access through the UI. For example, with the Catalog MCP you can extract the column-level lineage programmatically, to get the exact impact before making a change. With the Transform MCP, you can ask it to implement those changes. And with the Quality MCP, you can have it run a full root cause analysis, extracting metadata around similar issues, recent code changes, logs, errors etc.

MCP clients connect through the Coalesce MCP server to a set of tools
Fig 1.1 — The Coalesce Quality MCP server taking in "instructions" from Claude and Cursor, and picking the right tools to answer questions that'd otherwise only be available in the platform UI.

How to understand MCP tools

LLMs have a tendency to make things up. This can be okay, and you can get far by using the MCPs out of the box. But oftentimes, it's a good idea to understand the inner workings of the platform through which tools they have. A tool is an API call the MCP can make on your behalf. It can fetch data, change state, or trigger a workflow. The set of tools defines the boundary of what the MCP can and cannot do.

Every Coalesce MCP tool ships with a description. It tells you not just what each tool does, but where its limits sit. The more specific the instructions baked into a tool, the more deterministic the output. Open-ended API calls leave all the interpretation to the LLM, which can quickly take things off the rails.

Example tool
catalog_get_column_lineage walks the column lineage graph exhaustively with no depth cap, but caps out at 10,000 nodes by default to avoid pathological graphs.

Tools also map onto the steps a person would take. For example, if you ask the Quality MCP to investigate potential root causes of an issue, it works through the same metadata a data engineer would, one tool at a time, using its best understanding of how each tool should be used:

  • It reviews historical issues to see how similar problems were resolved before.
  • It reads upstream code changes across the affected models.
  • It detects schema changes anywhere across the stack.
  • It runs queries to analyze the actual data and what happened.
  • It retrieves data profiles to compare before and after distributions.
  • It samples rows on the affected tables to find the exact records that failed.

Each of those is a separate tool with a defined scope. Reading the tool descriptions tells you which of these the MCP can actually do, and where it will stop.

If you haven't specified it, there's no guarantee that you'll get the exact same result next time you ask the LLM the same question. A useful trick is asking the LLM to specify exactly what tools it used, and in which order. With that at hand, you can create reproducible prompts that are more deterministic in the result it gives (more on this in chapter 3: Skills).

A terminal trace showing each tool call the MCP made while investigating a root cause
Fig 1.2 — A full trace of tools used to investigate the root causes. Useful for reproducing the results of the prompt.

Three levels of working with the MCPs

You can work with MCPs through one-off prompts, but there are also cases where it makes sense to productionize workflows through skills or playbooks.

The three levels: prompt engineering, agent, and playbook — from more exploratory to more deterministic
Fig 1.3 — The same request expressed at three levels of structure, from exploratory prompt to deterministic playbook.
  • Level 1, prompts. Ad hoc interactions to speed up data engineering and governance. For example, an ad hoc prompt to understand the potential root causes of an issue.
  • Level 2, skills. Key workflows packaged as reproducible skills. For example, a way to generate triage and data quality reporting across the team.
  • Level 3, playbooks and agents. Structured playbooks, often with governance tooling and orchestration. For example, a background agent running workspace analytics on customer issues each evening.
Chapter 02
02

Prompting

Ad hoc MCP use, one question at a time.

01Choosing your LLM tool
02Writing good prompts
03Extend and chain MCPs to automate workflows
04Limitations of prompting
05Managing permissions and data access
Prompting

Choosing your LLM tool

You're already using prompts in tools like ChatGPT, Claude Desktop, and likely also to engage with coding tools such as VSCode or Cursor.

There are two main interfaces you'll typically use to engage with MCPs, code editors and chat interfaces.

Code editors such as Cursor and Claude Code, versus chat interfaces such as OpenAI and Claude Desktop
Fig 2.1 — Two main interfaces for working with MCPs: code editors and chat interfaces.

For use cases such as data modeling, you're most likely using a code editor while for ad hoc exploration, a chat interface often works best. It's worth considering the user experience for both, especially if you start developing tools and skills used by a wider audience.

Writing good prompts

A good prompt gives the model enough structure to be reliable without specifying every detail.

What a good prompt includes

You do not need all of these every time, but naming them removes guesswork.

  • A role lets the model answer at the right level. "You are a data governance lead working in Coalesce" pulls more careful answers than no role at all.
  • A goal works best stated as the outcome you want rather than the steps to reach it.
  • Constraints are the rules the model has to respect, like leaving passthrough columns untested or keeping Tier-1 coverage capped.
  • An output format makes the result land in a shape you can use, whether that is a model file, a set of tests, or a short table.

Finding the right balance

Below are three examples of prompts to add tests to a data source - one that's too specific, one that's too broad, and one that strikes the right balance.

Too specific

Add not_null tests on trip_id, vehicle_id, and trip_start_time to the raw_trips table. Enforce unique on trip_id. Use accepted_values for vehicle_type restricted to car, bike, scooter. Include a freshness test with a 4-hour SLA and a row-count anomaly monitor allowing a 2% deviation

It names every test, threshold, and column. There's no room for judgement and assumes that you got everything right in the first go.
Too vague

Add tests to make sure data looks correct and updates as expected. You can add some checks for missing values or old data. Start simple and add more coverage as your data processes become more stable.

No specifics, so the output is generic and you end up doing the real thinking yourself.
Just right

Add tests to the table raw_trips. If something is already tested at the source (for example not null, unique) and the column is passthrough, with no transformations or joins, do not apply the same test again downstream. Prefer table stats monitors over freshness monitors. Add a comment for why each test exists, and whether it should be removed if the column is passthrough.

It hands a specific set of guidelines (for example, don't apply tests if the column is passthrough) and leaves the right level of freedom for the model to come up with ideas.

Give the model room to ask and disagree

For anything non-trivial, tell the model to ask questions before it writes anything, and to push back if your instruction will not hold up.

Is there any information or data I can share with you to help improve the tests you suggested

This helps reinforce that the LLM has the right level of detail before adding the tests.

Finally, you can help the model avoid context overload. For example, parsing an entire project or every schema risks burying the request, and the model spends its attention on material it does not need. Point it at the one model and the handful of upstream tables that matter.

Extend and chain MCPs to automate workflows

There's no shortage of what kind of questions you can ask, and to understand what areas you can expect to get good answers, it helps to first build an understanding of what tools the different MCPs offer, and look in the documentation for example questions.

Transform, Catalog and Quality each expose example questions and tools you can reach through natural language
Fig 2.2 — Transform, Catalog, and Quality — each with its own MCP, all speaking the same language.

You don't have to stop here. MCPs become more useful when you combine them with the other MCPs your team already uses. Most clients let you connect several at once, and the model will call across them in a single prompt.

Here are a few examples of how you can use Coalesce MCPs with other tools.

  • Investigate and resolve an issue. Have the Quality MCP investigate the root cause of an issue and the Transform MCP implement a fix using the root cause information carried over.
  • Data quality report in Slack. Ask for a summary of open issues older than seven days using the Quality MCP and have it posted to a specific channel in Slack.

Limitations of prompting

Starting with basic prompting is a good way to get familiar with MCP capabilities and see results right away. It is worth keeping an eye on a few common limitations.

  • Results vary run to run. Ask the same question twice and you will sometimes get slightly different answers. For exploration this is fine, but for other workflows it can lead to inconsistencies.
  • Models may answer even when the tool is not available. If the MCP does not expose the capability you are asking about, the model will sometimes improvise rather than stop. Knowing how to understand tools is the best guardrail.
  • Everyone starts from scratch. Each person figures out their own prompts, workflows, and patterns. That makes it less scalable, and works better for people already proficient with MCPs.

Managing permissions and data access

If you are using the Coalesce MCPs in production workflows, it is worth being deliberate about permissions, especially around read versus write access.

Read

Read access lets the MCP inspect lineage, catalog metadata, quality status, and code.

Write

Write access lets it make changes directly, assigning owners, creating tests and monitors, or committing model code.

Below are a few recommendations to consider as a starting point

  • A good starting point is read-only, managed at the token level, and worth keeping that way for most users.
  • Treat MCP write tokens the way you treat production database credentials. The same person who approves one should approve the other.
  • Gate raw warehouse tools behind per-call approval. An unbounded SELECT can run up warehouse costs or pull customer PII into a context window where you did not expect it.

From prompt to skill

From prompt to skill

In Anthropic's own words, you can think of MCPs as providing the professional kitchen and tools, and Skills as bringing recipes that connect and prescribe the MCPs what to do and in what order. The good news is that all the lessons you've learned here of writing and designing good prompts also hold true for skills.

Chapter 03
03

Skills

Packaging workflows into reproducible recipes.

01Introduction to skills
02Designing a good skill
03Planning a skill suite
04Creating a real-world skill
05Evaluating & sharing skills
Skills

Introduction to skills

A skill is a markdown file (SKILL.md) plus optional supporting files that provide instruction about how to perform a task. This helps you encode workflows so they're performed the same way each time, and also share standardized ways of working across your team. For the rest of this chapter, we'll focus on skills as used through Anthropic's products although they're starting to get popular in other tools as well.

To use a cooking reference, the best way to think about a skill is as a recipe. A good skill specifies the following four parameters, just as you would if you were writing an instruction document for a colleague.

  • What the task is
  • What tools to use
  • What good output looks like
  • What to avoid

When a new conversation starts, Claude automatically tries to map the skill description to see if there's an existing skill that fits it. You can also force it by typing / to bring up the available skills.

i
NoteSkills are only as good as the data they can access, so the more context you pass through the MCPs, the better the skills built on top of them. If you've already set up and connected your MCP tools you've done the hard part and are ready to go.

Designing a good skill

The first part of a skill is the description. This tells Claude when to load the skill and is important to get right. A good description has three parts: what it does, when to use it, and its key capabilities.

Skills also rely on progressive disclosure. The header is always loaded, the body loads when the skill looks relevant, and any linked files load only as needed. That keeps token use low while keeping the expertise available when it matters.

The better the description, the better the first iteration of the skill.

Below is an example of a good description that's specific, actionable and clearly specifies a trigger

A good description

Generates a weekly data quality report from Coalesce Quality and posts it to Slack. Use whenever the user asks for a "weekly data quality report," "Coalesce Quality weekly summary," "data health digest," even if they don't use those exact phrases. Covers open incidents, issues by severity, failing monitors, and affected entities over 7 days.

The two below are weaker starting points.

Descriptions that need work

"Create a weekly data quality report."

Says nothing about when to trigger or what it contains, and leaves too much up to interpretation about which metrics to use.

"Use list_incidents ... to create a weekly data quality report ..."

Too technical for a first draft. Claude captures a lot of this from a good natural-language description. Go back and edit the skill with the technical detail later.

When designing the core of a skill, you should start with the use case in mind and write instruction steps as you would to a colleague. This is where you should specify what the user wants to accomplish, what best practices should be encoded, and which specific tools from the MCPs should be used.

With this, you get repeatable results unlike ad hoc prompts which may generate new results each time.

Below is an example of what a good use case description looks like.

Use case
Weekly Data Quality Report
  1. Confirm the Slack channel and whether to review first or send directly
  2. Pull open incidents using the list_incidents api from Coalesce Quality (via MCP), separating new from stale
  3. Pull open issues and bucket them by severity
  4. Rank the top 5 issues by downstream impact
  5. Summarise execution runs into a headline pass rate, deduped by monitor
  6. Post a five-line summary to the channel with the detail in threaded replies
Result: Weekly data quality digest posted to the chosen channel, with empty sections kept visible

Below is an example of a complete skill designed with best practices in mind

Anatomy of a skill: a SKILL.md file with a trigger description, an ordered workflow, and a worked example
Fig 3.1 — An example skill for building a structured weekly data quality report. The description is a brief way to instruct Claude for when to use the skill without scanning the entire content.

Planning a skill suite

For data roles specifically, think of data skills as jobs to be done. These can be grouped into areas of responsibility such as data quality, data modeling, analytics, data governance, and each job category contains specific jobs.

This is powerful and for example lets us encode our testing philosophy as a skill so each time data engineers want to add new tests, they can do it with the add-tests-to-node skill that follows our standards.

A skill suite organised as folders — data-quality, data-modelling, governance, reporting — each holding job-specific skills
Fig 3.2 — Example skill suite for a data team.

As a rule of thumb, the more often you find yourself doing a workflow, and the more time it takes, the more likely it is to benefit from being converted into a skill.

Each skill is written with a description that helps Claude quickly navigate to the right one for the job. So when someone asks to "triage this issue," the model consistently points them to that skill and they get all the right context, instead of the model chaining LLM calls and reinventing the workflow each time. That's the benefit of building out a deliberate skill suite, and a good way to figure out which skills you should build.

Creating a real-world skill

The steps above – integrating MCPs, writing clear descriptions and use cases, and specifying the expected result can take some work to get right. Luckily, there are creative tools to help you get started. Claude's skill-creator is excellent for kickstarting the process, and only takes in a description. With this, you can iterate fast and finetune your skill so it produces the output you expect.

Back to the example of the data quality report, this is the result we ended up with.

A weekly data quality report posted as a Slack thread with headline counts and detail in the replies
Fig 3.3 — The weekly data quality report we ended up with, posted to Slack.

The process we went through to create this shows what an iterative process like this often looks like. Here are some considerations we had to give feedback on only after seeing the first version of the report.

  • What the structure of the report should look like. What happens when there are hundreds of errors
  • What counts as a data issue, all severities or only error
  • Whether it should be actionable, tagging owners, linking data products, and suggesting next steps.
  • Whether it posts directly or goes for review, which Slack channel and the fallback, and whether the window is adjustable, say monthly instead of weekly. The first version was too dense, so it moved to a brief summary with the evidence in the Slack thread.

Extending skills with other assets

You can extend skills with assets besides the main content of the skills.md file. This can be anything from specific scripts you want to execute as part of the skill to other references, templates, etc.

your-skill-name/
├── SKILL.md            # Required - main skill file
├── scripts/            # Optional - executable code
│   ├── process_data.py # Example
│   └── validate.sh     # Example
├── references/         # Optional - documentation
│   ├── api-guide.md    # Example
│   └── examples/       # Example
└── assets/             # Optional - templates, etc.
    └── report-template.md # Example

Evaluating our skill

In many cases eyeballing the results tells you how well a skill works. Sometimes it makes sense to be more systematic, through three lenses.

  • Does it trigger on the right requests (if you run 20 sample prompts that should trigger the skill, for how many of those is the skill actually triggered).
  • Does it produce what we expect it to (are all elements there, does it look like we expected)
  • How does it handle edge cases (what if there's more than 20 errors, what if there's no owner set).

Claude's skill-creator comes with baked in testing. You shouldn't 100% use this as a replacement for manually testing it yourself, but it can give you a more systemic way of assessing skill quality, especially as you start growing your library of skills.

We ran a small benchmark: three test cases, each run twice, once with the skill loaded and once without. Each run was graded against five objective assertions, such as whether it contains a TL;DR, deduplicates the noisy monitor, calls out P1 downstream impact, and is Slack-formatted.

Benchmark table comparing pass rate, time and tokens with and without the skill loaded
Fig 3.4 — Benchmark results comparing runs with and without the skill loaded.

In short, the skill provides us with better output (more of the assertions we're running are passing), but on average also adds time it takes to complete the task. The extra tokens are the price of the skill pulling more data (get_issue_impact per top issue, deeper execution history) and following a stricter template. For a once-a-week report, that's fine.

With Skills, the variance also drops significantly as they make each task execution much more predictable. Without one, Claude makes structural decisions fresh every time. Sometimes it writes a great report, sometimes it skips the TL;DR or forgets to deduplicate noise. With a skill, the template is enforced. That's the actual case for skills: without one, every run is a roll of the dice. Although we shouldn't read too much into these numbers with such a small sample, they still go to show the point.

Sharing skills

A skill earns its keep when the whole team runs the same one.

There are several ways to share it depending on the level of maturity of your team

  • In the repo. Commit the skill folder to version control and anyone who pulls the repo has it. This fits data teams already working in a repo, and it gives you history, review, and rollback for free.
  • For one person. Upload the skill folder in Claude's settings to use it yourself, which is handy for trying a skill before the team adopts it.
  • Across a workspace. Upload through the Skills API so every member of the workspace gets it, with versioning and central management. This is the route for production, where you pin a version and roll out updates deliberately.

For applications, agents, or scheduled workflows, skills run through the API and need the code execution environment to run. The same skills work in the Agent SDK when you build custom agents. Treat a shared skill like software you install. Pin versions, evaluate before each rollout, and adopt skills only from sources you trust, since a skill can run code and call tools against your data.

Chapter 04
04

Playbooks and Built-In Agents

Putting workflows into production.

01Productionizing MCP workflows
02Built-in agents: Copilot and Scout
03Production-grade playbooks and skills
04When to use which
Playbooks and Built-In Agents

Productionizing MCP workflows

There are several ways the MCP workflows can be productionized, either as playbooks or built-in AI agents in products. Below are a few practical ways we're doing this at Coalesce. Some workflows are core enough to build directly into the product. Others are domain-specific, and we've built production-grade playbooks and skills on top of our MCPs.

A quadrant showing where each capability sits: Copilot, Scout, Catalog and Triage across who owns it and where it runs
Fig 4.1 — While built-in agents require significant product work, production-grade playbooks can be developed by domain experts such as solution engineers who often sit with the best understanding of customer problems.

Built-in agents: Copilot and Scout

Built-in agents are workflows that are part of the core product, with the guardrails already built in. They run where the work already happens, so you do not have to assemble the context yourself.

Copilot, in Transform. Lives in the development workspace. You describe what you want in natural language, and it turns that into governed transformations, writing SQL and building staging, dimension, fact, and view layers, keeping lineage intact under the roles and audit already in place. It captures context from the subgraph, and a read/write toggle controls whether it can change anything or only propose.

Coalesce Transform showing the node graph with the Copilot chat panel open on the right
Fig 4.2 — Copilot in Coalesce Transform, working alongside the node graph in the development workspace.

Scout, in Quality. Coalesce Quality's Scout works in the background as an always-on data SRE. It runs as asynchronous agents that kick off their own jobs, working through every open issue around the clock, triaging each by importance, and surfacing a recommended action with the evidence alongside the issue. By the time you open an issue the investigation is usually already done, so you stay in the loop by reviewing conclusions rather than chasing every alert. In one example, an anomaly monitor fired on a sharp drop in row count, and within seconds Scout had traced it to an intentional code change, linked the relevant commit, and proposed marking the issue as expected.

While a lot of the same analysis is available through the MCP, running the investigations in the background as an agent allows for new workflows, such as presenting users with a sorted list of recommended actions across all issues.

Scout's auto-triaged issues list, each with a recommended action such as declare incident or no action needed
Fig 4.3 — Scout's auto-triaged issues, each with a recommended action and the evidence alongside it.

Production-grade playbooks and skills

Not every production workflow belongs in the product. When something is specific to how your team works, it is often a better fit to build it on top of the MCPs as a playbook, often running on a set schedule.

Customer issue triage. Runs automatically each morning against the workspace that monitors our production systems, the ClickHouse tables, Postgres, and job queues. It groups related issues, queries the data, reads logs, drafts likely root causes ready to share with customers, and tags the person on rota that week. It replaces an otherwise lengthy manual investigation.

A Quality Triage app message in Slack drafting likely root causes for two customers with a suggested relay
Fig 4.4 — The customer issue triage playbook posting drafted root causes to Slack, ready to share.

Governance rollout. Takes a team from zero to a governed catalog in 8 to 12 weeks. It writes descriptions before assigning owners, classifies assets into tiers, caps Tier-1 at roughly 5% from real usage signals, and gives each phase an owner, an effort estimate, and a measurable exit criterion the agent can check itself.

When to use which

We start from a prompt for exploration, then build skills for repeatable tasks, playbooks for repeatable workflows, and finally a built-in agent for the workflows that fit better into the product.

The capability ladder: prompt, skill, playbook, built-in agent — from hands-on per request to hands-off continuous
Fig 4.5 — The capability ladder — from a one-off prompt to an agent maintained for you.
  • Built-in agents. Make sense when they capture context from the UI and underlying platforms, the subgraph you are editing or the full state of the monitoring environment, so you are not feeding it context by hand.
  • Building on the MCPs. More flexible, and better when the workflow is specific to a team or needs to span tools. You encode your own sequencing, anti-patterns, and exit criteria. The catalog rollout is a good example, since the phases, tiers, and order are specific to each organization.
  • Do not over-build. Building a playbook for a low-frequency, low-stakes task is usually wasted effort. A prompt or a skill is enough.

From a single ad hoc prompt to a production playbook.

Whether you are exploring on your own, building skills for your team, or putting workflows into production, you will find practical patterns and real examples throughout.

Coalesce Transform · Catalog · Quality