Skip to main content

Building a Claude Code Plugin That Actually Works: What We Learned the Hard Way

· 14 min read

We built a Claude Code plugin for Supaflow that manages data pipelines through natural language. It took three complete rewrites to get it right. This post covers the failures, the architecture decisions, and the constraints that made it work.

The final plugin is open source: supaflow-labs/supaflow-claude-plugin. Clone it, modify it, use it as a starting point for your own.

What we were building

Supaflow is a data integration platform. Users create datasources (connections to databases and SaaS apps), build pipelines between them, run syncs, and monitor jobs. All of this is available through a CLI (@getsupaflow/cli), and we wanted Claude Code to be able to drive the entire workflow.

This means Claude needs to:

  • Run CLI commands with exact flags and parse JSON responses
  • Follow multi-step workflows in the right order
  • Ask for user confirmation before creating permanent resources
  • Handle errors without making things worse

Simple enough, right?

The first attempt: skills only

Our first plugin had 6 skills covering auth, datasources, pipelines, jobs, schedules, and a quickstart walkthrough. Each skill was a markdown file with step-by-step instructions telling Claude how to use the CLI.

Here's what the quickstart skill looked like:

---
name: supaflow-quickstart
description: Use when the user asks to set up a pipeline from scratch...
---

## Step 6: Select Objects and Create Pipeline

supaflow pipelines create \
--name "<Pipeline Name>" \
--source <source> \
--project <project> \
--objects objects.json \
--json

It shipped. We tested it. It broke immediately.

Session 1: Everything went wrong

We asked Claude: "build a data pipeline from SQL Server to Snowflake using Change Tracking."

Claude read the quickstart skill and went to work. Here's what happened:

Problem 1: Claude skipped pipelines init entirely.

The skill said to use pipelines init to generate a config file with capability-aware defaults (ingestion mode, load mode, schema evolution, pipeline prefix). Claude skipped it and went straight to pipelines create without showing the user any configuration. The pipeline was created with defaults the user never saw or approved.

Why? The skill had this line:

If the user accepted all defaults, you can skip pipelines init and omit --config.

Claude interpreted "accepted all defaults" as "the user didn't complain" and skipped the entire config step.

Problem 2: Claude invented JSON fields that don't exist.

When polling job status, Claude wrote code like this:

print(f"Phase: {d['phase']}")
print(f"Duration: {d['duration']}")
print(f"Progress: {d['progress']}%")

None of these fields exist. The actual jobs status response has exactly 4 fields: id, job_status, status_message, job_response. Claude hallucinated phase, duration, progress, completed_at, and eta -- fields that sound reasonable but aren't in the API.

The output showed Phase: ? and Progress: ?% for every poll, but Claude kept going as if everything was fine.

Problem 3: Claude used status as a variable name in zsh.

The polling loop looked like this:

for i in $(seq 1 20); do
status=$(supaflow jobs status $job_id --json | python3 -c "...")
if [ "$status" = "completed" ]; then break; fi
sleep 5
done

This fails silently in zsh because status is a read-only builtin (it's the equivalent of $?). The variable assignment throws read-only variable: status and the loop runs all 20 iterations without ever checking the actual job status.

Problem 4: Claude silently renamed after a duplicate error.

When pipelines create failed with a unique constraint violation, Claude tried my_pipeline_2, then my_pipeline_3, without telling the user. The skill said "handle errors gracefully." Claude interpreted "gracefully" as "fix it silently."

What we learned from session 1

Every problem came from the same root cause: skills are suggestions, not constraints. A skill can say "always do X" but Claude can rationalize its way out of X. It can say "use these fields" but Claude can invent fields that sound right. It can say "ask for confirmation" but Claude can treat silence as confirmation.

We needed something stronger than prose instructions.

The architecture that worked

We studied the superpowers plugin -- the most mature Claude Code plugin in the ecosystem. Two patterns from superpowers shaped our approach:

  1. SessionStart injection: Superpowers injects its entire using-superpowers skill into Claude's system context at session start, before the user says anything. This shapes behavior from the first message.

  2. Hard rules with rationalization blockers: Instead of just saying "do X," superpowers includes tables of common rationalizations and explicitly calls them out as bugs.

Superpowers itself deprecated commands in favor of skills -- its problem domain is workflow discipline, not CLI tool integration. But Claude Code's plugin system supports allowed-tools in command frontmatter, which physically restricts which tools a command can use. We leaned into this for our CLI integration layer.

We combined these ideas into a three-layer architecture:

Layer 1: using-supaflow (policy router)

A skill injected at session start via the SessionStart hook. It establishes hard rules, maps user intents to commands, and blocks common rationalizations:

## HARD RULES

1. Use commands when they exist. Do not freestyle the same workflow.
2. Ask one question at a time.
3. Do not create before explicit confirmation.
4. Do not guess CLI output fields.
5. Stop on errors. Never silently retry, rename, or auto-increment.
6. Always run pipelines init before pipelines create.
7. Object scope is a required decision.

## Red Flags

| Thought | Reality |
|---|---|
| "I can probably use the defaults" | Use pipelines init and show actual values. |
| "The user only changed one field, so the rest is fine" | Partial edits are not approval. |
| "I can retry with a different name" | Silent retries are a bug. Stop and ask. |
| "I already know the job fields" | The CLI contract is the source of truth. |

Layer 2: Commands (execution)

9 slash commands that own specific workflows. Each command has:

  • allowed-tools in frontmatter that physically restricts what Claude can run
  • Inline parser contracts that list exact field names and forbidden alternatives
  • Embedded guardrails derived from real session failures

Example from /create-pipeline:

---
allowed-tools: Bash(supaflow *), Read, Edit, Write
argument-hint: [source-datasource] [destination-datasource]
---

### Parser Contract -- pipelines list

Use nested fields: source.name, source.datasource_id, destination.name
NEVER use flat fields like source_name or project_api_name

### Duplicate Constraint Handling

If pipelines create fails with a duplicate error:
STOP immediately. Show the error verbatim. Ask what to do.
NEVER silently rename. NEVER auto-increment.

Layer 3: Domain skills (reference only)

The original skills (datasources, pipelines, jobs, schedules) became reference material. They kept all CLI documentation, field names, and config tables, but lost all workflow orchestration language. Their descriptions changed from workflow triggers to reference triggers:

# Before (workflow owner):
description: Use when the user asks to create a pipeline, sync data, run a pipeline...

# After (reference):
description: Use when you need reference information about pipeline configuration,
schema management, sync modes, or pipeline lifecycle

The hardening loop

The three-layer architecture was the foundation, but it took four more test sessions to get the guardrails right. Each session revealed new failure modes.

Session 2: Object selection was silent

Claude created a pipeline with all objects selected without asking the user. The command said --objects was optional, so Claude skipped it.

Fix: Made object scope a mandatory question with two explicit paths:

## Step 9: Object Scope (REQUIRED)

Ask: "Do you want to sync all discovered objects, or choose a subset?"

### Path A: All Objects
State clearly: "The pipeline will include all discovered objects.
The --objects flag will be omitted."

### Path B: Choose Subset
Export catalog, show objects, ask which to INCLUDE.

Session 3: Include vs exclude confusion

We added the object scope question, but Claude asked "which objects to exclude?" The user said "just orders and customers" -- meaning include only those two. Claude interpreted it as exclude, producing the opposite selection.

Fix: Changed the question to always ask "which to include" and added an explicit warning:

IMPORTANT: When the user says "just X and Y", that means INCLUDE only
X and Y. It does NOT mean exclude X and Y. This is the most common
misinterpretation.

Session 4: Wrong sync command

After creating a pipeline, Claude tried supaflow jobs run (doesn't exist), checked --help, then found supaflow pipelines sync. Three wasted tool calls.

Fix: Created a dedicated /sync-pipeline command with the correct CLI syntax, response parser, and polling loop. Changed /create-pipeline to point to it instead of leaving sync as a freeform operation.

Session 4 (continued): Stopped at polling, no final results

Claude polled with jobs status until the job completed, then told the user "sync completed" -- without ever running jobs get to get per-object details and row counts.

Fix: Added a HARD GATE in /sync-pipeline:

## Step 4: Final Results (MANDATORY)

HARD GATE: When polling reaches a terminal state, you MUST run
jobs get before responding to the user. Do not end on jobs status
alone. jobs status is only for polling.

Parser contracts: the most important guardrail

The single most effective guardrail in the plugin is inline parser contracts -- explicit lists of which JSON fields to use and which to never reference.

Without a parser contract, Claude invents plausible-sounding field names. It's not lying -- it's pattern-matching from training data. If it's seen job tracking UIs with "duration" and "progress" fields, it'll use those names even though the actual API returns execution_duration_ms and has no progress field at all.

Here's the parser contract from /check-job:

### Parser Contract -- jobs status

Extract ONLY: id, job_status, status_message, job_response

NEVER reference these fields (they do not exist):
phase, duration, completed_at, progress, percent, eta

And from /sync-pipeline:

### Parser Contract -- pipelines sync response

The sync response returns EXACTLY these 3 fields:
- job_id -- the job UUID to track
- pipeline_id -- the pipeline UUID
- status -- always "queued" on success

NEVER reference: job_status, name, message

We also discovered that different CLI commands return the same concept with different field names. schedules list returns cron_schedule, but schedules create returns cron. Without explicit contracts for each command, Claude will use the wrong field name and get a KeyError.

Testing: a 4-tier approach

We built a test suite that catches regressions at different levels:

Fast tests (offline, under 1 minute)

No Claude, no API key, no network. Pure file-system checks:

# Does every command have required frontmatter?
for cmd in commands/*.md; do
has_frontmatter "$cmd"
get_frontmatter_field "$cmd" "allowed-tools"
get_frontmatter_field "$cmd" "argument-hint"
done

# Does the hook produce valid JSON? (simplified -- the real test
# controls PATH fully, symlinks python3 into the mock dir, and
# uses run_hook_with_mocks() to isolate from system CLIs)
mock_dir=$(create_mock_cli)
output=$(run_hook_with_mocks "$mock_dir")
echo "$output" | python3 -c "
import json,sys; d=json.load(sys.stdin)
assert d['hookSpecificOutput']['hookEventName'] == 'SessionStart'
"

Medium tests (fixture-based, ~2 minutes)

Feed saved CLI JSON output through the same parsers the commands prescribe. Catches field name regressions:

# Does jobs-status.json have the right fields?
assert_json_has_field "fixtures/jobs-status.json" "d['job_status']"
assert_json_missing_field "fixtures/jobs-status.json" "d['phase']"
assert_json_missing_field "fixtures/jobs-status.json" "d['duration']"

# Does create-pipeline.md contain the right guardrails?
assert_file_contains "commands/create-pipeline.md" "pipelines init"
assert_file_contains "commands/create-pipeline.md" "NEVER silently rename"
assert_file_contains "commands/create-pipeline.md" "warehouse_datasource_id"

Slow tests (claude -p, ~10 minutes)

Actually invoke slash commands via Claude in headless mode and assert on behavior:

output=$(claude -p "/create-pipeline")
assert_contains "$output" "datasources list" # checks existing
assert_contains "$output" "confirm" # asks for approval
assert_not_contains "$output" "auto-generated" # doesn't guess defaults

Live tests (real CLI, opt-in)

Run actual CLI commands and compare output shapes against committed fixtures. When a live test fails, it means the CLI changed its output and the fixture needs updating.

The SessionStart hook

The hook is the entry point for the entire plugin. It runs when Claude starts a new session and injects the using-supaflow skill content into Claude's system context:

#!/usr/bin/env bash
set -euo pipefail

# Read the using-supaflow skill
content=$(cat "$PLUGIN_ROOT/skills/using-supaflow/SKILL.md")

# Check CLI setup
warnings=()
if ! command -v supaflow &>/dev/null; then
warnings+=("[SETUP] Supaflow CLI not installed")
fi
# ... more checks ...

# Build context and output structured JSON
escaped=$(printf '%s' "$context" | python3 -c \
"import json,sys; print(json.dumps(sys.stdin.read())[1:-1])")

printf '{"hookSpecificOutput":{"hookEventName":"SessionStart",
"additionalContext":"%s"}}\n' "$escaped"

What we learned the hard way:

  • The JSON must include both hookEventName and additionalContext -- in our testing, omitting hookEventName caused the output to be silently ignored by Claude Code
  • We found that printf is more reliable than heredocs for large content -- the superpowers plugin notes a bash 5.3+ heredoc hang with content over ~512 bytes, and we hit similar issues
  • We switched from bash parameter substitution (${s//$'\n'/\\n}) to python3 json.dumps() for escaping after finding the bash approach produced unescaped output on some systems

What we stole from superpowers (and what we didn't)

Copied:

  • SessionStart injection of a router skill
  • Red flags / rationalization tables
  • HARD RULES / HARD GATE language
  • One-question-at-a-time discipline
  • Test suite structure (helpers, runner, fast/slow split)

Didn't copy:

  • Deprecating commands. Superpowers deprecated all commands in favor of skills. We kept commands because our problem is different -- we need tool restrictions and exact CLI contracts, which commands enforce better than skills.
  • Visual companion. Superpowers has a browser-based mockup system for brainstorming. Not relevant for CLI tool integration.
  • Cross-platform hook wrappers. Superpowers has a polyglot bash/batch wrapper for Windows. We only support macOS/Linux.

The final architecture

supaflow-claude-plugin/
.claude-plugin/plugin.json # Manifest
skills/
using-supaflow/ # Router (injected at SessionStart)
SKILL.md # Hard rules, command table, parser contracts
cli-reference.md # JSON output contracts, auth, error codes
supaflow-datasources/SKILL.md # Reference: connector properties, env files
supaflow-pipelines/SKILL.md # Reference: config fields, schema management
supaflow-jobs/SKILL.md # Reference: job lifecycle, per-object metrics
supaflow-schedules/SKILL.md # Reference: cron syntax, timezone handling
commands/
create-datasource.md # Reads connector docs, validates prerequisites
edit-datasource.md # Edit with confirmation gate
create-pipeline.md # The most guardrail-heavy command
edit-pipeline.md # Edit config or object selection
delete-pipeline.md # Permanent deletion with confirmation
sync-pipeline.md # Trigger sync, poll, mandatory final results
check-job.md # Job status with pipeline name resolution
explain-job-failure.md # Diagnosis from jobs get + logs
create-schedule.md # Cron schedule with duplicate detection
hooks/
check-setup.sh # SessionStart: inject using-supaflow + setup checks
hooks.json # Hook configuration
tests/
fast/ # Offline: structure, frontmatter, hook output
medium/ # Fixtures + guardrail grep
slow/ # claude -p workflow tests
live/ # Real CLI contract validation

What would we do differently?

Start with commands, not skills. We built skills first and spent weeks watching Claude drift. Commands with tool restrictions would have caught most issues from day one.

Write parser contracts before writing any CLI integration code. Run every command with --json, capture the output, commit it as a fixture, and write the parser contract from the fixture. Don't rely on memory or documentation -- the actual output is the contract.

Test with real sessions early. Our automated tests caught structural issues (wrong frontmatter, missing fields). But the behavioral issues (skipping confirmation, inventing fields, wrong routing) only showed up in real conversations. Review the JSONL session logs.

Don't say "optional." Any step marked "optional" or "if needed" will be skipped. If a step matters, make it mandatory with a hard gate.

Try it yourself

# Install the CLI
npm install -g @getsupaflow/cli

# Clone and add the plugin
git clone https://github.com/supaflow-labs/supaflow-claude-plugin.git
claude plugin add ./supaflow-claude-plugin

# Start Claude Code and try it
claude
> build a data pipeline from postgres to snowflake

The plugin will inject its router skill, check your CLI setup, and guide you through the workflow using commands with embedded guardrails. If you want to build your own plugin for a different CLI tool, fork the repo and replace the Supaflow-specific content -- the architecture (router skill, commands with parser contracts, test suite) transfers directly.

For the full CLI setup guide, see the CLI getting started guide. For the plugin's marketing overview and architecture, see the Supaflow Claude Code plugin page.


The Supaflow plugin is open source at github.com/supaflow-labs/supaflow-claude-plugin. The CLI is available at npmjs.com/package/@getsupaflow/cli.