Building a Claude Code Plugin That Actually Works: What We Learned the Hard Way
We built a Claude Code plugin for Supaflow that manages data pipelines through natural language. It took three complete rewrites to get it right. This post covers the failures, the architecture decisions, and the constraints that made it work.
The final plugin is open source: supaflow-labs/supaflow-claude-plugin. Clone it, modify it, use it as a starting point for your own.
What we were building
Supaflow is a data integration platform. Users create datasources (connections to databases and SaaS apps), build pipelines between them, run syncs, and monitor jobs. All of this is available through a CLI (@getsupaflow/cli), and we wanted Claude Code to be able to drive the entire workflow.
This means Claude needs to:
- Run CLI commands with exact flags and parse JSON responses
- Follow multi-step workflows in the right order
- Ask for user confirmation before creating permanent resources
- Handle errors without making things worse
Simple enough, right?
The first attempt: skills only
Our first plugin had 6 skills covering auth, datasources, pipelines, jobs, schedules, and a quickstart walkthrough. Each skill was a markdown file with step-by-step instructions telling Claude how to use the CLI.
Here's what the quickstart skill looked like:
---
name: supaflow-quickstart
description: Use when the user asks to set up a pipeline from scratch...
---
## Step 6: Select Objects and Create Pipeline
supaflow pipelines create \
--name "<Pipeline Name>" \
--source <source> \
--project <project> \
--objects objects.json \
--json
It shipped. We tested it. It broke immediately.
Session 1: Everything went wrong
We asked Claude: "build a data pipeline from SQL Server to Snowflake using Change Tracking."
Claude read the quickstart skill and went to work. Here's what happened:
Problem 1: Claude skipped pipelines init entirely.
The skill said to use pipelines init to generate a config file with capability-aware defaults (ingestion mode, load mode, schema evolution, pipeline prefix). Claude skipped it and went straight to pipelines create without showing the user any configuration. The pipeline was created with defaults the user never saw or approved.
Why? The skill had this line:
If the user accepted all defaults, you can skip pipelines init and omit --config.
Claude interpreted "accepted all defaults" as "the user didn't complain" and skipped the entire config step.
Problem 2: Claude invented JSON fields that don't exist.
When polling job status, Claude wrote code like this:
print(f"Phase: {d['phase']}")
print(f"Duration: {d['duration']}")
print(f"Progress: {d['progress']}%")
None of these fields exist. The actual jobs status response has exactly 4 fields: id, job_status, status_message, job_response. Claude hallucinated phase, duration, progress, completed_at, and eta -- fields that sound reasonable but aren't in the API.
The output showed Phase: ? and Progress: ?% for every poll, but Claude kept going as if everything was fine.
Problem 3: Claude used status as a variable name in zsh.
The polling loop looked like this:
for i in $(seq 1 20); do
status=$(supaflow jobs status $job_id --json | python3 -c "...")
if [ "$status" = "completed" ]; then break; fi
sleep 5
done
This fails silently in zsh because status is a read-only builtin (it's the equivalent of $?). The variable assignment throws read-only variable: status and the loop runs all 20 iterations without ever checking the actual job status.
Problem 4: Claude silently renamed after a duplicate error.
When pipelines create failed with a unique constraint violation, Claude tried my_pipeline_2, then my_pipeline_3, without telling the user. The skill said "handle errors gracefully." Claude interpreted "gracefully" as "fix it silently."
What we learned from session 1
Every problem came from the same root cause: skills are suggestions, not constraints. A skill can say "always do X" but Claude can rationalize its way out of X. It can say "use these fields" but Claude can invent fields that sound right. It can say "ask for confirmation" but Claude can treat silence as confirmation.
We needed something stronger than prose instructions.
The architecture that worked
We studied the superpowers plugin -- the most mature Claude Code plugin in the ecosystem. Two patterns from superpowers shaped our approach:
-
SessionStart injection: Superpowers injects its entire
using-superpowersskill into Claude's system context at session start, before the user says anything. This shapes behavior from the first message. -
Hard rules with rationalization blockers: Instead of just saying "do X," superpowers includes tables of common rationalizations and explicitly calls them out as bugs.
Superpowers itself deprecated commands in favor of skills -- its problem domain is workflow discipline, not CLI tool integration. But Claude Code's plugin system supports allowed-tools in command frontmatter, which physically restricts which tools a command can use. We leaned into this for our CLI integration layer.
We combined these ideas into a three-layer architecture:
Layer 1: using-supaflow (policy router)
A skill injected at session start via the SessionStart hook. It establishes hard rules, maps user intents to commands, and blocks common rationalizations:
## HARD RULES
1. Use commands when they exist. Do not freestyle the same workflow.
2. Ask one question at a time.
3. Do not create before explicit confirmation.
4. Do not guess CLI output fields.
5. Stop on errors. Never silently retry, rename, or auto-increment.
6. Always run pipelines init before pipelines create.
7. Object scope is a required decision.
## Red Flags
| Thought | Reality |
|---|---|
| "I can probably use the defaults" | Use pipelines init and show actual values. |
| "The user only changed one field, so the rest is fine" | Partial edits are not approval. |
| "I can retry with a different name" | Silent retries are a bug. Stop and ask. |
| "I already know the job fields" | The CLI contract is the source of truth. |
Layer 2: Commands (execution)
9 slash commands that own specific workflows. Each command has:
allowed-toolsin frontmatter that physically restricts what Claude can run- Inline parser contracts that list exact field names and forbidden alternatives
- Embedded guardrails derived from real session failures
Example from /create-pipeline:
---
allowed-tools: Bash(supaflow *), Read, Edit, Write
argument-hint: [source-datasource] [destination-datasource]
---
### Parser Contract -- pipelines list
Use nested fields: source.name, source.datasource_id, destination.name
NEVER use flat fields like source_name or project_api_name
### Duplicate Constraint Handling
If pipelines create fails with a duplicate error:
STOP immediately. Show the error verbatim. Ask what to do.
NEVER silently rename. NEVER auto-increment.
Layer 3: Domain skills (reference only)
The original skills (datasources, pipelines, jobs, schedules) became reference material. They kept all CLI documentation, field names, and config tables, but lost all workflow orchestration language. Their descriptions changed from workflow triggers to reference triggers:
# Before (workflow owner):
description: Use when the user asks to create a pipeline, sync data, run a pipeline...
# After (reference):
description: Use when you need reference information about pipeline configuration,
schema management, sync modes, or pipeline lifecycle
The hardening loop
The three-layer architecture was the foundation, but it took four more test sessions to get the guardrails right. Each session revealed new failure modes.
Session 2: Object selection was silent
Claude created a pipeline with all objects selected without asking the user. The command said --objects was optional, so Claude skipped it.
Fix: Made object scope a mandatory question with two explicit paths:
## Step 9: Object Scope (REQUIRED)
Ask: "Do you want to sync all discovered objects, or choose a subset?"
### Path A: All Objects
State clearly: "The pipeline will include all discovered objects.
The --objects flag will be omitted."
### Path B: Choose Subset
Export catalog, show objects, ask which to INCLUDE.
Session 3: Include vs exclude confusion
We added the object scope question, but Claude asked "which objects to exclude?" The user said "just orders and customers" -- meaning include only those two. Claude interpreted it as exclude, producing the opposite selection.
Fix: Changed the question to always ask "which to include" and added an explicit warning:
IMPORTANT: When the user says "just X and Y", that means INCLUDE only
X and Y. It does NOT mean exclude X and Y. This is the most common
misinterpretation.
Session 4: Wrong sync command
After creating a pipeline, Claude tried supaflow jobs run (doesn't exist), checked --help, then found supaflow pipelines sync. Three wasted tool calls.
Fix: Created a dedicated /sync-pipeline command with the correct CLI syntax, response parser, and polling loop. Changed /create-pipeline to point to it instead of leaving sync as a freeform operation.
Session 4 (continued): Stopped at polling, no final results
Claude polled with jobs status until the job completed, then told the user "sync completed" -- without ever running jobs get to get per-object details and row counts.
Fix: Added a HARD GATE in /sync-pipeline:
## Step 4: Final Results (MANDATORY)
HARD GATE: When polling reaches a terminal state, you MUST run
jobs get before responding to the user. Do not end on jobs status
alone. jobs status is only for polling.
Parser contracts: the most important guardrail
The single most effective guardrail in the plugin is inline parser contracts -- explicit lists of which JSON fields to use and which to never reference.
Without a parser contract, Claude invents plausible-sounding field names. It's not lying -- it's pattern-matching from training data. If it's seen job tracking UIs with "duration" and "progress" fields, it'll use those names even though the actual API returns execution_duration_ms and has no progress field at all.
Here's the parser contract from /check-job:
### Parser Contract -- jobs status
Extract ONLY: id, job_status, status_message, job_response
NEVER reference these fields (they do not exist):
phase, duration, completed_at, progress, percent, eta
And from /sync-pipeline:
### Parser Contract -- pipelines sync response
The sync response returns EXACTLY these 3 fields:
- job_id -- the job UUID to track
- pipeline_id -- the pipeline UUID
- status -- always "queued" on success
NEVER reference: job_status, name, message
We also discovered that different CLI commands return the same concept with different field names. schedules list returns cron_schedule, but schedules create returns cron. Without explicit contracts for each command, Claude will use the wrong field name and get a KeyError.
Testing: a 4-tier approach
We built a test suite that catches regressions at different levels:
Fast tests (offline, under 1 minute)
No Claude, no API key, no network. Pure file-system checks:
# Does every command have required frontmatter?
for cmd in commands/*.md; do
has_frontmatter "$cmd"
get_frontmatter_field "$cmd" "allowed-tools"
get_frontmatter_field "$cmd" "argument-hint"
done
# Does the hook produce valid JSON? (simplified -- the real test
# controls PATH fully, symlinks python3 into the mock dir, and
# uses run_hook_with_mocks() to isolate from system CLIs)
mock_dir=$(create_mock_cli)
output=$(run_hook_with_mocks "$mock_dir")
echo "$output" | python3 -c "
import json,sys; d=json.load(sys.stdin)
assert d['hookSpecificOutput']['hookEventName'] == 'SessionStart'
"
Medium tests (fixture-based, ~2 minutes)
Feed saved CLI JSON output through the same parsers the commands prescribe. Catches field name regressions:
# Does jobs-status.json have the right fields?
assert_json_has_field "fixtures/jobs-status.json" "d['job_status']"
assert_json_missing_field "fixtures/jobs-status.json" "d['phase']"
assert_json_missing_field "fixtures/jobs-status.json" "d['duration']"
# Does create-pipeline.md contain the right guardrails?
assert_file_contains "commands/create-pipeline.md" "pipelines init"
assert_file_contains "commands/create-pipeline.md" "NEVER silently rename"
assert_file_contains "commands/create-pipeline.md" "warehouse_datasource_id"
Slow tests (claude -p, ~10 minutes)
Actually invoke slash commands via Claude in headless mode and assert on behavior:
output=$(claude -p "/create-pipeline")
assert_contains "$output" "datasources list" # checks existing
assert_contains "$output" "confirm" # asks for approval
assert_not_contains "$output" "auto-generated" # doesn't guess defaults
Live tests (real CLI, opt-in)
Run actual CLI commands and compare output shapes against committed fixtures. When a live test fails, it means the CLI changed its output and the fixture needs updating.
The SessionStart hook
The hook is the entry point for the entire plugin. It runs when Claude starts a new session and injects the using-supaflow skill content into Claude's system context:
#!/usr/bin/env bash
set -euo pipefail
# Read the using-supaflow skill
content=$(cat "$PLUGIN_ROOT/skills/using-supaflow/SKILL.md")
# Check CLI setup
warnings=()
if ! command -v supaflow &>/dev/null; then
warnings+=("[SETUP] Supaflow CLI not installed")
fi
# ... more checks ...
# Build context and output structured JSON
escaped=$(printf '%s' "$context" | python3 -c \
"import json,sys; print(json.dumps(sys.stdin.read())[1:-1])")
printf '{"hookSpecificOutput":{"hookEventName":"SessionStart",
"additionalContext":"%s"}}\n' "$escaped"
What we learned the hard way:
- The JSON must include both
hookEventNameandadditionalContext-- in our testing, omittinghookEventNamecaused the output to be silently ignored by Claude Code - We found that
printfis more reliable than heredocs for large content -- the superpowers plugin notes a bash 5.3+ heredoc hang with content over ~512 bytes, and we hit similar issues - We switched from bash parameter substitution (
${s//$'\n'/\\n}) topython3 json.dumps()for escaping after finding the bash approach produced unescaped output on some systems
What we stole from superpowers (and what we didn't)
Copied:
- SessionStart injection of a router skill
- Red flags / rationalization tables
- HARD RULES / HARD GATE language
- One-question-at-a-time discipline
- Test suite structure (helpers, runner, fast/slow split)
Didn't copy:
- Deprecating commands. Superpowers deprecated all commands in favor of skills. We kept commands because our problem is different -- we need tool restrictions and exact CLI contracts, which commands enforce better than skills.
- Visual companion. Superpowers has a browser-based mockup system for brainstorming. Not relevant for CLI tool integration.
- Cross-platform hook wrappers. Superpowers has a polyglot bash/batch wrapper for Windows. We only support macOS/Linux.
The final architecture
supaflow-claude-plugin/
.claude-plugin/plugin.json # Manifest
skills/
using-supaflow/ # Router (injected at SessionStart)
SKILL.md # Hard rules, command table, parser contracts
cli-reference.md # JSON output contracts, auth, error codes
supaflow-datasources/SKILL.md # Reference: connector properties, env files
supaflow-pipelines/SKILL.md # Reference: config fields, schema management
supaflow-jobs/SKILL.md # Reference: job lifecycle, per-object metrics
supaflow-schedules/SKILL.md # Reference: cron syntax, timezone handling
commands/
create-datasource.md # Reads connector docs, validates prerequisites
edit-datasource.md # Edit with confirmation gate
create-pipeline.md # The most guardrail-heavy command
edit-pipeline.md # Edit config or object selection
delete-pipeline.md # Permanent deletion with confirmation
sync-pipeline.md # Trigger sync, poll, mandatory final results
check-job.md # Job status with pipeline name resolution
explain-job-failure.md # Diagnosis from jobs get + logs
create-schedule.md # Cron schedule with duplicate detection
hooks/
check-setup.sh # SessionStart: inject using-supaflow + setup checks
hooks.json # Hook configuration
tests/
fast/ # Offline: structure, frontmatter, hook output
medium/ # Fixtures + guardrail grep
slow/ # claude -p workflow tests
live/ # Real CLI contract validation
What would we do differently?
Start with commands, not skills. We built skills first and spent weeks watching Claude drift. Commands with tool restrictions would have caught most issues from day one.
Write parser contracts before writing any CLI integration code. Run every command with --json, capture the output, commit it as a fixture, and write the parser contract from the fixture. Don't rely on memory or documentation -- the actual output is the contract.
Test with real sessions early. Our automated tests caught structural issues (wrong frontmatter, missing fields). But the behavioral issues (skipping confirmation, inventing fields, wrong routing) only showed up in real conversations. Review the JSONL session logs.
Don't say "optional." Any step marked "optional" or "if needed" will be skipped. If a step matters, make it mandatory with a hard gate.
Try it yourself
# Install the CLI
npm install -g @getsupaflow/cli
# Clone and add the plugin
git clone https://github.com/supaflow-labs/supaflow-claude-plugin.git
claude plugin add ./supaflow-claude-plugin
# Start Claude Code and try it
claude
> build a data pipeline from postgres to snowflake
The plugin will inject its router skill, check your CLI setup, and guide you through the workflow using commands with embedded guardrails. If you want to build your own plugin for a different CLI tool, fork the repo and replace the Supaflow-specific content -- the architecture (router skill, commands with parser contracts, test suite) transfers directly.
For the full CLI setup guide, see the CLI getting started guide. For the plugin's marketing overview and architecture, see the Supaflow Claude Code plugin page.
The Supaflow plugin is open source at github.com/supaflow-labs/supaflow-claude-plugin. The CLI is available at npmjs.com/package/@getsupaflow/cli.
