Export Salesforce to S3 with Lambda and Step Functions: A Complete Open-Source Solution
Export Salesforce to S3 in 10 Minutes
Export Salesforce to S3 and query it in Athena — no ETL tools, no long‑running servers.
This open‑source, serverless pipeline uses AWS Lambda, Step Functions, and the Salesforce Bulk API 2.0 to export every Salesforce object to S3. After each run, it automatically updates the AWS Glue Data Catalog, so your data is immediately queryable in Athena.
After a single deployment, you can query your Salesforce data directly in Athena:
SELECT * FROM salesforce_export.account_raw LIMIT 10;
Why We Built This
At Supaflow, we already offer a 3–5× cost advantage over traditional Salesforce ETL tools by eliminating per‑row pricing and long‑running infrastructure.
But we also know a simple truth: not every team can justify adding another service, even if it’s cheaper than the alternatives.
Sometimes you just need:
- Raw Salesforce data in S3
- Something reliable and incremental
- No vendor lock‑in
- And no ongoing subscription
So we built this.
This project is a fully open‑source, self‑hosted Salesforce → S3 pipeline you can run for free (aside from nominal AWS infrastructure costs). It uses the same architectural principles we believe in at Supaflow — serverless execution, controlled rate‑limit handling, and schema‑aware exports — but without any managed service layer.
If you later need managed scheduling, monitoring, transformations, or bi‑directional syncs, Supaflow builds on these same foundations. But if all you need is a clean, cost‑effective export you fully control, this repo stands on its own.
What You Get
Incremental Sync
The pipeline uses SystemModstamp (or LastModifiedDate / CreatedDate as fallbacks) to sync only changed records. Your first run exports everything; subsequent runs are fast and efficient.
Parallel Processing
Jobs are submitted in parallel (up to 15 concurrent, matching Salesforce's limit) and results are fetched in parallel (4 concurrent). A full sync of 100+ objects completes in minutes, not hours.
Rate Limit Handling
When Salesforce throttles you, the pipeline exits gracefully with a RATE_LIMITED status. Cursors aren't advanced, so the next scheduled run picks up exactly where you left off.
Automatic Schema Discovery
After each successful export, the pipeline automatically updates the Glue Data Catalog so your data is immediately queryable — no manual DDL, no schema maintenance.
- Create tables for new objects
- Add partitions for each run
- Infer column types (boolean, integer, string, etc.)
No manual DDL required. Tables are queryable in Athena immediately.
Point-in-Time Queries
This makes it easy to debug historical changes, reproduce past reports, or build slowly changing dimension (SCD) tables directly from raw exports.
Each run writes to a run=<timestamp> partition. You can:
- Query a specific snapshot:
WHERE run = '20250103T120000Z' - Get the latest version of each record using
ROW_NUMBER() - Build SCD Type 2 tables from the partition history
Architecture
The solution uses a batch polling pattern with Step Functions:
AcquireLock
|
StartJobs (15 concurrent)
| Submit Bulk API jobs for all objects
v
PollLoop:
CheckJobsBatch
| Check status of all pending jobs
v
FetchCompleted (4 concurrent)
| Download results, upload to S3
v
Has pending? --> Wait 10s --> PollLoop
|
No
v
ReleaseLock --> StartGlueCrawler (update Glue tables + partitions)
Why Step Functions?
- Short-lived Lambdas: Each step runs for seconds, not minutes. No timeout issues.
- Visible state: The Step Functions console shows exactly where each run is.
- Built-in retries: Transient failures are automatically retried with backoff.
- Cost-effective: You pay for state transitions, not idle compute.
Quick Start
Prerequisites
- AWS CLI and SAM CLI installed
- Salesforce org with API access (Bulk API 2.0)
- Salesforce credentials (username, password, security token)
Deploy in five Commands
No AWS console clicks required.
# Clone the repository
git clone https://github.com/supaflow-labs/supaflow-salesforce-to-s3.git
cd supaflow-salesforce-to-s3
# Run the interactive installer
./install.sh
The installer will:
- Verify your AWS credentials
- Store Salesforce credentials in SSM Parameter Store
- Create an S3 bucket (or use an existing one)
- Deploy Lambda, Step Functions, DynamoDB, and EventBridge
- Optionally start a test execution
Start Your First Export
./scripts/start.sh
Check Status
./scripts/status.sh
Output:
Run: 20250103T042422Z
Status: COMPLETED
Started: 2025-01-03T04:24:22Z
Ended: 2025-01-03T04:31:15Z
Objects: 22/22 completed, 0 failed, 0 skipped
Data: 847,293 records, 156 pages, 124.5 MB
Glue Crawler: salesforce-export-crawler
State: READY
Last: SUCCEEDED
Changes: 22 tables created, 0 updated
Querying Your Data
After the Glue Crawler completes, tables appear in the salesforce_export database.
Basic Queries
-- List all tables
SHOW TABLES IN salesforce_export;
-- Query Account data
SELECT id, name, industry, annualrevenue
FROM salesforce_export.account_raw
LIMIT 100;
-- Count records per run
SELECT run, COUNT(*) as records
FROM salesforce_export.contact_raw
GROUP BY run
ORDER BY run DESC;
Get Latest Version of Each Record
Since each run creates a new partition, you may have multiple versions of the same record. Use ROW_NUMBER() to get the latest:
SELECT * FROM (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY run DESC) as _row_num
FROM salesforce_export.account_raw
) WHERE _row_num = 1;
Query a Point-in-Time Snapshot
SELECT * FROM salesforce_export.opportunity_raw
WHERE run = '20250103T042422Z';
Operational Commands
# Start a new export
./scripts/start.sh
# Check status of recent runs
./scripts/status.sh
# Stop a running execution
./scripts/stop.sh
# Reset all state (forces full re-sync)
./scripts/reset.sh
# Completely uninstall
./scripts/uninstall.sh
Configuration Options
Most users can start with the defaults. The options below are only needed if you want to customize scheduling or sync behavior.
| Parameter | Default | Description |
|---|---|---|
ScheduleExpression | rate(1 day) | How often to run (EventBridge syntax) |
ObjectAllowlist | 20 core objects | Which objects to sync (empty = all) |
MaxRecordsPerPage | 10000 | Bulk API page size |
LookbackSeconds | 0 | Overlap window for incremental runs |
FullRefreshIntervalMinutes | 1440 | How often to full-refresh no-cursor objects |
Sync Specific Objects
sam deploy --parameter-overrides \
ObjectAllowlist="Account,Contact,Lead,Opportunity"
Sync All Objects
sam deploy --parameter-overrides ObjectAllowlist="ALL"
Default Objects
By default, only 20 core CRM objects are synced:
Account, Contact, Lead, Opportunity, Case, User, UserRole,
Profile, Group, GroupMember, PermissionSet, PermissionSetAssignment,
Product2, Pricebook2, PricebookEntry, Document, EmailTemplate,
Solution, Entitlement, BusinessHours
This keeps initial runs fast and avoids hitting API limits while testing.
Handling Objects Without an Update Timestamp
Some Salesforce objects don't have a reliable update timestamp such as SystemModstamp or LastModifiedDate. By default, these are full-refreshed every 24 hours.
Configure with FullRefreshIntervalMinutes:
-1: Skip these objects entirely0: Full refresh every run1440(default): Full refresh once per day
Security
- Credentials in SSM: Salesforce password and token stored as SecureStrings
- Encrypted S3: Server-side encryption enabled by default
- Scoped IAM: Lambda only has access to its own resources
- No public access: S3 bucket blocks all public access
Source Code
The complete solution is open-source:
GitHub: github.com/supaflow-labs/supaflow-salesforce-to-s3
Contributions welcome. MIT License.
Have a feature request? Open an issue on GitHub.
Get Started
Clone the repo and run ./install.sh. In about 10 minutes, you'll have raw Salesforce data in S3 and fully queryable tables in Athena -- without paying for an ETL tool.
Questions or feature requests? Open an issue on GitHub or email support@supa-flow.io.
Need More Than a Raw Export?
If you need managed scheduling, monitoring, dbt transformations, or reverse ETL back into Salesforce, Supaflow builds on these same foundations as a fully managed platform.
- Start your free trial -- 31 days, no credit card required
- Book a demo -- see Supaflow handle Salesforce ingestion, transformation, and activation in one platform
