Export Salesforce to S3 with Lambda and Step Functions: A Complete Open-Source Solution

January 3, 2025 · 7 min read

Founder, Supaflow

Export Salesforce to S3 in 10 Minutes

Export Salesforce to S3 and query it in Athena — no ETL tools, no long‑running servers.

This open‑source, serverless pipeline uses AWS Lambda, Step Functions, and the Salesforce Bulk API 2.0 to export every Salesforce object to S3. After each run, it automatically updates the AWS Glue Data Catalog, so your data is immediately queryable in Athena.

After a single deployment, you can query your Salesforce data directly in Athena:

SELECT * FROM salesforce_export.account_raw LIMIT 10;

Why We Built This

At Supaflow, we already offer a 3–5× cost advantage over traditional Salesforce ETL tools by eliminating per‑row pricing and long‑running infrastructure.

But we also know a simple truth: not every team can justify adding another service, even if it’s cheaper than the alternatives.

Sometimes you just need:

Raw Salesforce data in S3
Something reliable and incremental
No vendor lock‑in
And no ongoing subscription

So we built this.

This project is a fully open‑source, self‑hosted Salesforce → S3 pipeline you can run for free (aside from nominal AWS infrastructure costs). It uses the same architectural principles we believe in at Supaflow — serverless execution, controlled rate‑limit handling, and schema‑aware exports — but without any managed service layer.

If you later need managed scheduling, monitoring, transformations, or bi‑directional syncs, Supaflow builds on these same foundations. But if all you need is a clean, cost‑effective export you fully control, this repo stands on its own.

What You Get

Incremental Sync

The pipeline uses SystemModstamp (or LastModifiedDate / CreatedDate as fallbacks) to sync only changed records. Your first run exports everything; subsequent runs are fast and efficient.

Parallel Processing

Jobs are submitted in parallel (up to 15 concurrent, matching Salesforce's limit) and results are fetched in parallel (4 concurrent). A full sync of 100+ objects completes in minutes, not hours.

Rate Limit Handling

When Salesforce throttles you, the pipeline exits gracefully with a RATE_LIMITED status. Cursors aren't advanced, so the next scheduled run picks up exactly where you left off.

Automatic Schema Discovery

After each successful export, the pipeline automatically updates the Glue Data Catalog so your data is immediately queryable — no manual DDL, no schema maintenance.

Create tables for new objects
Add partitions for each run
Infer column types (boolean, integer, string, etc.)

No manual DDL required. Tables are queryable in Athena immediately.

Point-in-Time Queries

This makes it easy to debug historical changes, reproduce past reports, or build slowly changing dimension (SCD) tables directly from raw exports.

Each run writes to a run=<timestamp> partition. You can:

Query a specific snapshot: WHERE run = '20250103T120000Z'
Get the latest version of each record using ROW_NUMBER()
Build SCD Type 2 tables from the partition history

Architecture

The solution uses a batch polling pattern with Step Functions:

AcquireLock
    |
StartJobs (15 concurrent)
    |  Submit Bulk API jobs for all objects
    v
PollLoop:
    CheckJobsBatch
    |  Check status of all pending jobs
    v
    FetchCompleted (4 concurrent)
    |  Download results, upload to S3
    v
    Has pending? --> Wait 10s --> PollLoop
         |
         No
         v
    ReleaseLock --> StartGlueCrawler  (update Glue tables + partitions)

Why Step Functions?

Short-lived Lambdas: Each step runs for seconds, not minutes. No timeout issues.
Visible state: The Step Functions console shows exactly where each run is.
Built-in retries: Transient failures are automatically retried with backoff.
Cost-effective: You pay for state transitions, not idle compute.

Quick Start

Prerequisites

AWS CLI and SAM CLI installed
Salesforce org with API access (Bulk API 2.0)
Salesforce credentials (username, password, security token)

Deploy in five Commands

No AWS console clicks required.

# Clone the repository
git clone https://github.com/supaflow-labs/supaflow-salesforce-to-s3.git
cd supaflow-salesforce-to-s3

# Run the interactive installer
./install.sh

The installer will:

Verify your AWS credentials
Store Salesforce credentials in SSM Parameter Store
Create an S3 bucket (or use an existing one)
Deploy Lambda, Step Functions, DynamoDB, and EventBridge
Optionally start a test execution

Start Your First Export

./scripts/start.sh

Check Status

./scripts/status.sh

Output:

Run: 20250103T042422Z
  Status:  COMPLETED
  Started: 2025-01-03T04:24:22Z
  Ended:   2025-01-03T04:31:15Z
  Objects: 22/22 completed, 0 failed, 0 skipped
  Data:    847,293 records, 156 pages, 124.5 MB

Glue Crawler: salesforce-export-crawler
  State:   READY
  Last:    SUCCEEDED
  Changes: 22 tables created, 0 updated

Querying Your Data

After the Glue Crawler completes, tables appear in the salesforce_export database.

Basic Queries

-- List all tables
SHOW TABLES IN salesforce_export;

-- Query Account data
SELECT id, name, industry, annualrevenue
FROM salesforce_export.account_raw
LIMIT 100;

-- Count records per run
SELECT run, COUNT(*) as records
FROM salesforce_export.contact_raw
GROUP BY run
ORDER BY run DESC;

Get Latest Version of Each Record

Since each run creates a new partition, you may have multiple versions of the same record. Use ROW_NUMBER() to get the latest:

SELECT * FROM (
  SELECT *,
    ROW_NUMBER() OVER (PARTITION BY id ORDER BY run DESC) as _row_num
  FROM salesforce_export.account_raw
) WHERE _row_num = 1;

Query a Point-in-Time Snapshot

SELECT * FROM salesforce_export.opportunity_raw
WHERE run = '20250103T042422Z';

Operational Commands

# Start a new export
./scripts/start.sh

# Check status of recent runs
./scripts/status.sh

# Stop a running execution
./scripts/stop.sh

# Reset all state (forces full re-sync)
./scripts/reset.sh

# Completely uninstall
./scripts/uninstall.sh

Configuration Options

Most users can start with the defaults. The options below are only needed if you want to customize scheduling or sync behavior.

Parameter	Default	Description
`ScheduleExpression`	`rate(1 day)`	How often to run (EventBridge syntax)
`ObjectAllowlist`	20 core objects	Which objects to sync (empty = all)
`MaxRecordsPerPage`	10000	Bulk API page size
`LookbackSeconds`	0	Overlap window for incremental runs
`FullRefreshIntervalMinutes`	1440	How often to full-refresh no-cursor objects

Sync Specific Objects

sam deploy --parameter-overrides \
  ObjectAllowlist="Account,Contact,Lead,Opportunity"

Sync All Objects

sam deploy --parameter-overrides ObjectAllowlist="ALL"

Default Objects

By default, only 20 core CRM objects are synced:

Account, Contact, Lead, Opportunity, Case, User, UserRole,
Profile, Group, GroupMember, PermissionSet, PermissionSetAssignment,
Product2, Pricebook2, PricebookEntry, Document, EmailTemplate,
Solution, Entitlement, BusinessHours

This keeps initial runs fast and avoids hitting API limits while testing.

Handling Objects Without an Update Timestamp

Some Salesforce objects don't have a reliable update timestamp such as SystemModstamp or LastModifiedDate. By default, these are full-refreshed every 24 hours.

Configure with FullRefreshIntervalMinutes:

-1: Skip these objects entirely
0: Full refresh every run
1440 (default): Full refresh once per day

Security

Credentials in SSM: Salesforce password and token stored as SecureStrings
Encrypted S3: Server-side encryption enabled by default
Scoped IAM: Lambda only has access to its own resources
No public access: S3 bucket blocks all public access

Source Code

The complete solution is open-source:

GitHub: github.com/supaflow-labs/supaflow-salesforce-to-s3

Contributions welcome. MIT License.

Have a feature request? Open an issue on GitHub.

Get Started

Clone the repo and run ./install.sh. In about 10 minutes, you'll have raw Salesforce data in S3 and fully queryable tables in Athena -- without paying for an ETL tool.

Questions or feature requests? Open an issue on GitHub or email support@supa-flow.io.

Need More Than a Raw Export?

If you need managed scheduling, monitoring, dbt transformations, or reverse ETL back into Salesforce, Supaflow builds on these same foundations as a fully managed platform.

Start your free trial -- 31 days, no credit card required
Book a demo -- see Supaflow handle Salesforce ingestion, transformation, and activation in one platform

Export Salesforce to S3 in 10 Minutes​

Why We Built This​

What You Get​

Incremental Sync​

Parallel Processing​

Rate Limit Handling​

Automatic Schema Discovery​

Point-in-Time Queries​

Architecture​

Why Step Functions?​

Quick Start​

Prerequisites​

Deploy in five Commands​

Start Your First Export​

Check Status​

Querying Your Data​

Basic Queries​

Get Latest Version of Each Record​

Query a Point-in-Time Snapshot​

Operational Commands​

Configuration Options​

Sync Specific Objects​

Sync All Objects​

Default Objects​

Handling Objects Without an Update Timestamp​

Security​

Source Code​

Get Started​

Need More Than a Raw Export?​