Legacy Connector

This connector is superseded by the Amazon S3 Data Lake connector, which supports both Parquet and Apache Iceberg table formats. New pipelines should use S3 Data Lake. Existing S3 pipelines will continue to work.

Amazon S3 Destination

Load data into Amazon S3 as Parquet files with optional AWS Glue Data Catalog integration for querying with Athena, Spark, or other analytics tools.

Prerequisites

Before you begin, ensure you have:

✅ An AWS account with IAM permissions to create roles and policies
✅ An S3 bucket where Supaflow will write data
✅ AWS Glue access (optional) for Data Catalog table metadata

AWS Setup

Supaflow uses cross-account IAM role assumption to securely access your S3 bucket. You will:

Create IAM policies for S3 (required) and Glue (optional)
Create an IAM role with a trust policy allowing Supaflow to assume it
Attach the policies to the role
Provide the role ARN and external ID in Supaflow

Step 1: Create S3 Permissions Policy

Log in to the AWS Console → IAM → Policies → Create policy
Click the JSON tab. Copy the following policy and paste it into the JSON editor:

S3 Permissions Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowBucketListForPrefix",
      "Effect": "Allow",
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:ListBucketMultipartUploads"
      ],
      "Resource": [
        "arn:aws:s3:::<YOUR_BUCKET_NAME>"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "<OPTIONAL_PREFIX>/*"
          ]
        }
      }
    },
    {
      "Sid": "AllowObjectReadWriteInPrefix",
      "Effect": "Allow",
      "Action": [
        "s3:DeleteObjectTagging",
        "s3:ReplicateObject",
        "s3:PutObject",
        "s3:GetObjectAcl",
        "s3:GetObject",
        "s3:DeleteObjectVersion",
        "s3:PutObjectTagging",
        "s3:DeleteObject",
        "s3:PutObjectAcl",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Resource": [
        "arn:aws:s3:::<YOUR_BUCKET_NAME>/<OPTIONAL_PREFIX>/*"
      ]
    }
  ]
}

IMPORTANT: Modify these values:
- Replace <YOUR_BUCKET_NAME> with your S3 bucket name
- Replace <OPTIONAL_PREFIX> with a path prefix, or remove the Condition block entirely for full bucket access
Click Next → Name the policy (e.g., SupaflowS3Policy) → Create policy

Prefix Restriction

The S3 policy includes an optional prefix condition. This restricts Supaflow to only write within a specific path (e.g., data/supaflow/*). Remove the entire Condition block from the first statement if you want to allow access to the entire bucket.

Step 2: Create Glue Permissions Policy (Optional)

If you want Supaflow to register tables in AWS Glue Data Catalog (recommended for querying with Athena):

In IAM → Policies → Create policy
Click the JSON tab. Copy the following policy and paste it into the JSON editor:

Glue Permissions Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowCatalogRead",
      "Effect": "Allow",
      "Action": [
        "glue:GetCatalogImportStatus"
      ],
      "Resource": [
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:catalog"
      ]
    },
    {
      "Sid": "AllowDbAndTableReadWriteForSupaflowPrefix",
      "Effect": "Allow",
      "Action": [
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:CreateDatabase",
        "glue:UpdateDatabase",

        "glue:GetTable",
        "glue:GetTables",
        "glue:CreateTable",
        "glue:UpdateTable",

        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:CreatePartition",
        "glue:BatchCreatePartition"
      ],
      "Resource": [
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:catalog",
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:database/supaflow*",
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:table/supaflow*/*"
      ]
    },
    {
      "Sid": "AllowResetOperationsForSupaflowPrefix",
      "Effect": "Allow",
      "Action": [
        "glue:DeleteTable",
        "glue:BatchDeleteTable",
        "glue:DeleteDatabase"
      ],
      "Resource": [
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:catalog",
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:database/supaflow*",
        "arn:aws:glue:<YOUR_REGION>:<YOUR_ACCOUNT_ID>:table/supaflow*/*"
      ]
    }
  ]
}

IMPORTANT: Modify these values:
- Replace <YOUR_REGION> with your AWS region (e.g., us-east-1)
- Replace <YOUR_ACCOUNT_ID> with your 12-digit AWS account ID
- (Optional) Replace supaflow* with your preferred database prefix
Click Next → Name the policy (e.g., SupaflowGluePolicy) → Create policy

Glue Database Prefix

The policy uses supaflow* as the default database prefix. You can change this to any prefix you prefer (e.g., mycompany_*). In Supaflow configuration, use the literal prefix (for example, mycompany) without the *.

Step 3: Create IAM Role with Trust Policy

Navigate to IAM → Roles → Create role
Select Custom trust policy. Copy the following trust policy and paste it into the JSON editor:

Trust Policy
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::805595753828:root"
      },
      "Action": "sts:AssumeRole",
      "Condition": {
        "StringEquals": {
          "sts:ExternalId": "<YOUR_EXTERNAL_ID>"
        }
        }
      }
  ]
}

IMPORTANT: Modify this value:
- Replace <YOUR_EXTERNAL_ID> with a unique secret string (e.g., my-company-supaflow-2024)
- This acts as a shared secret between your AWS account and Supaflow
- Keep this value secure — you'll enter it in Supaflow configuration
Click Next
Attach the policies you created:
- Search for and select SupaflowS3Policy
- Search for and select SupaflowGluePolicy (if created)
Click Next
Role name: Enter a descriptive name (e.g., SupaflowS3Access)
Click Create role

Step 4: Copy Role ARN

Find the role you just created and click on it
Copy the Role ARN from the summary page
- Example: arn:aws:iam::123456789012:role/SupaflowS3Access
You'll need this ARN and your external ID for the Supaflow configuration

Configuration

In Supaflow, create a new S3 destination with these settings:

Connection

S3 Bucket Name*

Name of the S3 bucket
Example: my-data-lake

Bucket Region*

AWS region of the bucket
Example: us-east-1

S3 Path Prefix

Optional path prefix for all files. Leave empty to write to bucket root.
Example: data-lake/production

Authentication

IAM Role ARN*

ARN of the IAM role to assume
Example: arn:aws:iam::123456789012:role/SupaflowS3Role

External ID*

External ID for role assumption (security measure to prevent confused deputy problem). Must match the value in your trust policy.
This value is encrypted and stored securely

AWS Glue

Sync Schema with AWS Glue Catalog

Automatically create/update Glue tables and partitions
Default: Disabled

Glue Database Prefix

Prefix for Glue database names. Databases will be created as prefix_schema. Must match the prefix in your Glue IAM policy.
Example: supaflow
Default: supaflow

Advanced Settings

Partition Strategy

Time-based partitioning strategy for organizing files
Options: NONE, DAY, HOUR, SYNC_ID
Default: HOUR

File Name Pattern

Pattern for file names. Supports variables: {seq}, {uuid}, {timestamp}, {supa_job_id}, {date}
Default: part-{supa_job_id}-{seq}

Upload Parallelism

Number of concurrent S3 uploads
Range: 1-10
Default: 5

Custom S3 Endpoint

Custom S3-compatible endpoint URL for MinIO, DigitalOcean Spaces, etc. Leave empty for AWS S3.
Example: https://s3.example.com

Test & Save

After configuring all required properties, click Test & Save to verify your connection and save the destination.

S3 Path Layout and Glue Catalog Naming

S3 Object Key Format and File Naming

Supaflow writes data to S3 using the following path layout:

s3://{bucket}/{prefix}/{schema}/{table}/{partition}/{file}.gzip.parquet

Component	Description	Example
bucket	Your S3 bucket name	`my-data-lake`
prefix	Optional S3 path prefix (configurable)	`data/production`
schema	Database/schema name (see below)	`supaflow_postgres_public`
table	Normalized table name	`accounts`
partition	Time-based partition (based on strategy)	`2024/01/15/10`
file	Parquet file with GZIP compression (fixed, not configurable)	`part-abc123-00000.gzip.parquet`

Example paths:

s3://my-bucket/data/supaflow_postgres_public/accounts/2024/01/15/10/part-abc123-00000.gzip.parquet
s3://my-bucket/data/supaflow_salesforce/contact/2024/01/15/part-def456-00000.gzip.parquet

Example folder layout (multiple connectors and tables under one bucket/prefix):

s3://my-bucket/
└── data/
    ├── supaflow_postgres_public/
    │   ├── tenants/
    │   │   └── 2025/
    │   │       └── 12/
    │   │           └── 31/
    │   │               └── 00/
    │   │                   └── part-<supa_job_id>-00000.gzip.parquet
    │   └── accounts/
    │       └── 2025/
    │           └── 12/
    │               └── 31/
    │                   └── 00/
    │                       └── part-<supa_job_id>-00001.gzip.parquet
    ├── supaflow_salesforce/
    │   └── contact/
    │       └── 2025/
    │           └── 12/
    │               └── 31/
    │                   └── 00/
    │                       └── part-<supa_job_id>-00000.gzip.parquet
    └── supaflow_s3/
        └── ext_contacts/
            └── 2025/
                └── 12/
                    └── 31/
                        └── 00/
                            └── part-<supa_job_id>-00000.gzip.parquet

File name pattern

Supports variables: {seq}, {uuid}, {timestamp}, {supa_job_id}, {date}. Default: part-{supa_job_id}-{seq} Compression is fixed to GZIP and cannot be changed.

Append-Only Write Semantics and Safety Guarantees

Supaflow writes to Amazon S3 using strict append-only semantics:

Supaflow never overwrites or mutates existing S3 objects.
Each pipeline run produces new Parquet files with unique file names.
By default, uniqueness is guaranteed by {supa_job_id} (a UUID). If you customize the file name pattern, keep a unique token (for example {supa_job_id} or {uuid}).

Partitions are time-based and deterministic:

Runs that fall in the same partition window (for example, the same hour when using HOUR partitioning) write multiple files into the same partition directory.
This is expected and safe because uniqueness is enforced at the file level, not the partition level.

First-run safety check (optional):

When Destination Table Handling is set to FAIL, Supaflow verifies the destination table path (bucket + prefix + schema + table) is empty on the initial run.
If any existing files are detected under that table path, the run fails to prevent mixing Supaflow-managed data with pre-existing datasets.

After the initial run succeeds, Supaflow appends new immutable files on subsequent runs, including additional files in existing partitions.

Partition Strategies

Choose a partition strategy based on your query patterns:

Strategy	Path Format	Best For
HOUR (default)	`yyyy/MM/dd/HH`	High-frequency syncs, hourly analysis
DAY	`yyyy/MM/dd`	Daily syncs, daily reporting
SYNC_ID	`{supa_job_id}`	Full refresh pipelines, point-in-time queries
NONE	(no partition)	Small datasets, infrequent updates

Partitions are reused across runs that fall within the same partition window; uniqueness is enforced at the file level, not the partition level.

Glue Data Catalog Database and Table Naming

The schema name (which becomes the Glue database name when Glue is enabled) is constructed from multiple components:

[glue_prefix_] + destination_prefix + source_schema

Component	Source	Default	Example
Glue Prefix	Connector config	`supaflow`	`supaflow`
Destination Prefix	Pipeline config (see Step 2: Configure Pipeline)	Empty unless set (UI often sets to connector type)	`postgres`, `salesforce`
Source Schema	Source database	Varies by source	`public`, `dbo`

Examples:

Source	Destination Prefix	Source Schema	Glue Prefix	Resulting Schema
PostgreSQL	`postgres`	`public`	`supaflow`	`supaflow_postgres_public`
Salesforce	`salesforce`	(none)	`supaflow`	`supaflow_salesforce`
PostgreSQL	(blank)	`public`	`supaflow`	`supaflow_public`
PostgreSQL	(blank)	(blank)	`supaflow`	`supaflow`
PostgreSQL	`postgres`	`public`	(blank)	`postgres_public`

Schema Segment Behavior

If the schema resolves to blank and Glue is disabled, the schema segment is omitted from the S3 path
If Glue is enabled with a prefix, the schema becomes at minimum the Glue prefix

Tracking Columns

Supaflow automatically adds tracking columns to every table for data lineage and identity:

Column	Type	Description
`_supa_synced`	timestamp	When the record was synced (UTC)
`_supa_deleted`	boolean	Soft delete marker from source
`_supa_index`	bigint	Row index within the batch (always 0 in append mode)
`_supa_id`	string	Record identity key -- see S3 Data Lake: `_supa_id` computation
`_supa_job_id`	string	Unique job/sync identifier

Table and Column Normalization

Table and column names are normalized for Parquet/Glue compatibility:

ASCII only: Non-ASCII characters (日本語, Δ, etc.) → underscore
Accents dropped: crème → creme
Valid characters: Letters (a-z, A-Z), digits (0-9), underscores only
Leading digit: Prepend underscore (123abc → _123abc)
Spaces/special chars: Converted to underscore
Multiple underscores: Collapsed to single underscore
Max length: 255 characters (Glue limit)

Collision handling: If normalization creates duplicates, numeric suffixes are added (name, name_1, name_2).

Querying with Athena

If Glue Catalog is enabled, you can query your data directly with Amazon Athena:

By default, Athena queries return all data written for the table (initial + incremental) across all partitions and files. To query a specific sync, filter on _supa_job_id. To get a "current" view, deduplicate by _supa_id ordered by _supa_synced.

-- Query latest sync
SELECT *
FROM supaflow_postgres_public.accounts
WHERE _supa_job_id = 'latest-job-id'
LIMIT 100;

-- Find all synced records for a table
SELECT _supa_job_id, COUNT(*) as record_count, MIN(_supa_synced) as sync_time
FROM supaflow_postgres_public.accounts
GROUP BY _supa_job_id
ORDER BY sync_time DESC;

-- Deduplicate using _supa_id (latest version of each record)
SELECT * FROM (
  SELECT *, ROW_NUMBER() OVER (PARTITION BY _supa_id ORDER BY _supa_synced DESC) as rn
  FROM supaflow_postgres_public.accounts
) WHERE rn = 1 AND _supa_deleted = false;

Troubleshooting

Access Denied when assuming role

Problem:

"AccessDenied" error when testing connection
"User is not authorized to perform sts:AssumeRole"

Solutions:

Verify the trust policy:
- Ensure Principal is exactly arn:aws:iam::805595753828:root
- Check External ID matches what you entered in Supaflow
Check role ARN:
- Ensure you copied the complete ARN
- Verify the role exists in your account
Wait a few minutes:
- IAM changes can take up to a minute to propagate

Access Denied on S3 operations

Problem:

"AccessDenied" when writing to S3
"Bucket does not exist" error

Solutions:

Verify bucket name:
- Check for typos in the bucket name
- Ensure bucket exists in the specified region
Check S3 policy:
- Verify <YOUR_BUCKET_NAME> was replaced correctly
- If using prefix restriction, ensure the prefix matches your configuration
Verify bucket permissions:
- Check if bucket has a bucket policy that might deny access
- Ensure bucket is not configured for Requester Pays

Access Denied on Glue operations

Problem:

"AccessDeniedException" when creating Glue tables
Tables not appearing in Glue catalog

Solutions:

Verify Glue policy is attached:
- Check the role has the Glue permissions policy
Check region and account ID:
- Ensure <YOUR_REGION> matches your bucket region
- Verify <YOUR_ACCOUNT_ID> is your 12-digit AWS account ID
Verify database prefix:
- Ensure your Glue policy prefix matches your Supaflow destination configuration
- Default is supaflow* but you can use any prefix

Invalid External ID

Problem:

"The security token included in the request is invalid"
External ID mismatch error

Solutions:

Check for typos:
- External ID is case-sensitive
- Check for leading/trailing spaces
Verify trust policy:
- Open the role in AWS Console
- Click "Trust relationships" tab
- Confirm the External ID matches exactly

Connection timeout

Problem:

Connection test times out
Slow response from AWS

Solutions:

Check AWS service status:
- Visit AWS Service Health Dashboard
Verify region:
- Ensure the region in Supaflow matches your bucket's region
Try again:
- Temporary network issues may cause timeouts

Support

Need help? Contact us at support@supa-flow.io

Prerequisites​

AWS Setup​

Step 1: Create S3 Permissions Policy​

Step 2: Create Glue Permissions Policy (Optional)​

Step 3: Create IAM Role with Trust Policy​

Step 4: Copy Role ARN​

Configuration​

Connection​

Authentication​

AWS Glue​

Advanced Settings​

Test & Save​

S3 Path Layout and Glue Catalog Naming​

S3 Object Key Format and File Naming​

Append-Only Write Semantics and Safety Guarantees​

Partition Strategies​

Glue Data Catalog Database and Table Naming​

Tracking Columns​

Table and Column Normalization​

Querying with Athena​

Troubleshooting​

Access Denied when assuming role​

Access Denied on S3 operations​

Access Denied on Glue operations​

Invalid External ID​

Connection timeout​

Support​

Prerequisites

AWS Setup

Step 1: Create S3 Permissions Policy

Step 2: Create Glue Permissions Policy (Optional)

Step 3: Create IAM Role with Trust Policy

Step 4: Copy Role ARN

Configuration

Connection

Authentication

AWS Glue

Advanced Settings

Test & Save

S3 Path Layout and Glue Catalog Naming

S3 Object Key Format and File Naming

Append-Only Write Semantics and Safety Guarantees

Partition Strategies

Glue Data Catalog Database and Table Naming

Tracking Columns

Table and Column Normalization

Querying with Athena

Troubleshooting

Access Denied when assuming role

Access Denied on S3 operations

Access Denied on Glue operations

Invalid External ID

Connection timeout

Support