MongoDB Source
Sync collections from MongoDB Atlas or self-managed MongoDB clusters into your warehouse. A single source can discover and sync multiple databases at once.
For an overview of capabilities and use cases, see the MongoDB connector page.
At a Glance
- Document Packing Mode. By default the connector keeps embedded documents and arrays as JSON values on the parent row (
PACKED). Switch toUNPACKEDto expand embedded documents intoparent__leafcolumns and promote arrays of objects into related child outputs. - Per-object incremental cursor. During schema discovery, datetime root fields on each collection are flagged as cursor candidates; you choose the cursor in the wizard.
- Sample-based schema. MongoDB collections are schemaless, so the connector samples documents per collection to infer field types.
Prerequisites
Before you begin, ensure you have:
- A reachable MongoDB cluster -- MongoDB Atlas or self-managed MongoDB
- A MongoDB connection string for the cluster
- A read-only MongoDB user with
readaccess to the databases you want to sync
Supported Objects
Supaflow discovers collections dynamically at sync time -- you do not list them up front.
- Top-level collections become root objects. They land in the destination as
database.collectionso collections of the same name in different databases stay distinct. - Nested document structures (embedded documents and arrays) follow your Document Packing Mode setting (see Configuration). In
PACKEDmode they remain as JSON values on the parent row -- one row per Mongo document, no child tables. InUNPACKEDmode embedded documents flatten intoparent__leafcolumns and arrays of objects become related child outputs (e.g.,sales.orders__line_items). - System databases --
admin,local, andconfig-- are skipped by default. System collections (system.*) are skipped during implicit discovery.
The connector default is PACKED. If you want flattened columns (parent__leaf) and arrays of objects surfaced as related child outputs, set Document Packing Mode to UNPACKED in the Advanced Settings of your MongoDB source.
Incremental Sync
MongoDB collections are schemaless, so cursors are not declared up front. After schema discovery, every datetime-typed root field on a collection is offered as a cursor candidate. In the pipeline wizard you select one cursor per collection, and Supaflow only fetches documents whose cursor field has advanced since the last run.
Collections without any datetime root field stay in full-refresh mode. Once documents in those collections gain a datetime root field, Supaflow surfaces it as a cursor candidate on the next schema refresh.
Cursor state is persisted per object, so different collections can use different cursor fields and operate on independent windows.
Authentication
Supaflow connects with a MongoDB connection string. Two patterns are supported:
- Credentials in the URI -- e.g.,
mongodb+srv://user:pass@cluster0.example.mongodb.net/. Useful for the simplest setups and copy-paste from MongoDB Atlas. - Split credentials -- the URI carries the host and TLS options, and the Username, Password, and Authentication Database are entered in dedicated fields. Recommended; passwords are stored encrypted and are not embedded in the saved connection URL or in the connector's displayed configuration.
If credentials appear in both places the connector fails fast rather than silently choosing one.
Permissions
The MongoDB user needs:
readon every database the connector should discover- Either cluster-wide
listDatabases(when you want implicit discovery of all readable databases) or per-databaselistCollectionsif you supply explicit Database Filters
For long-term stability, use a dedicated read-only service user rather than an individual's credentials.
Configuration
In Supaflow, create a new MongoDB source with these settings.
Authentication
MongoDB Connection String*The MongoDB URI for your cluster or deployment. Supports replica sets, TLS, read preference, direct-connection, and other standard MongoDB URI options.
Examples:
mongodb+srv://cluster0.example.mongodb.net/mongodb://host1:27017,host2:27017/?replicaSet=rs0
Stored encrypted
UsernameOptional MongoDB username. Use this with Password to keep credentials out of the URI.
PasswordOptional MongoDB password for the supplied Username.
Stored encrypted
Database that stores the supplied user. Most Atlas deployments authenticate against admin.
Default: admin
If your URI already specifies an authSource query option, that value wins.
Sync Settings
Database FiltersComma- or newline-separated list of database names to discover. Leave blank to discover all readable non-system databases.
Example: sales, crm
Comma- or newline-separated list of fully qualified database.collection names to sync. Leave blank to sync every collection in the selected database scope.
Example: sales.orders, sales.users, crm.contacts
Bare collection names (without a database) are not accepted. Two databases can contain the same collection name, so qualifying with the database is required to avoid ambiguity.
Advanced Settings
Document Packing ModeControls how nested document structures land in your destination.
Options:
- PACKED (default) -- Embedded documents and arrays stay as JSON values on the parent row. No child tables, no flattened
parent__leafcolumns. Use this when your dbt or warehouse model already expects raw documents. - UNPACKED -- Embedded documents flatten into
parent__leafscalar columns; arrays of objects promote into related child outputs (e.g.,sales.orders__line_items).
Default: PACKED. Changing this on an existing source restructures the destination on the next sync; coordinate with downstream consumers before switching.
Schema Sample SizeMaximum number of documents sampled per collection during schema discovery. Higher values improve field-type coverage on collections with sparse fields, at the cost of slower discovery.
Default: 1000. Min: 10. Max: 100000.
Seconds to roll the cursor back on each incremental run. Useful when writes can arrive slightly out of order on the cursor field.
Default: 0 (no lookback)
How often Supaflow re-samples schemas before running the pipeline.
Options:
- 0 -- Refresh before every pipeline execution. Recommended because MongoDB schemas are inferred from sampled documents.
- -1 -- Disable refresh. Use only when you know your document shapes are stable.
- Positive value -- Refresh interval in minutes (e.g., 60 = hourly, 1440 = daily).
Default: 0
Test & Save
After configuring the connection, click Test & Save. Supaflow validates the URI and credentials with a lightweight ping, then runs schema discovery on the next step.
Schema Evolution
Schemas are re-discovered on the cadence set by Schema Refresh Interval.
- New top-level fields appear in subsequent syncs once schema refresh runs.
- New repeated nested arrays can appear as new child objects on the next refresh when Document Packing Mode is
UNPACKED. Under the defaultPACKEDmode, the new array stays on the parent row as a JSON column. - Removed fields stop populating new rows; the column remains in the destination.
- New collections added to a discovered database appear on the next refresh.
- New databases added to the cluster are picked up on the next refresh when implicit discovery is in use, or once you list them in Database Filters.
Because MongoDB types are inferred from sampled documents, increasing Schema Sample Size can stabilize types on collections with rare-but-typed fields.
Performance and Source Load
The connector reads with a single MongoDB cursor per collection and filters on the selected cursor field for incremental runs.
- Use Database Filters and Collection Filters to narrow scope on large clusters.
- Off-peak scheduling is the simplest way to reduce contention on shared MongoDB deployments.
- Very large historical loads should select a cursor field as soon as possible so subsequent runs only fetch deltas.
MongoDB applies its own server-side limits on connection counts and operation rate. The MongoDB connector itself does not implement connector-level rate-limit retry; very large or throttled syncs may need a narrower scope (Database / Collection Filters) or off-peak scheduling. For scheduled production syncs we recommend a dedicated read-only user so connection caps and slow queries do not affect your application traffic. See MongoDB's operational guidance for cluster-side tuning.
Troubleshooting
Authentication failed
Problem: MongoDB authentication failed on Test & Save.
Solutions:
- Confirm the user exists in the Authentication Database (most Atlas deployments use
admin). - If the URI carries credentials, do not also fill in Username and Password -- the connector rejects credentials in two places.
- URL-encode special characters (
:,/,@,?) in passwords when embedding them in the URI, or move the password to the Password field. - For Atlas, confirm the IP address that runs Supaflow is allowed by the Atlas network access list.
Connection refused or DNS errors
Problem: MongoDB connection failed mentioning DNS, timeouts, or refused connections.
Solutions:
- Verify the URI by connecting from
mongoshagainst the same string. - For SRV URIs (
mongodb+srv://...), DNS resolution requires a working DNS path; corporate VPNs sometimes block it. - For self-managed clusters behind a firewall, allow the IP address that runs Supaflow on port 27017 (or your custom port).
A database is missing from discovery
Problem: A database the user can read does not appear in the source.
Solutions:
- Confirm the user has
listDatabasescluster-wide, or list the database explicitly in Database Filters (the connector then bypasses cluster-wide listing for that database). - System databases (
admin,local,config) are skipped by default.
A collection is missing from discovery
Problem: A collection in a discovered database does not appear.
Solutions:
- Confirm the user has
listCollectionson that database. - Check that the collection has at least one document if you expect type-rich field discovery; empty collections still appear, but only with
_iduntil they have content. - If you used Collection Filters, confirm the entry is fully qualified (
database.collection) and the spelling matches.
Object names look different from before
Problem: Previously the destination had tables named after bare collections (e.g., orders), and now they are named after database.collection.
Solutions:
- This is intentional. Database-qualified names prevent collisions when the same collection name exists in multiple databases.
- If you depend on the old destination layout, point downstream models at the new fully qualified table names. Supaflow's destination layer derives the destination identifier from the qualified name.
A nested structure I expected as a child is missing
Problem: A nested array in your documents did not produce a related child output.
Solutions:
- Check Document Packing Mode. The default is
PACKED, which keeps nested arrays as JSON columns on the parent row instead of producing child outputs. Switch toUNPACKEDif you want arrays of objects to materialise as child tables. - In
UNPACKEDmode, confirm the field is consistently an array of objects in the sampled documents -- single embedded objects are typically flattened, not promoted to a child. - Increase Schema Sample Size so sparse arrays are more likely to appear in the sample.
- Re-run schema discovery after adding the structure to documents.
Support
Need help? Contact us at support@supa-flow.io