The situation
The warehouse is consuming from five SaaS sources today.
- Salesforce. Thirty objects (Account, Contact, Opportunity, Lead, Case, and customer-objects). Full-refresh daily for most; incremental on the big three (Account, Contact, Opportunity).
- Zendesk. Fifteen objects (tickets, users, organisations, groups, views). Incremental by
updated_atevery hour. - Marketo. Ten objects (leads, activities, programs). Daily full-refresh of the small ones, hourly incremental of activities.
- Google Analytics 4. A daily extract of session-level data.
- Slack. Channels and messages for a few specific channels into a security pipeline.
The current state: five separate Python jobs, three languages between them (historical accidents), one on EC2 cron, two as Lambda on EventBridge Scheduler, two as Glue Python-shell jobs. Every time a SaaS vendor rotates an API version, someone has to fix a connector. Every time a new source lands, there’s a month of setup work. The engineers who keep this running don’t love it and the cost/benefit isn’t improving.
We want ingestion to be declarative: state the source, state the target, state the schedule, state the transformation, and let something else handle auth renewal, rate-limiting, pagination, incremental cursors, and error retries. If a managed ingestion service covers the sources we have, we stop maintaining five bespoke pipelines; if it doesn’t, we find out quickly and reach elsewhere.
What actually matters
Before reaching for a managed connector, it’s worth being clear about what “SaaS ingestion” actually involves.
A SaaS connector has to solve six annoying problems.
Authentication. OAuth flows (Salesforce, Google, HubSpot), API keys (Zendesk, some older connectors), mutual TLS (enterprise integrations). Every vendor has its own token-refresh semantics, failure modes, and per-tenant scoping. Rolling your own means owning that layer.
Pagination. REST APIs return results in pages; some by next_token, some by offset, some by cursor timestamp. Handling pagination end-to-end, including page-boundary errors and retry without duplication, is where a lot of bespoke-connector bugs live.
Rate limiting. SaaS APIs throttle aggressively. A good connector backs off on 429s, respects Retry-After headers, and paces requests. A bad one hammers and gets blocked.
Incremental extraction. “Since last successful run” requires tracking a cursor (usually a timestamp) between runs, handling ordering guarantees the SaaS API may or may not provide, and dealing with deleted records (which often aren’t reported at all).
Schema handling. SaaS objects evolve. A new field appears in Salesforce; the extract has to either include it, ignore it cleanly, or fail in a controlled way. Type mappings (strings vs numbers vs datetimes) need to be stable across runs.
Transformation in flight. Useful to map source field names to warehouse conventions, mask PII at source-crossing time, filter out test records, enrich with lookup data, compute derived columns.
A managed ingestion tool earns its place by solving all six consistently across connectors, so the team doesn’t re-solve each one per source.
What we’ll filter on
Distilling into filters we can score each option against:
- Source coverage (does the tool support the SaaS applications we actually use?)
- Authentication and credential management (OAuth refresh, API key rotation, tenant-per-flow scoping, without user code?)
- Incremental extraction (cursor tracking, ordering, deletion handling?)
- Target surface (where can the extract land without an intermediate step?)
- Transformation in flight (field mapping, masking, filtering, derivation?)
The SaaS-ingestion landscape
1. AWS AppFlow. A managed, no-code ingestion service for SaaS sources. 60+ source connectors (Salesforce, Zendesk, Marketo, Google Analytics, ServiceNow, Slack, SAP OData, Jira, many more) and a smaller list of targets (S3, Redshift, Snowflake, EventBridge, Lookout for Metrics, Honeycode). A flow is the unit of work: one source object, one target, optional filters, optional field-level transformations, a schedule (on-demand, event-triggered, or scheduled). AppFlow handles auth (including OAuth refresh), pagination, rate-limit backoff, and incremental cursors. Flows write in CSV, JSON, or Parquet, with Glue catalog registration optional.
2. Glue custom Python jobs. The incumbent. Full control; full operational burden. Right answer when the source isn’t supported by AppFlow or the extract logic has business-specific weirdness that doesn’t fit AppFlow’s filter model.
3. Third-party ingestion tools (Fivetran, Stitch, Airbyte). Commercial (or open-source with managed offerings) ingestion platforms with much wider connector libraries, often thousands of sources. Strong on the long tail of SaaS sources AppFlow doesn’t cover. Cost model is per-source-row or per-seat; integration with AWS is through S3/Redshift/Snowflake targets.
4. Custom connectors on Lambda + EventBridge Scheduler. If the source is a REST API without an AppFlow connector, a Lambda-on-schedule pattern is the usual fallback: Lambda calls the API, pages through, writes to S3, updates a cursor in DynamoDB. Works. Operational overhead per source.
5. Kinesis Data Streams ingestion + Firehose. For push-based SaaS integrations (webhooks), Kinesis with an authenticated HTTP frontend is a better fit than AppFlow, which is pull-based.
Side by side
| Option | Source coverage | Auth handling | Incremental extraction | Target surface | In-flight transform |
|---|---|---|---|---|---|
| AppFlow | 60+ managed connectors | Managed, OAuth refresh included | ✓ (cursor per flow) | S3, Redshift, Snowflake, EventBridge, Lookout | Map, mask, filter, partition, validate |
| Glue custom Python | Anything | User-coded | User-coded | Anything | User-coded |
| Fivetran / Stitch / Airbyte | 100s to 1000s | Managed | ✓ | S3, Redshift, Snowflake, many | Vendor-specific |
| Lambda + Scheduler | Anything (REST) | User-coded | User-coded (DynamoDB cursor) | Anything | User-coded |
| Kinesis + Firehose | Webhook push | User-coded ingress | N/A (push model) | S3, Redshift, OpenSearch | Firehose transform Lambda |
Reading by source shape:
- Source is in AppFlow’s list (Salesforce, Zendesk, Marketo, GA4, ServiceNow, SAP, Slack, Jira, dozens more): AppFlow is the correct answer. Managed auth, managed incremental, managed retries.
- Source isn’t in AppFlow’s list and never will be (custom internal API, niche vendor): Lambda + Scheduler or a third-party tool. Don’t bend AppFlow.
- Push-based webhook source: Kinesis or API Gateway + Lambda, not AppFlow. AppFlow is pull.
- Long-tail connector with very specific needs: third-party vendor with the correct connector is often cheaper than bespoke code, if the row volume doesn’t make per-row pricing painful.
AppFlow in depth
Connectors and connections. A connector is AWS’s code for talking to a specific SaaS service (Salesforce connector, Zendesk connector, etc.). A connection is an authenticated instance of a connector tied to your tenant. You create a connection to Salesforce once (OAuth flow in the console or via API), and flows reference it. Connections handle token refresh; when the OAuth token expires, AppFlow refreshes it automatically behind the flow.
For connectors not in the built-in list, AppFlow supports custom connectors via the AppFlow Custom Connector SDK (Python or Java). You build a connector against their interface, deploy it as a Lambda, register it with AppFlow, and it behaves like a built-in from that point forward. Worth doing only for sources you’ll reuse many times; otherwise Lambda + Scheduler is cheaper.
Flows. A flow configures one source-to-target transfer. Fields:
- Source: connection + object (e.g.
Salesforce/Account). - Destination: target service + location (e.g.
S3://warehouse/salesforce/account/). - Flow trigger: Run on demand (manually or via API), Run on schedule (cron-style), or Run on event (Salesforce platform events, Zendesk events; real-time push from the source).
- Data mapping: field-by-field mapping from source to target schema. Rename, type-coerce, concat, truncate, mask. A “map all fields directly” option copies everything without mapping if that’s what you want.
- Filters: source-side WHERE-like clauses, like “only objects where
Status = 'Active'” or “only records updated after2027-01-01”. Evaluated at the source if the SaaS API supports it (more efficient), at AppFlow if not. - Validation: data quality checks on records (skip records where a required field is null, fail the flow if more than N records fail validation, etc.).
- Partition and aggregation: for S3 targets, partition by source field (common: date), aggregate into fewer larger files vs one file per record.
Incremental handling. For supported source objects, AppFlow tracks a cursor between runs (typically the source’s updated_at or SystemModstamp). Each run pulls records updated since the last successful run’s high-water-mark. The first run is a full extract; subsequent runs are incremental. AppFlow handles the “what if a run fails” case by not advancing the cursor until the target write succeeds.
Write behaviour to S3. Flows can write in CSV, JSON Lines, or Parquet. Output can be partitioned (by a source field, like CreatedDate) and either aggregated (multiple records per file, preferred for downstream performance) or record-per-file. Optional Glue catalog registration: AppFlow creates/updates a Glue table for the flow’s target, so Athena can query immediately.
Error handling. Failed records can be routed to an error S3 location; flow-level errors can notify via EventBridge and SNS. For transient API failures, AppFlow retries with exponential backoff; for permanent errors (auth revoked, schema incompatibility), the flow fails and the run is marked failed in the console.
The flow catalog, visualised
A worked flow: Salesforce Opportunity to S3 Parquet
Flow configuration:
- Source:
Salesforce / Opportunity, connectionprod-sf. - Destination:
S3, bucketacme-warehouse, prefixsalesforce/opportunity/, format Parquet + Snappy, partition by source fieldCloseDate(year/month/day), aggregated (one file per partition per run). - Schedule: every hour at :15 past.
- Incremental cursor:
SystemModstamp, tracked automatically. First run extracts everything; later runs extract only records withSystemModstampgreater than the last high-water-mark. - Field mapping: 22 fields, source-to-target 1:1 with type coercion (Salesforce
decimal(16, 2)todouble, Salesforcedatetimetotimestamp), one rename (Amount→opportunity_amount_usd), one mask (AccountId→ hashed through a deterministic function keyed by an SSM secret). - Filter:
IsDeleted = false AND StageName NOT IN ('Test', 'Internal'), evaluated at Salesforce (via SOQL) because that’s cheaper than filtering client-side. - Validation: fail the flow if more than 1% of records have null
AccountId. - Glue catalog: register as
warehouse.salesforce_opportunity, update partitions on each run.
The flow runs hourly. AppFlow refreshes OAuth tokens as needed; handles pagination across Salesforce’s bulk-extract API; retries on 429s; advances the cursor only on success. When the flow fails (auth revoked, schema drift, validation threshold breach), an EventBridge event fires; a rule routes to an SNS topic to page the data-platform on-call.
What’s worth remembering
- AppFlow is the default for supported SaaS sources. Managed auth, managed pagination, managed rate limits, managed incremental cursors. If the source is on the connector list, writing a bespoke Python job is undifferentiated toil.
- Flows are per-object, not per-source. A Salesforce tenant with 30 objects is 30 flows (or however many you actually need in the warehouse). Connections are created once per tenant; flows reference them.
- Incremental requires a cursor field and a shape AppFlow understands. Works for Salesforce
SystemModstamp, Zendeskupdated_at, and similar. For exotic sources, might need full-refresh or fall back to a custom job. - Targets are narrower than sources. S3 (with Parquet/CSV/JSON, partitioning, Glue registration), Redshift, Snowflake, EventBridge, Lookout for Metrics. Not every warehouse target is native; bridge via S3 if needed.
- Event-triggered flows enable real-time from supported sources. Salesforce platform events and similar push to AppFlow, which delivers to EventBridge or S3 without scheduled polling. Major latency win for sources that support it.
- Custom connectors cover sources AppFlow doesn’t. Build once via the SDK, deploy as Lambda, reuse across flows. Worth it only if the source is used repeatedly; otherwise, Lambda + Scheduler is cheaper.
- Third-party tools (Fivetran, Stitch, Airbyte) fill the long-tail gap. When the connector count matters more than AWS-nativeness, pay for their library. Integrate via S3 or Redshift.
- AppFlow isn’t a fit for everything. Push webhooks need API Gateway + Lambda or Kinesis. Deep-business-logic extracts still need custom code. Don’t bend AppFlow into shapes it resists; use it for the shape it was designed for.
AppFlow is the managed middle layer between “another Python job” and “a full ingestion vendor”. For the ninety percent of SaaS ingestion that’s auth + pagination + incremental + write-to-S3, it removes the maintenance tax. The remaining ten percent still lives in custom code, and that’s fine. The point is to stop writing the ninety every time.