How to Govern Glue Catalogs Across Four Accounts

December 08, 2027 · 17 min read

Data Engineer · DEA-C01 · part of The Exam Room

The situation

We have four accounts. Finance runs Redshift with its own Glue catalog. Product analytics runs a lakehouse on S3 with a second Glue catalog. Marketing has a third catalog feeding QuickSight. Engineering has a fourth, messy one that’s mostly development scratch. Each account grew organically; each has its own naming conventions, its own permissions model, and its own idea of who owns what.

Four specific problems have surfaced.

  • Discoverability. An analyst in marketing wants “the table with customer retention by cohort”. Nobody knows which catalog, which table, or which account. The answer is one Slack thread, three DMs, and a calendar invite. The data is findable; the knowledge of its findability isn’t.
  • Ownership. When a column’s meaning changes, nobody knows who to notify or who should approve. The ownership spreadsheet was last updated eight months ago. Two of the three named owners have left.
  • Access approval. The finance analyst needed read access to a product table. The path was: ask in Slack, get pointed at an IAM engineer, raise a Jira ticket, wait four days for the policy to be attached. The analyst gave up and asked a colleague to export the data to a CSV, which she then loaded into her account. The governance got worse, not better.
  • Cross-account use without cross-account sprawl. Four catalogs means four sets of Lake Formation permissions, four sets of IAM roles, four sets of cross-account RAM shares to maintain. The glue between accounts is handwritten and fragile.

The theme is that the metadata system isn’t integrated. Each catalog is coherent in its own account; the layer above them is people and spreadsheets.

What actually matters

Before reaching for a service, it’s worth asking what “governance layer” actually means on top of an existing set of Glue catalogs.

Four capabilities, roughly.

A unified discovery surface. A single place an analyst can search for “customer retention”, see what exists across every producing account, read its description, understand who owns it, and decide whether to request access, without knowing which account it lives in or which catalog lists it.

Explicit ownership and stewardship. Each dataset has a registered owner (or team) and a steward responsible for metadata quality. That ownership is enforced at publication time, you can’t publish a dataset without owners, and visible at discovery time. When the owner leaves, re-assignment is a workflow, not a spreadsheet edit nobody does.

Self-service access requests with approval workflow. A consumer sees a dataset they want. They click “subscribe”, fill in a justification, and the request routes to the dataset’s owner for approval. Approval materialises the underlying Lake Formation grant (or RAM share, or IAM permission) without the consumer or the owner touching IAM. Rejection records a reason; approval records an audit trail.

A project abstraction that isolates consumers. Analysts don’t work directly in the producing account; they work in a “project” that has its own compute environment, its own pre-authorised access to subscribed datasets, and its own observability. A marketing analytics project can subscribe to customer data without marketing analysts getting access to the customer-producing account itself.

Those four capabilities, taken together, describe a data-mesh-style governance layer. The alternative, wiring cross-account shares by hand, maintaining ownership in a spreadsheet, running access requests through Jira, is exactly what we have, and it isn’t working.

The question is which shape on the landscape covers the four capabilities for the team we have, and which is overhead for the team’s size.

What we’ll filter on

Distilling into filters we can score each option against:

  1. Federated catalog discovery, can consumers search across multiple producing accounts from one place?
  2. Ownership and stewardship as data, are owners part of the metadata, not an external spreadsheet?
  3. Self-service subscription workflow, does a consumer request and an owner approve, without either touching IAM?
  4. Isolated project environments, do consumers get compute that’s pre-wired to subscribed data, separate from the producing account?
  5. Integration with existing Glue catalogs and Lake Formation, does the governance layer sit above what we have, or demand replacement?

The governance landscape

  1. AWS DataZone. A governance layer designed for the data-mesh shape. A domain is the top-level container (typically one per organisation or per business unit). Inside the domain live projects (collaboration spaces with their own members, compute, and catalog view) and a business data catalog (the federated searchable surface). Data producers register their Glue tables with the domain, attach business metadata, and publish them to the catalog. Consumers in other projects subscribe; approvals flow to the producer; approved subscriptions materialise as Lake Formation grants and RAM shares under the hood. DataZone itself doesn’t store the data; it orchestrates metadata and access across existing Glue catalogs.

  2. AWS Glue Data Catalog (per-account). The default catalog. Strong as a technical metadata store, schema, partition, stats, location, but thin on governance. No native ownership field, no native request/approval workflow, no federated search across accounts. Cross-account access is possible through Lake Formation and RAM, configured per-share. The thing we have today, and the reason we have the problem.

  3. Lake Formation cross-account sharing. Named resource shares and tag-based shares via RAM. Solves the mechanics of granting cross-account access; doesn’t solve the discovery, ownership, or workflow problems. Complements DataZone rather than replacing it. DataZone materialises its subscription approvals through Lake Formation.

  4. Third-party catalog platforms (Collibra, Alation, Atlan, etc.). Mature governance suites with deep feature sets (lineage, data dictionary, glossary workflows, stewardship, business glossary mapping). Usually integrate with Glue Catalog and Lake Formation via connectors. Stronger on organisational rollout features; another vendor and another integration to manage.

  5. Purpose-built in-house solution. Build the workflow on Lambda + DynamoDB + SNS, query Glue catalogs via API, store ownership in a table, orchestrate approvals through Step Functions. Fine for small teams; substantial ongoing investment as the organisation grows. Usually the wrong answer when a managed service fits.

Side by side

Option Federated discovery Ownership as data Self-service subscription Isolated projects Layers on Glue / LF
AWS DataZone ✓ (across domain) ✓ (owner, steward) ✓ (built-in workflow) ✓ (project envs) ✓ (orchestrates both)
Glue Catalog alone Per-account only N/A
Lake Formation sharing Manual ✓ (mechanism only)
Third-party catalog ✓ (via connectors) Partial ✓ (connectors)
In-house build As-built As-built As-built As-built As-built

Reading by use case:

  • Organisation with three-plus producing accounts and growing consumer base. DataZone. The pain we have (discovery, ownership, self-service access) is exactly what it was designed for.
  • Single account, under fifty datasets, stable small team, plain Glue Catalog plus a one-page ownership markdown file is fine. DataZone would be overhead.
  • Already standardised on Collibra or Alation, extend what you have. DataZone isn’t a differentiator big enough to replace a working governance platform.
  • Data engineering team has the headcount and taste for it, an in-house build can fit, but the operational cost compounds fast. Only pick this if the shape genuinely doesn’t fit a managed service.

The DataZone mental model

DataZone domain: acme-data Producer projects finance-producer Redshift + Glue (account 111…) owner: finance-data@ product-producer S3 lakehouse + Glue (account 222…) owner: product-data@ marketing-producer QuickSight-backed Glue (account 333…) owner: marketing-data@ Business data catalog Published asset: finance.accounts_receivable owner, steward, glossary terms, schema Published asset: product.user_events metadata form: pii=true, retention=365 Published asset: product.cohorts_daily glossary: "cohort", "retention" Business glossary (terms) Customer · Retention · Revenue · Cohort Consumer projects growth-analytics Athena environment (account 444…) subscribed: user_events, cohorts_daily finance-reporting Redshift Spectrum env (account 555…) subscribed: accounts_receivable ml-retention SageMaker env (account 666…) subscribed: cohorts_daily Underlying AWS services (DataZone orchestrates, doesn't replace) Glue Data Catalog tables + partitions Lake Formation row/column grants AWS RAM cross-account shares IAM principals + roles Query engines Athena · Redshift · EMR · SageMaker
Domain at the top, producer projects publishing assets into the catalog, consumer projects subscribing, Lake Formation and Glue and RAM underneath materialising the grants. DataZone is the orchestration; the services underneath are the plumbing.

DataZone in depth

Domain. The top-level container. One per organisation or large business unit. Domains are regional resources; a single organisation can have one domain per Region if data residency demands it, or a single domain that federates globally. Domains own the business glossary, the metadata forms (reusable templates for dataset metadata), and the list of projects.

Projects. A collaboration space inside a domain. Each project has members (human users mapped to IAM Identity Center identities), environments (more on these in a moment), and a catalog view scoped to what the project has published or subscribed to. A project can be a producer, a consumer, or both. The project boundary is where access is enforced, “this project has access to these datasets”, which is what makes membership and permissions coherent without needing to touch IAM directly.

Environments. Compute configurations inside a project. DataZone ships blueprints for the common shapes: Data Lake (Athena + Lake Formation), Data Warehouse (Redshift), SageMaker (notebooks + training). Creating an environment provisions the backing resources in a linked AWS account (the “environment account”) and registers them with the project. A project can have multiple environments, one Athena environment for ad-hoc analysis, one Redshift environment for reporting. Subscriptions approved for the project materialise as grants across all its environments automatically.

Assets and asset types. An asset is a catalog entry for a dataset, a pointer to a Glue table, a Redshift table, an S3 path, or a custom source. Asset types are the schemas (the expected metadata for each kind of asset); DataZone ships types for GlueTable, RedshiftTable, and S3Object, and you can define custom types. Publishing an asset attaches metadata forms, glossary terms, and owner/steward assignments, then makes it searchable in the catalog.

Metadata forms. Reusable templates of fields that must be filled in before an asset can be published. A “PII Disclosure” form with fields contains_pii (boolean), pii_categories (multi-select), retention_days (number), deletion_contact (email) can be attached to any asset type; publication is blocked until the form is completed. Forms are the mechanism for enforcing organisational metadata standards.

Business glossary. A hierarchical vocabulary of business terms, each with a definition, owner, and optional relationships to other terms. Glossary terms can be attached to assets and columns; the catalog becomes searchable by term. “Show me everything tagged Customer” returns every asset where the Customer term is attached, across accounts, across catalogs.

Subscription workflow. A consumer project finds an asset in the catalog. A project member clicks “Subscribe”, writes a justification, and the request routes to the asset’s owner in the producing project. The owner reviews (can see the consumer’s project, justification, and requested fields), approves or rejects. On approval, DataZone materialises the underlying Lake Formation grant or RAM share to the consumer project’s environments. The approval and the justification are recorded; revoking the subscription reverses the grant.

Data quality hooks. Published assets can expose Glue Data Quality scores in the catalog, consumers see freshness, completeness, and volume metrics before deciding whether to subscribe. The DQ score becomes part of the discovery surface.

A worked subscription: from request to query

The finance analyst wants product.cohorts_daily to build a report linking retention to billing.

  1. She opens the DataZone portal, searches “cohort retention”, finds cohorts_daily. The asset shows its description, owner (product-data@acme.com), last-updated time, DQ score (0.98), glossary terms (Cohort, Retention), and a preview of the schema.
  2. She clicks “Subscribe”, selects her project (finance-reporting), and writes a justification: “Building Q4 retention-to-revenue dashboard for CFO review”.
  3. The request routes to the product-producer project. The product-data owner sees the request, opens the consumer project’s metadata, sees finance-reporting is an approved consumer project under the domain, approves.
  4. DataZone, under the hood, creates a RAM share from the product-producer account to the finance-reporting account, grants Lake Formation permissions on the cohorts_daily table to the finance-reporting project’s IAM principals, and updates the finance-reporting project’s catalog view to include the table.
  5. Within a minute, the analyst opens her Athena environment (a DataZone-provisioned workgroup in the finance-reporting environment account), sees cohorts_daily in her catalog, runs SELECT * against it. The first query returns data. No IAM ticket, no cross-account role assumption, no spreadsheet lookup, no CSV export.

The time from request to query is minutes instead of days; the approval is a typed review instead of a Slack negotiation; the audit trail is DataZone’s event log instead of a Jira ticket that gets closed with “done”.

The Lake Formation layer underneath

DataZone doesn’t replace Lake Formation; it orchestrates it. Every approved subscription materialises as one or more Lake Formation grants. The principal is the DataZone project’s IAM role (not the individual user); the resource is the Glue table (or a Redshift view, or an S3 object); the permissions are SELECT and typically DESCRIBE. Column-level and row-level security attached to the table in Lake Formation are preserved. DataZone grants access to the table; Lake Formation enforces what columns and rows within it are visible.

That layering means two things. First, if you already have Lake Formation configured for fine-grained access, DataZone picks it up without reconfiguration. Second, if you don’t, DataZone is your opportunity to retrofit it without every consumer noticing, the project’s access is granted through Lake Formation regardless, so moving from unfenced to fenced tables is invisible to the consumer.

What’s worth remembering

  1. DataZone is a governance layer above existing catalogs, not a replacement for them. Glue catalogs keep the technical metadata; Lake Formation keeps the permissions; DataZone orchestrates discovery, ownership, and subscription workflow across them.
  2. Domain → projects → environments → assets. The hierarchy. A domain contains projects, projects contain environments (compute), and projects publish or subscribe to assets (dataset pointers).
  3. Producer and consumer are project roles, not services. A project can be both. The catalog is the shared middle where assets are published by producers and subscribed to by consumers.
  4. Subscriptions materialise as Lake Formation grants. The consumer doesn’t touch IAM; the producer doesn’t touch IAM; DataZone calls Lake Formation and RAM under the hood when a subscription is approved.
  5. Metadata forms enforce standards at publish time. PII disclosure, retention, deletion contact, business owner, attach as form, block publication until the form is filled.
  6. Business glossary is searchable. Attach glossary terms to assets and columns; the catalog becomes navigable by business concept rather than by table name.
  7. Data quality scores surface in the catalog. Glue Data Quality results attach to published assets so consumers can see freshness and completeness before subscribing.
  8. Alternatives exist and still make sense. Pure Glue Catalog for small, single-account setups. Third-party catalogs (Collibra, Alation, Atlan) if you’ve already standardised on one. In-house builds for very specific shapes that no managed service fits, rare, but a real answer.

DataZone is what you reach for when the catalog is no longer one place. The underlying infrastructure stays where it is; the layer above it starts to behave like a product that analysts and data engineers actually want to use. The hard part isn’t turning DataZone on, it’s deciding which projects map to which teams, which metadata forms to enforce, and who owns what. The tool removes the mechanical friction; the organisation still has to decide what governance means.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.