Skip to content
Configuration

Configuration

Configuration reference

Puddle is configured with a single YAML file passed via -c:

puddle -c /path/to/puddle.yaml

A config file is required — puddle won’t boot without one, and it must declare at least one warehouse. ${VAR} and $VAR references are expanded from the process environment before the YAML is parsed; any unset variable is a hard error at boot.

This document covers every top-level block. A minimal config that boots is just:

warehouses:
  default:
    location: file:///tmp/puddle-warehouse

Top-level blocks:

  • server — listen address.
  • logging — log level and format.
  • metastore — where catalog metadata is stored.
  • warehouses — one or more named warehouses.
  • external-warehouses — read-only register-table sources (no REST routing).
  • authn — bearer-token validators (off by default).
  • authz — policy backend (allow-authenticated, OPA).

server

server:
  addr: ":8089"   # default

The Iceberg REST API is mounted at /api/catalog/, so clients point their catalog uri at http://<host>/api/catalog.

logging

logging:
  level: info     # debug | info | warn | error
  format: text    # text | json

Both fields default to the values shown above. Logs go to stderr.

metastore

Where puddle keeps the catalog’s own state (which tables exist, what their current metadata locations are). Table data and the metadata files themselves live in the warehouse, not here.

metastore:
  type: memory    # memory | sqlite

Supported backends:

  • memory — non-persistent. State is lost on shutdown. Useful for tests, demos, and the RCK harness.

    metastore:
      type: memory
  • sqlite — file-backed. Suitable for single-node deployments.

    metastore:
      type: sqlite
      sqlite:
        path: ./tmp/puddle.db   # required; ":memory:" for ephemeral

    path is required. Use :memory: for an ephemeral in-process DB (tests).

warehouses

warehouses: is a map from warehouse name to its configuration. At least one entry is required. Clients address a specific warehouse by name (e.g. via the warehouse=<name> query parameter).

Names must match [a-zA-Z0-9_-]+. config, oauth, and tokens are reserved and rejected at boot.

warehouses:
  default:
    location: file:///tmp/puddle-warehouse
  prod:
    location: s3://my-warehouse
    s3: { ... }      # see below

location is a URL whose scheme picks the storage backend.

Supported schemes:

  • file:// — local filesystem. No sub-block. Suitable for dev, CI, and single-node deployments where data lives on local disk.

    warehouses:
      default:
        location: file:///tmp/puddle-warehouse
  • s3:// (also s3a://, s3n://) — S3 or any S3-compatible store: AWS S3, MinIO, Ceph RGW, Cloudflare R2, Tigris. Requires an s3: sub-block.

    warehouses:
      prod:
        location: s3://my-warehouse
        s3:
          region: us-east-1                       # required
          endpoint: https://s3.amazonaws.com      # omit for AWS, set for MinIO/Ceph
          path-style-access: false                # true for MinIO
    
          # Catalog's own S3 credentials. Leave empty to defer to the
          # AWS default credential chain (env vars, profile, IRSA,
          # instance metadata).
          access-key-id: ${AWS_ACCESS_KEY_ID}
          secret-access-key: ${AWS_SECRET_ACCESS_KEY}
          session-token: ""
    
          # Optional toggles, all default off / empty.
          use-arn-region: false
          checksum-enabled: false
          acl: ""                                 # e.g. bucket-owner-full-control
          write-storage-class: ""                 # STANDARD | STANDARD_IA | ...

    region, endpoint, and path-style-access are forwarded to clients automatically — PyIceberg / Spark / Trino pick them up without further configuration.

Vended credentials

When enabled, puddle mints short-lived, prefix-scoped session tokens via STS AssumeRole and hands them to clients on every table load. The catalog’s own credentials never leave the server. Reads get a read-only token; writes get a read-write token. Default policy templates scope every token to the table’s own prefix, so a token for db.events cannot read or write db.orders.

    s3:
      ...
      vended-credentials:
        enabled: true
        role-arn: arn:aws:iam::123456789012:role/IcebergTableAccess
        duration-seconds: 3600              # default; AWS allows [900, 43200]
        sts-endpoint: ""                    # defaults to s3.endpoint
        sts-region: ""                      # defaults to s3.region
        external-id: ""                     # cross-account
        read-policy-template: ""            # text/template; default scopes to table prefix
        write-policy-template: ""           # default extends read with PutObject + DeleteObject

Vending requires explicit access-key-id + secret-access-key on the warehouse (the catalog uses them to call STS). Works the same against AWS, MinIO, and Ceph (RGW); only role-arn and the optional sts-endpoint differ. See credential-vending.md for the per-target operator runbook (trust policies, role creation, template overrides, troubleshooting).

external-warehouses

external-warehouses: declares storage locations the catalog can read from but does not expose as REST routes. Entries are admin-enumerated read sources for registerTable: a client supplying a metadata-location whose URL falls under one of these locations is allowed to register the table even though the location is not a routed warehouse. The table’s bytes stay where they are; the catalog just stores the pointer. Subsequent loadTable and updateTable route I/O through the external’s own FileIO, with its own credentials, region, and (optional) credential vending — independent of any routed warehouse.

Register-only. Externals cannot be the target of a new table. A createTable whose location falls under an external is rejected with “cannot create a table at location …: it belongs to an external warehouse, which is register-only”. An operator who wants to author new tables into a specific storage configures it under warehouses: instead, which makes it REST-addressable.

warehouses:
  default:
    location: file:///tmp/puddle-warehouse

external-warehouses:
  legacy:
    location: s3://legacy-bucket/x
    s3:
      region: us-east-1
      access-key-id: ${LEGACY_AWS_ACCESS_KEY_ID}
      secret-access-key: ${LEGACY_AWS_SECRET_ACCESS_KEY}

Each entry has the same shape as a routed warehouse: a location URL and the matching per-scheme sub-block (s3: for s3/s3a/s3n, nothing for file://). Vended credentials are configured the same way under s3.vended-credentials: and apply to clients reading or writing tables registered under this external.

What externals are not

  • Not REST-routable. Clients cannot pass warehouse=legacy in a query parameter or prefix path segment. The API handler iterates warehouses: only.
  • Not exposed in /v1/config. The defaults endpoint advertises routed warehouses; externals do not appear.
  • Not valid createTable destinations. A createTable whose location resolves under an external is rejected at request time.

The catalog stores a pointer; the bytes stay in the external storage. loadTable and updateTable for tables registered there route I/O through the external’s own FileIO and credential vendor.

Routed or external?

Use warehouses: if you want clients to address this storage by name and create new tables in it. Use external-warehouses: if you only want to register tables that already exist there.

Validation

  • Names must be unique across the union of warehouses: and external-warehouses:. The same [a-zA-Z0-9_-]+ shape rule and the same reserved-name list (config, oauth, tokens) apply.
  • Locations must not overlap across the union — no entry’s URL may be a path-segment prefix of another’s. The register-table resolver picks a source by longest-prefix match; overlap would silently hide one entry behind another.
  • Per-entry rules carry over: location must have a scheme, the matching sub-block must be present, s3.region is required for s3-scheme entries, and vended-credentials is validated as in routed warehouses.

authentication

authn: is opt-in. Omit it entirely and every request runs anonymously — fine for local dev. Configure it and every request must present a valid bearer token in the Authorization header or get a 401.

Both kinds of token can be configured at once: tokens that look like a JWT are sent to the JWT validator, everything else to the static-token list.

authn:
  static-tokens: [ ... ]    # see below
  jwt: { ... }              # see below

Per-request attributes set by either validator (e.g. groups) are visible to authorization policy.

static-tokens

A list of bearer tokens, each tied to a principal name and free-form attributes.

authn:
  static-tokens:
    - token: ${PUDDLE_ADMIN_TOKEN}
      principal: admin
      attrs:
        groups: [admins]
        team: platform
    - token: ${PUDDLE_CI_TOKEN}
      principal: ci-runner
      attrs:
        groups: [ci]

token: should always use ${VAR} env-var expansion so the literal secret stays out of source control. attrs: is forwarded verbatim to authorization policy — the keys are operator-chosen.

Configuration errors hard-fail at boot: empty token or principal, duplicate tokens, and tokens whose value would be misrouted to the JWT validator.

jwt

OIDC-style validation against one or more issuers. Each request’s JWT is routed to its issuer by exact match on the iss claim.

authn:
  jwt:
    issuers:
      - issuer: https://login.example.com/
        audience: puddle                       # string or list of strings
        algorithms: [RS256]                    # restrict to specific algs
        # JWKS source — pick one (default: OIDC discovery)
        jwks-uri: https://login.example.com/.well-known/jwks.json
        # jwks: |                              # inline alternative
        #   { "keys": [ ... ] }
        claims:                                # claim → attrs.<key> mapping
          email: email
          groups: groups

If neither jwks-uri nor jwks is given, puddle does OIDC discovery (GET <issuer>/.well-known/openid-configuration) and follows the returned jwks_uri.

claims: is an allowlist: only the listed JWT claims are passed through to authorization policy. Anything not in the map is dropped.

Debugging: who am I?

GET /api/catalog/v1/whoami echoes back the identity of the calling token: subject, issuer, and the attribute map authorization will see. Useful for confirming static-token attrs: are wired the way you expect, or seeing which JWT claims actually surface under your claims: allowlist. With authn: unconfigured the response contains "anonymous": true.

$ curl -s -H "Authorization: Bearer $TOKEN" \
    http://localhost:8089/api/catalog/v1/whoami | jq
{
  "sub": "admin",
  "issuer": "static:admin",
  "attrs": { "groups": ["admins"], "team": "platform" }
}

authorization

authz: selects the policy that gates every request after authentication. Decisions are logged with the stable event="authz.decision" discriminator regardless of mode.

authz:
  mode: allow-authenticated   # allow-authenticated | opa

Supported modes:

  • allow-authenticated (default) — permit any authenticated request. With authn: unconfigured this collapses to allow-all so the dev path keeps working without auth.

    authz:
      mode: allow-authenticated
  • opa — call out to Open Policy Agent for every decision. OPA runs as a separate process; puddle calls its data API.

    authz:
      mode: opa
      opa:
        url: http://localhost:8181/v1/data/puddle/rbac/allow

    A runnable starter policy with admin / writer / reader roles, plus a working puddle.yaml, lives in examples/opa/. For container-based stacks, see examples/compose/ — the opa-local, trino-rustfs-opa, and trino-rustfs-opa-vending presets all exercise this code path against a real OPA sidecar.

Environment variable expansion

${VAR} and $VAR in any string field are expanded from the process environment before YAML parsing. An unset variable is a hard error — config-load will fail listing every missing name.

authn:
  static-tokens:
    - token: ${PUDDLE_ADMIN_TOKEN}
      principal: admin
warehouses:
  prod:
    location: s3://my-warehouse
    s3:
      region: us-east-1
      access-key-id: ${AWS_ACCESS_KEY_ID}
      secret-access-key: ${AWS_SECRET_ACCESS_KEY}

Use this for every secret. Don’t paste literal tokens into config files.