Configuration
Configuration reference
Puddle is configured with a single YAML file passed via -c:
puddle -c /path/to/puddle.yamlA config file is required — puddle won’t boot without one, and it must
declare at least one warehouse. ${VAR} and $VAR references are
expanded from the process environment before the YAML is parsed; any
unset variable is a hard error at boot.
This document covers every top-level block. A minimal config that boots is just:
warehouses:
default:
location: file:///tmp/puddle-warehouseTop-level blocks:
server— listen address.logging— log level and format.metastore— where catalog metadata is stored.warehouses— one or more named warehouses.external-warehouses— read-only register-table sources (no REST routing).authn— bearer-token validators (off by default).authz— policy backend (allow-authenticated, OPA).
server
server:
addr: ":8089" # defaultThe Iceberg REST API is mounted at /api/catalog/, so clients point
their catalog uri at http://<host>/api/catalog.
logging
logging:
level: info # debug | info | warn | error
format: text # text | jsonBoth fields default to the values shown above. Logs go to stderr.
metastore
Where puddle keeps the catalog’s own state (which tables exist, what their current metadata locations are). Table data and the metadata files themselves live in the warehouse, not here.
metastore:
type: memory # memory | sqliteSupported backends:
memory— non-persistent. State is lost on shutdown. Useful for tests, demos, and the RCK harness.metastore: type: memorysqlite— file-backed. Suitable for single-node deployments.metastore: type: sqlite sqlite: path: ./tmp/puddle.db # required; ":memory:" for ephemeralpathis required. Use:memory:for an ephemeral in-process DB (tests).
warehouses
warehouses: is a map from warehouse name to its configuration. At
least one entry is required. Clients address a specific warehouse by
name (e.g. via the warehouse=<name> query parameter).
Names must match [a-zA-Z0-9_-]+. config, oauth, and tokens
are reserved and rejected at boot.
warehouses:
default:
location: file:///tmp/puddle-warehouse
prod:
location: s3://my-warehouse
s3: { ... } # see belowlocation is a URL whose scheme picks the storage backend.
Supported schemes:
file://— local filesystem. No sub-block. Suitable for dev, CI, and single-node deployments where data lives on local disk.warehouses: default: location: file:///tmp/puddle-warehouses3://(alsos3a://,s3n://) — S3 or any S3-compatible store: AWS S3, MinIO, Ceph RGW, Cloudflare R2, Tigris. Requires ans3:sub-block.warehouses: prod: location: s3://my-warehouse s3: region: us-east-1 # required endpoint: https://s3.amazonaws.com # omit for AWS, set for MinIO/Ceph path-style-access: false # true for MinIO # Catalog's own S3 credentials. Leave empty to defer to the # AWS default credential chain (env vars, profile, IRSA, # instance metadata). access-key-id: ${AWS_ACCESS_KEY_ID} secret-access-key: ${AWS_SECRET_ACCESS_KEY} session-token: "" # Optional toggles, all default off / empty. use-arn-region: false checksum-enabled: false acl: "" # e.g. bucket-owner-full-control write-storage-class: "" # STANDARD | STANDARD_IA | ...region,endpoint, andpath-style-accessare forwarded to clients automatically — PyIceberg / Spark / Trino pick them up without further configuration.
Vended credentials
When enabled, puddle mints short-lived, prefix-scoped session tokens
via STS AssumeRole and hands them to clients on every table load.
The catalog’s own credentials never leave the server. Reads get a
read-only token; writes get a read-write token. Default policy
templates scope every token to the table’s own prefix, so a token
for db.events cannot read or write db.orders.
s3:
...
vended-credentials:
enabled: true
role-arn: arn:aws:iam::123456789012:role/IcebergTableAccess
duration-seconds: 3600 # default; AWS allows [900, 43200]
sts-endpoint: "" # defaults to s3.endpoint
sts-region: "" # defaults to s3.region
external-id: "" # cross-account
read-policy-template: "" # text/template; default scopes to table prefix
write-policy-template: "" # default extends read with PutObject + DeleteObjectVending requires explicit access-key-id + secret-access-key on the
warehouse (the catalog uses them to call STS). Works the same against
AWS, MinIO, and Ceph (RGW); only role-arn and the optional
sts-endpoint differ. See
credential-vending.md for the per-target
operator runbook (trust policies, role creation, template overrides,
troubleshooting).
external-warehouses
external-warehouses: declares storage locations the catalog can
read from but does not expose as REST routes. Entries are
admin-enumerated read sources for registerTable: a client
supplying a metadata-location whose URL falls under one of these
locations is allowed to register the table even though the location
is not a routed warehouse. The table’s bytes stay where they are;
the catalog just stores the pointer. Subsequent loadTable and
updateTable route I/O through the external’s own FileIO, with
its own credentials, region, and (optional) credential vending —
independent of any routed warehouse.
Register-only. Externals cannot be the target of a new table.
A createTable whose location falls under an external is
rejected with “cannot create a table at location …: it belongs
to an external warehouse, which is register-only”. An operator
who wants to author new tables into a specific storage configures
it under warehouses: instead, which makes it
REST-addressable.
warehouses:
default:
location: file:///tmp/puddle-warehouse
external-warehouses:
legacy:
location: s3://legacy-bucket/x
s3:
region: us-east-1
access-key-id: ${LEGACY_AWS_ACCESS_KEY_ID}
secret-access-key: ${LEGACY_AWS_SECRET_ACCESS_KEY}Each entry has the same shape as a routed warehouse: a location
URL and the matching per-scheme sub-block (s3: for s3/s3a/s3n,
nothing for file://). Vended credentials are configured the
same way under s3.vended-credentials: and apply to clients
reading or writing tables registered under this external.
What externals are not
- Not REST-routable. Clients cannot pass
warehouse=legacyin a query parameter orprefixpath segment. The API handler iterateswarehouses:only. - Not exposed in
/v1/config. The defaults endpoint advertises routed warehouses; externals do not appear. - Not valid
createTabledestinations. AcreateTablewhoselocationresolves under an external is rejected at request time.
The catalog stores a pointer; the bytes stay in the external storage.
loadTable and updateTable for tables registered there route I/O
through the external’s own FileIO and credential vendor.
Routed or external?
Use warehouses: if you want clients to address this storage by
name and create new tables in it. Use external-warehouses: if you
only want to register tables that already exist there.
Validation
- Names must be unique across the union of
warehouses:andexternal-warehouses:. The same[a-zA-Z0-9_-]+shape rule and the same reserved-name list (config,oauth,tokens) apply. - Locations must not overlap across the union — no entry’s URL may be a path-segment prefix of another’s. The register-table resolver picks a source by longest-prefix match; overlap would silently hide one entry behind another.
- Per-entry rules carry over:
locationmust have a scheme, the matching sub-block must be present,s3.regionis required for s3-scheme entries, andvended-credentialsis validated as in routed warehouses.
authentication
authn: is opt-in. Omit it entirely and every request runs
anonymously — fine for local dev. Configure it and every request must
present a valid bearer token in the Authorization header or get a
401.
Both kinds of token can be configured at once: tokens that look like a JWT are sent to the JWT validator, everything else to the static-token list.
authn:
static-tokens: [ ... ] # see below
jwt: { ... } # see belowPer-request attributes set by either validator (e.g. groups) are
visible to authorization policy.
static-tokens
A list of bearer tokens, each tied to a principal name and free-form attributes.
authn:
static-tokens:
- token: ${PUDDLE_ADMIN_TOKEN}
principal: admin
attrs:
groups: [admins]
team: platform
- token: ${PUDDLE_CI_TOKEN}
principal: ci-runner
attrs:
groups: [ci]token: should always use ${VAR} env-var expansion so the literal
secret stays out of source control. attrs: is forwarded verbatim to
authorization policy — the keys are operator-chosen.
Configuration errors hard-fail at boot: empty token or principal,
duplicate tokens, and tokens whose value would be misrouted to the
JWT validator.
jwt
OIDC-style validation against one or more issuers. Each request’s
JWT is routed to its issuer by exact match on the iss claim.
authn:
jwt:
issuers:
- issuer: https://login.example.com/
audience: puddle # string or list of strings
algorithms: [RS256] # restrict to specific algs
# JWKS source — pick one (default: OIDC discovery)
jwks-uri: https://login.example.com/.well-known/jwks.json
# jwks: | # inline alternative
# { "keys": [ ... ] }
claims: # claim → attrs.<key> mapping
email: email
groups: groupsIf neither jwks-uri nor jwks is given, puddle does OIDC discovery
(GET <issuer>/.well-known/openid-configuration) and follows the
returned jwks_uri.
claims: is an allowlist: only the listed JWT claims are passed
through to authorization policy. Anything not in the map is dropped.
Debugging: who am I?
GET /api/catalog/v1/whoami echoes back the identity of the calling
token: subject, issuer, and the attribute map authorization will
see. Useful for confirming static-token attrs: are wired the way
you expect, or seeing which JWT claims actually surface under your
claims: allowlist. With authn: unconfigured the response
contains "anonymous": true.
$ curl -s -H "Authorization: Bearer $TOKEN" \
http://localhost:8089/api/catalog/v1/whoami | jq
{
"sub": "admin",
"issuer": "static:admin",
"attrs": { "groups": ["admins"], "team": "platform" }
}authorization
authz: selects the policy that gates every request after
authentication. Decisions are logged with the stable
event="authz.decision" discriminator regardless of mode.
authz:
mode: allow-authenticated # allow-authenticated | opaSupported modes:
allow-authenticated(default) — permit any authenticated request. Withauthn:unconfigured this collapses to allow-all so the dev path keeps working without auth.authz: mode: allow-authenticatedopa— call out to Open Policy Agent for every decision. OPA runs as a separate process; puddle calls its data API.authz: mode: opa opa: url: http://localhost:8181/v1/data/puddle/rbac/allowA runnable starter policy with
admin/writer/readerroles, plus a workingpuddle.yaml, lives inexamples/opa/. For container-based stacks, seeexamples/compose/— theopa-local,trino-rustfs-opa, andtrino-rustfs-opa-vendingpresets all exercise this code path against a real OPA sidecar.
Environment variable expansion
${VAR} and $VAR in any string field are expanded from the process
environment before YAML parsing. An unset variable is a hard error —
config-load will fail listing every missing name.
authn:
static-tokens:
- token: ${PUDDLE_ADMIN_TOKEN}
principal: admin
warehouses:
prod:
location: s3://my-warehouse
s3:
region: us-east-1
access-key-id: ${AWS_ACCESS_KEY_ID}
secret-access-key: ${AWS_SECRET_ACCESS_KEY}Use this for every secret. Don’t paste literal tokens into config files.