# Configuration
> This bundle contains all pages in the Configuration section.
> Source: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration ===

# Advanced Configurations

> **📝 Note**
>
> An LLM-optimized bundle of this entire section is available at [`section.md`](https://www.union.ai/docs/v2/union/deployment/selfmanaged/section.md).
> This single file contains all pages in this section, optimized for AI coding agent context.

This section covers the configuration of union features on your Union.ai cluster.

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/node-pools ===

# Configuring Service and Worker Node Pools

As a best practice, we recommend using separate node pools for the Union services and the Union worker pods. This allows
you to guard against resource contention between Union services and other tasks running in your cluster.

Start by creating two node pools in your cluster. One for the Union services and one for the Union worker pods.
Configure the node pool for the Union services with the `union.ai/node-role: services` label.  The worker pool will
be configured with the `union.ai/node-role: worker` label.  You will also need to taint the nodes in the service and
worker pools to ensure that only the appropriate pods are scheduled on them.

The nodes for Union services should be tainted with:

```bash
kubectl taint nodes <node-name> union.ai/node-role=services:NoSchedule
```
The nodes for execution workers should be tainted with:

```bash
kubectl taint nodes <node-name> union.ai/node-role=worker:NoSchedule
```

Vendor interfaces and provisioning tools may support tainting nodes automatically through configuration options.

Set the scheduling constraints for the Union services in your values file:

```yaml
scheduling:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: union.ai/node-role
            operator: In
            values:
            - services
  tolerations:
    - effect: NoSchedule
      key: union.ai/node-role
      operator: Equal
      value: services
```

To ensure that your worker processes are scheduled on the worker node pool, set the following for the Flyte kubernetes plugin:

```yaml
config:
  k8s:
    plugins:
      k8s:
        default-affinity:
          nodeAffinity:
            requiredDuringSchedulingIgnoredDuringExecution:
              nodeSelectorTerms:
              - matchExpressions:
                - key: union.ai/node-role
                  operator: In
                  values:
                  - worker
        default-tolerations:
          - effect: NoSchedule
            key: union.ai/node-role
            operator: Equal
            value: worker
```

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/authentication ===

# Authentication

Union.ai uses [OpenID Connect (OIDC)](https://openid.net/specs/openid-connect-core-1_0.html) for user authentication and [OAuth 2.0](https://tools.ietf.org/html/rfc6749) for service-to-service authorization. You must configure an external Identity Provider (IdP) to enable authentication on your deployment.

## Overview

Authentication is enforced at two layers:

1. **Ingress layer** — The control plane nginx ingress validates every request to protected routes via an auth subrequest to the `/me` endpoint.
2. **Application layer** — `flyteadmin` manages browser sessions, validates tokens, and exposes OIDC discovery endpoints.

The following diagram shows how these layers interact for browser-based authentication:

```mermaid
sequenceDiagram
    participant B as Browser
    participant N as Nginx Ingress
    participant F as Flyteadmin
    participant IdP as Identity Provider
    B->>N: Request protected route
    N->>F: Auth subrequest (GET /me)
    F-->>N: 401 (no session)
    N-->>B: 302 → /login
    B->>F: GET /login (unprotected)
    F-->>B: 302 → IdP authorize endpoint
    B->>IdP: Authenticate (PKCE)
    IdP-->>B: 302 → /callback?code=...
    B->>F: GET /callback (exchange code)
    F->>IdP: Exchange code for tokens
    F-->>B: Set-Cookie + 302 → original URL
    B->>N: Retry with session cookie
    N->>F: Auth subrequest (GET /me)
    F-->>N: 200 OK
    N-->>B: Forward to backend service
```

## Prerequisites

- A Union.ai deployment with the control plane installed.
- An OIDC-compliant Identity Provider (IdP).
- Access to create OAuth applications in your IdP.
- A secret management solution for delivering client secrets to pods (e.g., External Secrets Operator with AWS Secrets Manager, HashiCorp Vault, or native Kubernetes secrets).

## Configuring your Identity Provider

You must create three OAuth applications in your IdP:

| Application | Type | Grant Types | Purpose |
|---|---|---|---|
| Web app (browser login) | Web | `authorization_code` | Console/web UI authentication |
| Native app (SDK/CLI) | Native (PKCE) | `authorization_code`, `device_code` | SDK and CLI authentication |
| Service app (internal) | Service | `client_credentials` | All service-to-service communication |

> [!NOTE]
> A single service app is shared by both control plane and dataplane services. If your security policy requires separate credentials per component, you can create additional service apps, but the configuration below assumes a single shared client.

### Authorization server setup

1. Create a custom authorization server in your IdP (or use the default).
2. Add a scope named `all`.
3. Add an access policy that allows all registered clients listed above.
4. Add a policy rule that permits `authorization_code`, `client_credentials`, and `device_code` grant types.
5. Note the **Issuer URI** (e.g., `https://your-idp.example.com/oauth2/<server-id>`).
6. Note the **Token endpoint** (e.g., `https://your-idp.example.com/oauth2/<server-id>/v1/token`).

### Application details

#### 1. Web application (browser login)

- **Type**: Web Application
- **Sign-on method**: OIDC
- **Grant types**: `authorization_code`
- **Sign-in redirect URI**: `https://<your-domain>/callback`
- **Sign-out redirect URI**: `https://<your-domain>/logout`
- Note the **Client ID** → used as `OIDC_CLIENT_ID`
- Note the **Client Secret** → stored in `flyte-admin-secrets` (see **Self-managed deployment > Advanced Configurations > Authentication > Secret delivery**)

#### 2. Native application (SDK/CLI)

- **Type**: Native Application
- **Sign-on method**: OIDC
- **Grant types**: `authorization_code`, `urn:ietf:params:oauth:grant-type:device_code`
- **Sign-in redirect URI**: `http://localhost:53593/callback`
- **Require PKCE**: Always
- **Consent**: Trusted (skip consent screen)
- Note the **Client ID** → used as `CLI_CLIENT_ID` (no secret needed for public clients)

#### 3. Service application (internal)

- **Type**: Service (machine-to-machine)
- **Grant types**: `client_credentials`
- Note the **Client ID** → used as `INTERNAL_CLIENT_ID` (control plane) and `AUTH_CLIENT_ID` (dataplane)
- Note the **Client Secret** → stored in multiple Kubernetes secrets (see **Self-managed deployment > Advanced Configurations > Authentication > Secret delivery**)

## Control plane Helm configuration

The control plane Helm chart requires auth configuration in several sections. All examples below use the global variables defined in `values.<cloud>.selfhosted-intracluster.yaml`.

### Global variables

Set these in your customer overrides file:

```yaml
global:
  OIDC_BASE_URL: "<issuer-uri>"             # e.g. "https://your-idp.example.com/oauth2/default"
  OIDC_CLIENT_ID: "<web-app-client-id>"     # Browser login
  CLI_CLIENT_ID: "<native-app-client-id>"   # SDK/CLI
  INTERNAL_CLIENT_ID: "<service-client-id>" # Service-to-service
  AUTH_TOKEN_URL: "<token-endpoint>"         # e.g. "https://your-idp.example.com/oauth2/default/v1/token"
```

### Flyteadmin OIDC configuration

Configure `flyteadmin` to act as the OIDC relying party. This enables the `/login`, `/callback`, `/me`, and `/logout` endpoints:

```yaml
flyte:
  configmap:
    adminServer:
      server:
        security:
          useAuth: true
      auth:
        grpcAuthorizationHeader: flyte-authorization
        httpAuthorizationHeader: flyte-authorization
        authorizedUris:
          - "http://flyteadmin:80"
          - "http://flyteadmin.<namespace>.svc.cluster.local:80"
        appAuth:
          authServerType: External
          externalAuthServer:
            baseUrl: "<issuer-uri>"
          thirdPartyConfig:
            flyteClient:
              clientId: "<native-app-client-id>"
              redirectUri: "http://localhost:53593/callback"
              scopes:
                - all
        userAuth:
          openId:
            baseUrl: "<issuer-uri>"
            clientId: "<web-app-client-id>"
            scopes:
              - profile
              - openid
              - offline_access
          cookieSetting:
            sameSitePolicy: LaxMode
            domain: ""
          idpQueryParameter: idp
```

Key settings:

- `useAuth: true` — registers the `/login`, `/callback`, `/me`, and `/logout` HTTP endpoints. **Required** for auth to function.
- `authServerType: External` — use your IdP as the authorization server (not flyteadmin's built-in server).
- `grpcAuthorizationHeader: flyte-authorization` — the header name used for bearer tokens. Both the SDK and internal services use this header.

### Flyteadmin and scheduler admin SDK client

Flyteadmin and the scheduler use the admin SDK to communicate with other control plane services. Configure client credentials so these calls are authenticated:

```yaml
flyte:
  configmap:
    adminServer:
      admin:
        clientId: "<service-client-id>"
        clientSecretLocation: "/etc/secrets/client_secret"
```

The secret is mounted from the `flyte-admin-secrets` Kubernetes secret (see **Self-managed deployment > Advanced Configurations > Authentication > Secret delivery**).

### Scheduler auth secret

The flyte-scheduler mounts a separate Kubernetes secret (`flyte-secret-auth`) at `/etc/secrets/`. Enable this mount:

```yaml
flyte:
  secrets:
    adminOauthClientCredentials:
      enabled: true
      clientSecret: "placeholder"
```

> [!NOTE]
> Setting `clientSecret: "placeholder"` causes the subchart to render the `flyte-secret-auth` Kubernetes Secret. Use External Secrets Operator with `creationPolicy: Merge` to overwrite the placeholder with the real credential, or create the secret directly before installing the chart.

### Service-to-service authentication

Control plane services communicate through nginx and need OAuth tokens. Configure the admin SDK client credentials and the union service auth:

```yaml
configMap:
  admin:
    clientId: "<service-client-id>"
    clientSecretLocation: "/etc/secrets/union/client_secret"
  union:
    auth:
      enable: true
      type: ClientSecret
      clientId: "<service-client-id>"
      clientSecretLocation: "/etc/secrets/union/client_secret"
      tokenUrl: "<token-endpoint>"
      authorizationMetadataKey: flyte-authorization
      scopes:
        - all
```

The secret is mounted from the control plane service secret (see **Self-managed deployment > Advanced Configurations > Authentication > Secret delivery**).

### Executions service

The executions service has its own admin client connection that also needs auth:

```yaml
services:
  executions:
    configMap:
      executions:
        app:
          adminClient:
            connection:
              authorizationHeader: flyte-authorization
              clientId: "<service-client-id>"
              clientSecretLocation: "/etc/secrets/union/client_secret"
              tokenUrl: "<token-endpoint>"
              scopes:
                - all
```

### Ingress auth annotations

The control plane ingress uses nginx auth subrequests to enforce authentication. These annotations are set on protected ingress routes:

```yaml
ingress:
  protectedIngressAnnotations:
    nginx.ingress.kubernetes.io/auth-url: "https://$host/me"
    nginx.ingress.kubernetes.io/auth-signin: "https://$host/login?redirect_url=$escaped_request_uri"
    nginx.ingress.kubernetes.io/auth-response-headers: "Set-Cookie"
    nginx.ingress.kubernetes.io/auth-cache-key: "$http_flyte_authorization$http_cookie"
  protectedIngressAnnotationsGrpc:
    nginx.ingress.kubernetes.io/auth-url: "https://$host/me"
    nginx.ingress.kubernetes.io/auth-response-headers: "Set-Cookie"
    nginx.ingress.kubernetes.io/auth-cache-key: "$http_authorization$http_flyte_authorization$http_cookie"
```

For every request to a protected route, nginx makes a subrequest to `/me`. If flyteadmin returns 200 (valid session or token), the request is forwarded. If 401, the user is redirected to `/login` for browser clients, or the 401 is returned directly for API clients.

## Dataplane Helm configuration

When the control plane has OIDC enabled, the dataplane must also authenticate. All dataplane services use the same service app credentials (`AUTH_CLIENT_ID`), which is the same client as `INTERNAL_CLIENT_ID` on the control plane.

### Dataplane global variables

```yaml
global:
  AUTH_CLIENT_ID: "<service-client-id>"  # Same as INTERNAL_CLIENT_ID
```

### Cluster resource sync

```yaml
clusterresourcesync:
  config:
    union:
      auth:
        enable: true
        type: ClientSecret
        clientId: "<service-client-id>"
        clientSecretLocation: "/etc/union/secret/client_secret"
        authorizationMetadataKey: flyte-authorization
        tokenRefreshWindow: 5m
```

### Operator (union service auth)

```yaml
config:
  union:
    auth:
      enable: true
      type: ClientSecret
      clientId: "<service-client-id>"
      clientSecretLocation: "/etc/union/secret/client_secret"
      authorizationMetadataKey: flyte-authorization
      tokenRefreshWindow: 5m
```

### Propeller admin client

```yaml
config:
  admin:
    admin:
      clientId: "<service-client-id>"
      clientSecretLocation: "/etc/union/secret/client_secret"
```

### Executor (eager mode)

Injects the `EAGER_API_KEY` secret into task pods for authenticated eager-mode execution:

```yaml
executor:
  config:
    unionAuth:
      injectSecret: true
      secretName: EAGER_API_KEY
```

### Dataplane secrets

Enable the `union-secret-auth` Kubernetes secret mount for dataplane pods:

```yaml
secrets:
  admin:
    enable: true
    create: false
    clientId: "<service-client-id>"
    clientSecret: "placeholder"
```

> [!NOTE]
> `create: false` means the chart does not create the `union-secret-auth` Kubernetes Secret. You must provision it externally (see **Self-managed deployment > Advanced Configurations > Authentication > Secret delivery**). Setting `clientSecret: "placeholder"` with `create: true` is also supported if you want the chart to create the secret and then overwrite it via External Secrets Operator.

## Secret delivery

Client secrets must be delivered to pods as files mounted into the container filesystem. The table below lists the required Kubernetes secrets, their mount paths, and which components use them:

| Kubernetes Secret | Mount Path | Components | Namespace |
| --- | --- | --- | --- |
| `flyte-admin-secrets` | `/etc/secrets/` | flyteadmin | `union-cp` |
| `flyte-secret-auth` | `/etc/secrets/` | flyte-scheduler | `union-cp` |
| Control plane service secret | `/etc/secrets/union/` | executions, cluster, usage, and other CP services | `union-cp` |
| `union-secret-auth` | `/etc/union/secret/` | operator, propeller, CRS | `union` |

All secrets must contain a key named `client_secret` with the service app's OAuth client secret value.

### Option A: External Secrets Operator (recommended)

If you use [External Secrets Operator (ESO)](https://external-secrets.io/) with a cloud secret store, create `ExternalSecret` resources that sync the client secret into each Kubernetes secret:

```yaml
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: flyte-admin-secrets-auth
  namespace: union-cp
spec:
  secretStoreRef:
    name: default
    kind: SecretStore
  refreshInterval: 1h
  target:
    name: flyte-admin-secrets
    creationPolicy: Merge
    deletionPolicy: Retain
  data:
    - secretKey: client_secret
      remoteRef:
        key: "<your-secret-store-key>"
---
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: flyte-secret-auth
  namespace: union-cp
spec:
  secretStoreRef:
    name: default
    kind: SecretStore
  refreshInterval: 1h
  target:
    name: flyte-secret-auth
    creationPolicy: Merge
    deletionPolicy: Retain
  data:
    - secretKey: client_secret
      remoteRef:
        key: "<your-secret-store-key>"
---
apiVersion: external-secrets.io/v1
kind: ExternalSecret
metadata:
  name: union-secret-auth
  namespace: union
spec:
  secretStoreRef:
    name: default
    kind: SecretStore
  refreshInterval: 1h
  target:
    name: union-secret-auth
    creationPolicy: Merge
    deletionPolicy: Retain
  data:
    - secretKey: client_secret
      remoteRef:
        key: "<your-secret-store-key>"
```

> [!NOTE]
> `creationPolicy: Merge` ensures the ExternalSecret adds the `client_secret` key alongside any existing keys in the target secret.

### Option B: Direct Kubernetes secrets

If you manage secrets directly:

```bash
# Control plane — flyteadmin
kubectl create secret generic flyte-admin-secrets \
  --from-literal=client_secret='<SERVICE_CLIENT_SECRET>' \
  -n union-cp

# Control plane — scheduler
kubectl create secret generic flyte-secret-auth \
  --from-literal=client_secret='<SERVICE_CLIENT_SECRET>' \
  -n union-cp

# Control plane — union services (add to existing secret)
kubectl create secret generic union-controlplane-secrets \
  --from-literal=pass.txt='<DB_PASSWORD>' \
  --from-literal=client_secret='<SERVICE_CLIENT_SECRET>' \
  -n union-cp --dry-run=client -o yaml | kubectl apply -f -

# Dataplane — operator, propeller, CRS
kubectl create secret generic union-secret-auth \
  --from-literal=client_secret='<SERVICE_CLIENT_SECRET>' \
  -n union
```

## SDK and CLI authentication

The SDK and CLI use PKCE (Proof Key for Code Exchange) for interactive authentication:

1. The SDK calls `AuthMetadataService/GetPublicClientConfig` (an unprotected endpoint) to discover the `flytectl` client ID and redirect URI.
2. The SDK opens a browser to the IdP's authorize endpoint with a PKCE challenge.
3. The user authenticates in the browser.
4. The IdP redirects to `localhost:53593/callback` with an authorization code.
5. The SDK exchanges the code for tokens and stores them locally.
6. Subsequent requests include the token in the `flyte-authorization` header.

No additional SDK configuration is required beyond the standard `uctl` or Union config:

```yaml
admin:
  endpoint: dns:///<your-domain>
  authType: Pkce
  insecure: false
```

For headless environments (CI/CD), use the **Self-managed deployment > Advanced Configurations > Authentication > SDK and CLI authentication > Client credentials for CI/CD** flow instead.

### Client credentials for CI/CD

For automated pipelines, create a service app in your IdP and configure:

```yaml
admin:
  endpoint: dns:///<your-domain>
  authType: ClientSecret
  clientId: "<your-ci-client-id>"
  clientSecretLocation: "/path/to/client_secret"
```

Or use environment variables:

```bash
export FLYTE_CREDENTIALS_CLIENT_ID="<your-ci-client-id>"
export FLYTE_CREDENTIALS_CLIENT_SECRET="<your-ci-client-secret>"
export FLYTE_CREDENTIALS_AUTH_MODE=basic
```

## Troubleshooting

### Browser login redirects in a loop

Verify that `useAuth: true` is set in `flyte.configmap.adminServer.server.security`. Without this, the `/login`, `/callback`, and `/me` endpoints are not registered.

### SDK gets 401 Unauthenticated

1. Check that the `AuthMetadataService` routes are in the **unprotected** ingress (no auth-url annotation).
2. Verify the SDK can reach the token endpoint. The SDK discovers it via `AuthMetadataService/GetOAuth2Metadata`.
3. Check that `grpcAuthorizationHeader` matches the header name used by the SDK (`flyte-authorization`).

### Internal services get 401

1. Verify that `configMap.union.auth.enable: true` and the `client_secret` file exists at the configured `clientSecretLocation`.
2. Check `ExternalSecret` sync status: `kubectl get externalsecret -n <namespace>`.
3. Verify the secret contains the correct key: `kubectl get secret <secret-name> -n <namespace> -o jsonpath='{.data.client_secret}' | base64 -d`.

### Operator or propeller cannot authenticate

1. Verify `union-secret-auth` exists in the dataplane namespace and contains `client_secret`.
2. Check operator logs for auth errors: `kubectl logs -n union -l app.kubernetes.io/name=operator --tail=50 | grep -i auth`.
3. Verify the `AUTH_CLIENT_ID` matches the control plane's `INTERNAL_CLIENT_ID`.
4. Verify the service app is included in the authorization server's access policy.

### Scheduler fails to start

1. Verify `flyte-secret-auth` exists in the control plane namespace: `kubectl get secret flyte-secret-auth -n union-cp`.
2. Check that `flyte.secrets.adminOauthClientCredentials.enabled: true` is set.
3. Check scheduler logs: `kubectl logs -n union-cp deploy/flytescheduler --tail=50`.

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/code-viewer ===

# Code Viewer

The Union UI allows you to view the exact code that executed a specific task. Union securely transfers the [code bundle](https://www.union.ai/docs/v2/union/user-guide/run-scaling/life-of-a-run) directly to your browser without routing it through the control plane.

![Code Viewer](https://www.union.ai/docs/v2/union/_static/images/deployment/configuration/code-viewer/demo.png)

## Enable CORS policy on your fast registration bucket

To support this feature securely, your bucket must allow CORS access from Union. The configuration steps vary depending on your cloud provider.

### AWS S3 Console

1. Open the AWS Console.
2. Navigate to the S3 dashboard.
3. Select your fast registration bucket. By default, this is the same as the metadata bucket configured during initial deployment.
4. Click the **Permissions** tab and scroll to **Cross-origin resource sharing (CORS)**.
5. Click **Edit** and enter the following policy:
![S3 CORS Policy](https://www.union.ai/docs/v2/union/_static/images/deployment/configuration/code-viewer/s3.png)

```
[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "HEAD",
        ],
        "AllowedOrigins": [
            "https://*.unionai.cloud"
        ],
        "ExposeHeaders": [
            "ETag"
        ],
        "MaxAgeSeconds": 3600
    }
]
```

For more details, see the [AWS S3 CORS documentation](https://docs.aws.amazon.com/AmazonS3/latest/userguide/cors.html).

### Google GCS

Google Cloud Storage requires CORS configuration via the command line.

1. Create a `cors.json` file with the following content:
    ```json
    [
        {
        "origin": ["https://*.unionai.cloud"],
        "method": ["HEAD", "GET"],
        "responseHeader": ["ETag"],
        "maxAgeSeconds": 3600
        }
    ]
    ```
2. Apply the CORS configuration to your bucket:
    ```bash
    gcloud storage buckets update gs://<fast_registration_bucket> --cors-file=cors.json
    ```
3. Verify the configuration was applied:
   ```bash
   gcloud storage buckets describe gs://<fast_registration_bucket> --format="default(cors_config)"

   cors_config:
   - maxAgeSeconds: 3600
     method:
     - GET
     - HEAD
     origin:
     - https://*.unionai.cloud
     responseHeader:
     - ETag
   ```
For more details, see the [Google Cloud Storage CORS documentation](https://docs.cloud.google.com/storage/docs/using-cors#command-line).

### Azure Storage

For Azure Storage CORS configuration, see the [Azure Storage CORS documentation](https://learn.microsoft.com/en-us/rest/api/storageservices/cross-origin-resource-sharing--cors--support-for-the-azure-storage-services).

## Troubleshooting

| Error Message | Cause | Fix |
|---------------|-------|-----|
| `Not available: No code available for this action.` | The task does not have a code bundle. This occurs when the code is baked into the Docker image or the task is not a code-based task. | This is expected behavior for tasks without code bundles. |
| `Not Found: The code bundle file could not be found. This may be due to your organization's data retention policy.` | The code bundle was deleted from the bucket, likely due to a retention policy. | Review your fast registration bucket's retention policy settings. |
| `Error: Code download is blocked by your storage bucket's configuration. Please contact your administrator to enable access.` | CORS is not configured on the bucket. | Configure CORS on your bucket using the instructions above. |

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/image-builder ===

# Image Builder

Union Image Builder supports the ability to build container images within the dataplane. This enables the use of the `remote` builder type for any defined [Container Image](https://www.union.ai/docs/v2/union/user-guide/task-configuration/container-images).

Configure the use of remote image builder:
```bash
flyte create config --builder=remote --endpoint...
```

Write custom [container images](https://www.union.ai/docs/v2/union/user-guide/task-configuration/container-images):
```python
env = flyte.TaskEnvironment(
    name="hello_v2",
    image=flyte.Image.from_debian_base()
        .with_pip_packages("<package 1>", "<package 2>")
)
```

> By default, Image Builder is disabled and has to be enabled by configuring the builder type to `remote` in flyte config

## Requirements

* The image building process runs in the target run's project and domain. Any image push secrets needed to push images to the registry will need to be accessible from the project & domain where the build happens.

## Configuration

Image Builder is configured directly through Helm values.

```yaml
imageBuilder:

  # Enable Image Builder
  enabled: true

  # -- The config map build-image container task attempts to reference.
  # -- Should not change unless coordinated with Union technical support.
  targetConfigMapName: "build-image-config"

  # -- The URI of the buildkitd service. Used for externally managed buildkitd services.
  # -- Leaving empty and setting imageBuilder.buildkit.enabled to true will create a buildkitd service and configure the Uri appropriately.
  # -- E.g. "tcp://buildkitd.buildkit.svc.cluster.local:1234"
  buildkitUri: ""

  # -- The default repository to publish images to when "registry" is not specified in ImageSpec.
  # -- Note, the build-image task will fail unless "registry" is specified or a default repository is provided.
  defaultRepository: ""

  # -- How build-image task and operator proxy will attempt to authenticate against the default #    repository.
  # -- Supported values are "noop", "google", "aws", "azure"
  # -- "noop" no authentication is attempted
  # -- "google" uses docker-credential-gcr to authenticate to the default registry
  # -- "aws" uses docker-credential-ecr-login to authenticate to the default registry
  # -- "azure" uses az acr login to authenticate to the default registry. Requires Azure Workload Identity to be enabled.
  authenticationType: "noop"

  buildkit:

    # -- Enable buildkit service within this release.
    enabled: true

    # Configuring Union managed buildkitd Kubernetes resources.
    ...
```

## Authentication

### AWS

By default, Union is intended to be configured to use [IAM roles for service accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) for authentication. Setting `authenticationType` to `aws` configures Union image builder related services to use AWS default credential chain. Additionally, Union image builder uses [`docker-credential-ecr-login`](https://github.com/awslabs/amazon-ecr-credential-helper) to authenticate to the ecr repository configured with `defaultRepository`.

`defaultRepository` should be the fully qualified ECR repository name, e.g. `<AWS_ACCOUNT_ID>.dkr.ecr.<AWS_REGION>.amazonaws.com/<REPOSITORY_NAME>`.

Therefore, it is necessary to configure the user role with the following permissions.

```json
{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken"
  ],
  "Resource": "*"
},
{
  "Effect": "Allow",
  "Action": [
    "ecr:BatchCheckLayerAvailability",
    "ecr:BatchGetImage",
    "ecr:GetDownloadUrlForLayer"
  ],
  "Resource": "*"
  // Or
  // "Resource": "arn:aws:ecr:<AWS_REGION>:<AWS_ACCOUNT_ID>:repository/<REPOSITORY>"
}
```

Similarly, the `operator-proxy` requires the following permissions

```json
{
  "Effect": "Allow",
  "Action": [
    "ecr:GetAuthorizationToken"
  ],
  "Resource": "*"
},
{
  "Effect": "Allow",
  "Action": [
    "ecr:DescribeImages"
  ],
  "Resource": "arn:aws:ecr:<AWS_REGION>:<AWS_ACCOUNT_ID>:repository/<REPOSITORY>"
}
```

#### AWS Cross Account access

Access to repositories that do not exist in the same AWS account as the data plane requires additional ECR resource-based permissions. An ECR policy like the following is required if the configured `defaultRepository` or `ImageSpec`'s `registry` exists in an AWS account different from the dataplane's.

```json
{
  "Statement": [
    {
      "Sid": "AllowPull",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<DATAPLANE_AWS_ACCOUNT>:role/<user-role>",
          "arn:aws:iam::<DATAPLANE_AWS_ACCOUNT>:role/<node-role>",
          // ... Additional roles that require image pulls
        ]
      },
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer"
      ]
    },
    {
      "Sid": "AllowDescribeImages",
      "Action": [
        "ecr:DescribeImages"
      ],
      "Principal": {
        "AWS": [
          "arn:aws:iam::<DATAPLANE_AWS_ACCOUNT>:role/<operator-proxy-role>",
        ]
      },
      "Effect": "Allow"
    },
    {
      "Sid": "ManageRepositoryContents"
      // ...
    }
  ],
  "Version": "2012-10-17"
}
```

In order to support a private ImageSpec `base_image` the following permissions are required.

```json
{
  "Statement": [
    {
      "Sid": "AllowPull",
      "Effect": "Allow",
      "Principal": {
        "AWS": [
          "arn:aws:iam::<DATAPLANE_AWS_ACCOUNT>:role/<user-role>",
          "arn:aws:iam::<DATAPLANE_AWS_ACCOUNT>:role/<node-role>",
          // ... Additional roles that require image pulls
        ]
      },
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:BatchGetImage",
        "ecr:GetDownloadUrlForLayer"
      ]
    },
  ]
}
```

### Google Cloud Platform

By default, GCP uses [Kubernetes Service Accounts to GCP IAM](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#kubernetes-sa-to-iam) for authentication. Setting `authenticationType` to `google` configures Union image builder related services to use GCP default credential chain. Additionally, Union image builder uses [`docker-credential-gcr`](https://github.com/GoogleCloudPlatform/docker-credential-gcr) to authenticate to the Google artifact registries referenced by `defaultRepository`.

`defaultRepository` should be the full name to the repository in combination with an optional image name prefix. `<GCP_LOCATION>-docker.pkg.dev/<GCP_PROJECT_ID>/<REPOSITORY_NAME>/<IMAGE_PREFIX>`.

It is necessary to configure the GCP user service account with `iam.serviceAccounts.signBlob` project level permissions.

#### GCP Cross Project access

Access to registries that do not exist in the same GCP project as the data plane requires additional GCP permissions.

* Configure the user "role" service account with the `Artifact Registry Writer`.
* Configure the GCP worker node and union-operator-proxy service accounts with the `Artifact Registry Reader` role.

### Azure

By default, Union is designed to use Azure [Workload Identity Federation](https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster) for authentication using [user-assigned managed identities](https://learn.microsoft.com/en-us/entra/identity/managed-identities-azure-resources/how-manage-user-assigned-managed-identities?pivots=identity-mi-methods-azp) in place of AWS IAM roles.

* Configure the user "role" user-assigned managed identity with the `AcrPush` role.
* Configure the Azure kubelet identity ID and operator-proxy user-assigned managed identities with the `AcrPull` role.

### Private registries

Follow guidance in this section to integrate Image Builder with private registries:

#### GitHub Container Registry

1. Follow the [GitHub guide](https://docs.github.com/en/packages/working-with-a-github-packages-registry/working-with-the-container-registry) to log in to the registry locally.
2. Create a Union secret:
```bash
flyte create secret --type image_pull --from-docker-config --registries ghcr.io SECRET_NAME
```

> This secret will be available to all projects and domains in your tenant. [Learn more about Union Secrets](./union-secrets)
> Check alternative ways to create image pull secrets in the [API reference](https://www.union.ai/docs/v2/union/api-reference/flyte-cli)

3. Reference this secret in the Image object:

```python
env = flyte.TaskEnvironment(
    name="hello_v2",
    # Allow image builder to pull and push from the private registry. `registry` field isn't required if it's configured
    # as the default registry in imagebuilder section in the helm chart values file.
    image=flyte.Image.from_debian_base(registry="<my registry url>", name="private", registry_secret="<YOUR_SECRET_NAME>")
        .with_pip_packages("<package 1>", "<package 2>"),
    # Mount the same secret to allow tasks to pull that image
    secrets=["<YOUR_SECRET_NAME>"]
)
```

This will enable Image Builder to push images and layers to a private GHCR. It'll also allow pods for this task environment to pull
this image at runtime.

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/multi-cluster ===

# Multiple Clusters

Union enables you to integrate multiple Kubernetes clusters into a single Union control plane using the `clusterPool` abstraction.

Currently, the clusterPool configuration is performed by Union in the control plane when you provide the mapping between clusterPool name and clusterNames using the following structure:

```yaml
clusterPoolname:
  - clusterName
```
With `clusterName` matching the name you used to install the Union operator Helm chart.

You can have as many cluster pools as needed:

**Example**

```yaml
default: # this is the clusterPool where executions will run, unless another mapping specified
  - my-dev-cluster
development-cp:
  - my-dev-cluster
staging-cp:
  - my-staging-cluster
production-cp:
  - production-cluster-1
  - production-cluster-2
dr-region:
  - dr-site-cluster
```

## Using cluster pools

Once the Union team configures the clusterPools in the control plane, you can proceed to configure mappings:

### project-domain-clusterPool mapping

1. Create a YAML file that includes the project, domain, and clusterPool:

**Example: cpa-dev.yaml**

```yaml
domain: development
project: flytesnacks
clusterPoolName: development-cp
```

2. Update the control plane with this mapping:

```bash
uctl update cluster-pool-attributes --attrFile cpa-dev.yaml
```
3. New executions in `flytesnacks-development` should now run in the `my-dev-cluster`

### project-domain-workflow-clusterPool mapping

1. Create a YAML file that includes the project, domain, and clusterPool:

**Example: cpa-dev.yaml**

```yaml
domain: production
project: flytesnacks
workflow: my_critical_wf
clusterPoolName: production-cp
```

2. Update the control plane with this mapping:

```bash
uctl update cluster-pool-attributes --attrFile cpa-prod.yaml
```
3. New executions of the `my_critical_wf` workflow in `flytesnacks-production` should now run in any of the clusters under `production-cp`

## Data sharing between cluster pools

The sharing of metadata is controlled by the cluster pool to which a cluster belongs. If two clusters are in the same cluster pool, then they must share the same metadata bucket, defined in the Helm values as `storage.bucketName`.

If they are in different cluster pools, then they **must** have different metadata buckets. You could, for example, have a single metadata bucket for all your development clusters, and a separate one for all your production clusters, by grouping the clusters into cluster pools accordingly.

 Alternatively you could have a separate metadata bucket for each cluster, by putting each cluster in its own cluster pool.

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/persistent-logs ===

# Persistent logs

Persistent logging is enabled by default. The data plane deploys [FluentBit](https://fluentbit.io/) as a DaemonSet that collects container logs from every node and writes them to the `persisted-logs/` path in the object store configured for your data plane.

FluentBit runs under the `fluentbit-system` Kubernetes service account. This service account must have write access to the storage bucket so FluentBit can push logs. The sections below describe how to grant that access on each cloud provider.

## AWS (IRSA)

On EKS, use [IAM Roles for Service Accounts (IRSA)](https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts.html) to grant the FluentBit service account permission to write to S3.

### 1. Create an IAM policy

Create an IAM policy that allows writing to your metadata S3 bucket:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:PutObjectAcl",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::<BUCKET_NAME>",
        "arn:aws:s3:::<BUCKET_NAME>/persisted-logs/*"
      ]
    }
  ]
}
```

Replace `<BUCKET_NAME>` with the name of your data plane metadata bucket.

### 2. Create an IAM role with a trust policy

Create an IAM role that trusts the EKS OIDC provider and is scoped to the `fluentbit-system` service account:

```json
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Federated": "arn:aws:iam::<ACCOUNT_ID>:oidc-provider/<OIDC_PROVIDER>"
      },
      "Action": "sts:AssumeRoleWithWebIdentity",
      "Condition": {
        "StringEquals": {
          "<OIDC_PROVIDER>:sub": "system:serviceaccount:<NAMESPACE>:fluentbit-system",
          "<OIDC_PROVIDER>:aud": "sts.amazonaws.com"
        }
      }
    }
  ]
}
```

Replace:

- `<ACCOUNT_ID>` with your AWS account ID
- `<OIDC_PROVIDER>` with your EKS cluster's OIDC provider (e.g. `oidc.eks.us-east-1.amazonaws.com/id/EXAMPLE`)
- `<NAMESPACE>` with the namespace where the data plane is installed (default: `union`)

You can retrieve the OIDC provider URL with:

```bash
aws eks describe-cluster --name <CLUSTER_NAME> --region <REGION> \
  --query "cluster.identity.oidc.issuer" --output text
```

Attach the IAM policy from step 1 to this role.

### 3. Configure the Helm values

Set the IRSA annotation on the FluentBit service account in your data plane Helm values:

```yaml
fluentbit:
  serviceAccount:
    name: fluentbit-system
    annotations:
      eks.amazonaws.com/role-arn: "arn:aws:iam::<ACCOUNT_ID>:role/<FLUENTBIT_ROLE_NAME>"
```

## Azure (Workload Identity Federation)

On AKS, use [Microsoft Entra Workload Identity](https://learn.microsoft.com/en-us/azure/aks/workload-identity-overview) to grant the FluentBit service account access to Azure Blob Storage.

### Azure prerequisites

- Your AKS cluster must be [enabled as an OIDC Issuer](https://learn.microsoft.com/en-us/azure/aks/use-oidc-issuer)
- The [Azure Workload Identity](https://learn.microsoft.com/en-us/azure/aks/workload-identity-deploy-cluster) mutating webhook must be installed on your cluster

### 1. Create or reuse a Managed Identity

Create a User Assigned Managed Identity (or reuse an existing one):

```bash
az identity create \
  --name fluentbit-identity \
  --resource-group <RESOURCE_GROUP> \
  --location <LOCATION>
```

Note the `clientId` from the output.

### 2. Add a federated credential

Create a federated credential that maps the `fluentbit-system` Kubernetes service account to the managed identity:

```bash
az identity federated-credential create \
  --name fluentbit-federated-credential \
  --identity-name fluentbit-identity \
  --resource-group <RESOURCE_GROUP> \
  --issuer <AKS_OIDC_ISSUER_URL> \
  --subject "system:serviceaccount:<NAMESPACE>:fluentbit-system" \
  --audiences "api://AzureADTokenExchange"
```

Replace:

- `<RESOURCE_GROUP>` with your Azure resource group
- `<AKS_OIDC_ISSUER_URL>` with the OIDC issuer URL of your AKS cluster
- `<NAMESPACE>` with the namespace where the data plane is installed (default: `union`)

You can retrieve the OIDC issuer URL with:

```bash
az aks show --name <CLUSTER_NAME> --resource-group <RESOURCE_GROUP> \
  --query "oidcIssuerProfile.issuerUrl" --output tsv
```

### 3. Assign a storage role

Assign the `Storage Blob Data Contributor` role to the managed identity at the storage account level:

```bash
az role assignment create \
  --assignee <CLIENT_ID> \
  --role "Storage Blob Data Contributor" \
  --scope "/subscriptions/<SUBSCRIPTION_ID>/resourceGroups/<RESOURCE_GROUP>/providers/Microsoft.Storage/storageAccounts/<STORAGE_ACCOUNT>"
```

### 4. Configure the Azure Helm values

Set the Workload Identity annotation on the FluentBit service account in your data plane Helm values:

```yaml
fluentbit:
  serviceAccount:
    name: fluentbit-system
    annotations:
      azure.workload.identity/client-id: "<CLIENT_ID>"
```

You must also ensure the FluentBit pods have the Workload Identity label. If you have already set `additionalPodLabels` for your data plane, confirm the following label is present:

```yaml
additionalPodLabels:
  azure.workload.identity/use: "true"
```

## GCP (Workload Identity)

On GKE, use [GKE Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity) to grant the FluentBit service account access to GCS.

### GCP prerequisites

- [Workload Identity](https://cloud.google.com/kubernetes-engine/docs/how-to/workload-identity#enable) must be enabled on your GKE cluster

### 1. Create or reuse a GCP service account

Create a GCP service account (or reuse an existing one):

```bash
gcloud iam service-accounts create fluentbit-gsa \
  --display-name "FluentBit logging service account" \
  --project <PROJECT_ID>
```

### 2. Grant storage permissions

Grant the service account write access to the metadata bucket:

```bash
gcloud storage buckets add-iam-policy-binding gs://<BUCKET_NAME> \
  --member "serviceAccount:fluentbit-gsa@<PROJECT_ID>.iam.gserviceaccount.com" \
  --role "roles/storage.objectAdmin"
```

### 3. Bind the Kubernetes service account to the GCP service account

Allow the `fluentbit-system` Kubernetes service account to impersonate the GCP service account:

```bash
gcloud iam service-accounts add-iam-policy-binding \
  fluentbit-gsa@<PROJECT_ID>.iam.gserviceaccount.com \
  --role "roles/iam.workloadIdentityUser" \
  --member "serviceAccount:<PROJECT_ID>.svc.id.goog[<NAMESPACE>/fluentbit-system]"
```

Replace:

- `<PROJECT_ID>` with your GCP project ID
- `<BUCKET_NAME>` with the name of your data plane metadata bucket
- `<NAMESPACE>` with the namespace where the data plane is installed (default: `union`)

### 4. Configure the GCP Helm values

Set the Workload Identity annotation on the FluentBit service account in your data plane Helm values:

```yaml
fluentbit:
  serviceAccount:
    name: fluentbit-system
    annotations:
      iam.gke.io/gcp-service-account: "fluentbit-gsa@<PROJECT_ID>.iam.gserviceaccount.com"
```

## Disabling persistent logs

To disable persistent logging entirely, set the following in your Helm values:

```yaml
fluentbit:
  enabled: false
```

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/monitoring ===

# Monitoring

The Union.ai data plane deploys a static [Prometheus](https://prometheus.io/) instance that collects metrics required for platform features like cost tracking, task-level resource monitoring, and execution observability. This Prometheus instance is pre-configured and requires no additional setup.

For operational monitoring of the cluster itself (node health, API server metrics, CoreDNS, etc.), the data plane chart includes an optional [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) instance that can be enabled separately.

## Architecture overview

The data plane supports two independent monitoring concerns:

| Concern | What it monitors | How it's deployed | Configurable |
|---------|-----------------|-------------------|--------------|
| **Union features** | Task execution metrics, cost tracking, GPU utilization, container resources | Static Prometheus with pre-built scrape config | Retention, resources, scheduling |
| **Cluster health** (optional) | Kubernetes components, node health, alerting, Grafana dashboards | `kube-prometheus-stack` via `monitoring.enabled` | Full kube-prometheus-stack values |

```
                    ┌─────────────────────────────────────┐
                    │          Data Plane Cluster          │
                    │                                     │
                    │  ┌──────────────────────┐           │
                    │  │  Static Prometheus   │           │
                    │  │  (Union features)    │           │
                    │  │  ┌────────────────┐  │           │
                    │  │  │ Scrape targets │  │           │
                    │  │  │ - kube-state   │  │           │
                    │  │  │ - cAdvisor     │  │           │
                    │  │  │ - propeller    │  │           │
                    │  │  │ - opencost     │  │           │
                    │  │  │ - dcgm (GPU)   │  │           │
                    │  │  │ - envoy        │  │           │
                    │  │  └────────────────┘  │           │
                    │  └─────────────────────-┘           │
                    │                                     │
                    │  ┌──────────────────────┐           │
                    │  │  kube-prometheus     │           │
                    │  │  -stack (optional)   │           │
                    │  │  - Prometheus        │           │
                    │  │  - Alertmanager      │           │
                    │  │  - Grafana           │           │
                    │  │  - node-exporter     │           │
                    │  └──────────────────────┘           │
                    └─────────────────────────────────────┘
```

## Union features Prometheus

The static Prometheus instance is always deployed and pre-configured to scrape the metrics that Union.ai requires. No Prometheus Operator or CRDs are needed. This instance is a platform dependency and should not be replaced or reconfigured.

### Scrape targets

The following targets are scraped automatically:

| Job | Target | Metrics collected |
|-----|--------|------------------|
| `kube-state-metrics` | Pod/node resource requests, limits, status, capacity | Cost calculations, resource tracking |
| `kubernetes-cadvisor` | Container CPU and memory usage via kubelet | Task-level resource monitoring |
| `flytepropeller` | Execution round info, fast task duration | Execution observability |
| `opencost` | Node hourly cost rates (CPU, RAM, GPU) | Cost tracking |
| `gpu-metrics` | DCGM exporter metrics (when `dcgm-exporter.enabled`) | GPU utilization |
| `serving-envoy` | Envoy upstream request counts and latency (when `serving.enabled`) | Inference serving metrics |

### Configuration

The static Prometheus instance is configured under the `prometheus` key in your data plane values:

```yaml
prometheus:
  image:
    repository: prom/prometheus
    tag: v3.3.1
  # Data retention period
  retention: 3d
  # Route prefix for the web UI and API
  routePrefix: /prometheus/
  resources:
    limits:
      cpu: "3"
      memory: "3500Mi"
    requests:
      cpu: "1"
      memory: "1Gi"
  serviceAccount:
    create: true
    annotations: {}
  priorityClassName: system-cluster-critical
  nodeSelector: {}
  tolerations: []
  affinity: {}
```

> [!NOTE] Retention and storage
> The default 3-day retention is sufficient for Union.ai features. Increase `retention` if you query historical feature metrics directly.

### Internal service endpoint

Other data plane components reach Prometheus at:

```
http://union-operator-prometheus.<NAMESPACE>.svc:80/prometheus
```

OpenCost is pre-configured to use this endpoint. You do not need to change it unless you rename the Helm release.

## Enabling cluster health monitoring

To enable operational monitoring with Prometheus Operator, Alertmanager, Grafana, and node-exporter:

```yaml
monitoring:
  enabled: true
```

This deploys a full [kube-prometheus-stack](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) instance with sensible defaults:

- Prometheus with 7-day retention
- Grafana with admin credentials (override `monitoring.grafana.adminPassword` in production)
- Node exporter, kube-state-metrics, kubelet, CoreDNS, API server, etcd, and scheduler monitoring
- Default alerting and recording rules

### Prometheus Operator CRDs

The `kube-prometheus-stack` uses the Prometheus Operator, which discovers scrape targets and alerting rules through Kubernetes CRDs (ServiceMonitor, PodMonitor, PrometheusRule, etc.). If you prefer to use static scrape configs with your own Prometheus instead, see **Self-managed deployment > Advanced Configurations > Monitoring > Scraping Union services from your own Prometheus**.

To install the CRDs, use the `dataplane-crds` chart:

```yaml
# dataplane-crds values
crds:
  flyte: true
  prometheusOperator: true  # Install Prometheus Operator CRDs
```

Then install or upgrade the CRDs chart before the data plane chart:

```shell
helm upgrade --install union-dataplane-crds unionai/dataplane-crds \
  --namespace union \
  --set crds.prometheusOperator=true
```

> [!NOTE] CRD installation order
> CRDs must be installed before the data plane chart. The `dataplane-crds` chart should be deployed first, and the monitoring stack's own CRD installation is disabled (`monitoring.crds.enabled: false`) to avoid conflicts.

### Customizing the monitoring stack

The monitoring stack accepts all [kube-prometheus-stack values](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack#configuration) under the `monitoring` key. Common overrides:

```yaml
monitoring:
  enabled: true

  # Grafana
  grafana:
    enabled: true
    adminPassword: "my-secure-password"
    ingress:
      enabled: true
      ingressClassName: nginx
      hosts:
        - grafana.example.com

  # Prometheus retention and resources
  prometheus:
    prometheusSpec:
      retention: 30d
      resources:
        requests:
          memory: "2Gi"

  # Alertmanager
  alertmanager:
    enabled: true
    # Configure receivers, routes, etc.
```

The monitoring stack's Prometheus supports [remote write](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write) for forwarding metrics to external time-series databases (Amazon Managed Prometheus, Grafana Cloud, Thanos, etc.):

```yaml
monitoring:
  prometheus:
    prometheusSpec:
      remoteWrite:
        - url: "https://aps-workspaces.<REGION>.amazonaws.com/workspaces/<WORKSPACE_ID>/api/v1/remote_write"
          sigv4:
            region: <REGION>
```

For the full set of configurable values, see the [kube-prometheus-stack chart documentation](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack).

## Scraping Union services from your own Prometheus

If you already run Prometheus in your cluster, you can scrape Union.ai data plane services for operational visibility. All services expose metrics on standard ports.

> [!NOTE] Union features Prometheus
> The built-in static Prometheus handles all metrics required for Union.ai platform features. Scraping from your own Prometheus is for additional operational visibility only -- it does not replace the built-in instance.

### Static scrape configs

Add these jobs to your Prometheus configuration:

```yaml
scrape_configs:
  # Data plane service metrics (operator, propeller, etc.)
  - job_name: union-dataplane-services
    kubernetes_sd_configs:
      - role: endpoints
        namespaces:
          names: [union]
    relabel_configs:
      - source_labels: [__meta_kubernetes_service_label_app_kubernetes_io_instance]
        regex: union-dataplane
        action: keep
      - source_labels: [__meta_kubernetes_endpoint_port_name]
        regex: debug
        action: keep
```

### ServiceMonitor (Prometheus Operator)

If you run the Prometheus Operator, create a ServiceMonitor instead:

```yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: union-dataplane-services
  namespace: union
spec:
  selector:
    matchLabels:
      app.kubernetes.io/instance: union-dataplane
  namespaceSelector:
    matchNames:
      - union
  endpoints:
    - port: debug
      path: /metrics
      interval: 30s
```

This requires the Prometheus Operator CRDs. Install them via the `dataplane-crds` chart with `crds.prometheusOperator: true`.

## Further reading

- [Prometheus documentation](https://prometheus.io/docs/introduction/overview/) -- comprehensive guide to Prometheus configuration, querying, and operation
- [Prometheus remote write](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#remote_write) -- forwarding metrics to external storage
- [Prometheus `kubernetes_sd_config`](https://prometheus.io/docs/prometheus/latest/configuration/configuration/#kubernetes_sd_config) -- Kubernetes service discovery for scrape targets
- [kube-prometheus-stack chart](https://github.com/prometheus-community/helm-charts/tree/main/charts/kube-prometheus-stack) -- full monitoring stack with Grafana and alerting
- [OpenCost documentation](https://www.opencost.io/docs/) -- cost allocation and tracking

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/union-secrets ===

# Secrets

[Union Secrets](https://www.union.ai/docs/v2/union/user-guide/task-configuration/secrets) are enabled by default. Union Secrets are managed secrets created through the native Kubernetes secret manager.

The only configurable option is the namespace where the secret is stored. To override the default behavior, set `proxy.secretManager.namespace` in the values file used by the helm chart. If this is not specified, the `union` namespace will be used by default.

Example:
```yaml
proxy:
  secretManager:
    # -- Set the namespace for union managed secrets created through the native Kubernetes secret manager. If the namespace is not set,
    # the release namespace will be used.
    namespace: "secret"
```

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/data-retention ===

Implications of object storage retention or lifecycle policies on the default bucket and metadata.

# Data retention policies

Union.ai relies on object storage for both **metadata** and **raw data** (your data that is passing through the workflow). Bucket-level retention and lifecycle policies (such as S3 lifecycle rules) that affect the metadata store can cause execution failures, broken history, and data loss.

## How Union.ai uses the default bucket

The platform uses a **default object store bucket** in the data plane for two distinct purposes:

1. **Metadata store** — References, execution state, and pointers to task outputs. The control plane and UI use this metadata to schedule workflows, resolve task dependencies, display execution history, and resolve output locations. This data is required for the correct operation of the platform.

2. **Raw data store** — Large task inputs and outputs or complex types (for example `FlyteFile`, dataframes, etc.). The metadata store holds only pointers to these blobs; the actual bytes live in the raw data store.

Because the **default bucket contains the metadata store**, it must be treated as **durable storage**. Retention or lifecycle policies that delete or overwrite objects in this bucket are **not supported** and can lead to data loss and system failure. There is **no supported way** to recover from metadata loss.

## Impact of metadata loss

| Area | Impact |
|------|--------|
| **UI and APIs** | Execution list or detail views may show errors or "resource not found." Output previews may fail to load. |
| **Execution engine** | In-flight or downstream tasks that depend on a node's output can fail. Retry state may be lost. |
| **Caching** | Pointers to cached outputs may be lost, resulting in cache misses; tasks may re-run or fail. |
| **Traces** | [Trace](https://www.union.ai/docs/v2/union/user-guide/task-programming/traces) checkpoint data (used by `@flyte.trace` for fine-grained recovery from system failures) may be lost, preventing resume-from-checkpoint. |
| **Data** | Raw blobs may still exist, but without metadata the system has no pointers to them. That data becomes **orphaned**. Downstream tasks that consume outputs by reference will fail at runtime. |
| **Operations** | Audit trails and the record of what ran, when, and with what outputs are lost. |

## Retention on a separate raw-data location

If you separate raw data from metadata, you can apply retention policies **only to the raw data location** while keeping metadata durable. This is the only supported approach for applying retention. You can do this either by configuring separate buckets using `configuration.storage.metadataContainer` and `configuration.storage.userDataContainer` in the [data plane chart](https://github.com/unionai/helm-charts/blob/master/charts/dataplane/values.yaml), or by using a metadata prefix within the same bucket (see **Self-managed deployment > Advanced Configurations > Data retention policies > Customizing the metadata path** below).

Be aware of the trade-offs:

- **Historical executions** that reference purged raw data will fail.
- **Cached task outputs** stored as raw data will be lost, causing cache misses and task re-execution.
- **Trace checkpoints** stored in the raw-data location will be purged, preventing resume-from-checkpoint for affected executions.

Data correctness is not silently violated, but the benefits of caching and trace-based recovery are lost for purged data.

## Customizing the metadata path

You can control where metadata is stored within the bucket via the **`config.core.propeller.metadata-prefix`** setting (e.g. `metadata/propeller` in the [data plane chart values](https://github.com/unionai/helm-charts/blob/master/charts/dataplane/values.yaml)). This lets you design lifecycle rules that **exclude** the metadata prefix (for example, in S3 lifecycle rules, apply expiration only to prefixes that do not include the metadata path) so that only non-metadata paths are subject to retention.

Confirm the exact prefix and bucket layout for your deployment from the chart configuration, and validate any retention rules in a non-production environment before applying them broadly.

=== PAGE: https://www.union.ai/docs/v2/union/deployment/selfmanaged/configuration/namespace-mapping ===

# Namespace mapping

By default, Union.ai maps each project-domain pair to a Kubernetes namespace using the pattern `{project}-{domain}`. For example, the project `flytesnacks` in domain `development` runs workloads in namespace `flytesnacks-development`.

You can customize this mapping by setting the `namespace_mapping.template` value in your Helm configuration.

## Template syntax

The template uses Go template syntax with two variables:

- `{{ project }}` — the project name
- `{{ domain }}` — the domain name (e.g., `development`, `staging`, `production`)

### Examples

| Template | Project | Domain | Resulting namespace |
|----------|---------|--------|---------------------|
| `{{ project }}-{{ domain }}` (default) | `flytesnacks` | `development` | `flytesnacks-development` |
| `{{ domain }}` | `flytesnacks` | `development` | `development` |
| `myorg-{{ project }}-{{ domain }}` | `flytesnacks` | `development` | `myorg-flytesnacks-development` |

> [!WARNING]
> Changing namespace mapping after workflows have run will cause existing data in old namespaces to become inaccessible. Plan your namespace mapping before initial deployment.

## Data plane configuration

Set the `namespace_mapping` value at the top level of your dataplane Helm values. This single value cascades to all services that need it: clusterresourcesync, propeller, operator, and executor.

```yaml
namespace_mapping:
  template: "myorg-{{ '{{' }} project {{ '}}' }}-{{ '{{' }} domain {{ '}}' }}"
```

> [!NOTE]
> The template uses Helm's backtick escaping for Go template delimiters. In your values file, wrap `{{ project }}` and `{{ domain }}` with backtick-escaped `{{` and `}}` delimiters as shown above.

## How it works

Namespace mapping controls several components:

| Component | Role |
|-----------|------|
| **Clusterresourcesync** | Creates Kubernetes namespaces and per-namespace resources (service accounts, resource quotas) based on the mapping |
| **Propeller** | Resolves the target namespace when scheduling workflow pods |
| **Operator** | Resolves the target namespace for operator-managed resources |
| **Executor** | Resolves the target namespace for task execution |
| **Flyteadmin** (control plane) | Determines the target namespace when creating V1 executions |

All components must agree on the mapping. The dataplane chart's top-level `namespace_mapping` value is the canonical source that cascades to clusterresourcesync, propeller, operator, and executor automatically. You should **not** set per-service overrides.