# Serve and deploy apps
> This bundle contains all pages in the Serve and deploy apps section.
> Source: https://www.union.ai/docs/v2/union/user-guide/serve-and-deploy-apps/

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/serve-and-deploy-apps ===

# Serve and deploy apps

> **📝 Note**
>
> An LLM-optimized bundle of this entire section is available at [`section.md`](section.md).
> This single file contains all pages in this section, optimized for AI coding agent context.

Flyte provides two main ways to deploy apps: **serve** (for development) and **deploy** (for production). This section covers both methods and their differences.

## Serve vs Deploy

### `flyte serve`

Serving is designed for development and iteration:

- **Dynamic parameter modification**: You can override app parameters when serving
- **Quick iteration**: Faster feedback loop for development
- **Interactive**: Better suited for testing and experimentation

### `flyte deploy`

Deployment is designed for production use:

- **Immutable**: Apps are deployed with fixed configurations
- **Production-ready**: Optimized for stability and reproducibility

## Using Python SDK

### Serve

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Serve and deploy examples for the _index.md documentation."""

import flyte
import flyte.app

# {{docs-fragment serve-example}}
app_env = flyte.app.AppEnvironment(
    name="my-app",
    image=flyte.app.Image.from_debian_base().with_pip_packages("streamlit==1.41.1"),
    args=["streamlit", "hello", "--server.port", "8080"],
    port=8080,
    resources=flyte.Resources(cpu="1", memory="1Gi"),
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(app_env)
    print(f"Served at: {app.url}")
# {{/docs-fragment serve-example}}

# {{docs-fragment deploy-example}}
app_env = flyte.app.AppEnvironment(
    name="my-app",
    image=flyte.app.Image.from_debian_base().with_pip_packages("streamlit==1.41.1"),
    args=["streamlit", "hello", "--server.port", "8080"],
    port=8080,
    resources=flyte.Resources(cpu="1", memory="1Gi"),
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)
    # Access deployed app URL from the deployment
    for deployed_env in deployments[0].envs.values():
        print(f"Deployed: {deployed_env.deployed_app.url}")
# {{/docs-fragment deploy-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/serve_and_deploy_examples.py*

### Deploy

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Serve and deploy examples for the _index.md documentation."""

import flyte
import flyte.app

# {{docs-fragment serve-example}}
app_env = flyte.app.AppEnvironment(
    name="my-app",
    image=flyte.app.Image.from_debian_base().with_pip_packages("streamlit==1.41.1"),
    args=["streamlit", "hello", "--server.port", "8080"],
    port=8080,
    resources=flyte.Resources(cpu="1", memory="1Gi"),
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(app_env)
    print(f"Served at: {app.url}")
# {{/docs-fragment serve-example}}

# {{docs-fragment deploy-example}}
app_env = flyte.app.AppEnvironment(
    name="my-app",
    image=flyte.app.Image.from_debian_base().with_pip_packages("streamlit==1.41.1"),
    args=["streamlit", "hello", "--server.port", "8080"],
    port=8080,
    resources=flyte.Resources(cpu="1", memory="1Gi"),
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)
    # Access deployed app URL from the deployment
    for deployed_env in deployments[0].envs.values():
        print(f"Deployed: {deployed_env.deployed_app.url}")
# {{/docs-fragment deploy-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/serve_and_deploy_examples.py*

## Using the CLI

### Serve

```bash
flyte serve path/to/app.py app_env
```

### Deploy

```bash
flyte deploy path/to/app.py app_env
```

## Next steps

- **Serve and deploy apps > How app serving works**: Understanding the serve process and configuration options
- **Serve and deploy apps > How app deployment works**: Understanding the deploy process and configuration options
- **Serve and deploy apps > Activating and deactivating apps**: Managing app lifecycle
- [**Model training and serving**](https://www.union.ai/docs/v2/union/user-guide/basic-project/page.md): Train a model with tasks and serve it via FastAPI
- **Serve and deploy apps > Prefetching models**: Download and shard HuggingFace models for vLLM and SGLang

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/serve-and-deploy-apps/how-app-serving-works ===

# How app serving works

Serving is the recommended way to deploy apps during development. It provides a faster feedback loop and allows you to dynamically modify parameters.

## Overview

When you serve an app, the following happens:

1. **Code bundling**: Your app code is bundled and prepared
2. **Image building**: Container images are built (if needed)
3. **Deployment**: The app is deployed to your Flyte cluster
4. **Activation**: The app is automatically activated and ready to use
5. **URL generation**: A URL is generated for accessing the app

## Using the Python SDK

The simplest way to serve an app:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Serve examples for the how-app-serving-works.md documentation."""

import logging
import flyte
import flyte.app

# {{docs-fragment basic-serve}}
app_env = flyte.app.AppEnvironment(
    name="my-dev-app",
    parameters=[flyte.app.Parameter(name="model_path", value="s3://bucket/models/model.pkl")],
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(app_env)
    print(f"App served at: {app.url}")
# {{/docs-fragment basic-serve}}

# {{docs-fragment override-parameters}}
app = flyte.with_servecontext(
    input_values={
        "my-dev-app": {
            "model_path": "s3://bucket/models/test-model.pkl",
        }
    }
).serve(app_env)
# {{/docs-fragment override-parameters}}

# {{docs-fragment advanced-serving}}
app = flyte.with_servecontext(
    version="v1.0.0",
    project="my-project",
    domain="development",
    env_vars={"LOG_LEVEL": "DEBUG"},
    input_values={"app-name": {"input": "value"}},
    cluster_pool="dev-pool",
    log_level=logging.INFO,
    log_format="json",
    dry_run=False,
).serve(app_env)
# {{/docs-fragment advanced-serving}}

# {{docs-fragment return-value}}
app = flyte.serve(app_env)
print(f"URL: {app.url}")
print(f"Endpoint: {app.endpoint}")
print(f"Status: {app.deployment_status}")
# {{/docs-fragment return-value}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/serve_examples.py*

## Overriding parameters

One key advantage of serving is the ability to override parameters dynamically:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Serve examples for the how-app-serving-works.md documentation."""

import logging
import flyte
import flyte.app

# {{docs-fragment basic-serve}}
app_env = flyte.app.AppEnvironment(
    name="my-dev-app",
    parameters=[flyte.app.Parameter(name="model_path", value="s3://bucket/models/model.pkl")],
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(app_env)
    print(f"App served at: {app.url}")
# {{/docs-fragment basic-serve}}

# {{docs-fragment override-parameters}}
app = flyte.with_servecontext(
    input_values={
        "my-dev-app": {
            "model_path": "s3://bucket/models/test-model.pkl",
        }
    }
).serve(app_env)
# {{/docs-fragment override-parameters}}

# {{docs-fragment advanced-serving}}
app = flyte.with_servecontext(
    version="v1.0.0",
    project="my-project",
    domain="development",
    env_vars={"LOG_LEVEL": "DEBUG"},
    input_values={"app-name": {"input": "value"}},
    cluster_pool="dev-pool",
    log_level=logging.INFO,
    log_format="json",
    dry_run=False,
).serve(app_env)
# {{/docs-fragment advanced-serving}}

# {{docs-fragment return-value}}
app = flyte.serve(app_env)
print(f"URL: {app.url}")
print(f"Endpoint: {app.endpoint}")
print(f"Status: {app.deployment_status}")
# {{/docs-fragment return-value}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/serve_examples.py*

This is useful for:
- Testing different configurations
- Using different models or data sources
- A/B testing during development

## Advanced serving options

Use `with_servecontext()` for more control over the serving process:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Serve examples for the how-app-serving-works.md documentation."""

import logging
import flyte
import flyte.app

# {{docs-fragment basic-serve}}
app_env = flyte.app.AppEnvironment(
    name="my-dev-app",
    parameters=[flyte.app.Parameter(name="model_path", value="s3://bucket/models/model.pkl")],
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(app_env)
    print(f"App served at: {app.url}")
# {{/docs-fragment basic-serve}}

# {{docs-fragment override-parameters}}
app = flyte.with_servecontext(
    input_values={
        "my-dev-app": {
            "model_path": "s3://bucket/models/test-model.pkl",
        }
    }
).serve(app_env)
# {{/docs-fragment override-parameters}}

# {{docs-fragment advanced-serving}}
app = flyte.with_servecontext(
    version="v1.0.0",
    project="my-project",
    domain="development",
    env_vars={"LOG_LEVEL": "DEBUG"},
    input_values={"app-name": {"input": "value"}},
    cluster_pool="dev-pool",
    log_level=logging.INFO,
    log_format="json",
    dry_run=False,
).serve(app_env)
# {{/docs-fragment advanced-serving}}

# {{docs-fragment return-value}}
app = flyte.serve(app_env)
print(f"URL: {app.url}")
print(f"Endpoint: {app.endpoint}")
print(f"Status: {app.deployment_status}")
# {{/docs-fragment return-value}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/serve_examples.py*

## Using CLI

You can also serve apps from the command line:

```bash
flyte serve path/to/app.py app
```

Where `app` is the variable name of the `AppEnvironment` object.

## Return value

`flyte.serve()` returns an `App` object with:

- `url`: The app's URL
- `endpoint`: The app's endpoint URL
- `deployment_status`: Current status of the app
- `name`: App name

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Serve examples for the how-app-serving-works.md documentation."""

import logging
import flyte
import flyte.app

# {{docs-fragment basic-serve}}
app_env = flyte.app.AppEnvironment(
    name="my-dev-app",
    parameters=[flyte.app.Parameter(name="model_path", value="s3://bucket/models/model.pkl")],
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    app = flyte.serve(app_env)
    print(f"App served at: {app.url}")
# {{/docs-fragment basic-serve}}

# {{docs-fragment override-parameters}}
app = flyte.with_servecontext(
    input_values={
        "my-dev-app": {
            "model_path": "s3://bucket/models/test-model.pkl",
        }
    }
).serve(app_env)
# {{/docs-fragment override-parameters}}

# {{docs-fragment advanced-serving}}
app = flyte.with_servecontext(
    version="v1.0.0",
    project="my-project",
    domain="development",
    env_vars={"LOG_LEVEL": "DEBUG"},
    input_values={"app-name": {"input": "value"}},
    cluster_pool="dev-pool",
    log_level=logging.INFO,
    log_format="json",
    dry_run=False,
).serve(app_env)
# {{/docs-fragment advanced-serving}}

# {{docs-fragment return-value}}
app = flyte.serve(app_env)
print(f"URL: {app.url}")
print(f"Endpoint: {app.endpoint}")
print(f"Status: {app.deployment_status}")
# {{/docs-fragment return-value}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/serve_examples.py*

## Best practices

1. **Use for development**: App serving is ideal for development and testing.
2. **Override parameters**: Take advantage of parameter overrides for testing different configurations.
3. **Quick iteration**: Use `serve` for rapid development cycles.
4. **Switch to deploy**: Use [deploy](./how-app-deployment-works) for production deployments.

## Troubleshooting

**App not activating:**
- Check cluster connectivity
- Verify app configuration is correct
- Review container logs for errors

**Parameter overrides not working:**
- Verify parameter names match exactly
- Check that parameters are defined in the app environment
- Ensure you're using the `input_values` parameter correctly

**Slow serving:**
- Images may need to be built (first time is slower).
- Large code bundles can slow down deployment.
- Check network connectivity to the cluster.

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/serve-and-deploy-apps/how-app-deployment-works ===

# How app deployment works

Deployment is the recommended way to deploy apps to production. It creates versioned, immutable app deployments.

## Overview

When you deploy an app, the following happens:

1. **Code bundling**: Your app code is bundled and prepared
2. **Image building**: Container images are built (if needed)
3. **Deployment**: The app is deployed to your Flyte cluster
4. **Activation**: The app is automatically activated and ready to use

## Using the Python SDK

Deploy an app:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Deploy examples for the how-app-deployment-works.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

# {{docs-fragment basic-deploy}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment basic-deploy}}

# {{docs-fragment deployment-plan}}
app1_env = flyte.app.AppEnvironment(name="backend", ...)
app2_env = flyte.app.AppEnvironment(name="frontend", depends_on=[app1_env], ...)

# Deploying app2_env will also deploy app1_env
deployments = flyte.deploy(app2_env)

# deployments contains both app1_env and app2_env
assert len(deployments) == 2
# {{/docs-fragment deployment-plan}}

# {{docs-fragment clone-with}}
app_env = flyte.app.AppEnvironment(name="my-app", ...)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(
        app_env.clone_with(app_env.name, resources=flyte.Resources(cpu="2", memory="2Gi"))
    )
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment clone-with}}

# {{docs-fragment activation-deactivation}}
if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    app = App.get(name=app_env.name)

    # deactivate the app
    app.deactivate()

    # activate the app
    app.activate()
# {{/docs-fragment activation-deactivation}}

# {{docs-fragment full-deployment}}
if __name__ == "__main__":
    flyte.init_from_config()

    deployments = flyte.deploy(
        app_env,
        dryrun=False,
        version="v1.0.0",
        interactive_mode=False,
        copy_style="loaded_modules",
    )

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            app = deployed_env.deployed_app
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {app.url}")

            # Activate the app
            app.activate()
            print(f"Activated: {app.name}")
# {{/docs-fragment full-deployment}}

# {{docs-fragment deployment-status}}
deployments = flyte.deploy(app_env)

for deployment in deployments:
    for deployed_env in deployment.envs.values():
        if hasattr(deployed_env, 'deployed_app'):
            # Access deployed environment
            env = deployed_env.env
            app = deployed_env.deployed_app

            # Access deployment info
            print(f"Name: {env.name}")
            print(f"URL: {app.url}")
            print(f"Status: {app.deployment_status}")
# {{/docs-fragment deployment-status}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/deploy_examples.py*

`flyte.deploy()` returns a list of `Deployment` objects. Each `Deployment` contains a dictionary of `DeployedEnvironment` objects (one for each environment deployed, including environment dependencies). For apps, the `DeployedEnvironment` is a `DeployedAppEnvironment` which has a `deployed_app` property of type `App`.

## Deployment plan

Flyte automatically creates a deployment plan that includes:

- The app you're deploying
- All [app environment dependencies](https://www.union.ai/docs/v2/union/user-guide/configure-apps/apps-depending-on-environments) (via `depends_on`)
- Proper deployment order

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Deploy examples for the how-app-deployment-works.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

# {{docs-fragment basic-deploy}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment basic-deploy}}

# {{docs-fragment deployment-plan}}
app1_env = flyte.app.AppEnvironment(name="backend", ...)
app2_env = flyte.app.AppEnvironment(name="frontend", depends_on=[app1_env], ...)

# Deploying app2_env will also deploy app1_env
deployments = flyte.deploy(app2_env)

# deployments contains both app1_env and app2_env
assert len(deployments) == 2
# {{/docs-fragment deployment-plan}}

# {{docs-fragment clone-with}}
app_env = flyte.app.AppEnvironment(name="my-app", ...)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(
        app_env.clone_with(app_env.name, resources=flyte.Resources(cpu="2", memory="2Gi"))
    )
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment clone-with}}

# {{docs-fragment activation-deactivation}}
if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    app = App.get(name=app_env.name)

    # deactivate the app
    app.deactivate()

    # activate the app
    app.activate()
# {{/docs-fragment activation-deactivation}}

# {{docs-fragment full-deployment}}
if __name__ == "__main__":
    flyte.init_from_config()

    deployments = flyte.deploy(
        app_env,
        dryrun=False,
        version="v1.0.0",
        interactive_mode=False,
        copy_style="loaded_modules",
    )

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            app = deployed_env.deployed_app
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {app.url}")

            # Activate the app
            app.activate()
            print(f"Activated: {app.name}")
# {{/docs-fragment full-deployment}}

# {{docs-fragment deployment-status}}
deployments = flyte.deploy(app_env)

for deployment in deployments:
    for deployed_env in deployment.envs.values():
        if hasattr(deployed_env, 'deployed_app'):
            # Access deployed environment
            env = deployed_env.env
            app = deployed_env.deployed_app

            # Access deployment info
            print(f"Name: {env.name}")
            print(f"URL: {app.url}")
            print(f"Status: {app.deployment_status}")
# {{/docs-fragment deployment-status}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/deploy_examples.py*

## Overriding App configuration at deployment time

If you need to override the app configuration at deployment time, you can use the `clone_with` method to create a new
app environment with the desired overrides.

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Deploy examples for the how-app-deployment-works.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

# {{docs-fragment basic-deploy}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment basic-deploy}}

# {{docs-fragment deployment-plan}}
app1_env = flyte.app.AppEnvironment(name="backend", ...)
app2_env = flyte.app.AppEnvironment(name="frontend", depends_on=[app1_env], ...)

# Deploying app2_env will also deploy app1_env
deployments = flyte.deploy(app2_env)

# deployments contains both app1_env and app2_env
assert len(deployments) == 2
# {{/docs-fragment deployment-plan}}

# {{docs-fragment clone-with}}
app_env = flyte.app.AppEnvironment(name="my-app", ...)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(
        app_env.clone_with(app_env.name, resources=flyte.Resources(cpu="2", memory="2Gi"))
    )
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment clone-with}}

# {{docs-fragment activation-deactivation}}
if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    app = App.get(name=app_env.name)

    # deactivate the app
    app.deactivate()

    # activate the app
    app.activate()
# {{/docs-fragment activation-deactivation}}

# {{docs-fragment full-deployment}}
if __name__ == "__main__":
    flyte.init_from_config()

    deployments = flyte.deploy(
        app_env,
        dryrun=False,
        version="v1.0.0",
        interactive_mode=False,
        copy_style="loaded_modules",
    )

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            app = deployed_env.deployed_app
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {app.url}")

            # Activate the app
            app.activate()
            print(f"Activated: {app.name}")
# {{/docs-fragment full-deployment}}

# {{docs-fragment deployment-status}}
deployments = flyte.deploy(app_env)

for deployment in deployments:
    for deployed_env in deployment.envs.values():
        if hasattr(deployed_env, 'deployed_app'):
            # Access deployed environment
            env = deployed_env.env
            app = deployed_env.deployed_app

            # Access deployment info
            print(f"Name: {env.name}")
            print(f"URL: {app.url}")
            print(f"Status: {app.deployment_status}")
# {{/docs-fragment deployment-status}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/deploy_examples.py*

## Activation/deactivation

Unlike serving, deployment does not automatically activate apps. You need to activate them explicitly:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Deploy examples for the how-app-deployment-works.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

# {{docs-fragment basic-deploy}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment basic-deploy}}

# {{docs-fragment deployment-plan}}
app1_env = flyte.app.AppEnvironment(name="backend", ...)
app2_env = flyte.app.AppEnvironment(name="frontend", depends_on=[app1_env], ...)

# Deploying app2_env will also deploy app1_env
deployments = flyte.deploy(app2_env)

# deployments contains both app1_env and app2_env
assert len(deployments) == 2
# {{/docs-fragment deployment-plan}}

# {{docs-fragment clone-with}}
app_env = flyte.app.AppEnvironment(name="my-app", ...)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(
        app_env.clone_with(app_env.name, resources=flyte.Resources(cpu="2", memory="2Gi"))
    )
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment clone-with}}

# {{docs-fragment activation-deactivation}}
if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    app = App.get(name=app_env.name)

    # deactivate the app
    app.deactivate()

    # activate the app
    app.activate()
# {{/docs-fragment activation-deactivation}}

# {{docs-fragment full-deployment}}
if __name__ == "__main__":
    flyte.init_from_config()

    deployments = flyte.deploy(
        app_env,
        dryrun=False,
        version="v1.0.0",
        interactive_mode=False,
        copy_style="loaded_modules",
    )

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            app = deployed_env.deployed_app
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {app.url}")

            # Activate the app
            app.activate()
            print(f"Activated: {app.name}")
# {{/docs-fragment full-deployment}}

# {{docs-fragment deployment-status}}
deployments = flyte.deploy(app_env)

for deployment in deployments:
    for deployed_env in deployment.envs.values():
        if hasattr(deployed_env, 'deployed_app'):
            # Access deployed environment
            env = deployed_env.env
            app = deployed_env.deployed_app

            # Access deployment info
            print(f"Name: {env.name}")
            print(f"URL: {app.url}")
            print(f"Status: {app.deployment_status}")
# {{/docs-fragment deployment-status}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/deploy_examples.py*

See [Activating and deactivating apps](./activating-and-deactivating-apps) for more details.

## Using the CLI

Deploy from the command line:

```bash
flyte deploy path/to/app.py app
```

Where `app` is the variable name of the `AppEnvironment` object.

You can also specify the following options:

```bash
flyte deploy path/to/app.py app \
    --version v1.0.0 \
    --project my-project \
    --domain production \
    --dry-run
```

## Example: Full deployment configuration

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Deploy examples for the how-app-deployment-works.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

# {{docs-fragment basic-deploy}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment basic-deploy}}

# {{docs-fragment deployment-plan}}
app1_env = flyte.app.AppEnvironment(name="backend", ...)
app2_env = flyte.app.AppEnvironment(name="frontend", depends_on=[app1_env], ...)

# Deploying app2_env will also deploy app1_env
deployments = flyte.deploy(app2_env)

# deployments contains both app1_env and app2_env
assert len(deployments) == 2
# {{/docs-fragment deployment-plan}}

# {{docs-fragment clone-with}}
app_env = flyte.app.AppEnvironment(name="my-app", ...)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(
        app_env.clone_with(app_env.name, resources=flyte.Resources(cpu="2", memory="2Gi"))
    )
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment clone-with}}

# {{docs-fragment activation-deactivation}}
if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    app = App.get(name=app_env.name)

    # deactivate the app
    app.deactivate()

    # activate the app
    app.activate()
# {{/docs-fragment activation-deactivation}}

# {{docs-fragment full-deployment}}
if __name__ == "__main__":
    flyte.init_from_config()

    deployments = flyte.deploy(
        app_env,
        dryrun=False,
        version="v1.0.0",
        interactive_mode=False,
        copy_style="loaded_modules",
    )

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            app = deployed_env.deployed_app
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {app.url}")

            # Activate the app
            app.activate()
            print(f"Activated: {app.name}")
# {{/docs-fragment full-deployment}}

# {{docs-fragment deployment-status}}
deployments = flyte.deploy(app_env)

for deployment in deployments:
    for deployed_env in deployment.envs.values():
        if hasattr(deployed_env, 'deployed_app'):
            # Access deployed environment
            env = deployed_env.env
            app = deployed_env.deployed_app

            # Access deployment info
            print(f"Name: {env.name}")
            print(f"URL: {app.url}")
            print(f"Status: {app.deployment_status}")
# {{/docs-fragment deployment-status}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/deploy_examples.py*

## Best practices

1. **Use for production**: Deploy is designed for production use.
2. **Version everything**: Always specify versions for reproducibility.
3. **Test first**: Test with serve before deploying to production.
4. **Manage dependencies**: Use `depends_on` to manage app dependencies.
5. **Activation strategy**: Have a strategy for activating/deactivating apps.
7. **Use dry-run**: Test deployments with `dry_run=True` first.
8. **Separate environments**: Use different projects/domains for different environments.
9. **Parameter management**: Consider using environment-specific parameter values.

## Deployment status and return value

`flyte.deploy()` returns a list of `Deployment` objects. Each `Deployment` contains a dictionary of `DeployedEnvironment` objects:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Deploy examples for the how-app-deployment-works.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

# {{docs-fragment basic-deploy}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ...
)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment basic-deploy}}

# {{docs-fragment deployment-plan}}
app1_env = flyte.app.AppEnvironment(name="backend", ...)
app2_env = flyte.app.AppEnvironment(name="frontend", depends_on=[app1_env], ...)

# Deploying app2_env will also deploy app1_env
deployments = flyte.deploy(app2_env)

# deployments contains both app1_env and app2_env
assert len(deployments) == 2
# {{/docs-fragment deployment-plan}}

# {{docs-fragment clone-with}}
app_env = flyte.app.AppEnvironment(name="my-app", ...)

if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(
        app_env.clone_with(app_env.name, resources=flyte.Resources(cpu="2", memory="2Gi"))
    )
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {deployed_env.deployed_app.url}")
# {{/docs-fragment clone-with}}

# {{docs-fragment activation-deactivation}}
if __name__ == "__main__":
    flyte.init_from_config()
    deployments = flyte.deploy(app_env)

    app = App.get(name=app_env.name)

    # deactivate the app
    app.deactivate()

    # activate the app
    app.activate()
# {{/docs-fragment activation-deactivation}}

# {{docs-fragment full-deployment}}
if __name__ == "__main__":
    flyte.init_from_config()

    deployments = flyte.deploy(
        app_env,
        dryrun=False,
        version="v1.0.0",
        interactive_mode=False,
        copy_style="loaded_modules",
    )

    # Access deployed apps from deployments
    for deployment in deployments:
        for deployed_env in deployment.envs.values():
            app = deployed_env.deployed_app
            print(f"Deployed: {deployed_env.env.name}")
            print(f"URL: {app.url}")

            # Activate the app
            app.activate()
            print(f"Activated: {app.name}")
# {{/docs-fragment full-deployment}}

# {{docs-fragment deployment-status}}
deployments = flyte.deploy(app_env)

for deployment in deployments:
    for deployed_env in deployment.envs.values():
        if hasattr(deployed_env, 'deployed_app'):
            # Access deployed environment
            env = deployed_env.env
            app = deployed_env.deployed_app

            # Access deployment info
            print(f"Name: {env.name}")
            print(f"URL: {app.url}")
            print(f"Status: {app.deployment_status}")
# {{/docs-fragment deployment-status}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/deploy_examples.py*

For apps, each `DeployedAppEnvironment` includes:

- `env`: The `AppEnvironment` that was deployed
- `deployed_app`: The `App` object with properties like `url`, `endpoint`, `name`, and `deployment_status`

## Troubleshooting

**Deployment fails:**
- Check that all dependencies are available
- Verify image builds succeed
- Review deployment logs

**App not accessible:**
- Ensure the app is activated
- Check cluster connectivity
- Verify app configuration

**Version conflicts:**
- Use unique versions for each deployment
- Check existing app versions
- Clean up old versions if needed

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/serve-and-deploy-apps/activating-and-deactivating-apps ===

# Activating and deactivating apps

Apps deployed with `flyte.deploy()` need to be explicitly activated before they can serve traffic. Apps served with `flyte.serve()` are automatically activated.

## Activation

### Activate after deployment

After deploying an app, activate it:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

### Activate an app

When you get an app by name, you get the current app instance:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

### Check activation status

Check if an app is active:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

## Deactivation

Deactivate an app when you no longer need it:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

## Lifecycle management

### Typical deployment workflow

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

### Blue-green deployment

For zero-downtime deployments:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

## Using CLI

### Activate

```bash
flyte update app --activate my-app
```

### Deactivate

```bash
flyte update app --deactivate my-app
```

### Check status

```bash
flyte get app my-app
```

Use `--project` and `--domain` to target a specific [project-domain pair](https://www.union.ai/docs/v2/union/user-guide/projects-and-domains).
For all available options, see the [CLI reference](https://www.union.ai/docs/v2/union/api-reference/flyte-cli).

## Best practices

1. **Activate after testing**: Test deployed apps before activating
2. **Version management**: Keep track of which version is active
4. **Blue-green deployments**: Use blue-green for zero-downtime
5. **Monitor**: Monitor apps after activation
6. **Cleanup**: Deactivate and remove old versions periodically

## Automatic activation with serve

Apps served with `flyte.serve()` are automatically activated:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

This is convenient for development but less suitable for production where you want explicit control over activation.

## Example: Complete deployment and activation

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
# ]
# ///

"""Activation examples for the activating-and-deactivating-apps.md documentation."""

import flyte
import flyte.app
from flyte.remote import App

app_env = flyte.app.AppEnvironment(
    name="my-app",
    # ...
)

# {{docs-fragment activate-after-deployment}}
# Deploy the app
deployments = flyte.deploy(app_env)

# Activate the app
app = App.get(name=app_env.name)
app.activate()

print(f"Activated app: {app.name}")
print(f"URL: {app.url}")
# {{/docs-fragment activate-after-deployment}}

# {{docs-fragment activate-app}}
app = App.get(name="my-app")
app.activate()
# {{/docs-fragment activate-app}}

# {{docs-fragment check-activation-status}}
app = App.get(name="my-app")
print(f"Active: {app.is_active()}")
print(f"Revision: {app.revision}")
# {{/docs-fragment check-activation-status}}

# {{docs-fragment deactivation}}
app = App.get(name="my-app")
app.deactivate()

print(f"Deactivated app: {app.name}")
# {{/docs-fragment deactivation}}

# {{docs-fragment typical-deployment-workflow}}
# 1. Deploy new version
deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

# 2. Get the deployed app
new_app = App.get(name="my-app")
# Test endpoints, etc.

# 3. Activate the new version
new_app.activate()

print(f"Deployed and activated version {new_app.revision}")
# {{/docs-fragment typical-deployment-workflow}}

# {{docs-fragment blue-green-deployment}}
# Deploy new version without deactivating old
new_deployments = flyte.deploy(
    app_env,
    version="v2.0.0",
)

new_app = App.get(name="my-app")

# Test new version
# ... testing ...

# Switch traffic to new version
new_app.activate()

print(f"Activated revision {new_app.revision}")
# {{/docs-fragment blue-green-deployment}}

# {{docs-fragment automatic-activation}}
# Automatically activated
app = flyte.serve(app_env)
print(f"Active: {app.is_active()}")  # True
# {{/docs-fragment automatic-activation}}

# {{docs-fragment complete-example}}
app_env = flyte.app.AppEnvironment(
    name="my-prod-app",
    # ... configuration ...
)

if __name__ == "__main__":
    flyte.init_from_config()

    # Deploy
    deployments = flyte.deploy(
        app_env,
        version="v1.0.0",
        project="my-project",
        domain="production",
    )

    # Get the deployed app
    app = App.get(name="my-prod-app")

    # Activate
    app.activate()

    print(f"Deployed and activated: {app.name}")
    print(f"Revision: {app.revision}")
    print(f"URL: {app.url}")
    print(f"Active: {app.is_active()}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/activation_examples.py*

## Troubleshooting

**App not accessible after activation:**
- Verify activation succeeded
- Check app logs for startup errors
- Verify cluster connectivity
- Check that the app is listening on the correct port

**Activation fails:**
- Check that the app was deployed successfully
- Verify app configuration is correct
- Check cluster resources
- Review deployment logs

**Cannot deactivate:**
- Ensure you have proper permissions
- Check if there are dependencies preventing deactivation
- Verify the app name and version

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/serve-and-deploy-apps/prefetching-models ===

# Prefetching models

Prefetching allows you to download and prepare HuggingFace models (including sharding for multi-GPU inference) before
deploying [vLLM](https://www.union.ai/docs/v2/union/user-guide/build-apps/vllm-app) or [SGLang](https://www.union.ai/docs/v2/union/user-guide/build-apps/sglang-app) apps. This speeds up deployment and ensures models are ready when your app starts.

## Why prefetch?

Prefetching models provides several benefits:

- **Faster deployment**: Models are pre-downloaded, so apps start faster
- **Reproducibility**: Models are versioned and stored in Flyte's object store
- **Sharding support**: Pre-shard models for multi-GPU tensor parallelism
- **Cost efficiency**: Download once, use many times
- **Offline support**: Models are cached in your storage backend

## Basic prefetch

### Using Python SDK

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

### Using CLI

```bash
flyte prefetch hf-model Qwen/Qwen3-0.6B
```

Wait for completion:

```bash
flyte prefetch hf-model Qwen/Qwen3-0.6B --wait
```

## Using prefetched models

Use the prefetched model in your vLLM or SGLang app:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

> [!TIP]
> You can also use prefetched models as parameters to your generic `[[AppEnvironment]]`s or `FastAPIAppEnvironment`s.

## Prefetch options

### Custom artifact name

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

### With HuggingFace token

If the model requires authentication:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

The default value for `hf_token_key` is `HF_TOKEN`, where `HF_TOKEN` is the name of the Flyte secret containing your
HuggingFace token. If this secret doesn't exist, you can create a secret using the [flyte create secret CLI](https://www.union.ai/docs/v2/union/user-guide/task-configuration/secrets).

### With resources

By default, the prefetch task uses minimal resources (2 CPUs, 8GB of memory, 50Gi of disk storage), using
filestreaming logic to move the model weights from HuggingFace to your storage backend directly.

In some cases, the HuggingFace model may not support filestreaming, in which case the prefetch task will fallback to
downloading the model weights to the task pod's disk storage first, then uploading them to your storage backend. In this
case, you can specify custom resources for the prefetch task to override the default resources.

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

## Sharding models for multi-GPU

### vLLM sharding

Shard a model for tensor parallelism:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

Currently, the `flyte.prefetch.hf_model` function only supports sharding models
using the `vllm` engine. Once sharded, these models can be loaded with other
frameworks such as `transformers`, `torch`, or `sglang`.

### Using shard config via CLI

You can also use a YAML file for sharding configuration to use with the
`flyte prefetch hf-model` CLI command:

```yaml
# shard_config.yaml
engine: vllm
args:
  tensor_parallel_size: 8
  dtype: auto
  trust_remote_code: true
```

Then run the CLI command:

```bash
flyte prefetch hf-model meta-llama/Llama-2-70b-hf \
    --shard-config shard_config.yaml \
    --accelerator L40s:8 \
    --hf-token-key HF_TOKEN
```

## Using prefetched sharded models

After prefetching and sharding, serve the model in your app:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

## CLI options

Complete CLI usage:

```bash
flyte prefetch hf-model <repo> \
    --artifact-name <name> \
    --architecture <arch> \
    --task <task> \
    --modality text \
    --format safetensors \
    --model-type transformer \
    --short-description "Description" \
    --force 0 \
    --wait \
    --hf-token-key HF_TOKEN \
    --cpu 4 \
    --mem 16Gi \
    --ephemeral-storage 100Gi \
    --accelerator L40s:4 \
    --shard-config shard_config.yaml
```

## Complete example

Here's a complete example of prefetching and using a model:

```
# /// script
# requires-python = ">=3.12"
# dependencies = [
#    "flyte>=2.0.0b52",
#    "flyteplugins-vllm>=2.0.0b49",
# ]
# ///

"""Prefetch examples for the prefetching-models.md documentation."""

import flyte
from flyte.prefetch import ShardConfig, VLLMShardArgs
from flyteplugins.vllm import VLLMAppEnvironment

# {{docs-fragment basic-prefetch}}
# Prefetch a HuggingFace model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")

# Wait for prefetch to complete
run.wait()

# Get the model path
model_path = run.outputs()[0].path
print(f"Model prefetched to: {model_path}")
# {{/docs-fragment basic-prefetch}}

# {{docs-fragment using-prefetched-models}}
# Prefetch the model
run = flyte.prefetch.hf_model(repo="Qwen/Qwen3-0.6B")
run.wait()

# Use the prefetched model
vllm_app = VLLMAppEnvironment(
    name="my-llm-app",
    model_path=flyte.app.RunOutput(
        type="directory",
        run_name=run.name,
    ),
    model_id="qwen3-0.6b",
    resources=flyte.Resources(cpu="4", memory="16Gi", gpu="L40s:1"),
    stream_model=True,
)

app = flyte.serve(vllm_app)
# {{/docs-fragment using-prefetched-models}}

# {{docs-fragment custom-artifact-name}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    artifact_name="qwen-0.6b-model",  # Custom name for the stored model
)
# {{/docs-fragment custom-artifact-name}}

# {{docs-fragment hf-token}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-7b-hf",
    hf_token_key="HF_TOKEN",  # Name of Flyte secret containing HF token
)
# {{/docs-fragment hf-token}}

# {{docs-fragment with-resources}}
run = flyte.prefetch.hf_model(
    repo="Qwen/Qwen3-0.6B",
    cpu="4",
    mem="16Gi",
    ephemeral_storage="100Gi",
)
# {{/docs-fragment with-resources}}

# {{docs-fragment vllm-sharding}}
run = flyte.prefetch.hf_model(
    repo="meta-llama/Llama-2-70b-hf",
    resources=flyte.Resources(cpu="8", memory="32Gi", gpu="L40s:4"),
    shard_config=ShardConfig(
        engine="vllm",
        args=VLLMShardArgs(
            tensor_parallel_size=4,
            dtype="auto",
            trust_remote_code=True,
        ),
    ),
    hf_token_key="HF_TOKEN",
)

run.wait()
# {{/docs-fragment vllm-sharding}}

# {{docs-fragment using-sharded-models}}
# Use in vLLM app
vllm_app = VLLMAppEnvironment(
    name="multi-gpu-llm-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="llama-2-70b",
    resources=flyte.Resources(
        cpu="8",
        memory="32Gi",
        gpu="L40s:4",  # Match the number of GPUs used for sharding
    ),
    extra_args=[
        "--tensor-parallel-size", "4",  # Match sharding config
    ],
)

if __name__ == "__main__":
    # Prefetch with sharding
    run = flyte.prefetch.hf_model(
        repo="meta-llama/Llama-2-70b-hf",
        accelerator="L40s:4",
        shard_config=ShardConfig(
            engine="vllm",
            args=VLLMShardArgs(tensor_parallel_size=4),
        ),
    )
    run.wait()

    flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            # override the model path to use the prefetched model
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            # set the hf_model_path to None
            hf_model_path=None,
            # stream the model from flyte object store directly to the GPU
            stream_model=True,
        )
    )
# {{/docs-fragment using-sharded-models}}

# {{docs-fragment complete-example}}
# define the app environment
vllm_app = VLLMAppEnvironment(
    name="qwen-serving-app",
    # this will download the model from HuggingFace into the app container's filesystem
    model_hf_path="Qwen/Qwen3-0.6B",
    model_id="qwen3-0.6b",
    resources=flyte.Resources(
        cpu="4",
        memory="16Gi",
        gpu="L40s:1",
        disk="10Gi",
    ),
    scaling=flyte.app.Scaling(
        replicas=(0, 1),
        scaledown_after=600,
    ),
    requires_auth=False,
)

if __name__ == "__main__":
    # prefetch the model
    print("Prefetching model...")
    run = flyte.prefetch.hf_model(
        repo="Qwen/Qwen3-0.6B",
        artifact_name="qwen-0.6b",
        cpu="4",
        mem="16Gi",
        ephemeral_storage="50Gi",
    )

    # wait for completion
    print("Waiting for prefetch to complete...")
    run.wait()
    print(f"Model prefetched: {run.outputs()[0].path}")

    # deploy the app
    print("Deploying app...")
    flyte.init_from_config()
    app = flyte.serve(
        vllm_app.clone_with(
            name=vllm_app.name,
            model_path=flyte.app.RunOutput(type="directory", run_name=run.name),
            hf_model_path=None,
            stream_model=True,
        )
    )
    print(f"App deployed: {app.url}")
# {{/docs-fragment complete-example}}
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/serve-and-deploy-apps/prefetch_examples.py*

## Best practices

1. **Prefetch before deployment**: Prefetch models before deploying apps for faster startup
2. **Version models**: Use meaningful artifact names to easily identify the model in object store paths
3. **Shard appropriately**: Shard models for the GPU configuration you'll use for inference
4. **Cache prefetched models**: Once prefetched, models are cached in your storage backend for faster serving

## Troubleshooting

**Prefetch fails:**
- Check HuggingFace token (if required)
- Verify model repo exists and is accessible
- Check resource availability
- Review prefetch task logs

**Sharding fails:**
- Ensure accelerator matches shard config
- Check GPU memory is sufficient
- Verify `tensor_parallel_size` matches GPU count
- Review prefetch task logs for sharding-related errors

**Model not found in app:**
- Verify RunOutput references correct run name
- Check that prefetch completed successfully
- Ensure model_path is set correctly
- Review app startup logs

