# Sandboxing
> This bundle contains all pages in the Sandboxing section.
> Source: https://www.union.ai/docs/v2/union/user-guide/sandboxing/

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/sandboxing ===

# Sandboxing

> **📝 Note**
>
> An LLM-optimized bundle of this entire section is available at [`section.md`](section.md).
> This single file contains all pages in this section, optimized for AI coding agent context.

A **sandbox** is an isolated, secure environment where code can run without affecting the host system.
Sandboxes restrict what the executing code can do — limiting filesystem access, blocking network calls, and preventing arbitrary system operations — so that even malicious or buggy code cannot cause harm.

The exact restrictions depend on the sandboxing approach: some sandboxes eliminate dangerous operations entirely, while others provide full capabilities within an isolated, disposable container.

## Why sandboxing matters for AI

LLM-generated code is inherently untrusted.
The model may produce code that is correct and useful, but it can also produce code that is dangerous — and it does so without intent or awareness.

| Risk | Example |
|------|---------|
| Data destruction | `DELETE FROM orders WHERE 1=1` — wipes an entire table |
| Credential exfiltration | Reads environment variables and sends API keys to an external endpoint |
| Infinite loops | `while True: pass` — consumes CPU indefinitely |
| Resource abuse | Spawns thousands of threads or allocates unbounded memory |
| Filesystem damage | `rm -rf /` or overwrites critical configuration files |
| Network abuse | Makes unauthorized API calls, sends spam, or joins a botnet |

Running LLM-generated code without a sandbox means trusting the model to never make these mistakes.
Sandboxing eliminates this trust requirement by making dangerous operations structurally impossible.

## Types of sandboxes

There are three broad approaches to sandboxing LLM-generated code, each with different tradeoffs:

| Type | How it works | Tradeoffs | Examples |
|------|-------------|-----------|----------|
| **One-shot execution** | Code runs to completion in a disposable container, then the container is discarded. Stdout, stderr, and outputs are captured. | Simple, no state reuse. Good for single-turn tasks. | Container tasks, serverless functions |
| **Interactive sessions** | A persistent VM or container where you send commands incrementally and observe results between steps. Sessions last for the lifetime of the VM. | Flexible and multi-turn, but heavier to provision and manage. | E2B, Daytona, fly.io |
| **Programmatic tool calling** | The LLM generates orchestration code that calls a predefined set of tools. The orchestration code runs in a sandbox while the tools run in full containers. | Durable, observable, and secure. Tools are known ahead of time. | Flyte workflow sandboxing |

## What Flyte offers

Flyte provides two complementary sandboxing approaches:

### Workflow sandbox (Monty)

A **sandboxed orchestrator** built on [Monty](https://github.com/pydantic/pydantic-monty), a Rust-based sandboxed Python interpreter.
The sandbox starts in microseconds, runs pure Python control flow, and dispatches heavy work to full container tasks through the Flyte controller.

This enables the **programmatic tool calling** pattern (also known as code mode): LLMs generate Python orchestration code that invokes registered tools, and Flyte executes it safely with full durability, observability, and type checking.

### Code sandbox (container)

A **stateless code sandbox** that runs arbitrary Python scripts or shell commands inside an ephemeral Docker container.
The container is built on demand from declared dependencies, executed once, and discarded.

This is the right choice when you need full Python capabilities — third-party packages, file I/O, shell commands, or any computation that goes beyond pure control flow.

### When to use which

| | Workflow sandbox | Code sandbox |
|---|---|---|
| **Runtime** | Monty (Rust-based Python interpreter) | Ephemeral Docker container |
| **Startup** | Microseconds | Seconds (image build + container spin-up) |
| **Capabilities** | Pure Python control flow only — no imports, no I/O, no network | Full Python environment — any package, any library, full I/O |
| **Use case** | LLM-generated orchestration logic that calls registered tools | Arbitrary computation — data processing, test execution, ETL, shell pipelines |
| **State** | Runs within a worker container process | Stateless — fresh container per invocation |
| **Security model** | Dangerous operations are structurally impossible | Isolated container |

- Use the **workflow sandbox** when you need to run untrusted control flow (loops, conditionals, routing) that dispatches work to known tasks. It starts in microseconds and provides the strongest isolation guarantees.
- Use the **code sandbox** when you need full Python capabilities — third-party packages, file I/O, shell commands, or any computation that goes beyond pure control flow.

### Learn more

- **Sandboxing > Workflow sandboxing in Flyte** — How the Monty-based sandboxed orchestrator works, with examples
- **Sandboxing > Programmatic tool calling for agents** — The concept behind programmatic tool calling and how to build agents that use it
- **Sandboxing > Code sandboxing** — Running arbitrary code and commands in ephemeral containers with `flyte.sandbox.create()`

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/sandboxing/workflow-sandboxing-flyte ===

# Workflow sandboxing in Flyte

Flyte provides a sandboxed orchestrator that lets you run pure Python control flow in a secure sandbox while dispatching heavy work to full container tasks.
This enables patterns where LLMs generate orchestration code dynamically, and Flyte executes it safely with full durability and observability.

## Why workflow sandboxing?

Three properties of Flyte make it a natural fit for sandboxed code execution:

1. **Infrastructure on demand**: Flyte spins up containers with specific permissions, secrets, and resources for each task.
2. **LLMs are great at Python**: Models trained on billions of lines of code can reliably generate Python orchestration logic.
3. **Microsecond startup**: The sandbox is powered by [Monty](https://github.com/pydantic/pydantic-monty) (Pydantic's Rust-based Python interpreter), which starts in microseconds without the overhead of VMs or containers.

The result: LLMs generate the orchestration code (control flow, conditionals, loops), and Flyte tasks handle the heavy lifting (data access, computation, external APIs) in full containers.

## How it works

Your generated code runs inside one or more **Monty sandboxes** — lightweight Python interpreters embedded within a **worker container**. Each sandbox can execute pure Python (variables, loops, conditionals, function calls) but has no access to the filesystem, network, imports, or OS. A **bridge layer** acts as a hypervisor between the worker container and the sandboxes, handling opaque IO and routing callable tasks. When your code calls an external task, the bridge dispatches it — either as a method in the outer Python process or as a durable remote call through the Flyte controller (via the Queue Service):

```mermaid
flowchart TB
    subgraph worker["Worker Container"]
        subgraph bridge["Bridge / Hypervisor"]
            IO["Opaque IO: File, Dir, DataFrame"]
            subgraph sandbox1["Monty Sandbox 1"]
                A1["Your code: loops, variables, conditionals"]
                B1["result = add(x, y)"]
            end
            subgraph sandbox2["Monty Sandbox 2"]
                A2["More sandboxed code"]
            end
        end
    end

    A1 --> B1
    B1 -- "callable task" --> bridge
    bridge -- "result" --> B1
    IO -. "routed to tasks" .-> bridge
    bridge -- "external call" --> QS["Queue Service"]
    QS -- "completion" --> bridge
```

Each sandbox sees external tasks as opaque function calls. When your code hits one, Monty **pauses**, and the bridge layer dispatches the task — either directly in the outer Python process or as a remote durable call through the Flyte controller system (Queue Service). Once the call completes, Monty **resumes** with the result. Your code never knows the difference — it just looks like a regular function call that returns a value. Multiple Monty sandboxes can run within the same worker container, each isolated like a lightweight VM.

**Opaque IO types** like `File`, `Dir`, and `DataFrame` are managed by the bridge layer and pass through the sandbox without inspection. Your code can route them between tasks but cannot read or modify their contents.

## Example: sandboxed orchestrator

Use `@env.sandbox.orchestrator` to define a sandboxed task that calls regular worker tasks.
The orchestrator contains only pure Python control flow — all heavy computation runs in worker containers.

```python
import flyte

env = flyte.TaskEnvironment(name="sandboxed-demo")

# Worker tasks — run in their own containers
@env.task
def add(x: int, y: int) -> int:
    return x + y

@env.task
def multiply(x: int, y: int) -> int:
    return x * y

@env.task
def fib(n: int) -> int:
    """Compute the nth Fibonacci number iteratively."""
    a, b = 0, 1
    for _ in range(n):
        a, b = b, a + b
    return a

# Sandboxed orchestrator — pure Python control flow
@env.sandbox.orchestrator
def pipeline(n: int) -> dict[str, int]:
    fib_result = fib(n)
    linear_result = add(multiply(n, 2), 5)
    total = add(fib_result, linear_result)

    return {
        "fib": fib_result,
        "linear": linear_result,
        "total": total,
    }
```

When `pipeline` runs, Monty executes the control flow in the sandbox. Each call to `fib`, `multiply`, and `add` pauses the sandbox, runs the worker task in a container, and resumes with the result.

Both `def` and `async def` orchestrators are supported — Monty natively handles `await` expressions.

## Example: dynamic code execution

For cases where the code itself is generated at runtime — from templates, user input, or LLM output — use `orchestrator_from_str()` and `orchestrate_local()`.

### Reusable task from a code string

`orchestrator_from_str()` creates a reusable task template from a Python code string.
The value of the **last expression** becomes the return value.

```python
import flyte
import flyte.sandbox

env = flyte.TaskEnvironment(name="code-string-demo")

@env.task
def add(x: int, y: int) -> int:
    return x + y

@env.task
def multiply(x: int, y: int) -> int:
    return x * y

# Create a reusable task from a code string
compute_pipeline = flyte.sandbox.orchestrator_from_str(
    """
    partial = add(x, y)
    multiply(partial, scale)
    """,
    inputs={"x": int, "y": int, "scale": int},
    output=int,
    tasks=[add, multiply],
    name="compute-pipeline",
)
# flyte.run(compute_pipeline, x=2, y=3, scale=4)  → 20
```

### One-shot local execution

`orchestrate_local()` executes a code string and returns the result directly — no task template, no controller.
Use it for quick one-off computations.

```python
result = await flyte.sandbox.orchestrate_local(
    "add(x, y) * 2",
    inputs={"x": 1, "y": 2},
    tasks=[add],
)
# result → 6
```

### Parameterized code generation

Because the code is a string, you can generate it programmatically:

```python
def make_reducer(operation: str) -> flyte.sandbox.CodeTaskTemplate:
    """Create a sandboxed task that reduces a list using the given operation."""
    if operation == "sum":
        body = """
            acc = 0
            for v in values:
                acc = acc + v
            acc
        """
    elif operation == "product":
        body = """
            acc = 1
            for v in values:
                acc = acc * v
            acc
        """
    else:
        raise ValueError(f"Unknown operation: {operation}")

    return flyte.sandbox.orchestrator_from_str(
        body,
        inputs={"values": list},
        output=int,
        name=f"reduce-{operation}",
    )

sum_task = make_reducer("sum")
product_task = make_reducer("product")
```

## Building agents with programmatic tool calling

The sandboxed orchestrator and `orchestrate_local()` are the foundation for building agents that use **programmatic tool calling** — systems where an LLM generates Python orchestration code, and the sandbox executes it with registered tools.

Because `orchestrate_local()` accepts a plain code string and a list of tool functions, you can wire it into an LLM generate-execute-retry loop: the model writes code, the sandbox runs it, and on failure the error feeds back to the model for correction.

See [Programmatic tool calling for agents](./code-mode) for the full concept, agent implementation patterns, and end-to-end examples.

## Syntax restrictions

Monty enforces strict syntax restrictions to guarantee sandbox safety.
These restrictions are a feature, not a limitation — they ensure that sandboxed code is deterministic and side-effect free.

### Allowed

| Feature | Notes |
|---------|-------|
| Variables and assignment | `x = 1` |
| Arithmetic and comparisons | `x + y`, `x > y` |
| String operations | Concatenation, formatting |
| `if`/`elif`/`else` | Conditional logic |
| `for` loops | Iteration over lists, ranges, dicts |
| `while` loops | Condition-based loops |
| Function definitions (`def`) | Local helper functions |
| `async def` and `await` | Async orchestrators |
| List/dict/tuple literals | `[1, 2, 3]`, `{"key": "value"}` |
| List comprehensions | `[x * 2 for x in items]` |
| `.append()` on lists | Building lists incrementally |
| Subscript reading | `x = d["key"]`, `x = l[0]` |
| External task calls | Calling registered `@env.task` workers |
| `raise` | Raising exceptions |

### Not allowed

| Feature | Workaround |
|---------|------------|
| `import` | All available functions are provided directly |
| Subscript assignment (`d[k] = v`, `l[i] = v`) | Build dicts as literals; use `.append()` for lists |
| Augmented assignment (`x += 1`) | Use `x = x + 1` |
| `class` definitions | Use dicts or tuples |
| `with` statements | Not needed — no resource management in sandbox |
| `try`/`except` | Errors propagate to the controller |
| Walrus operator (`:=`) | Use separate assignment |
| `yield`/`yield from` | Not supported |
| `global`/`nonlocal` | Not supported |
| Set literals/comprehensions | Use lists |
| `del` statements | Not supported |
| `assert` statements | Use `if` + `raise` |

### Type restrictions

- **Primitive types**: `int`, `float`, `str`, `bool`, `bytes`, `None`
- **Collection types**: `list`, `dict`, `tuple` (including generic forms like `list[int]`, `dict[str, float]`)
- **Opaque IO handles**: `File`, `Dir`, `DataFrame` — pass-through only, cannot be inspected in the sandbox
- **Union types**: `Optional[T]` and `Union` of allowed types
- **Not allowed**: Custom classes, dataclasses, Pydantic models, or any user-defined types

## Security model

The sandboxed orchestrator provides security through restriction, not trust:

- **No filesystem access**: Cannot read, write, or list files
- **No network access**: Cannot make HTTP requests, open sockets, or resolve DNS
- **No OS access**: Cannot spawn processes, read environment variables, or access system resources
- **No imports**: Cannot load any Python modules
- **Opaque IO**: `File`, `Dir`, and `DataFrame` values pass through the sandbox without inspection — the sandbox can route them between tasks but cannot read their contents
- **Type-checked boundaries**: Inputs and outputs are validated against declared types at the sandbox boundary
- **Deterministic execution**: The same inputs always produce the same outputs (excluding external task results)

The sandbox runs untrusted code safely because dangerous operations are not just discouraged — they are structurally impossible in the Monty runtime.

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/sandboxing/code-mode ===

# Programmatic tool calling for agents

**Programmatic tool calling** (also known as **code mode**) is a pattern where LLMs write executable code instead of making individual tool calls.
Rather than the model emitting a sequence of JSON tool-call objects and the system routing each one, the model generates a single block of code that calls multiple tools, transforms data, and applies logic — all executed in a sandbox.

The key insight: LLMs are trained on billions of lines of code, but only a small amount of synthetic tool-call data.
Code generation is a more natural and reliable output modality for models than structured tool-call schemas.

## Programmatic tool calling vs sequential tool calling

In sequential tool calling, every intermediate result passes through the model's context window.
The model calls one tool, reads the result, decides what to do next, calls another tool, and so on.
Each round-trip costs tokens and latency.

With programmatic tool calling, the model generates a complete program upfront.
The sandbox executes it, and only the final result returns to the model.

| Aspect | Sequential tool calling | Programmatic tool calling |
|--------|-------------|-----------|
| **Output format** | JSON tool-call objects, one at a time | A single block of executable code |
| **Data flow** | Every intermediate result passes through the model | Intermediate results stay in the sandbox |
| **Context overhead** | Grows with each tool call (all results in context) | Fixed — only tool signatures in context |
| **Multi-step logic** | Model re-invoked at every step | Sandbox executes loops, conditionals, transforms |
| **Scaling with tools** | Context grows linearly with number of tool definitions | Tools discovered progressively or loaded on demand |

## Why programmatic tool calling is powerful

### Token efficiency

Sequential tool calling loads all tool definitions into the context window upfront and passes every intermediate result through the model.
Programmatic tool calling reduces this dramatically:

- **98%+ context reduction** reported by Anthropic when using code execution with MCP servers — from 150,000 tokens down to 2,000 tokens for the same task.
- **99.9% reduction** reported by Cloudflare for large APIs — approximately 1,000 tokens with programmatic tool calling versus 1.17 million tokens when exposing each API endpoint as a separate tool.

### Performance

By eliminating round-trips through the model for intermediate steps, programmatic tool calling achieves significant speed improvements.
The sandbox evaluates conditionals, loops, and data transformations locally — no "time to first token" delay for each step.

### Natural programming patterns

Code naturally expresses patterns that are awkward or impossible in tool-call sequences:

- **Loops**: Process a list of items without the model deciding "call this tool again" for each one
- **Conditionals**: Branch on intermediate results without another model invocation
- **Data transformation**: Filter, map, and aggregate data before passing it to the next tool
- **Variable reuse**: Store intermediate results and reference them later

### Progressive tool discovery

Instead of loading hundreds of tool definitions into the context window, programmatic tool calling supports progressive discovery.
The model can search for relevant tools, load only what it needs, and compose them in code.

### Data privacy

Intermediate results stay in the sandbox execution environment.
They never re-enter the model's context window, which means sensitive data (PII, credentials, financial records) can be processed without the model seeing it.

## Example: sequential vs programmatic tool calling

Consider a task: "Analyze sales data, filter for Q4, calculate statistics, and create a chart."

### Sequential tool calling approach

The model makes serial tool calls, with each result passing through the context window:

```
Step 1: Model → tool_call: fetch_data("sales_2024")
        Result: [150KB of sales data] → back into model context

Step 2: Model → tool_call: filter_data(data, "month", ">=", "Oct")
        Result: [40KB of filtered data] → back into model context

Step 3: Model → tool_call: calculate_statistics(filtered, "revenue")
        Result: {"mean": 112000, ...} → back into model context

Step 4: Model → tool_call: create_chart("bar", "Q4 Revenue", ...)
        Result: "<canvas>...</canvas>" → back into model context
```

Four round-trips through the model.
The 150KB dataset enters the context window and stays there.

### Programmatic tool calling approach

The model generates a single code block:

```python
data = fetch_data("sales_2024")
q4_months = ["Oct", "Nov", "Dec"]
q4_data = [row for row in data if row["month"] in q4_months]
stats = calculate_statistics(q4_data, "revenue")

months = []
revenues = []
for row in q4_data:
    if row["month"] not in months:
        months.append(row["month"])
for month in months:
    total = 0
    for row in q4_data:
        if row["month"] == month:
            total = total + row["revenue"]
    revenues.append(total)

chart = create_chart("bar", "Q4 Revenue by Month", months, revenues)
{"charts": [chart], "summary": "Q4 stats: " + str(stats)}
```

One model invocation.
The data never re-enters the model's context window.
The sandbox handles the filtering, aggregation, and chart creation locally.

## Example: defining tools

Tools are plain Python functions with type annotations and docstrings.
The agent auto-generates its system prompt from these signatures, so adding a tool requires no other changes.

```python
async def fetch_data(dataset: str) -> list:
    """Fetch tabular data by dataset name.

    Available datasets:
    - "sales_2024": columns month, region, revenue, units
    - "employees": columns name, department, salary, years_exp, performance_rating
    - "website_traffic": columns date, page, visitors, bounce_rate, avg_duration
    - "inventory": columns product, category, stock, price, supplier
    """
    ...

async def create_chart(chart_type: str, title: str, labels: list, values: list) -> str:
    """Generate a self-contained Chart.js HTML snippet.

    Args:
        chart_type: One of "bar", "line", "pie", "doughnut".
        title: Chart title displayed above the canvas.
        labels: X-axis labels (or slice labels for pie/doughnut).
        values: Either a flat list of numbers, or a list of
                {"label": str, "data": list[number]} dicts for multi-series.
    """
    ...

async def calculate_statistics(data: list, column: str) -> dict:
    """Calculate descriptive statistics for a numeric column.

    Returns dict with keys: count, mean, median, min, max, std_dev.
    """
    ...

async def filter_data(data: list, column: str, operator: str, value: object) -> list:
    """Filter rows where column matches the condition.

    Operator: one of "==", "!=", ">", ">=", "<", "<=".
    """
    ...

ALL_TOOLS = {
    "fetch_data": fetch_data,
    "create_chart": create_chart,
    "calculate_statistics": calculate_statistics,
    "filter_data": filter_data,
}
```

The `ALL_TOOLS` dict is the single source of truth.
The agent introspects it to build the system prompt, and the sandbox uses it to resolve function calls.

## Example: programmatic tool-calling agent

The `CodeModeAgent` implements the generate-execute-retry loop:

```python
import flyte.sandbox
from _tools import ALL_TOOLS

class CodeModeAgent:
    def __init__(self, tools, *, model="claude-sonnet-4-6", max_retries=2):
        self._tools = tools
        self._model = model
        self._max_retries = max_retries
        # System prompt auto-generated from tool signatures + docstrings
        self.system_prompt = self._build_system_prompt()

    async def run(self, message: str, history: list[dict]) -> AgentResult:
        messages = [*history, {"role": "user", "content": message}]

        # Step 1: LLM generates Python code
        code = await generate_code(self._model, self.system_prompt, messages)

        # Step 2: Execute in Monty sandbox with registered tools
        for attempt in range(1 + self._max_retries):
            try:
                result = await flyte.sandbox.orchestrate_local(
                    code,
                    inputs={"_unused": 0},
                    tasks=list(self._tools.values()),
                )
                return AgentResult(code=code, charts=result.get("charts", []),
                                   summary=result.get("summary", ""))
            except Exception as exc:
                if attempt < self._max_retries:
                    # Step 3: Feed error back to LLM for retry
                    code = await generate_code(
                        self._model, self.system_prompt,
                        [*messages,
                         {"role": "assistant", "content": f"```python\n{code}\n```"},
                         {"role": "user", "content": f"Error: {exc}\nFix the code."}],
                    )
                    continue
                return AgentResult(code=code, error=str(exc))
```

The pattern:

1. **Generate**: The LLM receives tool signatures and the user's request, and outputs Python code.
2. **Execute**: The code runs in the Monty sandbox. Tool calls pause the sandbox, dispatch to real implementations, and resume with results.
3. **Retry**: If execution fails, the error message is fed back to the LLM, which generates a corrected version. This repeats up to `max_retries` times.

## Example: chat app

Wrap the agent in a FastAPI endpoint to create a conversational analytics assistant:

```python
from _agent import CodeModeAgent
from _tools import ALL_TOOLS
from fastapi import FastAPI

import flyte
from flyte.app.extras import FastAPIAppEnvironment

app = FastAPI(title="Chat Data Analytics Agent")

env = FastAPIAppEnvironment(
    name="chat-analytics-agent",
    app=app,
    image=flyte.Image.from_debian_base().with_pip_packages(
        "fastapi", "uvicorn", "httpx", "pydantic-monty",
    ),
    secrets=flyte.Secret(key="anthropic-api-key", as_env_var="ANTHROPIC_API_KEY"),
)

agent = CodeModeAgent(tools=ALL_TOOLS, max_retries=2)

@app.post("/api/chat")
async def chat(req: ChatRequest) -> ChatResponse:
    result = await agent.run(req.message, req.history)
    return ChatResponse(
        code=result.code,
        charts=result.charts,
        summary=result.summary,
        error=result.error,
    )
```

Users send natural language requests (`"Show me monthly revenue trends for 2024"`), the agent generates analysis code, the sandbox executes it with the registered tools, and the response includes charts and a text summary.

## Example: durable agent

For production workloads, wrap the tools as `@env.task` so the sandbox dispatches them as durable Flyte tasks through the controller.
This gives you execution history, retries, caching, and full observability.

```python
from _agent import CodeModeAgent
from _tools import ALL_TOOLS

import flyte
import flyte.report

env = flyte.TaskEnvironment(
    name="llm-code-mode",
    secrets=[flyte.Secret(key="anthropic-api-key", as_env_var="ANTHROPIC_API_KEY")],
    image=flyte.Image.from_debian_base().with_pip_packages(
        "httpx", "pydantic-monty", "unionai-reuse",
    ),
)

# Wrap each tool as a durable task
@env.task
async def fetch_data(dataset: str) -> list:
    return await _tools.fetch_data(dataset)

@env.task
async def create_chart(chart_type: str, title: str, labels: list, values: list) -> str:
    return await _tools.create_chart(chart_type, title, labels, values)

# ... wrap remaining tools similarly ...

# Agent uses plain functions for prompt generation,
# @env.task versions for durable sandbox execution
durable_tools = {t.func.__name__: t for t in [fetch_data, create_chart, ...]}
agent = CodeModeAgent(tools=ALL_TOOLS, execution_tools=durable_tools)

@env.task(report=True)
async def analyze(request: str) -> str:
    """Run the code-mode agent and render an HTML report."""
    result = await agent.run(request, [])
    report_html = build_report(request, result)
    await flyte.report.replace.aio(report_html)
    await flyte.report.flush.aio()
    return result.summary
```

The key difference from the chat app: each tool call goes through the Flyte controller as a durable task.
If `fetch_data` fails, Flyte retries it automatically.
Every tool invocation is recorded and visible in the execution timeline.

Run it with:

```bash
flyte run durable_agent.py analyze \
    --request "Show me monthly revenue trends for 2024, broken down by region"
```

## References

- [Code execution with MCP](https://www.anthropic.com/engineering/code-execution-with-mcp) — Anthropic engineering blog on the code execution pattern
- [Code Mode](https://blog.cloudflare.com/code-mode/) — Cloudflare's introduction to code mode for LLM tool calling
- [Code Mode MCP](https://blog.cloudflare.com/code-mode-mcp/) — Cloudflare's server-side code mode implementation
- [Code Mode Protocol](https://github.com/universal-tool-calling-protocol/code-mode) — Open specification for the code mode pattern

=== PAGE: https://www.union.ai/docs/v2/union/user-guide/sandboxing/code-sandboxing ===

# Code sandboxing

`flyte.sandbox.create()` runs arbitrary Python code or shell commands inside an ephemeral, stateless Docker container.
The container is built on demand from declared dependencies, executed once, and discarded.
Each invocation starts from a clean slate — no filesystem state, environment variables, or side effects carry over between runs.

## Execution modes

`flyte.sandbox.create()` supports three mutually exclusive execution modes.

### Auto-IO mode

The default mode. Write only the business logic — Flyte generates the I/O boilerplate automatically.

How it works:

1. Flyte generates an `argparse` preamble that parses declared inputs from CLI arguments.
2. Declared inputs become local variables in scope.
3. After your code runs, Flyte writes declared scalar outputs to `/var/outputs/` automatically.

```python{hl_lines=[2, 4, 6, 11]}
import flyte
import flyte.sandbox

sandbox = flyte.sandbox.create(
    name="double",
    code="result = x * 2",
    inputs={"x": int},
    outputs={"result": int},
)

result = await sandbox.run.aio(x=21)  # returns 42
```

No imports, no argument parsing, no file writing. The variable `x` is available directly, and the variable `result` is captured automatically because it matches a declared output name.

A more involved example with third-party packages:

```python{hl_lines=["4-9", 12, 20, 24]}
import datetime

_stats_code = """\
import numpy as np
nums = np.array([float(v) for v in values.split(",")])
mean = float(np.mean(nums))
std  = float(np.std(nums))

window_end = dt + delta
"""

stats_sandbox = flyte.sandbox.create(
    name="numpy-stats",
    code=_stats_code,
    inputs={
        "values": str,
        "dt": datetime.datetime,
        "delta": datetime.timedelta,
    },
    outputs={"mean": float, "std": float, "window_end": datetime.datetime},
    packages=["numpy"],
)

mean, std, window_end = await stats_sandbox.run.aio(
    values="1,2,3,4,5",
    dt=datetime.datetime(2024, 1, 1),
    delta=datetime.timedelta(days=1),
)
```

When there are multiple outputs, `.run()` returns them as a tuple in declaration order.

### Verbatim mode

Set `auto_io=False` to run a complete Python script with full control over I/O.
Flyte runs the script exactly as written — no injected preamble, no automatic output collection.

Your script must:

- Read inputs from `/var/inputs/<name>` (files are bind-mounted at these paths)
- Write outputs to `/var/outputs/<name>`

```python{hl_lines=["4-9", 12, 17]}
from flyte.io import File

_etl_script = """\
import json, pathlib

payload = json.loads(pathlib.Path("/var/inputs/payload").read_text())
total = sum(payload["values"])

pathlib.Path("/var/outputs/total").write_text(str(total))
"""

etl_sandbox = flyte.sandbox.create(
    name="etl-script",
    code=_etl_script,
    inputs={"payload": File},
    outputs={"total": int},
    auto_io=False,
)

total = await etl_sandbox.run.aio(payload=payload_file)
```

Use verbatim mode when you need precise control over how inputs are read and outputs are written, or when your script has its own argument parsing.

### Command mode

Run any shell command, binary, or pipeline. Provide `command` instead of `code`.

```python{hl_lines=[5]}
from flyte.io import File

linecount_sandbox = flyte.sandbox.create(
    name="line-counter",
    command=[
        "/bin/bash",
        "-c",
        "grep -c . /var/inputs/data_file > /var/outputs/line_count || echo 0 > /var/outputs/line_count",
    ],
    inputs={"data_file": File},
    outputs={"line_count": str},
)

count = await linecount_sandbox.run.aio(data_file=data_file)
```

Command mode is useful for running test suites, compiled binaries, shell pipelines, or any non-Python workload.

Use `arguments` to pass positional arguments to the command.
File inputs are bind-mounted at `/var/inputs/<name>` and can be referenced in the arguments list:

```python{hl_lines=[4, 5]}
sandbox = flyte.sandbox.create(
    name="test-runner",
    command=["/bin/bash", "-c", pytest_cmd],
    arguments=["/var/inputs/solution.py", "/var/inputs/tests.py"],
    inputs={"solution.py": File, "tests.py": File},
    outputs={"exit_code": str},
)
```

## Executing a sandbox

Call `.run()` on the sandbox object to build the image and execute.

**Async execution**

```python
result = await sandbox.run.aio(x=21)
```

**Sync execution**

```python
result = sandbox.run(x=21)
```

Both forms build the container image (if not already built), start the container, execute the code or command, collect outputs, and discard the container.

`flyte.sandbox.create()` defines the sandbox configuration and can be called at module level or inside a task. The actual container execution happens when you call `.run()`, which must run inside a Flyte task (either locally or remotely on the cluster).

### Error handling

If the sandbox code fails (non-zero exit code, Python exception, or timeout), `.run()` raises an exception with the error details.
If `retries` is set, Flyte automatically retries the execution before surfacing the error.

If the image build fails due to an invalid package, an `InvalidPackageError` is raised with the package name and the underlying error message.

## Supported types

Inputs and outputs must use one of the following types:

| Category         | Types                                     |
| ---------------- | ----------------------------------------- |
| **Primitive**    | `int`, `float`, `str`, `bool`             |
| **Date/time**    | `datetime.datetime`, `datetime.timedelta` |
| **File handles** | `flyte.io.File`                           |

### How types are handled

**In auto-IO mode:**

- **Primitive and date/time inputs** are injected as local variables with the correct Python type. Flyte generates an `argparse` preamble behind the scenes — your code just uses the variable names directly.
- **`File` inputs** are bind-mounted into the container. The input variable contains the file path as a string (e.g., `"/var/inputs/payload"`), so you can read it with `pathlib.Path(payload).read_text()`.
- **Primitive and date/time outputs** are written to `/var/outputs/<name>` automatically. Just assign the value to a variable matching the declared output name.
- **`File` outputs** are the exception — your code must write the file to `/var/outputs/<name>` manually.

**In verbatim mode:**

- All inputs (including primitives) are available at `/var/inputs/<name>`. Your script reads them directly from the filesystem.
- All outputs must be written to `/var/outputs/<name>` by your script.

**In command mode:**

- `File` inputs are bind-mounted at `/var/inputs/<name>`.
- All outputs must be written to `/var/outputs/<name>` by your command.

## Configuring the container image

### Python packages

Install PyPI packages with `packages`:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="data-analysis",
    code="...",
    inputs={"data": str},
    outputs={"result": str},
    packages=["numpy", "pandas>=2.0", "scikit-learn"],
)
```

### System packages

Install system-level (apt) packages with `system_packages`:

```python{hl_lines=[7]}
sandbox = flyte.sandbox.create(
    name="image-processor",
    code="...",
    inputs={"image": File},
    outputs={"result": File},
    packages=["Pillow"],
    system_packages=["libgl1-mesa-glx", "libglib2.0-0"],
)
```

> [!NOTE]
> `gcc`, `g++`, and `make` are included automatically in every sandbox image.

### Additional Dockerfile commands

For advanced image customization, use `additional_commands` to inject arbitrary `RUN` commands into the Dockerfile:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="custom-env",
    code="...",
    inputs={"x": int},
    outputs={"y": int},
    additional_commands=["curl -sSL https://example.com/setup.sh | bash"],
)
```

### Pre-built images

Skip the image build entirely by providing a pre-built image URI:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="prebuilt",
    code="result = x + 1",
    inputs={"x": int},
    outputs={"result": int},
    image="ghcr.io/my-org/my-sandbox-image:latest",
)
```

### Image configuration

Control the registry and Python version with `ImageConfig`:

```python{hl_lines=["8-12"]}
from flyte.sandbox import ImageConfig

sandbox = flyte.sandbox.create(
    name="custom-registry",
    code="...",
    inputs={"x": int},
    outputs={"y": int},
    image_config=ImageConfig(
        registry="ghcr.io/my-org",
        registry_secret="ghcr-credentials",
        python_version=(3, 12),
    ),
)
```

## Runtime configuration

### Resources

Set CPU and memory limits for the container:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="heavy-compute",
    code="...",
    inputs={"data": str},
    outputs={"result": str},
    resources=flyte.Resources(cpu=4, memory="8Gi"),
)
```

The default is 1 CPU and 1Gi memory.

### Retries

Automatically retry failed executions:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="flaky-task",
    code="...",
    inputs={"x": int},
    outputs={"y": int},
    retries=3,
)
```

### Timeout

Set a maximum execution time in seconds:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="bounded-task",
    code="...",
    inputs={"x": int},
    outputs={"y": int},
    timeout=300,  # 5 minutes
)
```

### Environment variables

Inject environment variables into the container:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="configured-task",
    code="...",
    inputs={"x": int},
    outputs={"y": int},
    env_vars={"LOG_LEVEL": "DEBUG", "FEATURE_FLAG": "true"},
)
```

### Secrets

Mount Flyte secrets into the container:

```python{hl_lines=[6]}
sandbox = flyte.sandbox.create(
    name="authenticated-task",
    code="...",
    inputs={"query": str},
    outputs={"result": str},
    secrets=[flyte.Secret(key="api-key", as_env_var="API_KEY")],
)
```

### Caching

Control output caching behavior:

```python{hl_lines=["6-8"]}
sandbox = flyte.sandbox.create(
    name="cached-task",
    code="...",
    inputs={"x": int},
    outputs={"y": int},
    cache="auto",      # default — Flyte decides based on inputs
    # cache="override"  # force re-execution and update cache
    # cache="disable"   # no caching
)
```

## Deploying a sandbox as a task

Use `.as_task()` to convert a sandbox into a deployable `ContainerTask`.
The returned task has the generated script pre-filled as a default input, so retriggers from the UI only require user-declared inputs.

This pattern is useful when you want to define a sandbox dynamically (for example, with LLM-generated code) and then deploy it as a standalone task that others can trigger from the UI.

```python{hl_lines=[4, 11, "33-38"]}
import flyte
import flyte.sandbox
from flyte.io import File
from flyte.sandbox import sandbox_environment

# sandbox_environment provides the base runtime image for code sandboxes.
# Include it in depends_on so Flyte builds the sandbox runtime before your task runs.
env = flyte.TaskEnvironment(
    name="sandbox-demo",
    image=flyte.Image.from_debian_base(name="sandbox-demo"),
    depends_on=[sandbox_environment],
)

@env.task
async def deploy_sandbox_task() -> str:
    # Initialize the Flyte client for in-cluster operations (image building, deployment)
    flyte.init_in_cluster()

    sandbox = flyte.sandbox.create(
        name="deployable-sandbox",
        # In auto-IO mode, File inputs become path strings — read with pathlib
        code="""\
import json, pathlib
data = json.loads(pathlib.Path(payload).read_text())
total = sum(data["values"])
""",
        inputs={"payload": File},
        outputs={"total": int},
        resources=flyte.Resources(cpu=1, memory="512Mi"),
    )

    # Build the image and get a ContainerTask with the script pre-filled
    task = await sandbox.as_task.aio()

    # Create a TaskEnvironment from the task and deploy it
    deploy_env = flyte.TaskEnvironment.from_task("deployable-sandbox", task)
    versions = flyte.deploy(deploy_env)

    return versions[0].summary_repr()
```

## End-to-end example

The following example defines sandboxes in all three modes, creates helper tasks, and runs everything in a single pipeline:

```
import datetime
from pathlib import Path

import flyte
import flyte.sandbox
from flyte.io import File
from flyte.sandbox import sandbox_environment

# sandbox_environment provides the base runtime for code sandboxes.
# Include it in depends_on so the sandbox runtime is available when tasks execute.
env = flyte.TaskEnvironment(
    name="sandbox-demo",
    image=flyte.Image.from_debian_base(name="sandbox-demo"),
    depends_on=[sandbox_environment],
)

# Auto-IO mode: pure computation
sum_sandbox = flyte.sandbox.create(
    name="sum-to-n",
    code="total = sum(range(n + 1)) if conditional else 0",
    inputs={"n": int, "conditional": bool},
    outputs={"total": int},
)

# Auto-IO mode with packages
_stats_code = """\
import numpy as np
nums = np.array([float(v) for v in values.split(",")])
mean = float(np.mean(nums))
std  = float(np.std(nums))
window_end = dt + delta
"""

stats_sandbox = flyte.sandbox.create(
    name="numpy-stats",
    code=_stats_code,
    inputs={
        "values": str,
        "dt": datetime.datetime,
        "delta": datetime.timedelta,
    },
    outputs={"mean": float, "std": float, "window_end": datetime.datetime},
    packages=["numpy"],
)

# Verbatim mode: full script control
_etl_script = """\
import json, pathlib

payload = json.loads(pathlib.Path("/var/inputs/payload").read_text())
total = sum(payload["values"])

pathlib.Path("/var/outputs/total").write_text(str(total))
"""

etl_sandbox = flyte.sandbox.create(
    name="etl-script",
    code=_etl_script,
    inputs={"payload": File},
    outputs={"total": int},
    auto_io=False,
)

# Command mode: shell pipeline
linecount_sandbox = flyte.sandbox.create(
    name="line-counter",
    command=[
        "/bin/bash",
        "-c",
        "grep -c . /var/inputs/data_file > /var/outputs/line_count || echo 0 > /var/outputs/line_count",
    ],
    inputs={"data_file": File},
    outputs={"line_count": str},
)

@env.task
async def create_text_file() -> File:
    path = Path("/tmp/data.txt")
    path.write_text("line 1\n\nline 2\n")
    return await File.from_local(str(path))

@env.task
async def payload_generator() -> File:
    path = Path("/tmp/payload.json")
    path.write_text('{"values": [1, 2, 3, 4, 5]}')
    return await File.from_local(str(path))

@env.task
async def run_pipeline() -> dict:
    # Auto-IO: sum 1..10 = 55
    total = await sum_sandbox.run.aio(n=10, conditional=True)

    # Auto-IO with numpy
    mean, std, window_end = await stats_sandbox.run.aio(
        values="1,2,3,4,5",
        dt=datetime.datetime(2024, 1, 1),
        delta=datetime.timedelta(days=1),
    )

    # Verbatim ETL
    payload = await payload_generator()
    etl_total = await etl_sandbox.run.aio(payload=payload)

    # Command mode: line count
    data_file = await create_text_file()
    line_count = await linecount_sandbox.run.aio(data_file=data_file)

    return {
        "sum_1_to_10": total,
        "mean": round(mean, 4),
        "std": round(std, 4),
        "window_end": window_end.isoformat(),
        "etl_sum_1_to_10": etl_total,
        "line_count": line_count,
    }

if __name__ == "__main__":
    flyte.init_from_config()

    r = flyte.run(run_pipeline)
    print(f"run url: {r.url}")
```

*Source: https://github.com/unionai/unionai-examples/blob/main/v2/user-guide/sandboxing/code_sandbox.py*

## API reference

### `flyte.sandbox.create()`

| Parameter             | Type              | Description                                                |
| --------------------- | ----------------- | ---------------------------------------------------------- |
| `name`                | `str`             | Sandbox name. Derives task and image names.                |
| `code`                | `str`             | Python source to run. Mutually exclusive with `command`.   |
| `inputs`              | `dict[str, type]` | Input type declarations.                                   |
| `outputs`             | `dict[str, type]` | Output type declarations.                                  |
| `command`             | `list[str]`       | Shell command to run. Mutually exclusive with `code`.      |
| `arguments`           | `list[str]`       | Arguments forwarded to `command`.                          |
| `packages`            | `list[str]`       | Python packages to install via pip.                        |
| `system_packages`     | `list[str]`       | System packages to install via apt.                        |
| `additional_commands` | `list[str]`       | Extra Dockerfile `RUN` commands.                           |
| `resources`           | `flyte.Resources` | CPU and memory limits. Default: 1 CPU, 1Gi memory.         |
| `image_config`        | `ImageConfig`     | Registry and Python version settings.                      |
| `image_name`          | `str`             | Explicit image name (overrides auto-generated).            |
| `image`               | `str`             | Pre-built image URI (skips build).                         |
| `auto_io`             | `bool`            | Auto-generate I/O wiring. Default: `True`.                 |
| `retries`             | `int`             | Number of retries on failure. Default: `0`.                |
| `timeout`             | `int`             | Timeout in seconds.                                        |
| `env_vars`            | `dict[str, str]`  | Environment variables for the container.                   |
| `secrets`             | `list[Secret]`    | Flyte secrets to mount.                                    |
| `cache`               | `str`             | `"auto"`, `"override"`, or `"disable"`. Default: `"auto"`. |

### Sandbox methods

| Method                            | Description                                                       |
| --------------------------------- | ----------------------------------------------------------------- |
| `sandbox.run(**kwargs)`           | Build the image and execute synchronously. Returns typed outputs. |
| `await sandbox.run.aio(**kwargs)` | Async version of `run()`.                                         |
| `sandbox.as_task()`               | Build the image and return a deployable `ContainerTask`.          |
| `await sandbox.as_task.aio()`     | Async version of `as_task()`.                                     |

Both `run()` and `as_task()` accept an optional `image` parameter to provide a pre-built image URI, skipping the build step.

