BYO Model Deployment: AI Agents That Never Phone Home

Local models aren't frontier yet, but the gap shrinks every quarter. BYO model deployment lets you point AI agents at your own infrastructure today and your own laptop when they catch up.

BYO Model Deployment: AI Agents That Never Phone Home

Google dropped Gemma 4 last week. A family of open models purpose-built for agentic workflows: multi-step planning, autonomous action, tool use. The 31B dense model ranks #3 on Arena AI at 1452 Elo, outperforming models twenty times its size on reasoning benchmarks. Apache 2.0 license.

Meanwhile, a developer got Qwen3.5-397B running on a 48GB MacBook Pro -- a model that normally requires a server rack. Any Apple Silicon Mac with 16GB of RAM can run useful local models today via Ollama or LM Studio.

Let me be honest: none of these are replacing Claude or GPT-5 for complex agent workflows right now. If you need an AI to orchestrate five APIs, reason about edge cases, and handle errors gracefully, you still want a frontier model. But the trajectory is undeniable. The gap between local and frontier shrinks every quarter. We're trending toward a world where everyone deploys agents from their laptop.

The infrastructure should be ready before the models are.

Why the current model matters

Most workflow automation tools process your data on their servers using their AI provider. Your CRM records, payment data, customer information, and internal documents flow through infrastructure you don't control.

That made sense when running AI required a data center. It makes less sense every month. Companies already run their own Claude on AWS Bedrock, Gemini on Google Vertex, and OpenAI on Azure -- governed by their IAM policies, logged by their CloudTrail, contained within their VPC. 73% of enterprises cite data privacy as their top AI risk. The answer isn't to avoid AI. It's to run it on your own terms.

What BYO model deployment means for General Input

I added BYO model deployment to General Input this weekend. Instead of using the built-in AI, you point the agent at your own model endpoint. The workflow engine handles scheduling, integration orchestration, credential management, and the audit trail. The AI inference happens wherever you point it.

Today the practical use is corporate cloud deployments. Claude on Bedrock. Gemini on Vertex. OpenAI on Azure. Any OpenAI-compatible endpoint. If your security team already spent months getting a model approved and deployed inside your VPC, you shouldn't have to re-litigate that conversation every time you want to automate a workflow.

The infrastructure is model-agnostic. It doesn't care whether the endpoint is in us-east-1 or running on your desk. When local models are good enough for production agent work -- and based on the last two years of progress, that's a when, not an if -- the same BYO setup works with Ollama on your Mac. And the inference cost drops to zero. No per-token charges from a cloud provider. Just your hardware and electricity.

The security model underneath

BYO model deployment sits on top of an architecture designed for this from day one.

Credentials never reach the AI. API keys and OAuth tokens are encrypted at rest, injected at runtime when the agent calls an external service, and stripped from the context before data reaches the model. The agent orchestrates your tools but never holds the keys.

Every data access is logged. When the agent reads from your CRM, pulls transactions from Stripe, or sends a Slack message, each access is recorded in an audit trail.

The model is yours. With BYO deployment, the AI inference itself runs on infrastructure you control. Your data enters your model and comes back as instructions. It never touches a third-party AI provider.

Where this is going

Gemma 4's 2B and 4B models are explicitly designed for on-device deployment. They handle multi-step planning and autonomous action at sizes that fit on a phone. The 31B model runs at chat speed on a Mac with 48GB of unified memory. A year ago, running an AI agent locally was a science project. Today it's getting close to practical.

BYO model deployment is the bridge. It works with enterprise cloud deployments today, and it'll work with whatever comes next. When it does, you get private data, zero inference cost, and no vendor lock-in -- all on hardware you already own.

Your data never leaves your walls, simplifying HIPAA, SOC 2, GDPR, and other privacy-sensitive deployments.

We're not there yet. But every quarter, we get closer.

Frequently asked questions

What does BYO model deployment let me do?
Point your General Input agents at any OpenAI-compatible endpoint. Claude on AWS Bedrock, Gemini on Vertex, OpenAI on Azure, or a local Ollama install on your own hardware.
Will my data ever leave my infrastructure?
With BYO model deployment, no. Inference runs on the endpoint you choose, and credentials never reach the model regardless of where it's hosted.
Is this useful today, or only for some future where local models are frontier?
Useful today. The most common use is corporate cloud deployments: running a model your security team has already approved inside your VPC. The same setup works tomorrow when local models are good enough.
How does this help with HIPAA, SOC 2, or GDPR?
Your data stays inside your infrastructure boundary, so you don't have to re-litigate every workflow with security and compliance. The model is yours, the audit trail is yours, the keys never cross a perimeter you don't own.

Run agents on the model your security team approved.

Point General Input at Bedrock, Vertex, Azure, or your own laptop.