Kubernetes and Crossplane Unlock Self-Hosted AI Inference for Enterprise Data Sovereignty

The growing reliance on Large Language Models (LLMs) and Generative AI, alongside stringent data privacy regulations, is driving enterprises towards self-hosted AI inference. While public LLMs like ChatGPT or Gemini offer convenience, they raise critical concerns about data exposure, code, and company secrets flowing through uncontrolled third-party infrastructure. This is particularly problematic for sectors like healthcare, finance, and government, where compliance is paramount.

The geopolitical landscape of open-weight models further complicates this, with Chinese entities such as Alibaba’s Qwen family, Deepseek, and Zepu AI’s GLM5 dominating the open-weight ecosystem since late 2024. However, these models present trust challenges; NIST research indicates baked-in censorship in Deepseek models regarding sensitive topics and alarmingly weak safety guardrails, with Deepseek models reportedly complying with up to 100% of overtly malicious requests, significantly higher than US counterparts (5-12%). Self-hosting mitigates data exposure risk by ensuring data remains within the network, though it doesn’t resolve inherent model biases. Despite the typically higher cost of self-hosting compared to API calls, the necessity for low latency, freedom from rate limits, protection of fine-tuned intellectual property, and strict regulatory compliance often makes it the only viable option. In related news, the open-source AI coding agent Kilo offers a privacy-first approach, supporting over 500 models from various providers through a single gateway and allowing local model execution without using code for training by default.

To address these challenges, a robust self-hosted inference platform leverages Kubernetes with GPU nodes. Kubernetes provides first-class support for GPU scheduling via device plugins, the Nvidia GPU Operator, and Node Feature Discovery, enabling declarative GPU management, autoscaling based on utilization, multi-tenancy, and portability. Building an Internal Developer Platform (IDP) on top of this foundation, powered by Crossplane, abstracts the underlying complexity. Crossplane allows platform teams to define custom APIs that provision GPU-enabled Kubernetes clusters and deploy models using VLLM runtime, presenting a simplified “inference as a service” experience. This setup empowers development teams to deploy models with a single custom resource, removing the need for deep GPU expertise while ensuring data never leaves the organization’s network. This foundational architecture is designed for extensibility, with future advancements planned for features like disaggregated inference, multi-cluster patterns, and KV cache routing.