MiniMax-M2 Proxy – Bridging 229B Models to Standard APIs
Overview
MiniMax-M2 Proxy solves a critical interoperability problem: MiniMax-M2, a powerful 229B parameter Mixture-of-Experts model, uses custom XML formatting for tool calls that most frameworks cannot parse.
This proxy translates between MiniMax's native format and standard OpenAI/Anthropic APIs, enabling developers to leverage cutting-edge open models without rewriting their entire stack.
Repository: github.com/0xSero/minimax-m2-proxy
The Problem
Open-weight large language models are advancing rapidly. MiniMax-M2 offers impressive reasoning capabilities at 229B parameters-but ships with a fundamental compatibility issue:
Native Output:
<minimax:tool_call>
{"name": "search", "arguments": {"query": "latest news"}}
</minimax:tool_call>
What OpenAI SDKs Expect:
{
"tool_calls": [
{
"function": { "name": "search", "arguments": "{\"query\": \"latest news\"}" }
}
]
}
What Anthropic SDKs Expect:
{
"content": [
{
"type": "tool_use",
"name": "search",
"input": { "query": "latest news" }
}
]
}
Without a translation layer, developers face a choice:
- Rewrite all their tool-calling code for MiniMax's XML format
- Give up on using MiniMax's tool-calling capabilities entirely
- Use a different (potentially less capable) model
Solution
The proxy sits between your application and the MiniMax backend (TabbyAPI or vLLM), performing real-time translation:
Dual API Support
| Endpoint | Compatible With |
|---|---|
/v1/chat/completions | OpenAI SDK, LangChain, most AI frameworks |
/v1/messages | Anthropic SDK, Claude-compatible tooling |
Intelligent Translation
XML-to-JSON Conversion
- Parses
<minimax:tool_call>blocks from model output - Converts to proper JSON format for target API
- Handles multiple simultaneous tool calls correctly
Type Inference
- Automatically converts parameter types based on tool schemas
- Integers, floats, booleans, and nested JSON objects
- No manual type coercion required in your code
Reasoning Preservation
- Maintains
<think>blocks verbatim in output - Enables visibility into model's chain-of-thought
- Critical for debugging and understanding model behavior
Streaming Support
Full Server-Sent Events (SSE) support for both API formats-the proxy handles the complexity of streaming XML parsing and translates to proper streaming JSON chunks.
Technical Architecture
┌─────────────────────────────────────────────────────────┐
│ Your Application │
│ (OpenAI SDK / Anthropic SDK / Custom) │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ MiniMax-M2 Proxy │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Router │ │ XML Parser │ │ Type Coerce │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ OpenAI Fmt │ │Anthropic Fmt│ │ SSE Stream │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ TabbyAPI / vLLM Backend │
│ (MiniMax-M2 229B) │
└─────────────────────────────────────────────────────────┘
Tech Stack
- Framework: FastAPI with full async support
- Validation: Pydantic for type-safe request/response handling
- Deployment: Systemd service file for production
- Python: 3.11+ for modern async features
Challenges & Solutions
XML in Streaming Contexts
Challenge: Streaming responses arrive token-by-token. An XML tag might be split across multiple chunks.
Solution: Stateful streaming parser that buffers partial XML, only emitting translated JSON when a complete tool call is parsed.
Type Ambiguity
Challenge: MiniMax outputs all values as strings. A tool expecting {"count": 5} receives {"count": "5"}.
Solution: Schema-aware type coercion. The proxy reads your tool definitions and converts parameters to their expected types automatically.
Multiple Tool Calls
Challenge: Some responses contain multiple tool calls in sequence. Both APIs have specific formats for expressing this.
Solution: Accumulator pattern that collects all tool calls before emitting the properly-formatted response.
Results
Community Impact
- 33+ GitHub stars from developers integrating MiniMax into existing stacks
- 3 forks with contributions for additional backends
- Featured in discussions about running open models in production
Use Cases Enabled
- Drop-in Replacement: Swap MiniMax into existing OpenAI-based applications
- Hybrid Architectures: Route different queries to different models while maintaining a consistent API
- Research: Compare model behavior without rewriting evaluation harnesses
- Cost Optimization: Use powerful open models where they excel, proprietary where they don't
Deployment
The proxy is designed for production use:
# Development
uvicorn main:app --host 0.0.0.0 --port 8000
# Production (systemd service provided)
systemctl enable minimax-proxy
systemctl start minimax-proxy
Configuration via environment variables:
BACKEND_URL: TabbyAPI or vLLM endpointBACKEND_API_KEY: Authentication for backendLOG_LEVEL: Debugging granularity
Lessons Learned
Interoperability > Features – A proxy that makes existing tools work is often more valuable than a new tool with more features.
Streaming is non-negotiable – Users expect real-time responses. Any translation layer must preserve streaming semantics.
Type safety at boundaries – Pydantic validation at API boundaries catches issues before they become runtime errors.
Production-ready from day one – Including systemd configs, logging, and proper error handling enabled immediate production deployment.
Future Work
- Model router: Intelligent routing between multiple backends based on query characteristics
- Caching layer: Response caching for repeated identical queries
- Metrics & observability: Prometheus endpoints for production monitoring
- Additional models: Support for other models with non-standard tool call formats
This project demonstrates 0xSero's approach to infrastructure that enables rather than constrains. When the gap between a powerful model and your existing code is just a thin translation layer, that's the right abstraction.
Explore the code: github.com/0xSero/minimax-m2-proxy
More Case Studies
AI Coding Assistant Training Data Extraction Toolkit
A Python toolkit for extracting conversation histories, code contexts, and metadata from popular AI coding assistants for ML training and analysis.
Open Orchestra – Multi-Agent Orchestration System
A hub-and-spoke multi-agent orchestration plugin with Neo4j-backed memory, 22+ tool APIs, and built-in worker profiles for complex AI workflows.