LM Studio 0.4.0 Transforms Local AI with Server Deployment and Parallel Processing

LM Studio 0.4.0 introduces llmster for headless deployments, parallel requests with continuous batching, a stateful REST API, and a complete UI refresh for local AI development.

LM Studio has just released version 0.4.0, marking a significant evolution in local AI development tools. This update transforms LM Studio from a desktop application into a versatile platform that can run anywhere—from cloud servers to CI pipelines—while introducing powerful new capabilities for high-throughput inference and developer workflows.

Server-Native Deployment with llmster

The headline feature of 0.4.0 is llmster, a headless daemon that separates LM Studio's core functionality from its GUI. This architectural shift allows developers to deploy LM Studio's inference engine on any infrastructure without the overhead of a graphical interface.

Installation is straightforward:

Linux/Mac: curl -fsSL https://lmstudio.ai/install.sh | bash
Windows: irm https://lmstudio.ai/install.ps1 | iex

Once installed, llmster supports the same model loading and serving capabilities as the desktop app, but through CLI commands:

lms daemon up - Start the daemon
lms get <model> - Download models
lms server start - Launch the inference server
lms chat - Interactive terminal chat
lms runtime update - Update the underlying engine

This opens up use cases like running LM Studio on cloud GPUs, integrating it into CI/CD pipelines, or deploying it on headless Linux servers—all while maintaining the same model compatibility and API endpoints.

Parallel Requests with Continuous Batching

LM Studio 0.4.0 introduces parallel inference requests with continuous batching, a feature that significantly improves throughput for concurrent workloads. This capability, powered by llama.cpp's open-source implementation, allows multiple requests to be processed simultaneously rather than queued sequentially.

Key configuration options in the model loader:

Max Concurrent Predictions: Sets the maximum number of parallel requests (default: 4)
Unified KV Cache: When enabled, prevents hard partitioning of resources per request, allowing varying request sizes

The continuous batching approach means requests are grouped and processed together, reducing idle GPU time and improving overall efficiency. This is particularly valuable for applications that need to handle multiple concurrent users or batch processing workloads.

Note: This feature is currently available in the llama.cpp engine, with MLX support coming soon.

Stateful REST API with MCP Support

LM Studio introduces a new stateful REST API endpoint at /v1/chat, departing from the typical stateless approach of chat APIs. This design choice enables more sophisticated multi-step workflows:

Stateful conversations: Use response_id to continue conversations across requests
Performance tracking: Responses include detailed metrics (tokens in/out, speed, time to first token)
Local MCP integration: Enable locally configured Model Context Protocol tools with permission keys
Small request payloads: State management on the server side keeps requests lightweight

Permission keys provide security control, allowing you to generate tokens that specify which clients can access your LM Studio server and what MCP tools they can use.

Complete UI Refresh

The desktop application receives a comprehensive redesign focused on consistency and usability:

Chat Export

Export conversations to PDF, Markdown, or plain text directly from the chat menu. This makes it easy to archive conversations or share them with others.

Split View

Work with multiple chat sessions side-by-side by dragging tabs to either half of the window. This is particularly useful for comparing model responses or working on related tasks simultaneously.

Developer Mode

A new Developer Mode setting (enabled in Settings > Developer) exposes advanced options throughout the app, including in the model loader and sidebars. This replaces the previous multi-mode system with a simpler toggle.

In-App Documentation

The Developer tab now includes comprehensive documentation covering the REST API, CLI commands, and advanced configuration options—making it easier to get started with LM Studio's more powerful features.

Enhanced CLI Experience

The CLI receives significant improvements centered around the new lms chat command, which provides an interactive terminal-based chat experience. Features include:

Slash commands: /model, /download, /system-prompt, /help, /exit
Thinking highlighting: Visual indication of model reasoning
Large content pasting: Better handling of substantial input
Model catalog browsing: Interactive model selection

Run lms chat --help to explore all available options.

Technical Improvements and Bug Fixes

The release includes numerous technical enhancements:

Model search improvements: New search interface with persistent preferences
Hardware settings updates: Better GPU configuration and monitoring
Image handling: Improved validation and display in API responses
Performance optimizations: Faster model loading and chat operations
Cross-platform fixes: Resolved issues on Windows, Mac, and Linux

What This Means for Local AI Development

LM Studio 0.4.0 represents a maturation of the local AI tooling ecosystem. By separating the core inference engine from the GUI, the team has created a flexible platform that can serve both desktop users and server deployments. The parallel processing capabilities address a key limitation of local inference—throughput—while the stateful API enables more sophisticated application development.

For developers, this means LM Studio can now serve as both a development tool and a production inference server. For researchers and hobbyists, the enhanced CLI and parallel processing make it easier to experiment with larger models and more complex workflows.

The release is available now at lmstudio.ai/download. The team welcomes feedback through their Discord community and bug tracker.

#LM Studio #Local AI #parallel inference #stateful API #CLI