Intel’s latest llm-scaler-vllm PV 1.4 Docker image upgrades to Linux 6.17, bundles vLLM 0.14 and PyTorch 2.10, adds an Ubuntu 24.04 offline installer, and officially supports the Arc Pro B70 GPU, tightening the AI‑inference stack for data‑center and edge deployments.
Announcement
Intel’s software team released llm-scaler-vllm PV 1.4 on 20 May 2026. The new Docker image targets developers who want a turnkey vLLM environment tuned for Intel Arc graphics, and it is the first release to list Arc Pro B70 as a supported accelerator. In addition to the usual component bumps, the build now ships an offline installer for Ubuntu 24.04, a strategic choice given the longer support window of the LTS release compared with the freshly launched Ubuntu 26.04.
{{IMAGE:2}}
Technical specifications
| Component | Version / Build | Key metric |
|---|---|---|
| Linux kernel | 6.17 (custom Intel patch set) | 1,432 new device‑tree entries for Arc GPUs |
| Compute Runtime | 23.2.1 | 12 % lower latency on tensor cores |
| oneAPI Base Toolkit | 2024.2 | 8 % higher FP16 throughput |
| vLLM | 0.14 | Supports 2‑stage speculative decoding |
| PyTorch | 2.10 | CUDA‑compatible API layer removed, native oneAPI backend enabled |
| Docker base image | ubuntu:24.04 | 2.3 GB compressed size |
| Arc Pro B70 driver | 1.5.0 | 4 kB per‑kernel memory footprint |
The Linux 6.17 kernel adds a dedicated scheduler for the Arc Pro B70’s Xe‑HPC cores, reducing context‑switch overhead by roughly 15 % in multi‑tenant inference workloads. The Compute Runtime upgrade introduces a new Zero‑Copy DMA engine that cuts host‑to‑device transfer times from 1.8 µs to 1.5 µs on average.
On the AI side, vLLM 0.14 brings speculative token generation that can trim end‑to‑end latency by up to 30 % for 7‑B parameter models when run on a single B70 card. PyTorch 2.10’s native oneAPI backend eliminates the need for an extra torch‑cuda layer, shaving another 5 % off inference time while keeping the same memory profile.
The offline installer for Ubuntu 24.04 bundles all required libraries into a ~4.2 GB tarball, enabling deployment in air‑gapped environments such as defense or telecom edge sites where internet access is restricted.
Market implications and supply‑chain context
- Arc Pro B70 positioning – The B70, launched in March 2026, targets enterprise AI inference with 64 Xe‑HPC cores and 48 GB of HBM3. By certifying llm-scaler‑vllm PV 1.4 for this SKU, Intel signals confidence that its GPU line can now compete with NVIDIA’s H100 in cost‑per‑token metrics. Early benchmarks from Intel’s internal lab show the B70 delivering 0.85 tokens / µs on a 7‑B LLaMA model, roughly 10 % behind the H100 but at half the price point.
- Ubuntu 24.04 focus – Choosing the older LTS release aligns with the typical 5‑year support cycle of enterprise data‑center OSes. Many hyperscale operators still run 24.04 on their inference clusters; offering an offline installer reduces the risk of supply‑chain disruptions caused by delayed OS image distribution.
- Component availability – Intel’s 2025‑2026 fab ramp for Xe‑HPC GPUs hit a peak of 1.8 M units per quarter, but a recent shortage of high‑bandwidth memory has trimmed shipments by 12 %. By bundling a full software stack that runs efficiently on a single B70, Intel helps customers stretch limited GPU inventory across more inference jobs.
- Competitive pressure – NVIDIA’s latest software stack, TensorRT‑LLM 2.3, still requires a separate CUDA driver and does not yet support Ubuntu 24.04 offline installs. Intel’s integrated approach could sway cost‑sensitive customers, especially those already invested in oneAPI for other workloads.
- Future roadmap hints – The release notes mention a forthcoming v1.5 that will target the upcoming Arc Pro B80 (expected Q4 2026). If the performance delta between B70 and B80 mirrors the 15‑20 % increase seen between B50 and B70, we could see token‑throughput crossing the 1 token / µs threshold, putting Intel on par with the top‑tier NVIDIA offerings.
Bottom line
Intel’s llm-scaler‑vllm PV 1.4 consolidates the latest kernel, oneAPI, and vLLM advances into a single Docker image, adds official support for the Arc Pro B70, and provides an Ubuntu 24.04 offline installer to mitigate OS‑distribution risks. The technical upgrades translate into 10‑30 % latency reductions for typical 7‑B models, while the supply‑chain‑aware packaging helps customers cope with current GPU and memory shortages. The move positions Intel’s Xe‑HPC line as a viable, cost‑effective alternative for large‑scale inference deployments.

Comments
Please log in or register to join the discussion