Microsoft Foundry Adds Hugging Face Models: Harrier and EGM-8B

Microsoft Foundry now hosts Microsoft Research's Harrier embedding model and NVIDIA's EGM-8B visual grounding model, both achieving state-of-the-art results through efficiency-focused training rather than scale.

Microsoft has expanded its Foundry platform with two powerful open-source models from Hugging Face: Microsoft Research's harrier-oss-v1-0.6b embedding model and NVIDIA's EGM-8B visual grounding model. Both models demonstrate how targeted training strategies can achieve results comparable to much larger models, marking a shift toward efficiency-first AI development.

Microsoft Research's Harrier: State-of-the-Art Embeddings at 0.6B Parameters

The harrier-oss-v1-0.6b model represents a breakthrough in embedding technology, achieving a 69.0 score on the Multilingual MTEB v2 (Massive Text Embedding Benchmark) leaderboard. This performance places it at the top of its size class, outperforming many larger models through contrastive learning and knowledge distillation techniques.

Key Technical Innovations

What makes Harrier particularly interesting is its departure from traditional embedding architectures. Unlike most embedding models that use encoder-only transformers, Harrier employs a decoder-only architecture with last-token pooling and L2 normalization. This design choice, combined with task-instruction queries, allows the same deployed model to be specialized for multiple tasks through prompt engineering alone.

The model supports 100+ languages and is trained on a large-scale mixture of multilingual data covering Arabic, Chinese, Japanese, Korean, and additional languages. Its broad task coverage spans six embedding scenarios: retrieval, clustering, semantic similarity, classification, bitext mining, and reranking.

Practical Applications

For enterprise deployments, Harrier offers compelling use cases:

Multilingual semantic search: Prepend task instructions to queries while encoding documents without instructions, ranking results by cosine similarity
Cross-lingual document clustering: Group semantically related content across language boundaries
Text classification with embeddings: Classify new text by nearest-neighbor similarity in embedding space
Bitext mining: Align parallel corpora in source and target languages

A practical example for global enterprises involves building a multilingual internal knowledge base. Using Harrier deployed in Microsoft Foundry, organizations can encode internal documents across multiple languages and retrieve relevant content through cosine similarity, passing results to a language model for final answer generation with source citations.

NVIDIA's EGM-8B: Visual Grounding Excellence Through Reinforcement Learning

NVIDIA's EGM-8B takes a different approach to efficiency, achieving 91.4 average Intersection over Union (IoU) on the RefCOCO visual grounding benchmark—a 3.6 point improvement over its base model through targeted reinforcement learning fine-tuning.

Training Methodology Breakthrough

The model employs a two-stage training process that demonstrates how test-time compute can be optimized:

Supervised Fine-Tuning (SFT): Initial training on detailed chain-of-thought reasoning traces generated by a proprietary VLM
Reinforcement Learning with GRPO: Refinement using Group Relative Policy Optimization with a reward function combining IoU accuracy and task success

The research reveals that 62.8% of small model errors on visual grounding stem from complex multi-relational descriptions. By focusing test-time compute on reasoning through these complex prompts, EGM-8B closes the performance gap without increasing model size.

Performance Advantages

EGM-8B achieves 737ms average latency, making it 5.9x faster than larger models at inference. This speed advantage comes from the model's ability to generate multiple medium-quality responses and select the best, rather than relying on a single expensive forward pass through a large model.

Real-World Use Cases

The model excels in scenarios requiring precise object localization:

Object localization: Submit images with natural language descriptions to receive bounding box coordinates
Document region extraction: Extract specific regions from scanned documents based on field descriptions
Visual quality control: Localize defects in product images for downstream classification
Retail shelf analysis: Identify product locations on shelves based on SKU descriptions

A practical deployment example involves building a visual inspection system for logistics warehouses. EGM-8B can analyze package scan images and identify regions of interest—such as labels, barcodes, or damaged areas—enabling automated routing to appropriate inspection stations.

Getting Started with Microsoft Foundry

Microsoft Foundry makes deploying these models straightforward. Users can browse the Hugging Face collection in the Foundry model catalog and deploy to managed endpoints with just a few clicks. The platform also supports one-click deployments directly from the Hugging Face Hub, bringing secure, scalable inference already configured.

For developers looking to experiment with intermediate stages, NVIDIA provides the EGM-8B-SFT checkpoint, allowing exploration of the supervised fine-tuning stage before reinforcement learning refinement.

The addition of these models to Microsoft Foundry represents a significant step toward making efficient, high-performance AI models accessible to enterprises. By focusing on training methodology rather than parameter count, both Microsoft Research and NVIDIA demonstrate that the gap between small and large models continues to narrow, offering organizations powerful tools without the computational overhead of massive models.