Large Language Models excel at planning but frequently stumble when executing actions on dynamic websites. Traditional approaches—whether vision-based systems requiring expensive pixel processing or DOM-reliant scripts vulnerable to layout changes—struggle with modern web complexity. SentienceAPI, developed by solo founder Tony W, proposes a novel solution: semantic geometry-based visual grounding.

The Core Innovation

Instead of flooding LLMs with raw HTML or imprecise screenshots, SentienceAPI distills web pages into minimal action spaces containing only visible, interactable elements. Each element is annotated with:
- Precise geometric coordinates (bbox)
- Visual cues like cursor: pointer
- A crucial is_primary signal indicating visual hierarchy priority

// Example element representation
{
  "id": 42,
  "role": "button",
  "text": "Add to Cart",
  "bbox": { "x": 935, "y": 529, "w": 200, "h": 50 },
  "visual_cues": {
    "cursor": "pointer",
    "is_primary": true,
    "color_name": "yellow"
  }
}

This structured approach mirrors human visual scanning patterns. As Tony explains: "Humans don’t read every pixel—we scan for visual hierarchy. Encoding that directly lets the agent prioritize actions without processing raw pixels or noisy DOM."

Technical Execution

SentienceAPI operates through three synergistic modes:
1. Map Mode: Generates wireframes of interactable elements
2. Visual Mode: Aligns geometry with screenshots for spatial grounding
3. Read Mode: Extracts clean text for LLM comprehension

The system treats browsers as black boxes, enabling compatibility across dynamic content. A reference application called MotionDocs demonstrates the workflow:

from sentienceapi_sdk import SentienceApiClient
from motiondocs import generate_video

video = generate_video(
    url="https://www.amazon.com/gp/bestsellers/",
    instructions="Open a product and add it to cart",
    sentience_client=SentienceApiClient(api_key="API_KEY")
)
video.save("demo.mp4")

Demo Video | Wireframe Visualization

Why This Matters

  • Reduced Hallucinations: Smaller action spaces constrain LLM decision-making
  • Deterministic Execution: Geometric precision ensures reliable element targeting
  • Cost Efficiency: Avoids GPU-intensive full-page vision processing
  • Robustness: Resilient to DOM changes through visual-semantic binding

Early testing shows agents successfully navigating complex flows like Amazon cart operations—a task where traditional methods frequently fail. The approach could significantly advance autonomous agents, RPA systems, and QA automation.

SentienceAPI remains in development, with its creator actively seeking feedback from developers working on agent technologies. As web interfaces grow increasingly complex, such semantic grounding layers may become essential infrastructure for practical LLM deployment.

Source: Tony W via Hacker News (https://news.ycombinator.com/item?id=46333526)