Julie Qiu, Uber Tech Lead for Google Cloud’s SDK, explains how large language models serve as an “archaeologist,” “experimenter,” “critic,” “author,” and “reviewer” to tame the complexity of a 400‑repo, multi‑language client‑library platform. The talk details concrete prompts, workflow integrations, and concrete performance gains, showing how AI can provide the RAM needed to synthesize legacy context, pressure‑test designs, and accelerate high‑level architectural decisions.

Using AI as a Thinking Partner for Large‑Scale Engineering Systems

Presented at QCon AI, May 28 2026 – transcript by Julie Qiu
Watch the full presentation

Technical announcement

Julie Qiu, Senior Staff Engineer at Google and Uber Tech Lead for the Google Cloud CLI/SDK, revealed how her team leveraged a large language model (LLM) – Gemini CLI – to manage the cognitive overload of 400+ repositories spanning nine programming languages. The core problem was not code‑generation speed but context‑capacity: keeping the state of dozens of service teams, language‑specific generators, and release pipelines in a single engineer’s working memory.

She framed the LLM’s role as a set of five distinct “thinking‑partner” personas:

Archaeologist – extracts legacy knowledge from code and stale docs.
Experimenter – runs low‑cost design simulations before committing engineering effort.
Critic – systematically pokes holes in proposals.
Author – produces production‑grade snippets on demand.
Reviewer – enforces style, safety, and consistency at PR time.

The session walked through concrete prompts, benchmark numbers, and deployment considerations for each persona.

Specifications & benchmarks

Persona	Prompt example	Typical latency (Gemini 1.5‑Pro)	Accuracy metric*	Engineering impact
Archaeologist	`Explain the Python generator pipeline, list all input files, and summarize the output format.`	3.2 s	92 % of manually written summary matches	Replaced 2 weeks of manual repo inspection
Experimenter	`Generate a minimal CLI that reads a spec.yaml and produces a Go client library. Do not write tests yet.`	4.8 s	87 % of generated code compiles after gofmt	Cut prototype time from 5 days to <1 hour
Critic	`Identify over‑engineered parts in the current release workflow and suggest simplifications.`	2.9 s	78 % of suggestions validated by senior engineers	Reduced release script size by 3 k LOC
Author	`Write a Go function that reads a YAML config, validates required fields, and returns a struct.`	1.9 s	95 % passes go vet & staticcheck	Eliminated repetitive boilerplate across 180 libraries
Reviewer	`Run gofmt, goimports, and staticcheck on the diff; flag any missing nil checks.`	2.1 s per PR	99 % of flagged issues are true positives	Cut manual review time by ~30 %

*Accuracy measured against a curated ground‑truth set created by the SDK team.

Deployment stack

LLM endpoint – Gemini 1.5‑Pro via Google Cloud Vertex AI, accessed through a private VPC to keep source‑code context internal.
Prompt orchestration – A thin Go wrapper (gemini-cli) that injects repository metadata, issue links, and a persistent session file to maintain conversational state across commands.
Security – All prompts are sanitized; no code is sent to external endpoints. The wrapper runs in a sandboxed Cloud Run service with IAM‑restricted access to the source repositories.
Caching – Frequently used artifact‑extraction prompts are cached in Memorystore (Redis) for 24 h, reducing average latency by ~40 %.

Real‑world performance numbers

Context window: 128 k tokens (Gemini 1.5‑Pro) allowed the model to ingest an entire service spec plus the last 3 months of issue comments in a single request.
Cost: Approx. $0.001 per 1 k‑token request; the team logged ~15 M tokens in the first month, costing under $15 USD.
Reliability: 99.96 % success rate; retries on network hiccups were automatically handled by the wrapper.

Real‑world implications

1. Faster knowledge acquisition (Archaeologist)

By asking the model to summarize a repository, Julie reduced the time to understand a new language generator from 2 weeks → 3 hours. The model surfaced hidden files, undocumented regexes, and cross‑repo dependencies that were never captured in the official docs. This rapid “knowledge‑dig” enabled the team to map the system into three logical layers:

Service‑team owned – API surface and product‑specific metadata.
Platform‑team owned – Language‑agnostic generation engine.
Language‑team owned – Idiomatic wrappers and packaging details.

The clear separation guided the next architectural refactor.

2. Low‑cost design validation (Experimenter)

Before committing engineers to a multi‑language migration, the team used the LLM to generate prototype generators for Python, Go, and Rust in isolated worktrees. The model’s output highlighted missing edge‑cases (e.g., version‑field duplication) and forced the designers to clarify ambiguous requirements. This simulation stage saved ~4 person‑months of engineering effort.

3. Systemic simplification (Critic)

The LLM identified that 3 600 lines of remove_regex configuration were redundant across 179 libraries. Removing them reduced the configuration footprint by 12 % and eliminated a class of bugs where regex updates diverged between languages.

4. Production‑grade code generation (Author)

When generating boilerplate (e.g., flag parsing, YAML unmarshalling), the model produced code that compiled on the first pass 95 % of the time. The remaining issues were mainly stylistic (missing gofmt), which were automatically fixed by a post‑generation formatter step.

5. Automated review pipeline (Reviewer)

Integrating Gemini‑code‑assist into the CI pipeline allowed PRs to be automatically annotated with:

Missing nil checks
Unused imports
Violations of the team’s Go style guide (sourced from the official Effective Go document)

Human reviewers then focused on architectural concerns rather than mechanical linting, cutting average PR turnaround from 48 h → 34 h.

Deployment considerations & best practices

Prompt versioning – Store prompts in a version‑controlled prompts/ directory. When a prompt changes, increment its semantic version and re‑run the affected experiments.
Context management – Use a session file to persist the LLM’s memory across commands. Reset the session when switching domains to avoid cross‑contamination.
Safety nets – Always run generated code through static analysis (staticcheck, golangci-lint) before merging. The LLM can hallucinate file paths or API fields.
Human‑in‑the‑loop – Treat the LLM as a first‑pass assistant. Critical design decisions still require peer review and roadmap alignment.
Cost monitoring – Enable Vertex AI budget alerts; the SDK team kept daily spend under $0.50 by throttling non‑essential prompts.

Conclusion

Julie Qiu’s experience demonstrates that large language models can become a practical “thinking partner” for massive, multi‑language infrastructure projects. By assigning the model clear roles—archaeologist, experimenter, critic, author, reviewer—engineers can offload repetitive, context‑heavy tasks to the model while reserving human judgment for strategic decisions. The result is a ten‑fold increase in mental bandwidth, measurable reductions in latency and cost, and a clearer architectural roadmap for the Google Cloud SDK ecosystem.

Further reading