LLM Coding Assistants Revisited: Swift Development Gains Amid Persistent Challenges
Share this article
Twelve months ago, developers witnessed coding Large Language Models (LLMs) like GitHub Copilot struggle with basic Swift/SwiftUI application development – requiring manual rewrites of nearly every generated line. Now, a new stress test evaluates whether today's models can deliver on the promise of "vibe coding" for non-trivial macOS applications combining SwiftUI, AVFoundation, and Swift Charts.
The Challenge: Raindrop Synthesis App
The test application requires:
- Real-time rain sound synthesis using AVAudioSourceNode
- Parametric controls for raindrop characteristics with randomized variations
- Dynamic waveform visualization via Swift Charts
- Background noise generators (pink/brown/white noise)
- UI with interactive sliders
Frontier Model Showdown
GPT-4o: Regression Alert (2/10)
The same model tested last year generated 383 lines across 5 files riddled with 21 errors. Critical audio rendering functions were placeholders, noise generators were structurally flawed, and Swift Charts integration was broken. "Hours of cleanup would be needed," the developer noted.
GPT-5.2: Functional But Flawed (7/10)
While generating a coherent architecture overview, the implementation suffered from audio buffer handling errors requiring manual memory management fixes. The UI lacked labels and exhibited severe performance issues due to excessive Chart view updates. "A nuisance of a compile bug" masked otherwise operational audio synthesis.
Gemini 3: Surprisingly Efficient (8/10)
Gemini's concise 227-line implementation contained only one syntax error (easily fixed) and delivered fully functional audio with properly labeled controls. Though the UI sizing was overly aggressive, it outperformed GPT-5.2 in both correctness and runtime efficiency.
Claude 4.5: Current Leader (9/10)
The most verbose solution (445 lines) included thoughtful additions like a Start/Stop button – though it failed to stop background noise generators. While the UI didn't fit standard displays and used outdated concurrency patterns, it represented the most complete implementation requiring only minor fixes.
Local LLMs: Not Ready for Primetime
Models like Qwen3-Coder-30B and GPT-OSS-20B struggled with fundamental Swift concepts:
- Stack corruption crashes due to timing calculation errors
- iOS-specific APIs in macOS contexts
- Misunderstanding of real-time audio rendering constraints
- Swift Charts integration failures
Even the best local model (GPT-OSS-20B) scored just 6/10 after extensive fixes, while others were "not even usable as a starting point."
Critical Limitations Persist
Across all models, consistent issues emerged:
1. Outdated Swift Practices: Models defaulted to ObservableObject instead of Observable, ignored modern concurrency (async/await), and used deprecated dispatch methods
2. API Blind Spots: Zero awareness of Swift 6 or recent iOS/macOS features
3. Non-Determinism: Changing minor prompt wording yielded drastically different (and often broken) results
4. Audio Engineering Gaps: Fundamental misunderstandings of AVAudioEngine's real-time constraints
The Verdict: Progress Without Parity
While frontier models now generate working applications from single prompts (unthinkable 12 months ago), they remain unreliable production partners. The inconsistency across runs, outdated coding patterns, and fundamental audio engineering errors mean developers still spend significant time debugging rather than creating. As the tester concluded: "For any prompt, you're rolling dice to see if you get a good job that might help you move forward or a complete mess."
Source: Cocoa with Love