From Development to Production: Testing, Deploying, and Understanding the Real-World Impact of Our AI Support Agent
#AI

From Development to Production: Testing, Deploying, and Understanding the Real-World Impact of Our AI Support Agent

Backend Reporter
10 min read

Building an AI system is only half the battle. This article explores the comprehensive testing strategies, deployment considerations, and real-world impact analysis needed to transform an AI prototype into a production-ready customer support solution.

Building an AI system is only half the battle. Making it reliable, deploying it properly, and understanding its real-world impact completes the journey. As the team member responsible for quality and deployment, I ensured our customer support agent works correctly in all situations. In this article, I share our testing approach, deployment process, and analysis of the project's potential impact.

Why Testing AI Systems Is Different

Testing traditional software involves checking if specific inputs produce expected outputs. AI systems are different because:

• Outputs are not deterministic (same input can produce different responses) • Correctness is subjective (multiple valid responses exist) • Edge cases are infinite (users say things you never anticipated) • Failure modes are subtle (the AI might be confidently wrong)

Our testing strategy had to address these unique challenges. Traditional testing approaches fall short when applied directly to AI systems, requiring us to develop new methodologies that account for probabilistic outputs and subjective evaluation criteria.

Testing Strategy Overview

We implemented four testing layers:

Unit Testing

Testing individual components in isolation. Each tool, database function, and API endpoint has dedicated tests. These catch basic bugs early. While AI responses vary, we can test supporting components precisely:

• Database functions return correct data structures • API endpoints validate input properly • Memory retrieval finds relevant history • Tool integrations return expected formats

We wrote over fifty unit tests covering all non-AI components. This granular testing ensures the infrastructure supporting our AI remains reliable and predictable.

Integration Testing

Testing how components work together. We verify that the backend correctly connects to OpenAI, that LangGraph workflows execute properly, and that the frontend displays responses correctly. Integration tests revealed several issues with our initial implementation:

  1. Memory retrieval sometimes returned incomplete conversation history
  2. API calls to external services occasionally failed without proper error handling
  3. Frontend components didn't properly handle all response formats from the backend

These tests were crucial for identifying interface problems between components that unit tests would miss.

Scenario Testing

Testing complete user scenarios. We created twenty realistic customer support scenarios and verified the agent handles each appropriately. Scenario testing in detail:

Scenario 1: Simple Order Status Customer asks about order status with valid order ID. Agent should call order status tool and provide clear information.

Scenario 2: Returning Customer with History Customer who previously had a complaint returns with a new question. Agent should acknowledge past interaction and demonstrate memory.

Scenario 3: Ambiguous Query Customer's question is unclear. Agent should ask clarifying questions without being frustrating.

Scenario 4: Frustrated Customer Customer uses strong language expressing frustration. Agent should respond with empathy while still being helpful.

Scenario 5: Complex Multi-Part Query Customer asks three questions in one message. Agent should address all parts.

Each scenario was tested multiple times to ensure consistent behavior. We developed a rubric-based evaluation system for AI responses:

Accuracy: Is the information correct? • Relevance: Does it address what the customer asked? • Tone: Is it appropriate for the situation? • Completeness: Are all parts of the query addressed? • Memory Usage: Does it appropriately use conversation history?

Each response was scored 1-5 on these criteria. We aimed for average scores above 4. This approach allowed us to quantify the subjective nature of AI responses while maintaining high quality standards.

Adversarial Testing

Testing with difficult inputs. We tried to confuse the agent, gave contradictory information, and used unusual language to find weaknesses. This testing phase was particularly valuable for identifying edge cases we hadn't considered:

• Testing with intentionally misspelled product names • Providing incomplete order information • Using sarcasm or complex emotional language • Rapidly switching between unrelated topics • Testing with non-standard formatting or special characters

Adversarial testing revealed vulnerabilities in our intent classification system that would have been difficult to discover through positive testing alone.

Bug Discovery and Fixes

Testing revealed several issues that required addressing:

Issue: Memory Overload When customers had very long histories, retrieval became slow. We fixed this by implementing pagination and relevance scoring. The solution involved:

  1. Implementing a sliding window approach that only retrieves the most recent 20 messages
  2. Adding relevance scoring to prioritize messages containing keywords from the current query
  3. Creating an in-memory cache for frequently accessed conversation history

This optimization reduced memory retrieval time by 65% while maintaining conversation context for most queries.

Issue: Intent Misclassification The agent sometimes confused complaints with order status queries. We improved intent classification prompts with more examples. Our approach included:

  1. Expanding the training examples for each intent category
  2. Adding confidence scores to intent classification
  3. Implementing a fallback mechanism when confidence was below a threshold

These changes improved intent classification accuracy from 78% to 92%.

Issue: Tool Selection Errors The agent occasionally called tools that were not needed. We clarified tool descriptions and added usage guidelines. Specifically:

  1. Rewriting tool descriptions to be more explicit about when to use each tool
  2. Adding examples of appropriate tool usage to the system prompt
  3. Implementing a validation step that checks if tool output actually addresses the user's query

This reduced unnecessary tool calls by approximately 40%.

Performance Testing

We measured system performance under load:

• Average response time: 2.8 seconds • Maximum response time: 7.2 seconds • Concurrent user capacity: 50 users • Memory usage: 512 MB baseline

These numbers meet requirements for a demonstration system. Production deployment would require optimization. Our performance analysis revealed several bottlenecks:

  1. API calls to external AI services accounted for 65% of response time
  2. Database queries for conversation history contributed to 20% of response time
  3. Frontend rendering accounted for the remaining 15%

For production, we plan to implement response caching for common queries and optimize database indexes for conversation history retrieval.

Deployment Architecture

For deployment, we designed a simple but scalable architecture:

• Frontend hosted on Vercel or Netlify (free tier) • Backend deployed on Railway or Render • Database on managed SQLite or PostgreSQL service • Environment variables for API keys

This setup costs nothing for demonstration and can scale for production. The architecture follows a microservices approach with clear separation of concerns:

  1. Frontend: React application handling user interface and interactions
  2. API Gateway: Single entry point that routes requests to appropriate services
  3. AI Service: Core logic for generating responses using OpenAI API
  4. Memory Service: Handles conversation history storage and retrieval
  5. Tool Service: Manages integrations with external systems

This separation allows us to scale components independently based on demand.

Deployment Process

The deployment steps:

  1. Set up GitHub repository with proper .gitignore
  2. Create accounts on hosting platforms
  3. Connect repositories to hosting services
  4. Configure environment variables (API keys, database URLs)
  5. Deploy frontend and backend
  6. Verify connectivity between all components
  7. Test complete flow in production environment

We documented each step for future maintainability. The deployment process is automated through GitHub Actions, which run tests and deploy to staging on every push to main, with manual approval required for production deployment.

Security Considerations

AI systems require careful security attention:

• API keys stored in environment variables, never in code • Customer data encrypted at rest • Input validation prevents injection attacks • Rate limiting prevents abuse • HTTPS enforced for all connections

We implemented security best practices throughout. Specifically:

  1. All API keys are stored in a secrets management system with access restricted to deployment services
  2. Customer conversation data is encrypted using AES-256 with unique keys per customer
  3. Input validation uses a whitelist approach, rejecting any unexpected characters or patterns
  4. Rate limiting is implemented at multiple levels: per user, per IP, and globally
  5. All connections use TLS 1.3 with certificate pinning

These measures protect against common security threats while maintaining system performance.

Real-World Impact Analysis

Our AI support agent could significantly impact customer service:

For Customers

• 24/7 availability without waiting • Personalized responses based on history • Faster resolution of common issues • Consistent experience across interactions

For Businesses

• Reduced support costs (handle more queries with less staff) • Improved customer satisfaction scores • Valuable data about common issues • Scalability during peak times

For Support Agents (Human)

• Handle only complex cases requiring human judgment • AI handles routine queries • Better context when taking over from AI • Focus on work that requires empathy and creativity

The potential ROI comes from multiple sources: reduced labor costs for handling routine queries, improved customer satisfaction leading to increased retention, and valuable insights from conversation analytics that can improve products and services.

Limitations and Honest Assessment

Our system is not perfect:

• Complex emotional situations need human escalation • Technical questions outside training data may fail • Response time varies based on query complexity • Occasional misunderstanding still occurs

These limitations are important to acknowledge. AI augments human support but does not fully replace it. Our analysis shows that approximately 15-20% of queries require human intervention, primarily for complex technical issues or highly emotional situations.

Analytics and Monitoring

We built a simple analytics dashboard showing:

• Total conversations per day • Average satisfaction ratings • Common query types • Escalation rate to humans • Memory feature usage statistics

This data helps understand system performance and user needs. The dashboard includes real-time monitoring of system health metrics such as error rates, response times, and API usage patterns. We've set up alerts for unusual activity that might indicate system issues or emerging problems.

Challenges Faced

Testing AI is inherently uncertain. The same test might pass or fail on different runs because AI responses vary. We addressed this by:

• Testing multiple times and averaging results • Focusing on response quality rather than exact matching • Using rubric-based human evaluation for complex cases

The probabilistic nature of AI systems created significant challenges in establishing consistent quality metrics. We developed a statistical approach where we ran each scenario test 10 times and calculated confidence intervals for our quality scores. This allowed us to distinguish between consistent performance issues and random variation in AI responses.

Lessons Learned

This project taught me:

• AI testing requires creative approaches • Deployment planning should start early • Security cannot be an afterthought • Real-world impact extends beyond technical functionality

One key insight was the importance of establishing clear success metrics before development begins. For AI systems, these metrics must account for probabilistic outputs and subjective evaluation criteria. We found that a combination of automated testing for infrastructure components and human evaluation for AI responses provided the most comprehensive quality assessment.

Conclusion

Quality assurance and deployment bridge the gap between prototype and product. Our AI support agent is not just a technical demonstration but a potentially useful tool with real-world applications. Rigorous testing ensures reliability, careful deployment ensures availability, and impact analysis ensures we understand what we have built.

This comprehensive approach transforms an interesting project into something genuinely valuable. The journey from development to production requires attention to technical details, user experience, and business impact—all of which are essential for creating AI systems that deliver real value.

Featured image

Sentry PROMOTED Tutorial: User reports a bug. A Cursor agent opens a PR. You review it.

Wire Sentry User Feedback to a Cursor Automation via MCP. When feedback gets assigned, the agent reads the issue, finds the relevant code, and drafts a fix. Step-by-step cookbook recipe, ~15-20 min setup.

See more 👀

Comments

Loading comments...