High-Level Overview
The chatbot platform follows a microservices-inspired architecture with clear separation of concerns:
Core Components
1. Web Server Auth Middleware
Ensures correct Authentification. The layer distinguishes between different User levels and restricts access accordingly.
- Login & docs routes are protected by UI api-keys
- Logged in users are authenticated by JWT tokens
- Chat endpoints requite JWT tokens or customer / persona specific api-keys.
- Chat JWT tokens can be requested from the /api/chat/get_token endpoint with an api-key
2. Web Server Layer
- Purpose: HTTP API gateway and request routing
- Responsibilities:
- REST API endpoints for chat & user/admin operations
- Configuration management endpoints
- Request/response serialization
- Chat streaming support
3. Chat Engine
- Purpose: Core conversation management and AI integration
- Responsibilities:
- Chat session lifecycle management
- AI provider abstraction and integration
- Token streaming implementation
- Context assembly and prompt construction
- Tool execution and function calling
4. Configuration System
- Purpose: Dynamic persona/RAG/MCP and system configuration
- Responsibilities:
- Chat persona definitions
- Tool configurations
- Context document associations
- Chat parameters (temperature, max tokens, etc.)
5. RAG System
- Purpose: Context retrieval and document management
- Responsibilities:
- Document ingestion and vectorization
- Semantic search and retrieval
- Context ranking and filtering
- Document metadata management
5. Data Layer
- PostgreSQL: Configuration, chat history, user management, document storage
- Qdrant Vector Database: Document embeddings and similarity search
- File System: Static assets
Data Flow
Chat Request Flow
- Request Reception: Web server receives chat request with persona ID
- Auth Middleware Api Key JWT extraction & check
- Chat Routing:
- Chat invocation uses an api-key or a JWT-Token that contains all information required to identify the persona.
- Database access is typically not necessary to ensure the request routing.
- Chat personas are cached or on demand-loaded from postgres.
- Session context is cached.
- Request moderation: - If configured in persona, the request is checked for inappropriate content.
- Context Assembly: Query RAG system for relevant context
- Prompt Construction: Build complete prompt with system, context, and user message
- AI Processing: Send to AI provider (streaming or batch)
- AI MCP/Function request handling. Possible follow up LLM requests with function results.
- Response Handling: Process and return/stream response to client
- History Storage: Save (brief) conversation history to session/database.
- Accounting: Saving of token counts etc.
Configuration Management Flow
- CRUD Operations: Create/read/update/delete persona configurations
- Validation: Ensure configuration integrity and required fields
- Hot Reload: Update active configurations without restart
- Versioning: Track configuration changes and rollback capability
Key Design Principles
1. Async-First
- All I/O operations are asynchronous using tokio
- Non-blocking database operations
- Concurrent request handling
- Streaming responses where applicable
2. Configuration-Driven
- All chat behavior controlled by database configuration
- No hardcoded prompts or parameters
- Dynamic persona switching
- Runtime configuration updates
3. Provider Agnostic
- Abstract AI provider interface
- Support multiple AI backends
- Easy provider switching
- Consistent API regardless of backend
4. Scalable Architecture
- Stateless web server design
- Horizontal scaling capability
- Efficient resource utilization
- Caching strategies for performance
Security Considerations
- API authentication and authorization
- Input validation and sanitization
- SQL injection prevention
- Rate limiting and abuse prevention
- Secure configuration storage
- Audit logging for compliance
Performance Targets
- Latency: < 200ms for configuration retrieval
- Throughput: 1000+ concurrent chat sessions
- Streaming: < 50ms first token latency
- Database: < 10ms query response time
- Memory: Efficient resource usage with connection pooling