Status History
May 29, 04:18 PM Pending
May 31, 12:55 PM In Progress
May 31, 12:55 PM In Progress
Intake Form
Technical Specification
Frontend
React SPA (session monitor dashboard + persona config UI)
Backend
FastAPI (Python) REST + WebSocket API with asyncio concurrency
Database
PostgreSQL via Neon (session logs, persona configs, metrics) + Redis (session state, KV-cache coordination)
Hosting
Docker Compose on bare-metal / GPU server (NVIDIA A100/H100) with Nginx reverse proxy
Summary

Aziza is a production-grade real-time full-duplex voice AI system built on NVIDIA PersonaPlex (Moshi), designed to handle 8–10 concurrent telephone and WebSocket voice sessions with sub-120ms latency. The system features FP8-quantized inference with dynamic batching, per-session KV-cache management, multilingual support for Russian and Uzbek via QLoRA fine-tuned adapters, and deep Asterisk PBX integration for PSTN/SIP call routing. What makes it technically distinctive is the combination of GPU-level inference optimization, language-specific persona control, and a complete telephony bridge — all orchestrated through a Dockerized microservice architecture with real-time observability.

File Structure
docker-compose.yml Orchestrates all services: FastAPI, Moshi inference engine, Redis, PostgreSQL, Nginx, Asterisk
Dockerfile.api FastAPI application container with Python 3.11, CUDA dependencies, and model client libraries
Dockerfile.inference NVIDIA PersonaPlex/Moshi inference server with FP8 TensorRT-LLM optimization and GPU passthrough
Dockerfile.asterisk Asterisk PBX container with AGI/ARI bridge configuration for PSTN/SIP integration
nginx/nginx.conf Reverse proxy config with WebSocket upgrade support, load balancing, and SSL termination
server/main.py FastAPI app entrypoint: registers routers, initializes session manager, connects to Redis and PostgreSQL
server/config.py Pydantic settings: model paths, GPU config, concurrency limits, latency thresholds, language settings
server/routers/sessions.py WebSocket endpoint for full-duplex voice streaming; manages session lifecycle, audio chunking, and response piping
server/routers/personas.py REST CRUD endpoints for persona definitions (name, voice profile, language, LoRA adapter reference)
server/routers/admin.py Admin endpoints for system health, active session count, GPU utilization, and benchmark trigger
server/routers/asterisk.py ARI webhook receiver and AGI handler for bridging Asterisk calls into the voice AI session pipeline
server/services/session_manager.py Manages up to 10 concurrent sessions with slot allocation, timeout handling, and graceful teardown
server/services/inference_client.py gRPC/HTTP2 client to the Moshi inference server; handles streaming audio in/out and dynamic batching requests
server/services/audio_pipeline.py Audio preprocessing: resampling, VAD (voice activity detection), codec handling (OPUS/PCM), and chunk buffering
server/services/persona_service.py Loads persona config and maps to correct LoRA adapter, voice embedding, and language tokenizer at session init
server/services/metrics_service.py Collects per-session latency, TTFB, concurrency stats; writes to PostgreSQL and exposes Prometheus metrics
server/models/session.py SQLAlchemy ORM model for session records: id, persona_id, language, start_time, end_time, avg_latency_ms
server/models/persona.py SQLAlchemy ORM model for persona: id, name, language, voice_profile_path, lora_adapter_path, system_prompt
server/db/database.py Async SQLAlchemy engine setup, session factory, and Alembic migration integration
server/db/redis_client.py Redis async client wrapper for session slot tracking, KV-cache state keys, and pub/sub coordination
inference/server.py Moshi inference server entrypoint: loads FP8-quantized model, initializes TensorRT engine, starts gRPC streaming service
inference/batching.py Dynamic batching logic: groups incoming audio frames across sessions to maximize GPU throughput within latency budget
inference/kv_cache.py KV-cache management: per-session cache allocation, eviction policy, and cache-hit optimization for streaming inference
inference/fp8_optimizer.py FP8 quantization utilities using TensorRT-LLM: model export, calibration dataset runner, and precision validation
training/finetune_lora.py QLoRA fine-tuning script for RU/UZ language adaptation using HuggingFace PEFT; configures rank, alpha, target modules
training/dataset_builder.py Builds multilingual voice dataset: audio segmentation, transcript alignment, language tagging for RU and UZ corpora
training/tokenizer_extension.py Extends base tokenizer with RU/UZ vocabulary, Cyrillic/Latin script handling, and phoneme mappings
training/train_config.yaml Training hyperparameters: learning rate, batch size, LoRA rank, warmup steps, evaluation strategy, checkpoint paths
asterisk/extensions.conf Dialplan configuration routing inbound calls to AGI script that bridges audio to the FastAPI session endpoint
asterisk/agi/voice_bridge.py AGI script: captures RTP audio from Asterisk, streams to FastAPI WebSocket, and plays back AI response audio
asterisk/ari_config.conf ARI (Asterisk REST Interface) configuration for programmatic call control and media streaming
benchmarks/load_test.py Locust-based load test simulating 8-10 concurrent voice sessions; measures latency percentiles and session stability
benchmarks/latency_report.py Parses load test results and generates benchmark report: p50/p95/p99 latency, concurrency ceiling, GPU memory usage
dashboard/src/App.tsx React app root: sets up routing between session monitor, persona manager, and system metrics views
dashboard/src/pages/SessionMonitor.tsx Live dashboard showing active sessions, per-session latency gauge, language/persona label, and session controls
dashboard/src/pages/PersonaManager.tsx CRUD UI for persona configurations: name, language selector (EN/RU/UZ), voice profile upload, LoRA adapter assignment
dashboard/src/pages/SystemMetrics.tsx GPU utilization charts, concurrency history, latency trends, and benchmark report viewer using Recharts
dashboard/src/hooks/useSessionSocket.ts Custom React hook managing WebSocket connection for real-time session audio streaming and status updates
dashboard/src/api/client.ts Axios client with base URL config, auth headers, and typed request/response interfaces for all REST endpoints
migrations/ Alembic migration scripts for PostgreSQL schema versioning across sessions, personas, and metrics tables
docs/architecture.md System architecture overview: component diagram, data flow, GPU pipeline, Asterisk integration, and scaling notes
docs/deployment.md Step-by-step production deployment guide: GPU driver setup, Docker Compose startup, Asterisk SIP trunk config, SSL setup
docs/api_reference.md Full API reference for WebSocket session protocol, REST endpoints, audio format specs, and error codes
docs/training_guide.md Guide for running LoRA fine-tuning: dataset preparation, training execution, adapter export, and model evaluation
.env.example Template for all required environment variables: DB URLs, Redis URL, GPU device IDs, model paths, Asterisk credentials
Features (12)
Full-Duplex Real-Time Voice Streaming P1
WebSocket-based bidirectional audio pipeline enabling simultaneous speech input and AI voice output with sub-120ms end-to-end latency.
  • WebSocket endpoint accepts PCM/OPUS audio chunks at 16kHz or 24kHz sample rate
  • AI-generated audio response begins streaming back within 120ms of last input voice activity
  • Full-duplex operation confirmed: AI can be interrupted mid-response by new user speech
  • Voice Activity Detection correctly segments speech turns without manual push-to-talk
  • Audio pipeline handles packet loss and jitter without session crash
Concurrent Session Management (8–10 Sessions) P1
Session slot manager that supports 8–10 simultaneous independent voice AI sessions with isolated state and graceful capacity enforcement.
  • System accepts and maintains exactly 10 concurrent WebSocket voice sessions without degradation
  • 11th connection attempt receives a 503 with queue position or rejection message
  • Each session maintains fully isolated KV-cache, persona context, and audio buffer
  • Session cleanup on disconnect frees GPU memory and Redis slot within 2 seconds
  • Load test confirms p95 latency remains under 120ms at 10 concurrent sessions
FP8 Inference Optimization P1
TensorRT-LLM FP8 quantization of the Moshi model to maximize GPU throughput and minimize per-token latency for real-time voice generation.
  • FP8-quantized model loads successfully on target NVIDIA GPU (A100/H100)
  • Benchmark shows ≥40% throughput improvement over FP16 baseline
  • Voice output quality degradation is imperceptible in blind A/B listening test
  • FP8 model supports dynamic batching across concurrent sessions
  • Calibration and export scripts run end-to-end without manual intervention
Dynamic Batching Engine P1
Inference batching layer that groups audio frames from multiple concurrent sessions to maximize GPU utilization while respecting per-session latency budgets.
  • Batching window is configurable (default 20ms) and auto-tunes based on session count
  • No single session waits more than one batch window beyond its natural latency
  • GPU utilization increases by ≥30% compared to per-session sequential inference
  • Batch formation logic handles variable audio chunk sizes across sessions
  • Metrics endpoint reports current batch sizes and queue depths in real time
KV-Cache Per-Session Management P2
Persistent KV-cache allocation per session enabling coherent multi-turn conversation context without recomputation overhead.
  • Each session is allocated a dedicated KV-cache slot at connection time
  • Cache persists conversation context for the full session duration (up to 30 minutes)
  • Cache eviction policy prevents OOM under maximum concurrency load
  • Cache hit rate ≥85% for mid-conversation turns reported in metrics
  • Cache state is fully cleared and GPU memory reclaimed on session end
Multilingual Support (Russian + Uzbek) P2
LoRA/QLoRA fine-tuned model adapters enabling natural voice AI interaction in Russian and Uzbek languages alongside English.
  • Model correctly identifies and responds in Russian when spoken to in Russian
  • Model correctly identifies and responds in Uzbek when spoken to in Uzbek
  • Word Error Rate (WER) for RU ≤15% on held-out test set
  • Word Error Rate (WER) for UZ ≤20% on held-out test set
  • Language switching mid-session is handled gracefully without session restart
  • LoRA adapters load in under 500ms at session initialization
Persona Configuration System P2
CRUD system for defining AI personas with distinct names, voice profiles, system prompts, language preferences, and associated LoRA adapters.
  • Admin can create, update, and delete personas via REST API and dashboard UI
  • Each persona stores: name, language, voice embedding path, LoRA adapter reference, system prompt
  • Persona is applied at session initialization and governs all AI responses in that session
  • Voice profile produces perceptibly distinct voice characteristics between different personas
  • Default persona is used when no persona_id is specified at session start
Asterisk PBX Integration P2
Full integration with Asterisk via AGI and ARI to route inbound telephone calls into the real-time voice AI session pipeline.
  • Inbound SIP/PSTN call to configured DID triggers AGI script and creates a voice AI session
  • RTP audio from Asterisk is correctly transcoded and streamed to the FastAPI WebSocket endpoint
  • AI response audio is played back to caller with no audible artifacts or sync issues
  • Call hangup triggers clean session teardown within 3 seconds
  • System handles 8 simultaneous inbound calls without audio quality degradation
  • ARI-based call control supports transfer, hold, and DTMF passthrough
QLoRA Fine-Tuning Pipeline P2
End-to-end training pipeline for fine-tuning the Moshi base model on RU/UZ voice datasets using QLoRA with configurable hyperparameters.
  • Training script runs on single A100 80GB GPU without OOM errors
  • Dataset builder produces correctly formatted and language-tagged training samples
  • Tokenizer extension adds RU/UZ vocabulary without corrupting base model tokenization
  • Training run produces checkpoint every N steps with automatic best-model selection
  • Exported LoRA adapter integrates with inference server without format conversion
  • Training guide documentation enables reproduction of results by a new engineer
Observability & Metrics Dashboard P3
Real-time monitoring dashboard and Prometheus metrics endpoint tracking session health, latency, GPU utilization, and concurrency.
  • Dashboard displays all active sessions with per-session latency, language, and persona in real time
  • Prometheus /metrics endpoint exposes: active_sessions, p50/p95/p99 latency, gpu_utilization, batch_size
  • Historical latency and concurrency charts available for last 24 hours
  • Alert threshold configurable for latency breaches above 120ms
  • Benchmark report page renders load test results with pass/fail against acceptance criteria
Load Testing & Benchmark Reporting P3
Automated load testing suite that validates the system meets concurrency and latency acceptance criteria and generates a structured benchmark report.
  • Load test simulates 10 concurrent voice sessions with realistic audio input streams
  • Test runs for minimum 10 minutes to validate session stability
  • Report captures: max stable concurrency, p50/p95/p99 TTFB, GPU memory peak, error rate
  • System passes if: ≥8 stable sessions AND p95 latency ≤120ms AND error rate ≤0.1%
  • Report is auto-generated as JSON and HTML after each test run
Docker Compose Production Deployment P3
Fully containerized deployment stack with GPU passthrough, service health checks, restart policies, and environment-based configuration.
  • docker-compose up --build starts all services including inference server with GPU access
  • All services have health checks and restart: unless-stopped policies
  • Environment variables fully control deployment without code changes
  • Nginx correctly terminates SSL and proxies WebSocket connections
  • Deployment guide enables a new engineer to stand up the full system in under 2 hours
Build Log
scoping Starting AI-powered tech spec generation
scoping Tech spec generated successfully
start Build orchestration started for project 7
attempt Build attempt 1/3
generate Attempt 1 failed: AI generation failed: Unexpected end of JSON input failed
retry Retrying (2/3)...
attempt Build attempt 2/3
generate Attempt 2 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed
retry Retrying (3/3)...
attempt Build attempt 3/3
generate Attempt 3 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed
complete Build failed after 3 attempts failed
status Project status updated to Build Failed
Deliverables
📦

Deliverables become available once project reaches Review status.