Status History
May 29, 04:37 PM Pending
May 31, 12:55 PM In Progress
May 31, 12:55 PM In Progress
Intake Form
Technical Specification
Frontend
React SPA (session monitor dashboard + persona config UI)
Backend
FastAPI (Python) — WebSocket + REST API for voice session orchestration
Database
PostgreSQL via Neon (session logs, persona configs, benchmark results)
Hosting
Docker Compose on bare-metal / GPU server (NVIDIA A100/H100 required for FP8)
Summary

Aziza is a production-grade real-time full-duplex voice AI system built on NVIDIA PersonaPlex (Moshi), designed to handle 8–10 simultaneous voice conversations with sub-120ms end-to-end latency. The system combines FP8 model quantization, dynamic cross-session batching, and KV-cache tuning to maximize GPU utilization, while LoRA fine-tuning adds native Russian and Uzbek language personas alongside English. What makes it particularly sophisticated is the integration of Asterisk PBX for real telephone call routing, a live operator monitoring dashboard, and a fully containerized GPU deployment stack — making it a complete, carrier-grade conversational AI platform rather than a research prototype.

File Structure
docker-compose.yml Orchestrates all services: FastAPI, Moshi inference engine, Asterisk, Redis, PostgreSQL, and dashboard
Dockerfile.api FastAPI application container with Python 3.11, CUDA dependencies, and voice processing libraries
Dockerfile.inference NVIDIA PersonaPlex/Moshi inference server container with FP8 support and TensorRT-LLM
Dockerfile.asterisk Asterisk PBX container with AGI/ARI bridge configured for real-time audio streaming
api/main.py FastAPI entrypoint — registers routers, initializes WebSocket manager, connects to DB and Redis
api/routers/sessions.py REST + WebSocket endpoints for creating, managing, and terminating voice AI sessions
api/routers/personas.py CRUD endpoints for persona definitions including voice profile, language, and behavioral parameters
api/routers/benchmarks.py Endpoints to trigger load tests, retrieve benchmark reports, and export latency/concurrency metrics
api/routers/health.py Health check and readiness probe endpoints for Docker/orchestration liveness checks
api/services/session_manager.py Core session orchestration logic — manages up to 10 concurrent full-duplex WebSocket voice sessions with slot allocation
api/services/inference_client.py gRPC/HTTP client to communicate with the Moshi inference server, handles audio chunk streaming and response buffering
api/services/asterisk_bridge.py ARI (Asterisk REST Interface) bridge — connects inbound SIP calls to active AI sessions via audio relay
api/services/latency_tracker.py Real-time latency measurement service — tracks end-to-end audio round-trip and logs P50/P95/P99 metrics
api/services/persona_loader.py Loads persona configs from DB, applies LoRA adapter weights, and configures voice synthesis parameters per session
api/models/session.py SQLAlchemy ORM model for voice sessions including state, persona_id, latency stats, and timestamps
api/models/persona.py SQLAlchemy ORM model for persona definitions — name, language (en/ru/uz), voice embedding, system prompt
api/models/benchmark.py SQLAlchemy ORM model for storing benchmark run results including concurrency level, avg latency, and pass/fail status
api/websocket/audio_handler.py WebSocket handler for bidirectional raw audio streaming — receives mic input, forwards to inference, streams back synthesized audio
api/websocket/connection_manager.py Manages WebSocket connection pool, enforces max 10 concurrent session limit, handles graceful disconnects
inference/server.py Moshi/PersonaPlex inference server — loads FP8-quantized model, exposes gRPC streaming endpoint for audio chunk processing
inference/optimization/fp8_loader.py Loads and applies FP8 quantization to Moshi model weights using TensorRT-LLM or transformer-engine
inference/optimization/dynamic_batcher.py Dynamic batching logic — accumulates audio chunks across concurrent sessions and batches inference calls to maximize GPU utilization
inference/optimization/kv_cache_config.py KV-cache configuration and tuning — sets cache size, eviction policy, and per-session memory allocation for 8-10 concurrent contexts
training/finetune_lora.py LoRA/QLoRA fine-tuning script for adding Russian and Uzbek language support to the base Moshi model
training/dataset_prep.py Dataset preparation pipeline — processes RU/UZ audio datasets, generates transcripts, and formats for Moshi training
training/tokenizer_extension.py Extends base tokenizer with Russian and Uzbek vocabulary, handles Cyrillic and Latin Uzbek script variants
training/configs/lora_ru.yaml LoRA hyperparameter config for Russian language fine-tuning — rank, alpha, target modules, learning rate schedule
training/configs/lora_uz.yaml LoRA hyperparameter config for Uzbek language fine-tuning — rank, alpha, target modules, learning rate schedule
benchmarks/load_test.py Load testing script using asyncio — simulates 8-10 concurrent voice sessions, measures latency distribution and session stability
benchmarks/report_generator.py Generates structured benchmark reports in JSON and PDF formats with latency charts and pass/fail assessment against targets
asterisk/extensions.conf Asterisk dialplan configuration routing inbound SIP calls to the ARI application for AI session handoff
asterisk/ari_app.py Asterisk ARI application — answers calls, establishes audio bridge to FastAPI session manager via WebSocket relay
asterisk/sip.conf SIP trunk and endpoint configuration for Asterisk PBX integration
dashboard/src/App.jsx Root React component — renders session monitor, persona manager, and benchmark dashboard views
dashboard/src/components/SessionMonitor.jsx Real-time display of active voice sessions — shows session ID, persona, language, current latency, and duration
dashboard/src/components/PersonaEditor.jsx Form UI for creating and editing AI personas — name, language selector (EN/RU/UZ), voice profile, system prompt
dashboard/src/components/BenchmarkPanel.jsx Benchmark control panel — trigger load tests, display live progress, render latency charts and final report
dashboard/src/components/LatencyGauge.jsx Real-time latency gauge component showing current P95 latency vs 120ms target threshold with color-coded status
dashboard/src/hooks/useSessionWebSocket.js Custom React hook for subscribing to session state updates via WebSocket for live dashboard refresh
dashboard/src/api/client.js Axios-based API client with base URL config and request interceptors for dashboard-to-backend communication
migrations/001_initial_schema.sql Initial PostgreSQL schema — creates sessions, personas, and benchmark_results tables with indexes
config/settings.py Pydantic settings model — loads all environment variables for DB URL, Redis, inference server address, Asterisk ARI credentials
config/logging.yaml Structured logging configuration with JSON output, log levels per module, and session_id correlation field
scripts/start_inference.sh Shell script to launch inference server with correct CUDA device flags, FP8 mode, and memory allocation parameters
scripts/run_benchmark.sh Convenience script to execute full benchmark suite and output report to /reports directory
docs/architecture.md System architecture documentation — component diagram, data flow, latency budget breakdown, and scaling notes
docs/deployment.md Step-by-step production deployment guide — GPU requirements, Docker Compose setup, Asterisk config, and environment variables
docs/training_guide.md Guide for running LoRA fine-tuning — dataset requirements, training commands, adapter merging, and evaluation procedure
tests/test_session_manager.py Unit and integration tests for session manager — concurrency limits, slot allocation, and graceful teardown
tests/test_latency.py Latency measurement tests — validates end-to-end audio round-trip stays under 120ms threshold under simulated load
tests/test_persona_api.py API tests for persona CRUD endpoints — create, update, delete, and language validation
.env.example Example environment file documenting all required variables — DB URL, Redis URL, inference host, Asterisk ARI URL, GPU device IDs
Features (10)
Full-Duplex Real-Time Voice Streaming P1
Bidirectional WebSocket audio pipeline enabling simultaneous speech input and AI voice output with sub-120ms end-to-end latency.
  • WebSocket endpoint accepts raw PCM audio chunks at 16kHz/16-bit from client
  • Synthesized AI audio response begins streaming back within 120ms of final input chunk
  • P95 latency measured across 100 consecutive exchanges remains below 120ms
  • Full-duplex operation confirmed — AI can be interrupted mid-utterance by new user speech
  • No audio artifacts or buffer underruns detected during 10-minute continuous session
Concurrent Session Management (8–10 Sessions) P1
Session orchestration layer that maintains up to 10 simultaneous independent voice AI sessions with isolated context and persona state.
  • System accepts and maintains exactly 10 concurrent WebSocket voice sessions without degradation
  • 11th connection attempt returns HTTP 503 with clear capacity error message
  • Each session maintains fully isolated KV-cache context — no cross-session bleed
  • All 10 sessions simultaneously sustain <120ms latency under load test
  • Session cleanup releases GPU memory within 2 seconds of disconnect
  • Load test benchmark report confirms ≥8 stable sessions as deliverable threshold
FP8 Model Optimization & Dynamic Batching P1
GPU inference optimization using FP8 quantization and dynamic cross-session batching to achieve required concurrency and latency targets on a single GPU.
  • Moshi model loads successfully in FP8 precision using TensorRT-LLM or transformer-engine
  • FP8 model VRAM footprint is ≤50% of BF16 baseline measured via nvidia-smi
  • Dynamic batcher aggregates audio chunks from multiple sessions into single GPU kernel calls
  • Benchmark shows ≥40% throughput improvement vs non-batched baseline
  • Model output quality (MOS score or human eval) does not degrade more than 5% vs BF16 baseline
KV-Cache Tuning for Multi-Session Contexts P2
Per-session KV-cache configuration ensuring each of the 10 concurrent sessions retains sufficient conversational context without OOM errors.
  • KV-cache size per session is configurable via environment variable
  • 10 concurrent sessions each maintain minimum 2048-token context window
  • No CUDA OOM errors observed during 30-minute 10-session load test
  • Cache eviction policy (sliding window or LRU) is documented and configurable
  • Memory allocation report included in benchmark output
Multilingual Persona Support (EN / RU / UZ) P2
LoRA/QLoRA fine-tuned model adapters enabling natural Russian and Uzbek language voice interaction alongside the base English capability.
  • Fine-tuned LoRA adapters for Russian and Uzbek are produced and loadable at runtime
  • Model correctly responds in Russian when session language is set to 'ru'
  • Model correctly responds in Uzbek (Latin script) when session language is set to 'uz'
  • Word Error Rate (WER) on RU/UZ held-out test set is ≤20%
  • Language switching between sessions does not require inference server restart
  • Training scripts are reproducible with documented dataset sources and preprocessing steps
Persona Configuration & Voice Profile Management P2
API and UI for defining named AI personas with distinct voice embeddings, system prompts, and language assignments that persist across sessions.
  • REST API supports full CRUD for persona objects with name, language, system_prompt, and voice_embedding fields
  • Persona is applied to a session at creation time and cannot be changed mid-session
  • Two distinct personas produce audibly different voice characteristics in A/B test
  • Persona configs persist in PostgreSQL and survive service restart
  • Dashboard UI allows creating, editing, and deleting personas with form validation
Asterisk PBX Integration P2
SIP telephony integration via Asterisk ARI that routes inbound phone calls to available AI voice sessions transparently.
  • Inbound SIP call to configured DID triggers ARI application and claims an available AI session slot
  • Audio from phone call is relayed bidirectionally to/from the AI session with <150ms added latency
  • Call hangup triggers clean session termination and slot release
  • Busy signal or rejection message played when all 10 session slots are occupied
  • Asterisk integration tested with at least 3 simultaneous SIP calls in staging environment
Real-Time Monitoring Dashboard P3
React-based operator dashboard providing live visibility into active sessions, per-session latency, persona assignments, and system health.
  • Dashboard displays all active sessions with session ID, persona name, language, duration, and current latency
  • Latency gauge updates in real-time (≤1s refresh) and turns red when P95 exceeds 120ms
  • Session list updates within 2 seconds of a new session connecting or disconnecting
  • Dashboard is accessible at port 3000 and requires no authentication for MVP
  • Works correctly in Chrome and Firefox on desktop
Benchmark & Load Testing Suite P3
Automated load testing tooling that simulates maximum concurrent sessions and generates a structured benchmark report validating acceptance criteria.
  • Load test script simulates 8 and 10 concurrent voice sessions using pre-recorded audio fixtures
  • Report captures P50, P95, P99 latency, session success rate, and GPU memory utilization
  • Report is output as both JSON and human-readable PDF/HTML
  • Test run is executable via single shell command: ./scripts/run_benchmark.sh
  • Report clearly indicates PASS or FAIL against the <120ms P95 and ≥8 sessions targets
Dockerized Production Deployment P3
Fully containerized multi-service deployment using Docker Compose with GPU passthrough, health checks, and documented runbook for production launch.
  • docker-compose up brings all services (API, inference, Asterisk, Redis, PostgreSQL, dashboard) to healthy state
  • NVIDIA GPU is correctly passed through to inference container via nvidia-docker runtime
  • All services have Docker health checks with appropriate intervals and failure thresholds
  • System recovers automatically if inference server crashes (restart: unless-stopped policy)
  • Deployment documentation covers prerequisites, env var setup, and first-run validation steps
  • Cold start from docker-compose up to first successful voice session completes within 3 minutes
Build Log
scoping Starting AI-powered tech spec generation
scoping Tech spec generated successfully
start Build orchestration started for project 8
attempt Build attempt 1/3
generate Attempt 1 failed: AI generation failed: Unterminated string in JSON at position 49759 failed
retry Retrying (2/3)...
attempt Build attempt 2/3
generate Attempt 2 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed
retry Retrying (3/3)...
attempt Build attempt 3/3
generate Attempt 3 failed: AI generation failed: 429 Daily token limit reached (100,000 tokens). Resets at midnight UTC. failed
complete Build failed after 3 attempts failed
status Project status updated to Build Failed
Deliverables
📦

Deliverables become available once project reaches Review status.