docker-compose.yml
Orchestrates all services: FastAPI, Moshi inference engine, Asterisk, Redis, PostgreSQL, and dashboard
Dockerfile.api
FastAPI application container with Python 3.11, CUDA dependencies, and voice processing libraries
Dockerfile.inference
NVIDIA PersonaPlex/Moshi inference server container with FP8 support and TensorRT-LLM
Dockerfile.asterisk
Asterisk PBX container with AGI/ARI bridge configured for real-time audio streaming
api/main.py
FastAPI entrypoint — registers routers, initializes WebSocket manager, connects to DB and Redis
api/routers/sessions.py
REST + WebSocket endpoints for creating, managing, and terminating voice AI sessions
api/routers/personas.py
CRUD endpoints for persona definitions including voice profile, language, and behavioral parameters
api/routers/benchmarks.py
Endpoints to trigger load tests, retrieve benchmark reports, and export latency/concurrency metrics
api/routers/health.py
Health check and readiness probe endpoints for Docker/orchestration liveness checks
api/services/session_manager.py
Core session orchestration logic — manages up to 10 concurrent full-duplex WebSocket voice sessions with slot allocation
api/services/inference_client.py
gRPC/HTTP client to communicate with the Moshi inference server, handles audio chunk streaming and response buffering
api/services/asterisk_bridge.py
ARI (Asterisk REST Interface) bridge — connects inbound SIP calls to active AI sessions via audio relay
api/services/latency_tracker.py
Real-time latency measurement service — tracks end-to-end audio round-trip and logs P50/P95/P99 metrics
api/services/persona_loader.py
Loads persona configs from DB, applies LoRA adapter weights, and configures voice synthesis parameters per session
api/models/session.py
SQLAlchemy ORM model for voice sessions including state, persona_id, latency stats, and timestamps
api/models/persona.py
SQLAlchemy ORM model for persona definitions — name, language (en/ru/uz), voice embedding, system prompt
api/models/benchmark.py
SQLAlchemy ORM model for storing benchmark run results including concurrency level, avg latency, and pass/fail status
api/websocket/audio_handler.py
WebSocket handler for bidirectional raw audio streaming — receives mic input, forwards to inference, streams back synthesized audio
api/websocket/connection_manager.py
Manages WebSocket connection pool, enforces max 10 concurrent session limit, handles graceful disconnects
inference/server.py
Moshi/PersonaPlex inference server — loads FP8-quantized model, exposes gRPC streaming endpoint for audio chunk processing
inference/optimization/fp8_loader.py
Loads and applies FP8 quantization to Moshi model weights using TensorRT-LLM or transformer-engine
inference/optimization/dynamic_batcher.py
Dynamic batching logic — accumulates audio chunks across concurrent sessions and batches inference calls to maximize GPU utilization
inference/optimization/kv_cache_config.py
KV-cache configuration and tuning — sets cache size, eviction policy, and per-session memory allocation for 8-10 concurrent contexts
training/finetune_lora.py
LoRA/QLoRA fine-tuning script for adding Russian and Uzbek language support to the base Moshi model
training/dataset_prep.py
Dataset preparation pipeline — processes RU/UZ audio datasets, generates transcripts, and formats for Moshi training
training/tokenizer_extension.py
Extends base tokenizer with Russian and Uzbek vocabulary, handles Cyrillic and Latin Uzbek script variants
training/configs/lora_ru.yaml
LoRA hyperparameter config for Russian language fine-tuning — rank, alpha, target modules, learning rate schedule
training/configs/lora_uz.yaml
LoRA hyperparameter config for Uzbek language fine-tuning — rank, alpha, target modules, learning rate schedule
benchmarks/load_test.py
Load testing script using asyncio — simulates 8-10 concurrent voice sessions, measures latency distribution and session stability
benchmarks/report_generator.py
Generates structured benchmark reports in JSON and PDF formats with latency charts and pass/fail assessment against targets
asterisk/extensions.conf
Asterisk dialplan configuration routing inbound SIP calls to the ARI application for AI session handoff
asterisk/ari_app.py
Asterisk ARI application — answers calls, establishes audio bridge to FastAPI session manager via WebSocket relay
asterisk/sip.conf
SIP trunk and endpoint configuration for Asterisk PBX integration
dashboard/src/App.jsx
Root React component — renders session monitor, persona manager, and benchmark dashboard views
dashboard/src/components/SessionMonitor.jsx
Real-time display of active voice sessions — shows session ID, persona, language, current latency, and duration
dashboard/src/components/PersonaEditor.jsx
Form UI for creating and editing AI personas — name, language selector (EN/RU/UZ), voice profile, system prompt
dashboard/src/components/BenchmarkPanel.jsx
Benchmark control panel — trigger load tests, display live progress, render latency charts and final report
dashboard/src/components/LatencyGauge.jsx
Real-time latency gauge component showing current P95 latency vs 120ms target threshold with color-coded status
dashboard/src/hooks/useSessionWebSocket.js
Custom React hook for subscribing to session state updates via WebSocket for live dashboard refresh
dashboard/src/api/client.js
Axios-based API client with base URL config and request interceptors for dashboard-to-backend communication
migrations/001_initial_schema.sql
Initial PostgreSQL schema — creates sessions, personas, and benchmark_results tables with indexes
config/settings.py
Pydantic settings model — loads all environment variables for DB URL, Redis, inference server address, Asterisk ARI credentials
config/logging.yaml
Structured logging configuration with JSON output, log levels per module, and session_id correlation field
scripts/start_inference.sh
Shell script to launch inference server with correct CUDA device flags, FP8 mode, and memory allocation parameters
scripts/run_benchmark.sh
Convenience script to execute full benchmark suite and output report to /reports directory
docs/architecture.md
System architecture documentation — component diagram, data flow, latency budget breakdown, and scaling notes
docs/deployment.md
Step-by-step production deployment guide — GPU requirements, Docker Compose setup, Asterisk config, and environment variables
docs/training_guide.md
Guide for running LoRA fine-tuning — dataset requirements, training commands, adapter merging, and evaluation procedure
tests/test_session_manager.py
Unit and integration tests for session manager — concurrency limits, slot allocation, and graceful teardown
tests/test_latency.py
Latency measurement tests — validates end-to-end audio round-trip stays under 120ms threshold under simulated load
tests/test_persona_api.py
API tests for persona CRUD endpoints — create, update, delete, and language validation
.env.example
Example environment file documenting all required variables — DB URL, Redis URL, inference host, Asterisk ARI URL, GPU device IDs