docker-compose.yml
Orchestrates all services: FastAPI, Moshi inference engine, Redis, PostgreSQL, Nginx, Asterisk
Dockerfile.api
FastAPI application container with Python 3.11, CUDA dependencies, and model client libraries
Dockerfile.inference
NVIDIA PersonaPlex/Moshi inference server with FP8 TensorRT-LLM optimization and GPU passthrough
Dockerfile.asterisk
Asterisk PBX container with AGI/ARI bridge configuration for PSTN/SIP integration
nginx/nginx.conf
Reverse proxy config with WebSocket upgrade support, load balancing, and SSL termination
server/main.py
FastAPI app entrypoint: registers routers, initializes session manager, connects to Redis and PostgreSQL
server/config.py
Pydantic settings: model paths, GPU config, concurrency limits, latency thresholds, language settings
server/routers/sessions.py
WebSocket endpoint for full-duplex voice streaming; manages session lifecycle, audio chunking, and response piping
server/routers/personas.py
REST CRUD endpoints for persona definitions (name, voice profile, language, LoRA adapter reference)
server/routers/admin.py
Admin endpoints for system health, active session count, GPU utilization, and benchmark trigger
server/routers/asterisk.py
ARI webhook receiver and AGI handler for bridging Asterisk calls into the voice AI session pipeline
server/services/session_manager.py
Manages up to 10 concurrent sessions with slot allocation, timeout handling, and graceful teardown
server/services/inference_client.py
gRPC/HTTP2 client to the Moshi inference server; handles streaming audio in/out and dynamic batching requests
server/services/audio_pipeline.py
Audio preprocessing: resampling, VAD (voice activity detection), codec handling (OPUS/PCM), and chunk buffering
server/services/persona_service.py
Loads persona config and maps to correct LoRA adapter, voice embedding, and language tokenizer at session init
server/services/metrics_service.py
Collects per-session latency, TTFB, concurrency stats; writes to PostgreSQL and exposes Prometheus metrics
server/models/session.py
SQLAlchemy ORM model for session records: id, persona_id, language, start_time, end_time, avg_latency_ms
server/models/persona.py
SQLAlchemy ORM model for persona: id, name, language, voice_profile_path, lora_adapter_path, system_prompt
server/db/database.py
Async SQLAlchemy engine setup, session factory, and Alembic migration integration
server/db/redis_client.py
Redis async client wrapper for session slot tracking, KV-cache state keys, and pub/sub coordination
inference/server.py
Moshi inference server entrypoint: loads FP8-quantized model, initializes TensorRT engine, starts gRPC streaming service
inference/batching.py
Dynamic batching logic: groups incoming audio frames across sessions to maximize GPU throughput within latency budget
inference/kv_cache.py
KV-cache management: per-session cache allocation, eviction policy, and cache-hit optimization for streaming inference
inference/fp8_optimizer.py
FP8 quantization utilities using TensorRT-LLM: model export, calibration dataset runner, and precision validation
training/finetune_lora.py
QLoRA fine-tuning script for RU/UZ language adaptation using HuggingFace PEFT; configures rank, alpha, target modules
training/dataset_builder.py
Builds multilingual voice dataset: audio segmentation, transcript alignment, language tagging for RU and UZ corpora
training/tokenizer_extension.py
Extends base tokenizer with RU/UZ vocabulary, Cyrillic/Latin script handling, and phoneme mappings
training/train_config.yaml
Training hyperparameters: learning rate, batch size, LoRA rank, warmup steps, evaluation strategy, checkpoint paths
asterisk/extensions.conf
Dialplan configuration routing inbound calls to AGI script that bridges audio to the FastAPI session endpoint
asterisk/agi/voice_bridge.py
AGI script: captures RTP audio from Asterisk, streams to FastAPI WebSocket, and plays back AI response audio
asterisk/ari_config.conf
ARI (Asterisk REST Interface) configuration for programmatic call control and media streaming
benchmarks/load_test.py
Locust-based load test simulating 8-10 concurrent voice sessions; measures latency percentiles and session stability
benchmarks/latency_report.py
Parses load test results and generates benchmark report: p50/p95/p99 latency, concurrency ceiling, GPU memory usage
dashboard/src/App.tsx
React app root: sets up routing between session monitor, persona manager, and system metrics views
dashboard/src/pages/SessionMonitor.tsx
Live dashboard showing active sessions, per-session latency gauge, language/persona label, and session controls
dashboard/src/pages/PersonaManager.tsx
CRUD UI for persona configurations: name, language selector (EN/RU/UZ), voice profile upload, LoRA adapter assignment
dashboard/src/pages/SystemMetrics.tsx
GPU utilization charts, concurrency history, latency trends, and benchmark report viewer using Recharts
dashboard/src/hooks/useSessionSocket.ts
Custom React hook managing WebSocket connection for real-time session audio streaming and status updates
dashboard/src/api/client.ts
Axios client with base URL config, auth headers, and typed request/response interfaces for all REST endpoints
migrations/
Alembic migration scripts for PostgreSQL schema versioning across sessions, personas, and metrics tables
docs/architecture.md
System architecture overview: component diagram, data flow, GPU pipeline, Asterisk integration, and scaling notes
docs/deployment.md
Step-by-step production deployment guide: GPU driver setup, Docker Compose startup, Asterisk SIP trunk config, SSL setup
docs/api_reference.md
Full API reference for WebSocket session protocol, REST endpoints, audio format specs, and error codes
docs/training_guide.md
Guide for running LoRA fine-tuning: dataset preparation, training execution, adapter export, and model evaluation
.env.example
Template for all required environment variables: DB URLs, Redis URL, GPU device IDs, model paths, Asterisk credentials