Provider Limitations
Limited Provider Support
Current: 18 providers Target: 50+ providers Impact: May not support all LLM providers needed by users Supported Providers:- OpenAI (GPT-4, GPT-3.5, DALL-E, Whisper, embeddings)
- Anthropic (Claude models)
- Google (Gemini)
- Groq (fast inference)
- Mistral AI
- Cohere
- Deepseek
- Fireworks
- HuggingFace
- Together AI
- Perplexity
- OpenRouter
- xAI (Grok)
- Voyage (embeddings)
- AWS Bedrock
- AssemblyAI (audio)
- ElevenLabs (TTS)
- Fal.ai (images)
Missing Providers
Missing Providers
Missing Providers (compared to LiteLLM):
- AI21 Labs
- Aleph Alpha
- Baseten
- Cloudflare Workers AI
- Databricks
- Gemini Pro Vision
- Hugging Face Inference Endpoints
- Ollama (local models)
- Replicate
- VertexAI
- Many others (60+ additional providers in LiteLLM)
Missing Features
The following features are not yet implemented or only partially available. Workarounds are provided where possible.
1. Batch Processing API
Status: Not implemented Description: No support for OpenAI batch API (/v1/batches)
Impact: Cannot process large batches of requests asynchronously
Workaround: Process requests individually in parallel client-side
Planned: Q2 2025
2. Vector Store Support
Status: Not implemented Description: No built-in vector database integration for RAG applications Missing:- Vector storage endpoints
- Embedding management
- Similarity search
3. Assistants API
Status: Partial implementation Description: OpenAI Assistants API (/v1/assistants) partially implemented
Missing:
- Code interpreter
- File search
- Function calling with assistants
- Thread management
4. Fine-tuning API
Status: Not implemented Description: No support for model fine-tuning endpoints Impact: Cannot manage fine-tuned models through gateway Workaround: Fine-tune directly with provider APIs Planned: Q4 2025 (lower priority)5. Advanced Routing Strategies
Status: Partially implemented Available:- Priority-based routing
- Lowest-cost routing
- Lowest-latency routing
- Load-based routing (current request queue)
- Geographic routing (nearest region)
- A/B testing routing
- Custom routing logic
6. Multi-Modal Support
Status: Partial support Supported:- Text (chat, completions)
- Images (vision models, image generation)
- Audio (transcription, TTS)
- Video processing
- Multi-modal embeddings
- Image editing
Performance Limitations
Guardrail and API latency are the primary performance concerns. Disable unused guardrails and choose appropriate model sizes to minimize impact.
1. Guardrail Latency
Issue: SLM guardrails add 30-50ms latency Impact: Noticeable delay for low-latency requirements (<100ms) Breakdown:- PII Detection: ~15ms
- Hate Speech Detection: ~20ms
- Prompt Injection Detection: ~15ms
- Total (parallel): ~30-50ms
- Disable unused guardrails
- Use selective guardrails (only on sensitive endpoints)
- Adjust inspection direction (request-only vs both)
2. OrionFence API Latency
Issue: OrionFence ML API adds 100-500ms latency Impact: Significant latency increase for PII detection Breakdown (depends on model):- spacy-sm: 50-100ms
- spacy-md: 100-200ms
- spacy-lg: 200-300ms
- transformers models: 300-500ms
- Use local regex-based detection (faster but less accurate)
- Enable fallback mode
- Use smaller models for non-critical detection
3. Cache Hit Rate
Issue: Cache effectiveness depends on request similarity Typical Hit Rates:- Identical requests: 90-100%
- Similar requests (different parameters): 0%
- Streaming requests: 0% (not cacheable)
- Use cache for repeated queries (FAQ, common questions)
- Implement request normalization
- Increase cache TTL for stable responses
4. Memory Usage
Issue: In-memory store requires all metadata to fit in RAM Current Usage:- Baseline: ~100MB
- Per 1000 API keys: ~5MB
- Per 10000 request logs: ~50MB
- Per provider model catalog: ~1MB
- < 10,000 API keys: 2GB RAM sufficient
- < 100,000 API keys: 4GB RAM sufficient
-
100,000 API keys: Consider external database
Scalability Constraints
Single Redis instance and in-memory log storage are the primary scalability bottlenecks. Plan for Redis Cluster and external log storage for large deployments.
1. Single Redis Instance
Issue: Single Redis instance is a bottleneck and single point of failure Impact:- Limited throughput (~50,000 ops/sec)
- No automatic failover
- Data loss if Redis crashes (without persistence)
- Enable Redis persistence (AOF + RDB)
- Use Redis Sentinel for automatic failover
- Use Redis Cluster for horizontal scaling
2. Request Log Storage
Issue: Logs stored in-memory with limited retention Current Limits:- In-memory: Last 10,000 requests
- Redis: Configurable TTL (default 30 days)
3. Concurrent Request Limit
Issue: Node.js event loop limits concurrent processing Typical Limits:- CPU-bound: ~1,000 requests/sec per core
- I/O-bound: ~5,000 requests/sec per core
- Horizontal scaling (add more instances)
- Use clustering (Node.js cluster module)
- Optimize middleware (reduce processing time)
4. WebSocket Limitations
Issue: No WebSocket support for real-time bidirectional communication Impact: Cannot use WebSockets for streaming or real-time updates Workaround: Use Server-Sent Events (SSE) for server-to-client streaming Planned: WebSocket support for admin UI in Q3 2025Known Bugs and Workarounds
Streaming with Anthropic Claude
Streaming with Anthropic Claude
Issue: Streaming responses may occasionally include duplicate chunksAffected: Anthropic provider, streaming modeWorkaround: Client-side deduplication based on chunk IDsStatus: Fix planned for next release
Rate Limit Reset Time
Rate Limit Reset Time
Issue: Rate limit reset time in headers may be slightly inaccurateAffected:
X-RateLimit-Reset-Requests headerImpact: Off by a few seconds (~5s)Workaround: Add 10s buffer when checking reset timeStatus: Low priority, cosmetic issueLarge Request Payloads
Large Request Payloads
Issue: Requests >1MB may timeout with default settingsAffected: Image generation, large context windowsWorkaround: Increase timeouts:Status: Working as designed, configurable
Model Discovery Performance
Model Discovery Performance
Issue: Model discovery on startup can be slow (2-5 seconds per provider)Impact: Delayed startup timeWorkaround: Disable automatic discovery:Manually configure models in provider settings.Status: Optimization planned for Q2 2025
MCP Session Timeout
MCP Session Timeout
Issue: MCP sessions timeout after 25 seconds of inactivityImpact: Need to reinitialize frequently used MCP serversWorkaround: Increase session TTL:Status: Working as designed, configurable
Browser Compatibility
Admin UI Browser Support
Fully Supported:- Chrome 90+
- Firefox 88+
- Safari 14+
- Edge 90+
- Safari 13 (missing some CSS features)
- Firefox 85-87 (missing some features)
- Internet Explorer (all versions)
- Chrome < 90
- Safari < 13
- Safari 13: Dark mode toggle animation broken
- Firefox < 88: WebSocket status indicator not working
- Mobile browsers: Limited support for playground features
MCP Limitations
1. MCP Protocol Version
Issue: Only supports MCP protocol version2024-11-05
Impact: Incompatible with older MCP servers
Workaround: Update MCP servers to latest version
Planned: Backward compatibility in Q2 2025
2. MCP Server Types
Supported:- Remote HTTP MCP servers
- Stdio-based MCP servers (via bridge)
- WebSocket MCP servers
- gRPC MCP servers
3. MCP Tool Size Limits
Issue: MCP tool responses limited to 512KB Configured:MCP_MAX_REQUEST_BYTES=524288
Impact: Cannot return large tool responses
Workaround: Increase limit or stream responses
Status: Working as designed, configurable
4. MCP Session Management
Issue: Sessions not shared across gateway instances Impact: Each gateway instance creates separate MCP sessions Workaround: Use session affinity (sticky sessions) in load balancer Planned: Shared session management in Q3 2025Caching Limitations
1. Cache Key Strategy
Issue: Cache key is exact request hash (no semantic similarity) Impact: Minor request variations bypass cache Example:- “What is 2+2?” (cache miss)
- “What is two plus two?” (cache miss, different wording)
2. Streaming Responses
Issue: Streaming responses are not cached Impact: No cache benefit for streaming requests Workaround: Use non-streaming requests when possible Status: Cannot cache streaming due to SSE protocol3. Cache Invalidation
Issue: No automatic cache invalidation on configuration changes Impact: Stale cached responses after model changes Workaround: Manual cache clear after config changes:4. Cache Size Limits
Issue: In-memory cache size limited to prevent memory exhaustion Default: 10,000 entries Impact: Cache eviction under high load Workaround: Use Redis cache for larger capacity:Future Improvements
See below for planned improvements to address these limitations.Short-Term (0-3 months)
- Expand provider support to 50+ providers
- Improve API documentation
- Optimize guardrail performance
- Add batch processing API
Medium-Term (3-6 months)
- Vector store integration
- Advanced routing strategies
- Redis Cluster support
- WebSocket support
Long-Term (6-12 months)
- Semantic caching
- Fine-tuning API
- Multi-modal expansion
- Auto-scaling optimizations
Reporting Issues
Found a bug or limitation not listed here?- Check GitHub Issues: https://github.com/guardwayai/agsec/issues
- Search Documentation: May already be documented
- Create Issue: Include:
- Description of issue
- Steps to reproduce
- Expected vs actual behavior
- Environment details (OS, Docker version, etc.)
- Logs/screenshots
For workarounds and best practices, see:
