Roboticks Core Functionality Architecture Design
Executive Summary
This document outlines the comprehensive architecture for implementing core fleet management, session tracking, real-time logging, and cloud monitoring functionality in the Roboticks platform.Table of Contents
- Runtime Logging Architecture
- Fleet Manager Device Onboarding
- Cloud Backend Monitoring
- Frontend Views Design
- AWS Services & Technology Stack
- Implementation Plan
1. Runtime Logging Architecture
Current State
- Session Manager currently copies/pastes logs from other modules only during teardown
- No runtime log streaming exists
- Logs are lost if modules crash before teardown
Proposed Architecture
1.1 Transport-Based Log Sync
1.2 Log Message Structure
1.3 Implementation Strategy
Phase 1: Logging System Enhancement- Add a new
RemoteSinkinroboticks-loggingpackage - Sink publishes logs to transport topic
roboticks/logs - All modules automatically send logs via transport
- Subscribe to
roboticks/logstopic - Maintain ring buffer (configurable size, default 10K messages)
- Associate logs with active session
- Batch upload to cloud every N seconds or M messages
- On teardown, Session Manager:
- Signals modules to shut down gracefully
- Continues collecting logs during shutdown (30s timeout)
- Flushes remaining logs to local storage
- Uploads final log batch to cloud
- Only copies module-local logs if upload fails (fallback)
2. Fleet Manager Device Onboarding
2.1 Device Authentication Flow
2.2 Device Types & Environment Categories
2.3 Capsule Structure
3. Cloud Backend Monitoring
3.1 Database Schema Extensions
3.2 Real-Time Log Streaming Architecture
3.3 API Endpoints
4. Frontend Views Design
4.1 Fleet Management View
Template Base:user-list / user-grid (supports list/grid toggle)
Features:
- Environment Tabs: Production | Testing | Development
- View Toggle: List view (table) | Grid view (cards) - Grid is default
- Status Filters: Online | Offline | Maintenance | Error
- Device Type Filters: All | Drone | Robot | Camera | Sensor | Vehicle
Grid View (Default)
List View
Standard table with columns:- Name | Type | Environment | Status | Last Seen | Battery/CPU | Current Session | Actions
Device Detail Page
- Device info, metrics charts, command history
- Active session (if any) with link to session view
- Recent logs (last 50 lines) with link to full log view
- Button: “View All Sessions”
- Button: “Send Command”
4.2 Sessions View
Template Base:packages-table (modified)
Features:
- Clickable Rows: Click any session to navigate
- Two Action Buttons per Row:
- Files 📁 → Navigate to File Manager view
- Logs 📄 → Navigate to Logs view with session filter pre-applied
- Filters: Device | Status | Date Range | Capsule
- Status Indicators: 🟢 Active | 🔵 Completed | 🔴 Failed | ⚫ Aborted
Session Files View (File Manager Template)
When user clicks 📁 button:4.3 Logs View
Real-Time Log Streaming with FilteringFeatures:
- Real-Time Updates: WebSocket connection, new logs appear automatically
- Lazy Loading: Shows last 100 lines, scroll up to load more (paginated)
- Multi-Level Filtering:
- Device (dropdown)
- Session (dropdown)
- Module (dropdown)
- Log Level (DEBUG/INFO/WARN/ERROR/FATAL checkboxes)
- Date/Time Range
- Text Search (searches message content)
- Export: Download filtered logs as CSV/JSON
- Auto-Scroll: Toggle to follow latest logs
- Color Coding: Levels have different colors (ERROR=red, WARN=yellow, INFO=blue, DEBUG=gray)
5. AWS Services & Technology Stack
5.1 AWS Services Selection
Device Communication & Management
- AWS IoT Core
- MQTT/WSS for bi-directional communication
- Device authentication via X.509 certificates
- Device shadows for state management
- Rules engine for routing messages
Data Storage
-
Amazon RDS (PostgreSQL)
- Primary database for devices, sessions, structured logs
- Full-text search on logs using PostgreSQL
tsvector - Connection pooling via PgBouncer
-
Amazon S3
- Log archive storage (compress logs older than 7 days)
- Session artifacts (files, screenshots, data dumps)
- Capsule storage (deployment packages)
- Lifecycle policies (move to Glacier after 90 days)
-
Amazon DynamoDB (Optional for hot logs)
- Ultra-fast log ingestion
- Time-series data (TTL after 7 days)
- Global secondary indexes for filtering
- DynamoDB Streams → Firehose → S3
Compute
-
AWS Lambda
- Log ingestion processor
- Device command dispatcher
- Session state machine
- Scheduled tasks (cleanup, analytics)
-
Amazon ECS Fargate (Existing)
- Backend API (FastAPI)
- Frontend serving (Nginx)
API & Networking
-
AWS API Gateway
- REST API for device/frontend communication
- WebSocket API for real-time log streaming
- API keys and usage plans
- Throttling and quotas
-
AWS VPC
- Private subnets for RDS, Redis
- NAT Gateway for outbound traffic
- VPC endpoints for S3, DynamoDB
Caching & Queuing
-
Amazon ElastiCache (Redis) (Existing)
- Session caching
- Real-time metrics aggregation
- Pub/Sub for WebSocket notifications
-
Amazon SQS
- Log processing queue
- Command execution queue
- Dead-letter queue for failed operations
Monitoring & Observability
-
Amazon CloudWatch
- Logs (Lambda, ECS, API Gateway)
- Metrics (custom + service metrics)
- Alarms for device offline, error rates
- Dashboards
-
AWS X-Ray
- Distributed tracing
- Performance bottleneck identification
5.2 Edge Device Libraries (C++)
HTTP Client
- libcurl (already available)
- HTTPS requests with TLS 1.3
- Certificate pinning
- Connection pooling
MQTT Client
- AWS IoT Device SDK for C++
- Native AWS IoT Core integration
- Auto-reconnection
- QoS levels 0, 1, 2
- WebSocket support
JSON Parsing
- nlohmann/json or RapidJSON
- Fast JSON serialization
- Schema validation
Compression
- zlib or lz4
- Log compression before upload
- Reduce bandwidth usage
Certificate Management
- OpenSSL (already available)
- X.509 certificate handling
- TLS connections
- CSR generation
5.3 Backend Libraries (Python)
5.4 Frontend Libraries (React/TypeScript)
6. Implementation Plan
Phase 1: Foundation (Weeks 1-2)
Week 1: Database & Backend Setup
- Add
environmentcolumn tofleet_devicestable (migration) - Create
session_logstable with full-text search indexes - Implement Fleet API endpoints (
/fleet/register,/fleet/heartbeat) - Implement Sessions API endpoints (
/sessions,/sessions/{id}/files) - Set up AWS IoT Core (thing types, policies, certificates)
Week 2: Device-Side Integration
- Implement
RemoteSinkinroboticks-logging(publishes to transport) - Update
SessionManagerto subscribe toroboticks/logstopic - Implement ring buffer for log collection
- Add log batching and upload to
FleetClient - Integrate AWS IoT Device SDK for C++
- Update capsule generation to include registration token
Phase 2: Real-Time Logging (Weeks 3-4)
Week 3: Log Ingestion Pipeline
- Create Lambda function for log ingestion
- Set up SQS queue for log processing
- Implement log storage to RDS (with batching)
- Implement log archive to S3 (daily rotation)
- Set up DynamoDB for hot log storage (optional)
Week 4: Real-Time Streaming
- Implement WebSocket API in API Gateway
- Create WebSocket handler in backend (FastAPI)
- Implement pub/sub using Redis for multi-instance support
- Build log query API with filtering
- Implement log export (CSV/JSON)
Phase 3: Frontend Development (Weeks 5-6)
Week 5: Fleet & Sessions Views
- Create Fleet Management view (grid/list toggle)
- Add environment tabs (Production/Testing/Development)
- Implement device detail page with metrics charts
- Create Sessions table view
- Implement Session Files view (file manager)
Week 6: Logs View
- Build real-time logs view with WebSocket connection
- Implement multi-level filtering (device, session, module, level)
- Add virtual scrolling for performance
- Implement lazy loading (pagination)
- Add export functionality
- Add auto-scroll toggle
Phase 4: Device Onboarding & Commands (Week 7)
- Build capsule generation UI
- Implement device registration flow
- Create command sending interface
- Implement command execution on device side
- Add command history view
Phase 5: Testing & Optimization (Week 8)
- Load testing (1000+ devices, high log volume)
- Performance optimization (indexing, caching)
- Security audit (authentication, authorization, encryption)
- Documentation (API docs, deployment guide)
- Monitoring setup (CloudWatch dashboards, alarms)
7. Key Technical Decisions & Rationale
7.1 Why AWS IoT Core over HTTP-only?
- Bi-directional: Devices can receive commands instantly (not just poll)
- Connection Management: Automatic reconnection, offline queuing
- Security: Built-in certificate management, fine-grained policies
- Scale: Handles millions of devices, message routing
7.2 Why Ring Buffer + Batch Upload?
- Network Efficiency: Reduce HTTP requests (1 req/5s vs 100 req/5s)
- Resilience: Buffer survives temporary network outages
- Performance: Minimal impact on module execution
7.3 Why PostgreSQL Full-Text Search over ElasticSearch?
- Simplicity: No additional service to manage
- Cost: Included with RDS
- Performance:
tsvector+ GIN indexes handle 100K+ logs/min - Fallback: Can migrate to ElasticSearch later if needed
7.4 Why WebSocket for Log Streaming?
- Low Latency: Sub-second log delivery to UI
- Efficiency: Single persistent connection vs HTTP polling
- User Experience: True real-time feel
8. Security Considerations
8.1 Device Security
- ✅ X.509 certificates for authentication (not passwords)
- ✅ Registration tokens expire after 24h
- ✅ TLS 1.3 for all communication
- ✅ Device-specific AWS IoT policies (least privilege)
- ✅ Secure storage of certificates on device (encrypted partition)
8.2 Backend Security
- ✅ JWT authentication for all API endpoints
- ✅ Role-based access control (users can only see their devices)
- ✅ API rate limiting (prevent abuse)
- ✅ Input validation on all endpoints
- ✅ SQL injection prevention (SQLAlchemy ORM)
8.3 Data Security
- ✅ Encryption at rest (S3, RDS, DynamoDB)
- ✅ Encryption in transit (TLS everywhere)
- ✅ Log data retention policy (90 days active, then archive)
- ✅ PII scrubbing from logs (if applicable)
9. Cost Estimation (Monthly, for 100 devices)
| Service | Usage | Cost |
|---|---|---|
| AWS IoT Core | 100 devices × 24h × 30d | ~$50 |
| RDS (db.t4g.micro) | Free tier | $0 |
| S3 | 50 GB logs/artifacts | ~$1.15 |
| Lambda | 1M invocations | ~$0.20 |
| API Gateway | 1M requests | ~$3.50 |
| SQS | 1M messages | $0.40 |
| CloudWatch Logs | 10 GB | ~$5 |
| Total | ~$60/month |
10. Success Metrics
- Reliability: 99.9% uptime for device connections
- Latency: < 2s from log generation to UI display
- Throughput: Support 1000 logs/sec per device
- Storage: Compress logs to 30% original size
- User Experience: < 500ms page load for logs view
Appendix A: Example Configuration Files
Device Capsule Config
Backend Environment Variables
Conclusion
This architecture provides a robust, scalable foundation for:- ✅ Real-time log collection and streaming
- ✅ Secure device onboarding and management
- ✅ Comprehensive fleet monitoring
- ✅ Production/test environment separation
- ✅ Remote command execution
- ✅ Session tracking and artifact management