Skip to main content

Roboticks Core Functionality Architecture Design

Executive Summary

This document outlines the comprehensive architecture for implementing core fleet management, session tracking, real-time logging, and cloud monitoring functionality in the Roboticks platform.

Table of Contents

  1. Runtime Logging Architecture
  2. Fleet Manager Device Onboarding
  3. Cloud Backend Monitoring
  4. Frontend Views Design
  5. AWS Services & Technology Stack
  6. Implementation Plan

1. Runtime Logging Architecture

Current State

  • Session Manager currently copies/pastes logs from other modules only during teardown
  • No runtime log streaming exists
  • Logs are lost if modules crash before teardown

Proposed Architecture

1.1 Transport-Based Log Sync

┌─────────────────┐
│   Module A      │──┐
│  (InputModule)  │  │
└─────────────────┘  │
                     │   ZeroMQ/Iceoryx Pub/Sub
┌─────────────────┐  │   Topic: "roboticks/logs"
│   Module B      │──┼───────────────────────────────►  ┌─────────────────┐
│ (ProcessModule) │  │                                  │ Session Manager │
└─────────────────┘  │                                  │   (Subscriber)  │
                     │                                  └────────┬────────┘
┌─────────────────┐  │                                           │
│   Module C      │──┘                                           │
│ (OutputModule)  │                                              │
└─────────────────┘                                              ▼
                                                        ┌─────────────────┐
                                                        │ Local Log Buffer│
                                                        │  (Ring Buffer)  │
                                                        └────────┬────────┘


                                                        ┌─────────────────┐
                                                        │  Fleet Client   │
                                                        │ (Cloud Uploader)│
                                                        └────────┬────────┘


                                                        ┌─────────────────┐
                                                        │  AWS IoT Core   │
                                                        │   (MQTT/WSS)    │
                                                        └─────────────────┘

1.2 Log Message Structure

struct LogMessage {
    std::string session_id;
    std::string module_name;
    std::string process_id;
    uint64_t timestamp_us;
    LogLevel level;        // DEBUG, INFO, WARN, ERROR, FATAL
    std::string logger_name;
    std::string message;
    std::string thread_id;
    std::map<std::string, std::string> context;  // Additional metadata
};

1.3 Implementation Strategy

Phase 1: Logging System Enhancement
  • Add a new RemoteSink in roboticks-logging package
  • Sink publishes logs to transport topic roboticks/logs
  • All modules automatically send logs via transport
Phase 2: Session Manager Log Collection
  • Subscribe to roboticks/logs topic
  • Maintain ring buffer (configurable size, default 10K messages)
  • Associate logs with active session
  • Batch upload to cloud every N seconds or M messages
Phase 3: Teardown Behavior
  • On teardown, Session Manager:
    1. Signals modules to shut down gracefully
    2. Continues collecting logs during shutdown (30s timeout)
    3. Flushes remaining logs to local storage
    4. Uploads final log batch to cloud
    5. Only copies module-local logs if upload fails (fallback)

2. Fleet Manager Device Onboarding

2.1 Device Authentication Flow

┌─────────────────┐                                ┌──────────────────┐
│  New Device     │                                │  Cloud Backend   │
│                 │                                │  (FastAPI)       │
└────────┬────────┘                                └────────┬─────────┘
         │                                                  │
         │ 1. Generate Device-Specific Capsule              │
         │    (via Web UI or CLI)                           │
         │◄─────────────────────────────────────────────────┤
         │    Capsule includes:                             │
         │    - Device Serial Number (DSN)                  │
         │    - Pre-shared Key (PSK) or                     │
         │    - JWT Registration Token (expires 24h)        │
         │    - Backend URL                                 │
         │    - Fleet Account ID                            │
         │                                                  │
         │ 2. Flash Capsule to Device                       │
         │                                                  │
         │ 3. Device Boots & Reads Capsule Config           │
         │                                                  │
         │ 4. Device Registration Request                   │
         │    POST /api/v1/fleet/register                   │
         │    {                                             │
         │      dsn: "DEV-ABC-12345",                       │
         │      registration_token: "jwt...",               │
         │      device_info: {...},                         │
         │      csr: "-----BEGIN CERTIFICATE REQUEST-----"  │
         │    }                                             │
         ├──────────────────────────────────────────────────►
         │                                                  │
         │    5. Backend validates token & creates          │
         │       device record in database                  │
         │                                                  │
         │ 6. Registration Response                         │
         │    {                                             │
         │      device_id: 123,                             │
         │      certificate: "-----BEGIN CERTIFICATE-----", │
         │      ca_certificate: "...",                      │
         │      mqtt_endpoint: "...",                       │
         │      iot_thing_name: "roboticks-dev-123"         │
         │    }                                             │
         │◄─────────────────────────────────────────────────┤
         │                                                  │
         │ 7. Store certificate & connect to AWS IoT Core   │
         │                                                  │
         │ 8. Send Heartbeat (every 30s)                    │
         │    POST /api/v1/fleet/heartbeat                  │
         │    {                                             │
         │      device_id: 123,                             │
         │      status: "online",                           │
         │      metrics: {...}                              │
         │    }                                             │
         ├──────────────────────────────────────────────────►
         │                                                  │
         │ 9. Poll for Commands (every 10s)                 │
         │    GET /api/v1/fleet/devices/123/commands        │
         ├──────────────────────────────────────────────────►
         │                                                  │

2.2 Device Types & Environment Categories

-- In database schema
CREATE TYPE device_environment AS ENUM ('production', 'testing', 'development');
CREATE TYPE device_type AS ENUM ('drone', 'robot', 'camera', 'sensor', 'vehicle', 'other');

ALTER TABLE fleet_devices ADD COLUMN environment device_environment DEFAULT 'testing';

2.3 Capsule Structure

# capsule_manifest.yaml (embedded in deployment capsule)
device:
  dsn: "DEV-ABC-12345"
  type: "drone"
  environment: "testing"  # or "production"

fleet:
  account_id: "acc_123456"
  backend_url: "https://api.roboticks.io"
  registration_token: "eyJhbGciOiJIUzI1NiIs..."  # Expires in 24h

iot:
  mqtt_endpoint: "a1b2c3d4e5f6g7.iot.us-west-2.amazonaws.com"
  ca_certificate: |
    -----BEGIN CERTIFICATE-----
    ...
    -----END CERTIFICATE-----

composition:
  name: "DroneNav_v2.1"
  version: "2.1.0"
  modules: [...]

3. Cloud Backend Monitoring

3.1 Database Schema Extensions

# New models to add to backend

class FleetDevice(Base):
    __tablename__ = "fleet_devices"

    id = Column(Integer, primary_key=True)
    name = Column(String, nullable=False)
    dsn = Column(String, unique=True, nullable=False)  # Device Serial Number
    device_type = Column(Enum(DeviceType), nullable=False)
    environment = Column(Enum('production', 'testing', 'development'), default='testing')

    # Status
    status = Column(Enum('online', 'offline', 'maintenance', 'error'), default='offline')
    last_seen_at = Column(DateTime(timezone=True), nullable=True)
    last_heartbeat_at = Column(DateTime(timezone=True), nullable=True)

    # IoT Integration
    iot_thing_name = Column(String, unique=True, nullable=True)
    iot_certificate_id = Column(String, nullable=True)

    # Current state
    current_session_id = Column(Integer, ForeignKey("sessions.id"), nullable=True)
    current_capsule_id = Column(Integer, ForeignKey("capsules.id"), nullable=True)

    # Metrics
    cpu_usage = Column(Float, nullable=True)
    memory_usage = Column(Float, nullable=True)
    disk_usage = Column(Float, nullable=True)
    battery_level = Column(Integer, nullable=True)
    temperature = Column(Float, nullable=True)

    # Relationships
    owner_id = Column(Integer, ForeignKey("users.id"), nullable=False)
    sessions = relationship("Session", back_populates="device")
    logs = relationship("SessionLog", back_populates="device")


class Session(Base):
    __tablename__ = "sessions"

    id = Column(Integer, primary_key=True)
    session_id = Column(String, unique=True, index=True)  # UUID from device
    name = Column(String, nullable=False)

    device_id = Column(Integer, ForeignKey("fleet_devices.id"), nullable=False)
    capsule_id = Column(Integer, ForeignKey("capsules.id"), nullable=True)

    status = Column(Enum('active', 'completed', 'failed', 'aborted'), default='active')

    started_at = Column(DateTime(timezone=True), nullable=False)
    completed_at = Column(DateTime(timezone=True), nullable=True)
    duration_seconds = Column(Float, nullable=True)

    # Statistics (updated in real-time from logs)
    total_logs = Column(Integer, default=0)
    error_count = Column(Integer, default=0)
    warning_count = Column(Integer, default=0)

    # Storage paths
    log_file_url = Column(String, nullable=True)  # S3 URL
    artifacts_url = Column(String, nullable=True)  # S3 URL for session files

    # Relationships
    device = relationship("FleetDevice", back_populates="sessions")
    logs = relationship("SessionLog", back_populates="session")


class SessionLog(Base):
    __tablename__ = "session_logs"

    id = Column(BigInteger, primary_key=True)
    session_id = Column(Integer, ForeignKey("sessions.id"), nullable=False, index=True)
    device_id = Column(Integer, ForeignKey("fleet_devices.id"), nullable=False, index=True)

    timestamp = Column(DateTime(timezone=True), nullable=False, index=True)
    level = Column(Enum('DEBUG', 'INFO', 'WARN', 'ERROR', 'FATAL'), nullable=False, index=True)

    module_name = Column(String, nullable=False, index=True)
    logger_name = Column(String, nullable=False)
    message = Column(Text, nullable=False)

    # Context (JSON blob for filtering)
    context = Column(JSON, default={})

    # Full-text search
    message_tsv = Column(TSVector, nullable=True)  # PostgreSQL text search

    # Relationships
    session = relationship("Session", back_populates="logs")
    device = relationship("FleetDevice", back_populates="logs")

3.2 Real-Time Log Streaming Architecture

Device Edge:
┌──────────────┐
│ Session Mgr  │
│              │
│ Log Buffer   │──► Every 5s or 100 logs
└──────┬───────┘

       │ HTTPS POST (batch)

┌──────────────────────────────────────┐
│      AWS API Gateway (REST)          │
│   POST /devices/{device_id}/logs     │
└──────────────┬───────────────────────┘


┌──────────────────────────────────────┐
│    AWS Lambda: LogIngestion          │
│    - Validates device authentication │
│    - Enriches log metadata           │
│    - Publishes to processing queue   │
└──────────────┬───────────────────────┘

       ┌───────┴───────┐
       │               │
       ▼               ▼
┌─────────────┐  ┌────────────────┐
│  Amazon SQS │  │   DynamoDB     │
│  (Fan-out)  │  │ (Hot storage)  │
└──────┬──────┘  │ Recent logs    │
       │         │ (Last 7 days)  │
       │         └────────────────┘

   ┌───┴────┐
   │        │
   ▼        ▼
┌────┐  ┌──────────────┐
│ S3 │  │   RDS        │
│Log │  │  (Postgres)  │
│Arc │  │  session_logs│
└────┘  └──────┬───────┘

                │ WebSocket Push

        ┌───────────────┐
        │  API Gateway  │
        │  (WebSocket)  │
        └───────┬───────┘


        ┌───────────────┐
        │  Frontend UI  │
        │  (Real-time)  │
        └───────────────┘

3.3 API Endpoints

# Fleet Management APIs
POST   /api/v1/fleet/register                    # Device registration
POST   /api/v1/fleet/heartbeat                   # Device heartbeat
GET    /api/v1/fleet/devices                     # List devices (with filters)
GET    /api/v1/fleet/devices/{device_id}         # Get device details
PUT    /api/v1/fleet/devices/{device_id}         # Update device
DELETE /api/v1/fleet/devices/{device_id}         # Decommission device
GET    /api/v1/fleet/devices/{device_id}/commands # Poll commands
POST   /api/v1/fleet/commands                    # Send command to device

# Session Management APIs
GET    /api/v1/sessions                          # List sessions (with filters)
GET    /api/v1/sessions/{session_id}             # Get session details
GET    /api/v1/sessions/{session_id}/files       # List session files (file manager view)
GET    /api/v1/sessions/{session_id}/files/download # Download session file
POST   /api/v1/sessions/{session_id}/artifacts   # Upload session artifacts (from device)

# Log Streaming APIs
POST   /api/v1/logs/ingest                       # Ingest logs from devices (batch)
GET    /api/v1/logs                              # Query logs (with filters)
GET    /api/v1/logs/stream                       # WebSocket for real-time logs
GET    /api/v1/logs/export                       # Export logs (CSV/JSON)

# Capsule Generation (for onboarding)
POST   /api/v1/capsules/generate                 # Generate device capsule with token

4. Frontend Views Design

4.1 Fleet Management View

Template Base: user-list / user-grid (supports list/grid toggle)

Features:

  • Environment Tabs: Production | Testing | Development
  • View Toggle: List view (table) | Grid view (cards) - Grid is default
  • Status Filters: Online | Offline | Maintenance | Error
  • Device Type Filters: All | Drone | Robot | Camera | Sensor | Vehicle

Grid View (Default)

┌──────────────────────────────────────────────────────────────┐
│  Fleet Devices                    [Production][Testing][Dev] │
│                                   [ Grid View ][ List View ]  │
├──────────────────────────────────────────────────────────────┤
│  Filters: [All Types ▼] [All Status ▼] [Search...]          │
├──────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────┐  ┌─────────────────┐  ┌──────────────┐│
│  │ 🟢 Drone-001    │  │ 🟢 Robot-A23    │  │ 🔴 Sensor-5  ││
│  │                 │  │                 │  │              ││
│  │ Type: Drone     │  │ Type: Robot     │  │ Type: Sensor ││
│  │ Status: Online  │  │ Status: Online  │  │ Status: Off  ││
│  │ Last seen: 2m   │  │ Last seen: 30s  │  │ Last: 2h ago ││
│  │ Battery: 87%    │  │ CPU: 45%        │  │ Battery: 0%  ││
│  │                 │  │                 │  │              ││
│  │ Current Session │  │ Idle            │  │ Maintenance  ││
│  │ nav_mission_42  │  │                 │  │              ││
│  │                 │  │                 │  │              ││
│  │ [View][Edit][⋮] │  │ [View][Edit][⋮] │  │ [View][Edit]││
│  └─────────────────┘  └─────────────────┘  └──────────────┘│
│                                                              │
│  ┌─────────────────┐  ┌─────────────────┐                  │
│  │ 🟢 Camera-X1    │  │ 🟡 Drone-002    │    ...           │
│  │ ...             │  │ ...             │                  │
└──────────────────────────────────────────────────────────────┘

List View

Standard table with columns:
  • Name | Type | Environment | Status | Last Seen | Battery/CPU | Current Session | Actions

Device Detail Page

  • Device info, metrics charts, command history
  • Active session (if any) with link to session view
  • Recent logs (last 50 lines) with link to full log view
  • Button: “View All Sessions”
  • Button: “Send Command”

4.2 Sessions View

Template Base: packages-table (modified)

Features:

  • Clickable Rows: Click any session to navigate
  • Two Action Buttons per Row:
    1. Files 📁 → Navigate to File Manager view
    2. Logs 📄 → Navigate to Logs view with session filter pre-applied
  • Filters: Device | Status | Date Range | Capsule
  • Status Indicators: 🟢 Active | 🔵 Completed | 🔴 Failed | ⚫ Aborted
┌──────────────────────────────────────────────────────────────────────────┐
│  Sessions                                                                │
├──────────────────────────────────────────────────────────────────────────┤
│  Filters: [All Devices ▼] [All Status ▼] [Last 7 Days ▼]               │
├──────────────────────────────────────────────────────────────────────────┤
│  Session ID    │ Device    │ Started     │ Duration │ Status  │ Actions │
├────────────────┼───────────┼─────────────┼──────────┼─────────┼─────────┤
│ nav_mission_42 │ Drone-001 │ 2h ago      │ 1h 23m   │ 🟢 ACT  │ 📁 📄  │
│ test_run_001   │ Robot-A23 │ 5h ago      │ 45m      │ 🔵 COMP │ 📁 📄  │
│ sensor_cal_3   │ Sensor-5  │ 1d ago      │ 2m 15s   │ 🔴 FAIL │ 📁 📄  │
│ ...            │ ...       │ ...         │ ...      │ ...     │ ...    │
└──────────────────────────────────────────────────────────────────────────┘

Session Files View (File Manager Template)

When user clicks 📁 button:
┌──────────────────────────────────────────────────────────────┐
│  Session: nav_mission_42 > Files                             │
│  Device: Drone-001 | Status: Active | Started: 2h ago        │
├──────────────────────────────────────────────────────────────┤
│  📁 logs/                                                    │
│    📄 session.log                         2.3 MB   [Download]│
│    📄 module_input.log                    456 KB  [Download]│
│    📄 module_navigation.log               1.1 MB  [Download]│
│  📁 data/                                                    │
│    📄 waypoints.json                      12 KB   [Download]│
│    📄 telemetry.csv                       543 KB  [Download]│
│  📁 screenshots/                                             │
│    🖼 frame_0001.png                       234 KB  [View]   │
│    🖼 frame_0002.png                       241 KB  [View]   │
└──────────────────────────────────────────────────────────────┘

4.3 Logs View

Real-Time Log Streaming with Filtering

Features:

  • Real-Time Updates: WebSocket connection, new logs appear automatically
  • Lazy Loading: Shows last 100 lines, scroll up to load more (paginated)
  • Multi-Level Filtering:
    • Device (dropdown)
    • Session (dropdown)
    • Module (dropdown)
    • Log Level (DEBUG/INFO/WARN/ERROR/FATAL checkboxes)
    • Date/Time Range
    • Text Search (searches message content)
  • Export: Download filtered logs as CSV/JSON
  • Auto-Scroll: Toggle to follow latest logs
  • Color Coding: Levels have different colors (ERROR=red, WARN=yellow, INFO=blue, DEBUG=gray)
┌──────────────────────────────────────────────────────────────────────────┐
│  Logs - Real-Time Stream                            [Auto-scroll: ON ✓] │
├──────────────────────────────────────────────────────────────────────────┤
│  Filters:                                                                │
│  Device: [Drone-001 ▼]  Session: [nav_mission_42 ▼]  Module: [All ▼]   │
│  Levels: [✓ ERROR] [✓ WARN] [✓ INFO] [ ] DEBUG                         │
│  Search: [____________]  Date: [Last Hour ▼]  [Export CSV] [Export JSON]│
├──────────────────────────────────────────────────────────────────────────┤
│  Timestamp          │ Level │ Module      │ Message                     │
├─────────────────────┼───────┼─────────────┼─────────────────────────────┤
│ 2025-01-03 14:32:15 │ INFO  │ navigation  │ Waypoint reached: WP-5      │
│ 2025-01-03 14:32:14 │ WARN  │ input       │ GPS signal weak (3 sats)    │
│ 2025-01-03 14:32:12 │ ERROR │ processing  │ Obstacle detection timeout  │
│ 2025-01-03 14:32:10 │ INFO  │ session_mgr │ Session active for 1h 23m   │
│ ...                 │ ...   │ ...         │ ...                         │
│ ▼ Load More (scroll up for older logs)                                  │
└──────────────────────────────────────────────────────────────────────────┘
Clicking a log row expands to show full context:
┌──────────────────────────────────────────────────────────────────────────┐
│ 📍 2025-01-03 14:32:12.345678                                           │
│ Level: ERROR                                                             │
│ Module: processing                                                       │
│ Logger: ObstacleDetector                                                 │
│ Thread: worker-thread-3                                                  │
│ Message: Obstacle detection timeout                                      │
│ Context:                                                                 │
│   - camera_id: front_cam                                                 │
│   - timeout_ms: 500                                                      │
│   - retry_count: 3                                                       │
│   - pipeline_id: 42                                                      │
│ [Copy] [View Full Session] [View in Context]                            │
└──────────────────────────────────────────────────────────────────────────┘

5. AWS Services & Technology Stack

5.1 AWS Services Selection

Device Communication & Management

  • AWS IoT Core
    • MQTT/WSS for bi-directional communication
    • Device authentication via X.509 certificates
    • Device shadows for state management
    • Rules engine for routing messages

Data Storage

  • Amazon RDS (PostgreSQL)
    • Primary database for devices, sessions, structured logs
    • Full-text search on logs using PostgreSQL tsvector
    • Connection pooling via PgBouncer
  • Amazon S3
    • Log archive storage (compress logs older than 7 days)
    • Session artifacts (files, screenshots, data dumps)
    • Capsule storage (deployment packages)
    • Lifecycle policies (move to Glacier after 90 days)
  • Amazon DynamoDB (Optional for hot logs)
    • Ultra-fast log ingestion
    • Time-series data (TTL after 7 days)
    • Global secondary indexes for filtering
    • DynamoDB Streams → Firehose → S3

Compute

  • AWS Lambda
    • Log ingestion processor
    • Device command dispatcher
    • Session state machine
    • Scheduled tasks (cleanup, analytics)
  • Amazon ECS Fargate (Existing)
    • Backend API (FastAPI)
    • Frontend serving (Nginx)

API & Networking

  • AWS API Gateway
    • REST API for device/frontend communication
    • WebSocket API for real-time log streaming
    • API keys and usage plans
    • Throttling and quotas
  • AWS VPC
    • Private subnets for RDS, Redis
    • NAT Gateway for outbound traffic
    • VPC endpoints for S3, DynamoDB

Caching & Queuing

  • Amazon ElastiCache (Redis) (Existing)
    • Session caching
    • Real-time metrics aggregation
    • Pub/Sub for WebSocket notifications
  • Amazon SQS
    • Log processing queue
    • Command execution queue
    • Dead-letter queue for failed operations

Monitoring & Observability

  • Amazon CloudWatch
    • Logs (Lambda, ECS, API Gateway)
    • Metrics (custom + service metrics)
    • Alarms for device offline, error rates
    • Dashboards
  • AWS X-Ray
    • Distributed tracing
    • Performance bottleneck identification

5.2 Edge Device Libraries (C++)

HTTP Client

  • libcurl (already available)
    • HTTPS requests with TLS 1.3
    • Certificate pinning
    • Connection pooling

MQTT Client

  • AWS IoT Device SDK for C++
    • Native AWS IoT Core integration
    • Auto-reconnection
    • QoS levels 0, 1, 2
    • WebSocket support

JSON Parsing

  • nlohmann/json or RapidJSON
    • Fast JSON serialization
    • Schema validation

Compression

  • zlib or lz4
    • Log compression before upload
    • Reduce bandwidth usage

Certificate Management

  • OpenSSL (already available)
    • X.509 certificate handling
    • TLS connections
    • CSR generation

5.3 Backend Libraries (Python)

# requirements.txt additions

# AWS Integration
boto3==1.34.0              # AWS SDK
aioboto3==12.3.0           # Async AWS SDK
aws-iot-device-sdk-python-v2==1.19.0

# WebSocket Support
websockets==12.0           # WebSocket server
python-socketio==5.10.0    # Socket.IO for real-time

# Background Tasks
celery==5.3.4              # Async task queue
redis==5.0.1               # Celery broker

# Monitoring
sentry-sdk==1.39.1         # Error tracking
prometheus-client==0.19.0  # Metrics

# Additional
pyjwt==2.8.0               # JWT token handling
cryptography==41.0.7       # Certificate operations

5.4 Frontend Libraries (React/TypeScript)

{
  "dependencies": {
    "socket.io-client": "^4.6.0",    // WebSocket for real-time logs
    "react-virtualized": "^9.22.5",  // Virtual scrolling for logs
    "date-fns": "^2.30.0",           // Already installed - date formatting
    "recharts": "^2.15.0",           // Already installed - metrics charts
    "react-query": "^3.39.3",        // Data fetching/caching
    "zustand": "^4.4.7"              // Already installed - state mgmt
  }
}

6. Implementation Plan

Phase 1: Foundation (Weeks 1-2)

Week 1: Database & Backend Setup

  • Add environment column to fleet_devices table (migration)
  • Create session_logs table with full-text search indexes
  • Implement Fleet API endpoints (/fleet/register, /fleet/heartbeat)
  • Implement Sessions API endpoints (/sessions, /sessions/{id}/files)
  • Set up AWS IoT Core (thing types, policies, certificates)

Week 2: Device-Side Integration

  • Implement RemoteSink in roboticks-logging (publishes to transport)
  • Update SessionManager to subscribe to roboticks/logs topic
  • Implement ring buffer for log collection
  • Add log batching and upload to FleetClient
  • Integrate AWS IoT Device SDK for C++
  • Update capsule generation to include registration token

Phase 2: Real-Time Logging (Weeks 3-4)

Week 3: Log Ingestion Pipeline

  • Create Lambda function for log ingestion
  • Set up SQS queue for log processing
  • Implement log storage to RDS (with batching)
  • Implement log archive to S3 (daily rotation)
  • Set up DynamoDB for hot log storage (optional)

Week 4: Real-Time Streaming

  • Implement WebSocket API in API Gateway
  • Create WebSocket handler in backend (FastAPI)
  • Implement pub/sub using Redis for multi-instance support
  • Build log query API with filtering
  • Implement log export (CSV/JSON)

Phase 3: Frontend Development (Weeks 5-6)

Week 5: Fleet & Sessions Views

  • Create Fleet Management view (grid/list toggle)
  • Add environment tabs (Production/Testing/Development)
  • Implement device detail page with metrics charts
  • Create Sessions table view
  • Implement Session Files view (file manager)

Week 6: Logs View

  • Build real-time logs view with WebSocket connection
  • Implement multi-level filtering (device, session, module, level)
  • Add virtual scrolling for performance
  • Implement lazy loading (pagination)
  • Add export functionality
  • Add auto-scroll toggle

Phase 4: Device Onboarding & Commands (Week 7)

  • Build capsule generation UI
  • Implement device registration flow
  • Create command sending interface
  • Implement command execution on device side
  • Add command history view

Phase 5: Testing & Optimization (Week 8)

  • Load testing (1000+ devices, high log volume)
  • Performance optimization (indexing, caching)
  • Security audit (authentication, authorization, encryption)
  • Documentation (API docs, deployment guide)
  • Monitoring setup (CloudWatch dashboards, alarms)

7. Key Technical Decisions & Rationale

7.1 Why AWS IoT Core over HTTP-only?

  • Bi-directional: Devices can receive commands instantly (not just poll)
  • Connection Management: Automatic reconnection, offline queuing
  • Security: Built-in certificate management, fine-grained policies
  • Scale: Handles millions of devices, message routing

7.2 Why Ring Buffer + Batch Upload?

  • Network Efficiency: Reduce HTTP requests (1 req/5s vs 100 req/5s)
  • Resilience: Buffer survives temporary network outages
  • Performance: Minimal impact on module execution

7.3 Why PostgreSQL Full-Text Search over ElasticSearch?

  • Simplicity: No additional service to manage
  • Cost: Included with RDS
  • Performance: tsvector + GIN indexes handle 100K+ logs/min
  • Fallback: Can migrate to ElasticSearch later if needed

7.4 Why WebSocket for Log Streaming?

  • Low Latency: Sub-second log delivery to UI
  • Efficiency: Single persistent connection vs HTTP polling
  • User Experience: True real-time feel

8. Security Considerations

8.1 Device Security

  • ✅ X.509 certificates for authentication (not passwords)
  • ✅ Registration tokens expire after 24h
  • ✅ TLS 1.3 for all communication
  • ✅ Device-specific AWS IoT policies (least privilege)
  • ✅ Secure storage of certificates on device (encrypted partition)

8.2 Backend Security

  • ✅ JWT authentication for all API endpoints
  • ✅ Role-based access control (users can only see their devices)
  • ✅ API rate limiting (prevent abuse)
  • ✅ Input validation on all endpoints
  • ✅ SQL injection prevention (SQLAlchemy ORM)

8.3 Data Security

  • ✅ Encryption at rest (S3, RDS, DynamoDB)
  • ✅ Encryption in transit (TLS everywhere)
  • ✅ Log data retention policy (90 days active, then archive)
  • ✅ PII scrubbing from logs (if applicable)

9. Cost Estimation (Monthly, for 100 devices)

ServiceUsageCost
AWS IoT Core100 devices × 24h × 30d~$50
RDS (db.t4g.micro)Free tier$0
S350 GB logs/artifacts~$1.15
Lambda1M invocations~$0.20
API Gateway1M requests~$3.50
SQS1M messages$0.40
CloudWatch Logs10 GB~$5
Total~$60/month
Scales linearly: ~$0.60/device/month

10. Success Metrics

  • Reliability: 99.9% uptime for device connections
  • Latency: < 2s from log generation to UI display
  • Throughput: Support 1000 logs/sec per device
  • Storage: Compress logs to 30% original size
  • User Experience: < 500ms page load for logs view

Appendix A: Example Configuration Files

Device Capsule Config

# /opt/roboticks/config/fleet.yaml
fleet:
  enabled: true
  backend_url: "https://api.roboticks.io"
  dsn: "DEV-ABC-12345"
  environment: "testing"
  registration_token: "eyJhbGciOiJIUzI1NiIs..."

  heartbeat_interval_ms: 30000  # 30 seconds
  command_poll_interval_ms: 10000  # 10 seconds

  log_upload:
    enabled: true
    batch_size: 100
    batch_interval_ms: 5000
    buffer_size: 10000

  certificates:
    ca_cert_path: "/opt/roboticks/certs/ca.pem"
    client_cert_path: "/opt/roboticks/certs/device.pem"
    client_key_path: "/opt/roboticks/certs/device.key"

Backend Environment Variables

# .env additions
AWS_IOT_ENDPOINT=a1b2c3d4e5f6g7.iot.us-west-2.amazonaws.com
AWS_IOT_CA_CERT_PATH=/app/certs/AmazonRootCA1.pem

# Log storage
LOG_STORAGE_BACKEND=rds  # or dynamodb
LOG_ARCHIVE_S3_BUCKET=roboticks-log-archive
LOG_HOT_RETENTION_DAYS=7

# WebSocket
WEBSOCKET_ENABLED=true
WEBSOCKET_MAX_CONNECTIONS=1000

# Redis (for WebSocket pub/sub)
REDIS_URL=redis://roboticks-redis:6379/0

Conclusion

This architecture provides a robust, scalable foundation for:
  1. ✅ Real-time log collection and streaming
  2. ✅ Secure device onboarding and management
  3. ✅ Comprehensive fleet monitoring
  4. ✅ Production/test environment separation
  5. ✅ Remote command execution
  6. ✅ Session tracking and artifact management
The design leverages proven AWS services, maintains security best practices, and provides an excellent user experience through real-time updates and intuitive filtering. Next Steps: Review this design, provide feedback, and proceed with Phase 1 implementation.