Files
MosisService/docs/DEV_PORTAL_M08_TELEMETRY.md

20 KiB

Milestone 8: Telemetry System

Status: Decided Goal: Collect app usage analytics and crash reports while respecting privacy.

Decision

SQLite with background aggregation for self-hosted Synology NAS:

Storage:      SQLite (separate telemetry.db to isolate write load)
Aggregation:  Go background goroutine (hourly/daily rollups)
Retention:    Raw events 7 days, aggregates indefinitely
Privacy:      Hashed device IDs, no PII, opt-out available

Rationale

  1. Simple - No separate time-series database needed
  2. SQLite scales - Can handle thousands of events/day easily
  3. Background jobs - Go goroutines for aggregation, cleanup
  4. Separate DB - Telemetry writes don't affect main portal.db
  5. Privacy-first - Minimal collection, hashed IDs

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                     mosis-portal container                       │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                      Go Binary                              │ │
│  │  ┌─────────────┐    ┌────────────────┐                     │ │
│  │  │ API Handler │───►│ Telemetry Svc  │                     │ │
│  │  │ POST /v1/   │    │ - Buffer events│                     │ │
│  │  │ telemetry/* │    │ - Batch insert │                     │ │
│  │  └─────────────┘    └───────┬────────┘                     │ │
│  │                             │                               │ │
│  │  ┌─────────────────────────▼────────────────────────────┐  │ │
│  │  │              Background Workers                       │  │ │
│  │  │  • Hourly aggregation (event counts, unique devices)  │  │ │
│  │  │  • Daily cleanup (delete raw events > 7 days)         │  │ │
│  │  │  • Crash grouping (fingerprint + dedup)               │  │ │
│  │  └───────────────────────────────────────────────────────┘  │ │
│  └──────────────────────────────┬─────────────────────────────┘ │
│                                  │                               │
│  /volume1/mosis/data/           │                               │
│  ├── portal.db     (main)       │                               │
│  └── telemetry.db  ◄────────────┘                               │
└─────────────────────────────────────────────────────────────────┘

Overview

Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy.


Privacy Principles

  1. Minimal collection - Only what's necessary
  2. No PII by default - Anonymized device IDs
  3. Transparency - Users know what's collected
  4. Opt-out available - Users can disable
  5. Data retention limits - Auto-delete old data
  6. GDPR compliance - Export/delete on request

Event Types

Automatic Events (Default)

Event Description Data
app_start App launched version, mosis_version
app_stop App closed duration_seconds
app_crash Unhandled error crash_type, message
lua_error Lua runtime error message, stack (no user data)

Performance Events (Default)

Event Description Data
perf_frame Frame time (sampled) avg_ms, p95_ms
perf_memory Memory usage used_mb, limit_mb
perf_startup Startup time duration_ms

Usage Events (Opt-in)

Event Description Data
screen_view Screen navigation screen_name
button_click UI interaction element_id
feature_used Feature usage feature_name

Data Schema

Event Payload

{
  "app_id": "com.developer.myapp",
  "app_version": "1.2.0",
  "mosis_version": "1.0.0",
  "device_id": "sha256_hashed_id",
  "session_id": "uuid",
  "events": [
    {
      "type": "app_start",
      "timestamp": "2024-01-15T10:30:00Z",
      "data": {}
    },
    {
      "type": "screen_view",
      "timestamp": "2024-01-15T10:30:05Z",
      "data": {
        "screen_name": "home"
      }
    }
  ]
}

Crash Report Payload

{
  "app_id": "com.developer.myapp",
  "app_version": "1.2.0",
  "mosis_version": "1.0.0",
  "device_id": "sha256_hashed_id",
  "timestamp": "2024-01-15T10:35:00Z",
  "crash": {
    "type": "lua_error",
    "message": "attempt to index nil value 'user'",
    "stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk",
    "context": {
      "screen": "profile.rml",
      "memory_mb": 45,
      "uptime_seconds": 300
    }
  }
}

Device ID Hashing

-- On device
local raw_id = get_android_id() -- or similar
local hashed = sha256(raw_id .. "mosis_salt_" .. app_id)
-- Result: "a3f2b1c4d5e6..."

-- Cannot reverse to original device ID
-- Different per app (can't track across apps)

Collection Architecture

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Device  │────►│  Batch   │────►│   API    │────►│ Storage  │
│          │     │  Queue   │     │          │     │          │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
                      │
                      │ Every 60s or
                      │ on app close
                      ▼
                 ┌──────────┐
                 │  Upload  │
                 └──────────┘

Client-Side Batching

-- TelemetryManager on device
local events = {}
local last_flush = os.time()

function track(event_type, data)
    if not telemetry_enabled then return end

    table.insert(events, {
        type = event_type,
        timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"),
        data = data or {}
    })

    -- Flush if batch is large or time elapsed
    if #events >= 50 or (os.time() - last_flush) > 60 then
        flush()
    end
end

function flush()
    if #events == 0 then return end

    local payload = {
        app_id = APP_ID,
        app_version = APP_VERSION,
        device_id = HASHED_DEVICE_ID,
        events = events
    }

    -- Async HTTP POST
    http.post(TELEMETRY_URL, json.encode(payload))

    events = {}
    last_flush = os.time()
end

Storage (SQLite)

Telemetry Database Schema

-- telemetry.db (separate from portal.db)

-- Raw events (7-day retention)
CREATE TABLE events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    app_id TEXT NOT NULL,
    device_id TEXT NOT NULL,  -- SHA256 hashed
    session_id TEXT,
    event_type TEXT NOT NULL,
    event_data TEXT,          -- JSON string
    app_version TEXT,
    mosis_version TEXT,
    timestamp TEXT NOT NULL   -- ISO8601
);

CREATE INDEX idx_events_app_time ON events(app_id, timestamp);
CREATE INDEX idx_events_type ON events(event_type, timestamp);

-- Hourly aggregates (computed by background job)
CREATE TABLE hourly_stats (
    app_id TEXT NOT NULL,
    hour TEXT NOT NULL,       -- YYYY-MM-DDTHH
    event_type TEXT NOT NULL,
    count INTEGER NOT NULL,
    unique_devices INTEGER NOT NULL,
    PRIMARY KEY (app_id, hour, event_type)
);

-- Daily aggregates (computed from hourly)
CREATE TABLE daily_stats (
    app_id TEXT NOT NULL,
    date TEXT NOT NULL,       -- YYYY-MM-DD
    event_type TEXT NOT NULL,
    count INTEGER NOT NULL,
    unique_devices INTEGER NOT NULL,
    PRIMARY KEY (app_id, date, event_type)
);

-- Crash groups (deduplicated by fingerprint)
CREATE TABLE crash_groups (
    id TEXT PRIMARY KEY,
    app_id TEXT NOT NULL,
    fingerprint TEXT NOT NULL,
    crash_type TEXT NOT NULL,
    message TEXT,
    sample_stack_trace TEXT,
    first_seen TEXT NOT NULL,
    last_seen TEXT NOT NULL,
    occurrence_count INTEGER DEFAULT 1,
    affected_versions TEXT,   -- JSON array
    status TEXT DEFAULT 'open',
    UNIQUE(app_id, fingerprint)
);

CREATE INDEX idx_crashes_app ON crash_groups(app_id, status);

Go Background Workers

// Start background workers
func (s *TelemetryService) StartWorkers(ctx context.Context) {
    // Hourly aggregation
    go s.runPeriodic(ctx, time.Hour, s.aggregateHourly)

    // Daily aggregation (run at 2am)
    go s.runDaily(ctx, 2, s.aggregateDaily)

    // Cleanup old events (run at 3am)
    go s.runDaily(ctx, 3, s.cleanupOldEvents)
}

func (s *TelemetryService) aggregateHourly(ctx context.Context) error {
    hour := time.Now().Add(-time.Hour).Format("2006-01-02T15")

    _, err := s.db.ExecContext(ctx, `
        INSERT OR REPLACE INTO hourly_stats (app_id, hour, event_type, count, unique_devices)
        SELECT
            app_id,
            strftime('%Y-%m-%dT%H', timestamp) as hour,
            event_type,
            COUNT(*) as count,
            COUNT(DISTINCT device_id) as unique_devices
        FROM events
        WHERE strftime('%Y-%m-%dT%H', timestamp) = ?
        GROUP BY app_id, hour, event_type
    `, hour)
    return err
}

func (s *TelemetryService) cleanupOldEvents(ctx context.Context) error {
    cutoff := time.Now().AddDate(0, 0, -7).Format(time.RFC3339)
    _, err := s.db.ExecContext(ctx,
        "DELETE FROM events WHERE timestamp < ?", cutoff)
    return err
}

Aggregation

Pre-computed Metrics

Metric Granularity Retention
Daily active users Day 2 years
Event counts Day 1 year
Crash counts Day 1 year
Session duration Day 90 days
Performance percentiles Day 90 days

Aggregation Queries

-- Daily active users
SELECT
    DATE_TRUNC('day', time) as day,
    COUNT(DISTINCT device_id) as dau
FROM telemetry_events
WHERE app_id = $1
  AND event_type = 'app_start'
  AND time > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;

-- Crash rate by version
SELECT
    app_version,
    COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes,
    COUNT(*) FILTER (WHERE event_type = 'app_start') as starts,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') /
        NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0),
        2
    ) as crash_rate
FROM telemetry_events
WHERE app_id = $1
  AND time > NOW() - INTERVAL '7 days'
GROUP BY app_version;

Crash Grouping

Stack Trace Fingerprinting

func fingerprintCrash(crash CrashReport) string {
    // Normalize stack trace
    normalized := normalizeStackTrace(crash.StackTrace)

    // Hash key components
    key := fmt.Sprintf("%s:%s:%s",
        crash.CrashType,
        crash.Message,
        normalized,
    )

    return sha256(key)[:16]
}

func normalizeStackTrace(stack string) string {
    // Remove line numbers (they change with code updates)
    // Remove memory addresses
    // Keep function names and file names
    re := regexp.MustCompile(`:\d+:`)
    return re.ReplaceAllString(stack, ":?:")
}

Crash Groups Table

CREATE TABLE crash_groups (
    id UUID PRIMARY KEY,
    app_id TEXT NOT NULL,
    fingerprint TEXT NOT NULL,
    crash_type TEXT NOT NULL,
    message TEXT,
    sample_stack_trace TEXT,
    first_seen TIMESTAMPTZ NOT NULL,
    last_seen TIMESTAMPTZ NOT NULL,
    occurrence_count INT DEFAULT 1,
    affected_versions TEXT[],
    status TEXT DEFAULT 'open',  -- open, resolved, ignored
    UNIQUE(app_id, fingerprint)
);

Developer Dashboard

Metrics View

┌─────────────────────────────────────────────────────────────┐
│  Analytics - My Calculator                                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Date Range: [Last 30 days ▼]                               │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Daily Users │  │   Crashes   │  │ Crash-free  │         │
│  │    1,234    │  │     23      │  │   98.1%     │         │
│  │   ▲ +12%    │  │   ▼ -45%    │  │   ▲ +2%     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Daily Active Users                                │    │
│  │  [Line chart showing DAU over time]                │    │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Version Distribution                              │    │
│  │  [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│   │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Crashes View

┌─────────────────────────────────────────────────────────────┐
│  Crashes - My Calculator                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Filter: [All versions ▼] [Open ▼]                          │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │ ● attempt to index nil value 'user'                  │  │
│  │   lua_error • 156 occurrences • v1.2.0              │  │
│  │   First: Jan 10 • Last: Jan 15                       │  │
│  │                                               [View] │  │
│  ├──────────────────────────────────────────────────────┤  │
│  │ ● memory limit exceeded                              │  │
│  │   sandbox_error • 23 occurrences • v1.1.0, v1.2.0   │  │
│  │   First: Jan 5 • Last: Jan 14                        │  │
│  │                                               [View] │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

API Endpoints

# Ingestion (from devices)
POST /v1/telemetry/events:
  auth: device_token or api_key
  body: { app_id, device_id, events[] }
  response: { received: number }

POST /v1/telemetry/crash:
  auth: device_token or api_key
  body: { app_id, device_id, crash }
  response: { id: string }

# Dashboard (for developers)
GET /v1/apps/:id/analytics/overview:
  auth: required
  query: { start_date, end_date }
  response: { dau, crashes, crash_free_rate, ... }

GET /v1/apps/:id/analytics/events:
  auth: required
  query: { start_date, end_date, event_type }
  response: { data: [{ date, count, unique_devices }] }

GET /v1/apps/:id/crashes:
  auth: required
  query: { version, status, page, limit }
  response: { crashes: CrashGroup[], total }

GET /v1/apps/:id/crashes/:fingerprint:
  auth: required
  response: { crash_group, recent_occurrences[] }

PATCH /v1/apps/:id/crashes/:fingerprint:
  auth: required
  body: { status: 'resolved' | 'ignored' }
  response: { crash_group }

Data Retention

Data Type Retention Reason
Raw events 7 days Debugging
Daily aggregates 2 years Trends
Crash reports 90 days Investigation
Crash groups Forever Issue tracking

Cleanup Job

-- Run daily
DELETE FROM telemetry_events
WHERE time < NOW() - INTERVAL '7 days';

DELETE FROM crash_reports
WHERE timestamp < NOW() - INTERVAL '90 days';

Privacy Controls

User Settings

Settings > Privacy > Analytics
├── [✓] Send crash reports (helps developers fix bugs)
├── [ ] Send usage analytics (how you use apps)
└── [Request Data Deletion]

GDPR Endpoints

# User requests their data
GET /v1/privacy/export:
  auth: user_token
  response: { download_url }  # JSON export of all data

# User requests deletion
DELETE /v1/privacy/data:
  auth: user_token
  response: { status: 'scheduled' }  # Delete within 30 days

Deliverables

  • Storage approach decided (SQLite with separate telemetry.db)
  • Event schema specification
  • Client-side batching (Lua TelemetryManager)
  • Ingestion API endpoints (Go + Chi)
  • SQLite schema and migrations
  • Background aggregation workers (Go goroutines)
  • Crash grouping logic
  • Developer analytics dashboard (htmx)
  • Privacy controls (opt-out in manifest)
  • Data retention cleanup job
  • GDPR export/delete endpoints

Open Questions

  1. Real-time crash alerts? → Consider email notifications for v1.1
  2. Sampling for high-volume apps? → Not needed for self-hosted scale
  3. Custom events API for developers? → Yes, via manifest opt-in
  4. Benchmarks/comparisons with similar apps? → Defer to post-MVP

References