# Milestone 8: Telemetry System **Status**: Decided **Goal**: Collect app usage analytics and crash reports while respecting privacy. ## Decision **SQLite with background aggregation** for self-hosted Synology NAS: ``` Storage: SQLite (separate telemetry.db to isolate write load) Aggregation: Go background goroutine (hourly/daily rollups) Retention: Raw events 7 days, aggregates indefinitely Privacy: Hashed device IDs, no PII, opt-out available ``` ### Rationale 1. **Simple** - No separate time-series database needed 2. **SQLite scales** - Can handle thousands of events/day easily 3. **Background jobs** - Go goroutines for aggregation, cleanup 4. **Separate DB** - Telemetry writes don't affect main portal.db 5. **Privacy-first** - Minimal collection, hashed IDs ### Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ mosis-portal container │ │ ┌────────────────────────────────────────────────────────────┐ │ │ │ Go Binary │ │ │ │ ┌─────────────┐ ┌────────────────┐ │ │ │ │ │ API Handler │───►│ Telemetry Svc │ │ │ │ │ │ POST /v1/ │ │ - Buffer events│ │ │ │ │ │ telemetry/* │ │ - Batch insert │ │ │ │ │ └─────────────┘ └───────┬────────┘ │ │ │ │ │ │ │ │ │ ┌─────────────────────────▼────────────────────────────┐ │ │ │ │ │ Background Workers │ │ │ │ │ │ • Hourly aggregation (event counts, unique devices) │ │ │ │ │ │ • Daily cleanup (delete raw events > 7 days) │ │ │ │ │ │ • Crash grouping (fingerprint + dedup) │ │ │ │ │ └───────────────────────────────────────────────────────┘ │ │ │ └──────────────────────────────┬─────────────────────────────┘ │ │ │ │ │ /volume1/mosis/data/ │ │ │ ├── portal.db (main) │ │ │ └── telemetry.db ◄────────────┘ │ └─────────────────────────────────────────────────────────────────┘ ``` --- ## Overview Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy. --- ## Privacy Principles 1. **Minimal collection** - Only what's necessary 2. **No PII by default** - Anonymized device IDs 3. **Transparency** - Users know what's collected 4. **Opt-out available** - Users can disable 5. **Data retention limits** - Auto-delete old data 6. **GDPR compliance** - Export/delete on request --- ## Event Types ### Automatic Events (Default) | Event | Description | Data | |-------|-------------|------| | `app_start` | App launched | version, mosis_version | | `app_stop` | App closed | duration_seconds | | `app_crash` | Unhandled error | crash_type, message | | `lua_error` | Lua runtime error | message, stack (no user data) | ### Performance Events (Default) | Event | Description | Data | |-------|-------------|------| | `perf_frame` | Frame time (sampled) | avg_ms, p95_ms | | `perf_memory` | Memory usage | used_mb, limit_mb | | `perf_startup` | Startup time | duration_ms | ### Usage Events (Opt-in) | Event | Description | Data | |-------|-------------|------| | `screen_view` | Screen navigation | screen_name | | `button_click` | UI interaction | element_id | | `feature_used` | Feature usage | feature_name | --- ## Data Schema ### Event Payload ```json { "app_id": "com.developer.myapp", "app_version": "1.2.0", "mosis_version": "1.0.0", "device_id": "sha256_hashed_id", "session_id": "uuid", "events": [ { "type": "app_start", "timestamp": "2024-01-15T10:30:00Z", "data": {} }, { "type": "screen_view", "timestamp": "2024-01-15T10:30:05Z", "data": { "screen_name": "home" } } ] } ``` ### Crash Report Payload ```json { "app_id": "com.developer.myapp", "app_version": "1.2.0", "mosis_version": "1.0.0", "device_id": "sha256_hashed_id", "timestamp": "2024-01-15T10:35:00Z", "crash": { "type": "lua_error", "message": "attempt to index nil value 'user'", "stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk", "context": { "screen": "profile.rml", "memory_mb": 45, "uptime_seconds": 300 } } } ``` ### Device ID Hashing ```lua -- On device local raw_id = get_android_id() -- or similar local hashed = sha256(raw_id .. "mosis_salt_" .. app_id) -- Result: "a3f2b1c4d5e6..." -- Cannot reverse to original device ID -- Different per app (can't track across apps) ``` --- ## Collection Architecture ``` ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │ Device │────►│ Batch │────►│ API │────►│ Storage │ │ │ │ Queue │ │ │ │ │ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │ │ Every 60s or │ on app close ▼ ┌──────────┐ │ Upload │ └──────────┘ ``` ### Client-Side Batching ```lua -- TelemetryManager on device local events = {} local last_flush = os.time() function track(event_type, data) if not telemetry_enabled then return end table.insert(events, { type = event_type, timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"), data = data or {} }) -- Flush if batch is large or time elapsed if #events >= 50 or (os.time() - last_flush) > 60 then flush() end end function flush() if #events == 0 then return end local payload = { app_id = APP_ID, app_version = APP_VERSION, device_id = HASHED_DEVICE_ID, events = events } -- Async HTTP POST http.post(TELEMETRY_URL, json.encode(payload)) events = {} last_flush = os.time() end ``` --- ## Storage (SQLite) ### Telemetry Database Schema ```sql -- telemetry.db (separate from portal.db) -- Raw events (7-day retention) CREATE TABLE events ( id INTEGER PRIMARY KEY AUTOINCREMENT, app_id TEXT NOT NULL, device_id TEXT NOT NULL, -- SHA256 hashed session_id TEXT, event_type TEXT NOT NULL, event_data TEXT, -- JSON string app_version TEXT, mosis_version TEXT, timestamp TEXT NOT NULL -- ISO8601 ); CREATE INDEX idx_events_app_time ON events(app_id, timestamp); CREATE INDEX idx_events_type ON events(event_type, timestamp); -- Hourly aggregates (computed by background job) CREATE TABLE hourly_stats ( app_id TEXT NOT NULL, hour TEXT NOT NULL, -- YYYY-MM-DDTHH event_type TEXT NOT NULL, count INTEGER NOT NULL, unique_devices INTEGER NOT NULL, PRIMARY KEY (app_id, hour, event_type) ); -- Daily aggregates (computed from hourly) CREATE TABLE daily_stats ( app_id TEXT NOT NULL, date TEXT NOT NULL, -- YYYY-MM-DD event_type TEXT NOT NULL, count INTEGER NOT NULL, unique_devices INTEGER NOT NULL, PRIMARY KEY (app_id, date, event_type) ); -- Crash groups (deduplicated by fingerprint) CREATE TABLE crash_groups ( id TEXT PRIMARY KEY, app_id TEXT NOT NULL, fingerprint TEXT NOT NULL, crash_type TEXT NOT NULL, message TEXT, sample_stack_trace TEXT, first_seen TEXT NOT NULL, last_seen TEXT NOT NULL, occurrence_count INTEGER DEFAULT 1, affected_versions TEXT, -- JSON array status TEXT DEFAULT 'open', UNIQUE(app_id, fingerprint) ); CREATE INDEX idx_crashes_app ON crash_groups(app_id, status); ``` ### Go Background Workers ```go // Start background workers func (s *TelemetryService) StartWorkers(ctx context.Context) { // Hourly aggregation go s.runPeriodic(ctx, time.Hour, s.aggregateHourly) // Daily aggregation (run at 2am) go s.runDaily(ctx, 2, s.aggregateDaily) // Cleanup old events (run at 3am) go s.runDaily(ctx, 3, s.cleanupOldEvents) } func (s *TelemetryService) aggregateHourly(ctx context.Context) error { hour := time.Now().Add(-time.Hour).Format("2006-01-02T15") _, err := s.db.ExecContext(ctx, ` INSERT OR REPLACE INTO hourly_stats (app_id, hour, event_type, count, unique_devices) SELECT app_id, strftime('%Y-%m-%dT%H', timestamp) as hour, event_type, COUNT(*) as count, COUNT(DISTINCT device_id) as unique_devices FROM events WHERE strftime('%Y-%m-%dT%H', timestamp) = ? GROUP BY app_id, hour, event_type `, hour) return err } func (s *TelemetryService) cleanupOldEvents(ctx context.Context) error { cutoff := time.Now().AddDate(0, 0, -7).Format(time.RFC3339) _, err := s.db.ExecContext(ctx, "DELETE FROM events WHERE timestamp < ?", cutoff) return err } ``` --- ## Aggregation ### Pre-computed Metrics | Metric | Granularity | Retention | |--------|-------------|-----------| | Daily active users | Day | 2 years | | Event counts | Day | 1 year | | Crash counts | Day | 1 year | | Session duration | Day | 90 days | | Performance percentiles | Day | 90 days | ### Aggregation Queries ```sql -- Daily active users SELECT DATE_TRUNC('day', time) as day, COUNT(DISTINCT device_id) as dau FROM telemetry_events WHERE app_id = $1 AND event_type = 'app_start' AND time > NOW() - INTERVAL '30 days' GROUP BY day ORDER BY day; -- Crash rate by version SELECT app_version, COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes, COUNT(*) FILTER (WHERE event_type = 'app_start') as starts, ROUND( 100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') / NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0), 2 ) as crash_rate FROM telemetry_events WHERE app_id = $1 AND time > NOW() - INTERVAL '7 days' GROUP BY app_version; ``` --- ## Crash Grouping ### Stack Trace Fingerprinting ```go func fingerprintCrash(crash CrashReport) string { // Normalize stack trace normalized := normalizeStackTrace(crash.StackTrace) // Hash key components key := fmt.Sprintf("%s:%s:%s", crash.CrashType, crash.Message, normalized, ) return sha256(key)[:16] } func normalizeStackTrace(stack string) string { // Remove line numbers (they change with code updates) // Remove memory addresses // Keep function names and file names re := regexp.MustCompile(`:\d+:`) return re.ReplaceAllString(stack, ":?:") } ``` ### Crash Groups Table ```sql CREATE TABLE crash_groups ( id UUID PRIMARY KEY, app_id TEXT NOT NULL, fingerprint TEXT NOT NULL, crash_type TEXT NOT NULL, message TEXT, sample_stack_trace TEXT, first_seen TIMESTAMPTZ NOT NULL, last_seen TIMESTAMPTZ NOT NULL, occurrence_count INT DEFAULT 1, affected_versions TEXT[], status TEXT DEFAULT 'open', -- open, resolved, ignored UNIQUE(app_id, fingerprint) ); ``` --- ## Developer Dashboard ### Metrics View ``` ┌─────────────────────────────────────────────────────────────┐ │ Analytics - My Calculator │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Date Range: [Last 30 days ▼] │ │ │ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │ │ Daily Users │ │ Crashes │ │ Crash-free │ │ │ │ 1,234 │ │ 23 │ │ 98.1% │ │ │ │ ▲ +12% │ │ ▼ -45% │ │ ▲ +2% │ │ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ Daily Active Users │ │ │ │ [Line chart showing DAU over time] │ │ │ └────────────────────────────────────────────────────┘ │ │ │ │ ┌────────────────────────────────────────────────────┐ │ │ │ Version Distribution │ │ │ │ [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│ │ │ └────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ ``` ### Crashes View ``` ┌─────────────────────────────────────────────────────────────┐ │ Crashes - My Calculator │ ├─────────────────────────────────────────────────────────────┤ │ │ │ Filter: [All versions ▼] [Open ▼] │ │ │ │ ┌──────────────────────────────────────────────────────┐ │ │ │ ● attempt to index nil value 'user' │ │ │ │ lua_error • 156 occurrences • v1.2.0 │ │ │ │ First: Jan 10 • Last: Jan 15 │ │ │ │ [View] │ │ │ ├──────────────────────────────────────────────────────┤ │ │ │ ● memory limit exceeded │ │ │ │ sandbox_error • 23 occurrences • v1.1.0, v1.2.0 │ │ │ │ First: Jan 5 • Last: Jan 14 │ │ │ │ [View] │ │ │ └──────────────────────────────────────────────────────┘ │ │ │ └─────────────────────────────────────────────────────────────┘ ``` --- ## API Endpoints ```yaml # Ingestion (from devices) POST /v1/telemetry/events: auth: device_token or api_key body: { app_id, device_id, events[] } response: { received: number } POST /v1/telemetry/crash: auth: device_token or api_key body: { app_id, device_id, crash } response: { id: string } # Dashboard (for developers) GET /v1/apps/:id/analytics/overview: auth: required query: { start_date, end_date } response: { dau, crashes, crash_free_rate, ... } GET /v1/apps/:id/analytics/events: auth: required query: { start_date, end_date, event_type } response: { data: [{ date, count, unique_devices }] } GET /v1/apps/:id/crashes: auth: required query: { version, status, page, limit } response: { crashes: CrashGroup[], total } GET /v1/apps/:id/crashes/:fingerprint: auth: required response: { crash_group, recent_occurrences[] } PATCH /v1/apps/:id/crashes/:fingerprint: auth: required body: { status: 'resolved' | 'ignored' } response: { crash_group } ``` --- ## Data Retention | Data Type | Retention | Reason | |-----------|-----------|--------| | Raw events | 7 days | Debugging | | Daily aggregates | 2 years | Trends | | Crash reports | 90 days | Investigation | | Crash groups | Forever | Issue tracking | ### Cleanup Job ```sql -- Run daily DELETE FROM telemetry_events WHERE time < NOW() - INTERVAL '7 days'; DELETE FROM crash_reports WHERE timestamp < NOW() - INTERVAL '90 days'; ``` --- ## Privacy Controls ### User Settings ``` Settings > Privacy > Analytics ├── [✓] Send crash reports (helps developers fix bugs) ├── [ ] Send usage analytics (how you use apps) └── [Request Data Deletion] ``` ### GDPR Endpoints ```yaml # User requests their data GET /v1/privacy/export: auth: user_token response: { download_url } # JSON export of all data # User requests deletion DELETE /v1/privacy/data: auth: user_token response: { status: 'scheduled' } # Delete within 30 days ``` --- ## Deliverables - [x] Storage approach decided (SQLite with separate telemetry.db) - [ ] Event schema specification - [ ] Client-side batching (Lua TelemetryManager) - [ ] Ingestion API endpoints (Go + Chi) - [ ] SQLite schema and migrations - [ ] Background aggregation workers (Go goroutines) - [ ] Crash grouping logic - [ ] Developer analytics dashboard (htmx) - [ ] Privacy controls (opt-out in manifest) - [ ] Data retention cleanup job - [ ] GDPR export/delete endpoints --- ## Open Questions 1. Real-time crash alerts? → Consider email notifications for v1.1 2. ~~Sampling for high-volume apps?~~ → Not needed for self-hosted scale 3. ~~Custom events API for developers?~~ → Yes, via manifest opt-in 4. ~~Benchmarks/comparisons with similar apps?~~ → Defer to post-MVP --- ## References - [GDPR Requirements](https://gdpr.eu/) - [TimescaleDB Best Practices](https://docs.timescale.com/timescaledb/latest/) - [Sentry Crash Grouping](https://docs.sentry.io/product/data-management-settings/event-grouping/)