# Milestone 8: Telemetry System

**Status**: Decided
**Goal**: Collect app usage analytics and crash reports while respecting privacy.

## Decision

**SQLite with background aggregation** for self-hosted Synology NAS:

```
Storage:      SQLite (separate telemetry.db to isolate write load)
Aggregation:  Go background goroutine (hourly/daily rollups)
Retention:    Raw events 7 days, aggregates indefinitely
Privacy:      Hashed device IDs, no PII, opt-out available
```

### Rationale

1. **Simple** - No separate time-series database needed
2. **SQLite scales** - Can handle thousands of events/day easily
3. **Background jobs** - Go goroutines for aggregation, cleanup
4. **Separate DB** - Telemetry writes don't affect main portal.db
5. **Privacy-first** - Minimal collection, hashed IDs

### Architecture

```
┌─────────────────────────────────────────────────────────────────┐
│                     mosis-portal container                       │
│  ┌────────────────────────────────────────────────────────────┐ │
│  │                      Go Binary                              │ │
│  │  ┌─────────────┐    ┌────────────────┐                     │ │
│  │  │ API Handler │───►│ Telemetry Svc  │                     │ │
│  │  │ POST /v1/   │    │ - Buffer events│                     │ │
│  │  │ telemetry/* │    │ - Batch insert │                     │ │
│  │  └─────────────┘    └───────┬────────┘                     │ │
│  │                             │                               │ │
│  │  ┌─────────────────────────▼────────────────────────────┐  │ │
│  │  │              Background Workers                       │  │ │
│  │  │  • Hourly aggregation (event counts, unique devices)  │  │ │
│  │  │  • Daily cleanup (delete raw events > 7 days)         │  │ │
│  │  │  • Crash grouping (fingerprint + dedup)               │  │ │
│  │  └───────────────────────────────────────────────────────┘  │ │
│  └──────────────────────────────┬─────────────────────────────┘ │
│                                  │                               │
│  /volume1/mosis/data/           │                               │
│  ├── portal.db     (main)       │                               │
│  └── telemetry.db  ◄────────────┘                               │
└─────────────────────────────────────────────────────────────────┘
```

---

## Overview

Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy.

---

## Privacy Principles

1. **Minimal collection** - Only what's necessary
2. **No PII by default** - Anonymized device IDs
3. **Transparency** - Users know what's collected
4. **Opt-out available** - Users can disable
5. **Data retention limits** - Auto-delete old data
6. **GDPR compliance** - Export/delete on request

---

## Event Types

### Automatic Events (Default)

| Event | Description | Data |
|-------|-------------|------|
| `app_start` | App launched | version, mosis_version |
| `app_stop` | App closed | duration_seconds |
| `app_crash` | Unhandled error | crash_type, message |
| `lua_error` | Lua runtime error | message, stack (no user data) |

### Performance Events (Default)

| Event | Description | Data |
|-------|-------------|------|
| `perf_frame` | Frame time (sampled) | avg_ms, p95_ms |
| `perf_memory` | Memory usage | used_mb, limit_mb |
| `perf_startup` | Startup time | duration_ms |

### Usage Events (Opt-in)

| Event | Description | Data |
|-------|-------------|------|
| `screen_view` | Screen navigation | screen_name |
| `button_click` | UI interaction | element_id |
| `feature_used` | Feature usage | feature_name |

---

## Data Schema

### Event Payload

```json
{
  "app_id": "com.developer.myapp",
  "app_version": "1.2.0",
  "mosis_version": "1.0.0",
  "device_id": "sha256_hashed_id",
  "session_id": "uuid",
  "events": [
    {
      "type": "app_start",
      "timestamp": "2024-01-15T10:30:00Z",
      "data": {}
    },
    {
      "type": "screen_view",
      "timestamp": "2024-01-15T10:30:05Z",
      "data": {
        "screen_name": "home"
      }
    }
  ]
}
```

### Crash Report Payload

```json
{
  "app_id": "com.developer.myapp",
  "app_version": "1.2.0",
  "mosis_version": "1.0.0",
  "device_id": "sha256_hashed_id",
  "timestamp": "2024-01-15T10:35:00Z",
  "crash": {
    "type": "lua_error",
    "message": "attempt to index nil value 'user'",
    "stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk",
    "context": {
      "screen": "profile.rml",
      "memory_mb": 45,
      "uptime_seconds": 300
    }
  }
}
```

### Device ID Hashing

```lua
-- On device
local raw_id = get_android_id() -- or similar
local hashed = sha256(raw_id .. "mosis_salt_" .. app_id)
-- Result: "a3f2b1c4d5e6..."

-- Cannot reverse to original device ID
-- Different per app (can't track across apps)
```

---

## Collection Architecture

```
┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Device  │────►│  Batch   │────►│   API    │────►│ Storage  │
│          │     │  Queue   │     │          │     │          │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
                      │
                      │ Every 60s or
                      │ on app close
                      ▼
                 ┌──────────┐
                 │  Upload  │
                 └──────────┘
```

### Client-Side Batching

```lua
-- TelemetryManager on device
local events = {}
local last_flush = os.time()

function track(event_type, data)
    if not telemetry_enabled then return end

    table.insert(events, {
        type = event_type,
        timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"),
        data = data or {}
    })

    -- Flush if batch is large or time elapsed
    if #events >= 50 or (os.time() - last_flush) > 60 then
        flush()
    end
end

function flush()
    if #events == 0 then return end

    local payload = {
        app_id = APP_ID,
        app_version = APP_VERSION,
        device_id = HASHED_DEVICE_ID,
        events = events
    }

    -- Async HTTP POST
    http.post(TELEMETRY_URL, json.encode(payload))

    events = {}
    last_flush = os.time()
end
```

---

## Storage (SQLite)

### Telemetry Database Schema

```sql
-- telemetry.db (separate from portal.db)

-- Raw events (7-day retention)
CREATE TABLE events (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    app_id TEXT NOT NULL,
    device_id TEXT NOT NULL,  -- SHA256 hashed
    session_id TEXT,
    event_type TEXT NOT NULL,
    event_data TEXT,          -- JSON string
    app_version TEXT,
    mosis_version TEXT,
    timestamp TEXT NOT NULL   -- ISO8601
);

CREATE INDEX idx_events_app_time ON events(app_id, timestamp);
CREATE INDEX idx_events_type ON events(event_type, timestamp);

-- Hourly aggregates (computed by background job)
CREATE TABLE hourly_stats (
    app_id TEXT NOT NULL,
    hour TEXT NOT NULL,       -- YYYY-MM-DDTHH
    event_type TEXT NOT NULL,
    count INTEGER NOT NULL,
    unique_devices INTEGER NOT NULL,
    PRIMARY KEY (app_id, hour, event_type)
);

-- Daily aggregates (computed from hourly)
CREATE TABLE daily_stats (
    app_id TEXT NOT NULL,
    date TEXT NOT NULL,       -- YYYY-MM-DD
    event_type TEXT NOT NULL,
    count INTEGER NOT NULL,
    unique_devices INTEGER NOT NULL,
    PRIMARY KEY (app_id, date, event_type)
);

-- Crash groups (deduplicated by fingerprint)
CREATE TABLE crash_groups (
    id TEXT PRIMARY KEY,
    app_id TEXT NOT NULL,
    fingerprint TEXT NOT NULL,
    crash_type TEXT NOT NULL,
    message TEXT,
    sample_stack_trace TEXT,
    first_seen TEXT NOT NULL,
    last_seen TEXT NOT NULL,
    occurrence_count INTEGER DEFAULT 1,
    affected_versions TEXT,   -- JSON array
    status TEXT DEFAULT 'open',
    UNIQUE(app_id, fingerprint)
);

CREATE INDEX idx_crashes_app ON crash_groups(app_id, status);
```

### Go Background Workers

```go
// Start background workers
func (s *TelemetryService) StartWorkers(ctx context.Context) {
    // Hourly aggregation
    go s.runPeriodic(ctx, time.Hour, s.aggregateHourly)

    // Daily aggregation (run at 2am)
    go s.runDaily(ctx, 2, s.aggregateDaily)

    // Cleanup old events (run at 3am)
    go s.runDaily(ctx, 3, s.cleanupOldEvents)
}

func (s *TelemetryService) aggregateHourly(ctx context.Context) error {
    hour := time.Now().Add(-time.Hour).Format("2006-01-02T15")

    _, err := s.db.ExecContext(ctx, `
        INSERT OR REPLACE INTO hourly_stats (app_id, hour, event_type, count, unique_devices)
        SELECT
            app_id,
            strftime('%Y-%m-%dT%H', timestamp) as hour,
            event_type,
            COUNT(*) as count,
            COUNT(DISTINCT device_id) as unique_devices
        FROM events
        WHERE strftime('%Y-%m-%dT%H', timestamp) = ?
        GROUP BY app_id, hour, event_type
    `, hour)
    return err
}

func (s *TelemetryService) cleanupOldEvents(ctx context.Context) error {
    cutoff := time.Now().AddDate(0, 0, -7).Format(time.RFC3339)
    _, err := s.db.ExecContext(ctx,
        "DELETE FROM events WHERE timestamp < ?", cutoff)
    return err
}
```

---

## Aggregation

### Pre-computed Metrics

| Metric | Granularity | Retention |
|--------|-------------|-----------|
| Daily active users | Day | 2 years |
| Event counts | Day | 1 year |
| Crash counts | Day | 1 year |
| Session duration | Day | 90 days |
| Performance percentiles | Day | 90 days |

### Aggregation Queries

```sql
-- Daily active users
SELECT
    DATE_TRUNC('day', time) as day,
    COUNT(DISTINCT device_id) as dau
FROM telemetry_events
WHERE app_id = $1
  AND event_type = 'app_start'
  AND time > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;

-- Crash rate by version
SELECT
    app_version,
    COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes,
    COUNT(*) FILTER (WHERE event_type = 'app_start') as starts,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') /
        NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0),
        2
    ) as crash_rate
FROM telemetry_events
WHERE app_id = $1
  AND time > NOW() - INTERVAL '7 days'
GROUP BY app_version;
```

---

## Crash Grouping

### Stack Trace Fingerprinting

```go
func fingerprintCrash(crash CrashReport) string {
    // Normalize stack trace
    normalized := normalizeStackTrace(crash.StackTrace)

    // Hash key components
    key := fmt.Sprintf("%s:%s:%s",
        crash.CrashType,
        crash.Message,
        normalized,
    )

    return sha256(key)[:16]
}

func normalizeStackTrace(stack string) string {
    // Remove line numbers (they change with code updates)
    // Remove memory addresses
    // Keep function names and file names
    re := regexp.MustCompile(`:\d+:`)
    return re.ReplaceAllString(stack, ":?:")
}
```

### Crash Groups Table

```sql
CREATE TABLE crash_groups (
    id UUID PRIMARY KEY,
    app_id TEXT NOT NULL,
    fingerprint TEXT NOT NULL,
    crash_type TEXT NOT NULL,
    message TEXT,
    sample_stack_trace TEXT,
    first_seen TIMESTAMPTZ NOT NULL,
    last_seen TIMESTAMPTZ NOT NULL,
    occurrence_count INT DEFAULT 1,
    affected_versions TEXT[],
    status TEXT DEFAULT 'open',  -- open, resolved, ignored
    UNIQUE(app_id, fingerprint)
);
```

---

## Developer Dashboard

### Metrics View

```
┌─────────────────────────────────────────────────────────────┐
│  Analytics - My Calculator                                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Date Range: [Last 30 days ▼]                               │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Daily Users │  │   Crashes   │  │ Crash-free  │         │
│  │    1,234    │  │     23      │  │   98.1%     │         │
│  │   ▲ +12%    │  │   ▼ -45%    │  │   ▲ +2%     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Daily Active Users                                │    │
│  │  [Line chart showing DAU over time]                │    │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Version Distribution                              │    │
│  │  [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│   │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

### Crashes View

```
┌─────────────────────────────────────────────────────────────┐
│  Crashes - My Calculator                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Filter: [All versions ▼] [Open ▼]                          │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │ ● attempt to index nil value 'user'                  │  │
│  │   lua_error • 156 occurrences • v1.2.0              │  │
│  │   First: Jan 10 • Last: Jan 15                       │  │
│  │                                               [View] │  │
│  ├──────────────────────────────────────────────────────┤  │
│  │ ● memory limit exceeded                              │  │
│  │   sandbox_error • 23 occurrences • v1.1.0, v1.2.0   │  │
│  │   First: Jan 5 • Last: Jan 14                        │  │
│  │                                               [View] │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘
```

---

## API Endpoints

```yaml
# Ingestion (from devices)
POST /v1/telemetry/events:
  auth: device_token or api_key
  body: { app_id, device_id, events[] }
  response: { received: number }

POST /v1/telemetry/crash:
  auth: device_token or api_key
  body: { app_id, device_id, crash }
  response: { id: string }

# Dashboard (for developers)
GET /v1/apps/:id/analytics/overview:
  auth: required
  query: { start_date, end_date }
  response: { dau, crashes, crash_free_rate, ... }

GET /v1/apps/:id/analytics/events:
  auth: required
  query: { start_date, end_date, event_type }
  response: { data: [{ date, count, unique_devices }] }

GET /v1/apps/:id/crashes:
  auth: required
  query: { version, status, page, limit }
  response: { crashes: CrashGroup[], total }

GET /v1/apps/:id/crashes/:fingerprint:
  auth: required
  response: { crash_group, recent_occurrences[] }

PATCH /v1/apps/:id/crashes/:fingerprint:
  auth: required
  body: { status: 'resolved' | 'ignored' }
  response: { crash_group }
```

---

## Data Retention

| Data Type | Retention | Reason |
|-----------|-----------|--------|
| Raw events | 7 days | Debugging |
| Daily aggregates | 2 years | Trends |
| Crash reports | 90 days | Investigation |
| Crash groups | Forever | Issue tracking |

### Cleanup Job

```sql
-- Run daily
DELETE FROM telemetry_events
WHERE time < NOW() - INTERVAL '7 days';

DELETE FROM crash_reports
WHERE timestamp < NOW() - INTERVAL '90 days';
```

---

## Privacy Controls

### User Settings

```
Settings > Privacy > Analytics
├── [✓] Send crash reports (helps developers fix bugs)
├── [ ] Send usage analytics (how you use apps)
└── [Request Data Deletion]
```

### GDPR Endpoints

```yaml
# User requests their data
GET /v1/privacy/export:
  auth: user_token
  response: { download_url }  # JSON export of all data

# User requests deletion
DELETE /v1/privacy/data:
  auth: user_token
  response: { status: 'scheduled' }  # Delete within 30 days
```

---

## Deliverables

- [x] Storage approach decided (SQLite with separate telemetry.db)
- [ ] Event schema specification
- [ ] Client-side batching (Lua TelemetryManager)
- [ ] Ingestion API endpoints (Go + Chi)
- [ ] SQLite schema and migrations
- [ ] Background aggregation workers (Go goroutines)
- [ ] Crash grouping logic
- [ ] Developer analytics dashboard (htmx)
- [ ] Privacy controls (opt-out in manifest)
- [ ] Data retention cleanup job
- [ ] GDPR export/delete endpoints

---

## Open Questions

1. Real-time crash alerts? → Consider email notifications for v1.1
2. ~~Sampling for high-volume apps?~~ → Not needed for self-hosted scale
3. ~~Custom events API for developers?~~ → Yes, via manifest opt-in
4. ~~Benchmarks/comparisons with similar apps?~~ → Defer to post-MVP

---

## References

- [GDPR Requirements](https://gdpr.eu/)
- [TimescaleDB Best Practices](https://docs.timescale.com/timescaledb/latest/)
- [Sentry Crash Grouping](https://docs.sentry.io/product/data-management-settings/event-grouping/)