Files
MosisService/docs/DEV_PORTAL_M08_TELEMETRY.md

597 lines
20 KiB
Markdown

# Milestone 8: Telemetry System
**Status**: Decided
**Goal**: Collect app usage analytics and crash reports while respecting privacy.
## Decision
**SQLite with background aggregation** for self-hosted Synology NAS:
```
Storage: SQLite (separate telemetry.db to isolate write load)
Aggregation: Go background goroutine (hourly/daily rollups)
Retention: Raw events 7 days, aggregates indefinitely
Privacy: Hashed device IDs, no PII, opt-out available
```
### Rationale
1. **Simple** - No separate time-series database needed
2. **SQLite scales** - Can handle thousands of events/day easily
3. **Background jobs** - Go goroutines for aggregation, cleanup
4. **Separate DB** - Telemetry writes don't affect main portal.db
5. **Privacy-first** - Minimal collection, hashed IDs
### Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ mosis-portal container │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Go Binary │ │
│ │ ┌─────────────┐ ┌────────────────┐ │ │
│ │ │ API Handler │───►│ Telemetry Svc │ │ │
│ │ │ POST /v1/ │ │ - Buffer events│ │ │
│ │ │ telemetry/* │ │ - Batch insert │ │ │
│ │ └─────────────┘ └───────┬────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────▼────────────────────────────┐ │ │
│ │ │ Background Workers │ │ │
│ │ │ • Hourly aggregation (event counts, unique devices) │ │ │
│ │ │ • Daily cleanup (delete raw events > 7 days) │ │ │
│ │ │ • Crash grouping (fingerprint + dedup) │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────┬─────────────────────────────┘ │
│ │ │
│ /volume1/mosis/data/ │ │
│ ├── portal.db (main) │ │
│ └── telemetry.db ◄────────────┘ │
└─────────────────────────────────────────────────────────────────┘
```
---
## Overview
Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy.
---
## Privacy Principles
1. **Minimal collection** - Only what's necessary
2. **No PII by default** - Anonymized device IDs
3. **Transparency** - Users know what's collected
4. **Opt-out available** - Users can disable
5. **Data retention limits** - Auto-delete old data
6. **GDPR compliance** - Export/delete on request
---
## Event Types
### Automatic Events (Default)
| Event | Description | Data |
|-------|-------------|------|
| `app_start` | App launched | version, mosis_version |
| `app_stop` | App closed | duration_seconds |
| `app_crash` | Unhandled error | crash_type, message |
| `lua_error` | Lua runtime error | message, stack (no user data) |
### Performance Events (Default)
| Event | Description | Data |
|-------|-------------|------|
| `perf_frame` | Frame time (sampled) | avg_ms, p95_ms |
| `perf_memory` | Memory usage | used_mb, limit_mb |
| `perf_startup` | Startup time | duration_ms |
### Usage Events (Opt-in)
| Event | Description | Data |
|-------|-------------|------|
| `screen_view` | Screen navigation | screen_name |
| `button_click` | UI interaction | element_id |
| `feature_used` | Feature usage | feature_name |
---
## Data Schema
### Event Payload
```json
{
"app_id": "com.developer.myapp",
"app_version": "1.2.0",
"mosis_version": "1.0.0",
"device_id": "sha256_hashed_id",
"session_id": "uuid",
"events": [
{
"type": "app_start",
"timestamp": "2024-01-15T10:30:00Z",
"data": {}
},
{
"type": "screen_view",
"timestamp": "2024-01-15T10:30:05Z",
"data": {
"screen_name": "home"
}
}
]
}
```
### Crash Report Payload
```json
{
"app_id": "com.developer.myapp",
"app_version": "1.2.0",
"mosis_version": "1.0.0",
"device_id": "sha256_hashed_id",
"timestamp": "2024-01-15T10:35:00Z",
"crash": {
"type": "lua_error",
"message": "attempt to index nil value 'user'",
"stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk",
"context": {
"screen": "profile.rml",
"memory_mb": 45,
"uptime_seconds": 300
}
}
}
```
### Device ID Hashing
```lua
-- On device
local raw_id = get_android_id() -- or similar
local hashed = sha256(raw_id .. "mosis_salt_" .. app_id)
-- Result: "a3f2b1c4d5e6..."
-- Cannot reverse to original device ID
-- Different per app (can't track across apps)
```
---
## Collection Architecture
```
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Device │────►│ Batch │────►│ API │────►│ Storage │
│ │ │ Queue │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│ Every 60s or
│ on app close
┌──────────┐
│ Upload │
└──────────┘
```
### Client-Side Batching
```lua
-- TelemetryManager on device
local events = {}
local last_flush = os.time()
function track(event_type, data)
if not telemetry_enabled then return end
table.insert(events, {
type = event_type,
timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"),
data = data or {}
})
-- Flush if batch is large or time elapsed
if #events >= 50 or (os.time() - last_flush) > 60 then
flush()
end
end
function flush()
if #events == 0 then return end
local payload = {
app_id = APP_ID,
app_version = APP_VERSION,
device_id = HASHED_DEVICE_ID,
events = events
}
-- Async HTTP POST
http.post(TELEMETRY_URL, json.encode(payload))
events = {}
last_flush = os.time()
end
```
---
## Storage (SQLite)
### Telemetry Database Schema
```sql
-- telemetry.db (separate from portal.db)
-- Raw events (7-day retention)
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
app_id TEXT NOT NULL,
device_id TEXT NOT NULL, -- SHA256 hashed
session_id TEXT,
event_type TEXT NOT NULL,
event_data TEXT, -- JSON string
app_version TEXT,
mosis_version TEXT,
timestamp TEXT NOT NULL -- ISO8601
);
CREATE INDEX idx_events_app_time ON events(app_id, timestamp);
CREATE INDEX idx_events_type ON events(event_type, timestamp);
-- Hourly aggregates (computed by background job)
CREATE TABLE hourly_stats (
app_id TEXT NOT NULL,
hour TEXT NOT NULL, -- YYYY-MM-DDTHH
event_type TEXT NOT NULL,
count INTEGER NOT NULL,
unique_devices INTEGER NOT NULL,
PRIMARY KEY (app_id, hour, event_type)
);
-- Daily aggregates (computed from hourly)
CREATE TABLE daily_stats (
app_id TEXT NOT NULL,
date TEXT NOT NULL, -- YYYY-MM-DD
event_type TEXT NOT NULL,
count INTEGER NOT NULL,
unique_devices INTEGER NOT NULL,
PRIMARY KEY (app_id, date, event_type)
);
-- Crash groups (deduplicated by fingerprint)
CREATE TABLE crash_groups (
id TEXT PRIMARY KEY,
app_id TEXT NOT NULL,
fingerprint TEXT NOT NULL,
crash_type TEXT NOT NULL,
message TEXT,
sample_stack_trace TEXT,
first_seen TEXT NOT NULL,
last_seen TEXT NOT NULL,
occurrence_count INTEGER DEFAULT 1,
affected_versions TEXT, -- JSON array
status TEXT DEFAULT 'open',
UNIQUE(app_id, fingerprint)
);
CREATE INDEX idx_crashes_app ON crash_groups(app_id, status);
```
### Go Background Workers
```go
// Start background workers
func (s *TelemetryService) StartWorkers(ctx context.Context) {
// Hourly aggregation
go s.runPeriodic(ctx, time.Hour, s.aggregateHourly)
// Daily aggregation (run at 2am)
go s.runDaily(ctx, 2, s.aggregateDaily)
// Cleanup old events (run at 3am)
go s.runDaily(ctx, 3, s.cleanupOldEvents)
}
func (s *TelemetryService) aggregateHourly(ctx context.Context) error {
hour := time.Now().Add(-time.Hour).Format("2006-01-02T15")
_, err := s.db.ExecContext(ctx, `
INSERT OR REPLACE INTO hourly_stats (app_id, hour, event_type, count, unique_devices)
SELECT
app_id,
strftime('%Y-%m-%dT%H', timestamp) as hour,
event_type,
COUNT(*) as count,
COUNT(DISTINCT device_id) as unique_devices
FROM events
WHERE strftime('%Y-%m-%dT%H', timestamp) = ?
GROUP BY app_id, hour, event_type
`, hour)
return err
}
func (s *TelemetryService) cleanupOldEvents(ctx context.Context) error {
cutoff := time.Now().AddDate(0, 0, -7).Format(time.RFC3339)
_, err := s.db.ExecContext(ctx,
"DELETE FROM events WHERE timestamp < ?", cutoff)
return err
}
```
---
## Aggregation
### Pre-computed Metrics
| Metric | Granularity | Retention |
|--------|-------------|-----------|
| Daily active users | Day | 2 years |
| Event counts | Day | 1 year |
| Crash counts | Day | 1 year |
| Session duration | Day | 90 days |
| Performance percentiles | Day | 90 days |
### Aggregation Queries
```sql
-- Daily active users
SELECT
DATE_TRUNC('day', time) as day,
COUNT(DISTINCT device_id) as dau
FROM telemetry_events
WHERE app_id = $1
AND event_type = 'app_start'
AND time > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;
-- Crash rate by version
SELECT
app_version,
COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes,
COUNT(*) FILTER (WHERE event_type = 'app_start') as starts,
ROUND(
100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') /
NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0),
2
) as crash_rate
FROM telemetry_events
WHERE app_id = $1
AND time > NOW() - INTERVAL '7 days'
GROUP BY app_version;
```
---
## Crash Grouping
### Stack Trace Fingerprinting
```go
func fingerprintCrash(crash CrashReport) string {
// Normalize stack trace
normalized := normalizeStackTrace(crash.StackTrace)
// Hash key components
key := fmt.Sprintf("%s:%s:%s",
crash.CrashType,
crash.Message,
normalized,
)
return sha256(key)[:16]
}
func normalizeStackTrace(stack string) string {
// Remove line numbers (they change with code updates)
// Remove memory addresses
// Keep function names and file names
re := regexp.MustCompile(`:\d+:`)
return re.ReplaceAllString(stack, ":?:")
}
```
### Crash Groups Table
```sql
CREATE TABLE crash_groups (
id UUID PRIMARY KEY,
app_id TEXT NOT NULL,
fingerprint TEXT NOT NULL,
crash_type TEXT NOT NULL,
message TEXT,
sample_stack_trace TEXT,
first_seen TIMESTAMPTZ NOT NULL,
last_seen TIMESTAMPTZ NOT NULL,
occurrence_count INT DEFAULT 1,
affected_versions TEXT[],
status TEXT DEFAULT 'open', -- open, resolved, ignored
UNIQUE(app_id, fingerprint)
);
```
---
## Developer Dashboard
### Metrics View
```
┌─────────────────────────────────────────────────────────────┐
│ Analytics - My Calculator │
├─────────────────────────────────────────────────────────────┤
│ │
│ Date Range: [Last 30 days ▼] │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Daily Users │ │ Crashes │ │ Crash-free │ │
│ │ 1,234 │ │ 23 │ │ 98.1% │ │
│ │ ▲ +12% │ │ ▼ -45% │ │ ▲ +2% │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Daily Active Users │ │
│ │ [Line chart showing DAU over time] │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Version Distribution │ │
│ │ [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│ │
│ └────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
### Crashes View
```
┌─────────────────────────────────────────────────────────────┐
│ Crashes - My Calculator │
├─────────────────────────────────────────────────────────────┤
│ │
│ Filter: [All versions ▼] [Open ▼] │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ ● attempt to index nil value 'user' │ │
│ │ lua_error • 156 occurrences • v1.2.0 │ │
│ │ First: Jan 10 • Last: Jan 15 │ │
│ │ [View] │ │
│ ├──────────────────────────────────────────────────────┤ │
│ │ ● memory limit exceeded │ │
│ │ sandbox_error • 23 occurrences • v1.1.0, v1.2.0 │ │
│ │ First: Jan 5 • Last: Jan 14 │ │
│ │ [View] │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
```
---
## API Endpoints
```yaml
# Ingestion (from devices)
POST /v1/telemetry/events:
auth: device_token or api_key
body: { app_id, device_id, events[] }
response: { received: number }
POST /v1/telemetry/crash:
auth: device_token or api_key
body: { app_id, device_id, crash }
response: { id: string }
# Dashboard (for developers)
GET /v1/apps/:id/analytics/overview:
auth: required
query: { start_date, end_date }
response: { dau, crashes, crash_free_rate, ... }
GET /v1/apps/:id/analytics/events:
auth: required
query: { start_date, end_date, event_type }
response: { data: [{ date, count, unique_devices }] }
GET /v1/apps/:id/crashes:
auth: required
query: { version, status, page, limit }
response: { crashes: CrashGroup[], total }
GET /v1/apps/:id/crashes/:fingerprint:
auth: required
response: { crash_group, recent_occurrences[] }
PATCH /v1/apps/:id/crashes/:fingerprint:
auth: required
body: { status: 'resolved' | 'ignored' }
response: { crash_group }
```
---
## Data Retention
| Data Type | Retention | Reason |
|-----------|-----------|--------|
| Raw events | 7 days | Debugging |
| Daily aggregates | 2 years | Trends |
| Crash reports | 90 days | Investigation |
| Crash groups | Forever | Issue tracking |
### Cleanup Job
```sql
-- Run daily
DELETE FROM telemetry_events
WHERE time < NOW() - INTERVAL '7 days';
DELETE FROM crash_reports
WHERE timestamp < NOW() - INTERVAL '90 days';
```
---
## Privacy Controls
### User Settings
```
Settings > Privacy > Analytics
├── [✓] Send crash reports (helps developers fix bugs)
├── [ ] Send usage analytics (how you use apps)
└── [Request Data Deletion]
```
### GDPR Endpoints
```yaml
# User requests their data
GET /v1/privacy/export:
auth: user_token
response: { download_url } # JSON export of all data
# User requests deletion
DELETE /v1/privacy/data:
auth: user_token
response: { status: 'scheduled' } # Delete within 30 days
```
---
## Deliverables
- [x] Storage approach decided (SQLite with separate telemetry.db)
- [ ] Event schema specification
- [ ] Client-side batching (Lua TelemetryManager)
- [ ] Ingestion API endpoints (Go + Chi)
- [ ] SQLite schema and migrations
- [ ] Background aggregation workers (Go goroutines)
- [ ] Crash grouping logic
- [ ] Developer analytics dashboard (htmx)
- [ ] Privacy controls (opt-out in manifest)
- [ ] Data retention cleanup job
- [ ] GDPR export/delete endpoints
---
## Open Questions
1. Real-time crash alerts? → Consider email notifications for v1.1
2. ~~Sampling for high-volume apps?~~ → Not needed for self-hosted scale
3. ~~Custom events API for developers?~~ → Yes, via manifest opt-in
4. ~~Benchmarks/comparisons with similar apps?~~ → Defer to post-MVP
---
## References
- [GDPR Requirements](https://gdpr.eu/)
- [TimescaleDB Best Practices](https://docs.timescale.com/timescaledb/latest/)
- [Sentry Crash Grouping](https://docs.sentry.io/product/data-management-settings/event-grouping/)