20 KiB
20 KiB
Milestone 8: Telemetry System
Status: Decided Goal: Collect app usage analytics and crash reports while respecting privacy.
Decision
SQLite with background aggregation for self-hosted Synology NAS:
Storage: SQLite (separate telemetry.db to isolate write load)
Aggregation: Go background goroutine (hourly/daily rollups)
Retention: Raw events 7 days, aggregates indefinitely
Privacy: Hashed device IDs, no PII, opt-out available
Rationale
- Simple - No separate time-series database needed
- SQLite scales - Can handle thousands of events/day easily
- Background jobs - Go goroutines for aggregation, cleanup
- Separate DB - Telemetry writes don't affect main portal.db
- Privacy-first - Minimal collection, hashed IDs
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ mosis-portal container │
│ ┌────────────────────────────────────────────────────────────┐ │
│ │ Go Binary │ │
│ │ ┌─────────────┐ ┌────────────────┐ │ │
│ │ │ API Handler │───►│ Telemetry Svc │ │ │
│ │ │ POST /v1/ │ │ - Buffer events│ │ │
│ │ │ telemetry/* │ │ - Batch insert │ │ │
│ │ └─────────────┘ └───────┬────────┘ │ │
│ │ │ │ │
│ │ ┌─────────────────────────▼────────────────────────────┐ │ │
│ │ │ Background Workers │ │ │
│ │ │ • Hourly aggregation (event counts, unique devices) │ │ │
│ │ │ • Daily cleanup (delete raw events > 7 days) │ │ │
│ │ │ • Crash grouping (fingerprint + dedup) │ │ │
│ │ └───────────────────────────────────────────────────────┘ │ │
│ └──────────────────────────────┬─────────────────────────────┘ │
│ │ │
│ /volume1/mosis/data/ │ │
│ ├── portal.db (main) │ │
│ └── telemetry.db ◄────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Overview
Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy.
Privacy Principles
- Minimal collection - Only what's necessary
- No PII by default - Anonymized device IDs
- Transparency - Users know what's collected
- Opt-out available - Users can disable
- Data retention limits - Auto-delete old data
- GDPR compliance - Export/delete on request
Event Types
Automatic Events (Default)
| Event | Description | Data |
|---|---|---|
app_start |
App launched | version, mosis_version |
app_stop |
App closed | duration_seconds |
app_crash |
Unhandled error | crash_type, message |
lua_error |
Lua runtime error | message, stack (no user data) |
Performance Events (Default)
| Event | Description | Data |
|---|---|---|
perf_frame |
Frame time (sampled) | avg_ms, p95_ms |
perf_memory |
Memory usage | used_mb, limit_mb |
perf_startup |
Startup time | duration_ms |
Usage Events (Opt-in)
| Event | Description | Data |
|---|---|---|
screen_view |
Screen navigation | screen_name |
button_click |
UI interaction | element_id |
feature_used |
Feature usage | feature_name |
Data Schema
Event Payload
{
"app_id": "com.developer.myapp",
"app_version": "1.2.0",
"mosis_version": "1.0.0",
"device_id": "sha256_hashed_id",
"session_id": "uuid",
"events": [
{
"type": "app_start",
"timestamp": "2024-01-15T10:30:00Z",
"data": {}
},
{
"type": "screen_view",
"timestamp": "2024-01-15T10:30:05Z",
"data": {
"screen_name": "home"
}
}
]
}
Crash Report Payload
{
"app_id": "com.developer.myapp",
"app_version": "1.2.0",
"mosis_version": "1.0.0",
"device_id": "sha256_hashed_id",
"timestamp": "2024-01-15T10:35:00Z",
"crash": {
"type": "lua_error",
"message": "attempt to index nil value 'user'",
"stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk",
"context": {
"screen": "profile.rml",
"memory_mb": 45,
"uptime_seconds": 300
}
}
}
Device ID Hashing
-- On device
local raw_id = get_android_id() -- or similar
local hashed = sha256(raw_id .. "mosis_salt_" .. app_id)
-- Result: "a3f2b1c4d5e6..."
-- Cannot reverse to original device ID
-- Different per app (can't track across apps)
Collection Architecture
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Device │────►│ Batch │────►│ API │────►│ Storage │
│ │ │ Queue │ │ │ │ │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
│
│ Every 60s or
│ on app close
▼
┌──────────┐
│ Upload │
└──────────┘
Client-Side Batching
-- TelemetryManager on device
local events = {}
local last_flush = os.time()
function track(event_type, data)
if not telemetry_enabled then return end
table.insert(events, {
type = event_type,
timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"),
data = data or {}
})
-- Flush if batch is large or time elapsed
if #events >= 50 or (os.time() - last_flush) > 60 then
flush()
end
end
function flush()
if #events == 0 then return end
local payload = {
app_id = APP_ID,
app_version = APP_VERSION,
device_id = HASHED_DEVICE_ID,
events = events
}
-- Async HTTP POST
http.post(TELEMETRY_URL, json.encode(payload))
events = {}
last_flush = os.time()
end
Storage (SQLite)
Telemetry Database Schema
-- telemetry.db (separate from portal.db)
-- Raw events (7-day retention)
CREATE TABLE events (
id INTEGER PRIMARY KEY AUTOINCREMENT,
app_id TEXT NOT NULL,
device_id TEXT NOT NULL, -- SHA256 hashed
session_id TEXT,
event_type TEXT NOT NULL,
event_data TEXT, -- JSON string
app_version TEXT,
mosis_version TEXT,
timestamp TEXT NOT NULL -- ISO8601
);
CREATE INDEX idx_events_app_time ON events(app_id, timestamp);
CREATE INDEX idx_events_type ON events(event_type, timestamp);
-- Hourly aggregates (computed by background job)
CREATE TABLE hourly_stats (
app_id TEXT NOT NULL,
hour TEXT NOT NULL, -- YYYY-MM-DDTHH
event_type TEXT NOT NULL,
count INTEGER NOT NULL,
unique_devices INTEGER NOT NULL,
PRIMARY KEY (app_id, hour, event_type)
);
-- Daily aggregates (computed from hourly)
CREATE TABLE daily_stats (
app_id TEXT NOT NULL,
date TEXT NOT NULL, -- YYYY-MM-DD
event_type TEXT NOT NULL,
count INTEGER NOT NULL,
unique_devices INTEGER NOT NULL,
PRIMARY KEY (app_id, date, event_type)
);
-- Crash groups (deduplicated by fingerprint)
CREATE TABLE crash_groups (
id TEXT PRIMARY KEY,
app_id TEXT NOT NULL,
fingerprint TEXT NOT NULL,
crash_type TEXT NOT NULL,
message TEXT,
sample_stack_trace TEXT,
first_seen TEXT NOT NULL,
last_seen TEXT NOT NULL,
occurrence_count INTEGER DEFAULT 1,
affected_versions TEXT, -- JSON array
status TEXT DEFAULT 'open',
UNIQUE(app_id, fingerprint)
);
CREATE INDEX idx_crashes_app ON crash_groups(app_id, status);
Go Background Workers
// Start background workers
func (s *TelemetryService) StartWorkers(ctx context.Context) {
// Hourly aggregation
go s.runPeriodic(ctx, time.Hour, s.aggregateHourly)
// Daily aggregation (run at 2am)
go s.runDaily(ctx, 2, s.aggregateDaily)
// Cleanup old events (run at 3am)
go s.runDaily(ctx, 3, s.cleanupOldEvents)
}
func (s *TelemetryService) aggregateHourly(ctx context.Context) error {
hour := time.Now().Add(-time.Hour).Format("2006-01-02T15")
_, err := s.db.ExecContext(ctx, `
INSERT OR REPLACE INTO hourly_stats (app_id, hour, event_type, count, unique_devices)
SELECT
app_id,
strftime('%Y-%m-%dT%H', timestamp) as hour,
event_type,
COUNT(*) as count,
COUNT(DISTINCT device_id) as unique_devices
FROM events
WHERE strftime('%Y-%m-%dT%H', timestamp) = ?
GROUP BY app_id, hour, event_type
`, hour)
return err
}
func (s *TelemetryService) cleanupOldEvents(ctx context.Context) error {
cutoff := time.Now().AddDate(0, 0, -7).Format(time.RFC3339)
_, err := s.db.ExecContext(ctx,
"DELETE FROM events WHERE timestamp < ?", cutoff)
return err
}
Aggregation
Pre-computed Metrics
| Metric | Granularity | Retention |
|---|---|---|
| Daily active users | Day | 2 years |
| Event counts | Day | 1 year |
| Crash counts | Day | 1 year |
| Session duration | Day | 90 days |
| Performance percentiles | Day | 90 days |
Aggregation Queries
-- Daily active users
SELECT
DATE_TRUNC('day', time) as day,
COUNT(DISTINCT device_id) as dau
FROM telemetry_events
WHERE app_id = $1
AND event_type = 'app_start'
AND time > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;
-- Crash rate by version
SELECT
app_version,
COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes,
COUNT(*) FILTER (WHERE event_type = 'app_start') as starts,
ROUND(
100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') /
NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0),
2
) as crash_rate
FROM telemetry_events
WHERE app_id = $1
AND time > NOW() - INTERVAL '7 days'
GROUP BY app_version;
Crash Grouping
Stack Trace Fingerprinting
func fingerprintCrash(crash CrashReport) string {
// Normalize stack trace
normalized := normalizeStackTrace(crash.StackTrace)
// Hash key components
key := fmt.Sprintf("%s:%s:%s",
crash.CrashType,
crash.Message,
normalized,
)
return sha256(key)[:16]
}
func normalizeStackTrace(stack string) string {
// Remove line numbers (they change with code updates)
// Remove memory addresses
// Keep function names and file names
re := regexp.MustCompile(`:\d+:`)
return re.ReplaceAllString(stack, ":?:")
}
Crash Groups Table
CREATE TABLE crash_groups (
id UUID PRIMARY KEY,
app_id TEXT NOT NULL,
fingerprint TEXT NOT NULL,
crash_type TEXT NOT NULL,
message TEXT,
sample_stack_trace TEXT,
first_seen TIMESTAMPTZ NOT NULL,
last_seen TIMESTAMPTZ NOT NULL,
occurrence_count INT DEFAULT 1,
affected_versions TEXT[],
status TEXT DEFAULT 'open', -- open, resolved, ignored
UNIQUE(app_id, fingerprint)
);
Developer Dashboard
Metrics View
┌─────────────────────────────────────────────────────────────┐
│ Analytics - My Calculator │
├─────────────────────────────────────────────────────────────┤
│ │
│ Date Range: [Last 30 days ▼] │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
│ │ Daily Users │ │ Crashes │ │ Crash-free │ │
│ │ 1,234 │ │ 23 │ │ 98.1% │ │
│ │ ▲ +12% │ │ ▼ -45% │ │ ▲ +2% │ │
│ └─────────────┘ └─────────────┘ └─────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Daily Active Users │ │
│ │ [Line chart showing DAU over time] │ │
│ └────────────────────────────────────────────────────┘ │
│ │
│ ┌────────────────────────────────────────────────────┐ │
│ │ Version Distribution │ │
│ │ [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│ │
│ └────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
Crashes View
┌─────────────────────────────────────────────────────────────┐
│ Crashes - My Calculator │
├─────────────────────────────────────────────────────────────┤
│ │
│ Filter: [All versions ▼] [Open ▼] │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ ● attempt to index nil value 'user' │ │
│ │ lua_error • 156 occurrences • v1.2.0 │ │
│ │ First: Jan 10 • Last: Jan 15 │ │
│ │ [View] │ │
│ ├──────────────────────────────────────────────────────┤ │
│ │ ● memory limit exceeded │ │
│ │ sandbox_error • 23 occurrences • v1.1.0, v1.2.0 │ │
│ │ First: Jan 5 • Last: Jan 14 │ │
│ │ [View] │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────┘
API Endpoints
# Ingestion (from devices)
POST /v1/telemetry/events:
auth: device_token or api_key
body: { app_id, device_id, events[] }
response: { received: number }
POST /v1/telemetry/crash:
auth: device_token or api_key
body: { app_id, device_id, crash }
response: { id: string }
# Dashboard (for developers)
GET /v1/apps/:id/analytics/overview:
auth: required
query: { start_date, end_date }
response: { dau, crashes, crash_free_rate, ... }
GET /v1/apps/:id/analytics/events:
auth: required
query: { start_date, end_date, event_type }
response: { data: [{ date, count, unique_devices }] }
GET /v1/apps/:id/crashes:
auth: required
query: { version, status, page, limit }
response: { crashes: CrashGroup[], total }
GET /v1/apps/:id/crashes/:fingerprint:
auth: required
response: { crash_group, recent_occurrences[] }
PATCH /v1/apps/:id/crashes/:fingerprint:
auth: required
body: { status: 'resolved' | 'ignored' }
response: { crash_group }
Data Retention
| Data Type | Retention | Reason |
|---|---|---|
| Raw events | 7 days | Debugging |
| Daily aggregates | 2 years | Trends |
| Crash reports | 90 days | Investigation |
| Crash groups | Forever | Issue tracking |
Cleanup Job
-- Run daily
DELETE FROM telemetry_events
WHERE time < NOW() - INTERVAL '7 days';
DELETE FROM crash_reports
WHERE timestamp < NOW() - INTERVAL '90 days';
Privacy Controls
User Settings
Settings > Privacy > Analytics
├── [✓] Send crash reports (helps developers fix bugs)
├── [ ] Send usage analytics (how you use apps)
└── [Request Data Deletion]
GDPR Endpoints
# User requests their data
GET /v1/privacy/export:
auth: user_token
response: { download_url } # JSON export of all data
# User requests deletion
DELETE /v1/privacy/data:
auth: user_token
response: { status: 'scheduled' } # Delete within 30 days
Deliverables
- Storage approach decided (SQLite with separate telemetry.db)
- Event schema specification
- Client-side batching (Lua TelemetryManager)
- Ingestion API endpoints (Go + Chi)
- SQLite schema and migrations
- Background aggregation workers (Go goroutines)
- Crash grouping logic
- Developer analytics dashboard (htmx)
- Privacy controls (opt-out in manifest)
- Data retention cleanup job
- GDPR export/delete endpoints
Open Questions
- Real-time crash alerts? → Consider email notifications for v1.1
Sampling for high-volume apps?→ Not needed for self-hosted scaleCustom events API for developers?→ Yes, via manifest opt-inBenchmarks/comparisons with similar apps?→ Defer to post-MVP