504 lines
15 KiB
Markdown
504 lines
15 KiB
Markdown
# Milestone 8: Telemetry System
|
|
|
|
**Status**: Planning
|
|
**Goal**: Collect app usage analytics and crash reports while respecting privacy.
|
|
|
|
---
|
|
|
|
## Overview
|
|
|
|
Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy.
|
|
|
|
---
|
|
|
|
## Privacy Principles
|
|
|
|
1. **Minimal collection** - Only what's necessary
|
|
2. **No PII by default** - Anonymized device IDs
|
|
3. **Transparency** - Users know what's collected
|
|
4. **Opt-out available** - Users can disable
|
|
5. **Data retention limits** - Auto-delete old data
|
|
6. **GDPR compliance** - Export/delete on request
|
|
|
|
---
|
|
|
|
## Event Types
|
|
|
|
### Automatic Events (Default)
|
|
|
|
| Event | Description | Data |
|
|
|-------|-------------|------|
|
|
| `app_start` | App launched | version, mosis_version |
|
|
| `app_stop` | App closed | duration_seconds |
|
|
| `app_crash` | Unhandled error | crash_type, message |
|
|
| `lua_error` | Lua runtime error | message, stack (no user data) |
|
|
|
|
### Performance Events (Default)
|
|
|
|
| Event | Description | Data |
|
|
|-------|-------------|------|
|
|
| `perf_frame` | Frame time (sampled) | avg_ms, p95_ms |
|
|
| `perf_memory` | Memory usage | used_mb, limit_mb |
|
|
| `perf_startup` | Startup time | duration_ms |
|
|
|
|
### Usage Events (Opt-in)
|
|
|
|
| Event | Description | Data |
|
|
|-------|-------------|------|
|
|
| `screen_view` | Screen navigation | screen_name |
|
|
| `button_click` | UI interaction | element_id |
|
|
| `feature_used` | Feature usage | feature_name |
|
|
|
|
---
|
|
|
|
## Data Schema
|
|
|
|
### Event Payload
|
|
|
|
```json
|
|
{
|
|
"app_id": "com.developer.myapp",
|
|
"app_version": "1.2.0",
|
|
"mosis_version": "1.0.0",
|
|
"device_id": "sha256_hashed_id",
|
|
"session_id": "uuid",
|
|
"events": [
|
|
{
|
|
"type": "app_start",
|
|
"timestamp": "2024-01-15T10:30:00Z",
|
|
"data": {}
|
|
},
|
|
{
|
|
"type": "screen_view",
|
|
"timestamp": "2024-01-15T10:30:05Z",
|
|
"data": {
|
|
"screen_name": "home"
|
|
}
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Crash Report Payload
|
|
|
|
```json
|
|
{
|
|
"app_id": "com.developer.myapp",
|
|
"app_version": "1.2.0",
|
|
"mosis_version": "1.0.0",
|
|
"device_id": "sha256_hashed_id",
|
|
"timestamp": "2024-01-15T10:35:00Z",
|
|
"crash": {
|
|
"type": "lua_error",
|
|
"message": "attempt to index nil value 'user'",
|
|
"stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk",
|
|
"context": {
|
|
"screen": "profile.rml",
|
|
"memory_mb": 45,
|
|
"uptime_seconds": 300
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
### Device ID Hashing
|
|
|
|
```lua
|
|
-- On device
|
|
local raw_id = get_android_id() -- or similar
|
|
local hashed = sha256(raw_id .. "mosis_salt_" .. app_id)
|
|
-- Result: "a3f2b1c4d5e6..."
|
|
|
|
-- Cannot reverse to original device ID
|
|
-- Different per app (can't track across apps)
|
|
```
|
|
|
|
---
|
|
|
|
## Collection Architecture
|
|
|
|
```
|
|
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
|
|
│ Device │────►│ Batch │────►│ API │────►│ Storage │
|
|
│ │ │ Queue │ │ │ │ │
|
|
└──────────┘ └──────────┘ └──────────┘ └──────────┘
|
|
│
|
|
│ Every 60s or
|
|
│ on app close
|
|
▼
|
|
┌──────────┐
|
|
│ Upload │
|
|
└──────────┘
|
|
```
|
|
|
|
### Client-Side Batching
|
|
|
|
```lua
|
|
-- TelemetryManager on device
|
|
local events = {}
|
|
local last_flush = os.time()
|
|
|
|
function track(event_type, data)
|
|
if not telemetry_enabled then return end
|
|
|
|
table.insert(events, {
|
|
type = event_type,
|
|
timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"),
|
|
data = data or {}
|
|
})
|
|
|
|
-- Flush if batch is large or time elapsed
|
|
if #events >= 50 or (os.time() - last_flush) > 60 then
|
|
flush()
|
|
end
|
|
end
|
|
|
|
function flush()
|
|
if #events == 0 then return end
|
|
|
|
local payload = {
|
|
app_id = APP_ID,
|
|
app_version = APP_VERSION,
|
|
device_id = HASHED_DEVICE_ID,
|
|
events = events
|
|
}
|
|
|
|
-- Async HTTP POST
|
|
http.post(TELEMETRY_URL, json.encode(payload))
|
|
|
|
events = {}
|
|
last_flush = os.time()
|
|
end
|
|
```
|
|
|
|
---
|
|
|
|
## Storage Options
|
|
|
|
### Option A: PostgreSQL + TimescaleDB
|
|
|
|
```sql
|
|
-- Hypertable for time-series data
|
|
CREATE TABLE telemetry_events (
|
|
time TIMESTAMPTZ NOT NULL,
|
|
app_id TEXT NOT NULL,
|
|
device_id TEXT NOT NULL,
|
|
session_id TEXT,
|
|
event_type TEXT NOT NULL,
|
|
event_data JSONB,
|
|
app_version TEXT,
|
|
mosis_version TEXT
|
|
);
|
|
|
|
SELECT create_hypertable('telemetry_events', 'time');
|
|
|
|
-- Continuous aggregate for daily stats
|
|
CREATE MATERIALIZED VIEW daily_stats
|
|
WITH (timescaledb.continuous) AS
|
|
SELECT
|
|
time_bucket('1 day', time) AS day,
|
|
app_id,
|
|
event_type,
|
|
COUNT(*) as count,
|
|
COUNT(DISTINCT device_id) as unique_devices
|
|
FROM telemetry_events
|
|
GROUP BY day, app_id, event_type;
|
|
```
|
|
|
|
### Option B: ClickHouse
|
|
|
|
```sql
|
|
CREATE TABLE telemetry_events (
|
|
timestamp DateTime,
|
|
app_id String,
|
|
device_id String,
|
|
session_id String,
|
|
event_type String,
|
|
event_data String, -- JSON
|
|
app_version String,
|
|
mosis_version String
|
|
) ENGINE = MergeTree()
|
|
PARTITION BY toYYYYMM(timestamp)
|
|
ORDER BY (app_id, timestamp);
|
|
```
|
|
|
|
### Option C: Custom + PostgreSQL
|
|
|
|
```
|
|
Raw events → Write to append-only log
|
|
Aggregator → Process hourly → Write to PostgreSQL
|
|
Cleanup → Delete raw after 24h
|
|
```
|
|
|
|
---
|
|
|
|
## Aggregation
|
|
|
|
### Pre-computed Metrics
|
|
|
|
| Metric | Granularity | Retention |
|
|
|--------|-------------|-----------|
|
|
| Daily active users | Day | 2 years |
|
|
| Event counts | Day | 1 year |
|
|
| Crash counts | Day | 1 year |
|
|
| Session duration | Day | 90 days |
|
|
| Performance percentiles | Day | 90 days |
|
|
|
|
### Aggregation Queries
|
|
|
|
```sql
|
|
-- Daily active users
|
|
SELECT
|
|
DATE_TRUNC('day', time) as day,
|
|
COUNT(DISTINCT device_id) as dau
|
|
FROM telemetry_events
|
|
WHERE app_id = $1
|
|
AND event_type = 'app_start'
|
|
AND time > NOW() - INTERVAL '30 days'
|
|
GROUP BY day
|
|
ORDER BY day;
|
|
|
|
-- Crash rate by version
|
|
SELECT
|
|
app_version,
|
|
COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes,
|
|
COUNT(*) FILTER (WHERE event_type = 'app_start') as starts,
|
|
ROUND(
|
|
100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') /
|
|
NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0),
|
|
2
|
|
) as crash_rate
|
|
FROM telemetry_events
|
|
WHERE app_id = $1
|
|
AND time > NOW() - INTERVAL '7 days'
|
|
GROUP BY app_version;
|
|
```
|
|
|
|
---
|
|
|
|
## Crash Grouping
|
|
|
|
### Stack Trace Fingerprinting
|
|
|
|
```go
|
|
func fingerprintCrash(crash CrashReport) string {
|
|
// Normalize stack trace
|
|
normalized := normalizeStackTrace(crash.StackTrace)
|
|
|
|
// Hash key components
|
|
key := fmt.Sprintf("%s:%s:%s",
|
|
crash.CrashType,
|
|
crash.Message,
|
|
normalized,
|
|
)
|
|
|
|
return sha256(key)[:16]
|
|
}
|
|
|
|
func normalizeStackTrace(stack string) string {
|
|
// Remove line numbers (they change with code updates)
|
|
// Remove memory addresses
|
|
// Keep function names and file names
|
|
re := regexp.MustCompile(`:\d+:`)
|
|
return re.ReplaceAllString(stack, ":?:")
|
|
}
|
|
```
|
|
|
|
### Crash Groups Table
|
|
|
|
```sql
|
|
CREATE TABLE crash_groups (
|
|
id UUID PRIMARY KEY,
|
|
app_id TEXT NOT NULL,
|
|
fingerprint TEXT NOT NULL,
|
|
crash_type TEXT NOT NULL,
|
|
message TEXT,
|
|
sample_stack_trace TEXT,
|
|
first_seen TIMESTAMPTZ NOT NULL,
|
|
last_seen TIMESTAMPTZ NOT NULL,
|
|
occurrence_count INT DEFAULT 1,
|
|
affected_versions TEXT[],
|
|
status TEXT DEFAULT 'open', -- open, resolved, ignored
|
|
UNIQUE(app_id, fingerprint)
|
|
);
|
|
```
|
|
|
|
---
|
|
|
|
## Developer Dashboard
|
|
|
|
### Metrics View
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Analytics - My Calculator │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Date Range: [Last 30 days ▼] │
|
|
│ │
|
|
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │
|
|
│ │ Daily Users │ │ Crashes │ │ Crash-free │ │
|
|
│ │ 1,234 │ │ 23 │ │ 98.1% │ │
|
|
│ │ ▲ +12% │ │ ▼ -45% │ │ ▲ +2% │ │
|
|
│ └─────────────┘ └─────────────┘ └─────────────┘ │
|
|
│ │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Daily Active Users │ │
|
|
│ │ [Line chart showing DAU over time] │ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
│ ┌────────────────────────────────────────────────────┐ │
|
|
│ │ Version Distribution │ │
|
|
│ │ [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│ │
|
|
│ └────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
### Crashes View
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Crashes - My Calculator │
|
|
├─────────────────────────────────────────────────────────────┤
|
|
│ │
|
|
│ Filter: [All versions ▼] [Open ▼] │
|
|
│ │
|
|
│ ┌──────────────────────────────────────────────────────┐ │
|
|
│ │ ● attempt to index nil value 'user' │ │
|
|
│ │ lua_error • 156 occurrences • v1.2.0 │ │
|
|
│ │ First: Jan 10 • Last: Jan 15 │ │
|
|
│ │ [View] │ │
|
|
│ ├──────────────────────────────────────────────────────┤ │
|
|
│ │ ● memory limit exceeded │ │
|
|
│ │ sandbox_error • 23 occurrences • v1.1.0, v1.2.0 │ │
|
|
│ │ First: Jan 5 • Last: Jan 14 │ │
|
|
│ │ [View] │ │
|
|
│ └──────────────────────────────────────────────────────┘ │
|
|
│ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## API Endpoints
|
|
|
|
```yaml
|
|
# Ingestion (from devices)
|
|
POST /v1/telemetry/events:
|
|
auth: device_token or api_key
|
|
body: { app_id, device_id, events[] }
|
|
response: { received: number }
|
|
|
|
POST /v1/telemetry/crash:
|
|
auth: device_token or api_key
|
|
body: { app_id, device_id, crash }
|
|
response: { id: string }
|
|
|
|
# Dashboard (for developers)
|
|
GET /v1/apps/:id/analytics/overview:
|
|
auth: required
|
|
query: { start_date, end_date }
|
|
response: { dau, crashes, crash_free_rate, ... }
|
|
|
|
GET /v1/apps/:id/analytics/events:
|
|
auth: required
|
|
query: { start_date, end_date, event_type }
|
|
response: { data: [{ date, count, unique_devices }] }
|
|
|
|
GET /v1/apps/:id/crashes:
|
|
auth: required
|
|
query: { version, status, page, limit }
|
|
response: { crashes: CrashGroup[], total }
|
|
|
|
GET /v1/apps/:id/crashes/:fingerprint:
|
|
auth: required
|
|
response: { crash_group, recent_occurrences[] }
|
|
|
|
PATCH /v1/apps/:id/crashes/:fingerprint:
|
|
auth: required
|
|
body: { status: 'resolved' | 'ignored' }
|
|
response: { crash_group }
|
|
```
|
|
|
|
---
|
|
|
|
## Data Retention
|
|
|
|
| Data Type | Retention | Reason |
|
|
|-----------|-----------|--------|
|
|
| Raw events | 7 days | Debugging |
|
|
| Daily aggregates | 2 years | Trends |
|
|
| Crash reports | 90 days | Investigation |
|
|
| Crash groups | Forever | Issue tracking |
|
|
|
|
### Cleanup Job
|
|
|
|
```sql
|
|
-- Run daily
|
|
DELETE FROM telemetry_events
|
|
WHERE time < NOW() - INTERVAL '7 days';
|
|
|
|
DELETE FROM crash_reports
|
|
WHERE timestamp < NOW() - INTERVAL '90 days';
|
|
```
|
|
|
|
---
|
|
|
|
## Privacy Controls
|
|
|
|
### User Settings
|
|
|
|
```
|
|
Settings > Privacy > Analytics
|
|
├── [✓] Send crash reports (helps developers fix bugs)
|
|
├── [ ] Send usage analytics (how you use apps)
|
|
└── [Request Data Deletion]
|
|
```
|
|
|
|
### GDPR Endpoints
|
|
|
|
```yaml
|
|
# User requests their data
|
|
GET /v1/privacy/export:
|
|
auth: user_token
|
|
response: { download_url } # JSON export of all data
|
|
|
|
# User requests deletion
|
|
DELETE /v1/privacy/data:
|
|
auth: user_token
|
|
response: { status: 'scheduled' } # Delete within 30 days
|
|
```
|
|
|
|
---
|
|
|
|
## Deliverables
|
|
|
|
- [ ] Event schema specification
|
|
- [ ] Client-side SDK for batching
|
|
- [ ] Ingestion API endpoints
|
|
- [ ] Storage setup (TimescaleDB or ClickHouse)
|
|
- [ ] Aggregation jobs
|
|
- [ ] Crash grouping logic
|
|
- [ ] Developer dashboard
|
|
- [ ] Privacy controls
|
|
- [ ] Data retention automation
|
|
- [ ] GDPR export/delete
|
|
|
|
---
|
|
|
|
## Open Questions
|
|
|
|
1. Real-time crash alerts (email/Slack)?
|
|
2. Sampling for high-volume apps?
|
|
3. Custom events API for developers?
|
|
4. Benchmarks/comparisons with similar apps?
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- [GDPR Requirements](https://gdpr.eu/)
|
|
- [TimescaleDB Best Practices](https://docs.timescale.com/timescaledb/latest/)
|
|
- [Sentry Crash Grouping](https://docs.sentry.io/product/data-management-settings/event-grouping/)
|