Files
MosisService/DEV_PORTAL_M08_TELEMETRY.md

15 KiB

Milestone 8: Telemetry System

Status: Planning Goal: Collect app usage analytics and crash reports while respecting privacy.


Overview

Telemetry provides developers with insights into app usage, performance, and crashes. Must balance usefulness with user privacy.


Privacy Principles

  1. Minimal collection - Only what's necessary
  2. No PII by default - Anonymized device IDs
  3. Transparency - Users know what's collected
  4. Opt-out available - Users can disable
  5. Data retention limits - Auto-delete old data
  6. GDPR compliance - Export/delete on request

Event Types

Automatic Events (Default)

Event Description Data
app_start App launched version, mosis_version
app_stop App closed duration_seconds
app_crash Unhandled error crash_type, message
lua_error Lua runtime error message, stack (no user data)

Performance Events (Default)

Event Description Data
perf_frame Frame time (sampled) avg_ms, p95_ms
perf_memory Memory usage used_mb, limit_mb
perf_startup Startup time duration_ms

Usage Events (Opt-in)

Event Description Data
screen_view Screen navigation screen_name
button_click UI interaction element_id
feature_used Feature usage feature_name

Data Schema

Event Payload

{
  "app_id": "com.developer.myapp",
  "app_version": "1.2.0",
  "mosis_version": "1.0.0",
  "device_id": "sha256_hashed_id",
  "session_id": "uuid",
  "events": [
    {
      "type": "app_start",
      "timestamp": "2024-01-15T10:30:00Z",
      "data": {}
    },
    {
      "type": "screen_view",
      "timestamp": "2024-01-15T10:30:05Z",
      "data": {
        "screen_name": "home"
      }
    }
  ]
}

Crash Report Payload

{
  "app_id": "com.developer.myapp",
  "app_version": "1.2.0",
  "mosis_version": "1.0.0",
  "device_id": "sha256_hashed_id",
  "timestamp": "2024-01-15T10:35:00Z",
  "crash": {
    "type": "lua_error",
    "message": "attempt to index nil value 'user'",
    "stack_trace": "main.lua:42: in function 'loadUser'\nmain.lua:15: in main chunk",
    "context": {
      "screen": "profile.rml",
      "memory_mb": 45,
      "uptime_seconds": 300
    }
  }
}

Device ID Hashing

-- On device
local raw_id = get_android_id() -- or similar
local hashed = sha256(raw_id .. "mosis_salt_" .. app_id)
-- Result: "a3f2b1c4d5e6..."

-- Cannot reverse to original device ID
-- Different per app (can't track across apps)

Collection Architecture

┌──────────┐     ┌──────────┐     ┌──────────┐     ┌──────────┐
│  Device  │────►│  Batch   │────►│   API    │────►│ Storage  │
│          │     │  Queue   │     │          │     │          │
└──────────┘     └──────────┘     └──────────┘     └──────────┘
                      │
                      │ Every 60s or
                      │ on app close
                      ▼
                 ┌──────────┐
                 │  Upload  │
                 └──────────┘

Client-Side Batching

-- TelemetryManager on device
local events = {}
local last_flush = os.time()

function track(event_type, data)
    if not telemetry_enabled then return end

    table.insert(events, {
        type = event_type,
        timestamp = os.date("!%Y-%m-%dT%H:%M:%SZ"),
        data = data or {}
    })

    -- Flush if batch is large or time elapsed
    if #events >= 50 or (os.time() - last_flush) > 60 then
        flush()
    end
end

function flush()
    if #events == 0 then return end

    local payload = {
        app_id = APP_ID,
        app_version = APP_VERSION,
        device_id = HASHED_DEVICE_ID,
        events = events
    }

    -- Async HTTP POST
    http.post(TELEMETRY_URL, json.encode(payload))

    events = {}
    last_flush = os.time()
end

Storage Options

Option A: PostgreSQL + TimescaleDB

-- Hypertable for time-series data
CREATE TABLE telemetry_events (
    time        TIMESTAMPTZ NOT NULL,
    app_id      TEXT NOT NULL,
    device_id   TEXT NOT NULL,
    session_id  TEXT,
    event_type  TEXT NOT NULL,
    event_data  JSONB,
    app_version TEXT,
    mosis_version TEXT
);

SELECT create_hypertable('telemetry_events', 'time');

-- Continuous aggregate for daily stats
CREATE MATERIALIZED VIEW daily_stats
WITH (timescaledb.continuous) AS
SELECT
    time_bucket('1 day', time) AS day,
    app_id,
    event_type,
    COUNT(*) as count,
    COUNT(DISTINCT device_id) as unique_devices
FROM telemetry_events
GROUP BY day, app_id, event_type;

Option B: ClickHouse

CREATE TABLE telemetry_events (
    timestamp DateTime,
    app_id String,
    device_id String,
    session_id String,
    event_type String,
    event_data String,  -- JSON
    app_version String,
    mosis_version String
) ENGINE = MergeTree()
PARTITION BY toYYYYMM(timestamp)
ORDER BY (app_id, timestamp);

Option C: Custom + PostgreSQL

Raw events → Write to append-only log
Aggregator → Process hourly → Write to PostgreSQL
Cleanup    → Delete raw after 24h

Aggregation

Pre-computed Metrics

Metric Granularity Retention
Daily active users Day 2 years
Event counts Day 1 year
Crash counts Day 1 year
Session duration Day 90 days
Performance percentiles Day 90 days

Aggregation Queries

-- Daily active users
SELECT
    DATE_TRUNC('day', time) as day,
    COUNT(DISTINCT device_id) as dau
FROM telemetry_events
WHERE app_id = $1
  AND event_type = 'app_start'
  AND time > NOW() - INTERVAL '30 days'
GROUP BY day
ORDER BY day;

-- Crash rate by version
SELECT
    app_version,
    COUNT(*) FILTER (WHERE event_type = 'app_crash') as crashes,
    COUNT(*) FILTER (WHERE event_type = 'app_start') as starts,
    ROUND(
        100.0 * COUNT(*) FILTER (WHERE event_type = 'app_crash') /
        NULLIF(COUNT(*) FILTER (WHERE event_type = 'app_start'), 0),
        2
    ) as crash_rate
FROM telemetry_events
WHERE app_id = $1
  AND time > NOW() - INTERVAL '7 days'
GROUP BY app_version;

Crash Grouping

Stack Trace Fingerprinting

func fingerprintCrash(crash CrashReport) string {
    // Normalize stack trace
    normalized := normalizeStackTrace(crash.StackTrace)

    // Hash key components
    key := fmt.Sprintf("%s:%s:%s",
        crash.CrashType,
        crash.Message,
        normalized,
    )

    return sha256(key)[:16]
}

func normalizeStackTrace(stack string) string {
    // Remove line numbers (they change with code updates)
    // Remove memory addresses
    // Keep function names and file names
    re := regexp.MustCompile(`:\d+:`)
    return re.ReplaceAllString(stack, ":?:")
}

Crash Groups Table

CREATE TABLE crash_groups (
    id UUID PRIMARY KEY,
    app_id TEXT NOT NULL,
    fingerprint TEXT NOT NULL,
    crash_type TEXT NOT NULL,
    message TEXT,
    sample_stack_trace TEXT,
    first_seen TIMESTAMPTZ NOT NULL,
    last_seen TIMESTAMPTZ NOT NULL,
    occurrence_count INT DEFAULT 1,
    affected_versions TEXT[],
    status TEXT DEFAULT 'open',  -- open, resolved, ignored
    UNIQUE(app_id, fingerprint)
);

Developer Dashboard

Metrics View

┌─────────────────────────────────────────────────────────────┐
│  Analytics - My Calculator                                  │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Date Range: [Last 30 days ▼]                               │
│                                                             │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐         │
│  │ Daily Users │  │   Crashes   │  │ Crash-free  │         │
│  │    1,234    │  │     23      │  │   98.1%     │         │
│  │   ▲ +12%    │  │   ▼ -45%    │  │   ▲ +2%     │         │
│  └─────────────┘  └─────────────┘  └─────────────┘         │
│                                                             │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Daily Active Users                                │    │
│  │  [Line chart showing DAU over time]                │    │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
│  ┌────────────────────────────────────────────────────┐    │
│  │  Version Distribution                              │    │
│  │  [Pie chart: v1.2.0: 60%, v1.1.0: 30%, v1.0.0: 10%]│   │
│  └────────────────────────────────────────────────────┘    │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Crashes View

┌─────────────────────────────────────────────────────────────┐
│  Crashes - My Calculator                                    │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Filter: [All versions ▼] [Open ▼]                          │
│                                                             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │ ● attempt to index nil value 'user'                  │  │
│  │   lua_error • 156 occurrences • v1.2.0              │  │
│  │   First: Jan 10 • Last: Jan 15                       │  │
│  │                                               [View] │  │
│  ├──────────────────────────────────────────────────────┤  │
│  │ ● memory limit exceeded                              │  │
│  │   sandbox_error • 23 occurrences • v1.1.0, v1.2.0   │  │
│  │   First: Jan 5 • Last: Jan 14                        │  │
│  │                                               [View] │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                             │
└─────────────────────────────────────────────────────────────┘

API Endpoints

# Ingestion (from devices)
POST /v1/telemetry/events:
  auth: device_token or api_key
  body: { app_id, device_id, events[] }
  response: { received: number }

POST /v1/telemetry/crash:
  auth: device_token or api_key
  body: { app_id, device_id, crash }
  response: { id: string }

# Dashboard (for developers)
GET /v1/apps/:id/analytics/overview:
  auth: required
  query: { start_date, end_date }
  response: { dau, crashes, crash_free_rate, ... }

GET /v1/apps/:id/analytics/events:
  auth: required
  query: { start_date, end_date, event_type }
  response: { data: [{ date, count, unique_devices }] }

GET /v1/apps/:id/crashes:
  auth: required
  query: { version, status, page, limit }
  response: { crashes: CrashGroup[], total }

GET /v1/apps/:id/crashes/:fingerprint:
  auth: required
  response: { crash_group, recent_occurrences[] }

PATCH /v1/apps/:id/crashes/:fingerprint:
  auth: required
  body: { status: 'resolved' | 'ignored' }
  response: { crash_group }

Data Retention

Data Type Retention Reason
Raw events 7 days Debugging
Daily aggregates 2 years Trends
Crash reports 90 days Investigation
Crash groups Forever Issue tracking

Cleanup Job

-- Run daily
DELETE FROM telemetry_events
WHERE time < NOW() - INTERVAL '7 days';

DELETE FROM crash_reports
WHERE timestamp < NOW() - INTERVAL '90 days';

Privacy Controls

User Settings

Settings > Privacy > Analytics
├── [✓] Send crash reports (helps developers fix bugs)
├── [ ] Send usage analytics (how you use apps)
└── [Request Data Deletion]

GDPR Endpoints

# User requests their data
GET /v1/privacy/export:
  auth: user_token
  response: { download_url }  # JSON export of all data

# User requests deletion
DELETE /v1/privacy/data:
  auth: user_token
  response: { status: 'scheduled' }  # Delete within 30 days

Deliverables

  • Event schema specification
  • Client-side SDK for batching
  • Ingestion API endpoints
  • Storage setup (TimescaleDB or ClickHouse)
  • Aggregation jobs
  • Crash grouping logic
  • Developer dashboard
  • Privacy controls
  • Data retention automation
  • GDPR export/delete

Open Questions

  1. Real-time crash alerts (email/Slack)?
  2. Sampling for high-volume apps?
  3. Custom events API for developers?
  4. Benchmarks/comparisons with similar apps?

References