Files

omigamedev d484f61b29 Add development plan with 13 milestone specifications

- docs/plan.md: Master roadmap with phases and priorities
- docs/milestones/01-13: Detailed specs for each feature
- Updated CLAUDE.md with plan references and build commands

Milestones cover:
- Phase 1: Temporal versioning, auto-capture, context injection, codebase indexing
- Phase 2: Daily journal, content ingestion, graph visualization, import/export
- Phase 3: Multi-graph, smart retrieval, TUI dashboard, browser extension, shell completions

2026-02-03 09:36:08 +01:00

6.0 KiB

Raw Blame History

Milestone 6: URL & Content Ingestion

Overview

Ingest content from URLs, PDFs, and documents into the knowledge graph. Automatically chunk, summarize, and link to existing knowledge.

Motivation

Knowledge exists outside the codebase (docs, articles, specs)
Manual copy-paste is tedious and loses structure
Supermemory's multi-source ingestion is key feature
Research and documentation should be first-class

Features

6.1 URL Ingestion

# Ingest a webpage
cortex ingest https://docs.example.com/api

# Ingest with custom title
cortex ingest https://... --title "API Documentation"

# Ingest and tag
cortex ingest https://... --tags docs,api,reference

6.2 PDF Ingestion

# Ingest a PDF
cortex ingest ./spec.pdf

# Ingest specific pages
cortex ingest ./spec.pdf --pages 1-10

# Ingest with chunking strategy
cortex ingest ./spec.pdf --chunk-size 1000

6.3 Markdown/Text Ingestion

# Ingest markdown file
cortex ingest ./notes.md

# Ingest from stdin
cat notes.txt | cortex ingest --stdin

# Ingest clipboard
cortex ingest --clipboard

6.4 Smart Chunking

Large documents are split intelligently:

interface ChunkStrategy {
  maxTokens: number;       // Max tokens per chunk
  overlap: number;         // Overlap between chunks
  splitOn: 'paragraph' | 'sentence' | 'heading' | 'page';
  preserveStructure: boolean;
}

6.5 Entity Extraction

Extract and link entities:

interface ExtractedEntities {
  people: string[];
  organizations: string[];
  technologies: string[];
  concepts: string[];
}

// Auto-link to existing nodes with matching titles/tags

6.6 Source Tracking

Track where content came from:

metadata: {
  source: {
    type: 'url' | 'pdf' | 'file' | 'clipboard';
    url?: string;
    filePath?: string;
    ingestedAt: number;
    checksum: string;  // For deduplication
  }
}

Implementation

Ingestion Pipeline

// src/core/ingest/index.ts
export async function ingest(source: string, options: IngestOptions): Promise<IngestResult> {
  // Detect source type
  const sourceType = detectSourceType(source);

  // Fetch/read content
  const rawContent = await fetchContent(source, sourceType);

  // Convert to markdown
  const markdown = await convertToMarkdown(rawContent, sourceType);

  // Chunk if needed
  const chunks = chunkContent(markdown, options.chunkStrategy);

  // Create nodes
  const nodes: Node[] = [];

  if (chunks.length === 1) {
    // Single node
    const node = await createIngestNode(chunks[0], source, options);
    nodes.push(node);
  } else {
    // Parent + children
    const parent = await createParentNode(source, chunks, options);
    nodes.push(parent);

    for (const chunk of chunks) {
      const child = await createChunkNode(chunk, parent.id, options);
      nodes.push(child);
      addEdge(parent.id, child.id, 'contains');
    }
  }

  // Extract and link entities
  for (const node of nodes) {
    await extractAndLinkEntities(node);
  }

  // Find and link related nodes
  for (const node of nodes) {
    await linkRelatedNodes(node);
  }

  return { nodes: nodes.length, source: sourceType };
}

URL Fetcher

// src/core/ingest/fetchers/url.ts
export async function fetchUrl(url: string): Promise<FetchedContent> {
  const response = await fetch(url);
  const html = await response.text();

  // Use readability to extract main content
  const doc = new JSDOM(html);
  const reader = new Readability(doc.window.document);
  const article = reader.parse();

  return {
    title: article?.title || url,
    content: article?.textContent || '',
    html: article?.content || html,
  };
}

PDF Parser

// src/core/ingest/fetchers/pdf.ts
export async function parsePdf(filePath: string, options?: PdfOptions): Promise<ParsedPdf> {
  // Use pdf-parse or pdfjs-dist
  const dataBuffer = fs.readFileSync(filePath);
  const data = await pdfParse(dataBuffer);

  return {
    text: data.text,
    pages: data.numpages,
    metadata: data.info,
  };
}

Markdown Converter

// src/core/ingest/convert.ts
export async function convertToMarkdown(content: FetchedContent, type: SourceType): Promise<string> {
  switch (type) {
    case 'url':
      return turndown.turndown(content.html);
    case 'pdf':
      return content.text; // Already text
    case 'markdown':
      return content.content;
    default:
      return content.content;
  }
}

CLI Commands

Command	Description
`cortex ingest <source>`	Ingest URL, file, or path
`cortex ingest --clipboard`	Ingest from clipboard
`cortex ingest --stdin`	Ingest from stdin
`cortex ingest --title <title>`	Override title
`cortex ingest --tags <tags>`	Add tags
`cortex ingest --chunk-size <n>`	Set chunk size
`cortex ingest --no-link`	Skip auto-linking

MCP Tools

memory_ingest     // Ingest URL or content
memory_clip       // Quick clip from URL

Testing

URL ingestion extracts main content
PDF parsing handles multi-page docs
Chunking preserves context
Entities extracted and linked
Duplicate content detected
Source metadata preserved

Acceptance Criteria

URLs ingested with readable extraction
PDFs parsed into searchable text
Large docs chunked intelligently
Related nodes auto-linked
Source tracked for reference
Deduplication prevents duplicates

Estimated Effort

URL fetcher + Readability: 4 hours
PDF parser: 4 hours
Chunking strategy: 3 hours
Entity extraction: 4 hours
Auto-linking: 3 hours
CLI commands: 2 hours
Testing: 3 hours
Total: ~23 hours

Dependencies

@mozilla/readability for URL content extraction
pdf-parse or pdfjs-dist for PDFs
turndown for HTML→Markdown

6.0 KiB Raw Blame History