Skip to content

vectors: prune orphan embeddings after delete-deduped (cheaper than build-embeddings --full-rebuild) #313

@wesm

Description

@wesm

Context

Surfaced during the design re-read of PR #304. The post-run hint correction landed in commit f48e8f1, but the underlying gap (orphan embeddings accumulating in vectors.db after delete-deduped purges rows from messages) is real and was deferred.

What happens

The vector backend's design contract is documented at internal/vector/sqlitevec/backend.go:300-306:

Dedup Execute does not remove vector-store rows by design: if a message is embedded then later soft-deleted, the embedding stays in the vector store and query-time live filtering (dropDeletedFromSource, filteredMessageIDs) enforces the live-message contract.

This is correct for soft-delete (deleted_at), where the message row still exists and the join still works. After delete-deduped permanently removes message rows, the vector-store rows whose message_id no longer joins are orphaned:

  • They consume disk space in vectors.db.
  • They get over-fetched by the deletedOverfetchFactor = 2 pad in dropDeletedFromSource (backend.go:797), which assumes a constant fraction of orphans.
  • They never get pruned. Over months of dedup + purge cycles, the orphan count grows unbounded relative to the live corpus.

The post-run hint in delete-deduped (now corrected by f48e8f1) tells the user to run build-embeddings --full-rebuild, which recreates the vector index from scratch — a heavy operation that re-pays the embedding-API cost for the entire corpus. That's a workaround, not a maintenance command.

Why it matters

  • build-embeddings --full-rebuild is expensive: it re-runs every embedding through the configured endpoint. Users running large archives will avoid it.
  • The over-fetch factor was tuned for a low orphan ratio. As orphans accumulate, ANN recall degrades because the live subset of the top-K shrinks.
  • Long-running daemonized deployments (serve) compound the problem.

Proposed approach

Add a lightweight vectors prune (or build-embeddings --prune-orphans) command that:

  1. Reads message IDs from the vector backend.
  2. Anti-joins against main.messages.id.
  3. Deletes vector-store rows whose message_id has no live message row.

This is much cheaper than a full rebuild: no embedding API calls, just a DELETE FROM vec_chunks WHERE message_id NOT IN (SELECT id FROM messages)-shaped query.

Optionally hook it into delete-deduped as a post-step (gated by a flag) so the cleanup happens in-line for users who want it, while remaining opt-out for users who batch their vector maintenance separately.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions