PageIndex | Vectorless, Reasoning-Based RAG via Tree Search

PageIndex

Introduction

PageIndex is a groundbreaking open-source RAG (Retrieval-Augmented Generation) framework developed by VectifyAI. It completely challenges the traditional data-chunking and vector database paradigm by introducing a ‘vectorless’ document retrieval architecture inspired by AlphaGo’s tree-search logic. Instead of embedding static paragraph slices and scanning them via cosine similarity, PageIndex builds a hierarchical, structural ‘Table-of-Contents’ JSON tree of long documents. It then deploys an LLM reasoning loop to actively think, navigate, and look up precise page and section contents, achieving unprecedented accuracy on professional complex document analysis.

Use Cases

High-Precision Financial Auditing
Analyze complex SEC filings, annual reports, and earnings disclosures, allowing LLMs to extract exact financial metrics with 100% traceable page citations.
Legal Contract & Discovery Review
Navigate hundred-page multi-clause legal contracts to evaluate liabilities, compliance criteria, and clause dependencies without missing hidden footnotes or definitions.
Enterprise Knowledge Exploration
Query dense corporate standard operating procedures (SOPs), engineering manuals, or HR policies through a structured index, eliminating vector-matching noise.
Explainable AI & Audit Trails
Build regulated data applications where every retrieved fact must be explicitly justified by an logical reasoning trajectory and direct page-number alignment.
Agentic Knowledge Graph Backends
Equip autonomous multi-agent swarms with an analytical tool that lets them intelligently inspect document architectures rather than parsing unorganized text dumps.

Features & Benefits

Vectorless Tree-Index Architecture
Replaces the traditional vector database with a lightweight, hierarchical JSON tree index carrying section names, structural summaries, and page ranges.
AlphaGo-Inspired Tree Search
Implements a multi-step LLM reasoning loop that scans the Table-of-Contents, identifies candidate nodes, reads the underlying text, and determines if further navigation is required.
Zero-Chunking Retrieval Integrity
Eliminates semantic fragmentation by preserving complete section and page contexts, preventing the typical information loss caused by arbitrary token splitting.
Model Context Protocol (MCP) Support
Features a native MCP server implementation (pageindex-mcp) that allows AI terminal agents (like Claude Code or Cursor) to interact with local file structures directly.
State-of-the-Art Benchmark Rankings
Achieved a dominant 98.7% accuracy score on FinanceBench, heavily outperforming traditional top-k nearest-neighbor vector RAG setups on financial data.
Fully Typed SDKs & CLI
Provides clean, developer-centric TypeScript (@pageindex/sdk) and Python libraries to submit documents, fetch trees, and handle streaming chat completions.

Visit Website

Pros

Absolute Traceability & Explainability
Eliminates vague ‘vibe retrieval’ by replacing mathematical vector matching with transparent, step-by-step reasoning and explicit page references.
Massive Infrastructure Reduction
Completely removes the overhead of provisioning, maintaining, and cost-managing specialized vector databases like Pinecone, Milvus, or Weaviate.
Unmatched Structural Awareness
The AI maintains a global mental model of the document layout, making it exceptionally reliable at extracting ‘needle-in-a-haystack’ data points.

Cons

Sequential Reasoning Latency
Because the multi-step tree-search loop requires sequential LLM calls to navigate deep document paths, initial query times can be higher than instant vector lookups.
Reasoning Token Overhead
Using intelligent models to actively reason their way down an organizational index consumes more prompt tokens during the retrieval phase.