Two-Database Architecture

System Architecture

The recommended production architecture combines PostgreSQL pgvector for semantic vector search with Neo4j for knowledge graph traversal, connected through a hybrid retrieval pipeline using Reciprocal Rank Fusion.

GraphRAG Pipeline

1. Document Ingestion

Documents are chunked using semantic-aware splitting (512 tokens, 64 overlap) respecting paragraph and section boundaries.

2. Entity Extraction

LLM (GPT-4o) extracts entities (people, orgs, concepts, technologies) and relationships from each chunk with structured JSON output.

3. Knowledge Graph

Entities become Neo4j nodes, relationships become edges. Leiden algorithm detects hierarchical communities. LLM generates community summaries.

4. Vector Indexing

Entity descriptions, chunk text, and community summaries are embedded (text-embedding-3-small) and stored in PostgreSQL with HNSW indexes.

5. Hybrid Retrieval

Queries trigger parallel vector search (pgvector) + graph traversal (Neo4j) + full-text search, fused via Reciprocal Rank Fusion (k=50).

Database Schemas

PostgreSQL + pgvector

PostgreSQL Schema (pgvector)

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE documents (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  title TEXT NOT NULL,
  source_url TEXT,
  content TEXT NOT NULL,
  metadata JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE chunks (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  document_id UUID REFERENCES documents(id) ON DELETE CASCADE,
  chunk_index INT NOT NULL,
  content TEXT NOT NULL,
  embedding vector(1536),
  token_count INT,
  content_tsv tsvector GENERATED ALWAYS AS
    (to_tsvector('english', content)) STORED,
  UNIQUE(document_id, chunk_index)
);

CREATE TABLE entities (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  name TEXT NOT NULL,
  type TEXT NOT NULL,
  description TEXT,
  embedding vector(1536),
  properties JSONB DEFAULT '{}',
  UNIQUE(name, type)
);

CREATE TABLE relationships (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  source_entity_id UUID REFERENCES entities(id),
  target_entity_id UUID REFERENCES entities(id),
  type TEXT NOT NULL,
  description TEXT,
  weight FLOAT DEFAULT 1.0
);

CREATE TABLE communities (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  level INT NOT NULL,
  title TEXT,
  summary TEXT,
  summary_embedding vector(1536),
  entity_ids UUID[] NOT NULL
);

-- HNSW indexes for fast vector search
CREATE INDEX idx_chunks_embedding ON chunks
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

CREATE INDEX idx_entities_embedding ON entities
  USING hnsw (embedding vector_cosine_ops)
  WITH (m = 16, ef_construction = 64);

-- Full-text search index
CREATE INDEX idx_chunks_tsv ON chunks USING gin(content_tsv);
Neo4j Knowledge Graph

Neo4j Knowledge Graph

// Entity nodes
CREATE CONSTRAINT entity_name IF NOT EXISTS
FOR (e:Entity) REQUIRE e.name IS UNIQUE;

// Vector index for semantic search
CREATE VECTOR INDEX entity_embeddings IF NOT EXISTS
FOR (e:Entity) ON (e.embedding)
OPTIONS {
  indexConfig: {
    \`vector.dimensions\`: 1536,
    \`vector.similarity_function\`: 'cosine'
  }
};

// Community detection with Leiden
CALL gds.leiden.write('entity-graph', {
  writeProperty: 'communityId',
  maxLevels: 10,
  gamma: 1.0,
  includeIntermediateCommunities: true
});

// k-hop neighborhood retrieval
MATCH (e:Entity {name: $entityName})-[r*1..2]-(neighbor)
RETURN e, r, neighbor;

// Shortest path between entities
MATCH path = shortestPath(
  (a:Entity {name: $entity1})-[*..5]-(b:Entity {name: $entity2})
)
RETURN path;

Hybrid Retrieval with RRF

Reciprocal Rank Fusion combines results from vector similarity search and full-text search. Each result's score is computed as 1/(rank + k) where k=50, then scores are summed across search methods.

Hybrid Search with RRF

-- Reciprocal Rank Fusion combining vector + full-text
WITH vector_search AS (
  SELECT id, title, content,
    ROW_NUMBER() OVER (
      ORDER BY embedding <=> $1
    ) AS rank
  FROM chunks
  ORDER BY embedding <=> $1
  LIMIT 20
),
text_search AS (
  SELECT id, title, content,
    ROW_NUMBER() OVER (
      ORDER BY ts_rank_cd(content_tsv, query) DESC
    ) AS rank
  FROM chunks, plainto_tsquery('english', $2) query
  WHERE content_tsv @@ query
  LIMIT 20
)
SELECT
  COALESCE(v.id, t.id) AS id,
  COALESCE(v.title, t.title) AS title,
  COALESCE(1.0/(v.rank + 50), 0) +
  COALESCE(1.0/(t.rank + 50), 0) AS rrf_score
FROM vector_search v
FULL OUTER JOIN text_search t ON v.id = t.id
ORDER BY rrf_score DESC
LIMIT 10;

Database Comparison

CapabilityPostgreSQL + pgvectorNeo4j
Semantic SimilarityExcellent (HNSW)Good (native vectors 5.18+)
Relationship TraversalNot possibleExcellent
Multi-hop ReasoningNot possibleExcellent
Full-text SearchExcellent (tsvector + GIN)Good
Community DetectionNot nativeExcellent (GDS Leiden)
ACID ComplianceFullFull
Hybrid Search (RRF)NativeVia plugins

Production Performance Benchmarks

After applying the five essential optimization techniques (semantic chunking, parallel processing, batch operations, lock resolution, LLM batching):

Ingestion Speed
56 hours4.1 hours
13x faster
Processing Rate
50-100/sec2,000-5,000/sec
40x faster
Deadlock Rate
15-30%0%
Eliminated
GraphRAG Accuracy
~28% (RAG)86.31%
3x better