Back to Projects
Web Scraping & RAG Pipeline

Web Scraping & RAG Pipeline

Intelligent scraping pipeline with Crawl4AI and advanced RAG to query any website

Apr 2025 - Oct 2025 โ€ข 6 months

Tech Stack

Crawl4AIRAGHyDEChromaDBSupabasePythonLLMVector DatabaseOpenAI Embeddings

Description

End-to-end pipeline for scraping any website and querying it through advanced RAG techniques. The goal: transform unstructured web data into a knowledge base queryable via natural language.

Architecture

flowchart LR
    A[๐ŸŒ URL] --> B[Crawl4AI]
    B --> C[LLM Extraction]
    C --> D[Chunking]
    D --> E[Embeddings]
    E --> F[(Vector DB)]
    F --> G[RAG Agent]
    G --> H[๐Ÿ’ฌ Response]

Intelligent scraping with Crawl4AI

Using Crawl4AI, an open-source crawler optimized for LLMs:

Vectorization

Scraped content is transformed into vectors for semantic search:

ComponentConfiguration
ChunkingRecursive character splitter (1000 chars, 200 overlap)
EmbeddingsOpenAI text-embedding-3-small (1536 dimensions)
Vector Store (dev)ChromaDB (local, fast)
Vector Store (prod)Supabase pgvector (scalable, SQL queries)

Chunking with overlap preserves context between segments:

flowchart LR
    Doc[๐Ÿ“„ Document] --> C1[Chunk 1<br/>1000 chars]
    Doc --> C2[Chunk 2<br/>1000 chars]
    Doc --> C3[Chunk 3<br/>1000 chars]

    C1 -.->|200 chars overlap| C2
    C2 -.->|200 chars overlap| C3

Advanced RAG techniques

Beyond basic RAG (retrieve โ†’ generate), several techniques improve response quality:

HyDE (Hypothetical Document Embeddings)

Generate a hypothetical document answering the question, then search for documents similar to this ideal response:

flowchart TB
    Q[โ“ User query] --> LLM[LLM generates hypothetical document]
    LLM --> HDoc[๐Ÿ“„ Fictional ideal document]
    HDoc --> Emb[Hypothetical doc embedding]
    Emb --> Search[Similarity search]
    Search --> Results[๐Ÿ“š Relevant documents]

Query Augmentation & Multiple Query Generation

Reciprocal Rank Fusion (RRF)

Combine results from multiple retrievers with intelligent weighting:

flowchart LR
    Q[Query] --> R1[Retriever 1]
    Q --> R2[Retriever 2]
    Q --> R3[Retriever 3]
    R1 --> |Rank A| RRF[๐Ÿ”€ RRF Fusion]
    R2 --> |Rank B| RRF
    R3 --> |Rank C| RRF
    RRF --> Final[Weighted combined score]

Re-ranking

Refine final ranking with a cross-encoder model to improve precision.

Agentic RAG

The โ€œAgentic RAGโ€ approach goes beyond the classic linear pipeline. An LLM agent dynamically orchestrates the retrieval process:

flowchart TB
    Q[Query] --> Agent[๐Ÿค– RAG Agent]
    Agent --> Decide{Strategy?}
    Decide --> |Reformulate| Rewrite[Query Rewriting]
    Decide --> |Multi-query| Multi[Generate variants]
    Decide --> |Direct| Retrieve[Retrieval]

    Rewrite --> Retrieve
    Multi --> Retrieve
    Retrieve --> Eval{Sufficient results?}
    Eval --> |No| Agent
    Eval --> |Yes| Generate[Response generation]
    Generate --> Response[๐Ÿ’ฌ Final response]

The agent can:

Tech stack

Challenges

Results

View All Projects