Crawl4AI RAG MCP Server

by coleam00

About

Crawl4AI RAG MCP Server provides AI agents and coding assistants with comprehensive web crawling and retrieval-augmented generation (RAG) capabilities backed by Supabase vector storage. Key capabilities include: - Web crawling and scraping to extract structured content from any website for RAG ingestion - Supabase-backed vector store for semantic search and knowledge persistence - Advanced RAG strategies including Contextual Embeddings, Hybrid Search (vector + keyword), Agentic RAG for code extraction, and cross-encoder Reranking - Knowledge Graph generation for AI hallucination detection and repository code analysis - Integration with OpenAI embeddings with planned support for Ollama and local embedding models

README

Web Crawling and RAG Capabilities for AI Agents and AI Coding Assistants

A powerful implementation of the Model Context Protocol (MCP) integrated with Crawl4AI and Supabase for providing AI agents and AI coding assistants with advanced web crawling and RAG capabilities.

With this MCP server, you can scrape anything and then use that knowledge anywhere for RAG.

The primary goal is to bring this MCP server into Archon as I evolve it to be more of a knowledge engine for AI coding assistants to build AI agents. This first version of the Crawl4AI/RAG MCP server will be improved upon greatly soon, especially making it more configurable so you can use different embedding models and run everything locally with Ollama.

Consider this GitHub repository a testbed, hence why I haven't been super actively address issues and pull requests yet. I certainly will though as I bring this into Archon V2!

Overview

This MCP server provides tools that enable AI agents to crawl websites, store content in a vector database (Supabase), and perform RAG over the crawled content. It follows the best practices for building MCP servers based on the Mem0 MCP server template I provided on my channel previously.

The server includes several advanced RAG strategies that can be enabled to enhance retrieval quality:

Contextual Embeddings for enriched semantic understanding

Hybrid Search combining vector and keyword search

Agentic RAG for specialized code example extraction

Reranking for improved result relevance using cross-encoder models

Knowledge Graph for AI hallucination detection and repository code analysis

See the Configuration section below for details on how to enable and configure these strategies.

Vision

The Crawl4AI RAG MCP server is just the beginning. Here's where we're headed:

1. Integration with Archon: Building this system directly into Archon to create a comprehensive knowledge engine for AI coding assistants to build better AI agents.

2. Multiple Embedding Models: Expanding beyond OpenAI to support a variety of embedding models, including the ability to run everything locally with Ollama for complete control and privacy.

3. Advanced RAG Strategies: Implementing sophisticated retrieval techniques like contextual retrieval, late chunking, and others to move beyond basic "naive lookups" and significantly enhance the power and precision of the RAG system, especially as it integrates with Archon.

4. Enhanced Chunking Strategy: Implementing a Context 7-inspired chunking approach that focuses on examples and creates distinct, semantically meaningful sections for each chunk, improving retrieval precision.

5. Performance Optimization: Increasing crawling and indexing speed to make it more realistic to "quickly" index new documentation to then leverage it within the same prompt in an AI coding assistant.

Features

Smart URL Detection: Automatically detects and handles different URL types (regular webpages, sitemaps, text files)

Recursive Crawling: Follows internal links to discover content

Parallel Processing: Efficiently crawls multiple pages simultaneously

Content Chunking: Intelligently splits content by headers and size for better processing

Vector Search: Performs RAG over crawled content, optionally filtering by data source for precision

Source Retrieval: Retrieve sources available for filtering to guide the RAG process

Tools

The server provides essential web crawling and search tools:

Core Tools (Always Available)

1. crawl_single_page: Quickly crawl a single web page and store its content in the vector database 2. smart_crawl_url: Intelligently crawl a full website based on the type of URL provided (sitemap, llms-full.txt, or a regular webpage that needs to be crawled recursively) 3. get_available_sources: Get a list of all available sources (domains) in the database 4. perform_rag_query: Search for relevant content using semantic search with optional source filtering

Conditional Tools

5. search_code_examples (requires USE_AGENTIC_RAG=true): Search specifically for code examples and their summaries from crawled documentation. This tool provides targeted code snippet retrieval for AI coding assistants.

Knowledge Graph Tools (requires `USE_KNOWLEDGE_GRAPH=true`, see below)

6. parse_github_repository: Parse a GitHub repository into a Neo4j knowledge graph, extracting classes, methods, functions, and their relationships for hallucination detection 7. check_ai_script_hallucinations: Analyze Python scripts for AI hallucinations by validating imports,

Related MCP Servers

AI Research Assistant

hamid-vakilzadeh

AI Research Assistant provides comprehensive access to millions of academic papers through the Semantic Scholar and arXiv databases. This MCP server enables AI coding assistants to perform intelligent literature searches, citation network analysis, and paper content extraction without requiring an API key. Key features include: - Advanced paper search with multi-filter support by year ranges, citation thresholds, field of study, and publication type - Title matching with confidence scoring for finding specific papers - Batch operations supporting up to 500 papers per request - Citation analysis and network exploration for understanding research relationships - Full-text PDF extraction from arXiv and Wiley open-access content (Wiley TDM token required for institutional access) - Rate limits of 100 requests per 5 minutes with options to request higher limits through Semantic Scholar

Web & Search

12 8

Linkup

LinkupPlatform

Linkup is a real-time web search and content extraction service that enables AI assistants to search the web and retrieve information from trusted sources. It provides source-backed answers with citations, making it ideal for fact-checking, news gathering, and research tasks. Key features of Linkup: - Real-time web search using natural language queries to find current information, news, and data - Page fetching to extract and read content from any webpage URL - Search depth modes: Standard for direct-answer queries and Deep for complex research across multiple sources - Source-backed results with citations and context from relevant, trustworthy websites - JavaScript rendering support for accessing dynamic content on JavaScript-heavy pages

Web & Search

2 24

Math-MCP

EthanHenrickson

Math-MCP is a computation server that enables Large Language Models (LLMs) to perform accurate numerical calculations through the Model Context Protocol. It provides precise mathematical operations via a simple API to overcome LLM limitations in arithmetic and statistical reasoning. Key features of Math-MCP: - Basic arithmetic operations: addition, subtraction, multiplication, division, modulo, and bulk summation - Statistical analysis functions: mean, median, mode, minimum, and maximum calculations - Rounding utilities: floor, ceiling, and nearest integer rounding - Trigonometric functions: sine, cosine, tangent, and their inverses with degrees and radians conversion support

Developer Tools

22 81

Crawl4AI RAG MCP Server

About

README

Overview

Vision

Features