🕸️🐍 GraphCrawler: Open-Source Graph-Based Web Crawler & Scraper¶

🎯 Build Powerful Web Graphs in Minutes¶

GraphCrawler is a feature-rich, production-ready Python library for web crawling that builds a graph representation of website structures. Perfect for SEO analysis, link auditing, content extraction, and AI pipelines.

Python 3.14 Optimized — With free-threading support, GraphCrawler achieves 3.2x faster crawling speeds!

⚡ Quick Start¶

Get started with just 3 lines of code:

import graph_crawler as gc

# Crawl and build a graph
graph = gc.crawl("https://example.com")

print(f"Found {len(graph.nodes)} pages")
print(f"Found {len(graph.edges)} links")

Async? No problem:

import asyncio
import graph_crawler as gc

async def main():
    graph = await gc.async_crawl("https://example.com")
    return graph

graph = asyncio.run(main())

📚 Full Quick Start Guide →

🆕 What's New in v4.0¶

🚀 Python 3.14 Free-Threading¶

GraphCrawler 4.0 is optimized for Python 3.14's free-threading mode:

# Enable free-threading for maximum speed
export PYTHON_GIL=0
python your_script.py

Performance Results:

⚡ 2-4x faster HTML parsing
🚀 3.2x faster end-to-end crawling
📉 16% less memory usage
⏱️ 30% faster startup

🌱 Multiple Seed URLs¶

graph = gc.crawl(
    seed_urls=[
        "https://example.com/products/",
        "https://example.com/blog/",
        "https://example.com/docs/",
    ],
    max_depth=3
)

🔄 Incremental Crawling¶

# Start with initial crawl
graph1 = gc.crawl("https://example.com", max_pages=50)

# Later, continue from where you left off
graph2 = gc.crawl(base_graph=graph1, max_pages=100)

📖 View Full Changelog →

✨ Key Features¶

🕸️ Graph-Based Architecture¶

Unlike traditional crawlers, GraphCrawler builds a complete graph structure of your website:

# Graph operations
merged = graph1 + graph2      # Union
diff = graph2 - graph1        # Difference
common = graph1 & graph2      # Intersection

# Subgraph detection
if graph1 < graph2:
    print("graph1 is a subgraph of graph2")

# Find popular pages
popular = graph.get_popular_nodes(top_n=10, by='in_degree')

🔌 Plugin Architecture¶

Extend functionality with powerful plugins:

from graph_crawler import crawl, BaseNodePlugin, NodePluginType

class SEOPlugin(BaseNodePlugin):
    @property
    def name(self):
        return "seo_analyzer"

    @property
    def plugin_type(self):
        return NodePluginType.ON_HTML_PARSED

    def execute(self, context):
        # Your custom logic here
        context.user_data['seo_score'] = calculate_seo(context.html_tree)
        return context

graph = crawl("https://example.com", plugins=[SEOPlugin()])

🎭 Multiple Drivers¶

Driver	Description	Best For
`http`	Async HTTP (aiohttp)	Static sites (default)
`playwright`	Full browser rendering	JavaScript SPAs
`stealth`	Anti-detection mode	Protected sites

💾 Flexible Storage¶

Storage	Scale	Best For
`memory`	< 1K pages	Quick tests
`json`	1K - 20K pages	Medium projects
`sqlite`	20K+ pages	Large crawls
`postgresql`	100K+ pages	Production
`mongodb`	100K+ pages	Production

📊 Rich Data Extraction¶

Built-in extractors for:

📝 Metadata (title, description, h1, canonical)
🔗 Links (internal, external, nofollow)
📞 Phones (UA, US, RU formats)
📧 Emails (RFC 5322 compliant)
💰 Prices (USD, EUR, UAH with ranges)
🏷️ Structured Data (JSON-LD, Open Graph, Microdata)

📚 Documentation Structure¶

Section	Description	Audience
Getting Started	Installation & Quick Start	Everyone
Core Concepts	Crawling modes, URL rules, caching	All developers
Advanced	Distributed crawling, proxies, auth	Senior devs
Extraction	Plugins & data extraction	All developers
API Reference	Complete API documentation	All developers
Architecture	System internals	Architects

🛠️ Installation¶

Basic Installation¶

pip install graph-crawler

With Optional Features¶

# JavaScript rendering (Playwright)
pip install graph-crawler[playwright]

# ML/AI features (embeddings)
pip install graph-crawler[embeddings]

# Database backends
pip install graph-crawler[mongodb,postgresql]

# Everything
pip install graph-crawler[all]

📖 Detailed Installation Guide →

🎓 Learning Paths¶

🌱 Beginner Path (Week 1)¶

🚀 Intermediate Path (Week 2-3)¶

🏆 Advanced Path (Month 2+)¶

💡 Use Cases¶

SEO Analysis¶

graph = gc.crawl("https://yoursite.com", max_depth=5)

# Find orphan pages (no incoming links)
for node in graph:
    in_degree = len([e for e in graph.edges if e.target_node_id == node.node_id])
    if in_degree == 0 and node.depth > 0:
        print(f"Orphan: {node.url}")

# Export for visualization
graph.export_edges("site_structure.dot", format="dot")

Content Extraction¶

from graph_crawler.extensions.plugins.node.extractors import (
    PhoneExtractorPlugin,
    EmailExtractorPlugin,
    PriceExtractorPlugin,
)

graph = gc.crawl(
    "https://shop.com",
    plugins=[
        PhoneExtractorPlugin(),
        EmailExtractorPlugin(),
        PriceExtractorPlugin(),
    ]
)

for node in graph:
    print(f"Page: {node.url}")
    print(f"  Phones: {node.user_data.get('phones', [])}")
    print(f"  Emails: {node.user_data.get('emails', [])}")
    print(f"  Prices: {node.user_data.get('prices', [])}")

AI/RAG Pipeline¶

from graph_crawler.extensions.plugins.node.vectorization import RealTimeVectorizerPlugin

graph = gc.crawl(
    "https://docs.example.com",
    plugins=[RealTimeVectorizerPlugin()]
)

# Export embeddings for vector database
for node in graph:
    embedding = node.user_data.get('embedding')
    if embedding:
        # Store in Pinecone, Chroma, etc.
        vector_db.upsert(node.url, embedding, node.metadata)



## 📖 Quick Links

| Resource      | Link                                                                           |
| ------------- | ------------------------------------------------------------------------------ |
| Installation  | [getting-started/installation.md](getting-started/installation.md)             |
| Quick Start   | [getting-started/quickstart.md](getting-started/quickstart.md)                 |
| API Reference | [api/API.md](api/API.md)                                                       |
| Plugin System | [extraction/plugins.md](extraction/plugins.md)                                 |
| Architecture  | [architecture/ARCHITECTURE_OVERVIEW.md](architecture/ARCHITECTURE_OVERVIEW.md) |
| Changelog     | [changelog.md](changelog.md)                                                   |