A distributed web crawler system with intelligent Hive controller for scalable web scraping and data collection.
- 🐝 Distributed Architecture: Multi-worker crawler system with centralized Hive controller
- ⚡ High Performance: Built with Fastify for blazing-fast API responses
- 🔄 Queue Management: Redis-backed job queues with BullMQ for reliable task processing
- 🌐 Web Scraping: Cheerio-powered HTML parsing and data extraction
- 📊 Real-time Monitoring: Comprehensive logging and system health tracking
- 🔧 Modular Design: Clean separation of concerns with configurable components
- 🛡️ Error Handling: Robust error recovery and graceful shutdown procedures
- Node.js 20.0.0 or higher
- Redis server running locally or remotely
- Modern browser for web interface (optional)
# Clone the repository
git clone https://github.com/NeaByteLab/HiveMind-Crawler.git
cd HiveMind-Crawler
# Install dependencies
npm install
# Configure Redis connection (see Configuration section)Set these environment variables for configuration:
# Redis Configuration
REDIS_HOST=localhost
REDIS_PORT=6379
REDIS_PASSWORD=
REDIS_DB=0
# Crawler Configuration
CRAWLER_TIMEOUT=10000
CRAWLER_MAX_RETRIES=3
CRAWLER_USER_AGENT=HiveMind-Crawler/1.0
CRAWLER_MAX_CONCURRENT=2
CRAWLER_MAX_DEPTH=3
CRAWLER_HEARTBEAT_INTERVAL=10000
CRAWLER_DEAD_TIMEOUT=30000npm run devnpm start# Run tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage# Lint code
npm run lint
# Fix linting issues
npm run lint:fix
# Check linting status
npm run lint:checkThe central intelligence that coordinates all crawler workers:
- Task Distribution: Assigns crawling jobs to available workers
- Health Monitoring: Tracks worker status and performance
- Load Balancing: Distributes load across multiple workers
- Fault Tolerance: Handles worker failures and recovery
- Domain Assignment: Manages domain-specific crawler assignments
Individual worker processes that perform web scraping:
- URL Processing: Handles various URL formats and redirects
- Content Extraction: Parses HTML and extracts structured data
- Rate Limiting: Respects robots.txt and implements polite crawling
- Data Storage: Queues extracted data for processing
- Domain Filtering: Processes only assigned domains
RESTful API for system management and monitoring:
- Job Management: Submit, monitor, and control crawling jobs
- System Status: Real-time health and performance metrics
- Configuration: Runtime configuration updates
- Data Access: Retrieve crawling results and statistics
HiveMind-Crawler/
├── src/
│ ├── hive/
│ │ └── controller.js # 🐝 Hive controller logic
│ ├── crawler/
│ │ └── worker.js # 🕷️ Crawler worker implementation
│ ├── api/
│ │ └── server.js # 🌐 Fastify API server
│ ├── config/
│ │ ├── redis.js # 🔧 Redis configuration
│ │ ├── crawler.js # 🕷️ Crawler settings
│ │ └── queues.js # 📋 Queue configuration
│ ├── utils/
│ │ ├── logger.js # 📝 Logging utilities
│ │ └── url.js # 🔗 URL processing utilities
│ └── index.js # 🚀 Main application entry point
├── __tests__/ # 🧪 Test files
├── package.json # 📦 Project dependencies
├── eslint.config.js # 🔍 ESLint configuration
├── jest.config.js # 🧪 Jest configuration
├── jest.setup.js # 🧪 Jest setup
└── README.md # 📖 This file
- Fastify: High-performance web framework
- BullMQ: Redis-based job queue system
- Cheerio: Server-side jQuery for HTML parsing
- Axios: HTTP client for web requests
- Redis: In-memory data structure store
- UUID: Unique identifier generation
- ESLint: Code linting and formatting
- ESLint Plugin Unused Imports: Remove unused imports automatically
- Jest: Testing framework
POST /crawl- Add URLs to crawl queue- Body:
{ urls: string|string[], priority?: 'high'|'normal', domain?: string } - Response:
{ message: string }
- Body:
GET /health- System health check- Response:
{ status: 'healthy', timestamp: number }
- Response:
GET /metrics- Comprehensive system metrics- Response:
{ activeCrawlers, queueSize, totalProcessed, totalFailed, avgResponseTime, crawlers[] }
- Response:
GET /crawlers- List all registered crawlers- Response:
Array<{ id, status, capabilities, stats, assignedDomains, last_update }>
- Response:
GET /results?url=<url>- Get crawl result for specific URL- Query:
url- The URL to retrieve results for - Response:
Object|null- Crawl result or null if not found
- Query:
POST /assign-domain- Assign domain to specific crawler- Body:
{ crawlerId: string, domain: string } - Response:
{ message: string }
- Body:
DELETE /queue- Clear all crawl queues- Response:
{ message: 'Queue cleared' }
- Response:
# Single URL
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"urls": "https://example.com", "priority": "high"}'
# Multiple URLs
curl -X POST http://localhost:3000/crawl \
-H "Content-Type: application/json" \
-d '{"urls": ["https://example.com", "https://test.com"], "priority": "normal"}'curl http://localhost:3000/healthcurl http://localhost:3000/metricscurl "http://localhost:3000/results?url=https://example.com"curl -X POST http://localhost:3000/assign-domain \
-H "Content-Type: application/json" \
-d '{"crawlerId": "crawler-1", "domain": "example.com"}'Redis Connection Failed
# Ensure Redis is running
redis-server
# Check connection
redis-cli pingWorker Not Starting
# Check logs
npm run dev
# Verify Redis connection
# Check configuration filesHigh Memory Usage
- Reduce
CRAWLER_MAX_CONCURRENTin environment variables - Implement data streaming for large datasets
- Monitor Redis memory usage
The project includes comprehensive tests:
# Run all tests
npm test
# Run tests in watch mode
npm run test:watch
# Run tests with coverage
npm run test:coverage- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.