PuppetMaster 🤖

A powerful microservice for web automation, scraping, and data processing, integrating Puppeteer for browser control and Crawl4AI for advanced crawling and AI-powered extraction.

Features

Puppeteer Core:
- 🌐 Headless browser automation with Puppeteer and Chromium
- 🖱️ Standard browser interactions: navigate, click, type, scroll, select
- 🖼️ Screenshot generation (full page or element)
- 📄 PDF generation
- ⚙️ Custom JavaScript evaluation
Crawl4AI Integration:
- 🕷️ Advanced crawling strategies (schema-based, LLM-driven)
- 🧩 Flexible data extraction (CSS, XPath, LLM)
- 🧠 Dynamic schema generation using LLMs
- ✅ Content verification
- 🔗 Deep link crawling
- ⏳ Element waiting and filtering
- 📄 PDF text extraction
- 📝 Webpage to Markdown conversion
- 🌐 Webpage to PDF conversion (via Crawl4AI)
System:
- 🔄 Bull queue system for robust job management (separate queues for Puppeteer & Crawl4AI)
- 📊 MongoDB for job persistence, status tracking, and results storage
- 💾 Local file storage for generated assets (screenshots, PDFs, Markdown files)
- 📈 API endpoints for job management and queue monitoring

Key Technologies

Backend: Node.js, Express.js
Web Automation: Puppeteer
Crawling & AI: Python, FastAPI, Crawl4AI
Job Queue: BullMQ, Redis
Database: MongoDB (with Mongoose)
Language: JavaScript, Python

Installation

Prerequisites

Node.js (v18 or later recommended)
npm or yarn
Python (v3.8 or later recommended)
pip
MongoDB (local instance or Atlas)
Redis (local instance or cloud provider)

Setup

Clone the repository:

git clone <repository-url>
cd PuppetMaster

Install Node.js dependencies:
```
npm install
# or
# yarn install
```

Set up Python environment for Crawl4AI:

# Create a virtual environment (recommended)
python3 -m venv .venv
source .venv/bin/activate  # On Windows use `.venv\\Scripts\\activate`

# Install Python dependencies
pip install -r requirements.txt

Configure Environment Variables: Create a .env file in the project root and configure the following variables:

# Node.js App Configuration
PORT=3000
NODE_ENV=development # or production
MONGODB_URI=mongodb://localhost:27017/puppet-master # Replace with your MongoDB connection string
REDIS_HOST=localhost
REDIS_PORT=6379
RATE_LIMIT_WINDOW_MS=60000
RATE_LIMIT_MAX=100

# Puppeteer Worker Configuration
PUPPETEER_HEADLESS=true # Set to false to run browser in non-headless mode
PUPPETEER_TIMEOUT=60000 # Default timeout for Puppeteer operations (ms)
JOB_CONCURRENCY=2 # Max concurrent Puppeteer jobs

# Crawl4AI Worker & Service Configuration
CRAWL4AI_API_URL=http://localhost:8000 # URL of the Python Crawl4AI service
CRAWL4AI_API_TIMEOUT=120000 # Timeout for requests to Crawl4AI service (ms)
CRAWL4AI_PORT=8000 # Port for the Python Crawl4AI service
JOB_ATTEMPTS=3 # Default Bull queue job attempts
JOB_TIMEOUT=300000 # Default Bull queue job timeout (ms)
# Add necessary API keys for LLM providers if using LLMExtractionStrategy
# Example for OpenAI (only required if using OpenAI models):
# OPENAI_API_KEY=your_openai_api_key

# Example for Google Gemini (only required if using Gemini models):
# GOOGLE_API_KEY=your_google_ai_api_key

Start the Services and Workers:

You can start everything concurrently using the provided npm scripts:

# For development (with nodemon for Node.js app/worker)
npm run dev:all

# For production
npm run start:all

These scripts run the following components:

Node.js API Server (src/index.js) - Also processes jobs from the crawl4ai-jobs queue.
Puppeteer Worker (src/workers/puppeteer.worker.js) - Processes jobs from the puppeteer-jobs queue.
Crawl4AI Python Service (src/crawl4ai/main.py) - Handles Crawl4AI API requests from the Node.js worker.

Alternatively, you can start components individually:

# Start Node.js API (Terminal 1)
# This process also handles processing for Crawl4AI jobs.
npm start  # or npm run dev

# Start Puppeteer Worker (Terminal 2)
# Processes only Puppeteer-specific jobs.
npm run start:worker # or npm run dev:worker

# Start Crawl4AI Python Service (Terminal 3)
npm run start:crawl4ai
# or directly: ./start-crawl4ai.sh
# or: source .venv/bin/activate && python src/crawl4ai/main.py

Architecture Overview

PuppetMaster uses a microservice architecture:

Node.js API Server (src/index.js):
- Exposes REST API endpoints for job management and queue monitoring.
- Uses Express.js, Mongoose (for MongoDB interaction), and Bull for queue management.
- Handles incoming job requests, saving them to MongoDB.
- Adds jobs to either the Puppeteer or Crawl4AI Bull queue based on action types.
- Processes jobs from the crawl4ai-jobs queue by interacting with the Crawl4AI Python Service.
Puppeteer Worker (src/workers/puppeteer.worker.js):
- A separate Node.js process that listens to the puppeteer-jobs Bull queue.
- Executes Puppeteer-specific browser automation tasks (navigate, click, screenshot, etc.).
- Updates job status and results in MongoDB.
Crawl4AI Python Service (src/crawl4ai/):
- A FastAPI application providing endpoints for advanced crawling and extraction tasks.
- Uses the Crawl4AI library internally.
- Communicates with the Node.js API/worker process via HTTP requests.
Bull Queues (Redis): Manages job processing, ensuring robustness and retries.
MongoDB: Persists job definitions, status, results, and generated asset metadata.
Local File Storage (/public): Stores generated files like screenshots, PDFs, and Markdown files.
Error Handling: Uses a centralized error handler (src/middleware/errorHandler.js) providing consistent JSON error responses (see ApiError class).
Validation: Incoming requests for specific endpoints (like job creation) are validated using Joi schemas (src/middleware/validation.js).
Job Model: Job details, including status, results, assets, and progress, are stored in MongoDB using the schema defined in src/models/Job.js.

API Documentation

The API allows you to create, manage, and monitor automation jobs.

Base URL: `/api`

Job Management (`/jobs`)

`POST /jobs`

Create a new job. The job will be routed to the appropriate queue (Puppeteer or Crawl4AI) based on its actions.

Request Body:

{
  "name": "Unique Job Name",
  "description": "Optional job description",
  "priority": 0, // Optional: Bull queue priority (-100 to 100)
  "actions": [
    {
      "type": "action_type_1", // See Action Types section below
      "params": { ... } // Parameters specific to the action type
    },
    {
      "type": "action_type_2",
      "params": { ... }
    }
    // ... more actions
  ],
  "metadata": { ... } // Optional: Any additional data to store with the job
}

Response (Success: 201 Created):

{
  "status": "success",
  "message": "Job created successfully",
  "data": {
    "jobId": "unique-job-id",
    "name": "Unique Job Name",
    "status": "pending"
  }
}