A powerful microservice for web automation, scraping, and data processing, integrating Puppeteer for browser control and Crawl4AI for advanced crawling and AI-powered extraction.
- Puppeteer Core:
- 🌐 Headless browser automation with Puppeteer and Chromium
- 🖱️ Standard browser interactions: navigate, click, type, scroll, select
- 🖼️ Screenshot generation (full page or element)
- 📄 PDF generation
- ⚙️ Custom JavaScript evaluation
- Crawl4AI Integration:
- 🕷️ Advanced crawling strategies (schema-based, LLM-driven)
- 🧩 Flexible data extraction (CSS, XPath, LLM)
- 🧠 Dynamic schema generation using LLMs
- ✅ Content verification
- 🔗 Deep link crawling
- ⏳ Element waiting and filtering
- 📄 PDF text extraction
- 📝 Webpage to Markdown conversion
- 🌐 Webpage to PDF conversion (via Crawl4AI)
- System:
- 🔄 Bull queue system for robust job management (separate queues for Puppeteer & Crawl4AI)
- 📊 MongoDB for job persistence, status tracking, and results storage
- 💾 Local file storage for generated assets (screenshots, PDFs, Markdown files)
- 📈 API endpoints for job management and queue monitoring
- Backend: Node.js, Express.js
- Web Automation: Puppeteer
- Crawling & AI: Python, FastAPI, Crawl4AI
- Job Queue: BullMQ, Redis
- Database: MongoDB (with Mongoose)
- Language: JavaScript, Python
- Node.js (v18 or later recommended)
- npm or yarn
- Python (v3.8 or later recommended)
- pip
- MongoDB (local instance or Atlas)
- Redis (local instance or cloud provider)
-
Clone the repository:
git clone <repository-url> cd PuppetMaster
-
Install Node.js dependencies:
npm install # or # yarn install
-
Set up Python environment for Crawl4AI:
# Create a virtual environment (recommended) python3 -m venv .venv source .venv/bin/activate # On Windows use `.venv\\Scripts\\activate` # Install Python dependencies pip install -r requirements.txt
-
Configure Environment Variables: Create a
.env
file in the project root and configure the following variables:# Node.js App Configuration PORT=3000 NODE_ENV=development # or production MONGODB_URI=mongodb://localhost:27017/puppet-master # Replace with your MongoDB connection string REDIS_HOST=localhost REDIS_PORT=6379 RATE_LIMIT_WINDOW_MS=60000 RATE_LIMIT_MAX=100 # Puppeteer Worker Configuration PUPPETEER_HEADLESS=true # Set to false to run browser in non-headless mode PUPPETEER_TIMEOUT=60000 # Default timeout for Puppeteer operations (ms) JOB_CONCURRENCY=2 # Max concurrent Puppeteer jobs # Crawl4AI Worker & Service Configuration CRAWL4AI_API_URL=http://localhost:8000 # URL of the Python Crawl4AI service CRAWL4AI_API_TIMEOUT=120000 # Timeout for requests to Crawl4AI service (ms) CRAWL4AI_PORT=8000 # Port for the Python Crawl4AI service JOB_ATTEMPTS=3 # Default Bull queue job attempts JOB_TIMEOUT=300000 # Default Bull queue job timeout (ms) # Add necessary API keys for LLM providers if using LLMExtractionStrategy # Example for OpenAI (only required if using OpenAI models): # OPENAI_API_KEY=your_openai_api_key # Example for Google Gemini (only required if using Gemini models): # GOOGLE_API_KEY=your_google_ai_api_key
-
Start the Services and Workers:
You can start everything concurrently using the provided npm scripts:
# For development (with nodemon for Node.js app/worker) npm run dev:all # For production npm run start:all
These scripts run the following components:
- Node.js API Server (
src/index.js
) - Also processes jobs from thecrawl4ai-jobs
queue. - Puppeteer Worker (
src/workers/puppeteer.worker.js
) - Processes jobs from thepuppeteer-jobs
queue. - Crawl4AI Python Service (
src/crawl4ai/main.py
) - Handles Crawl4AI API requests from the Node.js worker.
Alternatively, you can start components individually:
# Start Node.js API (Terminal 1) # This process also handles processing for Crawl4AI jobs. npm start # or npm run dev # Start Puppeteer Worker (Terminal 2) # Processes only Puppeteer-specific jobs. npm run start:worker # or npm run dev:worker # Start Crawl4AI Python Service (Terminal 3) npm run start:crawl4ai # or directly: ./start-crawl4ai.sh # or: source .venv/bin/activate && python src/crawl4ai/main.py
- Node.js API Server (
PuppetMaster uses a microservice architecture:
-
Node.js API Server (
src/index.js
):- Exposes REST API endpoints for job management and queue monitoring.
- Uses Express.js, Mongoose (for MongoDB interaction), and Bull for queue management.
- Handles incoming job requests, saving them to MongoDB.
- Adds jobs to either the Puppeteer or Crawl4AI Bull queue based on action types.
- Processes jobs from the
crawl4ai-jobs
queue by interacting with the Crawl4AI Python Service.
-
Puppeteer Worker (
src/workers/puppeteer.worker.js
):- A separate Node.js process that listens to the
puppeteer-jobs
Bull queue. - Executes Puppeteer-specific browser automation tasks (navigate, click, screenshot, etc.).
- Updates job status and results in MongoDB.
- A separate Node.js process that listens to the
-
Crawl4AI Python Service (
src/crawl4ai/
):- A FastAPI application providing endpoints for advanced crawling and extraction tasks.
- Uses the
Crawl4AI
library internally. - Communicates with the Node.js API/worker process via HTTP requests.
-
Bull Queues (Redis): Manages job processing, ensuring robustness and retries.
-
MongoDB: Persists job definitions, status, results, and generated asset metadata.
-
Local File Storage (
/public
): Stores generated files like screenshots, PDFs, and Markdown files. -
Error Handling: Uses a centralized error handler (
src/middleware/errorHandler.js
) providing consistent JSON error responses (seeApiError
class). -
Validation: Incoming requests for specific endpoints (like job creation) are validated using Joi schemas (
src/middleware/validation.js
). -
Job Model: Job details, including status, results, assets, and progress, are stored in MongoDB using the schema defined in
src/models/Job.js
.
The API allows you to create, manage, and monitor automation jobs.
Create a new job. The job will be routed to the appropriate queue (Puppeteer or Crawl4AI) based on its actions.
Request Body:
{
"name": "Unique Job Name",
"description": "Optional job description",
"priority": 0, // Optional: Bull queue priority (-100 to 100)
"actions": [
{
"type": "action_type_1", // See Action Types section below
"params": { ... } // Parameters specific to the action type
},
{
"type": "action_type_2",
"params": { ... }
}
// ... more actions
],
"metadata": { ... } // Optional: Any additional data to store with the job
}
Response (Success: 201 Created):
{
"status": "success",
"message": "Job created successfully",
"data": {
"jobId": "unique-job-id",
"name": "Unique Job Name",
"status": "pending"
}
}
Get a list of jobs with filtering and pagination.
Query Parameters:
status
(string, optional): Filter by job status (e.g.,pending
,processing
,completed
,failed
,cancelled
).page
(number, optional, default: 1): Page number for pagination.limit
(number, optional, default: 10): Number of jobs per page.sort
(string, optional, default:createdAt
): Field to sort by.order
(string, optional, default:desc
): Sort order (asc
ordesc
).
Response (Success: 200 OK):
{
"status": "success",
"data": {
"jobs": [ ... ], // Array of job objects
"pagination": {
"total": 100,
"page": 1,
"limit": 10,
"pages": 10
}
}
}
Get details of a specific job by its jobId
.
Response (Success: 200 OK):
{
"status": "success",
"data": {
"job": { ... } // Full job object
}
}
Get assets generated by a specific job (e.g., screenshot URLs, PDF URLs).
Response (Success: 200 OK):
{
"status": "success",
"data": {
"assets": [
{ "type": "screenshot", "url": "/public/screenshots/...", "createdAt": "..." },
{ "type": "pdf", "url": "/public/pdfs/...", "createdAt": "..." }
// ... other assets like markdown URLs
]
}
}
Cancel a pending or processing job.
Response (Success: 200 OK):
{
"status": "success",
"message": "Job cancelled successfully",
"data": { "jobId": "unique-job-id" }
}
Retry a job that has failed. Resets status to pending
and adds it back to the queue.
Response (Success: 200 OK):
{
"status": "success",
"message": "Job retried successfully",
"data": { "jobId": "unique-job-id" }
}
Delete a job from the database and remove it from the queue if pending.
Response (Success: 200 OK):
{
"status": "success",
"message": "Job deleted successfully"
}
Get statistics about both the Puppeteer and Crawl4AI job queues.
Response (Success: 200 OK):
{
"status": "success",
"data": {
"metrics": {
"puppeteer": { "waiting": 0, "active": 1, "completed": 50, "failed": 2, "delayed": 0, "total": 53 },
"crawl4ai": { "waiting": 2, "active": 0, "completed": 25, "failed": 1, "delayed": 0, "total": 28 },
"total": { "waiting": 2, "active": 1, "completed": 75, "failed": 3, "delayed": 0, "total": 81 }
}
}
}
Get jobs currently in the queues based on their state.
Query Parameters:
types
(string, optional, default:active,waiting,delayed,failed,completed
): Comma-separated list of job states to retrieve.limit
(number, optional, default: 10): Maximum number of jobs to return across all specified types.
Response (Success: 200 OK):
{
"status": "success",
"data": {
"jobs": [
{
"id": "bull-job-id", // Bull queue job ID
"name": "Job Name",
"jobId": "unique-db-job-id", // Database job ID
"timestamp": 1678886400000,
// ... other Bull job details
"state": "active" // or waiting, completed, etc.
}
// ... more jobs
]
}
}
(Admin/Protected Endpoint) Clears all jobs from all queues (waiting, active, delayed, failed, completed). Use with caution!
Response (Success: 200 OK):
{
"status": "success",
"message": "Queue cleared successfully"
}
Provides a simple status check for the Node.js API process (not individual workers).
Response (Success: 200 OK):
{
"status": "success",
"data": {
"isRunning": true,
"uptime": 12345.67,
"memory": { ... }, // Node.js process memory usage
"cpuUsage": { ... } // Node.js process CPU usage
}
}
Jobs consist of a sequence of actions. Each action has a type
and params
.
Action Type | Description | Parameters (params ) |
---|---|---|
navigate |
Go to a URL | url (string, required) |
scrape |
Extract content from element(s) | selector (string, required), attribute (string, optional, default: textContent ), multiple (boolean, optional) |
click |
Click an element | selector (string, required) |
type |
Type text into an input | selector (string, required), value (string, required), delay (number, optional, ms) |
screenshot |
Take a screenshot | selector (string, optional), fullPage (boolean, optional, default: false) |
pdf |
Generate PDF of the current page | format (string, optional, e.g., A4 ), margin (object, optional, e.g., {top: '10mm', ...} ), printBackground (boolean, optional) |
wait |
Wait for element or timeout | selector (string, optional), timeout (number, optional, ms, default: 30000) |
evaluate |
Run custom JavaScript on page | script (string, required) - Must be a self-contained function body or expression |
scroll |
Scroll page or element | selector (string, optional - scrolls element into view), x (number, optional - scrolls window), y (number, optional - scrolls window) |
select |
Select an option in a dropdown | selector (string, required), value (string, required) |
Note: These actions are forwarded to the Crawl4AI Python microservice. Jobs containing any of these actions will be processed by the crawl4ai-jobs
queue and crawl4ai.worker.js
.
Action Type | Description | Parameters (params ) |
Notes |
---|---|---|---|
crawl |
Crawl & extract using schema/strategy | url (string, required), schema (object, optional), strategy (string, optional, e.g., JsonCssExtractionStrategy , LLMExtractionStrategy ), baseSelector (string, optional), For LLM: llm_provider (string, e.g., openai/gpt-4o-mini , gemini/gemini-1.5-pro-latest ), llm_api_key_env_var (string, e.g., OPENAI_API_KEY , GOOGLE_API_KEY ), llm_instruction (string), llm_extraction_type (string, schema or block ), llm_extra_args (object, optional) |
For LLMExtractionStrategy , ensure the corresponding API key (OPENAI_API_KEY or GOOGLE_API_KEY ) is set in the .env file if the provider requires it. |
extract |
Extract specific content (text, html, attribute) | url (string, required), selector (string, required), type (string, optional, default: text ), attribute (string, optional) |
Uses Playwright directly in the Python service for extraction. |
generateSchema |
Generate extraction schema using LLM | url (string, required), prompt (string, required), model (string, optional, e.g., openai/gpt-4o-mini , gemini/gemini-1.5-pro-latest ) |
Requires appropriate API key in .env if the provider requires it. |
verify |
Verify element existence or content | url (string, required), selector (string, required), expected (string, optional) |
Uses Playwright directly in the Python service. |
crawlLinks |
Follow links and extract data | url (string, required), link_selector (string, required), schema (object, optional), max_depth (number, optional, default: 1) |
|
wait (Crawl4AI) |
Wait for an element (delegated to Crawl4AI service) | url (string, required), selector (string, required), timeout (number, optional, ms, default: 30000) |
Uses Playwright directly in the Python service. |
filter |
Filter elements based on condition | url (string, required), selector (string, required), condition (string, e.g., href.includes("pdf") , text.includes("Report") ) |
Uses Playwright directly in the Python service. |
extractPDF |
Extract text content from a PDF URL | url (string, required) |
Fetches and parses PDF content. |
toMarkdown |
Convert webpage content to Markdown | url (string, required), options (object, optional, see Crawl4AI docs) |
Saves to /public/markdown and returns the URL/path. |
toPDF |
Convert webpage to PDF (via Crawl4AI) | url (string, required) |
Saves to /public/pdfs and returns the URL/path. |
PuppetMaster processes jobs containing multiple actions sequentially within a single worker process (either puppeteer.worker.js
or crawl4ai.worker.js
based on the action types).
- Sequential Execution: Actions defined in the
actions
array of a job are executed one after another in the order they are listed. - State Management:
- The Puppeteer worker maintains a single browser page instance across actions within a job (e.g., navigating first, then clicking, then scraping).
- The Crawl4AI worker typically sends each action as a separate request to the Python service, which is stateless between requests for different actions within the same job.
- Result Passing: Currently, the result of one action is not automatically passed as input to the
params
of the next action. The parameters for each action are fixed when the job is initially created.- Workaround: For complex workflows requiring intermediate results, you need to:
- Create a job for the first action(s).
- Wait for the job to complete and retrieve its result (e.g., a scraped URL) from the API (
GET /jobs/:id
). - Create a new job for the subsequent action(s), using the retrieved result in its
params
.
- Future Enhancement: A potential future enhancement could involve allowing template variables in action parameters (e.g.,
"url": "{{results.action_0.url}}"
), which the worker would resolve before executing the action.
- Workaround: For complex workflows requiring intermediate results, you need to:
{
"name": "Login and Scrape Dashboard",
"actions": [
{ "type": "navigate", "params": { "url": "https://example.com/login" } },
{ "type": "type", "params": { "selector": "#username", "value": "user" } },
{ "type": "type", "params": { "selector": "#password", "value": "pass" } },
{ "type": "click", "params": { "selector": "button[type='submit']" } },
{ "type": "wait", "params": { "selector": "#dashboard-title" } }, // Wait for dashboard
{ "type": "scrape", "params": { "selector": ".widget-data", "multiple": true } }
]
}
This entire job would be handled by the puppeteer.worker.js
.
// --- JOB 1 ---
{
"name": "Navigate and Get PDF Link",
"actions": [
{ "type": "navigate", "params": { "url": "https://www.example.com/some-page-with-pdf-link" } },
{ "type": "scrape", "params": { "selector": "a.pdf-link", "attribute": "href" } }
// Worker executes these, result saved to DB: { "action_0": { "url": "..." }, "action_1": "https://example.com/document.pdf" }
]
}
// --- After Job 1 completes, retrieve the result (e.g., "https://example.com/document.pdf") ---
// --- JOB 2 ---
{
"name": "Extract PDF Text",
"actions": [
// Use the result from Job 1 here
{ "type": "extractPDF", "params": { "url": "https://example.com/document.pdf" } }
// Worker sends this to Crawl4AI service
]
}
Contributions are welcome! Please refer to the contribution guidelines.
MIT
Keith Mzaza