Skip to content

topo-io/ai-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Technical test: AI Scraper

🎯 Objective

You need to develop a Python scraper that extracts information about companies from a list of 1000 URLs.

📌 Constraints & Expectations

⏳ Duration: 1 hour

📤 Input

A list of 1000 URLs containing company pages.

Example URLs:

https://exemple-entreprise.com
https://startup-cool.io

📥 Expected Output

For each successfully processed URL, your script should return a structured JSON containing:

{
  "url": "https://example-company.com",
  "success": true,
  "data": {
    "company_name": "Example Company",
    "company_description": "A company specializing in AI solutions...",
    "business_type": "B2B",
    "pricing": "$2000 per user per month"
  }
}

Or

{
  "url": "https://example-company.com",
  "success": false,
  "error": "description of the error"
}

📚 Libraries

🔍 Steps to Follow

1️⃣ Step 1 - Scrape Web Pages

Use requests or httpx to fetch the HTML content of the pages. You should not use a third party service to fetch the content of the pages.

Handle as many errors as possible to maximize the number of successfully processed URLs.

2️⃣ Step 2 - Extract Information Using an LLM

Once the text is extracted, send it to a LLM (GPT-4o Mini) to structure the data.

Model name: gpt-4.1-mini

API Key: ask for it

Note: This model has a rate limit of 3000 requests per minute (RPM) but is shared with other resources.

📄 Documentation:  https://platform.openai.com/docs/guides/structured-outputs

3️⃣ Step 3 - Evaluate Data Quality

If you have additional time, assess the quality of extracted information by implementing validation checks.

✅ Evaluation Criteria

Criterion Expectations
Number of URLs processed Maximizing scraping despite potential errors
Code Quality Clear and well-structured code
Data quality Ensuring data relevance and accuracy
AI Well-designed prompt
Evaluation Ability to track evolution of the quality of the data

Good luck ! 🚀

About

Template for the AI scraper

Resources

Stars

Watchers

Forks

Languages