You need to develop a Python scraper that extracts information about companies from a list of 1000 URLs.
⏳ Duration: 1 hour
A list of 1000 URLs containing company pages.
Example URLs:
https://exemple-entreprise.com
https://startup-cool.io
For each successfully processed URL, your script should return a structured JSON containing:
{
"url": "https://example-company.com",
"success": true,
"data": {
"company_name": "Example Company",
"company_description": "A company specializing in AI solutions...",
"business_type": "B2B",
"pricing": "$2000 per user per month"
}
}
Or
{
"url": "https://example-company.com",
"success": false,
"error": "description of the error"
}
Use requests or httpx to fetch the HTML content of the pages. You should not use a third party service to fetch the content of the pages.
Handle as many errors as possible to maximize the number of successfully processed URLs.
Once the text is extracted, send it to a LLM (GPT-4o Mini) to structure the data.
Model name: gpt-4.1-mini
API Key: ask for it
Note: This model has a rate limit of 3000 requests per minute (RPM) but is shared with other resources.
📄 Documentation: https://platform.openai.com/docs/guides/structured-outputs
If you have additional time, assess the quality of extracted information by implementing validation checks.
Criterion | Expectations |
---|---|
Number of URLs processed | Maximizing scraping despite potential errors |
Code Quality | Clear and well-structured code |
Data quality | Ensuring data relevance and accuracy |
AI | Well-designed prompt |
Evaluation | Ability to track evolution of the quality of the data |
Good luck ! 🚀