This project demonstrates how to perform web scraping using both APIs and Selenium Web Driver, designed to run seamlessly on Google Colab or a local environment.
- For data accessible via APIs, HTTP requests are used.
- For content rendered dynamically with JavaScript or requiring user interaction, a Web Driver is used (e.g., Selenium).
Before running the project, ensure the following requirements are met:
- A Google Account (for accessing Google Colab)
- Required Python libraries:
pandas
beautifulsoup4
selenium
- (Optional:
requests
,lxml
, etc.)
-
Open one of the notebooks:
-
Make sure the appropriate web driver is installed for your browser (e.g., Chrome, Edge, etc.).
-
Run each cell in the notebook sequentially to initiate the scraping process.
-
Data will be collected using both APIs and Web Driver as needed.
-
Scraped data can be saved to:
- The Colab session (e.g., as
.csv
or.json
) - Your linked Google Drive (if mounted)
- The Colab session (e.g., as
You can customize the notebooks to suit your specific scraping needs:
- Update API endpoints or request parameters.
- Modify Web Driver settings (e.g., headless mode, wait time).
- Add authentication headers or tokens (if required).
- Adjust parsing logic based on the HTML structure of the target site.
-
Limitations in Google Colab:
- GUI-based browser interactions are limited (consider using headless mode).
- Some websites may block scraping via user-agent or IP.
-
Authentication Handling:
- If the target API or website requires login/authentication, include proper headers or login steps in your notebook.
- For OAuth2 or cookies-based auth, you may need to simulate sessions or store tokens securely.
Feel free to open an issue or submit a pull request if you encounter any issues or have improvements you'd like to contribute.
Happy scraping! 🕷️📊