Skip to content

This project is a web scraping application that collects data from websites using both APIs and a web driver

Notifications You must be signed in to change notification settings

phucvn16409/web-scraping

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 

Repository files navigation

Web Scraping with API and Web Driver in Google Colab

📘 Description

This project demonstrates how to perform web scraping using both APIs and Selenium Web Driver, designed to run seamlessly on Google Colab or a local environment.

  • For data accessible via APIs, HTTP requests are used.
  • For content rendered dynamically with JavaScript or requiring user interaction, a Web Driver is used (e.g., Selenium).

✅ Prerequisites

Before running the project, ensure the following requirements are met:

  • A Google Account (for accessing Google Colab)
  • Required Python libraries:
    • pandas
    • beautifulsoup4
    • selenium
    • (Optional: requests, lxml, etc.)

🚀 Running in Google Colab

  1. Open one of the notebooks:

  2. Make sure the appropriate web driver is installed for your browser (e.g., Chrome, Edge, etc.).

  3. Run each cell in the notebook sequentially to initiate the scraping process.

  4. Data will be collected using both APIs and Web Driver as needed.

  5. Scraped data can be saved to:

    • The Colab session (e.g., as .csv or .json)
    • Your linked Google Drive (if mounted)

⚙️ Configuration

You can customize the notebooks to suit your specific scraping needs:

  • Update API endpoints or request parameters.
  • Modify Web Driver settings (e.g., headless mode, wait time).
  • Add authentication headers or tokens (if required).
  • Adjust parsing logic based on the HTML structure of the target site.

⚠️ Additional Notes

  • Limitations in Google Colab:

    • GUI-based browser interactions are limited (consider using headless mode).
    • Some websites may block scraping via user-agent or IP.
  • Authentication Handling:

    • If the target API or website requires login/authentication, include proper headers or login steps in your notebook.
    • For OAuth2 or cookies-based auth, you may need to simulate sessions or store tokens securely.

📩 Questions or Contributions?

Feel free to open an issue or submit a pull request if you encounter any issues or have improvements you'd like to contribute.

Happy scraping! 🕷️📊

About

This project is a web scraping application that collects data from websites using both APIs and a web driver

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published