Skip to content

Latest commit

 

History

History
49 lines (29 loc) · 2.24 KB

README.md

File metadata and controls

49 lines (29 loc) · 2.24 KB

Proxy Farm

proxyfarm is a node script that scrapes proxy lists from websites without caring for its underlying HTML structure. This allows proxy lists to be easily harvested from a large amount of sources, without implementing custom scraping logic for each source. It does this via using a PhantomJS driver along with the Javascript Selection API. This strips away all HTML tags and makes regex matching trivial. Proxy lists can be used with things like scrapy-proxies in order to bypass IP restrictions and improve web crawling speed.

demo

Getting Started

Simply clone the repository, run npm install, and node --harmony proxyfarm --in sources.txt --out proxies.txt

NPM module coming soon!

Arguments

Parameter Description
in A text file with line delimited urls to scrape proxies from. See defaults/sources.txt for an example.
out The path to save the scraped proxy list to, in the format <host>:<port>

Prerequisites

  • Node.js v6.x and later

Running the tests

Coming soon!

Contributing

There are many ways that you can contribute:

  • Improving documentation - Submit a pull request with the fixes.
  • Requesting a feature - Simply create a new issue with the said feature.
  • Suggesting a proxy list source - Create a new issue mentioning the new source.
  • Report a bug - Find a problem? Create an issue with your environment, screenshot of the error, and reproduction steps.
  • Fix a bug - All help appreciated!

Future Roadmap

  • Validating the scraped proxy list
  • Detecting anonymity, speed, and country of the proxy list
  • Automatic crawling of websites rather than manually specifying all proxy lists
  • Handling of ajax pages

License

This project is licensed under the MIT License - see the LICENSE.md file for details