- The crawlers are present in the
spiders
folder. - The code to establish pipeline between scrapy data and mongodb is in the
settings.py
file. - All the required python libraries required to do all the tasks are in the
requirements.txt
file. - The mongo queries are in the
task2_queries.ipynb
file. The outputs of those cells can be seen there itself.
- Password for mongo database is present in the
.env
file.
- Get the CSS selector which selects all the products' elements on the product page.
- Get the CSS selectors for respective fields (name, price, image url, product url etc.) and yield them as we iterate through all the products selected by the parent CSS selector.
- Once everything from the page has been scraped, next I uploaded it to a mongodb collection using
pymongo
python library and the pipelines that are present in thescrapy
library (code present inpipelines.py
file). - Next we need to cater to pagination and for this, we need the CSS selector of "next page". This selector returns
None
if that particular element is not present in the HTML. - Got the page number from next page's url and used this to limit scraping process to 25 pages.
- Same process was used to scrape both, the topwears' as well as footwears' page.