GitHub - kushuu/cliff-ai-queries

The folder structure is quite self explanatory:

The crawlers are present in the spiders folder.
The code to establish pipeline between scrapy data and mongodb is in the settings.py file.
All the required python libraries required to do all the tasks are in the requirements.txt file.
The mongo queries are in the task2_queries.ipynb file. The outputs of those cells can be seen there itself.

Get the CSS selector which selects all the products' elements on the product page.
Get the CSS selectors for respective fields (name, price, image url, product url etc.) and yield them as we iterate through all the products selected by the parent CSS selector.
Once everything from the page has been scraped, next I uploaded it to a mongodb collection using pymongo python library and the pipelines that are present in the scrapy library (code present in pipelines.py file).
Next we need to cater to pagination and for this, we need the CSS selector of "next page". This selector returns None if that particular element is not present in the HTML.
Got the page number from next page's url and used this to limit scraping process to 25 pages.
Same process was used to scrape both, the topwears' as well as footwears' page.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
spiders		spiders
.env		.env
.gitignore		.gitignore
__init__.py		__init__.py
items.py		items.py
middlewares.py		middlewares.py
pipelines.py		pipelines.py
readme.md		readme.md
requirements.txt		requirements.txt
settings.py		settings.py
task2_queries.ipynb		task2_queries.ipynb