No easy(!) way of having head-to-head price comparison of specific food items between different shops on specific day/period. Why it is so?
Big local stores and supermarkets have their promotional brochures/leaflets in paper and digital forms:
- paper brochures/leaflets which are distributed into physical mailboxes of citizens as well as those leaflets are present inside stores near entrance
- proprietary mobile apps and websites of specific store/supermarket
- 'aggregational' mobile app and website "Moja Gazetka" with leaflets of many stores/supermarkets
The latter option is so far the best 'cause we have many leaflets with prices in one place.
But the main inconvenience of using those leaflets (either paper or digital) is that leaflets are maid as carousel of HTML pages - aka paper journal or flip book. Those pages include photos, price tags with prev/new prices, for what unit/pcs that price is, etc. Examples: link1, link2, link3
As a customer I want to have convenient "single source of truth" - one web/mobile app which helps me to buy goods cheaply whenever I go shopping.
The questions that must be answered and solved are:
- how to easily get/grab these prices and items names so we can have a list of all items that are promoted on specific dates (usually, its 2-3 days or 1 week)?
- how to parse/collect data from many webpages?
- where we can find APIs of those stores/supermarkets - it's much easier to work with structured JSON/XML responses
Web application with the following high-level components and functionality:
- Backend
- web framework
- web scraping/parsing
- agnostic scrapping, e.g. it is possible to get data from any shop with some configurations adjustments
- asynchronous scraping and data processing for faster I/O operations
- data analysis
- ETL pipeline to aggregate data into database
- filtering, grouping, and aggregating the data to extract insights
- data visualization: analytic dashboard with sorting/filtering, visualization, data export in different formats
- user accounts: authorization (sign up/sign in), profile, UI settings for personalization etc.
- admin panel
- web scraping/parsing
- databases (SQL and noSQL)
- store and retrieve data
- design schemas
- write queries
- CRUD operations
- optimize performance
- store and retrieve data
- background processing
- display up-to-date information without any delays
- testing
- unit tests for your code to ensure that it is working correctly
- load testing to ensure that the dashboard can handle high levels of traffic
- web framework
- Frontend
- design / mockups
- lightweight UI framework (HTML/CSS/JS) or big JavaScript framework
- visualization dashboard
- users can enter the URL of a website they want to scrape
- specify what data they want to extract, and view the results of the scraping and analysis
- Backend / frontend communication
- RESTful API
- OpenAPI docs
- DevOps
- code repository
- containerization
- orchestration
- continuous integration and delivery (CI/CD)
- clouds services
More details about architecture and tech stack can be found in documentation
git clone https://github.com/ivanprytula/price-navigator.git
# or
gh repo clone ivanprytula/price-navigator
cd price-navigator
- Start [status|stop] PostgreSQL server:
sudo systemctl start postgresql
orsudo service postgresql start
- Create a new PostgreSQL database with ...
- PostgreSQL client
psql
(steps below) - Shell CLI createdb
- pgAdmin
- Your preferable way
- PostgreSQL client
- Create/activate a virtualenv
python3.10 -m venv <virtual env path>
source <virtual env path>/bin/activate
pip install -r requirements.local.txt
- Install pre-commit hook:
pre-commit install
- Set the environment variables:
- Create/copy
.env
file in the root of your project with all needed variables:mv env.example.local .env
|cp env.example.local .env
- then
export DJANGO_READ_DOT_ENV_FILE=True
- or use a local environment manager like direnv (NB: you also need
.envrc
file)
- then
- Create/copy
- 'Dry run' w/o applying migrations - just spin off classic
./manage.py runserver
or./manage.py runserver_plus
(w/ watchdog and Werkzeug debugger) - Or skip prev step and do 'full run':
./manage.py migrate
->./manage.py runserver 0.0.0.0:8000
- Visit http://127.0.0.1:8000/
- Setting up your users:
- normal user account: just go to Sign Up and fill out the form. Once you submit it, you'll see a "Verify Your E-mail Address" page. Go to your console to see a simulated email verification message. Copy the link into your browser. Now the user's email should be verified and ready to go.
python manage.py createsuperuser
- Sanity checks of code quality: run test, type checks, linter, sort imports, formatter
pytest -p no:warnings -v
mypy price_navigator/
flake8
isort .
black --config pyproject.toml .
- Run the following command from the project directory to build and explore HTML documentation:
make -C docs livehtml
# verbose option (2.1)
sudo -u postgres -i psql
CREATE DATABASE price_navigator;
CREATE USER price_dwh_user WITH PASSWORD 'my_password';
# it is recommended to set these stuff also
# https://docs.djangoproject.com/en/4.2/ref/databases/#optimizing-postgresql-s-configuration
ALTER ROLE price_dwh_user SET client_encoding TO 'utf8';
ALTER ROLE price_dwh_user SET default_transaction_isolation TO 'read committed';
ALTER ROLE price_dwh_user SET timezone TO 'UTC';
GRANT ALL PRIVILEGES ON DATABASE "price_navigator" to price_dwh_user;
# jic, if you are in a hurry, here is simplified one-liner command
sudo -u postgres psql -c 'create database price_navigator;'
postgres=# \l # list all databases
# OPTION 1. Use dedicated files in .envs/.local/
# [!] This option is currently used (17-06-2023)
# docker-compose.yml + docker-compose.dev.yml
# NB: read if you have Docker Engine 23.0+ version
docker info
# https://docs.docker.com/engine/reference/commandline/build/#use-a-dockerignore-file
# OPTION 2. Load environment variables into shell
# This method is similar to CI as on CI pipeline we have env vars (plain text) and secrets (configured for repository)
# For this use: docker-compose.yml + docker-compose.dev-with-environment-attribute.yml
# check current $SHELL environment vars
env
# double-check that load_env_vars.sh file is executable, 'x' must be in OWNER permissions
ls -l --human-readable
-rwxr-xr-x 1 OWNER GROUP OTHERS size Jun 13 03:50 load_env_vars.sh
# if not:
chmod u=rwx,go=r load_env_vars.sh
# if in rush:
chmod +x load_env_vars.sh
. ./load_env_vars.sh
# or
source ./load_env_vars.sh
# confirm that project env vars in $SHELL
env
# 2. Build and spin containers
make build
make up
# dive deep inside django & db containers
make sh-django
make sh-db
# stop and remove containers
make down
# Connect to db in VSCode with SQLTools extension
# 1.
docker inspect price_navigator_local_postgres # container name
# 2. find in output line:
"IPAddress": "192.168.80.3", # address could be another
# 3. use this IPAddress and other data from .postgres/.env file to config connection
# 4. example
{
"previewLimit": 50,
"server": "192.168.80.3",
"port": 5432,
"driver": "PostgreSQL",
"name": "price_navigator_docker",
"database": "postgres",
"username": "postgres"
}
- Check
Makefile
for shortened versions of verbose commands - Explore and use Django management commands:
./manage.py --help
and django-extensions commands