An intelligent web scraping application that combines Selenium automation with AI-powered content analysis using Ollama. This project provides a user-friendly interface built with Streamlit for scraping web content and analyzing it using AI models.
- 🌐 Web content scraping using Selenium
- 🤖 AI-powered content analysis with Ollama
- 📊 User-friendly Streamlit interface
- 🐳 Fully containerized with Docker Compose
- 🔄 Selenium Grid integration for reliable browser automation
- 📝 Detailed content parsing and analysis
- 🛡️ Robust error handling and logging
- Frontend: Streamlit
- Scraping: Selenium WebDriver, BeautifulSoup4
- AI Processing: Ollama
- Containerization: Docker, Docker Compose
- Browser Automation: Selenium Grid with Chrome
- Language: Python 3.11
- Docker
- Docker Compose
- Git (for cloning the repository)
- Stable internet connection
-
Clone the Repository
git clone <repository-url> cd AI-Web-Scraper-main
-
Environment Configuration (Optional)
- Create a
.env
file if you want to customize default settings:
SELENIUM_HOST=selenium-chrome SELENIUM_PORT=4444
- Create a
-
Build and Start the Application
docker-compose up --build
- Main Application: http://localhost:8501
- Selenium Grid: http://localhost:4444
- VNC Viewer (for debugging): http://localhost:7900 (password:
secret
) - Ollama API: http://localhost:11434
The application is split into three main containers:
-
Web Scraper Container
- Runs the main Streamlit application
- Handles web scraping logic
- Processes user requests
-
Selenium Chrome Container
- Provides browser automation capabilities
- Runs in headless mode
- Managed through Selenium Grid
-
Ollama Container
- Runs the AI model service
- Handles content analysis
- Provides AI-powered insights
- Access the Streamlit interface at http://localhost:8501
- Enter the URL you want to scrape
- Select the type of analysis you want to perform
- View the scraped content and AI analysis results
Common issues and solutions:
-
Connection Issues
- Ensure all ports (8501, 4444, 11434) are available
- Check if all containers are running:
docker-compose ps
-
Scraping Failures
- Verify the target URL is accessible
- Check the Selenium Grid status at http://localhost:4444
- Review logs:
docker-compose logs web-scraper
-
AI Analysis Issues
- Ensure Ollama container is running
- Check Ollama logs:
docker-compose logs ollama
- Runs with non-root user in containers
- Disabled unnecessary Chrome extensions
- Implements certificate error handling
- Containerized environment isolation
-
Local Development
# Start services in development mode docker-compose up --build # View logs docker-compose logs -f # Restart specific service docker-compose restart web-scraper
-
Stopping the Application
docker-compose down
Key Python packages:
- selenium
- streamlit
- beautifulsoup4
- requests
- ollama-python
- Requires stable internet connection
- Limited to headless Chrome browsing
- Some websites may block automated access
- AI analysis dependent on Ollama model availability
- Add more AI model options
- Implement retry mechanisms
- Enhanced error handling
- More detailed logging
- Additional scraping strategies
- User authentication
- Results caching
- Export functionality
Contributions are welcome! Please feel free to submit pull requests.
This project is licensed under the MIT License - see the LICENSE file for details.