-
-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run in a Docker container #98
Comments
It is a good point, @dmarti! We discussed the Dockerization and also think it is a good idea to explore. @sophieeng will take the lead on our end with the help of @katehausladen, @Mattm27, and @franciscawijaya helping out. |
Thank you -- right now I think in order to get the crawl working I need to modify my Dockerfile to get the right versions of Firefox Nightly and geckodriver installed. What versions are you running and what source are you using for geckodriver? (I haven't used Selenium in a while and it seems like things have moved around, I just want to go to the right place) |
Currently we're just using whatever Firefox version is on the computer locally (.setBinary('/Applications/Firefox\ Nightly.app/Contents/MacOS/firefox')), and Selenium uses the geckodriver from the local Firefox Nightly. So, this means we're always using the most recent version of both. In terms of Docker, I think as long as you use something relatively recent, it should be fine. |
Hey @dmarti! Just wanted to check in on your progress with running the crawler in Docker. Do you need any support? Let us know if we can help with anything. |
Hi @sophieeng I got a little stuck figuring out the right source code and/or Linux packages for Selenium. It didn't look like geckodriver was packaged with the Firefox Nightly for Linux download that I was using |
If I can get source for known good Firefox and Selenium downloads that work together that would help (I don't have a Mac to test on) |
@franciscawijaya and @Mattm27 will work on this over the next couple of weeks. |
Vẫn la tới
Vào Th 6, 14 thg 6, 2024 lúc 7:32 SA Sebastian Zimmeck <
***@***.***> đã viết:
… @dmarti <https://github.com/dmarti> is using Linux and not macOS. So,
@dmarti <https://github.com/dmarti>'s question
<#98 (comment)>
is which Selenium and Firefox version work for Linux.
—
Reply to this email directly, view it on GitHub
<#98 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/BILHNXI3ONZ6Z4TG7NCCLH3ZHI2ZHAVCNFSM6AAAAABEEQEFGWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRXGAYDENJZGQ>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Hey @dmarti! We have resumed our efforts on the dockerization of our web crawler, and I’ve been reviewing the progress you made in the spring as a foundation for our work. However, I noticed that the myextension.xpi file has been removed from the codebase, and I'm a bit unclear about the reasoning behind this change. Could you please provide some clarification on why it was deleted? Thanks! |
I've been dealing with two main problems: the container closing immediately after starting (exit code 255) and conflicts between MySQL and MariaDB installations. The container issue seems related to how systemd is set up, while the MySQL vs. MariaDB problem is likely due to package conflicts. In terms of next steps, the plan is to create a new Dockerfile from scratch, focusing on getting the container to stay running first, and then adding Apache, Geckodriver, either MySQL or MariaDB, etc... one step at a time to avoid further conflicts. Once these issues are resolved, the Dockerization should be much closer to completion, and the existing |
The Docker image itself is building successfully, meaning all the required dependencies and configurations are being included properly. However, the issue arises when creating a container from that image—systemd is failing to initialize, causing the container to immediately exit. This distinction is important because it suggests the problem is not with the build process, but rather with how the container is running or managing processes once started. |
As discussed in our meeting, @Mattm27 will start a fresh Docker implementation. |
I was successfully able to build the Docker image, and the container now runs continuously without stopping unexpectedly. This was achieved by using Successfully installed and verified the following components within the container:
These installations are all functioning correctly, and the container is stable during testing. I'm still having trouble installing MySQL. The container is currently not able to locate the mysql-server package, likely due to what I expect is a repository issue. However, as discussed in the meeting earlier this week, I was able to install MariaDB as an alternative. Since MariaDB is a drop-in replacement for MySQL, we can explore using it if we cannot resolve the MySQL installation directly. |
Good progress, @Mattm27! |
I've made updates to the Dockerization process as outlined in the code above. The container is now being built correctly using the updated image, and I am in the process of testing individual crawler components within the container. The |
I managed to work around the issue where the After resolving the extension issue, I’ve run into a new problem with Firefox when attempting to run the crawl. The Firefox browser does not seem to be functioning properly in the container, preventing the crawl from executing as expected. I'm currently troubleshooting the setup for Firefox Nightly, which is required for the extension, to ensure it's correctly installed and configured for headless mode but suspect I may need some extra support on this part of the process. |
So the browser is starting in headless mode inside the container? Does debugging protocol work? Can you expose the debugging protocol from Firefox inside the container? https://firefox-source-docs.mozilla.org/devtools/backend/protocol.html |
Hey @dmarti! Yes, currently the browser is starting in headless mode inside the container, but we typically run Firefox in headful mode for the crawl. I haven’t yet explored exposing the debugging protocol from Firefox within the container, but it's something I can definitely look into. I apologize for the oversight in the past comment. Our main priority right now is getting Firefox Nightly installed and running correctly, since the extension required for the crawl only functions in Nightly. While I was able to install standard Firefox inside the container, it wasn’t functioning properly, which is concerning, but ultimately, we need to focus on ensuring Firefox Nightly is installed and runs in headful mode for the crawler. I'm currently updating the Docker setup to address this. Do you have any suggestions for how we can go about doing this? It doesn't seem as straightforward when compared to installing other applications. Thanks for the suggestion on the debugging protocol—I'll revisit that once we’ve got Nightly working as expected. |
@Mattm27 This is interesting -- in order to run non-headless, Firefox needs to be provided with a working GUI environment, which could mean connecting it out to the host. One option that people seem to be using is VNC inside the container -- if you commit your working Dockerfile I can try to add it, or you can see if one of these can be adapted to run Nightly... |
Thanks for the information @dmarti! - My updated Dockerfile is committed to branch |
@dmarti We were able to leverage the docker-headless-vnc-container for our needs, thank you for the suggestion! The branch |
@eakubilo opened PR #138. @eakubilo explained that running the crawler with Docker works well for Intel Macs, however, throws an inscrutable error for Apple Silicon Macs. Thus, in addition to @Mattm27 and @eakubilo, @natelevinson10 and @franciscawijaya will try it out on their computers and the lab computer. @eakubilo provided the following instructions on how to run the crawler on Docker:
|
Now that we have merged the new functional Docker infrastructure, myself and @eakubilo will work on updating the readme with proper installation instructions before closing this issue! |
Since the Docker Image is starting the crawler in debug mode by default, myself and @eakubilo are working on adding functionality to allow users to start the web crawler with or without the debugging table by running either |
@eakubilo and @Mattm27 will finish the Dockerization, including updating the readme and any other documentation, such that we can start the crawl (#16) next week Monday. @franciscawijaya and @natelevinson10 will try out if they can follow the readme and install the Docker version on their own computers and the lab computer for next week's crawl. |
I am working on setting up phpMyAdmin to connect to the MariaDB database running inside our crawl container. This will allow me to access and manage the database directly on my local machine without needing to export .sql files. I have successfully linked the phpMyAdmin container to the MariaDB container using Docker’s While I’ve successfully set up the containers and linked them, I’m encountering login issues when attempting to access phpMyAdmin. I suspect the problem lies in the configuration file for phpMyAdmin, which may not be correctly set up to connect to the MariaDB database in the container. |
We've containerized most of the crawler functionality (see PR #146). We're currently refactoring the crawler script to incorporate the well-known crawler, completing its containerization. @dmarti, will you be analyzing the crawler data, or should we provide an interpretation (as data in the form of a json or csv file), including a list of sites not meeting the GPC signal specification? |
For regular testing use it would be useful to run in a container with one command. I have the REST API and extension build steps largely working, and am working out how to do the actual crawl. Work in progress...
https://github.com/privacy-tech-lab/gpc-web-crawler/compare/main...dmarti:gpc-web-crawler:dockerize?expand=1
Not ready to discuss or merge yet, just wanted to see if there is interest. In the long run I'd like to be able to run the crawler as a service that doesn't need much attention, just sends reports if a site being watched has broken GPC.
The text was updated successfully, but these errors were encountered: