Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run in a Docker container #98

Open
dmarti opened this issue Mar 4, 2024 · 29 comments
Open

Run in a Docker container #98

dmarti opened this issue Mar 4, 2024 · 29 comments
Assignees
Labels
crawl Perform crawl or crawl feature-related infrastructure An issue relate to underlying compute or selecting technologies

Comments

@dmarti
Copy link

dmarti commented Mar 4, 2024

For regular testing use it would be useful to run in a container with one command. I have the REST API and extension build steps largely working, and am working out how to do the actual crawl. Work in progress...

https://github.com/privacy-tech-lab/gpc-web-crawler/compare/main...dmarti:gpc-web-crawler:dockerize?expand=1

Not ready to discuss or merge yet, just wanted to see if there is interest. In the long run I'd like to be able to run the crawler as a service that doesn't need much attention, just sends reports if a site being watched has broken GPC.

@SebastianZimmeck SebastianZimmeck added infrastructure An issue relate to underlying compute or selecting technologies crawl Perform crawl or crawl feature-related labels Mar 4, 2024
@SebastianZimmeck
Copy link
Member

It is a good point, @dmarti! We discussed the Dockerization and also think it is a good idea to explore. @sophieeng will take the lead on our end with the help of @katehausladen, @Mattm27, and @franciscawijaya helping out.

@dmarti
Copy link
Author

dmarti commented Mar 4, 2024

Thank you -- right now I think in order to get the crawl working I need to modify my Dockerfile to get the right versions of Firefox Nightly and geckodriver installed.

What versions are you running and what source are you using for geckodriver? (I haven't used Selenium in a while and it seems like things have moved around, I just want to go to the right place)

@katehausladen
Copy link
Collaborator

Currently we're just using whatever Firefox version is on the computer locally (.setBinary('/Applications/Firefox\ Nightly.app/Contents/MacOS/firefox')), and Selenium uses the geckodriver from the local Firefox Nightly. So, this means we're always using the most recent version of both. In terms of Docker, I think as long as you use something relatively recent, it should be fine.

@sophieeng
Copy link
Collaborator

Hey @dmarti! Just wanted to check in on your progress with running the crawler in Docker. Do you need any support? Let us know if we can help with anything.

@dmarti
Copy link
Author

dmarti commented Apr 11, 2024

Hi @sophieeng I got a little stuck figuring out the right source code and/or Linux packages for Selenium. It didn't look like geckodriver was packaged with the Firefox Nightly for Linux download that I was using

@dmarti
Copy link
Author

dmarti commented Apr 11, 2024

If I can get source for known good Firefox and Selenium downloads that work together that would help (I don't have a Mac to test on)

@katehausladen
Copy link
Collaborator

@franciscawijaya and @Mattm27 will work on this over the next couple of weeks.

@sophieeng sophieeng removed their assignment May 13, 2024
@n-aggarwal n-aggarwal self-assigned this Jun 13, 2024
@SebastianZimmeck
Copy link
Member

@dmarti is using Linux and not macOS. So, @dmarti's question is which Selenium and Firefox version work for Linux.

@Sokvy77
Copy link

Sokvy77 commented Jun 14, 2024 via email

@Mattm27
Copy link
Member

Mattm27 commented Sep 26, 2024

Hey @dmarti! We have resumed our efforts on the dockerization of our web crawler, and I’ve been reviewing the progress you made in the spring as a foundation for our work.

However, I noticed that the myextension.xpi file has been removed from the codebase, and I'm a bit unclear about the reasoning behind this change. Could you please provide some clarification on why it was deleted? Thanks!

@Mattm27
Copy link
Member

Mattm27 commented Sep 30, 2024

I've been dealing with two main problems: the container closing immediately after starting (exit code 255) and conflicts between MySQL and MariaDB installations. The container issue seems related to how systemd is set up, while the MySQL vs. MariaDB problem is likely due to package conflicts.

In terms of next steps, the plan is to create a new Dockerfile from scratch, focusing on getting the container to stay running first, and then adding Apache, Geckodriver, either MySQL or MariaDB, etc... one step at a time to avoid further conflicts. Once these issues are resolved, the Dockerization should be much closer to completion, and the existing .sh scripts should work correctly with a stable setup.

@Mattm27
Copy link
Member

Mattm27 commented Sep 30, 2024

The Docker image itself is building successfully, meaning all the required dependencies and configurations are being included properly. However, the issue arises when creating a container from that image—systemd is failing to initialize, causing the container to immediately exit. This distinction is important because it suggests the problem is not with the build process, but rather with how the container is running or managing processes once started.

@SebastianZimmeck
Copy link
Member

As discussed in our meeting, @Mattm27 will start a fresh Docker implementation.

@Mattm27
Copy link
Member

Mattm27 commented Oct 5, 2024

I was successfully able to build the Docker image, and the container now runs continuously without stopping unexpectedly. This was achieved by using CMD ["sleep", "infinity"] for testing purposes to keep the container alive.

Successfully installed and verified the following components within the container:

  • Apache
  • Node.js
  • Geckodriver
  • Selenium
  • Firefox

These installations are all functioning correctly, and the container is stable during testing.

I'm still having trouble installing MySQL. The container is currently not able to locate the mysql-server package, likely due to what I expect is a repository issue. However, as discussed in the meeting earlier this week, I was able to install MariaDB as an alternative. Since MariaDB is a drop-in replacement for MySQL, we can explore using it if we cannot resolve the MySQL installation directly.

@SebastianZimmeck
Copy link
Member

Good progress, @Mattm27!

Mattm27 added a commit that referenced this issue Oct 7, 2024
@Mattm27
Copy link
Member

Mattm27 commented Oct 7, 2024

I've made updates to the Dockerization process as outlined in the code above. The container is now being built correctly using the updated image, and I am in the process of testing individual crawler components within the container. The rest-api.sh script is functioning as expected and building the database, but I am currently troubleshooting issues with the build-extension.sh script to ensure the extension is properly built and integrated.

@Mattm27
Copy link
Member

Mattm27 commented Oct 14, 2024

I managed to work around the issue where the myextension.xpi file was being repacked every time the software was run in the Docker container. This repacking process was causing errors. Instead of repacking the extension, I used the prepacked myextension.xpi that is already present in the codebase. This allowed me to bypass the issues with corrupt extensions.

After resolving the extension issue, I’ve run into a new problem with Firefox when attempting to run the crawl. The Firefox browser does not seem to be functioning properly in the container, preventing the crawl from executing as expected. I'm currently troubleshooting the setup for Firefox Nightly, which is required for the extension, to ensure it's correctly installed and configured for headless mode but suspect I may need some extra support on this part of the process.

@SebastianZimmeck
Copy link
Member

As we discussed, @eakubilo will also help with the dockerization.

@dmarti, we are currently having an issue getting Firefox Nightly to run in Docker. As @Mattm27 is saying:

The Firefox browser does not seem to be functioning properly in the container, ...

Do you have any thoughts on this?

@dmarti
Copy link
Author

dmarti commented Oct 14, 2024

So the browser is starting in headless mode inside the container?

Does debugging protocol work? Can you expose the debugging protocol from Firefox inside the container?

https://firefox-source-docs.mozilla.org/devtools/backend/protocol.html

@Mattm27
Copy link
Member

Mattm27 commented Oct 15, 2024

Hey @dmarti! Yes, currently the browser is starting in headless mode inside the container, but we typically run Firefox in headful mode for the crawl. I haven’t yet explored exposing the debugging protocol from Firefox within the container, but it's something I can definitely look into.

I apologize for the oversight in the past comment. Our main priority right now is getting Firefox Nightly installed and running correctly, since the extension required for the crawl only functions in Nightly. While I was able to install standard Firefox inside the container, it wasn’t functioning properly, which is concerning, but ultimately, we need to focus on ensuring Firefox Nightly is installed and runs in headful mode for the crawler. I'm currently updating the Docker setup to address this. Do you have any suggestions for how we can go about doing this? It doesn't seem as straightforward when compared to installing other applications.

Thanks for the suggestion on the debugging protocol—I'll revisit that once we’ve got Nightly working as expected.

@dmarti
Copy link
Author

dmarti commented Oct 15, 2024

@Mattm27 This is interesting -- in order to run non-headless, Firefox needs to be provided with a working GUI environment, which could mean connecting it out to the host. One option that people seem to be using is VNC inside the container -- if you commit your working Dockerfile I can try to add it, or you can see if one of these can be adapted to run Nightly...

https://github.com/ConSol/docker-headless-vnc-container

@Mattm27
Copy link
Member

Mattm27 commented Oct 16, 2024

Thanks for the information @dmarti! - My updated Dockerfile is committed to branch issue-98. I will also check out the link to see if it is possible to adapt that code to run Nightly!

@eakubilo
Copy link
Member

eakubilo commented Oct 18, 2024

@dmarti We were able to leverage the docker-headless-vnc-container for our needs, thank you for the suggestion! The branch issue-98 should have the functionality described - the command sh scripts/test.sh will open a docker container that performs the privacy crawl. We're working on a PR #138 which hopefully will have this functionality in main soon.

@SebastianZimmeck
Copy link
Member

SebastianZimmeck commented Oct 22, 2024

@eakubilo opened PR #138. @eakubilo explained that running the crawler with Docker works well for Intel Macs, however, throws an inscrutable error for Apple Silicon Macs. Thus, in addition to @Mattm27 and @eakubilo, @natelevinson10 and @franciscawijaya will try it out on their computers and the lab computer.

@eakubilo provided the following instructions on how to run the crawler on Docker:

  • download docker https://www.docker.com/products/docker-desktop/
    make sure to download the correct version for your architecture
  • open docker
  • in terminal, navigate to "gpc-web-crawler" repo
  • in terminal, checkout issue-98 branch by doing git switch issue-98
    in the terminal, run sh scripts/test.sh
  • Go to "http://localhost:6901/vnc.html", the password is "vncpassword"
  • The logs from the crawler live in /logs
  • To stop the crawler just stop the docker container

@Mattm27
Copy link
Member

Mattm27 commented Oct 24, 2024

Now that we have merged the new functional Docker infrastructure, myself and @eakubilo will work on updating the readme with proper installation instructions before closing this issue!

@Mattm27
Copy link
Member

Mattm27 commented Oct 25, 2024

Since the Docker Image is starting the crawler in debug mode by default, myself and @eakubilo are working on adding functionality to allow users to start the web crawler with or without the debugging table by running either sh scripts/webcrawler.sh or sh scripts/webcrawler.sh debug. The goal is to pass a variable from the command into the container using a flag, enabling more control over the crawler's behavior during startup.

@SebastianZimmeck
Copy link
Member

@eakubilo and @Mattm27 will finish the Dockerization, including updating the readme and any other documentation, such that we can start the crawl (#16) next week Monday.

@franciscawijaya and @natelevinson10 will try out if they can follow the readme and install the Docker version on their own computers and the lab computer for next week's crawl.

@Mattm27
Copy link
Member

Mattm27 commented Nov 18, 2024

I am working on setting up phpMyAdmin to connect to the MariaDB database running inside our crawl container. This will allow me to access and manage the database directly on my local machine without needing to export .sql files. I have successfully linked the phpMyAdmin container to the MariaDB container using Docker’s --link feature.

While I’ve successfully set up the containers and linked them, I’m encountering login issues when attempting to access phpMyAdmin. I suspect the problem lies in the configuration file for phpMyAdmin, which may not be correctly set up to connect to the MariaDB database in the container.

@eakubilo
Copy link
Member

eakubilo commented Dec 2, 2024

We've containerized most of the crawler functionality (see PR #146). We're currently refactoring the crawler script to incorporate the well-known crawler, completing its containerization. @dmarti, will you be analyzing the crawler data, or should we provide an interpretation (as data in the form of a json or csv file), including a list of sites not meeting the GPC signal specification?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
crawl Perform crawl or crawl feature-related infrastructure An issue relate to underlying compute or selecting technologies
Projects
None yet
Development

No branches or pull requests

10 participants