Address change in lobbyist structure, fix bug preventing download of all filings #36

hancush · 2024-10-15T20:48:28Z

Successful run here: https://github.com/datamade/nmid-scrapers/actions/runs/11354821874/job/31582862346

hancush · 2024-10-15T20:49:35Z

.github/workflows/tmp.yml

@@ -12,6 +12,9 @@ jobs:
  build:
    # The type of runner that the job will run on
    runs-on: ubuntu-latest
+    strategy:
+      matrix:


Parallelizes building of these files.

hancush · 2024-10-15T20:50:10Z

scrapers/lobbyist/scrape_filings.py

@@ -41,7 +41,7 @@ def scrape(self, id, version):
            for record in response.json():
                # Add the lobbyist ID and version to the filing record
                record["MemberID"] = id
-                return record
+                yield record


The source of pain! 🙀

hancush · 2024-10-15T20:50:39Z

scrapers/lobbyist/cli.py

+def scrapelib_opts(f):
+    @click.option("--rpm", default=180, show_default=True)
+    @click.option("--retries", default=3, show_default=True)
+    @click.option("--verify/--no-verify", default=False, show_default=True)
+    @functools.wraps(f)
+    def wrapped_func(*args, **kwargs):
+        return f(*args, **kwargs)
+
+    return wrapped_func


This was the main reason for pulling this into a CLI. That way, we can run scrapes faster locally by specifying --rpm=something larger than 3

hancush · 2024-10-15T20:51:18Z

scrapers/lobbyist/scrape_employers.py

@@ -59,15 +60,20 @@ def _employers(self):
                sys.exit()

    def scrape(self):
+        seen_employers = deque(maxlen=25)


Added this so the output is deduplicated.

hancush · 2024-10-15T20:53:28Z

scrapers/lobbyist/scrape_lobbyists.py

+    page_number = 1
+    page_size = 1000
+    result_count = 1000
+    seen_lobbyists = deque(maxlen=25)


Ditto deduplicated.

hancush · 2024-10-15T20:53:52Z

scrapers/lobbyist/scrape_lobbyists.py

+        for lobbyist in lobbyists:
+            if lobbyist["ID"] in seen_lobbyists:
+                continue
+
+            lobbyist_details = s.get(
+                "https://login.cfis.sos.state.nm.us/api//LobbyistDetails/GetLobbyistDetails",
+                params={
+                    "memberId": lobbyist["ID"],
+                    "memberversionID": lobbyist["MemberVersionID"],
+                },
+            ).json()


In addition, our previous approach for scraping lobbyist employers -> scraping lobbyists no longer works????? If you go to the lobbyist tab for an employer with multiple lobbyists and click around the list, you'll notice the lobbyist IDs are not correct and the links don't resolve correctly. So, I updated this script to scrape the lobbyist search, and for each distinct lobbyist, retrieve details. I scraped their employers (clients) separately.

hancush added 3 commits October 15, 2024 15:42

Revision lobbyist scrape approach

62fb41d

Remove filings.csv

0f74616

Update tmp action to use matrix strategy

12b938b

hancush commented Oct 15, 2024

View reviewed changes

hancush added 2 commits October 15, 2024 15:55

Fix field names for lobbyist employer

33e10e7

Fix join field

0afd55a

hancush marked this pull request as ready for review October 28, 2024 15:36

hancush requested a review from antidipyramid October 28, 2024 15:36

antidipyramid approved these changes Oct 29, 2024

View reviewed changes

hancush merged commit 8e22a1e into main Oct 31, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address change in lobbyist structure, fix bug preventing download of all filings #36

Address change in lobbyist structure, fix bug preventing download of all filings #36

hancush commented Oct 15, 2024 •

edited

Loading

hancush Oct 15, 2024

hancush Oct 15, 2024

hancush Oct 15, 2024

hancush Oct 15, 2024

hancush Oct 15, 2024 •

edited

Loading

hancush Oct 15, 2024

Address change in lobbyist structure, fix bug preventing download of all filings #36

Address change in lobbyist structure, fix bug preventing download of all filings #36

Conversation

hancush commented Oct 15, 2024 • edited Loading

hancush Oct 15, 2024

Choose a reason for hiding this comment

hancush Oct 15, 2024

Choose a reason for hiding this comment

hancush Oct 15, 2024

Choose a reason for hiding this comment

hancush Oct 15, 2024

Choose a reason for hiding this comment

hancush Oct 15, 2024 • edited Loading

Choose a reason for hiding this comment

hancush Oct 15, 2024

Choose a reason for hiding this comment

hancush commented Oct 15, 2024 •

edited

Loading

hancush Oct 15, 2024 •

edited

Loading