Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address change in lobbyist structure, fix bug preventing download of all filings #36

Merged
merged 5 commits into from
Oct 31, 2024

Conversation

hancush
Copy link
Member

@hancush hancush commented Oct 15, 2024

@@ -12,6 +12,9 @@ jobs:
build:
# The type of runner that the job will run on
runs-on: ubuntu-latest
strategy:
matrix:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Parallelizes building of these files.

@@ -41,7 +41,7 @@ def scrape(self, id, version):
for record in response.json():
# Add the lobbyist ID and version to the filing record
record["MemberID"] = id
return record
yield record
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The source of pain! 🙀

Comment on lines +6 to +14
def scrapelib_opts(f):
@click.option("--rpm", default=180, show_default=True)
@click.option("--retries", default=3, show_default=True)
@click.option("--verify/--no-verify", default=False, show_default=True)
@functools.wraps(f)
def wrapped_func(*args, **kwargs):
return f(*args, **kwargs)

return wrapped_func
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was the main reason for pulling this into a CLI. That way, we can run scrapes faster locally by specifying --rpm=something larger than 3

@@ -59,15 +60,20 @@ def _employers(self):
sys.exit()

def scrape(self):
seen_employers = deque(maxlen=25)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added this so the output is deduplicated.

page_number = 1
page_size = 1000
result_count = 1000
seen_lobbyists = deque(maxlen=25)
Copy link
Member Author

@hancush hancush Oct 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto deduplicated.

Comment on lines +70 to +80
for lobbyist in lobbyists:
if lobbyist["ID"] in seen_lobbyists:
continue

lobbyist_details = s.get(
"https://login.cfis.sos.state.nm.us/api//LobbyistDetails/GetLobbyistDetails",
params={
"memberId": lobbyist["ID"],
"memberversionID": lobbyist["MemberVersionID"],
},
).json()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition, our previous approach for scraping lobbyist employers -> scraping lobbyists no longer works????? If you go to the lobbyist tab for an employer with multiple lobbyists and click around the list, you'll notice the lobbyist IDs are not correct and the links don't resolve correctly. So, I updated this script to scrape the lobbyist search, and for each distinct lobbyist, retrieve details. I scraped their employers (clients) separately.

@hancush hancush marked this pull request as ready for review October 28, 2024 15:36
@hancush hancush requested a review from antidipyramid October 28, 2024 15:36
@hancush hancush merged commit 8e22a1e into main Oct 31, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants