-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address change in lobbyist structure, fix bug preventing download of all filings #36
Conversation
@@ -12,6 +12,9 @@ jobs: | |||
build: | |||
# The type of runner that the job will run on | |||
runs-on: ubuntu-latest | |||
strategy: | |||
matrix: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Parallelizes building of these files.
@@ -41,7 +41,7 @@ def scrape(self, id, version): | |||
for record in response.json(): | |||
# Add the lobbyist ID and version to the filing record | |||
record["MemberID"] = id | |||
return record | |||
yield record |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The source of pain! 🙀
def scrapelib_opts(f): | ||
@click.option("--rpm", default=180, show_default=True) | ||
@click.option("--retries", default=3, show_default=True) | ||
@click.option("--verify/--no-verify", default=False, show_default=True) | ||
@functools.wraps(f) | ||
def wrapped_func(*args, **kwargs): | ||
return f(*args, **kwargs) | ||
|
||
return wrapped_func |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was the main reason for pulling this into a CLI. That way, we can run scrapes faster locally by specifying --rpm=something larger than 3
@@ -59,15 +60,20 @@ def _employers(self): | |||
sys.exit() | |||
|
|||
def scrape(self): | |||
seen_employers = deque(maxlen=25) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added this so the output is deduplicated.
page_number = 1 | ||
page_size = 1000 | ||
result_count = 1000 | ||
seen_lobbyists = deque(maxlen=25) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto deduplicated.
for lobbyist in lobbyists: | ||
if lobbyist["ID"] in seen_lobbyists: | ||
continue | ||
|
||
lobbyist_details = s.get( | ||
"https://login.cfis.sos.state.nm.us/api//LobbyistDetails/GetLobbyistDetails", | ||
params={ | ||
"memberId": lobbyist["ID"], | ||
"memberversionID": lobbyist["MemberVersionID"], | ||
}, | ||
).json() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition, our previous approach for scraping lobbyist employers -> scraping lobbyists no longer works????? If you go to the lobbyist tab for an employer with multiple lobbyists and click around the list, you'll notice the lobbyist IDs are not correct and the links don't resolve correctly. So, I updated this script to scrape the lobbyist search, and for each distinct lobbyist, retrieve details. I scraped their employers (clients) separately.
Successful run here: https://github.com/datamade/nmid-scrapers/actions/runs/11354821874/job/31582862346