Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add search query support to the job posting spider. #115

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

wRAR
Copy link
Member

@wRAR wRAR commented Dec 30, 2024

This worked for indeed but didn't work for glassdoor and we should understand why.

Copy link

codecov bot commented Dec 30, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.92%. Comparing base (1b72aa8) to head (0adc089).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #115      +/-   ##
==========================================
+ Coverage   95.89%   95.92%   +0.02%     
==========================================
  Files          26       26              
  Lines        2609     2627      +18     
==========================================
+ Hits         2502     2520      +18     
  Misses        107      107              
Files with missing lines Coverage Δ
zyte_spider_templates/spiders/job_posting.py 96.39% <100.00%> (+0.69%) ⬆️

@wRAR wRAR marked this pull request as draft December 30, 2024 15:38
@Gallaecio
Copy link
Contributor

Gallaecio commented Dec 31, 2024

didn't work for glassdoor and we should understand why.

I don’t see SearchAction metadata in the HTML, and Formasaurus seems to fail:

import asyncio
from base64 import b64decode

from form2request import form2request
from formasaurus import build_submission
from parsel import Selector
from zyte_api import AsyncZyteAPI


async def main():
    client = AsyncZyteAPI()
    url = "https://www.glassdoor.com/Job/index.htm"
    result = await client.get({"url": url, "httpResponseBody": True})
    html = b64decode(result["httpResponseBody"]).decode()
    selector = Selector(text=html, base_url=url)
    form, data, submit_button = build_submission(selector, "search", {"search query": "foo"})
    request_data = form2request(form, data, click=submit_button)
    print(request_data)


asyncio.run(main())
$ python test.py 
Request(url='https://www.glassdoor.com/Job/index.htm', method='GET', headers=[], body=b'')

So I would say it “fails as expected”.

@wRAR
Copy link
Member Author

wRAR commented Dec 31, 2024

That is unfortunate and I wonder what other steps we can make to support it.

@wRAR wRAR marked this pull request as ready for review January 6, 2025 12:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants