-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Substack with a custom domain #3244
base: master
Are you sure you want to change the base?
Support Substack with a custom domain #3244
Conversation
Remove the substack.com domain from the target regex to support Substacks with a custom domain In detectWeb, manually check for the old target regex first, then for one of the Substack footer buttons in the page DOM. If neither matches, return false
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a couple of inline comments and suggestions, otherwise looks and works (in Scaffold, manually + tests) ok for me (with my suggested change for the archive
regex).
There are two test failures due to one-line expected result differences in view of the usual translator test rot (blogTitle
changed on both test subjects) - probably worth fixing in this PR? Alternatively, I could update them after this one is merged, together with asyncifying (now that it drew my attention, why not do that).
@@ -2,14 +2,14 @@ | |||
"translatorID": "ac3b958f-0581-4117-bebc-44af3b876545", | |||
"label": "Substack", | |||
"creator": "Abe Jellinek", | |||
"target": "^https://[^.]+\\.substack\\.com/(p/|archive)", | |||
"target": "/p/|/archive", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is of course quite generic, so this translator will be considered for quite a few URLs I suspect (e.g., many sites plausibly have some "archive"). A couple of suggestions in view of that:
- The
priority
field should be bumped up to "250", I think. See translator priority docs for details. This is accounting for thata.footer-substack-cta
check, which I think can be counted as a "unique check in detectWeb". - I'd suggest specializing the
target
pattern as much as possible and at least adding a(?:$|\?)
(non-capturing because performance 🙂) for thearchive
part (see a proposal and a counterexample below)./p/
is trickier as I'm not sure what symbols Substack could use there, but maybe something along the same lines, i.e.,/p/<something-but-not-/>$
? Though I see in the test set that there may be a/p/<post name>/comments
, so maybe not exactly that - essentially, this is just a suggestion to think twice of any additional options.
As for the archive
, below is the proposed alteration and here's the counterexample link I could think of OTMH: IACR ePrint paper version record.
"target": "/p/|/archive", | |
"target": "/p/|/archive(?:$|\?)", |
@@ -49,7 +51,7 @@ function detectWeb(doc, url) { | |||
function getSearchResults(doc, checkOnly) { | |||
var items = {}; | |||
var found = false; | |||
var rows = doc.querySelectorAll('a.post-preview-title[href*="/p/"]'); | |||
var rows = doc.querySelectorAll('a[data-testid="post-preview-title"][href*="/p/"]'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good one - looks like this part is plain broken right now due to this change in the markup. In view of that, even if the pattern change doesn't get in (e.g., I can imagine maintainers proposing an alternative approach to keep the target
specialized due to performance reasons), this one is worthwhile as a standalone fix.
BTW, one more thing to note - I see somehow none of your PRs submitted that day got a workflow run (with the usual checks), so maybe something like rebasing on top of current |
I don't think this is the right approach. The Substack translator is very basic. On that test case, Embedded Metadata already gets the author name, abstract, and date. The item type is wrong - |
Remove the substack.com domain from the target regex to support Substacks with a custom domain
In detectWeb, manually check for the old target regex first, then for one of the Substack footer buttons in the page DOM. If neither matches, return false