[addition for LocalDocs: recent_2024janfeb news/events] ZIP file with +380 PDF news articles between 2024.01.01 and 2024.02.29 #2057
Replies: 4 comments 2 replies
-
I'll try and do as cebtenzrre suggested (thanks) - placing this post on Discord, so when announcements are made about updates, those who subscribed to that thread will know. In the meantime, this set will be updated - today the 29th once (in about an hour from now, I think), and tomorrow the 1st of March, 2 or 3 times. |
Beta Was this translation helpful? Give feedback.
-
The ZIP is there until ~19GMT, today March the 1st, so anyone who wants to download it for whatever tests, better hurry... I'll try to open a thread afterwards on Discord (as cebtenzzre recommended) for this stuff, as we're already in March and things happen... Anyway, I intend to Close this thread too, if no moderator will do it before then. Thank you. |
Beta Was this translation helpful? Give feedback.
-
The ZIP got deleted (by me), with the intent that in the following days a new one will appear with news from March 2024, at: Thank you for following this. |
Beta Was this translation helpful? Give feedback.
-
2 things: 1) Is there a way I can get 2023 and 2022 done? I mean, my version of Nous Hermes 2 thinks it's October 17, 2021! 2) I wonder if there's a way we can automate this happening in GTP4All? |
Beta Was this translation helpful? Give feedback.
-
Hello.
Please let me know if I should open a dedicated channel (to Nomic, perhaps) for such ZIPs with current events, in order not to post the /links here and clutter the Discussion list.
I have put together, for use as a LocalDocs collection, some 350 (three hundred plus) news articles and a few science papers between January 01, 2024 and February 29, 2024.
This set is increasing by the day (open the news in browser->Print->PDF printer, or Save, if it already is a PDF like the science papers are) and I use it as a LocalDocs collection, to ask the LLMs about recent events - for instance, if curious how's the weather in Detroit, then Mistral (at least) will respond using the PDF dated 2024.02.27 about the record-breaking 73 degrees in February; or, what's the land area engulfed by flames in Texas; or, who's won the Michigan primaries on February 28... you'll be told.
News sources: many, among which: Al Jazeera, Associated Press, Axios, Bloomberg, CNBC, CNN, Daily Mail (what), DNyuz, Fox, France24, Guardian (UK), Medium, NHK, NPR, PBS, Politico, reddit (yay), Semafor, The Sun (whatwhat), Wired, ZeroHedge.
Language of news: English (almost all items), other (Polish, Russian, Spanish, ~10 items, irrelevant)
The ZIP file:
as of this post:
example:
20240228-CoinTelegraph-OpenAI accuses New York Times of hacking AI models in copyright lawsuit.pdf
-- yyyyMMdd-nameofinfosource-titleofnewsarticle [(HHmmGMT)].pdf
-- comma , : when in numbers (4,000) deleted (4000); otherwise, replaced by " and" or " or"
-- semicolons ; replaced by " and" or " or"
-- colons : replaced by blank dash blank " - "
-- dash - replaced by blank dash blank " - " unless in complex words like "meta-analyses" where it remains as it is
-- apostrophes: when around a text ( 'this is cool' they said) : replaced by dash - (-this is cool- they said); when indicating a subject (California's) : left as it was (California's) ; when at the end of a word (Texas' rangers) : deleted (Texas rangers)
-- quotes (of any kind, including apostrophes used as quotes): replaced by dash -
-- uppercase first letter of words: when appropriate (1, when not the first word of title and 2, when not indicating Names) : changed to lowercase
-- ampersand & : changed to underscore _
-- question mark ? : deleted
-- percent sign % : replaced by " percent "
-- dot . : when in numbers (36.8) - replaced by underscore _ (36_8); when in abbreviations (U.S.) - deleted
-- Dollar sign: changed to "USD" after the sum of money (e. g. $5 million -> 5 million USD), where USD is appropriate/applicable (there are Dollars in several countries)
-- Euro sign: changed to "EUR", likewise
-- UK Pound sign changed to "Pounds", likewise
-- leading and trailing spaces/blanks: removed
-- the extension may be "pdf" or "PDF"
-- 1 author: yyyyMMdd-author-titleofsp.pdf
-- 2 authors: yyyyMMdd-author1--author2-titleofsp.pdf
-- more than 2 authors: yyyyMMdd-author1 et al-titleofsp.pdf
Everyone interested can download it from my website (which is old and legit to boot), at the address:
[http://sinapsaro.ro/frees/_infos/_localdocs_for_gpt4all_news_2024jan01feb29.zip]
If it cannot be found, 404, this most probably means that it is being uploaded right then and you should wait a few minutes until the upload finishes.
It will be deleted on 2024.03.01, at ~17 UTC/GMT.
So maybe someone who downloads it would want to share it, too, or use it to train an LLM...
It is good that those interested should download it from time to time as it is being updated quite frequently (every 6 hours, or so).
If anyone is interested in other such packages, feel free to let me know: the "archive" goes back to early 2003 (although most of the files until ~2010 are .mht / .mhtml).
Please let me know if I should open a dedicated channel (to Nomic, perhaps) for such ZIPs with current events, in order not to post the /links here and clutter the Discussion list.
Thank you.
Beta Was this translation helpful? Give feedback.
All reactions