Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 21 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
5c1c08f
Implement query_js.py using Selenium
Jun 5, 2020
9564c6d
make headless
Jun 5, 2020
7f7b753
misc fixes
Jun 5, 2020
3cc39c3
fix retries
Jun 5, 2020
f0a47c6
fix typo, make headless
Jun 5, 2020
c5a61aa
sleep less
Jun 5, 2020
2dfb19d
include @abhisheksaxena1998's fix
Jun 5, 2020
9246ad5
allow interoperability between old and js
Jun 5, 2020
56c88f8
peg selenium-wire version
Jun 5, 2020
9b9a218
remove get_query_url
Jun 5, 2020
38fb35e
remove unused functions
Jun 5, 2020
c226cdd
misc fixes
Jun 6, 2020
59b7f5f
implement limit for selenium
Jun 6, 2020
3c256d8
remove requests requirement
Jun 7, 2020
8786a57
fix misleading log line
Jun 7, 2020
b584d92
enable fetching all history from user
Jun 7, 2020
4a0277d
enable full lookback on profile by searching for nativeretweets
Jun 7, 2020
ede0303
fix get user data
Jun 7, 2020
9b47cce
fix query.py get_user_info
Jun 7, 2020
8db4b85
fix user.py so it allows 0 follower 0 like accounts to work
Jun 7, 2020
197435c
add test cases (failing rn)
Jul 23, 2020
27ef71c
filter irrelevant dateranges and convert to tweet
Sep 20, 2020
22b6278
fix missing return
Sep 20, 2020
c804799
fix type error
Sep 20, 2020
67fb182
make apis consistent, upgrade selenium, disable verify_ssl
Sep 21, 2020
a81e4af
date range fix
Sep 22, 2020
a73b339
add geckodriver
Sep 23, 2020
c6081e3
upgrade geckodriver
Sep 23, 2020
4fd12c8
Merge remote-tracking branch 'upstream/master' into selenium
Sep 23, 2020
954c696
implement use_proxy
Sep 23, 2020
d72dde5
fix logger
Sep 24, 2020
faf75af
fix logging
Sep 24, 2020
7e28f6b
Update Dockerfile
webcoderz Sep 24, 2020
2a7d9d5
Merge pull request #1 from webcoderz/patch-2
Sep 24, 2020
0a05ad7
merge user query
Sep 24, 2020
d39a47a
Update Dockerfile
webcoderz Sep 24, 2020
34367fb
Update Dockerfile
webcoderz Sep 24, 2020
bb31bdf
Merge pull request #3 from webcoderz/patch-5
Sep 24, 2020
eea4954
updating dockerfile with firefox dependencies
webcoderz Sep 25, 2020
9f8bf13
updating dockerfile with firefox dependencies
webcoderz Sep 25, 2020
7ad6f54
Merge pull request #4 from webcoderz/selenium
Sep 25, 2020
d6cd1d8
do multiple-passes, fix proxy, faster scrolling
Sep 26, 2020
dc816cd
Merge branch 'selenium' of github.com:lapp0/twitterscraper into selenium
Sep 26, 2020
a6a76f7
refactor browser & scraping, get best proxies, add test_simple_js
Sep 27, 2020
8d665d0
remove unused imports, increase timeout
Sep 28, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ bs4
lxml
requests
billiard
selenium-wire==1.0.1
6 changes: 2 additions & 4 deletions twitterscraper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,7 @@
__license__ = 'MIT'


from twitterscraper.query import query_tweets
from twitterscraper.query import query_tweets_from_user
from twitterscraper.query import query_user_info
from twitterscraper.tweet import Tweet
from twitterscraper.query import query_tweets, query_tweets_from_user, query_user_info
from twitterscraper.query_js import get_user_data, get_query_data
from twitterscraper.user import User
from twitterscraper.ts_logger import logger as ts_logger
40 changes: 29 additions & 11 deletions twitterscraper/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,7 @@
import datetime as dt
from os.path import isfile
from pprint import pprint
from twitterscraper.query import query_tweets
from twitterscraper.query import query_tweets_from_user
from twitterscraper.query import query_user_info
from twitterscraper import query_js, query
from twitterscraper.ts_logger import logger


Expand Down Expand Up @@ -57,12 +55,14 @@ def main():
"This may take a while. You can increase the number of parallel"
"processes depending on the computational power you have.")
parser.add_argument("-c", "--csv", action='store_true',
help="Set this flag if you want to save the results to a CSV format.")
help="Set this flag if you want to save the results to a CSV format.")
parser.add_argument("-j", "--javascript", action='store_true',
help="Set this flag if you want to request using javascript via Selenium.")
parser.add_argument("-u", "--user", action='store_true',
help="Set this flag to if you want to scrape tweets from a specific user"
"The query should then consist of the profilename you want to scrape without @")
parser.add_argument("--profiles", action='store_true',
help="Set this flag to if you want to scrape profile info of all the users where you"
help="Set this flag to if you want to scrape profile info of all the users where you"
"have previously scraped from. After all of the tweets have been scraped it will start"
"a new process of scraping profile pages.")
parser.add_argument("--lang", type=str, default=None,
Expand Down Expand Up @@ -98,14 +98,31 @@ def main():
exit(-1)

if args.all:
args.begindate = dt.date(2006,3,1)
args.begindate = dt.date(2006, 3, 1)

if args.user:
tweets = query_tweets_from_user(user = args.query, limit = args.limit)
if args.javascript:
tweets = query_js.get_user_data(
from_user=args.query, limit=args.limit,
begindate=args.begindate, enddate=args.enddate,
poolsize=args.poolsize, lang=args.lang
)['tweets']
else:
tweets = query.query_tweets_from_user(user=args.query, limit=args.limit)

else:
tweets = query_tweets(query = args.query, limit = args.limit,
begindate = args.begindate, enddate = args.enddate,
poolsize = args.poolsize, lang = args.lang)
if args.javascript:
tweets = query_js.get_query_data(
queries=[args.query], limit=args.limit,
begindate=args.begindate, enddate=args.enddate,
poolsize=args.poolsize, lang=args.lang
)['tweets']
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since query_js.get_query_data() no longer returns a list of Tweet objects (like query.query_tweets() does) but returns a dictionary with the tweet-id as the key and a dump of the tweet object as the value, saving it to CSV no longer works.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, converting this list of dicts to Tweet objects is necessary before merging.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

resolved in 27ef71c

else:
tweets = query.query_tweets(
query=args.query, limit=args.limit,
begindate=args.begindate, enddate=args.enddate,
poolsize=args.poolsize, lang=args.lang
)

if args.dump:
pprint([tweet.__dict__ for tweet in tweets])
Expand Down Expand Up @@ -136,7 +153,8 @@ def main():
json.dump(tweets, output, cls=JSONEncoder)
if args.profiles and tweets:
list_users = list(set([tweet.username for tweet in tweets]))
list_users_info = [query_user_info(elem) for elem in list_users]
# Note: this has no query_js equivalent!
list_users_info = [query.query_user_info(elem) for elem in list_users]
filename = 'userprofiles_' + args.output
with open(filename, "w", encoding="utf-8") as output:
json.dump(list_users_info, output, cls=JSONEncoder)
Expand Down
16 changes: 7 additions & 9 deletions twitterscraper/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@
'Mozilla/5.0 (Windows NT 5.2; RW; rv:7.0a1) Gecko/20091211 SeaMonkey/9.23a1pre'
]

HEADER = {'User-Agent': random.choice(HEADERS_LIST)}
HEADER = {'User-Agent': random.choice(HEADERS_LIST), 'X-Requested-With': 'XMLHttpRequest'}
logger.info(HEADER)

INIT_URL = 'https://twitter.com/search?f=tweets&vertical=default&q={q}&l={lang}'
Expand All @@ -49,8 +49,8 @@ def get_proxies():
list_ip = [elem[0].text for elem in list_td]
list_ports = [elem[1].text for elem in list_td]
list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))]
return list_proxies
return list_proxies

def get_query_url(query, lang, pos, from_user = False):
if from_user:
if pos is None:
Expand Down Expand Up @@ -109,7 +109,7 @@ def query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60):
pos = json_resp['min_position']
has_more_items = json_resp['has_more_items']
if not has_more_items:
logger.info("Twitter returned : 'has_more_items' ")
logger.info("Twitter response: 'has_more_items' == False ")
return [], None
else:
pos = None
Expand Down Expand Up @@ -210,10 +210,10 @@ def query_tweets_once(*args, **kwargs):

def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang=''):
no_days = (enddate - begindate).days

if(no_days < 0):
sys.exit('Begin date must occur before end date.')

if poolsize > no_days:
# Since we are assigning each pool a range of dates to query,
# the number of pools should not exceed the number of dates.
Expand Down Expand Up @@ -319,8 +319,6 @@ def query_user_info(user):

:param user: the twitter user to web scrape its twitter page info
"""


try:
user_info = query_user_page(INIT_URL_USER.format(u=user))
if user_info:
Expand All @@ -333,4 +331,4 @@ def query_user_info(user):
logger.exception("An unknown error occurred! Returning user information gathered so far...")

logger.info("Got user information from username {}".format(user))
return user_info
return None
197 changes: 197 additions & 0 deletions twitterscraper/query_js.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
from collections import defaultdict
import requests
import datetime as dt
import time
import sys

from functools import lru_cache, partial
from billiard.pool import Pool
from bs4 import BeautifulSoup
from itertools import cycle

from seleniumwire import webdriver
from selenium.webdriver.firefox.options import Options
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys

from twitterscraper.ts_logger import logger


INIT_URL = 'https://twitter.com/search?f=live&vertical=default&q={q}&l={lang}'
INIT_URL_USER = 'https://twitter.com/{u}'
PROXY_URL = 'https://free-proxy-list.net/'


@lru_cache(1)
def get_proxy_pool():
response = requests.get(PROXY_URL)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', id='proxylisttable')
list_tr = table.find_all('tr')
list_td = [elem.find_all('td') for elem in list_tr]
list_td = list(filter(None, list_td))
list_ip = [elem[0].text for elem in list_td]
list_ports = [elem[1].text for elem in list_td]
list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))]
return cycle(list_proxies)


def get_driver(proxy=None, timeout=10):
profile = webdriver.FirefoxProfile()
if proxy:
profile.set_preference("network.proxy.http", proxy)

opt = Options()
opt.headless = True

driver = webdriver.Firefox(profile, options=opt)
driver.implicitly_wait(timeout)

return driver


def linspace(start, stop, n):
if n == 1:
yield stop
return
h = (stop - start) / (n - 1)
for i in range(n):
yield start + h * i


def query_single_page(url, retry=50, from_user=False, timeout=60, use_proxy=True, limit=None):
"""
Returns tweets from the given URL.
:param query: The query url
:param retry: Number of retries if something goes wrong.
:param use_proxy: Determines whether to fetch tweets with proxy
:param limit: Max number of tweets to get
:return: Twitter dict containing tweets users, locations, and other metadata
"""
limit = limit or float('inf')

logger.info('Scraping tweets from {}'.format(url))

proxy_pool = get_proxy_pool() if use_proxy else cycle([None])

proxy = next(proxy_pool)
logger.info('Using proxy {}'.format(proxy))
driver = get_driver(proxy)

try:
data = defaultdict(dict)
already_idxs = set()

# page down, recording the results, until there isn't anything new or limit has been breached
driver.get(url)
retries = 20
while retries > 0 and len(data['tweets']) < limit:

# relevant requests have completely responses, json in their path (but not guide.json), and a globalObjects key
relevant_request_idxs = set([
i for i, r in enumerate(driver.requests)
if 'json' in r.path and 'guide.json' not in r.path and
r.response is not None and isinstance(r.response.body, dict) and
'globalObjects' in r.response.body and i not in already_idxs
])
already_idxs |= relevant_request_idxs

if not relevant_request_idxs:
time.sleep(0.2)
retries -= 1
continue

# if no relevant requests, or latest relevant request isn't done loading, wait then check again
latest_tweets = driver.requests[max(relevant_request_idxs)].response.body['globalObjects']['tweets']
if len(relevant_request_idxs) == 0 or not latest_tweets:
time.sleep(0.2)
retries -= 1
continue

# scroll down
actions = ActionChains(driver)
for _ in range(100):
actions.send_keys(Keys.PAGE_DOWN)
actions.perform()

# record relevant responses
for idx in relevant_request_idxs:
driver.requests[idx]
for key, value in driver.requests[idx].response.body['globalObjects'].items():
data[key].update(value)

# reset retries
retries = 20

return data

except Exception as e:
logger.exception('Exception {} while requesting "{}"'.format(
e, url))
finally:
driver.quit()

if retry > 0:
logger.debug('Retrying... (Attempts left: {})'.format(retry))
return query_single_page(url, retry - 1)
logger.error('Giving up.')
return defaultdict(dict)


def get_query_data(queries, limit=None, begindate=None, enddate=None, poolsize=None, lang=''):
begindate = begindate or dt.date(2006, 3, 21)
enddate = enddate or dt.date.today()
poolsize = poolsize or 5

num_days = (enddate - begindate).days

if(num_days < 0):
sys.exit('Begin date must occur before end date.')

if poolsize > num_days:
# Since we are assigning each pool a range of dates to query,
# the number of pools should not exceed the number of dates.
poolsize = num_days
# query one day at a time so driver doesn't use too much memory
dateranges = list(reversed([begindate + dt.timedelta(days=elem) for elem in linspace(0, num_days, num_days)]))
Copy link
Owner

@taspinar taspinar Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linspace(0, num_days, num_days) always creates num_days amount of separate queries instead of poolsize+1 number of separate queries .
Please have a look at query.py to see how it was originally implemented.
dateranges = [begindate + dt.timedelta(days=elem) for elem in linspace(0, num_days, poolsize+1)]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This behavior is different because

  1. the pool size is limited so it doesn't matter how many items we have in our list
  2. we want more granularity in case loading a history fails. If we have 1000 days, and twitterscraper fails to load a day 10% of the time, we'll run twitterscraper for a single day 1,100 times on average. If our interval is one month instead of one day, we'll succeed retrieving a month 4% (90%^30) of the time, resulting in 20,000 work units instead of 1,100.
  3. We save precious memory by only going one day at the time. Emulated browsers store increasingly more state as we scroll further.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable.
Do you think creating a separate query for each day also has the same benefits (indicated under 2) for regular old query.py?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's necessary for old query.py because it already shards and retries on a granular level, and there isn't the risk of needing to "restart" from the beginning of the day. Perhaps it would be nice for the two APIs to be consistent, however.


urls = []
for until, since in zip(dateranges[:-1], dateranges[1:]):
for query in queries:
query_str = '{} since:{} until:{}'.format(query, since, until)
urls.append(INIT_URL.format(q=query_str, lang=lang))
logger.info('query: {}'.format(query_str))

return retrieve_data_from_urls(urls, limit=limit, poolsize=poolsize)


def get_user_data(from_user, *args, **kwargs):
# include retweets
retweet_query = f'filter:nativeretweets from:{from_user}'
no_retweet_query = f'from:{from_user}'
return get_query_data([retweet_query, no_retweet_query], *args, **kwargs)


def retrieve_data_from_urls(urls, limit, poolsize):
# send query urls to multiprocessing pool, and aggregate
if limit and poolsize:
limit_per_pool = (limit // poolsize) + 1
else:
limit_per_pool = None

all_data = defaultdict(dict)
try:
pool = Pool(poolsize)
try:
for new_data in pool.imap_unordered(partial(query_single_page, limit=limit_per_pool), urls):
for key, value in new_data.items():
all_data[key].update(value)
logger.info('Got {} data ({} new).'.format(
len(all_data['tweets']), len(new_data['tweets'])))
except KeyboardInterrupt:
logger.debug('Program interrupted by user. Returning all tweets gathered so far.')
finally:
pool.close()
pool.join()

return all_data
Empty file.
Loading