Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Use Selenium to Enable Javascript / Real-Browser Scraping + Misc Fixes #302

Open
wants to merge 45 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
5c1c08f
Implement query_js.py using Selenium
Jun 5, 2020
9564c6d
make headless
Jun 5, 2020
7f7b753
misc fixes
Jun 5, 2020
3cc39c3
fix retries
Jun 5, 2020
f0a47c6
fix typo, make headless
Jun 5, 2020
c5a61aa
sleep less
Jun 5, 2020
2dfb19d
include @abhisheksaxena1998's fix
Jun 5, 2020
9246ad5
allow interoperability between old and js
Jun 5, 2020
56c88f8
peg selenium-wire version
Jun 5, 2020
9b9a218
remove get_query_url
Jun 5, 2020
38fb35e
remove unused functions
Jun 5, 2020
c226cdd
misc fixes
Jun 6, 2020
59b7f5f
implement limit for selenium
Jun 6, 2020
3c256d8
remove requests requirement
Jun 7, 2020
8786a57
fix misleading log line
Jun 7, 2020
b584d92
enable fetching all history from user
Jun 7, 2020
4a0277d
enable full lookback on profile by searching for nativeretweets
Jun 7, 2020
ede0303
fix get user data
Jun 7, 2020
9b47cce
fix query.py get_user_info
Jun 7, 2020
8db4b85
fix user.py so it allows 0 follower 0 like accounts to work
Jun 7, 2020
197435c
add test cases (failing rn)
Jul 23, 2020
27ef71c
filter irrelevant dateranges and convert to tweet
Sep 20, 2020
22b6278
fix missing return
Sep 20, 2020
c804799
fix type error
Sep 20, 2020
67fb182
make apis consistent, upgrade selenium, disable verify_ssl
Sep 21, 2020
a81e4af
date range fix
Sep 22, 2020
a73b339
add geckodriver
Sep 23, 2020
c6081e3
upgrade geckodriver
Sep 23, 2020
4fd12c8
Merge remote-tracking branch 'upstream/master' into selenium
Sep 23, 2020
954c696
implement use_proxy
Sep 23, 2020
d72dde5
fix logger
Sep 24, 2020
faf75af
fix logging
Sep 24, 2020
7e28f6b
Update Dockerfile
webcoderz Sep 24, 2020
2a7d9d5
Merge pull request #1 from webcoderz/patch-2
Sep 24, 2020
0a05ad7
merge user query
Sep 24, 2020
d39a47a
Update Dockerfile
webcoderz Sep 24, 2020
34367fb
Update Dockerfile
webcoderz Sep 24, 2020
bb31bdf
Merge pull request #3 from webcoderz/patch-5
Sep 24, 2020
eea4954
updating dockerfile with firefox dependencies
webcoderz Sep 25, 2020
9f8bf13
updating dockerfile with firefox dependencies
webcoderz Sep 25, 2020
7ad6f54
Merge pull request #4 from webcoderz/selenium
Sep 25, 2020
d6cd1d8
do multiple-passes, fix proxy, faster scrolling
Sep 26, 2020
dc816cd
Merge branch 'selenium' of github.com:lapp0/twitterscraper into selenium
Sep 26, 2020
a6a76f7
refactor browser & scraping, get best proxies, add test_simple_js
Sep 27, 2020
8d665d0
remove unused imports, increase timeout
Sep 28, 2020
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
34 changes: 34 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,38 @@
FROM python:3.7-alpine

RUN apt-get update \
&& apt-get install -y --no-install-recommends wget libgtk-3-dev libdbus-glib-1-2 \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*


ARG FIREFOX_VERSION=latest
RUN FIREFOX_DOWNLOAD_URL=$(if [ $FIREFOX_VERSION = "latest" ] || [ $FIREFOX_VERSION = "nightly-latest" ] || [ $FIREFOX_VERSION = "devedition-latest" ] || [ $FIREFOX_VERSION = "esr-latest" ]; then echo "https://download.mozilla.org/?product=firefox-$FIREFOX_VERSION-ssl&os=linux64&lang=en-US"; else echo "https://download-installer.cdn.mozilla.net/pub/firefox/releases/$FIREFOX_VERSION/linux-x86_64/en-US/firefox-$FIREFOX_VERSION.tar.bz2"; fi) \
&& apt-get update -qqy \
&& apt-get -qqy --no-install-recommends install libavcodec-extra \
&& rm -rf /var/lib/apt/lists/* /var/cache/apt/* \
&& wget --no-verbose -O /tmp/firefox.tar.bz2 $FIREFOX_DOWNLOAD_URL \
&& tar -C /opt -xjf /tmp/firefox.tar.bz2 \
&& rm /tmp/firefox.tar.bz2 \
&& mv /opt/firefox /opt/firefox-$FIREFOX_VERSION \
&& ln -fs /opt/firefox-$FIREFOX_VERSION/firefox /usr/bin/firefox

#============
# GeckoDriver
#============
ARG GECKODRIVER_VERSION=latest
RUN GK_VERSION=$(if [ ${GECKODRIVER_VERSION:-latest} = "latest" ]; then echo "0.27.0"; else echo $GECKODRIVER_VERSION; fi) \
&& echo "Using GeckoDriver version: "$GK_VERSION \
&& wget --no-verbose -O /tmp/geckodriver.tar.gz https://github.com/mozilla/geckodriver/releases/download/v$GK_VERSION/geckodriver-v$GK_VERSION-linux64.tar.gz \
&& rm -rf /opt/geckodriver \
&& tar -C /opt -zxf /tmp/geckodriver.tar.gz \
&& rm /tmp/geckodriver.tar.gz \
&& mv /opt/geckodriver /opt/geckodriver-$GK_VERSION \
&& cp /opt/geckodriver-$GK_VERSION /bin \
&& chmod 755 /opt/geckodriver-$GK_VERSION \
&& ln -fs /opt/geckodriver-$GK_VERSION /usr/bin/geckodriver \
&& ln -fs /opt/geckodriver-$GK_VERSION /usr/bin/wires
# twitterscraper
RUN apk add --update --no-cache g++ gcc libxslt-dev
COPY . /app
WORKDIR /app
Expand Down
1 change: 1 addition & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,4 @@ bs4
lxml
requests
billiard
selenium-wire
6 changes: 2 additions & 4 deletions twitterscraper/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,6 @@
__license__ = 'MIT'


from twitterscraper.query import query_tweets
from twitterscraper.query import query_tweets_from_user
from twitterscraper.query import query_user_info
from twitterscraper.tweet import Tweet
from twitterscraper.query import query_tweets, query_tweets_from_user, query_user_info
from twitterscraper.query_js import get_user_data, get_query_data
from twitterscraper.user import User
112 changes: 112 additions & 0 deletions twitterscraper/browser.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
import requests
from functools import lru_cache
from itertools import cycle
from bs4 import BeautifulSoup
from threading import Thread
from random import shuffle

from seleniumwire import webdriver
from selenium.webdriver.firefox.options import Options

import logging
logger = logging.getLogger('twitterscraper')


PROXY_URL = 'https://free-proxy-list.net/'
NYT_LOGO_URL = 'https://pbs.twimg.com/profile_images/1098244578472280064/gjkVMelR_normal.png'


def get_proxy_delay(proxy, result, max_time=10):
try:
response = requests.post(
NYT_LOGO_URL,
proxies={'https': f'https://{proxy}/'},
timeout=max_time
)

except Exception:
result[proxy] = None
else:
result[proxy] = response.elapsed.total_seconds()


@lru_cache(1)
def get_best_proxies(proxies):
logger.info('Pinging twitter to find best proxies')
threads = []
result = {}
# In this case 'urls' is a list of urls to be crawled.
for proxy in proxies:
process = Thread(target=get_proxy_delay, args=[proxy, result])
process.start()
threads.append(process)
for process in threads:
process.join()

assert len(set(result.values())) > 1 # ensure at least one proxy took less than max_time

result = {k: v for k, v in result.items() if v}
best_proxies = [x[0] for x in sorted(result.items(), key=lambda x: x[1])]
return best_proxies[:int(len(best_proxies)**0.5)] # best sqrt(N) of N working proxies


@lru_cache(1)
def get_proxies():
response = requests.get(PROXY_URL)
soup = BeautifulSoup(response.text, 'lxml')
table = soup.find('table', id='proxylisttable')
list_tr = table.find_all('tr')
list_td = [elem.find_all('td') for elem in list_tr]
list_td = list(filter(None, list_td))
list_ip = [elem[0].text for elem in list_td]
list_ports = [elem[1].text for elem in list_td]
list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))]
return list_proxies


def get_proxy_pool():
# TODO: cache this on disk so reruns aren't required
best_proxies = get_best_proxies(
tuple(get_proxies())
)
shuffle(best_proxies)
return cycle(best_proxies)


@lru_cache(1)
def get_ublock():
pass
#download ublock here


def get_driver(proxy=None, timeout=30):
profile = webdriver.FirefoxProfile()
profile.set_preference("http.response.timeout", 5)

seleniumwire_options = {'verify_ssl': False}
if proxy:
seleniumwire_options['suppress_connection_errors'] = False
seleniumwire_options['proxy'] = {
'https': f'https://{proxy}',
'http': f'http://{proxy}',
}

opt = Options()
opt.headless = True

driver = webdriver.Firefox(
firefox_profile=profile,
options=opt,
seleniumwire_options=seleniumwire_options
)

"""
TODO: install ublock here
get_ublock()
extensions.ublock0.adminSettings = best settings for twitter here
browser.install_addon(extension_dir + extension, temporary=True)
"""

driver.set_page_load_timeout(timeout)

return driver
45 changes: 36 additions & 9 deletions twitterscraper/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@
from os.path import isfile
from pprint import pprint

from twitterscraper import query_js, query


from twitterscraper.query import (query_tweets, query_tweets_from_user,
query_user_info)

Expand Down Expand Up @@ -65,12 +68,14 @@ def main():
"This may take a while. You can increase the number of parallel"
"processes depending on the computational power you have.")
parser.add_argument("-c", "--csv", action='store_true',
help="Set this flag if you want to save the results to a CSV format.")
help="Set this flag if you want to save the results to a CSV format.")
parser.add_argument("-j", "--javascript", action='store_true',
help="Set this flag if you want to request using javascript via Selenium.")
parser.add_argument("-u", "--user", action='store_true',
help="Set this flag to if you want to scrape tweets from a specific user"
"The query should then consist of the profilename you want to scrape without @")
parser.add_argument("--profiles", action='store_true',
help="Set this flag to if you want to scrape profile info of all the users where you"
help="Set this flag to if you want to scrape profile info of all the users where you"
"have previously scraped from. After all of the tweets have been scraped it will start"
"a new process of scraping profile pages.")
parser.add_argument("--lang", type=str, default=None,
Expand Down Expand Up @@ -113,14 +118,33 @@ def main():
exit(-1)

if args.all:
args.begindate = dt.date(2006,3,1)
args.begindate = dt.date(2006, 3, 1)

if args.user:
tweets = query_tweets_from_user(user = args.query, limit = args.limit, use_proxy = not args.disableproxy)
if args.javascript:
tweets = query_js.get_user_data(
from_user=args.query, limit=args.limit,
begindate=args.begindate, enddate=args.enddate,
poolsize=args.poolsize, lang=args.lang, use_proxy=not args.disableproxy
)['tweets']
else:
tweets = query.query_tweets_from_user(user=args.query, limit=args.limit, use_proxy=not args.disableproxy)

else:
tweets = query_tweets(query = args.query, limit = args.limit,
begindate = args.begindate, enddate = args.enddate,
poolsize = args.poolsize, lang = args.lang, use_proxy = not args.disableproxy)
if args.javascript:
tweets = query_js.get_query_data(
query=args.query, limit=args.limit,
begindate=args.begindate, enddate=args.enddate,
poolsize=args.poolsize, lang=args.lang,
use_proxy=not args.disableproxy
)
else:
tweets = query.query_tweets(
query=args.query, limit=args.limit,
begindate=args.begindate, enddate=args.enddate,
poolsize=args.poolsize, lang=args.lang,
use_proxy=not args.disableproxy
)

if args.dump:
pprint([tweet.__dict__ for tweet in tweets])
Expand Down Expand Up @@ -151,8 +175,11 @@ def main():
json.dump(tweets, output, cls=JSONEncoder)
if args.profiles and tweets:
list_users = list(set([tweet.username for tweet in tweets]))
list_users_info = [query_user_info(elem, not args.disableproxy) for elem in list_users]
filename = 'userprofiles_' + args.output

# Note: this has no query_js equivalent!
list_users_info = [query.query_user_info(elem, not args.disableproxy) for elem in list_users]

filename = 'userprofiles_' + args.outputp
with open(filename, "w", encoding="utf-8") as output:
json.dump(list_users_info, output, cls=JSONEncoder)
except KeyboardInterrupt:
Expand Down
14 changes: 6 additions & 8 deletions twitterscraper/query.py
Original file line number Diff line number Diff line change
Expand Up @@ -52,8 +52,8 @@ def get_proxies():
list_ip = [elem[0].text for elem in list_td]
list_ports = [elem[1].text for elem in list_td]
list_proxies = [':'.join(elem) for elem in list(zip(list_ip, list_ports))]
return list_proxies
return list_proxies

def get_query_url(query, lang, pos, from_user = False):
if from_user:
if pos is None:
Expand Down Expand Up @@ -116,7 +116,7 @@ def query_single_page(query, lang, pos, retry=50, from_user=False, timeout=60, u
pos = json_resp['min_position']
has_more_items = json_resp['has_more_items']
if not has_more_items:
logger.info("Twitter returned : 'has_more_items' ")
logger.info("Twitter response: 'has_more_items' == False ")
return [], None
else:
pos = None
Expand Down Expand Up @@ -217,10 +217,10 @@ def query_tweets_once(*args, **kwargs):

def query_tweets(query, limit=None, begindate=dt.date(2006, 3, 21), enddate=dt.date.today(), poolsize=20, lang='', use_proxy=True):
no_days = (enddate - begindate).days

if(no_days < 0):
sys.exit('Begin date must occur before end date.')

if poolsize > no_days:
# Since we are assigning each pool a range of dates to query,
# the number of pools should not exceed the number of dates.
Expand Down Expand Up @@ -329,8 +329,6 @@ def query_user_info(user, use_proxy=True):

:param user: the twitter user to web scrape its twitter page info
"""


try:
user_info = query_user_page(INIT_URL_USER.format(u=user), use_proxy=use_proxy)
if user_info:
Expand All @@ -343,4 +341,4 @@ def query_user_info(user, use_proxy=True):
logger.exception("An unknown error occurred! Returning user information gathered so far...")

logger.info("Got user information from username {}".format(user))
return user_info
return None
Loading