- Objective: To scrape and analyse popular GitHub's users in Sydney and their repositories data
- Quiz ID: Sydney:100
Colab(python) script was created to perform 3 major steps -
- fetch the data - via invocation of multiple GitHub APIs, like
GET /search/users
GET /users/{userId}/repos
- process the data - like stripping whitespaces, removing leading
@
symbols etc, and - store the data - into two separate CSV files for further analysis
- COVID-19 impact?: In the last 16 years of GitHub's history, year 2021 was a breakout year (refer graph below). It saw a huge spike in creation of new repos and followers. It may be due to COVID-19 pandemic leading more developers to be working remotely and engaging in personal or open-source projects. Also, it coincides with the increased adoption of GitHub Copilot(AI) that year
- JavaScript dominance?: With 167,565 stars, JavaScript rules Sydney's GitHub languages. This points to strong web-development uptake, especially since users from web company Atlassian and Canva are top contributors
- Quality over quantity?: Many top users have a very low number of repositories despite a large follower count
- Weak correlation: between the number of followers and stargazers on repositories (0.067), indicating that having more followers doesn't strongly predict the popularity of a user's repositories
- developers should focus on improving the quality of their repos (and not quantity). This can be achieved via better code, clear guidelines on usage and collaboration, regular updates etc. It may lead to improved uptake on followers and community visibility
- since there's a weak negative regression slope of -9.72 between developer's bio and number of followers, it is suggested to keep bios shorter
- developers must consider adopting MIT licenses for their repos. This has greatest adoption in Sydney region (may be since it's a very permissive license that allows users to do almost anything they want with your code, including using it in commercial products)
- Mermaid, a javascript based diagramming/charting tool language has the highest average number of stars (491) per repository on GitHub Sydney. Given the popularity, developers must leverage it for better visualisation and community engagement
- Created a Colab notebook(python) to scrape GitHub data via APIs. GitHub has detailed API guide on this which helped understand the structure and parameters involved. Since GitHub has restrictive Rate limits, I created a personal API token, to help with higher requests limit(5000/hr)
- To fetch the data, the Search API was used to search for users located in Sydney with over 100 followers(location:Sydney followers:>100). As the API response was paginated, I introduced necessary code changes (per_page=100 and page parameter) to ensure all qualifying users were fetched. To validate my results and the API fetch count, I reconfirmed it via searching directly on the website. This data was written to users.csv file
- Thereafter, for each user found, we iterated and invoked a secondary API to retrieve their detailed information and associated public repository data
- Given the assignment instructions, data on repositories was collected for up to 500 of the most recently pushed repositories per user. This was achieved by looping through the repo List object and ensuring we are below 500. This data was written to repositories.csv file
- On completion, over 371 users were scraped, and their 32415 associated repositories
- For data insights, I leveraged Pandas and Google Sheets, using these 2 base CSVs
- users.csv: Contains key information about each user such as their GitHub login, name, company, location (city), email, and number of repositories, followers, and followings
- repositories.csv: Contains the public repositories for each user, including details like repository name, creation date, programming language, and license