Streamlit UI

How to build a Streamlit app to scrape Github Profiles

Posted by

In this tutorial, we will build a web app using Streamlit which scrapes GitHub for a user’s info. It displays the basic info and some of their recent repositories. We will be using Beautiful Soup for web scraping.

If you want to learn how to scrape websites using Selenium, view my previous tutorial. This tutorial focuses on Beautiful Soup. The tutorial will be divided into two sections. The first section will talk about web scraping using Beautiful Soup and the second section will talk about using the scraped data and Streamlit to build a web app.

Repo: https://github.com/rahulbanerjee26/githubScraper

Live: https://github-scrape.herokuapp.com/ 

Install the required Libraries

First, we need to install the libraries which we will be using. I highly recommend creating a virtual environment before installing the libraries.

python -m virtualenv venv  # setup your virtual environment
venv/Scripts/actiavte # activate the virtual environment
pip install beautifulsoup4 , streamlit

Scraping GitHub data

Import all the required libraries, i.e bs4, request, pandas. Define a function which accepts the user name as a parameter.

from bs4 import BeautifulSoup
import requests
import pandas as pd
def getData(userName):
pass

The URL ‘https://github.com/{user_name}?tab=repositories’ contains the user’s info and their recent public repositories. We will use the requests library to get the content of the page.

url = "https://github.com/{}?tab=repositories".format(userName)
page = requests.get(url)
print(page.content) #displays the html

Next, we create an instance of BeautifulSoup and pass the page.content as the parameter. Create an empty dictionary to store the user info

soup = BeautifulSoup(page.content , 'html.parser')
info = {}

We will be scraping the following info

  • Full Name
  • Image
  • Number of Followers
  • Number of users following
  • Location (If exists)
  • Portfolio Url (If exists)
  • Repo name, Repo link, Repo last update, Repo programming Language, Repo Description

Before proceeding, you might want to review CSS attribute selectors

Full Name

The full name is inside an element with class name ‘vcard-fullname’. We used beautiful soup’s .find() method and pass the class name.

#full Name
info['name'] = soup.find(class_ = 'vcard-fullname').get_text()

The .get_text() method retrieves the relevant text inside the element.

Image

Similar to the full name, the image is inside an element with class name ‘avatar-user’.

#image
info['image_url'] = soup.find(class_ = 'avatar-user')['src']

Since we are only interested in the image source, we store only the href value.

Followers/Following

To get the followers/following info we use CSS attribute selectors. If you use inspect element on the number of followers and users following, you will notice that they have an href tag which contains the word ‘followers’ and ‘following’ respectively. We will use beautiful soup’s .select_one() method to get the required data.

#followers and follwoing
info['followers'] = soup.select_one("a[href*=followers]").get_text().strip().split('\n')[0]
info['following'] = soup.select_one("a[href*=following]").get_text().strip().split('\n')[0]

Since we are only interested in the number, we need to manipulate the text and extract the number out of it.

Location and Portfolio URL

Upon inspecting the location and URL, you will notice that they are stored inside a <li> tag and have an attribute ‘itemprop’. The ‘itemprop’ value for Location contains the word ‘home’ and for the URL, it contains ‘url’. Make sure the case of the string matches. We will use the .select_one() method to get the relevant info

#location
try:
info['location'] = soup.select_one('li[itemprop*=home]').get_text().strip()
except:
info['location'] = ''
#url
try:
info['url'] = soup.select_one('li[itemprop*=url]').get_text().strip()
except:
info['url'] = ''

Since not all users might have their location and url, we need to put our code inside a try, expect block.

User’s recent public Repos

All the repositories are stored inside a parent element which has a class name “source”. First, we get the parent element. We will also need to declare an empty list to store all the repo information.

#get Repositories as a dataframe
repos = soup.find_all(class_ = 'source')
repo_info = []

Below are the elements where you will find the necessary info

  • Repo name is stored inside a <a> tag with and attribute ‘itemprop’ containing the word ‘codeRepository’
  • Repo link can be formed in the following format: ‘https://github.com/{user_name}/{repo_name}’
  • Repo Update Date is inside a ‘relative-time’ tag ,i.e <relative-time>
  • Repo Programming Language is inside a <span> tag with attributeitemprop’ having the value ‘programmingLanguage’
  • Repo Description is inside a <p> tag with attributeitemprop’ having the value ‘description’

Below is a snippet of code to get the repo info:

for repo in repos:
#repo name and link
try:
name = repo.select_one('a[itemprop*=codeRepository]').get_text().strip()
link = 'https://github.com/{}/{}'.format(userName,name)
except:
name = ''
link = ''
#repo update time
try:
updated = repo.find('relative-time').get_text()
except:
updated = ''
# programming language
try:
language = repo.select_one('span[itemprop*=programmingLanguage]').get_text()
except:
language = ''
# description
try:
description = repo.select_one('p[itemprop*=description]').get_text().strip()
except:
description = ''

Now we need to store the information in a dictionary and append into the empty list we created earlier. We will also convert it into a dataframe so that it can be used by Streamlit

repo_info.append({'name': name ,
'link': link ,
'updated ':updated ,
'language': language ,
'description':description})
repo_info = pd.DataFrame(repo_info)
return info , repo_info

If you have made it this far, Good Job! 👏 👏

Now we will move on to the Streamlit part.

Streamlit App

Type ‘streamlit hello’ in the command line to verify that you have installed streamlit correctly. It should open a web app in the browser.

Create a file called ‘app.py’ this will have to necessary code for streamlit.

First, we need to make the necessary imports, i.e streamlit and our function to get the user’s data

import streamlit as st
from scrape import getData

We will use the .title() method to display a title and the .text_input() method to get the user name.

st.title("Github Scraper")
userName = st.text_input('Enter Github Username')

Whenever a change is made to any of the variables, streamlit re-runs the app. Therefore in our case, when the user inputs a username, the app is re-run.

Once we get the user name, we use the getData function we created earlier. We put it inside a try-except block since we might get an invalid user name as an input. We use the .subheader() method to display the user info, except the image, for which we use the .image() method. To display the dataframe containing the repo info, we use the .table() method

if userName != '':
try:
info, repo_info = getData(userName)
for key , value in info.items():
if key != 'image_url':
st.subheader(
'''
{} : {}
'''.format(key, value)
)
else:
st.image(value)
st.subheader(" Recent Repositories")
st.table(repo_info)
except:
st.subheader("User doesn't exist")

You have successfully built a streamlit web app to scrape a user’s GitHub info 😎 😎 If you have come this far, I highly suggest you deploy your app. You can check out my article on how to deploy streamlit apps using Heroku.