Streamlit Word Cloud

How to use 2020 Google Keyword and Twitter Hashtag trends to build word clouds

Posted by

We will be building a Streamlit Web App to showcase a word cloud of Trending Google Keywords and Twitter Hashtags in 2020

The Link to the live app and screenshots of some of the word clouds are at the end of the article

Introduction 

We will be getting our data from the following website

https://us.trend-calendar.com/trend/2020-01-01.html

The above website stores archives of the trending keywords and hashtags on each day. Beautiful Soup will be used to scrape this website to get the required data. We will be building the following features

  • A 2020 word cloud
  • The ability for the user to select a date and generate a word cloud for that date
  • The ability for the user to change the image mask

Pre-Requisite

  • Basic Familiarity with Web Scraping using Beautiful Soup
  • Knowledge of Streamlit is not necessary for generating the word clouds but a basic understanding of Streamlit is required to build the web app

Install the Required Packages📦

We will need to install the following libraries

  • Pandas
  • Streamlit
  • WordCloud
  • Matplotlib
  • BeautifulSoup
pip install pandas,streamlit,wordcloud,matplotlib,bs4

Acquiring the Data 📈

The website mentioned above follows the following format

https://us.trend-calendar.com/trend/{date}.html

The {date} has to replaced by the date we are interested in. It has to be in the YYYY-MM-DD format. For ease, we will scrape the data in intervals of 7 days,i.e [2020–01–01, 2020–01–08, 2020–01–15, 2020–01–22 …… ]

Generating the Dates

Pandas has a function date_range() which is like the range() function but for dates. The function takes the start date, end date, and frequency as parameters

def get_dates():
    dates = pd.date_range('2020-01-01','2020-12-27',freq='7d')
    dates = [d.strftime('%Y-%m-%d') for d in dates]
    return dates

Defining Function to get data for given day

We will only store the top 10 keywords and hashtags.

Screenshot by Author

If you inspect the HTML of the website, you will notice the following

  • There are two ‘ol’ elements. Both their class names are ‘ranking’
  • The first element contains the Twitter Hashtags and the second element contains the Google Keywords
  • Inside the ‘ol’ element, the keywords/hashtags are stored inside ‘li’ elements.
def get_keywords(date):
    result = {}
    url = f'https://us.trend-calendar.com/trend/{date}.html'
    r = requests.get(url)
    if r.status_code != 200:
     print(f'Failed to get data from {url}')
    soup = BeautifulSoup(r.text, "html.parser")

The requests library will be used to make a request to our website and the page returned will be used to initialize a beautiful soup object

Before storing the data we will do the following pre-processing

  • Remove hashtag characters from the beginning of the Twitter hashtags
  • Converting all the data to lower case

Word clouds can use weights to vary the font size of certain words. In our case, we will assign a weight of 10 to the first ranked keyword and decrement weight for keywords lower in the ranking. Therefore the 10th keyword will have a weight of 1 and the 5th keyword will have a weight of 5.

try:
   twitter_trends = soup.find_all('ol' , 'ranking')[0].find_all('li')[0:10]
   for idx,trend in enumerate(twitter_trends):
       trend = trend.text.lstrip("#").lower()
       result[trend] = result.get(trend, 0) + (10 - idx)
except Exception as e:
   print(e)
   print(f'Failed to get twitter Hashtags from {url}')

The above code gets the Twitter hashtags. If a keyword already exists, we add the current weight to the previous weight. This might be useful if a word is found in both the google and Twitter lists.

try:
   google_trends = soup.find_all('ol' , 'ranking')[1].find_all('li')[0:10]
   for idx,trend in enumerate(google_trends):
      trend = trend.text.lower()
      result[trend] = result.get(trend, 0) + (10 - idx)
except Exception as e:
   print(e)
   print(f'Failed to get twitter Hashtags from {url}')
   print(f"Scraped Data for {date} successfully")

The above code gets the Google Keywords.

Collecting Data for all Dates and storing it in a file

dates = get_dates()
keywords = {}
for date in dates:
    keywords[date] = get_keywords(date)

A request for each date will be made to the website and data stored inside a dictionary

Once our dictionary is ready, we will store the data inside the JSON file

with open('data/weekly.json','w') as file:
   json.dump(keywords,file)

A new JSON file must be created to combine all the words

combined_result = {}
for date , week_keyword in keywords.items():
    for keyword in week_keyword:
        combined_result[keyword] = combined_result.get(keyword,0)   
        week_keyword[keyword] 

This JSON file will not store the dates, it will only store the word and its weight. It will be used to produce the 2020 word cloud.


Creating Word Clouds ☁️

We will use the wordcloud library to create the word clouds. First, we will create a default word cloud (see below) without any image mask

Screenshot by Author

The following piece of code creates a word cloud object

wordcloud = WordCloud(width, height, repeat, 
max_words,max_font_size,background_color)
  • width- Width of the word cloud
  • height– Height of the word cloud
  • repeat– Boolean value. If set to True, words will be repeated to fill up blank spaces. If set to False, blank spaces will be visible
  • max_words– A maximum number of words inside the word cloud
  • max_font_size– The maximum font size of a word, the word with the maximum weight will have the max_font_size
  • background_color– By default, it is set to black. However, we can change it

To create fancier word clouds like below, we will need to create image masks.

Screenshot by Author

The PIL library will be used to open the image and numpy will be used to create the mask array

path = f'data/image_masks/{image}.jpg'
mask = np.array(Image.open(path))

The path variable should point to the base Image which will be used to create the mask, you can find some example images in my GitHub repo. I have provided a link to it at the end of the article.

This newly created mask variable needs to be passed as a parameter while initializing the word cloud.

wordcloud = WordCloud(width, height, repeat, 
max_words,max_font_size,background_color, mask = mask)

The data for the word cloud can either be in the form of a large string or a dictionary with weights. In our case, it is the latter. The wordcloud object has a method generate_from_frequencies which takes in the dictionary with weights as a parameter and creates the wordcloud.

Since we will give the user the ability to chose the image mask, we will put the above code inside a function

def get_word_cloud(image,data,max_words,max_font_size):
    if image == 'default':
       wordcloud = WordCloud(width=400, height=400, repeat=True,   
                   max_words=max_words, max_font_size=   
                   max_font_size,background_color='white',
                   ).generate_from_frequencies(data)
    else:
       path = f'data/image_masks/{image}.jpg'
       mask = np.array(Image.open(path))
       wordcloud = WordCloud(width=400, height=400, repeat=True, 
                   max_words=max_words,max_font_size=   
                   max_font_size,background_color='white',
                   mask = mask).generate_from_frequencies(data)
    return wordcloud

The above function will return the wordcloud based on the given parameters.


Streamlit App 💡

Screenshot by Author

Before writing any code for the streamlit app, we will need to load the data from our JSON files

def load_data():
    with open('data/weekly.json','r') as file:
         weekly_keywords = json.load(file)
    with open('data/combined.json') as file:
         combined_keyword = json.load(file)
    dates = [date for date in weekly_keywords]
return combined_keyword,weekly_keywords,dates

We will also return all the dates for which we collected the data.

st.title("2020 Word Clouds based on Google Keyword and Twitter Hashtag trends")
image = st.sidebar.selectbox(label='Select Image Mask',options=
['default','twitter','hashtag','heart'])
combined_keyword,weekly_keywords,dates = load_data()

A sidebar with a dropdown will be created for the user to select the image mask they want to use.

For the 2020 word cloud, we will set the maximum number of words to 800 and maximum font size to 15

st.header("Entire Year")
wordcloud = get_word_cloud(image,combined_keyword,800,15)
fig1 = plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
st.pyplot(fig1)

For the weekly cloud, we can increase the font size since we do not have many unique words. We will also create a dropdown for the user to select a date

st.header("Weekly")
date = st.selectbox(label='Select Date',options=dates)
keywords = weekly_keywords[date]
wordcloud = get_word_cloud(image , keywords,200,25)
fig2 = plt.figure()
plt.imshow(wordcloud)
plt.axis("off")
st.pyplot(fig2)

Conclusion 

This mini project could be further improved, more specifically the way we acquire the data can be improved. Below are a few suggestions

  • Currently the dates we generated to get the data for are all Wednesdays. As a result, certain hashtags like ‘WednesdayWisdom’ or ‘WednesdayMorning’ are present in our data. The intervals between the generated dates could be randomized or some sort of pre-processing could be used to remove the words from our data
  • Use a different data source. The website we scrape the data from is a 3rd party website and might have incorrect data.
  • Increase options for the image masks

Please mention some other ways to improve the app in the comments 😃

Some of the Word Clouds are below

2020 Word Cloud 
23rd December’s Word Cloud
11th March’s Word Cloud
8th January’s Word Cloud

Resources

Github Repo

https://github.com/rahulbanerjee26/Word_Clouds

Live

https://share.streamlit.io/rahulbanerjee26/word_clouds/main/app.py

Deploy Streamlit App