By Frank La Vigne | May 2018 | Get the Code
One of the truisms of the modern data-driven world is that the velocity and volume of data keeps increasing. We’re seeing more data generated each day than ever before in human history. And nowhere is this rapid growth more evident than in the world of social media, where users generate content at a scale previously unimaginable. Twitter users, for example, collectively send out approximately 6,000 tweets every second, according to tracking site Internet Live Stats (internetlivestats.com/twitter-statistics). At that rate, there are about 350,000 tweets sent per minute, 500 million tweets per day, and about 200 billion tweets per year. Keeping up with this data stream to evaluate content would be impossible even for the largest teams—you just couldn’t hire enough people to scan Twitter to evaluate the sentiment of its user base at any given moment.
Fortunately, the use case for analyzing every tweet would be an extreme edge case. There are, however, valid business motives for tracking sentiment, be it against a specific topic, search term or hashtag. While this narrows the number of tweets to analyze significantly, the sheer volume of the data to analyze still makes it impractical to analyze the sentiments of the tweets in any meaningful way.
Thankfully, analyzing the overall sentiment of text is a process that can easily be automated through sentiment analysis. Sentiment analysis is the process of computationally classifying and categorizing opinions expressed in text to determine whether the attitude expressed within demonstrates a positive, negative or neutral tone. In short, the process can be automated and distilled to a mathematical score indicating tone and subjectivity.
In February (msdn.com/magazine/mt829269), I covered in detail Jupyter notebooks and the environments in which they can run. While any Python 3 environment can run the code in this article, for the sake of simplicity, I’ll use Azure Notebooks. Browse over to the Azure Notebooks service Web site at notebooks.azure.com and sign in with your Microsoft ID credentials. Create a new Library with the name Artificially Intelligent. Under the Library ID field enter “ArtificiallyIntelligent” and click Create. On the following page, click on New to create a new notebook. Enter a name in the Item Name textbox, choose Python 3.6 Notebook from the Item type dropdown list and click New (Figure 1).
Figure 1 Creating a New Notebook with a Python 3.6 Kernel
Click on the newly created notebook and wait for the service to connect to a kernel.Sentiment Analysis in Python
Once the notebook is ready, enter the following code in the empty cell and run the code in the cell.
The results that appear will resemble the following:
Polarity refers to how negative or positive the tone of the input text rates from -1 to +1, with -1 being the most negative and +1 being the most positive. Subjectivity refers to how subjective the statement rates from 0 to 1 with 1 being highly subjective. With just three lines of code, I could analyze not just sentiment of a fragment of text, but also its subjectivity. How did something like sentiment analysis, once considered complicated, become so seemingly simple?
Python enjoys a thriving ecosystem, particularly in regard to machine learning and natural language processing (NLP). The code snippet above relies on the TextBlob library (textblob.readthedocs.io/en/dev). TextBlob is an open source library for processing textual data, providing a simple API for diving into common natural language processing (NLP) tasks. These tasks include sentiment analysis and much more.
In the blank cell below the results, enter the following code and execute it:
The results state that the phrase “the sky is blue” has a polarity of 0.0 and a subjectivity of 0.1. This means that the text is neutral in tone and scores low in subjectivity. In the blank cell immediately underneath the results, enter the following code and execute the cell:
Note that the algorithm correctly identified that the contents of simple_text1 had a negative sentiment (-0.8) and that the statement is quite subjective (0.9). Additionally, the algorithm correctly inferred the positive sentiment of simple_text2 (0.625) and its highly subjective nature (1.0).
However, the algorithm does have significant difficulties parsing the more subtle nuances of human language. Sarcasm, for instance, is not only hard to detect, but may throw off the results. Imagine a scenario where a restaurant extracts reviews from an online review site like Yelp and automatically publishes reviews with a positive sentiment on their Web site and social media. Enter the following code into an empty cell and execute it:
Clearly the sentiment of these two reviews are negative. Yet the algorithm seems to think otherwise, with both reviews scoring as having positive sentiment scores, 0.15 and 0.26 respectively. In this case, the restaurant would likely not want either of these reviews highlighted on any platform. NLP systems have yet to grasp a good understanding of sarcasm, although there is a lot of research currently being done in this area (thesarcasmdetector.com/about).
So far, I have only run small bits of text through the TextBlob analyzer. A more practical use of this technology is to feed it user-generated data, ideally in near-real time. Fortunately Twitter, with its approximately 327 million active users (www.statista.com/statistics/282087/number-of-monthly-active-twitter-users), provides a constant stream of text to analyze.
To connect with Twitter’s API, I need to register an application with Twitter to generate the necessary credentials. In a browser, go to apps.twitter.com and, if needed, log in with your Twitter credentials. Click the Create New App button to bring up the Create an application form as shown in Figure 2. Enter a name, description and a Web site for the app. For the purposes of this article, the Web site address does not matter, so enter a valid URL. Click the checkbox to agree to the terms of the Twitter Developer Agreement and click the Create your Twitter application button.
Figure 2 The Twitter Create Application Form
On the following screen, look for Consumer Key (API Key) under the Application Settings section. Click on the “manage keys and access tokens” link. On the page that follows, click the Create my access token button, as shown in Figure 3, to create an access token. Make note of the following four values shown on this page: Consumer Key (API Key), Consumer Secret (API Secret), Access Token and Access Token Secret.
Figure 3 Twitter Application Keys and Access Tokens Screen
Tweepy is a Python library that simplifies the interaction between Python code and the Twitter API. More information about Tweepy can be found at docs.tweepy.org/en/v3.5.0. At this time, return to the Jupyter notebook and enter the following code to install the Tweepy API. The exclamation mark instructs Jupyter to execute a command in the shell:
Once the code executes successfully, the response text in the cell will read: “Successfully installed tweepy-3.6.0,” although the specific version number may change. In the following cell, enter the code in Figure 4 into the newly created empty cell and execute it.
The results that come back should look similar to the following:
Keep in mind that as the code executes a search on live Twitter data, your results will certainly vary. The formatting is a little confusing to read. Modify the for loop in the cell to the following and then re-execute the code.
Adding the pipe characters to the output should make it easier to read. Also note that the sentiment property’s two fields, polarity and subjectivity, can be displayed individually.
The previous code created a pipe-delineated list of tweet content and sentiment scores. A more useful structure for further analysis would be a DataFrame. A DataFrame is a two-dimensional-labeled data structure. The columns may contain different value types. Similar to a spreadsheet or SQL table, DataFrames provide a familiar and simple mechanism to work with datasets.
DataFrames are part of the Pandas library. As such, you will need to import the Pandas library along with Numpy. Insert a blank cell below the current cell, enter the following code and execute:
The results now will display in an easier to read tabular format. However, that’s not all that the DataFrames library can do. Insert a blank cell below the current cell, enter the following code, and execute:
By loading the tweet sentiment analysis data into a DataFrame, it’s easier to run and analyze the data at scale. However, these descriptive statistics just scratch the surface of the power that DataFrames provide. For a more complete exploration of Pandas DataFrames in Python, please watch the webcast, “Data Analysis in Python with Pandas,” by Jonathan Wood at bit.ly/2urCxQX.Wrapping Up
With the velocity and volume of data continuing to rise, businesses large and small must find ways to leverage machine learning to make sense of the data and turn it into actionable insight. Natural Language Processing, or NLP, is a class of algorithms that can analyze unstructured text and parse it into machine-readable structures, giving access to one of the key attributes of any body of text—sentiment. Not too long ago, this was out of reach of the average developer, but now the TextBlob Python library brings this technology to the Python ecosystem. While the algorithms can sometimes struggle with the subtleties and nuances of human language, they provide an excellent foundation for making sense of unstructured data.
As demonstrated in this article, the effort to analyze a given block of text for sentiment in terms of negativity or subjectivity is now trivial. Thanks to a vibrant Python ecosystem of third-party open source libraries, it’s also easy to source data from live social media sites, such as Twitter, and pull in users’ tweets in real time. Another Python library, Pandas, simplifies the process to perform advanced analytics on this data. With thoughtful analysis, businesses can monitor social media feeds and obtain awareness of what customers are saying and sharing about them.
Frank La Vigne leads the Data & Analytics practice at Wintellect and co-hosts the DataDriven podcast. He blogs regularly at FranksWorld.com and you can watch him on his YouTube channel, “Frank’s World TV” (FranksWorld.TV).
Thanks to the following technical expert for reviewing this article: Andy Leonard