Sentiment analysis (or opinion mining) aims to understand . It involves natural language processing and sometimes machine learning. In this tutorial we will apply sentiment analysis to get a sense of the attitude of consumers about a few car brands. We will learn how to:
listen to tweets using the DSS Twitter dataset
train a basic machine learning model to predict sentiment from tweets
score our tweets and create aggregated measures and reports to make all this information human readable
This part is about creating a new twitter dataset and listen to the keywords you are interested in. Since after setting twitter connection you have to wait for your dataset to be populated, we provide one in the project.
This part is about scoring each tweet sentiment. Then, averaging over a brand or a paticular keyword gives you a proxy of the average sentiment in the associated tweets. To generate this scoring we propose the following method: - learn a model of the global sentiment of a tweet on a dataset containing the tweet text and a sentiment: -1 for negative and +1 for negative. This corresponds to a classification problem. - predict the probability of our brand tweets to be in the two class (sentiments) - keep the expected value as the sentiment score. Ie: score = P(sentiment = 1) - P(sentiment = -1). In the end, the score is 1 if the tweet is very positive, -1 if the tweet is negative and 0 the tweet is rather neutral.
Ok so let's get started! First get the data here. This dataset was created using a small kaggle dataset and a 1.6 million lines dataset from stanford's sentiment 140. The final dataset is composed of two columns:the text and the sentiment (-1 or 1).
Now, let's create a model. Go to the model page and choose "prediction".
THIS IS A MODEL SO WE NEED TO WAIT FOR v2.0!
Now that the model has been trained, it's time to score the incoming tweets. Create a scoring recipe and choose the model you just built. Set the partition dependencies to "All available" and run the newly created recipe. When the job is done you can explore the scored dataset. The probabilities estimates of belonging to class 1 or -1 are appended to the previous list of columns.
In this part we'll show you how to generate a few dashboard for the end users that will review your pinboard. We used a postgresql connection to get faster results but any other type of files would work.
We will first focus on doing a daily report for the total number of tweets and for the global sentiment before breaking it down on the brand / keywords.
TO DO IN V2 FOR charts and so on.
To refine your analysis, you could follow different ideas that lie outside of the scope of this tutotial:
Try to get better model performances by tuning it yourself (using export in ipython notebook). Is logistic regression the best model? Would a SVD on the bag of words be usefull? Should we try ngrams? Can we use smiley information (try replacing them by words using this dataset)?
Some brands name can be ambiguous. For example in the "dodge" associated tweets you may have some tweets realted to "dodge ball". These could add some noise in your analysis.
When scoring the predictions we dropped the partition system. But if you were to score tweets every day, you probably would have to keep the partition system! Try making that work on a daily basis and create your first app using the scheduler.
Add as many dashboard and web apps you want!