Saturday, June 13, 2020

Text Analysis

News headlines text analysis

Introduction

In the present tutorial, I show an introductory text analysis of a ABC-news news headlines dataset. I will have a look to the most common words therein present and run a sentiment analysis on those headlines by taking advantage of the following sentiment words baselines:

  • NRC

  • Bing

  • AFINN

The NRC sentiment words baseline from Saif Mohammad and Peter Turney categorizes words into categories of positive, negative, anger, anticipation, disgust, fear joy, sadness, surprise and trust.

The Bing sentiment words baseline from Bing Liu and others categorizes words into positive or negative sentiment category.

The AFINN sentiment words baseline from Finn Arup Nielsen assigns words with a score from -5 to 5, with negative scores indicating negative sentiment and positive scores indicating positive sentiment.

For more information about those sentiment lexicons, see references listed out at the bottom.

Packages

I am going to take advantage of the following R packages.

suppressPackageStartupMessages(library(stringr))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(tidytext))
suppressPackageStartupMessages(library(tidyr))
suppressPackageStartupMessages(library(textdata))
suppressPackageStartupMessages(library(widyr))
suppressPackageStartupMessages(library(ggplot2))

Packages versions are herein listed.

packages <- c("stringr", "dplyr", "tidytext", "tidyr", "textdata", "widyr", "ggplot2")
version <- lapply(packages, packageVersion)
version_c <- do.call(c, version)
data.frame(packages=packages, version = as.character(version_c))

Running on Windows-10 the following R language version.

R.version
##                _                           
## platform       x86_64-w64-mingw32          
## arch           x86_64                      
## os             mingw32                     
## system         x86_64, mingw32             
## status                                     
## major          3                           
## minor          5.3                         
## year           2019                        
## month          03                          
## day            11                          
## svn rev        76217                       
## language       R                           
## version.string R version 3.5.3 (2019-03-11)
## nickname       Great Truth

Note

Before running this code, make sure to have downloaded the lexicon of the sentiments lexicons by executing the following operation:

get_sentiments("nrc")
get_sentiments("bing")
get_sentiments("afinn")

and accepting all prescriptions as asked by the interactive menu showing up.

Getting Data

I then download our news dataset containing millions of headlines from:

“https://www.kaggle.com/therohk/million-headlines/downloads/million-headlines.zip/7”

Its uncompression produces the abcnews-date-text.csv file. I load it into the news_data dataset and have a look at.

news_data <- read.csv("abcnews-date-text.csv", header = TRUE, stringsAsFactors = FALSE)
dim(news_data)
## [1] 1103663       2
head(news_data)
tail(news_data)

Token Analysis

It is time to extract the tokens from our dataset. Select the column named as headline_text and unnesting the word tokens determine the following.

news_df <- news_data %>% select(headline_text)
news_tokens <- news_df %>% unnest_tokens(word, headline_text)
head(news_tokens, 10)
tail(news_tokens, 10)

It is interesting to generate and inspect a table reporting how many times each token shows up within the headlines and its proportion with respect the total.

news_tokens_count <- news_tokens %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))

The top-10 words which appear most.

head(news_tokens_count, 10)

And the ones which appear less frequently:

tail(news_tokens_count, 10)

There is an issue in having doing that way. The issue is that there are words which do not have relevant role in easing the sentiment analysis, the so called stop words. Herein below the stop words wihin our dataset are shown.

data(stop_words)
head(stop_words, 10)

To remove stop words as required, we take advantage of the anti_join operation.

news_tokens_no_sp <- news_tokens %>% anti_join(stop_words)
head(news_tokens_no_sp, 10)

Then, counting news tokens again after having removed the stop words.

news_tokens_count <- news_tokens_no_sp %>% count(word, sort = TRUE) %>% mutate(proportion = n / sum(n))
head(news_tokens_count, 10)
tail(news_tokens_count)

Then, I filtering out tokens having more than 8,000 counts.

news_token_over8000 <- news_tokens_count %>% filter(n > 8000) %>% mutate(word = reorder(word, n))
nrow(news_token_over8000)
## [1] 32
head(news_token_over8000, 10) 
tail(news_token_over8000, 10) 

It is interesting to show the proportion as per-thousands by means of an histogram plot.

news_token_over8000 %>%  
  ggplot(aes(word, proportion*1000, fill=ceiling(proportion*1000))) +
  geom_col() + xlab(NULL) + coord_flip() + theme(legend.position = "none")

News Sentiment Analysis

In this paragraph, I focus on each single headline to evaluate its specific sentiment as determined by each lexicon. Hence the output shall determine if each specific headline has got positive or negative sentiment.

head(news_df, 10)

I will analyse only the first 1000 headlines just for computational time reasons. The token list of such is as follows.

news_df_subset <- news_df[1:1000,,drop=FALSE]
tkn_l <- apply(news_df_subset, 1, function(x) { data.frame(headline_text=x, stringsAsFactors = FALSE) %>% unnest_tokens(word, headline_text)})

Removing the stop words from the token list.

single_news_tokens <- lapply(tkn_l, function(x) {anti_join(x, stop_words)})
str(single_news_tokens, list.len = 5)
## List of 1000
##  $ 1   :'data.frame':    5 obs. of  1 variable:
##   ..$ word: chr [1:5] "aba" "decides" "community" "broadcasting" ...
##  $ 2   :'data.frame':    5 obs. of  1 variable:
##   ..$ word: chr [1:5] "act" "fire" "witnesses" "aware" ...
##  $ 3   :'data.frame':    4 obs. of  1 variable:
##   ..$ word: chr [1:4] "calls" "infrastructure" "protection" "summit"
##  $ 4   :'data.frame':    7 obs. of  1 variable:
##   ..$ word: chr [1:7] "air" "nz" "staff" "aust" ...
##  $ 5   :'data.frame':    6 obs. of  1 variable:
##   ..$ word: chr [1:6] "air" "nz" "strike" "affect" ...
##   [list output truncated]

As we can see, to each headline is associated a list of tokens. The sentiment of a headline is computed as based on the sum of positive/negative score of each token of.

single_news_tokens[[1]]

Bing lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for Bing lexicon. I first define a function named as compute_sentiment() whose purpose is to output the positiveness score of a specific headline.

compute_sentiment <- function(d) {
  if (nrow(d) == 0) {
    return(NA)
  }
  neg_score <- d %>% filter(sentiment=="negative") %>% nrow()
  pos_score <- d %>% filter(sentiment=="positive") %>% nrow()
  pos_score - neg_score
} 

The inner join on bing lexicon of each single headline tokens list is given as input to the compute_sentiment() function to determine the sentiment score of each specific headline.

sentiments_bing <- get_sentiments("bing")
str(sentiments_bing)
## Classes 'tbl_df', 'tbl' and 'data.frame':    6786 obs. of  2 variables:
##  $ word     : chr  "2-faces" "abnormal" "abolish" "abominable" ...
##  $ sentiment: chr  "negative" "negative" "negative" "negative" ...
single_news_sentiment_bing <- sapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_bing) %>% compute_sentiment()})

The result is a vector of integers each element value at i-th position is the sentiment associated to the i-th news

str(single_news_sentiment_bing)
##  Named int [1:1000] NA -1 1 -1 -1 2 0 NA NA NA ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

  • the median is negative
  • NA’s show up
summary(single_news_sentiment_bing)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -3.000  -1.000  -1.000  -0.475   1.000   2.000     520

Collecting the resulting in a data frame as follows.

single_news_sentiment_bing_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_bing)
head(single_news_sentiment_bing_df, 10)

NRC lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for NRC lexicon. With respect the previous analysis based on bing lexicon, some more pre-processing is needed as explained in what follows. First we get the NRC sentiment lexicon and see what are the sentiments threin present.

sentiments_nrc <- get_sentiments("nrc")
(unique_sentiments_nrc <- unique(sentiments_nrc$sentiment))
##  [1] "trust"        "fear"         "negative"     "sadness"      "anger"       
##  [6] "surprise"     "positive"     "disgust"      "joy"          "anticipation"

To have as output a positive/negative sentiment result, I define a mapping of abovelisted sentiments to a positive/negative string result as follows.

compute_pos_neg_sentiments_nrc <- function(the_sentiments_nrc) {
  s <- unique(the_sentiments_nrc$sentiment)
  df_sentiments <- data.frame(sentiment = s, 
                              mapped_sentiment = c("positive", "negative", "negative", "negative",
                                                    "negative", "positive", "positive", "negative", 
                                                    "positive", "positive"))
  ss <- sentiments_nrc %>% inner_join(df_sentiments)
  the_sentiments_nrc$sentiment <- ss$mapped_sentiment
  the_sentiments_nrc
}

nrc_sentiments_pos_neg_scale <- compute_pos_neg_sentiments_nrc(sentiments_nrc)

Above function is used to produce the single headline text sentiment results. Such result is given as input to the compute_sentiment() function.

single_news_sentiment_nrc <- sapply(single_news_tokens, function(x) { x %>% inner_join(nrc_sentiments_pos_neg_scale) %>% compute_sentiment()})
str(single_news_sentiment_nrc)
##  Named int [1:1000] 1 -4 1 2 -2 2 4 NA 5 -2 ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

  • the median is equal to zero
  • NA’s show up
summary(single_news_sentiment_nrc)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
## -9.0000 -2.0000  0.0000 -0.3742  2.0000  9.0000     257
single_news_sentiment_nrc_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_nrc)
head(single_news_sentiment_nrc_df, 10)

AFINN lexicon

In this paragraph, the computation of the sentiment associated to the tokens list is shown for AFINN lexicon.

sentiments_afinn <- get_sentiments("afinn")
colnames(sentiments_afinn) <- c("word", "sentiment")
str(sentiments_afinn)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 2477 obs. of  2 variables:
##  $ word     : chr  "abandon" "abandoned" "abandons" "abducted" ...
##  $ sentiment: num  -2 -2 -2 -2 -2 -2 -3 -3 -3 -3 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   word = col_character(),
##   ..   value = col_double()
##   .. )

As we can see, the afinn lexicon provides a score for each token. We just need to sum up each headline tokens score to obtain the sentiment score of the headline under analysis.

single_news_sentiment_afinn_df <- lapply(single_news_tokens, function(x) { x %>% inner_join(sentiments_afinn)})
single_news_sentiment_afinn <- sapply(single_news_sentiment_afinn_df, function(x) { 
      ifelse(nrow(x) > 0, sum(x$sentiment), NA)
  })
str(single_news_sentiment_afinn)
##  Named num [1:1000] NA -2 NA -2 -1 6 3 NA NA -2 ...
##  - attr(*, "names")= chr [1:1000] "1" "2" "3" "4" ...

Here is the summary, please note that:

  • the median is negative
  • NA’s show up
summary(single_news_sentiment_afinn)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##  -9.000  -3.000  -2.000  -1.148   1.000   7.000     508
single_news_sentiment_afinn_df <- data.frame(headline_text=news_df_subset$headline_text, score = single_news_sentiment_afinn)
head(single_news_sentiment_afinn_df, 10)

Comparing results

Having obtained for each news three potential results as sentiment evaluation, we would like to compare their congruency. As congruence we mean the fact that all three lexicons express the same positive or negative result, in other words the same score sign indipendently from its magnitude. If NA values are present, the congruence shall be computed until at least two non NA values are available, otherwise is equal to NA.

Furthermore we compute the final news sentiment as based upon the sum of each lexicon sentiment score.

compute_congruence <- function(x,y,z) {
  v <- c(sign(x), sign(y), sign(z))
  # if only one lexicon reports the score, we cannot check for congruence
  if (sum(is.na(v)) >= 2) {
    return (NA)
  }
  # removing NA and zero value
  v <- na.omit(v)
  v_sum <- sum(v)
  abs(v_sum) == length(v)
}

compute_final_sentiment <- function(x,y,z) {
  if (is.na(x) && is.na(y) && is.na(z)) {
    return (NA)
  }

  s <- sum(x, y, z, na.rm=TRUE)
  # positive sentiments have score strictly greater than zero
  # negative sentiments have score strictly less than zero
  # neutral sentiments have score equal to zero 
  ifelse(s > 0, "positive", ifelse(s < 0, "negative", "neutral"))
}

news_sentiments_results <- data.frame(headline_text = news_df_subset$headline_text, 
                                      bing_score = single_news_sentiment_bing, 
                                      nrc_score = single_news_sentiment_nrc, 
                                      afinn_score = single_news_sentiment_afinn,
                                      stringsAsFactors = FALSE)

news_sentiments_results <- news_sentiments_results %>% rowwise() %>% 
  mutate(final_sentiment = compute_final_sentiment(bing_score, nrc_score, afinn_score),
         congruence = compute_congruence(bing_score, nrc_score, afinn_score))
 
head(news_sentiments_results, 40)

Is would be useful to replace the numeric score with same {negative, neutral, positive} scale.

replace_score_with_sentiment <- function(v_score) {
  v_score[v_score > 0] <- "positive"
  v_score[v_score < 0] <- "negative"
  v_score[v_score == 0] <- "neutral"
  v_score
} 

news_sentiments_results$bing_score <- replace_score_with_sentiment(news_sentiments_results$bing_score)
news_sentiments_results$nrc_score <- replace_score_with_sentiment(news_sentiments_results$nrc_score)
news_sentiments_results$afinn_score <- replace_score_with_sentiment(news_sentiments_results$afinn_score)

news_sentiments_results[,2:5] <- lapply(news_sentiments_results[,2:5], as.factor)

head(news_sentiments_results, 40)

Tabularizations of each lexicon resulting sentiment and final sentiments are herein shown.

table(news_sentiments_results$bing_score, news_sentiments_results$final_sentiment, dnn = c("bing", "final"))
##           final
## bing       negative neutral positive
##   negative      278      15       14
##   neutral        16       6       11
##   positive        6       7      127
table(news_sentiments_results$nrc_score, news_sentiments_results$final_sentiment, dnn = c("nrc", "final"))
##           final
## nrc        negative neutral positive
##   negative      353      10        4
##   neutral        18      13        6
##   positive       25      16      298
table(news_sentiments_results$afinn_score, news_sentiments_results$final_sentiment, dnn = c("afinn", "final"))
##           final
## afinn      negative neutral positive
##   negative      326      10       12
##   neutral         3       1        6
##   positive        4       9      121

Tabularization of congruence and final sentiments is herein shown.

table(news_sentiments_results$congruence, news_sentiments_results$final_sentiment, dnn = c("congruence", "final"))
##           final
## congruence negative neutral positive
##      FALSE       67      33       45
##      TRUE       292       0      132

Conclusions

We analyzed the news headlines to determine their sentiments while taking advantage of three sentiments lexicons. We show some basics of the methodoloy for such purpose. We also had the chance to compare the results obtained across all three lexicons and set forth a final sentiment evaluation. If you are interested in understanding much more about text analysis, see ref. [4].

References

[1] NRC sentiment words baseline [http://saifmohammad.com/WebPages/NRC-Emotion-Lexicon.htm]

[2] BING sentiment words baseline [https://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html]

[3] AFINN sentiment words baseline [https://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010]

[4] Text mining with R [https://www.tidytextmining.com/]