# Natural language processing using TextBlob

*Natural language processing (NLP)* involves the analysis and extraction of meaningful information from natural language data, such as speech or text. This notebook demonstrates several *NLP* methods using the Python module *TextBlob* (https://textblob.readthedocs.io/en/dev/).

We will look at the following analyses:
Tokenization, Stemming, Noun phrase extraction, sentiment analysis, word count analysis, and language translation powered by Google Translate.

To use TextBlob, you must import it (see below), and then create a *TextBlob* object. By convention the *TextBlob* object is stored in a variable named *blob*, as in the example below:

```python
blob = TextBlob('text to analyze')
```

In [None]:
# Triple quotes are used to denote a multi-line string
# Quote from Jude Christodal
stage = """All the world's a stage, and every word a note.
And every day is filled with songs you never knew you wrote."""

# printing stage will output across multiple lines
print(stage)

# viewing the string we can see the newline ('\n') characters
stage

In [None]:
# create a TextBlob object
from textblob import TextBlob
blob = TextBlob(stage)
blob

## Tokenization

*Tokenization* is the process of splitting text into meaningful pieces (sequences of characters), such as words or sentences. In general, these  pieces are referred to as tokens. TextBlob will automatically parse text into words and sentences. 

*TextBlob* objects contain many properties (or fields) that can be accessed using the dot ('.') operator. 

In particular for tokenization, for a *TextBlob* object named *blob*,

- *blob.words* returns a list of *words*, stored in a WordList object that behaves like an ordinary *list*
- *blob.sentences* returns a list of sentences, stored as a list of Sentence objects


**Note**: The first time running this, you will be prompted to install a tokenizer. Run the following code and press enter:

```python
import nltk
nltk.download('punkt')
```

In [None]:
# get a list of words
blob.words

### Exercise
How many words are there? What is the first word?

In [None]:
# get a list of sentences
blob.sentences

**Note**: Each sentence has all the properties of a *TextBlob* object.

In [None]:
# get first sentence
sentence1 = blob.sentences[0]

# Since sentence1 is like a TextBlob, we can get its words using the following
sentence1.words

## Part of speech tagging

*TextBlob* automatically carries out part-of-speech tagging, as shown below. The *tags* property includes a list of tuples in the form (word, part of speech). Common tags include *NN* for noun (singular or mass), *NNS* for noun (plural), and *VBN*, *VBZ*, and *VB* which are different types of verbs. For a list of tags, see https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html

**Note**: the first time running this code you will be asked to install a tagger. Run the following code and press enter.

```python
import nltk
nltk.download('averaged_perceptron_tagger')
```

In [None]:
blob.tags

### Exercise
Print out all of the nouns (which have tags of 'NN', or 'NNS', 'NNP', or 'NNPS'; no other part of speech contains an 'NN')

## Noun phrase extraction

*Noun phrase extraction* involves the identification of *noun phrases*, which are phrases that include nouns (possibly following one or more adjectives)

For a *TextBlob* object named *blob*, a list of noun phrases are returned using *blob.noun_phrases*.

**Note**: the first time running this code, you will be asked to install a corpus used to identify noun phrases. Text corpora are described at https://www.nltk.org/book/ch02.html. Run the following code:

```python
import nltk
nltk.download('brown')
```

In [None]:
blob = TextBlob('Amy has a new red car')
blob.noun_phrases

## Sentiment analysis

A *sentiment analysis* measures the emotional content of text. We will use sentiment analysis to identify text as *positive* or *negative*, though other emotions can also be detected.

For a *TextBlob* object named *blob*, its sentiment can be found by using *blob.sentiment*, which will give you a Sentiment object (a named tuple) that contains the following:

- polarity: a score between -1 (negative sentiment) and +1 (positive sentiment)
- subjectivity: a score between 0 (objective) and +1 (subjective)



In [None]:
blob = TextBlob('I love this class')
blob.sentiment

In [None]:
blob = TextBlob('This class sucks!')
blob.sentiment

In [None]:
# the sentiment of a text blob will represent an average sentiment over multiple sentiments
blob = TextBlob('This class is great. This class is awesome. This class sucks. This class sucks!!! This class is okay.')
blob.sentiment

In [None]:
# but we can find the polarity of each sentence
for s in blob.sentences:
    print('"', s, '" has a polarity of ', s.sentiment.polarity, sep = '')


## Stemming 

*Stemming* is a normalization method that takes a word and converts it into a *base* form by removing word endings (suffixes) or prefixes.

For a *TextBlob* word object, you can get the *stem* by calling *word.stem()*.

Lemmatization is a related technique that takes the word's part of speech into account and returns a dictionary form of the word.

Stemming and lemmatization are useful for counting words, since different forms of the same word (like 'runs' and 'run' should probably be counted as one word).

In [None]:
blob = TextBlob('run runs running ran')

# get word stems
for w in blob.words :
    print(w, ': ', w.stem(), sep = '')

## Word counts

Word counts are automatically calculated when a *TextBlob* object is created. The word counts of a *TextBlob* named *blob* are stored in *blob.word_counts*, which is a default dictionary (a dictionary where keys that do not exist have a default value, which in this case is 0).

In [None]:
# Song lyrics by House of Pain
jump = 'So get out your seat and jump around! Jump around! Jump around! Jump up, jump up and get down!'
blob = TextBlob(jump)
blob.word_counts

How many times does 'jump' appear?

In [None]:
blob.word_counts['jump']

Because word_counts is a default dictionary, looking up a word not in the dictionary will return 0, and also add the word to the dictionary! Note: we could remove a key from a dictionary *d* by using *d.pop(key)*.

In [None]:
blob.word_counts['cheese']

We can iterate through the keys of a dictionary using a for loop (*for keys in dict*). We also can iterate through key,value pairs of a dictionary by using *dictionary.items()*.

In [None]:
for word, count in blob.word_counts.items() :
    print(word, ': ', count, sep = '')

## Stopwords

Stopwords are common words (like 'a' and 'the') that should be ignored when analyzing text.

We can get a list of stopwords from the *nltk.corpus*, using the function *stopwords.words()*. We convert this list to a *set*, which is a collection of unordered items (i.e., it is *not* a sequence). The advantage of a set is that it has a constant lookup time, meaning that it is faster to test whether an item is in a set than testing when an item is in a list.

**Note**: We will first need to install the set of stopwords by running the following code:

```python
import nltk
nltk.download('stopwords')
```

In [None]:
# create a set of stopwords
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
sw

## Removing stop words

List comprehension can be used to create a list of (word, frequency) tuples with stop words removed.

In [None]:
# create a list of (word, frequency) tuples but with stop words removed
words = [ (w, f) for w,f in blob.word_counts.items() if w not in sw]
words

## Sorting by frequency

The following code will return a sorted list, where the sorting is based on reading list elements from left to right.
```python
sorted(listName)
```

If the lists contains tuples, then sorting will be based on the first element of each tuple. But sometimes we want to sort based on another element. This is accomplished by specifying the *key* argument to the *sorted* function. An *itemgetter* can be used here to specify the *index* to use for sorting.

In [None]:
# sort the list
from operator import itemgetter

# sort the word list by frequency (stored in index 1 of each tuple) in reverse order (highest to lowest)
words = sorted(words, key = itemgetter(1), reverse=True)
words

## Generating a bar graph of word frequencies

A *bar graph* visualizes word frequencies by using a bar for each word; the height of the bar corresponds to the frequency of the word. 

In order to create a bar graph, we first create a data frame (a table) of word frequencies. The *pandas* module is used to create the data frame.

In [None]:
import pandas as pd

# create a data frame from a list of tuples (each element will be a column)
df = pd.DataFrame(words, columns = ['word', 'frequency'])
df

We can then create a bar graph directly from the data frame, using pandas *plot.bar* function. Note that a *None* is included at the end the cell to prevent the cell from displaying the value returned by the *ax.set_title* statement.

In [None]:
# generate a bar graph, where 'x' and 'y' are the data frame columns to use
ax = df.plot.bar(x = 'word', y = 'frequency', legend = False, color = 'lightblue')

# add y-axis labels and a title
ax.set_ylabel('frequency')
ax.set_title('Word Counts')
None

### Exercise

Because we looked up 'cheese' previously, this was added to the dictionary. How can we update the words list of tuples to remove (word, frequency) pairs where the frequency was 0?

We can generate a horizontal bar graph using the 'plot.barh' function

In [None]:
ax = df.plot.barh(x = 'word', y = 'frequency', legend = False)
ax.set_xlabel('frequency')
ax.set_title('Word Counts')
None

## Word clouds

A *word cloud* is a visualization of words where the size of each word is proportional to its frequency. Word clouds can be generated directly from text or from a dictionary containing words and their corresponding frequencies.

The default WordCloud has the following arguments:

```python
WordCloud(background_color = 'black', stopwords = None, colormap = 'viridis', ...)
```

The *stopwords* argument can be a set of strings of stop words to remove; the default value of *None* will use a built-in stopwords list.

For additional colormaps see https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html


In [None]:
from wordcloud import WordCloud
# generate a word cloud from text (stop words will be removed)
wordcloud = WordCloud().generate(jump)
wordcloud.to_image()

In [None]:
# generate a word cloud from a dictionary of frequencies
wordcloud = WordCloud(colormap = 'prism').generate_from_frequencies(blob.word_counts)
wordcloud.to_image()