# Lab 8: Natural Language Processing

Write your name in the Markdown cell below [2 points]

### Name:          

## Lab setup

The code cells below extract the plain text Wikipedia entry for *Eastern Connecticut State University*. The details for this code are beyond the scope of our class, but if you are curious I can explain more about what this code does. To complete the assignment, you just need to understand that the text of the Wikipedia article is stored in the variable *text*, which you will first analyze using *TextBlob*. The assignment begins with the *Textblob questions* section.

In [None]:
# download the wikipedia entry
import requests
from bs4 import BeautifulSoup

page = requests.get('https://en.wikipedia.org/w/api.php?action=query&format=json&titles=Eastern_Connecticut_State_University&prop=extracts&explaintext')

if page.status_code != 200 :
    print('Error: Page Not Found. Try again or ask Dr. Dancik for help')

In [None]:
# result is in json format, which we convert to a dictionary using json.loads
import json
j = json.loads(page.text)

In [None]:
# extract the text

# first find the pageId
pageId = list(j['query']['pages'].keys())[0]
pageId

# get the extract for this pageId, and store the text in the variable 'text'
text = j['query']['pages'][pageId]['extract']
print(text)

## *TextBlob* questions

The code below imports *TextBlob* and creates a *blob* from the Wikipedia text.

In [None]:
from textblob import TextBlob
blob = TextBlob(text)

### Question 1  <span style = 'font-size: 80%'>[10 points]</span>
(a) How many words are in the text?

(b) How many sentences are in the text?

(c) Display the 3rd sentence?

### Question 2 <span style = 'font-size: 80%'>[10 points]</span>
(a) Print out all the sentences that contain 'Eastern'. Print a blank line after each sentence.

(b) A programmer may want to repeat this kind of analysis for multiple words or phrases. This is where a *function* can be very useful.
Write a function that has the following format:

```python
def printSentences(blob, search) :
    # prints out all sentences in the 'blob' that contain the 'search' term.
```

Then use this function to print out all sentences that contain 'Willimantic'.


### Question 3 <span style = 'font-size: 80%'>[10 points]</span>
Recall that the word counts are stored in the *blob.word_counts* dictionary. 

(a) How many times does the word 'eastern' appear in the text? 

(b) How many times does the word 'student' appear? For this question, display the answer in the following format: 

```
The word 'student' appears 7 times
```

### Question 4 <span style = 'font-size: 80%'>[10 points]</span>

Output all of the noun phrases that contain the word 'university'.

### Question 5 <span style = 'font-size: 80%'>[10 points]</span>
The code below uses list comprehension to create a list of word-frequency pairs for each word.  

In [None]:
from operator import itemgetter
wc = [(word,count) for word,count in blob.word_counts.items()]
wc[:5]

(a) Sort this list from the most frequent word to the least frequent word, and display the first 5 words in sorted order. (You can display the first 5 tuples, as above) 

(b) Your results above should show that 'the' is the most common word. Let's now remove common words like 'the' and 'and'. The code below creates a set of stopwords that are stored in *sw*. Create a new list of word-frequency pairs but with the stopwords removed. Display the word counts, either as a list of tuples or as a data frame.

In [None]:
from nltk.corpus import stopwords
sw = set(stopwords.words('english'))
sw

### Question 6 <span style = 'font-size: 80%'>[10 points]</span>

Generate a word cloud directly from the text, but convert all text to lowercase

### Question 7 <span style = 'font-size: 80%'>[10 points]</span>


<h3 style = 'color:red'> Skip this question -- textblob no longer supports translation </h3>

Use TextBlob's *translate* function to translate 

(a) the first sentence of the Wikipedia entry into Spanish 

(b) the second sentence of the Wikipedia entry into German

(Note: the language codes for Google Translate are available here: https://cloud.google.com/translate/docs/languages).

## Spacy questions

Run the code below to load the *en_core_web_sm* language model and carry out natural language processing using spacy, storing the results in *doc*.

In [None]:
import spacy
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)

### Question 8 <span style = 'font-size: 80%'>[10 points]</span>

Recall that *doc* is a sequence of tokens (these include words and punctuation). 

(a) Use slicing to view the first 20 tokens.

(b) Use slicing to view the last 20 tokens.

### Question 9 <span style = 'font-size: 80%'>[10 points]</span>

Use displacy to view the text where the named entities are highlighted.

### Question 10 <span style = 'font-size: 80%'>[10 points]</span>

(a) Iterate through each token and output all of the dates. Note that if a token is a date then its entity label *ent.label_* will be equal to 'DATE'

(b) In *spacy*, the sentences are stored in *doc.sents*. For a sentence *s*, we can also get its named entities by using *s.ent*. The code below uses *nested* for loops to iterate through each named entity of each sentence, and prints out the sentence if it contains a date. Copy this code, but modify the *if* statement condition and *print* statement to print out each sentence if it contains a date. Note that *break* is used to break out of the inner loop (otherwise you would print a sentence multiple times if it contained  multiple dates).

```python
# for each setence in the document
for s in doc.sents :
    # for each entity in the sentence
    for ent in s.ents :        
        # if the entity is a date
        if the entity is a date :
            # print out the sentence
            print out the sentence
            break
```