# Natural language processing using *Spacy*

*Spacy* (https://spacy.io/usage/linguistic-features) is another Python module for *natural language processing (nlp)*.

*Spacy* does much of the same analyses that *TextBlob* does, but also includes *named entity recognition* (discussed below). While the choice between *TextBlob* and *Spacy* is a matter of preference for most analyses, a general rule of thumb for us will be to use *TextBlob* for sentiment and word count analyses and *Spacy* for named entity recognition.

In order to use *Spacy*, a language model (https://spacy.io/models/en) must be loaded. Then natural language processing is carried out, as in the following example:

```python
nlp = spacy.load('en_core_web_sm')
doc = nlp('text to analyze')
```

By convention the loaded language model is stored in an object named *nlp*, and the processed text is stored in an object named *doc*, which is a sequence of *token* objects. Each token contains various information such as its *lemma* and *part-of-speech*, which can be seen below.

In [None]:
# import spacy and load language model
import spacy
nlp = spacy.load('en_core_web_sm')

# description of Eastern from Wikipedia
eastern = """
Eastern Connecticut State University is a public liberal arts university in 
Willimantic, Connecticut. Founded in 1889, it is the second-oldest campus in 
the Connecticut State University System and third-oldest public university in 
the state. Eastern is located on Windham Street in Willimantic, Connecticut, 
on 182 acres (0.74 km2) 30 minutes from Hartford, lying midway between New York 
City and Boston.
"""

# process the text
doc = nlp(eastern)

# doc is a sequence of token objects, but displays as text when printed or evaluated
doc

Each *token* in the *doc* object contains the following properties (for a full list see https://spacy.io/api/token#attributes):

- *token.text*: the word text
- *token.lemma_*: the lemma or base form of the word
- *token.tag_*: the detailed part of speech tag (e.g., NN is a singular noun and NNS is a plural noun) 
- *token.pos_*: the simple part of speech tag (e.g., all singular and plural nouns are both labeled NOUN)


The code below uses list comprehension to first create a tuple containing the string, lemma, tag, and pos. Then we display the results using a pandas data frame.

In [None]:
token = doc[1]
print('string =', token.text)
print('lemma =', token.lemma_)
print('tag =', token.tag_)
print('pos =', token.pos_)

Let's use list comprehension to create a tuple containing the string, lemma, tag, and pos for the first 20 tokens. 

In [None]:
tokens = [(token.text, token.lemma_, token.tag_, token.pos_) for token in doc[:20]]
tokens

### Create a pandas data frame to display the results

We can create a data frame (table) by specifying a list of tuples, where each element of the list corresponds to a row of the table, and each element of the tuple is a column. The *columns* argument specifies the name for each column.

In [None]:
import pandas as pd
df = pd.DataFrame(tokens, columns = ['token', 'lemma', 'tag', 'pos'])
df

## Named entity recognition

A named entity consists of a noun (or phrase) that corresponds a predefined category such as a "person", a "date", or an "organization". Named entities can be accessed through 
```
doc.ents
```
which returns a tuple of tokens. Each *token* will have the following properties:

- *ent.text*: the text of the named entity
- *ent.label_*: the label for the named entity

A list of the named entities can be found here: https://spacy.io/api/annotation#named-entities

Let's create a list of named entities (text and labels), display this in a data frame.

In [None]:
entities = [ (ent.text, ent.label_) for ent in doc.ents]
pd.DataFrame(entities, columns = ['Text','Label'])

## Visualizing named entities

The *displaCy* visualizer can be used to display the text with named entities highlighted, as in the code below.

In [None]:
from spacy import displacy
displacy.render(doc, style="ent")

## Noun phrase extraction
In *Spacy*, noun phrases are called *noun chunks* and are available in *doc.noun_chunks*.
Each *chunk* contains the following:
- *chunk.text* - the text of the noun chunk
- *chunk.root* - the root word of the noun chunk

In [None]:
chunks = [ (chunk.text, chunk.root) for chunk in doc.noun_chunks]
df = pd.DataFrame(chunks, columns = ['Noun Chunk', 'Root'])
df