title | description | prev | next | type | id |
---|---|---|---|---|---|
Chapter 1: Finding words, phrases, names and concepts |
This chapter will introduce you to the basics of text processing with spaCy. You'll learn about the data structures, how to work with trained pipelines, and how to use them to predict linguistic features in your text. |
/chapter2 |
chapter |
1 |
Let's get started and try out spaCy! In this exercise, you'll be able to try out some of the 60+ available languages.
- Use
spacy.blank
to create a blank English ("en"
)nlp
object. - Create a
doc
and print its text.
- Use
spacy.blank
to create a blank German ("de"
)nlp
object. - Create a
doc
and print its text.
- Use
spacy.blank
to create a blank Spanish ("es"
)nlp
object. - Create a
doc
and print its text.
When you call nlp
on a string, spaCy first tokenizes the text and creates a
document object. In this exercise, you'll learn more about the Doc
, as well as
its views Token
and Span
.
- Use
spacy.blank
to create the Englishnlp
object. - Process the text and instantiate a
Doc
object in the variabledoc
. - Select the first token of the
Doc
and print itstext
.
You can index into a Doc
the same way you index into a list in Python. For
example, doc[4]
will give you the token at index 4, which is the fifth token
in the text. Remember that in Python the first index is 0, not 1.
- Use
spacy.blank
to create the Englishnlp
object. - Process the text and instantiate a
Doc
object in the variabledoc
. - Create a slice of the
Doc
for the tokens "tree kangaroos" and "tree kangaroos and narwhals".
Creating a slice of a Doc
works just like creating a slice of a list in Python
using the :
notation. Remember that the last token index is exclusive β for
example, 0:4
describes the tokens 0 up to token 4, but not including
token 4.
In this example, you'll use spaCy's Doc
and Token
objects, and lexical
attributes to find percentages in a text. You'll be looking for two subsequent
tokens: a number and a percent sign.
- Use the
like_num
token attribute to check whether a token in thedoc
resembles a number. - Get the token following the current token in the document. The index of the
next token in the
doc
istoken.i + 1
. - Check whether the next token's
text
attribute is a percent sign "%".
To get the token at a certain index, you can index into the doc
. For example,
doc[5]
is the token at index 5.
What's not included in a pipeline package that you can load into spaCy?
All saved pipelines include a config.cfg
that defines the language to
initialize, the pipeline components to load as well as details on how the
pipeline was trained and which settings were used.
To predict linguistic annotations like part-of-speech tags, dependency labels or named entities, pipeline packages include binary weights.
Trained pipelines allow you to generalize based on a set of training examples. Once they're trained, they use binary weights to make predictions. That's why it's not necessary to ship them with their training data.
Pipeline packages include a strings.json
that stores the entries in the
pipeline's vocabulary and the mapping to hashes. This allows spaCy to only
communicate in hashes and look up the corresponding string if needed.
The pipelines we're using in this course are already pre-installed. For more details on spaCy's trained pipelines and how to install them on your machine, see the documentation.
- Use
spacy.load
to load the small English pipeline"en_core_web_sm"
. - Process the text and print the document text.
To load a pipeline, call spacy.load
on its string name. Pipeline names differ
depending on the language and the data they were trained on β so make sure to
use the correct name.
You'll now get to try one of spaCy's trained pipeline packages and see its
predictions in action. Feel free to try it out on your own text! To find out
what a tag or label means, you can call spacy.explain
in the loop. For
example: spacy.explain("PROPN")
or spacy.explain("GPE")
.
- Process the text with the
nlp
object and create adoc
. - For each token, print the token text, the token's
.pos_
(part-of-speech tag) and the token's.dep_
(dependency label).
To create a doc
, call the nlp
object on a string of text. Remember that you
need to use the token attribute names with an underscore to get the string
values.
- Process the text and create a
doc
object. - Iterate over the
doc.ents
and print the entity text andlabel_
attribute.
To create a doc
, call the nlp
object on a string of text. Remember that you
need to use the token attribute names with an underscore to get the string
values.
Models are statistical and not always right. Whether their predictions are correct depends on the training data and the text you're processing. Let's take a look at an example.
- Process the text with the
nlp
object. - Iterate over the entities and print the entity text and label.
- Looks like the model didn't predict "iPhone X". Create a span for those tokens manually.
- To create a
doc
, call thenlp
object on the text. Named entities are available as thedoc.ents
property. - The easiest way to create a
Span
object is to use the slice notation β for exampledoc[5:10]
for the token at position 5 up to position 10. Remember that the last token index is exclusive.
Let's try spaCy's rule-based Matcher
. You'll be using the example from the
previous exercise and write a pattern that can match the phrase "iPhone X" in
the text.
- Import the
Matcher
fromspacy.matcher
. - Initialize it with the
nlp
object's sharedvocab
. - Create a pattern that matches the
"TEXT"
values of two tokens:"iPhone"
and"X"
. - Use the
matcher.add
method to add the pattern to the matcher. - Call the matcher on the
doc
and store the result in the variablematches
. - Iterate over the matches and get the matched span from the
start
to theend
index.
- The shared vocabulary is available as the
nlp.vocab
attribute. - A pattern is a list of dictionaries keyed by the attribute names. For example,
[{"TEXT": "Hello"}]
will match one token whose exact text is "Hello". - The
start
andend
values of each match describe the start and end index of the matched span. To get the span, you can create a slice of thedoc
using the given start and end.
In this exercise, you'll practice writing more complex match patterns using different token attributes and operators.
- Write one pattern that only matches mentions of the full iOS versions: "iOS 7", "iOS 11" and "iOS 10".
- To match a token with an exact text, you can use the
TEXT
attribute. For example,{"TEXT": "Apple"}
will match tokens with the exact text "Apple". - To match a number token, you can use the
"IS_DIGIT"
attribute, which will only returnTrue
for tokens consisting of only digits.
- Write one pattern that only matches forms of "download" (tokens with the
lemma "download"), followed by a token with the part-of-speech tag
"PROPN"
(proper noun).
- To specify a lemma, you can use the
"LEMMA"
attribute in the token pattern. For example,{"LEMMA": "be"}
will match tokens like "is", "was" or "being". - To find proper nouns, you want to match all tokens whose
"POS"
value equals"PROPN"
.
- Write one pattern that matches adjectives (
"ADJ"
) followed by one or two"NOUN"
s (one noun and one optional noun).
- To find adjectives, look for tokens whose
"POS"
value equals"ADJ"
. For nouns, look for"NOUN"
. - Operators can be added via the
"OP"
key. For example,"OP": "?"
to match zero or one time.