Sunday, October 4, 2015

A very (very) short primer on spacy.io

A very (very) short primer on spacy.io


This is a primer and short python tutorial on the package spacy.io. The tutorial is the same as the one in the website but I changed the commands to fit the environment I use (Centos 6.5 with manually installed python 2.7 configured by a script I published in a former post). 

The tutorial was written and test on version 0.93

Installation


In order to install spacy.io all we need to do is the following:

sudo /usr/local/bin/pip2.7 install spacy
sudo python2.7 -m spacy.en.download all

Note: the main disadvantage of the package is that it takes ~1.5 GB of disk space.

usage


The usage of the system is very straight forward. The package basically has a single command (parse) that basically provide us with tokenizing, lemmatizing, and tagging



#import and construct the package
from spacy.en import English
nlp = English()

Now, lets assume that we work on a simple corpus, constructed from a list of texts.


corpus = [
    u"I ate his liver with some fava beans and a nice chianti.",
    u"Did you think I'd be too stupid to know what a eugoogly is?",
    u"Gentlemen, you can't fight in here! This is the War Room!",
    u"If you are looking for ransom, I can tell you I don't have money. But what I do have are a very particular set of skills, skills I have acquired over a very long career. Skills that make me a nightmare for people like you."
]

Note: It seems mandatory that the text will be in unicode, so you must use the @u'text'@ prefix for text

To parse the corpus all we need to do is 


docs = [
    nlp(d) for d in corpus
]

and now we have a list of parsed docs. Each parsed doc can be iterated by sentences or tokens, I'll show iteration by sentences:


for idx,doc in enumerate(docs):
    print "working on doc {0}".format(idx)
    for sent in doc.sents:
        #for each sentence, print the tokens and their original form,lemma, pos, penn pos tag, and constituent 
        for token in sent:
            print (token.orth_,token.lemma_,token.pos_,token.tag_,token.dep_)

The code will print for each token in each sentence in each document the following:
  1. The original form
  2. The lemma
  3. The part-of-speech
  4.  The Penn POS
  5. The constituent

To exemplify, the result for the first document in the corpus I ate his liver with some fava beans and a nice chianti.:

(u'I', u'i', u'NOUN', u'PRP', u'nsubj')
(u'ate', u'eat', u'VERB', u'VBD', u'ROOT')
(u'his', u'his', u'ADJ', u'PRP$', u'poss')
(u'liver', u'liver', u'NOUN', u'NN', u'dobj')
(u'with', u'with', u'ADP', u'IN', u'prep')
(u'some', u'some', u'ADJ', u'DT', u'det')
(u'fava', u'fava', u'NOUN', u'NN', u'compound')
(u'beans', u'bean', u'NOUN', u'NNS', u'pobj')
(u'and', u'and', u'CONJ', u'CC', u'cc')
(u'a', u'a', u'ADJ', u'DT', u'det')
(u'nice', u'nice', u'ADJ', u'JJ', u'amod')
(u'chianti', u'chianti', u'NOUN', u'NN', u'conj')
(u'.', u'.', u'PUNCT', u'.', u'punct')

The second token is 'ate' and the tuple printed is: @(u'ate', u'eat', u'VERB', u'VBD', u'ROOT')@. We can see that the lemma of the token is eat, and that it is a verb in past tense (VBD). It is also the root of the sentence.


No comments:

Post a Comment