BackendGuy
What is NLP ?

[machine learning] Cool start to Natural Language Processing in python

What is NLP ?

Natural language processing (NLP) is about developing applications and services that are able to understand human languages. Some Practical examples of NLP are speech recognition for eg: google voice search, understanding what the content is about or sentiment analysis, and that is what Cool start to Natural Language Processing would show us now. etc.

Benefits of NLP

As all of you know, there are millions of gigabytes every day are generated by blogs, social websites, and web pages.

There are many companies gathering all of these data for understanding users and their passions and give these reports to the companies to adjust their plans.

Suppose a person loves traveling and is regularly searching for a holiday destination, the searches made by the user is used to provide him with relative advertisements by online hotel and flight booking apps.

You know what, search engines are not the only implementation of natural language processing (NLP) and there are a lot of awesome implementations out there.

NLP Implementations

These are some of the successful implementations of Natural Language Processing (NLP):

  • Search engines like Google, Yahoo, etc. Google search engine understands that you are a tech guy so it shows you results related to you.
  • Social websites feed like the Facebook news feed. The news feed algorithm understands your interests using natural language processing and shows you related Ads and posts more likely than other posts.
  • Speech engines like Apple Siri.
  • Spam filters like Google spam filters. It’s not just about the usual spam filtering, now spam filters understand what’s inside the email content and see if it’s a spam or not.

How do I Start with NLP using Python?

Natural language toolkit (NLTK) is the most popular library for natural language processing (NLP) which was written in Python and has a big community behind it.

NLTK also is very easy to learn, actually, it’s the easiest natural language processing (NLP) library that you’ll use.

In this NLP Tutorial, we will use Python NLTK library.

Before I start installing NLTK, I assume that you know some Python basics to get started.

Install nltk

If you are using Windows or Linux or Mac, you can install NLTK using pip:

$ pip install nltk

You can use NLTK on Python 2.7, 3.4, and 3.5 at the time of writing this post.

Alternatively, you can install it from source from this tar.

To check if NLTK has installed correctly, you can open python terminal and type the following:

Import nltk

If everything goes fine, that means you’ve successfully installed NLTK library.

Once you’ve installed NLTK, you should install the NLTK packages by running the following code:

import nltk
nltk.download()

This will show the NLTK downloader to choose what packages need to be installed.

You can install all packages since they have small sizes, so no problem. Now let’s start the show.

Here we will learn how to identify what the web page is about using NLTK in Python

First, we will grab a webpage and analyze the text to see what the page is about.

urllib module will help us to crawl the webpage

import urllib.request
response =  urllib.request.urlopen('https://en.wikipedia.org/wiki/SpaceX')
html = response.read()
print(html)

It’s pretty clear from the link that page is about SpaceX now let us see whether our code is able to correctly identify the page’s context.

We will use Beautiful Soup which is a Python library for pulling data out of HTML and XML files. We will use beautiful soup to clean our webpage text of HTML tags.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html,'html5lib')
text = soup.get_text(strip = True)
print(text)
You will get an output somewhat like this

Now we have clean text from the crawled web page, let’s convert the text into tokens.

tokens = [t for t in text.split()]
print(tokens)

your output text is now converted into tokens

Count word Frequency

nltk offers a function FreqDist() which will do the job for us. Also, we will remove stop words (a, at, the, for etc) from our web page as we don’t need them to hamper our word frequency count. We will plot the graph for most frequently occurring words in the webpage in order to get the clear picture of the context of the web page

from nltk.corpus import stopwords
sr= stopwords.words('english')
clean_tokens = tokens[:]
for token in tokens:
    if token in stopwords.words('english'):
        
        clean_tokens.remove(token)
freq = nltk.FreqDist(clean_tokens)
for key,val in freq.items():
    print(str(key) + ':' + str(val))
freq.plot(20, cumulative=False)
frequency word count output
graph of 20 most frequent words.

Great!!! the code has correctly identified that the web page speaks about SpaceX.

This post is a featured post from LikeGeeks

BackendGuy

1 comment

Your Header Sidebar area is currently empty. Hurry up and add some widgets.