Simply through Natural Language Processing, (NLP) What it is?
Introduction to NLP
● NLP stands for Natural Language Processing.It is a branch of artificial intelligence that deals with the interaction between computers and humans using the natural language.
● NLP has the ability of a computer to understand, analyze, manipulate, and potentially generate human language.
● Just 21% of the available data is present in the organized form in the 21st century. Millions of tweets, emails and web searches are generated daily, resulting in a huge amount of data increasing by the minute..And most of these data are in the form of text and unstructure.Natural Language Processing plays an important role in structuring data.
Following figure describes Natural Language Processing
Application Of Natural Language Processing
● Sentimental Analysis:
○ Sentimental Analysis is the interpretation and classification of emotions in positive,negative or neutral within the text data using text analysis techniques.
○ Twitter sentiment analysis, facebook sentiment analysis is an example of it.
○ Sentimental Analysis helps to know people’s view point towards any product,any topics etc.
○ Chatbot interacts through instant messaging,artificially replicating the patterns of human interactions.
○ Chatbot is simply a computer program which simulates human conversation.
○ Formally chatbot allows a formal human-machine interaction.
● Speech Recognition:
○ Speech recognition, the ability for a machine to convert spoken language to text, is often used for voice dialing, call routing and voice search.
○ Today many devices are available like Alexa,Siri etc.
● Machine Translation:
○ Machine Translation is another use case of natural language processing.
○ The best example of it is google translator which translates the data from one language to another.
● Spell Checking and keyword searching:
○ Spell Checking and keyword searching is the most common application of natural language processing.
● Information Extraction:
○ It is necessary to get important information from data. Information Extraction is process of extracting specific information from various text sources.
● Advertisement Matching:
○ It is the most widely used application of NLP. It is used to give recommendations of ads based on the user’s history,search, interest etc.
Components of Natural Language Processing:
● There are two components of natural language processing.
○ Natural Language Understanding(NLU):
■ Natural language Understanding or we can say NLU, as the name says “understanding of input given by the user.
■ It deals with machine reading comprehension à ability to read text, to do process on it, and understand its meaning. It is involved with mapping the given inputs (let’s take plain text as input) in useful representations and analyzing the different aspects of the language.
○ Natural Language Generation(NLG):
■ Main task of this component is to generate meaningful output or parser in the form of natural language from some internal representation according to given input.
■ It includes :-
■ Text planning : — retrieving the relevant content from the database. Here database can include vocabulary, sentences, knowledge, sample data and many more.
■ Sentence planning : — we get our content using text planning Now the next step to do is choosing required words and forming meaningful sentences setting the words in the right grammatical way.
■ Text realization :- we have all the things needed to create actual text in the human language.
Steps to perform Natural Language Processing
● Identifying and removing Stop Words
● Stemming and lemmatization
● Bag of Words Model
● Word Vectorization
● POS Tagging
● Named Entity Recognition
Python Module used for Natural Language Processing
● Natural Language Toolkit (NLTK)
● Tokenization is the first step in NLP.
● Tokenization essentially divides a phrase, a sentence, a paragraph, or a whole text document into smaller units, such as words or terms.Each of these smaller units are called tokens.
● Importance Of Tokenization:
○ Every word in the sentence has its meaning and text could easily be interpreted by analyzing the words present in the text.
○ By the help of Tokenization we can understand the sentiment behind the words,So we can also understand the sentence also.
● There are different ways to perform tokenization.Some ways are as follow.
● We can perform tokenization using Split method
● Tokenization using regular expression
● Tokenization using NLTK
● Tokenization using Gensim
●We can also tokenize paragraph into sentences
● Stop Words
● A stop word is a widely used word (such as “the”, “a”, “an”, “in”) that a search engine has been programmed to ignore, both when indexing entries for searching and when extracting
● them as the result of a search query.
● Text may contain stop words like ‘the’, ‘is’, ‘are’. Stop words can be filtered from the text to be processed. There is no universal list of stop words in nlp research, however the nltk module contains a list of stop words.
● To build NLP models and text mining the stop words might not add much value.
● There are some methods to remove stop words.
● Stop words removal using NLTK
● Stop words removal using Gensim
● Stemming and Lemmatization
● Stemming is the process of normalizing words into its base forms or root forms.
● For example a stemming reduces the words likes,liking,liked,likly into word like.
● There are two main errors in stemming.
○ Over-stemming occurs when two words of different stems are stemmed to the same root.Over-stemming can also be regarded as false-positives.for example:
○ All the above 3 words stem from wrong behavior.
○ Under-stemming occurs when two words are stemmed to the same root that are not of different stems. Under-stemming can be interpreted as false-negatives.for example:
○ Below is the example for the same.That does not resolve into the same word.
● There are different types of stemming.Some of these are as shown below.
● Porter Stemmer
● It is one of the most common stemming processes. It is based on the premise that the suffixes are made of a mixture of smaller and simpler suffixes in the English language.
● Example: EED -> EE means “if the word has at least one vowel and consonant plus EED ending, change the ending to EE” as ‘agreed’ becomes ‘agree’.
● In output the suffix is removed.
● This method is not so precise.
● Snowball Stemmer
● In Snowball Stemmer some improvements done in Poster Stemmer.
● Lemmatization is the process by which a word is converted to its base form. The distinction between stemming and lemmatization is that lemmatization recognizes the context and transforms the term into its meaningful base form, while stemming merely eliminates the last few letters, frequently leading to incorrect definitions and errors in spelling.
● It is preferred to do lemmatization instead of stemming.
Bag Of Words Model:
● One of the main goals of text analysis is to convert text into a numerical form so that we can use machine learning on it.
● Machine learning algorithms need numerical data to work with so that they can analyze the data and extract meaningful information.This is where the Bag of Words model comes in.
● This model extracts vocabulary from all the words in the documents and builds the model using a document-term matrix.
● This allows us to represent every document as a bag of words.
● A document term matrix is a table that gives us counts of various words.We can set thresholds and choose more meaningful words.
● Let’s consider example:
○ Consider the following sentences:
○ Sentence 1: The children are playing in the hall
○ Sentence 2: The hall has a lot of space
○ Sentence 3: Lots of children like playing in an open space
○ If you consider all three sentences, we have the following 14 unique words:
● For each sentence, let’s construct a histogram using the word count in each sentence.
● Each feature vector will be 14-dimensional because we have 14 unique words:
● Sentence 1: [2, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0]
● Sentence 2: [1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0]
● Sentence 3: [0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1]
● Count Vectorizer works on Terms Frequency, i.e. counting the occurrences of tokens and building a sparse matrix of documents x tokens.
● TF-IDF stands for Term Frequency Inverse Document Frequency. It is one of the most hot topic in natural language processing.It is us
● Our goal is to take a given document,whether that’s a blog post,news article,random website or any other corpus and extract a few sentences that best summarized the document.
● A Term Frequency(TF) is a count of how many times a word occurs in a given document.
● The formula for Term Frequency is :
● Inverse Document Frequency(IDF) is the number of times a word occurs in a corpus of documents.
● The formula for Inverse Document Frequency is:
In short the TF-IDF formula is:
● POS stands for part of Speech.
● Bag-of-words sometimes fails to capture the structure of the sentences and sometimes give its appropriate meaning.
● POS and Chunking overcome this weakness.
● The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging.
● The part of speech explains how a word is used in a sentence.There are eight main eight main parts of speech nouns,pronouns,adjectives,verbs,adverbs,prepositions,conjunctions and interjections.
Named Entity Recognition
● Named Entity Recognition is the most important method to identify named entities in the text.That includes
○ Dates,States and work of arts and so on..
If you liked the story and want to appreciate us you can clap as much as you can. Appreciate our work by your constructive comment and also you can connect to us on….
Website : https://www.societyofai.in/