Natural Language Processing

Blog Author

Thirumala Reddy

Last Updated

December 19, 2022

📖 In this article

Share This Article

Natural Language Processing

Stemming:-

Stemming is the process wherein we are reducing infected words to their word steam. Stemming is faster in execution when compared to lemmatization, But Stemming will remove the meaning of the word.

Ex:- 1)If we have words history, historical then these words get converted to history

2)If we have words finally, final, and finalized then these words get converted to final

 

Applications of stemming:-

Stemming is used in sentiment classifiers, Gmail spam classifiers,s, etc.

Lemmatization: -

Lemmatization also performs the same process as stemming but the output word we get from lemmatization is a meaningful word. The execution time in Lemmatization is more when compared to stemming.

Ex:- The words history, and historical get converted into history & finally, final, finalized words get converted into final.

Applications of Lemmatization:-

Lemmatization is used in the chat box, Text Summarization, Language Translation, etc.

-->stopwords library is used to remove unwanted words. Examples of stop words are is, this, the, are  …

-->sent_tokenize is a function inside nltk that takes the paragraph & it applies a lot of regular expressions inside the function. This regular expression will be responsible for converting the paragraph into different sentences.

-->We will take up the above sentences and convert them into words by using word_tokenize.

Click here to see the implementation of Stemming

Click here to see the implementation of Lemmatization

 


Word Embedding:-
           Word Embedding is a technique that converts words into vectors
-->Word Embedding is divided into two types

1)Based on the count or frequency of words

a)Bag Of Words

b)TF-IDF

c)One Hot Embedding

2)Deep Learning Trained model

a)Word2Vec

i)Continue the Bag of words

ii)SKIP GRAMS

 

BAG OF WORDS:-

Let's understand the concept of a bag of words by taking an example. Let's assume that I have 3 statements

sent 1:- He is a good boy.

sent 2:- She is a good girl.

sent 3:- Boy and girl are good.

-->The first step is we need to lower the sentences i.e. we remove the stop wards. The resulting sentence is

sent 1:- good boy

sent 2:- good girl

sent 3:- boy girl good

Let's see construct a bag of words

 

WORDSFREQUENCYgoodboygirl
good3sent 1110
boy2BOW--->sent 2101
girl2sent 3111

 

 

 

-->The order of good, boy, and girl will be based on the frequency count. As the frequency of goods is more when compared to the remaining, goods will be placed first.
-->In a Binary bag of words even though the word is present 2 or more times, the count value after applying a bag of words will be only 1.
  Ex:- Take the example of the sentence He is a very very good boy, after applying a bag of words the count of very will be 1 only not 2 i.e. the bag of words will represent whether the word is present or not in a sentence. It won't represent how many times the word is present in the sentence.
-->The Symantec information is not getting captured in the Bag of words.
-->The major disadvantage of Bag Of words is above we see that for sent 1 we got 1 for both good & boy i.e. both are having equal representation so we cant derive which word is more important in sentiment analysis.
   For example in the sentence "He is intelligent", intelligent is the major word, we need to give more importance to intelligent. This is not happening with Bag of words.
N-Gram:-
-->In N-Grams N is nothing but the number. If we give N as 2 it is called a Bi-Gram in Bi-Gram we are going to combine 2 words. In Tri-Gram we are going to combine 3 words.
-->This will helps to capture the Symantec information.
TF-IDF: -
TF:-
          TF(Term Frequency) is the ratio of the Number of repetitions of words in a sentence to the Number of words in a sentence
IDF:-
          IDF(Inverse Document Frequency) is the log to the ratio of the Number of sentences to the Number of sentences containing words.
-->In TF-IDF we will give more weightage to the words which are less present & give less weightage to the words which are repeated more. i.e we will give less weightage for good & give more weightage for both boy & girl.
-->TF will capture the rare words & IDF will capture the most repeated words.
-->We multiply both TF & IDF to convert the sentences into vectors.
--> Let's take the result of the sentence after performing Lemmatization or Stemming and after removing the stop words
sent 1-->good boy
sent 2-->good girl
sent 3-->boy girl good
TFIDF
WORDSFREQUENCYsent 1sent 2sent 3WordsIDF
good3good1/21/21/3goodlog(3/3)=0
boy2TF-IDF-->boy1/201/3*boylog(3/2)
girl2girl01/21/3girllog(3/2)
goodboygirl
sent 10(1/2)*log(3/2)0
"="sent 200(1/2)*log(3/2)
sent 30(1/3)*log(3/2)(1/3)*log(3/2)

 

Refer to this to know how to apply TF-IDF

 

Word2Vec:-
      In Word2Vec instead of assigning values like 1,2,0.4... to words, we will represent each word with a vector of 32 or more dimensions.
-->Word2Vec is nothing but the representation of words that have the same meaning with a similar representation.
-->In Word2Vec the relationship between different words is preserved.
-->Below is the example of Word2Vec & in the example we can see how the relationship between different words is maintained.
-->The steps we will follow in word embedding are as follows
1)We will take the sentence for which word embedding is to be done.
2)We will perform one hot representation.
        In Keras, we have a function called one_hot which helps to convert the word into one hot representation of some vocabulary size
3)Next we do padding to the result of one hot representation by using pad_sequences which is a library present in Keras so some sentence length.
-->If Our sentence length is 5 but we have given padding size as 8, the remaining 3 places will be placed with zero.
-->If we give padding='pre' .The zeros will be placed in starting. If we give padding="post", the zeros will be placed last.
4)The result of one hot representation is passed into the Embedding Layer of some voc_size and dim
AVERAGE WORD2VEC:-
        The main problem with word2vec each & every word in the input is gets converted into a given number of dimensions i.e if I have 4 words in input each & every word inside the input gets converted into a given no of dimensions. So the no of dimensions will be more. To overcome this issue we use the Average word 2 vec. In average word 2 vec, the average of all words dimension is calculated.

 


Get Free Consultation

📖 In this article

Share This Article

Natural Language Processing

Stemming:-

Stemming is the process wherein we are reducing infected words to their word steam. Stemming is faster in execution when compared to lemmatization, But Stemming will remove the meaning of the word.

Ex:- 1)If we have words history, historical then these words get converted to history

2)If we have words finally, final, and finalized then these words get converted to final

 

Applications of stemming:-

Stemming is used in sentiment classifiers, Gmail spam classifiers,s, etc.

Lemmatization: -

Lemmatization also performs the same process as stemming but the output word we get from lemmatization is a meaningful word. The execution time in Lemmatization is more when compared to stemming.

Ex:- The words history, and historical get converted into history & finally, final, finalized words get converted into final.

Applications of Lemmatization:-

Lemmatization is used in the chat box, Text Summarization, Language Translation, etc.

-->stopwords library is used to remove unwanted words. Examples of stop words are is, this, the, are  …

-->sent_tokenize is a function inside nltk that takes the paragraph & it applies a lot of regular expressions inside the function. This regular expression will be responsible for converting the paragraph into different sentences.

-->We will take up the above sentences and convert them into words by using word_tokenize.

Click here to see the implementation of Stemming

Click here to see the implementation of Lemmatization

 


Word Embedding:-
           Word Embedding is a technique that converts words into vectors
-->Word Embedding is divided into two types

1)Based on the count or frequency of words

a)Bag Of Words

b)TF-IDF

c)One Hot Embedding

2)Deep Learning Trained model

a)Word2Vec

i)Continue the Bag of words

ii)SKIP GRAMS

 

BAG OF WORDS:-

Let's understand the concept of a bag of words by taking an example. Let's assume that I have 3 statements

sent 1:- He is a good boy.

sent 2:- She is a good girl.

sent 3:- Boy and girl are good.

-->The first step is we need to lower the sentences i.e. we remove the stop wards. The resulting sentence is

sent 1:- good boy

sent 2:- good girl

sent 3:- boy girl good

Let's see construct a bag of words

 

WORDSFREQUENCYgoodboygirl
good3sent 1110
boy2BOW--->sent 2101
girl2sent 3111

 

 

 

-->The order of good, boy, and girl will be based on the frequency count. As the frequency of goods is more when compared to the remaining, goods will be placed first.
-->In a Binary bag of words even though the word is present 2 or more times, the count value after applying a bag of words will be only 1.
  Ex:- Take the example of the sentence He is a very very good boy, after applying a bag of words the count of very will be 1 only not 2 i.e. the bag of words will represent whether the word is present or not in a sentence. It won't represent how many times the word is present in the sentence.
-->The Symantec information is not getting captured in the Bag of words.
-->The major disadvantage of Bag Of words is above we see that for sent 1 we got 1 for both good & boy i.e. both are having equal representation so we cant derive which word is more important in sentiment analysis.
   For example in the sentence "He is intelligent", intelligent is the major word, we need to give more importance to intelligent. This is not happening with Bag of words.
N-Gram:-
-->In N-Grams N is nothing but the number. If we give N as 2 it is called a Bi-Gram in Bi-Gram we are going to combine 2 words. In Tri-Gram we are going to combine 3 words.
-->This will helps to capture the Symantec information.
TF-IDF: -
TF:-
          TF(Term Frequency) is the ratio of the Number of repetitions of words in a sentence to the Number of words in a sentence
IDF:-
          IDF(Inverse Document Frequency) is the log to the ratio of the Number of sentences to the Number of sentences containing words.
-->In TF-IDF we will give more weightage to the words which are less present & give less weightage to the words which are repeated more. i.e we will give less weightage for good & give more weightage for both boy & girl.
-->TF will capture the rare words & IDF will capture the most repeated words.
-->We multiply both TF & IDF to convert the sentences into vectors.
--> Let's take the result of the sentence after performing Lemmatization or Stemming and after removing the stop words
sent 1-->good boy
sent 2-->good girl
sent 3-->boy girl good
TFIDF
WORDSFREQUENCYsent 1sent 2sent 3WordsIDF
good3good1/21/21/3goodlog(3/3)=0
boy2TF-IDF-->boy1/201/3*boylog(3/2)
girl2girl01/21/3girllog(3/2)
goodboygirl
sent 10(1/2)*log(3/2)0
"="sent 200(1/2)*log(3/2)
sent 30(1/3)*log(3/2)(1/3)*log(3/2)

 

Refer to this to know how to apply TF-IDF

 

Word2Vec:-
      In Word2Vec instead of assigning values like 1,2,0.4... to words, we will represent each word with a vector of 32 or more dimensions.
-->Word2Vec is nothing but the representation of words that have the same meaning with a similar representation.
-->In Word2Vec the relationship between different words is preserved.
-->Below is the example of Word2Vec & in the example we can see how the relationship between different words is maintained.
-->The steps we will follow in word embedding are as follows
1)We will take the sentence for which word embedding is to be done.
2)We will perform one hot representation.
        In Keras, we have a function called one_hot which helps to convert the word into one hot representation of some vocabulary size
3)Next we do padding to the result of one hot representation by using pad_sequences which is a library present in Keras so some sentence length.
-->If Our sentence length is 5 but we have given padding size as 8, the remaining 3 places will be placed with zero.
-->If we give padding='pre' .The zeros will be placed in starting. If we give padding="post", the zeros will be placed last.
4)The result of one hot representation is passed into the Embedding Layer of some voc_size and dim
AVERAGE WORD2VEC:-
        The main problem with word2vec each & every word in the input is gets converted into a given number of dimensions i.e if I have 4 words in input each & every word inside the input gets converted into a given no of dimensions. So the no of dimensions will be more. To overcome this issue we use the Average word 2 vec. In average word 2 vec, the average of all words dimension is calculated.

 


Get Free Consultation

Related Articles