The Importance of Data Pre-processing in Sentiment Analysis

Vidhi Khaitan
3 min readApr 20, 2022

Sentiment Analysis or opinion mining is a simple classification of positive, negative, or neutral sentiment in any text. This classification of text is termed Sentiment Polarization. In recent times, Sentiment analysis has gained a lot of popularity and is currently being used in almost all sectors of industries — no matter big or small!

What is Data Preprocessing?
Even after selectively collecting data, raw datasets contain several unnecessary components that might hinder the accuracy of the sentiment analysis algorithm. Data Preprocessing is used to remove these components, hence it is a crucial part of sentiment analysis. We made the use of python, to write a simple program that processes the data for us.

Pic by Real Python

Data pre-processing for any English corpora can be divided into 2 steps -

  1. Clean non-English text- All text in the corpus is converted into lower case characters and all characters except for A-Z and a-z are replaced with a space. Then the text is further tokenized using the NLTK tokenizer to split the text into individual words. Each tokenized word is further preprocessed by returning the Unicode code of each character to be less than 128. This implies that the character is in ranges A-Z and a-z.
  2. Clean English text - In this step, all stopwords are removed.

PRE-PROCESSING TECHNIQUES-
Tokenization
— It is the process of separating a string into smaller bits, such as words, keywords, phrases, symbols, and other tokens. Individual words, phrases, or even entire sentences can be used as tokens. Some characters, such as punctuation marks, are removed during the tokenization process.

Stopwords — These are unimportant words with low-level information that must be eliminated so that the algorithm can focus more on the important information to process. This paper uses the stopwords provided by NLTK Python Library.

Some examples for the same are under, more, herself, this, should’ve, same, m, with, or, be, does, it’s, yourselves, both, will, before, until, other, they, should, above, needn’t, while, very, themselves, when, didn’t, because, by, mustn’t, why, up, them, etc.

Stemming or Lemmatization —
Stemming cuts off a list of common prefixes from the beginning and end of the word, while Lemmatization looks into the morphological analysis of a word. Thus, Lemmatization is better in terms of accuracy.

We used a pre-existing data set from Kaggle (It is a crowd-based open-source platform that aggregates data sets and builds problem-solving skills of data science enthusiasts ). The following dataset comprises plot summaries for 16,559 English books extracted from Wikipedia, along with aligned metadata from Freebase, including book author, title, and genre.

We compared the polarities obtained from 10 different abstracts in our dataset with and without using data preprocessing. A huge difference between polarities is observed. Thus, this highlights the importance of data preprocessing in sentiment analysis.

Table 1. Comparison of Sentiment Polarity for Raw and Preprocessed Data.

CONCLUSION
A pilot phase of proper data preprocessing is required, to calculate the sentiment polarities of the corpus accurately. Data preprocessing affects the sentiment polarity very clearly, which has been demonstrated in Table 1.

Important Links -

Connect with Us!

Feel free to get in touch with me or email me at vidhik2002@gmail.com & nimishjain100701@gmail.com!

--

--