Building a System to Identify Sarcasm in Hindi Twitter Posts
Deep Learning (DL) and Natural Language Processing (NLP) for Detecting Sarcasm in Hindi Tweets
Introduction
"भाषा केवल संवाद का साधन नहीं, यह भावनाओं और संस्कृति का प्रतिबिंब है।"
(Language is not just a medium of communication; it’s a reflection of emotions and culture.)
"भाषा के में व्यंग्यता को समझना, मशीनों के लिए मैजिक है।"
(Decoding sarcasm in language is an art, but for machines, it is a marvel of science.)
Sarcasm is a unique blend of humor and subtlety often challenging to detect, especially in regional languages like Hindi. From understanding implicit tones to grasping cultural nuances, detecting sarcasm is a complex yet rewarding task in Natural Language Processing (NLP).
Social media platforms often serve as battlegrounds for emotions, humor, and sarcasm in the digital world. Detecting sarcasm, especially in regional languages like Hindi, is a challenging task that requires a blend of linguistic knowledge and computational prowess.
This blog presents an end-to-end sarcasm detection system for Hindi tweets. You’ll gain insights into:
Exploratory Data Analysis (EDA)
Text preprocessing and feature engineering.
Training a deep learning model with BERT.
Deploying the model using a REST API.
By the end, you'll clearly understand how to tackle a real-world NLP task from the ground up.
Exploratory Data Analysis (EDA)
"Data is a precious thing and will last longer than the systems themselves." — Tim Berners-Lee
Why EDA Matters?
EDA provides a comprehensive understanding of the dataset, its distribution, and potential challenges.
Loading the Data
We begin by loading two datasets of sarcastic and non-sarcastic Hindi tweets.
sarcastic_df = pd.read_csv('./dataset/Sarcasm_Hindi_Tweets-SARCASTIC.csv')
non_sarcastic_df = pd.read_csv('./dataset/Sarcasm_Hindi_Tweets-NON-SARCASTIC.csv')
Sample Data |
यह तो बहुत सही है। (Sarcastic) |
वाकई? मुझे तो ऐसा नहीं लगा। (Sarcastic) |
आज का दिन सच में अच्छा था। (Non-Sarcastic) |
Observations:
Sarcastic Tweets often have implicit humor or exaggerated statements.
Non-Sarcastic Tweets are straightforward and lack hidden meanings.
Purpose: Load sarcastic and non-sarcastic tweet datasets.
Question: Why separate datasets for sarcastic and non-sarcastic tweets?
Answer: To maintain balance and ensure unbiased training data.
Data Preparation
To preprocess data, we use data_preparation.py
, which combines sarcastic and non-sarcastic Hindi tweet datasets, applies text cleaning, and saves a processed CSV file.
After loading both datasets, I have added a new column called ‘label‘. All the tweets in sarcastic_df are labeled as ‘sarcastic‘ and those in non_sarcastic_df are labeled as ‘non_sarcastic‘ respectively. Thereafter, the two data frames are combined to form a single data frame and a few unnecessary columns are dropped.
sarcastic_df['label'] = 'sarcastic'
non_sarcastic_df['label'] = 'non_sarcastic'
df = pd.concat([sarcastic_df, non_sarcastic_df], axis=0)
df = df.drop(['username', 'acctdesc', 'location', 'following', 'followers', 'totaltweets', 'usercreatedts', 'tweetcreatedts', 'retweetcount', 'hashtags'] ,axis=1)
df = df.reset_index()
df = df.drop('index', axis=1)
Word Counting:
Since we will remove unnecessary words or tokens throughout the analysis, let us create a function called count_words() that will count the number of words in the text.
Take a moment to look at the type of data we have here. It's quite messy! It includes emojis, unnecessary newline characters, punctuation, stopwords, and other elements that don't add value to the analysis. In the following sections, we will go through these steps individually. Also, notice how the word count changes after each step.
Remove All Emojis from Hindi Text
Removing emojis is easy by using a regular expression for a range of emojis, like the one shown below:
# Compile the emoji pattern once
emoji_pattern = re.compile(
r"["
u"\U0001F600-\U0001F64F" # emoticons
u"\U0001F300-\U0001F5FF" # symbols & pictographs
u"\U0001F680-\U0001F6FF" # transport & map symbols
u"\U0001F1E0-\U0001F1FF" # flags (iOS)
u"\U00002500-\U00002BEF" # chinese characters
u"\U00002702-\U000027B0"
u"\U000024C2-\U0001F251"
u"\U0001F926-\U0001F937"
u"\U00010000-\U0010FFFF"
u"\u2640-\u2642"
u"\u2600-\u2B55"
u"\u200D"
u"\u23CF"
u"\u23E9"
u"\u231A"
u"\uFE0F" # dingbats
u"\u3030"
"]+", flags=re.UNICODE
)
def remove_emoji(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
"""
Remove emojis from the specified text column in a DataFrame.
Parameters:
df (pd.DataFrame): The input DataFrame containing text data.
text_column (str): The name of the column from which to remove emojis.
Returns:
pd.DataFrame: A new DataFrame with emojis removed from the specified column.
"""
# Handle missing values
df[text_column] = df[text_column].fillna('').str.replace(emoji_pattern, '', regex=True)
return df
# sample = pd.DataFrame({'text': ['Hello 😊', 'Goodbye 👋', None]})
# sample_df = remove_emoji(sample, 'text')
df = remove_emoji(df, 'text')
Observation:
- Looking at the index numbers 16171 and 16173, we can see that the emojis have been removed from the text.
Text Preprocessing
To prepare a text for analysis, we clean and preprocess it using data_preparation.py
.
Converting to lowercase.
Removing URLs, mentions, hashtags, digits, and punctuation.
Cleaning Text
Remove URLs, mentions, hashtags, digits, and punctuations while converting text to lowercase.
def preprocess_text(text):
text = text.lower()
text = re.sub(r'http\S+', '', text) # Remove URLs
text = re.sub(r'@\w+', '', text) # Remove mentions
return text
df['text'] = df['text'].apply(preprocess_text)
Purpose: Normalize text by converting it to lowercase and removing URLs, mentions, hashtags, digits, and special characters.
Question: Why remove URLs and hashtags?
Answer: These elements add noise and don’t contribute to sarcasm detection.
Before: "यह @aarpitdubey बहुत मजेदार है! 😂https://www.linkedin.com/in/aarpitdubey"
After: "यह बहुत मजेदार है"
Combining and Labeling Data
sarcastic_df['label'] = 1 # Sarcastic
non_sarcastic_df['label'] = 0 # Non-Sarcastic
df = pd.concat([sarcastic_df, non_sarcastic_df], axis=0)
df.to_csv('./dataset/processed_data.csv', index=False)
Purpose: Assign binary labels (1 for sarcastic, 0 for non-sarcastic) and combine datasets.
Observations:
Tweet | Label |
कट्टर हिन्दू हूं हिंदुत्व की बातें करता हूं । हिन्दुस्तानी हूं हिंदुराष्ट्र की बातें करता हूं । आतंकवादी ,चमचे ,खान, स्टार किड्स,जौहर को फॉलो करने वाले दुर रहे | Sarcastic |
मैंने हाल ही में एक नई किताब पढ़ी, जो बहुत प्रेरणादायक थी। | Non-Sarcastic |
आज का दिन सच में अच्छा था। | Non-Sarcastic |
Key insights:
Sarcastic Tweets: Contain exaggeration or implicit tones.
Non-Sarcastic Tweets: Straightforward statements without hidden meanings.
Feature Engineering with BERT Tokenizer
"A word is not just a string of letters; it’s an ocean of meaning waiting to be explored."
Why BERT?
BERT (Bidirectional Encoder Representations from Transformers) captures contextual word relationships, making it ideal for nuanced tasks like sarcasm detection. I use BERT, a transformer-based model, to tokenize and represent our text.
Tokenization
BERT tokenizer splits sentences into subwords while preserving context.
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)
Example: "सच में?" →
[CLS] सच में ? [SEP]
Purpose: Tokenize and pad input text into fixed-length sequences for BERT.
Question: What does truncation=True
it do?
Answer: Ensures sequences longer than max_length
are truncated to prevent dimension mismatch.
Training a Deep Learning Model
Model Architecture
We fine-tune a pre-trained BERT model for binary classification.
The model_training.py
script uses a pre-trained BERT model (bert-base-uncased
) and fine-tune it for sarcasm detection.
Tokenizer: Converts text into token IDs.
Model: BERT for sequence classification with 2 output labels (
sarcastic
,non-sarcastic
).
import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch
def train_model():
# Load the processed data
df = pd.read_csv('/content/drive/MyDrive/sarcasm_detection/src/dataset/processed_data.csv')
# Split the data
X = df['text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Load the BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# Tokenize the input data
train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)
test_encodings = tokenizer(list(X_test), truncation=True, padding=True, max_length=128)
# Convert to PyTorch datasets
class SarcasmDataset(torch.utils.data.Dataset):
def __init__(self, encodings, labels):
self.encodings = encodings
self.labels = labels
def __getitem__(self, idx):
item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
item['labels'] = torch.tensor(self.labels[idx])
return item
def __len__(self):
return len(self.labels)
train_dataset = SarcasmDataset(train_encodings, y_train.tolist())
test_dataset = SarcasmDataset(test_encodings, y_test.tolist())
# Load the BERT model
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
# Define training arguments
training_args = TrainingArguments(
output_dir='./models/',
num_train_epochs=5,
per_device_train_batch_size=16,
per_device_eval_batch_size=16,
warmup_steps=500,
weight_decay=0.01,
logging_dir='./logs/',
logging_steps=10,
evaluation_strategy="epoch",
save_strategy="epoch",
)
# Create Trainer instance
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset,
)
# Train the model
trainer.train()
# Save the model and tokenizer
model.save_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_model')
tokenizer.save_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_tokenizer')
if __name__ == "__main__":
train_model()
from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)
Purpose: Load a pre-trained BERT model for binary classification.
Question: Why use num_labels=2
?
Answer: Indicates binary classification (sarcastic vs. non-sarcastic).
Training Process
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir='./models/',
num_train_epochs=3,
per_device_train_batch_size=16
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
eval_dataset=test_dataset
)
trainer.train()
Purpose: Fine-tune BERT with training and evaluation datasets.
Question: How can accuracy be improved?
Answer: Use data augmentation, adjust hyperparameters, or increase training epochs.
Observations:
Training Accuracy: Near-perfect by epoch 5.
Validation Loss: Very low, indicating excellent generalization.
Loss: Converged to near zero, demonstrating strong learning ability.
Saving the Model
model.save_pretrained('./models/sarcasm_model')
tokenizer.save_pretrained('./models/sarcasm_tokenizer')
Inference System
The inference.py
the script provides a real-time sarcasm prediction system.
Loading the Model
class SarcasmPredictor:
def __init__(self):
self.tokenizer = BertTokenizer.from_pretrained('/content/models/sarcasm_tokenizer')
self.model = BertForSequenceClassification.from_pretrained('/content/models/sarcasm_model')
Prediction Logic
Given an input text, the model predicts whether it’s sarcastic or non-sarcastic.
def predict_sarcasm(self, text):
inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
with torch.no_grad():
logits = self.model(**inputs).logits
return 'sarcastic' if torch.argmax(logits, dim=1).item() == 1 else 'non-sarcastic'
Testing the Inference System
Sample Hindi tweets were tested with the model:
import torch
from transformers import BertTokenizer, BertForSequenceClassification
class SarcasmPredictor:
def __init__(self):
self.tokenizer = BertTokenizer.from_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_tokenizer')
self.model = BertForSequenceClassification.from_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_model')
def predict_sarcasm(self, text):
# Preprocess the input text
inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
with torch.no_grad():
logits = self.model(**inputs).logits
predicted_class = torch.argmax(logits, dim=1).item()
return 'sarcastic' if predicted_class == 1 else 'non-sarcastic'
if __name__ == "__main__":
predictor = SarcasmPredictor()
sample_text_1 = "यह एक मजेदार ट्वीट है।"
result1 = predictor.predict_sarcasm(sample_text_1)
print(f"The prediction for the input text is: {result1}")
sample_text_2 = "मैंने हाल ही में एक नई किताब पढ़ी, जो बहुत प्रेरणादायक थी।"
result2 = predictor.predict_sarcasm(sample_text_2)
print(f"The prediction for the input text is: {result2}")
sample_text_3 = "मेरे दोस्त ने मुझे एक अच्छा उपहार दिया, मैं बहुत खुश हूँ। 😊"
result3 = predictor.predict_sarcasm(sample_text_3)
print(f"The prediction for the input text is: {result3}")
sample_text_4 = "कट्टर हिन्दू हूं हिंदुत्व की बातें करता हूं । हिन्दुस्तानी हूं हिंदुराष्ट्र की बातें करता हूं ।आतंकवादी ,चमचे ,खान , स्टार किड्स,जौहर को फॉलो करने वाले दुर रहे"
result4 = predictor.predict_sarcasm(sample_text_4)
print(f"The prediction for the input text is: {result4}")
sample_text_5 = "मुझे अपने परिवार के साथ समय बिताना बहुत पसंद है।"
result5 = predictor.predict_sarcasm(sample_text_5)
print(f"The prediction for the input text is: {result5}")
Purpose: Tokenize input text, pass it through the model, and predict the class (sarcastic/non-sarcastic).
Question: What does torch.argmax
it do?
Answer: Returns the index of the highest probability class.
Observations:
Correctly classify the tone of the Hindi tweets/text whether the text is Sarcastic or Not.
The model accurately detects sarcasm in culturally nuanced tweets.
It performs well on both straightforward and subtle examples.
Applications and Future Scope
Applications:
Chatbots: Improve conversational AI by detecting sarcasm.
Social Media Analysis: Enhance sentiment analysis with sarcasm detection.
Customer Support: Prioritize responses by identifying sarcastic complaints.
Future Scope:
Multilingual Sarcasm Detection: Extend to other regional languages.
Cross-Domain Adaptation: Apply to emails, reviews, and professional communications.
Explainability in NLP: Develop interpretable models.
Conclusion
"भाषा की सही समझ में जीवन की भूमिका है।"
(Language is the soul of civilization.)
Open-Source Impact: Enable researchers to extend the system to other languages or domains.
Hindi NLP Growth: Bridge gaps in regional language NLP systems.
This project demonstrates the potential of combining deep learning with cultural and linguistic insights. With applications ranging from chatbots to social media analysis, it paves the way for smarter, culturally-aware AI systems.
"नवाचार तभी सार्थक है जब वह उपयोगी हो।"
(Innovation is meaningful only when it is useful.)
"सरलता और जटिलता के बीच सही संतुलन ही वास्तविक प्रगति का संकेत है।"
(The true sign of progress lies in balancing simplicity and complexity.)
We successfully built an end-to-end sarcasm detection system using state-of-the-art NLP techniques. This project enhances our understanding of human emotions in digital text and demonstrates the power of AI in bridging linguistic barriers.
Are you ready to bring the nuances of Hindi into the universe of AI? Join the revolution today!
Feel free to experiment and expand this system to other regional languages or even cross-lingual sarcasm detection!