Detecting Sarcasm in Hindi Tweets

Introduction

"भाषा केवल संवाद का साधन नहीं, यह भावनाओं और संस्कृति का प्रतिबिंब है।"

(Language is not just a medium of communication; it’s a reflection of emotions and culture.)

"भाषा के में व्यंग्यता को समझना, मशीनों के लिए मैजिक है।"

(Decoding sarcasm in language is an art, but for machines, it is a marvel of science.)

Sarcasm is a unique blend of humor and subtlety often challenging to detect, especially in regional languages like Hindi. From understanding implicit tones to grasping cultural nuances, detecting sarcasm is a complex yet rewarding task in Natural Language Processing (NLP).

Social media platforms often serve as battlegrounds for emotions, humor, and sarcasm in the digital world. Detecting sarcasm, especially in regional languages like Hindi, is a challenging task that requires a blend of linguistic knowledge and computational prowess.

This blog presents an end-to-end sarcasm detection system for Hindi tweets. You’ll gain insights into:

Exploratory Data Analysis (EDA)
Text preprocessing and feature engineering.
Training a deep learning model with BERT.
Deploying the model using a REST API.

By the end, you'll clearly understand how to tackle a real-world NLP task from the ground up.

Exploratory Data Analysis (EDA)

"Data is a precious thing and will last longer than the systems themselves." — Tim Berners-Lee

Why EDA Matters?

EDA provides a comprehensive understanding of the dataset, its distribution, and potential challenges.

Loading the Data

We begin by loading two datasets of sarcastic and non-sarcastic Hindi tweets.

sarcastic_df = pd.read_csv('./dataset/Sarcasm_Hindi_Tweets-SARCASTIC.csv')
non_sarcastic_df = pd.read_csv('./dataset/Sarcasm_Hindi_Tweets-NON-SARCASTIC.csv')

Sample Data
यह तो बहुत सही है। (Sarcastic)
वाकई? मुझे तो ऐसा नहीं लगा। (Sarcastic)
आज का दिन सच में अच्छा था। (Non-Sarcastic)

Observations:

Sarcastic Tweets often have implicit humor or exaggerated statements.
Non-Sarcastic Tweets are straightforward and lack hidden meanings.

Purpose: Load sarcastic and non-sarcastic tweet datasets.

Question: Why separate datasets for sarcastic and non-sarcastic tweets?

Answer: To maintain balance and ensure unbiased training data.

Data Preparation

To preprocess data, we use data_preparation.py, which combines sarcastic and non-sarcastic Hindi tweet datasets, applies text cleaning, and saves a processed CSV file.

After loading both datasets, I have added a new column called ‘label‘. All the tweets in sarcastic_df are labeled as ‘sarcastic‘ and those in non_sarcastic_df are labeled as ‘non_sarcastic‘ respectively. Thereafter, the two data frames are combined to form a single data frame and a few unnecessary columns are dropped.

sarcastic_df['label']       = 'sarcastic'
non_sarcastic_df['label']   = 'non_sarcastic'

df = pd.concat([sarcastic_df, non_sarcastic_df], axis=0)
df = df.drop(['username', 'acctdesc', 'location', 'following', 'followers', 'totaltweets', 'usercreatedts', 'tweetcreatedts', 'retweetcount', 'hashtags'] ,axis=1)

df = df.reset_index()
df = df.drop('index', axis=1)

Word Counting:

Since we will remove unnecessary words or tokens throughout the analysis, let us create a function called count_words() that will count the number of words in the text.

Take a moment to look at the type of data we have here. It's quite messy! It includes emojis, unnecessary newline characters, punctuation, stopwords, and other elements that don't add value to the analysis. In the following sections, we will go through these steps individually. Also, notice how the word count changes after each step.

Remove All Emojis from Hindi Text

Removing emojis is easy by using a regular expression for a range of emojis, like the one shown below:

# Compile the emoji pattern once
emoji_pattern = re.compile(
    r"[" 
    u"\U0001F600-\U0001F64F"  # emoticons
    u"\U0001F300-\U0001F5FF"  # symbols & pictographs
    u"\U0001F680-\U0001F6FF"  # transport & map symbols
    u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
    u"\U00002500-\U00002BEF"  # chinese characters
    u"\U00002702-\U000027B0"
    u"\U000024C2-\U0001F251"
    u"\U0001F926-\U0001F937"
    u"\U00010000-\U0010FFFF"
    u"\u2640-\u2642" 
    u"\u2600-\u2B55"
    u"\u200D"
    u"\u23CF"
    u"\u23E9"
    u"\u231A"
    u"\uFE0F"  # dingbats
    u"\u3030"
    "]+", flags=re.UNICODE
)

def remove_emoji(df: pd.DataFrame, text_column: str) -> pd.DataFrame:
    """
    Remove emojis from the specified text column in a DataFrame.

    Parameters:
    df (pd.DataFrame): The input DataFrame containing text data.
    text_column (str): The name of the column from which to remove emojis.

    Returns:
    pd.DataFrame: A new DataFrame with emojis removed from the specified column.
    """
    # Handle missing values 
    df[text_column] = df[text_column].fillna('').str.replace(emoji_pattern, '', regex=True)

    return df

# sample = pd.DataFrame({'text': ['Hello 😊', 'Goodbye 👋', None]})
# sample_df = remove_emoji(sample, 'text')
df = remove_emoji(df, 'text')

Observation:

Looking at the index numbers 16171 and 16173, we can see that the emojis have been removed from the text.

Text Preprocessing

To prepare a text for analysis, we clean and preprocess it using data_preparation.py.

Converting to lowercase.
Removing URLs, mentions, hashtags, digits, and punctuation.

Cleaning Text

Remove URLs, mentions, hashtags, digits, and punctuations while converting text to lowercase.

def preprocess_text(text):
    text = text.lower()
    text = re.sub(r'http\S+', '', text)  # Remove URLs
    text = re.sub(r'@\w+', '', text)    # Remove mentions
    return text

df['text'] = df['text'].apply(preprocess_text)

Purpose: Normalize text by converting it to lowercase and removing URLs, mentions, hashtags, digits, and special characters.

Question: Why remove URLs and hashtags?

Answer: These elements add noise and don’t contribute to sarcasm detection.

Before: "यह @aarpitdubey बहुत मजेदार है! 😂https://www.linkedin.com/in/aarpitdubey"
After: "यह बहुत मजेदार है"

Combining and Labeling Data

sarcastic_df['label'] = 1  # Sarcastic
non_sarcastic_df['label'] = 0  # Non-Sarcastic
df = pd.concat([sarcastic_df, non_sarcastic_df], axis=0)
df.to_csv('./dataset/processed_data.csv', index=False)

Purpose: Assign binary labels (1 for sarcastic, 0 for non-sarcastic) and combine datasets.

Observations:

Tweet	Label
कट्टर हिन्दू हूं हिंदुत्व की बातें करता हूं । हिन्दुस्तानी हूं हिंदुराष्ट्र की बातें करता हूं । आतंकवादी ,चमचे ,खान, स्टार किड्स,जौहर को फॉलो करने वाले दुर रहे	Sarcastic
मैंने हाल ही में एक नई किताब पढ़ी, जो बहुत प्रेरणादायक थी।	Non-Sarcastic
आज का दिन सच में अच्छा था।	Non-Sarcastic

Key insights:

Sarcastic Tweets: Contain exaggeration or implicit tones.
Non-Sarcastic Tweets: Straightforward statements without hidden meanings.

Feature Engineering with BERT Tokenizer

"A word is not just a string of letters; it’s an ocean of meaning waiting to be explored."

Why BERT?

BERT (Bidirectional Encoder Representations from Transformers) captures contextual word relationships, making it ideal for nuanced tasks like sarcasm detection. I use BERT, a transformer-based model, to tokenize and represent our text.

Tokenization

BERT tokenizer splits sentences into subwords while preserving context.

from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)

Example: "सच में?" → [CLS] सच में ? [SEP]

Purpose: Tokenize and pad input text into fixed-length sequences for BERT.

Question: What does truncation=True it do?

Answer: Ensures sequences longer than max_length are truncated to prevent dimension mismatch.

Training a Deep Learning Model

Model Architecture

We fine-tune a pre-trained BERT model for binary classification.

The model_training.py script uses a pre-trained BERT model (bert-base-uncased) and fine-tune it for sarcasm detection.

Tokenizer: Converts text into token IDs.
Model: BERT for sequence classification with 2 output labels (sarcastic, non-sarcastic).

import pandas as pd
from sklearn.model_selection import train_test_split
from transformers import BertTokenizer, BertForSequenceClassification
from transformers import Trainer, TrainingArguments
import torch

def train_model():
    # Load the processed data
    df = pd.read_csv('/content/drive/MyDrive/sarcasm_detection/src/dataset/processed_data.csv')

    # Split the data
    X = df['text']
    y = df['label']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Load the BERT tokenizer
    tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

    # Tokenize the input data
    train_encodings = tokenizer(list(X_train), truncation=True, padding=True, max_length=128)
    test_encodings = tokenizer(list(X_test), truncation=True, padding=True, max_length=128)

    # Convert to PyTorch datasets
    class SarcasmDataset(torch.utils.data.Dataset):
        def __init__(self, encodings, labels):
            self.encodings = encodings
            self.labels = labels

        def __getitem__(self, idx):
            item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
            item['labels'] = torch.tensor(self.labels[idx])
            return item

        def __len__(self):
            return len(self.labels)

    train_dataset = SarcasmDataset(train_encodings, y_train.tolist())
    test_dataset = SarcasmDataset(test_encodings, y_test.tolist())

    # Load the BERT model
    model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

    # Define training arguments
    training_args = TrainingArguments(
        output_dir='./models/',
        num_train_epochs=5,
        per_device_train_batch_size=16,
        per_device_eval_batch_size=16,
        warmup_steps=500,
        weight_decay=0.01,
        logging_dir='./logs/',
        logging_steps=10,
        evaluation_strategy="epoch",
        save_strategy="epoch",
    )

    # Create Trainer instance
    trainer = Trainer(
        model=model,
        args=training_args,
        train_dataset=train_dataset,
        eval_dataset=test_dataset,
    )

    # Train the model
    trainer.train()

    # Save the model and tokenizer
    model.save_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_model')
    tokenizer.save_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_tokenizer')

if __name__ == "__main__":
    train_model()

from transformers import BertForSequenceClassification
model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=2)

Purpose: Load a pre-trained BERT model for binary classification.

Question: Why use num_labels=2?

Answer: Indicates binary classification (sarcastic vs. non-sarcastic).

Training Process

from transformers import Trainer, TrainingArguments

training_args = TrainingArguments(
    output_dir='./models/',
    num_train_epochs=3,
    per_device_train_batch_size=16
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=test_dataset
)

trainer.train()

Purpose: Fine-tune BERT with training and evaluation datasets.

Question: How can accuracy be improved?

Answer: Use data augmentation, adjust hyperparameters, or increase training epochs.

Observations:

Training Accuracy: Near-perfect by epoch 5.
Validation Loss: Very low, indicating excellent generalization.
Loss: Converged to near zero, demonstrating strong learning ability.

Saving the Model

model.save_pretrained('./models/sarcasm_model')
tokenizer.save_pretrained('./models/sarcasm_tokenizer')

Inference System

The inference.py the script provides a real-time sarcasm prediction system.

Loading the Model

class SarcasmPredictor:
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained('/content/models/sarcasm_tokenizer')
        self.model = BertForSequenceClassification.from_pretrained('/content/models/sarcasm_model')

Prediction Logic

Given an input text, the model predicts whether it’s sarcastic or non-sarcastic.

def predict_sarcasm(self, text):
    inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
    with torch.no_grad():
        logits = self.model(**inputs).logits
    return 'sarcastic' if torch.argmax(logits, dim=1).item() == 1 else 'non-sarcastic'

Testing the Inference System

Sample Hindi tweets were tested with the model:

import torch
from transformers import BertTokenizer, BertForSequenceClassification

class SarcasmPredictor:
    def __init__(self):
        self.tokenizer = BertTokenizer.from_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_tokenizer')
        self.model = BertForSequenceClassification.from_pretrained('/content/drive/MyDrive/sarcasm_detection/src/models/sarcasm_model')

    def predict_sarcasm(self, text):
        # Preprocess the input text
        inputs = self.tokenizer(text, return_tensors='pt', truncation=True, padding=True, max_length=128)
        with torch.no_grad():
            logits = self.model(**inputs).logits
        predicted_class = torch.argmax(logits, dim=1).item()
        return 'sarcastic' if predicted_class == 1 else 'non-sarcastic'

if __name__ == "__main__":
    predictor = SarcasmPredictor()
    sample_text_1 = "यह एक मजेदार ट्वीट है।"
    result1 = predictor.predict_sarcasm(sample_text_1)
    print(f"The prediction for the input text is: {result1}")

    sample_text_2 = "मैंने हाल ही में एक नई किताब पढ़ी, जो बहुत प्रेरणादायक थी।"
    result2 = predictor.predict_sarcasm(sample_text_2)
    print(f"The prediction for the input text is: {result2}")

    sample_text_3 = "मेरे दोस्त ने मुझे एक अच्छा उपहार दिया, मैं बहुत खुश हूँ। 😊"
    result3 = predictor.predict_sarcasm(sample_text_3)
    print(f"The prediction for the input text is: {result3}")

    sample_text_4 = "कट्टर हिन्दू हूं हिंदुत्व की बातें करता हूं । हिन्दुस्तानी हूं हिंदुराष्ट्र की बातें करता हूं ।आतंकवादी ,चमचे ,खान , स्टार किड्स,जौहर को फॉलो करने वाले दुर रहे"
    result4 = predictor.predict_sarcasm(sample_text_4)
    print(f"The prediction for the input text is: {result4}")

    sample_text_5 = "मुझे अपने परिवार के साथ समय बिताना बहुत पसंद है।"
    result5 = predictor.predict_sarcasm(sample_text_5)
    print(f"The prediction for the input text is: {result5}")

Purpose: Tokenize input text, pass it through the model, and predict the class (sarcastic/non-sarcastic).

Question: What does torch.argmax it do?

Answer: Returns the index of the highest probability class.

Observations:

Correctly classify the tone of the Hindi tweets/text whether the text is Sarcastic or Not.
The model accurately detects sarcasm in culturally nuanced tweets.
It performs well on both straightforward and subtle examples.

Applications and Future Scope

Applications:

Chatbots: Improve conversational AI by detecting sarcasm.
Social Media Analysis: Enhance sentiment analysis with sarcasm detection.
Customer Support: Prioritize responses by identifying sarcastic complaints.

Future Scope:

Multilingual Sarcasm Detection: Extend to other regional languages.
Cross-Domain Adaptation: Apply to emails, reviews, and professional communications.
Explainability in NLP: Develop interpretable models.

Conclusion

"भाषा की सही समझ में जीवन की भूमिका है।"
(Language is the soul of civilization.)

Open-Source Impact: Enable researchers to extend the system to other languages or domains.
Hindi NLP Growth: Bridge gaps in regional language NLP systems.

This project demonstrates the potential of combining deep learning with cultural and linguistic insights. With applications ranging from chatbots to social media analysis, it paves the way for smarter, culturally-aware AI systems.

"नवाचार तभी सार्थक है जब वह उपयोगी हो।"

(Innovation is meaningful only when it is useful.)

"सरलता और जटिलता के बीच सही संतुलन ही वास्तविक प्रगति का संकेत है।"
(The true sign of progress lies in balancing simplicity and complexity.)

We successfully built an end-to-end sarcasm detection system using state-of-the-art NLP techniques. This project enhances our understanding of human emotions in digital text and demonstrates the power of AI in bridging linguistic barriers.

Are you ready to bring the nuances of Hindi into the universe of AI? Join the revolution today!

Feel free to experiment and expand this system to other regional languages or even cross-lingual sarcasm detection!

Building a System to Identify Sarcasm in Hindi Twitter Posts

Deep Learning (DL) and Natural Language Processing (NLP) for Detecting Sarcasm in Hindi Tweets

Table of contents

Introduction

Exploratory Data Analysis (EDA)

Why EDA Matters?

Loading the Data

Observations:

Data Preparation

Word Counting:

Text Preprocessing

Cleaning Text

Combining and Labeling Data

Observations:

Feature Engineering with BERT Tokenizer

Why BERT?

Tokenization

Training a Deep Learning Model

Model Architecture

Training Process

Observations:

Saving the Model

Inference System

Loading the Model

Prediction Logic

Testing the Inference System

Observations:

Applications and Future Scope

Applications:

Future Scope:

Conclusion