Research

My research delves into the realms of natural language processing (NLP), encompassing both generation (NLG) and understanding (NLU). I aim to leverage contemporary NLP applications across diverse languages and fields in my research endeavors. Some of the research areas I have worked on or am currently working on are listed below.

1. Robust NLP

Recent research indicates that numerous natural language processing (NLP) systems exhibit sensitivity and susceptibility to minor input alterations, adversarial attacks, etc. leading to challenges in generalizing across diverse datasets. This lack of robustness poses significant obstacles to the practical deployment of NLP systems. Ongoing work involves investigating the robustness of low resource language models aiming to improve generalization.

Ongoing Work:

2. Trustworthy NLP

Trustworthy NLP refers to natural language processing systems that exhibit a high level of reliability, accuracy, and credibility in their output. In the context of hallucination mitigation and factual error correction in text summarization, trustworthy NLP entails the ability to effectively identify and rectify misleading or incorrect information present in the generated summaries. Hallucination mitigation involves the prevention of the generation of fictitious or misleading content, ensuring that the summary accurately reflects the input text. Factual error correction, on the other hand, focuses on detecting and rectifying inaccuracies or false claims within the summary to enhance its overall trustworthiness and informativeness. By incorporating robust mechanisms for hallucination mitigation and factual error correction, trustworthy NLP systems can provide users with more reliable and accurate summaries while maintaining a high standard of credibility.

Ongoing Work:

3. Bias & Fairness

The majority of AI systems and algorithms are data-driven and require data to be trained on. If the underlying training data has biases, the algorithms trained on it will learn about them and incorporate them into their predictions. As a result, existing biases in data might alter the algorithms that use the data, resulting in biased results. This can have significant implications, such as perpetuating discrimination or unfair treatment based on factors like race, gender, or socioeconomic status. To mitigate these issues, it is crucial to implement techniques such as bias detection and fairness-aware machine learning. Ongoing work involves investigating gender biases in low resource pre-trained language models aiming to promote fairness in AI outcomes.

Ongoing Work:

4. Misinformation Detection

Misinformation detection in NLP refers to the use of computational techniques to identify false or misleading information within text data. This is achieved through algorithms that analyze patterns in language, context, and other linguistic cues that might indicate whether the information is trustworthy or not. The importance of misinformation detection lies in its role in combating the spread of fake news, fake reviews, propaganda, and other forms of deceptive content that can influence public opinion, undermine trust in legitimate news sources, and sway political, economic, or social outcomes. With the increasing volume of information disseminated online, it has become crucial to develop robust NLP tools to automatically flag and filter out misinformation to maintain the integrity of public discourse and protect individuals from being misled by false narratives.

5. Sentiment and Social Media Analysis

Computational approaches to sentiment analysis and social media analysis involve the use of natural language processing (NLP) and machine learning techniques to extract and analyze sentiments, opinions, and emotions from textual data. Social media analysis encompasses a broader scope, analyzing not only the sentiment but also the content, trends, and patterns within social media data. It utilizes similar NLP and machine learning methods to extract information about user behavior, trending topics, network interactions, and the spread of information. Furthermore, social network analysis tools can help identify influential users, communities, and the structure of interactions. Both sentiment analysis and social media analysis provide valuable insights for businesses, policymakers, and researchers to understand public opinion, monitor brand reputation, and study social dynamics.

6. Summarization

In scenarios where multiple pretrained language models (PLMs) are available for text summarization, choosing the "best" can be challenging due to the variability in their performance depending on the context and nature of the text. Instead, an ensemble approach can be taken, where a variety of summaries generated by different PLMs are ranked to identify the most informative and coherent summary. TextRank-based algorithms, which are inspired by the PageRank algorithm, can be particularly useful in this ensemble framework. By modeling the problem as a graph with summaries as nodes and similarities between them as edges, TextRank can iteratively score each summary based on its similarity to other highly-scored summaries. This process naturally filters out redundant information and promotes summaries that capture the essence of the text from multiple perspectives. The highest-ranked summary according to TextRank can then be selected as the output, providing an effective way to harness the strengths of various PLMs while minimizing their individual biases and errors. This unsupervised approach does not require labeled training data, making it a versatile and practical solution for improving the quality of machine-generated text summarization.

7. Natural Language based Software Engineering

Natural Language Processing (NLP) based Software Engineering is an emerging area that integrates NLP techniques into the software development process to enhance the understanding, creation, and maintenance of software. Leveraging the power of NLP, developers and engineers can automate and improve tasks such as extracting requirements from documentation, generating code from natural language descriptions, analyzing user feedback and bug reports, and maintaining clear and up-to-date documentation. This approach aims to bridge the gap between human language and computer code, allowing for more efficient communication between stakeholders, reducing errors, and streamlining the overall software lifecycle. The use of NLP in software engineering has the potential to significantly increase productivity and improve the accuracy and quality of software products, although it also presents challenges in accurately interpreting the vast and nuanced human language within the context of software development.

8. Online Abuse and Harms

While digital technologies have revolutionized how we connect and communicate, they have also given a megaphone to harmful content like hate speech, toxic comments, abusive and harmful content. The vastness of online information makes it impossible to address this issue manually, requiring large-scale solutions like computer tools. However, identifying and moderating these harmful online activities is no easy feat, presenting complex technical, social, legal, and ethical challenges.

9. Figurative Language Processing

Figurative language is prevalent in all aspects of human activity and discourse, from poetry and everyday conversation to scientific literature and social media. The study of figurative language in NLP, which covers computational modeling of metaphors, idioms, puns, irony, sarcasm, similes, and other forms, is a rapidly expanding field. Its widespread use is supported by various corpus studies, and its significance in human reasoning has been verified through psychological research. Therefore, figurative language is a crucial area of study for both computational and cognitive linguistics, and its automatic identification, understanding, and generation are essential for any NLP application that deals with semantics.

10. Text Normalization

Text normalization in Natural Language Processing (NLP) refers to the process of converting the text into a more consistent and standard form. Back transliteration is a specific type of text normalization that involves converting transliterated text (text that has been converted from one script to another) back to its original script. Back transliteration is a complex task, as it requires a deep understanding of the phonetics and orthography of both the source and target languages. For instance, multiple characters or sounds from the original script might be represented by the same character in the Latin script, making it difficult to determine the correct original character during back transliteration. Additionally, the process may need to handle ambiguities and variations in the way people transliterate text.

11. Explainable AI (XAI)

Explainable AI in medical image processing, especially when using pretrained models, refers to the ability of the AI system to not only accurately classify medical conditions from images but also to provide insights into the reasoning behind its decisions. For instance, in brain tumor classification, the AI would not only identify the presence of a tumor but also highlight the features in the brain scans that led to its conclusion. Similarly, for cervical cancer classification, the AI would analyze pap smear or HPV test images and explain which patterns or irregularities suggest cancerous changes. In the case of gastrointestinal disease classification, the AI would examine endoscopic images and point out the abnormalities, such as ulcers or polyps, that signify a particular disease. The 'explainable' part means that the model's decision-making process is transparent and understandable to human experts, allowing healthcare professionals to trust and effectively interpret the AI's analysis for better patient outcomes. This is particularly important in healthcare, where the reasoning behind a diagnosis can be as crucial as the diagnosis itself.

12. Culinary Text Classification

This specialized area of text classification focuses on analyzing and categorizing text data related to food, recipes, cooking techniques, and cuisine types. It's a niche within the broader field of text classification that deals specifically with culinary content, using NLP techniques to understand and organize recipes based on their ingredients, cooking methods, regional origins, dietary restrictions, or any other relevant culinary genre distinctions.

Funding

Project ID: ARP/2021/CSE/01/2
Project Title: Bengali Fake Reviews: A Benchmark Dataset and Detection System
Funded by: Committee for Advance Studies and Research (CASR), AUST
Responsibility: Co-Principal Investigator (CO-PI)
Duration: May 2022 - May 2023