Publications
Most recent publication updates can be found on my [Google Scholar] profile.
[*] denotes equal contribution
2024
📌 Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation (Core A*)
Authors: G M Shahariar, Jia Chen, Jiachen Li, Yue Dong
Journal: EMNLP Findings (EMNLP 2024)
[Abstract] [PDF] [Code & Dataset] [Citation bib]
@misc{shahariar2024adversarialattackspartsspeech,
title={Adversarial Attacks on Parts of Speech: An Empirical Study in Text-to-Image Generation},
author={G M Shahariar and Jia Chen and Jiachen Li and Yue Dong},
year={2024},
eprint={2409.15381},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2409.15381},
}
Recent studies show that text-to-image (T2I) models are vulnerable to adversarial attacks, especially with noun perturbations in text prompts. In this study, we investigate the impact of adversarial attacks on different POS tags within text prompts on the images generated by T2I models. We create a high-quality dataset for realistic POS tag token swapping and perform gradient-based attacks to find adversarial suffixes that mislead T2I models into generating images with altered tokens. Our empirical results show that the attack success rate (ASR) varies significantly among different POS tag categories, with nouns, proper nouns, and adjectives being the easiest to attack. We explore the mechanism behind the steering effect of adversarial suffixes, finding that the number of critical tokens and content fusion vary among POS tags, while features like suffix transferability are consistent across categories.
📌 Bengali Fake Reviews: A Benchmark Dataset and Detection System (Q1)
Authors: G. M. Shahariar*, Md. Tanvir Rouf Shawon*, Faisal Muhammad Shah, Mohammad Shafiul Alam, Md. Shahriar Mahbub
Journal: Neurocomputing (Neurocomputing)
[Abstract] [PDF] [Code & Dataset] [Citation bib]
@article{SHAHARIAR2024127732,
title = {Bengali fake reviews: A benchmark dataset and detection system},
journal = {Neurocomputing},
volume = {592},
pages = {127732},
year = {2024},
issn = {0925-2312},
doi = {https://doi.org/10.1016/j.neucom.2024.127732},
url = {https://www.sciencedirect.com/science/article/pii/S0925231224005034},
author = {G M Shahariar and Md. Tanvir Rouf Shawon and Faisal Muhammad Shah and Mohammad Shafiul Alam and Md. Shahriar Mahbub},
keywords = {Bengali fake reviews detection, Ensemble learning, Transformers, Deep learning, Augmentation, Transliteration}
}
The proliferation of fake reviews on various online platforms has created a major concern for both consumers and businesses. Such reviews can deceive customers and cause damage to the reputation of products or services, making it crucial to identify them. Although the detection of fake reviews has been extensively studied in English language, detecting fake reviews in non-English languages such as Bengali is still a relatively unexplored research area. The novelty of this study unfolds on three fronts: (i) a new publicly available dataset called Bengali Fake Review Detection (BFRD) dataset is introduced, (ii) a unique pipeline has been proposed that translates English words to their corresponding Bengali meaning and also back transliterates Romanized Bengali to Bengali, (iii) a weighted ensemble model that combines four pre-trained transformers model is proposed. The developed dataset consists of 7710 non-fake and 1339 fake food-related reviews collected from social media posts. Rigorous experiments have been conducted to compare multiple deep learning and pre-trained transformer language models and our proposed model to identify the best-performing model. According to the experimental results, the proposed ensemble model attained a weighted F1-score of 0.9843 on a dataset of 13,390 reviews, comprising 1339 actual fake reviews, 5,356 augmented fake reviews, and 6695 reviews randomly selected from the 7710 non-fake instances.
📌 A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy Bangla Texts (Core A)
Authors: Kazi Toufique Elahi, Tasnuva Binte Rahman, Shakil Shahriar, Samir Sarker, Md. Tanvir Rouf Shawon, G. M. Shahariar
Workshop: Proceedings of the Ninth Workshop on Noisy and User-generated Text collocated with EACL 2024 (W-NUT 2024)
[Abstract] [PDF] [Code & Dataset] [Presentation] [Citation bib]
@inproceedings{elahi-etal-2024-comparative,
title = "A Comparative Analysis of Noise Reduction Methods in Sentiment Analysis on Noisy {B}angla Texts",
author = "Elahi, Kazi and
Rahman, Tasnuva and
Shahriar, Shakil and
Sarker, Samir and
Shawon, Md. and
Shibli, G. M.",
booktitle = "Proceedings of the Ninth Workshop on Noisy and User-generated Text (W-NUT 2024)",
month = mar,
year = "2024",
address = "San {\.G}iljan, Malta",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.wnut-1.5",
pages = "44--57"
}
While Bangla is considered a language with limited resources, sentiment analysis has been a subject of extensive research in the literature. Nevertheless, there is a scarcity of exploration into sentiment analysis specifically in the realm of noisy Bangla texts. In this paper, we introduce a dataset (NC-SentNoB) that we annotated manually to identify ten different types of noise found in a pre-existing sentiment analysis dataset comprising of around 15K noisy Bangla texts. At first, given an input noisy text, we identify the noise type, addressing this as a multi-label classification task. Then, we introduce baseline noise reduction methods to alleviate noise prior to conducting sentiment analysis. Finally, we assess the performance of fine-tuned sentiment analysis models with both noisy and noise-reduced texts to make comparisons. The experimental findings indicate that the noise reduction methods utilized are not satisfactory, highlighting the need for more suitable noise reduction methods in future research endeavors.
📌 Ben-Sarc: A Self-Annotated Corpus for Sarcasm Detection from Bengali Social Media Comments and Its Baseline Evaluation (Q1)
Authors: Sanzana Karim Lora, G. M. Shahariar, Tamanna Nazmin, Noor Nafeur Rahman, Rafsan Rahman, Miyad Bhuiyan, and Faisal Muhammad Shah
Journal: Natural Language Processing (Natural Language Processing)
[Abstract] [PDF] [Dataset] [Citation bib]
@article{Lora_Shahariar_Nazmin_Rahman_Rahman_Bhuiyan_Shah_2024,
title={Ben-Sarc: A self-annotated corpus for sarcasm detection from Bengali social media comments and its baseline evaluation},
DOI={10.1017/nlp.2024.11},
journal={Natural Language Processing},
author={Lora, Sanzana Karim and Shahariar, G. M. and Nazmin, Tamanna and Rahman, Noor Nafeur and Rahman, Rafsan and Bhuiyan, Miyad and Shah, Faisal Muhammad},
year={2024},
pages={1–26}
}
Sarcasm detection research in the Bengali language so far can be considered to be narrow due to the unavailability of resources. In this paper, we introduce a large-scale self-annotated Bengali corpus for sarcasm detection research problem in the Bengali language named ‘Ben-Sarc’ containing 25,636 comments, manually collected from different public Facebook pages and evaluated by external evaluators. Then we present a complete strategy to utilize different models of traditional machine learning, deep learning, and transfer learning to detect sarcasm from text using the Ben-Sarc corpus. Finally, we demonstrate a comparison between the performance of traditional machine learning, deep learning, and transfer learning models on our Ben-Sarc corpus. Transfer learning using Indic-Transformers Bengali Bidirectional Encoder Representations from Transformers as a pre-trained source model has achieved the highest accuracy of 75.05%. The second-highest accuracy is obtained by the long short-term memory model with 72.48% and Multinomial Naive Bayes is acquired the third highest with 72.36% accuracy for deep learning and machine learning, respectively. The Ben-Sarc corpus is made publicly available in the hope of advancing the Bengali Natural Language Processing Community.
📌 Explainable Contrastive and Cost-Sensitive Learning for Cervical Cancer Classification
Authors: Ashfiqun Mustari, Rushmia Ahmed, Afsara Tasnim, Jakia Sultana Juthi, G M Shahariar
Conference: 26th International Conference on Computer and Information Technology (ICCIT 2023)
[Abstract] [PDF] [Code & Dataset] [Citation bib]
@INPROCEEDINGS{10441352,
author={Mustari, Ashfiqun and Ahmed, Rushmia and Tasnim, Afsara and Juthi, Jakia Sultana and Shahariar, G. M.},
booktitle={2023 26th International Conference on Computer and Information Technology (ICCIT)},
title={Explainable Contrastive and Cost-Sensitive Learning for Cervical Cancer Classification},
year={2023},
volume={},
number={},
pages={1-6},
keywords={Visualization;Costs;Sensitivity;System performance;Self-supervised learning;Cervical cancer;Testing;Cervical Cancer;Cost-Sensitive Learning;Contrastive Learning;SIPaKMeD;XAI;LIME;GradCAM},
doi={10.1109/ICCIT60459.2023.10441352}
}
This paper proposes an efficient system for classifying cervical cancer cells using pre-trained convolutional neural networks (CNNs). We first fine-tune five pre-trained CNNs and minimize the overall cost of mis-classification by prioritizing accuracy for certain classes that have higher associated costs or importance. To further enhance the performance of the models, supervised contrastive learning is included to make the models more adept at capturing important features and patterns. Extensive experimentation are conducted to evaluate the proposed system on the SIPaKMeD dataset. The experimental results demonstrate the effectiveness of the developed system, achieving an accuracy of 97.29%. To make our system more trustworthy, we have employed several explainable AI techniques to interpret how the models reached a specific decision.
2023
📌 Contrastive Learning for API Aspect Analysis (Core A*)
Authors: G. M. Shahariar, Tahmid Hasan, Anindya Iqbal and Gias Uddin
Conference: 38th IEEE/ACM International Conference on Automated Software Engineering (ASE 2023)
[Abstract] [PDF] [Code & Dataset] [Presentation] [Citation bib]
@article{shahariar2023contrastive,
title={Contrastive Learning for API Aspect Analysis},
author={Shahariar, GM and Hasan, Tahmid and Iqbal, Anindya and Uddin, Gias},
journal={arXiv preprint arXiv:2307.16878},
year={2023}
}
We present a novel approach - CLAA - for API aspect detection in API reviews that utilizes transformer models trained with a supervised contrastive loss objective function. We evaluate CLAA using performance and impact analysis. For performance analysis, we utilized a benchmark dataset on developer discussions collected from Stack Overflow and compare the results to those obtained using state-of-the-art transformer models. Our experiments show that contrastive learning can significantly improve the performance of transformer models in detecting aspects such as Performance, Security, Usability, and Documentation. For impact analysis, we performed empirical and developer study. On a randomly selected and manually labeled 200 online reviews, CLAA achieved 92% accuracy while the SOTA baseline achieved 81.5%. According to our developer study involving 10 participants, the use of 'Stack Overflow + CLAA' resulted in increased accuracy and confidence during API selection.
📌 Rank Your Summaries: Enhancing Bengali Text Summarization via Ranking-based Approach
Authors: G. M. Shahariar*, Tonmoy Talukder*, Rafin Alam Khan Sotez and Md. Tanvir Rouf Shawon
Conference: 2nd International Conference on Big Data, IoT and Machine Learning (BIM 2023)
[Abstract] [PDF] [Code & Dataset] [Presentation] [Citation bib]
@article{shahariar2023rank,
title={Rank Your Summaries: Enhancing Bengali Text Summarization via Ranking-based Approach},
author={Shahariar, GM and Talukder, Tonmoy and Sotez, Rafin Alam Khan and Shawon, Md Tanvir Rouf},
journal={arXiv preprint arXiv:2307.07392},
year={2023}
}
With the increasing need for text summarization techniques that are both efficient and accurate, it becomes crucial to explore avenues that enhance the quality and precision of pre-trained models specifically tailored for summarizing Bengali texts. When it comes to text summarization tasks, there are numerous pre-trained transformer models at one's disposal. Consequently, it becomes quite a challenge to discern the most informative and relevant summary for a given text among the various options generated by these pre-trained summarization models. This paper aims to identify the most accurate and informative summary for a given text by utilizing a simple but effective ranking-based approach that compares the output of four different pre-trained Bengali text summarization models. The process begins by carrying out preprocessing of the input text that involves eliminating unnecessary elements such as special characters and punctuation marks. Next, we utilize four pre-trained summarization models to generate summaries, followed by applying a text ranking algorithm to identify the most suitable summary. Ultimately, the summary with the highest ranking score is chosen as the final one. To evaluate the effectiveness of this approach, the generated summaries are compared against human-annotated summaries using standard NLG metrics such as BLEU, ROUGE, BERTScore, WIL, WER, and METEOR. Experimental results suggest that by leveraging the strengths of each pre-trained transformer model and combining them using a ranking-based approach, our methodology significantly improves the accuracy and effectiveness of the Bengali text summarization.
📌 Gastrointestinal Disease Classification through Explainable and Cost-Sensitive Deep Neural Networks with Supervised Contrastive Learning
Authors: Dibya Nath and G. M. Shahariar
Conference: 2nd International Conference on Big Data, IoT and Machine Learning (BIM 2023)
[Abstract] [PDF] [Code & Dataset] [Presentation] [Citation bib]
@article{nath2023gastrointestinal,
title={Gastrointestinal Disease Classification through Explainable and Cost-Sensitive Deep Neural Networks with Supervised Contrastive Learning},
author={Nath, Dibya and Shahariar, GM},
journal={arXiv preprint arXiv:2307.07603},
year={2023}
}
Gastrointestinal diseases pose significant healthcare chall-enges as they manifest in diverse ways and can lead to potential complications. Ensuring precise and timely classification of these diseases is pivotal in guiding treatment choices and enhancing patient outcomes. This paper introduces a novel approach on classifying gastrointestinal diseases by leveraging cost-sensitive pre-trained deep convolutional neural network (CNN) architectures with supervised contrastive learning. Our approach enables the network to learn representations that capture vital disease-related features, while also considering the relationships of similarity between samples. To tackle the challenges posed by imbalanced datasets and the cost-sensitive nature of misclassification errors in healthcare, we incorporate cost-sensitive learning. By assigning distinct costs to misclassifications based on the disease class, we prioritize accurate classification of critical conditions. Furthermore, we enhance the interpretability of our model by integrating gradient-based techniques from explainable artificial intelligence (AI). This inclusion provides valuable insights into the decision-making process of the network, aiding in understanding the features that contribute to disease classification. To assess the effectiveness of our proposed approach, we perform extensive experiments on a comprehensive gastrointestinal disease dataset, such as the Hyper-Kvasir dataset. Through thorough comparisons with existing works, we demonstrate the strong classification accuracy, robustness and interpretability of our model.
📌 Evaluating the Reliability of CNN Models on Classifying Traffic and Road Signs using LIME
Authors: Md. Atiqur Rahman, Ahmed Saad Tanim, Sanjid Islam, Fahim Pranto, G. M. Shahariar, and Md. Tanvir Rouf Shawon
Conference: 2nd International Conference on Big Data, IoT and Machine Learning (BIM 2023)
[Abstract] [PDF] [Presentation] [Citation bib]
@article{rahman2023evaluating,
title={Evaluating the Reliability of CNN Models on Classifying Traffic and Road Signs using LIME},
author={Rahman, Md Atiqur and Tanim, Ahmed Saad and Islam, Sanjid and Pranto, Fahim and Shahariar, GM and Shawon, Md Tanvir Rouf},
journal={arXiv preprint arXiv:2309.05747},
year={2023}
}
The objective of this investigation is to evaluate and contrast the effectiveness of four state-of-the-art pre-trained models, ResNet-34, VGG-19, DenseNet-121, and Inception V3, in classifying traffic and road signs with the utilization of the GTSRB public dataset. The study fo-cuses on evaluating the accuracy of these models' predictions as well as their ability to employ appropriate features for image categorization. To gain insights into the strengths and limitations of the model's predictions , the study employs the local interpretable model-agnostic explanations (LIME) framework. The findings of this experiment indicate that LIME is a crucial tool for improving the interpretability and dependability of machine learning models for image identification, regardless of the models achieving an f1 score of 0.99 on classifying traffic and road signs. The conclusion of this study has important ramifications for how these models are used in practice, as it is crucial to ensure that model predictions are founded on the pertinent image features.
📌 Interpretable Multi Labeled Bengali Toxic Comments Classification using Deep Learning (Best Paper Award 🏆)
Authors: Tanveer Ahmed Belal, G. M. Shahariar, and Md. Hasanul Kabir
Conference: 3rd International Conference on Electrical, Computer and Communication Engineering (ECCE 2023)
[Abstract] [PDF] [Code & Dataset] [Presentation] [Citation bib]
@INPROCEEDINGS{10101588,
author={Belal, Tanveer Ahmed and Shahariar, G. M. and Kabir, Md. Hasanul},
booktitle={2023 International Conference on Electrical, Computer and Communication Engineering (ECCE)},
title={Interpretable Multi Labeled Bengali Toxic Comments Classification using Deep Learning},
year={2023},
volume={},
number={},
pages={1-6},
doi={10.1109/ECCE57851.2023.10101588}}
This paper presents a deep learning-based pipeline for categorizing Bengali toxic comments, in which at first a binary classification model is used to determine whether a comment is toxic or not, and then a multi-label classifier is employed to determine which toxicity type the comment belongs to. For this purpose, we have prepaBlue a manually labeled dataset consisting of 16,073 instances among which 8,488 are Toxic and any toxic comment may correspond to one or more of the six toxic categories - vulgar, hate, religious, threat, troll, and insult simulta-neously. Long Short Term Memory (LSTM) with BERT Embedding achieved 89.42% accuracy for the binary classification task while as a multi-label classifier, a combination of Convolutional Neural Network and Bi-directional Long Short Term Memory (CNN-BiLSTM) with attention mechanism achieved 78.92% accuracy and 0.86 as weighted F1-score. To explain the pBlueictions and interpret the word feature importance during classification by the proposed models, we utilized Local Interpretable Model-Agnostic Explanations (LIME) framework.
📌 Bengali Fake Review Detection using Semi-supervised Generative Adversarial Networks
Authors: Md. Tanvir Rouf Shawon*, G. M. Shahariar*, Faisal Muhammad Shah, Mohammad Shafiul Alam, and Md. Shahriar Mahbub
Conference: 5th International Conference on Natural Language Processing (ICNLP 2023)
[Abstract] [PDF] [Presentation] [Citation bib]
@INPROCEEDINGS{10236810,
author={Shawon, Md. Tanvir Rouf and Shahariar, G. M. and Shah, Faisal Muhammad and Alam, Mohammad Shafiul and Mahbub, Md. Shahriar},
booktitle={2023 5th International Conference on Natural Language Processing (ICNLP)},
title={Bengali Fake Review Detection using Semi-supervised Generative Adversarial Networks},
year={2023},
volume={},
number={},
pages={12-16},
doi={10.1109/ICNLP58431.2023.00011}}
This paper investigates the potential of semi-supervised Generative Adversarial Networks (GANs) to fine-tune pretrained language models in order to classify Bengali fake reviews from real reviews with a few annotated data. With the rise of social media and e-commerce, the ability to detect fake or deceptive reviews is becoming increasingly important in order to protect consumers from being misled by false information. Any machine learning model will have trouble identifying a fake review, especially for a low resource language like Bengali. We have demonstrated that the proposed semi-supervised GAN-LM architecture (generative adversarial network on top of a pretrained language model) is a viable solution in classifying Bengali fake reviews as the experimental results suggest that even with only 1024 annotated samples, BanglaBERT with semi-supervised GAN (SSGAN) achieved an accuracy of 83.59% and a f1-score of 84.89% outperforming other pretrained language models - BanglaBERT generator, Bangla BERT Base and Bangla-Electra by almost 3%, 4% and 10% respectively in terms of accuracy. The experiments were conducted on a manually labeled food review dataset consisting of total 6014 real and fake reviews collected from various social media groups. Researchers that are experiencing difficulty recognizing not just fake reviews but other classification issues owing to a lack of labeled data may find a solution in our proposed methodology.
📌 Effectiveness of Transformer Models on IoT Security Detection in StackOverflow Discussions
Authors: Nibir Chandra Mandal, G. M. Shahariar, and Md. Tanvir Rouf Shawon
Conference: International Conference on Information and Communication Technology for Development (ICICTD 2022)
[Abstract] [PDF] [Dataset ] [Presentation] [Citation bib]
@InProceedings{mandalSecurity,
author="Mandal, Nibir Chandra
and Shahariar, G. M.
and Shawon, Md. Tanvir Rouf",
title="Effectiveness of Transformer Models on IoT Security Detection in StackOverflow Discussions",
booktitle="Proceedings of International Conference on Information and Communication Technology for Development",
year="2023",
publisher="Springer Nature Singapore",
address="Singapore",
pages="125--137"
}
The Internet of Things (IoT) is an emerging concept that directly links to the billions of physical items, or “things” that are connected to the Internet and are all gathering and exchanging information between devices and systems. However, IoT devices were not built with security in mind, which might lead to security vulnerabilities in a multi-device system. Traditionally, we investigated IoT issues by polling IoT developers and specialists. This technique, however, is not scalable since surveying all IoT developers is not feasible. Another way to look into IoT issues is to look at IoT developer discussions on major online development forums like Stack Overflow (SO). However, finding discussions that are relevant to IoT issues is challenging since they are frequently not categorized with IoT-related terms. In this paper, we present the “IoT Security Dataset”, a domain-specific dataset of 7147 samples focused solely on IoT security discussions. As there are no automated tools to label these samples, we manually labeled them. We further employed multiple transformer models to automatically detect security discussions. Through rigorous investigations, we found that IoT security discussions are different and more complex than traditional security discussions. We demonstrated a considerable performance loss (up to 44%) of transformer models on cross-domain datasets when we transferred knowledge from a general-purpose dataset “Opiner”, supporting our claim. Thus, we built a domain-specific IoT security detector with an F1-Score of 0.69. We have made the dataset public in the hope that developers would learn more about the security discussion and vendors would enhance their concerns about product security.
📌 Assorted, Archetypal and Annotated Two Million (3A2M) Cooking Recipes Dataset based on Active Learning
Authors: Nazmus Sakib, G. M. Shahariar, Md. Mohsinul Kabir, Md. Kamrul, and Hasan Mahmud
Conference: International Conference on Machine Intelligence and Emerging Technologies (MIET 2022)
[Abstract] [PDF] [Dataset] [Presentation] [Citation bib]
@InProceedings{3A2M,
author="Sakib, Nazmus
and Shahariar, G. M.
and Kabir, Md. Mohsinul
and Hasan, Md. Kamrul
and Mahmud, Hasan",
title="Assorted, Archetypal and Annotated Two Million (3A2M) Cooking Recipes Dataset Based on Active Learning",
booktitle="Machine Intelligence and Emerging Technologies",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="188--203"
}
Cooking recipes allow individuals to exchange culinary ideas and provide food preparation instructions. Due to a lack of adequate labeled data, categorizing raw recipes found online to the appropriate food genres is a challenging task in this domain. Utilizing the knowledge of domain experts to categorize recipes could be a solution. In this study, we present a novel dataset of two million culinary recipes labeled in respective categories leveraging the knowledge of food experts and an active learning technique. To construct the dataset, we collect the recipes from the RecipeNLG dataset [1]. Then, we employ three human experts whose trustworthiness score is higher than 86.667% to categorize 300K recipe by their Named Entity Recognition (NER) and assign it to one of the nine categories: bakery, drinks, non-veg, vegetables, fast food, cereals, meals, sides and fusion. Finally, we categorize the remaining 1900K recipes using Active Learning method with a blend of Query-by-Committee and Human In The Loop (HITL) approaches. There are more than two million recipes in our dataset, each of which is categorized and has a confidence score linked with it. For the 9 genres, the Fleiss Kappa score of this massive dataset is roughly 0.56026. We believe that the research community can use this dataset to perform various machine learning tasks such as recipe genre classification, recipe generation of a specific genre, new recipe creation, etc. The dataset can also be used to train and evaluate the performance of various NLP tasks such as named entity recognition, part-of-speech tagging, semantic role labeling, and so on.
📌 Can Transformer Models Effectively Detect Software Aspects in StackOverflow Discussion?
Authors: Nibir Chandra Mandal, Tashreef Muhammad, and G. M. Shahariar
Conference: International Conference on Machine Intelligence and Emerging Technologies (MIET 2022)
[Abstract] [PDF] [Dataset] [Presentation] [Citation bib]
@InProceedings{mandal2022can,
author="Mandal, Nibir Chandra
and Muhammad, Tashreef
and Shahariar, G. M.",
title="Can Transformer Models Effectively Detect Software Aspects in StackOverflow Discussion?",
booktitle="Machine Intelligence and Emerging Technologies",
year="2023",
publisher="Springer Nature Switzerland",
address="Cham",
pages="226--241"
}
Dozens of new tools and technologies are being incorporated to help developers, which is becoming a source of consternation as they struggle to choose one over the others. For example, there are at least ten frameworks available to developers for developing web applications, posing a conundrum in selecting the best one that meets their needs. As a result, developers are continuously searching for all of the benefits and drawbacks of each API, framework, tool, and so on. One of the typical approaches is to examine all of the features through official documentation and discussion. This approach is time-consuming, often makes it difficult to determine which aspects are the most important to a particular developer and whether a particular aspect is important to the community at large. In this paper, we have used a benchmark API aspects dataset (Opiner) collected from StackOverflow posts and observed how Transformer models (BERT, RoBERTa, DistilBERT, and XLNet) perform in detecting software aspects in textual developer discussion with respect to the baseline Support Vector Machine (SVM) model. Through extensive experimentation, we have found that transformer models improve the performance of baseline SVM for most of the aspects, i.e., 'Performance', 'Security', 'Usability', 'Documentation', 'Bug', 'Legal', 'OnlySentiment', and 'Others'. However, the models fail to apprehend some of the aspects (e.g., 'Community' and 'Potability') and their performance varies depending on the aspects. Also, larger architectures like XLNet are ineffective in interpreting software aspects compared to smaller architectures like DistilBERT.
2022
📌 Automatic back transliteration of Romanized Bengali (Banglish) to Bengali
Authors: G. M. Shahariar Shibli, Md. Tanvir Rouf Shawon, Anik Hassan Nibir, Md. Zabed Miandad, and Nibir Chandra Mandal
Journal: Iran Journal of Computer Science (Iran J Comput Sci)
[Abstract] [PDF] [Code & Dataset] [Citation bib]
@article{shibli2022automatic,
title={Automatic back transliteration of Romanized Bengali (Banglish) to Bengali},
author={Shibli, GM Shahariar and Shawon, Md Tanvir Rouf and Nibir, Anik Hassan and Miandad, Md Zabed and Mandal, Nibir Chandra},
journal={Iran Journal of Computer Science},
pages={1--12},
year={2022},
publisher={Springer}
}
Back transliteration of Romanized Bengali to Bengali is the process of converting text written in the Latin alphabet back into the Bengali script. This is often done in order to improve the readability of Bengali text for Bengali speakers using a simple rules-based system, or an interactive transliteration tool. There are many ways to back transliterate from Romanized Bengali to Bengali, but most of them are either grapheme or phoneme based. This paper introduces a unique pipeline that uses nine open source back transliteration tools to automatically back transliterate Romanized Bengali to Bengali. The pipeline consists of seven steps: (1) processing the Romanized Bengali input; (2) acquiring human transliteration for performance comparison; (3) employing transliteration tools; (4) generating candidate transliterations; (5) post-processing the candidate transliterations; (6) selecting best candidate transliteration, and (7) evaluating the quality of the transliterations through several performance metrics. Experimental results reveal that our approach produced the highest BLEU-1 score of 81.28, BLEU-2 score of 60.75, BLEU-3 score of 44.45, BLEU-4 score of 30.46, and the lowest average Word Error Rate and Word Information Lost of 29.21 and 43.68, respectively, on 1000 Romanized Bengali texts. In terms of recall, we achieved a Rouge-L score of 0.7190.
📌 Urgent Text Detection in Bengali Language Based on Boosting Techniques
Authors: Rafsan Rahman, Tamanna Nazmin, Noor Nafeur Rahman, Miyad Bhuiyan, G. M. Shahariar, and Faisal Muhammad Shah
Conference: International Conference on Fourth Industrial Revolution and Beyond (ICFIRB 2022)
[Abstract] [PDF] [Citation bib]
@InProceedings{10.1007/978-981-19-2445-3_49,
author="Rahman, Rafsan
and Nazmin, Tamanna
and Rahman, Noor Nafeur
and Bhuiyan, Miyad
and Shahariar, G. M.
and Shah, Faisal Muhammad",
title="Urgent Text Detection in Bengali Language Based on Boosting Techniques",
booktitle="Proceedings of International Conference on Fourth Industrial Revolution and Beyond 2021 ",
year="2022",
publisher="Springer Nature Singapore",
address="Singapore",
pages="697--709",
isbn="978-981-19-2445-3"
}
This paper presents a learning approach on a unique dataset formulated by authors that detects urgent texts from the posts on social media platforms in Bengali language. It is difficult to keep track of every information we go through social media. In the collision of numerous posts, it is easy to miss information that is urgent. In this advanced era of machine learning, detecting urgent texts among thousands of posts would be much easier if we can implement a model that can filter the urgent text out of them. Therefore, we propose an approach that can identify any type of urgent texts from public posts by leveraging a manually constructed dataset that is fully human annotated. Apart from traditional machine learning classifiers, we applied boosting algorithms in our proposed method in addition. Experimentally, a significant increase in accuracy has been noticed by boosting weak learners. Support Vector Machine (SVM) achieved 80.9% accuracy where gradient boosting outperformed the traditional approach with 82% accuracy while detecting urgent texts in Bengali language.
2019
📌 Spam Review Detection Using Deep Learning
Authors: G. M. Shahariar, Swapnil Biswas, Faiza Omar, Faisal Muhammad Shah, and Samiha Binte Hassan
Conference: 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON 2019)
[Abstract] [PDF] [Presentation] [Citation bib]
@INPROCEEDINGS{8936148,
author={Shahariar, G. M. and Biswas, Swapnil and Omar, Faiza and Shah, Faisal Muhammad and Binte Hassan, Samiha},
booktitle={2019 IEEE 10th Annual Information Technology, Electronics and Mobile Communication Conference (IEMCON)},
title={Spam Review Detection Using Deep Learning},
year={2019},
volume={},
number={},
pages={0027-0033},
doi={10.1109/IEMCON.2019.8936148}
}
A robust and reliable system of detecting spam reviews is a crying need in todays world in order to purchase products without being cheated from online sites. In many online sites, there are options for posting reviews, and thus creating scopes for fake paid reviews or untruthful reviews. These concocted reviews can mislead the general public and put them in a perplexity whether to believe the review or not. Prominent machine learning techniques have been introduced to solve the problem of spam review detection. The majority of current research has concentrated on supervised learning methods, which require labeled data - an inadequacy when it comes to online review. Our focus in this article is to detect any deceptive text reviews. In order to achieve that we have worked with both labeled and unlabeled data and proposed deep learning methods for spam review detection which includes Multi-Layer Perceptron (MLP), Convolutional Neural Network (CNN) and a variant of Recurrent Neural Network (RNN) that is Long Short-Term Memory (LSTM). We have also applied some traditional machine learning classifiers such as Nave Bayes (NB), K Nearest Neighbor (KNN) and Support Vector Machine (SVM) to detect spam reviews and finally, we have shown the performance comparison for both traditional and deep learning classifiers.