Tahsin Mayeesha

Hi! I am currently working as a researcher in NSU HCI DIAL(Design and Inclusion) Lab on a Google funded project to research on education related barriers and challenges of South Asian women in computing. Concurrently I'm also managing 3 Bengali NLP projects on generative models in the domain of Question Answering research as a senior Research Assistant. I have worked as a predoctoral fellow in Fatima Fellowship on a NLP project with mentor Benjamin Muller on investigating cultural biases such as formality in multilingual generative models.

During my undergrad at North South University (NSU) in Bangladesh, I got into research when I was advised by Dr Nova Ahmed and Prof. Rashedur M Rahman. I've graduated from Computer Science and Engineering major (North South University) in Fall 2020. My thesis project was on building deep learning models for question answering systems in Bengali where I trained multilingual BERT models on synthetic data. My research experience has so far been around NLP, AI Ethics/Policy and HCI.

Previously I’ve worked with Tensorflow Hub team for Google Summer of Code 2019 with mentor Vojtech Bardiovský, Berkman Klein Center of Internet and Society with mentor Hal Roberts for Google Summer of Code 2018 and Cramstack in 2017.

I like to watch anime, read manga or books and take care of my cats during my free time.

Email  /  LinkedIn  /  Google Scholar  /  GitHub  /  Twitter

profile photo

Publications/Preprint

See also my Google Scholar profile for the most recent publications as well as the most-cited papers.

In What Languages are Generative Language Models the Most Formal? Analyzing Formality Distribution across Languages
Authors : Asim Ersoy, Gerson Vizcarra, Tasmiah Tahsin Mayeesha & Benjamin Muller
Accepted to Findings of EMNLP 2023. Presented to 3rd Multilingual Representation Learning Workshop, EMNLP 2023.

Preprint

Multilingual generative language models (LMs) are increasingly fluent in a large variety of languages. Trained on the concatenation of corpora in multiple languages, they enable powerful transfer from high-resource languages to low-resource ones. However, it is still unknown what cultural biases are induced in the predictions of these models. In this work, we focus on one language property highly influenced by culture: formality. We analyze the formality distributions of XGLM and BLOOM's predictions, two popular generative multilingual language models, in 5 languages. We classify 1,200 generations per language as formal, informal, or incohesive and measure the impact of the prompt formality on the predictions. Overall, we observe a diversity of behaviors across the models and languages. For instance, XGLM generates informal text in Arabic and Bengali when conditioned with informal prompts, much more than BLOOM. In addition, even though both models are highly biased toward the formal style when prompted neutrally, we find that the models generate a significant amount of informal predictions even when prompted with formal text.We release with this work 6,000 annotated samples, paving the way for future work on the formality of generative multilingual LMs.

Visual Question Generation in Bengali
Authors : Mahmud Hasan, Labiba Islam, Jannatul Ruma, Tasmiah Tahsin Mayeesha & Rashedur Rahman
In Proceedings of the Workshop on Multimodal, Multilingual Natural Language Generation and Multilingual WebNLG Challenge (MM-NLG 2023), pages 10–19, Prague, Czech Republic. Association for Computational Linguistics., 2023

Paper

The task of Visual Question Generation (VQG) is to generate human-like questions relevant to the given image. As VQG is an emerging research field, existing works tend to focus only on resource-rich language such as English due to the availability of datasets. In this paper, we propose the first Bengali Visual Question Gen- eration task and develop a novel transformer-based encoder-decoder architecture that gener- ates questions in Bengali when given an image. We propose multiple variants of models - (i) image-only: baseline model of generating questions from images without additional infor- mation, (ii) image-category and image-answer- category: guided VQG where we condition the model to generate questions based on the answer and the category of expected question. These models are trained and evaluated on the translated VQAv2.0 dataset. Our quantitative and qualitative results establish the first state of the art models for VQG task in Bengali and demonstrate that our models are capable of generating grammatically correct and relevant questions. Our quantitative results show that our image-cat model achieves a BLUE-1 score of 33.12 and BLEU-3 score of 7.56 which is the highest of the other two variants. We also perform a human evaluation to assess the qual- ity of the generation tasks. Human evaluation suggests that image-cat model is capable of generating goal-driven and attribute-specific questions and also stays relevant to the cor- responding image.

Transformer Based Answer-Aware Bengali Question Generation.
Authors : Jannatul Ferdous Ruma,Tasmiah Tahsin Mayeesha & Rashedur M. Rahman.
International Journal of Cognitive Computing in Engineering, Volume 4, 2023, Pages 314-326, ISSN 2666-3074.

Model /Paper

Question generation (QG), the task of generating questions from text or other forms of data, a significant and challenging subject, has recently attracted more attention in natural language processing (NLP) due to its vast range of business, healthcare, and education applications through creating quizzes, Frequently Asked Questions (FAQs) and documentation. Most QG research has been conducted in languages with abundant resources, such as English. However, due to the dearth of training data in low-resource languages, such as Bengali, thorough research on Bengali question generation has yet to be conducted. In this article, we propose a system for producing varied and pertinent Bengali questions from context passages in natural language in an answer-aware input format using a series of fine-tuned text-to-text transformer (T5) based models. During our studies with various transformer-based encoder-decoder models and various decoding processes, along with delivering 98% grammatically accurate questions, our fine-tuned BanglaT5 model had the highest 35.77 F-score in RougeL and 38.57 BLEU-1 score with beam search. Our automated and human evaluation results show that our answer-aware QG models can create realistic, human-like questions relevant to the context passage and answer. We also release our code, generated questions, dataset, and models to enable broader question generation research for the Bengali-speaking community.

Making ethics at home in Global CS Education: Provoking stories from the Souths
Authors : Marisol Wong- Villacres, Cat Kutay, Shaimaa Lazem, Nova Ahmed, Cristina Abad, Cesar Collazos, Shady Elbas- suoni, Farzana Islam, Deepa Singh, Tasmiah Tahsin Mayeesha, Martin Mabeifam Ujakpa, Tariq Zaman & Nicola J Bidwell
ACM Journal on Computing and Sustainable Societies, 2023, Best Journal Paper Award.

University courses and curricula on the ethics of computing are increasing, yet there are few studies about how CS programs should account for the diverse ways ethical dilemmas and approaches to ethics are situated in cultural, philosophical and governance systems, religions and languages. This paper seeks to prompt conversations about CS education that accounts for ethics in the Global Souths. We draw on the experiences and insights of 46 university educators and 9 practitioners, in Latin America, South Asia, Africa, the Middle east and Australian First Nations. Our modest study sought to inform revisions of the ACM’s international curricular guidelines for the Society, Ethics and Professionalism knowledge area in undergraduate CS programs. Participants’ responses in surveys and interviews illustrate difficulties in translating regional and local practices, explicit or implicit values and the changing impacts of technologies, into a singular vocabulary about ethics, such as formal ethical Codes of professional conduct. They illustrate opportunities for university teaching, and allied learning activities, to link more closely to students’ priorities, actions and experiences in the Global Souths and enrich students’ education in the Global North.

Deep learning based question answering system in Bengali
Authors : Tasmiah Tahsin Mayeesha , Abdullah Md Sarwar, Rashedur M Rahman
Journal of Information and Telecommunication, 5:2, 145-178., 2021

Paper / Dataset

Recent advances in the field of natural language processing has improved state-of-the-art performances on many tasks including question answering for languages like English. Bengali language is ranked seventh and is spoken by about 300 million people all over the world. But due to lack of data and active research on QA similar progress has not been achieved for Bengali. Unlike English, there is no benchmark large scale QA dataset collected for Bengali, no pretrained language model that can be modified for Bengali question answering and no human baseline score for QA has been established either. In this work we use state-of-the-art transformer models to train QA system on a synthetic reading comprehension dataset translated from one of the most popular benchmark datasets in English called SQuAD 2.0. We collect a smaller human annotated QA dataset from Bengali Wikipedia with popular topics from Bangladeshi culture for evaluating our models. Finally, we compare our models with human children to set up a benchmark score using survey experiments.

Applying Text Mining to Protest Stories as Voice against Media Censorship.
Authors : Tasmiah Tahsin Mayeesha, Zareen Tasneem, Jasmin Jones & Nova Ahmed.
ACM Conference on Computer-Supported Co-operative Work and Social Computing, Solidarity Across Borders Workshop,, 2018

Paper /

Data driven activism attempts to collect,analyze and visualize data to foster social change. However, during media censorship it is often impossible to collect such data. Here we demonstrate that data from personal stories can also help us to gain insights about protests and activism which can work as a voice for the activists. .

Projects

Bengali Automatic Speech Recognition System
Speech to text model for Bengali language, Huggingface Robust Speech Event, 2022

Huggingface Speech Bench / Model

Finetuned Wav2vec2-xls-r model on openslr Bangla Speech dataset of 200k+ samples which was recognized as one of the best performing model for Bangla for Huggingface Robust Speech Sprint.

Bengali GPT2 model
Part of Huggingface Flax-Jax Event , 2021

Model

Large OpenAI GPT-2 model was proposed in Language Models are Unsupervised Multitask Learners paper. Original GPT2 model was a causal (unidirectional) transformer pretrained using language modeling on a very large corpus of ~40 GB of text data. This model has same configuration but has been pretrained on bengali corpus of mC4(multilingual C4) dataset.Also features another finetuned variation on bengali song lyrics.

Credit Card Recommender System
Software Engineering Course, 2018 Code

Developed a similarity based card recommender model using geoloca- tion and card specific features with dataset collected from Bangladeshi banks. Used scikit-learn for modelling and deployed with Django web app and Google Dialogflow based chatbot.

Dobhashi - English Bangla Machine Translation
Natural Language Processing Course, 2018 Report

Architected English-Bangla machine translation model based on LSTM and transformer and trained on SUPARA Benchmark Bangla-English corpus. Best performing model achieves a BLEU score of 46.

News Article Network Visualization on violence against women
Interactive Network

This project explore the media coverage on the articles about harassment or violence against women, including rape and murder related cases. It was done with the help of KolpoKoushol , an initiative by former MIT alumni’s of Bangladesh to gather people from many fields for learning about interdisciplinary ideas. This project has been featured by Fast.ai. See : Deep Learning, Not just for Silicon Valley.


Blog Posts

Classifying Bangla Fake News with HuggingFace Transformers and Fastai

Google Summer of Code 19 with TensorFlow Hub

Building a Credit Card Recommender

Google Summer of Code 2018 : Network Visualization Of MediaCloud Topic Network

Multi class Fish Classification on Images using Transfer Learning and Keras

Recommending Animes Using Nearest Neighbors

Honors/Awards

Humayun Ahmed Research Fellowship. NSU HCI DIAL Lab., 2023

Weights and Bias Fastai x Huggingface study group blog competition winning submission, 2020
Code

Secure and Private AI Scholarship Challenge, Udacity-Facebook, 2019

AWS Machine Learning Scholarship, Udacity-Amazon, 2018

Fast.ai International Fellowship, 2018. Featured in Forbes article - Artificial Intelligence Education Transforms The Developing World, Deep Learning, not just for Silicon Valley

Udacity Machine Learning Nanodegree, 2017. Capstone project on multi-class image classification on fishery images. Code.


Mentorship

Bengali NLP : Application in Literature and Natural Language Generation Project(2023): Mentoring two graduated research assistants on Bengali Visual Question Generation Research

My Freedom in Light(2023): Mentoring undergraduate research assistants in report writing and literature review.


Invited Talks & Tutorials

NLP Reading Group Dhaka- Language Models are Few-Shot Learners Paper Presentation, 2022
Goethe-Institut and HerStory Foundation - Presentation on AI Ethics Articles, 2022.

W&B Study Group: fastai w/ Hugging Face Demo Day, 2022

Breaking into research for undergraduate students - Free Schooling Bangladesh, 2021

Udacity School Of Artificial Intelligence Open House, 2020


Template stolen from Jon Barron! Thanks for dropping by.
Last updated March 2023.