Research Projects on Large Language Model and Natural Language Processing

Data Augmentation for Text Classification with EASE

In image classification, efficient data augmentation (DA) is easy with cropping, rotating, blurring etc. It works because a cropped/blurred “cat” is still a “cat”. In other words, the augmented sample does not require additional labeling as in most cases the augmented sample retains the original label. In text classification (TC) existing methods have random insertion, deletion of random words or punctuation, but the semantics change very easily. Therefore, these methods that use the same label for the augmented samples simply inject more noise in many cases. Moreover, acquiring new labels for the augmented sample requires training on the original data first to get a good estimate of the new labels. In this work, we present EASE, a simple but dependable DA technique for TC that has four easy steps: Extract Units, Acquire Labels, Sift and Employ. We extract meaningful units as augmented samples from original data samples and use powerful tools to acquire labels for them before they are sifted and merged. Previous DA techniques, like EDA and AEDA, excel with sequential models but struggle with transformer-based models that heavily rely on token order. EASE, in contrast, performs well with these models, demonstrating stability, speed, and minimal adverse effects. We tested our intuitive method on multiple challenging datasets sensitive to augmentation, and experimental results have indicated the efficacy of DA with EASE.

ease

LLM on DataFrame Question Answering Without Data Exposure

We introduce DataFrame question answering (QA), a novel task that utilizes natural language processing (NLP) models to generate Pandas queries for information retrieval and data analysis on dataframes, emphasizing safe and non-revealing data handling. Specifically, our method, leveraging large language model (LLM), which solely relies on dataframe column names, not only ensures data privacy but also significantly reduces the context window in the prompt, streamlining information processing and addressing major challenges in LLM-based data analysis. We propose DataFrame QA as a comprehensive framework that includes safe Pandas query generation and code execution. Various LLMs are evaluated on the renowned WikiSQL dataset and our newly developed UCI-DataFrameQA, tailored for complex data analysis queries. Our findings indicate that GPT-4 performs well on both datasets, underscoring its capability in securely retrieving and aggregating dataframe values and conducting sophisticated data analyses. This approach, deployable in a zero-shot manner without prior training or adjustments, proves to be highly adaptable and secure for diverse applications.

dataframeqa

Exploring the Limits of LLM in Math

Stochastic Parrot Hypothesis, which posits that large language models (LLMs) might generate responses that appear coherent but are in fact random, mirroring the way a parrot mimics speech without true comprehension. This research aims to investigate whether LLMs possess a genuine understanding of mathematics or simply echo patterns found in their training data. The significance of this study extends beyond academic curiosity, as it seeks to elucidate the real capabilities of LLMs in abstract and logical reasoning, thereby shedding light on their potential to comprehend and interact with the physical world. Through this investigation, we aim to deepen our understanding of the operational limits and cognitive capacities of large language models in mathematical contexts.

mathllm