How to detect covert AI-generated text

Covert adversarial attacks and how machine-generated text detectors fail

Machine-generated text has been fooling humans since the release of GPT-2 in 2019. Large language model (LLM) tools have gotten progressively better at crafting stories, news articles, student essays and more. So humans are often unable to recognize when they are reading text produced by an algorithm.

Misuse and harmful outcomes

These LLMs are being used to save time and even boost creativity in ideating and writing, but their power can lead to misuse and harmful outcomes that are already showing up where we consume information.

One way both academics and companies are trying to improve this detection is by employing machine learning models that can identify subtle patterns of word choice and grammatical constructions. These allow for recognizing LLM-generated text in a way that our human intuition cannot.

Today, many commercial detectors are claiming to be highly successful at detecting machine-generated text, with up to 99% accuracy—but are these claims verified? Chris Callison-Burch, Professor in Computer and Information Science, and Liam Dugan, a doctoral student in Callison-Burch’s group, aimed to find out in their recent paper, published at the 62nd Annual Meeting of the Association for Computational Linguistics.

“As the technology to detect machine-generated text advances, so does the technology used to evade detectors,” said Callison-Burch in a statement. “It’s an arms race, and while the goal to develop robust detectors is one we should strive to achieve, there are many limitations and vulnerabilities in detectors that are available now.”

Testing AI detector ability

To investigate those limitations and provide a path forward for developing robust detectors, the research team created Robust AI Detector (RAID), a data set of more than 10 million documents across recipes, news articles, blog posts and more, including AI-generated text and human-generated text.

RAID is the first standardized benchmark to test detection ability in current and future detectors. The team also created a leaderboard, which publicly ranks the performance of all detectors that have been evaluated using RAID in an unbiased way.

“The concept of a leaderboard has been key to success in many aspects of machine learning like computer vision,” said Dugan in a statement. “The RAID benchmark is the first leaderboard for robust detection of AI-generated text. We hope that our leaderboard will encourage transparency and high-quality research in this quickly evolving field.”

Dugan has already seen the influence this recemt paper is having in companies that develop detectors. "Originality.ai is a prominent company that develops detectors for AI-generated text,” he says. “They shared our work in a blog post, ranked their detector in our leaderboard and are using RAID to identify previously hidden vulnerabilities and improve their detection tool.”

How detectors fail

So do the current detectors hold up to the work at hand? RAID shows that not many do as well as they claim.

“Detectors trained on ChatGPT were mostly useless in detecting machine-generated text outputs from other LLMs such as Llama and vice versa,” says Callison-Burch. “Detectors trained on news stories don’t hold up when reviewing machine-generated recipes or creative writing. What we found is that there are a myriad of detectors that only work well when applied to very specific use cases and when reviewing text similar to the text they were trained on.”

Detectors are able to detect AI-generated text when it contains no edits or “disguises,” but when manipulated, current detectors are not reliably able to detect AI-generated text. Faulty detectors are not only an issue because they don’t work well; they can be as dangerous as the AI tool used to produce the text in the first place.

“If universities or schools were relying on a narrowly trained detector to catch students’ use of ChatGPT to write assignments, they could be falsely accusing students of cheating when they are not,” says Callison-Burch. “They could also miss students who were cheating by using other LLMs to generate their homework.”

Adversarial attacks and tricks

The team also looked into how adversarial attacks (such as replacing letters with look-alike symbols) can easily derail a detector and allow machine-generated text to fly under the radar.

“It turns out, there are a variety of edits a user can make to evade detection by the detectors we evaluated in this study,” says Dugan. “Something as simple as inserting extra spaces, swapping letters for symbols, or using alternative spelling or synonyms for a few words can cause a detector to be rendered useless.”

While current detectors are not robust enough to be of significant use in society just yet, openly evaluating detectors on large, diverse, shared resources is critical to accelerating progress and trust in detection, and that transparency will lead to the development of detectors that do hold up in a variety of use cases, the study concludes.

Reducing harms

“My work is focused on reducing the harms that LLMs can inadvertently cause, and, at the very least, making people aware of the harms so that they can be better informed when interacting with information,” Dugan continues. “In the realm of information distribution and consumption, it will become increasingly important to understand where and how text is generated, and this paper is just one way I am working towards bridging those gaps in both the scientific and public communities.”

This study was funded by the Intelligence Advanced Research Activity (IARPA), a directive of the Office of the Director of National Intelligence and within the Human Interpretable Attribution of Text Using Underlying Structure (HIATUS) program.

Citation: Dugan, L., Hwang, A., Trhlik, F., Ludan, J. M., Zhu, A., Xu, H., & Ippolito, D. (2024). RAID: A Shared Benchmark for Robust Evaluation of Machine-Generated Text Detectors. arXiv. https://arxiv.org/abs/2405.07940 (open access)

How to detect covert AI-generated text

Related Articles

Comments on this article