Defending Against Stegnography in Large Language Models

The advent of large language models (LLMs) has revolutionized natural language processing, enabling machines to generate human-like text and demonstrate impressive performance in various language tasks. However, recent research has raised concerns about the potential use of steganography by LLMs to hide intermediate reasoning steps within their text generations. This concern affects the trustworthiness and reliability of LLM-generated content, particularly in applications where transparency and accountability are paramount.

Text steganography, the art of concealing secret messages within a large body of seemingly innocuous text, has been practiced for centuries. However, a recent study has demonstrated the feasibility of LLMs to potentially exploit this art and use encoded reasoning to hide information in their responses without compromising its coherence and fluency. Encoded reasoning refers to the ability of models to hide their intermediate steps of reasoning in a generated text making it difficult for human readers to understand the rationale behind the model’s outputs. This encoded reasoning can be leveraged to improve performance on certain tasks, such as sycophancy or goal preservation, where the model aims to please the user or achieve a specific objective, respectively.

Several studies have delved into the use of steganography in LLMs. Different techniques such as watermarking and chain-of-thought (CoT) prompting have been explored for steganographic purposes. However, limited research has focused on evaluating and defending against LLM steganography. Building up on previous work, a research team from Redwood Research has conducted a comprehensive study to assess the risks of LLM steganography and evaluate potential defenses against it. The study sheds light on the prevalence, impact, and counter measures for encoded reasoning in LLMs.

Credit: Tesfu Assefa

The researchers devised a series of experiments involving fine-tuning a GPT-3.5-Turbo, a powerful LLM, on various tasks to investigate steganography attempts. They employed different steganography schemes to encode information into GPT-3.5-Turbo’s text output and evaluated the model’s ability to successfully encode and decode hidden messages. The experiments are conducted using two datasets, AlpacaEval and AQUA-RAT.

They propose a methodology to evaluate the effectiveness of steganography schemes and defenses. It involves four key steps:

  1. Steganography encoding: An LLM is prompted to encode a secret message into a language response.
  2. Paraphrasing as a defense: A paraphrasing model attempts to remove the hidden information while preserving the meaning of the response.
  3. Steganography encoding: Another LLM is tasked with extracting the secret message from the paraphrased response.
  4. Task performance evaluation: The quality of the paraphrased response and its information density is assessed based on task-specific metrics, such as accuracy or fluency.
Illustration of the method for evaluating defenses against steganography. (Credit: Fabien Roger and Ryan Greenblatt)

The experiments revealed that GPT-3.5-Turbo can indeed learn to encode information using simple steganography schemes. However, the researchers also demonstrated that employing context-aware paraphrasing and penalizing abnormal generations could significantly reduce the amount of hidden information in the responses without compromising task performance. They have shown that paraphrasing can prevent even the best encoding schemes from encoding more than 3 bits of information per KB of text, thus mitigating the risks associated with encoded reasoning. The results also showed that the capacity of steganography schemes to encode information is constrained by the paraphraser’s ability to remove the encoded information.

The findings showed that fine-tuned LLMs with a simple steganographic encoding scheme were able to generate answers that were more sycophantic than those of a model trained to output the final answer directly. They also suggest that the proposed defense mechanism, context-aware paraphrasing, offers a promising solution to mitigate the threat posed to the reliability and transparency of LLM-generated content.

In conclusion, the research provides actionable insights into the feasibility and risks of LLM steganography and proposes a practical methodology for evaluating defenses against it. It also calls for the adoption of defenses against steganography to create more transparent and reliable models. The findings underscore the need for further research to develop robust defenses against this emerging threat and ensure the responsible use of LLMs in various applications.

Let us know your thoughts! Sign up for a Mindplex account now, join our Telegram, or follow us on Twitter