Technology

Google Open-Sources Its Watermarking Tool for AI-Generated Text

Published October 23, 2024

Google has made its SynthID text watermarking tool available to developers by open sourcing it through the Google Responsible Generative AI Toolkit, as announced recently on social media.

This new development enables other generative AI developers to utilize this technology, allowing them to determine whether text outputs originate from their own large language models (LLMs). According to Pushmeet Kohli, the vice president of research at Google DeepMind, this advancement aims to assist developers in creating AI in a more responsible manner.

Watermarking has gained importance as AI-generated content is increasingly used in harmful ways, such as spreading misinformation and producing nonconsensual content. Several governments are looking into making AI watermarking a requirement. For instance, California is currently examining the need for mandatory watermarking, while China has already implemented such regulations.

SynthID, introduced last August, helps in identifying AI-generated content by embedding an invisible watermark into various media formats, including images, audio, video, and text during generation. The unique aspect of the text version of SynthID is that it slightly modifies the probability of certain text outputs in a way detectable by software but remains imperceptible to human readers.

When an LLM generates text, it does so one token at a time, with tokens representing characters, words, or phrases. The model predicts the next token based on the previously generated words and their associated probability scores. Using an example, with the phrase “My favorite tropical fruits are __,” the LLM could suggest tokens like “mango,” “lychee,” or “papaya,” each assigned a probability score. SynthID can adjust these scores where possible without affecting the overall quality, accuracy, or creativity of the text output.

This adjustment process occurs throughout the entire text, resulting in numerous adjusted scores even within a single sentence. Ultimately, these patterns of scores, combined with the model's word choices, create the watermark.

Google has stated that this system, which is integrated into its Gemini chatbot, maintains the integrity of the generated text in terms of quality, accuracy, and speed—common concerns with conventional watermarking technology. SynthID can effectively work with text as short as three sentences, even if the content is cropped, paraphrased, or modified. However, it may face challenges with very short text, rewritten content, and responses to factual inquiries.

While Google acknowledges that “SynthID isn’t a silver bullet for identifying AI-generated content,” it emphasizes that the tool serves as a foundational element for developing more reliable AI identification technologies. This could ultimately empower millions of users to make informed decisions concerning their interactions with AI-generated content.

Google, AI, watermarking, technology, tool