While examining the licensing details of Google’s Gemma model, I noticed a potentially puzzling phenomenon: you can freely assign a license to the model’s outputs, yet depending on how those outputs are used, the original Terms of Use might suddenly propagate to the resulting work.

Outputs vs. Model Derivatives

The Gemma Terms of Use distinguish between two main categories:

  • Outputs :These are the texts or translations generated by Gemma. They are neither the Gemma model itself nor its “Model Derivative.” Therefore, the Terms of Use do not impose a rule that “anyone must inherit the Gemma license terms” solely for distributing these outputs as they are.
  • Model Derivatives :If you train a new model that replicates Gemma’s performance—via distillation or other methods—using Gemma’s parameters or outputs, that new model becomes a “Model Derivative.” It is then subject to Gemma’s licensing obligations, including license inheritance and usage restrictions.

Hence, plain text outputs are effectively “almost free” from restrictions. Yet, if you use such outputs as training data specifically to reproduce or incorporate Gemma’s capabilities, you could end up with a new model that, in legal terms, qualifies as a “Model Derivative.”

Can We License Gemma’s Outputs Freely?

Gemma’s Terms of Use do not impose explicit constraints on how to distribute Outputs or what license you can attach to them. This means you could, for instance, bundle Gemma’s translation results into a dataset and release it under Apache 2.0, CC-BY, or any other open license. At a glance, it might appear that “everyone can do whatever they like with it” if the dataset is under a permissive license.

The Real Issue: Building a Gemma-Equivalent Model

The more subtle concern arises when someone takes that dataset—comprised of Gemma outputs—and uses it to build a model with capabilities essentially equal to Gemma itself. As noted above, such a model is deemed a “Model Derivative” and must inherit and comply with the Gemma Terms of Use. Here is a possible scenario:

  1. You collect numerous Gemma translation outputs and publicly release the dataset under CC-BY.
  2. A third party obtains this dataset.
  3. The third party thinks, “This looks like a great translation corpus,” and uses it to train or distill their own translation model.
  4. As a result, they create a model that closely reproduces Gemma’s translation performance, and then they release that model.
  5. Because this model effectively replicates Gemma, it may be deemed a “Model Derivative,” making it subject to Gemma’s Terms of Use (i.e., the original license rules propagate).

In other words, even if the dataset itself has a free and open license, the act of using it to create a Gemma-equivalent model could trigger the Gemma Terms of Use. Moreover, there is the practical risk that someone might not realize the dataset originated from Gemma outputs and end up inadvertently creating a Gemma-like model without ever being aware of the license obligations.

How Likely Is This Scenario?

Realistically speaking, it may be quite rare for anyone to create a model that matches Gemma’s performance exactly. Nonetheless, from a legal perspective, it is a possibility that cannot be dismissed out of hand. For this reason, if you are distributing a dataset heavily composed of Gemma outputs, it may be prudent to include a note stating it contains Gemma outputs and indicate that any effort to replicate Gemma’s performance could fall under its Terms of Use. While not necessarily a legal requirement, such disclosure can help avoid future misunderstandings.

Importance of Dataset Transparency

I first noticed this potential licensing complexity when reviewing the Japanese CLIP model licensing from LLM.jp. As large-scale AI datasets become more widespread and intermixed, it can be difficult to trace their origins or confirm whether a portion of the data is derived from a particular model’s outputs. Ensuring transparency and clarity about data sources is increasingly important to prevent inadvertent violations of Terms of Use or other licensing constraints.

The Hidden Risks of NVIDIA’s Open Model License

Recently, regarding the open-weights AI model “Nemotron 3” released by NVIDIA, there are scattered media reports mistakenly describing it as open source. Because there is concern that these reports encourage ignoring the usage risks of the NVIDIA Open Model License Agreement (version dated October 24, 2025; hereinafter referred to as the NVIDIA License), which is…

The Current State of the Theory that GPL Propagates to AI Models Trained on GPL Code

When GitHub Copilot was launched in 2021, the fact that its training data included a vast amount of Open Source code publicly available on GitHub attracted significant attention, sparking lively debates regarding licensing. While there were issues concerning conditions such as attribution required by most licenses, there was a particularly high volume of discourse suggesting…

The Legal Hack: Why U.S. Law Sees Open Source as “Permission,” Not a Contract

In Japan, the common view is to treat an Open Source license as a license agreement, or a contract. This is also the case in the EU. However, in the United States—the origin point for almost every aspect of Open Source—an Open Source license has long been considered not a contract, but a “unilateral permission”…

Evaluating OpenMDW: A Revolution for Open AI, or a License to Openwash?

Although the number of AI models distributed under Open Source licenses is increasing, it can be said that AI systems in which all related components, including training data, are open are still in a developmental stage, even as a few promising systems have emerged. In this context, this past May, the Linux Foundation, in collaboration…

Should ‘Open Source AI’ Mean Exposing All Training Data?

DeepSeek has had a major global impact. This appears to stem not only from the emergence of a new force in China that threatens the dominance of major U.S. AI vendors, but also from the fact that the AI model itself is being distributed under the MIT License, which is an Open Source license. Nevertheless,…

Significant Risks in Using AI Models Governed by the Llama License

Although it has already been explained that the Llama model and the Llama License (Llama Community License Agreement) do not, in any sense, qualify as Open Source, it bears noting that the Llama License contains several additional issues. While not directly relevant to whether it meets Open Source criteria, these provisions may nonetheless cause the…

The Hidden Traps in Meta’s Llama License

— An Explanation of Llama’s Supposed “Open Source” Status and the Serious Risks of Using Models under the Llama License — It is widely recognized—despite Meta’s CEO persistently promoting the notion that “Llama is Open Source”—that the Llama License is in fact not Open Source. Yet few individuals have clearly articulated the precise reasons why…