A Curious Phenomenon with Gemma Model Outputs and License Propagation

While examining the licensing details of Google’s Gemma model, I noticed a potentially puzzling phenomenon: you can freely assign a license to the model’s outputs, yet depending on how those outputs are used, the original Terms of Use might suddenly propagate to the resulting work.

Outputs vs. Model Derivatives

The Gemma Terms of Use distinguish between two main categories:

Outputs :These are the texts or translations generated by Gemma. They are neither the Gemma model itself nor its “Model Derivative.” Therefore, the Terms of Use do not impose a rule that “anyone must inherit the Gemma license terms” solely for distributing these outputs as they are.
Model Derivatives :If you train a new model that replicates Gemma’s performance—via distillation or other methods—using Gemma’s parameters or outputs, that new model becomes a “Model Derivative.” It is then subject to Gemma’s licensing obligations, including license inheritance and usage restrictions.

Hence, plain text outputs are effectively “almost free” from restrictions. Yet, if you use such outputs as training data specifically to reproduce or incorporate Gemma’s capabilities, you could end up with a new model that, in legal terms, qualifies as a “Model Derivative.”

Can We License Gemma’s Outputs Freely?

Gemma’s Terms of Use do not impose explicit constraints on how to distribute Outputs or what license you can attach to them. This means you could, for instance, bundle Gemma’s translation results into a dataset and release it under Apache 2.0, CC-BY, or any other open license. At a glance, it might appear that “everyone can do whatever they like with it” if the dataset is under a permissive license.

The Real Issue: Building a Gemma-Equivalent Model

The more subtle concern arises when someone takes that dataset—comprised of Gemma outputs—and uses it to build a model with capabilities essentially equal to Gemma itself. As noted above, such a model is deemed a “Model Derivative” and must inherit and comply with the Gemma Terms of Use. Here is a possible scenario:

You collect numerous Gemma translation outputs and publicly release the dataset under CC-BY.
A third party obtains this dataset.
The third party thinks, “This looks like a great translation corpus,” and uses it to train or distill their own translation model.
As a result, they create a model that closely reproduces Gemma’s translation performance, and then they release that model.
Because this model effectively replicates Gemma, it may be deemed a “Model Derivative,” making it subject to Gemma’s Terms of Use (i.e., the original license rules propagate).

In other words, even if the dataset itself has a free and open license, the act of using it to create a Gemma-equivalent model could trigger the Gemma Terms of Use. Moreover, there is the practical risk that someone might not realize the dataset originated from Gemma outputs and end up inadvertently creating a Gemma-like model without ever being aware of the license obligations.

How Likely Is This Scenario?

Realistically speaking, it may be quite rare for anyone to create a model that matches Gemma’s performance exactly. Nonetheless, from a legal perspective, it is a possibility that cannot be dismissed out of hand. For this reason, if you are distributing a dataset heavily composed of Gemma outputs, it may be prudent to include a note stating it contains Gemma outputs and indicate that any effort to replicate Gemma’s performance could fall under its Terms of Use. While not necessarily a legal requirement, such disclosure can help avoid future misunderstandings.

Importance of Dataset Transparency

I first noticed this potential licensing complexity when reviewing the Japanese CLIP model licensing from LLM.jp. As large-scale AI datasets become more widespread and intermixed, it can be difficult to trace their origins or confirm whether a portion of the data is derived from a particular model’s outputs. Ensuring transparency and clarity about data sources is increasingly important to prevent inadvertent violations of Terms of Use or other licensing constraints.

Open Source Guy