Copyright issues in AI model training and generated outputs are widely discussed, but less attention is paid to how copyright licenses, the actual mechanism for granting permission, shape these workflows. Unlike traditional software, which is often described as a one-to-one relationship between source code and executable code, AI involves three interlocking layers: input data, the model, and the output. Across those layers, the licenses on training data and on models can matter legally, and can introduce additional complexity.
This article examines Creative Commons (CC), a framework of copyright licenses widely used for general purposes in content such as text and images, and explains the potential impact of various CC conditions on each stage of AI utilization. In doing so, it aims to provide insights for considering the effects of Open Source licenses and other licensing agreements across the various stages of AI implementation.
In accordance with the guidance provided by Creative Commons, this analysis distinguishes between “conservative compliance” and “legally required cases” wherever possible. This distinction is necessary because, under current circumstances, significant gray areas remain. Furthermore, as privacy, ethics, and data protection fall outside the scope of CC licenses, these topics will be addressed on another occasion.
- Prerequisites for the Application of CC Conditions
- Do CC Conditions Apply from Data to Output? (BY, NC, SA, ND)
- Do CC Conditions Apply to the Model Itself from Data to Model?
- Do CC Conditions Apply from Model to Output?
- Conclusion
- References
Note: This article is an English translation of the original Japanese text, with some parts translated using an LLM. I believe it will also be useful for English speakers.
Prerequisites for the Application of CC Conditions
CC licenses are a means of granting permission based on copyright. Therefore, if the use does not require permission under copyright law, the various CC conditions, such as Attribution (BY), NonCommercial (NC), ShareAlike (SA), and NoDerivatives (ND), are not triggered. Based on this premise, it is first necessary to confirm when exceptions and limitations to copyright apply, rendering permission unnecessary, and which types of copyright-protected uses trigger CC conditions in AI training and other applications.
Step 1: Is Copyright Permission Required?
Since CC is a copyright license, CC conditions only become relevant for uses that require copyright permission. Conversely, if a use falls under an exception or limitation in national laws and does not require the right holder’s permission, CC conditions have no legal force. Regarding AI training, exceptions and limitations are being developed in various jurisdictions. Briefly summarized, the interpretations in major legal jurisdictions are as follows.
United States: The fundamental approach is that if a use constitutes “Fair Use” under copyright law (17 U.S.C. § 107), the use is deemed non-infringing and no permission is required. Whether the reproduction of massive amounts of data for AI training purposes constitutes fair use is determined on a case-by-case basis. Still, the precedent in Authors Guild v. Google, which held that reproduction for data analysis purposes such as scanning books for a search engine constitutes fair use, is often cited as the foundational perspective. At the same time, multiple civil lawsuits concerning the use of data for generative AI training are currently underway. Even as settlements and early rulings begin to emerge, the scope of fair use is still being shaped by the courts. Since conflicting views persist on certain points of contention, the boundaries of fair use require continued monitoring. In particular, it should be noted that the evaluation of the “effect of the use upon the potential market for or value of the copyrighted work,” one of the four factors of fair use, may vary depending on the type of litigation and the evidence presented.
European Union (EU): The 2019 EU Directive on Copyright in the Digital Single Market (DSM Directive) introduced exceptions for reproductions made for Text and Data Mining (TDM). In both types of exceptions, “lawful access” is a prerequisite. For research purposes (Article 3), the exception covers TDM conducted by research organizations and cultural heritage institutions for “scientific research,” and right holders cannot opt out. Conversely, for general purposes (Article 4), the scope of subjects is broader, but the exception does not apply to content for which the right holder has “expressly reserved” rights for TDM purposes in accordance with Article 4(3). Furthermore, regarding the retention of copies or extractions made for TDM, Articles 3(2) and 4(2) state that they may be retained as long as necessary for the purpose. Particularly for research purposes, since storage under appropriate security is expected, operational designs such as retention periods, access controls, and deletion policies for training data may become points of contention. For content made available to the public online, rights reservations are expected to be expressed through machine-readable means, and in practice, robots.txt or meta tags are often discussed as candidates. In any event, specific requirements and operational details depend on how each member state implements the directive. It is important to clarify that publishing content under a CC license does not, in itself, immediately constitute a reservation of rights against TDM.
Japan: Under Article 30-4 of the Copyright Act and other provisions, a comprehensive exception has been established that allows for the reproduction of works for information analysis purposes without the permission of the right holder. This applies broadly as long as the use is “not for the purpose of personally enjoying the thoughts or emotions expressed” (non-expressive use). That said, the exception does not apply if the use “unreasonably prejudices the interests of the copyright owner.” Generally, the collection and processing of data for AI model training do not aim for the enjoyment of the work by humans, and thus fall under this exception. A scenario that might be considered to “unreasonably prejudice the interests” under the proviso of Article 30-4 would be the reproduction or ingestion of a database work, which is provided for a fee specifically for information analysis, into an AI training process without paying the license fee. Whether a use conflicts with the market for the work or obstructs potential future sales channels should be judged by comprehensively considering various circumstances, including the manner of use.
While reproduction for AI training is permitted by copyright exceptions in Japan, it falls within the scope of the TDM exception in the EU unless rights are reserved, and it may be permitted by fair use or other doctrines in the US. The scope of where permission is unnecessary differs by country. Depending on the jurisdiction in which a company’s AI development takes place, it is first necessary to examine whether the act of training itself constitutes infringement or whether permission is required at all in light of these exception provisions.
Step 2: Which CC Elements Are Triggered by Which Acts?
If an act requires copyright permission, the next step is to consider which CC conditions (elements) may be triggered at which stage of the AI development process. Based on the guidance from Creative Commons, Attribution (BY) is primarily a concern when sharing a licensed work (or material containing it) with the public. ShareAlike (SA) and NoDerivatives (ND) are primarily issues when sharing “Adapted Material” with the public. By contrast, NonCommercial (NC) is not limited to acts of sharing; whether a use is for commercial purposes can be an issue for any act that requires permission under copyright law.
In other words, regarding a CC-licensed work used as input data, BY, SA, and ND may become issues in scenarios such as (1) redistributing the original work itself, (2) creating and sharing with the public an adaptation based on the original work, or (3) when the output is a substantial reproduction of the original work and that output is provided to the public. Conversely, it is important to note that NC is not restricted to sharing; the fulfillment of the condition depends on whether any act of use (acts that could constitute copyright infringement), such as reproduction, adaptation, or distribution, is for commercial or noncommercial purposes.
Dataset Distribution: When Does Sharing Trigger CC Conditions?
Related to Step 2 above, there are also points of concern when constructing and distributing large-scale datasets for AI. When distributing a dataset as open data that contains many CC-licensed works, the treatment of each work within the dataset must comply with its CC license. In particular, when publishing a dataset containing works licensed under CC BY or CC BY-SA, it is necessary to provide appropriate credit for each work. For example, in a dataset containing images, it may be required to list the author’s name, license, and source URL for each image in the accompanying metadata or documentation. This is because the dataset provider, in sharing (redistributing) the copyrighted works, must comply with CC conditions such as the original attribution obligations.
At the same time, it should be noted that even when multiple works are aggregated in the form of a dataset, the dataset itself is a mere collection of individual works and does not constitute an adaptation. CC licenses explicitly state that “including a work in a collection does not, by itself, constitute an adaptation.” Therefore, even if a work with a ShareAlike condition, such as CC BY-SA, is included in a dataset, the CC BY-SA license does not automatically apply to the entire dataset. If you perform preprocessing such as cropping, color correction, noise removal, captioning, or translation during distribution, that preprocessing may constitute an alteration (adaptation) of the individual works, and the application of ND or SA conditions may become a separate issue. The ShareAlike (SA) condition is only triggered when the original work is adapted and shared; it does not immediately apply to a dataset (collective work) that simply aggregates the original works without alteration. Nonetheless, since the original CC licenses (and copyrights) still apply to the individual works within the dataset, it is desirable to clearly state this to users, for example, by noting that “this dataset includes X items of content under the CC BY 4.0 license.”
Do CC Conditions Apply from Data to Output? (BY, NC, SA, ND)
We now examine whether CC conditions apply to outputs generated by a model when CC-licensed text or images are used as training data or prompt inputs. This is a crucial point of direct impact for AI developers and users. The scenarios for each CC condition are as follows.
Attribution (BY): When CC BY licensed data is used for AI training, the key point is whether the expression of the original data appears in the model’s output. For example, in cases where an AI refers to a specific CC-licensed text via Retrieval-Augmented Generation (RAG) and generates an answer derived from it, it is desirable to provide links to the source and credit where possible. Furthermore, if an AI model or system memorizes the training data (not necessarily in a technical sense) and the output appears substantially identical or extremely similar to the original data, there is a possibility that this will be regarded as sharing the original work. Therefore, from a conservative standpoint, when providing or publishing such outputs to third parties, the title, author’s name, license, and URI of the original CC work should be displayed to satisfy the attribution conditions of the CC license. Legally, this may be necessary because sharing such an output could constitute the act of sharing the original copyrighted work. In practice, most model outputs are novel expressions that are dissimilar to the training data, in which case the BY condition of the original data does not immediately become an issue.
NonCommercial (NC): It is important to note that “NonCommercial” is not determined mechanically based on whether the user is a corporation, but is evaluated based on whether the use is “primarily intended for or directed toward commercial advantage or monetary compensation.” Based on this, using CC BY-NC licensed data for training involves the act of “reproduction” and may not be permitted for commercial purposes. Supporting this, the CC guidance states that if NC-restricted works are used, every stage, from copying the training data to providing the trained model, must be limited to noncommercial purposes. For example, using model outputs trained on NC works in a commercial service, such as a paid offering, could constitute a violation of NC conditions if the reproduction during training or the use of the model is judged to be for commercial purposes. That said, since reproduction during training may be permitted under fair use or rights limitations, legal judgments will not be uniform and may vary by jurisdiction. Conservatively, the safe course of action is to avoid using NC works for AI training in commercial projects or, if used, to ensure that the entire process, including training and generation, is not monetized.
ShareAlike (SA): When CC BY-SA licensed data is used for training, the issue is whether the model or the output constitutes an adaptation of the original work. Legally, cases where model weights or generated content are evaluated as “derivative works based on the original work” are considered to be quite limited. That said, the CC guidance suggests a conservative response: “if the training data includes SA works, the model and its outputs should be provided under the same CC license when they are made public.” In other words, even if the model output does not directly reuse the original work, the stance is that since the original data was provided under an SA condition, the developer of the generative AI should ideally share the output or the model itself under an open license in accordance with that spirit. Here, it is necessary to distinguish between “legally mandatory” and “compliance as an act of good faith.”
NoDerivatives (ND): The ND condition in CC generally prohibits “sharing adaptations with the public.” Therefore, ND becomes an immediate issue when an adaptation based on the original work is created and shared externally. Whether the process of AI training or feature extraction constitutes an “adaptation” under copyright law, and whether trained models or general generated outputs constitute adaptations of ND works, depends on the jurisdiction and the facts, and cannot be determined uniformly. Still, Creative Commons suggests that, as a conservative approach, the use of content with an ND condition for training data should be avoided. In practice, it is safer to ensure that training data or synthetic data containing ND materials is not provided externally (limiting it to internal use) or, if external provision is intended, to design the data pipeline to exclude ND materials.
Do CC Conditions Apply to the Model Itself from Data to Model?
Next, we examine whether the CC conditions of the original data apply to the AI model itself (the trained model) obtained as a result of training on CC-licensed data. A model consists of data composed of parameters resulting from massive calculations and, at first glance, does not seem to have a direct similarity to the works used for training. Still, if expressions derived from the training data are memorized within the model in some form, distributing or publishing the model itself could theoretically be evaluated as sharing a copy of the original work. The analysis for each CC condition is as follows.
Attribution (BY): If the original data for AI training was provided under a CC BY license, the issue is whether an attribution obligation arises at the stage of publishing or providing the trained model. For example, in cases where it is possible to evaluate that the model contains the original work, such as when an image generation model stores a specific CC work as-is within its binary, providing the model constitutes sharing the original work, and credit attribution may be required. In most cases, ordinary machine learning models internalize training data as statistical features and do not retain direct expressions, so there are few instances where the BY condition is legally required when providing the model itself. Nonetheless, as a conservative approach recommended by the CC community, it is considered desirable to clearly state the dataset name and a link to the source, for example, “This model was trained using CC-licensed materials such as the LAION dataset.” Indicating the origin of the training data in model cards or documentation is also beneficial for providing information to users.
NonCommercial (NC): If the training source includes data with an NC condition, it is necessary to consider whether the distribution of the trained model and its use by the recipient should be limited to noncommercial purposes. The CC guidance suggests that “being noncommercial at all stages, from data reproduction to model sharing” may be required for NC compliance. Therefore, not only must no compensation be obtained from the provision of the model itself, but since the provision of a service using the model also constitutes “use of the model (work)” in a broad sense, using a public model for commercial purposes could conflict with NC conditions. If one wishes to use a model commercially, it is desirable to either avoid using NC materials from the start or obtain separate commercial use permission from the right holder.
ShareAlike (SA): If the training data includes works with an SA condition, the key point is whether the trained model itself can be evaluated as an adaptation of those works. Legally, difficult questions remain, such as whether the model parameters themselves possess copyrightability and to what extent a relationship of reliance on the original data is recognized. While it is difficult to be definitive in the absence of sufficient judicial precedents, the CC view is that “if a model is based on ShareAlike content and is shared with the public, the model should be published under the same CC license as a conservative measure.” For example, if a model is trained on a corpus under CC BY-SA 4.0, it is recommended that the model (and its output) be provided under the CC BY-SA 4.0 license when distributed or published, thereby inheriting the same conditions. It should be noted, however, that this is strictly “guidance for developers who wish to comply with the license,” and the intent is to mitigate risk by following it in good faith even if it is not legally established whether the model is an adaptation.
NoDerivatives (ND): If the training data includes works with an ND condition, the issue of whether the model constitutes an adaptation of the original work may arise. ND becomes an immediate issue only when an adaptation based on the original work is created and shared with the public. Whether a model obtained through AI training or general generated outputs constitute adaptations of ND works depends on the jurisdiction and the facts. Nevertheless, Creative Commons indicates that, as a conservative approach, the use of ND materials for training should be avoided. In practice, ND works are often excluded from training datasets. If it is absolutely necessary to use ND materials, a cautious approach is required, such as limiting use to internal purposes and not publishing the model or the output externally, to avoid friction with right holders.
Do CC Conditions Apply from Model to Output?
Finally, we consider whether CC conditions apply to the outputs generated from an AI model when the model itself is provided under a CC license. In recent years, some publicly released AI models have been published under CC BY or CC BY-NC, and model users need to understand these license conditions correctly.
First, as a premise, a CC license is a set of permission conditions for the copyrighted work (in this case, the model itself) and does not automatically apply to the outputs the model generates. Furthermore, if the model’s output can be evaluated as an entirely new creation and does not include the expressions of third parties protected by copyright, CC conditions from the original model, such as BY attribution or SA inheritance, do not automatically apply to that output. For example, an image output from an image generation model provided under a CC BY-SA license does not automatically become CC BY-SA unless it directly contains the code or weights (the copyrighted work) of the original model. In other words, an obligation to display credit to the model author as part of generation does not arise. This is because the output is neither a copy nor an adaptation of the copyrighted work that is the model.
That said, this does not mean that the model’s license conditions are entirely irrelevant to the output. If the output contains expressions derived from the training data of the model, the CC conditions of that original data become an issue. As stated in the CC guidance, the conditions of a CC license have effect when sharing the work itself or its derivative works publicly. Using the model itself is merely the use of a tool, and at that stage, the CC conditions attached to the model (such as the BY attribution obligation) are not directly required of the end-user. But if the model output ends up reproducing the content of a specific CC work, the act of sharing that output will be regarded as sharing the original work, and CC conditions will apply. In short, it is insufficient to judge the treatment of an output by looking only at the model’s license; it is necessary to handle it based on how much the generated output relies on the copyrighted works of third parties.
Furthermore, attention must be paid to the CC license of the model itself. If a model is granted a CC BY license, the attribution condition under BY becomes an issue when distributing or publishing the model itself, such as the weights. Naturally, credit to the model author is required when providing the model, including its integration into a service. If one distributes only the output of the model, including via an API, it is difficult to call this sharing the model itself, and obligations such as attribution do not usually arise immediately.
If the model is under CC BY-NC (NonCommercial), using or reproducing the model for commercial purposes is not permitted. Therefore, providing a commercial API service using that model would likely constitute a violation of the NC condition. If a model is provided under CC BY-ND (NoDerivatives), and as long as copyrightability is recognized in that model, the act of sharing an altered version of the model (including the generation of different weights through fine-tuning or merging) with the public could be interpreted as not permitted under the ND condition. By contrast, using the model without alteration to obtain output, or providing only the output via an API, does not immediately constitute an ND violation. Therefore, an evaluation should be made as to whether the form of provision substantially constitutes “sharing an altered version of the model.” As emphasized in the CC guidance, CC conditions are permissions for uses involving copyright and are not intended to restrict uses for which permission is not required due to exceptions or limitations in national laws. Consequently, the act of using a model without alteration to obtain output should not be understood as being immediately prohibited by the ND condition. Caution is still warranted because altering an ND model to improve accuracy or repurpose it and then sharing that altered version with the public may constitute a license violation.
In summary, the CC license of a model constitutes the conditions for the use, alteration, and redistribution of the model itself, and the application of the license to generated outputs is, in principle, dependent on the content the output contains. Keeping in mind the core principle that “CC conditions have meaning only where copyright is concerned,” both model providers and users must judge at each stage whether the original work itself is included or whether a new creation or something else is being shared. It can be said that in practice, it is important to pay attention not only to the license indicated in the model card but also to whether the output relies on existing content.
Conclusion
We have examined how CC licenses applied to training datasets and models affect the processes of training and output in AI models. It is necessary to note that subtle jurisdictional differences arise under the copyright laws of various countries regarding the prerequisite of whether copyright is recognized. Furthermore, the timing for the activation of the NonCommercial (NC) condition differs from other CC conditions due to its nature of restricting the purpose of use. Most importantly, there is a significant gap between “conservative compliance,” which considers slight possibilities and ethical or conventional perspectives, and “legally required cases.” CC licenses include conditions such as Attribution (BY), ShareAlike (SA), and NonCommercial (NC) that are frequently encountered in machine learning datasets and model releases, and that often differ from typical Open Source software licenses. This analysis will hopefully serve as a set of insights when using training data or models under licenses other than Creative Commons.
References
Creative Commons, “Using CC-Licensed Works for AI Training” (web): https://creativecommons.org/using-cc-licensed-works-for-ai-training-2/
Creative Commons, “Using CC-Licensed Works for AI Training” (PDF): https://creativecommons.org/wp-content/uploads/2025/05/Using-CC-licensed-Works-for-AI-Training.pdf
Agency for Cultural Affairs (Japan), “Thought on AI and Copyright” (March 15, 2024): https://www.bunka.go.jp/seisaku/bunkashingikai/chosakuken/pdf/94037901_01.pdf
European Parliament and Council, Directive (EU) 2019/790 on copyright and related rights in the Digital Single Market: http://data.europa.eu/eli/dir/2019/790/oj
17 U.S.C. § 107 (fair use): https://uscode.house.gov/view.xhtml?req=granuleid:USC-prelim-title17-section107&num=0&edition=prelim
US Copyright Office, “Copyright and Artificial Intelligence, Part 3: Generative AI Training” (pre-publication version): https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf
