Should ‘Open Source AI’ Mean Exposing All Training Data?

DeepSeek has had a major global impact. This appears to stem not only from the emergence of a new force in China that threatens the dominance of major U.S. AI vendors, but also from the fact that the AI model itself is being distributed under the MIT License, which is an Open Source license. Nevertheless, while DeepSeek has indeed released its model under the MIT License, it has not publicly disclosed any information at this time regarding the data-processing code or, crucially, the training data. As a result, there has been a flurry of opinions asserting that “DeepSeek cannot be called Open Source,” which in turn seems to have heightened interest—at least to some degree—in the question of whether complete disclosure of training data is mandatory for an AI project to qualify as Open Source.

In October of last year, the Open Source Initiative (OSI) published its Open Source AI Definition (OSAID), which states that while the existence of complete training data is recommended, it is not an absolute requirement to be classified as Open Source AI. Instead, the OSAID largely focuses on the availability of detailed information about the training data. Presently, this OSI stance is widely regarded as a reasonable approach. That said, there remain two contrasting camps with strong convictions: one that insists on full disclosure of all training data, and one that argues no data disclosure is required whatsoever. This document provides an explanation of the background to this confrontation, and then describes the interpretation leading up to the OSI’s position.

Please note that the interpretation below is my personal understanding—based on the discussions to date—and does not necessarily reflect the exact process through which these debates actually evolved.

Original Japanese Version: https://shujisado.com/2025/02/18/need_for_training_data_in_opensource_ai/

1. The Divide Over the Necessity of Training Data

When considering what conditions must be met for an AI system to be called Open Source, nearly everyone would first point to the requirement that the model itself be distributed under an Open Source license. However, in the case of AI, that condition alone is usually insufficient to fully elucidate the AI’s behavior and to guarantee freedom to Study and Modify the system. Hence, some have argued that other elements must also be released under Open Source terms—specifically, all code used from training the AI model through to its runtime execution. Most individuals who have at least a basic familiarity with Open Source licensing would, up to this point, likely be in agreement.

Where the debate intensifies is whether the actual data used in training must also be fully disclosed. Here we encounter two major groups: those who demand full publication of the training data, and those who see no need for such disclosure. Moreover, there is a particularly wide gulf between the position that all data must come from the public domain or other Open Data sources and the view that no information about the dataset is necessary at all. Because strong supporters exist on both extremes—and are not few enough to ignore—these arguments have flared up in various places, often generating considerable friction.

When we summarize the viewpoints on training data, they can generally be divided into the following four positions:

A. Those who insist on the necessity of complete training data
- Aa. Argument for using only Open Data:
  Requires using only Open Data or public domain datasets for all training. The rationale is to ensure complete transparency for every aspect of model creation and to guarantee that the model can be rebuilt from scratch.
- Ab. Argument requiring publicly accessible data:
  Insists that the entire set of training data must be made publicly downloadable by anyone, in order to replicate the model exactly.

B. Those who do not require complete training data
- Ba. Argument that no data is needed at all:
  Since additional or fine-tuning processes can be used to Modify or create derivative versions, sharing the original training dataset in its entirety is not deemed essential.
- Bb. Argument that only detailed information on data provenance is needed:
  Maintains that if others can replicate or otherwise acquire similar data, the only requirement is knowing how the data was obtained, etc.; there is no need to fully release the original dataset.

If we plot these four positions along a single spectrum of “completeness of data,” we might line them up as Aa – Ab – Bb – Ba, from highest to lowest. Ultimately, the question of whether or not a project is deemed “Open Source” can be answered only in binary (yes or no), so a clear boundary must be drawn somewhere on this spectrum.

However, each of these four stances can be readily challenged, which complicates matters. If one demands truly “Open Data,” the pool of usable datasets shrinks dramatically, raising the possibility that certain large-scale AI endeavors—what we now consider major models—could become impossible to realize under Open Source conditions. Furthermore, in numerous specialized fields like healthcare or education, data may be restricted by law or ethics, preventing verbatim sharing of all raw data. Focusing solely on openly licensed datasets could then exclude many of the critical areas where AI is needed.

Conversely, if there is no training data at all, strict reproducibility and in-depth auditing of the model become nearly impossible, meaning biases might go undetected, and it becomes more difficult to replicate results for Study or improvement. The argument that “accurate reproduction and comprehensive debugging demand actual access to the dataset” is not easily dismissed.

It is clear, then, that we must draw a line somewhere among Aa – Ab – Bb – Ba that determines whether an AI system can be recognized as Open Source. Yet precisely where that line should go requires a return to basic principles and more thorough consideration.

2. Examining Section 2 (“Source Code”) of the OSD in the Context of AI

To scrutinize how far we must demand completeness of training data, we must begin with the foundational principles set forth in the Open Source Definition (OSD). The current version of the OSD is 1.9, released in 2004, which has remained unchanged for over 20 years. It is fair to say it has become fully stable as the condition for designating software as Open Source. When applying this OSD to AI systems, the pivotal issue lies in Section 2, “Source Code.”

OSD Section 2. Source Code:

The program must include source code, and must allow distribution in source code as well as compiled form. Where some form of a product is not distributed with source code, there must be a well-publicized means of obtaining the source code for no more than a reasonable reproduction cost, preferably downloading via the Internet without charge. The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed. Intermediate forms such as the output of a preprocessor or translator are not allowed.

Section 2 of the OSD stipulates that, to qualify as Open Source, the source code must be freely obtainable. In typical software, even if permission is granted to Modify the compiled binary, it is not technically feasible to do so freely unless developers have access to a conventional source code file. One cannot reasonably claim the freedom to Modify or create derivatives is guaranteed if there is no way to obtain the kind of source code that programmers normally handle. Essentially, an AI model is a binary file that stores numerical data called weights or parameters; taken literally, if the AI model alone is provided, there is no “source code” present.

However, we must recall a precedent for assessing Open Source compliance in a certain field: Open Source fonts. Consider fonts under the SIL OFL (SIL Open Font License), which are commonly recognized as Open Source. In practice, how are such fonts typically distributed? Most frequently, they are provided in OpenType or TrueType files, both of which are binary formats that embed glyph outlines and related metadata. While in some cases there may also be a “master data” file in text format—analogous to software source code—these fonts are widely recognized as Open Source even when only distributed in binary form.

The reason fonts are regarded as Open Source even when only distributed as a binary file is that, for fonts, we do not interpret “source code” literally. Instead, we treat the “preferred form of making modifications” as the de facto “source code.” Formats such as OpenType are fully specified, and tools like FontForge enable analysis and editing of the font. In other words, one can carry out fundamental Modifications or create derivatives, and in fact it is quite common to edit the binary font file itself directly. Hence, even a binary font file can fulfill the requirement of being the “preferred form of making modifications,” thereby meeting the OSD’s source code criteria. Although a binary font file alone may omit certain insights (e.g., the designer’s intent, which might reside solely in master data), in practice, it is still generally deemed sufficient to meet the requirements of “freedom to Modify and Study” for Open Source.

From the Open Source perspective on fonts, we may derive two important lessons:

Under the OSD, “source code” does not strictly mean literal source code per se; rather, it denotes whatever is the “preferred form of making modifications” that allows developers to Study and Modify the Open Source artifact.
If the “freedom to Modify and Study” is adequately assured for the development community, then even a partially incomplete state—lacking some information—may still be acceptable as Open Source.

From the first lesson, one might infer that it is not always necessary to furnish every piece of raw material used to create the licensed artifact; a processed or compiled format may suffice. At first glance, this might appear to strengthen the argument that the mere existence of a binary AI model would satisfy the OSD’s source code requirement. However, in the case of fonts, all glyph outlines are contained within the binary file, and editing tools exist that treat the file itself as, in practice, the “source.” By contrast, for an AI model, fully re-training or drastically Modifying how it operates may necessitate all data preprocessing logic. Thus, an AI model by itself typically does not constitute the “preferred form of making modifications,” so the stance described earlier as “no data needed whatsoever” (Ba) is inadequate.

According to the second lesson, if the developer’s freedom to Modify and Study is sufficiently protected, then not all supplementary information or objects must be disclosed. For instance, in the font scenario, design intent or other details that might only be found in master data are not required to label the font as Open Source. Similarly, in other domains, software that relies on proprietary compilers or commercial cloud services can still be considered Open Source if its own code meets the freedom to Modify and Study. From this line of reasoning, the position labeled “Aa” above (requiring all training data to be Open Data) can be viewed as an overly broad demand. Indeed, while having complete access to every piece of data and all objects involved may be helpful for absolute reproducibility and thorough Study, it is not strictly necessary to have every single component in order to ensure reproducibility at a functional level.

Through this examination of OSD Section 2 on “Source Code,” it becomes apparent that, for AI to be recognized as Open Source, the relevant requirement for training data should be that it is in the “preferred form of making modifications.” We also see that the threshold for ensuring sufficient “freedom to Modify and Study” within the development community may be located somewhere not at the two extremes of the spectrum laid out in the previous section, but rather between Ab and Bb. Nevertheless, the question of whether complete training data is required or not remains a stark dividing line. Determining which side of that line qualifies as Open Source AI calls for further detailed analysis.

3. Examining the Necessity of Training Data from Philosophical, Legal, and Technical Perspectives

To determine where along the Ab–Bb segment we should draw the boundary that defines the “preferred form of making modifications” (and thus ensures “freedom to Study and Modify”), we must address the crucial question: Does one actually need to provide the training data? Open Source is fundamentally about freedom—rooted in both a philosophical framework and the legal concept of copyright—and is shaped by the realities of software engineering. Below, we examine each of these three dimensions in detail.

3.1. Philosophical Considerations

From a purely ideological or philosophical point of view, numerous angles come into play. One way to organize the debate over the necessity of training data is as follows:

The “Source Code–like” Nature of Training Data

As noted in the previous section, the OSD’s Section 2 underscores that “source code” must be the preferred form of making modifications, permitting everyone to Study, fork, and improve the software. By extension, if the performance or functionality of an AI model is determined decisively by its training data, one could argue—on philosophical grounds—that such data is akin to source code. If the data itself is fully Open, one can replicate the model precisely, scrutinize any bias, and conduct comprehensive Study leading to verifiable outcomes.

On the other hand, if the model’s functionality and performance are not decisively fixed by the training data—i.e., if it is feasible to carry out additional training or fine-tuning using comparable data—then the community can still effectively fork the system. Furthermore, for tasks such as bias analysis, as long as one can access detailed instructions on how and where the original data was collected, labeled, or filtered, it may be possible to reconstruct an equivalent dataset if necessary, thus preserving the freedom to create derivative Open Source works.

Transparency and Ethical Accountability

If all training data can be made Open, that would ensure total transparency. It would also enable the community to expose any unethical data collection practices, thus allowing community oversight. For instance, suppose copyrighted or privacy-sensitive materials were included in the training set without proper authorization; complete data disclosure would help uncover such infractions and prompt corrective measures or legal consequences.

However, while there are clear ethical advantages to requiring all data, one must ask whether full data disclosure is essential for AI to qualify as Open Source. After all, even malicious or questionable goals can be pursued under an Open Source license. Moreover, demanding fully transparent data for AI in fields such as medicine, finance, or education would effectively make it impossible to handle sensitive personal information. In contrast, not requiring complete training data avoids mandating the disclosure of such sensitive data and thereby enables these domains to remain within the scope of “Open Source AI” while still respecting privacy regulations.

Pursuit of Free Software Philosophy

When viewed as a licensing standard, “Open Source” is largely synonymous with Free Software. Hence, one can argue that the philosophy of Free Software is likewise shared by the Open Source community. From this philosophical standpoint, every component that influences the functionality or performance of an AI model should be “free.” Since training data unquestionably shapes a model’s capabilities, requiring that data would appear to align with the fundamental Free Software idea that “users should control the software in all respects.”

In this purely philosophical arena, those who argue against data completeness have few strong counterpoints. From the perspective of “maximal freedom,” it indeed seems correct to make the training data as Open as possible.

Taken together, analysis of training data’s “source code–like” role, its relationship to transparency and ethics, and Free Software–based ideals yields mixed conclusions. Nonetheless, if one adheres to the core philosophy of freedom underlying Open Source, it is easy to see why certain advocates insist that all training data must be publicly shared.

3.2. Legal and Normative Considerations

Next, we consider the interplay between various national legal frameworks, the norms within the Open Source world, and how they bear on both sides of the debate.

Consistency with the Open Source Definition (OSD)

As noted above, Section 2 of the OSD references the “preferred form of making modifications.” If the model’s ultimate behavior is indeed shaped by its training data, one could argue that said data meets the legal concept of “source code” in an AI context. In such a scenario, the AI model effectively becomes a derivative or secondary work of that data-as-source-code, meaning that, unless the dataset is itself licensed under an OSI-approved license, the AI model cannot be deemed Open Source. One might also contend that being unable to acquire the complete dataset inhibits the creation of derivative models.

In practice, however, machine learning models are essentially the accumulated output of stochastic computations, in the form of numerical parameters. According to most interpretations of global intellectual property law, such outputs do not typically generate copyright, nor does the distribution of a license (like a software license) necessarily “propagate” in this manner. Unless a copyright-protected dataset is somehow literally reproduced in the output, it rarely qualifies as a derivative work. In Japan, for instance, the right to conduct “information analysis” (Article 30-4 of the Copyright Act) and, in Singapore, laws permitting “computational data analysis,” reflect that many jurisdictions broadly allow unlicensed use of copyrighted materials for AI training. Meanwhile, under U.S. law, training is often regarded as a transformative (i.e., Fair Use) activity. Although such legal provisions facilitate training itself, redistributing that training data is often more tightly restricted—thus raising the question of whether requiring data openness is even practical.

As for forking the model, if a substantial portion—or even most—of the original dataset is missing, one might still be able to gather comparable data. Indeed, the earlier discussion of Open Source fonts suggests that one can sometimes fork an artifact without having every “raw ingredient.”

Copyright and Privacy Law

In many cases, large training datasets contain a multitude of copyrighted materials, and in some jurisdictions, these curated collections may enjoy database rights. If the dataset indeed qualifies as “source code” in the AI context, then the dataset presumably requires an Open Source or analogous license, paralleling the normal demands we make for software source code. Fully disclosing the training data could, in theory, eliminate various legal uncertainties. For example, if part of the data remains undisclosed, then any developer building a derivative model might unwittingly violate usage agreements or laws.

Yet, as noted earlier, the training process in an AI model is primarily the accumulation of stochastic computations. It is widely accepted in many jurisdictions that the training data’s intellectual property rights do not remain resident in the model. If there is no residual right from the data, one might question whether it truly is “source code” in the sense of the OSD. Moreover, a large proportion of publicly available data is subject to some form of usage restriction, so redistributing it as is could violate third-party rights. Additionally, across many jurisdictions, sharing specific confidential or personally identifiable information is outlawed by privacy regulations (e.g., the EU’s GDPR, HIPAA in the United States). If complete dataset distribution were a criterion for Open Source AI, developers who complied might find themselves in direct violation of such laws—an obviously untenable outcome.

Where complete access to the training data is indeed feasible and can be licensed under terms resembling Open Source, that may improve legal certainty. And in certain scientific fields, replicating exact results is paramount, lending weight to demands for data completeness. Nevertheless, in light of the prevailing legal view that data’s intellectual property rights do not carry over to the resulting model, it can be difficult to justify requiring the entire dataset as a prerequisite for an AI model to call itself Open Source. Further, to avoid subjecting AI developers to legal risk in domains with restricted or sensitive data, a more realistic approach is to mandate that they disclose the detailed provenance of the training data and thereby maintain the possibility of creating derivative works, rather than forcing them to release “raw” data. Thus, from a legal standpoint, not treating complete dataset release as mandatory seems more consistent with real-world law and norms.

3.3. Technical Considerations

Although I am not a specialist in AI engineering, one can glean from general technical explanations that there are valid arguments on both sides:

AI Development and Operational Analysis

Machine learning algorithms learn patterns from data—indeed, data can be viewed as a means of “programming the weights.” One could say the data forms part of the model’s logic. In fact, when a model exhibits certain flaws, developers may reevaluate how the training data was gathered or labeled in order to debug or correct biases or errors. If so, having complete access to the training data can be paramount for reliability.

On the other hand, real-world development workflows frequently rely more on fine-tuning than on complete retraining. Many AI projects do not need the entire original dataset, because the objective is typically to adapt or extend a pretrained foundation model using domain-specific data. If the weights and details of the training pipeline are publicly available, that effectively constitutes a fork, enabling others to incorporate improvements. One benefit of Open Sourcing AI is the sharing of hyperparameters and the pipeline itself, allowing a third party to substitute a different dataset (whether Open or proprietary) and still leverage the architecture and methodology of the original model.

Necessity of Perfect Reproduction

In standard software development, having the source code allows one to reproduce the build exactly. By analogy, some argue that replicating an AI model precisely requires full disclosure of all training data. If the data differs in any meaningful way, the final model might deviate in critical aspects.

However, given nondeterministic processes, random seeds, and local environment differences—especially at large scale—perfect reproducibility is often said to be effectively unattainable. If so, one might settle for training on similar data to achieve equivalent performance. Many AI/ML developers and researchers do not equate the freedom to Study and Modify with exact bit-for-bit reproduction. Thus, “practical reproducibility” can be maintained by providing detailed documentation about the data and thorough training code—without necessarily distributing the original dataset verbatim.

Even though it makes sense in traditional software development that data can define part of the model’s “logic,” and that complete data might be needed to replicate it precisely, there are caveats. First, the model is governed by the algorithm set forth in the code, and in many cases data is merely the input consumed by that code. Second, even if the data and code are fully available, reproducing the model with absolute fidelity may remain difficult. And third, many developers might not actually require such perfect reproducibility. For an AI project to be considered Open Source, what truly matters is whether “freedom to Study and Modify” is available, and in many cases, that freedom may be preserved if sufficiently detailed information on the dataset can be accessed.

3.4. Conclusions from Philosophical, Legal, and Technical Analysis

When examining the concept of a “preferred form of making modifications” in AI through philosophical, legal, and technical lenses, one might conclude that ensuring absolute freedom for all related components is, from our community’s perspective, the ideal in principle. However, recalling that “Open Source” is essentially “a legal condition that grants freedom of use under copyright law (or a similar legal framework) to all third parties,” legal interpretation tends to predominate.

From a legal viewpoint:
In many jurisdictions, the interpretation is that any intellectual property rights in the dataset do not remain in the trained model. Accordingly, insisting on complete data integrity (i.e., requiring that all training data be fully shared) begins to look excessive; and once we also consider the handling of privacy-sensitive data, it seems unlikely that a requirement to disclose all data would be readily compatible with real-world society.
From a technical viewpoint:
The code’s algorithm is what directly governs the model’s behavior. Even if complete data were shared, fully recreating the model is inherently difficult, and there is not much need to do so in many practical scenarios.

Taking these factors together suggests that treating the full dataset as an absolute prerequisite for the “preferred form of making modifications” is not particularly realistic. Rather, it would be sufficient to provide information—enabling others to assemble similar data if they wish—to fill any gap created by not sharing the entire original dataset. In that sense, consistency arises with real-world law and norms. A purely philosophical approach to openness might indeed demand the release of all training data, but the OSI’s stance (requiring training code, parameters, and data information) offers a feasible balance that encourages broad adoption of Open Source AI.

Therefore, returning to the Aa–Ab–Bb–Ba continuum presented in Section 1, it appears that for data requirements in an AI system to be considered Open Source, the dividing line is drawn closer to Bb, which does not insist on completely disclosing the entire dataset. This aligns closely with the data information requirements set forth by OSAID v1.0, created by the OSI.

4. Future Outlook on Data Requirements

The debate over whether one must require full disclosure of training data in order for an AI system to qualify as Open Source AI reflects an ongoing tension between the philosophical demand for complete reproducibility and transparency and a more pragmatic approach that seeks to balance intellectual property rights, privacy constraints, and real-world feasibility.

Those who advocate for complete disclosure of training data argue that it is effectively the “source code” of AI. Without every original input, they contend, neither fully replicating the model nor conducting a thorough audit of fairness and transparency can be achieved. On the other hand, the approach taken by the OSI in its OSAID recognizes that, while training data is indeed critical to a model’s development, disclosing the entire dataset may be legally impossible or practically meaningless in many scenarios. By requiring detailed, comprehensive “Data Information” about the dataset, OSAID preserves the possibility of forks (derivative development) and ensures that Open Source AI can be realized even in fields restricted by privacy or other legal constraints.

OSAID does not deny that sharing the entire training data is the most direct route to perfect reproducibility. Rather, it highlights the necessity of acceptable compromise so that Open Source AI can thrive in the real world. In other words, while adhering to the four key freedoms—Use, Study, Modify, and Share—some portion of the training data can remain unpublished when required. In this sense, the OSI’s current approach fosters a pragmatic balance that retains the spirit of Open Source for AI applications.

Still, it is unlikely that the question—whether calling something “Open Source AI” demands complete release of all training data—will be settled any time soon. The AI field evolves at a rapid pace, complicated by ongoing litigation, emerging data privacy regulations worldwide, and shifting industry norms, all of which can dramatically alter the landscape.

For example, regarding the Open Source status of fonts as discussed in Section 2, we currently treat TTF/OTF files themselves as the “preferred form of making modifications.” If a scenario arises in which key design data is stored outside TTF/OTF files, simply distributing TTF/OTF might no longer suffice to count as source code, and providing additional design data might become a requirement for a font to be considered Open Source. In much the same way, the AI field may evolve such that sharing more substantial portions of data becomes increasingly necessary—beyond what OSAID presently requires—or, conversely, that once training data is fully standardized or if analyzing only the weights can yield a sufficiently accurate grasp of the model’s functionality, the importance of training data details might wane. In any case, each stage of evolution will require revisiting what “preferred form of making modifications” actually entails, and the OSI must continue defining standards that uphold the freedoms to Use, Study, Modify, and Share.

5. Final Remarks

I have been involved since the early stages of the discussions that led to the creation of OSAID, and in particular, I found the debate over whether complete disclosure of the training data is necessary for an AI project to qualify as Open Source AI to be a persistent challenge. This topic is extremely complex and can be perceived differently depending on one’s perspective; I also witnessed with dismay how certain individuals would launch unilateral attacks on those striving for a balanced resolution. Such hostility—especially from members of large IT corporations—was frankly disheartening.

Throughout its history, the term “Open Source” has consistently reflected a balance between the philosophy of freedom and the practical demands of law and technology. The OSI is currently working tirelessly to define the boundary for this balance in the AI arena. Given that many aspects of AI remain uncertain for our Open Source community, it is entirely possible that further disagreements among different stakeholders will surface. When that occurs, I hope people will refrain from imposing their views in an uncooperative manner and instead engage in multifaceted analysis of the issues, pursuing practical solutions that advance the common good.

Open Source Guy