While the open-source model has democratized software, applying it to AI raises legal and ethical issues. What is the end goal of the OS AI movement?
The race for the future of AI has just encountered a tiny bump on the road — the definition of “open-source.” The first time the general public heard there was conflict over this term was early spring, when Elon Musk, a co-founder of OpenAI, sued OpenAI for breaching its original non-profit mission (months later, he decided to withdraw his claims, though).
Indeed, for quite some time, OpenAI preached the word of the open-source community. However, this claim was widely critiqued, and a recent report showed that the underlying ChatGPT models are a closed system, with only an API remaining open to some extent. OpenAI wasn’t the only tech company trying to get on the “open-washing” train — Meta LLaMA and Google BERT were both marketed as “open-source AI.”
Unfortunately, the problem of branding a system as “open-source” when it’s actually not isn’t just about marketing: there are instances where tagging oneself as “open-source AI” can bring legal exemptions, so the risk of businesses abusing the term is real. To straighten things up, the Open Source Initiative (OSI), an independent non-profit that helped coin the definition of open-source software, has announced it will host a global workshop series to gather diverse input and push the definition of open-source AI to a final agreement.
While technocrats and developers are battling over the scope of the term, it is a good time to ask a question that might be slightly uncomfortable — is the open-source movement the best way to democratize AI and make this technology more transparent?
Open-source software vs open-source AI
Open-source software usually refers to a decentralized development process where the code is made publicly available for collaboration and modification by different peers. OSI has developed a clear set of rules for open source definition, from free redistribution and non-discrimination to unrestrictive licensing. However, there are a couple of sound reasons why these principles cannot be easily replanted to the field of AI.
First, most AI systems are built on vast training datasets, and this data is subject to different legal regimes, from copyright and privacy protection to trade secrets and various confidentiality measures. Thus, opening up the training data bears a risk of legal consequences. As VP for AI research at Meta Joëlle Pineau has noted, current licensing schemes were not meant to work with software that leverages large amounts of data from a multitude of sources. However, leaving the data closed makes the AI system open-access but not open-source since there’s little anyone can do with the algorithmic architecture without having a glimpse into the training data.
Second, the number of contributors who participate in developing and deploying an AI system is much larger than that of software development, where there might be only one firm. In the case of AI, different contributors might be held liable for different parts and outputs of the AI system. However, it would be difficult to determine how to distribute the liability between different open-source contributors. Let’s take a hypothetical scenario: if the AI system based on the open-source model hallucinates outputs that prompt emotionally distressed people to harm themselves, who is the one responsible?
The risk of openness
OSI bases its efforts on the argument that, in order to make some modifications to the AI model, one needs access to the underlying architecture, the training code, documentation, weighting factors, data preprocessing logic, and, of course, the data itself. As such, a truly open system should allow complete freedom to use and modify the systems, meaning that anyone can participate in the technology’s development. In the ideal world, this argument would be absolutely legitimate. The world, however, is not ideal.
Recently, OpenAI has acknowledged they are uncomfortable releasing powerful generative AI systems as open-source unless all risks are carefully assessed, including misuse and acceleration. It might be argued whether this is an honest consideration or a PR move, but the risks are indeed there. Acceleration is the risk we don’t even know how to tackle — this was clearly shown by the last two years’ rapid AI developments that left the legal and political community confused over a number of regulation questions and challenges.
Misuse — be it for criminal or other purposes — is even harder to contain. As RAND-financed research has shown, most future AI systems will probably be dual-use, meaning that the military will take and adapt commercially developed technologies instead of developing military AI from scratch. Therefore, the risk of open-source systems getting into the hands of undemocratic states and militant nonstate actors cannot be overrated.
Also, there are less tangible risks, such as increased bias and disinformation, that must be considered when releasing an AI system under open-source licenses. If the system is free to modify and play with, including the possibility to alter training data and training code, there is little the original AI provider can do to ensure the system will remain ethical, trustworthy, and responsible. Probably, it is why OSI has explicitly called these issues as “out of scope” when defining their mission. Thus, while open source may equalize the playing field, allowing smaller actors to benefit from AI innovation and drive it further, it also bears an inherent risk of making AI outputs less fair and accurate.
The use and abuse of the open-source model
To summarize, it is yet unclear how the widely-defined open-source model might be applied to AI, which is mostly data, without inflicting serious risks. Opening AI systems would require novel legal frameworks, such as Responsible AI Licenses (RAIL), that would allow developers to prevent their work from being used unethically or irresponsibly.
It is not to say, however, that OSI’s mission to consolidate a single definition isn’t important for the future of AI innovation, but that importance primarily lies not in the quest for promoting innovation and democratization but in the necessity to ensure legal clarity and mitigate potential manipulations.
Let’s take the example of the newly released EU AI Act — the first ever comprehensive AI development regulation. The AI Act provides explicit exceptions for open-source General-Purpose AI (GPAI) models, easing up the transparency and documentation requirements. These are the models that power most current consumer-oriented generative AI products, such as ChatGPT. The exemptions do not apply only if the model bears “systemic risk” or is profit-oriented.
Under such circumstances, more (or less) permissive open-source licenses can actually act as a way to avoid transparency and documentation requirements, a behavior that is very likely having in mind the ongoing struggle of AI firms to acquire multifaceted training data without breaching copyright and data privacy laws. The industry must agree on a unanimous definition of “open-source” and impose this definition; without it, bigger players will determine what “open-source” means with their interests in mind.
Democratizing data, not systems
As much as a clear definition is needed for legal purposes, it remains doubtful whether a widely-defined open-source approach can bring the anticipated technological advancements and level the playing field. AI systems are mostly built on data, and the difficulty of acquiring it on a large scale is the strongest competitive advantage of Big Tech, along with computing power.
Making AI open-source won’t remove all structural barriers that small players face — a constant influx of data, proper computing power, and highly skilled developers and data scientists will still be needed to modify the system and train it further.
Preserving the open internet and open web data that is accessible to everyone might be a more important mission in the quest for AI democratization than pushing the open source agenda. Due to conflicting or outdated legal regimes, internet data today is fragmented, hindering innovation. Therefore, it is vital for governments and regulatory institutions to look for ways to rebalance such fields as copyright protection, making public data easier to acquire.
Is Open Source the Best Path Towards AI Democratization? was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.
Originally appeared here:
Is Open Source the Best Path Towards AI Democratization?
Go Here to Read this Fast! Is Open Source the Best Path Towards AI Democratization?