What defines the ‘open’ in ‘open AI’?

A growing number of researchers are challenging big tech and rethinking the meaning of openness in AI development

Monday 01 Aug 2022

Filed under

Openness is a widespread concept in the technology world. But while we already have good definitions of open source, open research and open data, ‘open AI’ is a term that remains somewhat nebulous. For researchers, open AI may mean collaborative and reproducible science and systems. For technologists, it may centre on free use and distribution of AI models, or perhaps public participation in an algorithm’s development. As more projects and organisations begin to use this term, we’ll need a clearer definition so that everyone knows what to expect from AI systems badged as ‘open’.

When the San Francisco-based company OpenAI was created in 2015, its founders described their mission “to build value for everyone rather than shareholders” and to share their patents “with the world”. While open collaboration wasn’t new to the burgeoning AI field (OpenCV and scikit-learn opened the doors to machine learning for many), OpenAI was one of the first organisations to centre its structure and public branding around the ‘open’ terminology. In the years since, OpenAI has created revolutionary algorithms, including the DALL·E and GPT model families.

However, along the way, the company has shifted from its original structure. In 2019, it transformed from a non-profit to a “capped” for-profit, and, in 2020, gated its text-generating GPT-3 large language model (LLM) behind a commercial API (soon after, it granted an exclusive license to GPT-3’s code to Microsoft, the company’s biggest investor). These developments led some to question how open OpenAI’s practices really were, reopening a conversation on what open means for AI development, beyond a value signal to funders or great marketing copy. OpenAI now seems to be taking DALL·E in a similarly commercial direction, restricting free use and applying a ‘freemium’ business model.

Meanwhile, at Google – 30 minutes’ drive down San Francisco Bay – Timnit Gebru and Margaret Mitchell (at the time co-leaders of Google’s Ethical AI team) noticed a mismatch between value statements and actions in their organisation’s work on LLMs. With co-authors from the University of Washington, they outlined their concerns in a 2021 paper on the social, technical and environmental harms created by state-of-the-art LLMs, and the dangerous trend of building ever-larger models.

This paper sparked fierce debate among the AI community on the importance of open and ethical practices within big tech organisations, which produce most of the world’s AI. In the aftermath, Gebru and Mitchell were fired. They have since begun building out their own solutions to addressing the problems they identified: Gebru founding the Distributed AI Research Institute (DAIR) and Mitchell joining Hugging Face as its Chief Ethics Scientist.

Beginning in May 2021, Mitchell also co-organised the year-long BigScience workshop, a collaboration of over 1,000 volunteer researchers from around the world, working to create a multilingual LLM grounded in an ethical charter. In June 2022, the BLOOM (short for BigScience Large Open-science Open-access Multilingual) LLM finished training, and can now be downloaded for free from the Hugging Face platform. To me, what’s unique about BLOOM isn’t necessarily the model itself (though its training on 46 human languages and 13 programming languages on a 28 petaflops French supercomputer is impressive), but its pioneering approach to building LLMs in a collaborative and open way.

BLOOM opens the black box not just of the model itself, but also of how LLMs are created and who can be part of the process. With its publicly documented progress, and its open invite to any interested participants and users, the BigScience team has distributed the power to shape, criticise and run an LLM to communities outside big tech. BigScience has also incorporated local needs through regional working groups that extend model localisation beyond just inclusion of a language to context-specific decision-making and evaluation. In this space, the lines between AI producers, funders, consumers, regulators and impacted communities become blurred. The result is a next-gen LLM created not just by tech workers, but through the collective effort of librarians, lawyers, engineers and public servants. Instead of “move fast, break things” we see a new model of “move together, build the right thing”.

As the UK’s national institute for data science and AI, the Turing holds developing best practices for open and responsible AI innovation as a strategic priority. In this spirit, shortly before BLOOM’s launch, we hosted a discussion between members of BigScience’s Data Governance team (including Margaret Mitchell) and representatives of three Turing research programmes (public policy; AI; and tools, practices and systems), each with their own perspective on open AI development. Discussion points included how to ensure open tech initiatives connect and co-develop solutions with existing communities, how to design governance frameworks that account for malicious actors potentially misusing open data/models, and how to close the gap between high-level ethics principles and a practice of developing AI that can truly be described as ethical.

The Turing is composed of researchers from fields as diverse as philosophy, economics, history and neuroscience. This breadth of background informs how we conduct our AI research, collaborating across teams and disciplines by design. Through initiatives like The Turing Way and BigScience, researchers can co-create a definition and practice of open AI that meets the needs of more communities. By involving people from different countries, backgrounds and organisations, I feel hopeful that we will reach a better shared understanding of open AI that resonates in Seoul and Sao Paulo, Sydney and Soweto, as much as it does in Silicon Valley.


Top image: Katerina Pavlyuchkova / Unsplash