Navigating AI in Music Tech: Sourcing Copyright-Cleared Datasets

By Sean Keenan, independent music technology advisory

In this busy world of music technology, the number one challenge for AI companies is sourcing copyright-cleared data sets, or training data. Last week, Kate Knibbs at Wired wrote about how you can train AI on copyright-cleared datasets.

In this post we aim to help provide some guidance through the complexities of obtaining high-quality, legally sound training sets. Whether you are already leveraging AI-powered tools or just exploring its potential, the revolution around generative AI has pushed this topic to the forefront

Training data

Training data is fundamental to the development of AI-powered tools and products. When high-quality, diverse training data is inputted into AI models it can recognise patterns, make predictions, and innovate more effectively. Poor-quality or biassed data can lead to unreliable outcomes, making the acquisition of robust dataset a top priority.

Although training data can consist of almost any type of electronic data, many of the largest AI companies insist it’s ‘impossible’ to train AI models without copyright protected content.

The challenge

Creators and music industry stakeholders are incredibly vocal about the misuse of copyright protected content in training AI. They argue that such practices deprive creators of fair compensation and infringe intellectual property rights. Many advocate for responsible AI development that respects creator consent and complies with rights under intellectual property law (eg. see UMG and Roland’s AI principles).

High-profile lawsuits, like those involving AI music startups Udio and Suno, illustrate the legal risks of using unlicensed content and the willingness of rights holders to enforce their IP rights through the courts.

In addition, the regulatory environment is also evolving. The EU AI Act further underscores the need for transparency and compliance with copyright law. One of the requirements is for general-purpose AI models to publicly disclose detailed summaries of their training data. Although the UK does not have a specific AI law (for now), the extraterritorial nature of the EU Act means UK companies must remain vigilant about how and where they source datasets.

These developments suggest that the landscape will continue to change and companies that rely on scrapping and fair use arguments may find themselves on the backfoot as legal interpretations and industry standards continue to evolve.

Implications for music tech

For UK music tech companies, navigating the current legal landscape and ethical consideration can be daunting. A balance has to be struck between developing innovative and competitive AI systems and respecting legal and ethical standards.

However, as highlighted in the Wired article, there are existing and emerging solutions for acquiring copyright-cleared datasets.

Data acquisition

The business of creating and providing curated, copyright-compliant datasets is poised for growth. As the demand for clean, high-quality data increases, we can expect more solutions to emerge that cater to companies prioritising legal compliance and ethical practices. This shift will likely lead to more robust and transparent data sourcing practices across the industry.

Sources of copyright-cleared data sets can include:

  • Creative Commons (CC) – owners of works can choose to apply CC licences for the purposes of encouraging collaboration and innovation. However these can grant permissions for copyrighted work reuse, but don’t override existing exceptions. Examples include Jamendo and Freesound.
  • Public domain resources – these are repositories of works and sound recordings that are in the public domain, meaning copyright has lapsed, and are free to use for whatever reason. Note these are often outdated however. Examples include the Open Music Archive and ChoralWiki.
  • Commercially licensed libraries – a number of companies are investing in commercially licensed libraries which offer tracks and sounds samples with clear licensing agreements. Examples are Pond 5 and Shutterstock for music.
  • Institutional repositories and research data – these research orientated datasets consisting of music and sound samples along with annotated tags, examples include MIREX and the MagnaTagaTune.
  • Custom datasets through partnerships (licensing) – this requires bespoke licensing arrangements with creators, rights holders such as labels and publishers, but if you can secure them they can offer legal security and ongoing collaboration within the industry.

Conclusion

Obtaining datasets for AI in music tech requires diligence in sourcing and using data responsibly. However, leveraging the right copyright-cleared resources and following best practices can translate into confident innovation and respect for creator rights.

Scroll to Top