CanLII v CasewayAI: Defendant Trots Out AI Industry’s Misinformation and Scare Tactics (But Don’t Panic, Canada)

Image: Pixabay

Last month I highlighted the first AI/Copyright case in Canada to reach the courts, CanLII v CasewayAI. CanLII, (the Canadian Legal Information Institute), a non-profit established in 2001 by the Federation of Law Societies of Canada, sued Caseway AI, a self-described AI-driven legal research service, for copyright infringement and for violating CanLII’s Terms of Use through a massive downloading of 3.5 million files which Caseway allegedly used to populate its AI based services. Now the principal of CasewayAI, Alistair Vigier, through an article (Don’t Scare AI Companies Away, Canada – They’re Building the Future) published in Techcouver, has responded publicly by trotting out many of the tired and specious arguments put forward by the AI industry to justify the unauthorized “taking” of copyrighted content to use in or to train generative AI models. Let’s have a closer look at these arguments.

Vigier opens by referencing another AI/Copyright case in Canada where a consortium of Canadian media companies is suing OpenAI for copyright infringement. He claims this is all based on a misunderstanding of how AI training works, stating that “AI systems like OpenAI rely on publicly available data to learn and improve. This does not equate to stealing content.” Whether data is “publicly available” or not is irrelevant when it comes to determining whether copyright infringement (aka stealing content) is concerned. Books in libraries are publicly available, or so is a book that you purchase in a bookstore, or content on the internet that is not behind a paywall. (It is worth noting that the Canadian media companies also claim that OpenAI circumvented their paywalls to access their content when copying it). But in none of these cases is copying permitted unless the copying falls within a fair dealing exception, which is very precise in its definition. Labelling copied material as “publicly available” is a red herring.

Vigier’s next argument is to equate the ingestion of content by various AI development models with a human being reading a book. We know that humans enhance their knowledge through reading and are thus able, presumably, to better reason based on the content they have absorbed. Vigier says, “This is how AI works. The AI “reads” as much as it can, gets really “smart,” and then explains what it knows when you ask it a question. Like a human learns from reading the news, so does an AI.

Really? A human does not make a copy, not even a temporary copy, of the content although some elements of the content are no doubt retained in the human brain. But AI operates differently. It makes a copy of the content. This should be beyond dispute although the AI industry continues to muddy the waters by claiming that when content is “ingested” it is converted to numeric data and is thus not actually copied. This is a fallacious argument. Just because the form changes, this does not mean there is no reproduction. When you make a digital copy of a book, there is still reproduction even though the digital form is different from the original hard copy version. When a work is converted to data, the content is still represented in the dataset.

Vigier dubiously states, with regard to OpenAI, “OpenAI’s models do not reproduce articles verbatim; they process vast datasets to identify patterns, enabling insights and efficiency.” Apart from the fact that the New York Times in its separate lawsuit in the US has been able to demonstrate that by typing in leads of articles, it can prompt OpenAI to reproduce verbatim the rest of the article (OpenAI claimed that the Times “tricked” the algorithm), copying is copying even if the result of the copying is somewhat different from the original. The Copyright Act is crystal clear on this point. Section 3 (1) of the Act states that, “For the purposes of this Act, copyright, in relation to a work, means the sole right to produce or reproduce the work or any substantial part thereof in any material form whatever…. If copyright protected content is reproduced in its entirety without permission for a commercial purpose (eg for AI training), that is infringement, unless the use qualifies as a fair dealing under Canadian law or fair use in the US.

The issue of whether ingestion of content to train an AI application results in copying (reproduction) has been carefully studied and documented. One of the most thorough examples is a recent SSRN (Social Science Research Network) paper, entitled, “The Heart of the Matter: Copyright, AI Training, and LLMs” with noted scholar Daniel Gervais (a Canadian by the way) of Vanderbilt University as lead author. The article goes into a detailed discussion on how copying of content occurs during AI scraping to build a Large Language Model (LLM), including the stages of tokenization, embedding, leading to reward modelling and reinforcement learning. The section of the article explaining how copying occurs (pp. 1-6) is dense, technical text but the conclusion is clear, “LLMs make copies of the documents on which they are trained, and this copying takes various forms, and as a result, with appropriate prompting, applications that use the LLMs are able to reproduce original works.” A shorter (and earlier) version explaining how the LLM copyright process works can be found in this article (“Heart of the Matter: Demystifying Copying in the Training of LLMs“), produced by the Copyright Clearance Center in the US. It is also worth noting that these explanations refer only to ingestion of text. AI models that train on images and music are even more likely to produce exact or close-to-exact reproductions of some of the works they have been built and trained on.

So much for the misinformation in Vigier’s article. Now to the scare tactics. He says that the recent Canadian media lawsuit against OpenAI sends a negative message to innovators that Canada may not be open to AI development.

If Canada wishes to remain relevant in this (AI) sector, it must balance protecting intellectual property and promoting technological progress.

The fact that there are currently more than 30 lawsuits in the US, including the seminal New York Times v OpenAI case, does not seem to have slowed down the AI companies in the US. In the UK, legislation has been introduced that would, according to British media reports, “ensure that operators of web crawlers (internet bots that copy content to train GAI, generative AI) and GAI firms themselves comply with existing UK copyright law. These amendments would provide creators with crucial transparency regarding how their content is copied and used, ensuring tech firms are held to account in cases of copyright infringement.” There is lots of AI innovation ongoing in Britain.

The Australian Senate Select Committee Report on Adopting AI has recommended, among other findings, that there be mandatory transparency requirements and compensation mechanisms for rightsholders. The EU is already way out in front on this issue. Its new AI Act stipulates that providers of AI generative models will be required to provide a detailed summary of content used for training in a way that allows rightsholders to exercise and enforce their rights under EU law. Even India now has its own version of the US and Canadian media cases against OpenAI. (OpenAI’s defence in part is based on the argument that no copying took place in India because no OpenAI servers are located there!)

If that is what the “competition” is doing, who does Vigier cite as being the jurisdictions most likely to attract innovators away from Canada? Why, it is those AI powerhouses of Switzerland, Dubai—and the Bahamas!

The argument that if legislators and the courts don’t give AI innovators a free pass on helping themselves to copyrighted content for AI training purposes, this will either slow down innovation or chase it elsewhere is a common fearmongering strategy of the AI industry. This is a race-to-the-bottom mentality whereby content industries are thrown under the AI bus. Vigier, having been the subject of his own lawsuit, argues that instead of resorting to litigation, the Canadian media companies should have sought a licensing solution. But the fact that no licensing agreement was reached with OpenAI is undoubtedly the reason for the lawsuit in the first place. That is certainly the reason behind the NYT v OpenAI lawsuit in the US; licensing negotiations broke down. If someone has taken your content without authorization, and then offers you pennies on the dollar in comparison to what that content is actually worth, then the stage for a lawsuit is set.

In explaining CasewayAI’s position in the litigation brought by CanLII, Vigier says that Caseway approached CanLII with an offer to collaborate but was rebuffed. As a result they developed other extensive web crawling technology that pulled the needed material from elsewhere. (Where exactly the material was downloaded from is the crux of the matter). Regardless, this makes it sound as if it was CanLII’s fault for refusing to share their content. Surely a rightsholder has the right to determine the terms on which their content is to be shared with others, if at all.

The fact that Caseway went to CanLII in the first place suggests that CanLII had developed the content that Caseway wanted. Caseway claims the material it accessed was on the public record, such as court documents and decisions. CanLII, on the other hand, claims that it had reviewed, indexed, analyzed, curated and otherwise enhanced the content in question, thus adding a wrapping of copyright protection to what otherwise would be public documents. Who is right, and whether the material was scraped from CanLII’s website without authorization, will be determined by the BC Supreme Court.

If the material taken by CasewayAI was not copyright protected, they are in the clear, at least with respect to copyright infringement. That is quite different, however, from arguing that no copying takes place during AI training or that if rightsholders use the courts to protect their rights, Canada will be a laggard when it comes to AI development. Robust AI development needs to go hand in hand with robust copyright protection for creators, with an appropriate sharing of the spoils of the new wealth generated from the creative work of authors, artists, musicians and other rightsholders. To say, as Vigier does in his concluding paragraph that;

Canada has a choice to make. Will we embrace AI as the transformative force it is, or will we let fear and litigation stifle innovation? The lawsuits against Caseway and OpenAI message tech companies: you’re not welcome here. If this continues, Canada won’t just lose its AI startups; it will lose the future of job creation.

What sheer self-interested nonsense!. This is fearmongering of the worst kind, based on an inaccurate and misinformed knowledge of how AI is developed and trained, that moreover impugns the legitimate right of a rightsholder to seek the protection of the law to protect their creativity and investment in content. Vigier might be correct when he says that licensing of content is a win/win for both parties. I agree with that. But licensing negotiations are about money and conditions of use and require willing parties on both sides. When licensing discussions break down, or when one party decides to do an end run on licensing because they have been rebuffed, then the way to gain clarity is through the courts whose job it is to interpret what the legislation means.

Canada still needs to come to grips with the question of how copyrighted content will interface with AI development. As I noted earlier, both sides in the debate made their cases in the public consultation launched a year ago, but since then there has been no movement in Ottawa. The law could be strengthened to ensure adequate protection of rightsholder interests in an age of AI, resulting in facilitating licensing solutions. In the meantime, misinformation and scare tactics need to be called out for what they are.

Adequate protection for rightsholders does not mean the end of AI innovation or investment in Canada. There is no need for panic. We can walk and chew gum at the same time.

© Hugh Stephens, 2024. All Rights Reserved.

Another AI Scraping Copyright Case in Canada: News Media Companies Sue OpenAI

Image: Shutterstock (AI assisted)

First, I heard it on the radio. The word “copyright” caught my attention because that’s a word seldom heard on the morning news. Then the news stories started to appear, first on Canadian Press, which was “largely” accurate, then on the CBC, Globe and Mail, even the New York Times. A consortium of Canadian media, including the Toronto Star, Postmedia, the Globe and Mail and the CBC/Radio-Canada is suing OpenAI in Ontario Superior Court for copyright infringement and for violating their Terms of Use. The publishers are seeking CAD20,000 per infringement plus an injunction to prevent further infringement. The case largely parallels a similar one in the US brought by the New York Times against OpenAI and its largest investor Microsoft, which I wrote about earlier this year (When Giants Wrestle, the Earth Moves (NYT v OpenAI/Microsoft).

Despite what the press articles state, this is not the first case in Canada where copyright infringement has been alleged as a result of data being scraped to use in AI applications, as I noted last week. However, it is the first case where news organizations have gone after an AI development company. It also has nothing to do with the Online News Act as stated in the Canadian Press report. In fact, it is the absence of legislation in Canada regarding copyright and AI that is partly responsible for this being fought out in the courts.

OpenAI in its statement quoted “fair use” and “related international copyright principles” to justify its behaviour. The fact that the US fair use doctrine does not apply in Canada, combined with the closed nature of fair dealing exceptions, and the lack of a Text and Data Mining exception in Canadian law, could prove troublesome for OpenAI. It also has the effrontery to state that it offers “opt out” options for news publishers. When you are taking someone’s proprietorial content without permission or payment, it is an insult to tell them they can always opt out. To steal, and then to tell your victim to request that you not steal again, is hardly the way ethical companies operate.

One question to be decided is whether the scraped content falls under copyright as it is a well-established principle that the “news of the day” is not subject to copyright protection. See (Do News Publishers “Own” the News?) News media may not have a monopoly over reporting on what is happening in, say, Gaza but they certainly have the rights to their expression of what is happening through their coverage. OpenAI has also apparently said that its web crawlers are just “reading” publicly available material, as a human being would do. However, reading and copying are two different things, although proving reproduction may be difficult given the unwillingness of OpenAI to disclose its training methods, an issue that has come up in the New York Times case. “Publicly available” is irrelevant, since being publicly available on the internet, or in a library, or anywhere else, does not justify copyright infringement.

In their suit, the plaintiffs are also alleging circumvention of a TPM (technological protection measure, sometimes referred to as a digital lock, which puts content behind a paywall). This is a separate violation of the Copyright Act. In addition, they are alleging violation of their Terms of Use, which are linked to their websites. When a user accesses material on the publishers’ websites, they must agree to the Terms of Use which, among other things, state that the content to be accessed is for the “personal, non-commercial use of individual users only, and may not be reproduced or used other than as permitted under the Terms of Use”, unless consent is given.

The publishers state that OpenAI was well aware of the need to pay for their content and to obtain permission to use it. That is essentially the position also taken by the New York Times. OpenAI has reached licensing agreements with some publishers including the Associated Press, Axel Springer (Business Insider, Politico), the Financial Times, the publishers of People, Better Homes and Gardens and other titles, News Corp (Wall Street Journal and many others), The Atlantic, and others. But not the New York Times obviously (negotiations broke down, leading to the current lawsuit) and not with any of the Canadian media bringing suit. A licensing agreement acceptable to both parties will be the likely outcome of this case. As the US-based Copyright Alliance has pointed out, generative AI licensing isn’t just possible, it’s essential.

There is a vacuum when it comes to legislation in Canada, and elsewhere, regarding the intersection of copyright and AI development. Various models are being experimented with, from the “throw copyright under the bus” model in Singapore to a more nuanced model in Japan, to uncertainty elsewhere. Australia has just produced a Senate report in response to its public consultation on the issue. Among its recommendatons, the Select Committee Report on Adopting Artificial Intelligence called for changes that would ensure copyright holders are compensated for use of their material, while tech firms would be forced to reveal what copyrighted works they used to train their AI models. Canada initiated a public consultation on the topic last year and the Australian Committee’s recommendations with respect to copyrighted content are essentially what the Canadian copyright community asked for. However, since receiving input in January of this year and publishing the submissions received in June, there has been no further information released by the Canadian government. A conclusion similar to the recommendations in Australia would be welcome.

Canadian creators and rightsholders are waiting for some action. Meanwhile the only alternative is to toss the issue to the courts to adjudicate.

(c) Hugh Stephens, 2024. All Rights Reserved.

AI-Scraping Copyright Litigation Comes to Canada (CANLII v Caseway AI)

Image: Shutterstock (with AI assist)

It was inevitable. After all the lawsuits in the US (and some in the UK) pitting various copyright holders against AI development companies alleging the AI platforms were infringing copyright by reproducing and ingesting copyrighted materials without authorization to train their algorithms to produce outputs based on the ingested content–outputs that in some cases compete directly with the original work—AI scraping litigation has finally come to Canada. As reported by the CBC, CanLII (the Canadian Legal Information Institute), a non-profit established in 2001 by the Federation of Law Societies of Canada “to provide efficient and open online access to judicial decisions and legislative documents” is suing Caseway AI, a self described AI-driven legal research service, for copyright infringement and for violating CanLII’s Terms of Use through a massive downloading of 3.5 million files.

In its civil claim brought before the Supreme Court of British Columbia, CanLII alleges that the defendants, doing business as Caseway AI, violated its Terms of Use that prohibit bulk or systematic download of CanLII material and that in doing so, the defendants also engaged in copyright infringement by reproducing, publishing and creating a derivative work based on the copied works for the defendants own commercial purposes. There is no question that Caseway is providing legal material for commercial gain. Caseway’s services start at $49.99 a month , or $499.99 a year, and offer an AI driven service that “leverages advanced AI to find relevant case law in less than a minute… Designed with a user-friendly chatbot interface powered by proprietary technology, Caseway (is) a robust tool tailored specifically for the legal profession.” Caseway’s Terms of Service have all sorts of disclaimers, however.

In his defence, Caseway’s Canadian principal (and defendant) Alistair Vigier is reported to have said that “court documents are public record, not owned by any organization, including CanLII. Numerous other websites also make these decisions available.” It is true that court documents and decisions are public documents not subject to copyright protection. However, CanLII claims that its database contains more than just the court’s decisions. It says in its claim that it spends significant time to “review, analyze, curate, aggregate, catalogue, annotate, index and otherwise enhance the data” prior to publication. It is this creative effort that turns public documents into a copyright protected document (or so the argument goes). To use another copyright analogy, you cannot copyright a recipe (a “list of ingredients”) but we all know that cookbooks containing recipes are always copyrighted. This is because of the display and illustrations of the recipes, the layout, commentary and other editorial touches. Julia Child’s sole amandine recipe is not just any old recipe for fried sole. Is CanLII’s compilation of “judicial ingredients” protectable? We will have to wait to find out.

CanLII’s case is reminiscent of a similar case in the US, Thomson Reuters v Ross Intelligence. Thomson Reuters operates a subscription-based legal research service called Westlaw. One of Westlaw’s employees allegedly copied Westlaw content to enable Ross Intelligence to build a machine learning platform that competed with Westlaw. Part of Ross’ defence was that the judicial decisions themselves are public domain documents, so there could be no infringement. Westlaw maintained that its case head notes, summaries that described the cases, were copyrightable material. Ross also brought forward a fair use defence arguing transformation, i.e. they had produced something new and different that did not compete directly with Westlaw’s product. Here is a good summary of the case. The court determined that Ross had copied the headnotes but the copyrightability of Westlaw’s numbering system and headnotes needed to go to a jury to determine. While Ross’ anti-trust case against Westlaw has been dismissed, the copyright case is still pending.

Another case that has been cited as a possible precedent is the famous 2004 CCH Canadian Ltd v Law Society of Upper Canada case in which the Supreme Court of Canada ruled that copies of CCH materials made by the Law Society library for its members did not infringe CCH’s copyright because the library was exercising the fair dealing research exception on behalf of the individuals requesting the copies. I personally don’t see the relevance of this case (but I am not a lawyer) since the Great Library’s users were copying only relevant parts of certain documents, for a specified fair dealing purpose. In the CanLII case, Caseway has apparently inhaled the full collection of documents and is doing so for a commercial purpose, with the resultant product (although not identical to the original) competing with it. Moreover, since there is no text and data mining exception in Canadian law, the “transformation” defences available to US-based AI companies (i.e transforming the original materials to produce something different) are not applicable in Canada. This will be an interesting one for the lawyers.

What the case demonstrates is a crying need for some legislative guidance on the question of AI scraping of copyrighted materials in Canada. It may be that CanLII’s collection cannot be protected by copyright, which would provide Caseway a defence without settling the fundamental issue of whether it is a violation of the Copyright Act to do what Caseway did, assuming the material they used was protectable by copyright. A consultation exercise was launched by the government of Canada (through the Ministry of Innovation, Science and Economic Development, ISED) last October, closing in January with submissions posted in June. Since then, there has been silence on the part of the government. With Parliament at a standstill, and the current government hanging on to power by its fingernails, don’t expect clarity any time soon.

© Hugh Stephens, 2024. All Rights Reserved

If AI Tramples Copyright During its Training and Development, Should AI’s Output Benefit from Copyright Protection? Part Two: Jason Allen

Image: Théâtre D’Opéra Spatial, Jason B. Allen (not protected by copyright)

Last week I wrote about Stephen Thaler’s quixotic and determined approach to obtain copyright registration in the US for his AI generated artwork, “A Recent Entrance to Paradise”, created (he claims) exclusively by his AI “machine”, the so-called Creativity Machine. So far, despite repeated efforts, he has drilled a dry hole. An alternative approach to claiming copyright for an AI-generated work is by asserting that the AI used to produce it was simply a technological assist. The essence of the work was produced through human creativity, using AI only as a tool, and therefore the work should be eligible for copyright protection, or so the argument goes. Unlike the example of Kristina Kashtanova, discussed in last week’s blog, under this theory the entire work is protectable because it was human created, with AI playing only a facilitating or assistive role. This line of attack has most recently been pushed by the creator of the work “Théâtre D’Opéra Spatial”, Jason B. Allen.

Allen made headlines a couple of years back (September 2022) when he entered Théâtre into the Colorado State Fair’s annual art competition in the category of “digital art/digitally manipulated photography.” He labelled the piece as having been created by him, “via Midjourney”, the popular generative AI art algorithm that had recently been released. He won first prize, incurring the opprobrium of many artists who accused him of crashing a contest for human creators. Writing just a month later, in October of 2022, I posted my own AI produced artwork, based on the style of Monet (whose works are in the public domain) in this blog post, (AI and Computer-Generated Art: Its Impact on Artists and Copyright). My effort was substantially less artistic than Allen’s but was an original work of sorts, created with the help of AI. I used the program DALL-E2, which is similar to Midjourney. Both were freely available and a tool that any rank amateur “artist” (like me) could use.

Generative AI as a source of art burst into the public’s consciousness in 2022 because of the public release of these programs, but AI generated art has been around for a few years before that although used exclusively by art specialists. The New York Times reports on a sale at Christie’s Auction House four years earlier, in October 2018 (not exactly eons ago, but generations in internet/AI time). A portrait with blurred and distorted lines produced by an AI algorithm sold in New York for $432,500 (with fees). Christie’s, in inimitable auctioneering style, billed it as “the first portrait generated by an algorithm to come up for auction”, according to the Times. Now AI generated art is a dime a dozen. In fact, often it is hard to tell what is AI generated and what is not.

If the question of copyright protection for AI generated content is a big issue, an even bigger one is currently being played out currently in the courts; content owners, ranging from the New York Times to Getty Images to music labels to authors, are suing various AI development companies for unlicensed use of their content to train AI programs. The Copyright Alliance has a good summary of the various lawsuits in play here. The ultimate outcome is undecided, but if the courts find that the wholesale unlicensed and unauthorized ingestion of copyrighted content to train AI algorithms is not fair use (in the US) or does not fall within specified text and data mining exceptions in other jurisdictions, then the table will be set for serious negotiation between rights holders and AI developers. Some of the parameters of this negotiation are already pretty obvious;

  • a transparent inventory of what copyrighted works were accessed for training;
  • the ability of rights holders to be able to opt out or opt in;
  • various options for licensing content for training purposes.

If these conditions governing inputs were met, rights holders might be somewhat more sympathetic to arguments for copyrighting the output of generative AI programs. As it is, AI developers, and the users of AI programs, want to have their cake and eat it too. Jason B. Allen of Théâtre D’Opéra Spatial” is Exhibit No. 1.

Allen is currently appealing in court the US Copyright Office’s ruling that his work cannot be registered under copyright. He claims that because of all the publicity about his work, and the USCO’s subsequent decision to deny copyright registration on the grounds that it was an AI generated rather than human creation, the work has lost value and impacted his ability to charge industry-standard licensing fees. Moreover, he claims that without copyright protection, he has no ability to stop others from using his work without authorization. (Like me, posting Théâtre on this blog post). Apparently, people are selling copies of the work on Etsy. One has some sympathy for his position, as it is one faced by many artists whose copyrighted works are also being ripped off on the internet.

Allen argues he had substantial creative input into the production of the work, using no fewer than 624 prompts to create the work to his mental specifications. Of course, we have no idea of what those specifications were. What Midjourney produced may have been an accurate reflection of what Allen had in his mind and intentionally created, or it may have taken him on a journey where he eventually settled on the output offered. One thing is likely, if not certain. Were he to enter those exact same prompts into Midjourney today, the outcome would not be identical to the current “Théâtre D’Opéra Spatial”. This raises the question of who, exactly, is guiding the creative process, the artist making the prompts, or the algorithm responding to the prompts.

Because of the way AI works, there is a large degree of randomness in the results, requiring more and more precise prompts to narrow the range of possibilities and guide the algorithm to the desired destination. But it is almost impossible to recreate precisely the route to an outcome. This suggests to me that, in the end, it is the algorithm that is in control, not the human issuing the prompts. (Although not everyone agrees with this thesis). To my mind, this is what distinguishes AI-generated art from photography, where the photographer, while using a mechanical assist, is nevertheless in full control at all times and has the ability to adjust for extraneous inputs such as light, shadow etc. rather than being controlled by them.

Allen’s appeal of the USCO’s rejection of his copyright claim takes place against a backdrop where Midjourney, the AI program he used, is itself being sued by a group of artists for appropriating their work without permission in order to train Midjourney’s art generation algorithms. Does anyone see any irony in this? However, the fact that Midjourney takes the works of others without permission for training and development purposes is not really Allen’s fault. He and other users of the program could perhaps be considered victims almost as much as the artists whose works have been appropriated. Nevertheless, if Allen is ever successful in getting Théâtre registered, this will be not just to his benefit, but also to the benefit of Midjourney and all the other AI developers who are in a similar position. On the other hand, if the output of their programs cannot be copyright protected, it diminishes the value of the AI product. So, perversely, the more Allen pursues registration for Théâtre, the more he undercuts those who make a living from producing art.

I see one possible scenario that could help resolve the issue of AI outputs being unprotectable. If the key elements of respect for copyrighted work (transparent inventory, opt in/out, licensing) were to be adopted by AI developers, then perhaps rights-holders would be more amenable to accepting at least some degree of copyright protection for AI created or assisted outputs. But right now, the AI industry wants it both ways; total freedom to appropriate copyrighted works for training and development purposes while claiming the same copyright protection they have just trampled for AI generated outputs. Jason Allen and other digital artists who use AI to produce art or other works are caught in the middle.

At the moment there is no clear solution. The most likely outcome–after all the legal dust has settled—is probably going to be some ability to copyright works produced with AI, dependent on the extent and degree of human intervention in a given work (which could possibly be carefully tailored prompts), balanced by a commitment by the AI industry to recognize the property rights of those holding copyright over the content it is using to create the AI program in the first place. This will necessarily involve an appropriate sharing of the added value being produced by AI in the form of licensing fees. This will take a few more years, a few more lawsuits, a few court decisions, and a few government interventions in the form of legislation–but I can see no other way forward. In the interim, neither Stephen Thaler nor Jason Allen are likely to get what they want.

© Hugh Stephens 2024. All Rights Reserved.

If AI Tramples Copyright During its Training and Development, Should AI’s Output Benefit from Copyright Protection? Part One: Stephen Thaler

” A Recent Entrance to Paradise”, Stephen Thaler (not protected by copyright)

One of the ongoing debates about works made with generative AI is whether they qualify for copyright protection. Should they? Let’s consider the essence of copyright. What is its raison d’être? According to the classical European definition, it is to respect the property rights of the author (droit dauteur), sometimes described in the simple terms of the Eighth Commandment (“Thou shalt not steal”). According to the more utilitarian Anglo-Saxon rationale for copyright, it is to benefit society by rewarding and incentivizing authors, thus stimulating further production of works for the greater good. In either case, it gives one pause to wonder how works created by AI fit into either school of thought. Is there an inherent property right in content produced by an algorithm? How does copyright protection incentivize an algorithm to produce more “useful arts”?

In my view, the only way one can square this circle is by attributing human creation, or at least a degree of human creation, to AI generated works. But this opens Pandora’s box; if there is to be human attribution, to whom in the chain of creation does the credit fall? How much human creativity is required? It also raises the spectre of hypocrisy, as the AI industry hijacks the creative work of others, without recognition or recompense, yet has the gall to claim that AI outputs are unique and worthy of protection. I have written about these issues before (here, here, and here), but a couple of recent cases in the US have brought these fundamental issues bubbling back to the surface.

When it comes to trying to prove that AI and copyright protection go together, there are a couple of different approaches. One, most notably espoused by an AI technologist in the US named Stephen Thaler, is to claim that a given work was produced exclusively by AI but should nonetheless be protected. In Thaler’s case, the work for which he is seeking copyright registration was created by a particular AI “machine” (or algorithm), specifically the one he “invented”, the so-called “Creativity Machine”. Thaler claims his machine should hold the copyright but behind the machine, of course, stands Thaler. This is not dissimilar to existing British copyright law where, under Section 178 of the Copyright, Designs and Patents Act, 1988, works “generated by computer in circumstances such that there is no human author of the work” are nevertheless accorded copyright protection for fifty years from date of creation, with the copyright being held by “the person by whom the arrangements necessary for the creation of the work are undertaken”, even if there was no creative act undertaken by that person.

As I noted in an earlier blog post (The Humanity of Copyright), Thaler began his (so far) unsuccessful pursuit of US copyright registration for his professed 100% AI generated art work, “A Recent Entrance to Paradise”, back in 2018. Despite several reversals, both in the application process at the USCO, its Review Board and in the District of Columbia courts, Thaler persists in his quixotic journey. He has unsuccessfully argued various precedents for non-human copyright ownership, such as the “work for hire” doctrine and corporate copyright, although it is worth noting that humans stand behind both. He is now apparently pursuing the common law theory of “fruit of the tree” in his attempt to get the USCO to register his work. To my non-legal mind, this is the ultimate stretch.

What Thaler could do is to claim that at least part of the work is the result of his personal creative efforts. That was the USCO outcome for the graphic novel Zarya of the Dawn produced by writer Kristina Kashtanova (who identifies as “they”). The novel contained both generative artwork and human story and design elements. After initially registering Kashtanova’s work, the Copyright Office cancelled the registration after they (Kashtanova, that is) claimed it was AI produced. The Office subsequently reconsidered and granted copyright protection to the parts of the work Kashtanova had created, namely the text, and selection and arrangement of the work’s written and visual elements. That, however, is a step too far for Thaler who continues to push for recognition by the Copyright Office that a work produced exclusively by his “Creativity Machine” can be protected by copyright. That seems very unlikely to happen.

While Thaler doggedly pursues copyright registration for “A Recent Entry to Paradise” (featured on this blog post—after all, it is not copyright protected), others who have created art using AI are following a different track. One of these is Jason B. Allen, whose award-winning digital art creation “Théâtre D’Opéra Spatial” has also been denied copyright registration. Allen’s approach is the opposite to that taken by Thaler. In contrast to Thaler’s insistence that the work is a creation of AI (his AI “machine”), Allen insists that he is the source of the creative inspiration behind the work, notwithstanding that it was created by an AI algorithm, Midjourney. Allen’s pursuit of copyright registration for “Théâtre” will be the subject of my blog post next week.

© Hugh Stephens, 2024. All Rights Reserved.

It Took Glue on Pizza to Spotlight Google’s AI Problem

Image: Shutterstock (with AI assist)

Google, the “indispensable” search engine relied on by millions for accurate and reliable search, has done it again, stepping smack into the pile of steaming excrement waiting for it in the middle of the road. Its most recent ill-starred foray into AI generated search has yielded some hilarious results, lighting up the blogosphere and making Google the butt of many jokes. After flubbing the public launch of its first AI enabled service, Bard, back in early 2023 when the AI driven search function produced the wrong results for a simple question about the James Webb Space Telescope, overnight wiping $100 million off Google’s valuation, Google’s new Gemini “AI Overview” service scored another own goal with its “hallucinatory” responses to questions like how to ensure cheese will stay on pizza (add glue) or how many rocks a day should a human eat. (Only one, in case you were wondering). It also informed users that Barack Obama was the first Muslim President of the United States.

When it comes to AI, “hallucinations” refer to incorrect or misleading results resulting from lack of training data, biased or selective training data, or incorrect assumptions made by the model. Hallucinations leading to trademark dilution was one of accusations levelled against OpenAI and Microsoft by the New York Times in its landmark copyright infringement case that is still working its way through the courts. In this case, the AI algorithm incorrectly attributed the false information to the Times, thus undermining its journalistic credibility, and diluting its trademark, or so the argument goes.

Apparently, the source of the pizza glue misinformation was an old tongue-in-cheek post on Reddit. I guess an AI algorithm has no sense of humour and can’t tell sarcasm from reality. It also gives credibility to conspiracy theories and blatantly false information, such as the Barack Obama example. Normally a search on any subject turns up a variety of sources on Google, some clearly more authoritative than others. Searchers can weigh a Wikipedia entry against a Reddit post against information from a government website or reputable academic institution. Even a plain old tendentious website put up by an advocacy organization can be probed and the bona fides of the source checked out. That is becoming more difficult, or at least less obvious, with the AI generated search summary provided by Google’s AI Overview.

If the search topic falls within AI Overview’s purview (and at the moment, not all do), viewers will see a summary of the information requested drawn from sources chosen by the algorithm. The algorithm decides how much information is drawn from any given site, and which sites are chosen. Users have the option of clicking through to access these and other sites that are displayed (below the annoying sponsored listings). However, many consumers, looking for a quick information fix, will not bother to do this and thus risk taking the AI summary as gospel. If you are being advised to mix glue into your pizza topping, you can probably figure out that something is haywire, but if the summary is only slightly wrong, or is on a subject that you are not familiar with, watch out. A good example was provided by the website Plagiarism Today. It asked Google five questions about copyright in the US. Its conclusion regarding the responses provided by AI Overview? Decidedly mixed. One A, one B, one C, one D and one resounding F.

The accuracy of the summary obviously depends on the sources of information chosen by the algorithm, and the emphasis it chooses to put on information from any given source. Unfortunately, in many instances it does not seem to prefer credible and authoritative sources, but instead goes for those that are popular. That is one of the basic problems of AI generally—quantity over quality, popularity over facts. (By the way, this account of how AI Overview works is based on reading about it from US sources since it is not yet available in Canada, which may be a good thing since I have read various US posts explaining that Overview is impossible to disable and very hard to turn off). Google intends to cram it down your throat whether you want it or not.

Of course, Google assures everyone that Overview is a “good thing” and the early kinks will be ironed out. Many websites are not happy with the new interface that will now exist between themselves and the consumer. They lose traffic when users simply read the AI summary and move on, not visiting the source website. Google used to boast that it had a symbiotic relationship with content providers because it facilitated, and even drove, eyeballs to the sites. No longer. It has appropriated–without permission–content from independent sites to feed AI Overview in the same way that the LLM (Large Language Model) AI developers have scooped up content, including copyrighted content, from rightsholders, without permission, licence or payment to enable their AI training. It is one thing to link to third party content, which requires a visit to the actual site to access the content; it is quite another to freely copy from it and mix it, sometimes inaccurately or inappropriately, with content from other websites that may not be reliable or acceptable sources of information.

Google clearly controls what goes into AI Overview and has said that it will apply more filters. If it can screen out sources of sarcasm and parody, it clearly has the capacity to install other filters that could differentiate trustworthy information from garbage. This might require Google to license the use of this curated information (Horrors! Google having to pay for the information of others that it so freely uses!). Licensing has already begun for content used for generative AI training. News Corp has just signed a licensing deal with OpenAI, as has the AP and the Financial Times. Licensing is at the heart of the dispute that OpenAI is facing with the New York Times.

Licensing presupposes knowledge of what inputs are being used, a requirement now enshrined in EU law which requires that AI developers maintain an inventory of works used for training purposes (transparency). This will allow rightsholders to opt out or negotiate a licensing solution unless the copying meets the text and data mining exception (i.e. for research by research and cultural organizations).

However, Silicon Valley has variously proclaimed that (a) it is impossible to track all the information ingested during AI training; (b) it would bankrupt the industry should they have to pay for content (c) they need to use copyrighted content because there is not enough current public domain information available (d) it is not feasible to filter out or identify specific works amongst the millions of datapoints that it ingests (e) everything that it does is fair use anyway (f) all of the above. Google’s embarrassment, and its apparent ability to finetune AI Overview, demonstrates that it is clearly feasible to filter out certain works and types of content. It is the will to do so generally that is lacking. Meanwhile, the number of lawsuits brought by rightsholders against AI developers continues to multiply.

By making itself the object of social media ridicule, and then admitting it can address the problem, Google has actually done us all a favour by highlighting the “garbage in; garbage out” problem. Not all copyrighted material is responsible or accurate but a good chunk of it is, such as professional journalism and academic journals. Access to that material is essential to provide credible results. And that material needs to be paid for, on terms set by the content owners.

The solution is not to stop the development of generative AI; for one thing, that won’t happen. It is to corral it, improve it and make it more trustworthy, if necessary with penalties if it is not. The penalties could be imposed by the market (i.e. Google search is not reliable so I will go elsewhere) or, in certain cases, by regulation. Licensing of accurate, credible information to drive search will inevitably distinguish the fake from the real and dubious from the trustworthy. This is what any credible search engine seeks. It is market gold.

Google, open your bulging wallet and start licensing content that will make us want to continue to try you first to get reliable information. Right now, through your clumsy rollout of AI supported search, you are rapidly losing that trust. It is also not acceptable to plagiarize someone else’s content, mix it with garbage from some other source, and serve it up on a platter to consumers on the pretext that this is the definitive answer. The result, as we have seen, is gluey pizza.

(c) Hugh Stephens, 2024. All rights reserved.

The Economics of Copyright: Incentives and Rewards (It’s Important to Get them Right)

Image: Shutterstock

Two years ago, in April 2022, the US Copyright Office (USCO) appointed its first Chief Economist, Dr. Brent Lutes. Many national Intellectual Property Offices have such a position, e.g, UK IPO, IP Australia, EUIPO, and WIPO. (Notably, Canada’s Intellectual Property Office–CIPO–does not). All these positions have broad responsibility for assessing the economics of IP generally, covering patent, trademark, industrial designs as well as copyright. In the US, the Patent and Trademark Office has its own Chief Economist. However, Lutes’ USCO position appears to be the only one related exclusively to assessing the economic impact of copyright. The position sits within the Office of Policy and International Affairs and is composed of a small team of economists, providing the Register of Copyrights, Shira Perlmutter, with policy-relevant research on economic issues related to copyright.

In an interview conducted last month, Lutes talked about the economic goals of copyright in terms of enhancing social welfare. He noted the goal of copyright is to contribute to the welfare of society by promoting access to creative works, now and in the future, through market based behavioural incentives. The goal of the Office of Chief Economist is to gather more information to inform policy making, such as the geographic distribution of copyright activity or the demographic characteristics of creators. As but one example, is racial or ethnic diversity related to creativity? The economic issues surrounding AI and copyright, both pro and con, is another field of research the USCO will be exploring.

In addition to finding the right economic levers to stimulate production of creative works, economic studies of copyright also demonstrate the enormous impact copyright-based industries have on national economic welfare. While the impact can depend on what economic multipliers are used and how direct versus indirect benefits are calculated, there is no question that copyright industries in most economies are very significant as job creators and multipliers. For example, IP Australia in its most recent annual report estimates that cultural and creative activity contributes about 6% of Australian GDP annually, with design, fashion, publishing, broadcasting, electronic and digital media and film being the primary industries involved. In the US the figures are even more impressive. According to the International Intellectual Property Alliance, in 2021 (the last year for which statistics are apparently available), core copyright industries in the US, defined as those industries “whose primary purpose is to create, produce, distribute or exhibit copyright materials”, added $1.8 trillion to US GDP, accounting for 7.76% of the economy. Total copyright industries, a definition that includes industries partially dependent on copyright, such as fabric, jewellery or toys and games, account for another trillion USD, even when only a portion of their total value is included in the copyright calculation.  

The UK Intellectual Property Office published its IP survey in 2022, comparing the role of patents, trademark, registered industrial designs and copyright. While copyright industries were on the low side for exports (£4.7 billion as opposed to patents at £120.6 billion, copyright’s “non-financial value-added output” (IP data is not available for the financial industries, thus the description of “non-financial”) trounced that of patent industries by almost 2:1. As with the US IIPA study, the UK report accounted for the degree to which certain industries depend on copyright, categorizing them as core, interdependent, partial or non-dedicated support industries, adjusting the amount of copyright contribution accordingly. Book publishing, for example, is considered a 100% copyright industry and its value is calculated as such, whereas for an industry such as paper manufacturing, only 25% of the value was included in the calculation of copyright benefits. This methodology followed that of the World Intellectual Property Organization, aka WIPO, which also conducts economic studies as well as assists national authorities with their own. Economists are careful people, not prone to exaggeration, and consistent methodology is important to ensure accurate measurement and reporting.

WIPO worked with the Department of Canadian Heritage to produce a report in 2020 on “The Economic Impact of Canada’s Copyright-Based Industries”. As with other deep dives on the economic benefits of copyright, this study produced similar notable statistics. For example, while many copyright opponents in Canada were deploring the extension of the copyright term of protection in Canada, arguing that the result would be an outflow of royalties to foreign rights-holders because Canada was a net importer of copyrighted materials, the Heritage report established that “Canada has exported more copyright-related services than it has imported, maintaining a trade balance surplus from 2009 ($2.5 billion) to 2019 ($5.6 billion)”. In actual fact, extending the copyright term in Canada brought with it the additional benefit of a reciprocal extended term in many foreign countries for Canadian works, clearly benefiting Canadian rights-holders. The Heritage study went on to document a range of other important outcomes such as employment (over 600,000), contribution to GDP ($95.6 billion) and percentage of GDP (4.9%). All figures are based on 2019 data. No update has been published since. It is just as well that Heritage Canada took the lead in preparing this report since the government department holding lead statutory responsible for copyright in Canada, the mammoth Department of Industry, Science and Economic Development (ISED), unfortunately seems to treat copyright as but a tiny pimple on its elephantine rump.

While the studies cited above highlight the economic contribution that copyright industries make to national economies in terms of jobs and wealth generation, let us not forget the key point that Dr. Lutes underlined regarding the social welfare contribution of copyright through using market-based incentives to promote and encourage creativity and investment in creative outputs. It is hard, if not impossible, to put a dollar amount on the social welfare benefits of creative expression and cultural sovereignty, but they are immense if incalculable. Without copyright, not only would existing content-based industries be unable to thrive and expand, but the formula to encourage new, original content would be missing.

Notwithstanding the importance of a robust copyright framework for both economic and social welfare, creators and content-based copyright industries are facing major challenges today. Some are technological, like the emergence of generative AI; some behavioural, such as a wide tolerance, even acceptance, of piracy and free riding. The struggle against piracy is ongoing and protracted, a cat and mouse game. Free riding is what AI developers are doing on the backs of content creators through unauthorized training of AI models on copyrighted content, with resultant legal challenges. There is also the question of whether wholly AI generated works should be accorded copyright protection. As the Copyright Alliance has observed, the Copyright Clause in the US Constitution is premised on the promotion of the “progress of science and useful arts” by protecting for a limited period of time the writings and discoveries of authors and inventors. Given that premise, it should be self-evident that creator incentivization is not applicable to machines, which do not need nor comprehend economic incentives to create.

Free riding is also what the education sector has been doing in Canada under the specious umbrella of “education fair dealing”, introduced through copyright amendments in 2012 that broadened the scope of fair dealing. Since then, the “education industry” at the public, secondary and post-secondary level has been siphoning off economic value from writers and other creators to the tune to date of over CAD$200 million. Their legalized renunciation of collective reprographic licensing is ostensibly to benefit students but is in fact a transfer of wealth from creators to the bottom line of educational institutions. If a key objective of copyright is to incentivize creation of new content, such as materials used by educational institutions to teach students, then the current interpretation of education fair dealing in Canada upends a key rationale for granting copyright protection in the first place. (As a footnote, I should add that not all arguments in favour of copyright are based solely on economic incentives. There is also the question of natural justice and equity, providing authors with a degree of control over works they have created).

Since court challenges have unfortunately proven ineffective, the remedy for Canada’s education fair dealing fiasco is for the Government of Canada to amend the Copyright Act so that rightsholders are properly compensated when their works are used in Canada. Both the copyright collective in English Canada, Access Copyright and its Québec counterpart, Copibec, recently called for legal clarification of the nature and extent of educational fair dealing.

Thorough documentation of the contribution that copyright makes to economic and social welfare helps substantiate the case for adequate legal frameworks, including combatting piracy and ending copyright free riding. Sound economic data are essential to sound policy making. The initiative of the US Copyright Office to appoint a Chief Economist helps to meet these goals and is to be commended.  Should the Canadian Intellectual Property Office ever create such a position, its first task should be to evaluate the full economic and social costs of the current short-sighted interpretation of fair dealing in Canada’s education sector in terms of its negative long-term impact on creativity and cultural sovereignty in the country.

The Scottish writer Thomas Carlyle may have described economics as the “dismal science”, an oft-quoted remark, but rather than being dismal it is in fact just the opposite; it sheds light on the importance of copyright to maintaining a well-functioning, equitable and culturally rich modern society.

© Hugh Stephens, 2024. All Rights Reserved.

Artificial Intelligence and Copyright: The Canadian Cultural Community Speaks Out

Image: http://www.shutterstock.com

The extended period set by the Canadian Government (through Innovation, Science and Economic Development Canada, ISED) for response to its consultation paper on Artificial Intelligence (AI) and Copyright closed on January 15. We will start to see a flurry of submissions released by participants while ISED digests and assesses the input it has received. One of the first is the submission from the Coalition for the Diversity of Cultural Expression (CDCE), which represents over 360,000 creators and nearly 3,000 cultural businesses in both French and English-speaking parts of Canada. CDCE’s membership includes organizations representing authors, film producers, actors, musicians, publishers, songwriters, screenwriters, artists, directors, poets, music publishers—just about every profession you can think of that depends on creativity, and protection for creative output. The CDCE submission highlights three key recommendations, summarized as follows;

  • No weakening of copyright protection for works currently protected (i.e. no exception for text and data mining to use copyrighted works without authorization to train AI systems)
  • Copyright must continue to protect only works created by humans (AI generated works should not qualify)
  • AI developers should be required to be transparent and disclose what works have been ingested as part of the training process (transparency and disclosure).

While none of these recommendations are surprising, and from my perspective are eminently reasonable, I am sure we will also see a number of submissions arguing that, “in the interests of innovation”, access to copyrighted works is not only essential but should be freely available without permission or payment. OpenAI, the motive force behind ChatGPT—and the defendant in the most recent high-profile copyright infringement case involving AI (When Giants Wrestle, the Earth Moves (NYT v OpenAI/Microsoft)—has already staked out part of this position. In its brief to the UK House of Lords Select Committee looking into Large Language Models (LLMs), a key technology that drives AI development, the company says;

“Because copyright today covers virtually every sort of human expression–including blog posts, photographs, forum posts, scraps of software code, and government documents–it would be impossible to train today’s leading AI models without using copyrighted materials (emphasis added). Limiting training data to public domain books and drawings created more than a century ago might yield an interesting experiment, but would not provide AI systems that meet the needs of today’s citizens.”

OpenAI claims that it respects content creators and owners and looks forward to continuing to work with them, citing among other things, the licensing agreement for content it has signed with the Associated Press. But failure to reach a licensing deal with the New York Times is really the crux of the lawsuit that the media giant has brought against OpenAI and its key investor Microsoft. If reports are true that OpenAI’s licensing deals top out at $5 million annually, it is not surprising that licensing negotiations between the Times and OpenAI broke down over such lowball offerings.

As for the CDCE submission to ISED, it recommends that the government refrain from creating any new exceptions for text and data mining (TDM) since this would interfere with the ability of users and rightsholders to set the boundaries of the emerging market in licensing. No copyright exemption for AI is what the British government has just confirmed, after playing footsie with the concept for over a year. Apart from the examples of the licensing deals that OpenAI has with the Associated Press and German multimedia giant Axel Springer, the CDCE paper notes a range of other recent examples of content owners offering access to their product through licensing arrangements, including Getty Images, Universal Music Group and educational and scientific publishers like Elsevier. The paper also urges the government to avoid interfering in the market when it comes to setting appropriate compensation, leaving it to market players or, where the players can’t reach agreement, to the quasi-judicial Copyright Board.

In my view, licensing is going to be the solution that will eventually level the playing field, but to get there it will require that major content players lockout the AI web-crawlers while pursuing legal redress, as the NYT is doing. This will help to open the licensing path to smaller players and individual creators who don’t have the resources available to employ either technical or legal remedies. (The issue of what has already been ingested without authorization still needs to be settled). As for the tech industry’s suggestion that creators can opt-out of content ingestion if they wish, CDCE rightly points out that this is standing the world on its head, and would be contrary to longstanding copyright practice. Not only is it impractical in a world where what goes into an AI model is a black box (thus the imperative for transparency) but it is like saying a homeowner has to request not to be burgled, or else can expect to become a target.

On the question of whether AI generated works should be granted copyright protection, CDCE points out the double-standard of proposing an exception to copyright for TDM for inputs while claiming copyright protection for AI generated outputs. The need for human creativity is a line that has been firmly held by the US Copyright Office, pushing back on various attempts to register AI-generated (as opposed to AI-assisted) works. Canada has not been quite so clear cut in its position, owing to the way in which copyright is registered (almost by default, without examination) in Canada, as I pointed out in this blog post (A Tale of Two Copyrights). While AI generated works have received copyright protection in Canada (Canadian Copyright Registration for my 100 Percent AI-Generated Work), this is more by oversight than design, given the way the Canadian copyright registration system works.

Thirdly, we turn to transparency, a sine qua non if licensing solutions are to be implemented.  If authors don’t know whether their works are being used to train AI algorithms, or can’t easily prove it, licensing will fall flat. CDCE calls for publication of all content ingested into training models, disclosure of any content outputs that contain AI, and design of AI models to prevent generation of illegal or infringing content. This is similar to requirements already under consideration in the EU.

CDCE also makes the important point that it is not just copyright legislation that defends individual and collective rights against the incursions of AI and big AI platforms. While the Copyright Act offers some protection to creators, privacy legislation is important for all citizens. As the UK Information Commissioner has pointed out in a recent report, the legal basis for web-scraping is dependent on (a) not breaching any laws, such as intellectual property or contract laws and (b) conformity with UK privacy laws (the GDPR, or General Data Protection Regulation), where the privacy rights of the individual may override the interests of AI developers, even if data scraping meets other legitimate interest tests.

Finally, there is the question of the moral rights of creators that can be threatened by misapplication of AI, whether it is infringement of a performer’s personality or publicity right, distortion of their performance or creative output, misuse of their works for commercial or political reasons or any of the other reasons why copyright gives the creator the right to authorize use of their work.

Quite apart from the question of AI, there are of course other outstanding copyright questions that need to be resolved urgently, including the longstanding issue of the ill-conceived education “fair dealing” exception that has undermined if not permanently damaged the educational publishing industry in Canada. This exception needs to be narrowed to allow users continued unlicensed access to copyrighted materials under fair dealing guidelines for study, research and educational purposes but to limit institutional use to situations only where a work is not commercially available under a license from a rightsholder or collective society. While this issue requires looking back and fixing something that is already broken, policy making with respect to AI and copyright needs to anticipate the future and “do no harm”, while requiring AI developers to open up their black boxes and respect existing rights. This should be achieved by maintaining and protecting the rights of creators in ways that will facilitate market-based licensing solutions for use of copyrighted content by AI developers, while ensuring that creative output remains the domain of human beings, and not machines.

© Hugh Stephens, 2024.

When Giants Wrestle, the Earth Moves (NYT v OpenAI/Microsoft)

Image:www.shutterstock.com

There is no better way to start out the New Year, 2024, with a commentary on Artificial Intelligence (AI) and copyright. It was the big emerging issue in 2023 and is going to be even bigger in 2024. The unlicensed and unauthorized reproduction of copyright-protected material to train AI “machines”, in the process often producing content that directly competes in the market with the original material, is the Achilles heel of AI development. To date, no one knows if it is legal to do so, in the US or elsewhere, as the issue is still before the courts. The cases brought to date by artists, writers and image content purveyors like Getty Images have not always been the strongest or best thought out. In one instance, the plaintiffs had not even registered the copyright on some of the works for which they were claiming infringement, a fatal flaw in the US where registration is a sine qua non in order to bring an infringement case. That may have been the most egregious example of a rookie error but in general the artists’ and writers’ cases have not gone too well so far, although the process continues. Some cases are on stronger grounds than others. Here is a good summary. The Getty Images case will be an interesting one to watch. And now the New York Times has weighed in with a billion-dollar suit against Open AI, and Microsoft. The big guys are now at the table and the sleeves are rolled up. The giants are wrestling.

What is at issue could be nothing less than the survival of the news media and the ability of individual creators to protect and monetize their work. It could also open a pathway to legitimacy for the burgeoning AI industry. The ultimate solution is surely not to put a halt to AI development, nor to put content creators out of business. It is to find a modus vivendi between the needs of AI developers to ingest content in order to train algorithms that will “create” (sort of) content–assembled from vast swathes of input–and the rights of content creators. While training sets are generally very large, some of the input can be very creator-specific and the output very creator-competitive. This is where the New York Times comes in.

The Times, like any enterprise, needs to be paid for the content it creates in order to stay in business and create yet more content. If its expensively acquired “product”, whether news, lifestyle, cooking, book reviews or any of the other content that Times’ readers crave and are willing to pay for, can be obtained for free through an AI algorithm (“What is the most popular brunch recipe in the NYT using eggs, bacon and spinach”, or “What does Thomas Friedman think of…..”), this creates a huge disincentive to go to the source and undermines journalism’s business model, already under severe stress and threat.

The Times is one of the few journals that has managed to thrive, relatively speaking, in the new digital age at a time when many of its competitors are dying on the vine. According to Press Gazette, the New York Times is the leading paywalled news publisher, with 9.4 million subscribers. (Wall Street Journal and Washington Post are numbers two and three respectively). You need to pay to read the Times, and why not? But paying for access does not give you the right to copy the content, especially for commercial purposes. (The Times offers various licensing agreements for reproduction of its content, with cost dependent on use). Technically, all it takes is one subscription from OpenAI and the content of the Times is laid bare to the reproduction machines, the “large language models”, or LLMs, used by the AI developers. The Times has now thrown down the gauntlet. Its legal complaint, 69 pages long, makes compelling reading. If there ever was a “smoking gun” putting the spotlight directly on the holus-bolus copying and ingestion of copyright protected proprietary content in order to produce an unfair directly-competing commercial product that harms the original source, this is it. It’s a far cry from earlier copyright infringement cases brought by some artists and writers.

While you can read the complaint yourself if you are interested (recommended reading), let me tease out a few of the highlights. After setting out the well-proven case for the excellence of its journalism, the Times’ complaint notes that while the defendants engaged in widespread copying from many sources, they gave Times’ content particular emphasis when building their LLMs, thus revealing a preference that recognized the value of that content. The result was a free ride on the journalism produced at great expense by the Times, using Times’ content to build “substitutive products” without permission or payment.

Not only does ChatGPT at times regurgitate the Times’ content verbatim, or closely summarizes it while mimicking its style, at other times it wrongly attributes false information to the Times. This is referred to in AI circles as “hallucination”, something the complaint labels misinformation that undermines the credibility of the Times’ reporting and reputation. Hallucination is a particularly dangerous element of AI produced content. Rather than admitting it doesn’t know the answer, the AI algorithm simply makes it up, complete with false references and attributions all of which make it very difficult for the average reader to separate fact from fiction. This misinformation is the basis of the Times’ complaint for trademark dilution that accompanies various other copyright related complaints of infringement. Concrete examples of such misinformation are provided in the complaint.

So too is ample evidence of users exploiting ChatGPT to pierce the Times’ paywall, by asking for the completion of stories that have been blocked for non-subscribers. There are concrete examples of carefully researched restaurant and product reviews that have been replicated virtually verbatim. Not only is the Times’ subscription model undermined, but the value it derives from reader-linked product referrals from its own platform bleeds to Bing when the product is accessed through Microsoft Search enabled by ChatGPT. Examples are given of full news articles based on extensive Times’ investigative reporting being reproduced by ChatGPT, with only the slightest variations. These are not composite news reports of what is happening in Gaza, for example, but a word-for- word lifting of a Times’ analysis of what Hamas knew about Israeli military intelligence. The Times’ complaint makes for chilling reading. AI’s hand has been caught firmly in the cookie jar.

What does the Times want out of all of this? The complaint does not specify a dollar amount, while noting the billions in increased valuation that has accrued to OpenAI and Microsoft as a result of ChatGPT. However, it asks for statutory and compensatory damages, “restitution, disgorgement, and any other relief that may be permitted by law or equity” as well as destruction of all LLM models incorporating New York Times’ content, plus, of course, costs. If the Times gets its way, this will be a huge setback for AI development as well as for OpenAI and Microsoft, but of course it may not come to that. The complaint notes that the Times had tried to reach a licensing deal with the defendants. OpenAI cried foul, expressing “disappointment”, and noting that they had been having “productive” and “constructive” discussions with the Times over licensing content. However, to me this is a bit like stealing the cookies, getting caught red-handed and offering to negotiate to pay for them, then crying foul when your offer is rebuffed. The Times has just massively upped the ante, making the potential licensing fees much more valuable.

The irony is that the use of NYT material by OpenAI or indeed other platforms like Google or Facebook potentially brings some advantage and drives some business to the Times, while obviously also providing commercial benefits to the AI program, search engines or social media platforms. The real question will be how that proprietary content is used, and how much is paid to use it. A similar issue is being played out in another context, most recently in Canada with Bill C-18 where news media content providers wanted the big platforms (Google and Meta/Facebook) that derive benefit from using or indexing that content to pay for accessing it. The result in Canada was both a standoff and a compromise. Facebook blocked Canadian news content rather than pay for it, while Google agreed to create a fund for access by the news media in return for being exempted from the Canadian legislation.

The NYT-OpenAI/Microsoft lawsuit is a different iteration of the same principle. Businesses that gain commercial advantage from using proprietary content of others should contribute to the creation of that content, either through licensing or some other means such as a media fund. The most logical outcome of the Times’ lawsuit is almost certainly going to be a licensing agreement. Given the seemingly unstoppable wave of AI development, meaningful licensing agreements would seem to be the best way to ensure fairness and balance of interests going forward.  

A Goliath like the New York Times is in a much better position to make this happen than a disparate group of writers and artists. Indeed, there are logistical challenges in being able to license the works of tens of thousands of content creators. In an earlier blog post, I postulated that perhaps copyright collectives might find a role for themselves in this area in future. In my view, ultimately the only logical solution to the conundrum of respecting rights-holders while facilitating the development of AI is to find common ground through fair and balanced licensing solutions. The wrestling giants of the NYT and Microsoft may help show the way.

© Hugh Stephens 2024. All Rights Reserved.

AI’s Copyright Challenges: Searching for an International Consensus

Image: Shutterstock

This has been a busy couple of weeks for national and international declarations on Artificial Intelligence (AI). First the G7 issued its International Code of Conduct for Advanced AI Systems on October 30.  The same day US President Biden signed the Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence, followed by the Bletchley Declaration at the conclusion of the “AI Summit” hosted by UK Prime Minister Rishi Sunak in Bletchley Park, Buckinghamshire, a couple of days later. Meanwhile, the EU’s AI Act is being touted by its sponsor as a potential model for AI legislation in other parts of the world (although its enactment is currently bogged down in the trilogue process between the Commission, EU Council and European Parliament). Notable was the fact that the US Executive Order, a wide-ranging framework document covering many aspects of the AI issue, effectively “scooped” the Brits by a day or so, allowing Vice President Kamala Harris to highlight steps the US had just announced when speaking to the press at Bletchley.

The declarations all addressed many of the concerns surrounding AI, ranging from safety and security, fraud and cybersecurity to privacy, equity and civil rights to protecting consumers, supporting workers and promoting innovation. A key issue only lightly touched on in these declarations, however, was that of AI’s intersection with copyright. This was a missed opportunity to come to grips with a major concern regarding how AI will be able to co-exist with copyright law. (The EU’s draft AI Act includes a transparency requirement to  “document and make publicly available a sufficiently detailed summary of the use of training data protected under copyright law“, Article 28(b) 4(c).)

AI faces two significant challenges when it comes to copyright protection. First, with respect to the inputs that AI developers use to populate their models to produce generative AI, there is the unresolved question as to whether the free use of copyrighted content violates copyright law by making unauthorized reproductions. There are currently a number of lawsuits underway in the US examining this fundamental question. Many creator groups, such as the News Media Alliance in the United States argue that “the pervasive copying of expressive works to train and fuel generative artificial intelligence systems is copyright infringement and not a fair use”.

Second, with respect to outputs, the work generated by AI has two challenges in terms of obtaining the benefits of copyright protection. If its inputs are infringing, that clearly casts doubt on the legality of the derivative outputs. In addition, there is the problem posed by the current position of the US Copyright Office (and most other copyright authorities) that to be copyright-protected a work must be an original human creation. After the infamous Monkey Selfie case, the USCO issued an interpretive bulletin reiterating the need for human authorship and, to date, it has hewed to this line when examining applications for copyright registration from authors claiming works produced by AI.

The G7 Declaration was broad, covering a wide range of issues related to AI. It included a reference to the copyright issue under Point 11, “Implement appropriate data input measures and protections for personal data and intellectual property”, specifically stating that, “Organizations are encouraged to implement appropriate safeguards, to respect rights related to privacy and intellectual property, including copyright-protected content.” This is hardly prescriptive language, but it is a beginning. I understand that the creative community had to fight hard to get this wording included, but it is at least recognition of the issue.

With respect to the US Administration’s Executive Order, the issue of copyright was also acknowledged, but in a somewhat backhanded way. Section 5.2 (Promoting Innovation), addresses copyright as part of clarifying issues “related to AI and inventorship of patentable subject matter”. Paragraph (c)(iii) declares that the Under Secretary of Commerce for Intellectual Property and Director of the US Patent and Trademark Office shall;

within 270 days of the date of this order or 180 days after the United States Copyright Office of the Library of Congress publishes its forthcoming AI study that will address copyright issues raised by AI, whichever comes later, consult with the Director of the United States Copyright Office and issue recommendations to the President on potential executive actions relating to copyright and AI. The recommendations shall address any copyright and related issues discussed in the United States Copyright Office’s study, including the scope of protection for works produced using AI and the treatment of copyrighted works in AI training.”  

This is not exactly a ringing endorsement of the need for respecting the copyright of those who, willingly or not, provide the raw material for the voracious AI machines that are busy scooping up creator’s content, but it is nonetheless an acknowledgment that there’s an issue that needs addressing.

The US Copyright Office (USCO) launched its study on Artificial Intelligence and Copyright on August 30 of this year “to help assess whether legislative or regulatory steps in this area are warranted”. By the end of October, the USCO had already received more than 10,000 submissions. The comments range from statements by AI developers as to why they shouldn’t be required to pay for copyrighted content used as inputs in developing their models (while of course claiming they should enjoy the benefits of copyright protection for their AI generated outputs), to submissions by creator organizations that argue, among other things, that the ingestion of copyrighted material by AI systems is not categorically fair use and that AI companies should license works they ingest. Licensing their content to AI companies as an additional revenue stream is precisely what major media companies are currently engaged in.

If the US, currently and for the foreseeable future the leading country in development of AI, is thrashing around trying to address this question, one can imagine the process taking place elsewhere. Will the need to set standards inevitably lead to some form of international consensus for the regulation of AI, including the role of copyrighted content? I think it will be essential. Countries that are too lax in protecting their creative sectors will see their copyright-protected cultural industries suffer negative economic consequences; countries that are overly protective of content are worried that investment in AI innovation will flow to countries with lower copyright standards, becoming a race to the bottom for creators.

The UK government has already felt the pinch of this dilemma. In a misguided attempt to gain a head start in the AI development race, about a year and a half ago the British government unveiled a proposal sponsored by the UK Intellectual Property Office (of all entities!) to create an unlimited text and data mining (TDM) exception to copyright, at the same time stripping rights-holders of their ability to license their contact for TDM purposes, or to contract or opt out. In the words of the discussion paper accompanying the draft legislation, in order to reduce the time needed to obtain permission from rightsholders and to eliminate the need to pay license fees;

The Government has decided to introduce a new copyright and database right exception which allows TDM for any purpose …Rights holders will no longer be able to charge for UK licences for TDM and will not be able to contract or opt-out of the exception.”

This outrageous attempted expropriation of intellectual property rights aroused a storm of protest from the UK’s vibrant cultural sector, a backlash that found resonance in Parliament. As a result, the British government backed off, and withdrew the proposed legislation. However, one wonders if the stake has truly been driven through the heart of this hi-tech gambit or whether, like Dracula, this misguided policy will rise again. UK Parliamentary Committee Shoots Down Copyright Exemption for AI Developers–But is it Really Dead”? Certainly, British publishers are not convinced the content grab is over. According to the Guardian, they have just issued a statement urging the UK government, “to help end the unfettered, opaque development of artificial intelligence tools that use copyright-protected works with impunity.”

Canada has just launched a public consultation on AI and Copyright, (”Copyright in the Age of Generative Artificial Intelligence”), and others will be doing the same. In Australia, Google, responding to a review of copyright enforcement, urged the government to relax copyright laws to allow artificial intelligence to mine websites for information across the internet (even though this wasn’t the topic of the enquiry). Meanwhile, the Attorney-General’s Department has been conducting several roundtables to explore the issue, the most recent being at the end of August. In that roundtable, representatives of the Australian creative community called for greater transparency around how copyright material is being used by AI developers during the input training and output process.

And so, the search for the right formula goes on. It will not be easy to find the elusive international consensus, especially since at the moment (with the exception of China) this is an issue on the agenda only of the so-called Global North.

How the heavy-hitters will deal with the issue of AI, including its intellectual property dimensions, remains to be seen. There could be something as relatively powerless as OECD Guidelines that emerge or regulation could go a lot further, including the establishment of some kind of international agency with the “authority” to regulate in the area of AI, as suggested by Elon Musk and others. However, as we have seen with every international organization created to date, whether it be the UN, World Trade Organization, International Atomic Energy Agency, or any of the myriad other supra-national structures created in recent years, the authority granted them is only as good as the commitment of its signatory states. It makes sense to harmonize and set broad international standards for the way in which AI is created and used, but it will be a long road to get there.

The challenge of how copyright can intersect with AI–to the mutual benefit of both–has still be worked out. The courts are playing a role, as is technology, evolving business models, and legislation. Society needs to find the sweet spot where both human creation and technological advancement in the form of AI can co-exist for the benefit of society at large. Despite recent pronouncements, the search continues.

© Hugh Stephens, 2023. All Rights Reserved.

This post has been updated to include reference to the ongoing roundtable process underway in Australia under the aegis of the Attorney-General’s Department to explore, inter alia, questions of AI and copyright.