USCO – Hugh Stephens Blog

AI’s Habit of Information Fabrication (“Hallucination”): Where’s the Human Factor?

An illustration of a cartoonish robot face on a computer screen with the text 'THE WORLD IS FLAT' above it.

Image: Shutterstock (with AI assist)

It is well known that when AI applications can’t respond to a query, instead of admitting they don’t know the answer, they often resort to “making stuff up”—a phenomenon commonly called “hallucination” but which should more accurately be called for what it is, total fabrication. This was one of the legal issues raised by the New York Times in its lawsuit against OpenAI, with the Times complaining, among other things, that false information attributed to the journal by OpenAI’s bot undermined the credibility of Times journalism and diminished its value, leading to trademark dilution. According to a recent article in the Times, the incidence of hallucination is growing, not shrinking, as AI models develop. One would have thought that as the models ingest more material, including huge swathes of copyrighted and curated material such as content from reputable journals like the Times (without permission in most instances), its accuracy would improve. That doesn’t seem to be the case. Given AI’s hit and miss record of accuracy, it should be evident that AI output cannot be trusted or, at the very least, can only be trusted if verified. Not only is AI built on the backs of human creativity (with a potentially disastrous impact on creators unless the proper balance is struck between AI training and development, and the rights of creators to authorize and benefit from the use of their work), but human oversight and judgement is required to make it a useful and reliable tool. AI on auto-pilot can be downright dangerous.

The most recent outrageous example of AI going astray is the publication by the Chicago Sun-Times and Philadelphia Inquirer, both reputable papers (or at least they used to be), of a summer reading list in which only five of fifteen books listed were real. The authors were real but most of the book titles and plots were just made up. Pure bullshit produced by AI. The publishers did a lot of backing and filling, pointing to a freelancer who had produced the insert on behalf of King Features, a unit of Hearst. Believe it or not, it was actually licensed content! That freelancer, reported to be one Marco Buscaglia, a Chicago “writer”, admitted that he had used AI to create the piece and had not checked it. “It’s 100% on me”, he is reported to have said. No kidding. Pathetic. Readers used to have an expectation that when a paper or magazine published a feature recommending something, like a summer reading list, the recommendation represented the intellectual output of someone who had done some research, exercised some judgement, and had presumably even read or at least heard about the books on the list. How could anyone recommend non-existent works? The readers trusted the newspaper, the paper trusted the licensor, the licensor trusted the freelancer, the so-called author. Nobody checked. Where was the human element? The list wasn’t worth the paper it was printed on.

The same problem of irresponsible dependence on unverified information produced by AI is a growing problem in the legal field. Prominent lawyer and blogger Barry Sookman has just published a cautionary tale about the consequences of using hallucinatory AI legal references. Counsel for an applicant in a divorce proceeding in Ontario cited several legal references using the CanLII database (for more information on CanLII see “AI-Scraping Copyright Litigation Comes to Canada (CANLII v Caseway AI”) that the presiding judge could not locate—because they did not exist. He suspected the factum had been prepared using Generative AI and threatened to cite the lawyer in question for contempt of court, noting that putting forward fake cases in court filings is an abuse of process, and a waste of the court’s time. The lawyer in question has now confirmed that AI was used by her law clerk, that the citations were unchecked, and has apologized, thus avoiding a contempt citation. Again, nobody checked (until the judge went to the references cited).

This is not even the first case in Canada where legal precedents fabricated by AI were presented to a court. Last year in a child custody case in the BC Supreme Court, the lawyer for the applicant was reprimanded by the presiding judge for presenting false cases as precedents. The fabricated information was discovered by the defence attorneys when they went to check the applicant’s lawyer’s arguments. As a result, the applicant’s lawyer was ordered to personally compensate the defence lawyers for the time they took to track down the truth. The perils of using AI to argue legal cases first came to prominence in the US in 2023 when a New York federal judge fined two lawyers $5000 each for submitting legal briefs written by ChatGPT, which included citations of non-existent court opinions and fake quotes.

Another area fraught with consequences for using unacknowledged AI generated references is academia. The issue extends well beyond undergraduate student essays being researched and written by AI to include graduate students, PhD candidates and professors taking shortcuts. This university library website, in its guide to students on use of AI generated content, notes that LLMs (Large Language Models used in AI) can hallucinate as much as 27% of the time and that factual errors are found in 46% of the output. The solution is pretty simple. When writing a research paper, don’t cite sources that you didn’t consult.

This brings up the question of “you don’t know what you don’t know”. If your critical faculties are so weak as to not be able to detect a fabricated response, you are in trouble. Of course, some hallucinations are easier to spot than others. Some of the checking is to simply verify that a fact stated in an AI response is accurate or that a cited reference actually exists (but then it should be read to determine relevance). In other cases, it may be more subtle, with the judgement and creativity of the human mind being brought into play to detect a hallucination. That requires experience, knowledge, context—all of which may be lacking in the position of a junior clerk or student intern assigned the task of compiling information. This is all the more reason why it is important for those using AI to check sources, and to exercise quality control. Part of the process is to ensure transparency. If AI is used as an assist, that should be disclosed.

At the end of the day, AI depends on human creativity and accurate information produced by humans. Without these inputs, it is nothing. This brings us to the fundamental issue of whether and how copyright protected content should be used in AI training to produce AI generated outputs.

The US Copyright Office has just released a highly anticipated study on the use of copyrighted content in generative AI training. Here is a good summary produced by Roanie Levy for the Copyright Clearance Center. The USCO report is clear in stating that the training process for AI implicates the right of reproduction. That is not in doubt. It then examines fair use arguments under the four factors used in the US. Notably, with respect to the purpose and character of the work used for training, USCO notes that the use of copyrighted content for AI training may not be transformative if the resulting model is used to generate expressive content or potentially reproduce copyrighted expression. It notes that the copying involved in AI training can threaten significant harm to the market for, or value of, copyrighted works especially where a model can produce substantially similar outputs that directly substitute for works used in the training data. This report is not binding on the courts but is a considered and well researched opinion by a key player.

It is interesting to note that the report was released quickly in a pre-publication version on May 9, just a day before the Register of Copyrights (the Head of the Office) Shira Perlmutter was dismissed by the Trump Administration and a day after the Librarian of Congress, Carla Hayden (to whom Perlmutter reports) was fired. Washington is rife with speculation on the causes for, and the legality of, the dismissals. We will no doubt hear more on this. With respect to fair use in general, the study concludes that “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets…goes beyond established fair use boundaries”. The anti-copyright Electronic Frontier Foundation (EFF), of course, disagrees. (Which probably further validates the USCO’s conclusions).

The USCO study is about infringement, not hallucination or fabrication, yet both stem from the indiscriminate application and use of AI where the human factor is largely ignored and devalued. Human creativity and judgement is needed to set guardrails on both. Transparency as to what content has been used to train an AI model, along with licensing reputable and reliable content for training purposes, are important factors in helping AI to get its outputs right. Not taking an AI output as gospel but applying a degree of diligence, common sense, fact verification or experienced judgement are other important factors in deploying AI as it should be used, as an aide and assist to make human creativity and human directed output more efficient but not as a substitute for thinking or original research. Generative AI must be the servant, not the master. Human creativity and judgement are needed to ensure it stays that way.

Looking Back at 2024: It’s All About AI and Copyright (And a Few Other Things)

Image: Shutterstock

A retrospective on the year now coming to a close is what one expects this time of year, so I will try not to disappoint. However, when I look back at the copyright developments I wrote about in 2024, the dominant issues that jump out are AI, AI and AI. You can’t read or think about copyright without Artificial Intelligence, or to be more correct, Generative Artificial Intelligence (GAI), occupying most of the space despite many other issues on the copyright agenda. The mantra of “AI, AI and AI”, as in “Location, Location and Location” is apt because there are at least three important copyright dimensions related to AI; training of AI models; copyright protection for outputs generated by AI; and infringement of copyright by works created with or by AI. Of the three, the use of copyrighted content for AI training is the most salient.

Last year in my year-ender, I also discussed AI and the numerous lawsuits that were emerging as rightsholders pushed back on having their content vacuumed up by AI developers to train their algorithms. Those lawsuits have only multiplied. At last count, there are more that 30 cases in the US, ranging from big media vs big AI (New York Times v OpenAI/Microsoft) to class action suits brought by artists and authors, as well as litigation in the UK, EU, and now in Canada (see here and here). That is just on the input side.

In terms of output, i.e. whether works produced by an AI can be copyrighted, there are a couple of interesting cases in the US where applications for copyright registration have been refused by the US Copyright Office (USCO) because of a lack of human creativity. A couple of months ago, I discussed two such high profile cases, one brought by Stephen Thaler, and the other by Jason Allen. To date the USCO is not budging, although it is undertaking an extensive study of the issue. Part 1 of its study, on digital replicas, was published in July of this year. The next section on copyrightability is expected to be published in January with the issues of ingestion for training and licensing in Q1 2025.

While the USCO has to date denied applications for copyright registration of AI-generated works, the Canadian copyright office (CIPO-Canadian Intellectual Property Office) has been caught up in a problem of its own making. This is because Canadian copyright registration is granted automatically, so long as tombstone data and the prescribed fee is provided. The work for which registration is sought is not examined. As a result, copyright certificates have been issued to works created by AI, notwithstanding the general presumption that copyright protection is only accorded to human created work (although this is not explicitly stated in the Act). In July a legal challenge was launched against copyright registrant Ankit Sahni, who successfully registered a work with CIPO claiming an AI as co-author. The case was brought by the Canadian Internet Policy and Public Interest Clinic (CIPPIC) at the University of Ottawa, as I wrote about here. (Canadian Copyright Registration and AI-Created Works: It’s Time to Close the Loophole).

While the courts in the US, UK, Canada and elsewhere are grappling with various issues related to AI and copyright, governments are studying the issue.

In Australia, the Select Committee on Adopting Artificial Intelligence issued its final report in November. While the report was wide-ranging, three of its recommendations related to copyright;

• engagement with the creative Industry to address unauthorized use of their works by AI developers and tech companies,

• transparency in Training Data by requiring AI developers to disclose the use of copyrighted works in training datasets and ensure proper licensing and payment for these works, and

• remuneration for AI Outputs, with an appropriate mechanism to be determined through further consultation

These are important principles, but how they will be implemented in practice remains to be determined.

In Canada, a consultation on AI and copyright was launched late in 2023 with submissions to be received by January 15, 2024. The Canadian cultural community put forth three key demands;

• No weakening of copyright protection for works currently protected (i.e. no exception for text and data mining to use copyrighted works without authorization to train AI systems)

• Copyright must continue to protect only works created by humans (AI generated works should not qualify)

• AI developers should be required to be transparent and disclose what works have been ingested as part of the training process (transparency and disclosure).

Submissions to the consultation were published in mid-year but since then there has been no apparent action. Given the current political crisis facing the Trudeau government, none is expected in the near term although the issue will inevitably have to be addressed after the general election in 2025.

While the EU has already established some parameters dealing with use of copyrighted materials for AI training, the new UK Labour government is taking another run at the issue after various proposals in Britain to find a modus vivendi between the AI and content industries under the Tories went nowhere. The current UK discussion paper on Copyright and Artificial Intelligence, which seems excessively tilted in favour of the AI industry, has aroused plenty of controversy. While it says some of the right things, such as proclaiming that one of the objectives of the consultation is to “support…right holders’ control of their content and ability to be remunerated for its use” the thrust of the paper is to find ways to encourage the AI industry to undertake more research in the UK by establishing a more permissive regime with respect to use of copyrighted content. It is based on three self-declared principles; (notice how these things always seem to come in threes?);

• Control: Right holders should have control over, and be able to license and seek remuneration for, the use of their content by AI models

• Access: AI developers should be able to access and use large volumes of online content to train their models easily, lawfully and without infringing copyright, and

• Transparency: The copyright framework should be clear and make sense to its users, with greater transparency about works used to train AI models, and their outputs.

These three objectives then lead to what is clearly the preferred solution;

“A data mining exception which allows right holders to reserve their rights, underpinned by supporting measures on transparency”

Fine in principle, but the devil is always in the detail and the details in this case revolve around transparency (how detailed, what form, what about content already taken?) and, in particular, reservation of rights, aka “opting out”. This is easy to proclaim in principle but difficult to do in practice. British creators are up in arms, led by artists such as Paul McCartney, and supported by the creative industries in the US. The British composer Ed Newton-Rex has penned a brilliant satire explaining how AI development in the UK will work if current proposal is enacted. The problem with an opt-out solution is essentially twofold; it doesn’t deal with content already absorbed by AI developers and it would be cumbersome if not impossible for many rightsholders to use.

Other governments have addressed the issue in different ways. Singapore has taken a very loose approach toward copyright protection, putting its thumb firmly on the scale in favour of AI developers. It is currently considering additional proposals that would strip even more protection from rights-holders, who are pushing back strongly. Japan had been widely and incorrectly reported to have been on the same path, resulting in a welcome clarification this year from the Agency for Cultural Affairs regarding the limits of Japan’s text and data mining (TDM) exception.

While AI dominated the copyright agenda in 2024, there were other issues relating to copyright and copyright industries that I wrote about. The ongoing question of payment for news content by large digital platforms continued to play out in different ways. In Canada, the struggle between the government and US tech giants Google and META was finally “resolved” (after a fashion) at the end of last year. Google agreed to “voluntarily” pay $100 million annually into a fund for Canadian journalism in return for being exempted from the Online News Act (ONA) while META called the government’s bluff by blocking Canadian news providers from its platform thus, in theory, avoiding being subject to the ONA. However, META has a very subjective interpretation as to what is Canadian news content, allowing some news providers to post to it, while many users have found workarounds, as documented by McGill’s Media Ecosystem Observatory. While the CRTC investigated, the issue is still unresolved.

Meanwhile in Australia, it seems that META intends to go down the same road of blocking news, announcing it will not renew the content deals it initially signed with Australian media in response to Australia’s News Media Bargaining Code, the model upon which Canada’s legislation was based. Unlike in Canada, the Australian government is planning a robust response. (More on this in a future blog post). Finally, on the same topic, California (which was threatening to introduce its own version of legislation to require digital platforms to compensate news content providers) emerged with an outcome very similar to that reached in Canada, with Google offering up some funding (although proportionally less than in Canada) while META appears to have walked away.

Controlled Digital Lending (CDL) was another copyright issue finally settled in 2024 (in the US). The Internet Archive, after losing a lawsuit brought against it by a consortium of publishers who argued that the digital copying of their works constituted copyright infringement, notwithstanding the Archive’s theory that they were simply lending a digital version of a legally obtained physical work held by them (or someone else associated with them), lost its appeal. In December, the deadline for further appeals expired, thus effectively ending this saga. Whether Canadian university libraries, some of whom are avid devotees of CDL, will take note remains to be seen.

The issue of circumventing a TPM (“Technological Protection Measure”), commonly referred to as a “digital lock” and often represented by a password allowing access to content behind a paywall, was also front and centre this year in Canada. In the case of Blacklock’s Reporter v Attorney General for Canada, the Federal Court found that an employee of Parks Canada, who shared a single subscription to Blacklock’s with a number of other employees by providing them with the password did not infringe Blacklock’s copyright since the employee did not circumvent (in the meaning of the law) the TPM and the purpose of the sharing was for “research“, which is a specified fair dealing purpose. Blacklock’s is a digital research service that sells access to its content and protects its content with a paywall, as is common for many online content providers, like magazines and newspapers.

Despite the hoo-ha of anti-copyright commentators asserting the Court had found that “digital lock rules do not trump fair dealing“, it was equally clear the Court had ruled that fair dealing does not trump digital locks (TPMs). The Court did not undermine the protection afforded to businesses to protect their content through use of TPMs. Rather, it determined that sharing a licitly obtained password did not constitute circumvention as outlined in the Act, as I explained here. (Fair Dealing, Passwords and Technological Protection Measures (TPMs) in Canada: Federal Court Confirms Fair Dealing Does Not Trump TPMs (Digital Lock Rules). Although the Court did not legitimize circumvention of a TPM for fair dealing purposes, contrary to claims stating the opposite, its acceptance of password sharing is an outcome that legal experts have disagreed with, (as do I for what it is worth). The law is very clear that fair dealing cannot be used as a pretext or a defence against violation of the anti-circumvention provisions of the Copyright Act. The decision now under appeal by Blacklock’s.

Finally, the last copyright point of note for 2024 is that this year marked the bicentenary of the introduction of the first copyright legislation in Canada, in the Assembly of Lower Canada, in 1824. It also marked the centenary of the entry in force of the first truly Canadian Copyright Act on January 1, 1924. This two hundred years of domestic copyright history is worth celebrating. The first legislation was introduced “for the Encouragement of Learning” so that more local school texts would be written and printed. Given the current standoff between the secondary and post-secondary educational establishment and Canadian authors and their copyright collective over license payments for use of copyrighted works in teaching, one wonders whether we have really learned anything about the role copyright plays in our society. (Copyright and Education in Canada: Have We Learned Nothing in the Past Two Centuries? (From the “Encouragement of Learning” to the “Great Education Free Ride”).

Leaving that question with you to ponder, gentle Reader, is probably a good way to end this look back over the past 12 months. Stay tuned for more commentary on copyright developments in 2025.

If AI Tramples Copyright During its Training and Development, Should AI’s Output Benefit from Copyright Protection? Part Two: Jason Allen

Image: Théâtre D’Opéra Spatial, Jason B. Allen (not protected by copyright)

Last week I wrote about Stephen Thaler’s quixotic and determined approach to obtain copyright registration in the US for his AI generated artwork, “A Recent Entrance to Paradise”, created (he claims) exclusively by his AI “machine”, the so-called Creativity Machine. So far, despite repeated efforts, he has drilled a dry hole. An alternative approach to claiming copyright for an AI-generated work is by asserting that the AI used to produce it was simply a technological assist. The essence of the work was produced through human creativity, using AI only as a tool, and therefore the work should be eligible for copyright protection, or so the argument goes. Unlike the example of Kristina Kashtanova, discussed in last week’s blog, under this theory the entire work is protectable because it was human created, with AI playing only a facilitating or assistive role. This line of attack has most recently been pushed by the creator of the work “Théâtre D’Opéra Spatial”, Jason B. Allen.

Allen made headlines a couple of years back (September 2022) when he entered Théâtre into the Colorado State Fair’s annual art competition in the category of “digital art/digitally manipulated photography.” He labelled the piece as having been created by him, “via Midjourney”, the popular generative AI art algorithm that had recently been released. He won first prize, incurring the opprobrium of many artists who accused him of crashing a contest for human creators. Writing just a month later, in October of 2022, I posted my own AI produced artwork, based on the style of Monet (whose works are in the public domain) in this blog post, (AI and Computer-Generated Art: Its Impact on Artists and Copyright). My effort was substantially less artistic than Allen’s but was an original work of sorts, created with the help of AI. I used the program DALL-E2, which is similar to Midjourney. Both were freely available and a tool that any rank amateur “artist” (like me) could use.

Generative AI as a source of art burst into the public’s consciousness in 2022 because of the public release of these programs, but AI generated art has been around for a few years before that although used exclusively by art specialists. The New York Times reports on a sale at Christie’s Auction House four years earlier, in October 2018 (not exactly eons ago, but generations in internet/AI time). A portrait with blurred and distorted lines produced by an AI algorithm sold in New York for $432,500 (with fees). Christie’s, in inimitable auctioneering style, billed it as “the first portrait generated by an algorithm to come up for auction”, according to the Times. Now AI generated art is a dime a dozen. In fact, often it is hard to tell what is AI generated and what is not.

If the question of copyright protection for AI generated content is a big issue, an even bigger one is currently being played out currently in the courts; content owners, ranging from the New York Times to Getty Images to music labels to authors, are suing various AI development companies for unlicensed use of their content to train AI programs. The Copyright Alliance has a good summary of the various lawsuits in play here. The ultimate outcome is undecided, but if the courts find that the wholesale unlicensed and unauthorized ingestion of copyrighted content to train AI algorithms is not fair use (in the US) or does not fall within specified text and data mining exceptions in other jurisdictions, then the table will be set for serious negotiation between rights holders and AI developers. Some of the parameters of this negotiation are already pretty obvious;

a transparent inventory of what copyrighted works were accessed for training;
the ability of rights holders to be able to opt out or opt in;
various options for licensing content for training purposes.

If these conditions governing inputs were met, rights holders might be somewhat more sympathetic to arguments for copyrighting the output of generative AI programs. As it is, AI developers, and the users of AI programs, want to have their cake and eat it too. Jason B. Allen of “Théâtre D’Opéra Spatial” is Exhibit No. 1.

Allen is currently appealing in court the US Copyright Office’s ruling that his work cannot be registered under copyright. He claims that because of all the publicity about his work, and the USCO’s subsequent decision to deny copyright registration on the grounds that it was an AI generated rather than human creation, the work has lost value and impacted his ability to charge industry-standard licensing fees. Moreover, he claims that without copyright protection, he has no ability to stop others from using his work without authorization. (Like me, posting Théâtre on this blog post). Apparently, people are selling copies of the work on Etsy. One has some sympathy for his position, as it is one faced by many artists whose copyrighted works are also being ripped off on the internet.

Allen argues he had substantial creative input into the production of the work, using no fewer than 624 prompts to create the work to his mental specifications. Of course, we have no idea of what those specifications were. What Midjourney produced may have been an accurate reflection of what Allen had in his mind and intentionally created, or it may have taken him on a journey where he eventually settled on the output offered. One thing is likely, if not certain. Were he to enter those exact same prompts into Midjourney today, the outcome would not be identical to the current “Théâtre D’Opéra Spatial”. This raises the question of who, exactly, is guiding the creative process, the artist making the prompts, or the algorithm responding to the prompts.

Because of the way AI works, there is a large degree of randomness in the results, requiring more and more precise prompts to narrow the range of possibilities and guide the algorithm to the desired destination. But it is almost impossible to recreate precisely the route to an outcome. This suggests to me that, in the end, it is the algorithm that is in control, not the human issuing the prompts. (Although not everyone agrees with this thesis). To my mind, this is what distinguishes AI-generated art from photography, where the photographer, while using a mechanical assist, is nevertheless in full control at all times and has the ability to adjust for extraneous inputs such as light, shadow etc. rather than being controlled by them.

Allen’s appeal of the USCO’s rejection of his copyright claim takes place against a backdrop where Midjourney, the AI program he used, is itself being sued by a group of artists for appropriating their work without permission in order to train Midjourney’s art generation algorithms. Does anyone see any irony in this? However, the fact that Midjourney takes the works of others without permission for training and development purposes is not really Allen’s fault. He and other users of the program could perhaps be considered victims almost as much as the artists whose works have been appropriated. Nevertheless, if Allen is ever successful in getting Théâtre registered, this will be not just to his benefit, but also to the benefit of Midjourney and all the other AI developers who are in a similar position. On the other hand, if the output of their programs cannot be copyright protected, it diminishes the value of the AI product. So, perversely, the more Allen pursues registration for Théâtre, the more he undercuts those who make a living from producing art.

I see one possible scenario that could help resolve the issue of AI outputs being unprotectable. If the key elements of respect for copyrighted work (transparent inventory, opt in/out, licensing) were to be adopted by AI developers, then perhaps rights-holders would be more amenable to accepting at least some degree of copyright protection for AI created or assisted outputs. But right now, the AI industry wants it both ways; total freedom to appropriate copyrighted works for training and development purposes while claiming the same copyright protection they have just trampled for AI generated outputs. Jason Allen and other digital artists who use AI to produce art or other works are caught in the middle.

At the moment there is no clear solution. The most likely outcome–after all the legal dust has settled—is probably going to be some ability to copyright works produced with AI, dependent on the extent and degree of human intervention in a given work (which could possibly be carefully tailored prompts), balanced by a commitment by the AI industry to recognize the property rights of those holding copyright over the content it is using to create the AI program in the first place. This will necessarily involve an appropriate sharing of the added value being produced by AI in the form of licensing fees. This will take a few more years, a few more lawsuits, a few court decisions, and a few government interventions in the form of legislation–but I can see no other way forward. In the interim, neither Stephen Thaler nor Jason Allen are likely to get what they want.

Tag: USCO

AI’s Habit of Information Fabrication (“Hallucination”): Where’s the Human Factor?

Like this:

Looking Back at 2024: It’s All About AI and Copyright (And a Few Other Things)

Like this:

If AI Tramples Copyright During its Training and Development, Should AI’s Output Benefit from Copyright Protection? Part Two: Jason Allen

Like this:

Share this post:

Like this:

Share this post:

Like this:

Share this post:

Like this: