
Image: Shutterstock (with AI assist)
It is well known that when AI applications can’t respond to a query, instead of admitting they don’t know the answer, they often resort to “making stuff up”—a phenomenon commonly called “hallucination” but which should more accurately be called for what it is, total fabrication. This was one of the legal issues raised by the New York Times in its lawsuit against OpenAI, with the Times complaining, among other things, that false information attributed to the journal by OpenAI’s bot undermined the credibility of Times journalism and diminished its value, leading to trademark dilution. According to a recent article in the Times, the incidence of hallucination is growing, not shrinking, as AI models develop. One would have thought that as the models ingest more material, including huge swathes of copyrighted and curated material such as content from reputable journals like the Times (without permission in most instances), its accuracy would improve. That doesn’t seem to be the case. Given AI’s hit and miss record of accuracy, it should be evident that AI output cannot be trusted or, at the very least, can only be trusted if verified. Not only is AI built on the backs of human creativity (with a potentially disastrous impact on creators unless the proper balance is struck between AI training and development, and the rights of creators to authorize and benefit from the use of their work), but human oversight and judgement is required to make it a useful and reliable tool. AI on auto-pilot can be downright dangerous.
The most recent outrageous example of AI going astray is the publication by the Chicago Sun-Times and Philadelphia Inquirer, both reputable papers (or at least they used to be), of a summer reading list in which only five of fifteen books listed were real. The authors were real but most of the book titles and plots were just made up. Pure bullshit produced by AI. The publishers did a lot of backing and filling, pointing to a freelancer who had produced the insert on behalf of King Features, a unit of Hearst. Believe it or not, it was actually licensed content! That freelancer, reported to be one Marco Buscaglia, a Chicago “writer”, admitted that he had used AI to create the piece and had not checked it. “It’s 100% on me”, he is reported to have said. No kidding. Pathetic. Readers used to have an expectation that when a paper or magazine published a feature recommending something, like a summer reading list, the recommendation represented the intellectual output of someone who had done some research, exercised some judgement, and had presumably even read or at least heard about the books on the list. How could anyone recommend non-existent works? The readers trusted the newspaper, the paper trusted the licensor, the licensor trusted the freelancer, the so-called author. Nobody checked. Where was the human element? The list wasn’t worth the paper it was printed on.
The same problem of irresponsible dependence on unverified information produced by AI is a growing problem in the legal field. Prominent lawyer and blogger Barry Sookman has just published a cautionary tale about the consequences of using hallucinatory AI legal references. Counsel for an applicant in a divorce proceeding in Ontario cited several legal references using the CanLII database (for more information on CanLII see “AI-Scraping Copyright Litigation Comes to Canada (CANLII v Caseway AI”) that the presiding judge could not locate—because they did not exist. He suspected the factum had been prepared using Generative AI and threatened to cite the lawyer in question for contempt of court, noting that putting forward fake cases in court filings is an abuse of process, and a waste of the court’s time. The lawyer in question has now confirmed that AI was used by her law clerk, that the citations were unchecked, and has apologized, thus avoiding a contempt citation. Again, nobody checked (until the judge went to the references cited).
This is not even the first case in Canada where legal precedents fabricated by AI were presented to a court. Last year in a child custody case in the BC Supreme Court, the lawyer for the applicant was reprimanded by the presiding judge for presenting false cases as precedents. The fabricated information was discovered by the defence attorneys when they went to check the applicant’s lawyer’s arguments. As a result, the applicant’s lawyer was ordered to personally compensate the defence lawyers for the time they took to track down the truth. The perils of using AI to argue legal cases first came to prominence in the US in 2023 when a New York federal judge fined two lawyers $5000 each for submitting legal briefs written by ChatGPT, which included citations of non-existent court opinions and fake quotes.
Another area fraught with consequences for using unacknowledged AI generated references is academia. The issue extends well beyond undergraduate student essays being researched and written by AI to include graduate students, PhD candidates and professors taking shortcuts. This university library website, in its guide to students on use of AI generated content, notes that LLMs (Large Language Models used in AI) can hallucinate as much as 27% of the time and that factual errors are found in 46% of the output. The solution is pretty simple. When writing a research paper, don’t cite sources that you didn’t consult.
This brings up the question of “you don’t know what you don’t know”. If your critical faculties are so weak as to not be able to detect a fabricated response, you are in trouble. Of course, some hallucinations are easier to spot than others. Some of the checking is to simply verify that a fact stated in an AI response is accurate or that a cited reference actually exists (but then it should be read to determine relevance). In other cases, it may be more subtle, with the judgement and creativity of the human mind being brought into play to detect a hallucination. That requires experience, knowledge, context—all of which may be lacking in the position of a junior clerk or student intern assigned the task of compiling information. This is all the more reason why it is important for those using AI to check sources, and to exercise quality control. Part of the process is to ensure transparency. If AI is used as an assist, that should be disclosed.
At the end of the day, AI depends on human creativity and accurate information produced by humans. Without these inputs, it is nothing. This brings us to the fundamental issue of whether and how copyright protected content should be used in AI training to produce AI generated outputs.
The US Copyright Office has just released a highly anticipated study on the use of copyrighted content in generative AI training. Here is a good summary produced by Roanie Levy for the Copyright Clearance Center. The USCO report is clear in stating that the training process for AI implicates the right of reproduction. That is not in doubt. It then examines fair use arguments under the four factors used in the US. Notably, with respect to the purpose and character of the work used for training, USCO notes that the use of copyrighted content for AI training may not be transformative if the resulting model is used to generate expressive content or potentially reproduce copyrighted expression. It notes that the copying involved in AI training can threaten significant harm to the market for, or value of, copyrighted works especially where a model can produce substantially similar outputs that directly substitute for works used in the training data. This report is not binding on the courts but is a considered and well researched opinion by a key player.
It is interesting to note that the report was released quickly in a pre-publication version on May 9, just a day before the Register of Copyrights (the Head of the Office) Shira Perlmutter was dismissed by the Trump Administration and a day after the Librarian of Congress, Carla Hayden (to whom Perlmutter reports) was fired. Washington is rife with speculation on the causes for, and the legality of, the dismissals. We will no doubt hear more on this. With respect to fair use in general, the study concludes that “making commercial use of vast troves of copyrighted works to produce expressive content that competes with them in existing markets…goes beyond established fair use boundaries”. The anti-copyright Electronic Frontier Foundation (EFF), of course, disagrees. (Which probably further validates the USCO’s conclusions).
The USCO study is about infringement, not hallucination or fabrication, yet both stem from the indiscriminate application and use of AI where the human factor is largely ignored and devalued. Human creativity and judgement is needed to set guardrails on both. Transparency as to what content has been used to train an AI model, along with licensing reputable and reliable content for training purposes, are important factors in helping AI to get its outputs right. Not taking an AI output as gospel but applying a degree of diligence, common sense, fact verification or experienced judgement are other important factors in deploying AI as it should be used, as an aide and assist to make human creativity and human directed output more efficient but not as a substitute for thinking or original research. Generative AI must be the servant, not the master. Human creativity and judgement are needed to ensure it stays that way.
© Hugh Stephens, 2025. All Rights Reserved.

Fascinating article, but I do question whether the author is right to draw parallels between hallucination and infringement.
The author’s starting point is that AI hallucinations are limited to occasions when the AI models “don’t know the answer.” If AI hallucinations were, in fact, limited in that way, it might be easier to prevent them.
As I understand it, the process of AI’s language generation is quite separate from the process of looking up knowledge (see, for example, the second paragraph on page 22 of the US Copyright Office study cited by the author above – link below). The language process “guesses” what a human might write, based on the statistical evidence of what humans have written in the past. It’s a purely semantic process based on past patterns of words and completely divorced from a database of facts which those strings of words might represent.
If my description is accurate, a language model never “knows the answer”; it can only guess what word or words come next in response to the prompt it has been given.
The infringement debate is, I think, altogether different. For understandable reasons, lawyers often couch that debate in terms of whether the existing law covers what AI is now doing, rather than whether the principles that gave rise to the current law (ie the reasons WHY the law grants copyright protection) ought to allow what AI is doing or ban it. In an article of my own, I argued that there is no greater need to prohibit AI learning than there is to prohibit human learning (which, of course, we encourage!). What matters is whether or not the AI’s output is a copy. For those who are interested, see https://simoncarne.substack.com/p/rage-against-the-machine.
The US Copyright Office study mentioned in my comment can be found at: https://www.copyright.gov/ai/Copyright-and-Artificial-Intelligence-Part-3-Generative-AI-Training-Report-Pre-Publication-Version.pdf
Thank you for the thoughtful comments. I am not sure that I drew or attempted to draw parallels between infringement and hallucination. Rather, my point was that i both instances, the human element is necessary to make AI work effectively. Without it, AI will not be a reliable aide to more efficient tasking and even creating. That includes both AI training and verifying AI outputs. Current trends suggest a diminishment of human engagement in both elements, a path that I think is perilous. With respect to training, taking account of the human factor means acknowledging the creator of works that are being used in training, whether or not the output is a copy. If it is a copy, then absent licensing it is an infringement. In my view training to produce a potentially substitutable output, without permission, even if the output is a compilation of many works, is likely an infringement but it depends on each work. To assess whether there has been substitution, AI trainers should be required to document works they have copied.
Many thanks for the response. I hope it’s OK to continue the debate (at least a little). You and I certainly agree that producing a copy of someone else’s work without permission is a breach of their rights. And I happily acknowledge that I shouldn’t have used the word “parallel” at the start of my earlier comment.
The issue that separates us is whether or not AI should acknowledge the humans it has learned from. My argument is that the basis on which copyright law was founded has never required humans to acknowledge which other humans they have learned from, so there needs to be a clear policy justification before requiring AI to do so. The arguments I have seen tend to be based on a confusion over “copying” and “learning”. The two are different, as discussed in my first comment (and in my article at https://simoncarne.substack.com/p/rage-against-the-machine).
In your response to me above, you refer to “substitution”. I have re-visited the US Copyright report in an attempt to understand what that word is meant to connote in this context. I haven’t found a clear meaning (other than its usual meaning in language). So, for example, using AI to summarise someone else work might make it a substitute. But so too is a human-written summary. So, when you linked to Roanie Levy’s summary of the US Copyright Office’s report, you probably did so because (and I’m speculating here) you thought it might provide an adequate substitute for someone who didn’t want to read the full 100-page report. There are existing laws which dictate when summaries are legitimate and when they have become a breach of copyright. I strongly suspect the same is true for other forms of substitute. But without a clear understanding of what others mean by “substitute” I can’t be definitive.
In summary (of my own words!), I suggest that the debate over AI and copyright should start by looking at the reasons why copyright protection was granted in the first place and see what those reasons lead to in a world of AI. If legislatures around the world are to be encouraged to create new rules for AI that don’t apply to humans, I suggest that there needs to be a clear policy objective which is better than the alternative of not adding to the existing legislation.
Simon, you argue that a machine copying a copyrighted work, and then producing something that is based on that work, is the same as a human reading that work and producing his/her own creation, having inculcated the essence of what was read and then producing something new. It isn’t the same. The fundamental difference is that an AI makes an unauthorized and unacknowledged copy of either all the work or a substantial part of it whereas a human ingests the work, perhaps drawing inpiration from it and incorportes the non-copyrightable ideas to produce something new. The only “copy” (and it will not be an exact copy, unlike with an AI) is a mental impression that is shaped by all the other mental impressions in the creator’s brain.
With regard to “substitution”, I use it in the plain English sense of “in place of”. If an AI is asked to produce artwork in the style of Greg Rutkowski and does so, having made unauthorized copies of Rutkowski works in order to be able to do so, when the user of the AI uses that AI produced image for a purpose (perhaps a commercial purpose) instead of licensing a Rutkowski image, to me that is substitution. This applies to graphic artists, copyrighters, songwriters, authors….you name it.
I have written only one book, “In Defence of Copyright”. I read a lot of material as part of the writing process but I did not copy any of it, other than selected excerpts that I documented and which fell within fair dealing guidelines. An AI, however, could make a copy of every word of my book, every example, every anecdote, every reference, and then use that content, in combination with verbatim content from similar books, to produce a competing work. Neither I nor any of the other writers authorized the making of that copy, and it does not fall within fair dealing criteria and is likely not to fall within a fair use interpretation in the US. Thus it is an infringement. A human could also write a competing work but would not have the ability to incorporate every word of every input work as part of the writing process. A human writer might jot down ideas from several books, but if they used the exact text they would be limited by fair dealing guidelines (amount, purpose, etc) or else would be guilty of infringement. An AI exercises no such limitation in its copying and yet produces competing works that can substitute for original human created works.
You suggest that we should look at why copyright law was created in the first place. We know why. It was to encourage creation of more works (“An Act for the Encouragement of Learning, by vesting the Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned”). To remove the protection that copyright confers (the right to authorize and be compensated for the use of a work) goes against the encouragement of learning. AI’s require the “raw meat” of human created content (data) to be able to develop. Undermining incentives for the human creation of content by producing substitute works that are derived from unauthorized copies of human created works goes against the rationale for copyright. It disincentivizes and discourages the creation of content and learning.
AI and AI training can adapt to copyright law. All it requires is for the AI developers to be transparent about what content they use to train their models and to share some of the wealth created through their models with the creators of works that made their success possible. That is called “licensing” (and an author still retains the right to decide whether or not to license their content).
Hugh, it is looking like the crucial difference between us is the one set out in the first paragraph of your most recent reply, viz that an AI machine makes “an unauthorized and unacknowledged copy of” the work it learns from whereas a human does not. You are, of course, right about that. But we have been here before with copyright and tech. Internet browsers create a copy of the web pages that the user has viewed (known as “caching”). In the UK, from where I write, this was challenged and our Supreme Court decided that was not a breach of copyright. Given that browser caching is an accepted practice in other jurisdictions also, I think there can be little doubt that if a country’s court had decided that the existing law was not compatible with caching, the legislature would have stepped in to amend that nation’s law.
The legal argument(s) that have earned caching a clean bill of health around the world don’t necessarily apply to machine learning. It will vary from one jurisdiction to another. In some countries, that has pushed the debate into the hands of the legislature. For example, the EU passed a law which created a copyright exemption for text and data mining. In the UK, the government recently consulted on a proposal that would allow AI machines to be trained on copyrighted works without seeking permission unless the copyright owner expressly opts out.
The need to re-visit the law in the light of AI is why I argued above for a return to first principles: why was copyright granted in the first place? It was, as you say, “for the Encouragement of Learning, by vesting the Copies of Printed Books in the Authors or purchasers of such Copies, during the Times therein mentioned” (quoting from the UK Copyright Act of 1709).
The question now is whether AI learning will operate against that aim. You assert that it will. You say AI learning will “[undermine] incentives for the human creation of content by producing substitute works that are derived from unauthorized copies of human created works”. Your use of the word “unauthorized” in that sentence is valid only for so long as the courts and/or the legislatures don’t permit AI to hold copies inside their memories for learning purposes (as discussed above). The POLICY question is whether your sentence is valid without the word “unauthorized” in it. In other words, if the holding of copies inside the machine’s memory is determined – either by the courts or by the legislature – to be lawful, will humans will be disincentivised from creating more works?
I don’t claim to know the answer to that question. I think we need evidence before we can answer it.
Simon, all copying of copyrighted works, unless licensed, is “unauthorized” by the copyright owner. However, the unauthorized use may or may not be an infringement depending on whether an exception applies (such as de minimis, fair dealing, or other specified exception like copying for archival purposes). As for the difference between making an ephemeral copy for caching purposes (which is legal) and making a copy, sometimes ephemeral and sometimes not, for AI training (which arguably is an infringement), the key difference is the PURPOSE for which the copy was made. You yourself refer to AI outputs as being infringing if they are a copy. That is true, but I would argue the copying process is an infringement not just if the AI reproduces all or significant portions of the original work (as the NYT alleges in its case against OpenAI) but also if the intent or purpose of the copying is to produce content that can substitute for and compete with the originaL. (Assuming he AI has benefited from training on the originals by making unauthorized copies, which is why transparency must be a requirement). If AI can produce multiple works on any given subject (say, on growing roses) because it had ingested (copied) without authorization multiple works on the subject, then there is going to be a reduced market for the definitive work on the topic, whatever that is, plus many other works in the field written and researched by reputable authors. The purpose of the unauthorized reproduction is to create competing works, damaging the interests of copyright owners. In the case of caching, the purpose is to enable the effective functioning of the internet, benefiting all. There is a signficant difference between the two. I agree with you when you say that “The legal argument(s) that have earned caching a clean bill of health around the world don’t necessarily apply to machine learning.” That is because there is no parallel between the two in my view.
We are not going to agree, but I have found it very interesting exchanging views and exploring your perspective. I think the discussion has teased out the key points of difference, which I have found very instructive. Many thanks, Hugh, for responding to my comments.
Simon, as the saying goes, I could agree with you but then we’d both be wrong. No doubt you feel the same way. Thanks for your interest in the topic and the blog. Here is one reading I recommmend (although it may not change your mind). https://cdn.vanderbilt.edu/vu-URL/wp-content/uploads/sites/356/2025/05/25192637/Charlesworth-FINAL.pdf
Hugh, I really had thought that we had reached the end of this exchange, but since you pointed me to a text which “may not change [my] mind”, I read it. And I agreed with it. Not because it changed my mind, but because it articulated over 50 pages the same central message that I had attempted to articulate.
In her Conclusion, the author says: “At this early stage, we do not yet know how generative AI will impact human authorship, or creative culture in general. Will there be less incentive for humans to create works, and for publishers to invest in and disseminate those works[?] … An unprecedented exception to copyright with such far-reaching consequences is not a question of fair use but rather a fundamental question of policy for Congress to decide.”
In one of my previous comments, I wrote: “The POLICY question is whether … if the holding of copies inside the machine’s memory is determined – either by the courts or by the legislature – to be lawful, will humans will be disincentivised from creating more works? I don’t claim to know the answer to that question. I think we need evidence before we can answer it.”
I think we both recognize that the rules governing inputs of copyrighted content into AI training will need to be codified, and that the long term impact on human creativity remains to be determined, although I think we approach both issues from different perspectives. Thanks for a thoughtful exchanges of views.