2023 was the year of generative AI, and it will continue to evolve in 2024. Generative AI plays an increasingly significant role in content creation. For niche website owners, the need to protect original website content has never been more critical. With AI tools like ChatGPT becoming commonplace, concerns about content theft and originality are on the rise among content creators and publishers.
AI vs. Search Engine Crawlers: Understanding the Difference
In a recent webinar led by Cassidy Jensen, a VIP Publisher Success Manager at Ezoic, shares her wealth of knowledge on AI and content protection to educate website owners on what AI crawlers are, how media outlets are currently handling the AI boom, and what to do if your content is stolen.
Cassidy explains that while AI tools facilitate content creation and spark creativity, they often source information from existing online content, potentially sidelining the original creators.
What are AI Crawlers? AI crawlers, or bots, autonomously navigate the internet to gather data for various purposes, such as machine learning, data for analysis, and knowledge base enhancement. They work similarly to Google’s search engine crawlers and can ‘scrape’ any page accessible via a web browser.
While Google Crawlers show the benefit of driving traffic to ad-support sites, AI bot crawlers don’t seem to yield any benefit to the original creators.
OpenAI introduced its GPTBot Crawler in August 2023, claiming that the data it gathered would be used to improve future models. In a new age of publishers using AI to prompt the generation of new content, it’s important to recognize that Google SERPs can ignore duplicate or copied content, ultimately harming that site’s traffic.
This announcement has fueled a lot of questions and has prompted publishers to start putting protections in place to protect their original content
AI doesn’t typically credit the original sources, and there’s no guarantee that the information it retrieves is always accurate. This creates a risk for unique content creators, as AI-generated content can sometimes be seen as duplicate content by search engines, negatively impacting site traffic.
The Double-Edged Sword of Blocking AI Crawlers
The decision to block AI crawlers is not without its trade-offs. Cassidy discussed the benefits of protecting intellectual quality and maintaining content control, but also the potential downsides, such as reduced exposure to search engine crawlers which can be detrimental to content site owners who rely on traffic from Google search engine.
While blocking AI bots can protect content, it also risks limiting visibility by search engine crawlers. Additionally, there’s currently no unified standard for ‘do not crawl’ directives for AI bots.
Proactive Measures by Leading Media Outlets
Looking ahead, Cassidy shared that media giants like the New York Times, Reuters, Amazon, and CNN are already negotiating with AI firms to license their data. These proactive steps are setting a precedent for content protection in the digital age.
Some major companies are in talks with AI firms about licensing their data for AI use, at a fee, but any actual regulatory action is far on the horizon. In the meantime, some intellectual property holders are working on taking legal action against AI companies using their data without permission.
Following Google’s recent privacy policy update disclosing the collection of public data for its own AI services and OpenAI’s introduction of the chatbot, The New York Times updated its terms of service in August to prohibit the use of its content (text, photos images, audio/video, metadata) from being used in any other software, including machine learning and AI bots. Specifying that web crawlers designed to collect content cannot be used without written permission and threatening fines and penalties for violating their terms.
What to Do When Content Is Stolen
While search engines are improving at identifying duplicate content, there’s a risk that your website might be unfairly implicated. To tackle content theft, Cassidy recommends a five-step approach:
- Use a plagiarism checker to identify duplicated content and its source. You can utilize tools such as Grammarly or Copyscape.
- Reach out to the website with duplicate content and request removal. They may be unaware that their content was copied, and offering them a chance to rectify the situation can prevent penalties.
- Report identified copied content to Google as a legal request under the Digital Millennium Copyright Act (DMCA). U.S. copyright law mandates that Google takes action against plagiarizers.
- (If the content appears in both Google Search and Blog, you’ll need to report it on both platforms. Refer to Google’s policy at https://support.google.com/legal.)
- Once your request is submitted, complete the subsequent requests for information provided by Google, typically in the form of a short questionnaire.
- Progress to another form to report the URLs where your content is being copied.
If you detect content theft, it’s crucial to notify Google promptly to prevent adverse impacts on your site.
Reporting stolen content is considered a commendable action, as search engines aim to avoid displaying copied content in their search engine results pages (SERPs).
Conclusion
As AI continues to evolve and permeate various aspects of content creation, it’s crucial for creators to stay informed and vigilant. Employing plagiarism checkers, staying abreast of protective measures by leading media companies, and considering legal options if necessary are essential steps in safeguarding original website content in the age of AI.
If you are an Ezoic publisher, you can watch a recording of Cassidy’s webinar here. Weekly Walkthrough webinars happen every Wednesday, check out past recordings and register for future events here.