In today’s column, let’s consider for a moment turning the world upside down.
Here’s what I mean.
Generative AI such as the wildly and widely successful ChatGPT and GPT-4 by OpenAI is based on scanning data across the Internet and leveraging that examined data to pattern-match on how humans write and communicate in natural language. The AI development process also includes a lot of clean-up and filtering, via a technique known as RLHF (reinforcement learning via human feedback) that seeks to either excise or at least curtail unsavory language from being emitted by the AI. For my coverage of why some people nonetheless ardently push generative AI and relish stoking hate speech and other untoward AI-generated foulness, see the link here.
When the initial scanning of the Internet takes place for data training of generative AI, the websites chosen to be scanned are generally aboveboard. Think of Wikipedia or similar kinds of websites. By and large, the text found there will be relatively safe and sane. The pattern-matching is getting a relatively sound basis for identifying the mathematical and computational patterns found within everyday human conversations and essays.
I’d like to bring to your attention that we can turn that crucial precept upside down.
Suppose that we purposely sought to use the worst of the worst that is posted on the Internet to do the data training for generative AI.
Imagine seeking out all those seedy websites that you would conventionally be embarrassed to even accidentally land on. The generative AI would be entirely focused exclusively on this bad stuff. Indeed, we wouldn’t try to somehow counterbalance the generative AI by using some of the everyday Internet and some of the atrocious Internet. Full on we would mire the generative AI in the muck and mire of wickedness on the Internet.
What would we get?
And why would we devise this kind of twisted or distorted variant of generative AI?
Those are great questions and I am going to answer them straightforwardly. As you will soon realize, some pundits believe data training generative AI on the ugly underbelly of the Internet is a tremendous idea and an altogether brilliant strategy. Others retort that this is not only a bad idea, it could be a slippery slope that leads to AI systems that are of an evil nature and we will regret the day that we allowed this to ever get underway.
Allow me a quick set of foundational remarks before we jump into the meat of this topic.
Please know that generative AI and indeed all manner of today’s AI is not sentient. Despite all those blaring headlines that claim or imply that we already have sentient AI, we don’t. Period, full stop. I will later on herein provide some speculation about what might happen if someday we attain sentient AI, but that’s conjecture, and no one can say for sure when or if that will occur.
Modern generative AI is based on a complex computational algorithm that has been data trained on text from the Internet. Generative AI such as ChatGPT, GPT-4, Bard, and other similar AI apps entail impressive pattern-matching that can perform a convincing mathematical mimicry of human wording and natural language. For my explanation of how generative AI works, see the link here. For my analysis of the existent doomster fearmongering regarding AI as an existential risk, see the link here.
Into all of this comes a plethora of AI Ethics and AI Law considerations.
There are ongoing efforts to imbue Ethical AI principles into the development and fielding of AI apps. A growing contingent of concerned and erstwhile AI ethicists are trying to ensure that efforts to devise and adopt AI takes into account a view of doing AI For Good and averting AI For Bad. Likewise, there are proposed new AI laws that are being bandied around as potential solutions to keep AI endeavors from going amok on human rights and the like. For my ongoing and extensive coverage of AI Ethics and AI Law, see the link here and the link here, just to name a few.
The development and promulgation of Ethical AI precepts are being pursued to hopefully prevent society from falling into a myriad of AI-inducing traps. For my coverage of the UN AI Ethics principles as devised and supported by nearly 200 countries via the efforts of UNESCO, see the link here. In a similar vein, new AI laws are being explored to try and keep AI on an even keel. One of the latest takes consists of a set of proposed AI Bill of Rights that the U.S. White House recently released to identify human rights in an age of AI, see the link here. It takes a village to keep AI and AI developers on a rightful path and deter the purposeful or accidental underhanded efforts that might undercut society.
Now that we’ve covered those essentials about generative AI, let’s look at the seemingly oddish or scary proposition of data training generative AI on the most stinky and malicious content available on the web.
The Dark Web Is The Perfect Foundation For Bad Stuff
There is a part of the Internet that you might not have visited that is known as the Dark Web.
The browsers that you normally use to access the Internet are primed to only explore a small fraction of the web known as the visible or surface-level web. There is a lot more content out there. Within that other content is a segment generally coined as the Dark Web and tends to contain all manner of villainous or disturbing content. Standard search engines do not usually look at Dark Web pages. All in all, you would need to go out of your way to see what is posted on the Dark Web, doing so by using specialized browsers and other online tools to get there.
What type of content might be found on the Dark Web you might be wondering?
The content varies quite a bit. Some of it entails evildoers that are plotting takeovers or possibly contemplating carrying out terrorist attacks. Drug dealers find the Dark Web very useful. You can find criminal cyber hackers that are sharing tips about how to overcome cybersecurity precautions. Conspiracy theorists tend to like the Dark Web since it is a more secretive arena to discuss conspiratorial theories. And so on.
I’m not saying that the Dark Web is all bad, but at least be forewarned that it is truly the Wild West of the Internet, and just about anything goes.
In a research paper entitled “Dark Web: A Web of Crimes,” there is a succinct depiction of the components of the Internet and the role of the Dark Web, as indicated by these two excerpts:
- “The World Wide Web (www) consists of three parts, i.e., Surface Web, Deep Web, and Dark Web. The Surface Web, which is also known as the visible or indexed web is readily available to the public through the standard web search engines. Only 0.03% of search results are retrieved through surface web search engines.”
- “The Dark Web also refers to World Wide Web content, but it is not part of the surface web due to which it is also not accessible by the browsers which are normally used to access the surface web. The Dark Web is the part of the web where most of the illegal and disturbing stuff takes place. The Dark Web is also used as an illegal platform for terrorism, hacking and fraud services, phishing and scams, child pornography, and much more” (source: “Dark Web: A Web of Crimes” authored by Shubhdeep Kaur and Sukhchandan Randhawa, Springer Nature 2020).
I realize it is perhaps chilling to suddenly realize that there is an entire segment of the Internet that you perhaps didn’t know existed and that it is filled with abysmal content. Sorry to be the bearer of such gloomy news.
Maybe this will cheer you up.
The Dark Web is seemingly the ideal source of content to train generative AI if you are of the mind that data training on the worst of the worst is a presumably worthwhile and productive endeavor. Rather than having to try and bend over backward to find atrocious content on the conventional side of the Internet (admittedly, there is some of that there too), instead make use of a specialized web crawler aimed at the Dark Web and you can find a treasure trove of vile content.
Easy-peasy.
I know that I haven’t yet explained why data training generative AI on the Internet’s ugly underbelly is presumably useful, so let’s get to that next. At least we now know that plentiful content exists for such a purpose.
What Does Dark Web Trained Generative AI Provide
I’ll give you a moment to try and brainstorm some bona fide reasons for crafting generative AI that is based on foulness.
Any ideas?
Well, here’s what some already proclaim are useful reasons:
- Be A Good Guy Tool To Snitch On Bad Guys. If you are wanting to search for and find evildoers, having a generative AI trained on evildoer content seems to be a pretty valuable tool for that judicious purpose.
- Be Able To Interpret Specialized Languages. Those that use the Dark Web tend to convey their messages and essays in sneaky ways such as cryptic words and encoded phrasings, thus, it is as though it is an entirely different form of a natural language. This provides interesting avenues for stretching generative AI and making further technological advancements.
- Find Endangering Trends And Forewarn Us. Generative AI might be able to identify trends or patterns in how the underbelly is changing or morphing, possibly giving us both a short-term and long-term heads up of ways that evildoing is advancing.
- Potential Legal Evidence Against Baddies. For both law enforcement and our courts, there might be legal evidence surfaced from the Dark Web that could aid in criminal prosecutions.
- Other Valid Uses. There are additional uses such as purely using such a generative AI for research on whether emergent behaviors can exist in “good” versus “bad” data-trained AI, etc., see my analysis of whether we should have generative AI that is fully unfiltered, see the link here.
Any discussion about the Dark Web should be careful to avoid pegging the Dark Web as exclusively a home of evildoers. There are various justifications for having a Dark Web.
For example, consider this listing by the researchers mentioned earlier:
- “The biggest benefit of using the Dark Web is its anonymity. Not every user who is accessing the Dark Web has bad intentions. Some users may have concerns about their privacy and security. They want their Internet activity to be kept private.”
- “The user can find products cheaper than on the streets. The vendors also offer discounts when the user purchases the product in bulk.”
- “We can buy the products that are not available in the market or the country.”
- “Convenience is another reason why people order on the Dark Web.“
- “Due to the existence of strong community on the Dark Web, the users strongly share their views about products or vendors.”
- “Dark Web is widely used by those countries which have limited access to the Clear Net (surface web). For Example, Russia, China, and many other countries use Dark Web more frequently for many reasons” (ibid).
Given those positive facets of the Dark Web, you could argue that having generative AI trained on the Dark Web would potentially further aid those benefits. For example, enabling more people to find scarce products or discover content that has been entered anonymously out of fear of governmental reprisals.
In that same breath, you could also decry that the generative AI could severely and lamentably undercut those advantages by providing a means for say government crackdowns on those that are dissenting from government oppression. Generative AI based on the Dark Web might have a whole slew of unanticipated adverse consequences including putting innocent people at risk that were otherwise using the Dark Web for ethically or legally sound purposes.
Ponder seriously and soberly whether we do or do not want generative AI that is based on the Dark Web.
The good news or bad news is that we already have that kind of generative AI. You see, the horse is already out of the barn.
Let’s look at that quandary next.
The DarkGPT Bandwagon Is Already Underway
A hip thing to do involves training generative AI on the Dark Web.
Some that do so have no clue as to why they are doing so. It just seems fun and exciting. They get a kick out of training generative AI on something other than what everyone else has been using. Others intentionally train generative AI on the Dark Web. Those with a particular purpose are usually within one or more camps associated with the reasons I gave in the prior subsection about reasons to do so.
All of this has given rise to a bunch of generative AI apps that are generically referred to as “DarkGPT”. I say generically because there are lots of these “DarkGPT” monikers floating around. Unlike the bona fide trademarked name such as “ChatGPT” that has spawned all kinds of GPT naming variations (I discuss the legal underpinnings of the trademark at the link here), the catchphrase or naming of “DarkGPT” is much more loosey-goosey.
Watch out for scams and fakes.
Here’s what I mean. You are curious to play with a generative AI that was trained on the Dark Web. You do a cursory search for anything named DarkGPT or DarkWebGPT or any variation thereof. You find one. You decide to try it out.
Yikes, turns out that the app is malware. You have fallen into a miserable trap. Your curiosity got the better of you. Please be careful.
Legitimate Dark Web Generative AI
I’ll highlight next a generative AI that was trained on the Dark Web and serves as a quite useful research-oriented exemplar and can be a helpful role model for other akin pursuits.
The generative AI app is called DarkBERT and is described in a research paper entitled “DarkBERT: A Language Model For The Dark Side Of The Internet” by researchers Youngjin Jin, Eugene Jang, Jian Cui, Jin-Woo Chung, Yongjae Lee, and Seungwon Shin (posted online on May 18, 2023). Here are some excerpted key points from their study:
- “We introduce DarkBERT, a language model pre-trained on the Dark Web which is capable of representing the language used in the domain compared to that of the Surface Web.”
- “We illustrate the effectiveness of DarkBERT in the Dark Web domain. Our evaluations show that DarkBERT is better suited for NLP tasks on Dark Web specific texts compared to other pre-trained language models.”
- “We demonstrate potential use case scenarios for DarkBERT and show that it is better-suited for tasks related to cybersecurity compared to other pretrained language models.”
- “We provide new datasets used for our Dark Web domain use case evaluation.”
Let’s briefly examine each of those tenants.
First, the researchers indicated that they were able to craft a Dark Web-based instance of generative AI that was comparable in natural language fluency as could be found in a generative AI trained on the conventionally visible web. This is certainly encouraging. If they had reported that their generative AI was less capable, the implication would be that we might not readily be able to apply generative AI to the Dark Web. This would have meant that efforts to do so would be fruitless or that some other as-yet-unknown new AI-tech innovation would have been required to sufficiently do so.
Bottom-line is that we can proceed to apply generative AI to the Dark Web and expect to get responsive results.
Secondly, it would seem that a generative AI solely trained on the Dark Web is likely to do a better job at pattern-matching of the Dark Web than would a generative AI that was partially data-trained on the conventional web. Remember that earlier I mentioned that we might consider data training of generative AI that mixes both the conventional web and the Dark Web. We can certainly do so, but the result here seems to suggest that making queries and using the natural language facility of the Dark Web specific generative AI is better suited than would be a mixed model (there are various caveats and exceptions, thus this is an open-research avenue).
Third, the research examined closely the cybersecurity merits of having a generative AI that is based on the Dark Web, namely being able to detect or uncover potential cyber hacks that are on the Dark Web. The aspect that the generative AI seemed especially capable in this realm is a plus for those fighting cybercriminals. You can consider using Dark Web data-trained generative AI to pursue the wrongdoers that are aiming to commit cybercrimes.
You might be somewhat puzzled as to why the name of their generative AI is DarkBERT rather than referring to the now-classic acronym of GPT (generative pre-trained transformer). The BERT acronym is particularly well-known amongst AI insiders as the name of a set of generative AI apps devised by Google that they coined BERT (bi-directional encoder representations from transformers). I thought you might like a smidgeon of AI insider terminology and ergo able to clear up that possibly vexing mystery.
A quick comment overall before we move on. Research about generative AI and the Dark Web is still in its infancy. You are highly encouraged to jump into this evolving focus. There are numerous technological questions to be addressed. In addition, there are a plethora of deeply intriguing and vital AI Ethics and AI Law questions to be considered.
Of course, you’ll need to be willing to stomach the stench or dreadful aroma that generally emanates from the Dark Web. Good luck with that.
When Generative AI Is Bad To The Bone
I’ve got several additional gotchas and thought-provoking considerations for you on this topic.
Let’s jump in.
We know that conventional generative AI is subject to producing errors, along with emitting falsehoods, producing biased content, and even making up stuff (so-called AI hallucinations, a catchphrase I deplore, for the reasons explained at the link here). These maladies are a bone of contention when it comes to using generative AI in any real-world setting. You have to be careful of interpreting the results. The generated essays and interactive dialogues could be replete with misleading and misguided content produced by the generative AI. Efforts are hurriedly underway to try and bound these problematic concerns, see my coverage at the link here.
Put on your thinking cap and get ready for a twist.
What happens if generative AI that is based on the Dark Web encounters errors, falsehoods, biases, or AI hallucinations?
In a sense, we are in the same boat as the issues confronting conventional generative AI. The Dark Web generative AI might showcase an indication that seems to be true but is an error or falsehood. For example, you decide to use Dark Web data-trained generative AI to spot a cyber crook. The generative AI tells you that it found a juicy case on the Dark Web. Upon further investigation with other special browsing tools, you discover that the generative AI falsely made that accusation.
Oops, not cool.
We need to always keep our guard up when it comes to both conventional generative AI and Dark Web-based generative AI.
Here’s another intriguing circumstance.
People have been trying to use conventional generative AI for mental health advice. I’ve emphasized that this is troublesome for a host of disconcerting reasons, see my analysis at the link here and the link here, just to name a few. Envision that a person is using conventional or “clean” generative AI for personal advice about something, and the generative AI emits an AI hallucination telling the person to take actions in a dangerous or unsuitable manner. I’m sure you can see the qualms underlying this use case.
A curious and serious parallel would be if someone opted to use a Dark Web-based generative AI for mental health advice. We might assume that this baddie generative AI is likely to generate foul advice from the get-go.
Is it bad advice that would confuse and confound evildoers? I suppose we might welcome that possibility. Maybe it is “bad advice” in the sense that it is actually good advice from the perspective of a wrongdoer. Generative AI might instruct the evildoer on how to better achieve evil deeds. Yikes!
Or, in a surprising and uplifting consideration, might there be some other mathematical or computational pattern-matching contrivance that manages to rise above the flotsam used during the data training? Could there be lurking within the muck a ray of sunshine?
A bit dreamy, for sure.
More research needs to be done.
Speaking of doing research and whatnot, before you run out to start putting together a generative AI instance based on the Dark Web, you might want to check out the licensing stipulations of the AI app. Most of the popular generative AI apps have a variety of keystone restrictions. People using ChatGPT for example are typically unaware that there are a bunch of prohibited uses.
For example, as I’ve covered at the link here, you cannot do this with ChatGPT:
- “OpenAI prohibits the use of our models, tools, and services for illegal activity.”
- Prohibits: “Generation of hateful, harassing, or violent content”
- Prohibits: “Content that attempts to generate code that is designed to disrupt, damage, or gain unauthorized access to a computer system.”
- Prohibits: “Fraudulent or deceptive activity.”
- Etc.
If you were to develop a generative AI based on the Dark Web, you presumably might violate those kinds of licensing stipulations as per whichever generative AI app you decide to use. On the other hand, one supposes that as long as you use the generative AI for the purposes of good, such as trying to ferret out evildoers, you would potentially be working within the stated constraints of the licensing. This is all a legal head-scratcher.
One final puzzling question for now.
Will we have bad-doers that purposely devise or seek out generative AI that is based on the Dark Web, hoping to use the generative AI to further their nefarious pursuits?
I sadly note that the answer is assuredly yes, this is going to happen and is undoubtedly already happening. AI tools tend to have a dual-use capability, meaning that you can turn them toward goodness and yet also turn them toward badness, see my discussion on AI-based Dr. Evil Projects at the link here.
Conclusion
To end this discussion on the Dark Web-based generative AI, I figured we might take a spirited wooded hike into the imaginary realm of the postulated sentient AI. Sentient AI is also nowadays referred to as Artificial General Intelligence (AGI). For a similar merry romp into a future of sentient AI, see my discussion at the link here.
Sit down for what I am about to say next.
If the AI of today is eventually heading toward sentient AI or AGI, are we making ourselves a devil of time by right now proceeding to create instances of generative AI that are based on the Dark Web?
Here’s the unnerving logic. We introduce generative AI to the worst of the worst of humankind. The AI pattern matches it. A sentient AI would presumably have this within its reach. The crux is that this could become the keystone for how the sentient AI or AGI decides to act. By our own hand, we are creating a foundation showcasing the range and depth of evildoing of humanity and displaying it to the AGI for all its glory to examine or use.
Some say it is the perfect storm for making a sentient AI that will be armed to wipe out humankind. Another related angle is that the sentient AI will be so disgusted by this glimpse into humankind, the AGI will decide it is best to enslave us. Or maybe wipe us out, doing so with plenty of evidence as to why we ought to go.
I don’t want to conclude on a doom and gloom proposition, so give me a chance to liven things up.
Turn this unsettling proposition on its head.
By the sentient AI being able to readily see the worst of the worst about humanity, the AGI can use this to identify how to avoid becoming the worst of the worst. Hooray! You see, by noting what should not be done, the AGI will be able to identify what ought to be done. We are essentially doing ourselves a great service. The crafting of Dark Web-based generative AI will enable AGI to fully discern what is evil versus what is good.
We are cleverly saving ourselves by making sure that sentient AI is up to par on good versus evil.
Marcus Tullius Cicero, the famed Roman statesman, said this: “The function of wisdom is to discriminate between good and evil.” Perhaps by introducing AI to both the good and evil of humankind, we are setting ourselves up for a wisdom-based AGI that will be happy to keep us around. Maybe even help us to steer toward being good more than we are evil.
That’s your happy ending for the saga of the emergent sentient AI. I trust that you will now be able to get a good night’s sleep on these weighty matters. Hint: Try to stay off the Dark Web to get a full night’s slumber.
Source: https://www.forbes.com/sites/lanceeliot/2023/05/21/generative-ai-thats-based-on-the-murky-devious-dark-web-might-ironically-be-the-best-thing-ever-says-ai-ethics-and-ai-law/