In a recent study published within the journal Scientific Reports, researchers compared human and artificial intelligence (AI) chatbot creativity using the alternate uses task (AUT) to grasp the present boundaries and potential of machine-generated creativity.
Study: Best humans still outperform artificial intelligence in a creative divergent considering task. Image Credit: girafchik / Shutterstock
Background
Generative AI tools, corresponding to Chat Generative Pre-Trained Transformer (ChatGPT) and MidJourney, have stirred debates regarding their impact on jobs, education, and legal protections for AI-generated content. Historically, creativity was seen as uniquely human and traditionally linked to originality and usefulness, but AI’s emerging capabilities at the moment are difficult and redefining this belief. Nevertheless, there may be a pressing need for further research to profoundly comprehend the underlying mechanisms of human and AI creativity and their implications for society, employment, ethics, and the shifting definition of human identity within the AI era.
Concerning the study
In the current study, AUT data from humans was sourced from a previous research project, and using the net platform Prolific, native English speakers were engaged. Out of the 310 participants who began the study, 279 accomplished it. After evaluating their attentiveness through visual tasks, 256 were deemed diligent and included within the evaluation, boasting a mean age of 30.4. Most participants were full-time employees or students, hailed primarily from america of America (USA), the UK (UK), Canada, and Ireland.
In 2023, three AI chatbots, ChatGPT3.5 (later known as ChatGPT3), ChatGPT4, and Copy.Ai, were tested on specific dates, undergoing examination 11 times using 4 different object prompts across separate sessions. This method ensured a large sample to discern differences, especially when put next against the extensive human data.
For the AUT procedure, participants were engaged with 4 objects: rope, box, pencil, and candle, and were advised to prioritize the originality of their answers fairly than sheer volume. While humans were tested once per session, AIs underwent 11 distinct sessions with instructions barely modified to suit their design. The first concern was keeping the AI responses comparable to human answers, especially in length.
Before evaluation, the responses underwent a spell-check, and any ambiguous short answers were discarded. The essence of divergent considering was gaged using the semantic distance between an object and its AUT response, utilizing the SemDis platform. Any potential bias in responses, particularly from AIs using jargon like “Do It Yourself” (DIY), was addressed for consistency.
The originality of answers was evaluated by six human raters who, unaware of the AI-generated responses, rated each answer’s originality on a scale of 1 to five. The rating methodology had clear guidelines to make sure objectivity, and their collective rankings demonstrated high inter-rater reliability.
Lastly, the information underwent rigorous statistical analyses with the aim of deriving meaningful conclusions from the amassed data. Various models were employed to guage the scores, bearing in mind fixed effects corresponding to group and object and potential covariates.
Study results
The current study analyzed the creative divergent considering in humans and AI chatbots, specializing in their responses to different objects, and observed a moderate correlation between the semantic distance and humans’ subjective rankings. This suggested that while each scoring methods measured similar qualities, they weren’t similar in nature. Due to this fact, it was deemed appropriate to guage the information using each semantic distance and subjective rankings.
Using linear mixed-effect models for a broad comparison between humans and AI, a consistent pattern emerged: AI chatbots not only generally outperformed humans but additionally had higher mean and max scores when it comes to semantic distance. When fluency was regarded as a covariate, it was seen to diminish the mean scores but increase the max scores. This trend was also reflected within the human subjective rankings of creativity, where AI again scored higher in each mean and max scores. An interesting remark was that while some human participants responded with typical and even illogical uses of the objects, AI chatbots consistently provided atypical and logical uses, never scoring below a certain threshold.
The study delved deeper into comparing the responses of humans and individual AI chatbots to specific objects, namely a rope, box, pencil, and candle. The analyses showcased that ChatGPT3 and ChatGPT4, two of the AI models, outperformed humans when it comes to mean semantic distance scores. Nevertheless, when considering max scores, there was no statistically significant difference between human participants and the AI chatbots. It was also observed that responses to the rope were typically rated lower when it comes to semantic distance than the opposite objects.
The human subjective rankings assessing creativity revealed that ChatGPT4 consistently received higher rankings than each humans and the opposite chatbots, showcasing its clear edge. Nevertheless, this advantage was not observed when the chatbots were tasked with the item “pencil.” An interesting pattern emerged with the candle, as responses related to it generally received lower rankings in comparison with other objects. A standout remark was that two AI sessions, one from ChatGPT3 and the opposite from ChatGPT4, recorded max scores higher than any human for the item “box.”
The info underlined the impressive performance of AI chatbots, particularly ChatGPT4, in creative divergent considering tasks in comparison with humans. Nevertheless, it’s value noting that AI didn’t uniformly outpace humans across all metrics or objects, underscoring the complexities of creativity and the areas where humans still hold a bonus.