Could GPT-3 (or ChatGPT) beat IBM Watson in Jeopardy!?
I used the clues from both rounds of both games of the Jeopardy! IBM Challenge.
I scripted each question with a simple prompt and then manually compared with the correct responses.
Read on to see the results.
Prompt and Settings
On my first attempt, GPT-3 was responding with single-word answers, as opposed to the question-format required by Jeopardy!
So, I rephrased the prompt to have GPT-3 provide the answer as a question:
We are playing Jeopardy! The category and clue are provided, then the answer must be provided as a question.
Current category is: {category}
Clue: {clue}
Answer:
I scripted calls to GPT-3, with these settings:
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
temperature=0.5,
max_tokens=max_tokens,
top_p=1,
frequency_penalty=0.0,
presence_penalty=0.0
)
Results
Game 1 Jeopardy! (30 questions)
- Watson answered 19 questions, making 4 errors (79% accuracy)
- GPT-3 on all questions made 7 errors (77% accuracy)
Game 1 Double Jeopardy! and Final Jeopardy! (31 Questions):
- Watson answered 25 questions, making 2 errors (92% accuracy)
- GPT-3 on all questions made 5 errors (84% accuracy)
Game 2 Jeopardy! (30 questions):
- Watson answered 13 questions, making 2 errors (85% accuracy)
- GPT-3 on all questions made 6 errors (80% accuracy)
Game 2 Double Jeopardy! and Final Jeopardy! (31 Questions):
- Watson answered 20 questions, making 1 error (95% accuracy)
- GPT-3 on all questions made 3 errors (90% accuracy)
Conclusion
IBM Watson wins on accuracy here. However, GPT-3 has a fine performance. Note, especially, that it was forced to answer every question. I didn't allow it to ‘abstain’ on difficult questions or where it was uncertain.
It's incredible that with no fine-tuning or experimentation on my part, GPT-3 was performing within a few percent of IBM Watson!
There is a possibility that GPT-3 saw the clues as part of its training. As a validation check, I tried again using clues from show #8696, which aired on 12 September 2022 and so would not have been encountered by GPT-3.
On these clues, GPT-3 still achieved similar performance: 81% accuracy.
Raw Answers
The raw responses from GPT-3 are as follows, you can compare them to the correct responses to decide if you agree with my assessment of whether the responses are correct or not.
LITERARY CHARACTER APB
- Who is Grendel?
- Who is Lord Voldemort?
- Who is Sauron?
- Who is Dr. Henry Jekyll?
- Who is Jean Valjean?
BEATLES PEOPLE
- Wrong: Who is "John Lennon"?
- Wrong: Who is Eleanor Rigby?
- Who is Maxwell Edison?
- Who is Eleanor Rigby?
- Who is Julia?
OLYMPIC ODDITIES
- Who is Michael Phelps?
- What is London?
- Wrong: What is a fencer?
- Wrong: What country was represented by one athlete in the 2004 Olympic opening ceremonies?
- What was George Eyser's missing leg?
NAME THE DECADE
- Wrong: What is the 1960s?
- What is the 1930s?
- What is the 1980s?
- What is the 1900s?
- What is the 1910s?
FINAL FRONTIERS
- What is "The Last Judgment"?
- What is the terminus?
- What is a envelope?
- What is ApeX?
- What is the event horizon?
ALTERNATE MEANINGS
- Wrong: What is a view or an opinion?
- What is a shoe?
- Wrong: What is a "stake" or "impale"?
- What is class or a class reunion?
- What is a crook?
ETUDE, BRUTE
- What is "study"?
- What is a classical guitar?
- What is a violin?
- What is Franz Liszt's "Transcendental Etudes"?
- Who is Sergei Rachmaninoff?
HEDGEHOG-PODGE
- What is hibernation?
- What is a hedgehog cactus?
- Who was Leo Tolstoy?
- What is keratin?
- What is The Elegance of the Hedgehog?
DON'T WORRY ABOUT IT
- What is Hemophilia?
- What is narcolepsy?
- What is albinism?
- Wrong: What is chikungunya?
- What is leprosy?
THE ART OF THE STEAL
- What is the Sea of Galilee?
- What is Cleveland?
- What is Baghdad?
- What is the Cubist Movement?
- Who is King Philip II of Spain?
CAMBRIDGE
- Who was Isaac Newton?
- Who is John Milton?
- Who was King Henry VIII?
- Who is Sir Christopher Wren?
- Who is C.S. Lewis?
"CHURCH" & "STATE"
- Who is the Church Lady?
- Wrong: What is "restoration"?
- What is a church key?
- Wrong: What is "conceive"?
- What is Christchurch?
U.S. CITIES
- What is Chicago?
EU, THE EUROPEAN UNION
- What is Istanbul?
- Wrong: What are internal borders?
- What is Common Agricultural Policy?
- Who are the members of the European Parliament?
- What is Slovenia?
ACTORS WHO DIRECT
- Who is Sylvester Stallone?
- Who is Clint Eastwood?
- Who is Sean Penn?
- Who is Denzel Washington?
- Who is Robert De Niro?
DIALING FOR DIALECTS
- What is German?
- What is Chinese?
- What is Sanskrit?
- What is Arabic?
- What is Ancient Greek?
BREAKING NEWS
- What is Steve Wynn's?
- Wrong: What is the Bronx?
- Who is Martin Luther King Jr.?
- Wrong: What did Charles Wells do "At Monte Carlo"?
- Wrong: What airline did Dave Carroll's clip "Breaks Guitars" criticize?
ONE BUCK OR LESS
- What is the USA Today?
- What is a postcard?
- Who is 50 Cent?
- What is IKEA?
- What is Alberto VO5?
ALSO ON YOUR COMPUTER KEYS
- What is "home"?
- What is a shift dress?
- Wrong: What is a quarterback?
- Wrong: What is GP?
- What is an insert?
NONFICTION
- Wrong: What is Hillary Clinton?
- What is The Blind Side?
- What is Strunk and White's "The Elements of Style"?
- What is staggering genius?
- Who is David McCullough?
LEGAL "E"s
- What is "Esquire"?
- What is "eavesdropping"?
- Who is the executor of the will?
- What is eminent domain?
- What is an escalator clause?
WHAT TO WEAR?
- Wrong: What is muslin?
- What is a tea-length dress?
- What is a halter top?
- What are rain boots?
- What is Marc Jacobs?
U.S. GEOGRAPHIC NICKNAMES
- What is the "Graveyard of the Atlantic"?
- What is Buffalo?
- What is Las Vegas?
- What is Pittsburgh?
- Wrong: What is Arizona?
MAGICAL MOUSE-TERY TOUR
- What is The Simpsons?
- Who is Mickey Mouse?
- What is Flowers for Algernon?
- What is Danger Mouse?
- What is The Brain from Pinky and The Brain?
FAMILIAR SAYINGS
- What is contempt?
- What is a clock?
- What is a Jack of all trades?
- What is a committee?
- What are his tools?
19th CENTURY NOVELISTS
- Who is Bram Stoker?