• progress_activity cloud_sync

    Reconnection to the server…

    Movim cannot talk with the server, please try again later

  • back_to_tab fullscreen tile_small dialpad mic videocam switch_camera screen_share

    mic_none No sound detected from your microphone


    • Public subscriptions

    • chevron_right

      coopr8

    • chevron_right

      gabagoo

    • chevron_right

      kenu_demon

    • chevron_right

      coopr8

    • chevron_right

      gabagoo

    • chevron_right

      kenu_demon

    • chevron_right

      coopr8

    • chevron_right

      gabagoo

    • chevron_right

      kenu_demon

  • Register Login

    Movim

    movim.chatterboxtown.us


  • group_work rss_feed
    add Follow

    ArsTechnica

    • Ar chevron_right

      Why Google Gemini’s Pokémon success isn’t all it’s cracked up to be

      news.movim.eu / ArsTechnica • 5 May 2025 • 1 minute

    Earlier this year, we took a look at how and why Anthropic's Claude LLM was struggling to beat Pokémon Red (a game, let's remember, designed for young children). But while Claude 3.7 is still struggling to make consistent progress at the game weeks later, a similar Twitch-streamed effort using Google's Gemini 2.5 model managed to finally complete Pokémon Blue this weekend across over 106,000 in-game actions, earning accolades from followers, including Google CEO Sundar Pichai .

    Before you start using this achievement as a way to compare the relative performance of these two AI models—or even the advancement of LLM capabilities over time—there are some important caveats to keep in mind. As it happens, Gemini needed some fairly significant outside help on its path to eventual Pokémon victory.

    Strap in to the agent harness

    Gemini Plays Pokémon developer JoelZ (who's unaffiliated with Google) will be the first to tell you that Pokémon is ill-suited as a reliable benchmark for LLM models. As he writes on the project's Twitch FAQ , "please don't consider this a benchmark for how well an LLM can play Pokémon. You can't really make direct comparisons—Gemini and Claude have different tools and receive different information. ... Claude's framework has many shortcomings so I wanted to see how far Gemini could get if it were given the right tools."

    Read full article

    Comments

    • tagai tagai tagai taggaming taggaming taggaming tagai tagai tagai taggaming taggaming taggaming tagai tagai tagai taggaming taggaming taggaming

    • Pictures 3 image

    • visibility
    • visibility
    • visibility
    • Ar chevron_right

      Why Google Gemini’s Pokémon success isn’t all it’s cracked up to be

      news.movim.eu / ArsTechnica • 5 May 2025 • 1 minute

    Earlier this year, we took a look at how and why Anthropic's Claude LLM was struggling to beat Pokémon Red (a game, let's remember, designed for young children). But while Claude 3.7 is still struggling to make consistent progress at the game weeks later, a similar Twitch-streamed effort using Google's Gemini 2.5 model managed to finally complete Pokémon Blue this weekend across over 106,000 in-game actions, earning accolades from followers, including Google CEO Sundar Pichai .

    Before you start using this achievement as a way to compare the relative performance of these two AI models—or even the advancement of LLM capabilities over time—there are some important caveats to keep in mind. As it happens, Gemini needed some fairly significant outside help on its path to eventual Pokémon victory.

    Strap in to the agent harness

    Gemini Plays Pokémon developer JoelZ (who's unaffiliated with Google) will be the first to tell you that Pokémon is ill-suited as a reliable benchmark for LLM models. As he writes on the project's Twitch FAQ , "please don't consider this a benchmark for how well an LLM can play Pokémon. You can't really make direct comparisons—Gemini and Claude have different tools and receive different information. ... Claude's framework has many shortcomings so I wanted to see how far Gemini could get if it were given the right tools."

    Read full article

    Comments

    • tagai tagai tagai taggaming taggaming taggaming tagai tagai tagai taggaming taggaming taggaming tagai tagai tagai taggaming taggaming taggaming

    • Pictures 3 image

    • visibility
    • visibility
    • visibility
    • Ar chevron_right

      Why Google Gemini’s Pokémon success isn’t all it’s cracked up to be

      news.movim.eu / ArsTechnica • 5 May 2025 • 1 minute

    Earlier this year, we took a look at how and why Anthropic's Claude LLM was struggling to beat Pokémon Red (a game, let's remember, designed for young children). But while Claude 3.7 is still struggling to make consistent progress at the game weeks later, a similar Twitch-streamed effort using Google's Gemini 2.5 model managed to finally complete Pokémon Blue this weekend across over 106,000 in-game actions, earning accolades from followers, including Google CEO Sundar Pichai .

    Before you start using this achievement as a way to compare the relative performance of these two AI models—or even the advancement of LLM capabilities over time—there are some important caveats to keep in mind. As it happens, Gemini needed some fairly significant outside help on its path to eventual Pokémon victory.

    Strap in to the agent harness

    Gemini Plays Pokémon developer JoelZ (who's unaffiliated with Google) will be the first to tell you that Pokémon is ill-suited as a reliable benchmark for LLM models. As he writes on the project's Twitch FAQ , "please don't consider this a benchmark for how well an LLM can play Pokémon. You can't really make direct comparisons—Gemini and Claude have different tools and receive different information. ... Claude's framework has many shortcomings so I wanted to see how far Gemini could get if it were given the right tools."

    Read full article

    Comments

    • tagai tagai tagai taggaming taggaming taggaming tagai tagai tagai taggaming taggaming taggaming tagai tagai tagai taggaming taggaming taggaming

    • Pictures 3 image

    • visibility
    • visibility
    • visibility
  • cloud_queue

    Powered by Movim