Screen Vision
Vision AI that watches your game in real time
Sidekick AI's screen vision reads your gameplay frame by frame. The companion already knows the boss, the phase, and the loot — so voice coaching matches the moment instead of waiting for you to describe it.
Add to Steam WishlistHow It Works
Reads your screen frame by frame
The companion sees what you see at native frame cadence. Boss health bar, your position, the cursor over a chest, the icon of the spell that just popped — all of it.
Identifies game state, not just pixels
Vision AI turns raw frames into structured understanding: which game, which encounter, which phase, which mechanic is winding up. The companion talks about the situation, not the screen.
Acts on the moment, not the description
Because the companion already sees the scene, you skip the entire describe-then-respond loop. You react to Sidekick's voice; Sidekick reacts to your game.
Lightweight on-device gate, server-side analysis
A change detector runs on your machine to decide which frames are worth analyzing — most frames in a session don't change enough to matter. The frames that pass the gate are sent to Sidekick's vision model for the actual analysis.
Why screen vision is the headline feature
Every AI gaming companion claims to help in real time. The honest test is whether the companion can act on the moment without you describing it. A chatbot that needs you to type “I'm at half health and Malenia just started Waterfowl Dance” before it gives advice has already lost the moment. By the time you finish typing, the fight is over.
Screen vision collapses the loop. The companion sees the boss health bar, sees the Waterfowl windup animation, sees your stamina, and calls the dodge timing in voice before you can articulate what's happening. That's the entire pitch of real-time AI coaching, and screen vision is what makes it real instead of marketing.
What “reads your screen” means in practice
Vision AI doesn't just dump pixels into a language model. The pipeline turns each captured frame into structured signals: which game is running, what scene is on screen, what UI elements are visible, what the player avatar is doing, what enemies are present, what state the player and enemies are in. Those signals are what the coaching layer actually reasons about.
That structure is why Sidekick can make precise calls instead of vague observations. The companion can say “you're at 30% HP, back out and chug an Estus” because the vision layer extracted your HP and your flask count — not because the model guessed.
How Sidekick differs from Character.AI, ChatGPT, and Replika
Most AI assistants and chatbots are blind to your game. Character.AI, ChatGPT, Replika— none of them can see what you see. They can chat about a game you describe to them, but they can't coach during play because the loop is too slow.
The AI gaming companion category exists because screen vision changed what was possible. Sidekick AI is built around that change. The 3D avatar, the voice layer, and the HypeReel highlight workflowall sit on top of the vision layer being good enough that the companion already knows what's happening when it speaks.
How the vision pipeline actually works
Frame capture happens at the operating system level — the same Windows.Graphics.Capture and Core Graphics window-capture APIs OBS and other capture tools use. There's no DLL injection, no game-memory hook, no driver-level instrumentation. The companion never attaches to your game's process, so anti-cheat treats Sidekick like any other capture tool. The cost of this design is that Sidekick only sees what's rendered to your screen; the upside is that it works with any PC game without per-title integration.
Once a frame is captured, a lightweight change detector runs on your machine deciding whether the frame is worth sending — most consecutive frames in a session are visually similar enough that the model has nothing new to say. The frames that clear the gate are sent to Sidekick's vision-language model with a game-aware coaching prompt. The model returns a structured response (which game, which scene, what UI is visible, what entities are in the frame, what the player is doing). The coaching layer reasons about that structured response, not the raw pixels, when deciding what to say in your headset. That pipeline shape is why coaching can match the moment — the companion already knows the situation before it speaks.
What it sees during play
The structured-understanding part is abstract. The concrete version is what the companion can call out during a session across the genres Sidekick is tuned for. The vision model is general-purpose; it recognizes the elements below because large multimodal models have seen these games in their training data, not because Sidekick ships per-game extractors.
In a Souls boss fight — the boss's health bar and posture meter, your HP and stamina bars, the wind-up animation that signals an incoming attack, your flask count, your skill cooldowns, the boss's phase transitions. That's how Sidekick can say “you're at 30%, flask now” or “Waterfowl is winding up, run for the first flurry” instead of generic dodge advice.
In a metroidvania— the mini-map state (rooms visited, rooms unexplored), your movement abilities, the icon of the charm or relic you just picked up, the locked-and-key doors on the current floor. That's how Sidekick can nudge you toward the unexplored room two screens northwest without spoiling the whole map.
In an RPG turn— the turn order, ability cooldowns and resource counts, the dialog choices on screen, environmental hazards in the encounter, the condition icons on each combatant. That's how the companion can flag “the elemental surface ignites if you cast that fireball, your ally is standing in it” before you click.
In a survival horror — your ammo count, herb or healing-item stack, the save typewriter availability, enemy positions on the radar, the condition state of your weapons. That's how Sidekick can say “four pistol rounds and one green herb left, the next merchant is past the church” instead of generic scarcity advice.
Privacy and control — what the companion does and doesn't see
Sidekick reads the surface you point it at. On multi-monitor setups you select which window or display the companion sees; a second monitor running Discord, OBS, a wiki tab, or your email stays private. The vision layer never reaches into windows you haven't selected.
Pausing vision is one click. When paused, the companion continues to chat but stops capturing or analyzing the screen — useful for cutscenes you want to experience without spoiler-adjacent commentary, story moments where coaching would feel intrusive, or just times when you want company without analysis. Pause is a per-session control; the next launch starts in your default state.
HypeReel clips are a separate, opt-in workflow. Those clips exist because you triggered them — a highlight worth saving — and they're yours to keep, edit, share, or delete from your account. The clip pipeline uses the same vision layer for highlight detection, but the resulting video is under your control, not the vision layer's.
Frequently Asked Questions
How does Sidekick AI's screen vision actually work?
Which games does screen vision work with?
Does screen vision slow down my game?
Can Sidekick see HUD elements, menus, and inventory screens?
What about spoilers? Will Sidekick reveal late-game content?
Is screen vision different from streaming or screen recording?
Does screen vision work on multi-monitor setups?
Can I turn screen vision off temporarily?
Does Sidekick capture frames via DLL injection or anything anti-cheat would flag?
Can I run Sidekick alongside a streaming setup like OBS?
Related Resources
Ready to play smarter?
Sidekick AI uses vision AI to watch your screen and coach you in real-time. Try the free demo on Steam.
Add to Steam Wishlist