Screen Vision

Vision AI that watches your game in real time

Sidekick AI's screen vision reads your gameplay frame by frame. The companion already knows the boss, the phase, and the loot — so voice coaching matches the moment instead of waiting for you to describe it.

Add to Steam Wishlist

How It Works

Reads your screen frame by frame

The companion sees what you see at native frame cadence. Boss health bar, your position, the cursor over a chest, the icon of the spell that just popped — all of it.

Identifies game state, not just pixels

Vision AI turns raw frames into structured understanding: which game, which encounter, which phase, which mechanic is winding up. The companion talks about the situation, not the screen.

Acts on the moment, not the description

Because the companion already sees the scene, you skip the entire describe-then-respond loop. You react to Sidekick's voice; Sidekick reacts to your game.

Lightweight on-device gate, server-side analysis

A change detector runs on your machine to decide which frames are worth analyzing — most frames in a session don't change enough to matter. The frames that pass the gate are sent to Sidekick's vision model for the actual analysis.

Why screen vision is the headline feature

Every AI gaming companion claims to help in real time. The honest test is whether the companion can act on the moment without you describing it. A chatbot that needs you to type “I'm at half health and Malenia just started Waterfowl Dance” before it gives advice has already lost the moment. By the time you finish typing, the fight is over.

Screen vision collapses the loop. The companion sees the boss health bar, sees the Waterfowl windup animation, sees your stamina, and calls the dodge timing in voice before you can articulate what's happening. That's the entire pitch of real-time AI coaching, and screen vision is what makes it real instead of marketing.

What “reads your screen” means in practice

Vision AI doesn't just dump pixels into a language model. The pipeline turns each captured frame into structured signals: which game is running, what scene is on screen, what UI elements are visible, what the player avatar is doing, what enemies are present, what state the player and enemies are in. Those signals are what the coaching layer actually reasons about.

That structure is why Sidekick can make precise calls instead of vague observations. The companion can say “you're at 30% HP, back out and chug an Estus” because the vision layer extracted your HP and your flask count — not because the model guessed.

How Sidekick differs from Character.AI, ChatGPT, and Replika

Most AI assistants and chatbots are blind to your game. Character.AI, ChatGPT, Replika— none of them can see what you see. They can chat about a game you describe to them, but they can't coach during play because the loop is too slow.

The AI gaming companion category exists because screen vision changed what was possible. Sidekick AI is built around that change. The 3D avatar, the voice layer, and the HypeReel highlight workflowall sit on top of the vision layer being good enough that the companion already knows what's happening when it speaks.

How the vision pipeline actually works

Frame capture happens at the operating system level — the same Windows.Graphics.Capture and Core Graphics window-capture APIs OBS and other capture tools use. There's no DLL injection, no game-memory hook, no driver-level instrumentation. The companion never attaches to your game's process, so anti-cheat treats Sidekick like any other capture tool. The cost of this design is that Sidekick only sees what's rendered to your screen; the upside is that it works with any PC game without per-title integration.

Once a frame is captured, a lightweight change detector runs on your machine deciding whether the frame is worth sending — most consecutive frames in a session are visually similar enough that the model has nothing new to say. The frames that clear the gate are sent to Sidekick's vision-language model with a game-aware coaching prompt. The model returns a structured response (which game, which scene, what UI is visible, what entities are in the frame, what the player is doing). The coaching layer reasons about that structured response, not the raw pixels, when deciding what to say in your headset. That pipeline shape is why coaching can match the moment — the companion already knows the situation before it speaks.

What it sees during play

The structured-understanding part is abstract. The concrete version is what the companion can call out during a session across the genres Sidekick is tuned for. The vision model is general-purpose; it recognizes the elements below because large multimodal models have seen these games in their training data, not because Sidekick ships per-game extractors.

In a Souls boss fight — the boss's health bar and posture meter, your HP and stamina bars, the wind-up animation that signals an incoming attack, your flask count, your skill cooldowns, the boss's phase transitions. That's how Sidekick can say “you're at 30%, flask now” or “Waterfowl is winding up, run for the first flurry” instead of generic dodge advice.

In a metroidvania— the mini-map state (rooms visited, rooms unexplored), your movement abilities, the icon of the charm or relic you just picked up, the locked-and-key doors on the current floor. That's how Sidekick can nudge you toward the unexplored room two screens northwest without spoiling the whole map.

In an RPG turn— the turn order, ability cooldowns and resource counts, the dialog choices on screen, environmental hazards in the encounter, the condition icons on each combatant. That's how the companion can flag “the elemental surface ignites if you cast that fireball, your ally is standing in it” before you click.

In a survival horror — your ammo count, herb or healing-item stack, the save typewriter availability, enemy positions on the radar, the condition state of your weapons. That's how Sidekick can say “four pistol rounds and one green herb left, the next merchant is past the church” instead of generic scarcity advice.

Privacy and control — what the companion does and doesn't see

Sidekick reads the surface you point it at. On multi-monitor setups you select which window or display the companion sees; a second monitor running Discord, OBS, a wiki tab, or your email stays private. The vision layer never reaches into windows you haven't selected.

Pausing vision is one click. When paused, the companion continues to chat but stops capturing or analyzing the screen — useful for cutscenes you want to experience without spoiler-adjacent commentary, story moments where coaching would feel intrusive, or just times when you want company without analysis. Pause is a per-session control; the next launch starts in your default state.

HypeReel clips are a separate, opt-in workflow. Those clips exist because you triggered them — a highlight worth saving — and they're yours to keep, edit, share, or delete from your account. The clip pipeline uses the same vision layer for highlight detection, but the resulting video is under your control, not the vision layer's.

Frequently Asked Questions

How does Sidekick AI's screen vision actually work?
Sidekick captures frames from your gameplay window on a regular cadence and runs them through a vision-language model with a game-aware coaching prompt. The model identifies what's on screen — the game, the scene, the active mechanic, the player state — and passes that structured understanding to the coaching layer. The coaching layer decides what (if anything) to say. The result is voice tips that match what's actually happening, not generic advice.
Which games does screen vision work with?
Any PC game that runs in a standard window. There's no per-game integration required because the vision layer reads the rendered screen rather than the game's internal state. Single-player and co-op titles are where the experience is sharpest because the coaching content is tuned for them — Elden Ring, Baldur's Gate 3, Hollow Knight, Dark Souls 3, Resident Evil 4, Silent Hill 2, Lethal Company, Phasmophobia, Minecraft, and more.
Does screen vision slow down my game?
No measurable impact on most setups. The frame capture is lightweight and happens outside the game's render loop. The vision analysis runs on a separate thread or device. Sidekick is designed so your frame rate and input latency stay where they are — coaching is the value, not the bottleneck.
Can Sidekick see HUD elements, menus, and inventory screens?
Yes. The vision layer reads the whole rendered frame, including UI elements like health bars, mini-maps, inventory grids, and dialog boxes. This is how Sidekick can say things like "you're at 30% health, back off" or "that spell scroll in the loot drop is worth picking up."
What about spoilers? Will Sidekick reveal late-game content?
The coaching layer is tuned to talk about what's on screen right now, not to volunteer information about content you haven't reached. If a story beat is about to trigger, Sidekick won't pre-empt it. If you actively ask for help on a puzzle whose solution involves later content, the companion can warn you and let you decide.
Is screen vision different from streaming or screen recording?
Yes. Streaming tools capture and broadcast your screen to a public audience. Screen recording saves your screen to a local file. Sidekick's screen vision captures frames in real time so the coaching layer can act on the moment — the goal is voice tips in your headset, not a broadcast or a saved video. Sidekick is built to coexist with your existing streaming setup rather than replace it.
Does screen vision work on multi-monitor setups?
Yes. You select which window or display Sidekick reads. The companion only sees the surface you point it at, so a second monitor with Discord, OBS, or a wiki tab stays private.
Can I turn screen vision off temporarily?
Yes. There's a clear toggle to pause vision capture. When vision is paused, the companion still talks but stops referencing the screen — useful for cutscenes, story moments, or when you just want company without coaching.
Does Sidekick capture frames via DLL injection or anything anti-cheat would flag?
No. Sidekick reads the game window the same way OBS or any screen-capture tool does — using the operating system's standard window-capture APIs. There's no DLL injection, no game-memory hook, no driver-level instrumentation. Anti-cheat systems see Sidekick as a regular desktop application, because that's what it is. The companion never touches the game's process.
Can I run Sidekick alongside a streaming setup like OBS?
Yes. Sidekick captures from the game window; OBS captures from whatever scene you've set up. They don't compete for the same surface, and both can run simultaneously without one breaking the other. The avatar window and Sidekick's voice output are both routable into OBS as a window source and an audio source if you want the companion on stream.

Ready to play smarter?

Sidekick AI uses vision AI to watch your screen and coach you in real-time. Try the free demo on Steam.

Add to Steam Wishlist