Case Study Multimodal Input · Agentic UX / 2025

Context-Aware Gesture System for Smart Glasses

A gesture is a signal the system interprets through environment, user state, and learned behavior — before any action is taken.

Gestures in XR are treated as commands. This breaks in real life — where context, environment, and user state constantly change. This project reframes gestures as signals, interpreted by the system rather than mapped to fixed actions.

Current XR systems map a small set of gestures to fixed commands. This works in controlled demos, but fails in real-world use — where intent, environment, and social context vary constantly.

This work defines a system that interprets gestures in context — combining movement, environment, and user state to infer intent and generate appropriate responses.

RoleLead inventor (70%)

Defined problem framing and interaction architecture, built scenario development, drove hardware UX strategy, led DOI submission end-to-end.

Collaborators
1 co-inventor (PM, 30% contribution)

Platforms
Smart glasses (audio, display, AR), VR HMD

Outcome
Patent filed, 2025 · A1 graded

Highest commercial viability rating.

The interaction system for smart glasses must be a continuous inference engine.

Problem

Current XR gesture systems rely on fixed vocabularies — small sets of predefined commands that users must learn and adapt to. This breaks down in real-world use, where context, habit, and environment constantly change. When a gesture doesn’t match the predefined vocabulary, it fails — and the user adapts instead. The core problem is not gesture recognition, it's interpretation.

No personalization — users adapt to the system

No context awareness — environment and intent ignored

No adaptive response — same gesture always means the same thing

Approach

Instead of mapping gestures to fixed functions, the system interprets them through a continuous loop: sensing context, inferring intent, acting, and learning over time. Gestures are not inputs to execute — they are signals the system resolves into appropriate responses.

01

Sense

Capture multimodal signals: body movement, environment, and device context

02

Infer

Interpret user intent based on context, history, and situation

03

Act

Generate adaptive responses across the appropriate surface and modality.

04

Learn

Update the system’s model from each interaction.

The system builds understanding progressively — combining gesture, context, and user history into a single decision loop.

Key Design Decisions

Gesture is interpreted alongside gaze, motion, proximity, and environment.

Active multimodal gesture input map — 01 Active Input: direct gesture and manipulation signals.

Passive multimodal input map — 02 Passive Input: voice, movement, physiological, and facial/body cues.

Environmental context input map — 03 Environmental Input: physical, digital, and social context signals.

The same gesture produces different outcomes depending on environment, user state, and intent. The same gesture could mean different things under different contexts. Taking the sign gesture of "OK" as an example.

Context-aware gesture system diagram 2 — How gesture is associated with the user's environment, which leads to user intention prediction.

Output can also shift across the ecosystem. The system can route multimodal feedback to the surface that best fits the situation. Output device each can deliver different parts of the response, with the system choosing automatically to the surface that best fits the situation or letting the user override the target surface through behavior.

01

Context-aware distributed

Adaptive output routing across connected devices

System routes feedback to the best-fit surface for the situation.

02

Intention guided overrides

Manual device selection gesture near connected devices

User behavior can explicitly override automatic output routing.

03

Hybrid

Combined automatic and manual output selection flow

Automatic narrowing first, user confirmation second.

As the system learns, interaction compresses — moving from reactive to anticipatory.

Apply to user scenarios

The system resolves intent by combining gesture with environmental and behavioral signals, rather than mapping gestures to fixed commands. The system can be applied to a wide range of scenarios beyond the core flows.

Key interfaces categories

The system operates as a staged loop: detect gesture, interpret context, predict intent, and respond. It extends beyond recognition — deciding when to ask, suggest, guide, or act based on confidence.

01

Gesture identifier & feedback

A visual cue confirms input immediately — whether the gesture is recognized or not — enabling lightweight, continuous onboarding.

02

Ergonomic context option UI

Relevant actions appear around the hand, adapting to posture and movement without disrupting focus.

03

Learning from unregistered gestures

Repeated unknown gestures trigger suggestions, allowing the system to learn incrementally from user behavior.

04

Guidance-based output

The system responds with contextual guidance or actions in real time, depending on confidence.

01

Gesture identifier and feedback

Gesture input must be visible before it can be understood. A lightweight identifier appears in the user’s field of view when a gesture is detected — confirming input and defining the interaction region. This transforms gestures from implicit signals into something the system and user can both act on.

Examples of gesture identifiers and user-defined gesture behaviors that can be registered in the system

The identifier is not just feedback — it is the moment gesture becomes interaction.

Flowchart showing how detected gestures trigger identifier initiation, context screening, and intention prediction

02

Context option and overlays

Context options are adaptive, not static. The system chooses when and where to present UI based on gesture confidence, body state, and immediate task context.

Keep options hand-local and in-reach

Once a gesture is detected, the system surfaces contextual options around the hand. Interaction remains local, continuous, and easy to reach without breaking focus.

Places options in ergonomic reach zones around the active hand.
Uses depth-aware placement to avoid occluding the main task area.
Keeps action labels short for glanceable confirmation.

Context recognition options arranged ergonomically around a user's hand and held object

Ergonomic context option UI showing z-depth placement — z-depth placement keeps menus legible and out of the way.

Adapt placement to posture and motion

Placement adapts to body state. In still contexts, UI can anchor to the world. In motion, it follows the user to keep options reachable and readable.

World-locked mode stabilizes controls for precise interaction.
Body-locked mode follows the user while walking or in transit.
Gaze-based anchoring helps preserve continuity across head turns.

Ergonomic context option UI in world-locked mode — **World-locked** UI stays in stable world coordinates for consistency.

Ergonomic context option UI in body-locked mode — **Body-locked** UI follows the torso to remain reachable in motion.

Ergonomic context option UI in gaze-based mode — **Gaze-based** UI anchors to head direction and reflows on stop.

Surface support at the right moment

Contextual UI appears just in time, not before and not after. It surfaces when intent is partially understood so users can decide without interrupting flow.

Early phase: suggest lightweight options as intent confidence rises.
Decision phase: expose richer context when multiple outcomes are plausible.
Commit phase: collapse to the chosen action and remove unnecessary UI.

Meeting scenario showing contextual overlays appearing at different moments — Meeting example: supplemental prompts evolve as context shifts from 3:00 PM to 4:00 PM.

03

Learning unregistered gestures

Unrecognized gestures are not treated as errors — they are treated as signals. When a repeated unknown gesture appears, the system offers possible intentions, allowing the user to define its meaning. Over time, the system builds a personalized gesture vocabulary from the user’s own behavior.

Trash can scenario showing intention selection options after detecting a squeezed can gesture

04

Guidance-based UI Outcomes

Once intent is understood, the system responds with guidance. Instead of generic overlays, UI is tied directly to the task — helping the user navigate, locate, or complete actions in real time.

Hardware spectrum

The system was designed to scale across devices — from audio-only glasses to full AR headsets — adapting input and output to each hardware capability..

Tier 1

Displayless

Audio + minimal input — gestures interpreted through context and history.

Tier 2

Monocular

Limited display — gestures guide lightweight visual responses.

Tier 3

Binocular

Spatial awareness — gestures interact with environment and depth.

Tier 4

Full AR

Full multimodal system — gestures operate in 3D space with adaptive feedback.

Hardware spectrum slide showing four device tiers from displayless Jirju to full 3D AR glasses

Impact and Outcome

01

Invention

The method was formalized as an A1-graded Disclosure of Invention — the highest commercial viability rating — establishing a foundation for adaptive, personalized interaction systems across XR platforms.

02

System contribution

The project reframes gesture interaction from command input to contextual inference, and defines a framework that scales from displayless devices to full AR systems.

03

Outlook

This system sets an input architecture for agentic experience onward by operating as a continuous loop - perception, inference, action, feedback, and learning. As confidence grows, explicit steps compress, and interaction shifts from reactive to anticipatory.

Next project

Adaptive Passthrough System

View case study

For commissions, collaborations, and future-facing products.

simplyujie@gmail.com

YUJIE

Context-Aware Gesture System for Smart Glasses

Context-Aware Gesture System for Smart Glasses

Problem

Approach

Sense

Infer

Act

Learn

Key Design Decisions

01 — Input is multimodal

02 — Context defines meaning, not the gesture itself

03 — Ecosystem output, and adaptive feedback

Context-aware distributed

Intention guided overrides

Hybrid

04 — Behavior adapts with confidence

Apply to user scenarios

Key interfaces categories

Gesture identifier & feedback

Ergonomic context option UI

Learning from unregistered gestures

Guidance-based output

Gesture identifier and feedback

Context option and overlays

Keep options hand-local and in-reach

Adapt placement to posture and motion

Surface support at the right moment

Learning unregistered gestures

Guidance-based UI Outcomes

Hardware spectrum

Displayless

Monocular

Binocular

Full AR

Impact and Outcome

Invention

System contribution

Outlook

Adaptive Passthrough System

For commissions, collaborations, and future-facing products.