Relational, real-time, multimodal AI.

Fromvoicetounderstanding

Technology·7 min read·20 February 2025

A pause mid-argument means something different than a pause mid-apology. How we're building multimodal models that account for context, tone, and timing.

Introduction

A pause mid-argument means something different than a pause mid-apology. Tone, timing, and the spaces between words carry as much information as the words themselves.¹ Our models are built to account for that.

Mehrabian, A. (1972). Nonverbal Communication. Aldine-Atherton.

Technically, the challenge is multimodal alignment under conversational noise. The same utterance can indicate different states depending on pacing, turn-taking, and preceding context.

As a result, model quality depends as much on context assembly as on classifier sophistication. If the system sees fragments without temporal grounding, its outputs will appear plausible but behave inconsistently.

Key Signal

We analyze voice, tone, timing, and words. Where both people consent, we can incorporate facial expression and body language. All as one integrated signal. Context changes everything—and our system is designed to learn the context of each couple.

For this reason, the modeling target is not a single label per utterance but a contextual estimate over time. Temporal modeling is essential when meanings shift within seconds.

We also prioritize calibration over raw confidence. A model that can identify uncertain states and defer interpretation is generally more useful in production than one that is confidently wrong.

How This Shapes The System

Every pair has its own way of communicating. How they escalate. How they repair. When one withdraws. When one reaches out. Over time, the system maps where the relationship stands.² The goal isn't to replace human judgment. It's to support reflection.

Coan, J. A. & Gottman, J. M. (2007). The specific affect coding system (SPAFF). In J. A. Coan & J. J. B. Allen (Eds.), Handbook of Emotion Elicitation and Assessment. Oxford University Press.

In implementation, that means preserving conversational memory, calibrating confidence, and distinguishing between weak and strong evidence before surfacing insights to users.

Systems that cannot represent ambiguity tend to overfit short-term cues and degrade trust. We optimize for reliable interpretation over maximal intervention frequency.

Outlook

From voice to understanding: that's the path we're on. Grounded in decades of relationship science, built for the nuance that makes intimacy possible.

The technical roadmap favors iterative evaluation: improve sensing quality, validate against external judgments, and only then expand intervention scope.

← Older

What repair looks like

Newer →

Safety first: why we screen

← All articles