AI Product & UX Design · 2026 · Honda Motor Co.
Turns a PowerPoint deck with Japanese speaker notes into a finished narrated video — with the slide animations still playing in time with the voice. Built so non-technical staff can drive a genuinely technical speech pipeline without ever feeling they left familiar ground.
Internal Honda Motor Co. desktop tool. Product screens are confidential — this case study describes the design thinking, key decisions, and outcomes. Everything runs locally on the user's PC; only the script text ever leaves the machine.
2026 — shipped & in use
GenAI Engineer · Product & UI/UX
From deck to narrated video — one screen
Edit one sentence and only that one clip is redone — never the whole deck.
Overview
AI Narration Studio is a Windows desktop app built for Honda. A user drops in a PowerPoint deck, reviews and adjusts the AI-generated Japanese narration on a single screen, and exports a finished video with the slide animations still timed to the voice. Before it, the same job meant copying each slide's script into a separate voice tool, generating and downloading audio one slide at a time, dragging clips back onto slides, and lining up animation timing by hand — for every slide, and again whenever a single word changed. The role sat in AI development, but the strength brought to the project was product design and UI/UX — and that's where most of the work happened.
The Problem
Every year around the shareholders' meeting and quarterly reports, one team had to turn an 80–100 slide deck into a narrated video. The old way was a loop: open a slide, copy its note, switch tools, paste, generate, listen, download, drag the clip onto the slide, align each animation reveal with the words — eighty to a hundred times. And a sting in the tail: two small wording changes late on a Friday meant running those slides through the whole loop again. The real problem wasn't 'automate the steps.' The people doing the work weren't engineers, but the technology underneath — speech markup, phonetic pitch-accent control, animation timing — is genuinely technical. The goal: maximum capability, minimum intimidation.
Research & Discovery
Workflow shadowing — watched the team narrate full 80–100 slide decks; found the bottleneck is the mechanical conversion, not writing the script
Speaker-notes audit — most decks already have well-written notes, so the design could trust existing input rather than ask for more
Speech pipeline mapping — studied SSML markup, Japanese pitch-accent control, and word-level timing to know exactly what power had to be made approachable
PowerPoint OOXML / COM investigation — mapped animation triggers and media placement so authored animations survive into the exported video
Key Insight
“Treat every technical requirement as a design question first. An API key became 'why should the user see this at all?' A markup language became 'how does this feel as something you touch?' A slightly-random voice model became 'how do we keep a promise the user can hear?'”
Design Process
Make the powerful thing friendly — SSML markup is kept as a hidden source of truth and shown as editable pills (a pause is a small ⏸ 200 ms chip). A one-click 'Pro Edit' toggle reveals raw markup for power users: calm default, escape hatch one click away.
Design for trust — the pronunciation editor offers two ways to set Japanese pitch accent that stay in sync: plain-label quick-pick chips (Flat / Drop on first / Drop in middle) and a tappable visual pitch graph. Both write to one source of truth.
Be honest about limits — strong sentence context can override a set accent, so the panel says so plainly and points to the audio preview as the real answer. Honest beats polished.
Change what 'approve' means — instead of approving settings then regenerating and hoping, a Generate-and-hear loop makes the take the user just heard the take that ships. 'What you heard is what you get' became a real guarantee.
The best setup step is no setup step — the encrypted credential is baked into the installer, so the first thing the user sees is the actual tool, not an API-key form. Getting there was a packaging decision, not a UI one.
Segment-level regeneration — a one-word script change redoes only that one clip, not the whole deck, turning the dreaded Friday-afternoon edit into a two-minute task.
Critical Pivot
The first pronunciation editor let the user type a reading and an accent mark — but to hear the result they had to regenerate the whole segment, and because the voice is slightly random, the version they approved wasn't the version that shipped. The tool was quietly breaking its own promise. The fix wasn't a better form: it was redefining 'approve.' A fast Generate-and-hear loop plays the corrected word in its real sentence, and the exact audio the user approves becomes the final audio — no second roll of the dice.
Results
Shipped end to end — installed, used, and verified on real decks, and in active use by the team it was built for
Replaced a slow, repetitive manual loop with a single review screen and a short list of clear actions; an afternoon of narration shrank to a fraction of the effort
Names and formal terms are pronounced consistently and can be corrected — and the audio a user approves is exactly the audio that ships
A non-technical user can operate genuinely technical features — speech markup, phonetic accent control, animation-timed video — without ever feeling they opened a developer tool
Reflection
“The result hides a lot of complexity behind a calm, plain surface, and earns trust because that complexity never leaks. Closing the gap between capable and approachable — and the care it takes to do it — is what this project is really about.”