Lift your bottom lip so its inner edge (where the wet part meets the dry part) touches the very bottom of your top front teeth. Add vocal cord vibration as you blow air through.

Americans pronounce video as VIH-dee-oh (/ˈvɪdioʊ/). In "video", the "t" between vowels sounds like a quick "d" — the tongue briefly taps the ridge behind the upper teeth. This is called the Flap T, and it's why Americans sound more relaxed than the textbook. So instead of VIH·tee·oh, you get VIH·dee·oh. Stress falls on the first syllable — keep everything else short and quick. You'll hear it in sentences like "We will review the video later this week" or "He enjoys video editing and creating content for his channel" — more examples below.
Record yourself saying "video" and play it back. The mic stays on your device — nothing's uploaded.
3 syllables, 5 sounds. Tap a syllable to jump to its row, then explore each sound's mouth shape and how it's made.
Lift your bottom lip so its inner edge (where the wet part meets the dry part) touches the very bottom of your top front teeth. Add vocal cord vibration as you blow air through.

Drop your jaw slightly with relaxed lips. Touch the tongue tip behind the bottom front teeth and arch the top-front toward the roof.

Click any sentence to see the full breakdown — every link, every reduction, every flap-T.
The textbook way isn't wrong — it's just not how anyone actually says it.
In "video", the "t" between vowels sounds like a quick "d" — the tongue briefly taps the ridge behind the upper teeth. /t/ or /d/ becomes a quick tap [ɾ] — sounds like a soft D. The tongue briefly taps the ridge behind the upper teeth.
Stress falls on the first syllable, not the others. Stretch VIH — keep everything else short and quick.