Lift your bottom lip to touch the very bottom of your top front teeth. Blow air through this contact point without voicing.

Americans pronounce photo as FOH-toh (/ˈfoʊɾoʊ/). In "photo", the "t" between vowels sounds like a quick "d" — the tongue briefly taps the ridge behind the upper teeth. This is called the Flap T, a hallmark of natural-sounding American speech. It comes out as FOH·toh. Stress falls on the first syllable — keep everything else short and quick. You'll hear it in sentences like "This is a photo of my family".
Record yourself saying "photo" and play it back. The mic stays on your device — nothing's uploaded.
2 syllables, 4 sounds. Tap a syllable to jump to its row, then explore each sound's mouth shape and how it's made.
The textbook way isn't wrong — it's just not how anyone actually says it.
In "photo", the "t" between vowels sounds like a quick "d" — the tongue briefly taps the ridge behind the upper teeth. /t/ or /d/ becomes a quick tap [ɾ] — sounds like a soft D. The tongue briefly taps the ridge behind the upper teeth.
Stress falls on the first syllable, not the others. Stretch FOH — keep everything else short and quick.