Skip to main content

Software Architecture

Firmware based on a Retro-Go fork with a custom display driver and target configuration for the ESP32 Emu Turbo hardware.


Platform: Retro-Go

Retro-Go is a multi-system emulator for ESP32 devices. It provides a launcher UI, save states, ROM browser (SD card), and a unified input/display/audio framework.

Why Retro-Go

CriteriaRetro-Goesp-box-emuCustom from scratch
Emulator count10+ systems6 systems1 at a time
ESP32-S3 supportMatureESP32-S3-BOX onlyManual porting
Launcher UIBuilt-inLVGL-basedMust build
Save statesYesYesMust implement
SD card ROM browserYesYesMust implement
Community / forksLargeSmallNone
SNES coresnes9x2010 (slow)WIPMust port

Supported Emulators

CoreSystemResolutionFPS on ESP32-S3
nofrendoNES / Famicom256x24060 fps
gnuboyGame Boy160x14460 fps
gnuboyGame Boy Color160x14460 fps
smsplusMaster System256x19260 fps
smsplusGame Gear160x14460 fps
pce-goPC Engine / TurboGrafx-16256x24060 fps
handyAtari Lynx160x10260 fps
gwenesisSega Genesis / Mega Drive320x22450-60 fps
gw-emulatorGame & Watchvarious60 fps
snes9xSNES / Super Famicom256x22420-45 fps

All systems except SNES run at full speed on ESP32-S3 N16R8 @ 240MHz.


Implementation Plan

Phase 1 — Hardware Abstraction (bootstrap) ✅

Standalone ESP-IDF v5.x project in software/ that validates all hardware before integrating Retro-Go. See software/README.md for build instructions.

StepTaskDetailsStatus
1.1ESP-IDF v5.x project setupsdkconfig for N16R8 (240MHz, 16MB flash, 8MB PSRAM)✅ Done
1.2ST7796S display driver (i80 8-bit parallel)esp_lcd_panel_io_i80 + esp_lcd_st7796 component, 20MHz✅ Done
1.3Display test patternColor bars, fill screen, status indicators✅ Done
1.4SD card (SPI mode)esp_vfs_fat_sdspi_mount, FAT32, ROM directory scanner✅ Done
1.512-button inputGPIO polling @ 1ms, bitmask API, HW RC debounce✅ Done
1.6I2S audio outputi2s_std 32kHz 16-bit mono, 440Hz test tone✅ Done
1.7Power managementIP5306 I2C (0x75), battery %, charge status✅ Done

Firmware project structure

software/
├── CMakeLists.txt ESP-IDF project root
├── sdkconfig.defaults ESP32-S3 N16R8 hardware config
├── partitions.csv 4MB app + 12MB storage
└── main/
├── idf_component.yml esp_lcd_st7796 ^1.4.0
├── board_config.h All GPIO pin definitions (source of truth)
├── main.c Test harness → interactive button display
├── display.c/h ST7796S 320×480 i80 parallel + LEDC backlight
├── input.c/h 12 buttons, active-low, bitmask polling
├── sdcard.c/h SPI @ 40MHz, FAT32, ROM listing
├── audio.c/h I2S mono → PAM8403 amplifier
└── power.c/h IP5306 I2C battery level + charge status

Build & flash (Docker)

No local toolchain needed — the build runs inside the official espressif/idf:v5.4 Docker image.

# Build firmware
make firmware-build

# Flash + serial monitor (connect board, hold SELECT at power-on)
make firmware-flash

# Custom USB port
ESP_PORT=/dev/ttyACM0 make firmware-flash

Native ESP-IDF is also supported — see software/README.md for details.

Test sequence on boot

  1. Display shows color bars for 3 seconds (verifies 8-bit data bus)
  2. IP5306 battery % and charge status (serial log)
  3. All 12 button GPIOs initialized
  4. SD card mounted, ROM directories scanned
  5. 440 Hz test tone plays for 2 seconds
  6. Interactive mode: button presses shown on screen + serial

SD Card Setup

The console loads ROMs from a micro SD card formatted as FAT32. Each emulated system has its own folder under /roms/.

Directory structure

SD Card (FAT32)
└── roms/
├── nes/ .nes files
├── snes/ .smc / .sfc files
├── gb/ .gb files
├── gbc/ .gbc files
├── sms/ .sms files
├── gg/ .gg files
├── pce/ .pce files
├── gen/ .bin / .md files
├── lynx/ .lnx files
└── gw/ .gw files

Preparation steps

  1. Format the micro SD card as FAT32 (most cards come pre-formatted)
  2. Create the roms/ directory in the root of the card
  3. Create sub-folders for each system you want to emulate
  4. Copy ROM files into the matching folder

Automated setup

A script is provided to format the SD card and copy test ROMs in one step:

# Format SD card as FAT32 + copy all homebrew test ROMs
sudo ./scripts/setup-sdcard.sh /dev/sdX

# Copy only (skip formatting)
sudo ./scripts/setup-sdcard.sh /dev/sdX --no-format

Included homebrew test ROMs

The project includes 8 freely distributable homebrew ROMs in test-roms/ for testing without commercial ROMs:

SystemROMAuthorSize
NESOwliaGradual Games512 KB
GBBlargg's CPU InstructionsBlargg64 KB
GBCucity v1.3AntonioND128 KB
SMSSilver ValleyEnrique Ruiz256 KB
GGSwabby v1.11Anders S. Jensen128 KB
PCEReflectronAetherbyte256 KB
GenesisMiniplanetsSik256 KB
SNESSuper Boss Gaiden v1.2Dieter Von Laser512 KB
SystemROMFileSizeWhy
NESSuper Mario Brossmb.nes40 KBUniversal test — scrolling, sprites, audio
SNESSuper Mario Worldsmw.smc512 KBGood baseline — 2 BG layers, Mode 1
SNESFF6ff6.smc3 MBTurn-based RPG — best SNES genre for ESP32
GBTetristetris.gb32 KBMinimal — verifies basic emulation
GenesisSonicsonic.bin512 KBFast scrolling stress test

Size limits

ConstraintValue
Max ROM size (PSRAM)6 MB
SD card formatFAT32 (max 32 GB recommended)
Max filename length255 characters (long filename support enabled)
SNES ROM sizes

Most SNES games are 1–4 MB. Games with special chips (SA-1, SuperFX) are larger and may not be compatible with snes9x on ESP32-S3.

Phase 2 — Retro-Go Integration

Fork and adapt Retro-Go for our hardware. Retro-Go is included as a git submodule at retro-go/ and built via a separate Docker Compose file.

StepTaskDetailsStatus
2.1Add ducalex/retro-go as submoduleretro-go/ directory, upstream repo✅ Done
2.2Create target targets/esp32-emu-turbo/config.h + env.py + sdkconfig✅ Done
2.3Docker build pipelinedocker-compose.retro-go.yml + Makefile targets✅ Done
2.4Custom display driver st7796s_i80.h8-bit i80 parallel via esp_lcd_panel_io_i80, async DMA, 5-buffer pool✅ Done
2.5Frame scalingAutomatic via Retro-Go core (320x480 portrait, integer scale + letterbox)✅ Done
2.6Input mapping12 GPIO direct buttons + MENU=SELECT (GPIO 0)✅ Done
2.7Audio routingI2S ext DAC (BCLK=15, WS=16, DATA=17) → PAM8403✅ Done
2.8First boot: NES testnofrendo running Super Mario Bros at 60fps⏳ Needs hardware

Build & flash (Docker)

Retro-Go uses a separate Docker Compose file (docker-compose.retro-go.yml) with the espressif/idf:v5.4 image.

# Build all Retro-Go apps (launcher + emulators)
make retro-go-build

# Build launcher only (quick test)
make retro-go-build-launcher

# Flash firmware + serial monitor
make retro-go-flash

# Serial monitor only
make retro-go-monitor

# Custom USB port
ESP_PORT=/dev/ttyACM0 make retro-go-flash

# Clean build cache
make retro-go-clean

Build output

All 5 Retro-Go applications compile successfully for the ESP32 Emu Turbo target (ESP-IDF v5.4, ESP32-S3):

BinaryContentsSizePartition free
launcher.binRetro-Go launcher UI + ROM browser1037 KB67%
retro-core.binAll emulators (NES, GB, GBC, SMS, GG, PCE, Lynx, SNES, G&W)~2.5 MB~17%
gwenesis.binSega Genesis / Mega Drive (standalone)~1.5 MB~50%
prboom-go.binDoom port (PrBoom)~1.5 MB~50%
fmsx.binMSX emulator655 KB79%
note

The build produces Device doesn't support fw format, try build-img! at the end — this is expected. Our target uses individual app flashing via make retro-go-flash, not a combined firmware image.

Target configuration

The target lives at retro-go/components/retro-go/targets/esp32-emu-turbo/ with:

  • config.h — GPIO mapping, display/audio/input config (mirrors board_config.h)
  • env.pyIDF_TARGET = "esp32s3", firmware format
  • sdkconfig — ESP-IDF config (240MHz, 16MB flash QIO, 8MB Octal PSRAM)

GPIO mapping verification

All 33 GPIO pins have been cross-verified between three sources with zero discrepancies:

GroupPinsboard_config.hRetro-Go config.hKiCad schematic
Display data D0–D7GPIO 4–11
Display controlGPIO 3, 12–14, 45, 46
SD card SPIGPIO 36–39
I2S audioGPIO 15–17
D-padGPIO 40, 41, 42, 1
Face buttonsGPIO 2, 48, 47, 21
System buttonsGPIO 18, 0
Shoulder buttonsGPIO 35, 43
I2C (IP5306)GPIO 33, 34

Notes:

  • MENU and SELECT share GPIO 0 in Retro-Go (intentional — 12 physical buttons, 13 logical)
  • GPIO 19/20 are used for native USB data (D-/D+) — firmware flash + CDC debug console
  • GPIO 43 is BTN_R (was TX0 UART debug, replaced by USB native)
  • GPIO 26–32 are reserved for Octal PSRAM (cannot be used)

Display driver: st7796s_i80.h

Custom driver replacing Retro-Go's SPI-based ili9341.h with 8-bit 8080 parallel interface. Located at retro-go/components/retro-go/drivers/display/st7796s_i80.h.

FeatureValue
Bus8-bit i80 parallel (esp_lcd_panel_io_i80)
Clock20 MHz write clock
Resolution320x480 portrait
Color formatRGB565 (16-bit)
DMAAsync with 5-buffer pool
BacklightPWM via LEDC (GPIO 45)
Driver IDRG_SCREEN_DRIVER 2

The driver uses esp_lcd_panel_io_tx_param for commands (CASET/RASET) and esp_lcd_panel_io_tx_color for async DMA pixel transfers. A completion callback recycles buffers to the pool, providing natural backpressure without explicit sync.

Phase 3 — All Emulators at Full Speed

Enable and test each emulator core.

StepCoreTest ROMTarget
3.1nofrendo (NES)Super Mario Bros60 fps
3.2gnuboy (GB)Tetris60 fps
3.3gnuboy (GBC)Pokemon Crystal60 fps
3.4smsplus (SMS)Sonic the Hedgehog60 fps
3.5smsplus (GG)Sonic Triple Trouble60 fps
3.6pce-go (PCE)Bonk's Adventure60 fps
3.7handy (Lynx)California Games60 fps
3.8gwenesis (Genesis)Sonic the Hedgehog50-60 fps
3.9gw-emulator (G&W)Ball60 fps

Phase 4 — SNES Optimization (60 FPS target)

Progressive optimization of the snes9x core (Snes9x 2005 via Retro-Go) in 3 sub-phases over ~14 days. Target: 60 FPS stable on standard titles (Super Mario World, Zelda ALttP, Chrono Trigger, Final Fantasy VI, Mega Man X). Baseline: ~30 FPS. See SNES Deep Dive for full technical details.

Sub-phaseStepOptimizationDaysGainCumulative FPS
4.1 — ASM DSP4.1.1BRR Decode assembly (Xtensa LX7)1+5–7%30 → 32–33
4.1.2Gaussian Interpolation assembly0.5+3–4%33 → 34–35
4.1.3Voice Mixing assembly (fast-path)2+5–8%35 → 38–40
4.1.4Echo FIR Filter assembly (8-tap unrolled)0.5+2–3%40 → 41–42
4.2 — Architecture4.2.1Dual-Core SPC700 (Core 1 dedicated audio)2–3+35–45%42 → 50–52
4.2.2Memory Layout (PSRAM → SRAM, ~100 KB)1+15–20%52 → 54–56
4.2.3Overclock to 260 MHz0.01+8%56 → 57–58
4.2.4Audio sample rate 32 → 16 kHz0.05+2–3%57–58
4.3 — PPU & Display4.3.1PPU Fast-Path rendering (Mode 1)3–4+5–8%58 → 59–60
4.3.2Tile Cache in SRAM (dirty-flag)1+3–5%60 + headroom
4.3.3DMA Display Push (double-buffer)1+2–3%60 + headroom
4.3.4Adaptive Frameskip (safety net)0.5safety net60 stable

Phase 5 — v2 Hardware Audio Coprocessor (ESP32-S3-MINI-1)

Hardware evolution for v2: add an ESP32-S3-MINI-1 module (~$3.25) as a universal audio coprocessor. Same Xtensa LX7 architecture and ESP-IDF toolchain as the main chip — the Phase 4.1 assembly code runs identically with zero porting. The ESP32-S3 is 100% freed from audio — both cores are dedicated to CPU + PPU + game logic. Benefits all emulators, not just SNES. See Phase 5 Deep Dive for full technical details.

StepTaskDaysImpact
5.1ESP32-S3-MINI-1 circuit design + PCB integration1Hardware schematic (module = no external crystal/flash)
5.2SPI communication protocol (ESP32 ↔ ESP32-S3-MINI-1)1Same ESP-IDF SPI API on both sides
5.3Coprocessor firmware — Passthrough mode (PCM relay)0.5Reuse existing audio.c I2S code
5.4Coprocessor firmware — SPC700 native mode0.5Phase 4.1 Xtensa ASM runs identically
5.5ESP32 firmware — audio hub integration1Same ESP-IDF build system, single idf.py
5.6Testing + latency tuning1Same idf.py monitor, same log format
Total~5100% audio offload

Display Driver

ST7796S 8-bit Parallel (i80 Bus)

Retro-Go ships with an SPI-only ILI9341 driver. Our hardware uses 8-bit 8080 parallel which requires a custom driver. The firmware uses the esp_lcd_st7796 component with the esp_lcd_panel_io_i80 bus API.

ESP32-S3                      ST7796S (4.0" 320x480)
───────── ──────────────────────────
GPIO 4-11 (D0-D7) ────────► DB0-DB7 (8-bit data bus)
GPIO 12 (CS) ────────► CS (chip select)
GPIO 14 (DC) ────────► DC (data/command)
GPIO 46 (WR) ────────► WR (write strobe)
GPIO 3 (RD) ────────► RD (read strobe)
GPIO 13 (RST) ────────► RST (reset)
GPIO 45 (BL) ────────► LED (backlight PWM via LEDC)

GPIO4–11 form a contiguous 8-bit bus, enabling efficient DMA transfers.

Bandwidth

InterfaceClockThroughput60fps 320x480 16-bit
SPI (ILI9341)40 MHz5.0 MB/s54% utilization
8-bit i80 (ours)20 MHz20.0 MB/s14% utilization

The 8080 parallel bus has 4x the bandwidth of SPI, leaving headroom for scaling and double-buffering.

Frame Scaling

Source systemNative res→ Display 320x480Method
NES256x240256x240 centered1:1 letterbox
SNES256x224256x448 (2x V)Integer 2x vertical
Game Boy160x144320x288 (2x both)Integer 2x
Genesis320x224320x448 (2x V)Integer 2x vertical
Master System256x192256x384 (2x V)Integer 2x vertical

The ST7796S at 320x480 is well-suited: most systems are ≤320px wide and can be doubled vertically for a crisp image with black bars.


SNES Deep Dive

Why SNES is Hard on ESP32-S3

The SNES has three CPU-intensive subsystems that must be emulated in real-time:

Frame time budget: 16.67 ms (for 60 fps)

┌────────────────────────────────────────────┐
│ 65C816 CPU emulation ~4.5 ms 27% │
│ PPU rendering (2 BG layers) ~5.0 ms 30% │
│ SPC700 audio DSP ~8.0 ms 48% │ ← bottleneck
│ Display transfer ~1.5 ms 9% │
├────────────────────────────────────────────┤
│ TOTAL ~19.0 ms 114% │ ← over budget
└────────────────────────────────────────────┘

At 114% of the frame budget on a single core, SNES emulation via Retro-Go (Snes9x 2005) currently reaches ~30 FPS on a target of 60 FPS. The 3-phase optimization plan below combines assembly-level DSP work, architectural changes (dual-core, memory layout), and rendering optimizations (PPU fast-path, tile cache, DMA display) to reach 60 FPS stable.

note

Performance gains are not perfectly additive — each optimization reduces the total frame time, so subsequent ones operate on a smaller base. The estimates account for this non-linearity.


Phase 4.1 — Assembly DSP (Xtensa LX7)

Goal: Rewrite the 4 heaviest S-DSP audio functions in native Xtensa assembly. These consume ~50% of the total SPC700 audio emulation time. ~120 lines of ASM, ~4 days.

4.1.1 — BRR Decode (DecodeBlockAsm)

C functionDecodeBlock() in soundux.cpp
What it doesDecodes BRR blocks (9 bytes → 16 PCM 16-bit samples). Native compressed format for all SNES audio samples.
Call frequency~2000–4000 times/frame (8 voices x sample rate x variable pitch)
C bottleneckLoop with branches for clamping, stack spill for filter variables, unoptimized buffer access
ASM optimizationZero-overhead LOOP, branchless MIN/MAX clamping, dedicated registers for filter state (a7/a8), load/compute interleaving
Expected gain+5–7% on total frame time
Effort~30 lines ASM — 1 day

Each BRR block has a header byte (shift amount + filter type 0–3) followed by 8 bytes of compressed data. Filters apply linear prediction using the 2 previous samples. The assembly eliminates branches in [-32768, +32767] clamping via native Xtensa MIN/MAX instructions, keeping old/older samples in registers a7/a8 without touching the stack.

4.1.2 — Gaussian Interpolation (GaussianInterpAsm)

C functionInline interpolation in MixStereo/MixMono loop
What it does4-point filter with 512-entry Gaussian lookup table. Interpolates between decoded samples for resampling at desired pitch.
Call frequency32000/sec x 8 voices = 256,000 calls/sec
C bottleneck4 loads from gauss table + 4 multiplications + accumulate. Compiler generates ~18 instructions with intermediate load/stores.
ASM optimizationGauss table in IRAM (.section .iram1), 4x MULL+ADD pipeline-scheduled, result in 8 net instructions. All 4 samples and 4 coefficients live in registers.
Expected gain+3–4% on total frame time
Effort~10 lines ASM — half day

The key is placing the Gaussian table (1 KB) in IRAM with .section .iram1 — this eliminates PSRAM latency for every lookup. With coefficients pre-loaded in registers, the computation reduces to 4 MULL + 3 ADD + 1 SRAI. The C compiler typically cannot keep everything in registers because it has no aliasing guarantees on the pointers.

4.1.3 — Voice Mixing (MixVoiceAsm)

C functionMixStereo() / MixMono() in soundux.cpp
What it doesFor each voice: applies ADSR/GAIN envelope, multiplies by L/R volume, accumulates into mix buffer. Handles pitch modulation, noise, and echo enable.
Call frequency1 per output sample x 8 voices = core loop of the entire DSP
C bottleneckMost complex loop: per-voice branching (envelope state machine, pitch mod check, noise check, echo check), volume multiplications, stereo accumulate. Many variables, heavy register pressure.
ASM optimizationFast-path for the common case (no pitch mod, no noise): eliminates branches, unrolls 8 voices, optimized stereo volume MAC. Fallback to C for special cases.
Expected gain+5–8% on total frame time
Effort~60 lines ASM — 2 days (most complex)

The strategy is a fast-path for the most frequent case (active voice, envelope in SUSTAIN state, no pitch modulation, no noise). This covers ~80% of real gameplay situations. For edge cases (ATTACK/DECAY/RELEASE, active pitch mod, noise generator), it falls back to the original C function. The fast-path uses Xtensa register windowing to keep all 8 volumes (L+R) and 8 envelopes in registers.

4.1.4 — Echo FIR Filter (EchoFIRAsm)

C functionEcho processing in main MixStereo loop
What it does8-tap FIR (Finite Impulse Response) on echo buffer. Each echo output sample = sum of 8 previous samples x 8 programmable coefficients.
Call frequency32000/sec (one per output sample, stereo)
C bottleneck8-iteration loop with signed multiplication and accumulate. Compiler doesn't fully unroll and doesn't optimally schedule the MULL.
ASM optimizationFull 8x unroll, MULL pipeline-scheduled with next sample load in parallel. Echo buffer pointer in register. Branchless clamping.
Expected gain+2–3% on total frame time
Effort~20 lines ASM — half day

With 8 taps fully unrolled, each MULL is scheduled while the next sample load is in flight, hiding memory latency. The 8 FIR coefficients (signed bytes) are loaded into two 32-bit registers (4 coefficients per register) and extracted with shift+mask, avoiding 8 separate loads.

Phase 4.1 Summary

FunctionASM linesDaysGain %FPS impact
DecodeBlockAsm~301+5–7%30 → 32–33
GaussianInterpAsm~100.5+3–4%33 → 34–35
MixVoiceAsm~602+5–8%35 → 38–40
EchoFIRAsm~200.5+2–3%40 → 41–42
TOTAL Phase 4.1~1204+15–22%30 → 38–42 FPS

Phase 4.2 — Architectural Optimization

Goal: Restructure the emulator to leverage the ESP32-S3 dual-core and optimize memory layout. This phase has the single biggest impact overall. ~4 days.

4.2.1 — Dual-Core SPC700 Separation

InterventionMove the entire SPC700 + DSP emulation (now ASM-optimized from Phase 4.1) to a dedicated FreeRTOS task on Core 1.
Current stateCPU 65C816, PPU, and SPC700 all run on Core 0 sequentially. Core 1 is essentially unused (only Wi-Fi/BT stack).
Target architectureCore 0: CPU 65C816 + PPU + game logic. Core 1: SPC700 CPU + DSP (Phase 4.1 assembly) + I2S output via DMA. Communication via 4 lock-free I/O ports (atomic read/write).
ImplementationFreeRTOS task pinned to Core 1 with high priority. DMA-capable ring buffer (`MALLOC_CAP_DMA
RisksTemporal synchronization: some games depend on exact timing between CPU and SPC700. Solution: timestamp-based sync with ±64 sample tolerance (~2ms). Works for 95%+ of games.
Expected gain+35–45% on total frame time
Effort2–3 days

This is the single most impactful change in the entire plan. Freeing Core 0 from all audio emulation virtually doubles the available CPU budget for CPU+PPU.

Core 0 (main):                 Core 1 (audio):
65C816 CPU emulation SPC700 CPU emulation
PPU rendering DSP (assembly from Phase 4.1)
Display transfer I2S DMA output feed
Input polling

~10.5 ms/frame ~8.0 ms/frame → ~5 ms with ASM
→ bottleneck at 11ms (runs fully in parallel)

4.2.2 — Memory Layout Optimization

InterventionRelocate critical data structures from PSRAM to internal SRAM (512 KB).
Structures to moveSPC700 RAM (64 KB), PPU tile cache (~32 KB), palette RAM (512 B), OAM sprite table (544 B), CGRAM (512 B), DSP registers (128 B). Total: ~100 KB in SRAM.
ImpactOctal PSRAM has ~80–120ns random access latency vs ~10ns for internal SRAM. The DSP and PPU make thousands of random accesses per frame. 8–10x latency difference.
ImplementationReplace malloc() with `heap_caps_malloc(size, MALLOC_CAP_INTERNAL
Expected gain+15–20% on total frame time
Effort1 day (few lines of code, but requires profiling)

4.2.3 — Overclock to 260 MHz

InterventionIncrease clock from 240 to 260 MHz via ESP-IDF menuconfig (unofficial but stable).
ImplementationIn sdkconfig: CONFIG_ESP_DEFAULT_CPU_FREQ_MHZ=260. Or runtime: esp_pm_configure() with max_freq_mhz=260.
RisksMinimal. The S3 is officially tested to 240 MHz, but 260 MHz is widely used in the community without stability issues. No significant power consumption increase.
Expected gain+8% linear across everything
Effort10 minutes

4.2.4 — Audio Sample Rate Reduction

InterventionReduce DSP sample rate from 32 kHz to 16 kHz. Halves the number of samples to compute per second.
Audio impactSlightly lower perceived quality on high frequencies (cymbals, hi-hat). For most SNES music the difference is minimal on a handheld speaker.
Expected gain+5–8% on audio processing (~2–3% total after dual-core)
Effort30 minutes

Phase 4.2 Summary

InterventionDaysGain %Cumulative FPS
Dual-Core SPC7002–3+35–45%42 → 50–52
Memory Layout SRAM1+15–20%52 → 54–56
Overclock 260 MHz0.01+8%56 → 57–58
Sample Rate 16 kHz0.05+2–3%57–58
TOTAL Phase 4.2~4cumulative42 → 56–58 FPS

Phase 4.3 — The Last Mile: PPU & Display

Goal: Go from ~57 to 60 FPS stable by optimizing PPU rendering and the display pipeline. More complex optimizations but necessary for the final 5%. ~6 days.

4.3.1 — PPU Fast-Path Rendering

InterventionCreate optimized paths for common PPU cases: Mode 1 (used by 70%+ of games), no clipping windows, no mosaic, no complex color math.
DetailThe Snes9x 2005 PPU handles ALL cases (Mode 0–7, windows, mosaic, color math, hi-res, interlace, offset-per-tile) in a single generic code path with many branches. The fast-path eliminates checks for features not used in the current scanline.
Expected gain+5–8% on total frame time
Effort3–4 days (requires deep PPU understanding)

4.3.2 — Tile Cache in SRAM

InterventionCache decoded tiles in internal SRAM. The PPU decodes the same tiles hundreds of times per frame (repeated background tiles). With dirty-flag tracking, only re-decode when VRAM changes.
Expected gain+3–5%
Effort1 day

4.3.3 — DMA Display Push

InterventionUse the ESP32-S3 DMA to transfer the framebuffer to the display (8-bit 8080 parallel) without engaging the CPU. Double-buffering: while DMA sends frame N, the CPU renders frame N+1.
Expected gain+2–3%
Effort1 day

4.3.4 — Adaptive Frameskip (Safety Net)

InterventionIf the frame budget (16.67ms) is exceeded, skip rendering the next frame (but still execute game logic). Frameskip 1 = 30 FPS perceived but gameplay at 60.
StrategyAuto-adaptive: measure previous frame time. If over 16.67ms, skip render. If under 15ms, never skip. Zone 15–16.67ms: skip 1 every 4 frames. Perceived result: 45–60 FPS constant.
Expected gainSafety net — maintains 60 perceived FPS even at ~55 real FPS
EffortHalf day

Complete FPS Progression

#InterventionFPS preFPS postDelta FPSDays cum.
F4.1BRR Decode ASM3032–33+2–31
F4.1Gaussian Interp ASM3334–35+1–21.5
F4.1Voice Mixing ASM3538–40+3–53.5
F4.1Echo FIR ASM4041–42+1–24
F4.2Dual-Core SPC7004250–52+8–107
F4.2Memory Layout SRAM5254–56+2–48
F4.2Overclock 260 MHz5657–58+1–28
F4.2Sample Rate 16 kHz5858+0–18
F4.3PPU Fast-Path5859–60+1–212
F4.3Tile Cache SRAM6060+headroom13
F4.3DMA Display6060+headroom14
F4.3Adaptive Frameskip60 stablesafety net14.5

Game Compatibility

GameComplexityExpected FPSPlayable?
Super Mario WorldLow60Yes
Zelda: A Link to the PastLow58–60Yes
Chrono TriggerMedium55–60Yes
Final Fantasy VIMedium55–60Yes
Mega Man XMedium55–58Yes
Super MetroidMedium-High50–58Yes*
Donkey Kong CountryHigh45–55Partial
Street Fighter II TurboHigh45–55Partial
Star Fox (Super FX)Extreme20–30No
Yoshi's Island (Super FX 2)Extreme15–25No

* With occasional adaptive frameskip in heavy scenes.

Games with special coprocessors

Games using special coprocessors (Super FX, Super FX 2, SA-1, DSP-1/2/3/4) would require an ESP32-P4 (400 MHz) or better to reach full speed. These chips add a significant computation overhead that cannot be optimized away on the ESP32-S3.

SNES on v2 (ESP32-P4)

The ESP32-P4 at 400MHz with 2.1x the CoreMark score would bring SNES to full-speed with full audio quality for virtually all standard games, and make Super FX titles partially playable.

Audio Profiles

The SPC700 audio DSP is the single biggest CPU bottleneck before Phase 4.1 assembly optimizations. Three selectable profiles trade audio quality for frame rate, toggled in-game via Menu button → Audio: Full / Fast / OFF.

Profile Comparison

ProfileSample rateInterpolationEcho/ReverbChannelsDSP time (pre-ASM)DSP time (post-ASM)
Full32 kHzGaussian (4-tap)YesStereo~8.0 ms~5.0 ms
Fast16 kHzLinear (2-tap)NoMono~2.5 ms~1.5 ms
OFF0 ms0 ms

After Phase 4.1 (ASM DSP) + Phase 4.2 (dual-core), audio runs on Core 1 in parallel. With all optimizations applied, Full audio profile at 60 FPS is the target — no quality compromise needed for standard games.

Recommended: Full audio after all optimizations

Unlike the pre-optimization estimates, the 3-phase plan targets 60 FPS with full 32kHz stereo audio for standard games (Super Mario World, Zelda, Chrono Trigger, FF6). Audio Fast/OFF remain available as fallback options for heavy scenes or complex games.


Phase 5 — v2 Hardware Audio Coprocessor

Goal: Add an ESP32-S3-MINI-1 module as a dedicated audio coprocessor on the v2 PCB. This completely offloads audio processing from the ESP32-S3 for all emulators — not just SNES. Both ESP32-S3 cores become 100% available for CPU + PPU + game logic. ~5 days (down from 14 with RP2040 — see Why ESP32-S3-MINI-1 instead of RP2040 for the rationale).

Why a Hardware Audio Coprocessor?

Even after Phase 4 optimizations, the ESP32-S3 spends one entire core on SNES audio (SPC700 + DSP). For simpler emulators (NES, GB, Genesis), audio still consumes I2S DMA time and interrupt cycles. A dedicated audio chip eliminates this entirely.

v1 Architecture (software only):
┌──────────────────────────────────────────────────────┐
│ ESP32-S3 │
│ Core 0: CPU + PPU + Display │
│ Core 1: SPC700 + DSP + I2S DMA ← audio burden │
│ │ │
│ I2S bus │
│ │ │
│ PAM8403 → Speaker │
└──────────────────────────────────────────────────────┘

v2 Architecture (ESP32-S3-MINI-1 audio hub):
┌──────────────────────┐ SPI 10MHz ┌──────────────────────┐
│ ESP32-S3 (main) │ ──────────────→│ ESP32-S3-MINI-1 │
│ Core 0: CPU + PPU │ commands │ Core 0: SPC700 DSP │
│ Core 1: CPU + PPU │ + PCM data │ Core 1: I2S output │
│ (both 100% free │ │ │
│ for emulation) │ │ I2S (hardware) │
└──────────────────────┘ │ │ │
│ PAM8403 │
│ │ │
│ Speaker │
└──────────────────────┘

Why ESP32-S3-MINI-1 instead of RP2040

The original Phase 5 design used an RP2040 (ARM Cortex-M0+, $0.70). After analysis, the ESP32-S3-MINI-1 is the better choice despite a higher module cost (~$3.25), because the development time savings and architectural simplification far outweigh the $2.26 BOM increase.

Development time comparison
StepTaskRP2040ESP32-S3-MINI-1Savings
5.1Circuit design + PCB2 days1 dayModule integrates crystal + flash
5.2SPI protocol2 days1 daySame ESP-IDF SPI API on both sides
5.3Passthrough firmware (PCM relay)1 day0.5 daysCopy audio.c I2S code, same API
5.4SPC700 native firmware5 days0.5 daysXtensa ASM from Phase 4.1 runs identically
5.5ESP32 firmware integration2 days1 daySingle idf.py, one build system
5.6Testing + latency tuning2 days1 daySame idf.py monitor, same log format
Total14 days~5 days-9 days (64% reduction)

The decisive factor is Step 5.4: with the RP2040, you must rewrite ~120 lines of Xtensa LX7 assembly (BRR decode, Gaussian interpolation, voice mixing, echo FIR from Phase 4.1) into ARM Thumb assembly — a complete cross-architecture port requiring testing, debugging, and re-optimization. With the ESP32-S3-MINI-1, you copy the .S files and compile. Done.

Technical advantages
AspectRP2040ESP32-S3-MINI-1Winner
ArchitectureARM Cortex-M0+Xtensa LX7 (same as main chip)MINI-1
Clock speed133 MHz240 MHz (+80%)MINI-1
Internal SRAM264 KB512 KB (+94%)MINI-1
I2SVia PIO (custom bitbang)Native hardware I2SMINI-1
SPI slaveHardwareHardware (same ESP-IDF API)Tie
ToolchainPico SDK (separate)ESP-IDF (same as main)MINI-1
ASM compatibilityZero (must rewrite all)100% (identical Xtensa LX7)MINI-1
External componentsCrystal + flash + 4 capsNone (all integrated)MINI-1
WiFi/BTNoYes (future upgrade path)MINI-1
Unit cost$0.70 + $0.29 external = $0.99$3.25 + $0.02 caps = $3.27RP2040
Development time14 days5 daysMINI-1
Key architectural benefits
  1. One toolchain — No need to install Pico SDK, learn RP2040 PIO, or maintain two build systems in Docker. The entire project stays pure ESP-IDF.

  2. Unified debugging — Both chips flash and monitor via idf.py. Same serial log format, same profiling APIs, same Docker Compose target (just a different idf.py target for the coprocessor).

  3. Simpler BOM — Eliminates the 12 MHz crystal, W25Q16 flash chip, and 4 extra decoupling capacitors. The module has everything integrated.

  4. Faster CPU — 240 MHz vs 133 MHz with higher IPC (Xtensa LX7 is a more capable core than Cortex-M0+). The SPC700 emulation runs with massive headroom.

  5. More SRAM — 512 KB vs 264 KB. The SPC700 needs 64 KB RAM + DSP buffers + I2S ring buffer. On the MINI-1, there is 400+ KB free for audio mixing buffers and future features.

  6. Future upgrade path — The ESP32-S3-MINI-1 has WiFi and Bluetooth 5.0 LE built in. Future firmware could enable WiFi ROM downloads, Bluetooth wireless controllers, or OTA updates for the coprocessor — all with zero hardware changes.

ESP32-S3-MINI-1 Specifications

ParameterValue
ModuleESP32-S3-MINI-1-N8 (Espressif)
SoCESP32-S3 (Xtensa LX7 dual-core)
Cores2x Xtensa LX7 @ 240 MHz
Internal SRAM512 KB
Flash8 MB Quad SPI (integrated)
PSRAMNone (not needed for audio)
I2S2x hardware I2S (8/16/24/32-bit)
SPISPI2 + SPI3 (general-purpose, DMA-capable)
GPIOs39 (including 4 strapping)
AntennaOn-board PCB antenna
Operating voltage3.0–3.6V
Power~50 mA active (single core audio task)
Dimensions15.4 × 20.5 × 2.4 mm
LCSC Part #C2913206
Unit cost~$3.25
Why N8 (no PSRAM) instead of N4R2?

The audio coprocessor only needs internal SRAM. SPC700 RAM is 64 KB, DSP buffers ~32 KB, I2S ring buffer ~16 KB — total ~112 KB, well within the 512 KB internal SRAM. PSRAM would add latency to the audio path without benefit. The N8 variant also keeps all 39 GPIOs available (N4R2 loses GPIO26 to PSRAM).

Dual-Mode Firmware

The ESP32-S3-MINI-1 runs two firmware modes, selected by the main ESP32-S3 via an SPI command at emulator launch:

ModeActive forMINI-1 Core 0MINI-1 Core 1Audio latency
PassthroughNES, GB, GBC, SMS, GG, PCE, Genesis, LynxReceive PCM via SPI → ring bufferRing buffer → I2S DMA outputunder 2 ms
SPC700 NativeSNESFull SPC700 CPU + S-DSP emulation (Phase 4.1 ASM)I2S DMA output from DSP bufferunder 5 ms

Passthrough mode: The main ESP32-S3 computes audio samples as usual (e.g., NES APU, GB sound) and sends raw PCM over SPI. The MINI-1 relays them to I2S via DMA. This frees the main ESP32-S3 from I2S DMA interrupts and buffer management. The MINI-1 firmware reuses the same i2s_std driver code from Phase 1 (audio.c).

SPC700 Native mode: The main ESP32-S3 sends SPC700 I/O port writes (4 bytes) and timing sync packets over SPI. The MINI-1 runs the complete SPC700 CPU emulation + S-DSP natively. The Phase 4.1 Xtensa assembly functions (DecodeBlockAsm, GaussianInterpAsm, MixVoiceAsm, EchoFIRAsm) run identically — same opcodes, same register layout, same instruction timings. At 240 MHz with 512 KB SRAM, the MINI-1 has massive headroom for real-time SPC700 emulation.

SPI Communication Protocol

CommandDirectionPayloadRate
MODE_SETMain → MINI-11 byte (0=Passthrough, 1=SPC700)At emulator launch
PCM_DATAMain → MINI-1256–512 bytes PCM (16-bit stereo)32 kHz / buffer size
SPC_PORT_WRITEMain → MINI-14 bytes (ports 0-3)Per CPU write (~1000/frame)
SPC_SYNCMain → MINI-14 bytes (timestamp)Every 2 ms
SPC_UPLOADMain → MINI-1Variable (SPC700 program)At game load
STATUSMINI-1 → Main4 bytes (ports 0-3 readback)On request

SPI bus: 10 MHz clock, Mode 0, 8-bit frames. Both sides use the same spi_slave/spi_master ESP-IDF driver with DMA. At 10 MHz (higher than the 4 MHz originally planned for RP2040, since both chips support ESP-IDF SPI DMA natively), a 512-byte PCM buffer transfers in ~0.4 ms.

BOM Impact (v1 → v2)

ComponentQtyUnit costTotalNotes
ESP32-S3-MINI-1-N81$3.25$3.25Audio coprocessor (flash + crystal integrated)
100nF caps (decoupling)2$0.01$0.02Power filtering
3.3V LDO (shared)$0.00Uses existing AMS1117
Total v2 addition$3.27

v2 total BOM delta: ~$3.27. The ESP32-S3-MINI-1 runs at 3.3V from the existing AMS1117 regulator (which has 800 mA headroom — the MINI-1 adds ~50 mA for single-core audio tasks, well within budget).

Comparison with RP2040 BOM: The RP2040 approach cost $0.99 in parts (chip + flash + crystal + caps) but required 14 days of development. The MINI-1 costs $2.28 more per unit but saves 9 days. On a 5-unit JLCPCB order, that is $11.40 total — a trivial cost for 64% less development time. The simpler PCB layout (3 components vs 7) also reduces routing complexity.

GPIO / SPI Wiring

In v2, the main ESP32-S3's I2S pins (GPIO 15, 16, 17) are freed since audio output moves to the coprocessor. These pins are repurposed for the SPI link to the MINI-1:

ESP32-S3 Main (SPI Master)         ESP32-S3-MINI-1 (SPI Slave)
────────────────────────── ─────────────────────────────
GPIO 15 (SPI_CLK) ───────→ GPIO 12 (SPI2_CLK)
GPIO 16 (SPI_MOSI) ───────→ GPIO 11 (SPI2_MOSI)
GPIO 17 (SPI_MISO) ←─────── GPIO 13 (SPI2_MISO)
GPIO 20 (SPI_CS) ───────→ GPIO 10 (SPI2_CS)

ESP32-S3-MINI-1 (I2S hardware) Audio
───────────────────────────── ─────
GPIO 15 (I2S_BCLK) ───────→ PAM8403 BCLK
GPIO 16 (I2S_LRCLK) ───────→ PAM8403 LRCLK
GPIO 17 (I2S_DOUT) ───────→ PAM8403 DIN

Notes:

  • The main ESP32-S3's I2S pins (GPIO 15–17) become SPI pins in v2 — clean reuse, no wasted GPIOs.
  • GPIO 20 (USB_D+ in v1) serves as SPI chip select in v2. Native USB is no longer available in v2 (debug via SPI or UART instead).
  • The MINI-1 uses its own GPIO 15–17 for I2S output to the PAM8403 — the same pin numbers as v1, making the audio output path identical.

Performance: v1 vs v2

Metricv1 (software)v2 (ESP32-S3-MINI-1)Improvement
ESP32 cores for emulation1.0–1.5 (Core 1 shared with audio)2.0 (both cores 100%)+33–100%
SNES audio CPU cost~5 ms/frame (ASM, Core 1)0 ms (offloaded)-100%
NES/GB audio CPU cost~0.5 ms/frame + I2S IRQ0 ms (offloaded)-100%
Audio latency2–5 ms (DMA buffer)2–5 ms (SPI + DMA)Same
Audio quality16 kHz (compromise for FPS)32 kHz stereo (no compromise)2x sample rate
Coprocessor clock240 MHz (80% faster than RP2040)N/A
Power consumption~180 mA (both cores loaded)~230 mA (+50 mA MINI-1)+28%
BOM cost$33$36.27+$3.27
Development time5 days (vs 14 for RP2040)-64%

SNES FPS Impact (v2)

With the ESP32-S3-MINI-1 handling all audio, the main ESP32-S3 frame budget changes drastically:

v2 Frame time budget: 16.67 ms (for 60 fps)

┌────────────────────────────────────────────────┐
│ 65C816 CPU emulation ~4.5 ms 27% │
│ PPU rendering (2 BG layers) ~5.0 ms 30% │
│ SPC700 audio DSP 0.0 ms 0% │ ← offloaded to MINI-1
│ Display transfer ~1.5 ms 9% │
├────────────────────────────────────────────────┤
│ TOTAL ~11.0 ms 66% │ ← 34% headroom!
└────────────────────────────────────────────────┘

At only 66% of the frame budget before any Phase 4 software optimizations, v2 hardware reaches 60 FPS for standard SNES games out of the box. Phase 4 optimizations (PPU fast-path, tile cache, overclock) become headroom for complex games.

v2 Game Compatibility (All 16-bit Systems)

SystemExample gamesv1 FPSv2 FPSNotes
NESSuper Mario Bros, Zelda6060Already full speed; v2 frees CPU headroom
Game BoyTetris, Pokemon6060Already full speed
GBCPokemon Crystal6060Already full speed
SMSSonic the Hedgehog6060Already full speed
Game GearSonic Triple Trouble6060Already full speed
PCEBonk's Adventure6060Already full speed
LynxCalifornia Games6060Already full speed
GenesisSonic, Streets of Rage50–6058–60+8–10 FPS from freed Core 1
SNES (standard)Mario World, Zelda ALttP3055–60Audio offloaded, no Phase 4 needed
SNES (complex)Chrono Trigger, FF625–3050–58Phase 4 PPU optimizations for 60
SNES (Super FX)Star Fox, Yoshi's Island15–2525–40Coprocessor still too heavy for 60
v2 makes Phase 4 optional for most SNES games

With the ESP32-S3-MINI-1 handling all audio natively (running the same Xtensa LX7 assembly from Phase 4.1), the biggest SNES bottleneck (48% of frame time) is eliminated at the hardware level. Standard SNES games reach 55–60 FPS without any assembly or architectural optimization on the main chip. Phase 4 becomes a bonus for pushing complex titles to a stable 60.

v2 Implementation Roadmap (5 days)

StepTaskDaysDetails
5.1Circuit design + PCB1Add ESP32-S3-MINI-1-N8 footprint to KiCad. Only 2 decoupling caps needed (no crystal, no flash). Route 4 SPI traces + 3 I2S traces to PAM8403. Simpler than RP2040 (3 components vs 7).
5.2SPI communication protocol1Use spi_master on main ESP32 and spi_slave on MINI-1 — both from ESP-IDF. Same API, same DMA engine. Protocol: MODE_SET + PCM_DATA + SPC_PORT_WRITE. Can start from ESP-IDF SPI slave example.
5.3Passthrough firmware0.5Copy audio.c from Phase 1 to the coprocessor project. Replace i2s_write() source from local buffer to SPI-received buffer. Same i2s_std driver, same config, same sample format.
5.4SPC700 native firmware0.5Copy Phase 4.1 assembly files (.S) + SPC700 C emulation code to coprocessor project. Compile with idf.py set-target esp32s3 && idf.py build. The Xtensa assembly runs identically — same opcodes (MULL, MIN, MAX, LOOP), same register layout, same instruction timing. No porting needed.
5.5Main ESP32 integration1Replace I2S audio output in Retro-Go with SPI transmit to coprocessor. Add MODE_SET command at emulator launch. The emulator code doesn't change — only the audio output path switches from local I2S to SPI.
5.6Testing + latency tuning1Same idf.py monitor for both chips. Same serial log format. Same profiling APIs (esp_timer_get_time()). Can test both chips simultaneously with two USB cables.

Reference Projects

ProjectWhat it doesUseful for
ducalex/retro-goMulti-system emulator for ESP32Base framework (our fork)
fcipaq/snes9x_esp32Optimized SNES on ESP32-P4/S3IRAM optimizations, ~45fps on S3
ohdarling/retro-goRetro-Go fork for ESP32-S3S3-specific patches
esp-box-emuEmulators on ESP32-S3-BOXLVGL UI reference
atanisoft/esp_lcd_ili9488ILI9488 ESP-IDF driverDisplay driver reference
libretro/snes9x2010Lightweight snes9x forkSNES core source

Memory Map

┌─────────────────────────────────────────────────┐
│ Internal SRAM (520 KB) │
│ ├─ FreeRTOS stacks ~32 KB │
│ ├─ DMA buffers (display) ~40 KB │
│ ├─ I2S audio DMA ~8 KB │
│ ├─ Emulator hot buffers ~150 KB (SNES) │
│ ├─ Input / misc ~10 KB │
│ └─ Free ~280 KB │
├─────────────────────────────────────────────────┤
│ Octal PSRAM (8 MB) │
│ ├─ ROM image up to 6 MB │
│ ├─ Emulator state / VRAM ~512 KB │
│ ├─ Frame buffer (x2) ~300 KB │
│ ├─ Save states ~256 KB │
│ └─ Free ~1 MB │
├─────────────────────────────────────────────────┤
│ Flash (16 MB) │
│ ├─ Firmware ~2-4 MB │
│ ├─ NVS (settings) ~64 KB │
│ ├─ OTA partition ~4 MB (optional) │
│ └─ Free / SPIFFS ~8 MB │
└─────────────────────────────────────────────────┘

Build & Flash

# Clone fork
git clone https://github.com/pjcau/retro-go.git
cd retro-go

# Build for ESP32 Emu Turbo
python3 rg_tool.py --target=esp32-emu-turbo build

# Flash via USB-C (GPIO0/SELECT = download mode at boot)
python3 rg_tool.py --target=esp32-emu-turbo flash

# Copy ROMs to SD card
# /roms/nes/ — .nes files
# /roms/snes/ — .smc/.sfc files
# /roms/gb/ — .gb files
# /roms/gbc/ — .gbc files
# /roms/sms/ — .sms files
# /roms/gg/ — .gg files
# /roms/pce/ — .pce files
# /roms/gen/ — .bin/.md files