git clone https://github.com/grantsimmons/GameBuddy-Verilog.git
I began this emulation project as a successor to my semi-complete C++ Software GameBoy emulator, GameBuddy. The goal of the GameBuddy GameBoy emulator is to provide its user with an experience identical to that of an original GameBoy. This GameBuddy-Verilog emulator aims to implement the GameBoy's architecture in hardware, while verifying its functionality using the verification mode of my existing software emulator to generate test vectors. When the progress of this project surpasses the capabilities of my software emulator, I will likely turn to another, more robust emulator to achieve my verification goals. Architecturally, I will be aiming to stay true to the SM83's architecture through the careful study of publicly available documentation and with the help of the work done by leaders in the effort to reverse engineer the GameBoy's technical details, such as Joonas Javanainen, aka Gekkio.
This is also my first large project in digital design, and as such I wanted to teach myself everything from RTL and Microarchitecture to Layout to Verification. This design will be riddled with bad practice and newbie mistakes, but I want the goal of this project to be to create something from the ground up, and as such all of the tools I use for the design and the design's verification will also be developed from scratch. I will be referring to instruction descriptions and publicly available documentation on the hardware such as the information provided by PanDocs, but I am going to do my best to avoid using pre-designed IP and guides in an effort to confront every aspect of designing a complex microprocessor system. So far by doing this I have found the fun in attempting to discern the tiny details, like how to optimally group instructions for algorithmic decoding and how to reuse hardware in a clever manner to reduce LUT usage, but I have also found challenges in design trade-offs and timing conflicts that I would not have encountered by using a pre-built core. The end goal is not necessarily to create a cycle-exact emulator, but rather to challenge myself to constrain my design to the known characteristics of the GameBoy while finding ways to fill in the blanks of our current understanding of the system.
This project began as a personal project in late 2019 when my curiosity for microprocessor architecture blossomed. To provide motivation and accountability, I made the reconstruction of the SM83 CPU the final project for my Digital Systems Design course at Stevens Institute of Technology. Since I will be working alone on this project, my only goal is to have a functioning design which can run the GameBoy's boot ROM with 100% register and memory accuracy by the end of the semester. At this point, pixel processing and cycle accuracy will not be a consideration, however a basic implementation of the Pixel Processing Unit (PPU) will be necessary to progress to the end of the Boot ROM.
By the end of the Spring 2020 semester I was able to accomplish 78% instruction accuracy on the SM83 CPU core. Most of the resulting outliers occurred from the design's lack of a stack pointer, which would enable the Jump, Call, Return, and Push/Pop instructions. This project has currently been placed on hold as a result of Finals, but will pick back up after they are over.
GameBoy System Block Diagram
The GameBoy's SM83 Microprocessor High-level block diagram, with color indicating the current progress of the project. Orange indicates sections which are currently under development.
Basic Technical Specifications of the SM83:
The Instruction Set Architecture of the SM83 core is very similar to the Zilog Z80 and Intel 8080 instruction sets in main functionality; however, these two cores are far from binary compatible.
Some of the main differences:
Memory Map indicating the destination of memory accesses
Note: Almost all of the timing information here is a direct result of Gekkio's publications in his Complete Technical Reference and my conversations with Gekkio. Some of the information is a result of my own timing analysis to supplement the information provided by Gekkio. For the most accurate GameBoy timings, consult Gekkio's publications. The information here is strictly my interpretation of his data and how it can be modified to accommodate a FPGA implementation.
Overview:
When talking about the GameBoy's timing cycles, I will be using the 1-based timing convention. I.e., instead of the first machine cycle of instruction execution being designated "M0," it will instead be referred to as "M1" in an effort to maintain consistency with Gekkio's documents. However, throughout this document I do refer to the fetch portion of the instruction as "M0."
The SM83 processor operates in ~1MHz Machine cycles, or "M-Cycles," from which multiple clocks of different phase are derived to drive the operations of the processor, resembling a pseudo ~4MHz "T-Cycle" clock.
Execute/Fetch Overlap:
The SM83 processor is minimally pipelined. In most cases, fetching the next instruction from memory occurs during the current instruction, so long as the current instruction does not access memory. In the following diagram, you can see that the "Memory Data Out" signal precedes the "Instruction Register" signal by half of a M-Cycle. This is where the pipelining occurs. It can be seen that during T3 (t_cycle = 2'b10), the next instruction's op code is placed on Memory Data Out, which is propagated to the Internal Data Bus on T4 (t_cycle = 2'b11). The CPU samples its instruction from the Internal Data Bus on T1 of the next M-Cycle (denoted by the yellow markers), where the decode unit begins execution and prepares to fetch the next instruction.
Decode:
The instruction decode logic in the GameBoy system is dynamic logic. This means that the logic must be pre-charged to quickly deliver the control signals during the next T-Cycle. The logic is pre-charged during the first half of T1 and evaluated starting at the falling edge of T1. From here, the respective control and data signals are generated.
"The instruction decoding is done by a two-stage PLA (programmable logic array). The first stage has 26 inputs and 107 outputs, so it's basically a 26x107 ROM (don't know whether it's NAND or NOR). 8 out of those 26 inputs are IR (instruction register) bits, and the second stage has no other inputs than the 107 outputs of the first stage." -Gekkio, EmuDev Discord Server
Since I don't have access to dynamic logic on the FPGA and timing is likely not an issue if I am running at 1-4MHz, my emulator takes a different approach. On T1, the instruction is sampled from the data bus and allowed to propagate through static decode logic. This logic, rather than consisting of a ROM, is an algorithmic decoder, which was derived from my analysis of the instructions, which can be found in the aptly-named "data/scratch_scratch.txt" file in the GitHub repository. This is by no means more efficient than generating a decode ROM, but I found it an interesting reverse engineering problem. Instead of evaluating the decode phase on the falling edge of T1, the decoded signals are sampled and propagated beyond the decode stage at T2. Ideally this will still allow for enough time for the data and address signals to settle on their respective buses for sampling on T3/T4 while also evading the necessity of triggering events on the negative edge of a clock in an FPGA. This is an issue because, instead of using a clock's negative edge, FPGAs typically generate a second clock which is 180 degrees out of phase. This can lead to difficult timing analysis and unexpected behavior.
Execute:
The SM83 core is interesting in that, in reality, the core is not fed by a single 4 MHz clock, but rather by several clocks of varying phases and duty cycles, according to Gekkio. The specifics of these clocks are still being reverse engineered since it is still unknown whether some of the registers are flip flops or latch arrays. In practice, the CPU only "sees" 4 of the 8 possible clock transitions, some of which are falling edges. This means that every M-Cycle has a maximum of 4 steps when sequential logic is used. The core is not a sequential state machine that is aware that register write-backs must occur after ALU operations. Rather, it appears to have a dedicated "write-back" clock that happens to occur after the clock instructing the ALU to begin operation. Considering this, I implemented a T-Cycle timer which I rely on to generate these signals. After the decode stage, the control signals are available by the rising edge of T2. ALU is then instructed to begin operation on the rising edge of T3. On the rising edge of T4, the ALU result is available and is written back to its respective register. Because these edges are fixed to the timer, ALU operations can only begin on T3, and register write-back can only occur on T4.
Memory Access:
More research is necessary on my end to determine that actual buffering mechanism that the GameBoy uses to buffer its memory accesses. For the sake of using synchronous RAM in a FPGA, I've come to the following conclusions based on the idea that all memory actions occur on T4's rising edge. However, it is entirely possible that RAM accesses on the GameBoy itself occur asynchronously. For example, the memory may simply put the data of whatever address is in the address buffer onto the external data bus immediately and rely on the external buffer for synchronous sampling to the internal data bus.
To be an accurate emulator, an emulator must replicate ALL of the hardware's architectural features, including its flaws. The GameBoy system has a few notorious hardware bugs as a result of various design oversights, both architectural and physical.
To verify the functional correctness of the core and the system, the RTL requires verification. I accomplished this by creating a "Verification" mode in the software GameBuddy emulator. This mode suppresses video output and simply generates strings of bits representing the instruction executed as well as the state of the processor at the end of the M-Cycle. The testbench will then load the instructions into the design's RAM and begins execution. The testbench uses the remainder of each test vector to compare the state of the RTL to the expected result from the software emulator. If there is a mismatch, the instruction, previous state, and current state of the system is reported and an error is logged in rpts/run.rpt. The next steps for my verification flow are to automate this entire process.
To generate test programs for the software emulator, the script scripts/asm_to_bit.py will generate a seeded, random sequence of instructions from the hardware emulator's currently supported instruction set for random stimulus generation, but this script can also convert arbitrary instruction sequences into binary format to test specific cases. The test vectors are generated, and the output is copied to the stimulus file for the hardware emulator and run. When the test vectors are evaluated, they will pass or fail. In the case of a failure or register mismatch, register values are dumped and a XOR difference is calculated between the actual value and the expected value to assist in debug.
An example of test bench output, indicating a flag register mismatch between the hardware description and the golden model
Current challenges I am facing stem not from the technical complexity of the processor, but from the details of emulating parts of the processor in a FPGA. For instance, bidirectional buses are complicated to implement in a FPGA, as tri-state buffers are not easily accessible within the FPGA fabric itself. This will likely lead to an architectural overhaul in the future if I cannot find an effective way to create bidirectional data buses. Since the address bus is unidirectional, the issues can be solved by simply multiplexing the address lanes. However, doing this on the data bus would still require tri-state buffers when data is being received from memory. I am working on a solution which involves two unidirectional data buses with buffers near the memory interface, which would allow the FPGA's built-in tri-state elements to control only data flow between the buffers and memory. However, this approach also complicates the internal architecture of the processor itself.
My resolution to this problem, by the suggestion of Gekkio, is to use the FPGA's internal block RAMs to accomplish memory access. This removes the need for convergence of the data in and data out buses onto a bidirectional data bus, as the block RAMs have dedicated data in and data out ports.
I also recently faced the issue that my software emulator may have a higher density of bugs than my hardware emulator. While I can still use the software emulator to debug the hardware emulator, I believe I have demonstrated proof of concept and it may be time to move on to a tried-and-true emulator with thorough debugging capabilities. Unfortunately, of the open source C++ emulators few are very accurate and many of the available programs may not be good candidates for a golden model.
To resolve this issue and continue development on my software emulator, I am rather ensuring that the emulator matches the output of the closed-source BGB emulator, which is known to be one of the best and most robust emulators available. After I have verified the scenario on the software emulator, I replicate the test in hardware and ensure that the results of this test match the test vectors of the software emulator. I will undoubtedly be changing this methodology when I find an easily-modifiable emulator to generate accurate test vectors from the beginning.
Gekkio's Complete Technical Reference
GBZ80 Instruction Descriptions
Personal Conversations with Gekkio and others, documented in GitHub Repository
Z80 MAS: