Home of TurboCNC, the best open-source CNC control around...
You are here:
Homepage -> Software -> MISC Sim Latest release of TurboCNC is v4.01 build 050312,
released March 13
Minimal Instruction Set Processor (F-4 MISC)
Articles HOME FAQ's
DAK Ordering Contact
Update: Bill Langdon, a Senior Research Fellow at Essex University in the UK, has recently published a paper in which he analyzes the F-4 MISC processor (PDF), along with a similar one of his invention which he calls the T7.
What is MISC?
Instruction set and specs
This article presents a design for a high performance minimal instruction set processor with code examples and a simulator written in VB6.
2. What is MISC?
MISC is an acronym for Minimal Instruction Set CPU, and it refers a particular philosophy of microprocessor design along a spectrum of internal complexity.
A MISC processor has a minimal number of instructions to carry out computing tasks. Where a typical CISC (complex instruction set CPU) like the Pentium has well over 200 instructions, and a modern RISC (reduced instruction set CPU) may have 30 to 70, a MISC will have far fewer. This team, for example, is using 20 instructions as a base for their MISC design. A good summary of RISC vs CISC philosophy is here.
The processor described in this article uses just 4 instructions. A small number of instructions means a simple, fast chip that is easy to produce and consumes less power than its complex cousins.
3. Instruction Set and specs
Although the specific implementation is not important, a 16 bit address and data word size is assumed here to make the decode stage as horizontal as possible and the fetch cycle perfectly symmetric. Each opcode is one bit of 16 possible in the instruction word. It would be possible to build this processor in any word size down to 4 bits, or up to 64 and beyond depending on the application. The only requirement is that the address size be equal to the data size and that the program and data space are shared (Von Neumann architecture).
The instruction set for the F-4 (fast 4), a proof-of-concept minimal instruction set CPU, is listed below. Each instruction has several addressing modes. Note that the mnemonics are borrowed from the venerable Motorola SY6502, used in the old 8-bit Nintendo, among other machines. There is only one "A" register or accumulator, and a program counter.
F-4 MISC 16 bit instruction set
instruction opcode operand operation clocks
ADDi imm 00 01 16 bit value imm+(A) --> A 3
ADDm addr 00 02 16 bit address (addr)+(A) --> A 4
ADDpc 00 04 null operand PC+(A) --> A 3
BVS addr 00 08 16 bit address (addr) --> PC if
LDAi imm 00 10 16 bit value imm --> A 3
LDAm addr 00 20 16 bit address (addr) --> A 3
LDApc 00 40 null operand PC --> A 3
STAm addr 00 80 16 bit address A --> (addr) 3
STApc PC 01 00 null operand A --> PC 3
The four instructions can be summarized as Load, Store, Add, and Branch if Overflow Set. Each memory operation is assumed to take one clock, and ALU operations take one clock also. ALU (Arithmetic and Logic Unit) is something of a misnomer here, since this chip can only add. For example, "ADD addr" takes four clocks, two to fetch the instruction and operand, one to read the memory, and one more to do the addition.
I estimate that about 1400 transistors would be needed to build this for a 16 bit implementation, which gives a 128k (216 words) space for programs and data. With so few transistors, a very conservative performance estimate would be on the order of 50 MIPS if the memory is onboard the chip.
4. Code examples
Four instructions isn't much. Here are some examples of how some instructions common to other processors might be implemented on the F-4.
A JMP (jump) instruction can be done by storing the destination to the program counter::
High level languages require a JSR or CALL (jump to subroutine) function at the machine level. A simple approach is to store the return address in a dedicated location inside the function space. Note that this example is for a non re-entrant function. In general this would be done with a stack operation.
STApc (executes a JMP)
... (execution resumes here)
SHL (arithmetic shift left) is a quick addition of a number to itself. The overflow flag is set if the bit shifted out is 1, clear if it's 0. This is the basic operation to build more complex functions with.
Here's how boolean NOT would be performed. This is a bit lengthy, so I'll use a 4-bit example:
4 bit NOT
(op) is the input
(res) is the output
skip LDAm (op)
skip1 LDAm (op)
skip2 LDAm (op)
That took 26 instructions in 4 bit code, and it would be 98 instructions for 16 bit code. It works by shifting each bit sequentially with pseudo-SHL and setting the appropriate bit in the output word accordingly. AND, EOR, XOR, XNOR, and SUB/SBC work similarly, but in a more complex way since two operands are involved with the concomitant branching layers.
It's a misuse of the MISC concept to try to directly duplicate each instruction for a more sophisticated processor however. The purpose in writing these code examples is to demonstrate that the F-4 processor is Turing complete, as it is essentially a canonical register machine. That is to say, any calculation or algorithm can be implemented given unlimited memory.
To model the F-4 in action, I've developed a simple VB program that reads the memory image to run and then simulates the execution.
The memory is capped at 64 kilowords in a 16 bit operating mode (word addresses 0 thru $FFFF hex). No I/O is modeled, and only one MISC is running in a single data space. This is for developing and testing code fragments. Here's a screenshot of the simulator (reduced for the web).
Download the simulator here. Source code and some sample code fragments are included.
It should be evident from the above that a single MISC processor is slower than the RISC or CISC equivalents in almost all but the most trivial cases. However, the extreme simplicity allows ultra-parallel implementations, similar to the approach taken by Cray Computing, which uses hundreds of parallel RISC processors in its machines.
Consider that a Pentium of recent vintage has on the order of 50 million transistors. If each F-4 requires 5,000 transistors (a conservative estimate to allow for the "glue logic" and caching necessary in such a highly parallel structure), then in principle 10,000 of them could fit on one die and execute simultaneously.
Such an array would be programmed in a manner not unlike user-configurable microcode processors like the Intel IXP-1200, which has 6 microengines with 2k of microstore each.
This type of configuration allows custom SIMD-type instructions to be invented and discarded on an ad-hoc basis as needed by a program. Here are some practical applications for the real world:
Large matrix solutions and spatial transformations
Neural nets (each F-4 owns some localized neural space)
Garbage collection in LISP-type languages
Text and virus pattern search
3D rendering and graphics/image rendering
Audio data processing
Interrupt and I/O service handling
Although it could easily be built on an FPGA chip as a demo, I plan to use the F-4 concept for experiments in genetic programming instead. Having a small number of instructions is advantageous and in a sense mimics the basic structure of the DNA molecule, which has four basic proteins.
© 2001-2008 DAK Engineering.
All rights reserved.
This page last updated on August 14, 2008 .