Jul 202011
 

Last time I wrote about how to organize a basic Verilog module that implements a combinational circuit in order to improve clarity for readers and synthesis tools alike. Implementing a sequential circuit correctly is a lot more challenging because it requires an internal combinational circuit. For my discussion I will largely focus on Altera’s tools, because they are what I use on a day-to-day basis, but I would expect the same principles to apply to code targeting Xilinx or another vendor.

The traditional approach to designing a sequential circuit begins by creating a state diagram. In a classroom setting, students would then go on to pick a state encoding, create transition tables, and then optimize the combinational logic for implementation. Luckily, as an engineer I don’t have to do things the academic way, and I can instead rely on the synthesis tools to optimize the combinational circuit for me. With a good set of tools, it is actually possible for the synthesis step to choose a better state encoding (to minimize logic usage, propagation delay, or register usage, as necessary) than one that I will choose arbitrarily, which means that I can focus primarily on making sure the transition and output logic is correct. Frequently, I don’t even create a state diagram on paper, instead setting up the template that I normally use and then writing in the state transitions directly while defining the state machine’s behavior.

With that as the goal, a specific approach to writing state machines is necessary to ensure that it’s clear to the synthesis tools and to other readers what is going on. There are two options: one is to use a graphical state machine builder, which I would obviously prefer not to do, as Verilog is usually faster to write and easier to maintain, because clicking is not an efficient way to transfer input to the computer (at least, not as efficient as the keyboard, in this case). The other option is to use language features to alert the synthesis tools that you are creating a state machine. In Quartus II, this can be done using the parameter data type to create state variable templates, or, alternatively, I prefer using the `define keyword to define states as constants. Both will be picked up by the synthesis tools and allow for state remapping if desired. Although Altera recommends using parameter, I prefer the use of `define because it does not expose the state variables to the hierarchy as modifiable in the way a module parameter does. Note however that the Altera tools compile all the files in a shared namespace, so all `define‘d symbols must have unique names. This is easy to do with a consistent naming scheme, personally I use STATE_<MOD>_<STATE NAME>, with MOD being up to 5 letters of abbreviation for the module being implemented.

One important behavior to note here is that Altera has different options for state machine synthesis, some of which increase the number of bits used for state encoding (such as one-hot encoding). Therefore, it may not always be a good thing for the synthesis tools to infer a state machine—if explicit structural storage is instantiated (such as lpm_ff) then the state machine will not be optimized by the compiler. The state machine can also be specifically set to use “user-defined” encoding by placing a synthesis directive on the same line as the state variable’s declaration. All of the options for state machine encoding are specified in this document describing Quartus II Synthesis options.

(* syn_encoding = "user" *) reg [1:0] state;

Structurally, it is important to separate storage logic from state calculation logic, for a couple of reasons. First, the prevailing belief is that blocking assignments (with the = operator) should be used for combinational logic only, while non-blocking assignments (with the <= operator) should be used for sequential logic only. Assignment types should never be mixed within a single behavioral block because unintended or unpredictable simulation and synthesis behavior can result from differences in interpretation. Therefore, use non-blocking assignment in an edge-sensitive behavioral block to implement the storage logic, and use blocking assignment in a combinational behavioral block to implement next state and output.

Second, it makes the module easier to maintain if the its organization reflects the hardware structures that it is implementing, and it is more likely for the synthesis tools to produce the most optimal implementation. In particular, when dealing with enable, synchronous or asynchronous preset or clear signals, and other features normally associated with the flip flop elements themselves (as opposed to a part of the combinational logic before the flip flop), on a modern FPGA these features can either be implemented as logic inside the LUTs or they can be implemented using the dedicated enable, preset, and clear lines of the storage element within the LE. In order to make it clear to the tools that an enable is being used, the conditional use of the enable should be in the storage block, not the combinational block. For example, consider the following two implementations of a simple vending machine that accepts two coin types (1 and 2, where coin 2 is worth two of coin 1) and counts up to a total of three of coin 1. Then it issues a vend signal and returns to the idle state, with no change calculations, for simplicity.

/** Implementation A */
assign vend = (state == `STATE_EX_S3);
	
always @* begin
	nextState = state;
	
	case(state)
		`STATE_EX_IDLE : begin
			if (coin1 == 1′b1) begin
				if (enable == 1′b1)
					nextState = `STATE_EX_S1;
			end
			else if (coin2 == 1′b1) begin
				if (enable == 1′b1)
					nextState = `STATE_EX_S2;
			end
		end
		`STATE_EX_S1 : begin
			if (coin1 == 1′b1) begin
				if (enable == 1′b1)
					nextState = `STATE_EX_S3;
			end
			else if (coin2 == 1′b1) begin
				if (enable == 1′b1)
					nextState = `STATE_EX_S3;
			end
		end
		`STATE_EX_S2 : begin
			if (coin1 == 1′b1) begin
				if (enable == 1′b1)
					nextState = `STATE_EX_S3;
			end
			else if (coin2 == 1′b1) begin
				if (enable == 1′b1)
					nextState = `STATE_EX_S3;
			end
		end
		`STATE_EX_S3 :  begin
			if (enable == 1′b1)
				nextState = `STATE_EX_IDLE;
		end
	endcase
end

always @(posedge clk or negedge reset_L) begin
	if (reset_L == 1′b0) begin
		state <= `STATE_EX_IDLE;
	end
	else begin
		state <= nextState;
	end
end
/** Implementation B */
assign vend = (state == `STATE_EX_S3);
	
always @* begin
	nextState = state;
	
	case(state)
		`STATE_EX_IDLE : begin
			if (coin1 == 1'b1) begin
				nextState = `STATE_EX_S1;
			end
			else if (coin2 == 1'b1) begin
				nextState = `STATE_EX_S2;
			end
		end
		`STATE_EX_S1 : begin
			if (coin1 == 1'b1) begin
				nextState = `STATE_EX_S3;
			end
			else if (coin2 == 1'b1) begin
				nextState = `STATE_EX_S3;
			end
		end
		`STATE_EX_S2 : begin
			if (coin1 == 1'b1) begin
				nextState = `STATE_EX_S3;
			end
			else if (coin2 == 1'b1) begin
				nextState = `STATE_EX_S3;
			end
		end
		`STATE_EX_S3 :  begin
			nextState = `STATE_EX_IDLE;
		end
	endcase
end

always @(posedge clk or negedge reset_L) begin
	if (reset_L == 1'b0) begin
		state <= `STATE_EX_IDLE;
	end
	else if (enable == 1'b1) begin
		state <= nextState;
	end
end

When I synthesize both of these files, I would normally expect the same result as they are logically equivalent. However, because the default assignment of nextState is not gated by the enable signal while all of the state transitions are, the synthesis tools are unable to push the enable condition through up to the register, and implementation A requires two LEs more than implementation B. Unfortunately, due to the state machine inference there are further problems: the enable signal is inferred as a part of the state transitions (technically, it is, although not many think about it that way because it affects all of state transitions in the same way). If I simply replace the behaviorally generated flip flop with an explicit structural D-Flip Flop using the Altera primitives, the same relationship exists (A requires two more LEs than implementation B), but both implementations require two fewer LEs overall:

/** Implementation A */
lpm_ff #(.lpm_width(2)) storage(.clock(clk), .aclr(!reset_L), .data(nextState), .q(state));

/** Implementation B */
lpm_ff #(.lpm_width(2)) storage(.clock(clk), .aclr(!reset_L), .data(nextState), .q(state), .enable(enable));

A quick look at the RTL Viewer shows that the discrepancy between implementations A and B is due to the enable signal not being pushed into the flip flops and instead being implemented in logic.

RTL For Implementation A

RTL For Implementation A

vs.

RTL For Implementation B

RTL For Implementation B

We can see that the same six multiplexers (more or less) are used to implement the state transition logic, but for implementation A, the enable signal is being handled purely in logic before the state transition logic, whereas in implementation B it is correctly passed off to the enable signal on the D Flip Flop that is already available in the LE's register, with no additional logic cost. As a result, be sure to separate signals that are (or can be) associated with the storage elements themselves and handle them near the storage element, either inside the appropriate sequential behavioral block or with the appropriate input on the structural storage element, rather than in with the rest of the next state logic. This is both cleaner to read and understand and it will generally produce better synthesis results.

Third, and last for now, on the list of file organization techniques is the use of the case statement (or its relatives, casex and casez) inside the combinational block for next-state logic, unless building a counter or other type of state machine where the state transitions are systematic, in which case simple statements like nextState = state + 1'b1; are sufficient. If a counter is being used inside another state machine to determine timing, then I believe it is sufficient to increment the counter as a default option and reset or modify it as necessary in the actual states that have logic governing their progression. While examining the earlier vending machine example, when the resource usage was significantly higher than expected, I found that if the vend signal was assigned inside the combinational block and the state-handling logic was an if-else if tree, three unnecessary multiplexers were generated to drive the output, although the state machine logic was still calculated correctly. If, instead, the case structure were used or the continuous assignment that I finally settled for (in keeping with my guidelines for combinational logic—it was simple enough to be stated as continuous assignment, therefore it should be handled that way), then the multiplexers did not appear. I believe this is due to a difference in representation for if-else trees and case statements internal to the synthesis tools, and because the case structure is arguably more reasonable in one's source code because behavior must be defined for every state, more or less, the tools handle it much better.

A potential fourth rule that some people would advocate is physically separating the logic for calculating the next state and the logic for calculating the machine outputs into two behavioral blocks, but I make no recommendation here. For Moore-type machines, this can be reasonable because it provides visual clues as to what each block corresponds to on the typical mental image that most people have for a Moore-type machine (input -> combinational logic -> flip flops -> output logic -> outputs), but personally I find that this just leads to more coding (or copy/pasting the case structure) with no significant benefit. I also use primarily Mealy-type machines, where the separation argument doesn't make much sense, because the outputs will also depend on the input. As a result the two blocks would end up mirroring each other very closely in a structural sense, and, without the correct set of tool options, the synthesizer could generate two copies of the control/condition signals for the relevant multiplexers, wasting resources. I haven't had the opportunity (read: desire) to explore possible differences in synthesis for the two approaches, so I cannot provide definitive advice on which style is better.

Finally, with these three (four?) rules in mind, I present a general Verilog state machine template based on the guidelines described above. I also realized that it was more useful to provide this as a copy/paste-able fill-in-the-blanks template, rather than one with dummy logic already present. Plus, the earlier vending machine implementations also exemplify this, just without the description of what goes where.

// Verilog state machine template
module state_proto(clk, reset_L, ...);
	input clk;
	input reset_L;
	// ... other inputs/outputs

	// State declarations
	`define STATE_EX_S0		2'b00
	`define STATE_EX_S1		2'b01

	// State variables
	reg [1:0] state;
	reg [1:0] nextState;
	// ... more state variables for embedded state machines

	// ... Internal signals/wires

	// Continuous time assignment for simple outputs and internal signals

	// Next state and output logic
	always @* begin
		nextState = state;
		// ... default values for outputs and other state machines

		case (state)
			`STATE_EX_S0 : // ... transition logic/output logic
			`STATE_EX_S1 : // ... transition logic/output logic
			// ... more states
		endcase
	end

	// Storage, combine all state machine storage, with asynchronous clear
	always @(posedge clk or negedge reset_L) begin
		if (reset_L == 1'b0) begin
			state <= `STATE_EX_S0;
			// ... other state defaults
		end
		else begin
			state <= nextState;
			// ... flop in other storage unconditionally
		end
	end

	// Structural storage, instantiate things like shift registers, FIFOs, etc. Control signals
	// are generated above
endmodule
Jul 172011
 

Hardware implementation is fundamentally different from software implementation. A lot of beginners are confused by the fact that hardware description languages such as Verilog resemble C programs or Python programs. Modules have the same basic ingredients: variables, conditional statements, and assignment. Verilog even allows some type of control flow, using blocking assignment inside always blocks, which can frequently cause people to use familiar yet complicated techniques that come up in software design.

Don’t do it.

Last year I was a TA for an introductory digital design course, and a lot of students struggled once we were writing Verilog modules to implement complex communicating state machines or building combinational processes that operate sequentially, like a single-cycle microprocessor. Most of them understood the concepts of digital design, but were thrown off by how Verilog works and how it is converted into actual hardware, especially with regards to distinguishing combinational and sequential sections of circuits and writing appropriate code for each.

In software, functions and classes can take almost any form and provide a wide variety of behaviors. Entire books exist on different types of design patterns that should be used to simplify the task of implementing a set of classes to provide a specific interface or implementation tool. In (synchronous) digital hardware, however, there are two fundamental types of circuits: combinational circuits and sequential circuits. This simplifies the design process into identifying the appropriate type of circuit and then following the standard pattern for implementing it. The implementation, however, is usually clouded by the three distinct implementation styles possible with Verilog: structural, data flow, and behavioral. I will essentially ignore structural because it is equivalent to designing a circuit with a schematic, but in text form, so it avoids most of the complications of dealing with Verilog.

On an abstract level, combinational circuits describe those whose outputs are a continuous time function of their inputs. Basic combinational circuits should be implemented using data flow style code, with continuous assignment statements using the assign keyword. This means that a simple 2:1 mux primitive could be implemented like so:

module mux21(a, b, s, c);
	input a;
	input b;
	input s;
	output c;

	assign  c = (s == 1'b1) ? b : a;
endmodule

More complicated combinational circuits should be implemented using a combination of the two. Continuous assignment statements for simple logic, and behavioral statements for expressions that are most easily described with nested ifs or case statements. For instance, a 4-to-2 priority encoder is easily implemented using a casex statement:

module priority_encoder(in,  out);
	input [3:0] in;
	output [1:0] out;
	reg [1:0] out;

	always @* begin
		out = 2'b00;

		casex (in)
			4'b1xxx : out = 2'b11;
			4'b01xx : out = 2'b10;
			4'b001x : out = 2'b01;
			4'b0001 : out = 2'b00;
		endcase
	end
endmodule

A common problem with behavioral combinational circuits, however, is the accidental creation of transparent latches, where the output retains its previous value for one or more of the input cases. For the above priority encoder, if line 7, where out is assigned the default value of 0, were not present, then the synthesis tools would infer an automatic latch, because if in is equal to 0000, then out is not driven with a new value and the specification states that it should retain its previous value. This is not only the incorrect behavior but also wastes resources inside the device as a latch must be used, as well as additional LUTs to implement the latch enable, plus the actual logic function being implemented.

I propose instead the use of a simple template for all combinational circuits in order to create a hardware-specific mental approach. Declare all ports, internal wires, and variables at the top. Organize the file vertically as one would normally organize a schematic from left-to-right on a piece of paper: each block uses the results of the block(s) above it. Avoid having continuous time assignment use results that are calculated later by a behavioral block—create another continuous time assignment block AFTER the behavioral structure, to reinforce the signal path visually. Behavioral blocks themselves should be organized as a set of default values followed by a set of conditional statements to overwrite the values for specific input cases. Most importantly, try to organize the code so that it mirrors the circuit you are hoping to create: for decisions provide both true and false assignment information (unless one is the default value) to resemble a 2:1 mux and avoid using two disjoint if statements to assign a value because it creates a situation where the semantics of the blocking assignment operator become very significant, which makes it difficult for others (or yourself a few months later) to understand the original intentions of the code. Such if statements also obscure the parallelism inherent in the evaluation, because, while all of the if statements can be evaluated in parallel, some kind of additional glue logic must be invisibly added in order to arbitrate the final assignment to the enable signal. For instance, do not write code like this:

always @* begin
	enable = 1'b0;

	if (counter > 3'b100)
		enable = 1'b1;

	if (sload == 1'b1)
		enable = 1'b1;

	if (force_hold == 1'b1)
		enable = 1'b0;
end

This code obscures the true purpose of the force_hold signal by relying on the sequential evaluation of the always block to ensure that it overrides the counter and synchronous load signals. Understanding a complex circuit by reading the code once and forming an appropriate mental image can be tricky, so it is much more reasonable to nest if statements. Because this is hardware and not software, there are no branches and no additional penalties for having a large chain of if statements—if the two circuits are equivalent, then the same hardware should be generated in either case, but one makes the behavior more obvious. Nesting the control flow also makes the final physical tree-like mux implementation more apparent in the source code, allowing you to estimate the necessary resources more easily. Therefore, I would re-write the above block like this:

always @*
	enable = 1'b0;

	if (force_hold == 1'b0) begin
		if (counter > 3'b100 || sload == 1'b1)
			enable = 1'b1;
	end
end

This immediately makes it clear that the value of force_hold overrules the counter value and any synchronous load signals, so it would drive a 2:1 mux on the output, with the 0-value input taking its input from an earlier mux and the 1-value input taking the constant value of 0. The earlier mux would then be an optimized implementation of the conditional choice, with, for example, the counter condition as the control signal, the 1-value input as the constant 1, and the 0-value input as the synchronous load condition, implementing an OR function with a mux. The placement of the two conditions could also be swapped, but the synthesis tools will usually choose the correct assignment in order to optimize for the delay of each incoming signal.

From these examples, I can create a prototypical combinational Verilog circuit

module proto(i0, i1, i2, o0, o1, o2);
	input i0, i1;
	input [5:0] i2;
	output o0, o1, o2;
	
	// Redefine outputs that will be driven inside an always block
	reg o1, o2;

	// Internal signals here
	wire p1;

	// Continuous time assignments come first for simple expressions like bus parity or
	// logic expressions with only a few gates
	assign p1 = ^i2;
	assign o0 = i0 | i1;

	// Then afterwards behavioral blocks specify more complex relationships
	always @* begin
		// First, default values for everything assigned in this always block
		o1 = 1'b0;
		o2 = 1'b0;

		// Then, either per output or combined if they share enough of the decision tree
		// have the actual logic for computing outputs
		if (i2 > 6'b110000) begin
			if (i0 == 1'b1)
				o1 = p1;
			if (i1 == 1'b1)
				o2 = p1;
		end
		else begin
			o1 = 1'b1;
			o2 = 1'b1;
		end		
	end
endmodule

Writing simple combinational circuits that reflect the type of circuit being created, without being verbose, helps with synthesis. Tools are better at optimizing large circuits than people are (just like compilers are better at optimizing large programs than people are), but they must first be able to understand the circuits. Sticking to simple constructs and suggestive organization makes it easier for the tools to produce the simplest circuit without accidentally creating additional logic because of unintended semantics. Some tools (such as Altera’s Quartus or Xilinx’s ISE) allow you to view the schematic that was produced from the Verilog that you compiled; I highly recommend doing this for complex designs, especially if timing becomes an issue, to determine if the synthesis tools are doing what you want, or if you need to massage the source code into a cleaner format.

Only once a decent style has been developed for approaching combinational circuits is it even possible to begin creating successful, complex sequential circuits, because every sequential circuit must contain at least one combinational circuit for calculating the next state and the circuit’s output. Next time: building proper edge-triggered sequential circuits using Verilog.