One of the most useful things that didn't come up enough in college was a very basic concept, central to almost any digital communications system: the numerically controlled oscillator (NCO), the digital counterpart to an analog oscillator. They are used in software defined radio in order to implement modulators/demodulators and they have a number of other applications in signal processing, such as arbitrary waveform synthesis and precise control for phased array radar or sonar systems. Noise performance in digital systems can be carefully controlled by adjusting the data type's precision, whereas in analog systems, even if the circuit is designed to produce a minimum of intrinsic noise, external sources can contribute an uncontrolled amount of noise that is much more challenging to manage properly. As digital systems increase in speed, analog circuits will be reduced to minimal front ends for a set of high speed ADCs and DACs.

Luckily, NCOs are easy to understand intuitively (but surprisingly difficult to explain conceptually), which is probably why they weren't covered in-depth in school, although they are usually not immediately obvious to someone who hasn't already seen them. A basic NCO consists of a lookup table containing waveform data (usually a sinusoid) for exactly one period and a counter for indexing into the table. The rate of change of the counter determines the frequency of the output wave, in normalized units, because the output wave still exists in the discrete time domain. The counter is generally referred to as a 'phase accumulator,' or simply an accumulator, because it stores the current value of the sine's phase, and the amount that it changes every cycle I normally refer to as the 'phase.' In this sense, one of the simplest explanations of how an NCO works is that it tracks the argument to sin(2\pi \hat{f}n) in a counter and uses a look up table to calculate the corresponding value of sin(2\pi \hat{f}n). The challenge, however, lies in the implementation.

NCO Block Diagram

Floating point hardware is expensive in terms of area and power. Software emulation of floating point is expensive in terms of time. And, of course, floating point numbers cannot be used to index into an array without an intermediate conversion process, which can consume a large number of cycles without dedicated hardware. As a result, most NCOs are implemented using fixed point arithmetic for the phase accumulator, even if the table stores floating point values for high end DSPs. Fixed point introduces the notion of "normalization" because the number of bits dedicated to integer and fractional values is fixed, limiting the numbers that can be represented. Ideally, the full range of the fixed point type is mapped to the full range of values to be represented by a multiplicative constant. In the case of NCOs, this is usually done by using a new set of units (as opposed to radians or degrees) to represent phase angle, based on the lookup table size.

Normally, the period of sin(x) is 2\pi. However, for a lookup table, the period is the length of the table, because the input is an integer index, and the table's range spans a single cycle of sin(x). Because the index must wrap to zero after passing the end of the table, it is convenient to choose a table size that is a power of 2 so that wrap around can be implemented with a single bitwise operation or exploit the overflow properties of the underlying hardware for free, rather than an if-statement, which generally requires more cycles or hardware to evaluate. Thus, for a B-bit index, the table contains 2^B entries, and the possible frequencies that can be generated are integer multiples of \frac{1}{2^B} (the minimum change in the accumulator's value is naturally one table entry).

There is, of course, a clear problem with this implementation when precise frequency control is necessary, such as in all of the applications I mentioned at the start. If I wanted to build a digital AM radio tuner, then my sampling frequency would theoretically need to be at least 3.3 MHz to cover the entire medium wave band, where most commercial AM stations exist (although in practice it would need to be much higher in order to improve performance). If I use a table with B=8, then my frequency resolution is 0.00390625 * 3.3 MHz = 12.89 kHz, which is insufficient to form a software demodulator because the intra-station spacing is only 10 kHz. However, because the table size grows exponentially with B, it is undesirable or impossible to increase B past a certain point, depending on the amount of memory available for the lookup table. Depending on the size of the data stored in the table, there are also noise floor issues that affect the utility of increasing B, but I will discuss the effects of word size and quantization error on NCO performance another time.

A better solution is to produce non-integer multiples of the fundamental frequency by changing the index step size dynamically. For instance, by advancing the phase accumulator by an alternating pattern of 1 and then 2 samples, the effective frequency of the output sinusoid is halfway between the frequency for 1 sample and the frequency for 2 samples, plus some noise. This makes use of a second, fractional counter that periodically increments the primary index counter. The easiest way to implement this is to simply concatenate the B-bit index with an F-bit fractional index to form a fixed point word, so that an overflow from the fractional index will automatically increment the real table index. Then, when performing table lookup, the combined fixed point word is quantized to an integer value by removing the fractional bits. More advanced NCOs can use these fractional bits to improve performance by rounding or interpolating between samples in the table. Generally, because the value of F does not affect the size of the table, but it does increase the possible frequency resolution, I find the minimum value for F to give the desired resolution and then round up to the next multiple of 8 for B+F (8, 16, 24, …). It is possible to implement odd size arithmetic (such as 27 bits), but almost always the code will require at least the use of primitives for working with the smallest supported word size, which means that there is no performance penalty for increasing F so that B+F changes from 27 to 32.

Fixed Point decomposition for an Integer Type

By adding the F-bit fractional index, the frequency resolution improves to integer multiples of \frac{1}{B+F}, with no change in storage requirements, allowing. The only challenge, then, is converting between a floating point normalized frequency in Hz and the corresponding fixed point representation in table units. This is normally only done once during initialization, so the penalty of emulating floating point can be ignored, as the code that runs inside a tight processing loop will only use fast fixed point hardware. Because normalized frequency (in Hz) is already constrained to the interval [0, 1), the conversion from Hz (f) to table units (p) is a simple ratio:

\frac{f}{p} = \frac{1}{2^B}
2^{B}f=p

If any fractional index bits are used, then they must be included before p is cast from a floating point value to an integer by multiplying the value by the normalization constant, 2^F, which is efficiently calculated with a left shift by F bits. The resulting value is then stored as an integer type; normally I use unsigned integers because I only need to work with positive frequency. All subsequent fixed point operations are done using the standard integer primitives with the understanding that the "true" value of the integer being manipulated is actually the stored value divided by 2^F. This becomes important if two fixed point values are multiplied together, because the result will implicitly be multiplied by the normalization constant twice and need to be renormalized before it can be used with other fixed point value. In order to use the fixed point phase accumulator to index into the table, the integer portion must be extracted first, which is done by dividing by the 2^F. Luckily, this can be computed efficiently with a right shift by F bits.

In conclusion, because an NCO requires between 10 and 20 lines of C to implement, I've created a sample NCO that uses an 8.24 fixed point format with a lookup table that has 256 entries, to better illustrate the concept. The output is an integer waveform with values from 1 to 255, representing a sine wave with amplitude 127 that is offset by 128, which would normally be used with an 8-bit unipolar voltage output DAC to produce an analog signal biased around a half-scale virtual ground. This code was tested with Visual Studio 2005, but it should be portable to any microcontroller that has correct support for 32-bit types and 32-bit floating point numbers.

uint8_t sintable32[256];

struct nco32 {
    uint32_t accumulator;
    uint32_t phase;
    uint8_t value;
};

void sintable32_init(void);
void nco_init32(struct nco32 *n, float freq);
void nco_set_freq32(struct nco32 *n, float freq);
void nco_step32(struct nco32 *n);

/**
 * Initialize the sine table using slow library functions. The sine table is
 * scaled to the full range of -127,127 to minimize the effects of
 * quantization. It is also offset by 128 in order to only contain positive
 * values and allow use with unsigned data types.
 */
void sintable32_init(void) {
    int i;
    for (i = 0; i < 256; ++i) {
        sintable32[i] = (uint8_t) ((127.*(sin(2*PI*i/256.))+128.) + 0.5);
    }
}

/**
 * Initialize the oscillator data structure and set the target frequency
 * Frequency must be positive (although I don't check this).
 */
void nco_init32(struct nco32 *n, float freq) {
    n->accumulator = 0;
    n->value = sintable32[0];
    nco_set_freq32(n, freq);
}

/**
 * Set the phase step parameter of the given NCO struct based on the
 * desired value, given as a float. This changes its frequency in a phase
 * continuous manner, but this function should not be used inside a
 * critical loop for performance reasons. Instead, a chirp should be
 * implemented by precomputing the correct change to the phase rate
 * in fixed point and adding it after every sample.
 */
void nco_set_freq32(struct nco32 *n, float freq) {
    // 256 table entries, 24 bits of fractional index; 2^24 = 16777216
    n->phase = (uint32_t) (freq * 256. * 16777216. + 0.5);
}

/**
 * Compute the next output value from the table and save it so that it
 * can be referenced multiple times. Also, advance the accumulator by
 * the phase step amount.
 */
void nco_step32(struct nco32 *n) {
    uint8_t index;

    // Convert from 8.24 fixed point to 8 bits of integer index
    // via a truncation (cheaper to implement but noisier than rounding)
    index = (n->accumulator >> 24) & 0xff;
    n->value = sintable32[index];
    n->accumulator += n->phase;
}

/**
 * Example program, for a console, not a microcontroller, produces
 * 200 samples and writes them to output.txt in comma-separated-value
 * format. They can then be read into matlab to compare with ideal
 * performance using floats for phase and an actual sine function.
 * First parameter is the desired normalized frequency, in Hz.
 */
int main(int argc, char **argv) {
    struct nco32 osc;
    float freq;
    int i;
    FILE *nco_output;

    freq = (float) atof(argv[1]);

    sintable32_init();
    nco_init32(&osc, freq);
    nco_output = fopen("output.txt", "w");

    for (i = 0; i < 200; ++i) {
        nco_step32(&osc);
        fprintf(nco_output, "%d,", osc.value);
    }

    fclose(nco_output);
    return 0;
}

There are obvious improvements, such as a linear interpolation when computing the NCO's output, but I will save those for my discussion of NCOs, resolution, quantization, and noise performance, because they are not necessary for a basic oscillator (in particular, for 8 bit samples like this, the 8.24 format is overkill, and the output has an SNR of approximately 48 dB, depending on the frequency chosen, which is limited predominantly by the fact that only 8 bits are used on the output, fundamentally limiting it to no more than 55 dB, with some approximations).

After setting up a basic web server, I wanted to secure access to privileged resources using SSL. Normally, if I were dealing with customers or others who need to establish some kind of trust with me from the outside world, I would have to purchase a properly signed and verified SSL certificate from VeriSign or another Certificate Authority. However, because my goal is to secure internal resources that must be exposed publicly due to the nature of the internet (webmail and phpMyAdmin), it is cheaper (read: free) to create and sign my own certificates. End users do not need to trust me, and because I can personally verify the origin of my CA certificate, I do not need one signed by an external authority. I can then use this to sign additional certificates in order to create an area of my server that requires a user to present a certificate signed by my personal CA certificate in order to proceed, allowing very tight access control over sensitive areas.

Using the 64-bit Basic Linux AMI, OpenSSL is already installed. The configuration file, openssl.cnf is located at /etc/pki/tls/openssl.cnf. Luckily, it has a pretty reasonable set of defaults, and in particular the directory tree is already set up, with appropriate permissions (0700) on the directory used for storing the CA private key. Because openssl.cnf points to the CA certificate and the private key directly, I chose to rename my keys from the default value and entered modified the values for certificate and private_key to reflect this. With the basic OpenSSL configuration complete, it was time to create a self-signed CA certificate, set up the directory structure, and configure apache to use mod_ssl.

Signing the CA certificate

Creating a certificate that you can use to sign other requests is very straightforward:

[root@thelonepole /]# openssl req -newkey rsa:2048 -keyout tlpCA.key -x509 -days 365 -out tlpCA.crt

This creates a new X.509 format certificate along with its private key. It is automatically self-signed (this form of the command combines the key generation step, certificate request step, and self-signing step into a single command). My certificate requires that a password be entered when it is used to sign a certificate request, but for a small private server with exactly one trusted user the password can be removed by adding the -nodes flag, for convenience. You may also want to change the number of days for which the certificate is valid so that you don't have to recreate it every year (for a CA certificate, you might as well set this to 10 or more years).

Once the certificate and key were ready, I moved them to the locations earlier specified in openssl.cnf for certificate and private_key. Then, I had to create the file used to track the next available serial number. For some reason this isn't created by default, possibly because of some obscure vulnerability when the next available serial number is known, but for our purposes, we are in a relatively low security situation, so starting at 1 is pretty reasonable:

[root@thelonepole /]# touch /etc/pki/CA/serial
[root@thelonepole /]# echo 1 > /etc/pki/CA/serial

If this step is omitted, openssl will complain when it comes time to sign certificate requests because it does not know which serial number to fill in.

Signing Other Certificates

At this point, with the CA certificate in place, it is easy to create and sign certificate requests. It is necessary to create a different certificate for each subdomain that you wish to be able to use SSL with, because the certificate will include the domain that it is valid for in its path. If this doesn't match the ServerName directive, it will generate a warning when starting apache, and users (even those who have added your CA certificate to their trusted store, i.e. you) will also see a warning in their browsers when using SSL. Creating and signing a certificate is done with two commands:

[root@thelonepole /]# openssl req -newkey rsa:2048 -nodes -keyout webmail.key -out webmail.csr
[root@thelonepole /]# openssl ca -in webmail.csr -out webmail.crt

Note that the certificates that are created for use with apache should be done WITHOUT a password, using the -nodes flag. If you don't do this, you'll be prompted for a password each time you start apache, which is a huge problem if your server goes down and/or is automatically restarted and you aren't available to enter the password. The certificates generated by the above command will also be valid for the period of time specified in your openssl.cnf. If you want them to last longer, be sure to specify -days XXX.

Copy these certificates to a known location, such as /etc/pki/tls and note their location so that it can be used when configuring apache. You will also want to create a certificate (in exactly the same way) for yourself, which will be used for client authentication for SSL-enabled subdomains as well. However, for a client certificate, there is an additional step: it must be converted from X.509 format to the PKCS #12 format that is often the only one understood by browsers. The single output file will combine both the signed certificate and the private key, and you will have it on any computer or device that you expect to use to access your secure areas, so BE SURE to use a strong export password when prompted during this step. The conversion can be done with one command, which takes in your private key and signed certificate and exports it as a client certificate in PKCS format:

[root@thelonepole /]# openssl pkcs12 -export -clcerts -out greg.p12 -in greg.crt -inkey greg.key

Then, transfer the PKCS file to your computer and install it in your web browser, usually with sftp. Under chrome, I had the option to install it specifically as a key for client authentication, but I am not sure if all browsers support the categorization in this manner. You will also need to install a copy of your CA certificate into the trusted root certificate store, or the browser will not accept the client certificate. Luckily you can simply download the X.509 format certificate from your server and tell your browser about it without converting it (which wouldn't make sense anyway because it'd require the private key…).

Configuring mod_ssl

Finally, there are several directives to add to the VirtualHost entries in order to configure them to use SSL. First, if you intend to only allow SSL to be used for a specific subdomain, like I do with my webmail, create an entry for normal HTTP that redirects all traffic to HTTPS:

<VirtualHost *:80>
    ServerName webmail.thelonepole.com
    Redirect permanent / https://webmail.thelonepole.com
</VirtualHost>

Next, a VirtualHost entry must be created that has the desired SSL settings, listening on port 443. Obviously, it must include SSLEngine On in order to enable SSL, but there are actually several extra directives that should be included in order to have client authentication work. Be sure to set SSLCACertificateFile to point to your local CA certificate and enable client authentication with SSLVerifyClient. In order to require that clients use certificates directly signed by your CA certificate, set SSLVerifyDepth to 1. I've included a complete VirtualHost entry as an example, based on the configuration I use here.

<VirtualHost *:443>
    ServerAdmin greg@thelonepole.com
    DocumentRoot /var/www/webmail
    ServerName webmail.thelonepole.com
    ErrorLog logs/webmail-error_log
    CustomLog logs/webmail-access_log combined
    SSLEngine On
    SSLCACertificateFile /path/to/CA/tlpCA.crt
    SSLCertificateFile /path/to/apache/ssl/webmail.crt
    SSLCertificateKeyFile /path/to/apache/ssl/webmail.key
    SSLVerifyClient require
    SSLVerifyDepth 1
    SetEnvIf User-Agent ".*MSIE.*" nokeepalive ssl-unclean-shutdown
    CustomLog logs/webmail-ssl_request_log "%t %h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \"%r\" %b"
</VirtualHost>

At this point, simply restart apache, and, if DNS was already set up, you should be prompted for a key when accessing your subdomain. If you have not added your key to the browser's store (or are interested in verifying that it requires a client-supplied key and use an unconfigured browser), you should see a message notifying you that an error with code "ssl_error_handshake_failure_alert" happened during SSL negotiation.

This has been updated after some feedback that I received from a more experienced FPGA designer explaining that asynchronous reset is an inappropriate default approach because it can cause partial resets, metastability, and partially correct computations as it moves across the device. Therefore, a synchronous reset should be expected, with a dedicated reset synchronizer (not shown or considered here)

Last time I wrote about how to organize a basic Verilog module that implements a combinational circuit in order to improve clarity for readers and synthesis tools alike. Implementing a sequential circuit correctly is a lot more challenging because it requires an internal combinational circuit. For my discussion I will largely focus on Altera's tools, because they are what I use on a day-to-day basis, but I would expect the same principles to apply to code targeting Xilinx or another vendor.

The traditional approach to designing a sequential circuit begins by creating a state diagram. In a classroom setting, students would then go on to pick a state encoding, create transition tables, and then optimize the combinational logic for implementation. Luckily, as an engineer I don't have to do things the academic way, and I can instead rely on the synthesis tools to optimize the combinational circuit for me. With a good set of tools, it is actually possible for the synthesis step to choose a better state encoding (to minimize logic usage, propagation delay, or register usage, as necessary) than one that I will choose arbitrarily, which means that I can focus primarily on making sure the transition and output logic is correct. Frequently, I don't even create a state diagram on paper, instead setting up the template that I normally use and then writing in the state transitions directly while defining the state machine's behavior.

With that as the goal, a specific approach to writing state machines is necessary to ensure that it's clear to the synthesis tools and to other readers what is going on. There are two options: one is to use a graphical state machine builder, which I would obviously prefer not to do, as Verilog is usually faster to write and easier to maintain, because clicking is not an efficient way to transfer input to the computer (at least, not as efficient as the keyboard, in this case). The other option is to use language features to alert the synthesis tools that you are creating a state machine. In Quartus II, this can be done using the parameter data type to create state variable templates, or, alternatively, I prefer using the `define keyword to define states as constants. Both will be picked up by the synthesis tools and allow for state remapping if desired. Although Altera recommends using parameter, I prefer the use of `define because it does not expose the state variables to the hierarchy as modifiable in the way a module parameter does. Note however that the Altera tools compile all the files in a shared namespace, so all `define'd symbols must have unique names. This is easy to do with a consistent naming scheme, personally I use STATE_<MOD>_<STATE NAME>, with MOD being up to 5 letters of abbreviation for the module being implemented.

One important behavior to note here is that Altera has different options for state machine synthesis, some of which increase the number of bits used for state encoding (such as one-hot encoding). Therefore, it may not always be a good thing for the synthesis tools to infer a state machine—if explicit structural storage is instantiated (such as lpm_ff) then the state machine will not be optimized by the compiler. The state machine can also be specifically set to use "user-defined" encoding by placing a synthesis directive on the same line as the state variable's declaration. All of the options for state machine encoding are specified in this document describing Quartus II Synthesis options.

(* syn_encoding = "user" *) reg [1:0] state;

Structurally, it is important to separate storage logic from state calculation logic, for a couple of reasons. First, the prevailing belief is that blocking assignments (with the = operator) should be used for combinational logic only, while non-blocking assignments (with the <= operator) should be used for sequential logic only. Assignment types should never be mixed within a single behavioral block because unintended or unpredictable simulation and synthesis behavior can result from differences in interpretation. Therefore, use non-blocking assignment in an edge-sensitive behavioral block to implement the storage logic, and use blocking assignment in a combinational behavioral block to implement next state and output.

Second, it makes the module easier to maintain if the its organization reflects the hardware structures that it is implementing, and it is more likely for the synthesis tools to produce the most optimal implementation. In particular, when dealing with enable, synchronous or asynchronous preset or clear signals, and other features normally associated with the flip flop elements themselves (as opposed to a part of the combinational logic before the flip flop), on a modern FPGA these features can either be implemented as logic inside the LUTs or they can be implemented using the dedicated enable, preset, and clear lines of the storage element within the LE. In order to make it clear to the tools that an enable is being used, the conditional use of the enable should be in the storage block, not the combinational block. For example, consider the following two implementations of a simple vending machine that accepts two coin types (1 and 2, where coin 2 is worth two of coin 1) and counts up to a total of three of coin 1. Then it issues a vend signal and returns to the idle state, with no change calculations, for simplicity.

/** Implementation A */
assign vend = (state == `STATE_EX_S3);

always @* begin
    nextState = state;

    case(state)
        `STATE_EX_IDLE : begin
            if (coin1 == 1'b1) begin
                if (enable == 1'b1)
                    nextState = `STATE_EX_S1;
            end
            else if (coin2 == 1'b1) begin
                if (enable == 1'b1)
                    nextState = `STATE_EX_S2;
            end
        end
        `STATE_EX_S1 : begin
            if (coin1 == 1'b1) begin
                if (enable == 1'b1)
                    nextState = `STATE_EX_S3;
            end
            else if (coin2 == 1'b1) begin
                if (enable == 1'b1)
                    nextState = `STATE_EX_S3;
            end
        end
        `STATE_EX_S2 : begin
            if (coin1 == 1'b1) begin
                if (enable == 1'b1)
                    nextState = `STATE_EX_S3;
            end
            else if (coin2 == 1'b1) begin
                if (enable == 1'b1)
                    nextState = `STATE_EX_S3;
            end
        end
        `STATE_EX_S3 :  begin
            if (enable == 1'b1)
                nextState = `STATE_EX_IDLE;
        end
    endcase
end

always @(posedge clk) begin
    if (reset_L == 1'b0) begin
        state <= `STATE_EX_IDLE;
    end
    else begin
        state <= nextState;
    end
end
/** Implementation B */
assign vend = (state == `STATE_EX_S3);

always @* begin
    nextState = state;

    case(state)
        `STATE_EX_IDLE : begin
            if (coin1 == 1'b1) begin
                nextState = `STATE_EX_S1;
            end
            else if (coin2 == 1'b1) begin
                nextState = `STATE_EX_S2;
            end
        end
        `STATE_EX_S1 : begin
            if (coin1 == 1'b1) begin
                nextState = `STATE_EX_S3;
            end
            else if (coin2 == 1'b1) begin
                nextState = `STATE_EX_S3;
            end
        end
        `STATE_EX_S2 : begin
            if (coin1 == 1'b1) begin
                nextState = `STATE_EX_S3;
            end
            else if (coin2 == 1'b1) begin
                nextState = `STATE_EX_S3;
            end
        end
        `STATE_EX_S3 :  begin
            nextState = `STATE_EX_IDLE;
        end
    endcase
end

always @(posedge clk) begin
    if (reset_L == 1'b0) begin
        state <= `STATE_EX_IDLE;
    end
    else if (enable == 1'b1) begin
        state <= nextState;
    end
end

When I synthesize both of these files, I would normally expect the same result as they are logically equivalent. However, because the default assignment of nextState is not gated by the enable signal while all of the state transitions are, the synthesis tools are unable to push the enable condition through up to the register, and implementation A requires two LEs more than implementation B. Unfortunately, due to the state machine inference there are further problems: the enable signal is inferred as a part of the state transitions (technically, it is, although not many think about it that way because it affects all of state transitions in the same way). If I simply replace the behaviorally generated flip flop with an explicit structural D-Flip Flop using the Altera primitives, the same relationship exists (A requires two more LEs than implementation B), but both implementations require two fewer LEs overall:

/** Implementation A */
lpm_ff #(.lpm_width(2)) storage(.clock(clk), .sclr(!reset_L), .data(nextState), .q(state));

/** Implementation B */
lpm_ff #(.lpm_width(2)) storage(.clock(clk), .sclr(!reset_L), .data(nextState), .q(state), .enable(enable));

A quick look at the RTL Viewer shows that the discrepancy between implementations A and B is due to the enable signal not being pushed into the flip flops and instead being implemented in logic.

RTL For Implementation A

vs.

RTL For Implementation B

We can see that the same six multiplexers (more or less) are used to implement the state transition logic, but for implementation A, the enable signal is being handled purely in logic before the state transition logic, whereas in implementation B it is correctly passed off to the enable signal on the D Flip Flop that is already available in the LE's register, with no additional logic cost. As a result, be sure to separate signals that are (or can be) associated with the storage elements themselves and handle them near the storage element, either inside the appropriate sequential behavioral block or with the appropriate input on the structural storage element, rather than in with the rest of the next state logic. This is both cleaner to read and understand and it will generally produce better synthesis results.

Third, and last for now, on the list of file organization techniques is the use of the case statement (or its relatives, casex and casez) inside the combinational block for next-state logic, unless building a counter or other type of state machine where the state transitions are systematic, in which case simple statements like nextState = state + 1'b1; are sufficient. If a counter is being used inside another state machine to determine timing, then I believe it is sufficient to increment the counter as a default option and reset or modify it as necessary in the actual states that have logic governing their progression. While examining the earlier vending machine example, when the resource usage was significantly higher than expected, I found that if the vend signal was assigned inside the combinational block and the state-handling logic was an if-else if tree, three unnecessary multiplexers were generated to drive the output, although the state machine logic was still calculated correctly. If, instead, the case structure were used or the continuous assignment that I finally settled for (in keeping with my guidelines for combinational logic—it was simple enough to be stated as continuous assignment, therefore it should be handled that way), then the multiplexers did not appear. I believe this is due to a difference in representation for if-else trees and case statements internal to the synthesis tools, and because the case structure is arguably more reasonable in one's source code because behavior must be defined for every state, more or less, the tools handle it much better.

A potential fourth rule that some people would advocate is physically separating the logic for calculating the next state and the logic for calculating the machine outputs into two behavioral blocks, but I make no recommendation here. For Moore-type machines, this can be reasonable because it provides visual clues as to what each block corresponds to on the typical mental image that most people have for a Moore-type machine (input -> combinational logic -> flip flops -> output logic -> outputs), but personally I find that this just leads to more coding (or copy/pasting the case structure) with no significant benefit. I also use primarily Mealy-type machines, where the separation argument doesn't make much sense, because the outputs will also depend on the input. As a result the two blocks would end up mirroring each other very closely in a structural sense, and, without the correct set of tool options, the synthesizer could generate two copies of the control/condition signals for the relevant multiplexers, wasting resources. I haven't had the opportunity (read: desire) to explore possible differences in synthesis for the two approaches, so I cannot provide definitive advice on which style is better.

Finally, with these three (four?) rules in mind, I present a general Verilog state machine template based on the guidelines described above. I also realized that it was more useful to provide this as a copy/paste-able fill-in-the-blanks template, rather than one with dummy logic already present. Plus, the earlier vending machine implementations also exemplify this, just without the description of what goes where.

// Verilog state machine template
module state_proto(clk, reset_L, ...);
    input clk;
    input reset_L;
    // ... other inputs/outputs

    // State declarations
    `define STATE_EX_S0     2'b00
    `define STATE_EX_S1     2'b01

    // State variables
    reg [1:0] state;
    reg [1:0] nextState;
    // ... more state variables for embedded state machines

    // ... Internal signals/wires

    // Continuous time assignment for simple outputs and internal signals

    // Next state and output logic
    always @* begin
        nextState = state;
        // ... default values for outputs and other state machines

        case (state)
            `STATE_EX_S0 : // ... transition logic/output logic
            `STATE_EX_S1 : // ... transition logic/output logic
            // ... more states
        endcase
    end

    // Storage, combine all state machine storage, with synchronous clear
    always @(posedge clk) begin
        if (reset_L == 1'b0) begin
            state <= `STATE_EX_S0;
            // ... other state defaults
        end
        else begin
            state <= nextState;
            // ... flop in other storage unconditionally
        end
    end

    // Structural storage, instantiate things like shift registers, FIFOs, etc. Control signals
    // are generated above
endmodule

Hardware implementation is fundamentally different from software implementation. A lot of beginners are confused by the fact that hardware description languages such as Verilog resemble C programs or Python programs. Modules have the same basic ingredients: variables, conditional statements, and assignment. Verilog even allows some type of control flow, using blocking assignment inside always blocks, which can frequently cause people to use familiar yet complicated techniques that come up in software design.

Don't do it.

Last year I was a TA for an introductory digital design course, and a lot of students struggled once we were writing Verilog modules to implement complex communicating state machines or building combinational processes that operate sequentially, like a single-cycle microprocessor. Most of them understood the concepts of digital design, but were thrown off by how Verilog works and how it is converted into actual hardware, especially with regards to distinguishing combinational and sequential sections of circuits and writing appropriate code for each.

In software, functions and classes can take almost any form and provide a wide variety of behaviors. Entire books exist on different types of design patterns that should be used to simplify the task of implementing a set of classes to provide a specific interface or implementation tool. In (synchronous) digital hardware, however, there are two fundamental types of circuits: combinational circuits and sequential circuits. This simplifies the design process into identifying the appropriate type of circuit and then following the standard pattern for implementing it. The implementation, however, is usually clouded by the three distinct implementation styles possible with Verilog: structural, data flow, and behavioral. I will essentially ignore structural because it is equivalent to designing a circuit with a schematic, but in text form, so it avoids most of the complications of dealing with Verilog.

On an abstract level, combinational circuits describe those whose outputs are a continuous time function of their inputs. Basic combinational circuits should be implemented using data flow style code, with continuous assignment statements using the assign keyword. This means that a simple 2:1 mux primitive could be implemented like so:

module mux21(a, b, s, c);
    input a;
    input b;
    input s;
    output c;

    assign  c = (s == 1'b1) ? b : a;
endmodule

More complicated combinational circuits should be implemented using a combination of the two. Continuous assignment statements for simple logic, and behavioral statements for expressions that are most easily described with nested ifs or case statements. For instance, a 4-to-2 priority encoder is easily implemented using a casex statement:

module priority_encoder(in,  out);
    input [3:0] in;
    output [1:0] out;
    reg [1:0] out;

    always @* begin
        out = 2'b00;

        casex (in)
            4'b1xxx : out = 2'b11;
            4'b01xx : out = 2'b10;
            4'b001x : out = 2'b01;
            4'b0001 : out = 2'b00;
        endcase
    end
endmodule

A common problem with behavioral combinational circuits, however, is the accidental creation of transparent latches, where the output retains its previous value for one or more of the input cases. For the above priority encoder, if line 7, where out is assigned the default value of 0, were not present, then the synthesis tools would infer an automatic latch, because if in is equal to 0000, then out is not driven with a new value and the specification states that it should retain its previous value. This is not only the incorrect behavior but also wastes resources inside the device as a latch must be used, as well as additional LUTs to implement the latch enable, plus the actual logic function being implemented.

I propose instead the use of a simple template for all combinational circuits in order to create a hardware-specific mental approach. Declare all ports, internal wires, and variables at the top. Organize the file vertically as one would normally organize a schematic from left-to-right on a piece of paper: each block uses the results of the block(s) above it. Avoid having continuous time assignment use results that are calculated later by a behavioral block--create another continuous time assignment block AFTER the behavioral structure, to reinforce the signal path visually. Behavioral blocks themselves should be organized as a set of default values followed by a set of conditional statements to overwrite the values for specific input cases. Most importantly, try to organize the code so that it mirrors the circuit you are hoping to create: for decisions provide both true and false assignment information (unless one is the default value) to resemble a 2:1 mux and avoid using two disjoint if statements to assign a value because it creates a situation where the semantics of the blocking assignment operator become very significant, which makes it difficult for others (or yourself a few months later) to understand the original intentions of the code. Such if statements also obscure the parallelism inherent in the evaluation, because, while all of the if statements can be evaluated in parallel, some kind of additional glue logic must be invisibly added in order to arbitrate the final assignment to the enable signal. For instance, do not write code like this:

always @* begin
    enable = 1'b0;

    if (counter > 3'b100)
        enable = 1'b1;

    if (sload == 1'b1)
        enable = 1'b1;

    if (force_hold == 1'b1)
        enable = 1'b0;
end

This code obscures the true purpose of the force_hold signal by relying on the sequential evaluation of the always block to ensure that it overrides the counter and synchronous load signals. Understanding a complex circuit by reading the code once and forming an appropriate mental image can be tricky, so it is much more reasonable to nest if statements. Because this is hardware and not software, there are no branches and no additional penalties for having a large chain of if statements—if the two circuits are equivalent, then the same hardware should be generated in either case, but one makes the behavior more obvious. Nesting the control flow also makes the final physical tree-like mux implementation more apparent in the source code, allowing you to estimate the necessary resources more easily. Therefore, I would re-write the above block like this:

always @*
    enable = 1'b0;

    if (force_hold == 1'b0) begin
        if (counter &gt; 3'b100 || sload == 1'b1)
            enable = 1'b1;
    end
end

This immediately makes it clear that the value of force_hold overrules the counter value and any synchronous load signals, so it would drive a 2:1 mux on the output, with the 0-value input taking its input from an earlier mux and the 1-value input taking the constant value of 0. The earlier mux would then be an optimized implementation of the conditional choice, with, for example, the counter condition as the control signal, the 1-value input as the constant 1, and the 0-value input as the synchronous load condition, implementing an OR function with a mux. The placement of the two conditions could also be swapped, but the synthesis tools will usually choose the correct assignment in order to optimize for the delay of each incoming signal.

From these examples, I can create a prototypical combinational Verilog circuit

module proto(i0, i1, i2, o0, o1, o2);
    input i0, i1;
    input [5:0] i2;
    output o0, o1, o2;

    // Redefine outputs that will be driven inside an always block
    reg o1, o2;

    // Internal signals here
    wire p1;

    // Continuous time assignments come first for simple expressions like bus parity or
    // logic expressions with only a few gates
    assign p1 = ^i2;
    assign o0 = i0 | i1;

    // Then afterwards behavioral blocks specify more complex relationships
    always @* begin
        // First, default values for everything assigned in this always block
        o1 = 1'b0;
        o2 = 1'b0;

        // Then, either per output or combined if they share enough of the decision tree
        // have the actual logic for computing outputs
        if (i2 > 6'b110000) begin
            if (i0 == 1'b1)
                o1 = p1;
            if (i1 == 1'b1)
                o2 = p1;
        end
        else begin
            o1 = 1'b1;
            o2 = 1'b1;
        end     
    end
endmodule

Writing simple combinational circuits that reflect the type of circuit being created, without being verbose, helps with synthesis. Tools are better at optimizing large circuits than people are (just like compilers are better at optimizing large programs than people are), but they must first be able to understand the circuits. Sticking to simple constructs and suggestive organization makes it easier for the tools to produce the simplest circuit without accidentally creating additional logic because of unintended semantics. Some tools (such as Altera's Quartus or Xilinx's ISE) allow you to view the schematic that was produced from the Verilog that you compiled; I highly recommend doing this for complex designs, especially if timing becomes an issue, to determine if the synthesis tools are doing what you want, or if you need to massage the source code into a cleaner format.

Only once a decent style has been developed for approaching combinational circuits is it even possible to begin creating successful, complex sequential circuits, because every sequential circuit must contain at least one combinational circuit for calculating the next state and the circuit's output. Next time: building proper edge-triggered sequential circuits using Verilog.

After switching from tr-ircd to InspIRCd 2.0, a lot of users complained that there was no longer a usermode to hide their idle time in /whois. In order to remedy this, I chose a random unused usermode (+l) and implemented m_hideidle.so, which allows users to prevent idle time from being shown in /whois. Opers with priv users/auspex will still be able to see idle time, because it is important for opers with the correct privileges to be able to see information that normal users cannot.

If you plan on using this module, be sure to add information to your HELPOP scripts to tell users that mode +l is available and what it does.

/******************************************************************************
 *    InspIRCd 2.0 - m_hideidle.so
 *    Greg Malysa, 2011.
 *
 *    This program is free software: you can redistribute it and/or modify
 *    it under the terms of the GNU General Public License as published by
 *    the Free Software Foundation, either version 3 of the License, or
 *    (at your option) any later version.
 *
 *    This program is distributed in the hope that it will be useful,
 *    but WITHOUT ANY WARRANTY; without even the implied warranty of
 *    MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *    GNU General Public License for more details.
 *
 *    You should have received a copy of the GNU General Public License
 *    along with this program.  If not, see &lt;http://www.gnu.org/licenses/&gt;.
 *
 ******************************************************************************
 *
 *    Third party module to provide usermode +l, which allows users to hide
 *    their idle time in /whois from normal users. Opers with users/auspex will
 *    always see idle times.
 *
 *****************************************************************************/

#include "inspircd.h"

// ModeHandler class for applying usermodes
class HideIdle : public ModeHandler {
public:
    HideIdle(Module *Creator) : ModeHandler(Creator, "hideidle", 'l', PARAM_NONE, MODETYPE_USER) {
        oper = false;
    }

    ModeAction OnModeChange(User *source, User *dest, Channel *channel, std::string &parameter, bool adding) {
        if (adding) {
            if (!dest-&gt;IsModeSet('l')) {
                dest-&gt;SetMode('l', true);
                return MODEACTION_ALLOW;
            }
        }
        else {
            if (dest-&gt;IsModeSet('l')) {
                dest-&gt;SetMode('l', false);
                return MODEACTION_ALLOW;
            }
        }
        return MODEACTION_DENY;
    }
};

// Module class interfaces with IRCD
class ModuleHideIdle : public Module {
    HideIdle hi;
public:
    ModuleHideIdle() : hi(this) {
        if (!ServerInstance-&gt;Modes-&gt;AddMode(&hi))
            throw ModuleException("Could not add +l (hide idle) usermode!");
        Implementation eventlist[] = {I_OnWhoisLine};
        ServerInstance-&gt;Modules-&gt;Attach(eventlist, this, 1);
    }
    virtual ~ModuleHideIdle() {}

    ModResult OnWhoisLine(User *user, User *dest, int &numeric, std::string &text) {
        // Check for the numeric about idle time and hide if necessary
        if (numeric == 317) {
            if (dest-&gt;IsModeSet('l') && !user-&gt;HasPrivPermission("users/auspex")) {
                return MOD_RES_DENY;
            }
        }

        // Default behavior is to show idle time
        return MOD_RES_PASSTHRU;
    }

    Version GetVersion() {
        return Version("Allows users to toggle their idle time being shown in /whois with usermode +l", 0);
    }
};

MODULE_INIT(ModuleHideIdle)