Jeff:
I developed a hardware Booth multiplier some time ago. My reference for the algorithm was "Computer Organization", Hamacher et al, McGraw-Hill Book Company, New York, NY, 1978, ISBN: 0-07-025681-0. It does a very nice job of laying out the algorithm for 1 bit at a time algorithm. Like you, some additional research on the web let me extend the Booth algorithm for any number of bits. I specifically put three Booth multiplier modules together: 1x, 2x, and 4x. The following two excerpts may be of help with regard to the Booth "recoding" tables that will feed a look-up table:
(Booth 2 bits at a time Multiplier)
Code:
// Additional Comments:
//
// The basic operations follow those of the standard Booth multiplier except
// that the transitions are being tracked across 2 bits plus the guard bit.
// The result is that the operations required are 0, ±1, and ±2 times the
// multiplicand (M). That is:
//
// Prod[2:0] Operation
// 000 Prod <= (Prod + 0*M) >> 2;
// 001 Prod <= (Prod + 1*M) >> 2;
// 010 Prod <= (Prod + 1*M) >> 2;
// 011 Prod <= (Prod + 2*M) >> 2;
// 100 Prod <= (Prod - 2*M) >> 2;
// 101 Prod <= (Prod - 1*M) >> 2;
// 110 Prod <= (Prod - 1*M) >> 2;
// 111 Prod <= (Prod - 0*M) >> 2;
//
// The operations in this table can be seen as direct extensions of the four
// conditions used in the standard Booth algorithm. The first and last terms
// simply indicate to skip over runs of 1s and 0s. The terms 001 and 110 are
// indicative of the 01 (+1) and 10 (-1) operations of the standard Booth al-
// gorithm. The terms 010 (+2,-1) and 101 (-2,+1) reduce to the operations
// noted in the table, namely ±1*M, respectively. The terms 011 (+2,0) and
// 100 (-2, 0) are the two new operations required for this modified Booth
// algorithm.
//
// The algorithm could be extended to any number of bits as required by noting
// that as more bits are added to the left, the number of terms (operations)
// required increases. Addding another bit, i.e. a 3-bit Booth recoding, means
// that the operations required of the adder/partial product are 0, ±1, ±2,
// and ±4. The guard bits provided for the sign bit accomodates the ±2 opera-
// tion required for the 2-bit modified booth algorithm. Adding a third bit
// means that a third guard bit will be required for the sign bit to insure
// that there is no overflow in the partial product when ±4 is the operation.
//
(Booth 4 bits at a time multiplier)
Code:
// Additional Comments:
//
// The basic operations follow those of the standard Booth multiplier except
// that the transitions are being tracked across 4 bits plus the guard bit.
// The result is that the operations required are 0, ±1, ±2, ±3, ±4, ±5, ±6,
// ±7, and ±8 times the multiplicand (M). However, it is possible to reduce
// the number of partial products required to implement the multiplication to
// two. That is, ±3, ±5, ±6, and ±7 can be written in terms of combinations of
// ±1, ±2, ±4, and ±8. For example, 3M = (2M + 1M), 5M = (4M + M), 6M = (4M
// + 2M), and 7M = (8M - M). Thus, the following 32 entry table defines the
// operations required for generating the partial products through each pass
// of the algorithm over the multiplier:
//
// Prod[4:0] Operation
// 00000 Prod <= (Prod + 0*M + 0*M) >> 4;
// 00001 Prod <= (Prod + 0*M + 1*M) >> 4;
// 00010 Prod <= (Prod + 0*M + 1*M) >> 4;
// 00011 Prod <= (Prod + 2*M + 0*M) >> 4;
// 00100 Prod <= (Prod + 2*M + 0*M) >> 4;
// 00101 Prod <= (Prod + 2*M + 1*M) >> 4;
// 00110 Prod <= (Prod + 2*M + 1*M) >> 4;
// 00111 Prod <= (Prod + 4*M + 0*M) >> 4;
// 01000 Prod <= (Prod + 4*M + 0*M) >> 4;
// 01001 Prod <= (Prod + 4*M + 1*M) >> 4;
// 01010 Prod <= (Prod + 4*M + 1*M) >> 4;
// 01011 Prod <= (Prod + 4*M + 2*M) >> 4;
// 01100 Prod <= (Prod + 4*M + 2*M) >> 4;
// 01101 Prod <= (Prod + 8*M - 1*M) >> 4;
// 01110 Prod <= (Prod + 8*M - 1*M) >> 4;
// 01111 Prod <= (Prod + 8*M + 0*M) >> 4;
// 10000 Prod <= (Prod - 8*M - 0*M) >> 4;
// 10001 Prod <= (Prod - 8*M + 1*M) >> 4;
// 10010 Prod <= (Prod - 8*M + 1*M) >> 4;
// 10011 Prod <= (Prod - 4*M - 2*M) >> 4;
// 10100 Prod <= (Prod - 4*M - 2*M) >> 4;
// 10101 Prod <= (Prod - 4*M - 1*M) >> 4;
// 10110 Prod <= (Prod - 4*M - 1*M) >> 4;
// 10111 Prod <= (Prod - 4*M - 0*M) >> 4;
// 11000 Prod <= (Prod - 4*M - 0*M) >> 4;
// 11001 Prod <= (Prod - 2*M - 1*M) >> 4;
// 11010 Prod <= (Prod - 2*M - 1*M) >> 4;
// 11011 Prod <= (Prod - 2*M - 0*M) >> 4;
// 11100 Prod <= (Prod - 2*M - 0*M) >> 4;
// 11101 Prod <= (Prod - 0*M - 1*M) >> 4;
// 11110 Prod <= (Prod - 0*M - 1*M) >> 4;
// 11111 Prod <= (Prod - 0*M - 0*M) >> 4;
//
I've successfully implemented these multipliers in a signal processing application. I believe that the recoding tables provided above are accurate.