Bit-serial arithmetic

Bit-serial multiplier of Chen and Willoner

Bit-serial arithmetic is used in signal processing and in parallel processing where chip area and interconnection width are central issues.

Here, a bit-serial multiplier is presented that was found by Sips [Sips 82] and, independently, by Strader and Rhyne [SR 82] and by Gnanasekaran [Gna 83]. It is based on a multiplication scheme (Figure 1a) that was first used by Chen and Willoner [CW 79].

For the multiplication of two n-bit numbers a = a_n-1, ..., a₀ and b = b_n-1, ..., b₀, the bit products shown in Figure 1a for the case n = 4 have to be computed and added appropriately.

Bit products of a 4-bit multiplication Data flow of variables

Figure 1: Bit products of a 4-bit multiplication (a), data flow of variables (b)

In the multiplication algorithm, for all index pairs (i,j) with i ≤ j, two bit products a_i b_j and a_j b_i, or if i = j one bit product a_i b_i, and three intermediate result bits are added, producing three new intermediate result bits. Since five bits are added, resulting in a value of possibly 5, three bits are necessary to represent the result. The boolean function that performs this operation is called (5,3)-counter, since it counts how many of its 5 inputs are 1 and represents that number by 3 output bits.¹)

At each index position (i,j), the bit products a_i b_j and a_j b_i and the other three intermediate result bits that are added have binary values 2^i+j. The result bits produced are a sum bit s of binary value 2^i+j, a first order carry bit c of binary value 2^i+j+1, and a second order carry bit d of binary value 2^i+j+2.

The result bits are propagated to some other index positions where bit products of the corresponding binary values are computed. Figure 1b shows the propagation of the result bits and of the input bits. This leads to the recurrence equations below.

In order to embed the input bits into the two-dimensional index space, we set

`a`_i,j	=	`a`_i
`b`_i,j	=	`b`_j
`a`'`i`,`j`	=	`a`_j
`b`'`i`,`j`	=	`b`_i

for all i, j ∈ {0, ..., n-1}, i ≤ j.

Here, a_i,j and b_i,j stand for the upper input bits in each box of Figure 1 and a'i,j and b'i,j stand for the lower input bits. For example, the term a₀ b₃ corresponds to a_0,3 b_0,3, and a₃ b₀ corresponds to a'0,3 b'0,3.

Now we have the following recurrence equations. The functions S are symmetric boolean functions defined as follows:

S_k is 1 if exactly k of its arguments are 1, otherwise it is 0. S_k,m is an abbreviation of S_k ∨ S_m. Together, S_1,3,5, S_2,3 and S_4,5 form a (5,3)-counter.

`a`_i,j+1	=	`a`_i,j
`b`_i+1,j	=	`b`_i,j
`a`'`i`+1,`j`	=	`a`'`i`,`j`
`b`'`i`,`j`+1	=	`b`'`i`,`j`
`s`_i-1,j+1	=	`S`_1,3,5 (`a`_i,j `b`_i,j , `a`'`i`,`j` `b`'`i`,`j` , `s`_i,j , `c`_i,j , `d`_i,j)
`c`_i,j+1	=	`S`_2,3 (same arguments as above)
`d`_i+1,j+1	=	`S`_4,5 (same arguments as above)

The result of the multiplication are the bits r_j = s_-1,j+1 with j = 0, ..., 2n-1. This means that the index space shown in Figure 1a has to be extended such that the sum bits and carry bits are propagated to index positions with j ≥ n. The input bits a'i,j and b_i,j are 0 for j ≥ n.

In order to derive the multiplier circuit from the recurrence equations, a linear transformation T is chosen that transforms the above index space {(i,j)} into the index space {(t,p)} of time and a linear array of processing elements [MF 86].

The linear transformation

		`i`	`j`
`T` =	`t`	0	1
	`p`	1	0

essentially transposes the scheme of Figure 1. The result is the following scheme (Figure 2). It can be seen that e.g. a₀ b₁ and a₁ b₀ are computed at time 1 in processor 0.

Figure 2: Transformed index space

Additional n time steps are necessary to shift the higher significant result bits out of the processor array. Application of the transformation T to index (-1,j+1) of a result bit gives (j+1,-1), i.e. the result bits appear at processor -1 (i.e. at the output of processor 0) at time j+1 (j = 0, ..., 2n-1).

The data dependency matrix D of the recurrences in the index space {(i,j)} is

		`s`	`c`	`d`	`a`	`b`	`a'`	`b'`
`D` =	`i`	-1	0	1	0	1	1	0
	`j`	1	1	1	1	0	0	1

The data dependencies express the fact that, e.g., variable s computed at index (i,j) is next used at index (i-1,j+1), or variable b produced at index (i,j) is next used at index (i,j+1).

The transformed data dependencies express the data flow in time and processor space. We have

		`s`	`c`	`d`	`a`	`b`	`a'`	`b'`
`TD` =	`t`	1	1	1	1	0	0	1
	`p`	-1	0	1	0	1	1	0

For instance, s is passed to the previous processor in one time step, c is held in the same processor for one time step, d is passed to the next processor in one time step. The input values b and a' are passed to the next processor in 0 time steps – this means that they are broadcast to all processors.

The input variables a and b' are held in the same processor in each time step. Thus, this input has to be provided in a bit-parallel way. However, a timing analysis shows that a₀ and b'0 are first needed at time 0, a₁ and b'1 are first needed at time 1, and so on. Therefore, the input is provided bit-serially and latched by a control signal at the appropriate time steps.

Figure 3 shows the multiplier, here for an operand length of n = 3. Each of the n processor elements consists of two latches where the input bits a and b' are stored, two AND-gates that compute the bit products, a (5,3)-counter, and four delay cells.

At time k in processor k only one bit product a_k b_k is added (see Figure 2). In this situation, the other bit product is set to 0 by the control signal.

It is possible to perform an addition simultaneously using the d-input of processor 0 (input bits s₀ s₁ s₂ in Figure 3), so that the multiplier computes r = s + a·b.

Figure 3: Multiplier circuit for operand length n=3

The bit-serial multiplier of [Sips 82][SR 82][Gna 83] presented here uses n processors and requires time 2n. Chen and Willoner [CW 79] originally found the underlying multiplication scheme. However, in their implementation they used a different transformation T, namely

		`i`	`j`
`T` =	`t`	0	1
	`p`	1	1

resulting in a less optimal design with 2n processors.

[CW 79] I.N. Chen, R. Willoner: An O(n) Parallel Multiplier with Bit-Sequential Input and Output. IEEE Transactions on Computers C-28, 10, 721-727 (1979)

[Gna 83] R. Gnanasekaran: On a Bit-Serial Input and Bit-Serial Output Multiplier. IEEE Transactions on Computers C-32, 9, 878-880 (1983)

[MF 86] D.I. Moldovan, J.A.B. Fortes: Partitioning and Mapping Algorithms into Fixed Size Systolic Arrays. IEEE Transactions on Computers C-35, 1, 1-12 (1986)

[Sips 82] H.J. Sips: Comments on "An O(n) Parallel Multiplier with Bit-Sequential Input and Output". IEEE Transactions on Computers C-31, 4, 325-327 (1982)

[SR 82] N.R. Strader, V.T. Rhyne: A Canonical Bit-Sequential Multiplier. IEEE Transactions on Computers C-31, 8, 791-795 (1982)

¹) A standard full adder is a (3,2)-counter

Next: [up]

H.W. Lang mail@hwlang.de Impressum Datenschutz
Created: 01.04.2003 Updated: 08.02.2023

Bit-serial arithmetic

Bit-serial multiplier of Chen and Willoner

Multiplication algorithm

Index space transformation

Multiplier unit

Conclusions

References