Accurate Power Estimation Identity for DSP Blocks Targeted to FPGAs

Neerja Singh, Gaurav Verma*, Vijay Khare

Department of ECE, JJIT, Noida 201307, U.P., India

Corresponding Author Email: gaurav.iitkg@gmail.com

https://doi.org/10.18280/isi.270403

Received: 13 April 2022
Accepted: 21 June 2022

Keywords:
FIR, IP, DSP, power, FPGA, RTL

ABSTRACT

Nowadays, the main challenge in front of system designers is to design power-efficient systems with reduced design turnaround time. It can be achieved in two ways, firstly, utilize off-shelf components (Intellectual Property cores) along with user-defined IPs. Secondly, estimate the power at an early stage of the design cycle. Therefore, this paper represents the power estimation of Cascaded and Non-Cascaded DSP blocks based on IP modeling. The DSP blocks are designed using a blend of embedded and user-defined IP cores. Curve-fitting and regression-based models for power evaluation have been created for each IP core. The power of the complete DSP block is estimated using identity projected by Elleouet et al. by incorporating the power values of each IP core obtained from the regression-based models. The models have been validated for accuracy using the power values gained from the commercial tool (Vivado design suite (2014.2)). From the analysis, it has been found that the identity is providing inaccurate results for cascaded DSP blocks. Therefore, in this work, a new identity has been proposed that has been estimating the power of the cascaded systems accurately and also in alignment with the results of a commercial tool.

1. INTRODUCTION

The foremost consequence of transistor miniaturization is high power consumption. This has led to the additional requirement of cooling devices and has also reduced battery life. Currently, power is the critical constraint for electronic design engineers with compressed design schedules. Nowadays, reconfigurable circuits such as FPGAs have preferred technology as they can achieve high performance with low cost and lesser time consumption [1]. These devices can implement complex circuits such as DSP blocks and embedded memories [2]. Today, these devices are attractive alternatives to their Application-Specific Integrated Circuits (ASICs) counterparts. But, due to their increased complexity, power consumption has aroused as the constraining factor that has bounded FPGA designs to cross the threshold of low power applications.

Several power estimation techniques already exist in the literature, but, accurate power estimation is possible only with the knowledge of capacitances. The available commercial tool measures the power accurately, but, with a longer time penalty. Power assessment at a higher abstraction level is not much accurate because of the absence of low-level statistics. So, to overcome the above-mentioned problem, system designing at Register Transfer Level (RTL) can be an attractive choice because of less simulation run time and technology independency. Though, numerous models are present in the literature that could determine the power of an individual block at the RTL level but the research on methodologies that could approximate the power of a complete system accurately using IP modeling approach, needs exploration.

Therefore, in this paper, DSP blocks have been designed and analyzed for power using IP cores. DSP blocks have been categorized as cascaded blocks and non-cascaded blocks. In cascaded blocks, input is applied at one IP core whose output acts as the input to the intermediate IP cores, and the final output is taken at another end. In non-cascaded blocks, external input may be applied to the intermediate blocks, and output is taken at each stage. The most important advantage of system designing using different IP cores is that dedicated IP cores can be used to design many systems. This approach will undoubtedly increase the design efficiency [3]. Also, power assessment at the primary design phase will help designers to design power-efficient systems with a lesser design calendar.

The paper has been ordered in the following sequence: a review on power modeling and estimation techniques for FPGAs is deliberated in part 2, and then the flow of the proposed power estimation method is conferred in part 3. Power in FPGAs is particularized in part 4. Characterization of DSP blocks is discussed in part 5. The regression model of sub-modules used in designing each DSP block is elaborated in part 6. Power modeling of the complete system is explained in part 7. Finally, result analysis, execution time comparison, model compatibility at different frequencies and conclusion are presented in part 8, 9,10 and 11 respectively.

2. LITERATURE SURVEYS

Elleouet et al. [3] have anticipated an identity that could estimate the power of a system designed using N IP cores. Architectural and algorithmic parameters have been used for projecting the model. The analysis is based at the system level. Jevtic et al. [4] have proposed a model that could estimate the power of multiplier blocks in FPGAs. They discovered a void in David Elleouet and Nathalie Choy’s work. According to their detections, interconnect and component powers have not been divided separately which may cause accuracy issues for
complex designs. Lorandel et al. [5] have presented a method that could evaluate the power of wireless communication systems at a higher abstraction level. The proposed methodology is specific to a wireless communication system. Also, in their work emphasis has not been put on how the power is influenced after interconnecting various IP blocks. Deng et al. [6] have presented curve-fitting and regression-based models that could accurately estimate the area, time, and power. Their work is on IP cores-based implementations for FPGAs. This designing approach will greatly enhance the hardware development efficiency. Gebotys et al. [7] have presented a linear regression-based model that could predict the power. They derived variables from the DSP code for formulating the models and achieved an error of less than 4%. Verma et al. [8] have applied the statistical power estimation technique for estimating the power of embedded systems. The analysis has been carried out for almost 30 circuits and power has been estimated using Xpower Analyzer. They found that the statistical-based power estimation technique provides good accuracy with a faster estimation speed. Nasser et al. [9] in their paper have presented an overview of power modeling and estimation techniques at different abstraction levels (from RTL to the transistor). They found that the simulation-based estimation technique is generic and estimates power accurately with a longer estimation time. However, the probabilistic-based approach provides low accuracy, but higher estimation speed. Referring to various works, they also agreed to the fact that the statistical-based estimation technique provides moderate accuracy with moderate estimation speed. Raghunathan et al. [10] have proposed a statistical modeling technique at the RTL level that could estimate switching activity and power consumption. In their work, they have considered glitches to achieve better accuracy. An error of about 7% has been achieved. Makani et al. [11] worked on resource utilization report from hardware. They carry out analysis for estimating the area and power without RTL implementation. Durrani and Riesgo [12] have proposed a modeling technique at the architectural level that could estimate the power based on the knowledge of input/output. Similar to Elleouet et al. [3] they have also claimed that the fast power estimation of IP-based designs can be achieved by simply adding the power consumed by the individual IP cores. They have achieved the error of 1-2% for individual macro blocks and 9-15% for the complete system. They have also not focused on how the power would get influenced once the various IP blocks are interconnected to form a complete system. Singh et al. [13] have proposed Artificial Neural Network (ANN) and regression-based model for an embedded multiplier. As per their finding the proposed models are generic for all 7-series FPGAs devices. Therefore, in this work, while designing a complete system using different IP cores, the focus has been laid on interconnection power.

From the literature survey, it has been analyzed that various power modeling and estimation techniques have been established in literature at different abstraction level. It has been seen that the statistical based modeling technique is providing better accuracy and estimation speed. Various works have been reported in the literature for power estimation at RTL level, but it is limited to individual blocks only. Very few works have been reported related to IP modelling approach for complete system. Therefore, power estimation of systems designed using IP cores is still in the primary phase. Thus, power estimation at RTL level based on IP modeling can prove to be an exceptional profusion due to technology independence and lesser simulation run-time.

3. FLOW OF PROPOSED POWER ESTIMATION METHOD

The proposed power estimation flow is shown in Figure 1. In this work, DSP blocks are designed by interconnecting diverse IP cores. DSP blocks are intended to use desired embedded as well as user-defined IP cores. User-defined IP cores are incorporated into the library using Verilog Hardware Descriptive Language (HDL). After design implementation, the value of total power is generated. Individual IP cores are modified and synthesized for various Input/Output (I/O) configurations. Data obtained after post synthesis has been used for creating regression-based model for individual IP cores.

![Figure 1. Power estimation process](Image)

System power is estimated through identity proposed by David et al. and proposed identity in this work using power values obtained from the regression model. The assessed power values from the commercial tool have been referred for authenticating the power values gained from identity proposed by Elleouet et al. [3] and the proposed identity in this work.

4. POWER IN FPGA

In FPGAs, the power consumption has increased due to the large count of programmable switches and interconnects. The total power, Power$_T$, in FPGA is sum of static power, Power$_S$, and dynamic power, Power$_D$, as given by Eq. (1).

$$\text{Power}_T = \text{Power}_S + \text{Power}_D$$ (1)
Static power is not instantaneous for a particular FPGA device and it occurs due to leakage mechanism in MOS transistors, and leakage mechanism itself is a function of the temperature. In this work, no significant rise is observed in temperature while analysis, hence the static power is assumed to be constant i.e., 120mW. However, dynamic power change instantly and is given by Eq. (2).

\[ Power_{(D)} = \alpha \times f_{\text{clk}} \times C_L \times V_{DD}^2 \]  

(2)

where, \( C_L \) is the total capacitance, \( V_{DD} \) is the supply voltage, \( \alpha \) is the switching activity and \( f_{\text{clk}} \) is the clock frequency as per the design requirement [14-18]. Vivado tool estimate the value of \( \alpha \) at various nodes of circuit under consideration using a vector-less algorithm. So, control over \( \alpha \) is not possible when circuits are designed by interconnecting various IP blocks. Hence, in FPGAs, dynamic power can be given by Eq. (3).

\[ Power_{D} = (\text{Signal} + \text{Logic} + I/O + \text{Clock} + \text{Memory} + \text{DSP}) \times \text{Power} \]  

(3)

where, I/O power depends on the total number of input/output pins. The average power disbursed by the clock web is the clock power. This also includes power spent by buffer and routing resources. Average power spent by interconnects is termed as the signal power. Logic power is a function of Configurable Logic Blocks (CLBs). This includes power spent by Look-up- Tables (LUTs) and Flip-Flop (FF). Memory power depends upon memory elements. DSP power is a function of number of DSP blocks used in the particular design [5].

5. CHARACTERIZATION OF DSP BLOCKS

In this work, various DSP blocks have been used for analyzing the feasibility of the proposed identity. DSP blocks are categorized into cascaded and non-cascaded blocks as shown in Table 1. In cascaded blocks, input is applied at one IP core whose output acts as the input to the intermediate IP cores and final output is taken at another end. In non-cascaded blocks, external input may be applied to the intermediate blocks and output is taken at each stage [19].

Table 1. Categorization of DSP blocks

<table>
<thead>
<tr>
<th>S. No.</th>
<th>Cascaded Blocks</th>
<th>Non-cascaded Blocks</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>FIR Filter</td>
<td>Carry Ripple Adder</td>
</tr>
<tr>
<td>2</td>
<td>MAC Unit</td>
<td>Carry Skip Adder</td>
</tr>
<tr>
<td>3</td>
<td>ALU</td>
<td>SIPO</td>
</tr>
<tr>
<td>4</td>
<td>Barrel Shifter</td>
<td>PIPO</td>
</tr>
<tr>
<td>5</td>
<td>Carry Save Adder</td>
<td>PISO</td>
</tr>
<tr>
<td>6</td>
<td>SISO</td>
<td>---</td>
</tr>
</tbody>
</table>

The DSP blocks are designed by connecting embedded IPs and user-defined IPs. The architectural details of the various DSP block designed in this work is depicted in Table 2.

Table 2. Architectural details of DSP blocks

<table>
<thead>
<tr>
<th>S. No.</th>
<th>Cascaded Blocks</th>
<th>User-defined IP</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>FIR Filter</td>
<td>Three Delay Element</td>
</tr>
<tr>
<td>2</td>
<td>MAC Unit</td>
<td>None</td>
</tr>
<tr>
<td>3</td>
<td>ALU</td>
<td>8-bit AND, OR, XOR, NOT gates and 16-bit MUX</td>
</tr>
<tr>
<td>4</td>
<td>Barrel Shifter</td>
<td>Twenty-four 2:1 MUX</td>
</tr>
<tr>
<td>5</td>
<td>Carry Save Adder</td>
<td>Eight full adder IP</td>
</tr>
<tr>
<td>6</td>
<td>SISO</td>
<td>Four D flip-flop IP</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>S. No.</th>
<th>Non-cascaded Blocks</th>
<th>User-defined IP</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Carry Ripple Adder</td>
<td>Four full adder IP</td>
</tr>
<tr>
<td>2</td>
<td>Carry Skip Adder</td>
<td>Four full adder IP and a 2:1 MUX IP</td>
</tr>
<tr>
<td>3</td>
<td>SIPO</td>
<td>Four D flip-flop IP</td>
</tr>
<tr>
<td>4</td>
<td>PIPO</td>
<td>Four D flip-flop IP and four 2:1 MUX IP</td>
</tr>
<tr>
<td>5</td>
<td>PISO</td>
<td>Four D flip-flop IP</td>
</tr>
</tbody>
</table>

6. REGRESSION MODEL FOR IP CORES USED IN THE DESIGN OF DSP BLOCKS

Table 3. Parameters used and their connotation

<table>
<thead>
<tr>
<th>Used parameters</th>
<th>Connotation</th>
</tr>
</thead>
<tbody>
<tr>
<td>out_pin</td>
<td>Total output pins</td>
</tr>
<tr>
<td>lut</td>
<td>Total LUT (logic slice)</td>
</tr>
<tr>
<td>ff</td>
<td>Total Flip-Flops</td>
</tr>
<tr>
<td>DSP48</td>
<td>Total DSP blocks</td>
</tr>
</tbody>
</table>

Curve-fitting and regression-based model for individual IP cores have been created based on the resource utilization data obtained after synthesis [20-22]. In this work, curve fitting and regression techniques is used to predict the relationship between the dependent and independent variables. Each model has been tested for accuracy against commercial tool. Parameters used and their connotation is explained in Table 3.

6.1 Regression model for divider

Divider IP is instantiated using different configuration. The dynamic power equations obtained using curve-fitting and regression technique are given by Eq. (4) to Eq. (7). Power obtained for different divider configuration is given in Table 4.

\[ Output_{power} = 1.185 \times out\_pin - 1.308 \]  

(4)
\[
\text{Clockpower} = -2.437 + 0.1583 \times x_{\text{lut}} - 0.0548 \times x_{\text{ff}} 
\]
(5)

\[
\text{Logicpower} = 1.475 - 0.0428 \times x_{\text{lut}} + 0.0232 \times x_{\text{ff}} 
\]
(6)

\[
\text{Signalpower} = 0.2327 + 0.003029 \times x_{\text{lut}} + 0.008581 \times x_{\text{ff}} 
\]
(7)

Table 4. Comparative analysis of embedded divider block

<table>
<thead>
<tr>
<th>Divider configurations</th>
<th>Estimated power values from commercial tool (mW)</th>
<th>Estimated power values from regression-based model (mW)</th>
<th>% Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>145</td>
<td>145.5</td>
<td>0.36</td>
</tr>
<tr>
<td>10</td>
<td>165</td>
<td>168.25</td>
<td>1.97</td>
</tr>
<tr>
<td>12</td>
<td>168</td>
<td>171.85</td>
<td>2.28</td>
</tr>
<tr>
<td>14</td>
<td>173</td>
<td>175.94</td>
<td>1.70</td>
</tr>
<tr>
<td>16</td>
<td>180</td>
<td>180.44</td>
<td>0.25</td>
</tr>
<tr>
<td>20</td>
<td>206</td>
<td>209.58</td>
<td>1.74</td>
</tr>
<tr>
<td>24</td>
<td>215</td>
<td>221.18</td>
<td>2.87</td>
</tr>
<tr>
<td>32</td>
<td>265</td>
<td>269.13</td>
<td>1.56</td>
</tr>
</tbody>
</table>

The power values gained from regression model has been tested for accuracy with reference to the commercial tool using Eq. (8).

\[
\text{Error} (\%) = \left( \frac{e_i - r_i}{r_i} \right) \times 100 
\]
(8)

where, \( e_i \) is the measured power from regression-based model [14], \( r_i \) is the power value gained from the Vivado tool. Other IP cores have also been validated using same method. From the analysis it has also been seen that the contribution of input power that depends on the number of input pins in the design is less than 1% to the total power. Thus, while modeling it has been assumed to be zero.

6.2 Regression model for 8:1 MUX

Mux IP is instantiated using different configuration. The dynamic power equation obtained using curve-fitting and regression technique are given by Eq. (9) to Eq. (12). The comparative analysis of 8:1 MUX IP for different configuration is given in Table 5.

Table 5. Comparative analysis of MUX block

<table>
<thead>
<tr>
<th>8:1 MUX configurations</th>
<th>Estimated power values from commercial tool (mW)</th>
<th>Estimated power values from regression-based model (mW)</th>
<th>% Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>125</td>
<td>130.73</td>
<td>4.58</td>
</tr>
<tr>
<td>2</td>
<td>130</td>
<td>133.33</td>
<td>2.56</td>
</tr>
<tr>
<td>4</td>
<td>140</td>
<td>139.96</td>
<td>0.02</td>
</tr>
<tr>
<td>8</td>
<td>161</td>
<td>160.99</td>
<td>0.006</td>
</tr>
<tr>
<td>16</td>
<td>202</td>
<td>200.99</td>
<td>0.50</td>
</tr>
</tbody>
</table>

\[
\text{Outputpower} = 79.76 \times \exp\left(\frac{-x_{\text{pin}} - 18.23}{11.88}\right)^2 
\]
(9)

\[
\text{Clockpower} = 1.42e^{-15} \times (\text{lut})^{0.831} + 0.9989 
\]
(10)

\[
\text{Signalpower} = 20.41 \times \exp\left(-\frac{\text{lut} - 36}{6.91}\right)^2 
\]
(11)

\[
\text{Logicpower} = 20.41 \times \exp\left(-\frac{\text{lut} - 36}{6.91}\right)^2 
\]
(12)

6.3 Regression model for full adder

Full adder IP has been used in many designs. The IP is instantiated using different configuration. The dynamic power equation obtained using curve-fitting and regression technique are given by Eq. (13) to Eq. (16). The comparative analysis for different configuration is given in Table 6.

Table 6. Comparative analysis of full adder block

<table>
<thead>
<tr>
<th>Full adder configurations</th>
<th>Estimated power values from commercial tool (mW)</th>
<th>Estimated power values from regression-based model (mW)</th>
<th>% Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>122</td>
<td>122.23</td>
<td>0.19</td>
</tr>
<tr>
<td>2</td>
<td>123</td>
<td>122.75</td>
<td>0.20</td>
</tr>
<tr>
<td>4</td>
<td>124</td>
<td>123.82</td>
<td>0.15</td>
</tr>
<tr>
<td>8</td>
<td>126</td>
<td>125.93</td>
<td>0.05</td>
</tr>
<tr>
<td>12</td>
<td>128</td>
<td>128.05</td>
<td>0.04</td>
</tr>
<tr>
<td>16</td>
<td>130</td>
<td>130.16</td>
<td>0.13</td>
</tr>
<tr>
<td>24</td>
<td>135</td>
<td>134.40</td>
<td>0.44</td>
</tr>
<tr>
<td>32</td>
<td>140</td>
<td>138.64</td>
<td>0.97</td>
</tr>
</tbody>
</table>

\[
\text{Outputpower} = 0.5294 \times x_{\text{lut}} + 0.1688 
\]
(13)

\[
\text{Logicpower} = 0.0001 
\]
(14)

\[
\text{Signalpower} = 0.0001 
\]
(15)

\[
\text{Clockpower} = 1 
\]
(16)

6.4 Regression model for multiplier

Multiplier IP is instantiated using different configuration. The dynamic power equation obtained using curve-fitting and regression technique are given by Eq. (17) to Eq. (20). The comparative analysis report for multiplier IP can be referred from [14].

\[
\text{Outputpower} = 1.171 \times x_{\text{lut}} - 2.18 
\]
(17)

\[
\text{DSPpower} = 3.372 - 5.57 \times \cos(DSP48 \times 0.3927) 
+ 2.671 \times \sin(DSP48 \times 0.3927) 
+ 0.400 \times \cos(2 \times DSP48 \times 0.3927) 
+ 0.065 \times \sin(2 \times DSP48 \times 0.3927) 
\]
(18)

\[
\text{Clockpower} = 0.6464 - 0.5 \times \cos(\pi \times 0.0462) 
+ 1.207 \times \sin(\pi \times 0.0462) 
+ 0.85 \times \cos(2 \times \pi \times 0.0462) 
- 0.146 \times \sin(2 \times \pi \times 0.0462) 
\]
(19)

\[
\text{Signalpower} = 2.446 \times e^{(0.0103 \times x_{\text{lut}})} 
- 1.646 \times e^{(-1.191 \times x_{\text{lut}})} 
\]
(20)
6.5 Regression model for 2:1 MUX

This IP has been customized for different input configuration. Curve-fitting and regression techniques have been applied for creating model based on synthesis report. Dynamic power equations are given by Eq. (21) to Eq. (24). The comparative analysis for different configurations is given in Table 7.

\[
\text{Outputpower} = 0.3069 \times \text{out}_{\text{pin}} + 0.2721 \\
\text{Clockpower} = 8.375e14 + 0.03299 \times ff - 8.375e14 \times lut \\
\text{Logicpower} = 0.0001 \\
\text{Signalpower} = 1
\]  

Table 7. Comparative analysis of 2:1 MUX block

<table>
<thead>
<tr>
<th>MUX configurations</th>
<th>Estimated power values from commercial tool (mW)</th>
<th>Estimated power values from regression-based model (mW)</th>
<th>% Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>121</td>
<td>121.95</td>
<td>0.78</td>
</tr>
<tr>
<td>4</td>
<td>122</td>
<td>122.62</td>
<td>0.51</td>
</tr>
<tr>
<td>8</td>
<td>123</td>
<td>123.97</td>
<td>0.79</td>
</tr>
<tr>
<td>16</td>
<td>126</td>
<td>126.68</td>
<td>0.54</td>
</tr>
<tr>
<td>32</td>
<td>132</td>
<td>132.09</td>
<td>0.07</td>
</tr>
<tr>
<td>48</td>
<td>137</td>
<td>137.63</td>
<td>0.46</td>
</tr>
<tr>
<td>64</td>
<td>144</td>
<td>143.04</td>
<td>0.67</td>
</tr>
</tbody>
</table>

6.6 Regression model for adder/subtractor

Adder/subtractor IP is instantiated using different configurations. The dynamic power equation obtained using curve-fitting and regression technique are given by Eq. (25) to Eq. (28). The comparative analysis result of delay IP for different configuration can be referred from [14].

\[
\text{Outputpower} = 0.8744 \times (\text{out}_{\text{pin}})^4 - 0.2083 \\
\text{Clockpower} = 1 - 0.0147 \times lut + 0.0147 \times ff \\
\text{Signalpower} = 1.167 + 2.039e14 \times ff - 2.039e14 \times lut \\
\text{Logicpower} = 0.0001
\]  

6.7 Regression model for AND gate

IP is instantiated using different configuration. Model has been created based on synthesis report. The dynamic power equation obtained using curve-fitting and regression technique are given by Eq. (29) to Eq. (32). This model is also applicable for OR gate, XOR gate and NOT gate used in the ALU design. Comparative analysis for different configuration is given in Table 8.

\[
\text{Outputpower} = 4.769 \times \text{out}_{\text{pin}} - 0.2205 \\
\text{Logicpower} = -213.9 \times \exp(2.685 \times lut) + 1.001
\]  

6.8 Regression model for delay

Delay element is created using D FF. The delay IP has been modified for different input vector length. The dynamic power equation obtained are given by Eq. (35) to Eq. (38). The comparative analysis result of delay IP for different configuration can be referred from [14].

\[
\text{Outputpower} = 8.007 - 5.005 \cos(lut \times 0.0938) - 3.231 \sin(lut \times 0.0938) - 1.582 \cos(2 \times lut \times 0.0938) - 0.02673 \times \sin(2 \times lut \times 0.0938) \\
\text{Clockpower} = 0.9166 - 0.177 \cos(lut \times 0.1848) - 0.1096 \sin(lut \times 0.1848) - 0.3403 \cos(2 \times lut \times 0.1848) - 0.6834 \sin(2 \times lut \times 0.1848)
\]  

Table 8. Comparative analysis of AND gate block

<table>
<thead>
<tr>
<th>AND gate configurations</th>
<th>Estimated power values from commercial tool (mW)</th>
<th>Estimated power values from regression-based model (mW)</th>
<th>% Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>4</td>
<td>140</td>
<td>139.86</td>
<td>0.09</td>
</tr>
<tr>
<td>8</td>
<td>160</td>
<td>159.91</td>
<td>0.05</td>
</tr>
<tr>
<td>16</td>
<td>199</td>
<td>200.09</td>
<td>0.55</td>
</tr>
<tr>
<td>32</td>
<td>282</td>
<td>280.38</td>
<td>0.57</td>
</tr>
<tr>
<td>48</td>
<td>362</td>
<td>359.69</td>
<td>0.63</td>
</tr>
<tr>
<td>64</td>
<td>442</td>
<td>437.99</td>
<td>0.91</td>
</tr>
</tbody>
</table>

6.9 Regression model for accumulator

Table 9. Comparative analysis of embedded accumulator block

<table>
<thead>
<tr>
<th>Accumulator configurations</th>
<th>Estimated power values from commercial tool (mW)</th>
<th>Estimated power values from regression-based model (mW)</th>
<th>% Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>8</td>
<td>130</td>
<td>131.1582</td>
<td>0.89</td>
</tr>
<tr>
<td>16</td>
<td>140</td>
<td>140.7182</td>
<td>0.51</td>
</tr>
<tr>
<td>24</td>
<td>150</td>
<td>150.2782</td>
<td>0.18</td>
</tr>
<tr>
<td>32</td>
<td>161</td>
<td>159.8382</td>
<td>0.72</td>
</tr>
<tr>
<td>48</td>
<td>181</td>
<td>178.9582</td>
<td>1.12</td>
</tr>
<tr>
<td>64</td>
<td>202</td>
<td>198.0782</td>
<td>1.9</td>
</tr>
</tbody>
</table>
The accumulator is customized for different output width. Analytical model has been created using post synthesis report. Equations for dynamic power obtained using curve-fitting and regression techniques are given by Eq. (37) to Eq. (39). Comparative analysis for different accumulator configuration is given in Table 9.

\[
\text{Output power} = 1.195 \times \text{out} \_ \text{pin} - 0.402 \\
\text{Logic power} = 0.0001 \\
\text{Signal power} = \text{Clock power} = 1
\]

7. POWER ESTIMATION OF COMPLETE SYSTEM

Power of various DSP block is estimated in three ways. Firstly, the complete system is designed using Vivado tool and power values obtained from the tool has been used as reference for identities validation. Secondly, the power is estimated for all DSP blocks using identity proposed by Elleouet et al. [3] Thirdly, the power has been estimated using the identity proposed in this work. All the three methods are discussed in detail for reference.

7.1 Power of DSP block by Vivado tool

Various DSP blocks have been designed by connecting different IP cores for power estimation and validation. Architectures of cascaded and non-cascaded blocks are configured using desired embedded IP and user-defined IP. The investigation has been done on the frequency of 125 MHz. The estimated power of each DSP block is given in Table 10.

Table 10. Power estimation of complete DSP systems using tool

<table>
<thead>
<tr>
<th>S. No.</th>
<th>Cascaded blocks</th>
<th>Power (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>MAC Unit</td>
<td>143</td>
</tr>
<tr>
<td>2</td>
<td>ALU</td>
<td>226</td>
</tr>
<tr>
<td>3</td>
<td>Barrel Shifter</td>
<td>126</td>
</tr>
<tr>
<td>4</td>
<td>Carry Save Adder</td>
<td>126</td>
</tr>
<tr>
<td>5</td>
<td>SISO</td>
<td>121</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>S. No.</th>
<th>Non-cascaded blocks</th>
<th>Power (mW)</th>
</tr>
</thead>
<tbody>
<tr>
<td>1</td>
<td>Carry Ripple Adder</td>
<td>124</td>
</tr>
<tr>
<td>2</td>
<td>Carry Skip Adder</td>
<td>125</td>
</tr>
<tr>
<td>3</td>
<td>SIPO</td>
<td>122</td>
</tr>
<tr>
<td>4</td>
<td>PIPO</td>
<td>122</td>
</tr>
<tr>
<td>5</td>
<td>PISO</td>
<td>127</td>
</tr>
</tbody>
</table>

7.2 Power estimation of DSP block by identity proposed by Elleouet et al. [3]

As per Elleouet et al. [3], power of a system comprising of N IPs is sum of the dynamic power of N IPs and power of FPGA configuration plan as shown in Eq. (40).

\[
\text{Power}_{\text{System}} = \sum \text{Power}_{\text{dynamic of each IP}} + \text{Power}_{\text{FPGA Configuration Plan}} (40)
\]

Power estimation of FIR filter has been discussed in detail for reference [3, 14]. Same method has been adopted for other DSP blocks. Since FIR filter consists of multiplier, adder and delay IP, the power equation of FIR filter can be given as Eq. (41).

\[
\text{Power}_{\text{FIR System}} = \sum \text{Power}_{\text{dynamic multipliers}} + \sum \text{Power}_{\text{dynamic adders}} + \sum \text{Power}_{\text{dynamic delays}} + \text{Power}_{\text{FPGA Configuration Plan}} (41)
\]

For FIR filter designed in this work, the dynamic power of one IP estimated through regression-based model is given by Eq. (42) to Eq. (45).

\[
\text{Power}_{\text{dynamic multipliers}} = 17.791 \text{mW} (42)
\]

\[
\text{Power}_{\text{dynamic adders}} = 15.7821 \text{mW} (43)
\]

\[
\text{Power}_{\text{dynamic delay element}} = 1.0134 \text{mW} (44)
\]

\[
\text{Power}_{\text{FPGA Configuration Plan}} = 120 \text{mW} (45)
\]

4-tap FIR filter designed in this work has four multiplier, three adder and three delay elements, the total power of FIR system using Eq. (41) would be given by Eq. (46).

\[
\text{Power}_{\text{FIR System}} = 17.791 \times 4 + 3 \times 15.7821 + 1.0134 \times 3 + 120 = 241.55 \text{ mW} (46)
\]

Total power of FIR filter designed using the commercial tool is 143mW, while, the power measured using Elleouet et al. [3] identity is 241.55 mW. The error (%) calculated using Eq. (8) is 68.91%. The error obtained shows that the identity is generating inaccurate result. Similarly, power values of various DSP blocks have been calculated and the results obtained has been analyzed for accuracy with reference to the commercial tool. Based on the results obtained for various DSP blocks, it can be concluded that the power values obtained using Elleouet et al. [3] identity are deviating much in context with the commercial tool.

7.3 Proposed power estimation identity

For FIR filter designed in this work, the dynamic power of one IP estimated through regression-based model is given by Eq. (42) to Eq. (45).

\[
\text{Power}_{\text{FIR System}} = 17.791 \times 4 + 3 \times 15.7821 + 1.0134 \times 3 + 120 = 241.55 \text{ mW} (46)
\]

Total power of FIR filter designed using the commercial tool is 143mW, while, the power measured using Elleouet et al. [3] identity is 241.55 mW. The error (%) calculated using Eq. (8) is 68.91%. The error obtained shows that the identity is generating inaccurate result. Similarly, power values of various DSP blocks have been calculated and the results obtained has been analyzed for accuracy with reference to the commercial tool. Based on the results obtained for various DSP blocks, it can be concluded that the power values obtained using Elleouet et al. [3] identity are deviating much in context with the commercial tool.
significant in contrast to the output power of output stage IP. Thus, total power of complete system estimated by just adding the dynamic power of individual IP cores along with the power of the FPGA configuration plan would deviate much with large error in context with the commercial tool [3]. Thus, in this work, interconnection effect on total power has been considered and a new identity has been proposed for estimating the power of the cascade system based on IP modeling given by Eq. (47).

\[
\text{Power}_{\text{System}} = \sum \text{Power}_{\text{(Dynamic of each IP)}} - \sum \text{Power}_{\text{(Interconnection)}} + \text{Power}_{\text{(FPGA Configuration Plan)}}
\]  

(47)

where, \(\text{Power}_{\text{(Interconnection)}}\) is the output power of intermediate stage IP and input stage IP in a cascade system. For non-cascaded systems, the term \(\sum \text{Power}_{\text{(Interconnection)}}\) will be approximately zero. Hence, the proposed identity will be same as proposed by Elleouet et al. [3] Escalating the proposed identity with reference to the FIR filter, the power equation can be written as Eq. (48).

\[
\text{Power}_{\text{(FIR System)}} = \sum \text{Power}_{\text{dynamic multipliers}} + \sum \text{Power}_{\text{dynamic adders}} + \sum \text{Power}_{\text{dynamic delays}} - \text{Power}_{\text{output Multiplier}} - \text{Power}_{\text{output adder}} - \text{Power}_{\text{output delay}} + \text{Power}_{\text{(FPGA Configuration Plan)}}
\]  

(48)

The power of FPGA configuration plan in this work is 120mW. The values of dynamic power and output power calculated using the curve-fitting and regression-based model for single IP used in designing the FIR system is given by Eq. (49) to Eq. (54).

\[
\text{Power}_{\text{dynamic multiplier}} = 17.791 mW
\]  

(49)

\[
\text{Power}_{\text{dynamic adder}} = 15.7821 mW
\]  

(50)

\[
\text{Power}_{\text{dynamic delay}} = 1.0134 mW
\]  

(51)

\[
\text{Power}_{\text{output multiplier}} = 16.55 mW
\]  

(52)

\[
\text{Power}_{\text{output adder}} = 13.78 mW
\]  

(53)

\[
\text{Power}_{\text{output delay}} = 1.098 mW
\]  

(54)

In FIR filter, one adder IP constitute the output stage IP, Input stage IP is one Multiplier and one delay IP and the intermediate IP consists of 3 multiplier IP, 2 delay IP and 2 adder IP. Thus, the power for FIR filter as per proposed identity is given by Eq. (55).

\[
\text{Power}_{\text{(FIR System)}} = 17.791 \times 4 + 3 \times 15.7821 + 1.0134 \times 3 - (3 \times 1.098 + 4 \times 16.55 + 2 \times 13.78) + 120 = 144.49 mW
\]  

(55)

Total power obtained using commercial tool is 143mW and through proposed identity it is 144.49 mW for FIR filter. The error (%) calculated through Eq. (8) is 1.04%. The obtained error (%) indicates that the identity is producing accurate result with reference to the power values attained using commercial tool. Similarly, power values of various DSP blocks have been calculated using proposed identity and are analyzed for accuracy with reference to commercial tool.

8. RESULT & ANALYSIS

Analysis for various DSP blocks has been carried out at 125 MHz frequency. The comparison results for cascaded and non-cascaded DSP blocks with reference to commercial tool have been shown in Figure 3 and Figure 4 respectively. From the results obtained it has been analyzed that the model proposed by Elleouet et al. [3] is working reasonably accurate for non-cascaded blocks. The maximum error obtained for complex non-cascaded blocks such as carry ripple adder is 3.96% as shown in Figure 6. But the percentage error is very large for cascading blocks with more complexity such as FIR filter, ALU, MAC unit, barrel shifter etc. For SISO cascading block, the percentage error obtained using Elleouet et al. [3] identity is 2.52% as its architecture is fairly simple. It consists of only D flip-flop IPs. However, the error is reduced to 0.08% using proposed identity.
The error obtained for complex cascading circuits with reference to the power values from the commercial tool indicates that the identity proposed by Elleouet et al. [3] is providing inaccurate results particularly for cascaded systems. But, when the power is calculated for cascaded systems using the proposed identity, the error obtained against commercial tool is very low. The maximum error obtained for fairly complex circuit i.e., ALU is only 6.97%. The graph of error for cascading DSP blocks shown in Figure 5 indicates that the proposed identity based on IP modeling is accurately measuring the power for cascaded DSP blocks. Since the proposed identity in this work is same as Elleouet et al. [3] identity for non-cascading DSP blocks, the error values obtained for non-cascading DSP blocks using proposed identity is same as obtained using Elleouet et al. [3] identity.

9. COMPARISON OF THE PROPOSED METHODOLOGY WITH THE COMMERCIAL TOOL

Accurate power estimation at the early design cycle is the major need today. For complex systems it may take 40-45min to get the power values. Therefore, in the proposed work, power models of the individual IP core are created based on the post synthesis data only. Thus, adopting this methodology for power model creation will save the design implementation time. Also, once the models are created for individual IP cores, these models can be utilized to approximate the power of such systems that are constructed using these IP cores.

The proposed power estimation methodology estimates the total power of a complete system consisting of required number of IPs based on the power values estimated using the power models of the individual IPs. Hence, the power of complete system based on IP modeling can be approximated quickly and accurately without using the commercial tool, based on the knowledge of individual IP cores used in designing a particular system. So, with this approach, design efficiency can be enhanced, also, this will help designer to design any power efficient systems quickly.

To showcase this, a comparison of execution time of complete system using the commercial tool (Vivado) and using proposed methodology is reported in Table 11. The time commercial tool takes to generate the power of any design is the design execution time. For determining the execution time of system using proposed methodology tic-tock MATLAB function has been used. The models are implemented in MATLAB R2013a environment with Windows 64-bit OS + processor Intel Core i5 ~ 3.6 GHz. Variation in time value may occur for different hardware, OS and programming languages.

From the time values reported in Table 11 it can be said that the proposed methodology estimates the total power of a system in fraction of seconds while the commercial tool takes more than 1 minute for estimating the total power. This difference is for simple design but for complex designs it may be very large.

Table 11. Comparison of execution time for different systems

<table>
<thead>
<tr>
<th>IP based system</th>
<th>Design execution time using commercial tool</th>
<th>Elapsed time using MATLAB</th>
</tr>
</thead>
<tbody>
<tr>
<td>SIPO</td>
<td>01 min 26 s</td>
<td>1.5 ms</td>
</tr>
<tr>
<td>PIPO</td>
<td>02 min 22 s</td>
<td>1.6 ms</td>
</tr>
<tr>
<td>Carry Skip Adder</td>
<td>01 min 42 s</td>
<td>1.67 ms</td>
</tr>
<tr>
<td>Carry Ripple Adder</td>
<td>02 min 13 s</td>
<td>2.09 ms</td>
</tr>
<tr>
<td>PISO</td>
<td>01 min 51 s</td>
<td>1.79 ms</td>
</tr>
<tr>
<td>Carry Save Adder</td>
<td>01 min 36 s</td>
<td>1.47 ms</td>
</tr>
<tr>
<td>Barrel Shifter</td>
<td>01 min 57 s</td>
<td>3.3 ms</td>
</tr>
<tr>
<td>ALU</td>
<td>02 min 16 s</td>
<td>3.2 ms</td>
</tr>
<tr>
<td>FIR</td>
<td>01 min 37s</td>
<td>1.2 ms</td>
</tr>
<tr>
<td>MAC</td>
<td>01 min 56 s</td>
<td>3.1 ms</td>
</tr>
<tr>
<td>SISO</td>
<td>01 min 43 s</td>
<td>1.4 ms</td>
</tr>
<tr>
<td><strong>Test Designs</strong></td>
<td></td>
<td></td>
</tr>
<tr>
<td>QPSK</td>
<td>6 min 43 sec</td>
<td>3.8 ms</td>
</tr>
<tr>
<td>BPSK</td>
<td>4 min 53 sec</td>
<td>3.1 ms</td>
</tr>
</tbody>
</table>
10. ANALYSIS OF THE PROPOSED MODELS AT DIFFERENT FREQUENCIES

The curve-fitting and regression-based model proposed in this work for individual IP cores is generalized for all frequencies as depicted in Table 12. The resource utilization would remain the same for all frequencies. Since the model proposed for individual IP cores is based on resource utilization it would work accurately for all frequencies. The dynamic power will vary in direct proportion with the frequency. For instance, if at frequency f1 the dynamic power is p1, then at frequency a*f1 the dynamic power would be a*p1. Thus, if we double the frequency, the power will also get double. From the result obtained for multiplexer IP core for 8x8 configuration at different frequencies it can be concluded that the power at each frequency can be obtained by just multiplying the dynamic power with the scaling factor (i.e. the factor by which frequency is scaled). It can also be concluded from the % error obtained at different frequencies that the proposed model is producing highly accurate results at higher frequencies. Thus, with the proposed methodology total power can be approximated quickly and accurately at different frequencies.

<table>
<thead>
<tr>
<th>Frequency (MHz)</th>
<th>Multiplier configuration</th>
<th>Dynamic power (mW) from tool</th>
<th>Dynamic power from model (mW)</th>
<th>Total power from Vivado (mW)</th>
<th>Total power from proposed model (mW)</th>
<th>%Error</th>
</tr>
</thead>
<tbody>
<tr>
<td>125</td>
<td>8X8</td>
<td>19</td>
<td>17.79</td>
<td>139</td>
<td>137.79</td>
<td>0.87</td>
</tr>
<tr>
<td>250</td>
<td>8X8</td>
<td>37</td>
<td>35.58</td>
<td>157</td>
<td>155.58</td>
<td>0.91</td>
</tr>
<tr>
<td>375</td>
<td>8X8</td>
<td>55</td>
<td>53.37</td>
<td>175</td>
<td>173.37</td>
<td>0.93</td>
</tr>
<tr>
<td>500</td>
<td>8X8</td>
<td>75</td>
<td>71.16</td>
<td>195</td>
<td>191.16</td>
<td>1.96</td>
</tr>
</tbody>
</table>

11. CONCLUSION

In this work, different DSP blocks have been analyzed for power. Blocks have been categorized as cascaded and non-cascaded blocks. After analyzing the results obtained for various DSP blocks, it can be concluded that the power obtained using Eq. (41) is inaccurate particularly for complex cascading systems. However, model works fairly accurate for non-cascading circuits. The maximum error obtained for cascading circuits is 82.84%, which is very large. This realization indicates that the identity projected by Elleouet et al. [3] needs reconsideration, particularly for cascading systems. So, we tried to eradicate the indistinctness that exists in the David Elleouet et al. identity. Therefore, in this work, a power estimation identity for complete system designed using an IP modeling approach has been proposed by considering cascaded DSP blocks at RTL level. It has been analyzed from the result obtained that the proposed identity for cascaded systems is accurate in comparison with Elleouet et al. [3] identity. The maximum error obtained using proposed identity for ALU is only 6.97%, which is very low in comparison with the error obtained using Elleouet et al. [3] identity. So, based on the results obtained we can say that the proposed identity is generic for cascaded and non-cascaded DSP systems and will have a broader spectrum for other systems as well.

ACKNOWLEDGMENT

This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors. The authors would like to thank the editor and anonymous reviewers for their comments that help improve the quality of this work.

REFERENCES


