Parallel implementation of resampling methods for particle filtering on graphics processing units

Matthew A. Nicely

Follow this and additional works at: https://louis.uah.edu/uah-dissertations

Recommended Citation
https://louis.uah.edu/uah-dissertations/181
PARALLEL IMPLEMENTATION OF RESAMPLING METHODS FOR PARTICLE FILTERING ON GRAPHICS PROCESSING UNITS

by

MATTHEW A. NICELY

A DISSERTATION

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in The Department of Electrical and Computer Engineering to The School of Graduate Studies of The University of Alabama in Huntsville

HUNTSVILLE, ALABAMA

2019
In presenting this dissertation in partial fulfillment of the requirements for a doctoral degree from The University of Alabama in Huntsville, I agree that the Library of this University shall make it freely available for inspection. I further agree that permission for extensive copying for scholarly purposes may be granted by my advisor or, in his/her absence, by the Chair of the Department or the Dean of the School of Graduate Studies. It is also understood that due recognition shall be given to me and to The University of Alabama in Huntsville in any scholarly use which may be made of any material in this dissertation.

Matthew A. Nicely

[Signature]

10/5/19

(date)
Dissertation Approval Form

Submitted by Matthew A. Nicey in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering and accepted on behalf of the Faculty of the School of Graduate Studies by the dissertation committee.

We, the undersigned members of the Graduate Faculty of The University of Alabama in Huntsville, certify that we have advised and/or supervised the candidate of the work described in this dissertation. We further certify that we have reviewed the dissertation manuscript and approve it in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering.

Dr. Buren E. Wells 11/5/2019  Committee Chair

Dr. Aleksandar Milenkovic 11/5/2019

Dr. Laurie L. Joiner 11/5/19

Dr. Swaguru Ravindran 11/8/19

Dr. Arie Nakashim 11/8/19

Dr. Ravi Gorur 11/12/19  Department Chair

Dr. Shankar Mahalingam 11/12/19  College Dean

Dr. David Berkowitz 11/12/19  Graduate Dean
Particle filter techniques are common methods used to estimate the evolving state of nonlinear, non-Gaussian time-variant systems by utilizing a periodic sequence of noisy measurements. The accuracy of particle filtering has often been shown to be superior to other state estimation techniques, such as the extended Kalman filter (EKF), for many applications. Unfortunately, the high computational cost and highly nondeterministic run-time behavior of particle filters often precludes their use in hard, real-time environments where filter response must meet the strict timing requirements of the application. Particle filter algorithms are composed of three main stages: prediction, update, and resampling.

General purpose graphics processing units (GPGPUs) have been successfully employed in previous research to accelerate the computation of both the prediction and update stages by exploiting their natural fine-grain parallelism. This research focuses on accelerating the resampling stage for GPGPU execution, which is much more difficult to parallelize due to its inherent sequentially. This dissertation introduces a novel GPGPU implementation of the systematic and stratified resampling
algorithms that exploit the monotonically increasing nature of the prefix-sum and the evolutionary nature of the particle weighting process to allow the re-indexing portion of the algorithms to occur in a two phase, multi-threaded manner. This resulting measured factor of performance improvement for the systematic and stratified algorithms, when executed on a GPGPU, was 16x and 33x, respectively, over the serial implementations.

This dissertation also describes the steps taken to optimize the naive parallel implementation to mitigate limitations such as memory dependencies stalls, low thread execution efficiency, and sub-optimal occupancy. These steps range from minimizing global memory accesses, utilizing shared memory, manually coding software prefetching, and grid-stride looping. In all, the optimizations provide a 2x - 4x improvement to execution times over the naïve implementation of the resampling methods. Mathematical analysis is provided to describe algorithm efficiencies regarded to time, work, and space. L'Hôpital's rule shows that while time complexity can be reduced from $O(N) \rightarrow O(1)$, for the best case scenario, it costs double the amount of work. Lastly, multiple pseudorandom number generators (PRNGs) are evaluated with different implementation techniques to investigate statistical quality and timing performance as it pertains to particle filters.

Abstract Approval: Committee Chair

Department Chair

Graduate Dean
I would like to express the deepest appreciation to my committee chair, Dr. Buren E. Wells, who has provided continuous guidance and support during my dissertation research over the past several years. I am sincerely grateful to Dr. Aleksander Milenkovic for advising me to continue researching parallel programming on embedded platforms. That guidance created the foundation that eventually led to this research. Also, I would like to thank Dr. Laurie Joiner for her assistance when I began graduate school and her continued support during my years at UAH. Lastly, a very special thanks to Ms. Jacqueline Siniard, who has been there for me any time I have had questions or issues in the course of the graduate process.

I will be eternally grateful to Dr. Thomas Kelly, who counseled me daily for five years during my PhD research. Whether listening to me complain over testing results or advising me through research roadblocks, I sincerely believe I would not have made it to where I am today without his help.

Finally, I would like to thank James Ruf, who helped me build my first computer in high school. It was that computer and the time he spent teaching me everything he knew that sent me down the path of becoming an engineer. Thanks to him I am working in my dream career.
# TABLE OF CONTENTS

List of Figures xii

List of Tables xiv

List of Acronyms xv

List of Symbols xviii

Chapter

1 Introduction 1

1.1 Problem Formulation 2

1.2 Contributions 3

1.2.1 Parallel Systematic/Stratified Resampling 3

1.2.2 Random Number Generation Requirements 4

1.2.3 Parallelization Strategies 4

1.3 Outline 5

2 Particle Filters 6

2.1 Bootstrap Particle Filter 8

2.2 Marginalized Particle Filter 10

2.3 Regularized Particle Filter 12

2.4 Resampling Techniques 15
2.4.1 Multinomial ......................................................... 17
2.4.2 Stratified ......................................................... 18
2.4.3 Systematic ......................................................... 19
2.4.4 Residual ......................................................... 19
2.4.5 Metropolis ......................................................... 21
2.4.6 Rejection ......................................................... 22
2.4.7 Coalesced Metropolis ........................................... 23

3 Graphics Processing Unit (GPU) .................................. 27
  3.1 General Purpose Graphics Processing Unit (GPGPU) ........ 28
  3.2 Programming Model ............................................. 29
    3.2.1 Memory Model ............................................... 30
      3.2.1.1 Global Memory ........................................ 31
      3.2.1.2 Local Memory ......................................... 32
      3.2.1.3 Shared Memory ........................................ 32
      3.2.1.4 Constant Memory ....................................... 33
      3.2.1.5 Texture Memory ........................................ 33
      3.2.1.6 Registers .............................................. 34
  3.3 Libraries and Tools ............................................ 34
    3.3.1 CUDA UnBound (CUB) ...................................... 35
    3.3.2 cuRAND ...................................................... 35
    3.3.3 cuBLAS ...................................................... 36
    3.3.4 CUDA-MEMCHECK ........................................... 36
3.3.5 Profilers .......................................................... 37

3.4 Architectures ....................................................... 38

3.4.1 Kepler ............................................................ 39

3.4.2 Maxwell ........................................................... 43

3.4.3 Pascal ............................................................. 47

3.4.4 Volta ............................................................... 49

3.4.5 Turing ............................................................. 52

3.4.6 Embedded Devices ............................................... 54

4 Parallel Resampling Technique ................................. 56

4.1 Algorithm Description ........................................... 56

4.2 Optimizations To Parallel Implementation .................... 60

4.2.1 Memory dependency stalls .................................... 60

4.2.1.1 Utilizing Shared Memory ................................. 63

4.2.1.2 Cooperative Groups ...................................... 65

4.2.1.3 Software Prefetching ..................................... 67

4.2.2 Thread Execution Efficiency ................................. 69

4.2.3 Sub-optimal Occupancy ....................................... 72

4.2.4 Grid-Stride Looping ........................................... 74

4.2.5 CUDA Streams .................................................. 77

4.2.6 Algorithmic Efficiency ....................................... 79

4.2.6.1 Time Complexity ......................................... 79

4.2.6.2 Work Efficiency .......................................... 82
4.2.6.3 Space Efficiency ........................................ 86

5 Pseudorandom Number Generators .................................. 87

5.1 Background ....................................................... 88

5.1.1 Single PRNG - Multiple Substreams ......................... 91

5.1.2 Single PRNG - Multiple Seeds ................................ 93

5.1.3 Statistical Experiments ....................................... 95

5.2 Random Number Generator Implications ......................... 96

6 Testing and Analysis ............................................... 104

6.1 Experimentation Setup ........................................... 104

6.2 Simulation Results ............................................... 106

6.3 Timing Across Multiple Variances ................................ 111

6.4 Timing of Worst Case Scenario .................................. 113

7 Conclusion .......................................................... 116

7.1 Summary .......................................................... 116

7.2 Future Work ....................................................... 118

7.2.1 Multi-core Implementation .................................... 118

7.2.2 Custom Prefix-Sum Kernel ................................... 119

7.2.3 Particle Filter Implementation ................................. 119

APPENDIX A: ............................................................. 122

A.1 Systematic/Stratified (Naive): Up Kernel ...................... 122

A.2 Systematic/Stratified (Naive): Down Kernel ................... 123
A.3 Systematic/Stratified (Shared Memory): Up Kernel 124
A.4 Systematic/Stratified (Shared Memory): Down Kernel 125
A.5 Systematic/Stratified (SM/Prefetch) Up Kernel 126
A.6 Systematic/Stratified (SM/Prefetch): Down Kernel 127
A.7 Systematic/Stratified (SM/Prefetch/2 Warps) Up Kernel 128
A.8 Systematic/Stratified (SM/Prefetch/2 Warps): Down Kernel 130
A.9 Metropolis (Naive) 132
A.10 Rejection 133
A.11 Metropolis (Coalesced 1) 134
A.12 Metropolis (Coalesced 2) 135

APPENDIX B: 136

B.1 Variance Effect on Naive Implementation: Systematic 136
B.2 Variance Effect on Improved Implementation: Systematic 140

REFERENCES 144
# LIST OF FIGURES

<table>
<thead>
<tr>
<th>FIGURE</th>
<th>PAGE</th>
</tr>
</thead>
<tbody>
<tr>
<td>2.1</td>
<td>17</td>
</tr>
<tr>
<td>2.2</td>
<td>21</td>
</tr>
<tr>
<td>3.1</td>
<td>31</td>
</tr>
<tr>
<td>3.2</td>
<td>38</td>
</tr>
<tr>
<td>3.3</td>
<td>41</td>
</tr>
<tr>
<td>3.4</td>
<td>42</td>
</tr>
<tr>
<td>3.5</td>
<td>44</td>
</tr>
<tr>
<td>3.6</td>
<td>46</td>
</tr>
<tr>
<td>3.7</td>
<td>48</td>
</tr>
<tr>
<td>3.8</td>
<td>50</td>
</tr>
<tr>
<td>3.9</td>
<td>51</td>
</tr>
<tr>
<td>3.10</td>
<td>53</td>
</tr>
<tr>
<td>3.11</td>
<td>55</td>
</tr>
<tr>
<td>4.1</td>
<td>58</td>
</tr>
<tr>
<td>4.2</td>
<td>62</td>
</tr>
<tr>
<td>4.3</td>
<td>63</td>
</tr>
<tr>
<td>4.4</td>
<td>64</td>
</tr>
<tr>
<td>4.5</td>
<td>67</td>
</tr>
</tbody>
</table>
## LIST OF TABLES

<table>
<thead>
<tr>
<th>TABLE</th>
<th>PAGE</th>
</tr>
</thead>
<tbody>
<tr>
<td>3.1 GPU Resources</td>
<td>30</td>
</tr>
<tr>
<td>3.2 Random Number Generators</td>
<td>36</td>
</tr>
<tr>
<td>4.1 Warp State Statistics</td>
<td>66</td>
</tr>
<tr>
<td>4.2 NVIDIA GTX 1080 Specifications</td>
<td>72</td>
</tr>
<tr>
<td>4.3 Grid-Stride Looping Sizes</td>
<td>76</td>
</tr>
<tr>
<td>5.1 <code>curand_init()</code> Parameters</td>
<td>92</td>
</tr>
<tr>
<td>5.2 PRNG RMSE</td>
<td>98</td>
</tr>
<tr>
<td>5.3 PRNG Statistics</td>
<td>100</td>
</tr>
<tr>
<td>6.1 Intel i7-5960X Specifications</td>
<td>105</td>
</tr>
<tr>
<td>6.2 Systematic/Stratified RMSE</td>
<td>107</td>
</tr>
<tr>
<td>6.3 Global Memory Loads</td>
<td>107</td>
</tr>
<tr>
<td>6.4 Resampling RMSE</td>
<td>110</td>
</tr>
<tr>
<td>6.5 Multiple Variances: Naive Statistics</td>
<td>113</td>
</tr>
<tr>
<td>6.6 Multiple Variances: Improved Statistics</td>
<td>113</td>
</tr>
<tr>
<td>6.7 Work Efficiency: Naive</td>
<td>115</td>
</tr>
<tr>
<td>6.8 Work Efficiency: Improved</td>
<td>115</td>
</tr>
</tbody>
</table>
# LIST OF ACRONYMS

<table>
<thead>
<tr>
<th>Acronym</th>
<th>Full Form</th>
</tr>
</thead>
<tbody>
<tr>
<td>API</td>
<td>Application Programming Interface</td>
</tr>
<tr>
<td>ARM</td>
<td>Advanced RISC Machine</td>
</tr>
<tr>
<td>BLAS</td>
<td>Basic Linear Algebra Subprograms</td>
</tr>
<tr>
<td>BPF</td>
<td>Bootstrap Particle Filter</td>
</tr>
<tr>
<td>BSP</td>
<td>Board Support Package</td>
</tr>
<tr>
<td>CPU</td>
<td>Central Processing Unit</td>
</tr>
<tr>
<td>CTA</td>
<td>Cooperative Thread Array</td>
</tr>
<tr>
<td>CUB</td>
<td>CUDA UnBound</td>
</tr>
<tr>
<td>CUDA</td>
<td>Compute Unified Device Architecture</td>
</tr>
<tr>
<td>DL</td>
<td>Deep Learning</td>
</tr>
<tr>
<td>DSP</td>
<td>Digital Signal Processor</td>
</tr>
<tr>
<td>EKF</td>
<td>Extended Kalman Filter</td>
</tr>
<tr>
<td>ESS</td>
<td>Effective Sample Size</td>
</tr>
<tr>
<td>FFN</td>
<td>FinFET NVIDIA</td>
</tr>
<tr>
<td>FMA</td>
<td>Fuse-Multiply-Add</td>
</tr>
<tr>
<td>FPGA</td>
<td>Field-Programmable Gate Array</td>
</tr>
<tr>
<td>GPGPU</td>
<td>General Purpose Graphics Processing Unit</td>
</tr>
<tr>
<td>Acronym</td>
<td>Full Form</td>
</tr>
<tr>
<td>---------</td>
<td>-----------</td>
</tr>
<tr>
<td>GPU</td>
<td>Graphics Processing Unit</td>
</tr>
<tr>
<td>HBM2</td>
<td>High Bandwidth Memory 2</td>
</tr>
<tr>
<td>IDE</td>
<td>Integrated Development Environment</td>
</tr>
<tr>
<td>ILP</td>
<td>Instruction Level Parallelism</td>
</tr>
<tr>
<td>IS</td>
<td>Importance Sampling</td>
</tr>
<tr>
<td>LFSR</td>
<td>Linear-Feedback Shift Register</td>
</tr>
<tr>
<td>MC</td>
<td>Monte Carlo</td>
</tr>
<tr>
<td>MCMC</td>
<td>Markov Chain Monte Carlo</td>
</tr>
<tr>
<td>MISE</td>
<td>Mean Integrate Square Error</td>
</tr>
<tr>
<td>MPF</td>
<td>Marginalized Particle Filter</td>
</tr>
<tr>
<td>PDF</td>
<td>Probability Density Function</td>
</tr>
<tr>
<td>PRAM</td>
<td>Parallel Random-Access Machine</td>
</tr>
<tr>
<td>PRNG</td>
<td>Pseudorandom Number Generator</td>
</tr>
<tr>
<td>QRNG</td>
<td>Quasirandom Number Generator</td>
</tr>
<tr>
<td>RNG</td>
<td>Random Number Generator</td>
</tr>
<tr>
<td>RAW</td>
<td>Read-After-Write</td>
</tr>
<tr>
<td>RBPF</td>
<td>Rao-Blackwellized Particle Filter</td>
</tr>
<tr>
<td>RMSE</td>
<td>Root Mean Squared Error</td>
</tr>
<tr>
<td>RPF</td>
<td>Regularized Particle Filter</td>
</tr>
<tr>
<td>Acronym</td>
<td>Full Form</td>
</tr>
<tr>
<td>---------</td>
<td>-----------</td>
</tr>
<tr>
<td>SDK</td>
<td>Software Development Kit</td>
</tr>
<tr>
<td>SFU</td>
<td>Special Function unit</td>
</tr>
<tr>
<td>SIMD</td>
<td>Single Instruction, Multiple Data</td>
</tr>
<tr>
<td>SIMT</td>
<td>Single Instruction, Multiple Threads</td>
</tr>
<tr>
<td>SIR</td>
<td>Sequential Importance resampling</td>
</tr>
<tr>
<td>SIS</td>
<td>Sequential Importance Sampling</td>
</tr>
<tr>
<td>SM</td>
<td>Streaming Multiprocessor</td>
</tr>
<tr>
<td>SOC</td>
<td>System-on-a-Chip</td>
</tr>
<tr>
<td>SSL</td>
<td>Secure Sockets Layer</td>
</tr>
<tr>
<td>TLP</td>
<td>Thread Level Parallelism</td>
</tr>
<tr>
<td>WAR</td>
<td>Write-After-Read</td>
</tr>
<tr>
<td>WAW</td>
<td>Write-After-Arite</td>
</tr>
</tbody>
</table>
## LIST OF SYMBOLS

<table>
<thead>
<tr>
<th>SYMBOL</th>
<th>DEFINITION</th>
</tr>
</thead>
<tbody>
<tr>
<td>$t$</td>
<td>Time</td>
</tr>
<tr>
<td>$x_{t+1}$</td>
<td>Progated State Variable</td>
</tr>
<tr>
<td>$y_t$</td>
<td>Update Measurement</td>
</tr>
<tr>
<td>$f_t$</td>
<td>Arbitrary Nonlinear System Model</td>
</tr>
<tr>
<td>$h_t$</td>
<td>Arbitrary Nonlinear Measurement Model</td>
</tr>
<tr>
<td>$w_t$</td>
<td>Process Noise</td>
</tr>
<tr>
<td>$e_t$</td>
<td>Measurement Noise</td>
</tr>
<tr>
<td>$\hat{x}_{t</td>
<td>t-1}$</td>
</tr>
<tr>
<td>$P_{t</td>
<td>t-1}$</td>
</tr>
<tr>
<td>$\hat{x}_{t</td>
<td>t}$</td>
</tr>
<tr>
<td>$P_{t</td>
<td>t}$</td>
</tr>
<tr>
<td>$F_t$</td>
<td>Linear System Model</td>
</tr>
<tr>
<td>$H_t$</td>
<td>Linear Measurement Model</td>
</tr>
<tr>
<td>$Q_t$</td>
<td>Process Noise Covariance</td>
</tr>
<tr>
<td>$R_t$</td>
<td>Measurement Noise Covariance</td>
</tr>
<tr>
<td>$K_t$</td>
<td>Kalman Gain</td>
</tr>
<tr>
<td>$S_t$</td>
<td>Innovation Covariance</td>
</tr>
</tbody>
</table>
To my family, who has been a source of inspiration throughout my life.
If you’re not prepared to be wrong, you’ll never come up with anything original.

— Ken Robinson
CHAPTER 1

INTRODUCTION

Research is what I’m doing
when I don’t know what I’m doing.

—Wernher von Braun

Over the past 15 years, graphics processing units (GPUs) have become fundamental tools in our everyday lives. While their initial purpose was fueled by high computing demands in the gaming industry, they have found an equally important role in the field of accelerated scientific computing and are commonly known as general purpose graphics processing units (GPGPUs). Now, with their applicability in the future of autonomous vehicles, GPUs are being thrust into the realm of embedded systems. In an embedded environment, a heterogeneous system-on-a-chip (SOC) with a central processing unit (CPU) and GPU can be used to balance the workload for optimal efficiency. This opens the door for engineers to find new and innovative ways to solve problems and optimize current solutions.

One such area of interest is location estimation, where an object’s location is being estimated from a set of noisy measurements. Because a perfect location sensor does not exist, Bayesian filter techniques can be used to give an estimate of a system’s state and the error between measurement updates. State estimation is
a process commonly used to determine the underlying behavior at any point. It can use telemetry streaming from on-board sensors to estimate continuous system parameters. This information can be used to improve control actions and identify failing components. Perhaps the most well-known algorithm for state estimation is the Kalman filter [1]. Given the right conditions, such as a linear, Gaussian system, it can produce an exact, analytical solution. In most real world applications, linear and Gaussian assumptions do not hold true. Therefore, alternative methods must be used. The most common alternative is the extended Kalman filter (EKF) which is a derivation of the Kalman filter. The EKF linearizes system and measurement equations through Taylor Series expansion [2, 3] and is popular for on-line system calculations due to its ease of use and computation efficiency. One of its limitations, however, is that it works well only with weak nonlinearities since it is based on local linear approximation. Thus, for highly nonlinear systems, the EKF is not an optimum estimator [4]. One solution might be to upgrade the system with on-board sensors that produce less error measurement data, but this is not always a viable answer.

1.1 Problem Formulation

Particle filters are a set of Monte Carlo algorithms used to solve filtering problems arising in signal processing and Bayesian statistical inference. They have an advantage in that they have the ability to represent arbitrary probability densities, even in nonlinear, non-Gaussian dynamic systems, and can produce probability densities with higher accuracy as the number of particles increase. Because each additional particle only requires an additional Monte Carlo evaluation, the algorithm can take
advantage of the abundant resources available on modern GPUs. Today’s GPUs have anywhere from a few hundred to several thousand cores per processor. Each core particle filter is based on three operations: 1) prediction, 2) updating, and 3) resampling. The prediction and updating steps are computationally intensive, but can be easily implemented in parallel if hardware is available. The resampling step, however, is not naturally suited for parallel processing and execution time is dependent on the number of particles rather than the number of state dimensions. Resampling is an essential step in particle filtering because it replaces the particles whose weights have little significance with new particles that are judiciously placed within the solution space. Traditionally, this is done using collective operations across particles and weights in a serial manner, due to data dependencies. Not only does this cause a bottleneck in the particle filtering process, but if the rest of the algorithm has been implemented on a GPU it will require costly data copies between the GPU and CPU memory banks.

1.2 Contributions

This dissertation makes contributions in the following areas:

1.2.1 Parallel Systematic/Stratified Resampling

A novel parallel approach to systematic [5] and stratified [6] resampling is presented that allows full implementation on GPU architecture while producing a resampling index that is an exact match to the traditional serial method. A first-cut implementation is provided as a baseline and to aid initial understanding. Then, an
improved implementation that reduces warp stalls caused by memory dependencies is provided in detail. The optimized approach contributes to speedups of 16x and 33x for systematic and stratified methods, respectively. Mathematical analysis is provided to evaluate algorithmic efficiencies such as time, work, and space.

1.2.2 Random Number Generation Requirements

This paper presents Monte Carlo techniques that utilize parallel processing on GPUs. To ensure statistical quality, along with optimal performance, various implementation techniques are examined against multiple random number generators (RNGs) that are provided in the NVIDIA cuRAND library. Years of research have gone into developing RNGs that are highly tuned for CPUs utilizing large instruction sets and large amounts of fast memory available to the core(s). These architecture characteristics are not available for modern GPUs, requiring modifications to be made for parallel implementations. This dissertation examines multiple programming implementations across multiple RNGs to determine the optimal RNG for the parallel resampling method.

1.2.3 Parallelization Strategies

Multiple Compute Unified Device Architecture (CUDA) coding techniques are compared for portability, maintainability, and scalability on GPU hardware. This dissertation investigates optimization techniques such as coalesced global memory access, utilizing shared memory storage, software prefetching, grid-stride looping, and cooperative thread arrays. These techniques are used to mitigate common GPU pro-
programming limitations. In particular, memory dependency stalls, low thread execution efficiency, and sub-optimal occupancy. Overall, speedups of 2x - 4x are achieved across a range of data sets, containing between $2^{10}$ and $2^{20}$ elements, respectively. Lastly, multiple streams are utilized to enhance GPU performance when individual kernel performance decreases due to the tail effect.

1.3 Outline

The remainder of this dissertation is organized as follows: Chapter 2 delineates a basic understanding of particle filtering along with pseudocode and algorithmic models of common particle filters and current resampling techniques. Chapter 3 provides background information about GPUs, starting with the programming model and the multiple types of memory within the memory model. It also highlights popular NVIDIA libraries and tools which were evaluated during this research. Chapter 4 provides details on the novel parallel approach to systematic and stratified resampling and optimization techniques to overcome common GPU limitations. This chapter also provides the mathematical analysis of multiple algorithmic efficiencies. Techniques and limitations of implementing RNGs on parallel architectures such as GPUs are described in Chapter 5. In Chapter 6, results on error and accuracy for a 4-state bootstrap particle filter (BPF) benchmark, by Schön [7], are provided in conjunction with execution performance across multiple data sets of variable variances. Finally, in Chapter 7, conclusions and suggestions for future research are provided.
CHAPTER 2

PARTICLE FILTERS

Particle filtering is a Monte Carlo method for performing Bayesian inference in state-space models as the system state evolves through time based on input measurement updates. Particle filtering techniques are used in a wide range of fields including navigation [7], image and signal processing [8, 9], robotics [10], economics [11], and self localization [12]. Particle filters utilize statistical inference, which uses mathematics to draw conclusions in the presence of uncertainty using quantitative data. In the case of particle filters, uncertainty lies within the current state and the quantitative data of the incoming measurements. Both the system and measurement data have contributing random noise. Particle filters apply a technique called recursive Bayesian filtering and are adept to solve Hidden Markov Chain and nonlinear filtering problems. The Bayesian approach is to construct the posterior probability density function (PDF) of the state based on all available information. For most nonlinear, non-Gaussian problems there are no general analytic expressions for the desired PDF. Therefore, particle filtering is more of an approximation of a complex model rather than an exact representation of a simplified model. Two assumptions are used to derive a recursive Bayes filter: 1) the states follow a first-order Markov process 2)
observations are independent of the given states. This dissertation uses the following notation: bold capital letters for matrices, and bold lowercase letters for vectors.

The general dynamic system consists of a system model along with a measurement model. Both are defined below:

\[
x_{t+1} = f_t(x_t, w_t) \quad (2.1)
\]

\[
y_t = h_t(x_t, e_t) \quad (2.2)
\]

Here \(x_{t+1}\) is the propagated state variable at time \(t\), \(y_t\) is the update measurement, \(w_t\) is the process noise, \(e_t\) is the measurement noise, and \(f, h\) are two arbitrary nonlinear functions. The noise densities are independent and are assumed to be known. In a Bayesian setting there is a two-step framework, prediction and update.

The prediction step, or *a priori*, \(p(x_t | y_{1:t-1})\) is computed from the filtering distribution \(p(x_{t-1} | y_{1:t-1})\) at time \(t - 1\).

\[
p(x_t | y_{1:t-1}) = \int \underbrace{p(x_t | x_{t-1})}_{\text{system model}} \underbrace{p(x_{t-1} | y_{1:t-1})}_{\text{previous posterior}} \, dx_{t-1} \quad (2.3)
\]

where \(p(x_{t-1} | y_{t-1})\) is assumed known due to recursion and \(p(x | x_{t-1})\) is given by Equation 2.1. During the update step, or *a posteriori*, the *a priori* is updated with
new measurement data $y_t$:

$$p(x_t|y_{1:t}) = \frac{\text{measurement model}}{p(y_t|x_t) p(x_t|y_{1:t-1})} \frac{\text{current prior}}{p(y_t|y_{1:t-1})}$$

(2.4)

This chapter is organized as follows. Section 2.1 discusses the earliest particle filter, BPF, and its limitations. Section 2.2 and Section 2.3 cover alternative schemes that focus on alleviating issues like *weight degeneracy* and *sample impoverishment*. Section 2.4 covers common resampling techniques, such as Multinomial, Systematic, Stratified, Residual, Metropolis, and Rejection.

### 2.1 Bootstrap Particle Filter

The BPF was introduced in 1993 by Gordon [4]. The key idea is to represent the PDF by random samples, or particles, with associated weights and to compute estimates based on these samples and weights. The basic building block of particle filters is importance sampling (IS). In IS, one approximates a target distribution $p(x)$ using samples drawn from a proposal distribution $q(x)$. To compensate between the two, a weight is added to each sample. IS yields

$$p(x_{0:t}|y_{1:t}) \approx \sum_{i=1}^{N} w_i^t \delta(x_{0:t} - x_{0:t}^i)$$

(2.5)

where $N$ is the number of samples and $\delta$ is the delta function centered at $x_{0:k-1}^i$. The process of running IS at each time step is called sequential importance sampling (SIS).
The purpose of SIS is to update the particles and weights such that they approximate the posterior distribution at the next time step. As \( N \to \infty \), the approximation (2.5) approaches the true posterior density. Weight updates are calculated as follows:

\[
\begin{align*}
\mathbf{x}_i^t &\approx q(x_t|x_{i-1}^t, y_t) \quad (2.6) \\
\mathbf{w}_i^t &\propto \frac{p(y_t|x_i^t)p(x_i^t|x_{i-1}^t)}{q(x_i^t|x_{i-1}^t, y_t)} \quad (2.7)
\end{align*}
\]

where \( q(\cdot) \) is the importance density and then normalized to sum 1.

SIS suffers from an unavoidable problem known as weight degeneracy, where after a few iterations most particles have negligible weights. One way to help degeneracy is to increase the number of particles, but that is impractical for most problems due to computational workload. Degeneracy can be measured by estimating the effective sample size (ESS), which is calculated as follows:

\[
\hat{N}_{\text{eff}} = \frac{1}{\sum_{i=1}^{N} (w_i^t)^2} \quad (2.8)
\]

Small \( \hat{N}_{\text{eff}} \) indicates severe degeneracy. The most common and effective way to mitigate high degeneracy is resampling. The idea is to eliminate particles with low importance weights and replicate particles with high importance weights. This, in combination with SIS, is often referred to as sequential importance resampling (SIR). A BPF is described in Algorithm 1.
Algorithm 1 Generic Particle Filter

1: for $i \leftarrow 1, N$ do
2:   Initialization $x_i^t \sim q(x_t|x_{i-1}^t, y_t)$
3:   Compute importance weights, $w_i^t$: per (2.7)
4: end for
5: Calculate total weight: $\sum_{i=1}^{N} w_i^t$
6: for $i \leftarrow 1, N$ do
7:   Normalize weights
8: end for
9: Compute state estimates
10: Calculate ESS ($\hat{N}_{eff}$): per (2.8)
11: if $\hat{N}_{eff} < N_T$ then
12:   Resample $\triangledown$ Method may vary
13: end if
14: for $i \leftarrow 1, N$ do
15:   Perform particle transition
16: end for

2.2 Marginalized Particle Filter

Computational complexity of a particle filter escalates quickly as the number of state dimensions increase. This is sometimes referred to as the *curse of dimensionality*. One countermeasure, using Bayes theorem, is to marginalize linear state variables and solve them using a Kalman filter [1]. Kalman filters are the optimal filters for these linear states and should provide better estimates. Nonlinear state variables are then solved using particle filters. In some cases, where state variables are *weakly nonlinear*, an EKF [13] may be used. An EKF linearizes a state variable about a working point. This hybrid method is called the marginalized particle filter (MPF) [7] or Rao-Blackwellized particle filter (RBPF) [14]. The MPF encompasses the BPF along with the update and prediction steps of a Kalman filter. For the
Kalman filter, equations (2.1) and (2.2) can be rewritten as follows:

\[ x_{t+1} = F_t x_t + w_t \]  
\[ y_t = H_t x_t + e_t \]

where \( F_t \) and \( H_t \) are known matrices defining linear functions. Covariances of \( w_t \) and \( e_t \) are, \( Q_t \) and \( R_t \), respectively. Kalman filter prediction equations

\[ \hat{x}_{t|t-1} = F_t \hat{x}_{t-1|t-1} \]  
\[ P_{t|t-1} = F_t P_{t|t-1} F_t^T + Q_t \]

are the \emph{a priori} state estimation and the \emph{a priori} estimation covariance, respectively. Kalman filter update equations are listed below, where \( S_t \) is the innovation covariance, \( K_t \) is the Kalman gain, \( \hat{x}_{t|t} \) is the \emph{a posteriori} state estimation, and \( P_{t|t} \) is the \emph{a posteriori} estimation covariance.

\[ S_t = H_t P_{t|t-1} H_t^T + R_t \]  
\[ K_t = P_{t|t-1} H_t^T S_t^{-1} \]  
\[ \hat{x}_{t|t} = \hat{x}_{t|t-1} + K_t (y_t - H_t \hat{x}_{t|t-1}) \]  
\[ P_{t|t} = P_{t|t-1} - K_t H_t P_{t|t-1} \]

An MPF is described in Algorithm 2. Once marginalized, a linear system will be formed for each particle and estimated by a Kalman filter. By solving linear states
in this fashion, some variance can be removed from the nonlinear states. For further accuracy improvements, the nonlinear estimates can be used as an additional measurement in the Kalman filter timing update. While the workload per particle is greater for the MPF, it has been shown by Schönh [7] that only a portion of the particles required for a BPF are needed; in turn, minimizing the computational time of the resampling step which is performed every timestep.

\begin{algorithm}
\caption{Marginalized Particle Filter}
\begin{algorithmic}[1]
\State \textbf{for} $i \leftarrow 1, N$ \textbf{do}
    \State \textbf{for} $i \leftarrow 1, N$ \textbf{do}
        \State \textbf{Initialization} $x_i^t \sim q(x_t|x_{i-1}^t, y_t)$
        \State Compute importance weights, $w_{k}^i$: per (2.7)
    \Endfor
    \State Calculate total weight: $\sum_{i=1}^{N} w_{i}^t$
    \State \textbf{for} $i \leftarrow 1, N$ \textbf{do}
        \State Normalize weights
    \Endfor
    \State Compute nonlinear state estimates
    \State Resample $N$ particles and covariance matrices
    \State \textbf{for} $i \leftarrow 1, N$ \textbf{do}
        \State Perform Kalman filter: measurement update
    \Endfor
    \State Compute linear state estimates
    \State \textbf{for} $i \leftarrow 1, N$ \textbf{do}
        \State Perform particle transition \CommentOnly nonlinear states
    \Endfor
    \State \textbf{for} $i \leftarrow 1, N$ \textbf{do}
        \State Perform Kalman filter: timing update
    \Endfor
\Endfor
\end{algorithmic}
\end{algorithm}

\subsection{Regularized Particle Filter}

Systems with little to no process noise resampling can introduce sample impoverishment, which is the loss of diversity among particles. In the most severe case, all particles can occupy the same state-space position. A modified version of the
BPF is proposed in [15], called a regularized particle filter (RPF). It is identical to the BPF with the exception of the resampling step. It samples from a continuous approximation rather than a discrete distribution (2.5). The approximation is

\[ p(x_t|y_{1:t}) \approx \sum_{i=1}^{N} w^i_t K_h(x_t - x^i_t) \]  (2.13)

where

\[ K_h(x) = \frac{1}{h^{n_x}} K \left( \frac{x}{h} \right) \]  (2.14)

is the rescaled kernel density \( K(\cdot) \) and \( h \) is the kernel bandwidth, which must be positive. \( n_x \) denotes the dimension of \( x \) and \( w^i_t \) is the normalized particle weight.

Now samples are drawn from an empirical representation that closely matches the true posterior density. This is accomplished by choosing a kernel and bandwidth to minimize the mean integrate square error (MISE) between the two. In the special case of a classical equally weighted sample \( w^i_t = 1/N \) for \( i = 1, \ldots, N \), the density estimation theory [16,17] provides the optimal choice for the kernel as the Epanechnikov kernel [15]. It is defined as follows:

\[
    f(n) = \begin{cases} 
        \frac{n_x + 2}{2c_{n_x}} (1 - ||x||^2), & \text{if } ||x|| < 1 \\
        0, & \text{otherwise} 
    \end{cases} \]  (2.15)

13
where \( c_{n_x} \) is the volume of the unit sphere \( \mathbb{R}^{n_x} \). The choice of the optimal bandwidth \( h_{opt} \) follows as [15]

\[
h_{opt} = AN^{1/(n_x+4)}
\]

\[
A = \left[8c_{n_x}^{-1}(n_x + 4)(2\sqrt{\pi})^{n_x}\right]^{1/(n_x+4)}
\]

In a real-world environment this special case can be used to find a sub-optimal filter.

In practical scenarios, the regularized particle filter (RPF) performance is better than the BPF in cases where sample impoverishment is severe, such as when the process noise is small [18]. An RPF is described in Algorithm 3.

**Algorithm 3 Regularized Particle Filter**

1: for \( i \leftarrow 1, N \) do
2: \hspace{1em} Initialization \( x_i^t \sim q(x_t|x_{i-1}^t, y_t) \)
3: \hspace{1em} Compute importance weights, \( w_i^t \): per (2.7)
4: end for
5: Calculate total weight: \( \sum_{i=1}^{N} w_i^t \)
6: for \( i \leftarrow 1, N \) do
7: \hspace{1em} Normalize weights
8: end for
9: Compute state estimates
10: Calculate ESS (\( \hat{N}_{eff} \)): per (2.8)
11: if \( \hat{N}_{eff} < N_T \) then
12: \hspace{1em} Calculate empirical covariance matrix \( S_t \)
13: \hspace{1em} Compute \( D_t \) such that \( D_t D_t^T = S_t \)
14: \hspace{1em} Resample \( \triangleright \) Method may vary
15: \hspace{1em} for \( i \leftarrow 1, N \) do
16: \hspace{2em} Draw from Epanechnikov Kernel
17: \hspace{1em} end for
18: end if
2.4 Resampling Techniques

While early forms of particle filters offered an alternative for nonlinear, non-Gaussian state estimation, it was the introduction of the resampling step [4] that made them a viable option in real-world applications. Before resampling, particle weights would be updated in an iterative manner as the next observation became available. With no method to discard weights with low or no discernible effect, the variance between particle weights would increase with time [19]. After a series of iterations, most particles retain a negligible weight and resources would be wasted propagating useless particles. A naive approach is to add an exorbitant amount of particles because the computational workload makes it impractical for most applications.

Two methods are often used to alleviate the effect of degeneracy: choosing an optimal importance density function and resampling. An optimal importance function helps minimize the variance of particle weight. The optimal importance function [20] is shown as follows:

\[
q(x_t|x_{t-1}^i, y_{0:t}) = p(x_t|x_{t-1}^i, y_t) = \frac{p(y_t|x_t, x_{t-1}^i) p(x_t|x_{t-1}^i)}{p(y_t|x_{t-1}^i)}
\]

Unfortunately, one must be able to sample from \(p(x_t|x_{t-1}^i, y_t)\) and calculate \(p(y_t|x_{t-1}^i)\), which is often not possible [21]. Therefore, a suboptimal importance function must be chosen. A good choice is the transitional \(a priori\), \(p(x_t|x_{t-1}^i)\). Fine tuning the
importance density for a given problem will yield an appropriate trade-off between the number of particles and computational expense to process each particle [18].

The second approach to improve weight degeneracy is resampling. Since its conception, resampling has proven to be beneficial both practically and theoretically [10]. Resampling is performed by removing particles with small weights and replacing them with neighboring particles with high weights. Once the remaining particles have been redistributed, all weights are set to a constant value $1/N$. A visual aid is provided in Figure 2.1. Traditionally, resampling is executed when the ESS has dropped below a given threshold, $N_T$. The threshold is expressed as a proportion of the number of particles and is sometimes defaulted to 50% [22]. This is because most resampling methods are serial by nature and create a bottleneck. Historically, this has caused programmers to balance performance and accuracy.

It is important to note that while resampling reduces degeneracy, it introduces sample impoverishment when system process noise is small. The particles lose diversity by removing all negligible particle weights and repeatedly replicating high particle weights. They are no longer statistically independent. This approach is called total sampling [23], where the resulting particle set only contains new-born particles. One technique to solve this problem is the resample-move algorithm described by Gilks [24]. The resample-move algorithm has a move step after the resampling step based on Markov chain Monte Carlo (MCMC) sampling. The move step is performed on each particle to rejuvenate diversity. Other approaches to alleviate sample impoverishment can be found in [23,25,26]. The following sections cover some common unbiased resampling techniques in literature such as multinomial [4], resid-
2.4.1 Multinomial

Multinomial resampling was the first method to be paired with SIS to produce a BPF [4]. It requires generating $N$ ordered random numbers of uniform distribution. Those numbers are then used to select particles from the approximation filtering distribution used to represent $p(x_t|y_{1:t})$ [29]. Particles are chosen when they meet the
following condition

\[ Q_{t}^{m-1} < u_{i}^{n} \leq Q_{t}^{m} \]  \hspace{1cm} (2.20)

where \( Q_{t}^{m} \) is the cumulative sum of all particle weights up to particle \( m \). This approach is inefficient and provides unsatisfactory Monte Carlo (MC) variations between particle weights [6]. The computational complexity of multinomial resampling is of order \( O(NK) \), where the \( K \) factor arises from the search of the required \( k \) at line (6) in Algorithm 4.

Algorithm 4 Multinomial Resampling

Require: \( w \leftarrow \text{Particle Weights} \)
Ensure: \( idx \leftarrow \text{Resample Index} \)

1: \( N \leftarrow \text{count } (w) \)
2: \( c \leftarrow \text{Inclusive-Prefix-Sum } (w) \)
3: \( \text{for } i \leftarrow 1, N \text{ do} \)
4: \( u \sim U[0, 1] \)
5: \( k \leftarrow 1 \)
6: \( \text{while } c(k) < u \text{ do} \)
7: \( k \leftarrow k + 1 \)
8: \( \text{end while} \)
9: \( idx(i) \leftarrow k \)
10: \( \text{end for} \)

2.4.2 Stratified

Stratified resampling was first proposed by Kitagawa [5]. In the algorithm, it is assumed that division into strata, or layers, is performed. In each stratum, resampling can be executed simultaneously since random numbers are drawn independently. The stratified method has a complexity of \( O(N) \) due to the iteration of the for-loop at line (5) in Algorithm 5.
Algorithm 5 Stratified Resampling

Require: \( w \leftarrow \) Particle Weights
Ensure: \( idx \leftarrow \) Resample Index

1: \( N \leftarrow \) count \((w)\)
2: \( c \leftarrow \) Inclusive-Prefix-Sum \((w)\)
3: \( u \leftarrow (\lfloor n - 1 \rfloor + u_n)/N \quad \triangleright u_n \sim U[0, 1]; n = 1, \ldots, N \)
4: \( k \leftarrow 1 \)
5: for \( i \leftarrow 1, N \) do
6: while \( c(k) < u(i) \) do
7: \( k \leftarrow k + 1 \)
8: end while
9: \( idx(i) \leftarrow k \)
10: end for

2.4.3 Systematic

Systematic resampling [6] is very similar to stratified except that it tries to reduce particle discrepancy by choosing the strata and the number of samples more effectively. This is achieved by choosing only one uniform random number and adding it to the entire ordered set. The samples are no longer independent and are at the same position in the stratum, shown by the generation of a single random number at line (3) of Algorithm 6. Systematic also has a complexity of \( O(N) \). It is more efficient and often the preferred method due to its simplistic implementation and its ability to minimize MC variation.

2.4.4 Residual

Residual sampling [27] consists of two stages. First, particle weights, \( M \), that are greater than \( 1/N \) are deterministically replicated. The remaining particles, or the residuals, \( R = N - \sum_{m=1}^{M} N_t^{(m)} \), are selected using one of the previously mentioned resampling steps during stage two. The first stage represents a deterministic repli-
Algorithm 6 Systematic Resampling

Require: $w \leftarrow$ Particle Weights
Ensure: $idx \leftarrow$ Resample Index
1: $N \leftarrow \text{count}(w)$
2: $c \leftarrow \text{Inclusive-Prefix-Sum}(w)$
3: $u \leftarrow ((n - 1) + u_0)/N$ \hspace{1cm} $\triangleright u_0 \sim U[0, 1)$
4: $k \leftarrow 1$
5: for $i \leftarrow 1, N$ do
6: \hspace{1cm} while $c(k) < u(i)$ do
7: \hspace{2cm} $k \leftarrow k + 1$
8: \hspace{1cm} end while
9: \hspace{1cm} $idx(i) \leftarrow k$
10: end for

cation so that the variation of the number of times a particle is resampled is only
attributed to the second stage [29]. Therefore, the complexity is $O(M) + O(R)$, or $O(N)$. Algorithm 7 shows pseudocode for Residual resampling.

Algorithm 7 Residual Resampling

Require: $w \leftarrow$ Particle Weights
Ensure: $idx \leftarrow$ Resample Index
1: $N \leftarrow \text{count}(w)$
2: for $i \leftarrow 1, N$ do
3: \hspace{1cm} $N_s(i) \leftarrow \text{floor}(N \times w(i))$
4: end for
5: for $j \leftarrow 1, N$ do
6: \hspace{1cm} for $k \leftarrow 1, N_s(j)$ do
7: \hspace{2cm} $idx(i) \leftarrow j$
8: \hspace{2cm} $i \leftarrow i + 1$
9: \hspace{1cm} end for
10: end for
11: $R = N - \text{sum}(N_s)$ \hspace{1cm} $\triangleright$ Remaining particle for stage 1
12: for $i \leftarrow 1, N$ do
13: \hspace{1cm} $w_s \leftarrow (N \times w(i) - N_s(i))/R$
14: end for
15: $c \leftarrow \text{Inclusive-Prefix-Sum}(w_s)$
16: \textbf{Execute} \textsc{multinomial}/\textsc{stratified}/\textsc{systematic}
2.4.5 Metropolis

Metropolis resampling was introduced as a way to accelerate particle filters with the assistance of GPUs through parallelization [28]. Like multinomial and stratified, it requires the creation of a relatively large set of uniformly distributed random numbers. With Metropolis, weights are not summed cumulatively or normalized. Therefore, it removes dependency between weights and improves parallelization. Figure 2.2 provides a visual aid comparing techniques where arcs along the perimeter of the circles represent particles by weight. Arrows indicate selected particles and are positioned (a) uniformly random in multinomial resampling, (b & c) by evenly slicing the circle into strata and randomly selecting an offset (stratified resampling) or using the same offset (systematic resampling) into each stratum, or (d) initializing multiple Markov chains and simulating to convergence in Metropolis resampling [28].

![Figure 2.2: Visualization of resampling methods [28].](image)

The concept is to execute \( N \) threads in parallel over \( B \) iterations comparing randomly selected weights. If the ratio of a randomly selected particle and the previous particle is greater than a uniform random number, then it shall be selected.
for further comparisons. See line (8) in Algorithm 8. However, if the ratio of the randomly selected particle is smaller, it shall be discarded. After $B$ comparisons, the current particle is passed for replication. Careful consideration must be taken when choosing $B$; if $B$ is too small, the sample size will be more biased and possibly not converge. The selection of $B$ is a trade-off of performance and accuracy. Algorithm 8 shows pseudocode for Metropolis resampling.

**Algorithm 8 Metropolis Resampling**

**Require:** $w \leftarrow$ Particle Weights  
**Ensure:** $idx \leftarrow$ Resample Index

1: $N \leftarrow \text{count} \ (w) \\
2: B \leftarrow \text{Number of iterations} \\
3: \text{for} \ i \leftarrow 1, N \ \text{do} \\
4: \quad p \leftarrow i \\
5: \quad \text{for} \ j \leftarrow 1, B \ \text{do} \\
6: \quad \quad u \sim U\[0, 1\] \\
7: \quad \quad q \sim U\{1, \ldots, N\} \\
8: \quad \quad \text{if} \ u \leq w(q)/w(p) \ \text{then} \\
9: \quad \quad \quad p \leftarrow q \\
10: \quad \quad \text{end if} \\
11: \quad \text{end for} \\
12: \quad idx(i) \leftarrow p \\
13: \text{end for}

2.4.6 Rejection

In the same paper, Murray [28] states that if an upper bound of the particle weights is known, rejection sampling is possible. Similar to Metropolis, rejection sampling does not require collective operation, such as prefix-sum, and is not affected by particle sets larger than $2^{20}$. It also offers the following advantages:

1. it is unbiased
2. it permits a first deterministic proposal that $a^i = i$

A major difference between Metropolis and rejection resampling methods is thread execution. Thread execution in Metropolis is deterministic because the number of inner for-loop iterations is set to $B$. On the other hand, the inner loop of rejection is a while-loop, at line (6) of Algorithm 9. This causes the runtime of independent threads to vary, which is an example of a variable task-length problem [28]. It is important to ensure that the upper bound, $w_{\text{max}}$, is tight. Otherwise the method may perform poorly. While empirical calculations of $w_{\text{max}}$ can be performed, $w_{\text{max}} = \max\{w^1, \ldots, w^N\}$, it would defeat the purpose of the approach by introducing a collective operation. Due to the variable task-length, Metropolis may be the preferred choice if its bias is acceptable and an appropriate $B$ is selected.

---

**Algorithm 9** Rejection Resampling

**Require:** $w \leftarrow$ Particle Weights

**Ensure:** $idx \leftarrow$ Resample Index

1: $N \leftarrow \text{count} (w)$
2: $B \leftarrow \text{Number of iterations}$
3: \textbf{for} $i \leftarrow 1, N$ \textbf{do}
4: \hspace{1em} $p \leftarrow i$
5: \hspace{1em} $u \sim U[0, 1]$
6: \hspace{1em} \textbf{while} $u \leq w(p)/w_{\text{max}}$ \textbf{do}
7: \hspace{2em} $q \sim U\{1, \ldots, N\}$
8: \hspace{2em} $u \sim U[0, 1]$
9: \hspace{1em} \textbf{end while}
10: \hspace{1em} $idx(i) \leftarrow p$
11: \textbf{end for}

2.4.7 Coalesced Metropolis

Dülger [30] recently improved performance of Metropolis by implementing it on a GPU with coalesced memory accesses to global memory. While Metropolis
resampling is well suited to utilize the multi-core architecture of a GPU, it does not perform efficient global memory accesses. This is because of the way it randomly generates an index in line (7) of Algorithm 8, and reads that element from the particle weight array. This has a negative impact on performance because the GPU tries to coalesce global memory loads and stores issued by threads of a warp into as few transactions as possible to minimize DRAM bandwidth. A warp is a maximal subset of threads from a single cooperative thread array (CTA), such that the threads execute the same instructions at the same time. This will be explained in detail in Chapter 3. When concurrent threads simultaneously access memory addresses that are very far apart in physical memory, there is no chance for the hardware to combine the accesses. Uncoalesced accesses can cause up to a 57.0% performance loss [31].

Dülger provides two methods, both of which are faster than Metropolis but at the expense of quality. The two techniques are designated as Metropolis-C1 (abbreviated C1) and Metropolis-C2 (abbreviated C2). From this point forward, the original, or uncoalesced, Metropolis will be referred to simply as Metropolis. Because read/write operations to global memory are performed in segments, these modifications constrain threads within a warp to only read and write within these segments. A segment is defined as a fixed number of contiguous elements. All the threads in the warp select random weights within a segment.

C1 is presented in Algorithm 10, $SC$ is the number of segments and $DC$ is the number of elements in each segment. All threads in a warp will select the same $s$, where $s$ is the index of the selected segment drawn from a uniform distribution. This can be ensured by passing the warp index to the random number generator. Next, $p$
Algorithm 10 Metropolis-C1 Resampling

Require: $w \leftarrow$ Particle Weights
Ensure: $idx \leftarrow$ Resample Index

1: $N \leftarrow \text{count}(w)$
2: $B \leftarrow \text{Number of iterations}$
3: for $i \leftarrow 1, N$ do
4: \hspace{1em} $p \leftarrow i$
5: \hspace{1em} $s \sim U\{0, \ldots, SC\}$
6: \hspace{1em} for $j \leftarrow 1, B$ do
7: \hspace{2em} $u \sim U[0, 1]$
8: \hspace{2em} $q \sim U\{(s - 1) \ast DC + 1, \ldots, s \ast DC\}$
9: \hspace{2em} if $u \leq w(q)/w(p)$ then
10: \hspace{3em} $p \leftarrow q$
11: \hspace{2em} end if
12: \hspace{2em} end for
13: $idx(i) \leftarrow p$
14: end for

is a random index between the first and last elements of the segment. Although not explicitly stated in the paper, if $DC$ is larger than the size of a warp, currently 32, then it is possible that global memory accesses for that warp will not be coalesced. By definition, a coalesced read occurs when threads in a warp access all required memory locations using a single cache. This is explained in detail in the CUDA Programming Guide [32] under the Best Practices section.

C2 in Algorithm 11 has the same parameters as Algorithm 10. The only difference is when the selection of $s$ is performed. For C2, it is performed in the inner-loop during each iteration of $B$. C2 is slower than C1 because of the additional random numbers generated in the inner loop. However, C2 provides higher quality results because it encounters more variety in the selection of weights. To reiterate, while C1 and C2 are faster than Metropolis, they produce worse quality because they select weights from a limited portion of the particle weight array. Therefore, C1 and
Algorithm 11 Metropolis-C2 Resampling

Require: $w \leftarrow$ Particle Weights
Ensure: $idx \leftarrow$ Resample Index

1: $N \leftarrow \text{count}(w)$
2: $B \leftarrow$ Number of iterations
3: for $i \leftarrow 1, N$ do
4:     $p \leftarrow i$
5:     for $j \leftarrow 1, B$ do
6:         $u \sim U[0, 1]$  
7:         $s \sim U\{0, \ldots, SC\}$
8:         $q \sim U\{(s - 1) * DC + 1, \ldots, s * DC\}$
9:         if $u \leq w(q)/w(p)$ then
10:             $p \leftarrow q$
11:         end if
12:     end for
13:     $idx(i) \leftarrow p$
14: end for

C2 variations of Metropolis provide a spectrum of speed versus quality trade-off for users.
CHAPTER 3

GRAPHICS PROCESSING UNIT (GPU)

GPUs were not always the parallel processing powerhouses they are today. At their conception, GPUs were to provide current hardware with a more efficient workflow for graphics processing. The original GPUs were modeled after the concept of a graphics pipeline and used fixed purpose hardware. The graphics pipeline refers to the process and steps of rendering images to a computer display. Even though transferring more of the graphics pipeline to the GPU progressed significantly through the 1980s and 1990s, they still required much help from the CPU. It was not until 1999, when NVIDIA implemented the last step of the pipeline in hardware, that reliance on the CPU was eliminated. Thus, the first consumer GPU was created. NVIDIA was the first to coin the term GPU.

One pitfall of the graphics pipeline is that it only allowed one pixel output per clock cycle, meaning CPUs could still send more triangles, the basic texture facet used in graphics processing, to the GPU than it could handle. This leads the way to introducing multiple pipelines in parallel on a GPU. Another downside to early GPUs is that the pipeline was a fixed-function pipeline, meaning once graphics data entered the pipeline it could not be modified. This was fixed with the introduction of
a programmable pipeline. A programmable pipeline allows a programmer access to manipulate and operate on data while it is in the pipeline. Application programming interfaces (APIs), such as Brook and Sh, removed the need for programmers to reformulate computational problems into terms of graphics primitives. Newer languages, such as NVIDIA’s CUDA, allows the user to focus on high-performance computing concepts and less on earlier basic graphical concepts.

3.1 General Purpose Graphics Processing Unit (GPGPU)

With the introduction of programmable pipelines and streaming multiprocessors (SMs), GPUs began to be utilized as GPGPUs, performing calculations in applications conventionally performed by the CPU. An SM has multiple cores that can access and execute multiple threads or operations simultaneously. Although GPUs can only process independent fragments, it can do them in parallel. Where a CPU might have 4, 8, 16, or 32 cores, a GPU can have up to several thousand. These cores are accessed through kernels, which are functions that work on each element. GPUs are extremely efficient at the single instruction, multiple data (SIMD) paradigm, or data parallelism. GPU processing can perform mathematically intensive computations on very large data sets, while a CPU can run the operating system and perform traditional serial tasks. This is an example of heterogeneous processing, which refers to systems that use more than one kind of processor.
3.2 Programming Model

With the increasing popularity of GPUs outside of the graphics domain, the CUDA API was introduced by NVIDIA in 2006. CUDA offers a mature development environment via an extension to the C/C++ programming language. An alternative to CUDA is OpenCL, provided by the Khronos Group in 2008. It is an open and royalty-free standard that can be utilized on a wide selection of hardware, including multi-core CPUs, GPUs (AMD and NVIDIA), field-programmable gate arrays (FPGAs), and digital signal processors (DSPs). While the programming scheme is similar, it does not provide the same performance as CUDA on NVIDIA GPUs since CUDA is tied closer to the hardware. This can become important when trying to achieve optimal performance.

CUDA provides a heterogeneous environment where programs are divided between the host and the device. CUDA allows programmers to define special functions, called kernels, that are called by the host code running on the CPU. Kernels can then be executed in parallel on the GPU by a collection of threads. Kernels are launched in a grid made up of a group of blocks that contain numerous threads. Once blocks are distributed to SMs they are divided into warps. All warps contain 32 threads and are executed concurrently through a series of warp schedulers. While the number of threads per warp has stayed constant through the evolution of GPGPU development, the number of warp schedulers vary between architecture. Also, entire blocks are required to reside on a SM. Therefore, the number of threads in a block are limited by resources. Threads are organized in a block, and blocks are organized in a grid in a
one-, two- or three-dimensional fashion. Each thread and block is designated a unique ID that can be accessed in the kernel by a built-in variable `threadIdx.x (,y,z)` and `blockIdx.x (,y,z)`, respectively. With the release of CUDA 9.X, threads within a block are synchronized with Cooperative Groups. The compute capability of a device is represented by a version number. The version number identifies the features supported by the GPU hardware and is used by applications at runtime to determine which hardware features and/or instructions are available on the target GPU. Table 3.1 provides a sublist of available resources per thread block by compute capability.

**Table 3.1**: GPU resources available (per thread block), adapted from [32].

<table>
<thead>
<tr>
<th>Technical Specifications</th>
<th>Compute Capability</th>
</tr>
</thead>
<tbody>
<tr>
<td>Threads per block</td>
<td>1024</td>
</tr>
<tr>
<td>Max dim size of block</td>
<td>1024 x 1024 x 64</td>
</tr>
<tr>
<td>Max dim size of grid</td>
<td>2147483647 x 65535 x 65535</td>
</tr>
<tr>
<td>Warp size</td>
<td>32</td>
</tr>
<tr>
<td>Registers (K)</td>
<td>64</td>
</tr>
<tr>
<td>Texture Memory (1D)</td>
<td>2^{27}</td>
</tr>
<tr>
<td>Shared Memory (KB)</td>
<td>48</td>
</tr>
<tr>
<td>Constant Memory (KB)</td>
<td>64</td>
</tr>
</tbody>
</table>

### 3.2.1 Memory Model

CUDA threads may access data from multiple memory spaces during their execution as illustrated by Figure 3.1. Pros and cons per memory space may vary and great care should be taken when choosing the appropriate memory space for a given
Figure 3.1: CUDA memory model showing the hierarchy of thread layout and memory spaces [32].

task. Some memory is available during the life of a thread block, such as registers and shared memory. Other spaces are available during the life of the program, such as global and constant memory. Specific details for each memory space will be provided in the following subsections.

3.2.1.1 Global Memory

Global memory is stored off-chip and is the slowest and largest available memory on the GPU. It is accessed through 32-, 64- or 128-byte memory transactions. When a warp executes an instruction that accesses global memory, it coalesces the memory accesses of the threads within the warp into one or more of these memory
transactions depending on the size of the word accessed by each thread and the distribution of the memory addresses across the threads. In general, higher throughput can be achieved with coalesced memory accesses.

3.2.1.2 Local Memory

The local memory space resides in device memory, so local memory accesses have the same high latency and low bandwidth as global memory accesses and are subject to the same requirements for memory coalescing as global memory. Structures and arrays that consume too many registers are stored in local memory. If local variables in a kernel exceed available resources, the excess will be stored in local memory. Local memory usage can have a negative impact on performance through increased memory traffic and instruction count.

3.2.1.3 Shared Memory

Shared memory is a fast on-chip memory available to all threads within a thread-block. It is much faster than the off-chip global memory but is generally much smaller, depending on the hardware generations, and does not persist through kernel calls. To achieve high bandwidth, shared memory is divided into equally-sized memory modules, called banks, which can be accessed simultaneously. However, if two addresses of a memory request fall in the same memory bank, there is a bank conflict, and the access has to be serialized. If too many bank conflicts occur the advantages of shared memory can be negated.
3.2.1.4 Constant Memory

The constant memory space resides in device memory and is cached in the constant cache. Maximum throughput is achieved when all threads within a warp access the same constant memory location during an instruction execution. Thread calls that access different memory locations will be serialized. Constant memory space is a limited resource with historically 64 KB per SM. Since constant memory is static, it can be accessed by any kernel in the context space without being passed as a parameter. Data can be stored using cudaMemcpyToSymbol. It is important to note that constant variables have a file scope linkage, meaning cudaMemcpyToSymbol can only be called within the same .cu file. The .cu file extension is given to developer files written for CUDA and are compiled with nvcc, a compiler program included in the NVIDIA CUDA toolkit.

3.2.1.5 Texture Memory

Texture memory space is read-only and cached. The texture cache is optimized for 2D spatial locality, so threads of the same warp that read texture addresses that are close together will achieve best performance. Textures can be initialized as a texture object or texture reference and can be an advantageous alternative to reading device memory from global or constant memory. The space provides features such as linear filtering, interpolation between texels, normalized texture coordinates, and addressing modes that are not provided by any other memory space.
3.2.1.6 Registers

Registers are the fastest and most limited memory space on a GPU. Generally, accessing a register consumes zero extra clock cycles per instruction. As shown in Table 3.1, there are 32K or 64K registers per block and 255 registers limited per thread in a block. Registers can suffer from bank conflicts. Therefore, the numbers of threads per block should be a multiple of 64 for best results. Programmers should be aware of register pressure and spillage, which occurs when there are not enough registers available for a given task, and data must be stored in local memory.

3.3 Libraries and Tools

Writing, tuning, and maintaining kernel code is perhaps the most challenging and time-consuming aspect of CUDA programming. NVIDIA provides a plethora of GPU-accelerated libraries and tools to speed up application development and performance. GPU-accelerated libraries for linear algebra, signal processing, image and video processing lay the foundation for compute-intensive applications in areas such as molecular dynamics, computational chemistry, medical imaging, and seismic exploration. NVIDIA libraries are optimized for maximum performance and are tuned to take advantage of the latest hardware features. Programmers are able to get performance improvements by compiling their program with a newer CUDA toolkit, which in turn reduces on-going development cost.
3.3.1 CUDA UnBound (CUB)

CUDA UnBound (CUB) is an NVIDIA library containing collective kernel-level primitives designed around reusable software components, such as warps and blocks for the single instruction, multiple threads (SIMT) paradigm. Collective software primitives are essential for constructing high-performance, maintainable CUDA kernel code. The library also provides global primitives, built from collectives, that provide performance portability. Historically, programmers have had to rewrite kernels to take advantage of new features that became available with the latest hardware. CUB reduces development cost and encapsulates complexity by insulating changing capabilities of underlying hardware from the programmer. It is continuously evolving to accommodate new architecture-specific features and instructions.

3.3.2 cuRAND

The cuRAND library provides facilities that focus on the simple and efficient generation of high-quality pseudorandom and quasirandom numbers. Quasirandom numbers generate successive numbers as far away as possible from existing numbers in the set. While they are too uniform to pass randomness tests, they avoid clustering and speed up convergence. Pseudorandom numbers are more appropriate for applications that require more randomness. cuRAND consist of host and device API calls to access device memory. The possible random number generators are listed in Table 3.2.
Table 3.2: Random number generators available in cuRAND.

<table>
<thead>
<tr>
<th>Generator Types</th>
<th>Pseudorandom</th>
<th>Quasirandom</th>
</tr>
</thead>
<tbody>
<tr>
<td>MRG32k3a</td>
<td>Scrambled Sobol (32-bit)</td>
<td></td>
</tr>
<tr>
<td>MT19937</td>
<td>Scrambled Sobol (64-bit)</td>
<td></td>
</tr>
<tr>
<td>MTGP32</td>
<td>Sobol (32-bit)</td>
<td></td>
</tr>
<tr>
<td>Philox_4x32_10</td>
<td>Sobol (64-bit)</td>
<td></td>
</tr>
<tr>
<td>XORWOW</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>

3.3.3 cuBLAS

The cuBLAS library is an implementation of basic linear algebra subprograms (BLAS) in CUDA. BLAS was originally written in Fortran and has been highly optimized for multiple architectures over its history. cuBLAS consists of two sets of API, the regular cuBLAS API and cuBLASXT API which utilizes multi-GPU capabilities. BLAS is structured in three levels; level 1) scalar and vector based operations, level 2) matrix-vector operations, and level 3) matrix-matrix operations. Levels 1 and 2 are memory limited, while level 3 is compute limited. It can handle multiple data types such as half-, single-, and double precision, complex, and double-complex. CUDA 9.X provides access to Tensor Cores that were introduced with the Volta architecture for matrix-matrix multiplication.

3.3.4 CUDA-MEMCHECK

An important consideration of any program is error-checking and CUDA-MEMCHECK is a functional correctness checking suite included in the CUDA toolkit.
It provides a series of tools that consists of different types of checks. The memcheck tool is a runtime error detection tool, capable of precisely detecting and attributing out of bounds and misaligned memory access errors in multiple memory spaces. To accurately detect memory leaks, a CUDA context must be destroyed on program termination. This can easily be done with `cudaDeviceReset()`. The racecheck tool can report shared memory data access hazards that can cause data races, also called a race condition. Race conditions are computational hazards that arises when the results of the program depend on the timing of uncontrollable events, such as the execution order of threads. It checks for three hazards: write-after-write (WAW), write-after-read (WAR) and read-after-write (RAW). The initcheck tool can report cases where the GPU performs uninitialized accesses to global memory. The synccheck tool can report cases where the application is attempting invalid usages of synchronization primitives.

When ran in combination with the debug compile file, -G, the line number at which an error occurs will be reported along with call hierarchy. More details about this error-checking suite are provided in the CUDA Toolkit Programming Guide [32].

### 3.3.5 Profilers

NVIDIA provides profiling tools in its CUDA toolkit to help understand and optimize CUDA applications. Nsight Systems is a graphical profiling tool that displays a timeline of an application’s CPU and GPU activity at the system level and provides low overhead performance analysis with insights developers need to optimize their software. It provides information on concurrent kernels, unified memory, and power, clock, and thermal profiling. It has the ability to provide further insight to
Figure 3.2: NVIDIA Nsight Systems.

kernel bottlenecks and makes suggestions for further optimizations. A screenshot of the visual profiler is provided in Figure 3.2. If programmers need the detailed performance metrics of an individual kernel and API debugging via a user interface and command line tool, they can use Nsight Compute.

3.4 Architectures

The NVIDIA GPU architecture is built around a scalable array of multi-threaded SMs and employs a unique architecture called SIMT. A GPU provides hundreds, and sometimes thousands, of cores per device. A GPU core provides a much simpler control block than a CPU core and does not include advanced branch prediction or speculative execution. This trade-off allows the GPU to provide maximum throughput. A multiprocessor creates, manages, schedules, and executes threads in
groups of 32 parallel threads, called warps. Warps are handled independently by a scheduler. Warps within a block can be launched randomly and threads with a warp can execute out of lockstep. This is why block synchronization is important when dealing with shared memory. Execution context (program counters, registers, etc.) for each warp is processed by a multiprocessor and maintained on chip during the entire lifetime of the warp. A warp scheduler selects a warp that has threads ready to execute its next instruction. Therefore, to achieve maximum throughput, launching blocks with numerous threads can provide enough work to hide long latency memory reads. While the overview presented above is common across all architectures, low-level details vary. Those variations, that affect coding performed in this research, will be described in the following sections.

### 3.4.1 Kepler

Kepler architecture was designed using 28 nm technology with a primary focus on reducing energy consumption. This was achieved by using the primary GPU clock rather than the 2x shader clock. Running a larger number of cores at a lower clock rate allowed the removal of power-hungry clocking logic at the expense of area optimizations. Two Kepler cores used 90% of the energy required by the previous generation. NVIDIA realized that the compiler could be used to determine up front when instructions would be ready to issue. As a result, NVIDIA replaced several complex and power-expensive blocks with a simple hardware block that retrieves predetermined latency information, and uses it to mask out warps from eligibility at the inter-warp scheduler stage. This scheduling scheme has been carried forward on all
successive architectures. Kepler SMs schedule threads in groups of 32 parallel threads called warps. Each SM features four warp schedulers and eight instruction dispatch units, allowing four warps to be issued and executed concurrently. Two independent instructions per warp can be dispatched in each cycle and are very useful when improving instruction level parallelism (ILP). An example is shown in Figure 3.3.

A single Kepler SM contains 192 FP32 CUDA cores, 64 FP64 units, 32 special function units (SFUs), and 32 load/store (LD/ST) units, as shown in Figure 3.4. Each of the CUDA cores comes with fully pipelined floating-point and integer arithmetic logic units. The cores are also compliant with IEEE 754-2008, providing fuse-multiply-add (FMA) functionality. An FMA operation is performed in one step, with a single rounding, yielding a more accurate result.
Kepler compute capability 3.5 introduced 255 registers per threads, an increase from 63 in previous generations. This provided substantial performance improvements by removing register pressure and reducing spillage into local memory, which is located in global or device memory. The shuffle instruction was another important feature added to this series. This feature provided a way for programmers to share information between threads without using shared memory resources. It also offers a performance increase since load-and-store operations are executed in a single cycle.
Figure 3.4: A single Kepler SM: 192 FP32 CUDA cores, 64 FP64 units, 32 Special function units (SFU), and 32 load/store (LD/ST) units.

Kepler introduced configurable 64 KB shared memory. Since shared memory and L1 cache were packaged together, programmers could configure 16/48, 32/32, and 48/16 splits on a per kernel basis. While this provided much more flexibility,
it implicitly synchronized kernels when a new configuration was required. Two improvements to texture memory were also added. First, bindless textures eliminated previous limits to the number of texture objects that existed on previous architectures. Second, an additional 48 KB read-only data cache was provided that allow programmers to load data directly at the launch of a kernel and lasting the duration. This is beneficial because it takes both the load and working set footprint off the shared/L1 cache path. It can be accessed by providing hints to the compiler with tags \texttt{const} \_\texttt{restrict}, meaning the memory is not aliased, or explicitly with the \texttt{ldg()} intrinsic [33]. This additional cache can be seen in Figure 3.5.

Lastly, dynamic parallelism and Hyper-Q added levels of parallelism that had not previously existed. With dynamic parallelism, a GPU was able to generate, synchronize, and control new work for itself, a role traditionally performed by the CPU. Hyper-Q increased the number of hardware managed work queues from 1 to 32. This provided more efficient concurrency when a GPU needed to handle work from multiple CPU cores simultaneously or from MPI processes. Kepler introduced many architectural changes from its predecessors that are still preserved today.

3.4.2 Maxwell

With the introduction of the Maxwell architecture, NVIDIA reduced the CUDA core count from 192 in the Kepler to 128 per SM. While still using 28 nm technology, NVIDIA was able to double the number of SMs used on some chips without doubling the size of the die. This new core configuration allows for better resource utilization on data transfers. The Maxwell SM retains the same number of instruction issue slots
per clock and reduces arithmetic latencies compared to the Kepler design. To balance efficiency and workload requirements of modern games, texture units were also reduced from sixteen to eight per SM. Higher clock rates enable the reduced number of texture units to maintain and/or exceed performance of the previous series. While the Maxwell SM contains four warp schedulers and eight instruction dispatch units

Figure 3.5: Kepler Memory Hierarchy.
like its predecessor, it is evenly partitioned into four, 32-core processing blocks. Each processing block has dedicated scheduling and instruction resources. This is one of the greatest changes NVIDIA has made to reduce power consumption. Shared resources, though extremely useful when you have the workloads to fill them, do have drawbacks. They waste space and power if not fed, and there is additional scheduling overhead from having to coordinate the actions of the warp schedulers [34]. These changes can be seen in Figure 3.6.
Figure 3.6: Single Maxwell SM.
This series was the first to provide dedicated shared memory on an SM and combines L1 and texture caches into a single unit. Shared memory per SM is either 64 KB or 96 KB depending on the chip. The increase in maximum number of active thread blocks per multiprocessor, from 16 to 32, was a significant improvement. This results in automatic occupancy improvements without any code modifications by the programmer. In addition, the throughput of many integer operations including multiply, logical operations, and shift were improved on Maxwell. Specialized integer instructions that can accelerate pointer arithmetic were added. Lastly, L2 cache was significantly increased to 2 MB, reducing the amount of traffic that needs to cross the memory bus. This reduced both the power spent on the memory bus and improved overall performance.

3.4.3 Pascal

The Pascal architecture was the first to use 16 nm FinFET technology. Even though smaller transistors were used, Pascal incorporated 64 (GP100) CUDA cores per SM, while Maxwell and Kepler used 128 and 192, respectively. The SM, shown in Figure 3.7, is partitioned into two processing blocks. It maintains the 32 single-precision CUDA cores, an instruction buffer, a warp scheduler, and two dispatch units per block used by Maxwell. With the reduction in the number of processing blocks, texture units were also reduced from eight to two per SM. The Pascal architecture is the worlds first GPU architecture to support high bandwidth memory 2 (HBM2), which offers three times the memory bandwidth of the Maxwell. Because HBM2 is stacked memory, it allowed NVIDIA to design a denser GPU compared to...
previous versions. It provides a much simpler data path organization than the Kepler and improves scheduling and floating-point utilization over the Maxwell. The HBM2 memory requires less die area and reduced power consumption over previous architectures. A higher ratio of shared memory, registers, and warps per SM allows for more efficiently executed code [35].

Figure 3.7: Single Pascal SM.

This architecture also includes 16-bit precision operations with twice the throughput as 32-bit operations. FP16 plays an important role in deep learning (DL) applications by reducing memory usage and increasing arithmetic performance and data transfers. An important thing to note is that as SMs became smaller and more were
added to a chip, NVIDIA was able to increase the amount of L2 cache to 4 MB. This allowed for fewer requests to device memory, reducing overall board power and improving performance.

3.4.4 Volta

One of newest architectures is the Volta series. It is fabricated with new 12 nm FinFET NVIDIA (FFN) technology. Volta returns to four processing blocks per SM like the Maxwell, but the processing block has some significant differences. First, instead of having 32 FP32 cores, it has 16 FP32 cores and 16 INT32 cores, as shown in Figure 3.8. The addition of INT32 cores allows for simultaneous execution of FP32 and INT32 operations at full throughput, while also increasing instruction issue throughput. FMA math operations are reduced from six cycles on Pascal to four. The instruction buffer in each block is replaced with a new L0 instruction cache to provide higher efficiency and is private to every scheduler. It also mitigates any penalty associated with large instruction size [36]. In addition, the number of dispatch units is reduced to one, therefore it no longer has superscalar execution. This means that Volta is a pure thread level parallelism (TLP) design; max utilization comes from maximizing the number of threads active at any given time. The Volta is said to have a streamlined instruction set for simpler decoding and reduced instruction latencies [37].

The Volta architecture combined the L1 cache and shared memory, similar to Kepler. The combined capacity is 128 KB per SM and allows for a configurable, shared memory capacity of 96 KB. All 128 KB can be used as cache by kernels that
do not use shared memory. A key reason to merge the L1 data cache with shared memory in Volta is to allow L1 cache operations to attain the benefits of shared memory performance. Shared memory provides high bandwidth and low latency, but the programmer needs to manage this memory explicitly. Volta narrows the gap
between applications that explicitly manage shared memory and those that access data in device memory directly. Prior NVIDIA GPUs performed load caching, while the Volta performs write caching. This further improves performance. Volta is the first architecture to implement independent thread scheduling. The individual CUDA cores within a 32-thread warp now have a limited degree of autonomy. Threads can be synchronized at a fine-grain level, which means that while the SIMT paradigm is still alive and well, it has greater overall efficiency. A visual aid is provided in Figure 3.9. Independent thread scheduling allows threads to diverge and reconverge at sub-warp granularity. In order to ensure proper warp synchronization, Volta should be paired with CUDA 9.X or greater and use Cooperative Groups.

![Figure 3.9: Independent thread scheduling comparison between Pre-Volta and Volta architectures.](image)

One of the most significant additions to the Volta SM is the two mixed-precision Tensor Cores. They are highly utilized by the DL matrix arithmetic. Tensor
Cores are a new type of core geared specifically for tensor math operations. The significance of these cores is that by performing a massive matrix multiplication operation in one unit, NVIDIA can achieve a much higher number of FLOPS for this one operation. Unfortunately, Tensor Cores are relatively rigid and are not ideal for anything except tensor operations. They are only applicable to certain classes of compute tasks.

3.4.5 Turing

Turing, the latest architecture, is constructed using 12 nm FFN technology and consists of an SM with 16 FP32 cores, 16 INT32 cores, two Tensor Cores, one warp scheduler, and one dispatch unit, shown in Figure 3.10. Among significant similarities with the Volta, the Turing differs in that NVIDIA has added an additional datapath to further increase parallelism between the FP32 cores and the INT32 cores. With a second datapath, instructions like integer adds for addressing and fetching data can coincide with floating point math. Early studies have shown an additional 36% throughput on modern applications [38]. L1 cache on the Turing is also very similar to the Volta. For Turing, there is a combined 96 KB of L1 and shared memory in an unified memory block. Compute workloads can partition the L1/shared memory space with up to 64 KB as L1 and the remaining 32 KB as shared memory, or vice versa. For Volta, the shared memory can be configured up to 96 KB. Anything above 48 KB must be done dynamically.
Figure 3.10: Single Turing SM.

Turing is the first GPU architecture to support GDDR6 memory, enabling as little as a 27% increase in memory bandwidth. Aside from these improvements and
some additional Tensor Core capabilities, there are not many noteworthy differences between the Volta and Turing compute units. Although it does not contribute to this research, the most significant advancement that Turing brings to the industry is the new ray-tracing cores (RT cores). Ray-tracing, in short, is a rendering process that emulates how light behaves in the real world and is primarily used in gaming graphics. These new cores enable the ability to make calculations in real-time.

3.4.6 Embedded Devices

While higher quality graphics for the gaming industry have driven advancements in GPU evolution, small fanless devices and autonomous cars have driven advancements in embedded systems, particularly the SOC. In 2013, NVIDIA introduced their first embedded GPU paired with an advanced RISC machines (ARM) CPU and memory. This has allowed programmers to provide high computing algorithms to devices previously using FPGAs. NVIDIA provides the Jetson embedded platforms, which include versions based off the Kepler (TK1), Maxwell (TX1), Pascal (TX2/TX2i), and Volta (Xavier) architectures. It packs this performance into a small, power-efficient form factor that is ideal for intelligent edge devices like robots, drones, smart cameras, and portable medical devices. Software can be installed using JetPack, which is a complete software development kit (SDK) that includes the board support package (BSP), libraries for deep learning, computer vision, GPU computing, and multimedia processing. Figure 3.11 shows a picture of a TX2 module (left) and a TX2 Developer Kit (right).
Figure 3.11: Side-by-side comparison of the Jetson TX2 module (left) and the Jetson TX2 Developer Kit (right).
CHAPTER 4

PARALLEL RESAMPLING TECHNIQUE

My hypothesis is that we can solve [the software crisis in parallel computing], but only if we work from the algorithm down to the hardware – not the traditional hardware first mentality.

—Tim Mattson, Intel

4.1 Algorithm Description

Systematic resampling is the algorithm most preferred due to its easy implementation and ability to minimize Monte Carlo variation [10, 18, 39]. Because of its similarities with stratified resampling the implementation presented below can easily be applied to both methods. Systematic and stratified resampling can be divided into three sections: 1) prefix-sum generation, 2) random number generation, 3) comparing prefix-sum and random numbers. Unfortunately, systematic and stratified, as well as some other traditional resampling algorithms, require collective operations across particles and weights due to data dependencies. This collective operation, also known as a prefix-sum, is the cumulative summation of particle weights. Major contributions have been made to parallelize prefix-sum algorithms on GPUs [40–43]. The parallel inclusive-prefix-sum has $O(\log N)$ complexity. Therefore, it effectively
disappears as $N$ continues to grow. Next, generating random numbers for these resampling algorithms can have a complexity of $O(N)$ on a CPU. On GPUs random number generation can be distributed among threads bringing its complexity to $O(1)$. This can provide a more pronounced improvement to stratified resampling which requires a different random number per particle. The third portion of systematic and stratified resampling, which requires iterating through the prefix-sum and comparing values to the uniform ordered random numbers, remains a serial process.

To efficiently utilize the GPU, work must be distributed to many threads. It is important to remember that data local to each thread is not visible by any other thread in a block. Taking a closer look, these resampling methods, they consist of a while-loop wrapped in a for-loop iterating over a zero-based consecutive strictly monotonic index through the prefix-sum, comparing elements to a random number [44]. This process can be split into two sub-processes; a middle-out approach where each process executes a while-loop on each element in the prefix-sum simultaneously. One sub-process increments through the prefix-sum until each and all threads have satisfied the comparator or until the end of the array is reached. The second sub-process then decrements through the prefix-sum until all threads satisfy the comparator or until the end of the array is reached. From henceforth, the kernel that increments will be referred to as $Up$ and the the kernel that decrements will be referred to as $Down$. Intermediate comparator results are stored in a bit mask. This implementation produces a variable task-length problem for each thread similar to that of rejection resampling, as described in Section 2.4.6. Therefore, it has a maximum complexity of $O(\ell)$, where $\ell$ is the number of strides from $c/thread$, where $c$ is the
prefix-sum array and thread is the thread index within a grid. A grid is a collection of all the threads executing the kernel. Algorithm 12 provides pseudocode of the parallel implementation. For brevity, stratified and systematic implementations have been combined, with the only difference being the random number generation. Notice the if-statement at line (8) during the Up kernel and line (20) in the Down kernel. It ensures that the while-loops do not access memory out of bounds.

The parallelized systematic and stratified are identical except for the creation of uniform random numbers. Both parallel methods should have similar timing because random number generation is performed by each thread.

![Figure 4.1](image)

<table>
<thead>
<tr>
<th>Weights</th>
<th>0.06 0.01 0.05 0.09 0.08 0.05 0.09 0.06 0.09 0.08 0.04 0.01 0.02 0.09 0.09 0.09</th>
</tr>
</thead>
<tbody>
<tr>
<td>Prefix-Sum</td>
<td>0.06 0.07 0.12 0.21 0.29 0.34 0.43 0.49 0.58 0.66 0.70 0.71 0.73 0.82 0.91 1.00</td>
</tr>
<tr>
<td>U[0,1]</td>
<td>0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2 0.2</td>
</tr>
<tr>
<td>Mask</td>
<td>0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0</td>
</tr>
<tr>
<td>Index</td>
<td>1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16</td>
</tr>
</tbody>
</table>

**1st While Loop**

| Mask | 0 1 1 0 0 0 0 0 0 1 1 0 0 |
| Index | 1 3 4 4 5 6 7 8 9 10 11 12 14 15 16 |

| Mask | 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 |
| Index | 1 3 4 4 5 6 7 8 9 10 11 12 14 15 16 |

**2nd While Loop**

| Mask | 1 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 |
| Index | 1 3 4 4 5 6 7 8 9 10 11 14 15 15 16 |

| Mask | 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 |
| Final | 1 3 4 4 5 6 7 8 9 10 11 14 15 15 16 |

**Figure 4.1:** A simplified example of the parallel resampling approach. The index for each particle (column) can be computed by individual threads in parallel.
Algorithm 12 Parallel Stratified (Systematic)

**Require:** \( w \leftarrow \text{Particle Weights} \)

**Ensure:** \( \text{idx} \leftarrow \text{Resample Index} \)

1: \( c \leftarrow \text{Inclusive-Prefix-Sum} \left( w \right) \)
2: for all \( t \leftarrow \text{thread} \) do
3: \( N \leftarrow \text{count} \left( w \right) \)
4: \( u_t \leftarrow (t + u_n(u_0))/N \) \( \triangleright u_n(u_0) \sim U[0, 1); \, n = 1, \ldots, N \)
5: \( m_t \leftarrow \text{true} \) \( \triangleright \) Bit mask for thread \( t \)
6: \( \ell_t \leftarrow 0 \)
7: while \( m_t \neq \text{false} \) do
8: \begin{align*}
9: & \text{if } t > (N - \ell_t) \text{ then} \\
10: & \quad m_t \leftarrow \text{false} \\
11: & \text{else} \\
12: & \quad m_t \leftarrow c(t + \ell_t) < u_t \\
13: & \text{end if} \\
14: & \text{if } m_t = \text{true} \text{ then} \\
15: & \quad \text{idx}_t \leftarrow \text{idx}_t + 1 \\
16: & \text{end if} \\
17: & \quad \ell_t \leftarrow \ell_t + 1 \\
18: & \end{align*} \( \triangleright \) All mask threads must be FALSE to exit
19: \( \ell_t \leftarrow 1 \)
20: while \( m_t \neq \text{true} \) do
21: \begin{align*}
22: & \text{if } t < \ell_t \text{ then} \\
23: & \quad m_t \leftarrow \text{true} \\
24: & \text{else} \\
25: & \quad m_t \leftarrow c(t - \ell_t) < u_t \\
26: & \text{end if} \\
27: & \text{if } m_t = \text{false} \text{ then} \\
28: & \quad \text{idx}_t \leftarrow \text{idx}_t - 1 \\
29: & \text{end if} \\
30: & \quad \ell_t \leftarrow \ell_t + 1 \\
31: & \end{align*} \( \triangleright \) All mask threads must be TRUE to exit
32: end for

As an example, a simplified output of Algorithm 12 is provided in Figure 4.1.

In the first while-loop, 4 out of 16 elements satisfy the comparator at line (11); therefore, only those elements are incremented. When all threads fail the while condition, the elements proceed to the next while-loop. In the second loop, only 3 elements fail the comparator at line (23) and must be decremented. Once all threads satisfy the
second while-loop, only the resampled index remains. Summing the total operations, what would have taken 23 operations using serial methods now can be completed in 4. Another advantage of this new method is that it allows the data to remain on the GPU without expensive copies to and from the CPU.

4.2 Optimizations To Parallel Implementation

While the naive parallel implementation presented in the previous section provides a speedup over the serial versions, both kernels can be modified in the following areas to better utilize GPU resources to achieve optimal performance:

1. Memory dependency stalls

2. Thread execution efficiency

3. Sub-optimal occupancy

4. Grid-stride looping

5. CUDA streams

In the following sections, limitations and the approach used to alleviate them will be described.

4.2.1 Memory dependency stalls

As mentioned in Section 3.2.1.1, global memory has the slowest access times of all memory spaces. To ensure optimal performance when loading data from global memory to on-chip resources, programmers must ensure all accesses are coalesced. A
coalesced access is when all threads in a warp access one cache line. The simplest way is for the $k$-th thread to access the $k$-th word in a cache line. For some architectures, accesses through L1 cache with 128-byte lines and L2 cache with 32-byte lines. Other architectures access both L1 and L2 caches with 32-byte lines. Warp sizes of 32 threads can help facilitate memory accesses that are aligned to cache lines. If sequential threads in a warp access memory that is sequential but not aligned with the cache lines then it is considered a misaligned memory access. Misaligned accesses require an additional cache line to be retrieved. While not a factor in this parallel implementation, it is important to mention the other extreme, strided accesses. Non-unit-stride global memory accesses occur when threads within a warp must load, or store, data that is greatly separated in global memory. If data is spread too far apart then L2 cache cannot be utilized. Global accesses then acts in a serial manner and can be detrimental to application performance. Examples of coalesced, misaligned and strided global memory access are provided in Figure 4.2.
Figure 4.2: (a) Coalesced, (b) Misaligned, and (c) Strided

The naive implementation when profiled with NVIDIA Nsight Compute, as shown in Figure 4.3, shows that the kernels are more memory (Memory %) bound than compute bound (SM %). A suggested best practice is for CUDA kernels to have high utilization with both compute and memory resources. This imbalanced utilization is because there are not enough calculations being performed in each timestep to hide memory latency. There is one comparator and one arithmetic operation per single global memory load. Every timestep, each thread in a warp must retrieve the next value from the prefix-sum in global memory. Because these strides through memory
are strictly consecutive monotonic, periodic global accesses meet the requirement for a coalesced load. As warps traverse through a cache line, global accesses are misaligned.

Figure 4.3: GPU Utilization: Naive (green).

For this parallel implementation, the stall with the greatest impact is the CPI Stall ‘Long Scoreboard.’ This occurs when a warp is waiting for a scoreboard dependency on a L1TEX (local, global, surface, texture) operation. To reduce the number of cycles waiting on L1TEX, one must make sure data accesses are coalesced and optimal for the target architecture, improve cache hit rates by increasing data locality, and move frequently used data to shared memory. Table 4.1 shows that the number of warp cycles per instruction is roughly 3x less when utilizing shared memory.

4.2.1.1 Utilizing Shared Memory

One way to mitigate misaligned memory accesses and reduce the total number of loads from global memory is to take advantage of shared memory. Covered previously, in Section 3.2.1.3, shared memory is on-chip and has higher bandwidth and lower latency than global memory. This is assuming there are no bank con-
conflicts between threads, which will be covered later in the section. Instead of threads traversing through global memory to retrieve the next iteration from the prefix-sum, chunks of data can be stored in a coalesced manner in shared memory. Threads can then traverse shared memory with little penalty to performance. As an example, a warp can load two consecutive coalesced accesses into shared memory, or a total of 64 elements. Then threads can step through shared memory 32 times, comparing the prefix-sum value to its random number. Now, the quantity of computation per global memory load has been increased by a factor of 16. This is because there are two coalesced loads, versus a combination of 32 coalesced and misaligned loads [44]. Figure 4.4 shows improved GPU utilization when using shared memory to reduce the number of global memory loads. The kernel using shared memory uses 55% more compute resources, while reducing memory utilization by 13%. This reduction in memory utilization is due to sub-optimal occupancy, which will be explained in detail later in this chapter.

Figure 4.4: GPU Utilization: Naive (green) and Shared (blue).

It is important to note that care should be taken when utilizing shared memory on the SM. Because it is on-chip, shared memory has a much higher bandwidth and
lower latency than local or global memory. To achieve this bandwidth, memory is divided into equally sized memory banks that can be accessed simultaneously. However, if multiple memory requests access the same bank, the request will be serialized. The one exception is when multiple threads within a warp access the same shared memory location. The result is then broadcast to those threads. Devices with compute capability 3.x and higher have two banking mode options, successive 32-bit or 64-bit words. Because the calculations in this paper deal with single precision, 32-bit mode is chosen. The implementation presented does not introduce any bank conflicts because each warp is allocated a shared memory space mutually exclusive from the other and individual threads, in each warp, never access the same memory bank in a given cycle.

4.2.1.2 Cooperative Groups

In the naive implementation, each thread works independently of the adjacent thread, in a warp, traversing the prefix-sum in global memory, and does so until its condition is satisfied. At that point, the thread becomes inactive and waits for all other threads to finish before the block is ejected and new work is scheduled. With a shared memory implementation, threads in a warp must work together. Each warp traverses the prefix-sum in global memory reading two 32 strictly consecutive monotonic element chunks of data at a time. The chunks are loaded into shared memory. Then each thread steps through shared memory until it reaches the end of the prefix-sum, satisfies its condition statement, or traverses its 32 element section. If at least one thread has not reached the satisfied condition, two more chunks of
data must be loaded into shared memory. For all 64 values to be loaded into shared memory, all threads in a warp must remain active.

**Table 4.1: Warp State Statistics: Instructions/Issues Per Cycle**

<table>
<thead>
<tr>
<th>Warp State Statistics</th>
<th>Naive</th>
<th>Shared</th>
</tr>
</thead>
<tbody>
<tr>
<td>Warp Cycles</td>
<td></td>
<td></td>
</tr>
<tr>
<td>Issued Instructions</td>
<td>27.05</td>
<td>9.14</td>
</tr>
<tr>
<td>Issue Active</td>
<td>29.56</td>
<td>11.41</td>
</tr>
<tr>
<td>Executed Instruction</td>
<td>27.09</td>
<td>9.12</td>
</tr>
</tbody>
</table>

For all threads in a warp to know if at least one thread requires more data from the prefix-sum, *intra-wrap* communication is required. This can be accomplished using *cooperative groups*, introduced in CUDA 9.0 [32]. Cooperative groups allow programmers the ability to define and synchronize groups of threads smaller than thread blocks and enable greater performance through design flexibility and *collective* group-wide primitives. Of the warp-level functions provided, *any()* and *all()* can be used to check the bit mask to determine if a thread in a warp has satisfied its condition. The *any()* function returns true if any thread in a warp satisfies the condition, and *all()* returns true if all threads satisfy the condition. Figure 4.5 shows how cooperative groups extend the CUDA programming model to allow for flexible, dynamic grouping of threads. It shows an example of a 64 thread block being partitioned into two 32 thread warp groups, and then being further partitioned into 4 thread groups.
4.2.1.3 Software Prefetching

Now that the Up and Down kernels are efficiently using shared memory, profiling the code shows that the bottleneck lies with loading data from global memory to shared memory. The data path from global to shared memory is handled by the \textit{LDG} and \textit{STS} instructions. For compute capabilities 5.X and later, \textit{LDG} loads data from global memory through L2 cache, and L1 cache depending on the architecture, to a register. Instruction \textit{STS} then stores a value from a register into a shared memory bank. Therefore, loading data straight from global memory to shared memory requires two steps. These two steps combined with synchronization between loads and
reads can cause unnecessary delays. Synchronization is required when using shared memory to ensure that threads are not reading from and/or writing to shared memory while other threads are accessing the same memory banks. Without synchronization, race conditions can occur on data in shared memory. A race condition occurs when two or more threads try to modify data in the same shared or global memory location at the same time. Pseudocode is provided in Technique 1, which shows the basic steps required when utilizing shared memory.

**Technique 1 No Prefetch**

1: Initial kernel
2: for all Threads do
3:   while True do
4:     Global memory → shared memory
5:     Synchronize threads
6:     Perform math calculations on shared memory
7:   end while
8: end for

This workflow can be modified to avoid mathematical delays. Instead of loading data from global memory directly into shared memory each timestep, it can be broken up into multiple steps. First, load the initial set of data into shared memory during the first timestep. Next, preload the next set of data into registers and then perform math calculations on the previous set. Once calculations are finished, load the data from the registers into shared memory. Then synchronization can occur. This process will be performed each timestep and is known as *software prefetching*. It is a common technique to adapt a program to take advantage of hardware designs. Pseudocode for this modified workflow is provided in Technique 2.


Technique 2 Prefetching

1: Initial kernel
2: for all Threads do
3: Global memory → shared memory
4: while True do
5: Global memory → registers (Preload next set)
6: Perform math calculations on shared memory
7: Registers → shared memory
8: Synchronize threads
9: end while
10: end for

Figure 4.6 shows that when using software prefetching, compute and memory utilizations increase by roughly 3% over using shared memory alone.

Figure 4.6: GPU Utilization: Naive (green), Shared (purple), and Prefetch (blue).

4.2.2 Thread Execution Efficiency

While utilizing shared memory reduces the overall quantity of global memory loads, it can introduce a new issue caused by a false dependency on shared memory. The issue is further exacerbated because of low thread execution efficiency related to the stochastic nature of the implementation. As mentioned previously, threads in a block are bundled into fixed-size warps. Maximum thread execution efficiency is achieved when 100% of a warp’s threads are active. Less than 100% means threads are
inactive due to sub-optimal launch, early return, or predicated off due to control flow divergence; hence, a variable-task length problem. Due to the stochastic nature of the parallel implementation, its efficiency is lessened by early return of thread through control flow divergence. As threads satisfy the comparator in a given while-loop, they enter an inactive state. This is critical to GPU performance because a warp cannot exit until all threads have finished the last instruction, and a block can not exit a SM until all warps have finished.

In the worst case, where there are 1024 threads (32 warps with 32 threads) in a block, the maximum allowed, if only one thread is active then that means the remaining 31 warps are active and consuming resources, even though their threads may be inactive [44]. This decreases the number of eligible warps per scheduler. Eligible warps are a subset of active warps that are ready to issue their next instruction. Every cycle with no eligible warp results in no instruction being issued and the issue slot remains unused. To increase the number of eligible warps, one should either increase the number of active warps or reduce the time the active warps are stalled.

The stochastic nature allows for the possibility of one thread executing while the rest of the block remains idle, limiting available resources. Due to each thread having a variable-task length caused by the while-loop, control flow divergence and early return are simply inherent to the implementation.

Additionally, variable-task length threads that require synchronization on shared memory can produce a false dependency. That is where there is no data dependence between each warp in a block, but they have to wait for each other to complete to transition to the next instruction because they are sharing shared memory. This
means that threads in a block cannot execute new instructions until all threads reach the barrier. The solution to false dependency is to reduce the number of threads in a block. In Figure 4.7, the basic execution modes when multiple warp access the same shared memory array are presented in diagram (a). Diagram (b) shows that an imbalanced workload between warps causes false dependency on shared memory. Diagram (c) shows how when imbalanced workloads are divided into smaller blocks, warps with smaller workloads are not required to wait.

**Figure 4.7:** False Dependency on Shared Memory where (a) basic execution mode, (b) false dependency, and (c) less warps per block to remove false dependency.

Fixing the false dependency on share memory also mitigates low thread execution efficiency by reducing the possibility of a long running thread, preventing a large block from being retired from an SM. Making the block size equal to the number of
threads in a warp eliminates the scenario of one active thread in a warp in a block of multiple warps. Now if a block has only one active thread it only affects that individual warp and not many. This increases the ratio of eligible warps to active warps; therein boosting performance.

4.2.3 Sub-optimal Occupancy

Another important consideration when programming on a GPU is occupancy. Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. It helps measure a kernel's ability to utilize resources of an SM. The number of blocks that can execute concurrently on an SM is limited by multiple factors, such as the size of blocks, the quantity of warps and registers, and the amount of shared memory. A subset of the resources available on an NVIDIA GTX 1080, the GPU used in this research, is given in Table 4.2.

<table>
<thead>
<tr>
<th>Property</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Architecture</td>
<td>Pascal</td>
</tr>
<tr>
<td>Clock Rate</td>
<td>1860 MHz</td>
</tr>
<tr>
<td>CUDA Cores</td>
<td>2560</td>
</tr>
<tr>
<td>Global Memory</td>
<td>8GB</td>
</tr>
<tr>
<td>Shared Memory</td>
<td>48KB</td>
</tr>
<tr>
<td>Register Count</td>
<td>65,536</td>
</tr>
<tr>
<td>SM Count</td>
<td>20</td>
</tr>
<tr>
<td>Max Threads Per SM</td>
<td>2048</td>
</tr>
<tr>
<td>Max Blocks Per SM</td>
<td>32</td>
</tr>
<tr>
<td>Max Warps Per SM</td>
<td>64</td>
</tr>
</tbody>
</table>

Table 4.2: NVIDIA GTX 1080 Hardware Specifications: CC 6.1.
If a block requires too much of any one resource, it limits the number of active blocks on that SM. As previously stated, in order to minimize the effect of low thread execution efficiency, one must increase the number of active warps or reduce the time the active warps are stalled. This is accomplished by setting the block size equal to that of a warp, or 32 threads as stated in Section 4.2.2. On an SM with compute capability 6.1, like the GTX 1080, the maximum number of active blocks is 32 and the maximum number of active warps is 64. That means there must be at least two active warps per active block to reach 100% occupancy. By limiting the number of threads per block to increase thread execution efficiency, the theoretical occupancy of the Up and Down kernels are limited to 50%, meaning 50% of the GPU is underutilized at any given time during kernel execution.

In order to increase occupancy, without removing any previously mentioned optimizations, warp-level functionality in cooperative groups can be used to increase the number of warps per block to two. This can be achieved by doubling the amount of shared memory so each warp works on independent data. Using warp synchronization instead of block synchronization eliminates false dependency on shared memory, while increasing block size. Using two warps per block does slightly increase the possibility of low thread execution efficiency, but there will be zero negative impact because the method is utilizing warps that would otherwise be inactive. Figure 4.8 shows that when the number of warps per block is doubled to two, compute and memory utilizations increase by roughly 7% over shared memory and prefetching.
4.2.4 Grid-Stride Looping

When programming on GPUs, different implementation techniques can indirectly affect performance. There are two popular techniques to implement kernels using CUDA, *monolithic kernels* and *grid-stride loops*. A monolithic kernel utilizes a single large grid containing one thread per element and processes the entire array in one pass. Grid-stride loops deploy a small grid of threads and loops over the data one grid at a time. They let you decouple the size of your CUDA grid from the data size it is processing. In Figure 4.9, diagram (a) displays a monolithic kernel where the quantity of threads in a grid matches the data set. Diagram (b) shows the grid-stride looping technique covering the same data set with half the number of threads in a grid.

**Figure 4.8**: GPU Utilization: Naive (green), Shared (purple), Prefetch (orange), vs Two Warps per Block (Blue).
Grid-stride loops also provide additional tuning capability by allowing configurable block-grid sizes per kernel per device. The goal is to achieve optimal persistence of a thread block on an SM. It is often the preferred technique because it reduces the overhead of launching, maintaining, and destroying additional blocks. It is also regarded as being more flexible, scalable, and portable. Additionally, it has the ability to handle the case when there is a mismatch between grid size and the size of the data set. Examples of both monolithic and grid-stride looping kernels, using SAXPY computation, are shown in Technique 3 and Technique 4, respectively.

**Figure 4.9:** (a) Monolithic kernel and (b) Grid-stride looping.

Technique 3 Monolithic

**Require:** $n \leftarrow$ size of array

1: tid = blockIdx.x * blockDim.x + threadIdx.x
2: if $tid < n$ then
3:   $y[i] = a \times x[i] + y[i]$
4: end if

Choosing the optimal grid size for grid-stride looping is kernel dependent but it is suggested to be a multiple of the number of SMs. This methodology helps
**Technique 4 Grid-Stride Loop**

**Require:** $n \leftarrow$ size of array

1: $tid = blockIdx.x \times blockDim.x + threadIdx.x$
2: $gridSize = blockDim.x \times gridDim.x$
3: **for** $i = tid; i < n; i += gridSize$ **do**
4: \[ y[i] = a \times x[i] + y[i] \]
5: **end for**

reduce any tail effect caused by inactive SMs as a kernel finishes a task. This parallel implementation also requires the quantity of blocks to be a multiple of 32 in order to utilize all 32 active blocks of an SM. Therefore, the minimum number of blocks should be 32 times the number of SMs, or 640 on a GTX 1080. Using 64 threads in a block gives a grid size of 40,960. Particle sets with a quantity less than this grid size will underutilize the GPU. Table 4.3 provides a list of grid sizes used in this research.

**Table 4.3:** Grid-Stride Looping Sizes

<table>
<thead>
<tr>
<th>Number of Particles</th>
<th>Blocks per Grid</th>
<th>Threads per Grid</th>
</tr>
</thead>
<tbody>
<tr>
<td>1024</td>
<td>16</td>
<td>1024</td>
</tr>
<tr>
<td>2048</td>
<td>32</td>
<td>2048</td>
</tr>
<tr>
<td>4096</td>
<td>64</td>
<td>4096</td>
</tr>
<tr>
<td>8192</td>
<td>128</td>
<td>8192</td>
</tr>
<tr>
<td>16384</td>
<td>256</td>
<td>16384</td>
</tr>
<tr>
<td>32768</td>
<td>512</td>
<td>32768</td>
</tr>
<tr>
<td>65536</td>
<td>640</td>
<td>40960</td>
</tr>
<tr>
<td>131072</td>
<td>640</td>
<td>40960</td>
</tr>
<tr>
<td>262144</td>
<td>1280</td>
<td>81920</td>
</tr>
<tr>
<td>524288</td>
<td>1280</td>
<td>81920</td>
</tr>
<tr>
<td>1048576</td>
<td>2560</td>
<td>163840</td>
</tr>
</tbody>
</table>

76
That multiplier’s performance can vary depending on compute and memory utilization of the kernels themselves. Using grid-stride looping also allows data reuse. As an example, because the systematic resampling requires the same randomly generated number for each particle, the generation can be performed outside the grid-stride loop and used by all subsequent loops.

4.2.5 CUDA Streams

The parallel implementation presented in this paper consists of two kernels that are mutually exclusive. Fortunately, CUDA provides an additional level of concurrency in the form of streams. A stream is a sequence of operations that executes in issue-order on the GPU. Different streams may execute their commands, or kernels, concurrently or out of order with respect to each other. It is important to note that any kernel not explicitly designated to a specific stream is launched in the default stream, which is synchronous, or blocking. Running the two kernels in independent asynchronous, or non-blocking, streams will not increase individual kernel occupancy, but now the GPU has an opportunity to run blocks from both kernels concurrently as resources become available [44]. A visual aid showing algorithmic concurrency is provided in Figure 4.10.

There are factors that effect streams’ ability to run concurrently. First, SMs must wait until enough resources are available to launch a kernel’s warps. Figure 4.11 shows that even though the Up and Down kernel are in independent streams they don’t execute with perfect concurrency. There is an initial time period where the Up kernel is utilizing all available resources on the GPU. While overlapping kernels may
slightly extend individual kernel execution time, it reduces overall execution of both kernels by allowing the hardware to utilize all available resources and removes any tail effect from the kernel launched first. This tail effect occurs as a kernel retires blocks from the SM and the SM becomes underutilized.

Henceforth, the combination of shared memory, software prefetching, and utilizing two warps per block will be referred to as the improved implementation. Timing comparisons of the naive and improved implementations of Algorithm 12 will be presented in Section 6.1. Also, the source code is provided for the naive implementation
in Section A.1 and Section A.2 and the improved implementation in Section A.7 and Section A.8.

4.2.6 Algorithmic Efficiency

In computer science, algorithmic efficiency is the property of an algorithm which relates to the number of computational resources used by said algorithm. The algorithm is considered efficient if the resource consumption of concern falls at or below a particular threshold. In this section, the serial and parallel implementations will be compared across three metrics; time complexity (the amount it takes to finish the problem), work efficiency (how much work is performed by all threads in the algorithm), and space efficiency (the amount of memory allocation required). For simplicity, these comparisons will be analyzed on a parallel random-access machine (PRAM). Doing so neglects many real-life practical issues, such as memory access time, synchronization, and communication. It also removes hardware constraints like the number of processors. Removing such complexities allows for a more general understanding of the best and worst case scenarios from a mathematical perspective. Also, this section will be focused on resampling and will negate the prefix-sum and random number generation because those trends are well understood.

4.2.6.1 Time Complexity

Before proceeding, a baseline must be established. The baseline will be the serial algorithm, as stated earlier in Section 2.4.2 and Section 2.4.3, which has a time complexity of $T_s = \mathcal{O}(N)$, or linear time. The serial algorithm is highly determin-
istic but the parallel algorithm’s performance is data dependent. Unlike the serial algorithm, the parallel can be broken down into best and worst case scenarios. Best case is when particle variance is at its lowest and worst case is when variance is at its highest. The time complexities for the naive parallel implementation are shown below in (4.1) and (4.2)

\[ T_{nb} = \mathcal{O}(1). \] (4.1)

\[ T_{nw} = \mathcal{O}(N). \] (4.2)

where \( T_{nb} \) is the naive best case time complexity and \( T_{nw} \) is the naive worst case.

In the best case scenario, (4.1), all threads, in both kernels, perform one work cycle before returning to an inactive state, as shown in Figure 4.12. The worst case scenario will occur when all the elements in the prefix-sum have a value of zero except element \( N - 1 \) which has a value of 1. This is the maximum variance capable in the problem.

In the Down kernel, all threads will perform one cycle and return to an inactive state. While in the Up kernel, all threads traverse the entire array to the right, requiring the first thread to do so \( N \) times, as shown in Figure 4.13. To ensure that the first thread in the Down kernel does not read out of bounds memory, it is set to inactive by default.

Therefore, the algorithmic speedup, \( S = \frac{T_a}{T_p} \), can range from \( S = \mathcal{O}(N) \rightarrow \mathcal{O}(1) \).

Next, the improved implementation has a slightly different time complexity due to the iterations through shared memory. In the shared memory implementations, a warp loads two 32 element chunks into a shared memory space. All threads must be active during this load. Therefore, threads cannot be retired immediately after
satisfying the comparator. They must finish all iterations in the for loop before all bit masks are checked. If all threads have satisfied the comparator at the end of the for-loop the warp can be retired. This means a minimum of 32 iterations will be executed instead of only one for the best case, as shown in Figure 4.14. The exception to this is the last warp in the Up kernel and the first warp in the Down kernel. In these warps, threads load data from global memory directly into registers. This ensures threads in those warps don’t traverse out of bounds.

Figure 4.12: Best case scenario for naive implementation

Figure 4.13: Worst case scenario for naive implementation

Figure 4.14: Best case scenario for improved implementation
The worst case time complexity in the improved Up kernel will match that of
the naive implementation because additional work cannot be performed due to the
boundary conditions, as shown in Figure 4.15. The Down kernel will be the same for
both cases.

\[
\begin{array}{cccccccc}
N & N-1 & N-2 & N-3 & \ldots & 3 & 2 & 1 \\
\end{array}
\quad
\begin{array}{cccccccc}
0 & 1 & 1 & 1 & \ldots & 32 & 32 & 32 & 32 \\
\end{array}
\]

\text{Up: Operations per thread} \quad \text{Down: Operations per thread}

\textbf{Figure 4.15:} Worst case scenario for improved implementation

Requiring all threads to remain active and evaluate the for-loop even after
the comparator has been satisfied is less detrimental to performance than applying
branching to check the bit mask at loop iteration. The time complexities for the
improved parallel implementation are shown below in (4.3) and (4.4).

\begin{align*}
T_{ib} &= \mathcal{O}(32). \\
T_{iw} &= \mathcal{O}(N).
\end{align*}

(4.3) \quad (4.4)

Therefore, the improved implementation’s algorithmic speedup, \( S = \frac{T_i}{T_p} \), can
range from \( S = \mathcal{O}(N) \to \mathcal{O}(32) \).

4.2.6.2 Work Efficiency

An algorithm’s work efficiency considers the total amount of work that is
performed across all threads to arrive at the correct answer. Again the baseline will
be the serial implementation, which is \( W_s = \mathcal{O}(N) \), because a single CPU thread will perform all the work. The work efficiency is \( WE = \frac{W_s}{W_p} \), where \( W_p \) is the work performed by the parallel implementation. As with time complexity, work performed by the naive parallel kernels will have a best and worst case scenario, represented by \( W_{nb} \) (4.5) and \( W_{nw} \) (4.6), respectively.

\[
W_{nb} = \frac{N}{U_p} + \frac{N - 1}{D_{own}} \quad (4.5)
\]

\[
W_{nw} = \frac{(N - 1)(N)}{2} + \frac{N + N - 1}{U_p} + \frac{N - 1}{D_{own}} \quad (4.6)
\]

Using L'Hôpital’s rule, the limits of the work efficiency can be evaluated as \( N \to \infty \).

\[
WE_{nb} = \frac{W_s}{W_{nb}}
\]

\[
\lim_{N \to \infty} WE_{nb} = \frac{W_s}{W_{nb}} = \frac{N}{N + N - 1} = \frac{N}{2N - 1} = \frac{1}{2 - \frac{1}{N}} = \lim_{N \to \infty}(1) \lim_{N \to \infty}(2 - \frac{1}{N}) = \frac{1}{2}
\]
While the algorithm has the potential to significantly improve time complexity, it comes at the cost of work efficiency. Because there are two while-loops in the parallel implementation, best case takes double the amount of work as \( N \to \infty \).

For the worst case, the work per thread in the parallel implementation is effectively serialized. Therefore, it makes sense that as \( N \to \infty \), there will be zero work efficiency.

For the improved implementation, the additional work per thread, due to shared memory utilization, must be taken into account. Work complexity will be computed in terms of warps where \( WS \) equals the warp size of 32, and \( F \) equals the number of iterations of the for loop, or 32. Because of software prefetching,
WS is multiplied by two. The best and worst case are presented in (4.7) and (4.8), respectively. Notice that in the (4.7) the WS is multiplied by a factor of 2 to account for software prefetching. In (4.8), we can reuse the time complexity equation to calculate the worst case for the Up kernel because the amount of work will be the same.

\[
W_{ib} = \left( (N - (WS \times 2)) \times F \right)_{Up} + (WS \times 2) + \left( (N - (WS \times 2)) \times F \right)_{Down} + ((WS \times 2) - 1)_{Down}
\]

(4.7)

\[
W_{iw} = \left( \frac{(N - 1)(N)}{2} \right)_{Up} + N + \left( (N - WS) \times F \right)_{Down} + (WS - 1)_{Down}
\]

(4.8)

Using L’Hôpital’s rule, the limits of the work efficiency can be evaluated as \( N \to \infty \).

\[
WE_{ib} = \frac{W_s}{W_{nb}}
\]

\[
\lim_{N \to \infty} \frac{W_s}{W_{ib}} = \frac{N}{(N - 64) \times 32 + 64 + (N - 64) \times 32 + 63}
\]

\[
= \frac{N}{64N - 3969}
\]

\[
= \frac{1}{64 - \frac{3969}{N}}
\]

\[
= \lim_{N \to \infty} (1) = \lim_{N \to \infty} \left( 64 - \frac{3969}{N} \right)
\]

\[
= \frac{1}{64}
\]
\[ WE_{iw} = \frac{W_s}{W_{nw}} \]

\[ \lim_{N \to \infty} = \frac{W_s}{W_{iw}} \]

\[ = \frac{(N-1)(N)}{2} + N + (N - 64) \times 32 + 63 \]

\[ = \frac{2N}{N^2 + 65N + 3970} \]

\[ 2 \times \lim_{N \to \infty} = \frac{N}{N^2 + 65N + 3970} \]

\[ = \frac{\frac{1}{N}}{1 + \frac{65}{N} + \frac{3970}{N^2}} \]

\[ = \lim_{N \to \infty}(\frac{1}{N}) \]

\[ = \lim_{N \to \infty}(1 + \frac{65}{N} + \frac{3970}{N^2}) \]

\[ = 2 \times \frac{0}{1} \]

\[ = 0 \]

The best case requires 32 times more work than the naive implementation, while the worst case has the same efficiency.

### 4.2.6.3 Space Efficiency

For space efficiency, the parallel implementation requires memory allocation to store resampling indexes for both the Up and Down kernels in global memory. This means the parallel implementation requires twice the memory footprint. For the worst case that would be $2^{20}$ elements * 4 bytes, for single precision, requiring 4MB of additional storage, which is not a significant impact on today’s devices.
CHAPTER 5

PSEUDORANDOM NUMBER GENERATORS

Anyone who considers arithmetical methods of producing random digits is, of course, in a state of sin.

—John von Neumann

To ensure that Monte Carlo simulations provide accurate approximate numerical solutions, the RNGs that provide stochastic input to each trial must be indistinguishable from a true random number source. In reality, RNGs in simulations do not produce random numbers, but use deterministic algorithms to produce numbers that mimic randomness. This is a favored characteristic of RNGs because it allows repeatability. Repeatability in Monte Carlo simulations allow engineers to reproduce Monte Carlo variations that cause unwanted effects for further analysis. The main method used to produce samples is to first generate uniform random numbers, then apply a transform to convert the uniform numbers into samples of the desired distribution. RNGs can be classified into three groups:

1. True Random Number Generators

   - Truly stochastic and are primary using in cryptography and secure sockets layer (SSL) protocols. Random numbers are usually generated on hardware
devices through a physical process based on random noise signals, such as thermal noise, the photoelectric effect, beam splitter, or some other quantum phenomena.

2. Quasirandom Number Generators

- Used to evenly fill an n-dimensional space with points, without clumping points; also known as low-discrepancy. Although they can be used in Monte Carlo simulations they were not considered for this research.

3. Pseudorandom Number Generators

- Designed to model a true RNG but are actually computed using a deterministic algorithm. This characteristic of determinism makes them appealing in Monte Carlo simulation because of reproducibility.

This work focuses on pseudorandom number generators (PRNGs). For greater insight into algorithmic selection criteria and RNG constructions refer to [45]. This chapter will briefly examine implementation requirements of RNGs for parallel processing, describe the PRNGs offered by cuRAND and their implementation details, and demonstrate their performance in this research.

5.1 Background

Much work over the past few decades has gone into the development of random number generators on CPUs. These optimized PRNG implementations take advantage of large instruction sets and large amounts of fast memory available to
the core(s). Below are a few general requirements that PRNGs should satisfy for simulation usage.

1. Execution speed

   • Generators should be able to create random numbers in a timely manner.

2. Long period

   • Because deterministic generators eventually repeat, longer spans in between repetitions are highly desirable.

3. Repeatability

   • Random numbers should be reproducible between simulations runs and various systems.

4. Good statistical quality

   • As previously stated, the goal of PRNGs is to be indistinguishable from true random number generators. Therefore, it is important that signs of correlation or patterns are not shown.

Due to limited on-chip resources, as described in Chapter 3, these CPU implementations do not map well to GPU architecture in their naive form. In parallel environments utilizing multiple CPU cores and software such as OpenMP [46], a central monitor can be extended to create multiple streams that are considered statistically independent. However, creating a large number of streams for GPUs that can have thousands of cores can cause excessive overhead and create a bottleneck.
A naive method of implementing RNGs is to generate a large quantity of numbers on the host or device and store them in device memory to be utilized by threads. This method suffers from memory bandwidth. First, if the numbers were to be generated on the host they would need to be transferred to device memory. Second, once numbers are in device memory, in order to be utilized by a kernel, data must be transferred to on-chip resources, whether that be registers or shared memory.

The preferred method is to use multiple PRNGs, or copies of the same PRNGs, with substreams starting at different states. If these substreams are able to imitate individual independent streams, reproducibility can be achieved by distributing substreams among Monte Carlo runs. They can be created using multiple PRNGs or by taking a single sequence and splitting it into smaller sequences that are used by threads. It is advised that the starting points of the substreams be sufficiently spread out to ensure no overlap occurs. For a PRNG to be effective in Monte Carlo simulations on a GPU, additional requirements include [47]:

1. Stream division
   - The generator must be able to divide a stream into substreams that can be utilized on parallel nodes with no overlap.

2. Substream quality
   - Substreams must retain good statistical quality and appear to be an independent stream of random numbers.
The NVIDIA cuRAND library offers multiple ways to achieve the requirements listed. The next section will describe their implementation techniques.

5.1.1 Single PRNG - Multiple Substreams

The most common approach to create multiple streams is from a single PRNG split into equally spaced streams. According to the cuRAND Programming Guide, that approach is considered to have higher statistically quality [32]. Each experiment will be assigned a unique seed and each thread will be designated a unique sequence number. It is also advised that sequence numbers be assigned in a monotonically increasing way. A common practice is to assign the thread index to the sequence number. A code snippet is provided below.

```c
__global__ void generateRandomNumbers ( int n, ull seed, float * devRanNum ) {
    int t = threadIdx.x, b = blockIdx.x, g = gridDim.x;
    for ( unsigned int tid = b * blockDim.x + t; tid < n; tid += blockDim.x * g ) {
        curandStateXORWOW_t localState { };
        curand_init ( seed, tid, 0, & localState );
        devRanNum [ tid ] = curand_uniform ( & localState );
    }
}
```

In the sample code, the variable `localState` is created to hold a local on-chip copy of an XORWOW state. (XORWOW is a type of PRNG that will be described in further detail in the section.) Next, the state is initialized using `curand init()` with its parameters listed in Table 5.1.

The same seed is passed to all threads from the host, where a seed can easily be created using the `clock()` command. The initialization function also accepts the sequence number, in this case `tid`, which is the thread index created by grid-stride
Table 5.1: `curand_init()` Parameters

<table>
<thead>
<tr>
<th>Type</th>
<th>Parameter</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>unsigned long long int</td>
<td>seed</td>
<td>Arbitrary bits to use as a seed</td>
</tr>
<tr>
<td>unsigned long long int</td>
<td>sequence</td>
<td>Subsequence to start at</td>
</tr>
<tr>
<td>unsigned long long int</td>
<td>offset</td>
<td>Absolute offset into subsequence</td>
</tr>
<tr>
<td>curandStateXORWOR_t</td>
<td>state</td>
<td>Pointer to state to initialize</td>
</tr>
</tbody>
</table>

looping. Finally, a uniform random number is generated with `curand_uniform()` and stored back in global memory. It should be noted though that using `clock()` to generate the seed will not allow repeatability between runs.

With this implementation in cuRAND, each thread is provided with an independent substream equally spaced apart. If the given seeds are the same across all threads, the starting state for each thread is calculated as such:

\[
state = 2^{67} \times \text{sequence} + \text{offset} \tag{5.1}
\]

With XORWOW as an example, the main sequence has a period of \(2^{190}\). Therefore, monotonically increasing the sequence id with a zero-based thread index provides a total of \(2^{123}\) independent streams. Then each thread will have a range of \(2^{67}\) states before transitioning into the next substream.
5.1.2 Single PRNG - Multiple Seeds

If jumping through the sequence is too taxing on performance, an alternative method is to initialize each thread with a different seed and set the sequence number to a constant zero for all threads. One must ensure that the PRNG has a relatively large period compared to the number of values generated. Applying different seeds to each thread initializes the thread to some random position within the period. A code snippet is provided below.

```c
__global__ void generateRandomNumbers ( int n, ull seed , float * devRanNum ) {
    int t = threadIdx.x, b = blockIdx.x, g = gridDim.x;
    for ( unsigned int tid = b * blockDim.x + t; tid < n; tid += blockDim.x * g ) {
        curandStateXORWOW_t localState { };  
        curand_init ( seed + tid , 0, 0, & localState );
        devRanNum [ tid ] = curand_uniform ( & localState );
    }
}
```

Notice in this snippet, the thread index is added to the seed to allow each thread to have a different seed, while the sequence number is set to zero. Unfortunately, this method does not guarantee uncorrelated random numbers. The limitations of the method come from the fact that interactions between the seeding algorithm and the mathematical properties of the PRNG can be erratic. This means that it is possible for multiple threads to begin at the same location. Although possible, this phenomena is highly improbable. Given a period $p$, with $s$ number of streams, each with a length $\ell$, the probability $P$ that there will be zero collisions can be given as
\[ P = \left(1 - \frac{s \ast \ell}{p}\right)^{s-1} \]  

(5.2)

This gives

\[ \ln P = (s - 1) \ast \ln(1 - x) \]  

(5.3)

\[ = \frac{s \ast \ell}{p} \]  

(5.4)

\[ = \frac{s^2 \ast \ell}{p} \]  

(5.5)

When \(x\) is small,

\[ P \approx -(s - 1)x \]  

(5.6)

Therefore,

\[ 1 - P \approx -\ln P \]  

(5.7)

\[ \approx (s - 1)x \]  

(5.8)

\[ \approx \frac{s^2 \ast \ell}{p} \]  

(5.9)

In the case of XORWOW, the main sequence has a period of \(2^{190}\). The highest number of threads utilized for this research is \(2^{20}\) because of stability issues with
prefix-sum in combination with single precision. This instability will be described in greater detail in Chapter 6. Letting $s = \ell = 2^{20}$ the probability is

$$1 - P \approx \frac{(2^{20})^2 \times 2^{20}}{2^{190}}$$  \hspace{1cm} (5.10)

$$\approx 7.3468e^{-40} \quad (5.11)$$

If initialization is performed each time step, where $\ell = 1$, probability can be reduced even further giving

$$1 - P \approx \frac{(2^{20})^2 \times 1}{2^{190}}$$  \hspace{1cm} (5.12)

$$\approx 7.0065e^{-46} \quad (5.13)$$

This approach will also prevent the possibility of recursive collisions.

### 5.1.3 Statistical Experiments

To ensure that numbers generated by RNGs meet statistics requirements, they can be analyzed by a series of battery tests. For many years, the test suite of choice was DIEHARD [48], but as technology advanced and higher precision became readily available this suite was replaced with TestU01 [49]. TestU01 is used to compare statistical properties of pseudorandom sequences to those expected from a true random process. For clarity, it is important to state that even if a PRNG passes all tests,
that does not mean the generator is flawless on a given system running a particular simulation [50]. For cryptographic application developers, the preferred reference suite is NIST battery test [51]. A subset of the results from these battery tests is provided in Section 4 of the cuRAND Programming Guide [32].

5.2 Random Number Generator Implications

As mentioned in Section 3.3.2, NVIDIA’s cuRAND library facilitates the simple and efficient generation of high-quality pseudorandom and quasirandom numbers. While a pseudorandom sequence is generated by a deterministic algorithm, it satisfies most of the statistical properties of a truly random sequence and is well suited for this research. Although cuRAND comes with five PRNGs, MT19937 will not be used. In this section, the remaining four PRNGs will be tested to determine how they affect the timing of the resampling step and the quality of the root mean squared error (RMSE).

MT19937 is 32-bit version for Mersenne Twister PRNG based on a period length of the Mersenne prime $2^{19937}-1$. It was proposed for generating uniform pseudorandom numbers and was developed in 1997 by Makoto Matsumoto and Takuji Nishimura [52]. In cuRAND it is only available through the host API. Although some host API calls are highly optimized, they required expensive stores to global memory when the call finishes its task and again by kernels loading the data from global memory into on-chip registers. This process causes significant negative impact to performance and is therefore not used.
The MTGP32 generator provided in cuRAND is an adaptation of the Mersenne Twister optimized for GPUs developed by Saito and Matsumoto [53]. It requires input parameters that allow many threads to compute the recursion in parallel. cuRAND provides a 200 parameter set that has been pre-generated for the 32-bit generator with period $2^{11214}$. It allows thread-safe generation and state update for up to 256 concurrent threads (within a single block) for each of the 200 sequences [32]. Because neither the naive nor improved parallel implementations exceed 256 threads per block this will not be a limitation. However, by using these pre-generated states, the maximum number of blocks in a grid can be 200. MTGP32’s performance is significantly impacted by the host API calls `curandMakeMTGP32Constants` and `curandMakeMTGP32KernelState`. These host functions help set up parameters for different sequences in global memory and set up the initial state. The function `curandMakeMTGP32Constants` takes the pre-generated parameter set and reorganizes the format before copying it to global memory. The next function, `curandMakeMTGP32KernelState`, initials the required number of states and copies them to global memory. Neither function has alternative stream functionality, therefore forcing the calls to run serially in the default stream. In order to ensure that the resampling kernels do not try to use the PRNG data before the memory is finished copying, `cudaDeviceSynchronize` must be called to block all kernel calls until copying is finished.

The last three PRNGs, XORWOW, MRG32k3a, and Philox_4x32_10, are relatively simple to implement because they do not require any host API calls and can be called directly from a kernel. XORWOW is a member of the xor-shift fam-
ily, a subset of linear-feedback shift registers (LFSR), and was discovered by George Marsaglia [54]. It is one of the fastest non-cryptographically-secure random number generators and works by generating the next number in series by taking the exclusive or of the number, hence the name. MRG32k3a is a 32-bit combined multiple recursive generator with two components of order 3. It has good multidimensional uniformity, or a long period, and performs well in statistical tests up to at least 45 dimensions [55]. Philox\textsubscript{4x32\_10} is a non-cryptographic bijection that uses multiplication instructions that compute the high and low halves of the product of word-sized operands and, as the name suggests, utilizes four 32-bit unsigned int values [56].

Table 5.2: RMSE comparison between PRNGs at $2^{16}$ particles.

<table>
<thead>
<tr>
<th>PRNGs</th>
<th>RMSE for States</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>$x_1$</td>
</tr>
<tr>
<td>XORWOW</td>
<td>0.2058</td>
</tr>
<tr>
<td>MRG32k3a</td>
<td>0.2062</td>
</tr>
<tr>
<td>Philox\textsubscript{4x32_10}</td>
<td>0.2059</td>
</tr>
<tr>
<td>MTGP32</td>
<td>0.2061</td>
</tr>
</tbody>
</table>

Table 5.2 shows the quality results for 100 MCs over 2500 samples for the improved parallel systematic resampling utilizing the multiple seed, single sequence id technique. It is clear that the results for each PRNG are nearly identical. Although their qualities are similar, the timing and implementations for the PRNGs vary greatly. Table 5.3 will be used to better understand these differences. It is split into two subtables, one consisting of statistics when calculating the starting state of each time step and the other when the starting state is only initialized once during ap-
application startup. For calculating repeatedly, the entire resampling process, including state initialization and random number generation, is encompassed in the **Timing** column. When initializing at application startup, the amount of time required for initialization is in the column titled **Setup** and the time it takes to execute the resampling process, with only random number generation, is in the **Timing** column. Also provided is the number of registers required for the kernel incrementing up the while-loop and the kernel decrementing down the while loop, in columns **Up** and **Down**, respectively.

**XORWOW** and **Philox_4x32_10** provide the fastest execution times for both systematic and stratified resampling algorithms and have similar implementation techniques. Both PRNGs are lightweight enough that state initialization and random number generation can be performed inside the kernels each time step. Implementing MRG32k3a in this fashion causes the worst execution time of all the PRNGs because the state initialization and random number generation of MRG32k3a are much slower than the previous two techniques. This is due to the amount of double precision registers required during state initialization. They consume all available registers in a block and then require additional local memory as described in Section 3.2.1.2, severely impacting performance. This issue is addressed in the cuRAND Programming Guide [32] under Section 3.5, Device API - Performance Notes. NVIDIA suggests saving and restoring the random generator state instead of recalculating the starting state repeatedly. By separating the initialization and random number generator, kernels using MRG32k3a were able to run at a speedup of 8x, although it remains the slowest overall. It is important to note that applying this best practice to XORWOW and
Philox.4x32.10 implementations degrades performance because loading the random generator state from global memory is slower than state initialization.

**Table 5.3:** Timing and registers statistics across PRNGs at $2^{16}$ particles.

<table>
<thead>
<tr>
<th>PRNG</th>
<th>Systematic</th>
<th></th>
<th>Stratified</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Setup (us)</td>
<td>Timing (us)</td>
<td>Registers</td>
<td>Setup (us)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Up Down</td>
<td></td>
</tr>
<tr>
<td>XORWOW</td>
<td>29</td>
<td>21</td>
<td>23</td>
<td>31</td>
</tr>
<tr>
<td>Philox.4x32.10</td>
<td>30</td>
<td>23</td>
<td>23</td>
<td>34</td>
</tr>
<tr>
<td>MRG32k3a</td>
<td>997</td>
<td>80</td>
<td>80</td>
<td>997</td>
</tr>
<tr>
<td>MTGP32</td>
<td>82</td>
<td>20</td>
<td>22</td>
<td>459</td>
</tr>
</tbody>
</table>

<table>
<thead>
<tr>
<th>PRNG</th>
<th>Systematic</th>
<th></th>
<th>Stratified</th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Setup (us)</td>
<td>Timing (us)</td>
<td>Registers</td>
<td>Setup (us)</td>
</tr>
<tr>
<td></td>
<td></td>
<td></td>
<td>Up Down</td>
<td></td>
</tr>
<tr>
<td>XORWOW</td>
<td>25</td>
<td>87</td>
<td>23 24</td>
<td>25</td>
</tr>
<tr>
<td>Philox.4x32.10</td>
<td>54</td>
<td>100</td>
<td>37 36</td>
<td>54</td>
</tr>
<tr>
<td>MRG32k3a</td>
<td>600</td>
<td>128</td>
<td>30 28</td>
<td>600</td>
</tr>
<tr>
<td>MTGP32</td>
<td>50</td>
<td>32</td>
<td>20 22</td>
<td>250</td>
</tr>
</tbody>
</table>

MTGP32 has by far the most complicated implementation in order to be utilized efficiently in the parallel systematic and stratified resampling techniques. This is partly due to the required host API calls and partly due to the grid size limitation discussed earlier in the section. While the host API calls required for initialization are twice as fast as the MRG32k3a initialization, they do not come with alternative stream capability. This means they must be executed in the default stream. Because
the random generator states are inputs to both the parallel implementation kernels, which are not in the default stream, the only way to ensure the states are not used before they are created is to run `cudaDeviceSynchronize()` after initialization. Unfortunately, this API call blocks all future API calls and kernel launches until the device has completed all preceding requested tasks and requires more overhead than other synchronization calls. A better approach, similar to MRG32k3a’s, is to perform initialization once at application startup. Next, as mentioned earlier, using the pre-generated parameters provided by NVIDIA, MTGP32 can only utilize 200 blocks in a grid. This inhibits the improved parallel implementation from launching a large number of blocks to hide its thread execution inefficiency as described in Section 4.2.2.

In the kernel, device API call `curand_mtgp32_specific` is being used to generate uniform random numbers. Its constraints are listed below:

1. At most 256 threads may call this function for a given state.

2. Within a block, for a given state, if n threads are calling the function, the indices must run from 0...n-1. The indices do not have to match the thread numbers, and may be distributed among the threads as required by the calling program. At a given point in the code, all of the indices from 0...n-1, or none of them, must be used.

3. A given state may not be used by more than one block.

4. A given block may generate randoms using multiple states.
Item three from the list above further limits the parallel implementation because both the **Up** and **Down** kernels are running in different streams concurrently and it is possible that a given block from each kernel could, in theory, access the same state. To ensure this does not occur, each kernel must run with only 100 blocks and in one kernel the input parameter to `curand_mtgp32_specific` must be offset by 100. Finally, notice in Table 5.3 that when using MTGP32, the timing for systematic and stratified resampling differ. Due to second item in the above list, systematic resampling cannot be implemented in the conventional manner. For the systematic implementation, prior to launching the resampling kernels, a kernel must be launched with one thread and one block to produce a single uniform random number. This number is then stored in global memory and loaded by the following kernels. In turn, systematic is faster than stratified when utilizing MTGP32. This speedup is more pronounced when the starting state is calculated repeatedly because of the additional number of states being initialized and copied.

Section 3.5 of the cuRAND Programming Guide also suggests that one way to speed the random generator is to use different seeds for each thread and a constant sequence number of 0. This was used for the XOR WOW, Philox_4x32_10, and MRG32k3a implementations in this research. It is known that this method provides less guarantees about the mathematical properties of the generated sequences and a possible side effect is that some threads may have highly correlated outputs. It is noted in the cuRAND Programming Guide that this is rare and this research did not see any adverse effects with this implementation technique [32]. Because the quanti-
ties of particles in this research can be considered low in respect to the sequence sizes of the PRNG, quality of the results is not affected by using this enhancement.
CHAPTER 6

TESTING AND ANALYSIS

*Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it.*

—Brian Kernighan, Princeton

6.1 Experimentation Setup

To examine the performance of the parallel approach introduced in this paper, it is compared to alternatives using a general nonlinear, non-Gaussian four state-space model from research provided by Schön [7] and can be found under Software and Data sets - Rao-Blackwelled particle filter at [http://user.it.uu.se/~thosc112/research/](http://user.it.uu.se/~thosc112/research/). For simplicity, a BPF was used to run 100 Monte Carlos of 2500 samples with $2^{16}$ and $2^{20}$ particles, with resampling performed every time step. Particle set sizes do not exceed $2^{20}$ to eliminate possible numerical instabilities produced during the prefix-sum, as pointed out by Murray [28] while using single-precision on GPUs. Double-precision on consumer-level cards and embedded systems is significantly slower.
These tests were performed using a desktop computer running Kubuntu 18.04 with an NVIDIA GTX 1080 and an Intel i7-5960X. Specifications are provided in Table 4.2 and Table 6.1, respectively. Serial versions of the systematic and stratified are run on the CPU with optimization level -O3. Code was written and tested in C++ GNU 7.3.0 and CUDA 10.0. The --use_fast_math flag was set at compile time to improve the use of any special functions by forcing the use of intrinsics. Intrinsic functions are faster as they map to fewer native instructions. A XORWOW generator was used from cuRAND [32] to generate all random numbers on the device. To ensure both while-loops of the parallelized systematic and stratified methods had the same random numbers, a seed was generated on the CPU and sent to the GPU as a parameter to each kernel [44].

Table 6.1: Intel i7-5960X Hardware Specifications.

<table>
<thead>
<tr>
<th>Property</th>
<th>Value</th>
</tr>
</thead>
<tbody>
<tr>
<td>Architecture</td>
<td>Haswell</td>
</tr>
<tr>
<td>Clock Rate</td>
<td>3.0 GHz</td>
</tr>
<tr>
<td>Overclock Rate</td>
<td>4.4 GHz</td>
</tr>
<tr>
<td>Cores</td>
<td>8</td>
</tr>
<tr>
<td>Threads</td>
<td>16</td>
</tr>
<tr>
<td>Cache</td>
<td>20 MB</td>
</tr>
<tr>
<td>Lithography</td>
<td>22nm</td>
</tr>
</tbody>
</table>

The prefix-sum at line (1) in Algorithm 12 can be implemented in parallel using the CUB library [57]. It is an NVIDIA library containing collective kernel-level primitives designed around reusable software components such as warps and blocks for the SIMT paradigm. This library not only provides the benefit of simplified
coding, but is optimized by NVIDIA to use the latest hardware features and provides sustainability when porting this method to different NVIDIA GPU architectures.

To assess and compare the quality of the serial and parallel resampling algorithms, the RMSE [58] is used in the following form

$$ RMSE_{fo} = \sqrt{\frac{\sum_{i=1}^{M} \sum_{n=1}^{S} (f_n^i - o_n^i)^2}{M \times S}} $$ \hspace{1cm} (6.1)

where $M$ is the number of Monte Carlo runs, $S$ is the number of samples in each run, $f$ is the expected results, or truth data, and $o$ is the observed filter estimates.

Due to the stochastic nature of the resampling process, execution time is determined by averaging the execution time of all the Monte Carlos. Also, the first executed Monte Carlo was considered a warm-up run and the timing of that run was discarded. This warm-up run ensures the GPU is in the correct performance state, in this case the highest performance.

### 6.2 Simulation Results

Table 6.2 shows timing and RMSE for all systematic and stratified implementations. The naive GPU (N-GPU) implementations of Algorithm 12 is 8x and 16.42x faster than the serial CPU versions of systematic and stratified, respectively. The improved GPU (I-GPU) implementations are 16.89x and 33.43x faster. Stratified execution times are similar to the systematic on the GPU because the random number generation can be performed on each thread in parallel. All three produced nearly
identical RMSE. This is expected because the GPU implementations produce the exact resampling index as serial methods. Numerical differences can be attributed to rounding errors in CPU and GPU arithmetic. Timing results include the prefix-sum computation, random number generation, and resampling index search.

**Table 6.2:** RMSE comparison of Systematic/Stratified implementation at $2^{16}$ particles.

<table>
<thead>
<tr>
<th>Type</th>
<th>Speed ($\mu$s)</th>
<th>RMSE for States</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>$x_1$</td>
</tr>
<tr>
<td>Systematic (CPU)</td>
<td>456</td>
<td>0.2061</td>
</tr>
<tr>
<td>Stratified (CPU)</td>
<td>936</td>
<td>0.2061</td>
</tr>
<tr>
<td>Systematic (N-GPU)</td>
<td>57</td>
<td>0.2060</td>
</tr>
<tr>
<td>Stratified (N-GPU)</td>
<td>57</td>
<td>0.2060</td>
</tr>
<tr>
<td>Systematic (I-GPU)</td>
<td>27</td>
<td>0.2060</td>
</tr>
<tr>
<td>Stratified (I-GPU)</td>
<td>28</td>
<td>0.2060</td>
</tr>
</tbody>
</table>

The additional speedup of the improved implementations over the naive is directly correlated to the decreased number of global memory loads. Load quantities for $2^{20}$ particles for over 1000 samples are presented in Table 6.3 and can be found using the NVIDIA Compute mentioned in Section 3.3.5. The improved implementation reduces the number of global memory loads over 10x by utilizing shared memory.

**Table 6.3:** 32-bit global memory loads of naive and improved.

<table>
<thead>
<tr>
<th>$2^{20}$ Particles</th>
<th>Naive</th>
<th>Improved</th>
</tr>
</thead>
<tbody>
<tr>
<td>while_loop_increment</td>
<td>154,365,478</td>
<td>14,996,374</td>
</tr>
<tr>
<td>while_loop_decrement</td>
<td>152,593,046</td>
<td>13,611,845</td>
</tr>
</tbody>
</table>
As mentioned earlier, the historic bottleneck of particle filtering lies within the resampling step. This is evident in Figure 6.1, which shows that when running the serial systematic resampling on the CPU it consumes roughly 93% of workload performed in a time step. While the naive parallel implementation is able to shift this workload to 70%, it is the improved parallel implementation, with a workload of 32%, that transfers the majority of the computation performed from resampling to the prediction and update steps of particle filtering.

![Workload percentages of systematic resampling implementations.](image)

**Figure 6.1:** Workload percentages of systematic resampling implementations.

Table 6.4 shows speed and RMSE comparison between Metropolis, C1, C2; Rejection; and the parallelized systematic and stratified methods for a particle set of
First, when comparing to Table 6.2, it can be seen that the performance differences between the naive and improved parallel implementation are more pronounced as the number of particles increase. Table 6.4 shows that the improved parallelized implementations of the systematic and stratified methods provide at least a 13.13x speed over Metropolis, and as $B$ increases, that speed up becomes more pronounced. As expected, systematic and stratified techniques provide the best quality, with Metropolis in second. Although C1 provides the fastest execution time of all Metropolis implementations, it has the worst quality because it focuses on local selections, and the expected number of repetitions of the particles becomes different than that in Metropolis [30]. C2 provides a balance between speed and quality; speed because it is coalesced and quality because it is more similar to Metropolis. The parallelized systematic and stratified methods are slightly faster than C2 with $B \geq 32$, but nearly double the execution time with $B=16$. With C2 producing quality only slightly worse for this particular test problem, it seems to be the best trade-off between speed and quality. It is evident that while rejection resampling is faster than Metropolis and C2, at $B \geq 64$, it has the worst quality out of all the methods for this data set.

Variance of the particle weight array will play a significant roll in the performance of these resampling methods. In Metropolis resampling, for a given $B$, as the variance increases, the execution time will remain constant while the quality will worsen. Conversely, execution time of the systematic and stratified parallel implementations will most certainly increase as threads are required to traverse further through the particle weight array to satisfy the while-loop condition.
Table 6.4: RMSE comparison of resampling techniques at $2^{20}$ particles.

<table>
<thead>
<tr>
<th>Type</th>
<th>Cutoff ($B$)</th>
<th>Speed ($\mu$s)</th>
<th>RMSE for States</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td></td>
<td>$x_1$</td>
</tr>
<tr>
<td>Systematic (N)</td>
<td>1511</td>
<td>0.2060</td>
<td>0.1769</td>
</tr>
<tr>
<td>Stratified (N)</td>
<td>1516</td>
<td>0.2060</td>
<td>0.1769</td>
</tr>
<tr>
<td>Systematic (I)</td>
<td>375</td>
<td>0.2059</td>
<td>0.1769</td>
</tr>
<tr>
<td>Stratified (I)</td>
<td>390</td>
<td>0.2059</td>
<td>0.1769</td>
</tr>
<tr>
<td>Metropolis</td>
<td>16</td>
<td>5122</td>
<td>0.2062</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>10158</td>
<td>0.2061</td>
</tr>
<tr>
<td></td>
<td>64</td>
<td>20226</td>
<td>0.2059</td>
</tr>
<tr>
<td>Rejection</td>
<td>631</td>
<td>0.3352</td>
<td>0.2613</td>
</tr>
<tr>
<td>Metropolis (C1)</td>
<td>16</td>
<td>135</td>
<td>0.2192</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>259</td>
<td>0.2207</td>
</tr>
<tr>
<td></td>
<td>64</td>
<td>518</td>
<td>0.2216</td>
</tr>
<tr>
<td>Metropolis (C2)</td>
<td>16</td>
<td>245</td>
<td>0.2062</td>
</tr>
<tr>
<td></td>
<td>32</td>
<td>452</td>
<td>0.2061</td>
</tr>
<tr>
<td></td>
<td>64</td>
<td>881</td>
<td>0.2061</td>
</tr>
</tbody>
</table>

Figure 6.2 compares execution times of all methods over a sweep of particle set sizes from $2^{10}$ to $2^{20}$. For small particle sets, the CPU will perform better than all GPU methods. Execution times on a GPU will plateau for small particle sets because overhead is a larger contributor than computational workload. It can be seen that C1 and C2 are faster alternatives to the improved systematic and stratified parallel versions when $B$ is small, and might be an appealing choice if the impact to quality is not severe.
Figure 6.2: Execution times of the resampling algorithms in single-precision floating point arithmetic versus increasing number of particles \(2^N\): (a) serial systematic and stratified on CPU with -O3, (b) naive parallel Systematic and Stratified on GPU, (c) improved parallel systematic and stratified on GPU, (d) Metropolis on GPU, (e) Metropolis: C1 and C2 on GPU, (f) Rejection resampling on GPU.

6.3 Timing Across Multiple Variances

The previous section calculated timing performance based off a given benchmark problem. This section will display timing results of the parallel implementation across a sweep of variances in the particle weights. One of the characteristics of the parallel implementation is that as variance in particle weights increases so does execution time. Recall from Section 4.2.6 that the worst case scenario is when variance is at its maximum. Although possible, the purpose of resampling is to prevent that from happening and maintain the lowest variance possible. To get a better understanding
of how the parallel implementations should perform in more practical scenarios they are tested against the framework provided by Murray in [28].

Weights are simulated for \( N \) number of particles using the following equation:

\[
    w^i = \frac{1}{\sqrt{2\pi}} \exp\left( -\frac{1}{2} (x^i - y^i)^2 \right) \quad (6.2)
\]

where \( y \) is the observation, \( x^i \) is a set of weights generated with \( \mathcal{N}(0,1) \) and \( i = 1, \ldots, N \). As \( y \) increases, \( 0, \frac{1}{2}, 1, \frac{3}{2}, \ldots, 4; \) so does the relative variance. Table 6.5 and Table 6.6 show the timing and quantity of \( \ell \) results for \( 2^{16} \) number of particles for naive and improved systematic resampling, respectively. The results are an average across 10,000 Monte Carlos. Notice that there is a direct relationship between the maximum number of \( \ell \) to the maximum time taken to perform resampling. Also, these tables align with our work efficiency equations; as the variance increases, the work efficiency for both decreases, more so for the improved. More tables can be found in Section B.1 and Section B.2 for particle numbers ranging from \( 2^{10} \) to \( 2^{20} \) by a power of two.
### Table 6.5: Naive: Timing/Quantity(ℓ) at 2^{16} Particles - Multiple Variances

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
<th>Total (ℓ)</th>
<th>Total (ℓ)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>29</td>
<td>63</td>
<td>37</td>
<td>1</td>
<td>205</td>
</tr>
<tr>
<td>0.5</td>
<td>29</td>
<td>67</td>
<td>39</td>
<td>0</td>
<td>246</td>
</tr>
<tr>
<td>1.0</td>
<td>31</td>
<td>89</td>
<td>46</td>
<td>0</td>
<td>313</td>
</tr>
<tr>
<td>1.5</td>
<td>35</td>
<td>103</td>
<td>53</td>
<td>0</td>
<td>484</td>
</tr>
<tr>
<td>2.0</td>
<td>37</td>
<td>122</td>
<td>63</td>
<td>2</td>
<td>568</td>
</tr>
<tr>
<td>2.5</td>
<td>44</td>
<td>150</td>
<td>77</td>
<td>0</td>
<td>802</td>
</tr>
<tr>
<td>3.0</td>
<td>52</td>
<td>186</td>
<td>96</td>
<td>2</td>
<td>1,004</td>
</tr>
<tr>
<td>3.5</td>
<td>62</td>
<td>238</td>
<td>123</td>
<td>3</td>
<td>1,341</td>
</tr>
<tr>
<td>4.0</td>
<td>78</td>
<td>314</td>
<td>163</td>
<td>1</td>
<td>1,846</td>
</tr>
</tbody>
</table>

### Table 6.6: Improved: Timing/Quantity(ℓ) at 2^{16} Particles - Multiple Variances

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
<th>Total (ℓ)</th>
<th>Total (ℓ)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>28</td>
<td>46</td>
<td>31</td>
<td>0</td>
<td>226</td>
</tr>
<tr>
<td>0.5</td>
<td>29</td>
<td>45</td>
<td>31</td>
<td>0</td>
<td>259</td>
</tr>
<tr>
<td>1.0</td>
<td>29</td>
<td>52</td>
<td>32</td>
<td>0</td>
<td>318</td>
</tr>
<tr>
<td>1.5</td>
<td>29</td>
<td>53</td>
<td>33</td>
<td>0</td>
<td>457</td>
</tr>
<tr>
<td>2.0</td>
<td>29</td>
<td>56</td>
<td>34</td>
<td>0</td>
<td>580</td>
</tr>
<tr>
<td>2.5</td>
<td>30</td>
<td>58</td>
<td>36</td>
<td>1</td>
<td>808</td>
</tr>
<tr>
<td>3.0</td>
<td>31</td>
<td>68</td>
<td>39</td>
<td>1</td>
<td>1,130</td>
</tr>
<tr>
<td>3.5</td>
<td>32</td>
<td>78</td>
<td>42</td>
<td>1</td>
<td>1,612</td>
</tr>
<tr>
<td>4.0</td>
<td>34</td>
<td>91</td>
<td>48</td>
<td>4</td>
<td>2,146</td>
</tr>
</tbody>
</table>

### 6.4 Timing of Worst Case Scenario

Algorithm efficiency for work is described in Section 4.2.6. While that provides efficiency as $N \to \infty$ on a PRAM, this section will dive into the work efficiency of the Up and Down kernels on a machine from an implementation point-of-view. In
order to calculate a more accurate worst case scenario, hardware must be taken into
account. First, this work efficiency takes into account the number of GPU cores that
are used during computation. Second, it considers one cycle as one trip through
the entire while-loop and will calculate the total number of cycles across all threads,
where there is one thread per elements. Work efficiency does not depend on whether
the implementation used monolithic or grid-stride looping kernels. Equation 6.3 and
Equation 6.4 represent work efficiency for the naive and improved implementation,
respectively.

\[
WE_{npw} = \frac{N}{\left[\left(\frac{(N-1) \times N}{2}\right) + N\right] + (N - 1)} / \text{cores} 
\] (6.3)

\[
WE_{ipw} = \frac{N}{\left[\left(\frac{(N-1) \times N}{2}\right) + N + (N - (WS \times 2)) \times F + (WS \times 2) - 1\right]} / \text{cores} 
\] (6.4)

Table 6.7 and Table 6.8 provides the tabular form of Equation 6.3 and Equa-
tion 6.4 as the number of particles grow from \(2^{10}\) to \(2^{20}\) by a power of two. It shows
the execution time on a GTX 1080, with 2560 CUDA cores, along with the maximum
number of cycles, \(\ell\), for a given thread and the total number of cycles for all threads
for each kernel. It is clear that the Down kernel provides little impact to the work ef-
ciciency because it executes the minimum number of cycles allowed. While the Down
kernel in the improved implementation required more cycles over the naive because
of the hard-coded for-loop, overall the improved is between 5-21x faster due to more
efficient accesses to global memory.
Table 6.7: Naive: Work Efficiency of Worst Case Scenario: Systematic

<table>
<thead>
<tr>
<th>Particle #</th>
<th>Time µs</th>
<th>Kernel A</th>
<th>Kernel B</th>
<th>Work Efficiency</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Max</td>
<td>Total</td>
<td>Max</td>
</tr>
<tr>
<td>1,024</td>
<td>166</td>
<td>1,024</td>
<td>524,800</td>
<td>1</td>
</tr>
<tr>
<td>2,048</td>
<td>300</td>
<td>2,048</td>
<td>2,098,176</td>
<td>1</td>
</tr>
<tr>
<td>4,096</td>
<td>581</td>
<td>4,096</td>
<td>8,390,656</td>
<td>1</td>
</tr>
<tr>
<td>8,192</td>
<td>1,147</td>
<td>8,192</td>
<td>33,558,528</td>
<td>1</td>
</tr>
<tr>
<td>16,384</td>
<td>2,302</td>
<td>16,384</td>
<td>134,225,920</td>
<td>1</td>
</tr>
<tr>
<td>32,768</td>
<td>4,700</td>
<td>32,768</td>
<td>536,887,296</td>
<td>1</td>
</tr>
<tr>
<td>65,536</td>
<td>11,673</td>
<td>65,536</td>
<td>2,147,516,416</td>
<td>1</td>
</tr>
<tr>
<td>131,072</td>
<td>38,572</td>
<td>131,072</td>
<td>8,590,000,128</td>
<td>1</td>
</tr>
<tr>
<td>262,144</td>
<td>149,927</td>
<td>262,144</td>
<td>34,359,869,440</td>
<td>1</td>
</tr>
<tr>
<td>524,288</td>
<td>602,399</td>
<td>524,288</td>
<td>137,439,215,616</td>
<td>1</td>
</tr>
<tr>
<td>1,048,576</td>
<td>2,533,634</td>
<td>1,048,576</td>
<td>549,756,338,176</td>
<td>1</td>
</tr>
</tbody>
</table>

Table 6.8: Improved: Work Efficiency of Worst Case Scenario: Systematic

<table>
<thead>
<tr>
<th>Particle #</th>
<th>Time µs</th>
<th>Kernel A</th>
<th>Kernel B</th>
<th>Work Efficiency</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td></td>
<td>Max</td>
<td>Total</td>
<td>Max</td>
</tr>
<tr>
<td>1,024</td>
<td>33</td>
<td>1,024</td>
<td>524,800</td>
<td>32</td>
</tr>
<tr>
<td>2,048</td>
<td>38</td>
<td>2,048</td>
<td>2,098,176</td>
<td>32</td>
</tr>
<tr>
<td>4,096</td>
<td>49</td>
<td>4,096</td>
<td>8,390,656</td>
<td>32</td>
</tr>
<tr>
<td>8,192</td>
<td>72</td>
<td>8,192</td>
<td>33,558,528</td>
<td>32</td>
</tr>
<tr>
<td>16,384</td>
<td>121</td>
<td>16,384</td>
<td>134,225,920</td>
<td>32</td>
</tr>
<tr>
<td>32,768</td>
<td>229</td>
<td>32,768</td>
<td>536,887,296</td>
<td>32</td>
</tr>
<tr>
<td>65,536</td>
<td>641</td>
<td>65,536</td>
<td>2,147,516,416</td>
<td>32</td>
</tr>
<tr>
<td>131,072</td>
<td>1,903</td>
<td>131,072</td>
<td>8,590,000,128</td>
<td>32</td>
</tr>
<tr>
<td>262,144</td>
<td>7,132</td>
<td>262,144</td>
<td>34,359,869,440</td>
<td>32</td>
</tr>
<tr>
<td>524,288</td>
<td>29,801</td>
<td>524,288</td>
<td>137,439,215,616</td>
<td>32</td>
</tr>
<tr>
<td>1,048,576</td>
<td>131,764</td>
<td>1,048,576</td>
<td>549,756,338,176</td>
<td>32</td>
</tr>
</tbody>
</table>
CHAPTER 7

CONCLUSION

7.1 Summary

Particle filters are known to provide much better accuracy in nonlinear, non-Gaussian problems, such as state estimation. Unfortunately, they are not nearly as common in practice as techniques like the Kalman filter and the EKF due to their high computational nature. This pitfall is caused by two underlying characteristics. First, as the number of states increase, the number of particles should increase exponentially. Second, and the primary focus of this research, is that traditional resampling techniques are considered serial by nature causing a performance bottleneck. GPUs provide a compute rich environment for mathematical problems that are embarrassingly parallel. These problems usually fit within the programming paradigm of SIMD and can take advantage of the high quantities and relatively simple cores provided by GPUs. Two out of the three steps in particle filtering have been implemented on GPU hardware in recent history. By finding a way to implement the last step, resampling, particle filtering becomes a viable alternative to nonlinear, non-Gaussian problems.
This dissertation shows that while some traditional resampling methods seem inherently serial, calculation can be distributed to parallel environments, such as GPUs, given the right conditions. In the case of systematic and stratified resampling, all inputs are known when the resampling index is determined and has indices that are zero-based, consecutive, strictly monotonically increasing. Therefore, the while condition can be implemented to all threads independently. While this parallel programming technique is shown to have poor work efficiency, it can provide improvements in speedups over traditional serial CPU implementation and produce an identical resampling index.

This research provides two ways to implement a parallel approach to systematic and stratified resampling. While the naive method is roughly 8x and 16x faster than the serial implementation, respectively, it has limitations that keep it from achieving optimal performance. An improved method is proposed that efficiently utilizes memory transitions to minimize global memory accesses. Improvements to thread execution efficiency, memory dependency stalls, and sub-optimal occupancy are then provided to decrease execution times of the systematic and stratified by 16.89x and 33.43x, respectively, over the serial method. With this parallel approach, all steps of the particle filter can be implemented in a parallel fashion. This shifts the historic workload particle filtering from resampling to the prediction and update steps.

For a thorough comparison, the performance was compared to a popular GPU method known as Metropolis resampling. Three versions of the Metropolis are presented; the original version and two coalesced versions that performed more efficient
memory reads to global memory. The parallel systematic and stratified methods are significantly faster than Metropolis on large particle sets. However, they are slightly slower than the coalesced versions of Metropolis when $B$ is small. Therefore, the specific method that is chosen will be a trade-off between performance and accuracy.

Given that particle filtering is a Monte Carlo technique it requires the generation of random numbers. How these numbers are generated can significantly affect accuracy and speed. A list of requirements was given to ensure that random number generators implemented in the GPU environment provide good statistical quality and are repeatable between runs and on different systems. NVIDIA’s cuRAND library provides a handful of PRNGs that can be implemented to guarantee statistical quality or forfeit that guarantee for improved speed. It was shown that the GPU parallel implementations of systematic and stratified resampling are not sensitive to random numbers generated with less guarantee that statistical correlation are non-existent.

7.2 Future Work

7.2.1 Multi-core Implementation

There is no reason to assume parallel techniques presented in this dissertation can not be extended to multi-core CPU environments. In that scenario, multiple CPU cores can be used to retrieve data from memory in chunks and put into large caches provided by the CPU architectures. Utilizing these large caches may eliminate the need for an improved implementation and will not be limited by warp size. This
would approach will provide performance improvements for applications that don’t have access to GPU hardware.

7.2.2 Custom Prefix-Sum Kernel

It can be shown with profiling tools that there are gaps between the prefix-sum kernels, launched under the CUB library, and the novel parallel implementation presented in this work. This is due to kernel launch overhead, and on a GTX1080, when processing $2^{20}$, there is not enough work to hide this latency. It might be possible to write a custom prefix-sum kernel that can also combine random number generation and the resampling index search. In doing so, a Kahan summation algorithm [59], also known as compensated summation, could be used to reduce the numerical error in the prefix-sum on single precision numbers.

7.2.3 Particle Filter Implementation

The parallel implementation provided in this paper can be integrated into particle filters that currently utilize the GPU for the updating and prediction steps. The combination of all three steps on the GPU can provide online calculations while using unbiased resampling. These three steps in the particle filter workflow are repeatedly executed through the run-time of most applications. CUDA streams require that the work is resubmitted with every iteration, which consumes both time and CPU resources. CUDA Graphs is a new model for submitting work using CUDA, which was released with CUDA 10.0 [32]. A graph consists of a series of operations, such as memory copies and kernel launches, connected by dependencies and defined sepa-
rately from its execution. Graphs enable a define-once-run-repeatedly execution flow. By utilizing Graphs, the overhead of launching multiple kernels can be significantly reduced.
template<typename T>
__global__ void __launch_bounds__(kTRI) computeResampleIndexSysUp(int const n,
unsigned long long const seed, int *__restrict__ devResampleIndexUp, T const *__restrict__ devCumulativeSum) {

#if RESAMPLE == 0 // Systematic
curandStateXORWOW_t localState { }; 
curand_init(seed, 0, 0, &localState);
T const distro = curand_uniform(&localState);
#endif

for (auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x) {
    auto const tidf { static_cast<T>(tid) };
    auto mask { true };
    auto idx { 0 };
    #if RESAMPLE == 1 // Stratified
curandStateXORWOW_t localState { }; 
curand_init(seed + tid, 0, 0, &localState);
T const distro = curand_uniform(&localState);
#endif

    // Distribution will be the same for each Monte Carlo
    T const draw = (distro + tidf) / n;
    while (mask && (tid < n - 1)) {
        // The last thread will always be false because devCumulativeSum(end) = 1.0f
        mask = devCumulativeSum[tid + l] < draw;
        if (mask)
            l++;
    }
    devResampleIndexUp[tid] = idx;
}

A.2 Systematic/Stratified (Naive): Down Kernel

```c
template<typename T>
void computeResampleIndexSysDown(int const n,
                                  unsigned long long const seed,
                                  int * __restrict__ devResampleIndexDown,
                                  T const * __restrict__ devCumulativeSum) {

#if RESAMPLE == 0 // Systematic
    curandStateXORWOW_t localState{};
    curand_init(seed, 0, 0, &localState);
    T const distro = curand_uniform(&localState);
#endif

for (auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x) {
    auto const tidf { static_cast<T>(tid) };
    auto mask { false };
    auto l { 1 };
    auto idx { 0 };

#if RESAMPLE == 1 // Stratified
    curandStateXORWOW_t localState{};
    curand_init(seed + tid, 0, 0, &localState);
    T const distro = curand_uniform(&localState);
#endif

    // Distribution will be the same for each Monte Carlo
    T const draw = (distro + tidf) / n;

    while (!mask && (tid >= l)) {
        mask = devCumulativeSum[tid - l] < draw;
        if (!mask)
            idx--;
        l++;
    }
    devResampleIndexDown[tid] = idx;
}
```
A.3 Systematic/Stratified (Shared Memory): Up Kernel

```cpp
#include <curand.h>
#include <cub/cub.h>

template<typename T>
__global__ void __launch_bounds__(kTRI) computeResampleIndexSysUpShared(int const n,
                            unsigned long long const seed, int * __restrict__ devResampleIndexUp, T const * __restrict__ devCumulativeSum) {
    auto const tile32 = cg::tiled_partition<kWarpSize>(cg::this_thread_block());
    auto const t = threadIdx.x;
    __shared__ T sData[kTRI * 2]; // Will hold 64 elements

    #if RESAMPLE == 0 // Systematic
        curandStateXORWOW_t localState { }; curand_init(seed, 0, 0, &localState);
        T const distro = curand_uniform(&localState);
    #endif

    for (int tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x) {
        auto const tidf { static_cast<T>(tid) }; auto mask { true }; auto l { 0 }; auto idx { 0 };

        #if RESAMPLE == 1 // Stratified
            curandStateXORWOW_t localState { }; curand_init(seed + tid, 0, 0, &localState);
            T const distro = curand_uniform(&localState);
        #endif

        // Distribution will be the same for each Monte Carlo
        T const draw = (distro + tidf) / n;

        while (tile32.any(mask)) {
            if (tid < n - kTRI - 1) {
                sData[t] = devCumulativeSum[tid + l];
                sData[t + kTRI] = devCumulativeSum[tid + kTRI + l];
                tile32.sync();
                for (auto i = 0; i < kTRI; i++) {
                    mask = sData[t + i] < draw;
                    if (mask) idx++;
                }
                l += kTRI;
            } else {
                while (mask && tid < (n - l)) {
                    // Last thread will always be false because devCumulativeSum(end) = 1.0f
                    mask = devCumulativeSum[tid + l] < draw;
                    if (mask) idx++;
                    l++;
                }
                tile32.sync();
            }
            devResampleIndexUp[tid] = idx;
        }
    }
```

124
A.4 Systematic/Stratified (Shared Memory): Down Kernel

```cpp
template<typename T>
__global__ void __launch_bounds__(kTRI) computeResampleIndexSysDownShared(int const n, unsigned long long const seed, int * __restrict__ devResampleIndexDown, T const * __restrict__ devCumulativeSum ) {
  auto const tile32 = cg::tiled_partition<kWarpSize>(cg::this_thread_block());
  auto const t = threadIdx.x;
  __shared__ T sData[kTRI * 2];

  #if RESAMPLE == 0 // Systematic
    curandStateXORWOW_t localState { }; curand_init(seed, 0, 0, &localState );
    T const distro = curand_uniform(&localState);
  # endif

  for ( auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x ) {
    auto const tidf { static_cast<T>(tid ) };
    auto mask ( false );
    auto l { 0 };
    auto idx { 0 };

    #if RESAMPLE == 1 // Stratified
      curandStateXORWOW_t localState { }; curand_init(seed + tid, 0, 0, &localState );
      T const distro = curand_uniform(&localState);
    # endif

    // Distribution will be the same for each Monte Carlo
    T const draw = ( distro + tidf ) / n;

    while ( !tile32.all( mask ) ) {
      if ( tid >= kTRI + l ) {
        sData[t] = devCumulativeSum[tid - kTRI - l];
        sData[t + kTRI ] = devCumulativeSum[tid - l];
        tile32.sync();
      }
      for ( auto i = 1; i < kTRI + 1; i++ ) {
        mask = sData[t + kTRI - i] < draw;
        if ( !mask )
          idx--; 
        else {
          while ( !mask ) {
            if ( tid > l ) mask = devCumulativeSum[tid - ( l + 1 )] < draw;
            else mask = true;
            if ( !mask )
              idx--; 
            l++; 
          }
        } 
        tile32.sync();
      }
      devResampleIndexDown[tid ] = idx;
    }
  }
}```
A.5 Systematic/Stratified (SM/Prefetch) Up Kernel

```cpp
template<typename T>
__global__ void __launch_bounds__(kTRI) computeResampleIndexSysUpShared(int const n,
                         unsigned long long const seed,
                         int * __restrict__ devResampleIndexUp,
                         T const * __restrict__ devCumulativeSum) {

  auto const tile32 = cg::tiled_partition<kWarpSize[]>(cg::this_thread_block());

  auto const t = threadIdx.x;

  __shared__ T sData[kTRI * 2]; // Will hold 64 elements

  #if RESAMPLE == 0 // Systematic
    curandStateXORWOW_t localState{};
    curand_init(seed, 0, 0, &localState);
    T const distro = curand_uniform(&localState);
  #endif

  for (int tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x) {
    auto const tidf { static_cast<T>(tid) };
    auto mask { true };
    auto l { 0 };
    auto idx { 0 };

    #if RESAMPLE == 1 // Stratified
      curandStateXORWOW_t localState{};
      curand_init(seed + tid, 0, 0, &localState);
      T const distro = curand_uniform(&localState);
    #endif

    // Distribution will be the same for each Monte Carlo
    T const draw = (distro + tidf) / n;

    while (tile32.any(mask)) {
      if (tid < n - kTRI - l) {
        sData[t] = devCumulativeSum[tid + l];
        sData[t + kTRI] = devCumulativeSum[tid + kTRI + l];
        tile32.sync();

        for (auto i = 0; i < kTRI; i++) {
          mask = sData[t + i] < draw;
          if (mask)
            idx++;
          l += kTRI;
        }
      } else {
        while (mask && tid < (n - l)) {
          // Last thread will always be false because devCumulativeSum(end) = 1.0f
          mask = devCumulativeSum[tid + l] < draw;
          if (mask)
            idx++;
          l++;
        }
      }

      tile32.sync();

    devResampleIndexUp[tid] = idx;
  }
}
```
template<typename T>
__global__ void __launch_bounds__(kTRI) computeResampleIndexSysDownShared(int const n, unsigned long long const seed, int * __restrict__ devResampleIndexDown, T const * __restrict__ devCumulativeSum) {
  auto const tile32 = cg::tiled_partition<kWarpSize>(cg::this_thread_block());
  auto const t = threadIdx.x;
  __shared__ T sData[kTRI * 2];
  #if RESAMPLE == 0 // Systematic
  curandStateXORWOW_t localState{};
  curand_init(seed, 0, 0, &localState);
  T const distro = curand_uniform(&localState);
  # endif
  for (auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x) {
    auto const tidf = static_cast<T>(tid);
    auto mask = false;
    auto l = 0;
    auto idx = 0;
    #if RESAMPLE == 1 // Stratified
    curandStateXORWOW_t localState{};
    curand_init(seed + tid, 0, 0, &localState);
    T const distro = curand_uniform(&localState);
    # endif
    // Distribution will be the same for each Monte Carlo
    T const draw = (distro + tidf) / n;
    while ( !tile32.all(mask) ) {
      if ( tid > kTRI + l ) {
        sData[t] = devCumulativeSum[tid - kTRI - l];
        sData[t + kTRI] = devCumulativeSum[tid - l];
        tile32.sync();
        for (auto i = 1; i < kTRI + 1; i++) {
          mask = sData[t + kTRI - i] < draw;
          if ( !mask )
            idx--;
        }
        l += kTRI;
      } else {
        while ( !mask ) {
          if ( tid > l ) {
            mask = devCumulativeSum[tid - (l + 1)] < draw;
            else
              mask = true;
          if ( !mask )
            idx--;
          l++;
        }
        tile32.sync();
      }
    }
  devResampleIndexDown[tid] = idx;
  }
}

A.6 Systematic/Stratified (SM/Prefetch): Down Kernel
A.7 Systematic/Stratified (SM/Prefetch/2 Warps) Up Kernel

```cpp
template<typename T>
__global__ void __launch_bounds__(64) computeResampleIndexSysUpSharedPrefetch64(int const n, unsigned long long const seed, int * __restrict__ devResampleIndexUp, T * __restrict__ devCumulativeSum) {
  auto const tile32 = cg::tiled_partition<kWarpSize>(cg::this_thread_block());
  uint const t = tile32.thread_rank();

  __shared__ T sWarp0[kTRI * 2];
  __shared__ T sWarp1[kTRI * 2];

#if RESAMPLE == 0
  curandStateXORWOW_t localState { }; 
  curand_init(seed, 0, 0, &localState);
  T const distro = curand_uniform(&localState);
#else
#endif

  for (int tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x) {
    if (threadIdx.x < kWarpSize) {
      // If warp 0
      bool mask { true };
      if (tid < n - kTRI - 1) {
        sWarp0[t] = devCumulativeSum[tid + l];
        sWarp0[t + kTRI] = devCumulativeSum[tid + kTRI + l];
      }
      // Distribution will be the same for each Monte Carlo
      T const draw = (distro + tidf) / n;

      tile32.sync();
      while (tile32.any(mask)) {
        if (tid < n - (kTRI + 32) - 1) {
          a = devCumulativeSum[tid + 32 + l];
          b = devCumulativeSum[tid + kTRI + 32 + 1];
        }
      #pragma unroll kTRI
      for (int i = 0; i < kTRI; i++) {
        mask = sWarp0[t + i] < draw;
        if (mask)
          idx++;
      }
      l += kTRI;
      sWarp0[t] = a;
      sWarp0[t + kTRI] = b;
      tile32.sync();
    } else {
      while (mask && tid < (n - 1)) {
```
mask = devCumulativeSum[tid + l] < draw;  // The last thread will always be false because devCumulativeSum(end) = 1.0f
if ( mask )
    idx++;
l++;
}
}
tile32.sync();
}
devResampleIndexUp[tid] = idx;
}
} else {  // If warp 1
    bool mask { true };
    if ( tid < n - kTRI - l ) {
        sWarp1[t] = devCumulativeSum[tid + l];
        sWarp1[t + kTRI] = devCumulativeSum[tid + kTRI + l];
    }

    // Distribution will be the same for each Monte Carlo
    T const draw = ( distro + tidf ) / n;
    tile32.sync();
    while ( tile32.any( mask ) ) {
        if ( tid < n - ( kTRI + 32 ) - l ) {
            a = devCumulativeSum[tid + 32 + l];
            b = devCumulativeSum[tid + kTRI + 32 + l];

            #pragma unroll kTRI
            for ( int i = 0; i < kTRI; i++ ) {
                mask = sWarp1[t + i] < draw;
                if ( mask )
                    idx++;
            }
            l += kTRI;
            sWarp1[t] = a;
            sWarp1[t + kTRI] = b;
            tile32.sync();
        } else {
            while ( mask && tid < ( n - l ) ) {
                mask = devCumulativeSum[tid + l] < draw;  // The last thread will always be false because devCumulativeSum(end) = 1.0f
                if ( mask )
                    idx++;
                l++;
            }
        }
    }
}
tile32.sync();
}
devResampleIndexUp[tid] = idx;
}
A.8 Systematic/Stratified (SM/Prefetch/2 Warps): Down Kernel

```c
template<typename T>
__global__ void __launch_bounds__(64) computeResampleIndexSysDownSharedPrefetch64(int const n, unsigned long long const seed, int * __restrict__ devResampleIndexDown,
T * __restrict__ devCumulativeSum) {

auto const tile32 = cg::tiled_partition < kWarpSize > ( cg::this_thread_block() );
uint const t = tile32.thread_rank();

__shared__ T sWarp0[kTRI * 2];
__shared__ T sWarp1[kTRI * 2];

#if RESAMPLE == 0
curandStateXORWOW_t localState { }; curand_init ( seed , 0, 0, & localState );
T const distro = curand_uniform ( & localState );
#endif

for ( auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x ) {

#if RESAMPLE == 1
curandStateXORWOW_t localState { }; curand_init( seed + tid , 0, 0, &localState );
T const distro = curand_uniform( &localState );
#endif

T const tidf { static_cast<T>( tid )};
int l { 0 }; int idx { 0 }; T a {}; T b {};

if ( threadIdx.x < kWarpSize ) { // If warp 0
bool mask { false }; if ( tid >= kTRI + l ) {
    sWarp0[t] = devCumulativeSum[tid - kTRI - l];
    sWarp0[t + kTRI] = devCumulativeSum[tid - l];
}

// Distribution will be the same for each Monte Carlo
T const draw = ( distro + tidf ) / n;
tile32.sync();

while ( !tile32.all( mask ) ) {
   if ( tid >= kTRI + 32 + l ) {
       a = devCumulativeSum[tid - ( kTRI + 32 ) - 1];
       b = devCumulativeSum[tid - 32 - 1];
   #pragma unroll
   for ( int i = 1; i < kTRI + 1; i++ ) {
       mask = sWarp0[t + kTRI - i] < draw;
   if ( !mask )
       idx--; } l += kTRI;
   sWarp0[t] = a;
   sWarp0[t + kTRI] = b;
tile32.sync();
   }
}
else { // If warp 1

}

}
```

130
while ( !mask ) {
  if ( tid > l )
  else
    mask = true;
  if ( !mask )
    idx--;
  l++;
}

    tile32.sync();
}

devResampleIndexDown[tid] = idx;
}
else {

  bool mask { false };

  if ( tid >= kTRI + l ) {
    sWarp1[t] = devCumulativeSum[tid - kTRI - l];
    sWarp1[t + kTRI] = devCumulativeSum[tid - l];
  }

  // Distribution will be the same for each Monte Carlo
  T const draw = ( distro + tidf ) / n;
  tile32.sync();

  while ( !tile32.all( mask ) ) {
    if ( tid >= kTRI + 32 + l ) {
      a = devCumulativeSum[tid - ( kTRI + 32 ) - l];
      b = devCumulativeSum[tid - 32 - l];

      #pragma unroll
      for ( int i = 1; i < kTRI + 1; i++ ) {
        mask = sWarp1[t + kTRI - i] < draw;
        if ( !mask )
          idx--;
      }
      l += kTRI;
      sWarp1[t] = a;
      sWarp1[t + kTRI] = b;
      tile32.sync();
    } else {

      while ( !mask ) {
        if ( tid > l )
        else
          mask = true;
        if ( !mask )
          idx--;
        l++;
      }

      tile32.sync();
    }
  }

devResampleIndexDown[tid] = idx;
}
}
A.9 Metropolis (Naive)

```c
#include "types.h"

__global__ void __launch_bounds__ ( kTMR ) computeResampleIndexMetropolis ( int const n,
    unsigned long long const seed , int * __restrict__ devResampleIndexDown , T const * __restrict__ devParticleWeight ) {

    for ( auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x
         * gridDim.x ) {
        uint idx { tid };
        uint key { }; T den { devParticleWeight[idx] };
        T num { }; T randNum { };

        curandStateXORWOW_t localState { }; curand_init( seed + tid , 0 , 0 , &localState );

        for ( auto i = 0; i < CUTOFF ; i++ ) {
            randNum = curand_uniform( &localState );
            key = static_cast<uint>( curand_uniform( &localState ) * ( n - 1 ) );
            num = devParticleWeight[key];

            if ( randNum <= ( num / den ) ) {
                den = num;
                idx = key;
            }
        }
        devResampleIndexDown[tid] = idx;
    }
}
```
A.10 Rejection

template<typename T>
__global__ void __launch_bounds__(kTMR) computeResampleIndexMetropolisRejection(int const n, unsigned long long const seed, int * __restrict__ devResampleIndexDown, T const * __restrict__ devParticleWeight) {

    for ( auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x ) {
        uint idx { tid };
        uint key {};
        T den { devParticleWeight[idx] };
        T randNum {};
        curandStateXORWOW_t localState {};
        curand_init( seed + tid, 0, 0, &localState );
        randNum = curand_uniform( &localState );

        while ( randNum > ( den / 1.0f ) ) {
            // Subtract warp size so the last warp load out-of-bounds
            key = static_cast<uint>( curand_uniform( &localState ) * ( n - 1 ) );
            den = devParticleWeight[key];
            randNum = curand_uniform( &localState );
        }
        devResampleIndexDown[tid] = key;
    }
}
template<typename T>
__global__ void __launch_bounds__(kTMR) computeResampleIndexMetropolisC1(int const n, unsigned long long int seed, int * __restrict__ devResampleIndexDown, T const * __restrict__ devParticleWeight ) {

    for ( auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x ) {

        uint idx { tid };
        uint key {}
        uint warp {};
        T den { devParticleWeight[tid] };
        T num {};
        T randNum {};

        curandStateXORWOW_t localState {}, randState {};
        // same random number for 0-based warp entire grid
        curand_init( static_cast<unsigned long long int>( tid >> 5 ) + seed , 0, 0, & localState );
        curand_init( static_cast<unsigned long long int>( tid ) + seed , 0, 0, & randState );

        // Calculate s(warp) using warp index. All threads in warp have same value
        int SS = kWarpSize; // Size of segment
        int SC = n / SS; // The number of segments
        int DC = SS;
        // Random number [0 -> number of warps]
        warp = static_cast<uint>( curand_uniform( &localState ) * (SC - 1) );

        for ( auto i = 0; i < CUTOFF; i++ ) {

            randNum = curand_uniform( &randState );
            key = static_cast<uint>( curand_uniform( &randState ) * (DC - 1) );
            key = warp * DC + key;
            num = devParticleWeight[key];

            if ( randNum <= ( num / den ) ) {
                den = num;
                idx = key;
            }
        }
        devResampleIndexDown[tid] = idx;
    }
}
template< typename T >
__global__ void __launch_bounds__( kTMR ) computeResampleIndexMetropolisC2( int const n , unsigned long long int seed , int * __restrict__ devResampleIndexDown , T const * __restrict__ devParticleWeight ) {
for ( auto tid = blockIdx.x * blockDim.x + threadIdx.x; tid < n; tid += blockDim.x * gridDim.x ) {
  uint idx { tid };
  uint key {};
  uint warp {};
  T den {devParticleWeight[tid]};
  T num {};
  T randNum {};

  curandStateXORWOW_t localState {}, randState {};
  // same random number for 0-based warp entire grid
  curand_init( static_cast<unsigned long long int>(tid >> 5) + seed , 0, 0, & localState );
  curand_init( static_cast<unsigned long long int>(tid) + seed , 0, 0, &randState );

  // Calculate s(warp) using warp index. All threads in warp have same value
  int SS = kWarpSize; // Size of segment
  int SC = n / SS; // The number of segments
  int DC = SS;

  for ( auto i = 0; i < CUTOFF; i++ ) {
    // Random number [0 -> number of warps]
    warp = static_cast<uint>( curand_uniform( &localState ) * (SC - 1) );

    randNum = curand_uniform( &randState );
    key = static_cast<int>( curand_uniform( &randState ) * (DC - 1) );
    key = warp * DC + key;
    num = devParticleWeight[key];
    if ( randNum <= ( num / den ) ) {
      den = num;
      idx = key;
    }
    devResampleIndexDown[tid] = idx;
  }
}
## APPENDIX B

### B.1 Variance Effect on Naive Implementation: Systematic

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th>Down Kernel</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Std</td>
<td>Total (f)</td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Std</td>
<td>Total (f)</td>
<td></td>
</tr>
<tr>
<td>0.0</td>
<td>15</td>
<td>34</td>
<td>18</td>
<td>0</td>
<td>23</td>
<td>1</td>
<td>1</td>
<td>14.043</td>
<td>0</td>
<td>26</td>
<td>1</td>
<td>1</td>
<td>14.070</td>
<td></td>
</tr>
<tr>
<td>0.5</td>
<td>17</td>
<td>38</td>
<td>19</td>
<td>0</td>
<td>33</td>
<td>2</td>
<td>2</td>
<td>16.227</td>
<td>0</td>
<td>35</td>
<td>2</td>
<td>2</td>
<td>16.911</td>
<td></td>
</tr>
<tr>
<td>1.0</td>
<td>17</td>
<td>36</td>
<td>19</td>
<td>0</td>
<td>41</td>
<td>3</td>
<td>2</td>
<td>22.571</td>
<td>0</td>
<td>41</td>
<td>3</td>
<td>2</td>
<td>20.586</td>
<td></td>
</tr>
<tr>
<td>1.5</td>
<td>18</td>
<td>38</td>
<td>21</td>
<td>0</td>
<td>63</td>
<td>4</td>
<td>3</td>
<td>30.782</td>
<td>0</td>
<td>56</td>
<td>4</td>
<td>4</td>
<td>26.522</td>
<td></td>
</tr>
<tr>
<td>2.0</td>
<td>18</td>
<td>41</td>
<td>22</td>
<td>0</td>
<td>75</td>
<td>5</td>
<td>5</td>
<td>39.909</td>
<td>0</td>
<td>76</td>
<td>5</td>
<td>5</td>
<td>40.570</td>
<td></td>
</tr>
<tr>
<td>2.5</td>
<td>19</td>
<td>40</td>
<td>23</td>
<td>0</td>
<td>107</td>
<td>7</td>
<td>7</td>
<td>49.693</td>
<td>0</td>
<td>106</td>
<td>7</td>
<td>7</td>
<td>61.801</td>
<td></td>
</tr>
<tr>
<td>3.0</td>
<td>20</td>
<td>47</td>
<td>25</td>
<td>0</td>
<td>144</td>
<td>10</td>
<td>9</td>
<td>71.655</td>
<td>0</td>
<td>134</td>
<td>10</td>
<td>9</td>
<td>69.107</td>
<td></td>
</tr>
<tr>
<td>3.5</td>
<td>21</td>
<td>52</td>
<td>28</td>
<td>0</td>
<td>207</td>
<td>14</td>
<td>13</td>
<td>109.871</td>
<td>0</td>
<td>196</td>
<td>13</td>
<td>13</td>
<td>99.342</td>
<td></td>
</tr>
<tr>
<td>4.0</td>
<td>23</td>
<td>59</td>
<td>32</td>
<td>0</td>
<td>251</td>
<td>19</td>
<td>18</td>
<td>130.568</td>
<td>0</td>
<td>278</td>
<td>19</td>
<td>19</td>
<td>148.264</td>
<td></td>
</tr>
</tbody>
</table>

### Naive Method: Particles (2048)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th></th>
<th>Down Kernel</th>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Std</td>
<td>Total (f)</td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Std</td>
<td>Total (f)</td>
<td></td>
</tr>
<tr>
<td>0.0</td>
<td>17</td>
<td>38</td>
<td>20</td>
<td>0</td>
<td>39</td>
<td>2</td>
<td>2</td>
<td>49.184</td>
<td>0</td>
<td>39</td>
<td>2</td>
<td>2</td>
<td>35.032</td>
<td></td>
</tr>
<tr>
<td>0.5</td>
<td>18</td>
<td>35</td>
<td>20</td>
<td>0</td>
<td>43</td>
<td>3</td>
<td>3</td>
<td>49.394</td>
<td>0</td>
<td>47</td>
<td>3</td>
<td>3</td>
<td>47.514</td>
<td></td>
</tr>
<tr>
<td>1.0</td>
<td>18</td>
<td>34</td>
<td>21</td>
<td>0</td>
<td>64</td>
<td>4</td>
<td>4</td>
<td>59.094</td>
<td>0</td>
<td>69</td>
<td>4</td>
<td>4</td>
<td>66.854</td>
<td></td>
</tr>
<tr>
<td>1.5</td>
<td>18</td>
<td>44</td>
<td>22</td>
<td>0</td>
<td>79</td>
<td>5</td>
<td>5</td>
<td>78.058</td>
<td>0</td>
<td>85</td>
<td>5</td>
<td>5</td>
<td>96.870</td>
<td></td>
</tr>
<tr>
<td>2.0</td>
<td>19</td>
<td>42</td>
<td>24</td>
<td>0</td>
<td>108</td>
<td>7</td>
<td>7</td>
<td>104.665</td>
<td>0</td>
<td>116</td>
<td>8</td>
<td>7</td>
<td>132.122</td>
<td></td>
</tr>
<tr>
<td>2.5</td>
<td>20</td>
<td>48</td>
<td>26</td>
<td>0</td>
<td>151</td>
<td>10</td>
<td>10</td>
<td>138.159</td>
<td>0</td>
<td>156</td>
<td>10</td>
<td>10</td>
<td>142.988</td>
<td></td>
</tr>
<tr>
<td>3.0</td>
<td>21</td>
<td>54</td>
<td>29</td>
<td>0</td>
<td>197</td>
<td>14</td>
<td>14</td>
<td>202.271</td>
<td>0</td>
<td>185</td>
<td>14</td>
<td>14</td>
<td>204.311</td>
<td></td>
</tr>
<tr>
<td>3.5</td>
<td>22</td>
<td>61</td>
<td>32</td>
<td>0</td>
<td>270</td>
<td>19</td>
<td>19</td>
<td>271.382</td>
<td>0</td>
<td>278</td>
<td>20</td>
<td>19</td>
<td>307.554</td>
<td></td>
</tr>
<tr>
<td>4.0</td>
<td>25</td>
<td>73</td>
<td>38</td>
<td>0</td>
<td>392</td>
<td>28</td>
<td>27</td>
<td>380.445</td>
<td>0</td>
<td>381</td>
<td>27</td>
<td>26</td>
<td>411.612</td>
<td></td>
</tr>
<tr>
<td>y</td>
<td>Time (µs)</td>
<td>Up Kernel</td>
<td>Down Kernel</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td>---</td>
<td>-----------</td>
<td>-----------</td>
<td>-------------</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Std</td>
<td>Total (ℓ)</td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Std</td>
<td>Total (ℓ)</td>
<td></td>
</tr>
<tr>
<td>0.0</td>
<td>18</td>
<td>36</td>
<td>21</td>
<td>0</td>
<td>55</td>
<td>3</td>
<td>3</td>
<td>107,074</td>
<td>0</td>
<td>53</td>
<td>3</td>
<td>3</td>
<td>132,312</td>
<td></td>
</tr>
<tr>
<td>0.5</td>
<td>19</td>
<td>37</td>
<td>21</td>
<td>0</td>
<td>60</td>
<td>4</td>
<td>4</td>
<td>119,335</td>
<td>0</td>
<td>63</td>
<td>4</td>
<td>4</td>
<td>120,096</td>
<td></td>
</tr>
<tr>
<td>1.0</td>
<td>19</td>
<td>43</td>
<td>23</td>
<td>0</td>
<td>78</td>
<td>6</td>
<td>5</td>
<td>162,737</td>
<td>0</td>
<td>85</td>
<td>6</td>
<td>5</td>
<td>180,031</td>
<td></td>
</tr>
<tr>
<td>1.5</td>
<td>20</td>
<td>43</td>
<td>24</td>
<td>0</td>
<td>113</td>
<td>8</td>
<td>7</td>
<td>207,280</td>
<td>0</td>
<td>116</td>
<td>8</td>
<td>8</td>
<td>228,076</td>
<td></td>
</tr>
<tr>
<td>2.0</td>
<td>20</td>
<td>44</td>
<td>26</td>
<td>0</td>
<td>182</td>
<td>11</td>
<td>10</td>
<td>344,192</td>
<td>0</td>
<td>164</td>
<td>11</td>
<td>10</td>
<td>335,918</td>
<td></td>
</tr>
<tr>
<td>2.5</td>
<td>22</td>
<td>52</td>
<td>29</td>
<td>0</td>
<td>192</td>
<td>15</td>
<td>14</td>
<td>415,344</td>
<td>0</td>
<td>205</td>
<td>15</td>
<td>14</td>
<td>444,249</td>
<td></td>
</tr>
<tr>
<td>3.0</td>
<td>23</td>
<td>62</td>
<td>33</td>
<td>0</td>
<td>306</td>
<td>20</td>
<td>19</td>
<td>571,957</td>
<td>0</td>
<td>271</td>
<td>20</td>
<td>19</td>
<td>673,142</td>
<td></td>
</tr>
<tr>
<td>3.5</td>
<td>26</td>
<td>74</td>
<td>39</td>
<td>0</td>
<td>371</td>
<td>28</td>
<td>27</td>
<td>779,765</td>
<td>1</td>
<td>384</td>
<td>27</td>
<td>27</td>
<td>763,462</td>
<td></td>
</tr>
<tr>
<td>4.0</td>
<td>29</td>
<td>87</td>
<td>47</td>
<td>0</td>
<td>488</td>
<td>39</td>
<td>38</td>
<td>1,024,128</td>
<td>0</td>
<td>514</td>
<td>39</td>
<td>38</td>
<td>1,094,475</td>
<td></td>
</tr>
</tbody>
</table>

---

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>19</td>
<td>38</td>
<td>23</td>
</tr>
<tr>
<td>0.5</td>
<td>19</td>
<td>45</td>
<td>23</td>
</tr>
<tr>
<td>1.0</td>
<td>20</td>
<td>43</td>
<td>24</td>
</tr>
<tr>
<td>1.5</td>
<td>21</td>
<td>49</td>
<td>27</td>
</tr>
<tr>
<td>2.0</td>
<td>23</td>
<td>53</td>
<td>30</td>
</tr>
<tr>
<td>2.5</td>
<td>24</td>
<td>62</td>
<td>34</td>
</tr>
<tr>
<td>3.0</td>
<td>26</td>
<td>75</td>
<td>40</td>
</tr>
<tr>
<td>3.5</td>
<td>30</td>
<td>92</td>
<td>48</td>
</tr>
<tr>
<td>4.0</td>
<td>34</td>
<td>110</td>
<td>60</td>
</tr>
</tbody>
</table>

---

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>20</td>
<td>41</td>
<td>25</td>
</tr>
<tr>
<td>0.5</td>
<td>21</td>
<td>46</td>
<td>25</td>
</tr>
<tr>
<td>1.0</td>
<td>22</td>
<td>55</td>
<td>28</td>
</tr>
<tr>
<td>1.5</td>
<td>23</td>
<td>59</td>
<td>31</td>
</tr>
<tr>
<td>2.0</td>
<td>25</td>
<td>67</td>
<td>36</td>
</tr>
<tr>
<td>2.5</td>
<td>27</td>
<td>78</td>
<td>42</td>
</tr>
<tr>
<td>3.0</td>
<td>30</td>
<td>96</td>
<td>50</td>
</tr>
<tr>
<td>3.5</td>
<td>34</td>
<td>120</td>
<td>61</td>
</tr>
<tr>
<td>4.0</td>
<td>41</td>
<td>152</td>
<td>78</td>
</tr>
<tr>
<td>y</td>
<td>Time (µs)</td>
<td>Up Kernel</td>
<td>Down Kernel</td>
</tr>
<tr>
<td>----</td>
<td>-----------</td>
<td>-----------</td>
<td>-------------</td>
</tr>
<tr>
<td>0.0</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>0.5</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1.0</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>1.5</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2.0</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>2.5</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3.0</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>3.5</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
<tr>
<td>4.0</td>
<td>0.125</td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
### Naive Method: Particles (262144)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>68</td>
<td>234</td>
<td>123</td>
</tr>
<tr>
<td>0.5</td>
<td>75</td>
<td>254</td>
<td>134</td>
</tr>
<tr>
<td>1.0</td>
<td>83</td>
<td>314</td>
<td>165</td>
</tr>
<tr>
<td>1.5</td>
<td>103</td>
<td>396</td>
<td>208</td>
</tr>
<tr>
<td>2.0</td>
<td>119</td>
<td>506</td>
<td>266</td>
</tr>
<tr>
<td>2.5</td>
<td>148</td>
<td>652</td>
<td>342</td>
</tr>
<tr>
<td>3.0</td>
<td>191</td>
<td>852</td>
<td>448</td>
</tr>
<tr>
<td>3.5</td>
<td>243</td>
<td>1,133</td>
<td>594</td>
</tr>
<tr>
<td>4.0</td>
<td>314</td>
<td>1,579</td>
<td>828</td>
</tr>
</tbody>
</table>

### Naive Method: Particles (524288)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>151</td>
<td>564</td>
<td>297</td>
</tr>
<tr>
<td>0.5</td>
<td>169</td>
<td>624</td>
<td>328</td>
</tr>
<tr>
<td>1.0</td>
<td>191</td>
<td>774</td>
<td>407</td>
</tr>
<tr>
<td>1.5</td>
<td>244</td>
<td>1,003</td>
<td>527</td>
</tr>
<tr>
<td>2.0</td>
<td>298</td>
<td>1,278</td>
<td>671</td>
</tr>
<tr>
<td>2.5</td>
<td>352</td>
<td>1,685</td>
<td>882</td>
</tr>
<tr>
<td>3.0</td>
<td>464</td>
<td>2,285</td>
<td>1,192</td>
</tr>
<tr>
<td>3.5</td>
<td>624</td>
<td>3,079</td>
<td>1,606</td>
</tr>
<tr>
<td>4.0</td>
<td>862</td>
<td>4,351</td>
<td>2,275</td>
</tr>
</tbody>
</table>

### Naive Method: Particles (1048576)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>333</td>
<td>1,382</td>
<td>728</td>
</tr>
<tr>
<td>0.5</td>
<td>390</td>
<td>1,544</td>
<td>815</td>
</tr>
<tr>
<td>1.0</td>
<td>444</td>
<td>1,959</td>
<td>1,031</td>
</tr>
<tr>
<td>1.5</td>
<td>573</td>
<td>2,596</td>
<td>1,367</td>
</tr>
<tr>
<td>2.0</td>
<td>701</td>
<td>3,453</td>
<td>1,810</td>
</tr>
<tr>
<td>2.5</td>
<td>845</td>
<td>4,612</td>
<td>2,410</td>
</tr>
<tr>
<td>3.0</td>
<td>1,248</td>
<td>6,282</td>
<td>3,266</td>
</tr>
<tr>
<td>3.5</td>
<td>1,429</td>
<td>8,525</td>
<td>4,442</td>
</tr>
<tr>
<td>4.0</td>
<td>2,145</td>
<td>12,166</td>
<td>6,329</td>
</tr>
</tbody>
</table>

---

139
### B.2 Variance Effect on Improved Implementation: Systematic

#### Improved Method: Particles (1024)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>15</td>
<td>35</td>
<td>18</td>
</tr>
<tr>
<td>0.5</td>
<td>16</td>
<td>37</td>
<td>19</td>
</tr>
<tr>
<td>1.0</td>
<td>16</td>
<td>32</td>
<td>19</td>
</tr>
<tr>
<td>1.5</td>
<td>17</td>
<td>39</td>
<td>19</td>
</tr>
<tr>
<td>2.0</td>
<td>17</td>
<td>39</td>
<td>20</td>
</tr>
<tr>
<td>2.5</td>
<td>18</td>
<td>38</td>
<td>20</td>
</tr>
<tr>
<td>3.0</td>
<td>18</td>
<td>40</td>
<td>21</td>
</tr>
<tr>
<td>3.5</td>
<td>18</td>
<td>38</td>
<td>21</td>
</tr>
<tr>
<td>4.0</td>
<td>18</td>
<td>42</td>
<td>22</td>
</tr>
</tbody>
</table>

#### Improved Method: Particles (2048)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>19</td>
<td>33</td>
<td>20</td>
</tr>
<tr>
<td>0.5</td>
<td>18</td>
<td>39</td>
<td>20</td>
</tr>
<tr>
<td>1.0</td>
<td>19</td>
<td>35</td>
<td>20</td>
</tr>
<tr>
<td>1.5</td>
<td>19</td>
<td>39</td>
<td>20</td>
</tr>
<tr>
<td>2.0</td>
<td>19</td>
<td>39</td>
<td>20</td>
</tr>
<tr>
<td>2.5</td>
<td>19</td>
<td>41</td>
<td>21</td>
</tr>
<tr>
<td>3.0</td>
<td>19</td>
<td>37</td>
<td>21</td>
</tr>
<tr>
<td>3.5</td>
<td>19</td>
<td>39</td>
<td>21</td>
</tr>
<tr>
<td>4.0</td>
<td>19</td>
<td>42</td>
<td>22</td>
</tr>
</tbody>
</table>

#### Improved Method: Particles (4096)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>18</td>
<td>38</td>
<td>20</td>
</tr>
<tr>
<td>0.5</td>
<td>19</td>
<td>37</td>
<td>20</td>
</tr>
<tr>
<td>1.0</td>
<td>19</td>
<td>36</td>
<td>20</td>
</tr>
<tr>
<td>1.5</td>
<td>19</td>
<td>42</td>
<td>20</td>
</tr>
<tr>
<td>2.0</td>
<td>19</td>
<td>42</td>
<td>20</td>
</tr>
<tr>
<td>2.5</td>
<td>20</td>
<td>41</td>
<td>21</td>
</tr>
<tr>
<td>3.0</td>
<td>19</td>
<td>42</td>
<td>21</td>
</tr>
<tr>
<td>3.5</td>
<td>19</td>
<td>39</td>
<td>22</td>
</tr>
<tr>
<td>4.0</td>
<td>20</td>
<td>42</td>
<td>23</td>
</tr>
</tbody>
</table>
## Improved Method: Particles (8192)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>18</td>
<td>37</td>
<td>21</td>
</tr>
<tr>
<td>0.5</td>
<td>19</td>
<td>41</td>
<td>21</td>
</tr>
<tr>
<td>1.0</td>
<td>19</td>
<td>36</td>
<td>21</td>
</tr>
<tr>
<td>1.5</td>
<td>19</td>
<td>39</td>
<td>21</td>
</tr>
<tr>
<td>2.0</td>
<td>19</td>
<td>38</td>
<td>21</td>
</tr>
<tr>
<td>2.5</td>
<td>19</td>
<td>41</td>
<td>21</td>
</tr>
<tr>
<td>3.0</td>
<td>19</td>
<td>40</td>
<td>22</td>
</tr>
<tr>
<td>3.5</td>
<td>20</td>
<td>40</td>
<td>22</td>
</tr>
<tr>
<td>4.0</td>
<td>19</td>
<td>41</td>
<td>23</td>
</tr>
</tbody>
</table>

## Improved Method: Particles (16384)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>19</td>
<td>39</td>
<td>21</td>
</tr>
<tr>
<td>0.5</td>
<td>20</td>
<td>39</td>
<td>21</td>
</tr>
<tr>
<td>1.0</td>
<td>20</td>
<td>41</td>
<td>21</td>
</tr>
<tr>
<td>1.5</td>
<td>20</td>
<td>38</td>
<td>22</td>
</tr>
<tr>
<td>2.0</td>
<td>20</td>
<td>43</td>
<td>22</td>
</tr>
<tr>
<td>2.5</td>
<td>20</td>
<td>39</td>
<td>22</td>
</tr>
<tr>
<td>3.0</td>
<td>20</td>
<td>42</td>
<td>23</td>
</tr>
<tr>
<td>3.5</td>
<td>21</td>
<td>46</td>
<td>23</td>
</tr>
<tr>
<td>4.0</td>
<td>21</td>
<td>46</td>
<td>23</td>
</tr>
</tbody>
</table>

## Improved Method: Particles (32768)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
</tr>
<tr>
<td>0.0</td>
<td>25</td>
<td>40</td>
<td>28</td>
</tr>
<tr>
<td>0.5</td>
<td>25</td>
<td>47</td>
<td>28</td>
</tr>
<tr>
<td>1.0</td>
<td>26</td>
<td>45</td>
<td>28</td>
</tr>
<tr>
<td>1.5</td>
<td>25</td>
<td>44</td>
<td>28</td>
</tr>
<tr>
<td>2.0</td>
<td>25</td>
<td>48</td>
<td>29</td>
</tr>
<tr>
<td>2.5</td>
<td>27</td>
<td>48</td>
<td>29</td>
</tr>
<tr>
<td>3.0</td>
<td>26</td>
<td>59</td>
<td>31</td>
</tr>
<tr>
<td>3.5</td>
<td>27</td>
<td>52</td>
<td>32</td>
</tr>
<tr>
<td>4.0</td>
<td>28</td>
<td>66</td>
<td>34</td>
</tr>
</tbody>
</table>
## Improved Method: Particles (65536)

<table>
<thead>
<tr>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>28</td>
<td>46</td>
</tr>
<tr>
<td>0.5</td>
<td>29</td>
<td>52</td>
</tr>
<tr>
<td>1.0</td>
<td>29</td>
<td>52</td>
</tr>
<tr>
<td>1.5</td>
<td>29</td>
<td>53</td>
</tr>
<tr>
<td>2.0</td>
<td>29</td>
<td>56</td>
</tr>
<tr>
<td>2.5</td>
<td>30</td>
<td>58</td>
</tr>
<tr>
<td>3.0</td>
<td>31</td>
<td>68</td>
</tr>
<tr>
<td>3.5</td>
<td>32</td>
<td>78</td>
</tr>
<tr>
<td>4.0</td>
<td>34</td>
<td>91</td>
</tr>
</tbody>
</table>

## Improved Method: Particles (131072)

<table>
<thead>
<tr>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>34</td>
<td>56</td>
</tr>
<tr>
<td>0.5</td>
<td>35</td>
<td>61</td>
</tr>
<tr>
<td>1.0</td>
<td>36</td>
<td>69</td>
</tr>
<tr>
<td>1.5</td>
<td>36</td>
<td>74</td>
</tr>
<tr>
<td>2.0</td>
<td>38</td>
<td>84</td>
</tr>
<tr>
<td>2.5</td>
<td>39</td>
<td>102</td>
</tr>
<tr>
<td>3.0</td>
<td>41</td>
<td>117</td>
</tr>
<tr>
<td>3.5</td>
<td>45</td>
<td>138</td>
</tr>
<tr>
<td>4.0</td>
<td>51</td>
<td>168</td>
</tr>
</tbody>
</table>

## Improved Method: Particles (262144)

<table>
<thead>
<tr>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>54</td>
<td>100</td>
</tr>
<tr>
<td>0.5</td>
<td>55</td>
<td>113</td>
</tr>
<tr>
<td>1.0</td>
<td>56</td>
<td>137</td>
</tr>
<tr>
<td>1.5</td>
<td>59</td>
<td>151</td>
</tr>
<tr>
<td>2.0</td>
<td>60</td>
<td>170</td>
</tr>
<tr>
<td>2.5</td>
<td>65</td>
<td>196</td>
</tr>
<tr>
<td>3.0</td>
<td>70</td>
<td>232</td>
</tr>
<tr>
<td>3.5</td>
<td>78</td>
<td>286</td>
</tr>
<tr>
<td>4.0</td>
<td>92</td>
<td>368</td>
</tr>
</tbody>
</table>
### Improved Method: Particles (524288)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
<th>Total (ℓ)</th>
<th>Total (ℓ)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>100</td>
<td>245</td>
<td>127</td>
<td>3</td>
<td>708</td>
</tr>
<tr>
<td>0.5</td>
<td>102</td>
<td>250</td>
<td>133</td>
<td>1</td>
<td>717</td>
</tr>
<tr>
<td>1.0</td>
<td>106</td>
<td>282</td>
<td>147</td>
<td>2</td>
<td>848</td>
</tr>
<tr>
<td>1.5</td>
<td>111</td>
<td>324</td>
<td>170</td>
<td>1</td>
<td>1,151</td>
</tr>
<tr>
<td>2.0</td>
<td>118</td>
<td>378</td>
<td>199</td>
<td>9</td>
<td>1,518</td>
</tr>
<tr>
<td>2.5</td>
<td>132</td>
<td>450</td>
<td>237</td>
<td>8</td>
<td>2,073</td>
</tr>
<tr>
<td>3.0</td>
<td>146</td>
<td>558</td>
<td>293</td>
<td>11</td>
<td>2,834</td>
</tr>
<tr>
<td>3.5</td>
<td>162</td>
<td>700</td>
<td>368</td>
<td>14</td>
<td>3,933</td>
</tr>
<tr>
<td>4.0</td>
<td>185</td>
<td>930</td>
<td>486</td>
<td>40</td>
<td>4,837</td>
</tr>
</tbody>
</table>

### Improved Method: Particles (1048576)

<table>
<thead>
<tr>
<th>y</th>
<th>Time (µs)</th>
<th>Up Kernel</th>
<th>Down Kernel</th>
<th>Total (ℓ)</th>
<th>Total (ℓ)</th>
</tr>
</thead>
<tbody>
<tr>
<td></td>
<td>Min</td>
<td>Max</td>
<td>Mean</td>
<td>Min</td>
<td>Max</td>
</tr>
<tr>
<td>0.0</td>
<td>191</td>
<td>514</td>
<td>269</td>
<td>2</td>
<td>830</td>
</tr>
<tr>
<td>0.5</td>
<td>194</td>
<td>541</td>
<td>285</td>
<td>3</td>
<td>1,087</td>
</tr>
<tr>
<td>1.0</td>
<td>208</td>
<td>620</td>
<td>327</td>
<td>7</td>
<td>1,199</td>
</tr>
<tr>
<td>1.5</td>
<td>224</td>
<td>733</td>
<td>388</td>
<td>7</td>
<td>1,633</td>
</tr>
<tr>
<td>2.0</td>
<td>230</td>
<td>888</td>
<td>469</td>
<td>6</td>
<td>2,155</td>
</tr>
<tr>
<td>2.5</td>
<td>279</td>
<td>1,099</td>
<td>578</td>
<td>5</td>
<td>2,806</td>
</tr>
<tr>
<td>3.0</td>
<td>341</td>
<td>1,406</td>
<td>736</td>
<td>7</td>
<td>3,631</td>
</tr>
<tr>
<td>3.5</td>
<td>373</td>
<td>1,870</td>
<td>979</td>
<td>10</td>
<td>5,104</td>
</tr>
<tr>
<td>4.0</td>
<td>475</td>
<td>2,532</td>
<td>1,326</td>
<td>13</td>
<td>7,211</td>
</tr>
</tbody>
</table>

143
REFERENCES


[34] NVIDIA GeForce GTX 980, NVIDIA Corporation, 2014. Available at https://international.download.nvidia.com/geforce-com/international/pdfs/GeForce_GTX_980_Whitepaper_FINAL.PDF.


