Because the Hessian matrix is constant for the inverse Gauss-Newton algorithm, the calculation is more efficient compared to the original Gauss-Newton algorithm, without compromising accuracy. In step 1, template image gradient ∇
T in x, y and z directions are calculated using the Sobel operators. In step 2, both the source image
I and the template image
T are transferred to the cached GPU texture memory. In step 3, the template gradient ∇
T is then multiplied with the Jacobian matrix ∂
w/∂
p and the resulting matrices (∂
w/∂
pk)∇
T are transferred onto the GPU aligned global memory. In step 4, a 6×6 Hessian matrix
H is calculated from the (∂
w/∂
pk)∇
T matrices on the CPU. Pre-processing steps 1–4 are implemented on the CPU because the preprocessing only takes a small percentage of overall registration time compared to the iterative optimization process. In step 5 and 6, the difference of the potentially motion corrupted image
I and the template image
T is multiplied with (∂
w/∂
pk)∇
T matrices on the GPU. In step 7, products from step 6 are then parallel summed on the GPU with the parallel summation algorithm. In step 8, the summed 6×1 vector from the GPU is returned to the CPU and the update parameter Δ
p is calculated by solving a linear equation with the LU decomposition. In the final step 9, the new warp function
is determined and is sent to the GPU for a new iteration that starts from step 5.