Reconstruction of High-Precision Semantic Map

We present a real-time Truncated Signed Distance Field (TSDF)-based three-dimensional (3D) semantic reconstruction for LiDAR point cloud, which achieves incremental surface reconstruction and highly accurate semantic segmentation. The high-precise 3D semantic reconstruction in real time on LiDAR data is important but challenging. Lighting Detection and Ranging (LiDAR) data with high accuracy is massive for 3D reconstruction. We so propose a line-of-sight algorithm to update implicit surface incrementally. Meanwhile, in order to use more semantic information effectively, an online attention-based spatial and temporal feature fusion method is proposed, which is well integrated into the reconstruction system. We implement parallel computation in the reconstruction and semantic fusion process, which achieves real-time performance. We demonstrate our approach on the CARLA dataset, Apollo dataset, and our dataset. When compared with the state-of-art mapping methods, our method has a great advantage in terms of both quality and speed, which meets the needs of robotic mapping and navigation.


Introduction
When entering unfamiliar environment, it is very important to perceive the 3D structure and semantic information in real time. Reconstructing precise and continuous surface in real time allows for robots to respond accurately and fast. At the same time, fusing semantic information into the 3D map is also a necessary part from the perception standpoint. In this work, we are interested in online feature fusion and 3D reconstruction, which is based on a temporal sequence of 3D scans and images. The acquisition of 3D scans can come from depth camera, LiDAR, and so on. These sensors are becoming increasingly affordable and widely used for many applications.
Among the three-dimensional (3D) sensing devices, LiDAR (Lighting Detection and Ranging) holds a high measurement accuracy and higher resolution of angles, distance, and speed. Researchers proposed many methods to achieve precise surface reconstruction for LiDAR point cloud. Verma et al. [1], Zhou et al. [2] and Poullis et al. [3] created 3D scenes from LiDAR data. In these methods, noise was removed by classification. Individual building patches and ground points were separated by segmentation, and mesh models were generated from building patches. These approaches were not generalized well to handle objects with arbitrary shapes, since they relied on predefined patterns. Hornung et al. [4] proposed OctoMap which used octree data structure to realize memory-efficient. However, OctoMap lacked details near the surface and didn't have an arbitrary resolution. In addition to considering accuracy, these 3D reconstruction methods [1][2][3][4][5][6][7] based on LiDAR were almost off-line, because LiDAR sensors collect ten thousand of points per frame and they will collect a huge number of points just in several minutes. However, online reconstruction based on the sparse LiDAR data is more beneficial for real-time applications. Although Lovi et al. [8] • We propose a line-of-sight algorithm to update the implicit surface incrementally and reduce noise considering LiDAR characteristics. • We present a novel self-attention based spatial-temporal feature fusion method for high accuracy of semantic segmentation. • We present a real-time SDF based 3D semantic reconstruction framework for LiDAR point cloud. Compared with other methods, our framework can achieve smoother 3D surface estimation, higher accuracy of semantic segmentation and real time performance.

Materials and Methods
The pipeline of our system is shown in Figure 1. The input of our system is the LiDAR point clouds and RGB images. Through the registration algorithm [39], a pose can be estimated and used to map and reconstruct. After obtaining the pose information, voxels will only be created if they are near the surface and within the volume range. Our line-of-sight algorithm updates them incrementally. At the same time, the 2D semantic segmentation network [15] will process the current frame data and output the semantic segmentation results. At last, the semantic fusion module will fuse the semantic results over time into reconstructed 3D map and output 3D semantic map.

3d Surface Reconstruction
Current methods [37,40] of integrating depth sensor data into a TSDF are based on depth images. When it comes to LiDAR, it will be different. What LiDAR collects is unorganized point cloud data and we cannot directly use the projection method to update its implicit surface. We propose a voxel based line-of-sight update algorithm, as shown in Figure 2, to solve the problem. The line of sight is gotten from current sensor posture and LiDAR points P in world coordinate. We get the voxels that the line of sight cross through, update the associated voxel values, fuse them over time, and get the continuous implicit surface of an object.
The input to our algorithm is an unorganized 3D point cloud set P L k = {p L k,1 , p L k,2 , · · · , p L k,n } in LiDAR coordinate system {L}, k ∈ Z + , where k represents LiDAR frame number, p L k,i ∈ R 3 indicates ith point in kth frame. LiDAR kth frame's pose in world coordinate system is H k .
The key idea of our implicit surface update method is to generate the line of sight o k p k,i , where o k is the sensor origin and p L k,i is the current LiDAR point, and then to find the relevant voxels X p k,i from the line of sight. We transform the point cloud into world coordinate system while using kth posture, which contains a 3 × 3 rotation matrix R k and 3 × 1 translation vector T k .
o k is equal to T k and the line of sight is a three-dimensional vector, as shown in Equation (2).
We use the line of sight to sweep relevant voxels in space and update their TSDF values and weights to avoid missing voxels when searching the surrounding voxels of each line of sight. We find maximal axis and take the maximal axis as standard and normalize other two directions to get the normalized direction vector− − → op k,i , as shown in Equation (3).
We then use the line normalized direction vectorv to find related points P op k,i in the front of and behind the original point O k,i respectively, as shown in Equation (4).
where m is a parameter, m = k step * L voxel . In our system, k step is set to integers, k step ∈ {−5, −4, ..., 5}, L voxel is the voxel size. Every time that k step is determined, a related point can be determined. Thus, the related points of the original point can be obtained. LiDAR probably collects points from a long distance, which contain more noise when compared with close points. We use a dynamic weight to account for noise data far from the sensor. As shown in Equation (5), a is a parameter that is related to the size of reconstruction scene.
where D k−1 (x k,i ), W k−1 (x k,i ) are the TSDF values and their weights of all voxels in k-1 frame, as shown in Equation (8).
Voxel's signed distance function is For parallelization, we use a spatial hashing-based data structure in order to manage the space, which allows for us to insert and update voxels in parallel. A memory pool is also created to reduce fragmentation of memory and improve efficiency. We only insert voxels near the zero isosurface and do not waste memory on empty space.

Semantic Fusion
The input of our system is LiDAR point clouds and RGB images. Each LiDAR point can be indexed to a voxel and we only update these related voxels. Firstly, the incremental reconstruction result provides voxels' normal that is embedded with position of the sensor as the input of our Observation Adaptive Network (OAN) to provide the observation effectiveness information. Secondly, through the 2D semantic segmentation network, image feature is extracted and LiDAR point cloud is projected to obtain its corresponding image feature. The image feature of current voxel is also the input of our OAN, and it is used together with voxel's normal and position of the sensor to update current voxel's state. Finally, we use Attention Based Spatial Fusion Network (ABSFN) to fuse voxels' state within a limited range of adjacent space to obtain current voxel's semantic label. All of the modules will fuse the semantic results over time into reconstructed 3D map and output 3D semantic map.

Network's Input
Image Feature. Our Feature Fusion Network requires voxels' corresponding image feature R k i , where i represents ith voxel and k represents kth frame. To achieve this, we first train a 2D real-time segmentation network [15] in order to extract dense image feature. Then the LiDAR points are projected onto the camera plane to get semantic feature of each voxel according to the corresponding pixel. In projection process, some voxels may lie out of the camera plane and these voxels are not reconstructed. Normal. Our system performs 3D reconstruction incrementally that is based on SDF (Signed Distance Field) algorithm. By calculating the gradient of signed distance field, normal of the ith voxel in kth frame can be achieved, which is denoted by N k i . Position of the sensor. In order to represent the validity of the observation from a geometric perspective, we also need current frame sensor position, which is equal to o k in Section 2.1. We use L k i in order to represent it in this section, where i represents ith voxel and k represents kth frame.

Observation Adaptive Network.
Multi-view observation is considered to be beneficial for robust scene understanding. However, the changes of observation are not uniform in the process of data collection. Especially when sensors stop, a large amount of data are collected from the same pose, which is redundant. Therefore, we design the OAN to evaluate observation effectiveness. The structure of OAN is shown as Figure 3. We assume there are two main factors related to the effectiveness of observations. Firstly, the location of observation L k i in the local coordinate system centered on current voxel. Secondly, the normal of current voxel N k i . The combination of normal and position can represent the validity of the observation from a geometric perspective. When the observation degree is close to 90 degrees from normal, the observation is not reliable. When the observation degree is close to 0 degree, the observation is reliable. Consequently, we utilize GRU to achieve the observation state E k i , which uses voxel's normal and sensor's local position as input.
where E k i implicitly contains the information of observation effectiveness. Subsequently, image feature R k i achieved from 2D images is concat with E k i as the input of next layer of GRU.
F k i is the hidden layer state of GRU that will be saved in each voxel. When the specific category of voxel needs to be calculated, ABSFN will obtain the status from the relevant voxel to obtain the final result.

Attention Based Spatial Fusion Network
Output from the OAN does not exploit spatial neighborhood information, which is helpful to further improve the accuracy of semantic segmentation. A direct method to exploit the spatial neighborhood information is performing 3D convolution on sparse 3D data. However, this method is inefficient, especially when voxels maintain a high dimensional feature. In addition, the 3D structure changes over time, which makes it hard for fixed grid network to learn. Therefore, we use the self-attention mechanism to explicitly measure the correlation between the current voxel and its adjacent voxels. The normalized sum unit makes the network more robust to the change of 3D structure. Conclusively, we propose ABSFN.
Looking Up Neighborhood. Our 3D semantic map uses a spatial hashing-based data structure for efficient space management. When looking up neighborhood voxels, the searching range is limited to a cube that is centered at current voxel. The side length of the cube is s voxels. For each grid's location in the cube, we judge whether there exists the neighborhood voxel. If existing, the voxel will be added to neighborhood list H k i = {h k i,1 , h k i,2 , ..., h k i,j , ...h k i,N }, where j represents jth neighborhood voxel and N represents the number of neighborhood voxel.
Spatial Self-attention. We assume there are two main elements in our system determining relevance between current voxel and its neighborhood. One is the hidden state stored in each voxel F k i , the other is the offset of neighborhood voxels to the current voxel, denoted as O k i,j , which means that the offset between ith voxel and jth voxel in kth frame, and the hidden states stored in jth voxels denoted as F k j . When considering the feature space inconsistency between O k i,j and F k i , we design a light-weight encoder to implement space transformation for a better weight measurement. The structure of ABSFN is shown as Figure 4. The output of OAN F k i is embedded with offset O k i,j and split into two branches: one only contains the target voxel's information to generate Q k i , the other, which contains all of the neighborhood voxels, generates weighted feature dictionary, denoting as K k i,j and V k i,j , according to the state of voxels themselves. In relevance calculation, we choose dot product to reduce computational complexity. Equations are shown, as below.
Z k i is the fusion result of V k i,j . We use ResBlocks to train the function of f and g. The coefficient α k i,j is calculated by feature vector of the target voxel and jth neighborhood voxel as the following equation.

Dataset
We verify the effectiveness of our method for both 3D semantic segmentation and reconstruction in three datasets. CARLA Dataset. We generate a virtual dataset with CARLA [41] simulation platform, which is built for realism in the rendering and physical simulation and allows for flexible configuration of the agent's sensor suite. In our experiment, we set cameras with 1024 × 1024 resolution to obtain RGB and semantic segmentation images. A virtual 64-line 10 Hz LiDAR sensor is also added to obtain point cloud. With sensors attached to a auto-piloted car in the virtual city, we simulate and collect sensor data with corresponding poses at 10 fps. Totally, 900 frames data are generated and every 10th frame of the sequence is selected to yield a 90 frames test set.
Apollo Dataset. We also evaluate our approach on Apollo Dataset [42], which is collected from the real environment. The environment is complex and there is more noise. For real-world large-scale scenes mapping, it is a challenging dataset. We use the Record005 sequence data of Road02 portion.
VLP-16 Dataset. The VLP-16 dataset is collected by ourselves. Our system's operating frequency is 10 Hz, which includes a VLP-16 LiDAR, two cameras, and a laptop.

Implementation Details
Our experiments were performed on a computer with Intel Core I7 and NVIDA GeForce GTX 1080Ti, and we operated our algorithm by c++ and pytorch.
The input of our method is the LiDAR point clouds and RGB images. Point cloud is transferred into the 3D reconstruction module to get 3D map. In the process of 3D surface reconstruction, we set the value of a to 5 empirically, and we set voxel size to 0.1 m to obtain a balance between speed and quality. RGB images are processed by ICNET [15] to get 2D image feature. For searching neighborhood in ABSFN, we set the value of s to 100 and the value of N to 25. And we used Adam optimizer to train our semantic fusion network with an initial learning rate of 0.001 and a decay of 0.00001 every epoch. For the optimization, we used the cross-entropy loss function.
The final prediction results of our framework are rendered from 3D semantic map. Figure 5 shows the result by rendering the output of the 3D semantic reconstruction framework.

Qualitative Results
We show some qualitative results of our reconstruction approach and our method achieves good results on both virtual dataset and real datasets. As shown in Figure 6, our approach not only obtains good results on large objects, such as roads and buildings but also has the ability to accurately recover small object such as ostreet lights, boxes and so on. These results prove that the line-of-sight algorithm is very effective on LiDAR point cloud.

Performance Comparison
When considering that there is no ground truth in Apollo dataset and VLP-16 dataset, we only reconstruct a scene in CARLA simulator as shown in Figure 7 to evaluate the errors introduced by line-of-sight algorithm. The error of 3D reconstruction mainly appears at the edge of small objects, such as railing and street lights. This is because the incremental reconstruction process will magnify the reconstruction error that is caused by the pose estimation error. However, the overall error value is 0.019m in average.  As the approach of Sengupta et al. [43], we render depth images from our reconstructed 3D model. Error heat maps can be calculated by comparing them with the ground truth. We think one pixel is accurate if the error is less than a fixed threshold and we can get the accuracy of an image. When the depth threshold is 0.2 m and 0.1 m, our approach achieves 94.87% and 86.91% accuracy, respectively, on average.
We compare our approach with OctoMap [4]. OctoMap uses a discrete cut-off probability and its precision is limited by the minimum voxel size. When compared with OctoMap, one advantage of our method is that we can reach the subvoxel precision. We use the same accuracy evaluation method. When the depth threshold is 0.2 m and 0.1 m, OctoMap achieves 86.85% and 73.28% accuracy, respectively, on average. Our approach outperforms OctoMap in all cases, as shown in Figure 8.

Semantic Mapping Results
We quantitatively evaluate the accuracy of our semantic reconstruction result on our simulated CARLA dataset and Apollo dataset. The prediction results of our fusion system are rendered from 3D semantic map. We use pixel-wise interaction over reunion (IoU) as our metric. For a fair comparison, only pixels that were included in rendered map are calculated.
We conduct two experiments to prove the effectiveness of our method. The first comparative experiment conducted on our simulated CARLA dataset shows the effectiveness of OAN and ABSFN. The results are summarized in Table 1 and visualized comparisons are illustrated in Figure 9. We first observe that our OAN utilizing point cloud data achieves a slight improvement of IoU on poles and fence class against [15]. Because our method evaluates the validity of observation, our approach of adaptive observation with attention-based spatial fusion obtains better performance upon all classes except fences and achieves an improvement of 22.01% over [15] on mIoU, owing to the advantage of geometry information. As we can see in Figure 9, many blurred boundaries are eliminated via attention-based spatial fusion approach and the outlines of thin objects improve a lot. The above results demonstrate the effectiveness of our semantic fusion system and the ability to reconstruct a highly precise semantic map. In order to better illustrate our conclusions, we also conduct corresponding experiments on the Apollo dataset, and outline the results in Table 2. We compare our method with other related methods. Because our framework hopes to be able to process each frame data in real time. Therefore, we choose some methods that can extract the semantic information of each frame in real time for fair comparison. On the other hand, fusion is the focus of our framework, many 2D or 3D semantic segmentation networks with high performance can be the input of the fusion in our framework, so we only compare our methods with similar fusion methods and basic 2D or 3D semantic segmentation methods. Finally, we compare our method with the ICNET [15], the Pointnet [23], and [32]. The ICNET [15] gets the semantic segmentation results through the CNN-based 2D network, and then we project the semantic segmentation results into the 3D space without extracting spatial and temporal information. The Pointnet [23] considers 3D spatial information but does not consider temporal information. Li et al. [32] considers temporal information but does not consider spatial information and 2D images features. From the quantitative results, our method has a greater improvement in accuracy. Table 3 presents the results obtained from the Apollo dataset which is a more challenging real dataset. It can be seen from the Table 3 that the semantic estimation that is based on probability fusion [32] has a certain improvement in accuracy compared to the semantic segmentation directly on the point cloud [23] or the projection method without fusion [15]. What is more, when compared with the probability-based method, our method gets a 13.3 percent mIoU improvement. Because our method takes texture and context information into account in fusion. The visualized comparisons are illustrated in Figures 10 and 11. Figure 10 is the visualization result of a sequence. From the top are ground truth, the results of our method and the results of the ICNET [15]. As we can see from the red boxes in Figure 10, there is a lot of high-frequency noise in the result of ICNET [15]. These high-frequency noise can also be regarded as discontinuous in 3D space. However, in the results of our method, these high-frequency noise is eliminated. Figure 11 shows the details of 3D semantic segmentation on the Apollo dataset. From the left are ground truth, the results of our method and results of [32]. In the red boxes of Figure 11, the accuracy of classification is well improved in our method. This shows that our method integrates spatial geometric information and temporal information very well. Figure 10. Visualization of Apollo dataset. From the top, ground truth, Ours, ICNET [15]. Our method is significantly better than the ICNET [15] in spatial consistency.  Figure 11. Visualization of Apollo dataset in detail. From the left, ground truth, Ours, [32]. The accuracy of our method in the red boxes is significantly higher than [32].

Real-Time Performance
Real-time performance means that all of our software modules should be completed before the next frame's data coming. In our system, LiDAR's operating frequency is 10 Hz and we need to process data within 100 ms per frame. VLP-16 LiDAR collects about 20k points per frame. Accordingly, we fix the points number per frame to 20k in Table 4. According to the experiment in Table 4, we set the voxel size to 0.1 m for the balance between speed and quality. As shown in Table 5, our system is robust when the points number changes. greenSetting the voxel size and points number to 0.1 m and 20k, the total time per frame is about 65 ms in average. Although there are a lot of related works on reconstruction or semantic segmentation, there are few works on both of them. Therefore, for a fair compassion, we only compare our method with [32], which conducts 3D reconstruction and semantic segmentation at the same time. For [32], with the same voxel size and points number, the total time per frame is about 266 ms in average. Therefore, our system is up to four times faster than [32]. In our system, time spending on the semantic segmentation, reconstruction, and rendering is about 35 ms, 20 ms, and 10 ms on average, as shown in Figure 12. Finally, our system consumes less than 0.1 s per frame and achieves real-time performance.

Conclusions
In this paper, we propose a real-time framework in order to generate a 3D semantic map in outdoor environment by combining LiDAR point clouds and RGB images. A line-of-sight algorithm is used to reconstruct smooth surface from LiDAR data with large noise. With respect to semantic mapping, an online attention-based spatial and temporal feature fusion network is proposed to incrementally update the semantic information. Our method provides a real-time performance to meet robotics needs and it achieves high accuracy of 3D surface estimation and semantic segmentation. When compared with the state-of-art mapping methods, our method has a great advantage in terms of quality and speed, which meets the needs of robotic mapping and navigation.