Searching and Tracking an Unknown Number of Targets: A Learning-Based Method Enhanced with Maps Merging

Unmanned aerial vehicles (UAVs) have been widely used in search and rescue (SAR) missions due to their high flexibility. A key problem in SAR missions is to search and track moving targets in an area of interest. In this paper, we focus on the problem of Cooperative Multi-UAV Observation of Multiple Moving Targets (CMUOMMT). In contrast to the existing literature, we not only optimize the average observation rate of the discovered targets, but we also emphasize the fairness of the observation of the discovered targets and the continuous exploration of the undiscovered targets, under the assumption that the total number of targets is unknown. To achieve this objective, a deep reinforcement learning (DRL)-based method is proposed under the Partially Observable Markov Decision Process (POMDP) framework, where each UAV maintains four observation history maps, and maps from different UAVs within a communication range can be merged to enhance UAVs’ awareness of the environment. A deep convolutional neural network (CNN) is used to process the merged maps and generate the control commands to UAVs. The simulation results show that our policy can enable UAVs to balance between giving the discovered targets a fair observation and exploring the search region compared with other methods.


Introduction
In the past decade, unmanned aerial vehicles (UAVs) have been widely used in military and civilian applications due to their low cost and high flexibility. Especially in search and rescue (SAR) missions, multiple UAVs working together can reduce mission execution time and provide timely relief to targets [1][2][3]. In a SAR mission, UAVs need to search out targets in an unknown region and continuously track them to monitor their status. However, in general, the number of targets is unknown and available UAVs are limited, which requires multiple UAVs to work together to keep track of the discovered targets while finding more unknown targets [4]. The problem of using robot teams to cooperatively observe multiple moving targets has been formalized first by Parker and Emmons [5], who termed this problem as Cooperative Multi-Robot Observation of Multiple Moving Targets (CMOMMT) and showed it is NP-hard.
Since the CMOMMT problem was raised, there has been a great deal of work to address it. A classical approach is the local force vector proposed by Parker and Emmons [5], in which a robot is subject to the attractive forces of nearby targets and the repulsive forces of nearby robots, and the direction of the robot's motion is determined by the combined force of the two. However, this method will cause overlapping observations on the same target. Thus, Parker [6] proposed an improved method called A-CMOMMT to solve this phenomenon, where the robots are controlled by the weighted local force vectors for tracking targets. Additionally, in [7], the authors proposed B-CMOMMT, in which a help behavior is added to reduce the risk of losing a target. In [8], the authors proposed an algorithm called P-CMOMMT, considering the uniformity of the observation of the targets through the information entropy of targets' observations. The methods based on local force vector lack the prediction of the targets' behaviors and do not make full use of the targets' historical position information, resulting in a low efficiency for searching and tracking the targets.
A large number of optimization-based methods have been proposed to solve CMOMMTlike problems [9][10][11][12]. In [13], the authors used group of vision-based UAVs to search for multiple ground targets. The objective is to optimize the collective coverage area and the detection performance based on a distributed probability map updating model, in which the probability of target existence is calculated through the measurement information and information sharing among neighboring agents. In [14], the authors proposed a multi-objective optimization approach based on genetic algorithm (GA) to minimize the mission completion time for a team of UAVs finding a target in a bounded area. In [15], the UAVs' task sequence for a reconnaissance task assignment problem is considered, where the problem is formulated as a multi-objective, multi-constraint, nonlinear optimization problem solved with a modified Multi-Objective Symbiotic Organisms Search algorithm (MOSOS). In [16], searching and tracking an unknown ground moving target by multiple UAVs in an urban environment was modeled as a multi-objective optimization problem with preemptive priority constraints. The authors proposed a fuzzy multi-objective path planning method to solve this problem with target behavior predicted by extended Kalman filter (EKF) and probability estimation. In [17], the authors proposed a real-time path-planning solution enabling multiple UAVs to cooperatively search a given area. The problem is modeled as a Model Predictive Control (MPC) problem solved with Particle Swarm Optimization (PSO) algorithm. In [18], the authors emphasize the fairness of observations among different targets compared with the initial CMOMMT problem. They proposed an integer linear programming model to solve this problem where the motion of the targets is estimated in a Bayesian framework.
The above-mentioned approaches fail to balance between target searching and target tracking, which will make it difficult for UAVs to keep searching for undiscovered targets when the number of UAVs is less than the number of targets. To solve this problem, Li et al. [19] proposed a profit-driven adaptive moving targets search algorithm, which considers the impact of moving targets and collaborating UAVs in a unified framework through a concept called observation profit of cells. However, this approach assumes that the total number of targets is known, which is impractical in some complex environments. In [20], Dames proposed a method to enable multiple robots to search for and track an unknown number of targets. The robots use the Probability Hypothesis Density (PHD) filter to estimate the number of targets and the positions of the targets, and a Voronoi-based control strategy to search and track targets. This method assumes that each robot has a unique ID for creating a globally consistent estimate, which will limit the scalability of the robot team.
Recently, the development of deep reinforcement learning (DRL) [21] provides an alternative way to deal with the CMOMMT problem. DRL learns control policies through interacting with the environment, and it has reached or exceeded human levels in some game tasks [22,23]. There have been some studies using DRL to solve the targets search and tracking problem. In [24], the authors proposed a framework for searching for multiple static targets through a group of UAVs. The framework consists of a global planner based on a modern online Partially Observable Markov Decision Process (POMDP) solver and a local continuous-environment exploration controller based on a DRL method. In [25], the authors proposed a target following method based on deep Q-networks, considering visibility obstruction from obstacles and uncertain target motion. In [26], the authors proposed a DRL-based method to enable a robot to explore unknown cluttered urban environments, in which a deep network with convolutional neural network (CNN) [27] was trained by asynchronous advantage actor-critic (A3C) approach to generate appropriate frontier locations. In [28], the authors constructed a framework for automatically exploring unknown environments. The exploration process is decomposed into the decision, planning, and mapping modules, in which the decision module is implemented by a deep Q-network for learning exploration policy from the partial map.
In this paper, we focus on the problem of Cooperative Multi-UAV Observation of Multiple Moving Targets (CMUOMMT), where a UAV team needs to search and track an unknown number of targets in a search region. Our objective is to enable UAVs to give the discovered targets a fair observation and meanwhile maximize the exploration rate of the environment to discover more targets. To achieve this objective, the problem is formulated as a POMDP and solved with a DRL method. During the mission, each UAV maintains four observation history maps, which can reduce the partial observability of the environment. Furthermore, maps merging among UAVs can further improve awareness of the environment. To extract environmental features, a deep network with CNN is used to process each UAV observation map. A modern DRL method is used to train the shared policy with a centralized training, decentralized execution paradigm. The main contributions of this work are as follows: • The average observation rate of the targets, the standard deviation of the observation rates of the targets, and the exploration rate of the search region are simultaneously optimized to enable multiple UAVs to cooperatively achieve fair observation of discovered targets and continuous search for undiscovered targets. • Each UAV maintains four observation maps recording observation histories, and a map merging method among UAVs is proposed, which can reduce the partial observability of the environment and improve awareness of the environment. • A DRL-based multi-UAV control policy is proposed, which allows UAVs to learn to balance tracking targets and exploring the environment by interacting with the environment.
The remainder of this paper is organized as follows. In Section 2, the problem is formulated and the optimization objectives are introduced. In Section 3, the details of our method are proposed, including the maps merging method and the key ingredients of the DRL method. In Section 4, simulation experiments are conducted and the results are discussed. Finally, we conclude this paper in Section 5.

Problem Formulation
In this paper, we consider the problem of CMUOMMT described in [6,19], which is shown in Figure 1 and defined as follows: • A bounded two-dimensional rectangular search region S discretized into C L × C W equally sized cells, where C L and C W represent the number of cells in the length and width directions of the search region, respectively. • The time step is discretized and denoted by t within a mission time duration T. • A set of N moving targets V in S. For target ν j (ν j ∈ V, j = 1, 2, · · · N), the cell that lies at time step t is denoted by c t (ν j ) ∈ S. The mission is to observe these targets using multiple UAVs. To simplify this mission, we assume that the maximal speed of the targets is smaller than that of the UAVs. • A team of M homogeneous UAVs U deployed in S to observe the targets. For UAV u i (u i ∈ U , i = 1, 2, · · · M), the cell that lies at time step t is denoted by c t (u i ) ∈ S. Each UAV can observe the targets through its onboard sensor. The sensing range of each UAV is denoted by d s . We assume that the UAVs are flying at a fixed altitude, and the size of the field of view (FOV) of each UAV is the same and remains constant. The term FOV t (u i ) denotes the FOV of the UAV u i at time step t. In addition, each UAV is equipped with a communication device to share information to coordinate with other UAVs. The communication range is denoted by d c , which is assumed to be larger than the sensing range d s . The UAVs can only share information with UAVs within a communication range. We further assume that all UAVs share a known global coordinate system. The target ν j is monitored when it is within the FOV of at least one UAV, which can be defined as where O t (ν j ) indicates the observation state of target ν j . During the mission, the observation rate of the target ν j can be defined as where η(ν j ) represents the observation rate of the target ν j , which represents the proportion of time elapsed under the observation of at least one UAV during the mission. The first objective for the UAV team is to maximize the average observation rate of N targets, which can be characterized by the metricη: Maximizingη alone is unfair, especially when the number of UAVs is less than the number of targets, which may result in some targets not being observed during the mission. To solve this problem, the second objective for the UAV team is to minimize the standard deviation σ η of the observation rates of N targets: A low value of σ η means that all targets are observed relatively uniformly during the mission. In addition, since the UAV team does not know the total number of targets, it needs to continuously explore the search region to discover new targets. Thus, the third objective for the UAV team is to maximize the exploration rate β of the search region, which is defined as where t stamp (c kl ) represents the latest observed time for cell c kl . k and l represent the indexes of the cell c kl in the length and width directions of the search region, respectively. The maximum value of β is 1, which means that all cells in the search region are being observed by the UAV team at time step T. However, this is unrealistic since the maximum region observed by the UAV team is less than the total search region to be observed. That is, The ultimate objective is a combination ofη, σ η and β, which is different from [6,19], whose objectives only consider the average observation rateη. In this study, the UAV team needs to balance between giving the known targets a fair observation and exploring the search region through an efficient method.

Overview
We formulate the CMUOMMT problem as a POMDP and solve it with a DRL method. In this method, all UAVs share a global control policy π to decide actions. The action is selected according to the observation from the environment, i.e., where s t is the global state of the environment, o t is the local observation of the environment state, O(s t ) is the observation function determined by the UAVs' sensing range and communication range, and a t is the selected action. The observation o t includes four observation maps about the environment, which will be given in Section 3.2.
In a reinforcement learning (RL) framework, an RL agent learns an optimal policy a t ∼ π * (a t |o t ) through interacting with the environment. The goal of the RL agent is to maximize a long-term accumulated reward where r t+k+1 is the reward the RL agent received at time step t + k + 1, γ(0 < γ < 1) is the discount factor to make G t a bounded value.
In the proposed DRL method, we use a deep neural network π θ (a t |o t ) parameterized by θ to approximate the UAVs' control policy. The objective is to use a DRL method to find the optimal parameters θ * , which can make the UAV team balance between giving the known targets a fair observation and exploring the search region. The system architecture is shown in Figure 2.
As shown in Figure 2, each UAV first gets the observations from the environment through its onboard sensor to update its local observation maps. Then, each UAV receives the local maps of the other UAVs through its communication device, and the local maps are merged to provide the deep network π θ an observation o t . Finally, the deep neural network π θ outputs the action a t to control the UAV and receives the reward r t+1 at the next time step. The maps merging method is introduced in the next subsection, and the ingredients of deep reinforcement learning are introduced in Section 3.3.

Maps Merging
During the mission, each UAV maintains four observation maps: (1) The observation map of the UAV's position in the search region, denoted by a (2) The observation history map of the cells, which records the latest observed time for each cell. This map is denoted by a The map MC t (u i ) is obtained in two steps. At each time step t, the map MC t (u i ) is first updated by the observation of the UAV u i on the subset of cells within FOV t (u i ), that is, In addition, the observation history maps from other UAVs within a communication range will also update the local maps. The values of corresponding cells in the observation history map will be updated with the latest observation time as follows: where d t (u i , u j ) represents the distance between UAV u i and UAV u j .
(3) The position history map of the other UAVs, which records the history positions of the other UAVs. This map is denoted by a At each time step, the map MU t (u i ) is updated by the observation history maps from other UAVs within a communication range as follows: where t U is a time constant, representing the decay period of the value of mu kl t (u i ). (4) The position history map of the targets, which records the historical positions of the targets. This map is denoted by a C L × C W matrix MT t (u i ), MT t (u i ) ∈ R 2 , defined as follows: The map MT t (u i ) is also updated by the observation history maps from other UAVs within a communication range, that is, mt kl The map MC t (u 0 ) is normalized as follows: mc kl t (u 0 ) = mc kl t (u 0 )/ max(MC t (u 0 )), where max(MC t (u 0 )) represents the maximum value of elements in matrix MC t (u 0 ). From the map MU t (u 0 ), we can see that this map records one UAV's historical positions. From the map MT t (u 0 ), we can see that this map records three targets' historical positions.

Deep Reinforcement Learning
In this section, we introduce the key elements of the proposed DRL method, consisting of observation space, action space, network architecture, reward function, and training algorithm.

Observation Space
At time step t, the observation of UAV u i consists of four parts, i.e., representing the positional relationship of UAV u i relative to the boundary of the search area S, which is defined as follows: • The observation representing the observation state of the cells around UAV u i , which is defined as follows: • The observation o 3 t (u i ) is a part of the map MU t (u i ) centered in the UAV's current cell c t (u i ), with length C input and width C input . Like o 2 t (u i ), o 3 t (u i ) is a C input × C input matrix, representing historical position information of other UAVs around UAV u i , which is defined as follows: , if c kl ∈ S and c kl = c mn , k = 1, 2, · · · C input , l = 1, 2, · · · C input , m = 1, 2, · · · C L , n = 1, 2, · · · C W 0, else • The observation o 4 t (u i ) is a part of the map MT t (u i ) centered in the UAV's current cell c t (u i ), with length C input and width C input . Similarly, o 4 t (u i ) is a C input × C input matrix, representing historical position information of targets around UAV u i , which is defined as follows: , if c kl ∈ S and c kl = c mn , k = 1, 2, · · · C input , l = 1, 2, · · · C input , m = 1, 2, · · · C L , n = 1, 2, · · · C W 0, else (27) One example of observations for UAV u 0 is shown in Figure 4, which is consistent with the scenario shown in Figure 3.  Figure 3. The value of C input is set to 21 cells. From the observation o 1 t (u 0 ) and o 2 t (u 0 ), we can see that UAV u 0 is very close to the right boundary of the search region.

Action Space
The UAV's action space is a set of target cells around the UAV, that is, each UAV can move into one of its eight neighbor cells or stay at its current cell. Thus, the action space has a total of nine command actions. The actual command action is selected according to the selection probabilities calculated by the deep neural network.

Network Architecture
In this study, a deep neural network is used to process the observation o t , and its outputs are the selection probabilities of actions, denoted by P(a t |o t ). The deep neural network architecture is shown in Figure 5. As shown in Figure 5, we use four hidden layers to process the observation o t . The first hidden layer uses the CNN to process the input data, which has 4 two-dimensional filters with kernel size = (2, 2) and stride = 1, and its activation function is ReLU [29]. The second and third hidden layers are two fully connected layers with 200 rectifier units. The last hidden layer contains nine nonlinear units with an activation function of Softmax, limiting the output to (0, 1), whose outputs are the selection probabilities of each action.

Reward Function
The design of the reward function is closely related to our objective, which is to enable the UAV team to balance between giving the known targets a fair observation and exploring the search region. Thus, a reward function is designed to achieve this objective: where r t (u i ) is the reward received by UAV u i at time step t, which is a sum of four different rewards. The reward r 1 t (u i ) encourages UAV u i to track targets that have been discovered, which consists of the following three terms: where l r 1 t (u i ) represents the local reward for tracking the discovered targets, g r 1 t (u i ) represents the global reward for tracking the discovered targets, h r 1 t (u i ) represents the reward for recording the historical positions of the targets, λ 1 , and λ 2 and λ 3 are the positive coefficients. The rewards l r 1 t (u i ), g r 1 t (u i ) and h r 1 t (u i ) are designed as follows: where d t (u i , ν j ) represents the distance between UAV u i and target ν j at time step t,η t represents the average observation rate of targets at time step t, sum(MT t (u i )) represents the sum of the values of the elements in matrix MT t (u i ), min(x, y) returns the minimum value of x and y. The reward r 2 t (u i ) encourages UAV u i to explore the search region, which consists of the following two terms: where l r 2 t (u i ) is the local reward for exploring the search region, g r 2 t (u i ) is the global reward for exploring the search region, λ 4 and λ 5 are the positive coefficients. The rewards l r 2 t (u i ) and g r 2 t (u i ) are designed as follows: where β t (u i ) represents the local exploration rate of the search region known by UAV u i at time step t, β t represents the global exploration rate of the search region at time step t. β t (u i ) and β t are calculated as follows: The reward r 3 t (u i ) penalizes UAV u i for approaching other UAVs too close, which is designed as follows: The reward r 4 t (u i ) penalizes UAV u i for leaving the search region, which is designed as follows: The reward function designed above can make UAVs receive dense rewards in the training process, which can reduce the difficulty of learning. In addition, we set λ 1 = 0.6, λ 2 = 0.2, λ 3 = 0.2, λ 4 = 0.7, and λ 5 = 0.3 in the training process.

Training Algorithm
In this study, we used a policy-based DRL algorithm, proximal policy optimization (PPO) [30], to train the deep neural network. The PPO has the benefits of optimizing control policies with guaranteed monotonic improvement and high sampling efficiency, and it has been widely used in the control of robots [31,32].
The algorithm flow is shown in Algorithm 1. In the training process, a centralized training, decentralized execution paradigm is used. Specifically, at each time step, each UAV independently obtains the observation and selects action through the shared policy, and the policy is trained with experiences collected by all UAVs during network training. The collected experience is used to construct the loss function L CLIP (θ) for the policy network π θ and the loss function L V (φ) for the value network V φ . The value network structure is the same as the policy network structure, except that it has only one linear unit in its last layer. In each episode, the policy network π θ is optimized E π times, and the value network V φ is optimized E V times on the same minibatch data sampled from the collected experience with Adam optimizer [33].

Results
In this section, simulation experiments are performed to evaluate the effectiveness of our proposed policy. We first describe the simulation setup and introduce the training process. Then, we compare our policy with other methods in various scenarios to validate its performance. Finally, we discuss the results.

Simulation Setup and Training Results
We conduct simulation experiments in a Python environment. The deep neural networks are implemented with Pytorch [34]. In the training process, we consider a search region of size 50 × 50 cells, i.e., C L = C W = 50 cells. The numbers of UAVs and targets in the search region are set to M = 5 and N = 10, respectively. The sensing range of each UAV is set to d s = 5 cells and the communication range of each UAV is set to d c = 10 cells. In addition, the maximum UAV speed is set to 1 cell per time step, and the maximum target speed is set to 0.5 cells per time step. The total mission time step is 200, i.e., T = 200. We set t U = 5 and t T = 8 for the decay period of the position history map of the UAVs and that of the position history map of the targets, respectively. The parameters in Algorithm 1 are listed in Table 1. In addition, the observation input size is set to C input = 21 cells.

Parameters Values
The training process took 3000 episodes. At the beginning of each training episode, the positions of the UAVs and the targets are randomly reset. The speed of each target is randomly generated between [0, 0.5] cells per time step and remains unchanged during a training episode. We recorded the average and variance of each episode's cumulative reward every 100 episodes. The cumulative reward for each training episode is the average of the cumulative rewards received by all UAVs. The training results are shown in Figure 6. As training progresses, each UAV receives progressively larger rewards, which means that the control policy gradually converges, allowing each UAV to track discovered targets and explore unknown environments. It is worth noting that in the early stages of training, the UAVs receive negative rewards due to leaving the search region.

Comparison with Other Methods
In this subsection, we compare our policy with other methods, including A-CMOMMT [6], P-CMOMMT [8], PAMTS [19], and Random policy. A-CMOMMT is a traditional approach for solving the CMOMMT problem, which uses weighted local force vectors to control UAVs. P-CMOMMT considers the uniformity of the observation of the targets compared with A-CMOMMT. PAMTS is a novel distributed algorithm, considering tracking the targets and exploring the environment in a unified framework. Random policy serves as a baseline of the CMOMMT problem.
In each set of comparative simulation experiments, we ran 50 random test experiments for each method and calculated the average of the following three metrics: • the average observation rate of the targetsη, • the standard deviation σ η of the observation rates of the N targets, and • the exploration rate β of the search region.
We first compared our policy against other methods with different numbers of UAVs while the number of targets was fixed to 10. As shown in Figure 7a, the average observation rates of the targets continued to increase as the number of UAVs increased across all methods. Our policy had the best performance when the number of UAVs was 2, 10, or 15, and it was the second best method when the number of UAVs was 5 or 20. In addition, Figure 7b shows that our policy had the minimum standard deviation of the observation rates compared with A-CMOMMT and PAMTS in most cases, which shows that our policy can give the targets relatively fair observations. It is worth noting that the standard deviation of the observation rates gradually increased with the increase in the number of UAVs when using P-CMOMMT and Random policy. This is because the number of targets being observed increases when the number of UAVs increases, so that the standard deviation of the observation rates also increases with it. Figure 7c shows the exploration rate of the search region with the various number of UAVs. It can be seen that our policy had a high exploration rate in most cases relative to other methods except for the random policy. Overall, our policy can give targets a high and fair observation while maintaining a high exploration rate of the search region.
The impact of the total mission time on the three metrics was also studied. Figure 8a shows that the observation rates with A-CMOMMT, PAMTS, and our policy continued to improve as the total mission time increased. It is because that the increased mission time allows UAVs to search the environment sufficiently to find the targets. In addition, the observation rates of the A-CMOMMT, PAMTS, and our policy gradually approached as the total mission time increased, which means all three methods can find the targets in the search region with enough mission time. P-CMOMMT had a low observation rate because it tries to give a uniform observation to the targets, which can also be seen from Figure 8b, where P-CMOMMT had a relatively low standard deviation of the observation rates. As shown in Figure 8b, our policy had a medium standard deviation of the observation rates. Similarly, as shown in Figure 8c, our policy had a medium exploration rate compared to the other methods. The results show that our policy can increase the observation rate of the targets when the mission time increases, while reducing the standard deviation of the observation rates and increasing the exploration rate of the search region.  In addition, the impact of the size of the search region on the three metrics is studied. Figure 9a,c shows that the observation rate of the targets and the exploration rate of the search region decreased as the size of the search region increased. It is obvious that targets were more scattered in a larger search region, which makes it difficult for UAVs to find targets and explore the entire search region. As shown in Figure 9b, the increase in the standard deviation of the observation rates from C L = C W = 25 to C L = C W = 50 was due to the number of the discovered targets decreasing as the size of the search region increased. However, the decrease in the standard deviation of the observation rates from C L = C W = 50 to C L = C W = 125 was due to the difficulty for UAVs to find the targets in a large search region.
Finally, we studied the impact of the communication range on the three metrics. As shown in Figure 10, for our policy, the observation rate and the exploration rate continued to improve, and the standard deviation of the observation rate continued to decrease as the communication range increased, until the communication range was greater than 10 cells, where all three metrics basically no longer changed. The impact of the communication range on the three metrics under PAMTS was consistent with our policy, except when there was no communication among UAVs, i.e., d c = 0 cells. The results show that the information from the nearby UAVs can bring significant improvements, and the information from the remote UAVs has little impact on this mission. In addition, because A-CMOMMT and P-CMOMMT only consider the impact of UAVs within the sensing range, the variation in communication range has no effect on the three metrics. Like the above results, our policy has a high observation rate just below PAMTS and a low standard deviation of the observation rates and a high exploration rate of the search region compared with A-CMOMMT and PAMTS.

Discussion
The above comparison results show that our policy can find a balance between giving the known targets a fair observation and exploring the search region. Though our policy has a low observation rate compared with PAMTS in most cases, it can give a fair observation to the targets with a low standard deviation of the observation rates and continue a high exploration rate of the search region, which can enable UAVs to find more targets when the total number of the targets is unknown. It is worth noting that PAMTS assumes that the total number of targets is known, and we do not have this assumption.

Conclusions
In this paper, a DRL based approach is proposed to solve the CMUOMMT problem. Unlike traditional CMOMMT approaches, we considered the average observation rate of the targets, the standard deviation of the observation rates, and the exploration rate of the search region at the same time under the assumption that the total number of the targets is unknown. To achieve this objective, we used four observation maps to record the historical positions of targets and other UAVs, exploration status of the search region, and the UAV's position relative to the search region. In addition, UAVs' maps were merged from the maps of different UAVs within a communication range. The merged maps were then cropped and processed with a deep neural network to obtain the selection probabilities of actions. The reward function was designed carefully to provide UAVs with dense rewards in the training process. The results of the extensive comparison simulation experiments prove that our policy can give the targets a fair observation and meanwhile maintain a high exploration rate of the search region. Future work will study the CMUOMMT problem in a search region with obstacles and targets with evasive movements. This is a more challenging problem that requires smarter collaboration between UAVs.

Data Availability Statement:
No new data were created or analyzed in this study. Data sharing is not applicable to this article.

Conflicts of Interest:
The authors declare no conflicts of interest.

Abbreviations
The following abbreviations are used in this manuscript: