Handling Complex Missing Data Using Random Forest Approach for an Air Quality Monitoring Dataset: A Case Study of Kuwait Environmental Data (2012 to 2018)

In environmental research, missing data are often a challenge for statistical modeling. This paper addressed some advanced techniques to deal with missing values in a data set measuring air quality using a multiple imputation (MI) approach. MCAR, MAR, and NMAR missing data techniques are applied to the data set. Five missing data levels are considered: 5%, 10%, 20%, 30%, and 40%. The imputation method used in this paper is an iterative imputation method, missForest, which is related to the random forest approach. Air quality data sets were gathered from five monitoring stations in Kuwait, aggregated to a daily basis. Logarithm transformation was carried out for all pollutant data, in order to normalize their distributions and to minimize skewness. We found high levels of missing values for NO2 (18.4%), CO (18.5%), PM10 (57.4%), SO2 (19.0%), and O3 (18.2%) data. Climatological data (i.e., air temperature, relative humidity, wind direction, and wind speed) were used as control variables for better estimation. The results show that the MAR technique had the lowest RMSE and MAE. We conclude that MI using the missForest approach has a high level of accuracy in estimating missing values. MissForest had the lowest imputation error (RMSE and MAE) among the other imputation methods and, thus, can be considered to be appropriate for analyzing air quality data.


Introduction
Air quality monitoring is conducted with the aim of protecting public health. Numerous air contaminants have been found to have harmful effects on human health. The air quality in cities varies, due to concentrations of particulate matter 10 micrometers (PM 10 ), nitrogen dioxide (NO 2 ), ozone (O 3 ), carbon monoxide (CO), and sulfur dioxide (SO 2 ), from emission sources including vehicle exhaust, manufacturing operations, and chemical facilities, among other sources.
A major challenge in air quality data management is determining how to deal with missing data values. Missing information in data sets occurs for multiple reasons, such as impaired equipment, insufficient sampling frequency, hardware problems, and human error [1]. Incomplete data sets affect the applicability of specific analyses, such as receptor modeling, which generally requires a complete data matrix [2]. The occurrence of missing data, no matter how infrequent, can bias findings on the relationships between air contaminants and health outcomes [3]. Incomplete data matrices may provide outcomes that vary significantly, compared to the results from complete data sets [4].
To gain a more complete data set, researchers must decide whether to discard or impute (i.e., substitute for) missing data. Ignoring missing values is typically not warranted, as valuable information is lost, which may compromise inferential power [5]. Therefore, the most appropriate option is to impute the missing data. Yet, the systematic differences between real and substituted data can also lead to unwanted bias. Therefore, it is vital to determine an optimal approach for estimating missing values. Several problems have been linked with missing data [6]. These challenges include statistical power reduction, bias as a result of inconsistent data, difficulties in managing the data during statistical analyses, and low efficiency. The criteria implemented for measures to deal with missing data in time-series analysis rely on the missing data replacement mechanism and missing data pattern [7]. Such challenges are especially problematic when the missing data exceed 60 percent, where existing methods have significant difficulty in addressing such situations [8].
This study focuses on a case study of missing data related to air quality monitoring. The Kuwait environmental public authority (KEPA) is mandated with the responsibility for measuring air quality. A data set collected from five fixed monitoring stations was associated with missing data, likely caused by multiple reasons. One is that there were a large number of routine maintenance changes in the monitoring sites. Second, simple human error occurred. Third, there were some tagging problems that necessitated the exclusion of some data.
The main purpose of this paper was to find the best imputation method to estimate the missing values for the measured pollutants (SO 2 , NO 2 , CO, O 3 , and PM 10 ) in the KEPA data sets. The imputation methods used in this paper are: multivariate imputation by chained equations using random forest (RF), k-nearest neighbor (kNN), Bayesian principal component analysis (BPCA), multiple imputation using expectation maximization with bootstrapping (EM with Bootstrapping), predictive mean matching (PMM), and the proposed iterative imputation method (missForest) based on a random forest. Two tests, root mean square error (RMSE) and mean absolute error (MAE), are used to compare the performances of the imputation methods. For the error indicators (RMSE or MAE), the larger the value, the greater the error. The end product is an outline of the best approaches for managing missing data in a data set that is critical for public health in Kuwait.
It is important to describe the factors that may lead to missing data in statistical analyses. The first instance of missing data is missing completely at random (MCAR), whereby the missing data result from either the observer not collecting the necessary information or the reporting of incomplete or false information. The second instance of missing data is missing at random (MAR), whereby the extent of data missing depends on the type of data under observation. MAR is recommended when the missing data can be partially retrieved, depending on the existence of information related to the variables in the same data set. The third instance is missing not at random (MNAR), whereby the missing data are dependent on the actual values absent for statistical analysis. Among the three types of missing data in statistical analysis, MAR and MNAR are the most common [9]. When the type of missing data tends towards MAR, multiple imputation techniques are more suitable than other techniques, such as listwise deletion [10].

Missing Completely at Random (MCAR)
For MCAR, the chance of missing data values is the same across all instances. It can be interpreted as the cause of missing data values not being related to the data collected. For instance, a random sample of a population, whereby each individual from the population has an equal chance of being selected for the sample. This would mean that not all members of the population were present among the selected sample. Therefore, the data and values of the members not selected would be missing from the statistical analysis. The following example describes an instance in which MCAR occurs in statistical analysis: Suppose that Y is an n × p matrix which includes all p variables with n cases in the sample. Let the observed values be denoted as (Y obs ), while the missing values are denoted as (Y mis ). The matrix R spots the missing values locations in Y. The observations of R and Y are denoted as r ij and y ij , respectively. Thus, r ij = 1 when y ij is observed, while r ij = 0 when y ij is missing. Then, the distribution of R depends upon Y = (Y obs , Y mis ). We can write Pr(R|Y obs , Y mis , Ψ) when the data are said to be assumed as MCAR, if: where ψ consists of the parameters of the missing data in the model. This means that the probability of missing a data value depends only on the estimated parameters in the model.

Missing at Random (MAR)
For MAR, the chance of data values missing is equal across all categories. MAR is, therefore, a more diverse instance, compared to MCAR; for instance, when selecting a sample from a population based on certain characteristics, the resulting missing data can be categorized as MAR. Statistical software for multiple imputations usually assumes that the data are MAR [11]. Therefore, the probability of data missing is dependent on the data under observation: The KEPA data are best classified as MAR.

Missing Not at Random (MNAR)
For MNAR, the chance of data not being available is dependent on reasons unknown to the researcher. For instance, when conducting research, some respondents may decide to withhold information for reasons unknown to the researcher. Due to the nature of MNAR, it is often regarded as a more complex case in statistical analysis. It can be addressed by targeting some of the reasons respondents would choose to with hold information, Y mis , itself. It is represented: The data set extracted from KEPA has extensive missing values. The missing data could have been due to routine maintenance, changes in the siting of monitors, human error, or tagging problems.

Ignoring the Missing Data Mechanism
One of the major issues that arise when performing imputations is whether the missing data come from the same distribution as the observed data (Y obs ). As mentioned above, the observed data are made up of Y obs and R with the joint density function f (Y obs , R|θ, ψ), which depends on the model estimated parameters θ for Y.
We can estimate θ without knowing ψ by defining the probability density function of the joint distribution of Y obs and Y mis as f (Y|θ) ≡ f (Y obs , Y mis |θ). Therefore, in order to compute the marginal probability density of Y obs , we integrate the missing data as: where the likelihood function of θ, according to Y obs while ignoring the missing data, can be defined as: Obtaining maximum likelihood (ML) estimates of θ can be done by maximizing the provided θ.
To build a more general model, we include R and specify the joint density distribution of Y and R as: We can find the distribution of the observed data by integrating Y mis from the joint density using θ and ψ, defined as: Now, we can rewrite Equation (7) as: The missing data mechanism is ignorable for likelihood inference if 1.
MAR: when the missing data pattern is missing at random; and 2.
Distinctness: when the joint parameter space of (θ, ψ) is equal to the product of the parameter space of θ and ψ [12].

Multiple Imputation (MI)
Studies have shown that MI is unbiased if the missing rate for a variable exceeds 50% of the total missing values [13][14][15]. Researchers have debated the role of listwise deletion when solving for such missing data. Most research studies have concluded that, although the listwise deletion technique is not commonly used, it is applicable in some instances [16,17]. According to Marshall et al. [15], multiple imputation is favorable for computing missing data and especially applicable when the missing data rate is above 10% [18]. For instance, in a regression model, including the number of variables with a low rate of missing data. In such an instance, this may result in a rate of missing data that is higher in the full regression model, when compared to the outcomes of simple bivariant regressions. Therefore, it is critical for analysts to evaluate the total missing rate, as well as the partial missing one.
One limitation of applying a single imputation approach is that formulas of standard variance applied to filled-in data tend to underestimate the variance of the estimates; therefore, multiple imputation methods have been proposed [11]. The first step in such a method is specifying the single encompassing multivariate approach for all data sets. There are four types of multivariate models of data completion to consider [12]: (i) standard models, which impute under multivariate normal distributions; (ii) log-linear models, that have been used traditionally by social scientists in describing the associations among cross-classified data variables; (iii) general location models, which combine the log-linear approach for the variables that are definite with the multivariate model of standard regression for the continuous variables; and (iv) a two-level model of linear regression, which is mostly applied to multi-level data. The imputation model should be able to match the subsequent analysis and should be able to preserve the interactions of variables, which relates to the central point of the investigation discussed later in this paper.
A multiple imputation method balances ease of application and the quality of obtained results. The various imputations identify random errors that are appropriate to the process of imputation, making it possible to obtain unbiased estimates in all parameters. No deterministic method of imputation can achieve the same result. The technique also allows for departure from normality assumptions, while providing results that are adequate with low sample sizes or when significant amounts of data are missing.
Some requirements are necessary, in order to attain the desired results of multiple imputation [19]. First, there should be random data missing (MAR), which means that there is a dependence on observed variables and not missing observations. Second, the method of generating the values imputed should suit the analysis that subsequently follows. This maintains the associations between variables, which is a focus in the analysis shown later in this paper. Third, the model for imputation should coincide and agree with that of the investigation. Rubin has given a thorough description of these conditions. A remaining question, however, relates to adopting the most suitable practices for performing the imputations [20]. It is essential to have an awareness of the possible prediction problems, in order to reduce or minimize systematic error.
There have been many applications of multiple imputation in health, environmental [21,22], and industrial [23,24] data bases, as well as for survey data [25,26] and data mining approaches, which extract patterns from large data sets through a combination of artificial intelligence and statistical methods, that can be used for database management [23].

Multiple Imputation Using Random Forest Method
Let us assume that X = X 1 , X 2 , . . . , X p is a n × p-dimensional data matrix. We propose the use of the random forest technique for imputing missing observations. The random forest algorithm has a built-in routine to handle the values that are missing by weighing the frequency of values with the proximity of a random forest after the training of an initially imputed mean data set [27]. This approach requires a response variable that is complete and useful for forest training. Instead, we estimate the values of all the missing values directly, by use of a random forest that is trained on the observed data set, where X is the matrix of the complete data. X s contains all missing values at entries According to [28], the process starts with an initial guess for the missing values in X using a mean imputation approach or any other imputation method, depending on the data. Then, we sort the predictors X s , s = 1, . . . , p, ascending or descending, X s , s = 1, . . . , p, according to the number of missing values. Then, for each variable X s , the missing values are imputed by random forest (i.e., the first fitting) with response y The imputation approach should be repeated until a stopping criterion is reached. Pseudo Algorithm 1 shows a representation of the missForest method (see Algorithm 1).
The stopping criterion (γ) is met when the difference between the last imputed data matrix and the previous one increases for the first time, with respect to both variable types. Here, the difference for the set of continuous variables N is defined as: and that for the set of categorical variables F as: Let X be an n × p matrix; set the stopping criterion (γ); set the initial guess for missing values. k ← vector of sorted indices of columns in X w.r.t. increasing amount of missing values. X imp old ← stores the previously imputed matrix. Fit a random forest: y After imputing the missing values, the performance is assessed using the normalized root mean squared error [29] for the continuous variables, defined by: where X true and X imp are the complete data matrix and the imputed data matrix, respectively. In this study, all predictors are classified as continuous observations. The mean and variance are used as a short notation for empirical mean and variance computed over the missing values only. When an RF is fit to the part that is observed on a variable, we use the out-of-bag (OOB) estimate of an error for the variable. When we meet the stopping criterion (γ), we average it over the variable set of that type, in order to obtain an approximation of the actual errors of imputation. We assess the performance of this estimate by comparing the absolute difference between the OOB imputation error estimate in all simulation runs and the true imputation error.

Process of Multiple Imputations (MI) Using Rubin's Rules
For our data sets, we followed Rubin's rules [11] for handling missing data. The process of multiple imputations (MIs) was conducted separately for each monitoring station (see Figure 1). The first step in multiple imputation is to create values ("imputes" or "m i "), with 10 iterations for each "m i " to be substituted for the missing data. In order to create imputed values, we need to identify a model (say, a linear regression) that allows us to create imputes based on other variables in the data set (predictor variables). As we need to do this multiple times, in order to produce multiple-imputed data sets, we identify a set of regression lines which are similar to each other. Figure 1 shows the process for the KEPA data sets, to process and estimate missing values using imputation methods. There were five data sets (1)(2)(3)(4)(5), relating to FAH, JAH, MAN, RUM, and ASA, respectively. Each data set should contain 2192 daily observations for each variable; however, due to missing values, they were all less than 2192.

Air pollution dataset containing missing values
Imputed: set 1 (m=20, iteration =10) Imputed: set 2 (m=20, iteration =10) Imputed: set 3 (m=20, iteration =10) Imputed: set 4 (m=20, iteration =10) Imputed: set 5 (m=20, iteration =10) The power of MI lies in its multiple imputations being able to be performed for each variable in the data set. While every single imputation is ambiguous or imprecise, the combination of the computed imputations takes the uncertainty of each imputation into consideration. According to [17,18], MAR or MCAR pooled estimated parameters are less biased and the associated standard errors are corrected appropriately.
The implementation of an MI technique requires three steps: First, it imputes several values for the same observation, using at least two methods (m ≥ 2). Then, the second step takes each individual method, m, and analyzes it using standard complete data. Finally, m (the completed data sets) is pooled by integrating the m analyses, in order to generate overall estimates and standard errors. This can be done by calculating the mean over the m repeated analyses. Pooling data from several m allows multiple imputations to ensure higher accuracy [30]. Figure 1 shows how we treated the KEPA data sets with multiple imputation, where m = 20.

Data Sets
We utilized a real-time air quality monitoring data set collected for 5 locations in Kuwait from the Kuwait Environmental Public Authority (KEPA), in order to evaluate and assess the performance of various imputation methods to estimate missing values in the data set. The data set contained air quality, time, and meteorological data.

1.
Air quality data: The air pollutant variables in the air quality data were NO 2 , CO, PM 10 , SO 2 , and O 3 ; 2.
Meteorological data: The meteorological parameters included temperature, humidity, wind direction, and wind speed.
All these variables for the past 24 h are collected on hourly basis and features extracted from the collected data set were used for evaluation of the models, for predictions of the concentration of missing values for NO 2 , CO, PM 10 , SO 2 , and O 3 . Concentrations of all the pollutants are reported in µg/m 3 .
We compiled pollutant data from the Environmental Public Authority of Kuwait (KEPA). The data were gathered from five environmental monitoring stations from 1 January 2013 to 31 December 2017. We used the following pollutants: Particulate matter 10 micrometers (PM 10 ), nitrogen dioxide (NO 2 ), ozone (O 3 ), carbon monoxide (CO), and sulfur dioxide (SO 2 ). We estimated a concentration time of 24 h (daily observation) for SO 2 , NO 2 , and PM 10 at each station and 8 h for CO and O 3 . We assumed 75% of the collected values as reliable averages [31]. We used the Air Quality Index (AQI), as generated by [32].
The AQI was developed, for Kuwait, based on the United States Environmental Protection Agency (USEPA) recommendations. The AQI is defined with consideration of characteristics of the air, in relation to the environmental needs of humans [32]. The AQI is an index for reporting the day-to-day air quality, providing details about the cleanliness of ambient air [33]. The following equation was used to convert between pollutant concentration to AQI: where I p is the AQI for the given pollutant, C p is the pollutant concentration, C low is the concentration breakpoint that is ≤ C p , C high is the concentration breakpoint that is ≥ C p , I low is the index breakpoint corresponding to C low , and I high is the index breakpoint corresponding to C high [34] (see Table 1). Using the data obtained from KEPA, we conducted an in-depth comparative analysis of the different imputation methods. Missing data were entered into each data set, assuming a general missing data pattern and three mechanisms of missing data: MCAR, MAR, and NMAR. Under the MCAR assumption, missing values were randomly applied to each data set. Under the MAR assumption, the probability of information being missing depended on class attribute. Under the NMAR assumption, the largest or smallest values of X s were removed. The objective of the study was to derive a comparison of six different imputation methods for NMAR, MAR, and MCAR, concerning missing data. We simulated the rates of missing data by varying the value proportions by 5%, 10%, 20%, 30%, and 40%.

Evaluation Criteria
To determine the best imputation method, three model performance tests were considered [35]: root mean square error (RMSE), mean absolute error (MAE), and correlation coefficient (R), which are calculated as follows: where y i andŷ i are the ith observations for the reconstructed and the comparison data sets, respectively. The error was measured based on the difference between the estimated value and the observed values. For RMSE and MAE tests, if the value obtained is small, then the estimation method is better.

R Packages Used for Imputation Process
Five well-known imputation packages accessible in R were applied. The first R package used here was VIM (https://cran.r-project.org/web/packages/VIM/VIM.pdf), which is associated with kNN imputation methods and robust model-based imputation for numerical, semi-continuous, categorical, or ordered variables [36]. The second R package was MICE (https://cran.r-project.org/web/packages/mice/mice.pdf) which stands for Multivariate Imputation via Chained Equations [37]. MICE is specialized to deal with missing values of MAR or MNAR types [38]. MICE can deal with different types of variables using different imputation methods, such as predictive mean matching for numeric variables, logistic regression for binary variables, Bayesian polytomous regression for factor variables, and a proportional odds model for ordered variables [38,39]. The third package was missForest (https://cran.r-project.org/web/packages/missForest/ missForest.pdf). MissForest deals with non-parametric imputation [28]. MissForest enables the imputation of the predictors by using regression trees of resampling under the prediction classification of missing values [40]. MissForest has good computational efficiency and can work well with high-dimensional data [28]. The fourth package was Amelia (https://cran.r-project.org/web/packages/Amelia/Amelia.pdf), which enables imputation by maximizing the level of expectation with a bootstrapping algorithm. The Amelia package has also been recommended under a larger number of variables with high-dimensional data. The package also provides improved imputation models by adding Bayesian priors on individual cell values [41]. The final package used was missCompare (https://cran.r-project.org/web/packages/missCompare/missCompare.pdf). The missCompare package provides several diagnostic measurements to compare between all imputation methods, using RMSE, MAE, and other imputation performance criteria.

Statistical Results
Based on results for the real-time ambient air quality and meteorological data from the monitoring stations in KEPA, we inferred real-time and fine-grained ambient air quality information using means and standard deviations. The distribution analysis was conducted using the skewness and kurtosis with information of the quartiles (e.g., 25th and 75th quartiles, median, and &IQR&), where the correlation between the predictors was assessed by the Pearson correlation coefficient. The rate of missing values is presented for each monitoring station using the percentage of total number of missing values among the predictors. All pollutant distributions were positively skewed and we corrected the skewness by applying log transformations [31]. Figure A4 in the Appendix A shows the distribution performance after we applied logarithmic transformations to PM 10 , SO 2 , O 3 , CO, and NO 2 . Table 3 shows the Pearson correlation analysis of various air pollutants and meteorological parameters. The strongest positive correlation was found between NO 2 and SO 2 . This was expected, due to their common emission sources (e.g., road traffic). NO 2 had a weak association with PM 10 , whereas O 3 had a highly negative association with NO 2 . All meteorological parameters (temperature, humidity, wind speed, and wind direction) showed a negative association with NO 2 .
We performed time series plot for each pollutant for each monitoring station to better understand the patterns of the missing data among all observations (see Figures 2-4). We concluded that the missing data pattern can be classified as missing at random (MAR) or missing not at random (MNAR), especially for the large missing gaps (see Appendix A, Figures A1-A3). Figure A3 from Appendix A shows missing observation ratios for each pollutant. From Figure A3, we can conclude that PM10 has the highest missing observation rate among the pollutants (see Appendix A Figure A3-left panel). The right side of the Figure A3 from Appendix A shows the missing value pattern for each pollutant. The vertical connected blocks present the non-randomness for missing data during the monitoring.    Table 4 shows a comparison of missing rates for each monitored pollutant between monitoring stations. There were significant differences among the stations in producing missing values, where all p-values were less than 0.05, except for that of PM 10 . PM 10 was excluded from all imputation calculations, due to a missing rate level that exceeded 50% [42,43].

Missing Data Patterns
As shown in Table 5 and Figure A5 from [44,45]. As seen in Table 5 and appendix Figure A5, the best imputation method for estimating the simulated missing data was the missForest method. The missForest method had the smallest values of MAE and RMSE for all parameters and percentages of simulated missing data rates, this finding was consistent with the study of [1], where MTB was the best imputation method for filling the missing data, as it was able to obtain the smallest error for all percentages of missing data, in agreement with [28,44,[46][47][48][49]. The second-best imputation method for estimating the simulated missing data was the k-nearest neighbor (kNN) method. This method performed better than the multiple imputation (MI) method for almost all parameters and proportions of missing data. This finding was consistent with the study reported by [42]. The worst-performing methods were multiple imputation using additive regression, bootstrapping, and predictive mean matching (PMM) methods. This was also consistent with the study reported by [42].
From Table 4, we can conclude that the missing rates are different among the selected air monitoring stations for each pollutant except PM 10 that shows similarities in missing rates among the monitoring stations. In addition, we can figure out from Appendix A Figure A3 how the missing values are distributed for each pollutant.
The results of the missing imputation approach were diagnosed using convergent plots for the mean and standard deviation of the multiple imputation data sets using missForest (see Appendix A Figures A6 and A7). For convergence, the different streams should not show any definite trends; we did not observe any obvious trends in these data.
In addition, Figure A8 shows Kernel density estimates for the marginal distributions of the observed data (blue line) and the m = 20 densities per variable calculated from the imputed data (red lines). This indicates stability after 10 iterations.
We imputed the missing information into the original data sets to assess if the imputed data are consistent with the existing data. Figures 5 and 6 showed how imputed datasets fit with the actual information in each station. We can see from the figures that large gaps of missing data are filled in the same pattern of the historical values for all pollutants and meteorological parameters which gives a good indication of using missForest to estimate missing air pollutants.

Discussion
In Kuwait, the Environmental Public Authority (KEPA) is responsible for monitoring the air quality status. The data of air quality obtained from the five stations used in this study usually contain missing data, which can cause bias due to systematic errors between the observed and unobserved values [31]. Therefore, it is vital to determine the optimal approach for estimating the missing values, in order to guarantee that the analyzed data are of high quality. Incomplete data matrices may provide outcomes that vary significantly, compared to the results expected from a data set that is complete [4]. The primary purpose of any data analysis is to make valid and reasonable inferences on a particular population under study. A researcher is expected to respond to the missing data problem in a way that aligns with the population of interest.
There have been many contributions to this field, such as in environmental [1,7,50,51], statistical [52,53], and medical studies [54,55]. In the environmental field, imputation is the statistical procedure of assigning inferential values to recover all missing data using prior knowledge from other predictors.
The existence of efficient imputation algorithms has led to the extensive usage of elaborate imputation methods across the world. As more people become knowledgeable about imputation algorithms, inquisitiveness regarding the methodology increases, leading to the invention of more sophisticated imputation methods. However, the main challenge concerning imputed values is whether to consider them as actual measurements or to be handled with caution. In the field of research, it is preferable to handle assigned figures with great discretion. This is because the use of imputed figures as actual data may lead to a misguided impression, which may potentially falsify the final results. Therefore, the imputed values should be given low priority.
It is, therefore, vital for a researcher to impute missing data and assess how robust the associated data estimation is. Environmental information that relies on technological processing and simulation poses a challenge. Missing data ascription is one approach: A substantial quality of ascription methods is that they are reliable and limited to one type of variable. This variable may be considered as persistent or unmitigated. If the data type is blended, the method must deal with the different types of data separately. In conclusion, these techniques ignore the potential associations between different factor types. For the situation here, before conducting any statistical modeling or performing time-series analysis, it is better to treat the missing values and to try to estimate them using other information from other predictors. This may help to avoid any bias circumstance and to enhance model performance for better estimation.
The main contribution of this paper was to find the most appropriate method to fill in missing observations in an air pollution data set from Kuwait. Single and multiple imputation methods were adopted and their performances were compared using using the RMSE and MAE metrics. To estimate missing data for SO 2 , NO 2 , PM 10 , CO, and O 3 in the KEPA database, we applied artificially introduced missing values ranging from 10% to 40%. We showed that missForest could successfully handle the missing values, particularly in data sets including different types of environmental variables.
However, this computation method also had limitations. It requires proficiency in R programming, being demanding in comparison to the kNN or PMM methods. There is also a possible connection between the pollutant values and the missing variables. Therefore, these results are not applicable in cases where the missing data are due to non-random reasons. It is evident that some of the observed air pollutant records contained erroneous information. When we ignore this factor during the examination, the results obtained tend to be misleading.
Our findings revealed that missForest was the only imputation method with a consistent and comparatively lower imputation error (of 0.82). The approach had a root mean square error of 1.04. missForest also exhibited the smallest prediction deviation in the imputed values of pollutants. Furthermore, missForest simulation provides the most readily available imputation of missing values, as its freeware R package is freely available.
While compiling the report of the study, we assumed the missing at random (MAR) tool. This premise is essential for the development of a prototype of the observation for the imputation of missing data. There was a possibility of the missing data system being not missing at random (NMAR). In such a case, the missing variables are directly related to their causes. It may be challenging to determine the actual missing data mechanism, in such a case. Therefore, distinguishing between NMAR and MAR would involve a thorough investigation of the data capturing process. Other assumptions include Gaussiandistributed data, which may have been erroneous for some variables. Using the appropriate distribution for each variable can help to reduce this error. This might increase the reliability of the MICE imputation results, which determine the mechanism for each variable.

Conclusions
Missing data are always lost, in their entirety and forever, but a proper imputation scheme can help to remedy the situation as much as possible. The method that performs best in each situation, in terms of the assessments, is made in this work. For this study, missForest gives the most accurate results in estimating the missing values through the multi-dimensional dataset (the datasets that came from five fixed monitoring stations). The missForest method enables imputation on virtually any kind of data. In particular, it can deal with multivariate information comprised of continuous and categorical factors at the same time. This method does not require parameter tuning, nor does it require assumptions about the distribution of the information. Finally, missForest had the least imputation error for both continuous and categorical variables at each frequency of missingness rates (5%, 10%, 20%, 30%, and 40%), and it had the smallest prediction error difference when models used imputed values.  It is very obvious that log transformation fixes the distribution shape for all pollutants. This step is very important-that is, normalizing the skewed data, such that they approximately conform to normality-in order to use them in the imputational calculation for more accurate results [56].  Temp. Figure A8. Density plots with multiple imputations for SO 2 , NO 2 , PM 10 , CO, and O 3 data. The blue line represents the observed data and the red lines are the density plots of the 20 imputed data sets.
As we can see, in all density plots, the red lines almost match the blue line (the observed data), which is an indication of matching between the observed and imputed values.