Oil Family Typing Using a Hybrid Model of Self-Organizing Maps and Artificial Neural Networks

Identifying the number of oil families in petroleum basins provides practical and valuable information in petroleum geochemistry studies from exploration to development. Oil family grouping helps us track migration pathways, identify the number of active source rock(s), and examine the reservoir continuity. To date, almost in all oil family typing studies, common statistical methods such as principal component analysis (PCA) and hierarchical clustering analysis (HCA) have been used. However, there is no publication regarding using artificial neural networks (ANNs) for examining the oil families in petroleum basins. Hence, oil family typing requires novel and not overused and common techniques. This paper is the first report of oil family typing using ANNs as robust computational methods. To this end, a self-organization map (SOM) neural network associated with three clustering validity indexes was employed on oil samples belonging to the Iranian part of the Persian Gulf oilfields. For the SOM network, at first, 10 default clusters were selected. Afterward, three effective clustering validity coefficients, namely, Calinski–Harabasz (CH), Silhouette (SH), and Davies–Bouldin (DB), were studied to find the optimum number of clusters. Accordingly, among 10 default clusters, the maximum CH (62) and SH (0.58) were acquired for 4 clusters. Similarly, the lowest DB (0.8) was obtained for four clusters. Thus, all three validation coefficients introduced four clusters as the optimum number of clusters or oil families. According to the geochemical parameters, it can be deduced that the corresponding source rocks of four oil families have been deposited in a marine carbonate depositional environment under dysoxic–anoxic conditions. However, oil families show some differences based on geochemical data. The number of oil families identified in the present report is consistent with those previously reported by other researchers in the same study area. However, the techniques used in the present paper, which have not been implemented so far, can be introduced as more straightforward for clustering purposes in oil family typing than those of common and overused methods of PCA and HCA.


INTRODUCTION
Identifying the relationship between oil samples and grouping them, known as oil family classification, as a part of petroleum system studies, plays a paramount role in various aspects of the oil industry, including exploration, development, and so forth. The primary outcomes of oil family typing are detecting migration pathways and evaluating the continuity between different oil reservoirs. 1 In the following introduction, the method used and recent works are generally explained. The second part of the paper is devoted to the data preparation and methodology. Then, the obtained results are discussed in the third section. Ultimately, the final part of the study provides a summary of the findings.
It is for a long time that geochemists have used the statistical techniques principal component analysis (PCA) and hierarchical clustering analysis (HCA) to group oil families in petroleum basins. 2,3 However, it is an undeniable fact that artificial intelligence (AI) and machine learning (ML) systems are developing on a regular basis and provide various applications for scientists, 4−8 and petroleum geochemists are no exception. AI and ML techniques in petroleum-related studies have been widely used in recent years. Hemmati-Sarapardeh et al. 9 conducted the modeling natural gas compressibility using a kind of artificial neural network (ANN). Amooie et al. 10 took advantage of ML methods for geological carbon storage studies. Bolandi et al. 11 evaluated source rock characteristics using ML methods. Bolandi et al. 12 studied the organic facies of source rocks by combining ML and ANNs. Tabatabaei et al. 13 utilized the ML algorithm for the estimation of total organic carbon (TOC) from well log data. Kadkhodaie-Ilkhchi et al. 14 integrated individual smart ML models with a committee machine intelligent system to approximate TOC from petrophysical well logs. Ghiasi-Freez et al. 15 used committee machines to predict permeability from petrographic image analysis. Amiri-Ramsheh et al. 16 conducted a study about modeling of wax disappearance temperature using different AI and ML methods. In addition to the mentioned studies, recently researchers have used AI and ML for organic geochemistry purposes. For example, Safaei-Farouji and Kadkhodaie 17 used intelligent AI and ML methods for the estimation of kerogen type from petrophysical well logs. Collectively, even though AI and ML methods have been used in various petroleum-related fields, oil family typing using an ANN is missing. ANNs have various applications, one of which is clustering. 18−20 Therefore, oil family grouping as a kind of clustering problem can be solved via ANNs.
The self-organization map (SOM) function as an ANN proposed by ref 21 maps multidimensional data to a twodimensional space. This space is created with the help of a competitive and unsupervised learning process. The SOM neural network preserves the topological properties of the input space by utilizing a neighborhood function. Actually, the resulting map illustrates the relationship between input patterns. 22,23 The primary use of SOM is clustering and other types of unsupervised classifications. 22,23 So far, for oil family grouping, limited common statistical methods, such as PCA and HCA, have been used, but using ANNs is entirely missing. Rabbani et al. 2 geochemically analyzed 33 oil samples from several oil fields in the Persian Gulf's Iranian sector. They defined four main oil families using the statistical methods of PCA and HCA. Mashhadi and Rabbani 24 also geochemically investigated 20 oil samples from oil fields in the Iranian part of the Persian Gulf. They identified two distinct genetic oil families using PCA. In another study, Hosseini et al., 3 based on the study of 14 oil samples from the eastern Iranian sector of the Persian Gulf and implementing HCA, identified two different oil families.
Petroleum geochemistry studies of the examined area have been conducted by previous research studies; 2,3,24 correspondingly, in the present paper, we focus on using an SOM neural network as a novel paradigm to determine oil families in the region. Indeed, the present study enables us to relate our outcomes to previously published works in the study area while using more database and introducing a new method for oil family typing.

MATERIALS AND METHODS
Collectively, 60 oil samples were collected from the literature. 2,3,24 These samples belong to different oilfields in the Iranian part of the Persian Gulf. This Gulf and its coastal regions are home to about two-thirds of the world's proven oil reserves (715 billion barrels). 25 The examined oilfields include Dorood, Kharg, Aboozar, Foroozan, Salman, Resalat, Reshadat, Balal, Bahregansar, Soroush, Nowrouz, Sirri A, Sirri C, Sirri D, and Sirri E. The location map of the studied oil fields is given in Figure 1. Also, the detailed geochemical and biomarker analysis of the studied crude oil samples can be found in Hosseiny et al., 3 Mashhadi and Rabbani, 24 and Rabbani et al. 2 Table 1 summarizes the 16 geochemical and biomarker parameters used as inputs for the SOM network.
2.1. Principal Component Analysis. The first stage in this study was using PCA to decrease the data dimensions. Since 16 different geochemical and biomarker parameters were implemented as inputs, it was mandatory to diminish dimensions to illustrate the data and provide graph results. 26,27 Accordingly, the data dimensions or components were decreased from 16 to 3 using PCA.
2.2. Creating the SOM Network. ANNs mimic the learning process in the human brain. A key component in processing a neural network is the neurons that receive the inputs and generate the outputs using nonlinear operations. The SOM ANN can obtain complex and high-dimension data and extract a visible cluster set. 21 The process of SOM network training consists of two repetitive phases. The first phase selects the best mapping unit (neural network neurons) to adapt to input data. The second phase is to update the mapping to provide the best representation and display input data. 28 The process of selecting the best unit to conform to the input data (best matching unit or BMU) is based on the minimum distance (usually the Euclidean distance). Then, in the update phase, each BMU and its neighboring units (within a given radius) move closer to the input data and fully comply with it. This neighborhood radius decreases with each phase selected and updated, eventually leading to a final (twodimensional) mapping. 29 The SOM network is composed of an input layer of nodes and an output layer of neurons, in which the grouping of the inputs is formed. 30 The output layer is called the competitive  layer because the competitive role of the network during the training process takes place at this layer. A competitive layer is a two-dimensional plane structured with m neurons while accommodating an input of n neurons. Each input layer neuron with different weight values is connected to the competing layer neurons, and also, a series of minor connections are made between the competing layer neurons. 31 The number of neurons may vary from a few tens to a few thousands. Each neuron is assigned a dimensional vector d with weight m, of which d is the same dimension as the input vectors. Neurons are connected to their neighboring neurons by a neighborhood relationship that affects the topology or structure of the map. Common topologies are square, hexagonal, triangular, or irregular grids. 32 As depicted in Figure 2, the SOM neural network consists of a set of M = m × m processing neurons. Suppose these M neurons are organized on a grid in a plane. In that case, the obtained network is two-dimensional because this network projects multi-dimensional input vectors onto a two-dimensional surface; for a given network, the input vector x is composed of a fixed dimension n. In the array, the n components of the input vector x (i.e., x 1 , x 2 ,..., x n ) are connected to each neuron. For a connection from the ith component of the input vector to the jth neuron, a synaptic weight w ij is assigned. Thus, an n-dimensional vector w j of synaptic weights is related to each neuron j. 33 In brief, the process of the SOM network is as follows: 33 (1) Calculate the distance between the pattern (X) and all neural neurons 33 (2) Select the nearest neuron as the winning neuron 33 (3) Update each neuron according to the neighborhood function. 33 The value of coefficient a reduces the effect of different weights. 33 This process is repeated until a specific stopping criterion is reached. Often the criterion for stopping is a certain number of repetitions. To stabilize the convergence and stability of the map, the learning rate and neighborhood radius are reduced in each iteration. Therefore, convergence will tend to zero. The measuring distance between the vectors is the Euclidean distance. 33 2.3. Clustering Validity Indexes. The clustering validity indexes commonly are used associated with a clustering algorithm. According to the selected index, to determine the exact number of clusters, either minimum or maximum index value aids to figure out the optimum number of clusters (k). 34 Generally, validity indexes can be grouped into internal and external. Internal indexes employ the information related to the data themselves, while external indexes, such as labels, are implemented by external information. Internal measures can improve the clustering algorithms. By contrast, external measures can be used merely for validation. Internal indexes are generally employed to determine the k value. 35−38 In this paper, for the SOM neural network, three efficient internal coefficients, including Davies−Bouldin (DB), Calinski−Harabasz (CH), and Silhouette (SH), were implemented to determine the optimum number of clusters for oil samples. Initially, a number of 10 classes were selected for the SOM network. The model was developed based on these clusters; then, the optimum number of classes as the optimum number of oil families was recognized using the coefficients.
2.3.1. DB Index. This index aims to minimize the average distance between each cluster and the most similar one. The minimum value for the DB index indicates the optimum number of clusters or oil families. 39 This index is described as 39 in which D i,j shows the within-to-between cluster distance ratio for the ith and jth clusters. D i,j can be defined as 39 where d i represents the mean distance between each point in the ith cluster and the cluster's centroid and d i,j denotes the Euclidean distance between the centroids of the ith and jth clusters. The optimum clustering solution possesses the lowest DB index value. 39 2.3.2. CH Index. The CH index 40 demonstrates the quality of the clustering solution based on the average sum of squares between clusters and within a cluster. It can be measured as 34 in which SSB shows the average between-clusters sum of squares. SSW indicates the average within-cluster sum of squares, k represents the number of clusters, and n denotes the number of observations. The average SSB is calculated as below 34 where m i is the centroid of cluster I, μ shows the mean of all data points, and ∥m i − μ∥ typifies the Euclidean distance between the centroid of the cluster and the mean of all data points. The formulation of mean SSW is computed as below 34 (8) in which k indicates the number of clusters, x is a sample, p i demonstrates the ith cluster, m i shows the centroid of the cluster p i , and ∥x − m i ∥ is the Euclidean distance between the sample and centroid of the cluster. 34 A higher CH quantity epitomizes a better data clustering outcome or the optimum number of questionable clusters. Therefore, high SSB and low SSW numbers give a wellseparated cluster. 34 2.3.3. SH Index. The SH index 41 demonstrates how close every data point is to other data points within a cluster and how well clusters are detached from each other. Simply put, it operates based on the distance between each point between and within clusters. The highest SH quantity indicates the optimum number of clusters (k). 42 in which sp(i) is named the silhouette width of a point. a(i) shows the mean distance between the ith point and all the points in the clusters Pi, (i = 1, 2, ..., n). b(i) displays the most minor of these distances. Hence, it can be observed that the SH value will be between 1 and −1. For every clustering, the average index of all sp(i) is employed. 34 The detailed feature of the SOM network used for clustering in the present study is given in Table 2.

RESULTS AND DISCUSSION
Ten clusters as the default numbers have been defined for the SOM network as the definite number of clusters or oil families is unknown. The samples were distributed in these clusters. Nevertheless, the principal objective of this study is to find the optimum number of clusters and hence oil families among these defined clusters. Therefore, validity indexes were employed.
Regarding clustering validity coefficients, the maximum values of CH (62) and SH (58) parameters were determined for four clusters (Figure 3a,b). Additionally, the minimum DB coefficient (0.8) was achieved for four clusters (Figure 3c). This means that all three used clustering validity indexes showed four clusters as the optimum number of clusters. Figure 4 in a 3-D shape shows four clusters identified by the SOM neural network. Therefore, it can be concluded that four oil families exist in the Iranian part of the Persian Gulf. In other words, at least four different source rocks have generated the reservoir oils.
Based on the SOM network's obtained result, cluster I consists of crude oil samples from Foroozan, Aboozar, Balal, Resalat, Reshadat, Salman, Bahregansar, and Soroush oilfields. Cluster II is composed of crude oils from Foroozan, Kharg, Dorood, Balal, Salman, and Nowrouz oilfields. Cluster III contains oil samples from Resalat, Reshadat, Hendijan,  Nousrat, Siri A, Siri C, Siri D, and Siri E oilfields. Finally, crude oil samples from Kharg, Dorood, Aboozar, Reshadat, Bahregansar, Nousrat, Sirri A, Sirri C, Sirri D, and Sirri E were grouped into cluster IV. According to the geochemical parameters summarized in Table 1 and obtained clustering outcomes, C 26 /C 25 and C 19 / C 23 tricyclic terpane ratio values lower than 1 for all families suggest a marine depositional environment for the corresponding source rocks. However, the lowest C 26 /C 25 ratio values for oil family IV (mean = 0.42) suggest a higher water depth during its corresponding source rock deposition. Additionally, family IV signifies a minimum C 19 /C 23 tricyclic terpane ratio (mean = 0.11), which further supports the higher water depth during the source rock deposition that generated the crude oils of family IV. 43−45 Furthermore, the mean C 29 /C 30 hopane ratio values for crude oils in all oil families are greater than 0.8, indicating carbonate lithology for their source rocks. 46,47 However, higher mean values of this ratio for oil family II may indicate a higher carbonate content in its corresponding source rock. This deduction can be further supported by tricyclic terpanes C 22 /C 21 and C 24 /C 23 . To explain more, higher C 22 /C 21 tricyclic terpanes and lower C 24 /C 23 tricyclic terpanes for crude oils of family II than those of other families could suggest a higher carbonate content in the source rock that has generated crude oils of this family. 1 Additionally, crude oils of all four families demonstrate C 35 / C 34 homohopane ratios near unity. Accordingly, a disoxic− anoxic depositional environment can be inferred for the source rocks of all four oil families. Oil family III clarifies a relatively higher mean C 35 /C 34 homohopane ratio (1.06) than that of other families, and hence, more anoxic conditions can be inferred for the source rock of oil family III. 48 Also, oil family III depicts a higher gamacerane/C 31 hopane ratio (mean = 0.25); correspondingly, higher water salinity and a stratified water column can be suggested during the deposition of the source rock of oil family III. 1,49,50 It is worth mentioning that some crude oil samples have been grouped into more than one oil family. For instance, some crude oils from Balal oilfield have been incorporated into oil family I, and some others have been clustered into oil family II (Table 1). Similarly, several crude oil samples from the Resalat oilfield have been recognized as oil family I, whereas some samples have been known as oil family IV (Table 1). This may suggest the relatively similar geochemical characteristics of the source rocks that have generated the crude oils of different oil families. In previous paragraphs, it was suggested that, generally, source rocks of four identified oil families have been deposited in a similar depositional environment, even though there are some differences.
Overall, the SOM ANN employed in the present paper grouped crude oil samples into four clusters and demonstrated four oil families in the studied area. Hanifa-Tuwaiq, Garau, Diyab member of Surmeh Formation, Kazhdumi, Sarvak, Khatiya, and Ahmadi member of Sarvak formation are regarded as the possible source rocks in the region. 2 The identified number of oil families is consistent with those suggested by Rabbani et al. 2 Nonetheless, only 33 samples were analyzed in the mentioned research. However, 60 crude oil samples were analyzed to identify oil families in the present paper to reach more reliable results.

CONCLUSIONS
Lack of novelty in previous studies was the main reason for which we decided to find a new method for identifying oil families, a vital study, in petroleum basins. Thus, an SOM neural network was selected for this purpose. In creating the SOM network, 10 clusters were initially defined in the network. Then, three effective clustering validity coefficients were implemented to identify the optimum number of clusters based on geochemical and biomarker characteristics of oil samples used as inputs for the network. The maximum CH and SH coefficients were acquired for four clusters. Similarly, the lowest DB coefficient was obtained for 4 clusters among 10 defined clusters. Accordingly, all three validation indexes introduced four clusters as the optimum number, hence the number of oil families. Generally, based on the geochemical data, it can be inferred that the source rocks of four oil families have been deposited in a marine carbonate depositional setting with dysoxic−anoxic conditions, although the oil families showed some differences based on geochemical data. Finally, it should be noted that, while some statistical methods such as PCA or HCA can be employed for oil family typing, these approaches have become over-used, and petroleum geochemistry studies and specifically oil family grouping demands novel paradigms. Accordingly, this paper introduced the SOM ANN as a quick and easy-to-use method, which could be greatly beneficial for geochemists in the petroleum geochemistry studies for classification purposes.  Automation and Mathematics, Slovak University of Technology, 812 37 Bratislava, Slovakia