Neural Collaborative Filtering with Ontologies for Integrated Recommendation Systems

Machine learning (ML) and especially deep learning (DL) with neural networks have demonstrated an amazing success in all sorts of AI problems, from computer vision to game playing, from natural language processing to speech and image recognition. In many ways, the approach of ML toward solving a class of problems is fundamentally different than the one followed in classical engineering, or with ontologies. While the latter rely on detailed domain knowledge and almost exhaustive search by means of static inference rules, ML adopts the view of collecting large datasets and processes this massive information through a generic learning algorithm that builds up tentative solutions. Combining the capabilities of ontology-based recommendation and ML-based techniques in a hybrid system is thus a natural and promising method to enhance semantic knowledge with statistical models. This merge could alleviate the burden of creating large, narrowly focused ontologies for complicated domains, by using probabilistic or generative models to enhance the predictions without attempting to provide a semantic support for them. In this paper, we present a novel hybrid recommendation system that blends a single architecture of classical knowledge-driven recommendations arising from a tailored ontology with recommendations generated by a data-driven approach, specifically with classifiers and a neural collaborative filtering. We show that bringing together these knowledge-driven and data-driven worlds provides some measurable improvement, enabling the transfer of semantic information to ML and, in the opposite direction, statistical knowledge to the ontology. Moreover, the novel proposed system enables the extraction of the reasoning recommendation results after updating the standard ontology with the new products and user behaviors, thus capturing the dynamic behavior of the environment of our interest.


Introduction
The amount of information found on web pages and social networks has increased dramatically in recent years as the Internet has grown. As a result, even while users have access to more information, it is becoming increasingly challenging to match their demands when it comes to providing information relevant to their interests. The rise of the Internet has also accelerated the spread of e-services across a variety of online platforms, with the primary benefit of providing products and services to consumers who have not yet purchased them anywhere and at any time. With so much data and services available, it is challenging not only for users to identify products they are interested in fast, but also for e-commerce and similar systems to recommend the product from the data. Recommendation systems (RSs) [1] are decision-support information systems created to assist users in locating things that fit their interests from the vast variety of choices [2,3].
There are three main sets of techniques for building personalized RSs: ontologybased [4,5], filtering by matrix factorization (MF) [6][7][8][9][10], and machine learning (ML) [11,12]. The common premise for all of them is trying to predict new items that match the users' preferences, revealed through their past purchases or by means of explicit ratings. However, these approaches differ fundamentally: ontology-based systems use a formalized ontology (a conceptual graph of entities and their mutual relationships) suited to the specific domain for rule-based reasoning; MF attempts to discover the low-dimensional latent factors (hidden state variables) in the user-item preference matrix; ML estimates a statistical predictive model from the collected data. While these methods have shown effectiveness and accuracy in constructing RSs, all are based on a fixed ground truth (the dataset) captured at a given time instant, so it is difficult to recommend diversified, personalized products since the recommendations are based on past observed purchase history, and it is not possible to take advantage of the change in users' preferences with time, their drift [13], because the computational load for building the model is often too large. For instance, ontologies have to follow a slow and complex synthesis process, and the aid of external experts is required [1]; ML needs large datasets and long training times to learn a precise statistical model, and MF runs in time O(n 3 ) at least, where n is the number of users. MF methods suffer also from the sparsity problem [14] (inaccurate inferences when new users or products are added), and ML methods exploit the statistical correlations independently of the logical relationships among user-items.
Though a hybrid RS (e.g., [15,16]) that combines two or more of the usual approaches seems intuitively more robust and able to tackles these problems, the drift of the users' preferences with time cannot be tackled by hybrid RSs, yet, as long as the dataset used to build the system is not allowed to evolve. For this, a key property is that the changes in the dataset can be introduced automatically into an existing model and that the computational task of updating the model is low. This paper presents a hybrid recommendation system (adapted for online retail markets) in which a dynamic ontology and a neural network classifier [17][18][19] work jointly to generate accurate recommendations for future purchases. The system differs from other works that attempt to incorporate semantics into the neural network (e.g., [20,21]) in that we allow the semantic representation to evolve with time, and in that, we use both item and user information to reveal the latent factor with the neural network. The main contributions of our work are: • We propose a new ontology-based RS where the ontology is dynamically updated and evolved to capture the semantic relationships between users and products. In contrast to other knowledge-based systems, the evolution of the ontology is built automatically without the participation of experts; • The novel proposed system enables the extraction of the reasoning recommendation results after updating the standard ontology with the new products and user behaviors; • The proposed RS can be integrated seamlessly with other collaborative filtering and content-based filtering RSs; • The proposed methodology is able to provide better recommendations, aligned with the current preferences of users.
The rest of the paper is organized as follows. Section 2 briefly introduces the tools and techniques used in our work, i.e., recommendation systems, neural collaborative filtering (NCF), and generalized matrix factorization (GMF). In Section 3, we review the related work. Section 4 gives an overview of the hybrid architecture of our proposed recommender and describes the main processing steps. The detailed architecture of the neural collaborative filtering component is presented next in Section 5, and Section 6 contains the experimental evaluation results of the system, including the description of the dataset, its preprocessing, the application of standard classification techniques, and their combination with the proposed neural network and the information provided by the dynamic ontology. Section 7 presents a separate evaluation of the personalized recommendations, and the paper makes some concluding remarks in Section 8.

Recommending Systems
The two classical forms of creating RSs are collaborative filtering (CF) and contentbased filtering (CBF). CF involves recommending products and items to active users who liked or purchased them in the past by comparing current user ratings (implicit or explicit) for things such as movies or products to those of similar users, which are nearest neighbors determined via some appropriate distance measure. This is used to provide recommendations for objects that have not yet been rated or seen by the active user. User-based and item-based approaches are the two most common types of this technique [14,[22][23][24][25]. CBF promotes an item to users based on their interests and also the product description of their past purchase. Because of this, the main disadvantage of CBF is that the system's recommended products are likely to be extremely similar to those that the active user has already purchased.
Generally, collaborative filtering involves matching the current user ratings for objects such as movies or products with those of similar users (nearest neighbors) to produce recommendations for objects not rated or seen by the active user. There are two basic variants of this approach, which are user-based and item-based collaborative filtering. Traditionally, within the previous category, the primary technique used to accomplish this task is the standard memory-based K-nearest-neighbor (KNN) classification approach. This approach compares a target user's profile with the historical profiles of other users to find the top K users who have similar tastes or interests [26]. Other common forms of solving the CF problem include MF (with many variants [22][23][24]27]) and ML [11,12,28].
Content-based filtering RSs [9,12,29] promote an item to users based on their interests and the description of the item. Content-based recommendation systems could be used to propose web pages, news articles, restaurants, television shows, and objects for sale, among other things. Content-based recommendation systems share a method for: (i) describing the items that may be recommended; (ii) creating a user profile that describes the types of items the user likes; (iii) comparing items to the user profile to determine what to recommend, despite differences in the details of the domain systems [26].

Neural Collaborative Filtering
As highlighted above, the key step in CF is to find a good approximation to the user-item interaction function directly from the observed data. Latent matrix factorization postulates to this end a simple linear projection operator on a lower-dimensional space embedded in the feature space; yet, linear interactions might not be powerful enough to capture or replicate a complicated interaction function. An actual solution to this trap consists of replacing the linear latent factors by a more general representation for nonlinear user-item models. Currently, it is widely known that neural networks can faithfully learn any functional input-output relationship between observables and hidden variables [30], provided several mild technical conditions hold (basically, a bounded domain, an activation function under some general assumptions, and a growing number of nodes). The approximation is arbitrarily good if the width and depth of the neural network are unconstrained as well [31,32]. In other works, for a broad class of functions, deep neural networks are universal approximators, a very powerful result that can be established by first proving that neural networks can approximate well smooth and non-smooth simple functions (e.g., sawtooth functions and polynomials); then, these building blocks are used to approximate arbitrary functions by virtue of well-known results in functional analysis such as the Weierstrass approximation theorem or Lagrange's interpolation. Further details were discussed at length in [32].
In view of this theoretical support, neural collaborative filtering, sparked by the recent influential paper [28], advocates quite naturally the use of a neural network in place of matrix factorization for modeling the interaction function f (·), under the intuition that the accuracy shown by neural network technology in an impressive array of machine learning tasks can also be exploited in CF. In NCF, the reduction in dimensionality is thus attained through a sequence of layers in a neural network [19], and the similarity among the latent factors is no longer restricted to be measured as a linear projection. More precisely, in NCF, the learned function can be written as the output of a neural network Φ NCF : R n 0 → R n L of L layers: are affine mappings giving the outputs of the ith layer for inputs x ∈ R n i−1 coming out of the previous layer, • denotes the composition of mappings, and σ(·) is a nonlinear activation function acting componentwise. A i ∈ R n i ×n i−1 and b i ∈ R n i are the parameters between layer i − 1 and layer i. There is a plethora of activation functions proposed in the literature, but perhaps the most popular is still the rectified linear unit (ReLU) σ(x) := max(0, x). Note that A i is the connection or weight matrix connecting two consecutive layers of the neural network and that b i is the bias; these are known as the edge weights and node weights of the network, respectively. Note also that indeed, ReLU is no other thing than an ideal rectifier (a diode, in the language of electrical engineers). Training in a neural network is performed by means of the optimization of a loss function through iterated backpropagation, i.e., the optimal edge and node weights (A i , b i ) are computed backwards, from the output layer to the input layer. This is possible since the local gradient of the loss function with respect to the weights at layer i can be computed, so a general gradient descent sequence of steps can be followed. However, since the number of parameters in (A i , b i ) can easily be too large, stochastic gradient descent (SGD) [33] substitutes the gradient by an empirical expectation calculated on the basis of a few training points (a batch), randomly sampled. The reduction in complexity with SGD is thus very substantial.
In practical applications, other numerical problems might arise: these are due to the unboundedness and non-differentiability of ReLU at zero, but both can be easily solved. For inferring probabilities, as is the case in CF, the typical squared loss function is better replaced by log-loss, namely the cross-entropy between the predicted output and the desired values.

Generalized Matrix Factorization
Since neural networks are universal approximators, NCF trivially generalizes matrix factorization as follows: • One-hot encoding of users and items: The input to the NCF is a pair of unit vectors e i ∈ R |I| and e j ∈ R |U | (where I is the set of items, U is the set of users, and | · | denotes the number of elements of a set), which encode the identity of item i ∈ I and user j ∈ U , respectively. e i (resp., e j ) have a single one at coordinate i (resp., j), and their remaining elements are zero. These vectors are column-stacked e i e j := [e T i e T j ] T , where T denotes the transpose of a vector, and are input to the network; • The weight matrix is: where ⊗ is the Kronecker product between two matrices, and the bias b 1 = 0. So A 1 (e i e j ) = g T j h i for all i, j; • The (output) activation function is the identity mapping, i.e., σ = I. Since matrix factorization is linear, a nonlinear activation function is completely unnecessary; • The loss function is the mean-squared error (MSE).
These choices make the neural network Φ GMF discover the latent matrices G and H with the backpropagation algorithm and then act linearly on their inputs just multiplying the inner latent factors.
If, instead of using the identity activation function, one substitutes a nonlinear activation response ρ(·), while keeping the one-hot encoding and the internal Kronecker product, and uses as the loss function a proper p -norm, the result is an architecture that can operate as a generalized matrix factorization with a nonlinear response, encompassing many different types of decomposition.

Neural Matrix Factorization
Neural matrix factorization (NMF) is defined as the combination in a single recommendation system of GMF and NCF. Specifically, NMF builds a system that maps the input embeddings e i , e j to the outputs according to: In [28], it was suggested that the input embeddings be split between the two parallel computation paths, that is, and similarly for e j , so that yet, there is no loss of generality in using (3) always. Generalizations of (3) to other combination rules are straightforward, so, for instance, one could use Φ = max{Φ NCF , Φ GMF } if the outputs are probabilities, or a weighted average Φ = π 0 Φ NCF + π 1 Φ GMF for some prior (π 0 , π 1 ). In the end, as in many other algorithms used in ML, (3) can be regarded as a form of ensemble averaging for only two different classifiers (recommenders, in our case). NMF is used in this paper for the fusion of ontology-based recommendation systems and neural network recommenders.

Related Work
Ontology-based recommenders are one sort of knowledge-based RS that has received attention when referring to better active user recommendations. An ontology-an organized set of concepts and categories in a subject matter or domain that formalizes their characteristics and relationships among them-can be used to combine diverse data and provide a first direction for recommendation preferences. Ontology models are used in an ontology-based RS for user profiling, personalized search, and Internet browsing [4,5] and support the expansion of RSs into a more diverse environment, allowing knowledgebased methods to be combined with traditional machine learning algorithms. Commercial RSs often include some simple product ontologies that can be used later via heuristics or a huge community of user's actively rating content suited for collaborative recommendations. Both data and ontological background information can be represented in defined formats on the Semantic Web, where standard languages are used to represent metadata. Nonetheless, constructing an ontology-based RS is an expensive procedure that involves extensive knowledge expertise and handling of enormous datasets.
As for their use in RSs, ontologies have the advantage of an explicit modeling of semantic information from which logical inferences can be drawn using some standard rule-based inference system, such as Fact++ [34]. The system of concepts and qualitative relationships among them can be easily tracked and represented with graphical software tools such as Protégé editors (https://protege.stanford.edu, accessed on 30 November 2021) and several existing plugins for visualization and functional extension, one of these being the cellfie plugin used in this paper for the semi-automatic update of an ontology. Thus, ontologies enable a rich and detailed semantic modeling of information as a loosely structured system of intertwined concepts; yet, from a computational point of view, there exist some disadvantages in using them as the fundamental form of knowledge representation, which limits their usefulness in RSs. First, ontologies are highly specific to a domain realm, usually very narrowly focused. As such, building an ontology requires typically the participation of human experts for conceptualizing the main entities. In other words, building an ontology is far from being an automatic process; on the contrary, there is some risk of introducing bias due to the experts' view of the field. Secondly, the ontology constitutes only an information base to which a set of complete inference rules are to be applied to deduce the implied consequences. Generally, these derived facts grow exponentially with the size of the ontology and quickly become unmanageable. Thirdly, building the ontology itself is costly in time and sometimes in access to experts. As a consequence, updating an ontology with new information frequently means rebuilding it from scratch, which is not realistic for a big RS such as on most websites or e-commerce sites.
In this paper, we used dynamic ontologies, i.e., ontologies that evolve with time, in a more general architecture for RSs, in combination with NCF. Our main purpose was to characterize the extent to which semantic modeling with ontologies and contemporary CF (namely, NCF) can complement each other by extracting at the same time statistical information and semantic knowledge from the same dataset in a semi-automatic way that overcomes most of the complexity involved in re-creating the ontology or re-training the NCF. In the rest of this section, we briefly review the related work on hybrid RSs and (neural) CF most related to our approach.
The work in [28] pioneered the research in merging neural networks and CF for novel RSs. The main idea, as mentioned previously, is to replace the linear matrix factorization commonly used in CF by a more general and potentially more effective function approximator: a (deep) neural network. The experiments reported in [28] showed a clear advantage when using this approach over the classical LMF that supports CF. A word of caution was introduced recently by [35], who repeated the experiments and found that, with a proper hyperparameter setting, LMF has similar or superior performance to NCF. In other words, the expressive power of neural networks appears not to be essential for modeling purposes. Therefore, the question is still open about the benefits and performance of using neural networks for inferring recommendations.
Despite the intriguing competitiveness, many other works have explored the use of deep networks for CF after [28]. For example, Reference [20] used two parallel neural networks joined in the last layers for learning item characteristics and user behavior together from text reviews. The first network works toward understanding users' behavior from their reviews, while the second one understands the characteristics of the item from the reviews that are written on it. The last layer joins the two networks together, allowing latent factors learned for users and items to collaborate with each other in a similar way as with factorization techniques. Datasets named Yelp, Amazon, and Beer were used to test the algorithm, and the results of DeepCNNoutperformed all baseline RSs by an average of 8%. A comparison between neural collaborative filtering and matrix factorization was conducted in [35].
Reference [36] proposed an approach called neural semantic personalized ranking (NSPR) that combines the effectiveness of deep neural networks and pairwise learning. The semantic representation of the items that were combined with latent factors learned from implicit feedback was performed by NSPR to address the item cold-start recommendation task, specifically. Their system introduces two alternatives based on logistic and probit functions. The experiment of the proposed approach was performed on two datasets (Netflix and CiteuLike) and applied MF and topic-regression-based CF. The experiment proved that NSPR expressively outperformed the state-of-the-art baselines. The idea proposed by [20] for their innovative context-aware recommendation model was to use a convolutional matrix factorization (ConvMF) that incorporates convolutional neural networks into probabilistic matrix factorization. This had a clear effect on the sparsity problem. Again, the proposed model after integrating CNN into MF under a probabilistic perspective was able to improve the accuracy of the rating prediction in addition to capturing the contextual information of documents.
An example of a hybrid system composed of ontologies and CF is [33], applied to MOOCs. Their RS combines item-and user-based CF with an ontology to recommend for online learners personalized MOOCs within MOOC platforms. Here, the ontology was used in order to present a semantic explanation about the learner and MOOC that would be fused in the 11 recommendation method that would help enhance the personalization recommendation for the learner. The cold-start problem of the RS can be released by the use of the proposed hybrid technique by using the ontological knowledge before the initial data.
The calculation of the similarity between ontologies has also been addressed via machine learning techniques, as in [37,38]. The approach in this case was to perform a direct embedding of the graph ontology to simplify the detection of the similarity between the graphs and, next, to use the embedding as the input to a statistical learning algorithm. The main problem with those embeddings is the large size of the base ontology graphs, which for our application domain prevent the use of this sort of mapping.

Overview of the Proposed Recommendation System Architecture Based on ML, NCF, and Ontology Evolution
This section introduces the proposed recommendation system architecture based on machine learning, neural collaborative filtering, and ontology evolution, as well as the proposed neural collaborative filtering with ontologies framework based on GMF and NCF. The proposed system architecture is depicted in Figure 1 [5] and comprises four phases. Phase 1 ϕ 1 (top left of the figure) is the online retail dataset's machine learning process; Phase 2 ϕ 2 (top middle) is the pre-evolution ontology of the online retail dataset; Phase 3 ϕ 3 (center part of the Figure) is the ontology after the evolution of the online retail dataset; finally, Phase 4 ϕ 4 (top right) is the neural collaborative filtering. The arrows represent the flow of information processing among the different computation steps: • Phase ϕ 1 of the ML process starts by loading the online retail dataset for a three-year transaction and consulting a domain expert for the feature selection within the dataset. The feature selection is further complemented in Phase 2 with ML techniques, thus without subjective criteria. Along with that, the dataset is preprocessed and cleaned by removing noisy data or missing values. The dataset is then used for training, and the classification algorithm is built for the online retail domain. After that, the model is evaluated by calculating the accuracy, and the ML-based product suggestions are presented to the user after applying the hybrid recommendation techniques based on CF and CBF; • Phase ϕ 2 includes the building of the online retail ontology before the evolution. The features selected in the machine learning process that give high accuracy are used as new inputs for enriching the old online retail ontology, which is built in a semi-automatic way with the standard cellfie plugin from the old dataset. This dataset records the users' past purchases and behavior. The Fast Classification of Terminologies (Fact++) [34] reasoning plugin is applied to the old online retail ontology (before the evolution), which recommends the products for users depending on their similar characteristics, preferences, and past transactions by applying CF and CBF implicitly; • Phase ϕ 3 entails the evolution of the old online retail ontology by using the 2008 and 2009 versions of the database; this evolution process takes place by checking both the old online retail ontology and the 2008 and 2009 database, then adding the new individuals to the old online retail ontology. As a result of this, the evolved online retail ontology is executed. The Fact++ reasoning plugin is applied to the evolved online retail ontology as in Phase 2, so new products suggestions will be shown to users according to the new purchases and behaviors. The two recommendations (before and after the evolution) are then compared to highlight the changes in the recommendations. Experimental results and examples are shown in Section 6. Afterwards, the evolved online retail ontology is extracted to apply to it the ML algorithms and obtain product suggestions to the user using hybrid recommendation techniques; • Phase ϕ 4 applies NCF to the dataset extracted from the database, and recommendations are generated for the user both before and after adding the user feature layer (UF). The last step is the evaluation step. In order to execute the evaluation of the evolved ontology, two methods are used. The first one is the calculation of the precision and recall by a domain expert; the second method is implementing the quality features dimension by calculating the cohesion and conceptualization. Subsequently, the reasoning results of the old and the evolved online retail ontology are re-evaluated by the domain expert by calculating the precision and recall.   Figure 2 presents the implementation steps that the proposed (NCFO) framework follows. The proposed NCFO framework includes five steps: the first section in the framework includes the two datasets used in the experiment; the first one is extracted from Contoso database, and the second dataset is extracted from the online retail evolved ontology. Then, the framework starts with its steps. First the preprocessing process is performed on both datasets. Second, the GMF method is used to formulate the proposed NCFO framework by applying the dot product between the MF user id and MF product id embedding vectors in addition to the MLP, which also uses the MLP user id and MLP product id embedding vector with the new layer of the user feature layer (UF) as the inputs for the MLP presented in the third step. Then, the three paths are concatenated with each other to form the proposed NCFO framework in the fourth step. Lastly, the fifth step is the evaluation, which occurs on both datasets before adding the user feature layer and after adding it.

Neural Collaborative Filtering Framework with Ontologies
The proposed NCFO framework for our retail market recommender is composed of the ensemble union of GMF and NCF, that is generalized matrix factorization and collaborative filtering, with a neural network as the function approximator. This is the main computation in Phase ϕ 4 . The GMF block in the proposed architecture receives as the input the item and user embeddings (their ids, in simple one-hot encoding); in a parallel branch of computation, the NCF block takes as the inputs both the user and item embeddings, and the subset of features for the user id (user features (UFs)). The complete system architecture is depicted in Figure 3, which shows three internal blocks: the NCF part Φ NCF (·) (bottom part in orange), the GMF part Φ GMF (·) (middle section, in blue), and the novel deep neural network for integrating the ontology part (top section, in green), encoded as in (4). Each block shows its component layer, the input/output sizes, and the type of layer.
As seen in the figure, the UF is converted at the input layer to a sparse representation with one-hot encoding, before undergoing the remaining steps. Next, all the embedded input vectors (user, item, and UF) are processed by several layers of a neural network. Note that one branch of this union network implements the generalized matrix factorization approach, while the complementary branch works on the UF by applying a conventional neural network sequence of layers with a decreasing number of nodes in each layer. This second path of computation has naturally more layers than the GMF counterpart, as expected, since the UF encloses more richness than the latent factor modeling upon which the GMF works. The interaction between the two paths of prediction/classification happens mainly in the last (output) layer, where the inner representations found by each component are merged (by concatenation) and passed to the last hidden layer.
For the user and item embeddings, the dense neural layers have a decreasing number of units, in powers of two {128, 64, 32, 16, 8, 4} and use ReLU as the activation function. The user features follow a deeper neural network with dense layers having {256, 1024, 128, 64, 32, 16, 8, 4} units in each layer, also with ReLU. Note that, in both cases (except the expansion layer for the UF), the dropout factor was set to 0.5 for the connections between two consecutive layers. The activation function for the last (output) layer is sigmoid, and the loss function chosen was the binary cross-entropy, optimized via the Adam algorithm with a learning rate of 0.001. For training, the batch sizes were {32, 64, 128}, and the number of epochs was set to {100, 200, 400}. Training was performed for only a subset of the users, as the dataset was large, and testing with the Adam, Adagrad, and RMSprop optimizers.  There are three parallel processing paths, neural collaborative filtering (orange), generalized matrix factorization (blue), and the ontology embedding deep network (green).

Experimental Results
We now report on the implementation of the proposed algorithm and the experimental results obtained with the dataset. First, the preprocessing step to embed the classification techniques into the hybrid recommendation system is explained, so that this can be later used in the CF and CBF recommendations. Next, we present the implementation of the proposed novel neural collaborative filtering framework with ontology integration (NCFO). This proposal extends the recently developed neural collaborative framework [28], which already mixes CF and neural networks, with the information modeled with the ontology. We describe the modifications and advantages over the basic NCF approach and give an experimental performance evaluation. All the results were obtained running the experiments on a computer with an Intel(R) core i7 3.2 GHz processor with six cores and 16 GB RAM. The ML algorithms were implemented in Python 3.7 on the TensorFlow 2.0.0 and Keras 2.3.1 libraries. The tests for the ontology-based evaluation were performed on the same computer with the standard-purpose software Protègè and its complements, as described above. Figure 4, which is itself a part of Figure 1, explains in detail the system implemented in this work for composing the hybrid RS. The system was composed of three parts; the input for the first part was the database for the years 2007, 2008, and 2009 on which the ML process was applied, which included feature selection and data cleansing, selecting the classification algorithms, model evaluation, and the visualization of results. The outputs were the selected features used in building the old online retail ontology, this one consisting of 60 classes, over 100,000 declaration axioms and more than 1.5M logical axioms from 113,953 individuals. The second part included the evolved online retail ontology after adding new individuals to the old ontology for the years 2008 and 2009. Next, the third part gave the evolved datasets that were extracted from the evolved online retail ontology to be the input for the ML and the NCF blocks.

Description of the Dataset
The datasets used in this experiment were two versions of the same primitive dataset. The first version was the original data (Contoso [39]), which included users' features and the properties of products useful and typical for personalized user recommendations, such as customer unique tags, gender, economic and social status, geographic (i.e., cultural) information, etc. The second dataset, in turn, was obtained from the first after evolving an ontology built up from the raw data. Algorithm 1 summarizes the steps carried out for establishing the baseline classification results used during the integration. Test and validate over D original and D evolved ;

Feature Selection
Let us first describe the essential information available in our database. The online retail dataset consisted of 36 features and 2,832,193 rows, for a total file size of over 800 MB. The number of unique customers was 18,484, while the number of unique products was 1860. Most of the features included in the dataset are self-explanatory (Table 1), and some of them are not significant for the recommendation outcomes. Among the numeric values, only Weight contained missing values for some of the entries (703,803 ≈ 24.85%), which were subsequently removed. For the experimental part, we determined that 10 features enclosed most of the necessary information using the two methods: advice from an external consultant and a PCA analysis of the raw data. Figure 5 shows the eigenvalues and percentages of variance explained by each of the numeric features. We see directly that the features were almost orthogonal along the two main principal components and that there was a substantial difference and correlation among the features. The cumulative variance explained by the 10 most significant eigenvalues was 95.83%. Based on this exploratory computation, we decided to keep 10 out of the 36 features of the dataset for the subsequent stages of the analysis, combining the latter results with the suggestions of an external expert in the area of retail markets. Feature extraction is part of Phase ϕ 1 .

Unsupervised Classification with Ontology Integration
In order to quantify the benefits of integrating machine learning with other recommendation approaches-namely, ontology-based (OB) using formal logic for reasoning; CF or CBF, these being purely computational-we needed first to determine to what extent classical ML techniques can group and recognize as similar the user and item behaviors contained in the database. Since the dataset contained no labels as to the classes or profiles that the customers belonged, we were dealing with a typical unsupervised learning task. As a matter of fact, those classes or profiles were totally undefined in our setting, and the main goal of the ML task was then to implicitly define the features that could help to structure the data into disjoint subsets. Unsupervised learning is often characterized by the presence of latent or hidden variables that cannot be directly observed and arise only through noisy transformations in the raw data. In the following sections, we give the technical details for the processing steps outlined in Algorithm 2, which is the core of Phase ϕ 4 . User-item embedding for generalized matrix factorization; User-item embedding for neural matrix factorization;

one-hot-encoding of user features;
FeatureEmbedding ← e 1(i=feature−id) ; Ensemble classification of GMF, NMF, and NCF: training; Evaluation over D original and D evolved

Baseline Hybrid Classification
We repeated the same baseline experiments over the evolved ontology. To that end, the procedure consisted of evolving the original ontology, the one developed for the oldest version of the dataset. The evolved ontology gave rise to new predictions corresponding to the newly added items, and these new predictions were inserted back into the dataset as the ground truth. Then, the classifiers were applied again over the modified (evolved) dataset. The results appear in Table 2. As shown in Table 2, there was an enhancement in the classification algorithms results after the evolution of the dataset. For KNN, the Accuracy increased substantially from 87% to 94% in the dataset created from the evolved retail ontology, while for decision trees (DTs), the accuracy rose from 73% to 87%. which is even more remarkable. In a similar fashion, Precision increased as well in both cases-KNN and DT-between the prior and posterior versions of the ontologies (datasets). The main conclusion to draw from these results is clear: the semantic relationship and recommendations found by the ontology, either in its static version or in its evolved offspring, when introduced back into the dataset, enriched the patterns and could be used to better train standard classification methods used in ML. We recall here that the main purpose of this numerical analysis was not to devise a good multi-class classifier for the retail data, but only to test whether the semantic information created by means of the ontology can be recognized and exploited by classical ML algorithms. The fact that some improvement in the performance can be measured ratifies the fact that the recommendations discovered via semantic rules contained fresh information not present in the original (not evolved) dataset. Consequently, the combination of ontology-based output and ML-based classifiers' input was beneficial for inference and prediction, as intuitively expected. This step is part of Phase ϕ 3 .

Neural Collaborative Filtering
In Phase ϕ 4 , the NCF and the ontology information were blended to generate improved recommendations.

Hyperparameter Setting
Based on our starting test cases, we decided to set a fixed split test threshold, namely 70%-30%, to strike a proper balance between overfitting and generalization. Therefore, this fraction was held constant over all the numerical experiments. Regarding the choice and tuning of the optimizer, we conducted tests with three optimizers, {Adam, Adagrad, RMSprop}, three batch sizes for evaluating the gradients, {32, 64, 128}, and different input sizes for training, as well as for testing, {100, 200, 400} different users. Numerical tests with more users are extremely intensive in computing time and were not attempted. Nevertheless, as we reported below, these combinations suffice to make effective predictions and recommendations, so we conjectured that little improvement is to be gained by using large input sizes. Naturally, this will depend on the diversity of the dataset. Table 3 presents a summary of the results obtained for determining the setting of the optimizer parameters, as well as for assessing the performance achievable with the proposed NCFO architecture. These data correspond to the non-evolved dataset (i.e., the one before the evolution), since the behavior of the evolved dataset was exactly the same. Based on the results listed in Table 3, we selected the Adam optimizer (learning rate 0.001 with batch size 64 or 128 and 200 epochs for training). Longer training is more prone to cause the overfitting of the produced model, and as Table 3 reveals, there were no further consistent improvements by extending the training epochs beyond that value: convergence was attained well before that limit. Figure 6 shows the typical training and test loss curves, illustrating the convergence of the system around Epoch 100, consistently for every one of the performance metrics.  The running time of the training step for the NCFO hybrid system depended crucially on the number of internal parameters in the neural networks, the number of epochs, and the number of the training samples. To a lesser extent, the computing time for training the system was weakly dependent on the optimizer chosen. Though a complete analytical characterization of the computational complexity for our system was not possible-the internal optimization procedure via backpropagation is stochastic-we list in Table 4 some representative results for our tests. We concluded from these figures that the optimizer RMSprop required more than twice the training time and that the training time decreased almost linearly with the batch size and increased linearly with the number of users used for the training dataset. In view of these tests, we decided to take batch sizes of 64 and 128 for the evaluation and 200 epochs for training the algorithm, since these attained a good balance between the model accuracy and loss and the total running time. Next, we characterize the performance of the NCFO framework when making recommendations. To this end, we first present the performance results of the hybrid neural architecture when the input was the original non-evolved database and after the ontology evolution. Thus, using the hyperparameters for the configuration selected according to the criteria in the two latter subsections (recall, Adam optimizer with learning rate 0.001, batch size equal to 64 and 128, and 200 epochs for training), we obtained the results listed in Table 5.  Table 5 contains the measured validation accuracy and validation loss for different test cases, as a function of the training dataset size, and compares the performance when the input dataset came from the non-evolved ontology (i.e., the original old dataset) and when the ontology was evolved. Furthermore, we compare in the same table the results when the NCFO architecture disregarded the user features (No UF) and when those user features were included (UF). The former case represents the NCFO architecture without considering the information provided by the ontology, evolved or not. In other words, this is the equivalent system to the generalized matrix factorization (GMF) implemented through the neural collaborative filtering approach and should be compared to a standard CF recommender eventually. The latter case corresponds to the NCFO architecture with the complementary information given by the (evolved or non-evolved) ontology, or the full architecture previously depicted in Figure 3.
As we can see in Table 5, there was a measurable improvement along both axis, i.e., when the user features output by the ontology reasoner were taken into account, and also when the novel information preprocessed by the ontology was used to train the system. Therefore, these experimental results confirmed our original intuition that implicit information contained in a domain-specific ontology can be exploited in a generic ML architecture. Moreover, we can also verify that training the system with increasingly large numbers of users and items also was useful to improve the results, since it increased the variety of patterns to which the NCFO neural network was exposed. Indeed, if the input dataset had enough variance in the raw data, using more training cases did not lead automatically to overfitting of the model. This latter assertion, however, needs to be carefully verified, since it is very sensitive to the (lack of) similarity in the available dataset. This study has been left out of the scope of the paper, nevertheless. The receiver operations characteristic (ROC) for the NCFO classifier is plotted in Figure 7. It was calculated using the one vs. rest policy (our classifiers were multiclass), and it confirmed the gains attained by incorporating the evolution and UFs into the hybrid RS. The second methodology for evaluation was the calculation of the hit ratio with the proposed hybrid recommender. Recall that the recommender outputs a list of items suggested for a given user, the top k best items according to the algorithm that best suit her/his preferences, where k is a configuration parameter (k = 1, 3, 5, 10, for instance). Namely, the proposed NCFO classification system ranks the items in decreasing order of relevance according to the output of the ensemble neural classifier and selects the k-highestranked products for recommendation, where k can be set as a parameter by the user. For the calculation of the hit ratio, we followed the common leave-t-out approach [17] followed in most of the literature. We selected randomly and uniformly a subset L of users from the test set. For each user u ∈ L, we picked her/his t most relevant items i 1 , . . . , i t from the utility matrix and removed them as if there had not been any interactions (u, i j ) in the test dataset. Next, we predicted the top k ≥ t items for user u with the NCFO approach and counted a hit (count one) if item i j was one of the items in the recommended list R for some j = 1, . . . , t and a miss (count zero) otherwise. The size of the list L depended in our case on the size of the test dataset, which was in turn 30% of the total dataset, where this ranged from 100 to 400 users. In our case, we set t = k for all values k = 1, 5, 10, i.e., for top-1, top-3, and top-5 recommendation lists.
Note that the definition of the top-k hit ratio is stringent for k = 1, as we required strictly that the recommended item be exactly the one suppressed from the test dataset. In contrast, the requirement for the hit ratio became more loose as k increased, since we required only that at least one of the items in R coincide with the best-ranked items by user u, or R ∩ {i 1 , . . . , i t } = ∅. Moreover, note that the special case k = 0 corresponds to the case where the recommendation list R is disjoint with the best-ranked list {i 1 , . . . , i t }, so the user obtains as recommended products a list of novel or different ones than the one she/he already purchased and knows.
The results of the experiment for the hit ratio are presented in Table 6. We can see that the hit ratio for the top-one recommendation was rather low, since the system had to correctly identify the unique top-ranked product for the user. There was, however, a slight improvement when the user features and the ontology evolution were individually incorporated into the system. The performance improved substantially when we increased the value of k and enlarged the list of recommended products. In this way, for k = 3, we obtained an average hit ratio around 65%, and again, we saw an improvement when the ontology evolution was considered, as well as when the user features were taken into account. In a similar way, the performance still improved up to around a 77% hit ratio if the list was expanded to the top-five recommendations for the random users. The key aspect to note is that the inclusion of the user features and the ontology evolution information led consistently to higher hit ratios, in the range of 3-4%. While this margin is moderate, it is meaningful, since we did not put special effort into designing an optimal neural network architecture. Accordingly, we conjectured that with more fine-tuning and optimized layers and dimensions, the achievable hit ratio can still be increased.  The Fact++ reasoning plugin was applied on the online retail ontology. Each customer individual in the ontology had data property assertions such as age, gender, number of children, material status, education, and the order details for each individual. The reasoner detects the similarities between the customer individuals and recommends products semantically according to these similarities. The results are collected in Table 7. Table 7. Ontology reasoning recommendation before and after ontology evolution.
(u 1 , u 2 )-before evolution-(v 1 , v 2 ) RP 1. Contoso  To simplify the recommendation results, we randomly selected two pairs of users (u 1 ; u 2 ) and (v 1 ; v 2 ) who were very similar to each other according to the cosine similarity measure (of their latent factors). Specifically, their similarities were 0.997 and 0.982.
After the evolution, new products were added to the ontology, and the customers bought the new products that were added after the evolution. Therefore, the reasoner after the evolution recommended from the new products that were added according to the change of their behaviors. After the evolution, new products (individuals) were added to the ontology. According to that, all the test cases bought new products that were added after the evolution. Therefore, the reasoner after the evolution recommended from the new products that were added according to the change of their behaviors.

Recommendation Results Based on ML and the NCFO
This section presents the personalized recommendations to customers according to their similarities, by using several techniques such as the machine learning recommendation by using the hybrid recommendations techniques from the initial dataset extracted from the database and the second dataset extracted from the evolved online retail ontology. After that, the recommendations extracted by the neural collaborative filtering using the deep learning techniques are presented as well, both before adding the user feature layer and after adding it, and for the two settings of recommendation before and after the evolution of the ontology.
The results included in this section constitute examples of recommendations for individual users in every case. We recall that the average quality of the recommendation was evaluated through the intrinsic performance of the NCFO (validation accuracy and loss) and additionally through the hit ratio, which measures the fraction of adequacy among the recommended products and those highly ranked by a sample of users. The hit ratio is meaningful in that we did not have the possibility of collecting the opinions of the users on the created recommendations, so feedback was not possible in our case.
Specifically, the lists of recommended products appearing in the tables below correspond to the case of the top-five recommendations obtained for the leave-zero-out policy. In other words, we did not remove from the test dataset any of the top-five highly ranked products (HRPs) already purchased by that user. As a result, the recommended products were disjoint with the HRPs. We emphasize that, in some test cases, the HRPs can be a multiset, i.e., the same product can appear more than once if there are not enough explicit preferences declared by that user according to her/his purchase history.
To simplify the recommendation results, we randomly selected two pairs of users (u 1 , u 2 ) and v 1 , v 2 ) who were very similar to each other according to the cosine similarity measure (of their latent factors). Specifically, their similarities were 0.997 and 0.982. Consequently, the purpose of the following examples is to give an empirical sample of the consistency or coherence of the recommendations. Formally, one possibility for measuring the coherence is to use the Jaccard distance: between the respective lists A, B of their recommended products, i.e., the fraction of overlap between the two lists (J(A, A) = 1, J(A, B) = J(B, A)). A compound measure of coherence for a given set of test users U could then be defined as: just by normalizing the aggregate pairwise coherence, where d(u, v) is the cosine similarity between u, v and R u (resp., R v ) denotes the set of items recommended to users u (resp., v). However, the coherence formula (7), despite its simplicity, does not lend itself to a clear intuitive interpretation. The reason is that it aggregates the individual coherence according to the cosine similarities, so it depends functionally in a non-trivial way on the probability density function of d(u, v). Conversely, two very different distributions can have close values of the coherence (7). Since comparing two distributions-directly for d(u, v) or for the transformed coherence-can be performed in several forms depending on the statistical applications, we decided not to work with the aggregated measure and simply show several test cases to give a rough idea of the practical results.
In Tables 8-10, we present the outcomes of the recommendations for the first hybrid system (ontology + classification with KNN and DT) and the highly ranked products (HRPs) and recommendation lists for the neural NCFO hybrid system. In every case, the results with and without the inclusion of the user features are included, and for the NCFO, the results before and after the evolution of the ontology are given as well. This allows a direct comparison between the textual similarities of the recommendation lists. We can easily check that, for the two pairs of users who were strongly similar to each other (in cosine distance), the recommendation lists overlapped significantly, as expected. As explained above, this is a simple form of verification of the coherence of the recommendations. Table 10. Top-four recommendations (NCFO) and highly rated products for two pairs of aligned users. Evolved dataset and ontology.
The expert also mentioned 15 recommendations that did not exist in the online retail ontology before the evolution. Then, the total number of possible recommendations equaled 32. According to this, the recall was: Recall = number of correct recommendations total number of possible recommendations = 53.12%.
The reasoning recommendation results of the online retail ontology after the evolution were presented to the domain expert to evaluate the recommendations generated for the users that were used in the experiment. The expert identified 20 correct recommendations, and the total number of all recommendations was 23. Then, the precision Precision = 20/23 = 86.95%. Finally, the expert's evaluation reported 10 recommendations that did not exist in the online retail ontology before the evolution. Then, the total number of possible recommendations equaled 30. Then, the recall in this case Recall = 20/30 = 67%. We therefore saw that the evolved ontology, even if increased with a small fraction of its original size, substantially improved over the original performance values.

Conclusions and Remarks
The results reported in this work showed evidence allowing us to draw two main conclusions:

•
The information extracted by a logical reasoner based on a suitable ontology and in parallel from a neural collaborative filter can be combined so that the accuracy of the recommendations is improved. We showed results in this respect for the classification accuracy and also for the hit ratio, which is more meaningful for the recommendation of products; • Another dimension that can effectively be exploited to improve the quality of predictions is the evolution of the ontology. Thus, a feedback loop in which novel data are inserted back again into the ontology provides a two-fold benefit: it allows the system to evolve in time, capturing the time-varying behavior of their preferences, if present; it combines naturally fresh information with past information without having to externally weigh the impact of each factor.