Predicting and determining antecedent factors of tourist village development using naive bayes and tree algorithm

: This study aims to predict the progress status of tourism villages in the Kedung Ombo area, Java, Indo-nesia, and find the antecedent factors of the progress of tourism villages in Indonesia. This study uses a modern approach, namely data mining. Data sources for tourist villages use the data available on the Google link and the observation method. The prediction technique uses the Naïve Bayes machine learning algorithm and Tree Decision on Orange 3.3.0 software. The number of tourist villages analyzed was 126. The results showed that all tourist villages in the Kedung Ombo area were at the development level of the four tourist village classifications of the Ministry of Tourism and Creative Economy. The antecedent factors for the progress of tourism villages are the completeness of ICT facilities, multi-stakeholder partnerships, strong government support, community involvement, and various attractions. Another finding is that the Tree Decision algorithm provides better predictions than the Naïve Bayes method. The results of this study can be used to design policies for developing tourist villages throughout Indonesia.


Introduction
Rural tourism is becoming a significant trend, especially in developing countries, as a manifestation of the concept of community-based tourism (CBT) which is deemed to be able to counteract the negative impacts of mass tourism related to social equality, environmental degradation, and preservation of community culture (Khalid et al., 2019;Muganda et al., 2013).Rural tourism could be a vehicle of sustainable development that could generate employment and income creation, prevent rural exodus, encourage socio-economic networks, save and enhance cultural and natural heritage, and improve the quality of life for residents (Rodrigues et al., 2021;Powell et al., 2018).Gohori & van der Merwe (2020) propose a reciprocal relationship between tourism, poverty alleviation, and community development.In the context of sustainability, rural tourism is synonymous with sustainable tourism development in nature, scale, character, and development process (Sharpley & Roberts, 2004).
Rural development through tourism has become an essential concept for both developed and developing countries as it represents a process of mobilizing innovation and aligning change, focusing on increasing opportunities for the population, economic growth, protection of natural resources, and social equality.Rural tourism is considered capable of supporting development in rural areas that are structurally weak (Neumeier & Pollermann, 2014).In rural areas, especially in developing countries, tourism is enthusiastically accepted as a panacea for revitalizing the rural economy: thus prompting many government agencies, particularly those related to tourism, to invest in promoting more sustainable forms of community-based rural tourism (Kamarudin et al., 2020).The tourist village program is a priority rural development program (Ariyani et al., 2022).
In Indonesia, rural tourism is manifested in the form of tourism village development (TVD), which since 2021 has been set by the Coordinating Ministry for Economic Affairs.TVD is directed toward increasing economic growth for people's welfare, eradicating poverty, overcoming unemployment, preserving nature, the environment, and resources, and promoting culture.The development of tourist villages is expected to be one form of accelerated village development in an integrated manner to encourage the village's social, cultural, and economic transformation.The success of the tourist village could become leverage for the village and regional economy, which will ultimately encourage national economic growth.In Indonesia, tourist villages are categorized as pilot, developing, developed, and independent villages (Ariyani & Fauzi, 2023).
Along with these provisions, various rural areas develop tourist villages.No less than thousand eight hundred and seventy tourist villages are spread throughout Indonesia.One area designated as a tourist village is the Kedung Ombo reservoir in Central Java.There are eight tourist destinations with the concept of a tourist village.Against the background of the limited benefits of reservoirs for communities in the upstream area, several community groups take advantage of the reservoir panorama as a tourist attraction.The existence of tourist villages in this area was expected to solve this area's socio-economic problems and infrastructure limitations.
However, to date, these efforts have not shown significant progress.Instead of creating alternative jobs for the local community and reducing poverty, the tourist villages have not been able to bring in enough visitors even though they have been developed over the years.As an integral part of the national tourism development agenda, this community initiative in the Kedung Ombo area needs support.
One of the reasons for the condition of tourist villages in the Kedung Ombo area is the development approach, which is based more on conventional methods that focus on in situ characteristics of the villages.Although this approach has advantages related to the ability to identify local needs, it has disadvantages due to the lack of understanding of hidden factors that may determine the development of the tourist village.For example, several tourist villages have the same characteristics, yet they may produce different outcomes related to the performance of the tourist village.
This shortcoming can be bridged with a modern approach based on data mining through machine learning such as Naïve Bayes and Tree.Machine learning is an analytical method using past information, usually in the form of available electronic data, to make accurate predictions.In principle, this approach is carried out by harvesting big data.Then, machine learning will produce forecasts regarding whether the tourist villages developed successfully or not.In addition, machine learning algorithms such as decision trees can be used to determine what factors (the antecedent factors) produce a pattern of success for tourist villages to develop.
The use of machine learning has been widely used in various studies, both economic, social, environmental, technological, and political (Ariyani et al., 2022).Naïve Bayes (NB) algorithm is trained and is used to classify or determine the gage length of the wheat straw based on target mechanical properties (Naik & Kiran, 2018).In the field of tourism, machine learning, for instance, has been used to forecast demand for tourism (Ahmed et al., 2007;Li, 2022;Yu & Chen, 2022), marketing strategies for rural tourism (Xie & He, 2022), and recommendations for smart tourism strategies (Ho, 2022).The use of machine learning for tourism cases in Indonesia is limited to several aspects, such as predictions of international tourist arrivals during the Covid-19 period (Andariesta & Wasesa, 2022) or estimation of international tourists (Purnaningrum & Athoillah, 2021).There remains a lot of room for tourism analysis with machine learning that can be done for the current case, such as predicting how the tourist village will develop or not and what antecedent factors determine the development of the tourist area.This study aims to predict the progress of tourist villages in Kedung Ombo area and determine the antecedent factors of successful development.
Rural tourism has various definitions with a very broad scope.Researchers from different countries develop their definitions based on the unique experiences or contexts they encounter (Nair et al., 2015).According to Tang (2022), rural tourism has no single definition.This statement implies that determinants of the success of tourist villages are difficult to identify (Ayazlar & Ayazlar, 2015).
In their research, Rodrigues et al. (2021) found that community involvement is a critical factor in developing tourist villages.Community participation is vital in creating sustainable community-based tourism (Amin & Ibrahim, 2015).In line with that, Bajrami et al. (2020) stated that one of the critical factors for the success of sustainable tourism in rural areas is the support from community members.Community participation can empower communities and significantly contribute to rural tourism development, which helps eradicate poverty, depopulation, hydro-geological instability, and degradation of cultural heritage and landscapes (Basile et al., 2021).Local people play a crucial role in developing sustainable tourism in rural areas (Yu et al., 2018).They must be involved in decisions that will influence themselves, their families, and their communities (Powell et al., 2018).
In addition to requiring the participation of all stakeholders, sustainable tourism development requires involvement from all relevant stakeholders and strong political leadership to ensure broad participation and build consensus (Kantsperger et al., 2019).The government is a leading actor with a paramount role in tourism development (Firdaus et al., 2021;McLennan et al., 2014).A more participatory rural development involving horizontal and vertical coordination places the government as the driver of the partnership process between stakeholders in helping to develop and oversee the strategic direction of rural tourism development (Koopmans et al., 2018).This interventionist approach adopting a more active government role in tourism development has become widespread worldwide, even in countries with different ideologies (Liu et al., 2020).
In its development, CBT requires public partnerships both in local and global contexts (Purbasari & Manaf, 2018).Regardless of the type of CBT services, these ventures should remain wholly owned, managed, and controlled by community members (or groups of independent micro and small experiences under the same CBT management organization).Meanwhile, external partners should provide facilitative and other support services instead of being a partner in the CBT venture itself (Mtapuri & Giampiccoli, 2013).
Research conducted by Kristianto et al. (2019) in the tourist village of Pahawang, Lampung, Indonesia, found the antecedents of the success of rural tourism development include attractions, amenities, accessibility, image, human resources, and tourism prices.The determination of ticket prices in rural tourism is a factor directly related to the interests of each stakeholder (Wu et al., 2017).Another factor is infrastructure, including homestays (Bhalla et al., 2016).Information and communication technology (ICT) plays a role in supporting the development of rural tourism as media promotion, booking, and payment facilities for transactions (Hidayatullah et al., 2018;Waghmode, M. L., & Jamsandekar, 2013).ICT-based management can be formulated into a tourist village development strategy (Pantiyasa et al., 2019).León-Gómez et al. (2021) found that the criteria for the success of CBT from evaluations in several countries in Asia: involving the masses of people; benefits being distributed equitably to all communities; good tourism management; strong partnerships both inside and outside; unique attractions; environmental preservation; the uniqueness of the location; the facilitation of existing embryo activities; involvement of the broader community as tourism actors; and partnerships.In a different context, Yang et al. (2016) stated that tourism resources, tourist traffic, and social and economic factors drive island tourism.Government policies, tourism companies, and the tourist market are external drivers.
The diversity of findings on the determinants of the success of the development of tourist villages can be used as material for analysis in developing tourist villages in Indonesia.The insitu characteristics of tourist villages in Indonesia are closely related to the community's geographical and socio-economic conditions and require guidance to be developed successfully in the future time spectrum.
This research has two objectives.First, it is to predict village tourist development in the Kedung Ombo area.Second, it is to analyze the antecedent factors that determine the success of tourist village development.

Methodology
This study uses a modern approach to achieve research objectives, namely data mining.Data mining is a data acquisition method and information collection methodology that can guide decision-making efficiently by extracting and analyzing accumulated datasets (big data) to obtain helpful knowledge (Adekitan et al., 2019).To achieve the first objective, Naïve Bayes and Trees algorithm are used, and for the second goal, the Decision Tree algorithm is used.All analyzes were performed using the Orange 3.3.0software.
To analyze the tourist village development profile, use nine (9) attributes: (1) the level of community participation, (2) the variety of attractions, (3) the level of government support; (4) the intensity of partnership, (5) the completeness of infrastructure, (6) the distance of object location to the main road, (7) the ticket price, (8) management, and (9) the application of information and communication technology (Table 1).For analysis progress status of tourist villages, the data of tourist villages are grouped into developed and progressive tourism villages based on the modified criteria of the Ministry of Tourism and Creative Economy: (1) the number of tourist visits both from within and outside the region, (2) the community can manage tourism businesses, and (3) the creation of employment from the tourism.According to the machine learning approach, the data in this study consists of two categories, namely the training data and testing data.The training data represent features from 126 tourism villages and is used to build a prediction basis model.The training data were obtained from various news and reviews about tourist villages in Indonesia on Google Search.At the same time, data testing is test data that will be tested based on the predicted results of training data.Data testing is the profile of eight tourist villages in the Kedung Ombo area (Table 2).The data in the Kedung Ombo area was obtained using the observation method.The Naïve Bayes algorithm is a probabilistic classification algorithm regularly used to handle big data (Panawong et al., 2014).Naïve Bayes is a widely used method because it is simple, measurable, and efficient in classifying (Ramoni & Sebastiani, 2001;Naik & Kiran, 2018).Naïve Bayes uses probability and statistics and the basic Bayes theorem.Probability is the chance that an event will occur randomly.Bayes' theorem was discovered by Thomas Bayes (1701-1761), who introduced that the conditional probability of a non-single event, i.e., the probability that an event will occur, is affected by the previous event.In this research, the naïve Bayes method is used as the first approach to prediction.The equation of Bayes' theorem is as follows: (1) P(A|B) is the posterior probability which indicates how often event A occurs under condition B. P(B|A) is the prior probability which shows how often B occurs given A occurs.P(A) and P(B) are the probabilities of events A and B, respectively.
The decision tree is a classification method applying a tree structure or decision hierarchy.According to Anggarwal (2015), it is a classification method whose model uses a set of decisions in a ranking, forming a tree structure with feature variables.Decision trees are easy-to-understand and often accurate decision-making applications (Witten et al., 2017).The decision tree algorithm criteria commonly used are ID3, C4.5, and CART.Iterative dichotomized 3 (ID3) is an algorithm with a basic iterative structure, and its features are divided into two classes at each step.This method produces a classification in the form of a decision tree that starts from the root of the tree of possible decisions-following the explanation of Quinlan (1992) (Quinlan, 1992), who developed an improvement on the previous method and called it the C4.5 algorithm.Breiman (2001) also developed another decision tree algorithm called the classification and regression tree (CART).This classification divides a binary data set into two sets separately.The CART method calculation process has several stages (Anggarwal, 2015).
A collection of points on the data S. Suppose that p is included in the dominant class.The error rate is calculated as 1-p.For the Split r-way from the set S to the set S1….Sr, the error rate of the split can be qualified as a weighted average of the error rates of the individual sets of Si, where Si is |Si|.The separation with the lowest error rate is selected from the alternatives. The Gini index G(S) is the training data for S in class p1..pk distribution from the training data points in S.
The overall Gini Index for the r-way split from the set S to the set S1…Sr can be quantified as a weighted average of the Gini Index values G(Si) of each Si, where the weight of Si is |Si|.
The split with the lowest Gini Index is selected from the alternatives.The CART algorithm uses the Gini Index as the split criterion.
The illustration of data mining using machine learning Orange in this research is shown in Figure 1.

Performance Evaluation
Performance Test A performance test is used to get the best validation and learning model through crossvalidation.This process tests which algorithm gives the best classification probability and is suitable for use as a prediction.These results can be seen in the prediction scores and test scores.The performance test was carried out using the cross-validation method as a sampling method because this method was effective in avoiding unintentional effects, primarily due to data limitations.This method was also suggested by Witten et al. (2017).
The learning technique separates the data into two categories: training data (training data) to form the model and test data (testing data) to test the model's performance.The classification results are likely to be accurate or rarely incorrect.The data will then be divided into several parts, symbolized by k in n data, known as k-fold cross-validation.Each iteration has a representation so that all data elements are met and data strata are used.The average result of each iteration obtained is used as the validation value.
The measurement of performance values is based on the confusion matrix value, which represents the prediction compared to the actual condition of the data generated by the machine learning algorithm (Table 3).True Positive (TP) is a positive and correct prediction; True Negative (TN) is a negative and true prediction; False Positive (FP) is a positive and false prediction result, while False Negative (FN) is a negative and false prediction result.Based on the Confusion Matrix in the naïve Bayes method, prediction performance is measured using the following values:

Table 2. Confusion Matrix
(1) The Area Under Curve (AUC) describes how accurately the model can classify correctly visually.The accuracy of the ROC classification is done by visually calculating the area under the Receiver Operating Characteristics Curve (ROC) curve.An excellent model has an AUC value close to 1.The following is the formula for the area Under the ROC Curve: The accuracy of the predicted values was confirmed using the criteria developed by Gorunescu (2011) in Table 3. (2) Classification Accuracy (CA) shows the accuracy of the predictions generated from the predicted and actual values divided by the total results.As with AUC, the higher the CA value, the closer to 1, the more accurate the model prediction.
CA = (TP + TN)/(Total A+B+C+D) (3) Precision.Precision is the ratio of the true positive predicted value to the overall positive predicted result.The value of precision in validation is more practical and provides an accurate picture.
Precisison =TP/(TP + FP) (4) Recall.The Recall is a positive true value comparison with all true values.The Recall compares the predicted positive true value with the overall positive true value.F1 is a weighted comparison of the average precision and Recall.
Recall = TP/(TP+FN) (5) F1. the F1 score combines Recall and Precision (described below) into one performance metric.The F1 score is a weighted average of Precision and Recall.Therefore, this score takes into account false positives and false negatives.

Results and Discussion Results
From the machine learning process on train data and test data using the Naïve Bayes and Tree methods, predictions regarding the status of villages in the Kedung Ombo area are obtained, as shown in Table 4.The prediction results show that the eight villages in the Kedung Ombo area, which are the basis of this study, are predicted to be developing tourism villages.

Figure 2. ROC Analysis
The results of this prediction were then tested for accuracy using the Confusion Matrix (Figure 3).6).The Confusion Matrix test results show that the predictive values for all the methods used are classified as excellent and good.The Naïve Bayes algorithm provides more consistent valida-tion than other methods in the three parameters of the five methods.Thus, this method is considered the most accurate in providing views on development policies in tourist villages in the Kedung Ombo area.
The overall distribution of tourist villages based on cross-correlation between several tested attributes is shown in Figure 4.

Discussions
From this point, to find out what attributes most determine the progress of a tourist village, it can be seen from the decision tree model.The results of the decision tree classification provide a good analysis of the predictions generated.The gain ratio value from the decision tree determines which variable is the split classification.The decision tree begins with the formation of roots (located at the top).Then the existing data is divided based on attributes suitable to be made into leaves connected and developed through branches.
Decision rules are formed from the Tee method that has been created and then derived by tracing from roots to leaves.Based on the dataset processing using the Tree method, the accuracy of the classification process obtained an AUC value of 96%, which means that predictions through the decision tree method are excellent.The decision tree image in Figure 5 provides information about the attributes/conditions that determine the progress of a tourist village.

Figure 5. Decision Tree Prediction Results
Base on Decision Tree (Figure 5) shows the factors determining the progress of a tourist village are the type of collaborative management, the location distance from 0 km to 30 km or 10 km to 30 km, various attractions, and a maximum ticket price of 30,000 rupiahs, ongoing partnerships with multiple partners (government, academia, and the private sector), as well as good ICT facilities and applications.
Furthermore, from the results of the decision tree prediction, the order of attributes that most influence the progress of the tourist village can be seen.The Orange operation shows the ranking results for the antecedent factors of the progress of tourism villages (Table 7).At Table 7 ICT is the attribute that has the most significant influence on the progress of tourist villages from various measurement bases (Ingo.gain, Gain Ratio, Gini, and X 2 ).At the next level is the attribute of partnership, followed by government support, community participation, and infrastructure.Next are the tourist attractions.The management and distance attributes have little influence, and the ticket price attribute has almost no effect.
These findings can be explained by the presence of ICT (especially the internet) that has changed tourists' behavior, especially in seeking information.Therefore, ICT should be utilized in creating ideas and information about rural tourism in the minds of tourists and enhancing seek information, such as tourism services, locality accessibility, ticket booking, and payment methods, as well as other tourist facilities.In the case of rural tourism, ICTs can facilitate the establishment of relationships between tourists and managers, even after the visit is over.Managers can build a visitor database to develop these relationships to contribute to building visitor loyalty.Friendly service during and after travel activities which ICT facilities support, can determine this loyalty.
The next determinant factor most significantly influencing tourist village progress is multistakeholder partnerships.Furthermore, the following factors are strong government support, community involvement, completeness of infrastructure facilities, and various attractions.However, the characteristic of each village are factors that must be developed in an integrated manner to accelerate the progress of tourist villages.
The results of this study support research conducted by Hidayatullah et al. (2018) and Waghmode & Jamsandekar (2013) that ICT is a factor that plays a role in supporting the development of rural tourism and tourist villages as a promotional medium and booking and payment facilities transaction.Regarding partnerships, the results of this study are in accordance with research conducted by Purbasari & Manaf (2018) and Mtapuri & Giampiccoli (2013) 2018); (Powell et al., 2018);and Amin & Ibrahim (2015) who in their research found that community involvement is a critical factor in the development of tourist villages.While the findings regarding the completeness of tourism infrastructure, the results of this study are in accordance with research conducted by Bhalla et al. (2016).

Conclusions
Tourism management, especially rural-based tourism, faces complex and challenging issues.The dynamic interaction between various components affects the tourism sector's performance.Hence, a paradigm shift is needed in determining the right rural tourism policy.The use of machine learning will be critical to help develop science-based and data-driven policies in the future.Besides being used to predict how a tourist village will grow or if it will not develop, machine learning will also reduce bias in determining tourism development policies, which tend to be more subjective.
In this study, the Tree Decision algorithm is the best approach to predict the progress status and determine antecedent factors of the progress of tourist villages.Likewise, the variables analyzed, namely community participation, diversity of attractions, government support, partnerships, infrastructure, ticket prices, distance to locations, management, and ICT, are the right factors to predict village status.
Among the antecedents of these variables, this study finds ICT is the main factor predicted to determine the progress of tourist villages.However, government support, completeness of infrastructure, community participation, multi-stakeholder partnership, and various tourist attractions are other essential factors that must be developed in integrated ways for tourist village development.

Figure 1 .
Figure 1.Image Data Mining to Determine the Status of Tourism Villages TP) = A False Positive (FP) = B + False Negative (FN) = C True Negative (TN) = D

Figure 3 .
Figure 3. Confusion Matrix on Naïve Bayes and Tree Method Based on the formula given previously, it is known that the results of the model performance test are based on the Confusion Matrix (Table6).

Figure 4 .
Figure 4. Scatter Plot between Target and Feature Variables

(
Source: Data processed with Orange Software 3.3.0) , which state that rural tourism development requires public partnerships in both local and global contexts.While related to community participation, the results of this study follow the results of research by Rodrigues et al. (2021); Bajrami et al. (2020); Basile et al. (2021); Yu et al. (

Table 1 .
Attributes of Tourism Village Development and Measurement Scale

Table 2 .
Profile of Tourist Village in the Kedung Ombo Area

Table 3 .
Prediction Value Classification

Table 6 .
Prediction Test Results through Confusion Matrix from Naïve Bayes and Tree Method

Table 7 .
Attribute Rank Determining Tourism Village Progress