Optimization of data integration using schema matching of linguistic-based and constraint-based in the university database

: University requires the integration of data from one system with other systems as needed. This is because there are still many processes to input the same data but with different information systems. The application of data integration generally has several obstacles, one of which is due to the diversity of databases used by each information system. Schema matching is one method that can be used to overcome data integration problems caused by database diversity. The schema matching method used in this research is linguistic and constraint. The results of the matching scheme are used as material for optimizing data integration at the database level. The optimization process shows a change in the number of tables and attributes in the database that is a decrease in the number of tables by 13 tables and 492 attributes. The changes were caused by some tables and attributes were omitted and normalized. This research shows that after optimization, data integration becomes better because the data was connected and used by other systems has increased by 46.67% from the previous amount. This causes the same data entry on different systems can be reduced and also data inconsistencies caused by duplication of data on different systems can be minimized.


Introduction
The use of information technology currently plays an important role in the success of an organization. Digital information storage can assist organizations in accessing information and processing it quickly [1]. One form of the use of information technology is the use of information systems for various organizational activities. Technology plays a role in demanding organizations have information systems that can support management in decision making [2]. Information systems that are supported by integrated data can produce useful information for their users [1]. Data integration is a process carried out to combine data from various sources to support information management [3]. Data integration is one of the things needed to support the planning and development of an organization that will provide added value in competition with competitors. The implementation of integration will provide benefits if the application follows the goals, vision, mission of the organization, and also the strategies that have been determined to align the information system strategy and business strategy [4]. This data integration is needed because with the development of an organization the information needed also develops so that there is some data and information needed from different divisions or units [5]. Data or information needs that are different require the existence of data integration between each division following the needs.
Higher education is one of the institutions that use information systems to achieve their goals. However, most of these information systems are not integrated, it can be seen from the discovery of the same data entry process in different information systems [6]. Such as the reregistration process, students enter the same data as the data that was previously entered into the registration system, students must register various information systems to get Wi-Fi access, get an email account, access the library, etc. This can cause data duplication and data redundancy. Data duplication can cause data inconsistency [1] and also take up space in the database where it can be used to store other data. Data duplication is caused by the process of entering the same data in different information systems where the process can allow data inconsistencies to occur. This means that the data is the same but it is not consistent so it might be considered different. The impact of the inconsistency of the data is that it can cause discrepancies in the information generated that can be detrimental to those who need the information.
Data integration can solve these problems and reduce the occurrence of data redundancy because data integration can make a relationship where data in an information system can be used on other systems to facilitate the process of sharing data used to support data and information management [7]. In the application of data integration in general there are several obstacles, one of which is the diversity of database schemes used [8]. The diversity of this scheme occurs because the amount of information stored in the database continues to grow, this causes the need for the information to be stored in several different databases, and database integration is a very important aspect in maintaining consistency between these databases on an ongoing basis [9].
In the process of data integration, it is necessary to match the schema of several relationships between the data with each other [10]. Schema matching is one method that can be used to overcome data integration problems caused by database diversity. Schema matching is an important thing and requires sufficient time in the data integration process [11]. Schema matching is used to detect relationships and similarities between elements of two or more schemes [12]. Schema matching has an important role in applications that require interoperability between diverse data sources [13]. Research related to schema matching was conducted by Rachman and Saptawati. The research shows that database integration problems can be solved by using a hybrid matching schema that involves linguistic-based and constraint-based. From several experiments that have been carried out by considering conflicting elements in the identification of the schema, schema matching uses a hybrid approach that can be used for rewriting queries on multiple databases [14]. Another research showed that the linguistic method is one of the schema matching methods that can be used to find similarities between schemes and can be used as material for optimizing database development [1]. Research conducted by Sutanta and colleagues combines two-hybrid schema matching methods to overcome database integration problems caused by database diversity [15].
In this research, the schema matching method used is linguistic and constraint. The constraint is part of an effort to maintain the integrity, consistency, and value of data so that it complies with certain valid limits that have been determined [16]. This constraint method performs matching schema based on attributes, data type, data width, nullity, uniqueness, and domain. Whereas the linguistic-based method matching scheme is one of the matching schema methods related to the similarity or similarity of the element names in a scheme both syntactically and semantically [17]. The linguistic method is a scheme matching method that is carried out by looking at the similarity of the naming of elements in a database [18]. The linguistic method used is N-Gram. N-Gram is an algorithm used to calculate a syntactic similarity. N-Gram has two stages in measuring the similarity between two strings, namely breaking the string into N characters and then calculating the similarity. Whatever the equivalent of N-Gram is as follows s: a symbol that represents a reference database element (Source) t: a symbol that represents the database element to be matched (Destination) The similarity tested using this linguistic method is the similarity based on the name of the table between the pair of tables from the source database with the destination.

Methodology
The method used in this research is the linguistic-based schema matching method and the constraint-based method in which the schema matching with the linguistic method is carried out based on the similarity or similarity of table names between table pairs from source to destination. The flow of research in this research can be seen in Figure 1.

Planning
The first stage is the planning stage where at this stage there is a literature study process used to determine the background, formulation of the problem, related research, and solutions to the problem. In addition to the literature study, there is also a data collection process. The data used in this research are databases at tertiary institutions such as administration, academic databases, KPKKNTA (practical work, real work lectures, final projects), ABDIMAS (research and public services), and others. The database used can be seen in Table 1.  Table Attribute  1  Administration  26  204  2  Academic  69  453  3  Employee  29  184  4  Payrol  16  96  5  KPKKNTA  3  16  6  Abdimas  9  63  7 Library 52 427 Based on the database scheme obtained in Table 1, an overview of data integration can be seen in Which A is data on citizenship, education, lecturers, majors, and types of education. B is religious data, PMB pathways, and districts. C is employee data. D is the faculty data, the field of study, and the study program. And E is the data of majors, students, and lecturers. Figure 2 shows that the administration database requires calling and data A, while academic calls data B from administration and C from Employees, etc. Figure 2 shows that the data integration that is currently running is not optimal yet, this can be seen by the existence of a system that should be integrated but not yet integrated, such as employee information systems, payroll, academic with libraries, and so on. This shows that there is a need to optimize data integration to be better so that it can reduce data redundancy and data inconsistencies and can save data storage space.

Analysis
The second stage is analysis. The database that has been collected at the planning stage is analyzed using the linguistic and constraint methods. The linguistic method used in this research is bigram. Before entering the bigram process it is necessary to carry out the conversion, generalization, and tagging process. The conversion is done by removing the "_" sign and also the prefix table name as can be seen in Table 2. The conversion process is carried out the results of the linguistic process carried out more leverage. This is caused by the addition of database names at the beginning of the table which can affect the results of linguistics. For example "adm_program_studi" with "akd_program_studi" which if no conversion is done then the same value is 0.812 and if converted the result is 1. After the conversion process is carried out, the next process is a generalization. This process is carried out on table names and attributes that have acronyms and abbreviations. Generalization is based on the true meaning of the abbreviations and acronyms of the word. An example of the generalization process can be seen in Table 3. After the generalization process, the next process is tagging. This process is carried out for the placement of words that have the same food as employees and employees, employees and employees, family and family, and others. So if there is a difference in the language used but has the same meaning then the word is considered the same. The next process is the bigram process. A bigram is a linguistic method that does the solving of two characters. An example of the bigram process can be seen in Table 4. Besides bigram, there is also a trigram which is a part of n-gram. An example of a trigram process can be seen in Table 5. In Table 4 the intersect value is 0, it shows that there is no bigram of the two tables that have similarities so that the similarity value is 0. The similarity value is between 0 to 1. If the similarity value is more than 0.8 then it can be categorized as relevant. After the bigram process. The next process is to analyze in terms of constraints with the constraint method which can be seen in Table 6. The similarity value shown in figure 3 is 0.8. This shows that the couple is relevant, but it should not be because the religious code with the type of education code is different. So it can be concluded that the pair of religious tables with the type of education is true negative (TN). The set that states the results of the matching schema is divided into four as seen in the matrix confusion table shown in Table 7 [19].  Table 7 shows that TP stated the outputs of the identified model are relevant, FP stated that the outputs that should have been matched were not, FN stated the outputs that were not supposed to match but it was TN and the results stated that the outputs were not suitable. In this research, to get results that match the set, the addition of logic algorithms is performed. This is because there are two schema matching methods used so that logic is used to help determine whether it is included in the relevant category or not. So we get the categorization as in Table 8. Schema matching results in the form of TP, FN, and FP will be used as material for testing the matching schema model with parameters P (precision), R (recall), and F Measure. The equation to find out the value of P, R, and F measure can be seen in equations (2), (3), and (4).

Optimalize
The results of the schema matching process are then used as material for optimization. The optimization process is done by eliminating one table from each pair of TP values and also eliminating attributes and tables that are considered irrelevant and then normalized. Normalization is carried out until the third form because the third form is already considered a normal form [20].

Testing
At this stage, the database integration design testing process is optimized. Testing is done by providing several sample queries to display data from different databases.

Results and Discussions
The results of the schema matching can be seen in Table 9.  Table 9 shows that there are several pairs of tables from each database with TP and FP values while others are blank or TN values. After getting the results of the matching schema obtained, the next process is optimization. The database schema optimization process results in changes in the number of tables and attributes in the database schema. Database schema data after optimization can be seen in Table 10.  Table 10 shows the decrease in the number of tables and overall attributes where the number of tables before the optimization is 204 and changes to 191 after optimization. While the number of attributes that were previously 1443 became 951 after optimization. There is a big enough change so that it can be said that data integration was not yet optimal because there are still many pairs of tables with TP values and some attributes that are not considered to be relevant so that optimization is needed. Design of data integration results can be optimized.  Figure 3. Optimization database integration design Figure 3 shows the design of data integration after optimization where A is the data of study programs, levels of education, ethnicity, types of education, fields of education, lecturers, education programs. B is district and camaba data. C is SK and employee data. D is the data of students, lecturers, and science. E is student data. F is a data type of work, religion, and diploma. G is a data module. H is the login and employee data. And I is employee data. Fig 3 shows the difference with Fig 2 (initial integration design). A fairly noticeable difference is that after optimization, all databases have been integrated with other databases wherein the initial integration design 2 databases are not integrated with any system even though it should be able to retrieve data from other systems so there is no need to insert again. The amount of data used on other systems increased from the initial 15 to 22. This shows that previously the data was stored in each database. With the increase in the amount of integrated data by 46.67% from before, data storage in the database becomes more efficient. There was a reduction in the number of databases that previously had 7 databases to 6 databases. This is because the KPKKNTA database after optimization can be removed. After all the data contained in the KPKKNTA database is available in the academic database. The process of entering the same data in different information systems can be reduced so that data inconsistencies can be avoided.
Retrieval of data from other databases can be done by adding attributes to the table from the database system that performs data retrieval. The attribute becomes the foreign key of the table from which the data will be retrieved. For example, in the administrative information system, there is an administrative database in which there is data on prospective students. To display study program data owned by prospective new students, the system needs to retrieve study program data from the academic database because the study program table is in the academic database. Therefore, it is necessary to add the code_program_studi attribute to the table of prospective new students in the administration database and become a foreign key that refers to the primary key of the study program table contained in the academic database.
Tests are carried out to test whether the new database integration design allows for data sharing so that the tests carried out are to give commands to display data from other databases. The command was given to display data after optimization has undergone a few changes. For example, the previous command is "Select a.attribute_name_1, b.attribute_name_2, and so on from table_name1 a, table_name2 b where condition". After optimization, if the desired data is in a different database table, the command given is "Select a.attribute name 1, b.attribute 2, and so on form name_database1.name_table1 a, name_database2.name_tabel2 b where condition ".
The difference lies in putting the database name before the table name. The example after the command after optimization is to display the study program data (academic) through the study program code in the Camaba (prospective new student) table (administration) with the following SQL command "SELECT a.NO_PENDAFTARAN as Registration_number ,a.nama_camaba as camaba_name ,b.nama_jalur_pmb as PMB_path_name, c.nama_program_studi as name_ of_ study_ program from 01administrasi.adm_camaba a, 01administrasi.adm_jalur_pmb b, 02akademik. akd_ program_studi c where a.kode_jalur_pmb= b.kode_ jalur_ pmb and a.kode_program_studi= c.kode_program_studi". The results of the order can be seen in Figure   4.

Conclusion
Based on the results of tests that have been carried out from this research it can be concluded that: a. The results showed that there were 16 pairs with TP value and 8 pairs with FP value while none had FN value. These results indicate that in data integration (before optimization) there is still duplication of data that can cause data inconsistencies. b. Database optimization which is done changes the number of tables and attributes of the tables amounted to 204 to 191 and the number of attributes that were previously 1443 to 951. This shows that several tables can be eliminated because they are connected with tables in other databases and also some attributes are can be eliminated. c. After optimization, data integration is better due to an increase in the amount of data connected to other systems by 46.67% from before. So that the same data entry on different systems can be reduced and also data inconsistency caused by duplication of data on different systems can be minimized.