A Comparison Study on the Era of Internet Finance China Construction of Credit Scoring System Model

At present, China's Internet finance has flourished, showing a variety of business models and operating mechanisms. Through Internet technology, financial institutions can speed up business processing and bring users a better service experience. However, there are also problems such as credit risk and user fraud, and it is urgent to improve the level of risk control through credit scoring models. Because of this, this article uses the borrower data of a Chinese financial institution from January 2017 to June 2017 as the original data, and then uses the Spearman rank correlation test to screen out the variables with reliable explanatory power from the many variables of the sample data, and then Based on the variables selected, R 3.4.3 and SPSS 23.0 were used to construct a random forest model, discriminant analysis model, and logistic regression model. In general, different models perform differently under different sample characteristics, but the discriminant analysis has been better applicable. This paper compares the judgment accuracy of these three types of models and tries to establish a more effective financial credit scoring method, to solve the problem of constructing China's credit scoring system model under the current Internet financial background.


Research Background
The objective, comprehensive, and accurate individual credit rating model is an essential component of the personal credit rating system (Hand & Henley, 1997). The existing personal credit scoring system through Internet technology, speed up business processing, bring users a better service experience (Yu et al., 2009). However, there are some problems, such as credit risk and customer fraud. Therefore, it is urgent to improve the level of risk control through the credit score model. The credit investigation institution shall use the rich information collected to make comprehensive credit evaluation on individuals (Dhillon & Torkzadeh, 2006). Based on abundant personal credit history and credit behavior data, the credit behavior pattern obtained by adopting the data mining method can more accurately predict the future credit performance of individuals, improve the efficiency of operation, reduce the cost of credit granting, and accurately estimate the risk of consumer credit, which is an essential tool for the internal scoring of financial institutions (Hsieh & Hung, 2010). Therefore, the establishment of an accurate credit scoring system is of considerable significance to enterprises. The model of individual credit rating is to use statistical analysis method and data mining technology to analyze the primary personal information data and transform the current personal information data into a specific credit risk value with high recognition (West, 2000;Huang et al., 2007).
In the past, China mainly relied on the experience of credit officers to judge the credit status of customers. There has been severe information asymmetry between credit institutions and customers (Stiglitz, 1993) which makes credit institutions unable to accurately measure the credit status and risk of lenders, which may lead to credit errors and directly threaten the interests of credit institutions and the healthy development of credit market (Hoff & Stiglitz, 1990). Although other countries have a very mature experience in credit scoring and have used the combination of traditional statistics and machine learning to evaluate customer credit quantitatively (Thomas, 2000) but because there is no unified data source and credit evaluation system in China at present, so foreign experience is not applicable, so it is necessary to form a set of personal credit reporting system in line with Chinese characteristics and find a suitable credit scoring method (Allen et al., 2007).
Based on the above conditions, this paper bases on the underlying theory and practice apply the appropriate methods of data mining and statistics and uses the historical business data of a loan institution as the original data Based on relev ant Study Experience. In order to construct the evaluation system of Chinese personal consumption credit, we will provide some reference to the financial institutions and government. Durand (1941) applied Discriminate Analysis to credit scores of commercial banks. Discriminant analysis is based on the original classification, when a new analytical sample is encountered, i.e., pass. This classification method is used to select specific evaluation criteria as the basis for judging the group in which the new sample is located (Eisenbeis, 1977;Wind, 1978;Day et al., 1978). On this basis, new discriminant samples can be classified into known taxonomic groups. Commonly used discriminants Distance discrimination, Bayesian discrimination and Fisher discrimination are the methods of analysis (Lachenbruchm & Goldstein, 1979;Ripley, 1994). Discriminant analysis was also used to develop the credit model (Desai et al., 1996;Dorronsoro et al., 1997). FICO scores constructed with discriminant analysis as the core are widely used in the field of credit scoring by Chen & Chen (2010) used the latest semi-supervised nonparametric discriminant analysis (SNDA), sparse tensor discriminant analysis (STDA), semi-supervised discriminant analysis (SDA), sparse discriminant analysis (Sparse DA), Fisher discriminant analysis (FDA), and multivariate discriminant analysis (MDA) to construct credit score models, respectively, and the results showed that SNDA, STDA, and SDA performed better than other discriminant analyses. Wiginton (1980) used discriminant analysis and logistic regression to construct a credit score model from 1967 to 1968. The results indicated that logistic regression was superior to discriminant analysis. Shi & He (2015) introduced the idea of asymmetric function in credit rating, took the distribution function of biased logistic distribution as the inverse function of connection, and conducted a comparative empirical analysis using personal credit data of a financial institution. The results indicate that the effect of the biased logistic regression model was better than that of the ordinary logistic regression model, and the effect of the biased logistic regression model was better than that of the decision tree, neural network and support vector machine in 10% default data set. Sohn et al., (2016) applies a fuzzy logistic regression model that was established by using the data of 4446 loan applicants and loan default results and compared with traditional logistic regression. It was found that fuzzy logistic regression could improve prediction performance. Compared with discriminant analysis, logistic regression is easy to calculate and requires more relaxed data distribution. So far, logistic regression is the most commonly used credit score model.

Literature Review
Since individual credit scoring models have their advantages, scholars have begun to study combination models, which are divided into heterogeneous integration models and homogeneous integration models. According to the definition of random forest, Random forest is a homogeneous integration of decision trees. Su (2018) proposed a personal credit scoring model based on the accompanying forest combination. Using the data of a commercial bank in Germany for empirical analysis, compared with KNN, radial basis based neural network, decision tree, gradient boosting decision tree and support vector machine, the random forest model not only has high accuracy but also has the characteristics of being able to handle noisy data and good generalization ability. According to the German credit data, Li (2017) respectively established the Logistic credit score model and random forest credit score model, and the results showed that the accuracy of the random forest was superior to that of logistic regression. As long as the coefficients of the combined model are set well, the combined model may be superior to the single model inaccuracy or other aspects. The two-stage scoring model proposed by Shi (2005) a logistic regression model based on the neural network, is validated with credit card customer data of a commercial bank. It is found that the accuracy of the new model is higher than logistic regression, and the robustness is also greater than neural network model, indicating that the new model combines the advantages of a single model and avoids the disadvantages of a single model. Yang (2018) used the results of a linear discriminant analysis model as one of the input variables of the BP neural network. The results of empirical analysis show that the combined model has better prediction accuracy than the single model, and overcomes the problem of single model robustness. A heterogeneous integration model based on bagging algorithm and stacking algorithm is proposed by Xia et al., (2018). Empirical analysis shows that the performance of this heterogeneous integration model is better than that of the logistic regression model, support vector machine, decision tree and random forest model.

Practical Application of Personal Credit Score Model
At present, the FICO score is the most commonly used in the US credit information market. Fair Isaac Company issues the FICO score. There are three forms of FICO score, which are respectively applied to the three significant US credit administrations (Berger & Udell, 2002)  The credit scores derived from the model for the FICO score ranged between 300 and 850 points. The higher the score, the smaller the credit risk of the customer. Nevertheless, the score itself does not tell whether a customer is good or bad, and lenders often use the score as a reference for their loan decisions (Allen et al., 2004). Each lender will have its lending strategy and standards, and each product will have its risk level, which determines the acceptable credit score level.
Generally speaking, if the borrower's credit score reaches 680 points or above, the lender can consider the borrower's credit outstanding and can agree to the payment without hesitation. If the borrower's credit score is below 620, the lender either asks the borrower to add collateral or looks for various reasons to reject the loan. If the borrower's credit score is between 620 and 680 points, the lender will conduct further investigation and verification and use other credit analysis tools to handle the case.
The sesame credit is A subsidiary of China Alibaba Group Ant Finance. It belongs to an Independent third-party credit reporting institution; see Table 2, and gold garments objectively present their credit status through techniques such as cloud computing and machine learning. The sesame credit is different from the traditional credit reporting agency (Yip & McKern, 2016). Alibaba Cloud has a vast database as a backdrop, with the unique advantages of Internet technology and data.
On this basis, sesame credit evaluates the credit rating of users through the credit model algorithm (Lin et al., 2015).
Nevertheless, it also suffers from the applicability of the credibility model. The problem of credit information sharing not only affects the comprehensiveness of data dimension but also affects the accuracy of the model measurement (Ennew & Binks, 1999;Wu, 2008). Therefore, the actual credit status of the client information subject cannot get a very accurate response in the sesame credit score. The applicability of the credit model also requires time for slow collection and validation; the primary data source for sesame credit depends on industry data, and the dimensions of data collection are not complete (Nwana, 1996). While sesame credit already collects a tremendous amount of information, Alibaba's social system is slightly lacking, so it has less control over data on social behavior; it also lacks credit data on financial institutions. At present, Sesame Credit has not been able to intervene in the Central Bank's credit system, and major banks and financial institutions have not been able to obtain their credit data, which also leads to the lack of personal use of bank credit information data in calculating Sesame Credit scores (Kostka, 2019;Creemers, 2018). Sesame Credit has no personal credit data from banks, and it is difficult for Sesame Credit to master the more accurate personal income of users, as well as essential assessment data such as debt information and related assets. In this paper, data mining and statistical correlation methods are used, i.e., The software R 3.4.3 extension package and SPSS 23.0 were used to construct random forest models and to apply discriminant analysis methods ，then establish Logistic Model, and the effect of each model after the actual operation of the comparative analysis.

Statistical Approach
This chapter selects the status of overdue repayment as the explanatory variable and selects the agent, local nationality, working province, education level, marital status, salary, presence or absence of funds and gender (as known from the information in the ID card data archive), as well as the provincial gross product, per capita disposable income, per capita consumption expenditure, regional fixed asset investment, regional fixed-asset investment index and unemployment rate that can be found by the working province (Anonymous, 2017).
The defined and explained variables are shown below: The higher the per capita consumption expenditure, the higher the rank X13 Regional Fixed Assets Investment 1-28 The higher the regional fixed asset investment, the higher the grade X14 Regional fixed asset investment index 1-28 The higher the regional fixed asset investment index, the higher the grade X15 Unemployment 1-16 The higher the unemployment rate, the higher the grade

Data Preprocessing
In summary, the borrowers with or without deferred repayment of a financial institution in China from January 2017 to June 2017 shall be taken as the total sample of data processing.  First, the data were processed, and we found that the amount of data for the vacancy values of the samples that did not contain agents was huge, and the user, no salary levels of the agents, were not included, for which the data were divided into two sample sets by whether or not the agents were included and analysed separately. Then, analysis of the data found that the sample data are microscopic, the lack of macroscopic data support, the conclusions may not be accurate and complete. Therefore, the working province containing the agent sample and the province of origin without the agent sample (known from the first two digits of the ID card) were converted into six indicator representatives related to economic development, namely, the province's gross product, per capita disposable income, per capita consumption expenditure, regional fixed asset investment, regional fixed-asset investment index, and unemployment rate (all data resources given by China Statistical Yearbook 2017 obtained).  Selection of samples.
Select whether to include the full sample remaining from the agent.

 Added blank and missing values
Since the data has been split into two data sets for analysis, the samples with vacancy values in the two data sets were filtered out, respectively, and then approximately 95% of the sample size remained in each data set, and the data integrity of these samples was functional.
Therefore, in combination with the above analysis, a small number of samples with vacant values are directly sieved out to obtain the final sample with or without agents.

Descriptive Statistical Analysis
The collected data samples were first subjected to descriptive statistical analysis using SPSS 23.0. From descriptive statistics, it can be seen that the degree of steepness or smoothness varies significantly among different variables, as does the degree of skew.

Basic Analysis of Variables
First, it can be seen that in the sample containing agents, the number of deferred repayments is: 698, accounting for about: 12%. The number of performance articles was 5077, or about 88 percent. As shown in the figure below: Figure 1. Percentage of samples containing agents with or without deferred repayment Based on the above analysis, it can be initially seen that the relative Contains The user of the agent has a high probability of deferred repayment Users without agents. Furthermore, overall, nearly 90% of people have not extended their repayment terms. An analysis that did not include a sample of agents was then performed. It can be seen that in the samples without agents, the number of deferred repayments is 895, accounting for about 4.5%; the number of performances is 19,239, accounting for about 95.5%. As shown in the figure below:

Correlation of Explanatory Variables with Explanatory Variables
In this chapter, 14 variables are selected to research the influencing factors of the borrower's deferred repayment, which are loan grade (x1), agent (x2), local nationality (x3), education level (x5), marital status (x6), salary (x7), fund availability (x8), gender (x9), provincial gross product (x10), per capita disposable income (x11), per capita consumption expenditure (x12), regional fixed investment (x13), regional fixed investment index (x14), unemployment rate (x15). The sample containing the agent does not select the working province because the working province itself has no substantial meaning. For this reason, we added six macro data variables corresponding to the provinces. SPSS 23.0 was first used in this paper, followed by Passed Pearson correlation test preliminarily explored the relationship between the explanatory variables and the explained variables. Explanatory variables were screened by the size and significance requirement of the correlation coefficient between the dependent and independent variables.
According to the correlation test, except X8 and X9 failed the significance test, and all other variables passed the significance test. Inquiry Considering the remaining variables as 12 There is only one, so it is not screened according to the correlation of variables, finally selected X1, X2, X3, X5, X6, X7, X10, X11, X12, X13, X14, X15 as explanatory variables.

Multicollinearity Analysis
In order to ensure the accuracy of the model results, it is necessary to test whether there is multicollinearity between the variables, and the results are shown in the following table, The inflation factor of 12 variables can be seen VIF Between 0-10 Between, can judge There was no severe multicollinearity among the 12 variables. Randomly selected in all samples 80% of data as training data, where 564 records for deferred repayment, 4059 records for on-time repayment, with each variable as the characteristics of training, by making with R 3.4.3 Randomize the original software package to implement the modeling process.
Selection of variables: Selecting appropriate variables not only improves accuracy but also reduces the complexity of the model calculation process, thereby improving the model Run Efficiency. First, the variables are initially selected based on their correlation, from the perspective of significance, excluding x8 (with or without funds) and x9 (sex), and then introduce x10 (intra-provincial GDP), x11 (per capita disposable income), x12 (per capita consumption expenditure), x13 (regional fixed-Asset investment), x14 (regional fixed-asset investment index), x15 (provincial unemployment rate) to replace x4 (working province). When there are 12 screening variables, the on-time repayment (0) is wrongly judged as delayed repayment (1), and the error rate is 2.7%. In contrast, the delayed repayment (1) is wrongly judged as on-time repayment (0).The error rate is 40.6%, and the overall error rate is 7.3%. Considering that there are too many variables, and the variables with less correlation may affect the training effect of the model, resulting in a decrease in the accuracy of prediction. So, we should eliminate some irrelevant variables step by step to make the model achieve the best prediction effect. According to the importance of variable features in the random forest model from small to large in order of deletion. For example, the figure below shows the importance degree of each variable at the first elimination, as shown in Figure 3, the one with the lowest elimination importance (level of education). Repeat the above steps according to the change of status of each variable during each elimination. The variables removed in turn are x6 (marital status), x7 (salary), x15 (unemployment rate), x3 (whether local), x13 (regional fixed asset investment), x11 (per capita disposable income), x12 (per capita consumption expenditure), x14 (regional fixed-asset investment index), x2 (agent), x1 (loan grade), the corresponding accuracy rate is shown in the

. Number of variables and accuracy
From the figure 4, it can be seen that when the variable is 4, 5, 6, when the error rate is low. When the variables are selected as 4, 5, 6 By predicting the training samples and comparing the correct rate, we can see that the correct rate is equal and the highest when four or five variables are selected, indicating that the probability of making the above two types of errors has decreased at this time, which indicates a significant improvement in the accuracy rate. Considering the original accuracy and training accuracy, the final selected variable in this chapter is 5. So, in this case, the solution chosen for this model is expected to be optimal.
Selection the trees for test: The choice of the number of trees directly affects the accuracy of the random forest training results. If there are too few tree choices, the predicted results will be unsatisfactory; if there are too many tree choices, the results will be more accurate, and It has no significant effect and will directly affect the Speed of model operation. In this paper, 200 trees are selected to explore the influence of the number of trees on the accuracy of judgment. The results are as follows: Figure 5. Verification of accuracy t = l As shown in the figure 5 above, the graph abscissa represents the number of trees, and the ordinate represents the judgment error rate of the model, where green represents the error rate of the model in judging deferred repayment, red represents the error rate of the model in judging on-time repayment, and black represents the total error rate. It is evident from the figure that when the tree of the tree is at 50 trees, the error rates of on-time repayment and deferred repayment have reached the lowest point. Based on this, the judgment standard of the model can be inferred. The rate of confirmation is approximately 94%.
Random forest model predicts the final result: According to the selection of variables and the setting of model parameters, the final variables selected in this paper are per capita consumption expenditure, x14 (regional fixed-asset investment index), x2 (agent), x1 (loan grade), x10 ( Province GDP). Parameter tree the choice is 200 trees. From the result, the overall accuracy of the model is 94.1%, of which the judgment of the people who repay on time is more accurate, and its accuracy up to 97.7%; while the judgment of the people who delayed repayment was slightly unsatisfactory, with an accuracy of approx. is 67.9%. The reason for this may be related to the selection of sample size.

Discriminate Analysis
According to the previous Person According to the results of correlation coefficient analysis, ten variables that have a large to small correlation with the explained variable (whether deferred repayment or not) are selected as the observed variables. These are X1 ,X2, X6, X7, X10, X11, X12, X13, X14, X15, respectively, and there will be full samples of agents as training data. The final results obtained by the calculation method of discriminant analysis are as follows: Overall accuracy: 78.7%

Logistic Regression
Introduction to the entropy weight method: Entropy weight is a method based on actual weights, the amount of information contained in each index, and A. The smaller the entropy, the higher the variability of the exponent. The greater its role, the higher the weight of comprehensive evaluation. Computational program entropy weight method is simple and straightforward;the index data is effectively used, excluding the influence of subjective factors (Bikker & Haaf, 2002 Next, according to the definition of the correlation matrix between the credit index and each factor, if the correlation coefficient is positive, then the factor entropy weight is also positive, if the correlation coefficient is negative, then the factor entropy weight is also negative. After gradually removing the non-significant variables, the regression results are obtained. See Table 14. It can be seen that these explanatory variables in the table have a robust explanatory effect on the explained variables so that they can be retained in the model. It can also be judged from the previous multicollinearity test results that these variables do not have multicollinearity, and the tolerance between the variables is relatively high, which will not have a significant impact on the accuracy of the parameter estimation results of the regression model. According to the estimation results, in Table 15, and The smaller the -2log-likelihood, the higher the value of Cox Snell R square and Nagelkerke R square, and thus the better fit of this model. Step Chi-square Df SIG. 10 89.960 8 .000 The overall situation of the significance test of the regression equation is shown in Table 16. For logistic analysis, the chi-square of the Hosmer-Leme show goodness-of-fit test was 89.960, and the probability P-value significance level was less than 0.05, so the goodness-of-fit between explanatory variables and logit (P) was significant, hence the model was reasonable.

Establish a Personal Risk Assessment Model
The logistic regression model can be expressed as: According to this logistic regression test, the results of SPSS 23.0 As can be seen from Table 17, In this paper, the average forecast accuracy is 89.4%, of which the forecast accuracy is 23.4% for deferred repayments and 98.5% for on-time repayments. The model has high accuracy in predicting customers' non-deferred repayment, while the accuracy of judging customers' deferred repayment is very low. Therefore, further tests are needed to determine the accuracy of deferred repayment and on-time repayment.

Summary and Prediction
Impact of variable screening on the model: In this chapter, the variables are first screened by the Person correlation coefficient test, and then the judgment accuracy of each model is analyzed.
Finding of Comparison of Models: by the detection of the three models described above. A summary of the predictive accuracy of each model was obtained, see Table 18. No overdue payments Record documented Dataset of samples in this paper, and this chapter tend to use the random forest to distinguish the data set of samples, i.e., If for a sample data set with an overdue repayment record, use Discriminate Analysis is more appropriate in a way that enhances the probability of judging overdue payments.

Correlation of Explanatory Variables with Explained Variables
In this chapter, 12 variables are selected to study the influence factors of the borrower's deferred repayment. They are the loan grade (x1), whether local nationality (x3), education level (x5), marital status (x6), fund or not (x8), gender (x9), provincial gross product (x10), per capita disposable income (x11), per capita consumption expenditure (x12), regional fixed investment (x13), regional fixed investment index (x14), unemployment rate (x15). The data samples not including agents generally lack working provinces and salaries, so we added six macro data variables corresponding to the province of origin. So, this chapter first uses SPSS 23.0 to pass Person correlation test was used to explore the correlation between explanatory variables and explained variables. The explanatory variables were screened by the magnitude of the correlation coefficient between the dependent and independent variables and the requirement of significance between the two. According to the correlation test results, except for Fig. X6 failed the significance test, and all other variables passed the significance test. Considering that the remaining variables are 11 One, so it is not filtered according to the correlation magnitude of variables and finally selected the explanatory variables were X1, X3, X5, X8, X9, X10, X11, X12, X13, X14, X15.

Multicollinearity Analysis
In order to ensure the accuracy of the prediction results of the constructed model, the first step was to use SPSS 23.0 to test for the presence of multicollinearity between variables. The results are shown in Table 21. It can be seen that the inflation factor VIF (Variance inflation factor) of 11 variables is between 0 and 10, from which it is judged that there is no severe multicollinearity between the 11 variables. This chapter randomly selects data samples that do not contain agents from the 80% data as training data, where 708 for deferred repayment records, 15449. The bar is the on-time repayment record, and each variable is used as the training feature. R software is a random package to realize the modeling process. Selection of variables: The variables were initially censored first. According to the correlation of each variable, the significance of each variable was judged and eliminated first x4 (marital status), again Introduced according to practical significance. Both the X7 (Fig. Province GDP), x8 (per capita disposable income), x9 (per capita consumption expenditure), x10 (regional fixed asset investment), x11 (regional fixed-asset investment index), x12 (provincial unemployment rate) to replace the working province variable and the native place variable and. When the screening variable is In 11 cases, repayment on time (0) is judged by the model as deferred repayment (1) The error rate is 0%, while deferred repayment (1) is judged by the model as ontime repayment (0) The error rate is 100%, overall The error rate is 4.4%. Considering the plethora of variables, among which the less relevant variables may affect the training effect of the model, resulting in the quasi-prediction Decreased certainty. Therefore, we consider eliminating some irrelevant variables step by step to make the model achieve the best prediction effect. According to Sen, the particular importance of forest variables is deleted from small to large. The screening rule is to eliminate the variables with the lowest degree of correlation based on the importance of each variable, as shown in the figure, and to eliminate X2 (whether local or not) with the lowest degree of correlation.  Figure 7 shown above, the graph abscissa represents the number of trees, and the ordinate represents the judgment error rate of the model, where green represents the error rate of the model-predicted deferred repayment, red represents the error rate of the model-predicted on-time repayment, and black represents the overall error rate. According to the figure above, the error rate is the same regardless of the number of trees.  Table 22, it can be seen that the overall prediction of the model the accuracy rate is 95.6%, of which the judgment of the people who repay on time is more accurate, and its prediction accuracy up to 100%; and the prediction accuracy for the deferred payoff population is very low, i.e., is 0%. The reason for this result may also be related to the selection of sample size.

Discriminate Analysis
Currently, according to the previous analysis, the test results of the Person correlation coefficient test and analysis are selected according to the explained variables (whether to postpone repayment). Significant correlative relationship 1 one variable served as its observed indicator. These variables are, respectively, x1, x 3, x 5, x8, x9, x10, x11, x12, x1 3, x1 4, x1 5, the data from the samples containing agents without missing data were used as the training set data. Run through discriminate analysis was performed by SPSS 23.0 software, after which the results of the discriminate analysis were output. Next, according to the correlation coefficient matrix between the credit index and each factor, if this correlation coefficient is positive, then this factor entropy weight is also positive if this correlation coefficient is negative, the factor entropy weight is also negative.  After removing the variables with low significance step by step, it can be easily observed that these explanatory variables have a strong explanatory effect on the explained variables, so they should be kept in the model. It can also be learned from the multicollinearity test performed previously that the absence of multicollinearity in these several variables and the high tolerance between variables do not significantly affect the precision of the results of parameter estimation by the regression model. According to the estimation results, in Table 28, -2The smaller the log-likelihood, the higher the value of the Cox Snell R square and Nagelkerke R square, and the higher the fit of the model, so that the model can be considered to have a better fit. The overall situation of the Hosmer-Lemeshow goodness-of-fit test of the regression equation is shown in the table, and it can be observed that the chi-square is 47.721 and the probability P-value significance level is less than 0.05. Hence, the correlation between the explanatory variables and logit (P) is significant, which can justify the model.  118  3  076631  0   081  1  088587  0  696  2  084632  0  393  2  07671  0   323  0  089162  0  209  7  088183  0  830  4  102543  0   533  2  131457  0  2  194  3  097815  0  662  0  095096  0  165  0 The results of this logistic regression test were obtained by running SPSS 23.0 software: In this chapter, the average forecast accuracy is 95.6%, of which the forecast accuracy is 0.00% for customers with deferred repayment and 100.0% for customers with timely repayment. The model has high accuracy in predicting customers' non-deferred repayment, while the accuracy of judging customers' deferred repayment is very low. Consequently, it is necessary to improve the test model further to improve the judgment of deferred repayment and on-time repayment accuracy.

Summary and Prediction
Impact of screening of variables on the model: In this chapter, the average forecast accuracy is 95.6%, of which the forecast accuracy is 0.00% for customers with deferred repayment and 100.0% for customers with timely repayment. The model has high efficiency in predicting customers' non-deferred compensation, while the accuracy of judging customers' deferred repayment is very low. Hence, it is necessary to improve the test model further to improve the judgment of partial compensation and ontime repayment accuracy.
Comparison of Models: The results were predicted by aggregating the three models described above. Obtain the prediction accuracy of each model, see Table 30. This chapter considers that there is no agent borrowing. Although the random forest model and the logistic regression model were both accurate at 95.6%, they were valid at 0% for the overdue population and did not work well for real-world applications.

Conclusion
In the sample with agents, the overall correct prediction rate of the random forest was 94.1%, discriminate analysis was 78.7%, and logistic regression was 89.4%. The prediction probability of random forest for overdue and non-overdue repayment was balanced, so the random forest model was more accurate and reliable for the general population. Nevertheless, the accuracy of discriminant analysis for overdue repayment prediction was higher than that of random forest. Discriminate analysis is suitable for the detection of the population with incomplete records.
In the non-agents sample, both random forest and logistic regression predicted 95.6% correctly, while discriminant analysis was only 75.5%. Nevertheless, discriminate analysis is more appropriate as a financial credit scoring model because the probability of predicting correctly using discriminate analysis is more than 75% for both overdue and non-overdue people. Although the prediction accuracy of the random forest model and the logistic regression model is 95.6%, the prediction accuracy for overdue repayment is 0, which is not practical for practical application.

Policy Recommendation
First, it is time to build a personal credit information system in line with China's national conditions. Compared with other countries, the construction of the personal credit information system in China started relatively late (Han et al., 2013;Cheng & Suyang, 2014). At present, a perfect and reasonable personal credit information system has not been formed, and personal credit information is lacking severely. Especially with the rapid development of China's consumer credit market in recent years, a complete personal credit information system is urgently needed to guide the healthy development of the market (Huang et al., 2016). At present, most of the personal credit scores of the traditional credit agencies in China are still in the stage of subjective judgment and have high randomness (Hu & Ge, 2018). Although the personal credit scoring methods in some foreign countries are relatively mature and have been quantified by a large number of artificial intelligence and statistical methods, there are significant controversies on the performance and stability of each method, and China does not have the functional conditions to apply these methods (Sachs et al., 2007).Thus, China needs to build a personal credit system with Chinese characteristics.
Second, the construction of a suitable personal credit evaluation index system. The following two problems should be considered when establishing the evaluation index system.On the one hand,The evaluation system constructed should be able to make full use of all the data.On the other hand,The evaluation system constructed should be able to evaluate individual credit from multiple perspectives.
Finally, establish a suitable personal credit scoring model. In this paper, we tried other credit scoring models before determining the objective evaluation model, but the model discrimination ability and robustness are not as good as the random forest model, Discrete Analysis and logistic regression selected in this paper.

Research Prospects
Although this paper discusses a variety of personal consumption credit evaluation model, respectively, the random forest model, discriminant analysis, logistic regression model for empirical analysis and comparison, proved that the combination of model optimization role, but there are still some shortcomings in practical applications, mainly in:-First, as many variables as possible should be introduced. Variables used in this paper involve fewer types due to data type limitations. If the data can reflect the customer credit behavior, the effect of the model will be significantly improved. Due to the limited sample size in this research, further tests are needed to determine the accuracy of deferred repayment and on-time repayment.
Finally, efforts should be made to produce multilevel classifications. In this paper, the sample according to whether overdue agents and no agent sample, but in reality, is far from that simple.