Credit Risk Meets Large Language Models: Building a Risk Indicator from Loan Descriptions in P2P Lending

Graphical Abstract

Credit Risk Meets Large Language Models: Building a Risk Indi-

cator from Loan Descriptions in P2P Lending

Mario Sanz-Guerrero, Javier Arroyo

arXiv:2401.16458v2 [q-fin.RM] 5 Aug 2024

Highlights

Credit Risk Meets Large Language Models: Building a Risk Indi-

cator from Loan Descriptions in P2P Lending

Mario Sanz-Guerrero, Javier Arroyo

• Fine-tuned BERT successfully produces risk scores from loan descrip-

tions.

• Integrating BERT score with traditional variables boosts granting model

performance.

• The approach applies without manual annotation, reducing subjectiv-

ity and complexity.

• Generating risk scores is instant with standard equipment.

• Research is needed to enhance transparency and trust in these LLM-

based approaches.

Credit Risk Meets Large Language Models: Building a

Risk Indicator from Loan Descriptions in P2P Lending

Mario Sanz-Guerrero

a,∗

, Javier Arroyo

Facultad de Inform´atica, Universidad Complutense de Madrid, Calle del Prof. Jos´e

Garc´ıa Santesmases, 9, Madrid, 28040, Comunidad de Madrid, Spain

Instituto de Tecnolog´ıa del Conocimiento, Universidad Complutense de Madrid, Calle

del Prof. Jos´e Garc´ıa Santesmases, 9, Madrid, 28040, Comunidad de Madrid, Spain

Abstract

Peer-to-peer (P2P) lending connects borrowers with lenders through online

platforms but faces signiﬁcant information asymmetry, as lenders often lack

suﬃcient data to assess the creditworthiness of borrowers. This paper ad-

dresses this challenge by leveraging BERT (Bidirectional Encoder Represen-

tations from Transformers), a Large Language Model (LLM) known for its

ability to understand contextual nuances in text, to analyze borrowers’ loan

descriptions.

We apply transfer learning to make BERT distinguish between default

and non-default loans using the loan descriptions of the Lending Club dataset.

The resulting BERT-generated risk score demonstrates solid competence in

discriminating between default and non-default loans. Furthermore, integrat-

ing the BERT-generated risk score with traditional variables enhances classi-

ﬁer performance, which demonstrates the complementary nature of advanced

language model outputs in reﬁning credit risk assessment methodologies.

However, the opacity of LLMs and potential biases highlight the need for

transparent regulatory frameworks. This study opens new research avenues

in P2P lending and AI, emphasizing the importance of trust and transparency

in adopting advanced credit risk models.

Keywords:

Credit Risk, P2P Lending, BERT, Transfer Learning, Explainable AI,

∗

Corresponding author.

Email addresses: [email protected] (Mario Sanz-Guerrero),

[email protected] (Javier Arroyo)

Preprint submitted to Elsevier August 6, 2024

XGBoost

1. Introduction

Peer-to-peer (P2P) lending is a growing phenomenon that allows indi-

viduals to engage in direct lending and borrowing transactions, bypassing

traditional ﬁnancial institutions. The process is facilitated through online

platforms, where prospective borrowers submit loan applications and poten-

tial lenders make informed decisions about where to invest their funds.

An inherent challenge in P2P lending is the presence of information asym-

metry, wherein borrowers possess more and often superior information com-

pared to lenders. To address this issue, platforms employ strategies to com-

plement the conventional data provided in loan applications [1]. For instance,

borrowers are frequently encouraged to provide a voluntary textual descrip-

tion describing the purpose of the loan and their particular situation. De-

spite the absence of formal veriﬁcation, such voluntary disclosures have been

observed to stimulate increased bidding activity among lenders. However,

lenders may lack the expertise to assess the creditworthiness of borrowers

eﬀectively and may be inﬂuenced by diﬀerent factors [2].

Traditional credit scoring models do not harness the rich information em-

bedded in the narratives submitted by loan applicants. Several approaches

have tried to incorporate such information into credit risk models. The ap-

proaches include the extraction of linguistic metrics [3, 4], the use of topic

modeling to characterize the underlying themes [5, 6], or a combination of

both [7].

However, these methods have limitations in fully capturing the contex-

tual nuances and semantic intricacies of loan descriptions. This study aims

to address this gap by leveraging the capabilities of Large Language Mod-

els (LLMs), speciﬁcally BERT (Bidirectional Encoder Representations from

Transformers) [8], which has shown exceptional performance in understand-

ing and processing complex textual data.

BERT’s bidirectional training and ability to capture context at a granular

level make it particularly suitable for tasks requiring deep semantic under-

standing. It has proven successful in ﬁne-tuning for classiﬁcation tasks [9],

in particular domains such as biomedicine [10] and in specialized tasks like

spam classiﬁcation [11]. In this study, we apply transfer learning to train

BERT to distinguish between default and non-default loans based on their

textual descriptions.

Our work demonstrates BERT’s capability to eﬀectively leverage descrip-

tive data to generate a risk indicator that accurately classiﬁes defaulted loans.

Moreover, we show that integrating the BERT-generated risk score into tradi-

tional credit granting models can signiﬁcantly enhance their predictive per-

formance. This highlights BERT’s potential to improve the accuracy and

reliability of credit risk models in P2P lending.

Furthermore, our ﬁndings highlight the need for greater transparency in

adopting advanced models to build trust among users and regulatory entities.

This study paves the way for future research, emphasizing the transformative

potential of LLMs like BERT in the ﬁeld of credit risk assessment.

This paper is organized as follows: Section 2 provides a comprehensive

review of related work in credit risk assessment and natural language pro-

cessing. Section 3 presents an overview of LLMs and the BERT model.

Section 4 describes the dataset used, detailing the data preprocessing steps

and conducting an in-depth exploratory data analysis. Section 5 outlines

the methodology, model architecture, and training procedures employed in

integrating BERT into the credit risk assessment framework. Section 6 ana-

lyzes the risk score generated by the BERT description processing. Section

7 discusses the results of our experiments, highlighting the improvements

achieved by incorporating BERT-based textual analysis. Finally, Section 8

concludes the paper by summarizing key ﬁndings, discussing implications,

and suggesting avenues for future research in the intersection of NLP and

credit risk assessment.

2. Related work

In their comprehensive analysis of risk-return modeling within the P2P

lending market [12], the authors identify a discernible trend towards includ-

ing new sources and types of information to improve risk and proﬁt manage-

ment models in the P2P market. The sources are very diverse and include

transactional data [13], the topology of the lending-borrowing network [14],

data from social networks [15], or, more recently, facial features [16]. Among

them, the authors identify as a predominant trend the inclusion of textual

data taken from statements describing the purpose of the loan.

In a pioneer work exploring the impact of textual factors on peer-to-peer

lending [17], the authors analyze P2P loans including manually annotated

narrative aspects, such as trustworthiness, economic hardship, hard work,

success, morality, and religiosity. These aspects were combined with de-

mographic variables and loan characteristics. Their results highlight that

narratives regarding trustworthiness strongly inﬂuence decision-makers, par-

ticularly credit lenders, in their loan approval process. Additionally, some of

these narratives play a substantial role in subsequent loan performance.

However, most subsequent studies typically use text mining or artiﬁcial

intelligence methods to extract linguistic features or loan description top-

ics. Regarding the use of linguistic features, the authors in [3] use machine

learning and text mining techniques to quantify and extract linguistic fea-

tures (e.g., readability, positivity, objectivity, and deception cues), and then

build both explanatory econometric models and predictive models using such

features. They ﬁnd that they can indeed reﬂect borrowers’ creditworthiness

and predict loan default. They also use a panel of investors and conﬁrm that

investors indeed value texts written by borrowers, but that they can also

be deceived by some of the deception cues well established in the literature.

Similarly, in [4], the authors include linguistic factors and the presence of

social and emotional keywords and evaluate their impact on two European

platforms. They found that text-derived variables inﬂuence the probability

of funding, but not the probability of default. In [18], linguistic statistical

features and abstract text features (including deception, subjectivity, sen-

timent, readability, personality, and mindset) are used to characterize text

descriptions. They compare the performance of diﬀerent classiﬁers based on

the textual features and conclude that their performance is close to that of

the classiﬁers using traditional ﬁnancial features, but that adding textual

features can improve the performance of the whole credit risk evaluation

system.

As for topic modeling, the Latent Dirichlet Allocation model (LDA) has

been widely used. In [19], the authors use LDA to extract six topics from the

loan descriptions whose meanings are obvious: assets, income and expenses,

work, family, business, and agriculture. They also consider the number of

characters in the descriptive text. They conclude that soft (qualitative) in-

formation can improve the performance of loan default prediction compared

to existing methods based only on hard (quantitative) information and that

soft features have a signiﬁcant ability to discriminate loan defaults. Similarly,

in [20], an LDA topic model is used to classify the loan titles into six pur-

poses. Their ﬁndings reveal that the stated purpose signiﬁcantly inﬂuences a

borrower’s chances of securing ﬁnancing. Notably, ambiguous titles—where

borrowers fail to clearly articulate the loan’s purpose—substantially diminish

the likelihood of loan approval. In [5], Xia et al. used a keyword clustering

algorithm for automatic topic extraction. Their method combines keyword

extraction based on term frequency-inverse document frequency (TF-IDF)

with word embeddings generated by the Word2Vec neural network model

[21]. Analysis of three real-world datasets demonstrated that incorporating

these topic variables signiﬁcantly enhanced predictive accuracy compared to

relying solely on traditional information.

More recently, Siering [7] investigated the eﬀect of both aspects: topics

and linguistic features. To extract topics, the author utilized a ﬁnancial text

analysis procedure [22] to build a domain dictionary. The identiﬁed topics

describe the purpose of the loan, as well as the borrowers’ request for help,

reliability, or appreciation. These topics were operationalized as binary in-

dicator variables. The author employs text mining to create variables that

measure aspects such as polarity, active orientation, readability, average sen-

tence length, and word count. These variables are then used as input in

a logistic regression. The results indicate that both linguistic and content-

based factors contribute to predicting loan default probability, with the latter

demonstrating greater signiﬁcance. Analysis of variable contributions shows

that certain factors increase or decrease default likelihood. Notably, expres-

sions of reliability correlate positively with loan repayment.

Yet, a signiﬁcant gap becomes apparent in applying state-of-the-art nat-

ural language processing techniques, including deep learning methods. In

their work [6], Zhang et al. studied the transformer encoder’s ability to cap-

ture textual features from loan descriptions. These characteristics, alongside

the traditional hard features derived from loan applications, were inputted

into a neural network to predict the likelihood of loan default. The study

highlights the eﬀectiveness of transformers, showing better results when in-

cluding textual loan descriptions compared to models that do not consider

them. In this work, we explore using encoder-based LLMs, such as BERT,

which has successfully improved classiﬁcation in other ﬁelds.

3. An overview of Large Language Models and BERT

Large Language Models (LLMs) are built upon the Transformer archi-

tecture [23], leveraging attention mechanisms to enhance language compre-

hension. LLMs can be broadly categorized into three primary families, each

distinguished by its architecture:

• Encoder-only, widely employed for language comprehension tasks such

as text classiﬁcation, named entity recognition, and extractive ques-

tion answering. The most famous example is BERT [8], which will be

explained in more detail below.

• Decoder-only, designed for generative tasks, exempliﬁed by the well-

known GPT models [24, 25]. It is employed in various tasks, including

question answering [26], text summarization [27], and programming

code generation [28].

• Encoder-decoder models, suited for tasks demanding both language

understanding and generation, such as language translation or text

summarization. The most inﬂuential models are BART [29] and T5

[30].

The selection of the appropriate architecture hinges on the speciﬁc re-

quirements of the intended task. Whether it be the nuanced comprehension

of language, creative text generation, or the synthesis of both, the versatility

of LLMs oﬀers a tailored solution for diverse applications.

We will focus on BERT (Bidirectional Encoder Representations from

Transformer), which is a Transformer-based language model introduced by

Google researchers in 2018 [8]. BERT’s architecture consists of a stack of

encoders from the Transformer model. The bidirectional nature of BERT is

key, as it considers both the left and right context of each word, enhancing

its ability to understand context-dependent meanings and to be eﬀective in

language understanding tasks. Numerous studies have consistently shown

that BERT is the most eﬀective linguistic model for various of these tasks

[31, 32]. Notably, BERT has 340 million parameters, while the widely rec-

ognized GPT-3 model has 175 billion, making BERT 514 times smaller than

GPT-3 [25]. Given this signiﬁcant size diﬀerence, BERT can be operated on

standard home equipment for model inference, which greatly simpliﬁes its

use in practical scenarios. In contrast, GPT, built with a Transformer de-

coder stack, not only demands much more powerful equipment but is suited

for language generation tasks.

BERT stands as a milestone whose success has spurred the development of

a diverse family of models that build upon its architecture. Some versions aim

to achieve similar performance while having a smaller number of parameters,

such as DistilBERT (a distilled version of BERT) [33], or ALBERT (A Little

BERT) [34]. Others are adaptations to other languages such as CamemBERT

[35] to French or BETO to Spanish [36]. Other proposals aimed to improve

upon BERT by modifying some design decisions when pretraining BERT

and also training the model longer, as in the case of RoBERTa (Robustly

optimized BERT approach) [37], which resulted in improved contextualized

representations and enhanced language understanding.

To further elucidate the role of BERT in specialized applications, it is cru-

cial to understand its capacity for transfer learning and ﬁne-tuning. Transfer

learning involves using a pre-trained model like BERT, which has initially

learned general language patterns from a large corpus to a speciﬁc task or

dataset. This technique allows us to take advantage of the rich linguistic

representations without needing extensive computation from scratch. Fine-

tuning involves adjusting the pre-trained model’s parameters to capture the

nuances of the target task or application ﬁeld by further training with new

instances from the new context. For example, BioBERT is a BERT model

ﬁne-tuned for biomedical text mining tasks like named entity recognition and

question answering [10]. Other adaptations have targeted text classiﬁcation

and sentiment analysis in speciﬁc datasets [9, 38].

Our research aims to develop a credit risk model by leveraging a BERT-

based model. Speciﬁcally, we will ﬁne-tune BERT for producing a predictive

score indicative of the likelihood of loan default using the textual descriptions

of loans provided by borrowers in their application forms. We will assess the

scoring eﬀectiveness in a granting model using the well-established dataset

from the Lending Club P2P lending platform.

4. Description of the dataset

We have used a public data set of the P2P lending company Lending

Club

, which is widely used in credit risk publications and the most widely

used when dealing with the P2P market [39, 12]. However, instead of using

the original dataset, which includes 2,260,699 loans granted by the company

between 2007 and 2018, we will use a version modiﬁed for proposing granting

models [40], used in [41, 42]. Since granting models determine which loans

will be fully repaid, its estimation needs loans whose ﬁnal status is known,

that is, that were either fully repaid or defaulted. Thus, the dataset excludes

loans in transitory states (in a grace period, late, etc.) and loans with no

https://www.kaggle.com/wordsforthewise/lending-club

Table 1: Variable description.

Variable Description

Quantitative variables

revenue Borrower’s self-declared annual income during regis-

tration.

dti n Indebtedness ratio for obligations excluding mortgage.

Monthly information.

loan amnt Amount of credit requested by the borrower.

ﬁco n Credit bureau score. Deﬁned between 300 and 850,

reported by Fair Isaac Corporation as a summary

risk measure based on historical credit information re-

ported at the time of application.

Categorical variables

emp length Employment length of the borrower categorized into

12 categories, including the no information category.

purpose Credit purpose category for the loan request.

home ownership Homeownership status provided by the borrower.

addr state Borrower’s residence state from the USA.

Textual variable

desc Description of the credit request provided by the bor-

rower.

information on income and indebtedness which is essential to compute the

input variables, resulting in 1,347,681 instances.

Additionally, the original dataset contains variables detailing the loan’s

lifecycle and other post-application aspects (e.g., the interest rate). In con-

trast, our version only includes variables available at the time of application,

which are those utilized by granting models.

Loan descriptions were inconsistently available, appearing only for certain

loans between April 2008 and March 2014. To accurately assess the impact of

textual descriptions on default prediction, our analysis focuses solely on the

119,101 loans that include the desc variable. Kolmogorov-Smirnov and chi-

square tests were applied to quantitative and categorical variables to assess

potential bias from ﬁltering. The lack of signiﬁcant diﬀerences indicates that

the ﬁltered dataset is representative of the original dataset.

In the dataset, the target variable suﬀers the usual class imbalance prob-

lem (only 15.27% of default), which will be considered in the design of the

experiments. Table 1 shows the input variables of our granting model, which

are explained below.

As for the quantitative variables, the Fair Isaac Corporation credit bu-

reau (FICO) information in the original dataset is given by a minimum and

maximum range of limits to which the borrower’s FICO belongs at loan orig-

ination. However, we average these two values to have a single indicator of

the creditworthiness of potential borrowers resulting in our ﬁco n variable.

For the case of the debt variable, dti n is estimated from the original dataset

variables as the ratio calculated from the total debts of the co-borrowers

over the total debt obligation divided by the combined monthly income of

the co-borrowers.

Regarding the categorical variables, we merged the categories ‘other’,

‘none’, and ‘any’ into a uniﬁed category labeled ‘other’ for the home ownership

variable. This decision was made due to a lack of clear diﬀerentiation among

these options, coupled with their similar default percentages and their rela-

tively low percentages of occurrences. The emp length variable was treated

as categorical rather than numerical since it includes categories for ‘no infor-

mation’ and for ‘more than ten years’.

For the textual variable, we carried out an exhaustive work of text clean-

ing. First, we removed all those descriptions that contained the default de-

scription provided by Lending Club on its web form (“Tell your story. What

is your loan for?”). Moreover, we removed the preﬁx “Borrower added on

DD/MM/YYYY >” from the descriptions, as we did not want any temporal

background on them. Finally, as these descriptions came from a web form, we

replaced all HTML entities with their corresponding characters (e.g. ‘&’

was substituted by ‘&’, ‘<’ was substituted by ‘<’, etc.).

Table 2 presents the quantitative variables and the results of the Kolmogorov-

Smirnov test, which was used to compare the empirical cumulative distribu-

tion functions of Default and Non-default loans. According to these results,

defaulted loans are characterized by lower revenue, higher debt-to-income

ratio (dti n), higher requested amount (loan amnt), and lower FICO scores

(ﬁco n), being the diﬀerences signiﬁcant at the 0.01 level.

Similarly, Table 3 displays the distribution of categories within each cate-

gorical variable and the corresponding default rates. The addr state variable

is excluded due to its 50 categories, one for each U.S. state. The table also in-

dicates whether there is a signiﬁcant dependence between the target variable

and the categorical variables at the 0.01 signiﬁcance level. The results show

Table 2: Exploratory data analysis. Quantitative variables.

Variable Statistic Non-Default Default Total

revenue

Mean $ 73,570.69 $ 66,218.66 $ 72,447.84

Median $ 64,000.00 $ 58,000.00 $ 62,000.00

SD $ 54,944.60 $ 40,731.98 $ 53,086.79

KS D-Test 0.09*

dti n

Mean 16.15 17.67 16.38

Median 15.88 17.70 16.16

SD 7.50 7.53 7.53

KS D-Test 0.09*

loan amnt

Mean $ 13,799.25 $ 15,111.44 $ 13,999.66

Median $ 12,000.00 $ 14,000.00 $ 12,000.00

SD $ 7,931.55 $ 8,363.19 $ 8,012.86

KS D-Test 0.08*

ﬁco n

Mean 705.22 694.20 703.54

Median 697.00 687.00 697.00

SD 33.32 27.48 32.74

KS D-Test 0.15*

* p-value less than 0.01.

signiﬁcant dependence for all variables, including the addr state variable not

reported in the table (test value of 211.12).

In the home ownership variable, the ‘OTHER’ category shows the high-

est risk (20.98%) but a small frequency (0.12%), while the ‘MORTGAGE’

category is the most frequent (51.05%) and the least risky (14.14%) one.

In the emp length variable, the category that denotes no information (‘NI’)

has the highest risk (19.78%), but also the lowest frequency (4.13%). In

general, employment length can be categorized into two groups with compa-

rable default rates: those with employment lengths of ﬁve years or less and

those with more than ﬁve years. Interestingly, the risk is slightly higher in

the group with more than ﬁve years of employment. The categories are not

Table 3: Exploratory data analysis. Categorical variables.

Variable Category Count Rel. Freq. Default Rate Chi Test

home ownership

MORTGAGE 60,796 51.05% 14.14%

131.08*

OTHER 143 0.12% 20.98%

OWN 9,582 8.05% 15.69%

RENT 48,580 40.79% 16.60%

emp lenght

< 1 year 9,548 8.02% 14.83%

104.96*

1 year 7,803 6.55% 14.39%

2 years 10,960 9.20% 14.68%

3 years 9,370 7.87% 14.18%

4 years 7,561 6.35% 14.56%

5 years 9,019 7.57% 14.76%

6 years 7,271 6.10% 16.04%

7 years 6,638 5.57% 15.59%

8 years 5,374 4.51% 15.39%

9 years 4,356 3.66% 15.79%

10+ years 36,287 30.47% 15.41%

NI 4,914 4.13% 19.78%

purpose

car 1,884 1.58% 9.61%

568.47*

credit card 25,051 21.03% 12.81%

debt consolidation 68,372 57.41% 16.14%

educational 265 0.22% 16.98%

home improvement 7,170 6.02% 12.93%

house 805 0.68% 15.78%

major purchase 3,062 2.57% 10.65%

medical 970 0.81% 17.32%

moving 768 0.64% 14.84%

other 6,361 5.34% 17.69%

renewable energy 127 0.11% 19.69%

small business 2,518 2.11% 26.41%

vacation 561 0.47% 16.22%

wedding 1,187 1.00% 12.47%

* p-value less than 0.01.

Table 4: Exploratory data analysis. Textual variable.

Variable Statistic Non-Default Default Total

Word count

Mean 36.72 35.49 36.54

Median 24.0 22.0 24.0

SD 46.62 48.78 46.96

KS D-Test 0.03*

Readability

Mean 66.70 66.31 66.64

Median 73.88 74.19 74.02

SD 32.87 35.55 33.29

KS D-Test 0.02*

Polarity

Mean 0.0964 0.0909 0.0956

Median 0.0367 0.0 0.0320

SD 0.1685 0.1699 0.1687

KS D-Test 0.04*

Subjectivity

Mean 0.3193 0.3029 0.3168

Median 0.3635 0.3333 0.3589

SD 0.2542 0.2595 0.2551

KS D-Test 0.04*

* p-value less than 0.01.

perfectly ordered, which supports the use of one-hot encoding to treat this

variable as categorical. Finally, the most frequent purpose is ‘debt consoli-

dation’, constituting 57% of the loans, which has a default rate of 16.14%.

Notably, the riskiest purpose is ‘small business’, with a 26.41% default rate.

Conversely, ‘car’ loans demonstrate the lowest risk, with a mere 9.61% de-

fault rate. This striking divergence in default rates across diverse purposes

underscores a signiﬁcant variability in risk within the various loan purposes.

Regarding the textual description of the loan (desc variable), Table 4

shows some metrics to characterize it. There is a one-word diﬀerence in the

average word count between the descriptions of defaulted and non-defaulted

obligations. The readability was calculated using the Flesch Reading Ease

Score

, which indicates the approximate educational level required for com-

fortable comprehension of a given text (higher scores denote greater ease of

reading). The texts in both categories have scores around 66, signifying that

they can be readily comprehended by students aged 13 to 15. Additionally,

we analyzed the average polarity and average subjectivity

. The polarity,

ranging from -1 to 1 to denote negative or positive sentiment, was observed

to be approximately 0.1 in both cases, suggesting a subtle positive sentiment.

On the other hand, subjectivity, measuring the presence of judgments and

opinions on a scale from 0 to 1, exhibited values close to 0.31 in both cate-

gories. This indicates that while the texts in both cases maintain a generally

objective tone, there is a discernible inclusion of some judgments or opinions.

Although the distinctions in these metrics between the default and non-

default categories are subtle, their signiﬁcance is conﬁrmed by the Kolmogorov-

Smirnov test. Consequently, it is pertinent to incorporate linguistic aspects

into credit risk modeling. Our approach for extracting information from

the descriptions relies on leveraging LLMs capable of encompassing not just

linguistic nuances but also capturing content details. We elaborate on our

methodology below.

5. Description of the methodology

This study assesses the enhancement of default prediction by integrating

textual descriptions of loans preprocessed through an LLM. Speciﬁcally, we

employ transfer learning and ﬁne-tuning to enable the LLM to generate a

score reﬂecting the probability of loan default, leveraging information derived

from the loan’s textual description.

As our baseline, we employ a machine learning classiﬁer that utilizes

the complete set of variables available during the loan application process,

encompassing both quantitative and categorical variables. Subsequently, we

enhance the classiﬁer by incorporating the LLM score—representing default

risk inferred from processing the textual description with the LLM—as an

additional input variable and proceed to evaluate the resulting diﬀerences.

Calculated with Textstat (Python library). Source: https://github.com/textstat/

textstat.

Calculated with TextBlob (Python library). Source: https://github.com/sloria/

textblob.

The data and code of our experiments are publicly available

The key components of the experiment are elaborated upon in this sec-

tion.

5.1. Tuning the classiﬁer

The classiﬁcation algorithm used was XGBoost [43], which is trained

using a stratiﬁed k-fold cross-validation with k being equal to 5 and where

the instances are shuﬄed to avoid potential ordering biases in the dataset. A

genetic algorithm [44] was employed to ﬁne-tune the hyperparameters. The

ﬁtness value of each individual was the average balanced accuracy (BACC) in

the 5 validation sets of the cross-validation. We use BACC as it accounts for

the imbalanced nature of the dataset. In preliminary experiments, we also

considered the area under the receiver operating characteristic (AUROC),

which measures the model’s ability to discriminate between positive and

negative examples regardless of the classiﬁcation threshold chosen. However,

we observed that the resulting XGBoost classiﬁers produced poor BACC

values (similar to those from a na¨ıve classiﬁer that predicts the majority

class) when using the standard 0.5 threshold to make the prediction. We

also observed that XGBoost classiﬁers with extremely similar AUROC values

produced very diﬀerent results in terms of BACC. Thus, we decided to use

the BACC measure as it resulted in classiﬁers with a more stable behavior.

In genetic optimization, each individual is characterized by its genes,

that is, the considered hyperparameters of the XGBoost. Table 5 shows the

hyperparameters together with their respective ranges.

The evolutionary strategy chosen for the optimization is the “Mu plus

lambda” approach, denoted as µ + λ, where µ represents the number of

individuals to select for the next generation, and λ indicates the number

of children to produce at each generation. Unlike traditional approaches

where children often replace parents, the µ + λ strategy involves adding both

children and parents to produce the next generation. In the context of this

research, we set µ to 150 and λ to 150. This conﬁguration, considering the

addition of both parents and oﬀspring, results in generations comprising a

total of 300 individuals.

Initially, pairs of parents are chosen through tournament selection with

a tournament size of 2. Subsequently, the children are generated employing

Link of the code will be included in ﬁnal version.

Table 5: Hyperparameters of XGBoost considered in the genetic optimization and their

respective ranges.

Parameter Min. value Max. value

ETA (learning rate) 0.001 0.5

max depth 2 12

gamma 0 10

min child weight 0 10

alpha 0.5 10

lambda 0.5 10

subsample 0.7 1

colsample bytree 0.3 1

scale pos weight 0.1 10

n estimators 2 500

a two-point crossover technique on the parents’ chromosomes with an 80%

probability and applying a random resetting mutation with a 20% probabil-

ity. The random resetting mutation implies that each gene of every child has

a 20% chance of acquiring a new random value within its deﬁned range. To

create an oﬀspring of 150 children, this selection, crossover, and mutation

process is repeated 75 times. Finally, we combine the 150 children and the

150 parents, resulting in generations of 300 individuals, and select the top

150 (µ) according to their ﬁtness value to pass to the next generation.

The evolutionary process consists of 20 iterations, thereby generating

a total of 3,000 individuals—each representing a distinct hyperparameter

conﬁguration. From this pool of conﬁgurations, the one exhibiting the highest

ﬁtness is ultimately chosen as the optimal outcome.

5.2. Generating a default score with BERT

In this section, we delineate the methodology employed to generate a

default score based on the textual description of the loan. We initiate the

process by applying transfer learning utilizing an LLM, speciﬁcally BERT in

our case. The ﬁne-tuned BERT model produces an outcome within the range

of 0 to 1, oﬀering a nuanced indicator rather than a binary classiﬁcation. This

subtle indicator is subsequently integrated into a classiﬁer along with other

input variables to predict the likelihood of loan default. The subsequent

steps in this process are detailed next.

5.2.1. Transfer learning to produce the default score

The BERT model we utilize

is conﬁgured with L=12 hidden layers (i.e.,

Transformer encoder blocks), each with a size of H=768, and it employs A=12

attention heads. These attention heads enable the model’s self-attention

mechanism to process inputs in 12 distinct patterns simultaneously. The

output from BERT are embeddings of size 768. For incorporating this model,

TensorFlow HUB was selected for its eﬃcient integration with additional

neural network layers.

As outlined in Section 3, during the transfer learning phase, we aim to

exploit BERT’s advanced language understanding while minimizing the need

to learn from scratch. To achieve this, we freeze the weights of all but the

last hidden layer of BERT. This approach preserves the model’s pre-trained

capabilities and prevents overﬁtting on our speciﬁc dataset, a phenomenon

known as catastrophic forgetting

Subsequently, in the ﬁne-tuning stage, only the last BERT layer and the

newly added layers are adjusted to better serve our speciﬁc task of generating

a default score. These layers are optimized to enhance the model’s adaptabil-

ity to our particular requirements, allowing slight parameter adjustments for

improved task-speciﬁc performance, while maintaining the general language

understanding gained from BERT’s initial pre-training. This strategy eﬀec-

tively balances specialized learning with the retention of valuable pre-trained

knowledge.

We also explored various conﬁgurations for the extra layers to determine

the one that produces the best results. Speciﬁcally, we explored the 126

conﬁgurations that result from combining the following options:

• Using a ﬁrst extra dense layer of 128, 256, or 512 neurons.

• Using or not a second extra dense layer of 128 neurons.

• Adding a dropout layer before or after all the extra layers.

– Considering a dropout percentage of 0%, 0.10%, 0.20% or 0.30%

for all the dropout layers.

Source: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4

A degradation issue during the retraining of pre-trained models, leading to loss of the

original capabilities [45].

• Using a learning rate of 0.001, 0.0001, or 0.00001.

All the internal hidden layers are dense layers using the ReLU activation

function

. Additionally, to obtain the probability of belonging to the default

class, the last layer of the neural network was conﬁgured with a single neuron

and a sigmoid activation function

The neural network is trained to predict the loan outcome (default or non-

default) based on the textual description as input. As a loss function, we use

the weighted binary cross-entropy, which measures the diﬀerence between the

predicted probabilities and the true binary labels of the input data taking

into account class weights

. The training was set up with a batch size of 64

and trained for 25 epochs, with an early stopper of 3 epochs.

5.2.2. Avoiding data leakage in cross-validation

As previously mentioned, our BERT model generates a default proba-

bility, which is then integrated into the classiﬁer as an additional quanti-

tative variable. It is important to note that when computing the BERT

default probability, we must replicate the exact folds used in the k-fold cross-

validation of the boosting algorithms. In each iteration, boosting algorithms

are trained using the BERT score variable. It is essential to prevent BERT

from training with validation data from a speciﬁc fold to avoid distorting

the model’s true performance. Allowing this would incorporate BERT pre-

dictions from its training phase in the fold’s validation data, which doesn’t

accurately represent the model’s real-world default prediction ability.

To avoid this data-leakage problem, we use the data as shown in Figure

1. This diagram consists of ﬁve steps:

1. Description extraction: Textual descriptions are extracted from the

original dataset.

2. Folds generation: The exact same folds as in the boosting algorithms

are generated for the textual descriptions. This is done by dividing the

data into a train set (green color) and a test set (purple color). Each

fold gets a diﬀerent test set.

The Rectiﬁed Linear Unit (ReLU) activation function outputs the maximum of zero

and the input value, “activating” the neuron if the input is positive.

The sigmoid activation function introduces non-linearity and maps the input values

to a range between 0 and 1, facilitating binary classiﬁcation tasks.

Considering class weights is crucial given the imbalanced nature of our dataset.

3. Optimization of the neural network architecture: The train set obtained

in the previous step is divided into a 70% train subset (light green

color) and a 30% test subset (dark green color). The neural network is

trained with the previously mentioned conﬁgurations on the training

subset (light green color), and is tested by predicting the test subset

(dark green color).

4. Default prediction: The optimal conﬁguration obtained in step 3 (lower

value of weighted binary cross-entropy in the test subset) is trained with

the train and test subsets (light and dark green colors) and is used to

predict the default probabilities of the test set obtained in step 2 (purple

color). These predictions of unseen data will be the BERT score values

of the current fold.

5. BERT score integration: Once the BERT score values of all the folds

are generated in step 4, they are incorporated into the original dataset

as a quantitative variable.

5.2.3. Result of the transfer learning process

As explained before, we explored 126 neural network conﬁgurations in

each of the 5-fold cross-validation. The resulting optimal conﬁguration for

each validation fold and its loss score are shown in Table 6. Interestingly, the

optimal combination of parameters varies across the diﬀerent folds, and the

only parameter that remains constant is the use of the second dense layer.

Despite this parameter heterogeneity, the test loss values are quite stable,

except for the case of fold 1, which has a slightly higher loss.

Table 6: BERT optimal conﬁguration and score by fold.

Neurons in

1st dense layer

Use 2nd dense layer

(128 neurons)*

Dropout Learning rate Loss**

Fold 0 512 True 20% 0.0001 0.1774

Fold 1 512 True 0% 0.00001 0.1793

Fold 2 128 True 20% 0.0001 0.1776

Fold 3 256 True 0% 0.001 0.1775

Fold 4 512 True 0% 0.001 0.1773

* Values ‘True’ or ‘False’ indicate whether the optimal conﬁguration has or has not that layer.

** Weighted binary cross-entropy of the test set predictions.

Figure 1: BERT training diagram.

5.2.4. Inclusion of the BERT score in the classiﬁcation model

The experiment architecture is shown in Figure 2. First, the loan de-

scriptions are processed by the ﬁne-tuned BERT model. By using a sigmoid

activation function in its ﬁnal layer, the resulting output can be interpreted

as the probability of belonging to the positive class. This probability, denoted

as BERT score, represents a default score solely based on the content of the

loan description. Finally, this BERT score is included as an input variable

in the XGBoost classiﬁer along with quantitative and categorical variables

to generate a ﬁnal default prediction.

Figure 2: Diagram of the experiment architecture.

The ﬁne-tuning of the BERT model was performed on a system equipped

with an Intel Core i9-12900KS processor, an NVIDIA GeForce RTX 4090

graphics card (24GB), and 128 GB of RAM. In contrast, the inference pro-

cess with the ﬁne-tuned model, which involves generating the BERT score

for a given loan description, can be conducted on a standard PC. This ca-

pability enhances the practicality of deploying our approach in real-world

applications.

6. Analysis of the BERT score

6.1. Assessment of the BERT score as a credit risk score

In this section, we evaluate the eﬀectiveness of the BERT score in utilizing

loan descriptions to predict default risk. Table 7 presents loan descriptions

with the highest and lowest BERT scores, which indicate loans assessed by

Table 7: Loan descriptions with highest (top) and lowest (bottom) BERT score.

BERT score Real value Description

0.8562 0 (Non-Default) getting a divorce need new apt. with new furniture be-

cause she getting everthing.

0.8149 1 (Default) need help my bills. to help pay my medications and some

bills.

0.8131 0 (Non-Default) i can pay of some bills for my self because i been helping

other people out. i could save more for my family and

their need.i have a good job that i am bless with. i am

from a large family seem like every one thinking i suppose

to help them when i need help my self.i alway belive that

the lord will.

0.8051 0 (Non-Default) consolidating our debt makes our life easier live in our

means with one solid low monthly payment insted of mul-

tiple payment that add up more then what ill be paying

with this loan and have a little left to leave in my sav-

ings for a rainy day to be honest and thank you for your

consideration good dy.

0.8045 0 (Non-Default) To consolidate debt. to pay oﬀ dept

0.0735 0 (Non-Default) Debt consolidation with a lower APR.

0.1146 0 (Non-Default) In need of funds to pay oﬀ some bills as well as minor

improvements to house and yard. I have an extremely

secure career, and maintaining my credit worthiness is

important to me.

0.1262 0 (Non-Default) Hard working individual with a stable job will use loan

proceeds to consolidate outstanding credit cards balances.

1) Net monthly income - $4,432. 2) All expenses (al-

located):. Rent - $1,124. Utilities- 84. Groceries

293. Auto (including fuel) 201 . Cell Phone 52. Ca-

ble/Internet 64. Personal care items 82. Entertain-

ment/dining 93. Sales tax 65. 3) Previously answered.

4) No.

0.1360 0 (Non-Default) This loan is to pay oﬀ credit ca.

0.1871 1 (Default) I need money for moving expenses and for a buﬀer for

the ﬁrst month while I transition into working in my

new location. I have successfully paid oﬀ two previous

Lending Club loans in the past couple of years.

BERT as having the highest and lowest default risks, respectively. It appears

that descriptions with higher BERT scores are often less informative, con-

taining errors such as typos (e.g., “pay of”, “everthing”, “alway”, “belive”)

or grammatical issues. Interestingly, only one of these loans ultimately de-

faulted. Conversely, loan descriptions with lower BERT scores tend to be

more detailed and insightful, often including information about the loan’s

purpose and the borrower’s creditworthiness, and sometimes even providing

numeric data related to the borrower’s ﬁnancial status.

Figure 3: Default and non-default proportions by BERT score range.

Figure 3 illustrates the distribution of default and non-default loans (Y

axis) across the BERT score range (X axis) in bins of 0.01. In the bar

chart, the blue segment represents the percentage of non-default loans, while

the orange segment denotes the percentage of default loans

. The ﬁgure

reveals a general trend where higher BERT scores are associated with a

higher proportion of defaulted loans. It is important to note that observations

outside the BERT score range of [0.3, 0.7] are sparse, which makes the bars

in these regions less reliable. Overall, the trend suggests that higher BERT

scores are indicative of a greater likelihood of default, highlighting the BERT

score’s usefulness as a risk assessment tool.

Table 8 shows the classiﬁcation performance obtained by applying a 0.5

threshold to binarize the BERT score, which obtains a balanced accuracy of

Absence of a bar indicates that no loan descriptions fall within that BERT score range.

Table 8: Classiﬁcation performance of BERT binarization at 0.5.

Default Non-default

Model BACC Precision Recall F1 Precision Recall F1

BERT 0.5444 0.1714 0.6896 0.2746 0.8771 0.3993 0.5487

54.4%. The BERT score is not very precise in predicting the default class

(17.1%) but retrieves 69% of the instances.

These results demonstrate that the BERT model can be eﬀectively ﬁne-

tuned to predict the ﬁnal state of the loan using only the information provided

by the borrower in the description ﬁeld of the application form. In Section

7, we will contextualize these ﬁndings by comparing them with the results

obtained from XGBoost and various sets of variables.

6.2. Assessment of the relationship between the BERT score and other vari-

ables

In this section, we aim to explore potential relationships between the

BERT score and other variables. Table 9 presents the correlation coeﬃ-

cients between the BERT score and the quantitative variables within the

dataset. The ﬁndings reveal weak yet statistically signiﬁcant relationships

with all variables except for the loan amount. Notably, the most pronounced

correlations exist with the FICO score and revenue variables. Both exhibit

inverse relationships, indicating that individuals with higher FICO scores

and revenues tend to have lower BERT scores, and vice versa. The associa-

tion between the BERT score and the debt variable (dti n) is direct, albeit

slightly weaker than the other two correlations.

Table 9: Correlation coeﬃcients of quantitative variables with the BERT score.

Variable Pearson Spearman

revenue -0.0734* -0.1007*

dti n 0.0663* 0.0627*

loan amnt -0.0008 -0.0012

ﬁco n -0.1293* -0.1293*

* p-value less than 0.01.

Table 10: Kruskal-Wallis test of categorical variables with the BERT score.

Variable H-statistic

emp length 1581.67*

purpose 1357.94*

home ownership 243.20*

addr state 360.44*

* p-value less than 0.01.

To evaluate the relationship with the categorical variables, Table 10 presents

the results of the Kruskal-Wallis test. The Kruskal-Wallis test is a non-

parametric statistical test that analyzes whether there are statistical diﬀer-

ences in the BERT scores across categories within each categorical variable.

The results indicate signiﬁcant diﬀerences in BERT scores among all cate-

gorical variables, suggesting a certain level of association between the BERT

score and these categorical factors. However, it remains challenging to quan-

tify the strength of this relationship or identify the speciﬁc categories with

the most robust associations.

Table 11: Correlation coeﬃcients of linguistic features with the BERT score.

Variable Pearson Spearman

Word count -0.1961* -0.2650*

Polarity -0.1036* -0.1473*

Subjectivity -0.1753* -0.1866*

Readability 0.1182* 0.1518*

* p-value less than 0.01.

Finally, Table 11 examines the potential relationship between the BERT

score and various linguistic features automatically extracted from the text.

We ﬁnd statistically signiﬁcant correlations between the BERT score and all

linguistic features analyzed. Notably, there is a strong inverse correlation

with both word count and subjectivity, indicating that shorter and more

objective texts tend to have higher BERT scores. This ﬁnding, together with

the correlations presented in Table 9, suggests that BERT is more closely

related to linguistic features than to the numerical variables associated with

loan applications.

7. Results of the LLM-enhanced granting model

7.1. Analysis of the classiﬁcation performance

First, we evaluate the impact of incorporating the BERT score in the

granting model. In our baseline experiment, we optimize XGBoost with a

genetic algorithm using the quantitative and categorical variables typically

used in granting models, while in the competing approach, we optimize it

but include the BERT score as input variable. Table 12 shows the results of

both approaches.

A closer examination of the balanced metrics reveals a marginal enhance-

ment in both BACC (0.6154 vs. 0.6187) and AUC (0.6575 vs. 0.6644), the

latter signiﬁcant according to the DeLong test [46] at the 0.01 level. It is cru-

cial to note that the Lending Club dataset exclusively consists of approved

loans. This fact poses a substantial challenge to signiﬁcantly enhance the

outcomes in a loan granting model such as ours since the loans included in

the dataset were initially considered favorable by the platform. Furthermore,

in experiments not reported in the paper we used CatBoost [47] instead of

XGBoost and obtained a similar BACC improvement.

Table 12: Classiﬁer performance with and without the BERT score.

Metric Quant. + Categ. var. Quant. + Categ. + BERT score

BACC 0.6154 0.6187

AUC 0.6575 0.6644

F1 0.3266 0.3308

Precision 0.2168 0.2249

Recall 0.6614 0.6360

Accuracy 0.5835 0.6066

The performance metrics in Table 12 indicate that incorporating the

BERT score results in improved precision but diminished recall. However, at-

tributing this change in precision-recall behavior solely to the BERT score re-

quires careful consideration. This caution arises from our observations within

our dataset, where classiﬁers with diﬀerent hyperparameters and similar

near-optimal balanced accuracy values have demonstrated varying precision-

recall behaviors, suggesting that this may also be the case here.

Table 13 shows an additional experiment in which XGBoost classiﬁers

are trained and optimized using only one kind of input variable. While the

Table 13: Performance of the XGBoost considering only a subset of variables.

Metric Quant. Categ. Text. BERT score

BACC 0.6062 0.5486 0.5258 0.5490

AUC 0.6457 0.5656 0.5309 0.5714

F1 0.3192 0.2759 0.2534 0.2601

Precision 0.2138 0.1746 0.1665 0.1877

Recall 0.6300 0.6563 0.5302 0.5153

Accuracy 0.5896 0.4738 0.5227 0.5724

classiﬁer using the quantitative variables is clearly the best, the classiﬁer

that uses just the BERT score obtains slightly better results than the one

using the four categorical variables (the AUC diﬀerence is signiﬁcant at the

0.01 level according to the DeLong test). This result is noteworthy given

the well-known eﬀectiveness and the meaningful nature of the qualitative

variables.

Table 13 also shows the result of an XGBoost classiﬁer using the textual

features presented in Table 11, namely: polarity, subjectivity, word count,

and readability score. This classiﬁer is outperformed by the XGBoost that

uses the BERT score in terms of balanced accuracy and AUC (signiﬁcant at

the 0.01 level according to the DeLong test). This ﬁnding underscores the

superior ability of the ﬁne-tuned LLM to leverage textual descriptions and

extract relevant information for the classiﬁcation task.

7.2. Feature importance and explainability

To further analyze the impact of the BERT score, in this section, we ana-

lyze whether the models’ use of the input variables changes. Figure 4 shows

the feature importance given by XGBoost both with and without incorpo-

rating the BERT score. Notably, the BERT score emerges as the third most

inﬂuential variable, accounting for 6% importance. Moreover, the order and

magnitude of importance for other variables exhibit shifts. Speciﬁcally, the

borrower’s annual revenue diminishes in importance following the inclusion of

the BERT score, while the signiﬁcance of the ‘credit card’ purpose increases.

This observation underscores the signiﬁcant inﬂuence of the BERT score on

the classiﬁer.

To gain a deeper understanding of the role of variables, we employ SHAP

values [48]. Figure 5 provides a visualization of the SHAP values for the 10

(a) Before BERT score incorporation. (b) After BERT score incorporation.

Figure 4: Feature importances of the XGBoost classiﬁers.

(a) Before BERT score incorporation. (b) After BERT score incorporation.

Figure 5: SHAP values of the XGBoost classiﬁers.

variables with the most signiﬁcant impact on the model’s output, comparing

models with and without including the BERT score.

It is noteworthy that Figure 5a resembles a comparable ﬁgure presented

in [41], wherein SHAP values were utilized in a lending model on Lending

Club

. Examining Figure 5b, we ﬁnd that the BERT score ranks as the

fourth variable in terms of impact. A closer inspection of the violin plots

reveals that including BERT does not substantially alter how variables in-

ﬂuence the target class.

Figure 6 reveals a direct and linear relationship between the BERT score

Our dataset is more constrained, only considering loans with textual descriptions, as

explained in Section 4.

Figure 6: Dependence plot between SHAP values and BERT score.

and the default risk. Notably, this relationship is asymmetric around BERT

score values of 0.5; scores below 0.4 correspond to SHAP values ranging be-

tween -0.4 and -0.6, leading the model to predict non-default. Conversely,

this impact range in the positive case is only reached by BERT scores exceed-

ing 0.7, signifying that only exceptionally high BERT scores serve as strong

indicators of default.

7.3. Impact across the purpose categories

Now we delve into the inﬂuence of incorporating the BERT score on clas-

siﬁcation outcomes across various categories of the purpose variable. Lever-

aging the language comprehension capabilities of LLMs, it is conceivable that

the BERT model can provide more accurate characterizations of loan pur-

poses than the purpose variable alone. For instance, the ‘other’ category, be-

ing inherently ambiguous, could beneﬁt from a more nuanced understanding

of the loan description, potentially leading to improved prediction outcomes.

Figure 7 illustrates the changes in balanced accuracy upon incorporating

the BERT score for each loan purpose. Notably, the ‘other’ category shows

a modest improvement of 2.11%. However, several other categories exhibit

more substantial enhancements, including ‘educational’ (9.22%), ‘moving’

(5.84%), ‘medical’ (3.84%), and ‘small business’ (3.40%). Given the black-

box nature of our model, it is diﬃcult to ascertain the reasons why these

categories have improved more than the others. However, we posit that these

categories share a commonality—the more detailed speciﬁcation of purpose

or a deeper understanding of the borrower’s situation contributes to a more

Figure 7: BACC changes by purpose after including the BERT score.

precise delineation of the default risk. For instance, in education loans, fac-

tors like the ﬁeld of study and the educational institution may play a crucial

role in determining employability and salaries, thereby enabling a more accu-

rate assessment of the default risk. The same happens in the case of moving

to a new place, having to pay medical expenses, or obtaining funds for a

business.

The following examples illustrate instances of loans that were initially

misclassiﬁed as defaults in the absence of the BERT score but were subse-

quently correctly predicted as non-defaults after its inclusion:

Purpose: Educational; BERT score: 0.3765

“I’m 25 years old and living in New Orleans. I’m asking for a rel-

atively small amount of money to help me take care of my post-Bac

tuition for teacher certiﬁcation and to help me pay oﬀ a credit card.

I currently work as a private school teacher making very little money

with no beneﬁts (about $29,000 a year). I have to pay about $1000

in the coming year for my tuition, and I have to get health insurance

ASAP, but it’s hard to do so with no ﬁnancial help from anyone else.

My parents can’t help me because my mother is permanently disabled

and my father took a huge pay cut this year.”

Purpose: Moving; BERT score: 0.3843

“Although I can aﬀord payments,due to some recent expenses, I am

short on cash ﬂow for an unexpected move. I am, however, looking for

a more reasonable alternative to banking rates. I have borrowed from

Lending Club before and always paid fully and on time with automatic

payments.”

Purpose: Small business; BERT score: 0.3292

“The purpose of this loan is to fund advertising costs for a growing

internet business venture. I am a successful sales professional earning

an average of over $250K per year over the last 5 years. My credit

scores are strong and I have a documented history of paying all my

debts (personal or business related) on time.”

Purpose: Medical; BERT score: 0.3921

“This loan will be used to pay oﬀ a Care Credit credit card currently

at 21.9% I used the card to pay for a prosthetic limb that my health

insurer would not cover.”

In each of these examples, the corresponding BERT scores fall below 0.4.

As demonstrated in Figure 6, the classiﬁer will have a stronger driver to

categorize the loan as non-default when the score is below this value. As

all the examples oﬀer more details on the situation and the purpose of the

loan, their further clarity could explain their low BERT score. In addition,

all the texts shown are accurate, readable, and written in a conﬁdent and

objective tone. While previous evidence in the literature oﬀered mixed results

about the inﬂuence of linguistic factors as default predictors [3, 18, 4, 7], we

have shown that our BERT score is signiﬁcantly related to them. While

these explanations are sound, given the black-box nature of our BERT-based

approach, the explanations inﬂuencing the BERT score of each text cannot

be precisely determined. This aspect needs to be improved to comply with

the requirements of the regulators and foster trust among the end-users.

8. Conclusions

In this paper, we have introduced a novel approach that highlights the

value of state-of-the-art natural language processing techniques in enhancing

credit risk models. Our approach ﬁne-tunes BERT using loan descriptions

to generate a risk score that eﬀectively discriminates between defaulted and

non-defaulted loans. Furthermore, we show that integrating the BERT-based

risk score with traditional variables signiﬁcantly enhances the performance

of conventional granting models. These results underscore the potential of

LLMs to augment lending models, emphasizing their invaluable role in har-

nessing the richness of textual data to improve credit risk assessment.

Our approach can be easily applied without manual annotation, which is

a complex task that also introduces annotator subjectivity. As our analy-

sis shows, the score incorporates both linguistic factors and content aspects

that relate to the creditworthiness of the borrower’s situation and purpose.

Additionally, while ﬁne-tuning the LLM is a time-consuming process that

requires GPU processing, generating predictions, such as the risk score for a

loan description, can be done instantly with standard home equipment.

Our work represents just the ﬁrst steps in this endeavor and underscores

the necessity for further exploration and integration of cutting-edge method-

ologies to enhance the eﬀectiveness of risk-return assessments in the dynamic

landscape of P2P lending or even in traditional ﬁnancial institutions. Sev-

eral directions deserve to be explored to reﬁne the predictive capability of

the model and the understanding of loan applicants’ conditions.

An important limitation of our work is the lack of transparency regard-

ing the factors contributing to a given risk score and its potential biases.

These issues hinder its application in real-world contexts, where a compre-

hensive understanding of the score generation process and assurance that it

does not introduce biases related to gender, ethnicity, ﬁnancial inclusion, or

other factors are essential. Thus, eﬀorts to enhance the transparency and

explainability of the risk score are crucial not only for regulatory compliance

but also for fostering trust among borrowers and lenders [49].

At this respect, LLM-based topic modeling tools such as BERTopic [50]

could be used to gain an understanding of the risk scores. For example, topic

modeling could identify whether a loan description is related to “risky topics”

that in turn could be incorporated as input variables in the granting model.

Similarly, the embeddings from BERT or other LLMs could be used to gen-

erate interpretable topics from large, complex vocabularies, as other works

have done [51]. Applying such topic modeling approaches to loan descrip-

tions could facilitate the identiﬁcation of situations with varying risk levels.

These topics could then serve as additional variables in classiﬁcation mod-

els, improving predictive performance while maintaining the transparency

required in ﬁnancial assessments.

Another potential research avenue involves delving into more advanced

LLMs. For instance, encoder-only models like RoBERTa [37] could improve

the model’s capacity to capture more intricate linguistic patterns. Further-

more, the rise of generative AI has spurred interest in exploring decoder

models for classiﬁcation purposes. Although these models were not origi-

nally designed for such tasks, recent advancements have shown promising

approximations. For example, CARP (Clue And Reasoning Prompting) [52]

utilizes in-context learning, generating textual responses conditioned on a

provided prompt with a few annotated examples (few-shot learning). Re-

markably, this approach improved results in classiﬁcation tasks without the

need for transfer learning and ﬁne-tuning and could be applied to risk scoring.

Future work should also address the economic implications of our ﬁndings

by integrating cost- or proﬁt-sensitive approaches [53, 42]. These methods

allow for the inclusion of the ﬁnancial impacts of classiﬁcation outcomes.

Investigating whether improvements in model performance by including tex-

tual descriptions translate into better ﬁnancial outcomes would enhance the

real-world applicability of our ﬁndings.

References

[1] M. Cummins, T. Lynn, C. Mac an Bhaird, P. Rosati, Addressing

Information Asymmetries in Online Peer-to-Peer Lending, Springer

International Publishing, Cham, 2019, pp. 15–31. doi:10.1007/

978-3-030-02330-0_2.

[2] J. Michels, Do unveriﬁable disclosures matter? Evidence from peer-to-

peer lending, The Accounting Review 87 (4) (2012) 1385–1413. doi:

10.2308/accr-50159.

[3] Q. Gao, M. Lin, Lemon or cherry? The value of texts in debt crowd-

funding, Tech. Rep. 18, Center for Analytical Finance. University of

California, Santa Cruz (2015).

URL https://cafin.ucsc.edu/research/work_papers/CAFIN_

WP18.pdf

[4] G. Dorﬂeitner, C. Priberny, S. Schuster, J. Stoiber, M. Weber, I. de

Castro, J. Kammler, Description-text related soft information in peer-to-

peer lending – Evidence from two leading European platforms, Journal

of Banking & Finance 64 (2016) 169–187. doi:https://doi.org/10.

1016/j.jbankfin.2015.11.009.

[5] Y. Xia, L. He, Y. Li, N. Liu, Y. Ding, Predicting loan default in peer-to-

peer lending using narrative data, Journal of Forecasting 39 (2) (2020)

260–280. doi:https://doi.org/10.1002/for.2625.

[6] W. Zhang, C. Wang, Y. Zhang, J. Wang, Credit risk evaluation model

with textual features from loan descriptions for p2p lending, Electronic

Commerce Research and Applications 42 (2020) 100989. doi:https:

//doi.org/10.1016/j.elerap.2020.100989.

[7] M. Siering, Peer-to-peer (p2p) lending risk management: Assessing

credit risk on social lending platforms using textual factors, ACM

Transactions on Management Information Systems 14 (3) (jun 2023).

doi:10.1145/3589003.

[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of

Deep Bidirectional Transformers for Language Understanding (2018).

doi:10.48550/ARXIV.1810.04805.

[9] C. Sun, X. Qiu, Y. Xu, X. Huang, How to ﬁne-tune BERT for text

classiﬁcation?, in: M. Sun, X. Huang, H. Ji, Z. Liu, Y. Liu (Eds.),

Chinese Computational Linguistics, Springer International Publishing,

Cham, 2019, pp. 194–206. doi:10.1007/978-3-030-32381-3_16.

[10] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT:

a pre-trained biomedical language representation model for biomedical

text mining, Bioinformatics 36 (4) (2019) 1234–1240. doi:10.1093/

bioinformatics/btz682.

[11] V. S. Tida, S. H. Hsu, Universal spam detection using transfer learning

of BERT model, in: Proceedings of the 55th Hawaii International Con-

ference on System Sciences, 2022, pp. 7669–7677.

URL http://hdl.handle.net/10125/80263

[12] M.-J. Ariza-Garz´on, M.-D.-M. Camacho-Mi˜nano, M.-J. Segovia-Vargas,

J. Arroyo, Risk-return modelling in the p2p lending market: Trends,

gaps, recommendations and future directions, Electronic Commerce Re-

search and Applications 49 (2021) 101079. doi:https://doi.org/10.

1016/j.elerap.2021.101079.

[13] J. Xu, D. Chen, M. Chau, L. Li, H. Zheng, Peer-to-peer loan fraud

detection: Constructing features from transaction data, MIS Quarterly

45 (3) (2022) 1777–1792. doi:10.25300/misq/2022/16103.

[14] Y. Li, A. Hao, X. Zhang, X. Xiong, Network topology and systemic risk

in peer-to-peer lending market, Physica A: Statistical Mechanics and its

Applications 508 (2018) 118–130. doi:https://doi.org/10.1016/j.

physa.2018.05.083.

[15] J. Xu, D. Chen, M. Chau, Identifying features for detecting fraudu-

lent loan requests on p2p platforms, in: 2016 IEEE Conference on

Intelligence and Security Informatics (ISI), 2016, pp. 79–84. doi:

10.1109/ISI.2016.7745447.

[16] Z. Qi, D. Chen, J. J. Xu, Do facial images matter? Understand-

ing the role of private information disclosure in crowdfunding mar-

kets, Electronic Commerce Research and Applications 54 (C) (jul 2022).

doi:10.1016/j.elerap.2022.101173.

[17] M. Herzenstein, S. Sonenshein, U. M. Dholakia, Tell me a good story

and I may lend you my money: The role of narratives in peer-to-peer

lending decisions, SSRN Electronic Journal (2011). doi:10.2139/ssrn.

1840668.

[18] S. Wang, Y. Qi, B. Fu, H. Liu, Credit risk evaluation based on text

analysis, International Journal of Cognitive Informatics and Natural In-

telligence 10 (2016) 1–11. doi:10.4018/IJCINI.2016010101.

[19] C. Jiang, Z. Wang, R. Wang, Y. Ding, Loan default prediction by com-

bining soft information extracted from descriptive text in online peer-to-

peer lending, Annals of Operations Research 266 (1–2) (2017) 511–529.

doi:10.1007/s10479-017-2668-z.

[20] J. Yao, J. Chen, J. Wei, Y. Chen, S. Yang, The relationship between

soft information in loan titles and online peer-to-peer lending: evidence

from renrendai platform, Electronic Commerce Research 19 (1) (2018)

111–129. doi:10.1007/s10660-018-9293-z.

[21] T. Mikolov, K. Chen, G. Corrado, J. Dean, Eﬃcient estimation of word

representations in vector space (2013). doi:10.48550/ARXIV.1301.

3781.

[22] T. Loughran, B. McDonald, When is a liability not a liability? Textual

analysis, dictionaries, and 10-Ks, The Journal of Finance 66 (1) (2011)

35–65. doi:https://doi.org/10.1111/j.1540-6261.2010.01625.x.

[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,

L. Kaiser, I. Polosukhin, Attention Is All You Need (2017). doi:10.

48550/ARXIV.1706.03762.

[24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,

Language models are unsupervised multitask learners, Tech. rep.,

OpenAI (2019).

URL https://insightcivic.s3.us-east-1.amazonaws.com/

language-models.pdf

[25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,

A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-

Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,

J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,

B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,

D. Amodei, Language models are few-shot learners (2020). doi:10.

48550/ARXIV.2005.14165.

[26] D. Pride, M. Cancellieri, P. Knoth, CORE-GPT: Combining open access

research and large language models for credible, trustworthy question

answering, arXiv preprint arXiv:2307.04683 (2023). doi:https://doi.

org/10.48550/arXiv.2307.04683.

[27] A. Bhaskar, A. R. Fabbri, G. Durrett, Prompted opinion summarization

with GPT-3.5, arXiv preprint arXiv:2211.15914 (2022). doi:https:

//doi.org/10.48550/arXiv.2211.15914.

[28] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka-

plan, H. Edwards, Y. Burda, G. Brockman, A. Ray, et al., Evaluating

large language models trained on code, arXiv preprint arXiv:2107.03374

(2021). doi:https://doi.org/10.48550/arXiv.2107.03374.

[29] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,

V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-

training for natural language generation, translation, and comprehen-

sion, arXiv preprint arXiv:1910.13461 (2019). doi:https://doi.org/

10.48550/arXiv.1910.13461.

[30] C. Raﬀel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,

Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning

with a uniﬁed text-to-text transformer, Journal of Machine Learning

Research 21 (1) (jan 2020).

URL http://jmlr.org/papers/v21/20-074.html

[31] J. Kriebel, L. Stitz, Credit default prediction from user-generated text

in peer-to-peer lending using deep learning, European Journal of Oper-

ational Research 302 (1) (2022) 309–323. doi:10.1016/j.ejor.2021.

12.024.

[32] M. Stevenson, C. Mues, C. Bravo, The value of text for small business

default prediction: A Deep Learning approach, European Journal of

Operational Research 295 (2) (2021) 758–771. doi:10.1016/j.ejor.

2021.03.008.

[33] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled

version of BERT: smaller, faster, cheaper and lighter (2019). doi:

10.48550/ARXIV.1910.01108.

[34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, AL-

BERT: A Lite BERT for Self-supervised Learning of Language Repre-

sentations (2019). doi:10.48550/ARXIV.1909.11942.

[35] L. Martin, B. Muller, P. J. Ortiz Su´arez, Y. Dupont, L. Romary,

E. de la

Clergerie, D. Seddah, B. Sagot, CamemBERT: a tasty French language

model, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Pro-

ceedings of the 58th Annual Meeting of the Association for Computa-

tional Linguistics, Association for Computational Linguistics, Online,

2020, pp. 7203–7219. doi:10.18653/v1/2020.acl-main.645.

[36] J. Ca˜nete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. P´erez,

Spanish pre-trained bert model and evaluation data, in: Practical ML

for Developing Countries Workshop at ICLR 2020, 2020. doi:https:

//doi.org/10.48550/arXiv.2308.02976.

[37] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,

L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT

Pretraining Approach (2019). doi:10.48550/ARXIV.1907.11692.

[38] Z. Gao, A. Feng, X. Song, X. Wu, Target-dependent sentiment clas-

siﬁcation with BERT, IEEE Access 7 (2019) 154290–154299. doi:

10.1109/ACCESS.2019.2946594.

[39] S. A. Basha, M. M. Elgammal, B. M. Abuzayed, Online peer-to-peer

lending: A review of the literature, Electronic Commerce Research and

Applications 48 (2021) 101069. doi:10.1016/j.elerap.2021.101069.

[40] M. J. Ariza-Garz´on, M. Sanz-Guerrero, J. Arroyo Gallardo, L. Club,

Lending club loan dataset for granting models (May 2024). doi:10.

5281/zenodo.11295916.

URL https://doi.org/10.5281/zenodo.11295916

[41] M. J. Ariza-Garz´on, J. Arroyo, A. Caparrini, M.-J. Segovia-Vargas, Ex-

plainability of a machine learning granting scoring model in peer-to-peer

lending, IEEE Access 8 (2020) 64873–64890. doi:10.1109/ACCESS.

2020.2984412.

[42] M.-J. Ariza-Garz´on, J. Arroyo, M.-J. Segovia-Vargas, A. Caparrini,

Proﬁt-sensitive machine learning classiﬁcation with explanations in

credit risk: The case of small businesses in peer-to-peer lending, Elec-

tronic Commerce Research and Applications 67 (2024) 101428. doi:

10.1016/j.elerap.2024.101428.

[43] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system,

in: Proceedings of the 22nd ACM SIGKDD International Confer-

ence on Knowledge Discovery and Data Mining, KDD ’16, Associa-

tion for Computing Machinery, New York, NY, USA, 2016, p. 785–794.

doi:10.1145/2939672.2939785.

[44] J. H. Holland, Adaptation in Natural and Artiﬁcial Systems: An Intro-

ductory Analysis with Applications to Biology, Control, and Artiﬁcial

Intelligence, The MIT Press, Cambridge, Massachusetts, USA, 1992.

doi:10.7551/mitpress/1090.001.0001.

[45] M. Biesialska, K. Biesialska, M. R. Costa-juss`a, Continual lifelong learn-

ing in natural language processing: A survey, in: D. Scott, N. Bel,

C. Zong (Eds.), Proceedings of the 28th International Conference on

Computational Linguistics, International Committee on Computational

Linguistics, Barcelona, Spain (Online), 2020, pp. 6523–6541. doi:

10.18653/v1/2020.coling-main.574.

[46] X. Sun, W. Xu, Fast implementation of DeLong’s algorithm for compar-

ing the areas under correlated receiver operating characteristic curves,

IEEE Signal Processing Letters 21 (11) (2014) 1389–1393. doi:10.

1109/LSP.2014.2337313.

[47] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin,

Catboost: unbiased boosting with categorical features, in: Proceedings

of the 32nd International Conference on Neural Information Processing

Systems, NIPS’18, Curran Associates Inc., Red Hook, NY, USA, 2018,

p. 6639–6649.

[48] S. M. Lundberg, S.-I. Lee, A uniﬁed approach to interpreting model pre-

dictions, in: Proceedings of the 31st International Conference on Neural

Information Processing Systems, NIPS’17, Curran Associates Inc., Red

Hook, NY, USA, 2017, p. 4768–4777.

[49] ROFIEG, Thirty recommendations on regulation, innovation and

ﬁnance. ﬁnal report to the european commission by the expert group

on regulatory obstacles to ﬁnancial innovation, Tech. rep., European

Commission (2019).

URL https://ec.europa.eu/info/files/

191113-report-expert-group-regulatory-obstacles-financial-innovation_

[50] M. Grootendorst, Bertopic: Neural topic modeling with a class-based

tf-idf procedure (2022). doi:10.48550/ARXIV.2203.05794.

URL https://arxiv.org/abs/2203.05794

[51] A. B. Dieng, F. J. R. Ruiz, D. M. Blei, Topic Modeling in Embedding

Spaces, Transactions of the Association for Computational Linguistics

8 (2020) 439–453. doi:10.1162/tacl_a_00325.

URL https://doi.org/10.1162/tacl_a_00325

[52] X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, G. Wang, Text classiﬁ-

cation via large language models (2023). doi:10.48550/ARXIV.2305.

08377.

[53] Y. Xia, C. Liu, N. Liu, Cost-sensitive boosted tree for loan evaluation in

peer-to-peer lending, Electronic Commerce Research and Applications

24 (2017) 30–49. doi:https://doi.org/10.1016/j.elerap.2017.06.

004.