Graphical Abstract
Credit Risk Meets Large Language Models: Building a Risk Indi-
cator from Loan Descriptions in P2P Lending
Mario Sanz-Guerrero, Javier Arroyo
arXiv:2401.16458v2 [q-fin.RM] 5 Aug 2024
Highlights
Credit Risk Meets Large Language Models: Building a Risk Indi-
cator from Loan Descriptions in P2P Lending
Mario Sanz-Guerrero, Javier Arroyo
Fine-tuned BERT successfully produces risk scores from loan descrip-
tions.
Integrating BERT score with traditional variables boosts granting model
performance.
The approach applies without manual annotation, reducing subjectiv-
ity and complexity.
Generating risk scores is instant with standard equipment.
Research is needed to enhance transparency and trust in these LLM-
based approaches.
Credit Risk Meets Large Language Models: Building a
Risk Indicator from Loan Descriptions in P2P Lending
Mario Sanz-Guerrero
a,
, Javier Arroyo
b
a
Facultad de Inform´atica, Universidad Complutense de Madrid, Calle del Prof. Jos´e
Garc´ıa Santesmases, 9, Madrid, 28040, Comunidad de Madrid, Spain
b
Instituto de Tecnolog´ıa del Conocimiento, Universidad Complutense de Madrid, Calle
del Prof. Jos´e Garc´ıa Santesmases, 9, Madrid, 28040, Comunidad de Madrid, Spain
Abstract
Peer-to-peer (P2P) lending connects borrowers with lenders through online
platforms but faces significant information asymmetry, as lenders often lack
sufficient data to assess the creditworthiness of borrowers. This paper ad-
dresses this challenge by leveraging BERT (Bidirectional Encoder Represen-
tations from Transformers), a Large Language Model (LLM) known for its
ability to understand contextual nuances in text, to analyze borrowers’ loan
descriptions.
We apply transfer learning to make BERT distinguish between default
and non-default loans using the loan descriptions of the Lending Club dataset.
The resulting BERT-generated risk score demonstrates solid competence in
discriminating between default and non-default loans. Furthermore, integrat-
ing the BERT-generated risk score with traditional variables enhances classi-
fier performance, which demonstrates the complementary nature of advanced
language model outputs in refining credit risk assessment methodologies.
However, the opacity of LLMs and potential biases highlight the need for
transparent regulatory frameworks. This study opens new research avenues
in P2P lending and AI, emphasizing the importance of trust and transparency
in adopting advanced credit risk models.
Keywords:
Credit Risk, P2P Lending, BERT, Transfer Learning, Explainable AI,
Corresponding author.
Email addresses: [email protected] (Mario Sanz-Guerrero),
[email protected] (Javier Arroyo)
Preprint submitted to Elsevier August 6, 2024
XGBoost
1. Introduction
Peer-to-peer (P2P) lending is a growing phenomenon that allows indi-
viduals to engage in direct lending and borrowing transactions, bypassing
traditional financial institutions. The process is facilitated through online
platforms, where prospective borrowers submit loan applications and poten-
tial lenders make informed decisions about where to invest their funds.
An inherent challenge in P2P lending is the presence of information asym-
metry, wherein borrowers possess more and often superior information com-
pared to lenders. To address this issue, platforms employ strategies to com-
plement the conventional data provided in loan applications [1]. For instance,
borrowers are frequently encouraged to provide a voluntary textual descrip-
tion describing the purpose of the loan and their particular situation. De-
spite the absence of formal verification, such voluntary disclosures have been
observed to stimulate increased bidding activity among lenders. However,
lenders may lack the expertise to assess the creditworthiness of borrowers
effectively and may be influenced by different factors [2].
Traditional credit scoring models do not harness the rich information em-
bedded in the narratives submitted by loan applicants. Several approaches
have tried to incorporate such information into credit risk models. The ap-
proaches include the extraction of linguistic metrics [3, 4], the use of topic
modeling to characterize the underlying themes [5, 6], or a combination of
both [7].
However, these methods have limitations in fully capturing the contex-
tual nuances and semantic intricacies of loan descriptions. This study aims
to address this gap by leveraging the capabilities of Large Language Mod-
els (LLMs), specifically BERT (Bidirectional Encoder Representations from
Transformers) [8], which has shown exceptional performance in understand-
ing and processing complex textual data.
BERT’s bidirectional training and ability to capture context at a granular
level make it particularly suitable for tasks requiring deep semantic under-
standing. It has proven successful in fine-tuning for classification tasks [9],
in particular domains such as biomedicine [10] and in specialized tasks like
spam classification [11]. In this study, we apply transfer learning to train
BERT to distinguish between default and non-default loans based on their
textual descriptions.
2
Our work demonstrates BERT’s capability to effectively leverage descrip-
tive data to generate a risk indicator that accurately classifies defaulted loans.
Moreover, we show that integrating the BERT-generated risk score into tradi-
tional credit granting models can significantly enhance their predictive per-
formance. This highlights BERT’s potential to improve the accuracy and
reliability of credit risk models in P2P lending.
Furthermore, our findings highlight the need for greater transparency in
adopting advanced models to build trust among users and regulatory entities.
This study paves the way for future research, emphasizing the transformative
potential of LLMs like BERT in the field of credit risk assessment.
This paper is organized as follows: Section 2 provides a comprehensive
review of related work in credit risk assessment and natural language pro-
cessing. Section 3 presents an overview of LLMs and the BERT model.
Section 4 describes the dataset used, detailing the data preprocessing steps
and conducting an in-depth exploratory data analysis. Section 5 outlines
the methodology, model architecture, and training procedures employed in
integrating BERT into the credit risk assessment framework. Section 6 ana-
lyzes the risk score generated by the BERT description processing. Section
7 discusses the results of our experiments, highlighting the improvements
achieved by incorporating BERT-based textual analysis. Finally, Section 8
concludes the paper by summarizing key findings, discussing implications,
and suggesting avenues for future research in the intersection of NLP and
credit risk assessment.
2. Related work
In their comprehensive analysis of risk-return modeling within the P2P
lending market [12], the authors identify a discernible trend towards includ-
ing new sources and types of information to improve risk and profit manage-
ment models in the P2P market. The sources are very diverse and include
transactional data [13], the topology of the lending-borrowing network [14],
data from social networks [15], or, more recently, facial features [16]. Among
them, the authors identify as a predominant trend the inclusion of textual
data taken from statements describing the purpose of the loan.
In a pioneer work exploring the impact of textual factors on peer-to-peer
lending [17], the authors analyze P2P loans including manually annotated
narrative aspects, such as trustworthiness, economic hardship, hard work,
3
success, morality, and religiosity. These aspects were combined with de-
mographic variables and loan characteristics. Their results highlight that
narratives regarding trustworthiness strongly influence decision-makers, par-
ticularly credit lenders, in their loan approval process. Additionally, some of
these narratives play a substantial role in subsequent loan performance.
However, most subsequent studies typically use text mining or artificial
intelligence methods to extract linguistic features or loan description top-
ics. Regarding the use of linguistic features, the authors in [3] use machine
learning and text mining techniques to quantify and extract linguistic fea-
tures (e.g., readability, positivity, objectivity, and deception cues), and then
build both explanatory econometric models and predictive models using such
features. They find that they can indeed reflect borrowers’ creditworthiness
and predict loan default. They also use a panel of investors and confirm that
investors indeed value texts written by borrowers, but that they can also
be deceived by some of the deception cues well established in the literature.
Similarly, in [4], the authors include linguistic factors and the presence of
social and emotional keywords and evaluate their impact on two European
platforms. They found that text-derived variables influence the probability
of funding, but not the probability of default. In [18], linguistic statistical
features and abstract text features (including deception, subjectivity, sen-
timent, readability, personality, and mindset) are used to characterize text
descriptions. They compare the performance of different classifiers based on
the textual features and conclude that their performance is close to that of
the classifiers using traditional financial features, but that adding textual
features can improve the performance of the whole credit risk evaluation
system.
As for topic modeling, the Latent Dirichlet Allocation model (LDA) has
been widely used. In [19], the authors use LDA to extract six topics from the
loan descriptions whose meanings are obvious: assets, income and expenses,
work, family, business, and agriculture. They also consider the number of
characters in the descriptive text. They conclude that soft (qualitative) in-
formation can improve the performance of loan default prediction compared
to existing methods based only on hard (quantitative) information and that
soft features have a significant ability to discriminate loan defaults. Similarly,
in [20], an LDA topic model is used to classify the loan titles into six pur-
poses. Their findings reveal that the stated purpose significantly influences a
borrower’s chances of securing financing. Notably, ambiguous titles—where
borrowers fail to clearly articulate the loan’s purpose—substantially diminish
4
the likelihood of loan approval. In [5], Xia et al. used a keyword clustering
algorithm for automatic topic extraction. Their method combines keyword
extraction based on term frequency-inverse document frequency (TF-IDF)
with word embeddings generated by the Word2Vec neural network model
[21]. Analysis of three real-world datasets demonstrated that incorporating
these topic variables significantly enhanced predictive accuracy compared to
relying solely on traditional information.
More recently, Siering [7] investigated the effect of both aspects: topics
and linguistic features. To extract topics, the author utilized a financial text
analysis procedure [22] to build a domain dictionary. The identified topics
describe the purpose of the loan, as well as the borrowers’ request for help,
reliability, or appreciation. These topics were operationalized as binary in-
dicator variables. The author employs text mining to create variables that
measure aspects such as polarity, active orientation, readability, average sen-
tence length, and word count. These variables are then used as input in
a logistic regression. The results indicate that both linguistic and content-
based factors contribute to predicting loan default probability, with the latter
demonstrating greater significance. Analysis of variable contributions shows
that certain factors increase or decrease default likelihood. Notably, expres-
sions of reliability correlate positively with loan repayment.
Yet, a significant gap becomes apparent in applying state-of-the-art nat-
ural language processing techniques, including deep learning methods. In
their work [6], Zhang et al. studied the transformer encoder’s ability to cap-
ture textual features from loan descriptions. These characteristics, alongside
the traditional hard features derived from loan applications, were inputted
into a neural network to predict the likelihood of loan default. The study
highlights the effectiveness of transformers, showing better results when in-
cluding textual loan descriptions compared to models that do not consider
them. In this work, we explore using encoder-based LLMs, such as BERT,
which has successfully improved classification in other fields.
3. An overview of Large Language Models and BERT
Large Language Models (LLMs) are built upon the Transformer archi-
tecture [23], leveraging attention mechanisms to enhance language compre-
hension. LLMs can be broadly categorized into three primary families, each
distinguished by its architecture:
5
Encoder-only, widely employed for language comprehension tasks such
as text classification, named entity recognition, and extractive ques-
tion answering. The most famous example is BERT [8], which will be
explained in more detail below.
Decoder-only, designed for generative tasks, exemplified by the well-
known GPT models [24, 25]. It is employed in various tasks, including
question answering [26], text summarization [27], and programming
code generation [28].
Encoder-decoder models, suited for tasks demanding both language
understanding and generation, such as language translation or text
summarization. The most influential models are BART [29] and T5
[30].
The selection of the appropriate architecture hinges on the specific re-
quirements of the intended task. Whether it be the nuanced comprehension
of language, creative text generation, or the synthesis of both, the versatility
of LLMs offers a tailored solution for diverse applications.
We will focus on BERT (Bidirectional Encoder Representations from
Transformer), which is a Transformer-based language model introduced by
Google researchers in 2018 [8]. BERT’s architecture consists of a stack of
encoders from the Transformer model. The bidirectional nature of BERT is
key, as it considers both the left and right context of each word, enhancing
its ability to understand context-dependent meanings and to be effective in
language understanding tasks. Numerous studies have consistently shown
that BERT is the most effective linguistic model for various of these tasks
[31, 32]. Notably, BERT has 340 million parameters, while the widely rec-
ognized GPT-3 model has 175 billion, making BERT 514 times smaller than
GPT-3 [25]. Given this significant size difference, BERT can be operated on
standard home equipment for model inference, which greatly simplifies its
use in practical scenarios. In contrast, GPT, built with a Transformer de-
coder stack, not only demands much more powerful equipment but is suited
for language generation tasks.
BERT stands as a milestone whose success has spurred the development of
a diverse family of models that build upon its architecture. Some versions aim
to achieve similar performance while having a smaller number of parameters,
such as DistilBERT (a distilled version of BERT) [33], or ALBERT (A Little
BERT) [34]. Others are adaptations to other languages such as CamemBERT
6
[35] to French or BETO to Spanish [36]. Other proposals aimed to improve
upon BERT by modifying some design decisions when pretraining BERT
and also training the model longer, as in the case of RoBERTa (Robustly
optimized BERT approach) [37], which resulted in improved contextualized
representations and enhanced language understanding.
To further elucidate the role of BERT in specialized applications, it is cru-
cial to understand its capacity for transfer learning and fine-tuning. Transfer
learning involves using a pre-trained model like BERT, which has initially
learned general language patterns from a large corpus to a specific task or
dataset. This technique allows us to take advantage of the rich linguistic
representations without needing extensive computation from scratch. Fine-
tuning involves adjusting the pre-trained model’s parameters to capture the
nuances of the target task or application field by further training with new
instances from the new context. For example, BioBERT is a BERT model
fine-tuned for biomedical text mining tasks like named entity recognition and
question answering [10]. Other adaptations have targeted text classification
and sentiment analysis in specific datasets [9, 38].
Our research aims to develop a credit risk model by leveraging a BERT-
based model. Specifically, we will fine-tune BERT for producing a predictive
score indicative of the likelihood of loan default using the textual descriptions
of loans provided by borrowers in their application forms. We will assess the
scoring effectiveness in a granting model using the well-established dataset
from the Lending Club P2P lending platform.
4. Description of the dataset
We have used a public data set of the P2P lending company Lending
Club
1
, which is widely used in credit risk publications and the most widely
used when dealing with the P2P market [39, 12]. However, instead of using
the original dataset, which includes 2,260,699 loans granted by the company
between 2007 and 2018, we will use a version modified for proposing granting
models [40], used in [41, 42]. Since granting models determine which loans
will be fully repaid, its estimation needs loans whose final status is known,
that is, that were either fully repaid or defaulted. Thus, the dataset excludes
loans in transitory states (in a grace period, late, etc.) and loans with no
1
https://www.kaggle.com/wordsforthewise/lending-club
7
Table 1: Variable description.
Variable Description
Quantitative variables
revenue Borrower’s self-declared annual income during regis-
tration.
dti n Indebtedness ratio for obligations excluding mortgage.
Monthly information.
loan amnt Amount of credit requested by the borrower.
fico n Credit bureau score. Defined between 300 and 850,
reported by Fair Isaac Corporation as a summary
risk measure based on historical credit information re-
ported at the time of application.
Categorical variables
emp length Employment length of the borrower categorized into
12 categories, including the no information category.
purpose Credit purpose category for the loan request.
home ownership Homeownership status provided by the borrower.
addr state Borrower’s residence state from the USA.
Textual variable
desc Description of the credit request provided by the bor-
rower.
information on income and indebtedness which is essential to compute the
input variables, resulting in 1,347,681 instances.
Additionally, the original dataset contains variables detailing the loan’s
lifecycle and other post-application aspects (e.g., the interest rate). In con-
trast, our version only includes variables available at the time of application,
which are those utilized by granting models.
Loan descriptions were inconsistently available, appearing only for certain
loans between April 2008 and March 2014. To accurately assess the impact of
textual descriptions on default prediction, our analysis focuses solely on the
119,101 loans that include the desc variable. Kolmogorov-Smirnov and chi-
square tests were applied to quantitative and categorical variables to assess
potential bias from filtering. The lack of significant differences indicates that
the filtered dataset is representative of the original dataset.
In the dataset, the target variable suffers the usual class imbalance prob-
8
lem (only 15.27% of default), which will be considered in the design of the
experiments. Table 1 shows the input variables of our granting model, which
are explained below.
As for the quantitative variables, the Fair Isaac Corporation credit bu-
reau (FICO) information in the original dataset is given by a minimum and
maximum range of limits to which the borrower’s FICO belongs at loan orig-
ination. However, we average these two values to have a single indicator of
the creditworthiness of potential borrowers resulting in our fico n variable.
For the case of the debt variable, dti n is estimated from the original dataset
variables as the ratio calculated from the total debts of the co-borrowers
over the total debt obligation divided by the combined monthly income of
the co-borrowers.
Regarding the categorical variables, we merged the categories ‘other’,
‘none’, and ‘any’ into a unified category labeled ‘other’ for the home ownership
variable. This decision was made due to a lack of clear differentiation among
these options, coupled with their similar default percentages and their rela-
tively low percentages of occurrences. The emp length variable was treated
as categorical rather than numerical since it includes categories for ‘no infor-
mation’ and for ‘more than ten years’.
For the textual variable, we carried out an exhaustive work of text clean-
ing. First, we removed all those descriptions that contained the default de-
scription provided by Lending Club on its web form (“Tell your story. What
is your loan for?”). Moreover, we removed the prefix “Borrower added on
DD/MM/YYYY > from the descriptions, as we did not want any temporal
background on them. Finally, as these descriptions came from a web form, we
replaced all HTML entities with their corresponding characters (e.g. ‘&’
was substituted by ‘&’, ‘&lt;’ was substituted by <’, etc.).
Table 2 presents the quantitative variables and the results of the Kolmogorov-
Smirnov test, which was used to compare the empirical cumulative distribu-
tion functions of Default and Non-default loans. According to these results,
defaulted loans are characterized by lower revenue, higher debt-to-income
ratio (dti n), higher requested amount (loan amnt), and lower FICO scores
(fico n), being the differences significant at the 0.01 level.
Similarly, Table 3 displays the distribution of categories within each cate-
gorical variable and the corresponding default rates. The addr state variable
is excluded due to its 50 categories, one for each U.S. state. The table also in-
dicates whether there is a significant dependence between the target variable
and the categorical variables at the 0.01 significance level. The results show
9
Table 2: Exploratory data analysis. Quantitative variables.
Variable Statistic Non-Default Default Total
revenue
Mean $ 73,570.69 $ 66,218.66 $ 72,447.84
Median $ 64,000.00 $ 58,000.00 $ 62,000.00
SD $ 54,944.60 $ 40,731.98 $ 53,086.79
KS D-Test 0.09*
dti n
Mean 16.15 17.67 16.38
Median 15.88 17.70 16.16
SD 7.50 7.53 7.53
KS D-Test 0.09*
loan amnt
Mean $ 13,799.25 $ 15,111.44 $ 13,999.66
Median $ 12,000.00 $ 14,000.00 $ 12,000.00
SD $ 7,931.55 $ 8,363.19 $ 8,012.86
KS D-Test 0.08*
fico n
Mean 705.22 694.20 703.54
Median 697.00 687.00 697.00
SD 33.32 27.48 32.74
KS D-Test 0.15*
* p-value less than 0.01.
significant dependence for all variables, including the addr state variable not
reported in the table (test value of 211.12).
In the home ownership variable, the ‘OTHER’ category shows the high-
est risk (20.98%) but a small frequency (0.12%), while the ‘MORTGAGE’
category is the most frequent (51.05%) and the least risky (14.14%) one.
In the emp length variable, the category that denotes no information (‘NI’)
has the highest risk (19.78%), but also the lowest frequency (4.13%). In
general, employment length can be categorized into two groups with compa-
rable default rates: those with employment lengths of five years or less and
those with more than five years. Interestingly, the risk is slightly higher in
the group with more than five years of employment. The categories are not
10
Table 3: Exploratory data analysis. Categorical variables.
Variable Category Count Rel. Freq. Default Rate Chi Test
home ownership
MORTGAGE 60,796 51.05% 14.14%
131.08*
OTHER 143 0.12% 20.98%
OWN 9,582 8.05% 15.69%
RENT 48,580 40.79% 16.60%
emp lenght
< 1 year 9,548 8.02% 14.83%
104.96*
1 year 7,803 6.55% 14.39%
2 years 10,960 9.20% 14.68%
3 years 9,370 7.87% 14.18%
4 years 7,561 6.35% 14.56%
5 years 9,019 7.57% 14.76%
6 years 7,271 6.10% 16.04%
7 years 6,638 5.57% 15.59%
8 years 5,374 4.51% 15.39%
9 years 4,356 3.66% 15.79%
10+ years 36,287 30.47% 15.41%
NI 4,914 4.13% 19.78%
purpose
car 1,884 1.58% 9.61%
568.47*
credit card 25,051 21.03% 12.81%
debt consolidation 68,372 57.41% 16.14%
educational 265 0.22% 16.98%
home improvement 7,170 6.02% 12.93%
house 805 0.68% 15.78%
major purchase 3,062 2.57% 10.65%
medical 970 0.81% 17.32%
moving 768 0.64% 14.84%
other 6,361 5.34% 17.69%
renewable energy 127 0.11% 19.69%
small business 2,518 2.11% 26.41%
vacation 561 0.47% 16.22%
wedding 1,187 1.00% 12.47%
* p-value less than 0.01.
11
Table 4: Exploratory data analysis. Textual variable.
Variable Statistic Non-Default Default Total
Word count
Mean 36.72 35.49 36.54
Median 24.0 22.0 24.0
SD 46.62 48.78 46.96
KS D-Test 0.03*
Readability
Mean 66.70 66.31 66.64
Median 73.88 74.19 74.02
SD 32.87 35.55 33.29
KS D-Test 0.02*
Polarity
Mean 0.0964 0.0909 0.0956
Median 0.0367 0.0 0.0320
SD 0.1685 0.1699 0.1687
KS D-Test 0.04*
Subjectivity
Mean 0.3193 0.3029 0.3168
Median 0.3635 0.3333 0.3589
SD 0.2542 0.2595 0.2551
KS D-Test 0.04*
* p-value less than 0.01.
perfectly ordered, which supports the use of one-hot encoding to treat this
variable as categorical. Finally, the most frequent purpose is ‘debt consoli-
dation’, constituting 57% of the loans, which has a default rate of 16.14%.
Notably, the riskiest purpose is ‘small business’, with a 26.41% default rate.
Conversely, ‘car’ loans demonstrate the lowest risk, with a mere 9.61% de-
fault rate. This striking divergence in default rates across diverse purposes
underscores a significant variability in risk within the various loan purposes.
Regarding the textual description of the loan (desc variable), Table 4
shows some metrics to characterize it. There is a one-word difference in the
average word count between the descriptions of defaulted and non-defaulted
obligations. The readability was calculated using the Flesch Reading Ease
12
Score
2
, which indicates the approximate educational level required for com-
fortable comprehension of a given text (higher scores denote greater ease of
reading). The texts in both categories have scores around 66, signifying that
they can be readily comprehended by students aged 13 to 15. Additionally,
we analyzed the average polarity and average subjectivity
3
. The polarity,
ranging from -1 to 1 to denote negative or positive sentiment, was observed
to be approximately 0.1 in both cases, suggesting a subtle positive sentiment.
On the other hand, subjectivity, measuring the presence of judgments and
opinions on a scale from 0 to 1, exhibited values close to 0.31 in both cate-
gories. This indicates that while the texts in both cases maintain a generally
objective tone, there is a discernible inclusion of some judgments or opinions.
Although the distinctions in these metrics between the default and non-
default categories are subtle, their significance is confirmed by the Kolmogorov-
Smirnov test. Consequently, it is pertinent to incorporate linguistic aspects
into credit risk modeling. Our approach for extracting information from
the descriptions relies on leveraging LLMs capable of encompassing not just
linguistic nuances but also capturing content details. We elaborate on our
methodology below.
5. Description of the methodology
This study assesses the enhancement of default prediction by integrating
textual descriptions of loans preprocessed through an LLM. Specifically, we
employ transfer learning and fine-tuning to enable the LLM to generate a
score reflecting the probability of loan default, leveraging information derived
from the loan’s textual description.
As our baseline, we employ a machine learning classifier that utilizes
the complete set of variables available during the loan application process,
encompassing both quantitative and categorical variables. Subsequently, we
enhance the classifier by incorporating the LLM score—representing default
risk inferred from processing the textual description with the LLM—as an
additional input variable and proceed to evaluate the resulting differences.
2
Calculated with Textstat (Python library). Source: https://github.com/textstat/
textstat.
3
Calculated with TextBlob (Python library). Source: https://github.com/sloria/
textblob.
13
The data and code of our experiments are publicly available
4
.
The key components of the experiment are elaborated upon in this sec-
tion.
5.1. Tuning the classifier
The classification algorithm used was XGBoost [43], which is trained
using a stratified k-fold cross-validation with k being equal to 5 and where
the instances are shuffled to avoid potential ordering biases in the dataset. A
genetic algorithm [44] was employed to fine-tune the hyperparameters. The
fitness value of each individual was the average balanced accuracy (BACC) in
the 5 validation sets of the cross-validation. We use BACC as it accounts for
the imbalanced nature of the dataset. In preliminary experiments, we also
considered the area under the receiver operating characteristic (AUROC),
which measures the model’s ability to discriminate between positive and
negative examples regardless of the classification threshold chosen. However,
we observed that the resulting XGBoost classifiers produced poor BACC
values (similar to those from a na¨ıve classifier that predicts the majority
class) when using the standard 0.5 threshold to make the prediction. We
also observed that XGBoost classifiers with extremely similar AUROC values
produced very different results in terms of BACC. Thus, we decided to use
the BACC measure as it resulted in classifiers with a more stable behavior.
In genetic optimization, each individual is characterized by its genes,
that is, the considered hyperparameters of the XGBoost. Table 5 shows the
hyperparameters together with their respective ranges.
The evolutionary strategy chosen for the optimization is the Mu plus
lambda approach, denoted as µ + λ, where µ represents the number of
individuals to select for the next generation, and λ indicates the number
of children to produce at each generation. Unlike traditional approaches
where children often replace parents, the µ + λ strategy involves adding both
children and parents to produce the next generation. In the context of this
research, we set µ to 150 and λ to 150. This configuration, considering the
addition of both parents and offspring, results in generations comprising a
total of 300 individuals.
Initially, pairs of parents are chosen through tournament selection with
a tournament size of 2. Subsequently, the children are generated employing
4
Link of the code will be included in final version.
14
Table 5: Hyperparameters of XGBoost considered in the genetic optimization and their
respective ranges.
Parameter Min. value Max. value
ETA (learning rate) 0.001 0.5
max depth 2 12
gamma 0 10
min child weight 0 10
alpha 0.5 10
lambda 0.5 10
subsample 0.7 1
colsample bytree 0.3 1
scale pos weight 0.1 10
n estimators 2 500
a two-point crossover technique on the parents’ chromosomes with an 80%
probability and applying a random resetting mutation with a 20% probabil-
ity. The random resetting mutation implies that each gene of every child has
a 20% chance of acquiring a new random value within its defined range. To
create an offspring of 150 children, this selection, crossover, and mutation
process is repeated 75 times. Finally, we combine the 150 children and the
150 parents, resulting in generations of 300 individuals, and select the top
150 (µ) according to their fitness value to pass to the next generation.
The evolutionary process consists of 20 iterations, thereby generating
a total of 3,000 individuals—each representing a distinct hyperparameter
configuration. From this pool of configurations, the one exhibiting the highest
fitness is ultimately chosen as the optimal outcome.
5.2. Generating a default score with BERT
In this section, we delineate the methodology employed to generate a
default score based on the textual description of the loan. We initiate the
process by applying transfer learning utilizing an LLM, specifically BERT in
our case. The fine-tuned BERT model produces an outcome within the range
of 0 to 1, offering a nuanced indicator rather than a binary classification. This
subtle indicator is subsequently integrated into a classifier along with other
input variables to predict the likelihood of loan default. The subsequent
steps in this process are detailed next.
15
5.2.1. Transfer learning to produce the default score
The BERT model we utilize
5
is configured with L=12 hidden layers (i.e.,
Transformer encoder blocks), each with a size of H=768, and it employs A=12
attention heads. These attention heads enable the model’s self-attention
mechanism to process inputs in 12 distinct patterns simultaneously. The
output from BERT are embeddings of size 768. For incorporating this model,
TensorFlow HUB was selected for its efficient integration with additional
neural network layers.
As outlined in Section 3, during the transfer learning phase, we aim to
exploit BERT’s advanced language understanding while minimizing the need
to learn from scratch. To achieve this, we freeze the weights of all but the
last hidden layer of BERT. This approach preserves the model’s pre-trained
capabilities and prevents overfitting on our specific dataset, a phenomenon
known as catastrophic forgetting
6
.
Subsequently, in the fine-tuning stage, only the last BERT layer and the
newly added layers are adjusted to better serve our specific task of generating
a default score. These layers are optimized to enhance the model’s adaptabil-
ity to our particular requirements, allowing slight parameter adjustments for
improved task-specific performance, while maintaining the general language
understanding gained from BERT’s initial pre-training. This strategy effec-
tively balances specialized learning with the retention of valuable pre-trained
knowledge.
We also explored various configurations for the extra layers to determine
the one that produces the best results. Specifically, we explored the 126
configurations that result from combining the following options:
Using a first extra dense layer of 128, 256, or 512 neurons.
Using or not a second extra dense layer of 128 neurons.
Adding a dropout layer before or after all the extra layers.
Considering a dropout percentage of 0%, 0.10%, 0.20% or 0.30%
for all the dropout layers.
5
Source: https://tfhub.dev/tensorflow/bert_en_uncased_L-12_H-768_A-12/4
6
A degradation issue during the retraining of pre-trained models, leading to loss of the
original capabilities [45].
16
Using a learning rate of 0.001, 0.0001, or 0.00001.
All the internal hidden layers are dense layers using the ReLU activation
function
7
. Additionally, to obtain the probability of belonging to the default
class, the last layer of the neural network was configured with a single neuron
and a sigmoid activation function
8
.
The neural network is trained to predict the loan outcome (default or non-
default) based on the textual description as input. As a loss function, we use
the weighted binary cross-entropy, which measures the difference between the
predicted probabilities and the true binary labels of the input data taking
into account class weights
9
. The training was set up with a batch size of 64
and trained for 25 epochs, with an early stopper of 3 epochs.
5.2.2. Avoiding data leakage in cross-validation
As previously mentioned, our BERT model generates a default proba-
bility, which is then integrated into the classifier as an additional quanti-
tative variable. It is important to note that when computing the BERT
default probability, we must replicate the exact folds used in the k-fold cross-
validation of the boosting algorithms. In each iteration, boosting algorithms
are trained using the BERT score variable. It is essential to prevent BERT
from training with validation data from a specific fold to avoid distorting
the model’s true performance. Allowing this would incorporate BERT pre-
dictions from its training phase in the fold’s validation data, which doesn’t
accurately represent the model’s real-world default prediction ability.
To avoid this data-leakage problem, we use the data as shown in Figure
1. This diagram consists of five steps:
1. Description extraction: Textual descriptions are extracted from the
original dataset.
2. Folds generation: The exact same folds as in the boosting algorithms
are generated for the textual descriptions. This is done by dividing the
data into a train set (green color) and a test set (purple color). Each
fold gets a different test set.
7
The Rectified Linear Unit (ReLU) activation function outputs the maximum of zero
and the input value, “activating” the neuron if the input is positive.
8
The sigmoid activation function introduces non-linearity and maps the input values
to a range between 0 and 1, facilitating binary classification tasks.
9
Considering class weights is crucial given the imbalanced nature of our dataset.
17
3. Optimization of the neural network architecture: The train set obtained
in the previous step is divided into a 70% train subset (light green
color) and a 30% test subset (dark green color). The neural network is
trained with the previously mentioned configurations on the training
subset (light green color), and is tested by predicting the test subset
(dark green color).
4. Default prediction: The optimal configuration obtained in step 3 (lower
value of weighted binary cross-entropy in the test subset) is trained with
the train and test subsets (light and dark green colors) and is used to
predict the default probabilities of the test set obtained in step 2 (purple
color). These predictions of unseen data will be the BERT score values
of the current fold.
5. BERT score integration: Once the BERT score values of all the folds
are generated in step 4, they are incorporated into the original dataset
as a quantitative variable.
5.2.3. Result of the transfer learning process
As explained before, we explored 126 neural network configurations in
each of the 5-fold cross-validation. The resulting optimal configuration for
each validation fold and its loss score are shown in Table 6. Interestingly, the
optimal combination of parameters varies across the different folds, and the
only parameter that remains constant is the use of the second dense layer.
Despite this parameter heterogeneity, the test loss values are quite stable,
except for the case of fold 1, which has a slightly higher loss.
Table 6: BERT optimal configuration and score by fold.
Neurons in
1st dense layer
Use 2nd dense layer
(128 neurons)*
Dropout Learning rate Loss**
Fold 0 512 True 20% 0.0001 0.1774
Fold 1 512 True 0% 0.00001 0.1793
Fold 2 128 True 20% 0.0001 0.1776
Fold 3 256 True 0% 0.001 0.1775
Fold 4 512 True 0% 0.001 0.1773
* Values ‘True’ or ‘False’ indicate whether the optimal configuration has or has not that layer.
** Weighted binary cross-entropy of the test set predictions.
18
Figure 1: BERT training diagram.
19
5.2.4. Inclusion of the BERT score in the classification model
The experiment architecture is shown in Figure 2. First, the loan de-
scriptions are processed by the fine-tuned BERT model. By using a sigmoid
activation function in its final layer, the resulting output can be interpreted
as the probability of belonging to the positive class. This probability, denoted
as BERT score, represents a default score solely based on the content of the
loan description. Finally, this BERT score is included as an input variable
in the XGBoost classifier along with quantitative and categorical variables
to generate a final default prediction.
Figure 2: Diagram of the experiment architecture.
The fine-tuning of the BERT model was performed on a system equipped
with an Intel Core i9-12900KS processor, an NVIDIA GeForce RTX 4090
graphics card (24GB), and 128 GB of RAM. In contrast, the inference pro-
cess with the fine-tuned model, which involves generating the BERT score
for a given loan description, can be conducted on a standard PC. This ca-
pability enhances the practicality of deploying our approach in real-world
applications.
6. Analysis of the BERT score
6.1. Assessment of the BERT score as a credit risk score
In this section, we evaluate the effectiveness of the BERT score in utilizing
loan descriptions to predict default risk. Table 7 presents loan descriptions
with the highest and lowest BERT scores, which indicate loans assessed by
20
Table 7: Loan descriptions with highest (top) and lowest (bottom) BERT score.
BERT score Real value Description
0.8562 0 (Non-Default) getting a divorce need new apt. with new furniture be-
cause she getting everthing.
0.8149 1 (Default) need help my bills. to help pay my medications and some
bills.
0.8131 0 (Non-Default) i can pay of some bills for my self because i been helping
other people out. i could save more for my family and
their need.i have a good job that i am bless with. i am
from a large family seem like every one thinking i suppose
to help them when i need help my self.i alway belive that
the lord will.
0.8051 0 (Non-Default) consolidating our debt makes our life easier live in our
means with one solid low monthly payment insted of mul-
tiple payment that add up more then what ill be paying
with this loan and have a little left to leave in my sav-
ings for a rainy day to be honest and thank you for your
consideration good dy.
0.8045 0 (Non-Default) To consolidate debt. to pay off dept
0.0735 0 (Non-Default) Debt consolidation with a lower APR.
0.1146 0 (Non-Default) In need of funds to pay off some bills as well as minor
improvements to house and yard. I have an extremely
secure career, and maintaining my credit worthiness is
important to me.
0.1262 0 (Non-Default) Hard working individual with a stable job will use loan
proceeds to consolidate outstanding credit cards balances.
1) Net monthly income - $4,432. 2) All expenses (al-
located):. Rent - $1,124. Utilities- 84. Groceries
293. Auto (including fuel) 201 . Cell Phone 52. Ca-
ble/Internet 64. Personal care items 82. Entertain-
ment/dining 93. Sales tax 65. 3) Previously answered.
4) No.
0.1360 0 (Non-Default) This loan is to pay off credit ca.
0.1871 1 (Default) I need money for moving expenses and for a buffer for
the first month while I transition into working in my
new location. I have successfully paid off two previous
Lending Club loans in the past couple of years.
21
BERT as having the highest and lowest default risks, respectively. It appears
that descriptions with higher BERT scores are often less informative, con-
taining errors such as typos (e.g., “pay of, “everthing”, “alway”, “belive”)
or grammatical issues. Interestingly, only one of these loans ultimately de-
faulted. Conversely, loan descriptions with lower BERT scores tend to be
more detailed and insightful, often including information about the loan’s
purpose and the borrower’s creditworthiness, and sometimes even providing
numeric data related to the borrower’s financial status.
Figure 3: Default and non-default proportions by BERT score range.
Figure 3 illustrates the distribution of default and non-default loans (Y
axis) across the BERT score range (X axis) in bins of 0.01. In the bar
chart, the blue segment represents the percentage of non-default loans, while
the orange segment denotes the percentage of default loans
10
. The figure
reveals a general trend where higher BERT scores are associated with a
higher proportion of defaulted loans. It is important to note that observations
outside the BERT score range of [0.3, 0.7] are sparse, which makes the bars
in these regions less reliable. Overall, the trend suggests that higher BERT
scores are indicative of a greater likelihood of default, highlighting the BERT
score’s usefulness as a risk assessment tool.
Table 8 shows the classification performance obtained by applying a 0.5
threshold to binarize the BERT score, which obtains a balanced accuracy of
10
Absence of a bar indicates that no loan descriptions fall within that BERT score range.
22
Table 8: Classification performance of BERT binarization at 0.5.
Default Non-default
Model BACC Precision Recall F1 Precision Recall F1
BERT 0.5444 0.1714 0.6896 0.2746 0.8771 0.3993 0.5487
54.4%. The BERT score is not very precise in predicting the default class
(17.1%) but retrieves 69% of the instances.
These results demonstrate that the BERT model can be effectively fine-
tuned to predict the final state of the loan using only the information provided
by the borrower in the description field of the application form. In Section
7, we will contextualize these findings by comparing them with the results
obtained from XGBoost and various sets of variables.
6.2. Assessment of the relationship between the BERT score and other vari-
ables
In this section, we aim to explore potential relationships between the
BERT score and other variables. Table 9 presents the correlation coeffi-
cients between the BERT score and the quantitative variables within the
dataset. The findings reveal weak yet statistically significant relationships
with all variables except for the loan amount. Notably, the most pronounced
correlations exist with the FICO score and revenue variables. Both exhibit
inverse relationships, indicating that individuals with higher FICO scores
and revenues tend to have lower BERT scores, and vice versa. The associa-
tion between the BERT score and the debt variable (dti n) is direct, albeit
slightly weaker than the other two correlations.
Table 9: Correlation coefficients of quantitative variables with the BERT score.
Variable Pearson Spearman
revenue -0.0734* -0.1007*
dti n 0.0663* 0.0627*
loan amnt -0.0008 -0.0012
fico n -0.1293* -0.1293*
* p-value less than 0.01.
23
Table 10: Kruskal-Wallis test of categorical variables with the BERT score.
Variable H-statistic
emp length 1581.67*
purpose 1357.94*
home ownership 243.20*
addr state 360.44*
* p-value less than 0.01.
To evaluate the relationship with the categorical variables, Table 10 presents
the results of the Kruskal-Wallis test. The Kruskal-Wallis test is a non-
parametric statistical test that analyzes whether there are statistical differ-
ences in the BERT scores across categories within each categorical variable.
The results indicate significant differences in BERT scores among all cate-
gorical variables, suggesting a certain level of association between the BERT
score and these categorical factors. However, it remains challenging to quan-
tify the strength of this relationship or identify the specific categories with
the most robust associations.
Table 11: Correlation coefficients of linguistic features with the BERT score.
Variable Pearson Spearman
Word count -0.1961* -0.2650*
Polarity -0.1036* -0.1473*
Subjectivity -0.1753* -0.1866*
Readability 0.1182* 0.1518*
* p-value less than 0.01.
Finally, Table 11 examines the potential relationship between the BERT
score and various linguistic features automatically extracted from the text.
We find statistically significant correlations between the BERT score and all
linguistic features analyzed. Notably, there is a strong inverse correlation
with both word count and subjectivity, indicating that shorter and more
objective texts tend to have higher BERT scores. This finding, together with
the correlations presented in Table 9, suggests that BERT is more closely
related to linguistic features than to the numerical variables associated with
loan applications.
24
7. Results of the LLM-enhanced granting model
7.1. Analysis of the classification performance
First, we evaluate the impact of incorporating the BERT score in the
granting model. In our baseline experiment, we optimize XGBoost with a
genetic algorithm using the quantitative and categorical variables typically
used in granting models, while in the competing approach, we optimize it
but include the BERT score as input variable. Table 12 shows the results of
both approaches.
A closer examination of the balanced metrics reveals a marginal enhance-
ment in both BACC (0.6154 vs. 0.6187) and AUC (0.6575 vs. 0.6644), the
latter significant according to the DeLong test [46] at the 0.01 level. It is cru-
cial to note that the Lending Club dataset exclusively consists of approved
loans. This fact poses a substantial challenge to significantly enhance the
outcomes in a loan granting model such as ours since the loans included in
the dataset were initially considered favorable by the platform. Furthermore,
in experiments not reported in the paper we used CatBoost [47] instead of
XGBoost and obtained a similar BACC improvement.
Table 12: Classifier performance with and without the BERT score.
Metric Quant. + Categ. var. Quant. + Categ. + BERT score
BACC 0.6154 0.6187
AUC 0.6575 0.6644
F1 0.3266 0.3308
Precision 0.2168 0.2249
Recall 0.6614 0.6360
Accuracy 0.5835 0.6066
The performance metrics in Table 12 indicate that incorporating the
BERT score results in improved precision but diminished recall. However, at-
tributing this change in precision-recall behavior solely to the BERT score re-
quires careful consideration. This caution arises from our observations within
our dataset, where classifiers with different hyperparameters and similar
near-optimal balanced accuracy values have demonstrated varying precision-
recall behaviors, suggesting that this may also be the case here.
Table 13 shows an additional experiment in which XGBoost classifiers
are trained and optimized using only one kind of input variable. While the
25
Table 13: Performance of the XGBoost considering only a subset of variables.
Metric Quant. Categ. Text. BERT score
BACC 0.6062 0.5486 0.5258 0.5490
AUC 0.6457 0.5656 0.5309 0.5714
F1 0.3192 0.2759 0.2534 0.2601
Precision 0.2138 0.1746 0.1665 0.1877
Recall 0.6300 0.6563 0.5302 0.5153
Accuracy 0.5896 0.4738 0.5227 0.5724
classifier using the quantitative variables is clearly the best, the classifier
that uses just the BERT score obtains slightly better results than the one
using the four categorical variables (the AUC difference is significant at the
0.01 level according to the DeLong test). This result is noteworthy given
the well-known effectiveness and the meaningful nature of the qualitative
variables.
Table 13 also shows the result of an XGBoost classifier using the textual
features presented in Table 11, namely: polarity, subjectivity, word count,
and readability score. This classifier is outperformed by the XGBoost that
uses the BERT score in terms of balanced accuracy and AUC (significant at
the 0.01 level according to the DeLong test). This finding underscores the
superior ability of the fine-tuned LLM to leverage textual descriptions and
extract relevant information for the classification task.
7.2. Feature importance and explainability
To further analyze the impact of the BERT score, in this section, we ana-
lyze whether the models’ use of the input variables changes. Figure 4 shows
the feature importance given by XGBoost both with and without incorpo-
rating the BERT score. Notably, the BERT score emerges as the third most
influential variable, accounting for 6% importance. Moreover, the order and
magnitude of importance for other variables exhibit shifts. Specifically, the
borrower’s annual revenue diminishes in importance following the inclusion of
the BERT score, while the significance of the ‘credit card’ purpose increases.
This observation underscores the significant influence of the BERT score on
the classifier.
To gain a deeper understanding of the role of variables, we employ SHAP
values [48]. Figure 5 provides a visualization of the SHAP values for the 10
26
(a) Before BERT score incorporation. (b) After BERT score incorporation.
Figure 4: Feature importances of the XGBoost classifiers.
(a) Before BERT score incorporation. (b) After BERT score incorporation.
Figure 5: SHAP values of the XGBoost classifiers.
variables with the most significant impact on the model’s output, comparing
models with and without including the BERT score.
It is noteworthy that Figure 5a resembles a comparable figure presented
in [41], wherein SHAP values were utilized in a lending model on Lending
Club
11
. Examining Figure 5b, we find that the BERT score ranks as the
fourth variable in terms of impact. A closer inspection of the violin plots
reveals that including BERT does not substantially alter how variables in-
fluence the target class.
Figure 6 reveals a direct and linear relationship between the BERT score
11
Our dataset is more constrained, only considering loans with textual descriptions, as
explained in Section 4.
27
Figure 6: Dependence plot between SHAP values and BERT score.
and the default risk. Notably, this relationship is asymmetric around BERT
score values of 0.5; scores below 0.4 correspond to SHAP values ranging be-
tween -0.4 and -0.6, leading the model to predict non-default. Conversely,
this impact range in the positive case is only reached by BERT scores exceed-
ing 0.7, signifying that only exceptionally high BERT scores serve as strong
indicators of default.
7.3. Impact across the purpose categories
Now we delve into the influence of incorporating the BERT score on clas-
sification outcomes across various categories of the purpose variable. Lever-
aging the language comprehension capabilities of LLMs, it is conceivable that
the BERT model can provide more accurate characterizations of loan pur-
poses than the purpose variable alone. For instance, the ‘other’ category, be-
ing inherently ambiguous, could benefit from a more nuanced understanding
of the loan description, potentially leading to improved prediction outcomes.
Figure 7 illustrates the changes in balanced accuracy upon incorporating
the BERT score for each loan purpose. Notably, the ‘other’ category shows
a modest improvement of 2.11%. However, several other categories exhibit
more substantial enhancements, including ‘educational’ (9.22%), ‘moving’
(5.84%), ‘medical’ (3.84%), and ‘small business’ (3.40%). Given the black-
box nature of our model, it is difficult to ascertain the reasons why these
categories have improved more than the others. However, we posit that these
categories share a commonality—the more detailed specification of purpose
or a deeper understanding of the borrower’s situation contributes to a more
28
Figure 7: BACC changes by purpose after including the BERT score.
precise delineation of the default risk. For instance, in education loans, fac-
tors like the field of study and the educational institution may play a crucial
role in determining employability and salaries, thereby enabling a more accu-
rate assessment of the default risk. The same happens in the case of moving
to a new place, having to pay medical expenses, or obtaining funds for a
business.
The following examples illustrate instances of loans that were initially
misclassified as defaults in the absence of the BERT score but were subse-
quently correctly predicted as non-defaults after its inclusion:
Purpose: Educational; BERT score: 0.3765
“I’m 25 years old and living in New Orleans. I’m asking for a rel-
atively small amount of money to help me take care of my post-Bac
tuition for teacher certification and to help me pay off a credit card.
I currently work as a private school teacher making very little money
with no benefits (about $29,000 a year). I have to pay about $1000
in the coming year for my tuition, and I have to get health insurance
ASAP, but it’s hard to do so with no financial help from anyone else.
My parents can’t help me because my mother is permanently disabled
29
and my father took a huge pay cut this year.”
Purpose: Moving; BERT score: 0.3843
“Although I can afford payments,due to some recent expenses, I am
short on cash flow for an unexpected move. I am, however, looking for
a more reasonable alternative to banking rates. I have borrowed from
Lending Club before and always paid fully and on time with automatic
payments.”
Purpose: Small business; BERT score: 0.3292
“The purpose of this loan is to fund advertising costs for a growing
internet business venture. I am a successful sales professional earning
an average of over $250K per year over the last 5 years. My credit
scores are strong and I have a documented history of paying all my
debts (personal or business related) on time.”
Purpose: Medical; BERT score: 0.3921
“This loan will be used to pay off a Care Credit credit card currently
at 21.9% I used the card to pay for a prosthetic limb that my health
insurer would not cover.”
In each of these examples, the corresponding BERT scores fall below 0.4.
As demonstrated in Figure 6, the classifier will have a stronger driver to
categorize the loan as non-default when the score is below this value. As
all the examples offer more details on the situation and the purpose of the
loan, their further clarity could explain their low BERT score. In addition,
all the texts shown are accurate, readable, and written in a confident and
objective tone. While previous evidence in the literature offered mixed results
about the influence of linguistic factors as default predictors [3, 18, 4, 7], we
have shown that our BERT score is significantly related to them. While
these explanations are sound, given the black-box nature of our BERT-based
approach, the explanations influencing the BERT score of each text cannot
be precisely determined. This aspect needs to be improved to comply with
the requirements of the regulators and foster trust among the end-users.
8. Conclusions
In this paper, we have introduced a novel approach that highlights the
value of state-of-the-art natural language processing techniques in enhancing
30
credit risk models. Our approach fine-tunes BERT using loan descriptions
to generate a risk score that effectively discriminates between defaulted and
non-defaulted loans. Furthermore, we show that integrating the BERT-based
risk score with traditional variables significantly enhances the performance
of conventional granting models. These results underscore the potential of
LLMs to augment lending models, emphasizing their invaluable role in har-
nessing the richness of textual data to improve credit risk assessment.
Our approach can be easily applied without manual annotation, which is
a complex task that also introduces annotator subjectivity. As our analy-
sis shows, the score incorporates both linguistic factors and content aspects
that relate to the creditworthiness of the borrower’s situation and purpose.
Additionally, while fine-tuning the LLM is a time-consuming process that
requires GPU processing, generating predictions, such as the risk score for a
loan description, can be done instantly with standard home equipment.
Our work represents just the first steps in this endeavor and underscores
the necessity for further exploration and integration of cutting-edge method-
ologies to enhance the effectiveness of risk-return assessments in the dynamic
landscape of P2P lending or even in traditional financial institutions. Sev-
eral directions deserve to be explored to refine the predictive capability of
the model and the understanding of loan applicants’ conditions.
An important limitation of our work is the lack of transparency regard-
ing the factors contributing to a given risk score and its potential biases.
These issues hinder its application in real-world contexts, where a compre-
hensive understanding of the score generation process and assurance that it
does not introduce biases related to gender, ethnicity, financial inclusion, or
other factors are essential. Thus, efforts to enhance the transparency and
explainability of the risk score are crucial not only for regulatory compliance
but also for fostering trust among borrowers and lenders [49].
At this respect, LLM-based topic modeling tools such as BERTopic [50]
could be used to gain an understanding of the risk scores. For example, topic
modeling could identify whether a loan description is related to “risky topics”
that in turn could be incorporated as input variables in the granting model.
Similarly, the embeddings from BERT or other LLMs could be used to gen-
erate interpretable topics from large, complex vocabularies, as other works
have done [51]. Applying such topic modeling approaches to loan descrip-
tions could facilitate the identification of situations with varying risk levels.
These topics could then serve as additional variables in classification mod-
els, improving predictive performance while maintaining the transparency
31
required in financial assessments.
Another potential research avenue involves delving into more advanced
LLMs. For instance, encoder-only models like RoBERTa [37] could improve
the model’s capacity to capture more intricate linguistic patterns. Further-
more, the rise of generative AI has spurred interest in exploring decoder
models for classification purposes. Although these models were not origi-
nally designed for such tasks, recent advancements have shown promising
approximations. For example, CARP (Clue And Reasoning Prompting) [52]
utilizes in-context learning, generating textual responses conditioned on a
provided prompt with a few annotated examples (few-shot learning). Re-
markably, this approach improved results in classification tasks without the
need for transfer learning and fine-tuning and could be applied to risk scoring.
Future work should also address the economic implications of our findings
by integrating cost- or profit-sensitive approaches [53, 42]. These methods
allow for the inclusion of the financial impacts of classification outcomes.
Investigating whether improvements in model performance by including tex-
tual descriptions translate into better financial outcomes would enhance the
real-world applicability of our findings.
References
[1] M. Cummins, T. Lynn, C. Mac an Bhaird, P. Rosati, Addressing
Information Asymmetries in Online Peer-to-Peer Lending, Springer
International Publishing, Cham, 2019, pp. 15–31. doi:10.1007/
978-3-030-02330-0_2.
[2] J. Michels, Do unverifiable disclosures matter? Evidence from peer-to-
peer lending, The Accounting Review 87 (4) (2012) 1385–1413. doi:
10.2308/accr-50159.
[3] Q. Gao, M. Lin, Lemon or cherry? The value of texts in debt crowd-
funding, Tech. Rep. 18, Center for Analytical Finance. University of
California, Santa Cruz (2015).
URL https://cafin.ucsc.edu/research/work_papers/CAFIN_
WP18.pdf
[4] G. Dorfleitner, C. Priberny, S. Schuster, J. Stoiber, M. Weber, I. de
Castro, J. Kammler, Description-text related soft information in peer-to-
peer lending Evidence from two leading European platforms, Journal
32
of Banking & Finance 64 (2016) 169–187. doi:https://doi.org/10.
1016/j.jbankfin.2015.11.009.
[5] Y. Xia, L. He, Y. Li, N. Liu, Y. Ding, Predicting loan default in peer-to-
peer lending using narrative data, Journal of Forecasting 39 (2) (2020)
260–280. doi:https://doi.org/10.1002/for.2625.
[6] W. Zhang, C. Wang, Y. Zhang, J. Wang, Credit risk evaluation model
with textual features from loan descriptions for p2p lending, Electronic
Commerce Research and Applications 42 (2020) 100989. doi:https:
//doi.org/10.1016/j.elerap.2020.100989.
[7] M. Siering, Peer-to-peer (p2p) lending risk management: Assessing
credit risk on social lending platforms using textual factors, ACM
Transactions on Management Information Systems 14 (3) (jun 2023).
doi:10.1145/3589003.
[8] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of
Deep Bidirectional Transformers for Language Understanding (2018).
doi:10.48550/ARXIV.1810.04805.
[9] C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune BERT for text
classification?, in: M. Sun, X. Huang, H. Ji, Z. Liu, Y. Liu (Eds.),
Chinese Computational Linguistics, Springer International Publishing,
Cham, 2019, pp. 194–206. doi:10.1007/978-3-030-32381-3_16.
[10] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, J. Kang, BioBERT:
a pre-trained biomedical language representation model for biomedical
text mining, Bioinformatics 36 (4) (2019) 1234–1240. doi:10.1093/
bioinformatics/btz682.
[11] V. S. Tida, S. H. Hsu, Universal spam detection using transfer learning
of BERT model, in: Proceedings of the 55th Hawaii International Con-
ference on System Sciences, 2022, pp. 7669–7677.
URL http://hdl.handle.net/10125/80263
[12] M.-J. Ariza-Garz´on, M.-D.-M. Camacho-Mi˜nano, M.-J. Segovia-Vargas,
J. Arroyo, Risk-return modelling in the p2p lending market: Trends,
gaps, recommendations and future directions, Electronic Commerce Re-
search and Applications 49 (2021) 101079. doi:https://doi.org/10.
1016/j.elerap.2021.101079.
33
[13] J. Xu, D. Chen, M. Chau, L. Li, H. Zheng, Peer-to-peer loan fraud
detection: Constructing features from transaction data, MIS Quarterly
45 (3) (2022) 1777–1792. doi:10.25300/misq/2022/16103.
[14] Y. Li, A. Hao, X. Zhang, X. Xiong, Network topology and systemic risk
in peer-to-peer lending market, Physica A: Statistical Mechanics and its
Applications 508 (2018) 118–130. doi:https://doi.org/10.1016/j.
physa.2018.05.083.
[15] J. Xu, D. Chen, M. Chau, Identifying features for detecting fraudu-
lent loan requests on p2p platforms, in: 2016 IEEE Conference on
Intelligence and Security Informatics (ISI), 2016, pp. 79–84. doi:
10.1109/ISI.2016.7745447.
[16] Z. Qi, D. Chen, J. J. Xu, Do facial images matter? Understand-
ing the role of private information disclosure in crowdfunding mar-
kets, Electronic Commerce Research and Applications 54 (C) (jul 2022).
doi:10.1016/j.elerap.2022.101173.
[17] M. Herzenstein, S. Sonenshein, U. M. Dholakia, Tell me a good story
and I may lend you my money: The role of narratives in peer-to-peer
lending decisions, SSRN Electronic Journal (2011). doi:10.2139/ssrn.
1840668.
[18] S. Wang, Y. Qi, B. Fu, H. Liu, Credit risk evaluation based on text
analysis, International Journal of Cognitive Informatics and Natural In-
telligence 10 (2016) 1–11. doi:10.4018/IJCINI.2016010101.
[19] C. Jiang, Z. Wang, R. Wang, Y. Ding, Loan default prediction by com-
bining soft information extracted from descriptive text in online peer-to-
peer lending, Annals of Operations Research 266 (1–2) (2017) 511–529.
doi:10.1007/s10479-017-2668-z.
[20] J. Yao, J. Chen, J. Wei, Y. Chen, S. Yang, The relationship between
soft information in loan titles and online peer-to-peer lending: evidence
from renrendai platform, Electronic Commerce Research 19 (1) (2018)
111–129. doi:10.1007/s10660-018-9293-z.
[21] T. Mikolov, K. Chen, G. Corrado, J. Dean, Efficient estimation of word
representations in vector space (2013). doi:10.48550/ARXIV.1301.
3781.
34
[22] T. Loughran, B. McDonald, When is a liability not a liability? Textual
analysis, dictionaries, and 10-Ks, The Journal of Finance 66 (1) (2011)
35–65. doi:https://doi.org/10.1111/j.1540-6261.2010.01625.x.
[23] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,
L. Kaiser, I. Polosukhin, Attention Is All You Need (2017). doi:10.
48550/ARXIV.1706.03762.
[24] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever,
Language models are unsupervised multitask learners, Tech. rep.,
OpenAI (2019).
URL https://insightcivic.s3.us-east-1.amazonaws.com/
language-models.pdf
[25] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,
A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-
Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler,
J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray,
B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever,
D. Amodei, Language models are few-shot learners (2020). doi:10.
48550/ARXIV.2005.14165.
[26] D. Pride, M. Cancellieri, P. Knoth, CORE-GPT: Combining open access
research and large language models for credible, trustworthy question
answering, arXiv preprint arXiv:2307.04683 (2023). doi:https://doi.
org/10.48550/arXiv.2307.04683.
[27] A. Bhaskar, A. R. Fabbri, G. Durrett, Prompted opinion summarization
with GPT-3.5, arXiv preprint arXiv:2211.15914 (2022). doi:https:
//doi.org/10.48550/arXiv.2211.15914.
[28] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka-
plan, H. Edwards, Y. Burda, G. Brockman, A. Ray, et al., Evaluating
large language models trained on code, arXiv preprint arXiv:2107.03374
(2021). doi:https://doi.org/10.48550/arXiv.2107.03374.
[29] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy,
V. Stoyanov, L. Zettlemoyer, Bart: Denoising sequence-to-sequence pre-
training for natural language generation, translation, and comprehen-
sion, arXiv preprint arXiv:1910.13461 (2019). doi:https://doi.org/
10.48550/arXiv.1910.13461.
35
[30] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena,
Y. Zhou, W. Li, P. J. Liu, Exploring the limits of transfer learning
with a unified text-to-text transformer, Journal of Machine Learning
Research 21 (1) (jan 2020).
URL http://jmlr.org/papers/v21/20-074.html
[31] J. Kriebel, L. Stitz, Credit default prediction from user-generated text
in peer-to-peer lending using deep learning, European Journal of Oper-
ational Research 302 (1) (2022) 309–323. doi:10.1016/j.ejor.2021.
12.024.
[32] M. Stevenson, C. Mues, C. Bravo, The value of text for small business
default prediction: A Deep Learning approach, European Journal of
Operational Research 295 (2) (2021) 758–771. doi:10.1016/j.ejor.
2021.03.008.
[33] V. Sanh, L. Debut, J. Chaumond, T. Wolf, DistilBERT, a distilled
version of BERT: smaller, faster, cheaper and lighter (2019). doi:
10.48550/ARXIV.1910.01108.
[34] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, AL-
BERT: A Lite BERT for Self-supervised Learning of Language Repre-
sentations (2019). doi:10.48550/ARXIV.1909.11942.
[35] L. Martin, B. Muller, P. J. Ortiz Su´arez, Y. Dupont, L. Romary,
´
E. de la
Clergerie, D. Seddah, B. Sagot, CamemBERT: a tasty French language
model, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Pro-
ceedings of the 58th Annual Meeting of the Association for Computa-
tional Linguistics, Association for Computational Linguistics, Online,
2020, pp. 7203–7219. doi:10.18653/v1/2020.acl-main.645.
[36] J. Ca˜nete, G. Chaperon, R. Fuentes, J.-H. Ho, H. Kang, J. P´erez,
Spanish pre-trained bert model and evaluation data, in: Practical ML
for Developing Countries Workshop at ICLR 2020, 2020. doi:https:
//doi.org/10.48550/arXiv.2308.02976.
[37] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis,
L. Zettlemoyer, V. Stoyanov, RoBERTa: A Robustly Optimized BERT
Pretraining Approach (2019). doi:10.48550/ARXIV.1907.11692.
36
[38] Z. Gao, A. Feng, X. Song, X. Wu, Target-dependent sentiment clas-
sification with BERT, IEEE Access 7 (2019) 154290–154299. doi:
10.1109/ACCESS.2019.2946594.
[39] S. A. Basha, M. M. Elgammal, B. M. Abuzayed, Online peer-to-peer
lending: A review of the literature, Electronic Commerce Research and
Applications 48 (2021) 101069. doi:10.1016/j.elerap.2021.101069.
[40] M. J. Ariza-Garz´on, M. Sanz-Guerrero, J. Arroyo Gallardo, L. Club,
Lending club loan dataset for granting models (May 2024). doi:10.
5281/zenodo.11295916.
URL https://doi.org/10.5281/zenodo.11295916
[41] M. J. Ariza-Garz´on, J. Arroyo, A. Caparrini, M.-J. Segovia-Vargas, Ex-
plainability of a machine learning granting scoring model in peer-to-peer
lending, IEEE Access 8 (2020) 64873–64890. doi:10.1109/ACCESS.
2020.2984412.
[42] M.-J. Ariza-Garz´on, J. Arroyo, M.-J. Segovia-Vargas, A. Caparrini,
Profit-sensitive machine learning classification with explanations in
credit risk: The case of small businesses in peer-to-peer lending, Elec-
tronic Commerce Research and Applications 67 (2024) 101428. doi:
10.1016/j.elerap.2024.101428.
[43] T. Chen, C. Guestrin, XGBoost: A scalable tree boosting system,
in: Proceedings of the 22nd ACM SIGKDD International Confer-
ence on Knowledge Discovery and Data Mining, KDD ’16, Associa-
tion for Computing Machinery, New York, NY, USA, 2016, p. 785–794.
doi:10.1145/2939672.2939785.
[44] J. H. Holland, Adaptation in Natural and Artificial Systems: An Intro-
ductory Analysis with Applications to Biology, Control, and Artificial
Intelligence, The MIT Press, Cambridge, Massachusetts, USA, 1992.
doi:10.7551/mitpress/1090.001.0001.
[45] M. Biesialska, K. Biesialska, M. R. Costa-juss`a, Continual lifelong learn-
ing in natural language processing: A survey, in: D. Scott, N. Bel,
C. Zong (Eds.), Proceedings of the 28th International Conference on
Computational Linguistics, International Committee on Computational
37
Linguistics, Barcelona, Spain (Online), 2020, pp. 6523–6541. doi:
10.18653/v1/2020.coling-main.574.
[46] X. Sun, W. Xu, Fast implementation of DeLong’s algorithm for compar-
ing the areas under correlated receiver operating characteristic curves,
IEEE Signal Processing Letters 21 (11) (2014) 1389–1393. doi:10.
1109/LSP.2014.2337313.
[47] L. Prokhorenkova, G. Gusev, A. Vorobev, A. V. Dorogush, A. Gulin,
Catboost: unbiased boosting with categorical features, in: Proceedings
of the 32nd International Conference on Neural Information Processing
Systems, NIPS’18, Curran Associates Inc., Red Hook, NY, USA, 2018,
p. 6639–6649.
[48] S. M. Lundberg, S.-I. Lee, A unified approach to interpreting model pre-
dictions, in: Proceedings of the 31st International Conference on Neural
Information Processing Systems, NIPS’17, Curran Associates Inc., Red
Hook, NY, USA, 2017, p. 4768–4777.
[49] ROFIEG, Thirty recommendations on regulation, innovation and
finance. final report to the european commission by the expert group
on regulatory obstacles to financial innovation, Tech. rep., European
Commission (2019).
URL https://ec.europa.eu/info/files/
191113-report-expert-group-regulatory-obstacles-financial-innovation_
en
[50] M. Grootendorst, Bertopic: Neural topic modeling with a class-based
tf-idf procedure (2022). doi:10.48550/ARXIV.2203.05794.
URL https://arxiv.org/abs/2203.05794
[51] A. B. Dieng, F. J. R. Ruiz, D. M. Blei, Topic Modeling in Embedding
Spaces, Transactions of the Association for Computational Linguistics
8 (2020) 439–453. doi:10.1162/tacl_a_00325.
URL https://doi.org/10.1162/tacl_a_00325
[52] X. Sun, X. Li, J. Li, F. Wu, S. Guo, T. Zhang, G. Wang, Text classifi-
cation via large language models (2023). doi:10.48550/ARXIV.2305.
08377.
38
[53] Y. Xia, C. Liu, N. Liu, Cost-sensitive boosted tree for loan evaluation in
peer-to-peer lending, Electronic Commerce Research and Applications
24 (2017) 30–49. doi:https://doi.org/10.1016/j.elerap.2017.06.
004.
39