First Advisor

John Lipor

Term of Graduation

Fall 2024

Date of Publication

12-4-2024

Document Type

Thesis

Degree Name

Master of Science (M.S.) in Electrical and Computer Engineering

Department

Electrical and Computer Engineering

Language

English

Subjects

Bayesian Methods, Geothermal Energy, Machine Learning, Mixture Proportion Estimation, PU Learning, Semi-supervised Learning

Physical Description

1 online resource (xii, 88 pages)

Abstract

As we face the current climate crisis, the discovery of geothermal energy resources has the potential to greatly reduce our dependence on fossil fuels worldwide. However, the development of any new energy infrastructure is expensive and depends on the willingness of energy agencies and developers to make initial investments based on calculated risk measures. One such measure, called geothermal favorability, is the likelihood that a site has conditions favorable for geothermal systems containing recoverable energy potential. Its prediction from existing geophysical datasets proves to be a nontrivial task. The prediction of geothermal favorability can be framed as a binary classification problem, the data for which consists entirely of samples which are either positive (location contains a conventional hydrothermal system) or unlabeled (location does not contain any known geothermal system but should not be considered strictly negative). Data that comprises only positive and unlabeled examples is called positive unlabeled data, or PU data. PU learning is the branch of semi-supervised learning concerned with learning effectively from PU data. In addition to containing no negative examples, the dataset this thesis uses is heavily imbalanced, with many unlabeled examples and very few positive examples.

Previously, resource assessments have required experts to be directly involved in the assessment process, which can be time-consuming, expensive, and incurs human bias. Recent work in geothermal favorabililty prediction has used undersampling to mitigate class imbalance when training linear and nonlinear machine learning models, but more sophisticated techniques exist in the larger body of literature to specifically target the deficiencies in PU data. Furthermore, the metrics most common to the classification literature, e.g., F1 score and area under the receiver operating characteristic curve(ROC-AUC), do not adequately reflect the performance of models trained on PU data with such severe class imbalance. This thesis introduces the Kullback-Leibler divergence (DKL) as a means of model evaluation for PU data and explores two PU learning techniques, Selected-at-Random Expectation-Maximization (SAR-EM) and Difference-of-Estimated-Densities-based PU Learning (DEDPUL), on U.S. Geological Survey data from 2008. It then compares these two most current PU learning techniques against previous naïve methods using logistic regression and XGBoost. We demonstrate that, when used as a scoring function to tune hyperparameters for linear and nonlinear machine learning models, the DKL has an intrinsic ability to separate the unlabeled from predicted positive distributions. It has the weakness of being a highly variable scoring function, however: the strategy with the highest DKL score has a standard deviation that is 38% larger than its mean.

In the PU learning context, a nontraditional classifier (NTC) is a classifier that is trained on PU data to separate positive from unlabeled examples as if it were performing traditional binary classification. NTCs are often an integral part of PU learning algorithms, where they function to reduce data dimensionality, provide a lower bound for class prior prediction, and act as a baseline for determining PU learning algorithm performance. We show that when logistic regression is used to train an NTC, SAR-EM achieves a more accurate estimate of the class prior over other methods, with an MAE 185% lower than the closest naïve approach. SAR-EM also produces a slight improvement of 18% in F1 score compared to all other methods. The favorability maps resulting from SAR-EM resemble those given by naïve logistic regression methods in that the transition between regions of favorability is much more gradual and the lower favorability regions much smaller than the other methods explored. When XGBoost is used to train an NTC, DEDPUL enhances the model’s bias toward heavily penalizing likely negative samples/regions with steep boundaries between favorability regions. When comparing ridge plots, naïve XGBoost is better at separating the unlabeled distribution from predicted positive results, even over more advanced PU methods. It remains that these simpler models such as logistic regression and XGBoost, which rely on undersampling and the appropriate tuning of class weights, and treat all unlabeled examples as negative, can provide performance that is on par with the current state-of-the-art in PU learning.

Rights

In Copyright. URI: http://rightsstatements.org/vocab/InC/1.0/ This Item is protected by copyright and/or related rights. You are free to use this Item in any way that is permitted by the copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from the rights-holder(s).

Persistent Identifier

https://archives.pdx.edu/ds/psu/43006

Share

COinS