IDC 6940 Final Project

University of West Florida

Delanyce Rose & Richard Henry

Summer 2025

Proposed Workflow

image

Introduction

First, we’ll start with some sort of introduction.

Definition

This is where we say what symbolic regression is.

Importance

Here we talk about why we’d want to use symbolic regression

Competing Technologies

Here, we talk about the alternatives:

Neural Networks

This can provide answers, but cannot explain why those answers are correct. Yes, they are working on reasoning LLM’s, but were not there yet.

General Linear Regression

This also provides answers, but the engineer or scientist has to decide on the form of the equation before using GLM to fit the constants.

Sparse Regression

Here we thow a bunch of functional forms into a regression model, and throw away the forms with small constants. What’s left tells us what kind of equation to use. This is not technically sparse regression, but the workflow is very similar.

Evolutionary Alogorithms

We figure out how to represent the parts of an equation as individual genes in a chromasone. Then we mimic the evolutionary process from biology until we settle on one “fittest” chromasone that represents the “best” equation.

The issue with this approach is that we must place our constants in the chromasone up front. THe advantage of this approach is that we can search a large number of equation forms in a reasonably short time.

Datasets

Next, we move onto the datasets that we will use test Symbolic Regression and its competitors.

API Gravity

This is a small, synthetic dataset with one predictor variable, two constants and one outcome variable. As we know the formula, we can tell judge the accuracy of the algorithm.

Ideal Gas Volumes

This is a larger, synthetic dataset with two predictor variables and three constants. Also one outcome variable. Like the API gravity dataset,the relationships are non-linear.

Molecular Mass I

This is a small, real dataset with two predictor variables and an unknown number of constants, as engineers have built many equations over the years to predict the single outcome variable.

This is the heart of the paper, and we can compare the predicted equation forms with those that have been generated by hand over the past 60 years.

Molecular Mass II

This is a larger synthetic dataset built using existing correlations for Molecular Mass. We are expecting challenges with the first molecular mass dataset, so that this one will hopefully help us identify the issues.

Health Care

Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

Observations

Here we discuss our observations on running the various datasets on various libraries.

Conclusions

Summary of Observations

References

We need to figure out how the Doc does his bibliography!!!

Appendices

If the report is too long, we shove some of it here.