IDC 6940 Final Project
University of West Florida
Delanyce Rose & Richard Henry
Summer 2025
Proposed Workflow
Introduction
First, we’ll start with some sort of introduction.
Definition
This is where we say what symbolic regression is.
Importance
Here we talk about why we’d want to use symbolic regression
Competing Technologies
Here, we talk about the alternatives:
Neural Networks
This can provide answers, but cannot explain why those answers are correct. Yes, they are working on reasoning LLM’s, but were not there yet.
General Linear Regression
This also provides answers, but the engineer or scientist has to decide on the form of the equation before using GLM to fit the constants.
Sparse Regression
Here we thow a bunch of functional forms into a regression model, and throw away the forms with small constants. What’s left tells us what kind of equation to use. This is not technically sparse regression, but the workflow is very similar.
Evolutionary Alogorithms
We figure out how to represent the parts of an equation as individual genes in a chromasone. Then we mimic the evolutionary process from biology until we settle on one “fittest” chromasone that represents the “best” equation.
The issue with this approach is that we must place our constants in the chromasone up front. THe advantage of this approach is that we can search a large number of equation forms in a reasonably short time.
Datasets
Next, we move onto the datasets that we will use test Symbolic Regression and its competitors.
API Gravity
This is a small, synthetic dataset with one predictor variable, two constants and one outcome variable. As we know the formula, we can tell judge the accuracy of the algorithm.
Ideal Gas Volumes
This is a larger, synthetic dataset with two predictor variables and three constants. Also one outcome variable. Like the API gravity dataset,the relationships are non-linear.
Molecular Mass I
This is a small, real dataset with two predictor variables and an unknown number of constants, as engineers have built many equations over the years to predict the single outcome variable.
This is the heart of the paper, and we can compare the predicted equation forms with those that have been generated by hand over the past 60 years.
Molecular Mass II
This is a larger synthetic dataset built using existing correlations for Molecular Mass. We are expecting challenges with the first molecular mass dataset, so that this one will hopefully help us identify the issues.
Health Care
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
Observations
Here we discuss our observations on running the various datasets on various libraries.
Conclusions
Summary of Observations
References
We need to figure out how the Doc does his bibliography!!!
Appendices
If the report is too long, we shove some of it here.