# Libraries
import numpy as np
import pandas as pd
#import sympy as sym
import matplotlib.pyplot as plt
from sklearn import linear_model
from sklearn.neural_network import MLPRegressor
IDC 6940
Weekly Report 1
Richard Henry & Delanyce Rose
25-June-2025
Symbolic Regression
Initial Research
This is a machine-learning technique for finding equations to fit data. The following wikipage piqued the student’s interest:
https://en.wikipedia.org/wiki/Symbolic_regression
First Toy Dataset
Before we dive into the details, let us build a toy dataset to explain the motivation behind symbolic regression.
API Gravity
Convert specific gravity to API Gravity
\(\gamma_{API}=\frac{141.5}{\gamma_{o}}-131.5\)
Symbol | Meaning | Units |
---|---|---|
\(\gamma_{API}\) | API Gravity | [degAPI] |
\(\gamma_o\) | specific gravity | [1/wtr] |
=np.arange(start=0.55,stop=1.075,step=0.025)
x=141.5/x-131.5
y=pd.DataFrame(data={'Specific Gravity':x,'API Gravity':y})
dfa dfa
Specific Gravity | API Gravity | |
---|---|---|
0 | 0.550 | 125.772727 |
1 | 0.575 | 114.586957 |
2 | 0.600 | 104.333333 |
3 | 0.625 | 94.900000 |
4 | 0.650 | 86.192308 |
5 | 0.675 | 78.129630 |
6 | 0.700 | 70.642857 |
7 | 0.725 | 63.672414 |
8 | 0.750 | 57.166667 |
9 | 0.775 | 51.080645 |
10 | 0.800 | 45.375000 |
11 | 0.825 | 40.015152 |
12 | 0.850 | 34.970588 |
13 | 0.875 | 30.214286 |
14 | 0.900 | 25.722222 |
15 | 0.925 | 21.472973 |
16 | 0.950 | 17.447368 |
17 | 0.975 | 13.628205 |
18 | 1.000 | 10.000000 |
19 | 1.025 | 6.548780 |
20 | 1.050 | 3.261905 |
We now have a dataset consisting of 20 records of two variables. If we are just interested in predicting the value of one variable from another, then two common tools we can employ are: - neural network - linear regression
Let’s take a look at the data:
plt.scatter(x,y)"Specific Gravity")
plt.xlabel("API Gravity")
plt.ylabel("First Toy Dataset")
plt.title(
plt.grid() plt.show()
Multiple Regression
When we plot the data, we notice that the relationship is non-linear. One way to handle this is to assume a functional form such as:
\(y=a_{3}x^{3}+a_{2}x^{2}+a_{1}x+a_{0}\)
and use multiple regression to estimate the values of the constants \(a_0...a_3\):
=x**2
x2=x**3
x3=pd.DataFrame(data={'x1':x,'x2':x2,'x3':x3})
dfb=linear_model.LinearRegression()
mod01
mod01.fit(dfb,y)=mod01.predict(dfb)
y_fit="Raw")
plt.scatter(x,y,label="Fit")
plt.plot(x,y_fit,label"Specific Gravity")
plt.xlabel("API Gravity")
plt.ylabel("First Toy Dataset Poly Fit")
plt.title(
plt.grid()
plt.legend() plt.show()
print(f"lin (a1): {mod01.coef_[0]:.2f}")
print(f"sqr (a2): {mod01.coef_[1]:.2f}")
print(f"cub (a3): {mod01.coef_[2]:.2f}")
print(f"int (a0): {mod01.intercept_:.2f}")
lin (a1): -1457.03
sqr (a2): 1241.69
cub (a3): -390.67
int (a0): 616.15
# Quick Test
=mod01.intercept_+mod01.coef_[0]*0.6+mod01.coef_[1]*0.6*0.6+mod01.coef_[2]*0.6*0.6*0.6
testprint(test)
104.55253068378275
Although this is a very good fit, and therefore we may confidently use this model for predicting values of API gravity for specific gravities in the range \(0.55-1.05\), we may be less comfortable using this model for values outside of this range:
=np.arange(start=1.1,stop=1.55,step=0.025)
x_big=141.5/x_big-131.5
y_big=x_big**2
x2_big=x_big**3
x3_big=pd.DataFrame(data={'x1':x_big,'x2':x2_big,'x3':x3_big})
dfc=mod01.predict(dfc)
y_bigfit="Raw")
plt.scatter(x_big,y_big,label="Fit")
plt.plot(x_big,y_bigfit,label"Specific Gravity")
plt.xlabel("API Gravity")
plt.ylabel("Extending Toy Dataset - Poly")
plt.title(
plt.grid()
plt.legend() plt.show()
Neural Network
Let us repeat the exercise with a baby ANN:
=MLPRegressor(hidden_layer_sizes=(7,21,7),
mod02=17,
random_state=False,
verbose=0.01,
learning_rate_init=1e-3,
tol=1500)
max_iter
mod02.fit(dfb,y)=mod02.predict(dfb)
y_fitann="Raw")
plt.scatter(x,y,label="Fit")
plt.plot(x,y_fitann,label"Specific Gravity")
plt.xlabel("API Gravity")
plt.ylabel("First Toy Dataset - ANN Fit")
plt.title(
plt.grid()
plt.legend() plt.show()
=mod02.predict(dfc)
y_bigfitann="Raw")
plt.scatter(x_big,y_big,label="Fit")
plt.plot(x_big,y_bigfitann,label"Specific Gravity")
plt.xlabel("API Gravity")
plt.ylabel("Extending Toy Dataset - ANN")
plt.title(
plt.grid()
plt.legend() plt.show()
Although we can argue that we’ve overtrained the ANN, the point here is that it predicts garbage outside of its training range.
Sparse Regression
One way of to recover the underlying equation is to throw a bunch of possible terms to the regression algorithm:
=np.reciprocal(x)
x1=np.log(x)
xo=np.exp(x)
xe=pd.DataFrame(data={'xo':xo,'xe':xe,'x1':x1,'xp':x,'x2':x2,'x3':x3})
dfd=linear_model.LinearRegression()
mod03
mod03.fit(dfd,y)=mod03.predict(dfd)
y_fitspa="Raw")
plt.scatter(x,y,label="Fit")
plt.plot(x,y_fitspa,label"Specific Gravity")
plt.xlabel("API Gravity")
plt.ylabel("First Toy Dataset Kitchen-Sink Fit")
plt.title(
plt.grid()
plt.legend() plt.show()
As expected, we have a very good fit, as we have included both the original inverse term \((\frac{1}{x})\) and the terms to make up a cubic equation \((x,x^{2},x^{3})\):
print("log: ",mod03.coef_[0])
print("exp: ",mod03.coef_[1])
print("inv: ",mod03.coef_[2])
print("lin: ",mod03.coef_[3])
print("sqr: ",mod03.coef_[4])
print("cub: ",mod03.coef_[5])
print("int :",mod03.intercept_)
log: -3.2805389571047666e-11
exp: 1.7561063714310876e-10
inv: 141.49999999998684
lin: -2.085140948793196e-10
sqr: -7.542411140093463e-12
cub: -7.80140396727802e-11
int : -131.5000000001701
However, when we look at the coefficients, the inverse term is much larger than all the others. This is the basic idea behind sparse regression. Notice that the coefficient of the inverse term and the intercept are very close to the original values, so that our regressed equation is:
\(y=0.0\cdot log(x)+0.0\cdot e^x+\frac{141.5}{x}+0.0\cdot x +0.0\cdot x^2 +0.0\cdot x^3 -131.5\)
At this point, the million dollar question concerns how this regression would have played out if we neglected to add the inverse term. The polynomial regression earlier provides an answer: The difference in fit quality between mod01
and mod03
is negligble, even though the coefficents for the polynomial terms are dramatically different.
Therefore this version of sparse regression has its limitations.
Tentative Conclusions
- Uncovering the fundamental relations underlying data can sometimes be critical for making predictions
- The “kitchen-sink” approach can sometimes work suprisingly well at uncovering fundamental relationships
Next Steps
We will move from python to R, to: 1. take a first look at library that uses evolutionary algorithm for symbolic regression.