A high-performance implementation of linear regression with support for multiple input features, written in C. This project demonstrates gradient descent optimization for both one-dimensional and multidimensional regression problems.
- ✅ Multidimensional Support: Handle 1 to 10 input features
- ✅ Gradient Descent: Efficient optimization with configurable learning rate
- ✅ Memory Efficient: Optimized 2D array storage for large datasets
- ✅ CSV Data Loading: Easy data import from CSV files
- ✅ Real-time Training: Monitor convergence with epoch-by-epoch output
- ✅ Robust Error Handling: Comprehensive validation and error checking
c-linear-regression/
├── main.c # Main program entry point
├── regression_core.c # Core regression algorithms
├── regression_io.c # CSV parsing and I/O operations
├── regression_utils.h # Header file with data structures and function declarations
├── singledim_data.csv # 1D sample dataset (100 samples)
├── multidim_data.csv # 2D sample dataset (100 samples, 2 features)
└── README.md
gcc -o linear_regression main.c regression_core.c regression_io.c -Wall -Wextra./linear_regression <num_epochs> <num_features> <num_samples> <data_file_path>1D Linear Regression:
./linear_regression 100 1 100 singledim_data.csv2D Linear Regression:
./linear_regression 100 2 100 multidim_data.csvThe program expects CSV files with the following format:
feature1,feature2,...,featureN,output
1D Data (sample_data.csv):
1,2.14
2,3.88
3,6.11
4,7.95
5,10.02
...
2D Data (multidim_sample_data.csv):
1,2,3.14
2,3,5.88
3,4,8.11
4,5,10.95
5,6,13.02
...
For multidimensional linear regression, the model predicts:
y_hat = w₁x₁ + w₂x₂ + ... + wₙxₙ + b
Where:
y_hatis the predicted outputw₁, w₂, ..., wₙare the feature weightsx₁, x₂, ..., xₙare the input featuresbis the bias term
The algorithm uses gradient descent to minimize the mean squared error (MSE):
Forward Pass:
- Compute predictions for all samples
- Calculate MSE:
MSE = (1/n) * Σ(y_hat - y)²
Backward Pass:
- Update weights:
wᵢ = wᵢ - α * (2/n) * Σ(y_hat - y) * xᵢ - Update bias:
b = b - α * (2/n) * Σ(y_hat - y)
Where α is the learning rate (default: 0.0001).
| Parameter | Default | Description |
|---|---|---|
LEARNING_RATE |
0.0001 | Gradient descent step size |
MAX_EPOCHS |
2500 | Maximum training iterations |
MIN_EPOCHS |
1 | Minimum training iterations |
MAX_SAMPLES |
1000 | Maximum number of data samples |
MAX_FEATURES |
10 | Maximum number of input features |
typedef struct {
float *weights; // Array of feature weights
float bias; // Bias term
} RegressionParameters;
typedef struct {
float **samples; // 2D array: samples[feature][sample_index]
float *labels; // Output values
size_t length; // Number of samples
size_t num_features; // Number of input features
} DataLoader;
typedef struct {
float mse; // Mean squared error
float sum_err; // Sum of errors
float *weighted_err; // Weighted errors for each feature
} TrainingDiagnostics;Initializing 100 epochs
Using 2 features
Using 100 samples
multidim_sample_data.csv
epoch: 1 - mse: 1.604226e+04
weights: w0=1.472700e+00 w1=1.495039e+00 - bias: 2.233916e-02
epoch: 2 - mse: 2.254612e+03
weights: w0=9.218054e-01 w1=9.362061e-01 - bias: 1.440068e-02
epoch: 3 - mse: 3.268029e+02
weights: w0=1.127588e+00 w1=1.145372e+00 - bias: 1.778380e-02
...
epoch: 20 - mse: 1.341520e+01
weights: w0=1.069073e+00 w1=1.090845e+00 - bias: 2.177159e-02
- Memory Usage: O(features × samples) for data storage
- Time Complexity: O(epochs × features × samples) per training run
- Convergence: Typically converges within 50-200 epochs for well-conditioned data
The program includes comprehensive error checking for:
- Invalid command line arguments
- File I/O errors
- Memory allocation failures
- CSV format validation
- Parameter bounds checking
gcc -g -o linear_regression main.c regression_core.c regression_io.c -Wall -Wextravalgrind --leak-check=full ./linear_regression 100 2 100 multidim_data.csvThis project is open source. See the LICENSE file for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
- Support for different loss functions (MAE, Huber loss)
- Regularization (L1, L2)
- Feature scaling/normalization
- Cross-validation support
- Model persistence (save/load trained models)
- GPU acceleration with CUDA
- Python bindings
Program exits without output:
- Check that the number of samples matches your CSV file
- Verify CSV format is correct
- Ensure file path is accessible
Weights explode to infinity:
- Reduce learning rate in
regression_utils.h - Check for data normalization issues
- Verify input data quality
Poor convergence:
- Increase number of epochs
- Adjust learning rate
- Check data preprocessing
If you encounter issues:
- Check the error messages
- Verify your data format
- Test with the provided sample datasets
- Review the configuration parameters