
Predicting Calories Expenditure
This project focuses on developing a data-driven model to estimate calorie expenditure during physical activity.Objective: To develop a robust machine learning model that accurately predicts the number of calories burned during a physical activity based on user biometrics and exercise data. This model is intended for integration into a fitness tracking application to provide users with personalized insights.
1 Project Scoping & Data Understanding
The first order of business is to define the problem and deeply understand the data we're working with.
1.1 Problem Definition
- Problem Type: This is a supervised regression task. We are predicting a continuous numerical value (
calories). - Business Goal: Enhance user experience in a fitness app by providing a more accurate and personalized calorie burn estimate than generic formulas.
1.2 Initial Data Assessment
import pandas as pdimport numpy as npimport matplotlib.pyplot as pltimport seaborn as sns from sklearn.linear_model import LinearRegression, Ridge, Lassofrom sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressorfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import StandardScalerfrom sklearn.metrics import mean_squared_error, mean_absolute_error, r2_scoreYou can find the dataset Hugging Face — @mnemoraorg.
df_calories = pd.read_csv(CALORIES)df_exercises = pd.read_csv(EXERCISES)Rename columns, turn them into lowercase
df_calories.columns = df_calories.columns.str.lower()df_exercises.columns = df_exercises.columns.str.lower()Merge the two dataframes baed on the user_id column
df = pd.merge(df_exercises, df_calories, on='user_id')- Identifier:
user_id- Unlikely to be a predictive feature but essential for any user-level analysis. - Features (Predictors):
gender(Categorical): Male/Female.age(Numerical): User's age in years.height(Numerical): Likely in centimeters. Needs confirmation.weight(Numerical): Likely in kilograms. Needs confirmation.duration(Numerical): Exercise duration, likely in minutes.heart_rate(Numerical): Average heart rate during the exercise (BPM).body_temp(Numerical): Body temperature, likely in Celsius.
- Target Variable:
calories(Numerical): The value we want to predict.
1.3 Key Questions & Hypothesis
- Hypothesis:
duration,heart_rate, andweightwill be the most significant predictors ofcaloriesburned. - Data Quality Concern: The
body_tempvalues (e.g., 40.8°C) seem high. While possible during intense exercise, this requires investigation. Are these accurate measurements, or is there a data entry/sensor error? We must validate the unit and typical range.
2 Exploratory Data Analysis (EDA)
This is the most critical phase for uncovering insights and informing our modeling strategy. We'll use statistical analysis and visualizations to explore the data from every angle.
2.1 Univariate Analysis
2.1.1A Distribution of Numerical Features
numerical_features = ['age', 'height', 'weight', 'duration', 'heart_rate', 'body_temp', 'calories'] fig, axes = plt.subplots(3, 3, figsize=(15, 12))fig.suptitle('Distribution of Numerical Features', y=0.995)axes = axes.flatten() for idx, feature in enumerate(numerical_features): sns.histplot(df[feature].dropna(), bins=30, color='steelblue', edgecolor='black', alpha=0.7, kde=True, ax=axes[idx]) axes[idx].set_xlabel(feature) axes[idx].set_ylabel('Frequency (count)') axes[idx].grid(alpha=0.3, linestyle='--') axes[idx].spines['top'].set_visible(False) axes[idx].spines['right'].set_visible(False) for idx in range(len(numerical_features), len(axes)): fig.delaxes(axes[idx]) plt.tight_layout()plt.show()
Distribution of Numerical Features
2.1.1B Report
-
Summary
This report details the findings from the univariate analysis of the seven numerical features in our dataset. The goal of this analysis was to understand the underlying distribution of each variable to identify its characteristics, spot potential data quality issues, and inform our subsequent feature engineering and modeling strategies.
Overall, the predictor variables (e.g.,
height,weight,heart_rate) exhibit well-behaved, near-normal distributions, making them excellent candidates for modeling. However, a critical finding is the significant right-skew of our target variable,calories. This will require a transformation to ensure robust model performance. The distribution ofbody_tempalso confirms our initial hypothesis that these values are consistently high, suggesting that the data primarily represents periods of strenuous activity. -
Detailed Feature Analysis
ageandduration:- Both
ageanddurationdisplay a near-uniform distribution. This indicates that our dataset contains a balanced representation of individuals across different age groups (from 20 to 80) and a wide variety of workout durations (from 0 to 30 minutes).
- Both
height,weight, &heart_rate:- These three features all follow a clear normal (Gaussian) distribution. They are symmetric with a distinct central tendency, which is characteristic of biometric data in a large population.
body_temp:- The distribution is left-skewed, with the vast majority of data points concentrated between 39°C and 41°C.
calories(the target variable):- This is the most critical finding. The
caloriesvariable is highly right-skewed. The majority of workouts result in a low-to-moderate calorie burn (0-100), with a long tail of less frequent, high-calorie-burn activities.
- This is the most critical finding. The
This analysis has provided crucial insights that directly inform our data preprocessing and modeling strategy. The data appears to be of high quality and is well-suited for the project's objective.
2.1.2A Box Plots for Outlier Detection
numerical_features = ['age', 'height', 'weight', 'duration', 'heart_rate', 'body_temp', 'calories'] fig, axes = plt.subplots(3, 3, figsize=(15, 12))fig.suptitle('Box Plots for Outlier Detection', y=0.995)axes = axes.flatten() for idx, feature in enumerate(numerical_features): sns.boxplot(y=df[feature].dropna(), color='steelblue', ax=axes[idx]) axes[idx].set_ylabel(feature) axes[idx].grid(alpha=0.3, linestyle='--', axis='y') axes[idx].spines['top'].set_visible(False) axes[idx].spines['right'].set_visible(False) for idx in range(len(numerical_features), len(axes)): fig.delaxes(axes[idx]) plt.tight_layout()plt.show()
Box Plots for Outlier Detection
2.1.2B Report
-
Summary
This report outlines the findings from our outlier analysis, which used box plots to identify data points that deviate significantly from the rest of the distribution for each numerical feature.
The key finding is that the dataset is remarkably clean. While some statistical outliers are present, the majority appear to represent plausible, real-world variations rather than data entry or measurement errors. The features
ageanddurationshow no outliers at all. Minor, explainable outliers were found inheight,weight,heart_rate, and our targetcalories.The only area requiring further investigation is
body_temp, which contains several low-end outliers that are inconsistent with typical exercise physiology. Our strategy will be to use models that are naturally robust to outliers, avoiding aggressive data removal which could discard valuable information. -
Detailed Feature Analysis
age&duration:- Both features show no statistical outliers. The whiskers of the box plots extend to the minimum and maximum values in the dataset, indicating a clean and well-contained distribution.
height,weight, &heart_rate:- These biometric features show a small number of outliers at their extremes. For example,
heighthas a few values that are exceptionally tall or short, andweighthas a few on the higher end.
- These biometric features show a small number of outliers at their extremes. For example,
body_temp:- This plot is the most noteworthy. It displays a cluster of significant outliers on the low end (around 37°C - 38°C).
calories(the target variable):- A few outliers are present on the high end, with calorie burns exceeding 250.
2.1.3A Gender Distribution
fig, ax = plt.subplots(figsize=(8, 6)) gender_counts = df['gender'].value_counts()sns.barplot(x=gender_counts.index, y=gender_counts.values, hue=gender_counts.index, palette=['steelblue', 'lightcoral'], legend=False, ax=ax) ax.set_title('Gender Distribution')ax.set_xlabel('Gender')ax.set_ylabel('Frequency (count)')ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)ax.grid(alpha=0.3, linestyle='--', axis='y') for i, v in enumerate(gender_counts.values): ax.text(i, v + 50, str(v), ha='center', va='bottom') plt.tight_layout()plt.show()
Gender Distribution
2.1.3B Report
-
Summary
This report details the findings from our analysis of the gender categorical feature. The primary objective was to determine the composition of the dataset with respect to gender to identify any potential biases that could affect model training.
The key finding is that the dataset is exceptionally well-balanced, with a near-equal representation of male and female participants. This balance is highly advantageous, as it significantly reduces the risk of the model developing a gender-based bias and ensures that its predictive performance will be reliable for all users.
-
Detailed Distribution Analysis
The dataset contains records for 7,553 females and 7,447 males. This corresponds to a split of approximately 50.3% female and 49.7% male.
2.2 Bivariate Analysis
2.2.1A Calorie Expenditure by Gender
fig, axes = plt.subplots(1, 2, figsize=(16, 6)) fig.suptitle('Calorie Expenditure by Gender', y=1.02) sns.boxplot(data=df, x='gender', y='calories', hue='gender', palette='coolwarm', legend=False, ax=axes[0])axes[0].set_xlabel('Gender', fontsize=12)axes[0].set_ylabel('Calories Burned', fontsize=12)axes[0].grid(axis='y', alpha=0.3)axes[0].spines['top'].set_visible(False)axes[0].spines['right'].set_visible(False) sns.stripplot(data=df, x='gender', y='calories', hue='gender', palette='coolwarm', alpha=0.4, size=2, dodge=False, ax=axes[1], legend=False)axes[1].set_xlabel('Gender', fontsize=12)axes[1].set_ylabel('Calories Burned', fontsize=12)axes[1].grid(axis='y', alpha=0.3)axes[1].spines['top'].set_visible(False)axes[1].spines['right'].set_visible(False) plt.tight_layout()plt.show()
Calorie Expenditure by Gender
2.2.1B Report
-
Summary
This report details our investigation into the relationship between
genderand our target variable,calories. By comparing the distributions of calories burned for male and female participants, we aimed to determine if gender is a significant factor in calorie expenditure within this dataset.The analysis reveals a discernible difference between the two groups. On average, males in this dataset exhibit a slightly higher median calorie burn than females. While the interquartile ranges (the middle 50% of users) are very similar, the overall distribution for males is shifted slightly higher. This confirms that
genderis a valuable predictive feature that will help improve the accuracy and personalization of our model. -
Detailed Distribution Analysis
-
Box Plot Analysis
- The median calorie expenditure for males (the central line in the blue box) is visibly higher than the median for females (the central line in the red box). Both genders display a similar spread in their interquartile range (IQR), indicating that the variability in calorie burn for the central 50% of users is comparable. Both groups also have high-end outliers, representing intense workouts.
-
Strip Plot Analysis
- This plot, which shows every individual data point, confirms the findings from the box plot. The density cloud for males is slightly shifted upwards compared to the cloud for females.
-
2.2.2A Feature Relationships
numeric_cols = ['age', 'height', 'weight', 'duration', 'heart_rate', 'body_temp', 'calories']correlation_matrix = df[numeric_cols].corr() fig, ax = plt.subplots(figsize=(10, 8))sns.heatmap(correlation_matrix, annot=True, fmt='.2f', cmap='coolwarm', center=0, square=True, linewidths=1, cbar_kws={"shrink": 0.8}, ax=ax)ax.set_title('Feature Relationships') plt.tight_layout()plt.show()
Feature Relationships
2.2.2B Report
-
Summary
This report concludes our Exploratory Data Analysis (EDA) by examining the correlation matrix heatmap, which quantifies the linear relationships between all numerical features. a slightly hi our mode our primary hypothesis:
duration(0.96),heart_rate(0.90), andbody_temp(0.82) are the most powerful predictors ofcalories. This gives us high confidence that the observed difference is genuine.
2.2.3A Feature vs Calories Relationships
numerical_features = ['age', 'height', 'weight', 'duration', 'heart_rate', 'body_temp'] fig, axes = plt.subplots(2, 3, figsize=(15, 10))fig.suptitle('Feature vs Calories Relationships', y=0.995)axes = axes.flatten() for idx, feature in enumerate(numerical_features): sns.regplot(x=df[feature], y=df['calories'], scatter_kws={'alpha': 0.5, 'color': 'steelblue'}, line_kws={'color': 'black', 'linewidth': 1, 'linestyle': '--', 'alpha': 0.5}, ax=axes[idx]) axes[idx].set_title(f'{feature} vs calories') axes[idx].set_xlabel(feature) axes[idx].set_ylabel('calories') axes[idx].grid(alpha=0.3, linestyle='--') axes[idx].spines['top'].set_visible(False) axes[idx].spines['right'].set_visible(False) plt.tight_layout()plt.show()
Feature vs Calories Relationships
2.2.3B Report
-
Summary
The relationship between each numerical predictor and the
caloriestarget. These scatter plots provide a deeper, qualitative understanding that complements the quantitative findings from our correlation matrix.The analysis visually confirms that
durationandheart_ratehave strong, positive, and distinctly linear relationships with calorie expenditure. We also uncovered a key non-linear pattern: the effect ofbody_temponcaloriesaccelerates as temperature increases. Finally, the plots validate thatage,height, andweighthave no meaningful linear relationship with the target on their own. -
Detailed Relationship Analysis
duration&heart_ratevs.calories:- Both plots show a clear and tight positive linear trend. The data points cluster closely around the regression line, especially for
duration.
- Both plots show a clear and tight positive linear trend. The data points cluster closely around the regression line, especially for
body_tempvs.calories:- This plot reveals a fascinating non-linear relationship. The trend is initially flat and then curves upwards, indicating that calorie burn increases exponentially as body temperature rises, particularly past the 40°C mark.
age,height, &weightvs.calories:- These plots appear as amorphous clouds of data with a nearly horizontal regression line. There is no discernible pattern or trend.
3 Data Preprocessing & Feature Engineering
3.1 Feature Engineering
Look for missing values.
missing_values = df.isnull().sum()Look at BMI distribution
df['height_m'] = df['height'] / 100df['bmi'] = df['weight'] / (df['height_m'] ** 2) df[['height', 'height_m', 'weight', 'bmi', 'calories']].head(10)Age Groups
bins = [0, 29, 39, 49, 59, 100]labels = ['18-29', '30-39', '40-49', '50-59', '60+']df['age_group'] = pd.cut(df['age'], bins=bins, labels=labels, right=True)Interaction Features
df['duration_heart_rate'] = df['duration'] * df['heart_rate']df['bmi_heart_rate'] = df['bmi'] * df['heart_rate']df['duration_body_temp'] = df['duration'] * df['body_temp']df['weight_duration'] = df['weight'] * df['duration']df['bmi_duration'] = df['bmi'] * df['duration']df['heart_rate_body_temp'] = df['heart_rate'] * df['body_temp']3.2 Preprocessing
- One-Hot Encoding for Categorical Variables
- Feature Scaling for Numerical Variables
- Train-Test Split
df = pd.get_dummies(df, columns=['gender', 'age_group'], drop_first=True) X = df.drop(['user_id', 'calories'], axis=1)y = df['calories'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) numerical_cols = X.select_dtypes(include=np.number).columns.tolist() scaler = StandardScaler()X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])4 Model Development and Training
We will experiment with several regression algorithms to identify the best-performing model for predicting calorie expenditure. The models we will evaluate include:
- Mean Squared Error (MSE): Measures the average of the squared differences between the predicted and actual values. Lower values are better.
- Mean Absolute Error (MAE): Measures the average of the absolute differences between the predicted and actual values. Lower values are better.
- R-squared (R2): Represents the proportion of the variance in the target variable that is predictable from the features. A value closer to 1 indicates a better fit.
4.1 Linear Regression
lr_model = LinearRegression()lr_model.fit(X_train, y_train) y_pred_lr = lr_model.predict(X_test)mse_lr = mean_squared_error(y_test, y_pred_lr)mae_lr = mean_absolute_error(y_test, y_pred_lr)r2_lr = r2_score(y_test, y_pred_lr)- Mean Squared Error (MSE): 49.6290
- Mean Absolute Error (MAE): 5.1934
- R-squared (R2): 0.9877 or 98.77%
4.2 Ridge Regression
rd_model = Ridge()rd_model.fit(X_train, y_train) y_pred_rd = rd_model.predict(X_test)mse_rd = mean_squared_error(y_test, y_pred_rd)mae_rd = mean_absolute_error(y_test, y_pred_rd)r2_rd = r2_score(y_test, y_pred_rd)- Mean Squared Error (MSE): 49.8228
- Mean Absolute Error (MAE): 5.1966
- R-squared (R2): 0.9877 or 98.77%
4.3 Lasso Regression
ls_model = Lasso()ls_model.fit(X_train, y_train) y_pred_ls = ls_model.predict(X_test)mse_ls = mean_squared_error(y_test, y_pred_ls)mae_ls = mean_absolute_error(y_test, y_pred_ls)r2_ls = r2_score(y_test, y_pred_ls)- Mean Squared Error (MSE): 103.9032
- Mean Absolute Error (MAE): 7.1880
- R-squared (R2): 0.9743 or 97.43%
4.4 Random Forest Regressor
rfr_model = RandomForestRegressor(random_state=42)rfr_model.fit(X_train, y_train) y_pred_rfr = rfr_model.predict(X_test)mse_rfr = mean_squared_error(y_test, y_pred_rfr)mae_rfr = mean_absolute_error(y_test, y_pred_rfr)r2_rfr = r2_score(y_test, y_pred_rfr)- Mean Squared Error (MSE): 7.8513
- Mean Absolute Error (MAE): 1.8234
- R-squared (R2): 0.9981 or 99.81%
4.5 Gradient Boosting Regressor
gbr_model = GradientBoostingRegressor(random_state=42)gbr_model.fit(X_train, y_train) y_pred_gbr = gbr_model.predict(X_test)mse_gbr = mean_squared_error(y_test, y_pred_gbr)mae_gbr = mean_absolute_error(y_test, y_pred_gbr)r2_gbr = r2_score(y_test, y_pred_gbr)- Mean Squared Error (MSE): 11.2890
- Mean Absolute Error (MAE): 2.4149
- R-squared (R2): 0.9972 or 99.72%
4.6 Regression Model Evaluation Report
Objective: To assess the performance of various regression models in predicting calorie expenditure based on the provided dataset and recommend the most suitable model for inference in a fitness tracking application.
Evaluation Metrics (on the Test Set):
| Model | MSE | MAE | R-squared |
|---|---|---|---|
| Linear Regression | 49.6290 | 5.1934 | 0.9877 |
| Ridge Regression | 49.8228 | 5.1966 | 0.9877 |
| Lasso Regression | 103.9032 | 7.1880 | 0.9743 |
| Random Forest Regressor | 7.8513 | 1.8234 | 0.9981 |
| Gradient Boosting Regressor | 11.2890 | 2.4149 | 0.9972 |
Analysis and Interpretation:
- Overall Strong Performance: Across the board, most models demonstrate remarkably high R-squared values (all above 0.97), indicating that a significant portion of the variance in calorie expenditure is explained by our features. This confirms the strong predictive power of the chosen features and the viability of a regression approach for this problem.
- Linear Models (Linear, Ridge, Lasso): The standard Linear Regression and Ridge Regression models performed quite similarly, with very high R-squared values around 0.9877. Their MAE and MSE are also comparable. Lasso Regression, while still performing well, had a slightly higher MSE and MAE, and a marginally lower R-squared. This suggests that the L1 regularization in Lasso might be penalizing some features more aggressively than necessary for this dataset, although the performance difference is not drastic.
- Tree-Based Ensemble Models (Random Forest, Gradient Boosting): This is where we see a significant leap in performance. Both the Random Forest Regressor and Gradient Boosting Regressor substantially outperform the linear models across all metrics.
- Random Forest Regressor: Achieved the lowest MSE (7.8513) and MAE (1.8234), and the highest R-squared (0.9981). This indicates that, on average, its predictions are the closest to the actual calorie values, and it explains the highest proportion of the target variable's variance.
- Gradient Boosting Regressor: Also performed exceptionally well, with very low MSE and MAE, and a high R-squared (0.9972). Its performance is very close to that of the Random Forest, but slightly less accurate based on these metrics.
Recommendation for Inference:
Based on these evaluation results, the Random Forest Regressor is the clear front-runner and the model I would strongly recommend for use in inference.
- Its exceptionally low Mean Absolute Error (MAE of 1.8234) means that, on average, our predictions for calorie burn will be off by less than 2 calories. This level of accuracy is excellent for a fitness tracking application, providing highly reliable estimates to users.
- The R-squared of 0.9981 signifies that the model captures almost all of the underlying patterns in the data, leaving very little unexplained variance.
While Gradient Boosting is a close second, the Random Forest's slightly better performance on the test set makes it the preferred choice at this stage.
5 Feature Importance
feature_importances = rfr_model.feature_importances_ feature_names = X_train.columns importance_series = pd.Series(feature_importances, index=feature_names) sorted_importance_series = importance_series.sort_values(ascending=False) fig, ax = plt.subplots(figsize=(10, 6))sns.barplot(x=sorted_importance_series.values, y=sorted_importance_series.index, palette="viridis", hue=sorted_importance_series.index, legend=False, ax=ax)plt.title('Feature Importance from Random Forest Regressor')plt.xlabel('Importance')plt.ylabel('Feature') ax.grid(alpha=0.3, linestyle='--')ax.spines['top'].set_visible(False)ax.spines['right'].set_visible(False)plt.tight_layout()plt.show()
Feature Importance from Random Forest Regressor
6 Inference
Create some sample data for inference.
sample_data = pd.DataFrame({ 'age': [30, 45, 25], 'height': [170.0, 185.0, 160.0], 'weight': [70.0, 90.0, 55.0], 'duration': [15.0, 30.0, 10.0], 'heart_rate': [90.0, 120.0, 85.0], 'body_temp': [39.5, 41.0, 38.8], 'gender': ['female', 'male', 'female']})Apply the same feature engineering steps as training data.
sample_data['height_m'] = sample_data['height'] / 100sample_data['bmi'] = sample_data['weight'] / (sample_data['height_m'] ** 2) bins = [0, 29, 39, 49, 59, 100]labels = ['18-29', '30-39', '40-49', '50-59', '60+']sample_data['age_group'] = pd.cut(sample_data['age'], bins=bins, labels=labels, right=True) sample_data['duration_heart_rate'] = sample_data['duration'] * sample_data['heart_rate']sample_data['bmi_heart_rate'] = sample_data['bmi'] * sample_data['heart_rate']sample_data['duration_body_temp'] = sample_data['duration'] * sample_data['body_temp']sample_data['weight_duration'] = sample_data['weight'] * sample_data['duration']sample_data['bmi_duration'] = sample_data['bmi'] * sample_data['duration']sample_data['heart_rate_body_temp'] = sample_data['heart_rate'] * sample_data['body_temp']- One-hot encode categorical features - make sure columns match training data.
- Align columns with the training data - crucial for consistent feature order.
- Add missing columns (if any) with 0 and ensure the order is the same as
X_train.
sample_data_encoded = pd.get_dummies(sample_data, columns=['gender', 'age_group'], drop_first=True) missing_cols = set(X_train.columns) - set(sample_data_encoded.columns)for c in missing_cols: sample_data_encoded[c] = 0sample_data_encoded = sample_data_encoded[X_train.columns]Scale numerical features using the same scaler fitted on the training data.
numerical_cols_sample = sample_data_encoded.select_dtypes(include=np.number).columns.tolist()sample_data_scaled = scaler.transform(sample_data_encoded[numerical_cols_sample])sample_data_scaled = pd.DataFrame(sample_data_scaled, columns=numerical_cols_sample, index=sample_data_encoded.index)Recombine scaled numerical features with one-hot encoded categorical features (which are already in X_train.columns order).
X_inference = pd.DataFrame(index=sample_data.index)- Add scaled numerical columns back.
- Add one-hot encoded columns back (these were handled by aligning columns with
X_train). - Ensure the order is still correct.
for col in numerical_cols_sample:X_inference[col] = sample_data_scaled[col] for col in X_train.columns:if col not in numerical_cols_sample:X_inference[col] = sample_data_encoded[col] X_inference = X_inference[X_train.columns]Make predictions.
predicted_calories = rfr_model.predict(X_inference)- Combine sample data and predicted calories.
- Highlight the
'predicted_calories'column
predictions_df = sample_data.copy()predictions_df['predicted_calories'] = predicted_calories def highlight_predicted_calories(s): if s.name == 'predicted_calories': return ['background-color: yellow'] * len(s) return [''] * len(s)Display the DataFrame with the Predicted Calories.
numerical_cols = predictions_df.select_dtypes(include=np.number).columnsdisplay(predictions_df.style.apply(highlight_predicted_calories, axis=0).format({col: '{:.2f}' for col in numerical_cols}))| age | height | weight | duration | heart_rate | body_temp | gender | height_m | bmi | age_group | duration_heart_rate | bmi_heart_rate | duration_body_temp | weight_duration | bmi_duration | heart_rate_body_temp | predicted_calories |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 30.00 | 170.00 | 70.00 | 15.00 | 90.00 | 39.50 | female | 1.70 | 24.22 | 30-39 | 1350.00 | 1350.00 | 592.50 | 1050.00 | 363.32 | 3555.00 | 63.92 |
| 45.00 | 180.00 | 90.00 | 30.00 | 120.00 | 41.00 | male | 1.85 | 26.30 | 40-49 | 3600.00 | 3155.59 | 1230.00 | 2700.00 | 788.90 | 4920.00 | 258.36 |
| 25.00 | 160.00 | 55.00 | 10.00 | 85.00 | 38.80 | female | 1.60 | 21.48 | 18-29 | 850.00 | 1826.17 | 388.00 | 550.00 | 214.84 | 3298.00 | 37.50 |
7 Project Summary
This project aimed to develop a data-driven model to accurately estimate calorie expenditure during physical activity, suitable for integration into a fitness tracking application.
Objective: To predict the number of calories burned based on user biometrics and exercise data using a machine learning model. This was framed as a supervised regression task.
Methodology:
- Data Understanding & EDA:
- Loaded and merged two datasets containing exercise and calorie information.
- Assessed data quality, identified no missing values, and examined the distributions of features.
- Discovered that
duration,heart_rate, andbody_tempshowed strong correlations withcalories. - Noted a slight difference in calorie expenditure between genders and a highly right-skewed distribution for the target variable,
calories.
- Data Preprocessing & Feature Engineering:
- Created new features such as BMI (
bmi), age groups (age_group), and interaction terms (e.g.,duration_heart_rate,bmi_duration) to potentially improve model performance. - Applied one-hot encoding to categorical features (
gender,age_group). - Split the data into training and testing sets.
- Scaled numerical features using
StandardScalerto prepare them for modeling.
- Created new features such as BMI (
- Model Development and Training:
- Trained several regression models: Linear Regression, Ridge Regression, Lasso Regression, Random Forest Regressor, and Gradient Boosting Regressor.
- Model Evaluation:
- Evaluated each model's performance on the test set using Mean Squared Error (MSE), Mean Absolute Error (MAE), and R-squared (R2).
- The ensemble models (Random Forest and Gradient Boosting) significantly outperformed the linear models.
- The Random Forest Regressor achieved the best results with the lowest MSE (7.8513), lowest MAE (1.8234), and highest R-squared (0.9981).
- Feature Importance:
- Analyzed the feature importances from the Random Forest model, confirming that the interaction term
duration_heart_ratewas the most influential feature in predicting calorie burn, followed byageandgender.
- Analyzed the feature importances from the Random Forest model, confirming that the interaction term
- Inference:
- Demonstrated how to use the trained Random Forest Regressor model to make predictions on new, unseen sample data, ensuring consistent preprocessing steps.
Key Findings:
- The dataset is well-suited for a regression task.
- Features related to exercise intensity and duration (
duration,heart_rate,body_temp) and their interactions are strong predictors of calorie expenditure. - The Random Forest Regressor model provides highly accurate predictions, with an average absolute error of less than 2 calories.
Conclusion:
Based on the comprehensive evaluation, the Random Forest Regressor is the recommended model for predicting calorie expenditure in the fitness tracking application. Its high R-squared value and low MAE indicate excellent predictive performance and reliability.