View on GitHub

CookingTime-vs-Calories-Analysis

Analysis of cooking time and calories per recipe based on the Food.com dataset for University of Michigan's Practical Data Science Class: EECS 398

Quick Bites or Slow Feasts? Analyzing Calories vs. Cooking Time

Author: Chinyere Amasiatu

Report for the Analyzing Calories vs. Cooking Time

Introduction

Cooking has a been necessity for a long time, from home cooks to chefs, recipes are the most remembered with our taste buds rather than with our brain. So let’s begin to explore the intricacies of recipes using the Recipe and Ratings dataset! The Recipe and Ratings dataset contains many recipes which will help us source the relationship between different recipe characteristics that we may have never thought of before

Central Question

The longer a recipe takes to make, increases the calorie count?

This question lies under the understanding of the relationship between cooking time and calories. Cooking time is a vital attribute to recipes because it reveals the complexity of an individual recipe. By investigating the relationship between cooking time and calories we can uncover what information users would appreciate to balance their time and calorie intake.
In addition to understanding cooking time, this dataset provides us with nutritional information that can further reveal the complexity of a recipe such as the number of steps and the number ingredients involved in a recipe. This can also help us answer a bigger question: what key nutritional information make a recipes calorie count higher? So whether users are a home cook or a professional chef, the findings of this analysis will offer data driven guidance about the choice of recipe one can choose rather than analyzing with just our tastebuds.

Introduction of Columns

The dataset in the RAW_recipes.csv has 83782 rows containing information relevent to our analysis such as:
Column Description
name Title of the recipe
minutes Total time required to prepare and cook the recipe
contributor_id Unique identifier for the user who contributed the recipe
submitted Date when the recipe was submitted
tags List of tags associated with the recipe, such as ['desserts', 'chocolate', 'easy']
nutrition A string containing a list of nutritional values: ['calories', 'total fat', 'sugar', 'sodium', 'protein', 'saturated fat', 'carbs']
n_steps Number of steps in the recipe instructions
steps List of instructions for preparing the recipe
n_ingredients Total number of distinct ingredients used

Data Cleaning and Exploratory Data Analysis

Cleaning the data was a necessary step to make sure the data was consistent and reliable enough to be analyzed our process included:
1. Parsing the Nutrition Column
2. Identifying and Removing Outliers

Cleaned Data Sample

Index Minutes Calories N_Steps N_Ingredients
0 40 138.4 10 9
1 45 595.1 12 11
2 40 194.8 6 9
83779 40 59.2 7 8
83780 29 188.0 9 10
83781 20 174.9 5 7
1.1 Univariate Analysis (Calories)
1.2 Univariate Analysis (Cooking Time)

##### 2.1 Bivariate Analysis (Calories vs Cooking Time)

Interesting Aggregates

Cooking Time (Minutes) Average Calories
0–10 269.74
11–20 322.87
21–30 358.55
31–45 381.42
46–60 423.89
61–120 456.93
120+ 473.11
Acoording to the table we group the cooking time in minutes into 10 minute bins and calculate the average calories per bin. According to the table we see an upward trend in the data which supports the idea the longe recipes seem to be more calorie dense.

Imputation

Most of the features we planned on analyzing were numerical so to prevent skewing in our analysis we wanted to mean impute missing values with the average mean of that column in all the columns necessary in the dataset. However, none of the numerical columns we were going to use had any missing values so we skipped this step.

Framing a Prediction Problem

Can we predict the number of calories in a recipe using features like cooking time, number of ingredients, and steps involved?

Baseline Model

Final Model

Feature Engineering

A New Model

Hyperparameter Tuning

Performance Comparison

Model MSE RMSE
Baseline Model (Linear Regression) 88,642.90 ≈297.7 6.73%
Final Model (Ridge + Polynomial Features + Feature Engineering) 26,573.70 ≈163.0 72.09%

Our final model significant improves:

This is a >60% reduction in MSE which means our predictions are on average much closer to the actual calorie values, I would say this model is not that bad!