Recipe Site Traffic prediction
Written report of validating data, exploratory analysis, model creation and exploring business need.
11/7/20233 min read
Part 1
Importing and cleaning the data.
Adjusted category column to adhere to 10 categories.
Servings column was also adjusted to be numeric.
There are 52 recipes which have no nutritional information, it was decided to remove them. Another possibility would be to use imputers, but it could potentialy skew the results.
High traffic column was transofmred to a boolean.
Outliers of calories, carbohydrate, sugar and protein columns were removed, only the upper limited outliers removed since the data is skewed.
Part 2
Looking at the categories and what part of recipes have high traffic, we can see that pork, vegetable and potato categories have the highest proportion of high traffic recipes.
On the other end beverage category has close to zero high traffic recipes.
Looking at the histogram of calories per recipe, it is clear that most high traffic calories are in the lower ranges, biggest part being recipes under 400 calories.
Looking at the high traffic recipes by category and servings, we can see that serving sizes are distributed in a similar way per category. In most cases 4 servings is the biggest category, other three serving size recipe number being similar per category.
Part 3
This is a binary categorization problem, so model fit for categorization will be used.
Firstly X and y of the dataset were split. Dummy variables were added to convert categorical values. Servings column was encoded using ordinal encoding.
Then train, test splits were created.
Standard scaler was applied and PCA test ran to determine the number of components to be used, which comes to 12.
Part 4
Firstly a logistic regression model will be used. Grid search cross validation is also used to determine optimum parameters for the model. We get an accuracy of 0.745, F1 score of 0.780, ROC AUC score of 0.797.
Secondly K-Nearest Neighbors classifier will be used. Following the same logic, grid search CV will be used to determine the optimum parameters. We get scores lowe than Logistical Regression: accuracy 0.715 , F1 score of 0.760, ROC AUC score of 0.761.
Finally we also fit a xgboost model, which is an ensamble model and potentially could yield higher accuracy. However we get a lower result yet again: accuracy 0.667, F1 score 0.714, ROC AUC 0.748
Finally cross validation score will be used for additional 4 models, just to make sure we are not missing a model, which is signifanctly more accurate.
The 4 additional models are LinearSVC, Decision Tree Classifier, Random Forest Classifier, GaussianNB.
We couldn't get a higher result with any of the other models, meaning our Logistic Regression model performs the best.
Linear: 0.7721052631578947 DTC: 0.6732280701754385
RFC Regression: 0.7140701754385965 GNB: 0.7707894736842105
LogReg: 0.7721052631578947 KNN: 0.7153859649122807