Build an ML Pipeline for Short-term Rental Prices in NYC

Project Description

Working for a property management company renting rooms and properties for short periods of time on various platforms. Need to estimate the typical price for a given property based on the price of similar properties. Your company receives new data in bulk every week. The model needs to be retrained with the same cadence, necessitating an end-to-end pipeline that can be reused.

Source code: vnk8071/reproducible_model_workflow

tree projects/reproducible_model_workflow -I 'wandb|__pycache__'

projects/reproducible_model_workflow
├── MLproject
├── README.md
├── components
├── conda.yml
├── config.yaml
├── cookiecutter-mlflow-template
├── environment.yml
├── images
├── main.py
└── src

25 directories, 56 files

#	Feature	Stack
0	Language	Python
1	Clean code principles	Autopep8, Pylint
2	Testing	Pytest
3	Logging	Logging
4	Configuration	Hydra
5	Visualize dataframe	Pandas Profiling
6	Pipeline & Monitoring	Mlflow
7	Experiment tracking	Weights & Biases

Install

In order to run these components you need to have conda (Miniconda or Anaconda) and MLflow installed.

conda env create -f environment.yml
conda activate nyc_airbnb_dev

wandb login

Cookiecutter

Using this template you can quickly generate new steps to be used with MLFlow.

cookiecutter cookiecutter-mlflow-template -o src

step_name [step_name]: basic_cleaning
script_name [run.py]: run.py
job_type [my_step]: basic_cleaning
short_description [My step]: This steps cleans the data
long_description [An example of a step using MLflow and Weights & Biases]: Performs basic cleaning on the data and save the results in Weights & Biases
parameters [parameter1,parameter2]: parameter1,parameter2,parameter3

Hydra

As usual, the parameters controlling the pipeline are defined in the config.yaml file defined in the root of the starter kit. We will use Hydra to manage this configuration file. Open this file and get familiar with its content. Remember: this file is only read by the main.py script (i.e., the pipeline) and its content is available with the go function in main.py as the config dictionary. For example, the name of the project is contained in the project_name key under the main section in the configuration file. It can be accessed from the go function as config["main"]["project_name"].

Pandas Profiling

ydata-profiling primary goal is to provide a one-line Exploratory Data Analysis (EDA) experience in a consistent and fast solution. Like pandas df.describe() function, that is so handy, ydata-profiling delivers an extended analysis of a DataFrame while allowing the data analysis to be exported in different formats such as html and json.

pip install ydata-profiling
profile = ProfileReport(df, title="Profiling Report")
profile.to_widgets()

Release new version

git tag -a 1.0.1 -m "Release 1.0.1"
git push origin 1.0.1

Step-by-step

0. Full pipeline

mlflow run .

1. Download data

mlflow run . -P steps=download

2. EDA

mlflow run src/eda

More details in Jupyter

EDA

3. Basic cleaning

mlflow run . -P steps=basic_cleaning
...
2023-08-20 22:17:49,537 Dropping duplicates
2023-08-20 22:17:49,566 Dropping outliers
2023-08-20 22:17:49,566 Number of rows before dropping outliers: 20000
2023-08-20 22:17:49,570 Number of rows after dropping outliers: 19001
2023-08-20 22:17:49,570 Converting last_review to datetime
2023-08-20 22:17:49,577 Saving cleaned dataframe to csv
2023-08-20 22:17:49,743 Logging artifact

4. Check data

mlflow run . -P steps=check_data
...
test_data.py::test_column_names PASSED                                        [ 16%]
test_data.py::test_neighborhood_names PASSED                                  [ 33%]
test_data.py::test_proper_boundaries PASSED                                   [ 50%]
test_data.py::test_similar_neigh_distrib PASSED                               [ 66%]
test_data.py::test_price_range PASSED                                         [ 83%]
test_data.py::test_row_count PASSED                                           [100%]

5. Split data

mlflow run . -P steps=data_split
...
2023-08-21 19:13:21,935 Fetching artifact clean_sample.csv:latest
2023-08-21 19:13:25,410 Splitting trainval and test
2023-08-21 19:13:25,461 Uploading trainval_data.csv dataset
2023-08-21 19:13:31,353 Uploading test_data.csv dataset

6. Train and evaluate model

mlflow run . -P steps=train_random_forest
...
2023-08-21 19:55:39,523 Minimum price: 10, Maximum price: 350
2023-08-21 19:55:39,549 Preparing sklearn pipeline
2023-08-21 19:55:39,550 Fitting
2023-08-21 19:55:41,063 Computing and scoring r2 and MAE
2023-08-21 19:55:41,240 Score: 0.5519470714568394
2023-08-21 19:55:41,241 MAE: 34.12780800870754
2023-08-21 19:55:41,241 Exporting model
2023-08-21 19:55:41,242 Uploading model

Optimize hyper-parameters

mlflow run . \
    -P steps=train_random_forest \
    -P hydra_options="modeling.random_forest.max_depth=10,50,100 modeling.random_forest.n_estimators=100,200,500 -m"

hyper_parameters

7. Test model

mlflow run . -P steps=test_regression_model
...
2023-08-21 20:52:23,425 Downloading artifacts
2023-08-21 20:52:29,760 Loading model and performing inference on test set
2023-08-21 20:52:32,254 Scoring
2023-08-21 20:52:32,341 Score: 0.5640242000942114
2023-08-21 20:52:32,341 MAE: 33.850181065122136

8. Test model with new dataset

mlflow run https://github.com/vnk8071/ml-production.git -v 1.0.1 -P hydra_options="etl.sample='sample2.csv'"
...
2023-08-21 22:06:44,481 Dropping duplicates
2023-08-21 22:06:44,607 Dropping outliers
2023-08-21 22:06:44,607 Number of rows before dropping outliers: 48895
2023-08-21 22:06:44,630 Number of rows after dropping outliers: 46427
2023-08-21 22:06:44,631 Converting last_review to datetime
2023-08-21 22:06:44,654 Saving cleaned dataframe to csv
2023-08-21 22:06:45,052 Logging artifact
...
test_data.py::test_column_names PASSED                            [ 16%]
test_data.py::test_neighborhood_names PASSED                      [ 33%]
test_data.py::test_proper_boundaries PASSED                       [ 50%]
test_data.py::test_similar_neigh_distrib PASSED                   [ 66%]
test_data.py::test_price_range PASSED                             [ 83%]
test_data.py::test_row_count PASSED                               [100%]
...
2023-08-21 22:09:01,322 Downloading artifacts
2023-08-21 22:09:04,782 Loading model and performing inference on test set
2023-08-21 22:09:05,168 Scoring
2023-08-21 22:09:05,298 Score: 0.6195968265496492
2023-08-21 22:09:05,298 MAE: 31.64257699859779

Public Wandb project

Link: https://wandb.ai/nguyenkhoi8071/nyc_airbnb/overview?workspace=user-nguyenkhoi8071

Select best model wandb-select-best

Code Quality

Style Guide - Format your refactored code using PEP 8 – Style Guide. Running the command below can assist with formatting. To assist with meeting pep 8 guidelines, use autopep8 via the command line commands below:

autopep8 --in-place --aggressive --aggressive .

Style Checking and Error Spotting - Use Pylint for the code analysis looking for programming errors, and scope for further refactoring. You should check the pylint score using the command below.

pylint -rn -sn .

Docstring - All functions and files should have document strings that correctly identifies the inputs, outputs, and purpose of the function. All files have a document string that identifies the purpose of the file, the author, and the date the file was created.

Build an ML Pipeline for Short-term Rental Prices in NYC

Project Description​

Install​

Login to Wandb​

Cookiecutter​

Hydra​

Pandas Profiling​

Release new version​

Step-by-step​

0. Full pipeline​

1. Download data​

2. EDA​

3. Basic cleaning​

4. Check data​

5. Split data​

6. Train and evaluate model​

7. Test model​

8. Test model with new dataset​

Public Wandb project​

Code Quality​