Predicting the Duration of NY City Taxi Rides: A Project in Predictive Analytics with Feature Engineering


Background


Taxis provide an important means of alternative transportation for New York City commuters, and data on these rides can give us a better understanding about peak traffic times, transportation costs for taxi users, travel hotspots and commute times, among other insights. Taxi companies (as well as ride-sharing companies like Uber and Lyft) face the challenge of efficiently assigning transport to passengers. Predicting the duration of a current ride can help a taxi cab company predict when a cab will be free for its next trip. It can also be used to determine how much to charge a customer at the outset of the trip. The large data set looked at below contains information on taxi trips in New York City. After cleaning and preprocessing our data and performing exploratory data analysis, I will apply different modeling techniques to predict trip duration. This regression project will engage in feature engineering through Deep Feature Synthesis along with dimensionality reduction with Principal Component Analysis (PCA) to see how the addition and reduction of features can affect models.


About the Data


The data used for this project is provided by the New York City Taxi & Limousine Commission. According to the source, the data was "collected and provided to the NYC Taxi and Limousine Commission (TLC) by technology providers authorized under the Taxicab & Livery Passenger Enhancement Programs (TPEP/LPEP). The trip data was not created by the TLC, and TLC makes no representations as to the accuracy of these data." The data used for this project is a subset of the TLC provided data found on Kaggle, and covers over 7 million recorded trips during the month of February in the year 2019 for yellow taxis. The dataset is accessible on kaggle at the following link: https://www.kaggle.com/datasets/microize/newyork-yellow-taxi-trip-data-2020-2019

The full dataset is available from the TLC website here: https://www1.nyc.gov/site/tlc/about/tlc-trip-record-data.page


Project Objective



Dataset


The original trips table has the following fields

Importing libraries

Exploratory Data Analysis

There are some null values in teh PickUp and DropOff Zones that will need to be explored.

Relevant columns for descriptive statistics include passenger_count, duration, trip_distance, and amounts:

Cleaning Our Data:

There are 16856 such entries. The durations appear to be accurate calculations of the pickup and dropoff datetime values, however, in many cases they do not make sense with the trip distances. For example, trip with ID 229 took 1435.62 minutes (nearly 24 hours) for a distance of 6.6 miles (from Flatiron District to Bushwick South)! Even for New York City traffic, this is a very difficult to imagine time frame! That would be 217.5 minutes/mile.

Based on the above, let's remove the cases where the duration > 60, AND where the duration/distance was over 20 minutes/mile. This is much above the median of 6.08, or the mean of 9.19, but far below the extreme cases representative of some of our outliers. These latter cases appear to be incorrect data. Let's see what cases they include:

Updates to our descriptive statistics

Univariate Analysis

The total amounts charged, trip distances and trip durations all appear right skewed.

Now let's see our histograms for date, showing fairly consistent taxi usage with what appears to be a minor weekly drop:

Countplot for passenger_count

The distribution of the passenger count shows us that vast majority of passenger counts are 1.

Countplot for Pickup and Dropoff Zones

We see that pickups and dropoffs are concentrated in a few neighborhoods such the upper east side. Public transportation authorities can consider how increasing public transportation offerings between select high-traffic volume districts may reduce the numbers of single passenger-taxi rides to improve traffic congestion, especially those high-traffic volume districts that are close by for which such a solution would be more economical to introduce.

Bivariate analysis

Plotting a scatter plot of trip distance to trip duration

Preparing the Data for Modeling

Deep Feature Synthesis (DFS) is an automated method of performing feature engineering on relational data. Its ability to create features derived from layers of data gives the meaning of 'deep' in 'deep' feature synthesis. We will define the relational nature of our taxi dataset through its entities and relationships. The three entities in this data are:

This data has the following relationships

We need to transform any of our categorical data into numeric data for processing. In our original trips dataframe, rendering the categories in our PUZone and DOZone columns numerically would result in numbers ranging from 1 to ~260 (the number of zones). However, we don't want the ML models to incorrectly assume a larger number represents a greater magnitude or significance, e.g., by taking a zone rendered as the number 23 to be greater than a zone rendered as the number 4.

Therefore we will use get_dummies to separate out all of the zone categories as their own columns. A trip being linked to a specific pick up or drop off zone will be noted as a 0 or 1 instead. We will apply get_dummies on our pickup_zones and dropoff_zones dataframes rather than our trips dataframe (the dataframes will all be combined when we create our feature matrix).

Let's now use get_dummies on payment_type, also a categorical column, so that one payment type is not assumed to be greater than another.

Defining entities and relationships for Deep Feature Synthesis

We also indicate a cutoff time which will apply to every instance of our target_entity, in this case, each trip recorded in trips. This cut off time refers to the last time data can be used for calculating features by DFS. This constraint is out of a practical consideration. We want to be able to predict our duration before the trip begins, and as such, the cutoff time should be set to our pickup time. This means that the model will consider data before a trip was made in order to predict that trip, rather then apply what it learns from that trip's data on that trip, which will not be a possibility in a real world application.

For the purposes of the case study, we also choose to only select trips after February 15th, 2019.

Refer to this source for more information: https://docs.featuretools.com/en/v0.16.0/automated_feature_engineering/handling_time.html

From documentation, the cutoff_time: Can either be a DataFrame or a single value. If a DataFrame is passed the instance ids for which to calculate features must be in a column with the same name as the target dataframe index or a column named instance_id. The cutoff time values in the DataFrame must be in a column with the same name as the target dataframe time index or a column named time. If the DataFrame has more than two columns, any additional columns will be added to the resulting feature matrix. If a single value is passed, this value will be used for all instances. See: https://featuretools.alteryx.com/en/stable/generated/featuretools.calculate_feature_matrix.html

Create baseline features using Deep Feature Synthesis

Instead of manually creating features, such as identifying the month of the pickup datetime, we can let Deep Feature Synthesis come up with them automatically. It does this in the following way:

Creating transform features using transform primitives

As noted earlier, features fall into a couple major categories, including identity features (which are already found in the data), transform features, and aggregate features. In featureools, we can create transform features by specifying transform primitives. Here is a transform primitive called weekend that applies to any datetime column in the data. It assesses if the entry is a weekend and returns a boolean.

In our data, we have two datetime columns taht this primitive will apply to: tpep_pickup_datetime and tpep_dropoff_datetime, however we will set our feature settings to ignore tpep_dropoff_datetime for the obvious reason that an exact dropoff date/time will not be available in a real life setting and shouldn't be considered in a prediction.

Note that we will ignore several other variables from the trips dataframe, including those related to amount charged (as that information will not be available for predicting a trip's duration before hand), and information that is redundant (the location data is already recorded in the pickup_zones and dropoff_zones dataframes)

Here are the features created:

Of the 532 features, only 1 is the new transform feature ('IS_WEEKEND'), the rest are the features that were already present in the data.

Compute features and define feature matrix

Note that this is less than half of our previous number of entries, which was 2010151. The size used for the feature matrix went down because we decided to limit our cut off times to half of the month (dates in February AFTER february 15th, 2019)

Building Our Models

We now separate the data into a portion for training (80% in this case) and a portion for testing (20%). We will train the model using Linear Regression, Decision Tree, and Random Forest

Splitting the data into X and y

Scaling our data

Because some of our independent variables are in different magnitudes, our models may incorrectly give weight to some variables over others by virtue of this difference. So we will scale relevant non-categorical values with the StandardScaler class. The scaler will ultimately make the standard deviation = 1, and mean = 0 for our values. We will apply this to our passenger_count and trip_distance features.

Splitting Our Data into Train and Test (80/20 split)

Defining functions to check the performance of the model.

Building Our First Linear Regression Model

Check the performance of the model

The linear regression model clearly does not fit as we end up with an extreme negative R-Squared result in the test data. The existence of many unnecessary variables can negatively affect the multiple regression equation, and in our case, we have 531 independent variables(!). We will later reduce the number of features to see if this fixes our linear regression model, using Principal Component Analysis and also by doing a simple linear regression using only the single independent variable of trip_distance, but in the mean time, we will first try alternative models to see if we have a better fit using our large feature set.

Building a Decision Tree

Check the performance of the model

The model is providing more reasonable results then our linear regression model, but the the model is very strongly overfitting, as there is a .35 difference in the R-Squared values in the train and test. Let's see if we can improve it further by building a pruned decision tree.

Building a Pruned Decision Tree

Check the performance of the model

The pruned decision tree offers us a much better model for predicting our trip duration. The R-Squared values appear to be almost the same between our train and test data which means the model is a good fit. Our RSquared value of 0.72 and 0.73 for the Train and Test respectively is good. We also see a significant improvement in our results on the test data in this model when compared to our base decision tree (the previous model).

Building a Random Forest

Check the performance of the model

With the random forest regressor (set to 80 trees and a max depth of 7), we see that our R-Squared score has slightly improved over our pruned decision tree. The model seems to be the best fitting of all the models looked at, and has the best performance.

Adding More Transform Primitives and Redoing Our Models

We now have 537 features. The 6 date/time related transform primitives (minute,hour,day,month,weekday,IsWeekend) have been applied to our tpep_pickup_datetime column.

Computing features and defining our feature matrix

Building New Models with the Additional Transform Features

Building a Linear Regression Model (w/ Additional Transform Primitives)

Check the performance of the model

We again find the linear regression model to be a bad fit for our data.

Building a Decision Tree Model (w/ Additional Transform Primitives)

Check the performance of the model

Our decision tree with these additional tranform primitives has improved from our prior decision tree, and we see less overfitting. Whereas with only the Is_Weekend transform primitive in the previous model we had a .35 difference in the R-Squared values in the train and test for the decision tree, we now see a difference of about .28. The results are also improved on our test data comparing R-Squared values. Let's see if we can improve it further by building a pruned decision tree:

Building a Pruned Decision Tree Model (w/ Additional Transform Primitives)

Check the performance of the model

As with our prior pruned decision tree, the r-squared values are almost the same between our train and test data with good fitting. We also see a good improvement in our results with the addition of more transform primitives. These additional features appear to be helping the model better predict duration values.

Building a Random Forest Model (w/ Additional Transform Primitives)

Check the performance of the model

We have the best fitting of all of our models thus far in this version of the random forest applied to several transform primitives related to date/time. Our R-Squared for both train and test datasets is 0.77, compared to 0.73 in our previous set up with only the Is_Weekend transform primitive. This is currently a strong model.

Adding additional transform primitives has helped all of our models in this case, except for linear regression which does not appear suited for the large number of features we have currently.

Creating Models with Transform AND Aggregate Primitive

I will now add aggregate primitives, which will apply functions including count, sum, mean, median, standard deviation, max and min on individual trips linked to parent items (our parent entities being the pick up and drop off locations). Our primitive options will be set to ignore duration and payment option as variables for our aggragate primitives to apply on. Duration is ignored for the obvious reason that it is our dependent variable and features based on it won't be available for making predictions in a real life setting. Payment option is ignored because it is categorical information. Therefore, our aggragate primitives will only apply on trip_distance and passenger count.

We now have 563 features. Note the last features in the list are our aggregate primitives applying mean, median, min, max, and standard deviation on the aggregate of all the child data associated with each pick-up or drop-off parent.

Building a Linear Regression Model with Transform AND Aggregate Primitives

Check the performance of the model

Again, our linear regression model is not suitable with so many features.

Building a Decision Tree Model with Transform AND Aggregate Primitives

Check the performance of the model

The results are minimally different when compared to our version with only transform primitives, indicating that the aggregate primitives were not useful. The R-Squared in our previous Decision Tree for the test data was 0.724235, whereas it is now 0.734763. There was strong overfitting in the prior model as there is here.

Building a Pruned Decision Tree Model with Transform AND Aggregate Primitives

Check the performance of the model

Here again, we see only slight improvement from our prior pruned decision tree. Adding aggregate primitives may have added some value to our prediction accuracy, but with additional computational time.

Building a Random Forest Model with Transform AND Aggregate Primitives

Check the performance of the model

The random forest model is again the best fitting of all of our models. The performance results of this version are approximately the same as the the previous random forest regressor with some minor improvement, though there was an increase in run time given the added features.

Selecting Our Best Model and Most Important Features

Let's predict our values with our best performing model thus far, our random forest regressor featuring transform and aggragate primitives:

Comparing the values above, we see the predicted and actual results are not that far off from each other. A future project might perform grid-search based hyperparameter tuning to come to better results using this model. Let's see what specific features were significant to the results we obtained by this model:

Determining the importance of features for determining duration (based on our selected model)

We see that a single feature, trip distance, is the most important feature by a significant factor. It has a 0.92 significance for the prediction, whereas the next most important feature only has 0.055 (the hour of pick up - one of our transform primitives). Other features do not surpass 1%. If one feature has so much significance, the question arises, is there a benefit in keeping so many features? What if we reduce the large number of features we currently have? I will now reduce the number of features using Principal Component Analysis and see how it affects the linear regression model (which was unusable up until now), and the random forest model, and afterwards I will apply those two models using only a single feature of trip_distance.

Running Our Models with a Reduced Number of Features (Principal Component Analysis)

Principal Component Analysis (PCA) is a dimensionality-reduction method that can help us transform our large set of variables into something smaller. Let's apply PCA to reduce the number of features and see how it affects two models: our linear refressor model and random forest regressor. I will apply PCA on our X values that contained only transform features for simplicity.

Building linear regressor using only 100 features (reduced by PCA):

Check the performance of the model

In this case where our features have been drastically reduced, our linear regression model now has a usable R-Squared value. While the results are fair, it is still not as good as our previous instantiations of the random forest model or decision tree-based regressors. There is also some overfitting, with disparity in the Train and Test r-squared values. Let's now see how the random forest model performs with reduced features:

Building random forest model using only 100 features (reduced by PCA):

Check the performance of the model

Surprisingly, even though the number of features has been reduced, the run time for this model was extremely long with the PCA-transformed data. This was the longest-to-run model of this project, and the results were not strong. There is some strong overfitting with the model, and this iteration of the random forest has the worst r-squared scores of the random forests considered thus far. Reducing the number of features may have helped our linear regression model, but has seemed to hinder our random forest regressor.

Running Our Models with Trip Distance as Our Only Feature

Let's run the linear regression model one final time, this time using ONLY trip_distance as our feature. Let's see how having only a single independent variable affects our linear regression and random forest models:

Linear Regression Model:

Checking model performance:

Compared with our linear regressor that was applied to our PCA-transformed data, this simple linear regression model only taking into account trip_distance fairs much better. The model fits better and the prediction is more accurate with a better r-squared score.

Random Forest Model:

Checking model performance:

Our random forest model here is still stronger than our linear regressor, but we see how it is the weakest random forest model we have run. While trip_distance is the strongest indicator of duration, we clearly see how adding transform primitives helped our random forest reach performance scores of > 0.77 r-squared scores for test and train data. DFS is therefore a useful tool in obtaining a strong predicting model.

Conclusion

This project engaged with a subset of a large dataset of taxi trips in New York City. After engaging in data preprocessing and exploratory data analysis, I engaged in feature engineering through DFS to see how the addition of features impacted a few selected models. With over 500+ features (including transform and aggregate features), we found that adding features improved the performance of our decision-tree based models, but that our linear regressor model was not well suited for so many independent variables. Reducing features through PCA made our linear regressor model a viable predictor, but still not very strong. Our strongest model, the random forest regressor, did the strongest when we included several transform primitives that applied to pickup date/time values, thus showing the utility of DFS. A future project can tune the hyperparameters of our random forest tree model with a grid search to obtain even more accurate predictions.