Data Projects
The following are a few select data science and data analytics projects I’ve worked on. The projects explore geospatial and traditional data sets, featuring deep learning, regression, classification, clustering, sentiment analysis, and recommendation system algorithms using TensorFlow (Keras), Google’s Earth Engine library (Python API), scikit-learn, Surprise, NLTK and other libraries. There are a couple projects below utilizing R and SQL as well.
Data Science
Hyperspectral Image Classification with SVM, Random Forest, and ANN
The following project analyzes a hyperspectral image (HSI) dataset containing 102 spectral bands, applying PCA for dimensionality reduction and exploring correlations between dimensions and ground truth classes. The project achieves robust classification of the HSI scene using SVM, Random Forest, and a simple ANN, evaluating precision, recall, and accuracy with confusion matrices.
Data Science
Predicting Taxi Cab Trip Duration (A Regression Problem Utilizing Deep Feature Synthesis and Principal Component Analysis)
After preprocessing and performing exploratory data analysis on a large dataset of NY City taxi trips, I will apply different modeling techniques to predict trip duration. This regression project will engage in feature engineering through Deep Feature Synthesis along with dimensionality reduction with Principal Component Analysis (PCA) to see how the addition and reduction of features can positively or negatively affect linear regression and decision-tree-based predictive models. This project also pays special attention to the time-based nature of our specific problem, where specific variables that reveal too much information for our models will need to be selectively ignored because they appear in our historic data but will not be available in a real-time prediction scenario.
Data Science
Land Cover Land Use (LCLU) Classification with Google Earth Engine (Random Forest and SVM)
The following project will walk through some basic Land Cover Land Use (LCLU) classification methods using Google Earth engine, using Bhopal, India as an example. The city features prominent lakes and has some neighboring croplands.
I will use two supervised classification methods, including Random Forest and Support Vector Machines (SVMs). For the Random Forest classification, I will use LCLU data available through ESRI to train our model. The SVM model will use some crude demarcations of buildings, water, crops and shrubs as inputs to train the model for demonstration purposes.
Data Science
Sentiment Analysis on Australian News Headlines: Comparing VADER and RoBERTa Approaches
The following project performs sentiment analysis and classification on Australian news headlines from the first half of 2020, using the rule-based VADER model from NLTK and the context-aware pre-trained RoBERTa NLP model. Comparing the performance of VADER and RoBERTa, the project highlights RoBERTa’s superior ability in accurate sentiment classification.
Data Science
Item Recommendation Systems (A Project Featuring Hyperparameter Tuning with Automated and Randomized Grid Search Cross Validation)
Streaming and video platforms like Youtube, Netflix, and Disney+ have recommendation systems that suggest relevant movies to their users based on historical interactions. For this case study, I will be building and tuning the hyperparameters of several different recommendation systems based on user ratings on films. These recommendation systems include user-based and item-based collaborative filtering systems utilizing the K Nearest Neighbor algorithm, and a matrix factorization-based collaborative filtering system using singular value decomposition (SVD). The systems will be optimized using hyperparameter tuning with automated and randomized grid search cross validation, and special attention will also be placed on recall and precision as metrics to evaluate the quality of top item recommendations made to a user.
Data Science
Predicting Employee Attrition (A Classification Problem Where Recall Matters)
This project utilizes a fictional employee dataset to help determine an employee’s likelihood of attrition. I explore how various factors may contribute to employee attrition, including, e.g., an employee’s distance from home, years of employment, and working overtime. I will create two classification models to predict employee attrition: logistic regression and support vector machine (SVM). I will utilize precision-recall curves to determine the optimal probability threshold for our respective algorithms that responsibly increases the recall metric and reduces false negatives.
Data Analysis
Using Amazon Search Data to Explore Customer Behavior (A Project in R)
One of the reports provided to Amazon sellers is an Amazon Search Terms report, which includes a list of the most popular search terms on Amazon for a selected time frame, ranked. In addition to providing the search frequency rank, the report also provides data on the top three clicked products for each search term, and the share that those products garner of all the clicks and sales made when a customer searches that term. While this is useful for a seller tracking the position of their own products on the Amazon platform, it can also be used to observe trends concerning broader customer interests and their interaction with the Amazon product catalog. The full report, when downloaded, includes records on the top 1,000,000 search terms for a select time period. The short project below, done in R, explores this data for a week-long period and provides visualization of some of the trends that can be ascertained from this large dataset.
Data Analysis
Deriving Customer Lifetime Value from Amazon Seller Data (A Project in R)
This is a simple guide in R demonstrating how to calculate the average customer lifetime value (CLV) for an example Amazon business using Amazon-provided business reports. CLV is a useful metric that lets a business estimate the average amount that a customer spends on the business over the course of time. Some businesses have products or business models with strong customer repurchase rates (think for example a supplement company or an apparel company with strong brand loyalty), whereas others will have a less frequent relationship with their customers. Knowing the lifetime value of a customer could help a business decide whether the cost of acquiring a new customer justifies its current marketing spend. It could also help the business create particular financial projections or serve as a metric for customer loyalty.
Data Analysis
1 Million US Deaths from Covid Cases, Visualized (A Project in SQL + Tableau)
Covid-related deaths reached 1 million in the US on 5/16/2022. This short project uses SQL to analyze international covid data from February 2020 until May 16th, 2022. Data was obtained from the following source: https://ourworldindata.org/covid-deaths. The SQL code can be found on the following GitHub page (link).
Tableau visualizations based on SQL queries from the project can be found [here].