top of page

Yelp Review Recommendations

Northwestern University EECS 349 Machine Learning

Profossor : Douglas Downey

Desheng Liu (dli848) : deshengliu2015@u.northwestern.edu

Shiping Zhao (szd041) : shipingzhao2015@u.northwestern.edu

Yan Liu (ylk070) : yanliu2015@u.northwestern.edu

Northwestern University EECS 349 Machine Learning

Profossor : Douglas Downey

Desheng Liu (dli848) : deshengliu2015@u.northwestern.edu

Shiping Zhao (szd041) : shipingzhao2015@u.northwestern.edu

Yan Liu (ylk070) : yanliu2015@u.northwestern.edu

portfolio

Project ​Overview

 

Motivation

The reviews we would see in the Yelp app in a first glance are mainly 2 kinds: reviews with high votes, and reviews that are fresh. Although the reviews with high votes could be useful, they are probably out-of-date. Thus, fresh reviews are also needed to be displayed in the reviews flow. However, most of the fresh reviews are low quality, or say, useless. It would be better to rank those more useful reviews to be higher than less useful ones, so that the users could read the useful ones first to save their time.

 

Collection of Datasets

The dataset we used are the json files achieved from Yelp's official website. We used mainly 2 of them, reviews.json and users.json.

  • Users.json is aggregated by users, with 11 attributes (e.g. number of reviews, elite or not, yelping since, number of friends, etc) and 552339 examples.  

  • Reviews.json is aggregated by reviews with useful attributes such as post_date, review text and user_id, etc. and 2225213 examples.

More details about the original dataset could be found in at the end of the website.

 

Preprocessing: Choice of label and Attributes

Label:

The common ranking of reviews is based on the attribute “Useful Count” in review.json. As this attribute depends largely on when a review was posted (basically the longer a review exists, the more useful count it may have), using it directly as label will make the result strongly biased by time and unable to represent “Freshness”. Thus we looked into the relationship between review counts and time, hoping for a reasonable normalization and finally defined quality of reviews for labeling.

Quality of reviews:  

To find the pattern, we first plotted the result votes with respect to their post year, see figure below, which turned out to be fitted to a linear model after 2008. Thus, we use Linear Regression to learn the model, and tended to use it as our normalization factor.

Attributes

  • Basic Features 

In terms of features, we first used only 10 basic features extracted from metadata of reviews and their users: { review author’s: total reviews, friends, elite user or not, compliments, number of fans, average stars, count of funny/useful/cool votes; reviews’: length of review}.

  • Text features

Then we tried text features (i.e. bag of words) extracted from contents of reviews. To earn better result on bag of words, we used word frequency as feature for each word and normalized them using the frequencies of words appeared among all the reviews. Because those more frequently appearing words among all reviews are more likely to be irrelative to a review’s quality. The normalization (TfidfTransformer) is a built-in function in scikit-learn, the ML library we used for our project. In addition, to speed up learning task and earn better result, we reduced the dimension of text features by PCA. The number of text features we used are 100.

We combined the above 2 kinds of features, metadata and text features, and get significant improvement compared to using only either of them.

 

Learning  Library and Models

Library: scikit-learn

Models: As proposed, we originally decide to use regression to predict the exact value of the qualities of reviews. In the status report, we encountered low R2 score of some of the regression models and decided to try classifications.

The Regressions Models we used are Linear Regression, Ridge Regression, Passive Aggressive Regression, AdaBoost and Random Forest.

In terms of classifications, we analyzed the distribution of the review qualities and found that almost all reviews has low qualities with only a few exceptions. Thus we sorted the QoR and uniformly distributed them to 10 classes (5-Stars System, Step: Half star).

The Classifications Models we  used are Decision Trees, Bayes and K Nearest Neighbors.

 

 Result Discussion

Measurement

To measure the success, we used the Dummy Regressor/Classifier learner as the baseline. It is similar to the ZeroR classifier in Weka which simply predicts the result by assigning a simple value (i.e. mean, median, constant quality). We then compared the performances of different kinds of regression/classifications learners with the baseline learner.

Performance & Evaluation

Performance & Evaluation

In terms of regression, we evaluated the performance with R2 score. It could be negative because the prediction for regression problem could be arbitrarily bad. And the closer to 1, the better it is. For the result, the R2 score of the baseline learner is -9.22 * 10^-7. The best learner we find is Random Forest, earning 0.48 R2 score. In the comparison between the following figures, we found that the combination of these two features greatly improves the quality of prediction.

As R2 score is not only determined by how well the regression represents the dataset, but also strongly affected the noisiness of the data. The possible explanation for the rather low R2 scores for all the regressors is that there are still noisiness in the dataset, which indicates that the selection of  attributes can be refined, we will discuss further in future works.

But on the other side, the increase in performance of random forest compared to the dummy regressor show that this model is promising and might become applicable in practical use if refinements are made.

In terms of classifications, we evaluate the performance with Accuracy rate. The dummy classifier gives an accuracy of 0.17, which is a little higher than the expectation 0.10. Among three of the classifications learners, the Gaussian NB is the best with only 0.50 accuracy.

One possible explanation for the only “half-half” accuracy other than the selection of attributes is that there are too many classes. A four stars review might be classified as a four and a half stars review, which will only make slightly difference in practice. If the number of classes is reduced to 5, the accuracy will certainly increase.

Future Work

We have made good use of the whole dataset, improvements on in terms of the dataset size, basic features are limited. The several ways we can focus on for next steps are:

  • Better extractions from review text:  we only considered frequencies of words this time, there are several aspects such as sentiment of the sentences, conjunction words, positions of each words, etc.

  • Better choice of learning models: we only used built-in models in sklearn learning pack. But we believe that  optimized learning models (e.g. by determining and giving proper weight to each of the attributes, or figuring out logic relationship between attributes) will lead to better performances of the prediction system.

  • Combination of models: there are some suggestions that we could probably create a model learning on only sparse text features (maybe without dimension reduction), and then combine its predictions as a dense feature with other dense (basic) features (metadata from reviews and users) to create a new model. It is declared that this will usually achieve better result than rigidly combining basic (metadata) features and text features (bag of words).

Meet The Team

Desheng Liu

  • Grey LinkedIn Icon

Desheng Liu contributed to the preprocessing of data, the extraction of basic features and text features (bag-of-words) and learning with regression.

Shiping Zhao

  • Grey LinkedIn Icon

Shiping Zhao contributed to the integration of user and review data, analysis of the pattern of data, preprocessing and learning with classification.

Yan Liu

  • Grey LinkedIn Icon

Yan Liu contributes to the collection of data, investigation on Yelp, practical analysis, decision of features and the development of project website.

bottom of page