Kaggle is great place to hone your data science skills with some real world data. The yelp dataset is made up of user reviews of businesses with star ratings and comments. The challenge for this particular competition on Kaggle is predict star ratings based on comments. For this we can use the Natural Language Toolkit (NLTK) which has some great libraries for dealing with human language data. We will also use the scikit-learn machine learning library for python with it's models, transforms and feature extraction tools. As is often the case, the first step is to
do any pre-processing on the data and bring it into a Pandas dataframe. In this case the original dataset came in the form of a JSON file that had over 200,000 records. I have taken just the first 100,000 records to reduce the time required process the data and avoid any memory issues. Dealing with large files can be a separate entry on it's own. The nltk stopwords can be downloaded once and then referred to locally. The stopwords are high frequency words such as 'the' and 'is', they provide little benefit when disitinguishing different bodies of text.
I'm interested in two fields, the 'text' review and the 'star' rating. We can split the data into test and training sets with sklearns train_test_split utility and create a function to remove the stopwords that can be loaded into our pipeline. The scikit learn pipeline is used to chain multiple estimators with the last being a classifier. Switch out a few transformers and estimators to see what difference it makes. Run a confusion matrix and classification report to compare.
import pandas as pd
import string
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, confusion_matrix
# nltk.download('stopwords')
yelp = pd.read_json('yelp_training_set_review_100.json', lines=True)
# print yelp.head()
# print yelp.describe()
# print yelp.info()
yelp_class = yelp[yelp['stars'].isin([1, 2, 3, 4, 5])]
X = yelp_class['text']
Y = yelp_class['stars']
def text_process(text):
nopunc = [char for char in text if char not in string.punctuation]
nopunc = ''.join(nopunc)
return [word for word in nopunc.split() if word.lower() not in stopwords.words('english')]
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=101)
# pipe = Pipeline([
# ('bow', CountVectorizer(analyzer=text_process)),
# ('tfidf', TfidfTransformer()),
# ('classifier', MultinomialNB())
# ])
# change the pipeline classifier to another model eg
pipe = Pipeline([
('bow', CountVectorizer(analyzer=text_process)),
('tfidf', TfidfTransformer()),
('classifier', RandomForestClassifier())
])
pipe.fit(X_train, Y_train)
predictions = pipe.predict(X_test)
print(confusion_matrix(Y_test, predictions))
print(classification_report(Y_test, predictions))