Classifiers for NLP sarcasm headlines dataset.
So as part of my PhD I chose the delve into NLP, I found it interesting and with the new developments that Prof. Hinton is giving us I like to think we all (as NLP students) can be considered as the tip of the spear.
But now back to my post this is not for deep learning just yet, that will come in a while this is a simple showcase of what I have learned from all the books, courses (Udemy), and general tinkering I have done.
This code gets a nice 91%. in the kaggle sarcasm headlines dataset.
First thing are the python imports to begin the work.
#if I combine both data sets and run it like that one goes down from 0.91 and to 0.81
another up from 0.60 to 0.84 depending the hyper-parameters of the neural network and the SVC
#1: Perform imports and load the dataset into a pandas DataFrame
import numpy as np
import pandas as pd
import re
import json
Next we read the data, as follows.
#df = pd.read_json('./Sarcasm_Headlines_Dataset.json', orient='records')
# df.head()
# cat $FILE | tr -d '\n' remove \n form the command line
with open('C:\\Users\\pedalo\\Documents\\DataForFirstThesisCompilation\\
Sarcasm_Headlines_Dataset.json', 'r') as f:
jsonDict = f.readlines()
print(jsonDict[:5])
print(len(jsonDict))
Then, remove the pesky \n from the end.
# remove the trailing "\n" from each line
data = list(map(lambda x: x.rstrip(), jsonDict))
print(data[:5])
And convert the json to an array of jsons. [JSON] in swift language.
# each element of 'data' is an individual JSON object.
# i want to convert it into an *array* of JSON objects
# which, in and of itself, is one large JSON object
# basically... add square brackets to the beginning
# and end, and have all the individual business JSON objects
# separated by a comma
data_json_str = "[" + ",".join(data) + "]"
print(len(data_json_str))
print(data_json_str[:10])
This is the end of basic cleaning of the data. Next a check.
##################END First part################################
# Needed to do all of this to get a df
#%%### Task #2: Check for missing values:
data_df.isnull().sum()
Next NA's, whites, etc. Need to go.
# Check for whitespace strings (it's OK if there aren't any!):
blanks = [] # start with an empty list
for i,link,headline, sarcastic in data_df.itertuples(): # iterate over the DataFrame
for i,link,headline, sarcastic in data_df.itertuples(): # iterate over the DataFrame
if type(headline)==str: # avoid NaN values
if headline.isspace(): # test 'review' for whitespace
blanks.append(i) # add matching index numbers to the list
print(list(data_df.itertuples())[:5])
print(len(blanks))
print(data_df.shape)
#%%
### Task #3: Remove NaN values:
data_df.dropna(inplace=True)
print(data_df.shape)
headlines = data_df['headline']
labels = data_df['is_sarcastic']
print(headlines[0], labels[0])
for line in headlines:
print(headlines[0], labels[0])
for line in headlines:
print(line)
Finally we are ready to load the data into a pandas data frame.
df = pd.DataFrame(sarcasm_dataset, columns=['headline', 'is_sarcastic'])
Split the data into a training and test set. For this we import the needed packages.
### Task #5: Split the data into train & test sets: use a validation at the end of
phd to twiddle a little more
from sklearn.model_selection import train_test_split
#print(len(X))
X = data_df['headline']#.append(df['headline'])
#X.append(data_df['headline'])
y = data_df['is_sarcastic']#.append(df['is_sarcastic'])
print(X[:10], len(X))
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01,
random_state=42)
Now this is the slow part , building the classifier. It might take a couple of minutes.
### Task #6: Build a pipeline to vectorize the date, then train and fit a model
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.neural_network import MLPClassifier
from sklearn.ensemble import RandomForestClassifier
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])
text_mlp_clf = Pipeline([('tfidf', TfidfVectorizer()),
('clf', MLPClassifier(hidden_layer_sizes=(300,200), random_state=42,
warm_start=True, solver='lbfgs'))])
# 0.84% default, 0.87% with (300,200) 0.85% (400,300, 100), 0.87% (400,200, 50),
0.85% (300,100, 50, 25), 0.84% hidden_layer_sizes=(400, 200, 100, 50),
random_state=42, warm_start=True, 0.89% hidden_layer_sizes=(400,200, 50),
random_state=42, warm_start=True, solver='lbfgs')
# 0.89% hidden_layer_sizes=(300,200), random_state=42, warm_start=True,
solver='lbfgs'
text_rf_clf = Pipeline([('tfidf', TfidfVectorizer()),
('clf', RandomForestClassifier())])
text_clf.fit(X_train, y_train)
text_mlp_clf.fit(X_train, y_train)
text_rf_clf.fit(X_train, y_train)
Finally after all this call it with the test set to see how good it turned out.
### Task #7: Run predictions and analyze the results
# Form a prediction set
pred = text_clf.predict(X_test)
pred_mlp = text_mlp_clf.predict(X_test)
pred_rf = text_rf_clf.predict(X_test)
And print the results.
# Report the confusion matrix
from sklearn.metrics import confusion_matrix
print(confusion_matrix(y_test, pred))
print(confusion_matrix(y_test, pred_mlp))
print(confusion_matrix(y_test, pred_rf))
#%%
# Print a classification report
from sklearn.metrics import classification_report
print(classification_report(y_test, pred))
print(classification_report(y_test, pred_mlp))
print(classification_report(y_test, pred_rf))
##without the full thing this get 0.91 and 0.84 and 0.80
And we are done, the best for now is SVC with a nice 91%,
with the settings shown here by altering I get less.
The print() statements scattered around are a sanity check, every programmer needs to do it, the so called
with the settings shown here by altering I get less.
The print() statements scattered around are a sanity check, every programmer needs to do it, the so called
"poor's man debugger". But if it works who am I to complain. That's is for now just
wanted to present my findings. :D
The final average over 11 rounds with different seeds is 0.8545%, but with the seed of 42 is 0.91 F1 score constant.
The final average over 11 rounds with different seeds is 0.8545%, but with the seed of 42 is 0.91 F1 score constant.
Comentarios
Publicar un comentario