Classifiers for NLP sarcasm headlines dataset.

enero 29, 2019

Classifiers for NLP sarcasm headlines dataset.

So as part of my PhD I chose the delve into NLP, I found it interesting and with the new developments that Prof. Hinton is giving us I like to think we all (as NLP students) can be considered as the tip of the spear.

But now back to my post this is not for deep learning just yet, that will come in a while this is a simple showcase of what I have learned from all the books, courses (Udemy), and general tinkering I have done.

This code gets a nice 91%. in the kaggle sarcasm headlines dataset.

First thing are the python imports to begin the work.

#if I combine both data sets and run it like that one goes down from 0.91 and to 0.81 

another up from 0.60 to 0.84 depending the hyper-parameters of the neural network and the SVC

#1: Perform imports and load the dataset into a pandas DataFrame

import numpy as np

import pandas as pd

import re

import json

Next we read the data, as follows.

#df = pd.read_json('./Sarcasm_Headlines_Dataset.json', orient='records')

# df.head()

# cat $FILE | tr -d '\n' remove \n form the command line

with open('C:\\Users\\pedalo\\Documents\\DataForFirstThesisCompilation\\

          Sarcasm_Headlines_Dataset.json', 'r') as f:

jsonDict = f.readlines()

print(jsonDict[:5])

print(len(jsonDict))

Then, remove the pesky \n from the end.

# remove the trailing "\n" from each line

data = list(map(lambda x: x.rstrip(), jsonDict))

print(data[:5])

And convert the json to an array of jsons. [JSON] in swift language.

# each element of 'data' is an individual JSON object.

# i want to convert it into an *array* of JSON objects

# which, in and of itself, is one large JSON object

# basically... add square brackets to the beginning

# and end, and have all the individual business JSON objects

# separated by a comma

data_json_str = "[" + ",".join(data) + "]"

print(len(data_json_str))

print(data_json_str[:10])

This is the end of basic cleaning of the data. Next a check.

##################END First part################################

# Needed to do all of this to get a df
#%%
### Task #2: Check for missing values:

data_df.isnull().sum()

Next NA's, whites, etc. Need to go.

# Check for whitespace strings (it's OK if there aren't any!):

blanks = []  # start with an empty list

for i,link,headline, sarcastic in data_df.itertuples():  # iterate over the DataFrame

if type(headline)==str:            # avoid NaN values

if headline.isspace():         # test 'review' for whitespace

blanks.append(i)     # add matching index numbers to the list

print(list(data_df.itertuples())[:5])

print(len(blanks))

print(data_df.shape)

#%%

### Task #3:  Remove NaN values:

data_df.dropna(inplace=True)

print(data_df.shape)

headlines = data_df['headline']

labels = data_df['is_sarcastic']

print(headlines[0], labels[0])

for line in headlines:

print(line)

Finally we are ready to load the data into a pandas data frame.

df = pd.DataFrame(sarcasm_dataset, columns=['headline', 'is_sarcastic'])

Split the data into a training and test set. For this we import the needed packages.

### Task #5: Split the data into train & test sets: use a validation at the end of

phd to twiddle a little more

from sklearn.model_selection import train_test_split

#print(len(X))

X = data_df['headline']#.append(df['headline'])

#X.append(data_df['headline'])

y = data_df['is_sarcastic']#.append(df['is_sarcastic'])

print(X[:10], len(X))

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.01, 

                                                    random_state=42)

Now this is the slow part , building the classifier. It might take a couple of minutes.

### Task #6: Build a pipeline to vectorize the date, then train and fit a model

from sklearn.pipeline import Pipeline

from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.svm import LinearSVC

from sklearn.neural_network import MLPClassifier

from sklearn.ensemble import RandomForestClassifier

text_clf = Pipeline([('tfidf', TfidfVectorizer()), ('clf', LinearSVC())])

text_mlp_clf = Pipeline([('tfidf', TfidfVectorizer()), 

('clf', MLPClassifier(hidden_layer_sizes=(300,200), random_state=42, 

                         warm_start=True, solver='lbfgs'))]) 

# 0.84% default, 0.87% with (300,200) 0.85% (400,300, 100), 0.87% (400,200, 50), 

0.85% (300,100, 50, 25), 0.84% hidden_layer_sizes=(400, 200, 100, 50), 

random_state=42, warm_start=True, 0.89% hidden_layer_sizes=(400,200, 50), 

random_state=42, warm_start=True, solver='lbfgs')

# 0.89% hidden_layer_sizes=(300,200), random_state=42, warm_start=True, 

solver='lbfgs'

text_rf_clf = Pipeline([('tfidf', TfidfVectorizer()), 

('clf', RandomForestClassifier())]) 

text_clf.fit(X_train, y_train)

text_mlp_clf.fit(X_train, y_train)

text_rf_clf.fit(X_train, y_train)

Finally after all this call it with the test set to see how good it turned out.

### Task #7: Run predictions and analyze the results

# Form a prediction set

pred = text_clf.predict(X_test)

pred_mlp = text_mlp_clf.predict(X_test)

pred_rf = text_rf_clf.predict(X_test)

And print the results.

# Report the confusion matrix

from sklearn.metrics import confusion_matrix

print(confusion_matrix(y_test, pred))

print(confusion_matrix(y_test, pred_mlp))

print(confusion_matrix(y_test, pred_rf))

#%%

# Print a classification report

from sklearn.metrics import classification_report

print(classification_report(y_test, pred))

print(classification_report(y_test, pred_mlp))

print(classification_report(y_test, pred_rf))

##without the  full thing this get 0.91 and 0.84 and 0.80

And we are done, the best for now is SVC with a nice 91%,

with the settings shown here by altering I get less. 

The print() statements scattered around are a sanity check, every programmer needs to do it, the so called

"poor's man debugger". But if it works who am I to complain. That's is for now just 

wanted to present my findings. :D 

 The final average over 11 rounds with different seeds is 0.8545%, but with the seed of 42 is 0.91 F1 score constant.

Buscar este blog

hopeland

Classifiers for NLP sarcasm headlines dataset.

This code gets a nice 91%. in the kaggle sarcasm headlines dataset.

Comentarios

Publicar un comentario

Entradas populares

Catalogo de productos.- Inventory

Check in and check out work module