This is the second tutorial of Tweet Sentiment Analysis with Logistic Regression series. Check out the first section if you haven't yet. In this tutorial, we will use another method to create features from sentences. This method is frequency count. This method creates only 3 features rather than 26233 using one-hot!

Frequency count: Number of times a word appear in particular class corpus. Watch this video for clear explanation. After that, watch this video for feature extraction using word frequencies.

The data loading process is the same as previous.

In [1]:

# uncomment below line to install dependencies
# !pip install numpy pandas scikit-learn nltk

In [2]:

import re, nltk
import numpy as np
import pandas as pd
from nltk.corpus import twitter_samples
from collections import Counter
# uncomment below line to download the dataset
nltk.download('twitter_samples')

[nltk_data] Downloading package twitter_samples to
[nltk_data]     /home/virk/nltk_data...
[nltk_data]   Package twitter_samples is already up-to-date!

Out[2]:

True

In [3]:

# select the set of positive and negative tweets
positive_tweets = twitter_samples.strings('positive_tweets.json')
negative_tweets = twitter_samples.strings('negative_tweets.json')

print('Number of positive tweets: ', len(positive_tweets))
print('Number of negative tweets: ', len(negative_tweets))

print('\nThe type of all_positive_tweets is: ', type(positive_tweets))
print('The type of a tweet entry is: ', type(negative_tweets[0]))

Number of positive tweets:  5000
Number of negative tweets:  5000

The type of all_positive_tweets is:  <class 'list'>
The type of a tweet entry is:  <class 'str'>

In [4]:

# Let's look at an example tweet
print("Positive example ->", positive_tweets[0])
print()
print("Negative example ->", negative_tweets[0])

Positive example -> #FollowFriday @France_Inte @PKuchly57 @Milipol_Paris for being top engaged members in my community this week :)

Negative example -> hopeless for tmr :(

I am using a pandas dataframe for easy data management.

In [5]:

posdf = pd.DataFrame(positive_tweets, columns=["tweet"])
posdf["target"] = 1
negdf = pd.DataFrame(negative_tweets, columns=["tweet"])
negdf["target"] = 0
# Combine both dataframes
df = pd.concat([posdf, negdf])
df.shape

Out[5]:

(10000, 2)

In [6]:

df.sample(6)

Out[6]:

	tweet	target
2455	@bmthofficial They are all sold out :((((	0
1916	@meliefluous pamer? :)	1
4255	@DEPORSEMPRE1 hello, any info about possible i...	0
3071	@milay_44 yeah :)	1
80	@LoLEsportspedia thanks :D	1
2430	I wanna buy the #calibraskaEP :(	0

Do some cleaning such as converting all text to lowercase, remove hashtags, extra spaces, etc. For the sake of simplicity, I am performing simple cleaning only.

In [7]:

def preprocessing(tweet):
    # lowercase
    tweet = tweet.lower().strip()
    # remove hashtags
    tweet = re.sub(r'#', '', tweet)
    return tweet

In [8]:

df["tweet"] = df.tweet.apply(preprocessing)
df.sample(6)

Out[8]:

	tweet	target
3347	miss u :-( @deepikapadukone	0
4461	@gufuus i saw one doujin i think of killua get...	0
3825	@silv3r i can also provide this, just pop over...	1
0	hopeless for tmr :(	0
1181	final ep of got, here we go. :(	0
4752	your happiness is your responsibilty. so, don'...	1

Shuffle the dataframe and split the data into train and validation set

In [9]:

from sklearn.model_selection import train_test_split

In [10]:

traindf, valdf = train_test_split(df, shuffle=True)
print("Shape of train and val set:", traindf.shape, valdf.shape)
# Verify the classes of both splits
print("Samples distribution in train set:", dict(traindf.target.value_counts()))
print("Samples distribution in val set:", dict(valdf.target.value_counts()))

Shape of train and val set: (7500, 2) (2500, 2)
Samples distribution in train set: {0: 3777, 1: 3723}
Samples distribution in val set: {1: 1277, 0: 1223}

Now let's build word frequencies using training set...

In [11]:

def build_freqs_dict(df):
    freqs = {}
    for i in range(len(df)):
        row = df.iloc[i]
        y = row.target
        for word in row.tweet.split(" "):
            pair = (word, y)

            freqs[pair] = freqs.get(pair, 0) + 1
            # The above line is equivalent to if else below but compact
#             if pair in freqs: freqs[pair] += 1
#             else: freqs[pair] = 1
    return freqs

In [12]:

%%time

freqs = build_freqs_dict(traindf)

CPU times: user 1.16 s, sys: 6.94 ms, total: 1.17 s
Wall time: 1.17 s

Now, let's make features from the frequencies... There will be three features for every tweet:

Bias: 1 for all tweets
positive frequencies count: Number of words present in freqs dict with target = 1 or positive
negative frequencies count: Number of words present in freqs dict with target = 0 or negative

In [13]:

def make_features(tweet, freqs):
    # Initialize a zeros array with size 3
    feats = np.zeros(3, dtype=int)
    # set bias term to 1
    feats[0] = 1

    for word in tweet.split(" "):
        # Set positive frequencies count
        if (word, 1) in freqs.keys(): feats[1] += freqs[(word, 1)]
        # Set negative frequencies count
        if (word, 0) in freqs.keys(): feats[2] += freqs[(word, 0)]

    assert(feats.shape == (3,))
    return feats

In [14]:

# Test make_features function
sample = traindf.tweet.iloc[10]
sample_feats = make_features(sample, freqs)
print("Sample tweet:", sample)
print("Sample features:", sample_feats)

Sample tweet: there's nothing as cool as being totally over someone, no bitterness, anger or hatred towards them. just pure indifference :)
Sample features: [   1 2945  622]

Make features and train and test model

In [15]:

from functools import partial
from sklearn.linear_model import LogisticRegression

In [16]:

X_train = traindf.tweet.apply(partial(make_features, freqs=freqs))
X_train = np.stack(X_train.values)
y_train = traindf.target.values

X_val = valdf.tweet.apply(partial(make_features, freqs=freqs))
X_val = np.stack(X_val.values)
y_val = valdf.target.values

print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

(7500, 3) (7500,) (2500, 3) (2500,)

In [17]:

%%time

clf = LogisticRegression()
clf.fit(X_train, y_train)
print("Train accuracy:", clf.score(X_train, y_train))

Train accuracy: 0.9328
CPU times: user 91.2 ms, sys: 7.88 ms, total: 99.1 ms
Wall time: 44.3 ms

In [18]:

print("Validation accuracy:", clf.score(X_val, y_val))

Validation accuracy: 0.928

These are acceptable results. However, we can improve our model by stemming and removing stop words and punctuation.

Let's do it...

We already lowercased and removed hastags. Now, we will perform:

removing hyperlinks
using a more sophisticated tokenizer instead of .split method
remove stop words and punctuation

In [19]:

from nltk.corpus import stopwords          # module for stop words that come with NLTK
from nltk.stem import PorterStemmer        # module for stemming
from nltk.tokenize import TweetTokenizer   # module for tokenizing strings
from string import punctuation             # common punctuations

# uncomment below 2 lines to download stopwords
nltk.download('stopwords')


stopwords_english = stopwords.words('english')

[nltk_data] Downloading package stopwords to /home/virk/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

In [20]:

def preprocessing2(tweet, tokenizer):
    # remove old style retweet text "RT"
    tweet = re.sub(r'^RT[\s]+', '', tweet)\
    # remove hyperlinks
    tweet = re.sub(r'https?:\/\/.*[\r\n]*', '', tweet)
    # tokenize
    tweet = tokenizer.tokenize(tweet)
    # remove stop words and punctuations
    clean_tweet = [ # Go through every word in tokens list
        word for word in tweet
        if ((word not in stopwords_english) and (word not in punctuation))
    ]
    # Make a string from tokens
    clean_tweet = " ".join(clean_tweet)
    return clean_tweet

In [21]:

# instantiate tokenizer class
tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True, reduce_len=True)

In [22]:

%%time

df["tweet"] = df.tweet.apply(partial(preprocessing2, tokenizer=tokenizer))
df.sample(6)

CPU times: user 1.66 s, sys: 79.8 ms, total: 1.74 s
Wall time: 1.22 s

Out[22]:

	tweet	target
352	happy friday checking see enjoyed super-blend :)	1
732	kik thenting 423 kik kiksex omegle skype amate...	0
2121	burned pizza rolls :-(	0
3562	dear last night waited 2hrs 13m pizza time arr...	0
2799	omg can't believe vampire diaries followed tha...	1
1248	absolutely waiting day :)	1

Split data, build freqs, and train

In [23]:

# Split data
traindf, valdf = train_test_split(df, shuffle=True, test_size=.20)
print("Shape of train and val set:", traindf.shape, valdf.shape)
# Verify the classes of both splits
print("Samples distribution in train set:", dict(traindf.target.value_counts()))
print("Samples distribution in val set:", dict(valdf.target.value_counts()))

Shape of train and val set: (8000, 2) (2000, 2)
Samples distribution in train set: {0: 4005, 1: 3995}
Samples distribution in val set: {1: 1005, 0: 995}

In [24]:

%%time

# Make freqs
freqs = build_freqs_dict(traindf)

CPU times: user 1.11 s, sys: 0 ns, total: 1.11 s
Wall time: 1.11 s

In [25]:

%%time

# Make features
X_train = traindf.tweet.apply(partial(make_features, freqs=freqs))
X_train = np.stack(X_train.values)
y_train = traindf.target.values

X_val = valdf.tweet.apply(partial(make_features, freqs=freqs))
X_val = np.stack(X_val.values)
y_val = valdf.target.values

print(X_train.shape, y_train.shape, X_val.shape, y_val.shape)

(8000, 3) (8000,) (2000, 3) (2000,)
CPU times: user 207 ms, sys: 3.15 ms, total: 210 ms
Wall time: 208 ms

In [26]:

%%time

# Train and test
clf = LogisticRegression()
clf.fit(X_train, y_train)

CPU times: user 160 ms, sys: 3.99 ms, total: 164 ms
Wall time: 42.5 ms

Out[26]:

LogisticRegression()

In [27]:

print("Train accuracy:", clf.score(X_train, y_train))
print("Validation accuracy:", clf.score(X_val, y_val))

Train accuracy: 0.990625
Validation accuracy: 0.99

WoW! we gain a significant improvement with better preprocessing/

Now. let's test our model with our text...

In [28]:

def test(sent):
    sent = preprocessing2(sent, tokenizer)
    test_feats = make_features(sent, freqs)
    y_pred = clf.predict(test_feats.reshape(1,-1))
    if y_pred[0] == 1: return "positive"
    elif y_pred[0] == 0: return "negative"
    else: return None

In [29]:

test("I am happy about the results")

Out[29]:

'positive'

In [30]:

test("This worked out fine.")

Out[30]:

'positive'

In [31]:

print(test("I lost my phone.")) # This should be negative
# Why this is positive?
print(test("lost")) # returns positive.
# This might be because the word "lost" is present in the positive freqs for some reason.
# This could be a drawback of using word frequencies

positive
positive

In [32]:

test("Julia broke up with John") # What? that's not good

Out[32]:

'positive'

In [33]:

test("I forgot my lunch at home")

Out[33]:

'positive'

In [34]:

test("I'm sick")

Out[34]:

'negative'

In [35]:

test("this is a sick beat")

Out[35]:

'positive'

In [36]:

test("Get away from me")

Out[36]:

'positive'

So frequencies count method is significantly faster and more accurate (at least in validation set). However, further insights are required. Also, we might need more complex features for difficult datasets.

In [ ]:

WorkTree

2. Tweet Sentiment Analysis with Logistic Regression - 2