I recently found an interesting set of information about Kaggle and thought it was interesting to explore.

I found the text data Medium articles A Kaggle dataset that includes an author, apps, reading time, link, title, and text for an article 337 related to machine learning, artificial intelligence, and data science.

In the following text search, I will only use the title of the articles to explore the authors ’popular topics.

1. Import libraries

import re
import pandas as pd
# text processing
import nltk
from nltk.tokenize import WordPunctTokenizer
nltk.download('stopwords')
from nltk.corpus import stopwords
## needed for nltk.pos_tag function
# nltk.download('averaged_perceptron_tagger')
nltk.download('wordnet')
from nltk.stem import WordNetLemmatizer
# visualization
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
from wordcloud import WordCloud

2. Import data

df = pd.read_csv("articles.csv")
print(df.shape)
print(df.columns)
df.head()
Data frame. Image by the author.

3. Word processing

The process is similar to my previous article, Word processing in Python. Therefore, I attach the manuscript to this, but omit the explanation for the repeated part to avoid unnecessaryness.

a. Identification

Divide the titles into a list of tokens.

# change DataFrame columns into a list
title = df['title'].values
# tokenize
title_text = ""
title_all = []
for _ in title:
title_text += (_ + " ")
title_all.append(_)

word_punct_token = WordPunctTokenizer().tokenize(title_text)

There are 4099 coins in the titles.

b. Normalization

Delete unwanted IDs.

clean_token=[]
for token in word_punct_token:
new_token = re.sub(r'[^a-zA-Z]+', '', token) # remove any value that are not alphabetical
if new_token != "" and len(new_token) >= 2: # remove empty value and single character value
vowels=len([v for v in new_token if v in "aeiou"])
if vowels != 0: # remove line that only contains consonants
new_token = new_token.lower() # change to lower case
clean_token.append(new_token)
# Get the list of stop words
stop_words = stopwords.words('english')
stop_words.extend(["could","though","would","also","us"])
# Remove the stopwords from the list of tokens
tokens = [x for x in clean_token if x not in stop_words]

There are 2214 characters left when we delete the non-alphabetical value, single-character tokens, characters that contain only consonants, and end words that don’t have much information. We removed almost half of the tokens.

The data set used in this example is small, so removing these identifiers does not significantly improve the speed of the model, but it is crucial when we analyze the giant data.

c. POS Tag and Lemmatization

Mark the speech part of the word and return the word to its basic form accordingly.

# POS Tag every token and save into dataframe
data_tagset = nltk.pos_tag(tokens)
df_tagset = pd.DataFrame(data_tagset, columns=['Word', 'Tag'])
# to focus on nouns, adjective and verb
tagset_allowed = ['NN','NNS','NNP','NNPS','JJ','JJR','JJS','VB','VBD','VBG','VBN','VBP','VBZ']
new_tagset = df_tagset.loc[df_tagset['Tag'].isin(tagset_allowed)]
text = [str(x) for x in new_tagset['Word']]
tag =[x for x in new_tagset['Tag'] if x != '']

There are more than 30 POS tags, but tags with significant insights mostly fall into the category of nouns, adjectives, and verbs. So we can filter out other tags from our template.

# Create lemmatizer object 
lemmatizer = WordNetLemmatizer()# Lemmatize each word and display the output
lemmatize_text = []
for word in text:
output = [word, lemmatizer.lemmatize(word, pos='n'),lemmatizer.lemmatize(word, pos='a'),lemmatizer.lemmatize(word, pos='v')]
lemmatize_text.append(output)# create DataFrame using original words and their lemma words
df = pd.DataFrame(lemmatize_text, columns =['Word', 'Lemmatized Noun', 'Lemmatized Adjective', 'Lemmatized Verb'])
df['Tag'] = tag
DataFrame After peting. Image by the author.

The script above creates three columns that stored lemmatized nouns, lemmatized adjectives, and lemmatized verbs. When a word tag is a noun, the basic form of the word is reflected in the Lemmatized Noun column and the base form of the adjective is reflected in the Lemmatized Adjective column, the verb base form in the Lemmatized Verb column.

At this point, each speech segment is further subdivided into subcategories. By [1], Nouns are further divided

  1. Single or mass noun (NN),
  2. Exclusive noun (NNP),
  3. Plural noun (NNPS) and
  4. (Inflection) plural

Adjectives and verbs are also further subdivided. This can create a little more work when we want to select IDs by groups later. Therefore, the subcategory is replaced by its main category.

# replace with single character for simplifying
df = df.replace(['NN','NNS','NNP','NNPS'],'n')
df = df.replace(['JJ','JJR','JJS'],'a')
df = df.replace(['VBG','VBP','VB','VBD','VBN','VBZ'],'v')

Then a new column “Lemmatized Word” with the basic form of the word is created with the following script.

'''
define a function where take the lemmatized word when tagset is a noun, and take lemmatized adjectives when tagset is adjective
'''
df_lemmatized = df.copy()
df_lemmatized['Tempt Lemmatized Word']=df_lemmatized['Lemmatized Noun'] + ' | ' + df_lemmatized['Lemmatized Adjective']+ ' | ' + df_lemmatized['Lemmatized Verb']
lemma_word = df_lemmatized['Tempt Lemmatized Word']
tag = df_lemmatized['Tag']
i = 0
new_word = []
while i<len(tag):
words = lemma_word[i].split('|')
if tag[i] == 'n':
word = words[0]
elif tag[i] == 'a':
word = words[1]
elif tag[i] == 'v':
word = words[2]
new_word.append(word)
i += 1

df_lemmatized['Lemmatized Word']=new_word
df_lemmatized.head()

Lemmatized Word created in a DataFrame. Image by the author.

The final step in word processing is to convert the Lemmatized Word column to a list for the next process.

lemma_word = [str(x) for x in df_lemmatized['Lemmatized Word']]

Now we are ready to create Word Cloud to study the text!

LEAVE A REPLY

Please enter your comment!
Please enter your name here