Analysis of sediments with this approach

Problem: The feed data includes tweets sent by users from about six airlines in the United States. The goal is to classify these Tweets and posts as positive, neutral, and negative. This is done using a standard supervised learning mechanism in which input training data is determined from tweets. With the help of the training series, we build a model that creates classes.

Solution: An approach to this problem requires machine learning, following the steps mentioned above. The initialization begins with bringing in the necessary libraries needed for this analysis. Exploratory data analysis is the next step in examining how data sections correlate, and we understand the input. Next, a new step is added during preprocessing, where we perform word processing. Because machine learning algorithms work efficiently in numerical data, we treat text input as a numeric input. Ultimately, when the preprocessing is complete, we continue the ML model and run it through the entire data set and understand its accuracy.

# Data Source: https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv
import_data_url = "https://raw.githubusercontent.com/kolaveridi/kaggle-Twitter-US-Airline-Sentiment-/master/Tweets.csv"

The first steps are performed: Data set data import and exploratory data analysis (EDA). In this story, we don’t go through the code for importing data and performing initial cleanup and EDA, because here the focus is on integrating NLP between this classic workflow. If you want to go through the whole source code, I’ve linked the whole project to the bottom of this story.

Now that we’ve completed the preliminary preprocessing, it’s time to clean up the data specific to natural languages. This is the part where we will add code that only works with languages.

The NLP layer of the ML cycle

In the next two sections, we will begin to deal with natural languages. The trend to this is simple: we first convert the language (text) into vectors and then perform some healthy cleansing to better predict the model.

# Cleaning of data: Since these tweets might contain punctuation marks and other non-relevant characters, we will process those and remove them from the model# Let us also divide the feature and label sets for this data
feature_set = sentiment_tweets.iloc[:, 10].values
label_set = sentiment_tweets.iloc[:, 1].values
cleaned_feature_set = list()
for input_phrase in range(0, len(feature_set)):
# 1.Removing all the special characters (*,etc.) and single characters (a,an,etc.)
clean_feature = re.sub(r'W', ' ', str(feature_set[input_phrase]))
clean_feature= re.sub(r's+[a-zA-Z]s+', ' ', clean_feature)
clean_feature = re.sub(r'^[a-zA-Z]s+', ' ', clean_feature)
# 2.Convert the entire phrase to lower cases
clean_feature = clean_feature.lower()
cleaned_feature_set.append(clean_feature)

Other libraries used to process text

We use two additional functions in the section below to improve text processing.

NLTK “keywords”: A stop word is a commonly used word (such as “the”, “a”, “an”, “in”) that most search engines are programmed to ignore. They are not found during the indexing of directories or when retrieving them as a result of a search query. Therefore, these words are not needed during processing and consume processing time and resources. NLTK (Natural Language Toolkit) is a list of words that are combined in 16 different languages ​​and are part of Python. Below is a predefined list of English keywords in NLTK. This list can be edited by adding / deleting words from the english.txt file of closed directories in local directories.

{‘ourselves’, ‘hers’, ‘between’, ‘yourself’, ‘but’, ‘again’, ‘there’, ‘about’, ‘once’, ‘during’, ‘out’, ‘very’, ‘having’, ‘with’, ‘they’, ‘own’, ‘an’, ‘be’, ‘some’, ‘for’, ‘do’, ‘its’, ‘yours’, ‘such’, ‘into’, ‘of’, ‘most’, ‘itself’, ‘other’, ‘off’, ‘is’, ‘s’, ‘am’, ‘or’, ‘who’, ‘as’, ‘from’, ‘him’, ‘each’, ‘the’, ‘themselves’, ‘until’, ‘below’, ‘are’, ‘we’, ‘these’, ‘your’, ‘his’, ‘through’, ‘don’, ‘nor’, ‘me’, ‘were’, ‘her’, ‘more’, ‘himself’, ‘this’, ‘down’, ‘should’, ‘our’, ‘their’, ‘while’, ‘above’, ‘both’, ‘up’, ‘to’, ‘ours’, ‘had’, ‘she’, ‘all’, ‘no’, ‘when’, ‘at’, ‘any’, ‘before’, ‘them’, ‘same’, ‘and’, ‘been’, ‘have’, ‘in’, ‘will’, ‘on’, ‘does’, ‘yourselves’, ‘then’, ‘that’, ‘because’, ‘what’, ‘over’, ‘why’, ‘so’, ‘can’, ‘did’, ‘not’, ‘now’, ‘under’, ‘he’, ‘you’, ‘herself’, ‘has’, ‘just’, ‘where’, ‘too’, ‘only’, ‘myself’, ‘which’, ‘those’, ‘i’, ‘after’, ‘few’, ‘whom’, ‘t’, ‘being’, ‘if’, ‘theirs’, ‘my’, ‘against’, ‘a’, ‘by’, ‘doing’, ‘it’, ‘how’, ‘further’, ‘was’, ‘here’, ‘than’} 

TF-IDF (term frequency – inverse document frequency) TfidfVectorizer: Each word has the same weight and meaning in the word bag model. In a real situation, the topics of discussion can be derived by understanding words that are repeated in context. The logic followed by TF-IDF is that words that appear less in all documents together and appear many times in one document play a larger role in the classification of predictions. As the name suggests, TF-IDF is a combination of two interpretations (Term Frequency and Document Frequency). In simple terms, it states that the classification value of a word increases if its term density (number of occurrences in the document) is higher and the inverse frequency of the document (number of documents in which the word occurs) is higher. Typical maintenance of TF-IDF is search engine optimization algorithms.

  • TF = (Word occurrence in 1 document) / (Total number of words in documents)
  • IDF = Log ((Total documents) / (Number of documents containing Word))
# Changing the text to a numerical form: All machine learning and statistical models use mathematics and numbers to compute data. Since the input here is textual, we will use the TF-IDF scheme to process words.
# Import the necessary packages
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer
input_vector = TfidfVectorizer (max_features=3000, min_df=6, max_df=0.8, stop_words=stopwords.words('english'))cleaned_feature_set = input_vector.fit_transform(cleaned_feature_set).toarray()

Train, test, shared

The workflow section of the workflow is now complete. This is a transition back to traditional machine learning. The train, test, division method is used to validate the performance of the model. Training section used to fit the model. Testing used to evaluate the suitability of the trained part. The function randomly distributes data to training and testing teams.

# Let us now use the Train, Test, Split function to divide this data into training and testing sets. We will use the training set to train the model and find the best suitable model for this prediction and then run that model on the test data to finalize the prediction scorefrom sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(cleaned_feature_set, label_set, test_size=0.33, random_state=42)

Modeling and training

In my experience, I have always seen that it is good to test the performance of a model with more than one algorithm. The choice of algorithms can be distinguished by the type of input, but it must always be tested with more than one algorithm. In the following, I test the data with four separate models.

# Random Forest Classification
rf_classifier = RandomForestClassifier(n_estimators=200, random_state=42)
rf_classifier.fit(X_train, y_train)
rf_classifier_score = rf_classifier.score(X_train, y_train)
# Support Vector Machine Linear Classification
svc_classifier = SVC(kernel='linear')
svc_classifier.fit(X_train, y_train)
svc_classifier_score = svc_classifier.score(X_train, y_train)
# Logistic Regression
lr_classifier = LogisticRegression(random_state=0, solver='lbfgs', multi_class='ovr').fit(X_train, y_train)
lr_classifier_score = lr_classifier.score(X_train, y_train)
# K-Nearest Neighbors Classification
knn_classifier = KNeighborsClassifier(n_neighbors=5)
knn_classifier.fit(X_train, y_train)
knn_classifier_score = knn_classifier.score(X_train, y_train)

Results from exercise data

Based on the training, we see that Random Forest Classifier gives us the best results. While 99.2% accuracy may mean excessive fitting, we can still see how it works with the test data set.

Results for each classifier Image by author

Test results

Perform the same Random Forest classifier in the test data.

# Final prediction using the best-case algorithm from the above table
final_pred = rf_classifier.predict(X_test)
# Accuracy score of the final prediction
print(accuracy_score(y_test, final_pred))
>>> 0.7667632450331126

We find that the prediction scores in the test data are not as good as in the exercise data. But since Random Forest was our best result algorithm, to improve this result, we can build a model with better rules. The preprocessing step, which adds additional rules to the algorithm, adds a lot of importance to the analysis. Creating better rules can always increase accuracy because the NLP model is unfamiliar with grammar, parts of speech, and other etymologies we use in everyday languages. We give these rules to the model during the process of creating the model.

For a detailed implementation of this project, you can visit the archive linked below:

And it’s a wrapper !! | Picture: Spencer Bergen on Loosen

LEAVE A REPLY

Please enter your comment!
Please enter your name here