Artificial Intelligence Machine Learning -Natural Language Processing

AbdiHakim Hussein

AbdiHakim Hussein

Nakuru, Nakuru County

1 0
  • 0 Collaborators

Natural language processing is a branch of machine learning science that deals with the processing of the linguisting texts and extract meaningful data from it. Natural language processing is so significant to the real world situations in a way that it will help us to get meanigful information from our contexts language ...learn more

Project status: Under Development

Artificial Intelligence

Code Samples [1]

Overview / Usage

Natural language processing is a machine learning section that deals with the natural human language processing by analysing every context of the human words and present meaningful information extracted from the language model set. All in all, this tool is very interesting and problem solving tool especially when analysing over thousands or millions of words at a particular context and give output information in a comprehensive manner. Talking about how a certain context model of dataset is presented to the machine to understand by splitting the model context into training and test set, several processes of algorithms has to be calculated and designed so that the machine can process all the chunks of words in an easier and a faster way and understand the deep meaning of a certain sentence or word in an organized format. This is really amazing and it will enable us to communicate and comprehend with the machine as long as every context of model data is organized in a logical steps! This is what we comfortably say it is an A.I and very applicable especially when designing Bots, Robots and language analysers in computers. In the methodology section, we gonna talk about several methods used in Natural language processing algorithms and how we can train machine to understand the human language even using classifiers algorithms.

Methodology / Approach

In a certain model dataset, there exists so many irregular patterns of words presented in different ways. In this way, it will make machine difficult to understand such pattern, so we have to design an algorithm to rectify such problem. First, we need to clean the text. Cleaning text is an algorithm that remove everything in a sentence except words. This is commonly referred to as Stop Word Removal. It is an algorithm that will remove punctuations and symbols in a sentence leaving only word texts behind. Next, we can convert that cleaned texts into a lower case sensitive for the purpose of making it even more easier to analyse model context. Splitting the words from a sentence is applied. Here it is referred to as tokenization. Tokenization can be of two types. Words tokenization and sentence tokenization. In a word tokenization, suppose we have the following words in a sentence, “Hi there, welcome to Artificial Intelligence.” This sentence will be broken down into chunks of words forming the following “Hi” “there” “welcome” “to” “Artificial” “Intelligence”. In a sentence scenario, we might have more than one sentence in a context. For example “Please come home. Thank you”.
The above sentences will be tokenized into two as
S1=”Please come home”
S2=”Thank you”
After this step of tokenization algorithms, what I can do next is to apply stemming and lemmatization algorithms. Stemming is the process of getting the root of a certain word. For example the word loved or loving is derived from a singular word that can’t be rooted and that is Love. While lemmatization, is another magic tool that can use dictionary context to find another meaning of Love. In this case I might say it will relate Love as adore. Adore has a closer meaning to love. These algorithms are very fantastic and has reduced the complexity of the words processing. After that, we can join the words with spaces to remain in a way that can be evaluated in a final and a powerful algorithm called Bag of Words. Bag of Words algorithm refers to taking every distinct words in a dataset model and put them in tabular platform containing rows and columns. Columns will be filled with all distinct words and will be compared against row . If a particular word will be found to appear in a certain sentence, then it will be termed as ‘1’ otherwise ‘0’. As you can see here, a binary format approach is used in this algorithm. So 0s and 1s will be filled throughout the table by comparing against the row containing sentence. Y axis will act as an dependent metric variable representing a sentence and X-axis will be words filled in the column metrics that acts as independent metric variables. In this way, we can now be able to train machine to determine positive and negative sentence. Here, we shall approach another section of Machine learning called classification under the following concepts.

  1. Naïve Bayes Theorem classification
  2. Random Forest classification
  3. Decision Trees Classification

Naïve Bayes Theorem a technique used in classification that uses probability method to determine a certain object belonging to a certain class if various features of that object corelates to the feature of that class. Meaning if we have to use probability method to find the correlation between an object and a class we can use the following formula to give a a closer probability of that object belonging to that class such that posterior = (prior X likehood)/evidence .
Here a sample python code that illustrate Bayes technique
from sklearn.naive_bayes import GaussianNB
classifier = GaussianNB()
classifier.fit(X_train, y_train)

Technologies Used

Python Programming Language - Anaconda

Libraries are:

Numpy
Pandas
Matplotlib
re
nltk

Classes are :

Stopwords
CountVectors

Repository

https://github.com/Hisaack/A.I-Natural-Language-Processing

Comments (0)