3 hours ago
i felt like talking about natural language processing (nlp). let's just say...i am in the mood again. swipe. for the culture
nlp, broadly, is the study of enabling computers to understand human languages.
python has some powerful tools that enable you to do nlp.
let's say we had a dataset of recent sports headlines.
nlp- first steps
we want to eventually train a machine learning algorithm to take in a headline and tell us how many upvotes it would receive.
however, machine learning algorithms only understand numbers, not words. how do we translate our headlines into something an algorithm can understand?
the first step is to create something called a bag of words matrix
a bag of words matrix gives us a numerical representation of which words are in which headlines.
a matrix is a rectangular array of numbers or other mathematical objects for which operations such as addition and multiplication are defined.
in order to construct a bag of words matrix, we first find the unique words across the whole set of headlines. then, we setup a matrix where each row is a headline, and each column is one of the unique words. then, we fill in each cell with the number of times that word occured in that headline.
this will result in a matrix where a lot of the cells have a value of zero (sparsity), unless the vocab is mostly shared between the headlines. gucci mane doesn’t appear that many times.
next time, i should mention dealing with the whole sparsity thanggg. lack of words like quavo. #gowizards