Dataiku makes coding and programming a first-class citizen of the platform.Learn More
NLP is a branch of machine learning and AI which deals with human language, and more specifically with bridging the gap between human communication and computer understanding. Its practical applications span from topic extraction from documents, to sentiment analysis of clients putting reviews in social media, to getting insights about the needs and the struggles of people calling customer support services, or even going as far as building near human conversational agents to offload call centers.
How NLP Works
- Cleaning and preprocessing the data. Before it can be processed by an algorithm, the textual data must be cleaned and annotated (labeled). Cleaning usually involves text normalization (converting to lowercase, removing punctuation, etc.), removing parts of speech without any inherent meaning (also called “stop words” — such as a, the, for, etc.), simplifying and converting words to their roots, and converting the text to smaller units called “tokens”.
- Vectorization. After preprocessing, the text data is transformed into numerical data, since machine learning models can only handle numerical input.
- Testing. Once a baseline has been created (the “rough draft” NLP model), its prediction accuracy is tested using a test subset. The model is built using the training subset and then tested on the testing subset to see if the model is generalizable — we don’t want a model that only gives accurate predictions for one specific dataset!