Poem generators, tweet generators, news generators, chat-bots, machines cracking exams… phew… Natural Language Processing (NLP) has gotten far far away. Or are we in an echo chamber of media hype? What is the reality?
Reality is where the developers are… what are they coding, what issues arise, what conversations happen? Top developer focused sites like Stack Overflow has a great picture of this. It is also a great lead indicator of what is coming (solutions of tomorrow being developed today). Let’s declutter to get NLP reality.
Here is the summary, basis tags & titles of developer issue queries raised. 11000 queries were analyzed across 1867 tags as of June 2019.
Attribution: Tag data is from Stack Overflow
Note: 2019 data is a year to date number and hence a dip is seen. On a pro-rata basis 2019 is at ~70% of 2018 at 6 months and on track to significantly exceed 2018.
Top 15 tags
NLP is on the upsurge especially after 2016
Machine learning is more used than deep learning… deep learning is raising through the ranks
Earlier years saw Java, PHP, etc showing up… now it’s mainly python
NLP is still big time NLTK (Natural Language Toolkit) library which has been going strong for a decade
Embedding/ vector techniques have picked up in last few years… word2vec the pioneer still dominates
Text analysis is the main use case with text classification picking up
Note: See inset “A Network Of Vectorized Tags” to understand how these technical terms are obtained.
A quick dive into these areas… vectors, libraries and use cases.
Vectors: The below chart shows a subset of insights focused on vector generation techniques. Word2Vec dominates but there is a rapid evolution of new architectures like elmo, bert, etc which are eating into its share.
Libraries: The most simple and free libraries of NLTK, Scikit-learn dominate. Stanford pioneered NLP in the 70s with rule-based text processing and continues to command loyalty.
Given Word2Vec popularity it is not surprising to see Gensim up there. Spacy is the package that has made big inroads into NLP. Interesting to see that the bigtech driven NLP APIs are yet to trigger a lot of conversation.
Use cases: Use cases are predominantly in text analysis/ mining. In recent years text classification and sentiment analysis have increased share. Sentiment analysis being one form of text classification it seems clear that text classification is the use case that developers are conversing about.
A quick peek under the hood of each use case indicates that text analysis and information extraction predominantly leverages regex, tf-idf, nltk, etc. Text classification/ sentiment analysis is where deep learning shows up.
Understanding documents using topic modeling techniques continue with good old LDA and Gensim approaches. Leveraging embedding and deep learning for making machines understand language isn’t showing up as yet.
In summary, NLP is getting major traction albeit with a higher mix of existing techniques. Exploration of vectors and deep learning concepts picking up and gaining share.
NLP reality… decluttered via “AI+Code”.