Technology Extraction from People’s Professional Summary
I’ve joined Slintel as a full-time Data Scientist!
I started my full-time position as a Data Scientist at Slintel earlier this month and I can’t be more excited about it! I’ve been interning at Slintel for 3 months now, and I’ve learned a lot during my internship period.Here’s a quick overview on what I’ve been working on over the past few months.
Overview
With the recent advances in technology and specifically in the data science domain, we are able to solve tons of amazing problems. Let me introduce one which we at Slintel are also trying to solve.
Here’s the gist of it: I was tasked with building an automated process that will help you to identify that Person X is the right prospect to be pitched for Product Y technology from Z organisation. Essentially, identifying high-intent prospects from any organization for any sales/marketing team across the globe!
So, we are working on Technology Extraction from people’s professional summary data. Sounds like an NER problem? A variety of approaches comes flooding to the mind … right? But let’s start from step #1, the research that went into it..
Research
I think the best way to start working on any problem statement is to do your homework in terms of your own research and exploration of different approaches. Guess what :) On exploring I found that there are a variety of approaches that are available and used for this task and this is often termed as “NER (Named Entity Recognition)”. Libraries like spaCy, pre-trained models like BERT, flair, etc. already exist to help you with this. Let’s get things started then.
Week 1:
My Onboarding :)
Got onboarded with the team, super excited to start my work. Let’s see what we have here to start….
I had the people’s professional summary data but found out important requirements for the same, as I am going to do everything from scratch which makes it 3000 times more exciting xD .
Week 2:
Pre-requirements
In order to build something and validate the performances of our algorithms we need some True data to compare with, right ??
Important Requirements:
- Tagged Dataset : We need a true valued list for technologies/companies matching from summary for validation with our list of predicted values i.e
2. Accuracy Metric : We opted for the Confusion Matrix. Let’s not dig deep into what it is, I am sure you can find out this on your own (check out the link above). An important point to consider here is that the number of FPs (False Positives) are going to be very important for the business use-case. Why? It is so because as a Sales representative, we do not want to pitch the wrong person for the wrong product that he/she might not be interested in buying, right? So, we want them as low as possible.
Week 3–4:
Got a Plan ??
Ohkk, wait! Let’s not be very fast to jump to code, let’s check if we have any plan first ??
Yes, on research and domain knowledge I formulated a plan in which we will proceed with the project.
Classification of Approaches on Dataset Type :
I am pretty sure you already got what I had in mind but let me explain it once. So we have two types of classification one with Untagged Data/raw data and the other is Contextually Tagged data.
- We will explore with Untagged Data first, and the methods involve Pattern Matching/Rule Based Approaches, use of pre-trained models like spaCy, BERT, etc for NER purpose, and the last is to combine Rule-based and pre-trained models, an Ensembling way.
- With Tagged Dataset we will have a great chance to train Models like spaCy, and different Neural Network’s for NER on our Company Dataset.
So, finally we started with Untagged Data and basic text cleaning methods, pattern matching algorithms, refined the data for good extractions followed by some regex patterns. Now, on analysing the results we understood that pattern matching is doing its work but wait , we are not able to catch the context of text for precise extractions.
Week 5–7:
NER through spaCy :
Stuck with the limitation of pattern matching right, don’t worry, we have spaCy coming to our rescue. So, we tried spaCy with the “ORG” tag and ran on entries. It didn’t perform in the way we expected it to be, that’s why good things take time right :)
Exploring tag types | spaCy NER
spaCy provides various variations itself and we thought of trying them out, it has POS tagged based extraction, other entity type extractions like “PERSON”, “PRODUCT”, “LOCATION”, etc. and concerned name can be a extracted as a PERSON tag also and hence same goes for PRODUCT tags, etc.
Use spaCy package and loading “en_core_web_sm” pre-trained model, it provides other models also, just replace “en_core_web_sm” with “en_core_web_md” to use medium sized model and with “en_core_web_lg” for choosing large sized model.
Let’s see how we can load and work through spaCy NER through this snippet -
- Importing and Loading
- Calling nlp object on a random text.
- Extracting NER entries
See, we have Shubham extracted as a Person (PERSON), Slintel as an Organisation(ORG) and Bangalore as a Geo-political Entity (GPE).
So, coming back to the new approach using spaCy NER with different tags , Yes, the score increases, but we need to remember FP’s are important and here they failed, and why not because we are basically increasing our matching set with many being FP’s right, so eventually we get very high False Positives.
One important thing to be noted here is that we found spaCy doesn’t work well on lowercase data and proper casing is required. The accuracy for NER falls drastically, and this was an important find on analysis, so use of casing methods is needed. Finally, after improving casing we get better results but still not great , let’s keep moving then !!
Week 8–10:
spaCy’s entity ruler
8 weeks in, we want to get better results, and upon digging deeper, we found something known as entity ruler in the spaCy documentation, which is primarily a dictionary that can be added in a spaCy model and tag the words.
But, again it was not able to justify the contextual match and gave high false positives.
Moving to Flair model
We finally moved to pre-trained models that are available like flair with different versions of it and trained on different standard datasets for NER.
Yes, it gave somewhat good results but it had high false positives, again a blocker. Don’t worry, I’ll set you sailing further from this :). We are finally at a point where we have tried pre-trained models and Rule Based Methods, now what about if I somehow refine the patterns and combine these two approaches and use it as an Ensemble method, sounds pretty interesting right ?
Week 11–12:
Ensemble Approach
For the major part, we worked on trying the combinations of (Rule Based/pattern-matching and pre-trained models), and to our surprise we had better results as compared to what we had before and tested it on larger data. The spaCy model gave us lowest false positives and high true positives from the run. Yet, there were entries which were not captured contextually, basically these were not at all extracted from spaCy so, counting them as false negative entries and now, the challenge is with these false negative entries.
Presentation Time!
Finally, I got the chance to present my work to the team. It went great and everyone appreciated all my efforts and work. Aaaaand, I also got offered for a full time role as Data Scientist at Slintel!
I strongly believe that any solution can be improved further, with better ideas, human interactions, debate and conflicts and I would like to request you to please share your thoughts on possible approaches that you think will prove useful for similar problem statements.
So, let me finish this short journey of this blog but I am excited to continue working on this problem statement. Many of you might be thinking, wait, still, there are many approaches that are possible and yes you are absolutely correct, but we will discuss my further journey with next experiments and approaches in my coming read. Don’t worry, we will dive deep into the concepts as well :)
Hope you all learn something new from this first try !!
Signing off,
Shubham Sunwalka