In my last blog post I talked about an ongoing supervised learning project at one of our client’s office (involving me, plus a team of data scientists at the client), the aim being to predict a set of human decisions from structured XML documents. Something I didn’t mention in the previous post is that there is also a great deal of text information in our XML documents that is mostly unstructured: large, free-form notes which may contain valuable predictive information about the modeling datapoints in question. I’ll talk a little about the efforts we’ve made to make modeling predictions on the basis of this unstructured information.
A natural first approach to handling unstructured text is to generate features using TF-IDF scores (TF-IDF is a commonly used form for normalizing word or term frequency across a document / corpus, preserving the importance of rare words). After applying a TF-IDF threshold, someone in the modeling group pulled out a list of top performing words by GINI impurity (a measure of information gain on the target). We used presence or absence (as well as number of occurrences) of these words as predictive variables in the main logistic regression we used for predictions. While the TF-IDF model performed reasonably well on its own (~75% AUC [link] for a modeling situation with 50/50 positive and negative cases), it added next to nothing to the model when combined with the structured data, presumably because the “meanings” of these lookup terms were already contained within correlated structured fields. We had to weigh, as well, the little that these lookup terms offered us in the modeling against the possibility that some of these terms would be annoying to our ultimate business users (thus, while the words did provide statistically likely correlations with the target variable, these correlations would not be apparent / make intuitive sense to individuals who knew the dataset well). For these reasons, we’ve shied away from the TF-IDF approach.
The second major approach we tried out here used the well-known Word2vec model (and related neural net language models such as GloVe). While we used pre-trained GloVe vectors, we did build a Word2vec model locally on our unstructured text, after eliminating stop words (as many as we could find) and using various other pre-processing optimizations. The Word2vec vectors provided a very substantial boost in accuracy when added into the model (with most of the improvements in precision / recall among particular populations of interest). The pre-built vector approaches (GLOVE, etc.) did not work well for our dataset, perhaps because the text we were working on differed so substantially from the style / substance of the text on which the neural net model had been trained.
While we continue to investigate our options among neural net models more exotic than Word2Vec, we are beginning to think we have reached the limits of our capacity to parse the text based on these kinds of bulk intra-textual relationships (small dataset may be a limiting factor here). We’re now therefore trying out a (to me) exciting and perhaps old-fashioned approach to NLP, building dedicated parsers which can read through the unstructured text and pull out the (very complicated, very natural-language) information we need – using here a mix of English grammar knowledge and endless regular expressions. I’ve never thought so hard about how an English sentence is structured. If this works out (as it looks like it may), this will be an interesting use case where a kind of brute force / intuitive approach to NLP turns out to work much better than more sophisticated modern methods.