web analytics

XML data ingestion in Composable, and modeling a model

Composable is not only a technology firm but also an outcome-driven partner for our clients. Our approach revolves around leveraging our deep technical expertise and adopting a robust systems engineering mindset to assist our clients in tackling complex technical projects and developing efficient operational systems.

Software engineers and data scientists at Composable are dedicated to providing exceptional service to our clients. Recently, our data scientists and I had the opportunity to be fully embedded with our client, working in their office, and working alongside their analaytics team analyzing a complex dataset. While I won’t disclose specific details about the client or their industry, I’d like to share some insights about this unique data science endeavor and how it set itself apart from other projects we’ve undertaken.

As is usual in machine learning projects, we started out with a great deal of data coming from the “business side” of the company and we needed to produce a model that could help explain that data. The first step was simply getting the data into a format where we could easily access / transform it. In this case we started with a pretty clean dataset – about 12000 XMLs, each XML containing most of the readily available data about a given sample (some additional information about each sample could possibly be brought in later). Importantly, each XML had a relatively clean target variable attached.

Unusually, the target variable in this case, while it is the thing we are trying to model, is not, say, a thing in the world which someone measured (pressure at depth X, the number of fishes in the sea, number of stars of a movie on IMDB, etc.) – here, the target is the result of a previous “model” run on data similar to that contained within the XMLs. We are modeling a “model.” I keep putting the previous model in scare-quotes because it is not a model which can easily be replicated in an automated, machine learning environment, but rather a complicated series of rules and intuitions which individual people use to look at data similar to that in the XMLs, and from which those individual people generate a 1 or a 0, a yes or a no.

To take it back a step, this is a very clear case of supervised learning – we know perfectly the thing we’re trying to predict, the target variable, we just don’t know how we might predict it. What is odd (and I can’t go into more detail because of NDA) is that the thing we are predicting is itself a prediction, subject to its own biases, variance, etc. On the other hand, having a target of any kind is a big step in the right direction, and it’s far from being the case in all projects. My academic background is in the biological sciences, and I have many times been in the position of looking at a dataset, not having a clue what I’m supposed to be looking at, and knowing that there is no person on Earth who has more than a vague idea what I might want to look for there. Knowing what to score for is half the battle.

The XMLs themselves are essentially handmade summaries of enormous PDF documents (these can be 100-1000 pages long), and these documents are what people were previously looking at to decide the target variable. Part of the summarizing process involves the introduction of new information, buried in the XML hierarchy itself; because the people making these XML summaries have expertise in the subject area, the XML structure itself turns out to be extremely important in our modeling process.

To bring the XMLs into a usable (read: SQL) form, I used Composable’s DataFlows system for reading the XMLs in bulk and Composable’s DataPortals capability to generate the backend.

Ingestion DataFlow

Sample Ingestion Form Element

The generated backed was a SQL Server database structure that exactly represented the XML structure, one table per XML field. A variety of primary and foreign keys were automatically generated by Forms and could be used for keying back and forth between these tables / fields; other lookup keys were subsequently added to the tables to permit direct indexing into each table on a per-sample basis. Forms automatically generated a basic web interface for accessing and manipulating these data, but Composable’s Ryan O’Shea subsequently built a more sophisticated interface on top of the generated backend using various tools from our low-code web development systems (QueryViews, DataFlows, Interfaces), providing overall statistics and various additional interactive elements (slicing of the data, capacity for adding comments, etc.).

After the data ingestion, I and others spent much of two months poring over these XMLs, generating additional tables with features that might be used to reproduce the human predictions. Recently we’ve finished building a model which does reproduce those predictions with something like 85% accuracy overall, with especially good accuracy on the unambiguous cases scored the highest or lowest in our model (we expect that the 30-40% most ambiguous of the cases will not be decidable by our algorithm).

For reasons of transparency for the business users who will have to look at the model afterwards, we use a simple logistic regression model — we don’t, in any event, find that gradient boost methods or neural nets etc. afford any significant accuracy boost. Logistic regression models can, of course, be trained on features that were generated in an arbitrarily complex fashion.

We find something surprising, which is that by far the most informative features use simple first-order correlations or related approaches to guess the target variable based on simple textual analysis schemes. More sophisticated techniques that were employed to generate specific variables, based on the actual rules that people use to determine these cases, pale in comparison with the power of the statistical techniques. While this is perhaps reminiscent of the experience of engineers involved in e.g. language translation, it still caught most of us here by surprise. The difficulty in our project is not in comprehending the rules that humans use in making these decisions, as perhaps in a translation task, but actually in the feature extraction at a low level.

When we run out of ideas for modeling – which is more or less the current situation – we can, nevertheless, always go back to the set of rules which informed the original “model” to try to find additional features. We expect that attempting to codify these will be a multi-month process yielding substantial (5%? 10%?) improvements.

That’s all for this post, but this work has generally gotten me thinking about how one moves from raw data to an explanatory model – what are the steps that one goes through? Can these steps be generalized? I hope to go more into detail on this subject in later posts.