web analytics
Press "Enter" to skip to content

XML data ingestion in Composable, and modeling a model

Peter Insley 0

Programmers and data scientists at Composable do a variety of service work for our clients. My primary role at Composable is as a data scientist, and in a recent project I worked on a dataset embedded in a team of other data scientists, spending most of the last few months at a client’s offices. I’ll talk a bit about this work here, and how it differed from some other data science projects, without going into details on who the client is or what kind of work they do.

As is usual in machine learning projects, we started out with a great deal of data coming from the “business side” of the company and we needed to produce a model that could help explain that data. The first step was simply getting the data into a format where we could easily access / transform it. In this case we started with a pretty clean dataset – about 12000 XMLs, each XML containing most of the readily available data about a given sample (some additional information about each sample could possibly be brought in later). Importantly, each XML had a relatively clean target variable attached.

Unusually, the target variable in this case, while it is the thing we are trying to model, is not, say, a thing in the world which someone measured (pressure at depth X, the number of fishes in the sea, number of stars of a movie on IMDB, etc.) – here, the target is the result of a previous “model” run on data similar to that contained within the XMLs. We are modeling a “model.” I keep putting the previous model in scare-quotes because it is not a model which can easily be replicated in an automated, machine learning environment, but rather a complicated series of rules and intuitions which individual people use to look at data similar to that in the XMLs, and from which those individual people generate a 1 or a 0, a yes or a no.

To take it back a step, this is a very clear case of supervised learning – we know perfectly the thing we’re trying to predict, the target variable, we just don’t know how we might predict it. What is odd (and I can’t go into more detail because of NDA) is that the thing we are predicting is itself a prediction, subject to its own biases, variance, etc. On the other hand, having a target of any kind is a big step in the right direction, and it’s far from being the case in all projects. My academic background is in the biological sciences, and I have many times been in the position of looking at a dataset, not having a clue what I’m supposed to be looking at, and knowing that there is no person on Earth who has more than a vague idea what I might want to look for there. Knowing what to score for is half the battle.

The XMLs themselves are essentially handmade summaries of enormous PDF documents (these can be 100-1000 pages long), and these documents are what people were previously looking at to decide the target variable. Part of the summarizing process involves the introduction of new information, buried in the XML hierarchy itself; because the people making these XML summaries have expertise in the subject area, the XML structure itself turns out to be extremely important in our modeling process.

To bring the XMLs into a usable (read: SQL) form, I used Composable’s DataFlows system for reading the XMLs in bulk and Composable’s Forms system to generate the backend.

 

 

Ingestion DataFlow

 

 

Sample Ingestion Form Element

 

The generated backed was a SQL Server database structure that exactly represented the XML structure, one table per XML field. A variety of primary and foreign keys were automatically generated by Forms and could be used for keying back and forth between these tables / fields; other lookup keys were subsequently added to the tables to permit direct indexing into each table on a per-sample basis. Forms automatically generated a basic web interface for accessing and manipulating these data, but Composable’s Ryan O’Shea subsequently built a more sophisticated interface on top of the generated backend using various tools from our low-code web development systems (QueryViews, DataFlows, Interfaces), providing overall statistics and various additional interactive elements (slicing of the data, capacity for adding comments, etc.).

After the data ingestion, I and others spent much of two months poring over these XMLs, generating additional tables with features that might be used to reproduce the human predictions. Recently we’ve finished building a model which does reproduce those predictions with something like 85% accuracy overall, with especially good accuracy on the unambiguous cases scored the highest or lowest in our model (we expect that the 30-40% most ambiguous of the cases will not be decidable by our algorithm).

For reasons of transparency for the business users who will have to look at the model afterwards, we use a simple logistic regression model — we don’t, in any event, find that gradient boost methods or neural nets etc. afford any significant accuracy boost. Logistic regression models can, of course, be trained on features that were generated in an arbitrarily complex fashion.

We find something surprising, which is that by far the most informative features use simple first-order correlations or related approaches to guess the target variable based on simple textual analysis schemes. More sophisticated techniques that were employed to generate specific variables, based on the actual rules that people use to determine these cases, pale in comparison with the power of the statistical techniques. While this is perhaps reminiscent of the experience of engineers involved in e.g. language translation, it still caught most of us here by surprise. The difficulty in our project is not in comprehending the rules that humans use in making these decisions, as perhaps in a translation task, but actually in the feature extraction at a low level.

When we run out of ideas for modeling – which is more or less the current situation – we can, nevertheless, always go back to the set of rules which informed the original “model” to try to find additional features. We expect that attempting to codify these will be a multi-month process yielding substantial (5%? 10%?) improvements.

That’s all for this post, but this work has generally gotten me thinking about how one moves from raw data to an explanatory model – what are the steps that one goes through? Can these steps be generalized? I hope to go more into detail on this subject in later posts.