When thinking about managing and analyzing large and complex datasets, it can be helpful to remember that the volume of data is often not the biggest hurdle for effective analysis, but rather the variety of data sources and formats you need to handle. The Composable platform can help in overcoming this challenge since it excels at allowing you to bring in data from any source to analyze and operate on. The extensibility of Composable makes it simple to interact with new, non-traditional datasets using the tools and techniques you are already familiar with.
Today we will focus on the challenge of data variety, particularly as it relates to non-traditional (aka non-tabular) data sets. We will look at two popular non-relation database tools, MongoDB and Neo4J, and how you can use them with Composable to tackle common data ops scenarios. We will also look at how you can use similar techniques using only Microsoft SQL Server, knowing that introducing new tools into an enterprise is often a challenge in itself!
MongoDB is an example of a Document Database, meaning that rather than storing data records as rows in a table, it stores its records as individual documents. For Mongo, these documents take a form similar to JSON documents, containing a nested hierarchy of named key/value pairs.
A person record with associated addresses in a relational database might look like:
|123 Composable Ave
However, in a document database, this would likely be represented as a hierarchal document instead:
organization: "Composable Analytics",
street_1: "123 Composable Ave",
One advantage of storing data this way is that you don’t need to know exactly what data structure is important to you, or what attributes and relationships are relevant. MongoDB allows you to store these data records in the structure in which they arrive, and then reason about the desired structure for analysis later. It provides this while still giving you tools for querying and extracting the data.
A dataset that often fits this description nicely is Twitter data. Twitter provides an extensive, hierarchical data model for tweets. Storing this information in a tabular database would typically require you to create the tables and foreign key relations to model the tweet structure up front.
MongoDB stores data in Collections rather than tables, and will create them on demand when trying to save data into it for the first time. With this in mind, we can begin to use Composable to read from Twitter and write them into a MongoDB collection called tweets.
In this Composable DataFlow we have 3 primary operations taking place:
- We begin with a search on the Twitter data of interest. This will return the resulting tweets both as a pre-formatted table, or as the raw JSON response with all of the tweet information. This raw format is what we want to consume and store.
- We will annotate this tweet data with both the search term used to find it, and the time in which it was ingested.
- We can then store this annotated document directly into our mongo collection using the MongoDB Store Module in Composable.
Once these tweets are saved within MongoDB, we can begin querying and projecting the results into more useful forms and structures for analysis. For example, we could decide at this point all we’re really interested in is the tweet text and the user’s screen name. In which case we can query for those fields using projections in Mongo.
- We can use the MongoDB Query Module to query and return tweets from our database.
- We will use a Projection Document in our query to specify which fields in the tweet data structure we’re interested in. More information on MongoDB projections can be found here: https://docs.mongodb.com/manual/tutorial/project-fields-from-query-results/.
- When we view the results of this query, we can now see we have just the fields we are interested in analyzing.
Other than running the MongoDB server, there was no other pre-configuration or schema development necessary to begin storing and querying the Twitter data. In addition to projections there are also methods for filtering (https://docs.mongodb.com/manual/tutorial/query-documents/) and aggregating (https://docs.mongodb.com/manual/aggregation/) the results as needed. This makes document databases very effective for rapid data analysis tasks on data sets of any size!
Using SQL Server
Starting with Microsoft SQL Server 2016, you can also take advantage of a similar capability without requiring another database server technology. To demonstrate this, we will use a new data type in SQL Server 2016 called JSON Data (https://msdn.microsoft.com/en-us/library/dn921897.aspx). Using this data type, we can create a table in SQL with a column of type JSON Data and ingest the raw tweet data in a very similar fashion.
We first need to create the table in our SQL database to store this raw data:
CREATE TABLE Tweets (
Id int identity constraint PK_JSON_ID primary key,
Tweet nvarchar(max) CONSTRAINT [Properly formatted JSON] CHECK (ISJSON(Tweet)>0)
Then we can modify our tweet ingest data flow to insert the JSON into SQL instead of MongoDB. We can use the standard SQL Query Module (1) in Composable to insert this data, with a simple parameterized INSERT statement (2 & 3).
Extracting particular values from the tweet is just as simple, using built in JSON data functions in SQL Server 2016. Here we use the same Sql Query Module (1) with a select statement using this functions to extract the tweet text and screen_name (2) and we can view these results as a table in Composable (3) which can then be analyzed further.
Graph Data Operations
Characterizing and understand the nature of relationships in a dataset is another common task in data mining and analysis. With tabular data, however, this can often result in joining many tables together, and can come with a large amount of data management overhead for tracking foreign key relationships, creating and updating indexes, etc… A class of databases for graphs was specifically created to makes this easier and faster. In a graph database, rather than tables with columns, the fundamental building blocks are Nodes and Edges of a graph along with their attributes . Instead of filtering on columns as the primary query mechanism, graph databases take advantage of the natural shape of the data and use graph pattern matching to return nodes and edges of interest.
We can begin to use this with our Twitter data to analyze how tweets are produced and shared among users. To start, we will first create 3 DataFlows specifically for writing the nodes for User and Tweet objects, as well as edges for when a user publishes a tweet, when a user retweets another user, and when a tweet is a retweet of another tweet. We insert data into Neo4j using Cypher queries. More information on the Cypher syntax can be found here: https://neo4j.com/docs/cypher-refcard/current/.
With these DataFlows available to us, we will now use them in our tweet ingest process to first record the basic User -[Tweeted]-> Tweet relationship. We will use our new data flows to save the User and the Tweet nodes (1), then once both of these are saved (2), we will create the edge between them for “Tweeted” (3).
We can immediately see the usefulness of graphs in beginning to visualize how the data is related, seeing which users (in red) are publishing which tweets (in purple).
While this is useful to see which users are more prolific tweeters, we also want to dig deeper into the relationships among the users. For this, we also want to record information in our graph about retweets. We do this by also record the users and tweet nodes (2) and relationship edges (3) when a retweet message is ingested (1).
This begins to give us a much richer graph visualization, where we can start to see centers of influence appear in the Twitter social network.
Beyond taking advantage of visualizations, we can also now use this graph to make queries on the nework. A first example might just be to see who is retweeting others the most.
Using a basic cypher query, we are able to produce this result in Composable to conduct further analysis or reporting on.
RETURN DISTINCT tweeter.screen_name, count(retweeter) as retweet_count
While this would be fairly straightforward in most database systems, graph queries begin to show their true effectiveness when the relationships you are looking for are more complex, or involve long changes of relationships. For example, if you want to query for which users have the most broad reach on Twitter, the graph matching becomes much more effective and clear than filtering across join relationships.
This Cypher query allows to search for Retweet chains in the graph, to return the longest chain for each user, as an approximation of the depth of their influence in the community.
WHERE NOT ( ()-[:Retweeted]->(retweeter) OR (tweeter)-[:Retweeted]->() )
RETURN DISTINCT tweeter.screen_name, LENGTH(path) AS influence
ORDER BY LENGTH(path) DESC
We looked at two data management techniques that provide an alternative to relational tables, and how they can be used to add new capabilities to your data ops toolbox. We examined how you can apply these in Composable with external tools such as MongoDB and Neo4J, or how you can take advantage of some of these using only Microsoft SQL Server. With any tools you want apply to your data, Composable makes it easy to integrate them together to design flexible and robust data operations.
As you take these ideas and apply them, let us know what you discover!