As organizations scale their usage of Databricks, they often encounter challenges related to governance and management of the various components within the environment. In this blog post, we will explore the importance of implementing a monitoring tool to maintain control over databases, tables, and notebooks created within Databricks and show how this can be done with the Composable DataOps Platform.
Challenges in Databricks Environment Governance
- Lack of Governance: The default deployment of Databricks does not provide robust governance mechanisms for managing databases, tables, and files. As a result, organizations face difficulties in maintaining an organized and controlled environment.
- Increasing Number of Databases and Tables: As usage of Databricks grows within an organization, so does the number of databases and tables. Without proper monitoring, it becomes challenging to track and manage this ever-growing list effectively.
- Proliferation of Notebooks, Libraries, and Repositories: Databricks enables users to create and collaborate on notebooks, import libraries, and version control repositories. While this promotes flexibility and productivity, it also introduces complexity in tracking and managing these artifacts across the organization.
Impacts on Cost and Risk
Due to the challenges outlined above, most organizations see a negative impact on both business expense and business risk as they scale up. Specifically, the following significant impacts occur:
- Cost Impact: Inefficient management of databases, tables, and files can result in increased storage and compute costs. Without monitoring, it becomes difficult to identify unused or redundant resources, leading to unnecessary expenditures.
- Risk Impact: The lack of proper monitoring and governance introduces risks related to data retention, data leakage, and security. Unauthorized access, unsecured data, and potential data breaches become more likely when there is no centralized control over the environment.
The Importance of Programmatic Tracking and Monitoring
Implementing a monitoring tool for Databricks can provide several benefits to organizations:
- Efficient Resource Utilization: With a monitoring tool, organizations can gain visibility into the usage of databases, tables, and files. This information allows for better resource planning and optimization, ensuring that storage and compute resources are utilized effectively.
- Enhanced Governance and Control: By implementing programmatic tracking and monitoring, organizations can establish clear policies and enforce governance rules. This enables administrators to have centralized control over the environment, ensuring compliance with data privacy regulations and security best practices.
- Improved Security and Risk Mitigation: A monitoring tool helps identify potential security vulnerabilities, suspicious activities, and unauthorized access attempts. By proactively monitoring the environment, organizations can implement necessary security measures and mitigate risks before they escalate.
Implementing a Databricks Monitoring Tool with Composable
To address the need for effective monitoring and governance in Databricks environments, organizations can leverage the capabilities of the Composable DataOps Platform. Composable offers robust features that enable seamless integration and comprehensive monitoring of Databricks resources.
Utilizing Composable DataPortals, an organization can create a custom relational data model to store the data that is relevant to their organization. Here, we create a DataPortal with the following core entities:
- DatabricksQuery: Captures queries executed by users, with fields such as: query_id status, query_text, query_start_time_ms, execution_end_time_ms, user_id, user_name, endpoint_id, rows_produced, query_end_time_ms, lookup_key, executed_as_user_id, executed_as_user_name.
- DatabricksUser: Captures the user information (e.g., username)
- UserResource: Captures the user resources (notebooks, libraries, etc.), with fields such as: object_type, path, object_id.
- Database: Captures the database information.
- TableObject: Captures the tables information
- TableDetail: Captures the table details, with fields such as: captureDateTime, RowCount, isTemporary, Format, Description, createdAt, lastModified, numFiles, sizeInBytes, Column.
- TableColumn: Captures the column information (e.g., name).
The example DataPortal Model File template file (xlsx) can be downloaded here.
With this DataPortal Model File, Composable builds out the full SQL back-end. You can create a DataRepository to view, explore and query the generated data model.
Utilizing Composable DataFlows you can execute scheduled programmatic interrogation of Databricks, via REST APIs, to capture and extract key metadata and usage statistics of the Databricks instance. In this way, you can collect historical data on the databases, tables, queries and notebooks being generated and used within the Databricks environment. By leveraging DataFlows, you can automate the data collection processes, ensuring real-time visibility into the usage and resource allocation.
In the above example, the DataFlow Lists all tables available within a database, and iterates through them, to capture key table details such as the number of rows and the table creation and modified datetimes.
Note the custom Modules that can be created to interrogate the Databricks environment:
Composable also provides several ways to consume the data. With QueryViews, you can create customized views and reports based on the collected Databricks monitoring data. This feature provides a user-friendly interface for exploring and analyzing the metrics.
Additionally, you can create dynamic Web Applications using Composable WebApps.By harnessing the capabilities of the Composable DataOps Platform, organizations can efficiently implement a comprehensive monitoring tool for their Databricks environment. The platform’s ability to programmatically interact with Databricks, along with the flexibility of Interactive QueryViews and compatibility with SQL-backed relational data models, ensures a streamlined monitoring experience.
With programmatic tracking and monitoring in place, organizations can confidently scale their usage of Databricks while maintaining governance and control over their data environment.
As organizations continue to leverage Databricks for their data analytics and processing needs, implementing a monitoring tool becomes imperative. By doing so, organizations can gain better control over the ever-growing number of databases, tables, notebooks, and associated resources. This, in turn, leads to efficient resource utilization, reduced costs, and improved security and risk management.
Composable is a powerful solution for implementing a monitoring tool in Databricks environments. Leveraging Composable DataFlows, interactive QueryViews, and a SQL-backed relational data model, organizations can gain valuable insights into the usage, performance, and resource allocation within their Databricks environment. By adopting this comprehensive monitoring approach, organizations can optimize their Databricks usage, improve governance, and mitigate risks effectively.