Depending upon the application, your data may be distributed over 100 or more tables each having dozens of columns. In many cases, you are primarily interested in only 5 to 20 of these tables – but precisely how do you identify those tables and map the "join" relationships between them? How can you understand the system when no database documentation is provided?
We frequently encounter these challenges as part of data migrations. In those situations, you must understand the data structure of both the old "giving" application and the new "receiving" application. With that said, making sense of a database structure is a recurring event for a database analyst. Let's walk through some typical methods for handling this.
The process always starts with interviewing you and understanding what you are hoping to achieve from the data. Following this, we will outline a plan and that usually comprises three steps which are listed here.
- Discovery: Looking under the "hood" of the database, exploring artifacts to understand how things fit together.
- Documentation: Summarizing our understanding, highlighting the areas of primary interest to you.
- Data Enabling: Opening a window on the data so that you can make direct sense of it.
Ok -- let's take a deeper look at each of these steps.
Our initial purpose is to separate the signal from the noise. The majority of the tables and fields are not of concern to us, what are the significant objects in the database? To answer this, we'll typically evaluate:
- Total record count for tables? Generated easily from SQL scripts, this helps you quickly identify where the bulk of your content rests. A similar question is, What tables have a lot of transaction activity in a 7 day period?
- Within those tables, what makes a record unique? Known as the primary key for a table, these help you understand what real-world entity the table represents. Examples being: customer, invoice, product, etc.
- What relationships exist between tables? This can be gleaned from the foreign key definitions on a table. Foreign keys map individual columns between two tables.
- Evaluation of Views and Stored Procedures. Views can be useful to solidify your understanding of the way tables relate to each other. Stored procedures can help clarify how business logic gets applied to the data design.
It's inaccurate to think of documentation as a "2nd step." The documentation process is ongoing throughout the work. From the point of interviewing, to discovery to enabling -- You are annotating what you learn as you go. We typically organize our work using an editor called Sublime Text (Markdown format) or within our Atlassian Wiki.
Enabling End Users
The final step in this process is putting the data into end users hands. The earlier steps have helped us identify the critical data objects that need to be shared. Our work at this stage is making some "windows" into the data for staff to use. To do this right, these windows need to be designed thoughtfully. They should have these qualities in mind:
- Preservation of data: make sure that users can't corrupt the integrity of data. Think "read-only."
- Security: Be deliberate about who is able to see the data.
- Performance Neutral: Ensure that retrieving this data does not slow down the original application.
- Easily Interpreted: Make the data more easily understood by humans. Achieve this via column and row reduction, column transforms, etc.
There are typically two alternate approaches for actually delivering the data:
- SQL Server Views: Typically the best approach if work-day system performance is not affected.
- Flat-file or CSV extracts: These can be generated periodically (eg., nightly) using SQL Server Integration Services.
At this point, you've got improved, raw data. The final step is putting that raw data in a more easily digested form for staff. This involves end-user applications such as:
- Power BI interface with some level of visualization, dashboard or raw-data view.
- Excel or Access window into data.
- SQL Server Reporting Service reports that are generated on demand or delivered on a recurring interval.
We have just reviewed three steps in data discovery process. Let me close by noting a couple of software applications that are useful for accelerating various parts of the process.
- Sparx Enterprise Architect: allows you to attach to a database and reverse engineer the design. It can be used to create detailed documentation, be useful for modeling change in the database, etc. It's particularly useful for larger projects.
- Redgate Tools: SQL Dependency Tracker, SQL Data Compare, and SQL Doc can all be useful in gaining a rapid understanding of a complex database.
Are there data sources you would like a clearer / targeted view? This article describes the documentation and training approach we take with our customers in Georgia.