Data vendor tear sheets and why they matter

In our last post, we explored how the meaning and importance of data dictionaries have changed over time. 20 years ago, a data dictionary referred to a list of field names and types spit out by a database. Now, a data dictionary serves as a sort of spec sheet for a dataset: it must be both accurate and attractive, putting the dataset in its best light so potential buyers understand what the data means and also why it’s valuable.

Let’s move one level up to the so-called vendor tear sheet. A tear sheet is a one page summary document describing a data vendor and the dataset offering. It too must serve several functions – to describe the vendor at a high level, such as legal company name and contact information, but also to describe the data product as a whole – what type of data it is, if it’s a time series dataset or not, how often the data is updated, how the data is delivered, what are the legal and contractual requirements when buying the dataset, does it contain any PII, etc. etc. At Syndetic we’ve read hundreds of vendor tear sheets, spoken with dozens of data buyers, and come to the conclusion that just as with data dictionaries, it’s time for tear sheets to be reformed.

Recently a group of researchers in the machine learning community published a paper called Datasheets for Datasets, proposing a standardized process for documenting datasets. They cite inspiration from the electronics industry, in which electronic parts “no matter how simple or complex” are accompanied by a datasheet outlining their characteristics, recommended use cases, test results, and so on. The authors propose an analogous standard in machine learning, where every dataset is accompanied by a datasheet outlining its motivation, composition, collection process, and recommended uses.

We’d like to expand this recommendation beyond the machine learning community to argue that any dataset exchanged between two companies should be accompanied by a standard datasheet, or tear sheet. Both sides of the market benefit. For data providers, completing the tear sheet encourages reflection on how their data is sourced, delivered, maintained, and used. Often a tear sheet also acts as an SLA between buyer and seller, with the data provider promising to notify their customer if, for example, the data is updated less frequently than it is stated in the tear sheet. The burden to monitor conformance to tear sheets today still falls predominantly upon the buyer. For data buyers, a standard tear sheet makes it easier to compare data vendors and datasets with each other. It encourages accountability and transparency within the industry. It also gives the buyer something to refer to if the data doesn’t seem to conform to the specs outlined in the tear sheet after purchase. In this way, it acts as another contract between data buyer and seller – not legally binding like the data licensing agreement signed by the parties, but equally important to the success of the ongoing relationship.

Managing tear sheets is becoming increasingly complex for both data providers and data buyers. This makes the adoption of a standard all the more urgent. Account managers at data providers must manage tear sheets across all sales channels: direct to customer and indirectly through data aggregators which have proliferated in the past 12 months. Just keeping track of which tear sheets have been sent to who can be a nightmare. Similarly, data buyers must manage the tear sheets they receive from vendors and view in the aggregators. Important points about the dataset are often lost in Sharepoint attachments or in the personal notes of an employee. Standards, and a centralized tear sheet management system, make it much more likely that these attributes don’t get lost.

Some industries are starting to adopt their own tear sheet standards, such as the one adopted by the Alternative Data Council within the financial services industry. We are a member of the council, and we built the recommended FISD standard into Syndetic so that our customers, data providers who often sell into financial services, can create and manage their tear sheets from one central system. You can create one for free here.

We expect that other industries will follow the lead of financial services and move to adopt their own best practices and standards for data tear sheets. We will be closely following the standards as they evolve. If you know of an industry working to adopt a standard data tear sheet, drop us a line at tearsheets@getsyndetic.com.

Why does everyone hate their data dictionary?

As cofounder of Syndetic, I’ve talked to a lot of people about their data dictionaries. At this point, probably dozens of people, ranging from data governance managers at large enterprises to founders of early stage tech companies. And every single one of the lot hates their data dictionary. When I say hates, I mean that they say something like “Ugh. I won’t even show it to you. It’s an embarrassment.”

Why does everyone hate their data dictionary? A sort of meta-spreadsheet, a data dictionary on its face sounds like a relatively simple thing. It is a document describing the meaning of a dataset. Typically this includes field names and types (e.g. string, text, varchar) and maybe some annotations that describe the lineage of the data (where did it come from) and the business definition. But as with many workflows that are captured in spreadsheets, things can go awry very quickly.

  1. They are difficult to maintain.

The first person to create a data dictionary for their company usually has great intentions. They may be the first data scientist hired there, or the first data governance professional. They are diligent and organized, and dedicated to the mission of ferreting out every last bit of information about their information. They meticulously craft a spreadsheet (or google sheet) that contains the best information available to them at the time. They double and triple-check it for accuracy. But then, of course, things go off the rails.

Maintaining a data dictionary is not a full time job. And so, the person who created it cannot be expected to be thinking about this document at all times. They go back to doing their day job, and bit by bit, changes start to happen. Engineers change schemas without thinking to alert the person who created or keeps the data dictionary. Data salespeople call their prospects and walk them through the fields of the dataset, but realize that the annotations don’t quite make sense for the use case of their prospect. So they make a copy of the spreadsheet, change the annotation, and send it out. Product teams buy a system to capture data that used to be captured manually. And the dictionary very quickly gets out of date, often before anyone even realizes.

It is dangerous to have any company asset that is so dependent on one person in the organization, in this case the creator of the dictionary. If that person leaves the organization, all history of the document often goes with them, and a new person in that role may be tempted to just scrap it and start over. But then the problem repeats.

  1. They are necessary, but not sufficient, to fully explain the data.

Oftentimes, datasets are shared with business analysts or other non-technical people inside organizations who are tasked with assessing whether the data they are being provided is useful. For these people, receiving a data dictionary containing a bunch of field names and types is the standard. But it doesn’t really help them make the assessment they need to judge whether the data is of high quality; whether it is of better quality than can be purchased elsewhere by a different provider, or whether it has improved or decreased in quality over time. For these people, a data dictionary is often filed away and they go about other means to try to assess the data’s quality. Can you send me a sample of 100 rows? What are the coverage rates for each field? What are the most common values that I’m likely to see?

If it all looks good, and they start receiving the data, often they will move on to the next data provider and next assessment. Rarely do teams have the resources to monitor their incoming data files on an ongoing basis for anomalies, like a sudden increase in null values in a particular field. Even more rarely do teams conduct regular data assessments to ask for new sample sets or statistics on the data. They simply move on.

Data as a product is very different than an application or a service because its value is dependent on many other variables besides the data being good or not. For example, the usability of the data is extremely important. You can have the most complete dataset in the world on say, university rankings, but if the data is not usable, it is worthless. By useable, I really mean that it can be easily joined with other datasets. And that’s because people in the market to buy data on university rankings aren’t just curious whether Stanford is ranked #1 again this year. They want to answer questions that require the data to be joined with data on say, student populations, geography, or fundraising. Rarely can a dataset be so valuable in isolation. Data providers should understand this, and work as hard as possible to make their data easily combined with other datasets. 

Another area where data as a product is special is that it is (usually) a collection of facts. However the data was collected, if a dataset contains information about property transactions, there is an objective truth to the amount of those transactions. A prospective buyer of that dataset is primarily concerned with whether the data is actually accurate or not. If it’s not, it’s not only worthless, but also potentially very damaging to that company’s business, as decisions will be made (such as pricing) in reliance on that data. It is in every data provider’s interest to invest as many resources as possible to the accuracy of its data with rigorous testing and monitoring.

  1. They’re ugly.

The standard in data dictionaries is the good old Excel spreadsheet, closely followed by a word document that has been saved as a PDF. It’s curious to me that for all of the time and money that companies spend on product marketing, they do not do a good job of marketing their actual product, which is the data itself. Software companies often pride themselves on design and on making their application as user-friendly and intuitive as possible. But when they receive an inquiry about their data, they send over a spreadsheet. Surely there is a better way.

  1. They cause confusion within the organization.

As with any workflow trapped in a spreadsheet, users of the spreadsheet often don’t know if they can trust it. Before sending out the dictionary to an important prospect, a salesperson may look it over and ask a few people in engineering or product if it’s still accurate. They are unlikely to know. If a current customer has a problem with the data, like if the file breaks, and they call up the support team at the data provider, the support team is going to check the actual file that was sent to the customer. They are not going to check the data dictionary. And so you have a reference document that is not really reliable, which sows confusion among many teams that need to work closely together to support their product. Confusion = time wasted that can be spent on other more valuable things.

Hate your data dictionary? Drop us a line at inquiries@getsyndetic.com.

Getting started

After you create your Syndetic account, we want to make it as easy possible for you to get started creating shareable data dictionaries. After you first create your account, you are brought to this screen:

Welcome screen

There are two workflows: one for users who already maintain a data dictionary and would like to import it into Syndetic, and one for users who are starting from scratch and want to get started sharing information about their datasets.

At Syndetic we think of everything in terms of datasets. A dataset is the slice of data that you want to explain to another person. These may be different bundles of data that you sell, or they may be different categories of data that you work with for internal purposes. For example, let’s say that your company sells data on financial institutions. You may have one dataset related to asset managers, another related to prime brokers, and a third related to stock exchanges. Each of these datasets needs to be individually packaged into a product, which means it needs to be explained and marketed. Collectively, all of the explanations of all of the datasets together comprise your data dictionary.

If you already have a data dictionary, simply click on the Upload your data dictionary button and send it to us. We’ll take your dictionary in whatever format it lives now – Excel spreadsheet, Google sheet, Word document, PDF – and break it down into its component datasets and load them into Syndetic for you. We turn around dictionaries in 1-2 business days. Once loaded, we’ll send you an email and explain how to manage and start sharing your datasets.

We do the work so you don’t have to!

If you don’t have a data dictionary, you hate the one you have (most people do!), or you want to start fresh, click Create a dataset.

So fresh and so clean

Now you’ll be prompted to give a name and description to your dataset. Remember, this is to identify a slice of data that you want to share with another person. Use a name and description that you think will be most helpful in explaining the dataset to someone who is not intimately familiar with your database. You can always change it later.

I love describing data over coffee.

Once you’ve described your dataset, you’ll be brought to the screen to upload your data extracts. A data extract is the data itself; it is required in order to get the automatically generated statistics (like coverage rates, top values, and character ranges) and automatic samples. You can use Syndetic without loading a data extract, but it is not nearly as valuable. As we like to say around here, statistics are worth 1,000 spreadsheets! We want to make it as easy as possible for the recipient of your dictionary to get a sense of the shape of the data you are sharing. These simple statistics (along with a small sample set) are the best way to convey the meaning and value of your data.

So click on Upload data extract and you will be brought here.

The self-serve version of Syndetic accepts csv formatted files with headers in the first row and files hosted on the internet.

If you have an Excel spreadsheet, you can click File — Save As and select csv from the drop-down menu to re-save your file in a CSV format. Note that in order to save an Excel file as a csv, your spreadsheet will need to have only one tab.

Excel-lent

The Enterprise version of Syndetic allows you to hook up your database directly so that you can pull extracts as often as you’d like, and get automatically refreshed statistics and samples. Contact us for more info. We can connect to any database that speaks ODBC/JDBC, the standard for database connections. Even with the Standard version of Syndetic, every time you upload a new data extract, the stats and samples will automatically refresh.

Once you’ve loaded your dataset, you can get started on the fun part – annotating your fields. We think this part is also the most important, because it’s where you convey the meaning behind the data. Where did this data set come from? What does it mean to your customer? How do you want to market this dataset? Maybe that’s customer-specific, because different customers may use the same dataset for entirely different purposes. Knowing the intended use case of your customer and describing the dataset for that use case is key. To start annotating, click on Manage fields.

When you click on a field, you are shown the common values and statistics on that field, to help with your annotating. When you click Edit for that field, you are taken to this page.

Describe your field just how you’d like

Because these annotations will appear in your final, published dictionary, think about how you want to describe this field to your customer.

  • Display name is useful if your data contains hard-to-read or especially long field names that you can’t change, e.g. sales_id_quart_20170412
  • Description is meant for describing the field from a business perspective to your customer. Keep it relatively short so that it fits neatly in your dictionary summary table on the published page.
  • Lineage is meant for describing where the data comes from – is it system generated from your app? Was it entered manually by a field sales rep? Does it come from a database you integrated into your tech stack years ago from a company you merged with? Data lineage is important for understanding the history behind this data and why it is what it is.
  • Notes are meant for any extra information about this field that may be helpful to your customer in understanding this field in particular. It’s almost like an introduction to the field, and will show up in bright blue in the published page.

For every field you have the option to publish or hide the statistics and values for that field. You also have the option to hide that field from the sample sets that are auto-generated by Syndetic. This may be useful for fields that contain particularly sensitive information. If you choose to hide a field from the sample sets by unchecking that checkbox, on the published page it will look like this:

From the main datasets page, you can click on Preview at any time to view your dictionary live and see how it will appear to other people you share the link with. This should help with your annotations along the way.

Any changes you make to your dictionary using the management layer are immediately reflected on the published page. It’s as fast as just hitting refresh.

Your organization

We want annotating to be collaborative, which is one of the many benefits of using Syndetic over a spreadsheet. You may want to involve data engineers, customer-facing relationship managers, or salespeople in your annotation process. To add users to your organization, click on Add a User.

Teamwork makes the dream work

When you add a user, you set their email and password, and an email will be sent to their address notifying them that they’ve been added to your organization. You can set permissions for each user that determine their capabilities:

  • Allow this user to manage other users means this user can create other users for your organization
  • Allow this user to create and edit objects means this user can add datasets, annotate fields, and edit content.

By default anyone who signs up from the Syndetic home page creates their own organization, so if you want to sign up your whole team, only one person should create an account from the home page, and they should invite the other users from their team. For enterprise packages, contact us.

From your organization’s homepage, you can also Add a Logo which will be included on all of your published dictionaries, and manage your billing. Make sure to add a logo so you get a snappy published dictionary page like this!

A logo makes your page look nice

Introducing Syndetic

At Syndetic we make software for data companies. Syndetic literally means connective. It comes from the Greek word syndetikos, which means to bind together. We chose this name because data is connective. While it has become a cliche that data is the new oil, we actually see data as the connective tissue that binds companies to each other. They exchange it in the course of transacting; they ingest it to power their businesses, and they sell it (or give it away) to add value to the greater ecosystem. We’re starting today with tools for companies that sell data, which are often misunderstood. Data-as-a-service is relatively new, but more and more companies are offering a data product alongside their core business because they have, in the course of building that business, built a valuable dataset as well.

Companies that sell data need to do two things in addition to building their dataset:

  1. Convey the meaning of their data
  2. Convey its value

Achieving these two things is surprisingly difficult. Why? For one thing, there is a lack of tools in the market specifically designed for DaaS, which means there are a lot of hacked together solutions out there. For another, data is by its nature fluid. Building software for a thing that is constantly changing is difficult. Lastly, there is increased competition in the data marketplace, as more and more data companies are founded in every vertical, and more incumbents launch data products. 

So when a salesperson at a data company calls up a prospect and says “I have some really valuable data I’d like to sell you,” the first thing the prospect is likely to ask is “Okay, what kind of data is it? And how do I know that it’s valuable to me?” And then the salesperson will say, “Let me send you our data dictionary, which explains our data schema, and a sample of the data so you can see what you think.”

Today, data dictionaries are almost always spreadsheets. Some companies keep a spreadsheet in a folder in a shared drive, some use a Google sheet, and some use a Word document that consists of a list of field names, types, and their business meaning. Who manages that spreadsheet often depends on the size of the company and the structure of their data organization – at a tech company or a startup it might be a data scientist, but at a large enterprise it might be a data steward or person responsible for a data governance program. Some companies have lots of process around the spreadsheet – there is an owner who is in charge of the whole spreadsheet, or maybe certain tabs or fields. If someone else who uses the spreadsheet needs to make an update to it, they will send a request to the owner, or submit a ticket through a project management tool. There is some approval process to make the change. Only certain people are allowed to send the spreadsheet out to prospects or current customers. When a field name or definition changes, chaos reigns. Who is in charge of updating the spreadsheet? Who is responsible for letting the customer know so their data pipeline doesn’t break?

This is why we are introducing Syndetic: a web platform for conveying the meaning and value of your datasets. We are purpose-built for DaaS companies, so you don’t need to hack another tool to work for data. We know that it is fluid. We know that what engineering does affects the business side, and vice versa. We know that depending on the use case, different fields may mean different things to different people. 

Go to www.getsyndetic.com to get started – upload your current data dictionary, or create one from scratch.

Allison and Steve, Cofounders