Metadata are data that describe data. They are everywhere; we just don’t always notice them. As an example, think about a book you might buy online. The book itself contains text, a kind of data. Information you might find about the book in an online store includes:
The metadata above provide more information about the item you’re looking at. When metadata for many items are brought together and standardized, they become powerful tools for locating and discovering things - like a library catalog or an internet search engine.
Metadata are important because they explain a data set to others. Data sets exist within a certain context, and this context must be communicated well so that others can reuse the data set.
For example, the City of Boston has open data on 311 Service requests. If a researcher wanted to use these data and didn’t know the data was about Boston, what a 311 request is, or the year the data was created, it would be very difficult for them to understand or reuse this data set. Even with this information, without a data dictionary, it would be hard to understand what some variables are, what blank values mean, or what values are possible.
Metadata provide necessary information for others (sometimes your future self) to understand the data set and properly reuse it. It often takes time to create metadata, but the effort is worthwhile.
To help others find your data and to reuse it appropriately, you’ll need to provide enough details to ensure your data is citable. Here are the following items you will need:
For more information on how to cite data, click the Citing Data tab on the left side of this guide.
Sharing documentation about your data set is the best way to help others reuse it. Additionally, developing the documentation will also help you articulate some of the subtle details living within your data.
A common method for documenting your data is writing a data dictionary. A data dictionary explains variable names, potential values, and format. Data dictionaries don’t have to be complicated to be useful - a spreadsheet or text file will do the trick.
An example data dictionary entry from the 311 Service calls in Boston looks like this:
Variable Name | Label | Type | Value Codes | Missing Code |
---|---|---|---|---|
OPEN_DT | Case open date | Date (mm/dd/yyyy hh:mm:ss AM/PM) |
NA | (BLANK) |
The table above quickly conveys a lot of useful information. According to the table, the variable name OPEN_DT is a case open date, and we’d expect to find it in date-time data in the data set. Without this, we’d have to contact the creator of the data set to ask “What does OPEN_DT stand for?”, which makes it time-consuming for everyone involved.
Other information to include with your data set might be:
Finally, it is always important to provide a short story about your data that briefly explains the who, what, when, where, and why about your data set. Also, if the data set was the foundation of any published works be sure to mention that and provide a link, if possible.
Without the proper documentation, your data is unlikely to be reusable.
Here are a few things to try to avoid when providing metadata about your data set: