What is metadata? A Christmas themed exploration

This content originally appeared on my Scientific American blog, Information Culture, in 2012. While the page remains, images are no longer visible. At the request of several readers, I’m reposting the content here. 

When I talk to most scientists and mention the word “metadata” they look at me as if I’ve grown a second head. Despite the fact that these folks regularly use and create metadata (not to be confused with megadata or “big data” which is a whole other subject), many have not heard of the term.

Broadly speaking, metadata is simply a structured description of something else. The most popular example of metadata comes from the library catalog. Each book has a title, author, call number, publisher, ISBN etc. listed in the online catalog. These elements comprise the book’s metadata, and there are rules to make sure that things are standardized.

Without metadata, discovery and reuse of digital information would be much harder. This is why discussion about metadata has increased greatly since the second half of the twentieth century.

The best way to understand metadata is to look at a few examples of metadata at work.

Here is part of a digital data table:

Screenshot of Santa's Google spreadsheet

If you stumbled across this list on the web you might be able to guess what it was, but you couldn’t be sure. It would also be difficult to find this list again if you were looking for it. The list creator might find this pretty useful, but if he or she shared it with others, we would want some added information to help the new user understand what he or she was looking at: this is metadata.

Metadata for this data file:

  • Who created the data: Santa Claus, North Pole. An email address would be nice. This way we have some contact information in case we need clarification.
  • Title: “My List” isn’t a title that is conducive to finding the file again. While it might be tempting to just call this “Santa’s list” that won’t help other folks who see this file. The title should be descriptive of what the data file contains, and “Santa’s List” could be many things: Santa’s list of Reindeer? Santa’s list of toys that need to be made? A more descriptive title might be “Santa’s list of naughty and nice children.”
  • Date created: We don’t want to confuse this year’s list (2012) with last year’s list (2011). This could lead to all sorts of unfortunate events where nice kids get coal, naughty kids get presents, or infants (who weren’t around in 2011) get nothing at all.
  • Who created the data file: Perhaps Santa created the data, but then used an elf to input the data into a computer file. Many computer programs automatically record this information, although you may not realize this.
  • How the list was created: Behavioral scans? Parental surveys? Elf on the Shelf reports? All of the above? In order to reuse this data in future research projects, we need to know how it was collected, including collection instruments and methodologies.
  • Definitions of terms used: What is “naughty” what is “nice”? How did Santa place a child into one category or another?
  • File type: What kind of file is it? The data here are pretty simple, but Santa has lots of different file formats to choose from: excel, .csv, xml, etc. Knowing the file type helps end users determine if they can use the data

Naturally, a different kind of item might have a completely different set of metadata.

This is my mom’s favorite Christmas picture of me:Me sitting on Santa's lap

My mom remembers the details of where, when and how this picture was taken, but if she isn’t around to tell the story, metadata can help:

Metadata for this photo:

  • Date the photo was taken: December, 1981. The digital version was created on 12/13/2012
  • Who took the photo: A mall employee. This can have implications for who owns the rights to use and distribute the image. The photographer? The folks who paid to have the photo taken?
  • Camera used to take the photo: I have no idea what camera was used for this picture. Luckily, modern digital cameras often automatically record this information as a part of the .jpg file. Digital cameras can also record all the detailed camera settings (for those who understand these things).
  • Location where the photo was taken: Arnot Mall, Horseheads, NY. Some digital cameras can automatically capture this information too, using built in GPS.
  • Picture format: .jpg
  • Picture size: Original size of the photo is 3.5 x 5.5 (I think). The original scanned image is 852 x 1116 pixels.
  • Description of the photo: Currently, the primary way of searching for an image is for a computer to search for the associated text. Good file names and good descriptions can be key to finding the image again. Bonnie J M Swoger, age 3, sitting on Santa’s lap. Her grandpa brought her to the mall to visit Santa. While not enthusiastic about it, she loved her grandpa and obliged him by sitting on Santa’s lap.
  • Copyright information: I don’t think the mall Santa folks were thinking about copyright in 1981 because there wasn’t an easy way to copy the photo. These days, it is important to state explicitly what rights other folks have to use the picture. Creative Commons licenses are great for being explicit about what users can do with your content.

Depending on the type of data, there may be many more metadata elements. Geospatial data, chemical data, astronomical data, etc. each have specific descriptive elements that are used. Many organizations have developed standards describing what kinds of metadata should be included and how the metadata should be formatted. This helps data creators add metadata that can be read by computers and reused by other interested folks.

Once you have well established metadata formats, you can start analyzing the metadata. Common metrics used to evaluate scholarly publication (impact factor, alt metrics, etc.) all rely on high quality metadata.

I think we can agree that Santa would use sound data management practices, including the creation and use of proper metadata, to keep track of his gift giving and logistical data. He would want the rest of us to use good metadata so we can always locate that 30 year old picture of him, too.

Be like Santa and make sure your data is findable and re-useable: use good metadata!

For a more robust (yet clear and understandable) definition of metadata, see NISO’s Understanding Metadata (PDF).


Originally published by Scientific American in 2012.

Tracking down articles from science news stories

Readers of this blog may be interested in a piece I wrote for the Scientific American Guest Blog, How to: Track down journal articles cited in news stories (when they don’t link directly).

Many blog posts will link directly to a version of the original article, but many news sources often have a policy of not linking to the original source. Even when a blog links directly to the original article, you may not be able to read the it without paying. But there are steps you can take to find the original article, and to find a version of it you can read.

Read more.

In which container is the journal article I need?

The other day, I got an email from a faculty member.  A scholarly society he is a member of just announced that their journals would now be available in JSTOR.  He went straight to JSTOR to look them up, only to see that he didn’t have access.  He promptly sent me an email saying, essentially, “What’s up with this?  Shouldn’t we have access?”  (Although his actual email was more eloquent).

In which container is the journal article I need? CC image courtesy of flickr user s_volenszki

In fact, we don’t have access, and it would cost us an additional $1000 to have access to those journals via JSTOR.

For non-librarian types (students, faculty, everyone else), there isn’t always a clear understanding of how they have access to information.

In the case of JSTOR above, most folks don’t understand the difference between the platform and the content (and quite frankly, they don’t really need to).  In this case, JSTOR is simply a platform for delivering journal articles.  You have to buy the content, and that tends to come in specific chunks.  In my library, we subscribe to several of the packages that JSTOR offers, and we have current access to some of the journals that are available via that platform.

But just because we have access to some content on JSTOR doesn’t mean we have access to everything.  The same can be said of other platforms like ScienceDirect from Elsevier, or Project Muse (for any humanities folks out there).

In much the same way, it can be difficult for folks to understand that libraries don’t always have access to journal articles direct from the publisher’s website.  We have access to a lot of journals via third party aggregators, like Proquest or the Ebsco packages.

For example, a student or researcher wants an article from the current issue of the Journal of Parasitology, and goes directly to the journal homepage.  When they get there, they encounter a paywall, asking for $20 for access to that article.  The student or researcher might think that they either have to fork over the money or move on to a different article.

While the student searches for a new research topic, a PDF of this exact article is sitting in our “MEDLINE with Full Text” database ready for them to download.  We’ve already paid for the content, just not through the journal website. Our current access to this journal is via a different platform.

In library instruction sessions we try to teach students to go through the library homepage to check on journal access, but it isn’t always the most intuitive thing to do.  And some students can go their entire undergraduate careers without seeing a librarian in their classes.  We also teach them to use special links we put into databases that will guide them through the library system (we call ours “Get It,” the generic term is OpenURL), but the databases don’t always help us out here.  Some databases provide direct-to-publisher links which, as we’ve seen, don’t always lead to the content.

Is this confusing?  Yes.  Could it be simpler?  Yes, but it would require a complete rethinking of the whole scholarly communication system.  Open access, anyone?

Author Order

One of the parlor tricks I occasionally do in an information literacy class is to the guess the name of a researcher’s PhD adviser, and sometimes their Post Doc adviser, simply by looking at a list of their publications.  This is most impressive when the researcher in question is the faculty member I’m working with and can confirm or deny my guess.

Students are usually impressed, but it isn’t difficult: you just need to know a little something about the meaning behind the order of author names.

Scientific publications are rarely authored by just one person. More often, they have 3-6 authors, and sometimes many more, depending on the field. Publications in high energy physics and genetics can sometimes have hundreds of authors: the record (as far as I can tell) is an article related to the installation of the particle accelerator at CERN that lists the group as a lead author and almost 3,000 co-authors.

My colleagues in the humanities sometime have trouble understanding how so many people could be the author of a paper – they equate authorship with actual typing and writing of words.  But in the sciences, the words aren’t the primary result – it’s the data, discoveries and conclusions that are important.  As a result, scientific publications encourage contributors to list as authors anyone who made a significant contribution to the work.

The definition of “significant contribution” can vary by field, however, and it isn’t unheard of to see authors who only made a nominal contribution.  In some places it was customary to add the department chair or lab PI as an author, even if he or she knew nothing about the work (see this 2006 article in Nature.)  Some journals are attempting to get a better handle on this by asking contributors for a list of credits, who did what (see this example).  And the medical community has outlined specific criteria for inclusion as an author.

Because of the quantity of authors, some thought has to go into how they will be ordered on the publication. The first author is typically the person who contributed the most to the publication, including carrying out the research and writing up the report.  After that, it can get a bit tricky.

In order to combat the trickiness, various disciplines have evolved strategies to keep the peace.  In some disciplines, additional authors are listed alphabetically.  In others, authorship goes in order of who made the biggest contribution.  Sometimes, the person who contributed the most (after the lead author) will go in second place, sometimes in last place.

In cases where author order is determined by the relative amount of an individuals contribution, disagreements, and even arguments can sometimes result.  You can actually download software that aims to help establish the correct author order.

I sometimes discuss author order in upper level classes.  If a researcher understands how this works, their ability to search for additional relevant publications by author goes up.