What is metadata? A Christmas themed exploration

This content originally appeared on my Scientific American blog, Information Culture, in 2012. While the page remains, images are no longer visible. At the request of several readers, I’m reposting the content here. 

When I talk to most scientists and mention the word “metadata” they look at me as if I’ve grown a second head. Despite the fact that these folks regularly use and create metadata (not to be confused with megadata or “big data” which is a whole other subject), many have not heard of the term.

Broadly speaking, metadata is simply a structured description of something else. The most popular example of metadata comes from the library catalog. Each book has a title, author, call number, publisher, ISBN etc. listed in the online catalog. These elements comprise the book’s metadata, and there are rules to make sure that things are standardized.

Without metadata, discovery and reuse of digital information would be much harder. This is why discussion about metadata has increased greatly since the second half of the twentieth century.

The best way to understand metadata is to look at a few examples of metadata at work.

Here is part of a digital data table:

Screenshot of Santa's Google spreadsheet

If you stumbled across this list on the web you might be able to guess what it was, but you couldn’t be sure. It would also be difficult to find this list again if you were looking for it. The list creator might find this pretty useful, but if he or she shared it with others, we would want some added information to help the new user understand what he or she was looking at: this is metadata.

Metadata for this data file:

  • Who created the data: Santa Claus, North Pole. An email address would be nice. This way we have some contact information in case we need clarification.
  • Title: “My List” isn’t a title that is conducive to finding the file again. While it might be tempting to just call this “Santa’s list” that won’t help other folks who see this file. The title should be descriptive of what the data file contains, and “Santa’s List” could be many things: Santa’s list of Reindeer? Santa’s list of toys that need to be made? A more descriptive title might be “Santa’s list of naughty and nice children.”
  • Date created: We don’t want to confuse this year’s list (2012) with last year’s list (2011). This could lead to all sorts of unfortunate events where nice kids get coal, naughty kids get presents, or infants (who weren’t around in 2011) get nothing at all.
  • Who created the data file: Perhaps Santa created the data, but then used an elf to input the data into a computer file. Many computer programs automatically record this information, although you may not realize this.
  • How the list was created: Behavioral scans? Parental surveys? Elf on the Shelf reports? All of the above? In order to reuse this data in future research projects, we need to know how it was collected, including collection instruments and methodologies.
  • Definitions of terms used: What is “naughty” what is “nice”? How did Santa place a child into one category or another?
  • File type: What kind of file is it? The data here are pretty simple, but Santa has lots of different file formats to choose from: excel, .csv, xml, etc. Knowing the file type helps end users determine if they can use the data

Naturally, a different kind of item might have a completely different set of metadata.

This is my mom’s favorite Christmas picture of me:Me sitting on Santa's lap

My mom remembers the details of where, when and how this picture was taken, but if she isn’t around to tell the story, metadata can help:

Metadata for this photo:

  • Date the photo was taken: December, 1981. The digital version was created on 12/13/2012
  • Who took the photo: A mall employee. This can have implications for who owns the rights to use and distribute the image. The photographer? The folks who paid to have the photo taken?
  • Camera used to take the photo: I have no idea what camera was used for this picture. Luckily, modern digital cameras often automatically record this information as a part of the .jpg file. Digital cameras can also record all the detailed camera settings (for those who understand these things).
  • Location where the photo was taken: Arnot Mall, Horseheads, NY. Some digital cameras can automatically capture this information too, using built in GPS.
  • Picture format: .jpg
  • Picture size: Original size of the photo is 3.5 x 5.5 (I think). The original scanned image is 852 x 1116 pixels.
  • Description of the photo: Currently, the primary way of searching for an image is for a computer to search for the associated text. Good file names and good descriptions can be key to finding the image again. Bonnie J M Swoger, age 3, sitting on Santa’s lap. Her grandpa brought her to the mall to visit Santa. While not enthusiastic about it, she loved her grandpa and obliged him by sitting on Santa’s lap.
  • Copyright information: I don’t think the mall Santa folks were thinking about copyright in 1981 because there wasn’t an easy way to copy the photo. These days, it is important to state explicitly what rights other folks have to use the picture. Creative Commons licenses are great for being explicit about what users can do with your content.

Depending on the type of data, there may be many more metadata elements. Geospatial data, chemical data, astronomical data, etc. each have specific descriptive elements that are used. Many organizations have developed standards describing what kinds of metadata should be included and how the metadata should be formatted. This helps data creators add metadata that can be read by computers and reused by other interested folks.

Once you have well established metadata formats, you can start analyzing the metadata. Common metrics used to evaluate scholarly publication (impact factor, alt metrics, etc.) all rely on high quality metadata.

I think we can agree that Santa would use sound data management practices, including the creation and use of proper metadata, to keep track of his gift giving and logistical data. He would want the rest of us to use good metadata so we can always locate that 30 year old picture of him, too.

Be like Santa and make sure your data is findable and re-useable: use good metadata!


For a more robust (yet clear and understandable) definition of metadata, see NISO’s Understanding Metadata (PDF).

 

Originally published by Scientific American in 2012.

Take the money and run

With library funds decreasing, for-profit publishers want your science research money.

To those in the know, this isn’t big news. Publishers have been hinting about it for some time.

Keys and money
For-profit (and some non-profit) publishers want more money, while keeping their content locked down and controlled. CC image courtesy of Flickr user Images_of_Money.

But a recent analysis report (PDF) has been making the rounds of the twitter-verse today, making a few things more explicit. Reed Elsevier: Is Elsevier Heading for a Political Train-Wreck? is an investment analysis from Claudio Aspesi and others at Bernstein Research:

Most important, at a time when budgets of academic libraries look likely to be constrained for years to come in many countries, Elsevier’s growth will increasingly depend on its ability to secure funding earmarked for general science research, instead of library funding.

This fascinating report, lays out the challenges facing Elsevier in an age of boycotts, open access, and increasing researcher awareness of the costs of scholarly publishing.

The report also focuses on Elsevier’s culture of control of their content. From limits on open access, licensing restrictions and text mining restrictions Elsevier wants to control who does what with their content. They want to take an ever increasing share of library budgets, then the research grant money, and they want to control what you do with the information you license from them. The investment analysts state:

We continue to be baffled by Elsevier’s perception that controlling everything (for example by severely restricting text- and data mining applications) is essential to protect its economics.

Recently, it seems like every year I talk to faculty in my departments and say “We have to make some tough choices due to flat budgets and increasing journal costs. What would you like to cancel?” Every time librarians do this, faculty become a bit more aware of the economics of scholarly publishing.

Elsevier isn’t the only publisher facing these challenges, and they might not even be the baddest apple in the group, but they are very big (my library pays more for their content than anyone else’s), and their actions are drawing the attention of the scholarly community.

It will be interesting to see how this develops. Stay tuned!

Publishers, Hyperbole, and the “Don’t subscribe” pricing model

Commercial publishing is no stranger to hyperbole. “Essential research for your institution.” “Best information resource available.” “Exclusive time-limited offer.”

But I recently came across an interesting case of publisher hype. Multi Science Publishing, publisher of many mid-range scientific journals, recently sent an email to an email discussion group, touting its new pricing model, “Pay only for Usage.” Their tag line is “Don’t subscribe.”

The email claimed that announcements of the new model had caused “quite a stir” and the author of the email, a W Hughs, the “director” of Multi-Science Publishing, suggested that he hadn’t seen “anything like” the scale of libraries response to the new plan.

That stuck me as quite interesting, because a search for information about this new plan yields little information about the plan.  The most prominent resources are a link to the email discussion group archives and a single blog post, both dated March 27.  I can’t seem to fins information about the package on the publishers website.  The only stir on twitter I can find are tweets from @billhughes6 directed towards libraries:

In short, I’m not convinced that this announcement has caused a stir at all.

And so you may ask, is the new model deserving of the hype?

The pricing model they’ve declared revolutionary is simple: libraries sign up for access and pay $5.00 for each article their users download. For folks in the Sciences, the $5.00 per article price point is less many publishers charge, and it’s even less than the interlibrary loan fees that libraries might have to pay.

It isn’t a bad deal. The big part seems to be the unmediated bit – users get direct access to the article, without having to request that the library buy it for them (although some new methods of making these requests have reduced the time needed to get the article to you, you still have to ask.)

So, how does this compare to other journal packages out there? That’s a difficult question to ask, because libraries don’t pay per article. In same cases, we can easily pay $1, $10 or $40 an article, depending on the journal, publisher and package.

Libraries may also be reluctant to sign up for a plan that will make it difficult for them to budget their expenses.

I like to see some experimentation in journal pricing models, because the status quo isn’t really helping anyone (except company shareholders).

I’d be less cynical about the plan if the publisher had simply promoted the plan, without informing us of how much “stir” and “excitement” it had already caused – earn the buzz for your new service, don’t just say it exists.

Open access challenges for small scholarly societies

I am very much in favor of open access.  I believe it is the natural extension of the scientific enterprise.  Scientists no longer record their results in code, or disseminate them via cryptic anagrams.  Instead, the work of scientists is shared with others so that they may, in turn, make new discoveries.

Creative Commons image courtesy of PLoS, available via Wikimedia Commons

Yes, this is idealistic, but I’m okay with that.

As a result, I have no hesitations in pushing the big for-profit publishers towards greater rights for authors and more open access options, and I applaud the effort behind the Cost of Knowledge  boycott of Elsevier.  And just being a not-for-profit scholarly society will not make me sympathetic to high subscription costs, aggressive price increases and restrictive copyright practices (like the American Chemical Society).

But for smaller scholarly societies, I can see how the open access movement has caused a lot of soul searching and a wide variety of opinions and options.

For small scholarly societies, subscriptions to their scholarly publications can make up a large portion of their operating budgets.  Moving to an open access model may mean the loss of some of this revenue, and society members may question whether an author-pays model of open access publication will be able to offset the cost of publication.  On the other hand, many societies may see a greater fulfillment of their mission by expanding open access options.

Happily, I am seeing more small scholarly societies embrace various aspects of “openness” in their publications.  The Ecological Society of America demonstrates some interesting examples of branching out and offering more open access options.

First, they recently started a new open access journal, Ecosphere.  The new journal conforms to what we tend to expect from “gold” open access publishers: online only, author fees for accepted manuscripts, and authors retain copyright of their articles.  The recent ESA annual report  suggests that they have been pleasantly surprised but the success of the new journal.

Second, although ESA requires transfer of copyright to the society for their other publications, they do grant the right to post a copy of the article on the authors personal homepage or institutions website.  This is “green” open access, an option that more researchers need to take advantage of.

Finally, their journal Ecology provides and interesting example of a hybrid publication.  Ecology is available via subscription and publishes a wide variety of article formats, including brief reports that are “expected to disclose new and exciting work in a concise format.”  Several reports are published in each issue, and all are open access.  As a result, a certain portion of each issue is freely available.

As scholarly societies and other publishing entities come to terms with new expectations for scholarly publishing, I expect that more societies will experiment with a variety of open access options.  A three year old report from SAGE suggests that this will happen. I’m looking forward to seeing what folks come up with.

What determines the value of academic publications?

And can librarians, scholars and publishers agree about it?

Image courtesy of flickr user 401K, 401kcalculator.org

By value, I don’t just mean the impact factor and other metrics, or even the general prestige of a journal as measured by gut feeling. I mean the value of a publication (a single article, an entire journal, every journal a publisher publishes) in fiscal terms. Dollars and cents. Moolah. Benjamins.

Perhaps we can say that a more highly ranked journal (impact factor, eigenfactor, etc.) might be worth more in dollars. Some publishers would certainly like this to be true. Higher quality equals higher cost. After all, you want the $100 bottle of wine to taste better than the $5 bottle of wine. But this doesn’t seem to be how it works in reality, at least in some disciplines. Bergstrom and Bergstrom (2006) did a study of ecology journals, and examined cost versus impact factor.  In general, they found that there was no correlation between the two.  [Yes, yes, I know. Impact factor is just one poor way of measuring value.]

Perhaps value has something to do with reliability? One way to look at reliability is to examine retractions. Fang and Casadevall (2011) have shown that typically, high impact journals have more per-article retractions because they are at the cutting edge of research (also see this Retraction Watch post). Because they want to get things out fast, mistakes are sometimes made. Nature, Cell and Science all have a relatively high Retraction Index (see Fang and Casadevall, 2011). But does this make these journals any less valuable overall?

With the number of open access options increases, publishers (especially for profit publishers and academic societies that act like for profit publishers) make the argument that their editing, copy editing and page preparation services add significant value to their publications.  How much should should these services add to the total value of the publication?  [Hint: not as much as the publisher would like.]

Then again, perhaps the value of a publication has less to do with the content and more to do with the audience. For example, the New England Journal of Medicine is probably more useful and valuable to a medical school than it is to my small liberal arts college, and more valuable to me than to a similarly sized school without a biology program.

When publishers assign a value to their publications, they typically take into account the size of the school or the types of degrees they award. Larger schools often pay more money for access to the same resources. Unfortunately, this doesn’t always take into account the details of who those folks are. Two schools with 5000 students will pay the same amount for a resource in chemistry, say, even though one school has 100 chemistry majors each year and the other has 10.

As journal costs keep rising, institutions must continuously evaluate value – does this journal provide enough value to my institution to justify the costs?

No matter how good the $100 bottle of wine is, I’ll need to keep drinking the $5 stuff. Or maybe the $10 stuff for Christmas.

References

Bergstrom, C. T., & Bergstrom, T. (2006). The economics of ecology journals. Frontiers in Ecology and the Environment, 4(9), 488–495. Retrieved from http://www.esajournals.org/doi/abs/10.1890/1540-9295(2006)4[488:TEOEJ]2.0.CO;2

Fang, F. C., & Casadevall, A. (2011). Retracted science and the retraction index. Infection and immunity, 79(10), 3855-9. doi:10.1128/IAI.05661-11