Data, Data, Data – ScienceOnline2010
The other major theme to emerge from the sessions I attended at ScienceOnline2010 was data. All kinds of data.
20 years ago, getting your hands on a data set meant knowing someone who knew someone who might be able to send you a disc.
These days, more and more data sets are being shared on the open web. Sometimes they are easy to find and use, and sometimes not so much. Sometimes the data require a bit of skill with Excel, and sometimes the data require multiple servers and extensive programming skills.
But it’s out there.
I attended a very interesting session led by John Hogenesch about cloud computing. Some of this was way over my head – I’m not as familiar with bioinformatics as I’d like to be one day, and I only have minimal knowledge of how geneticists are actually using this information. None-the-less, it was informative to learn about the various trends in cloud computing. Some of them I am already very familiar with – like wiki’s, Gmail, Google Docs. I learned more about some services that I only know a bit about. For example, Google Knol is being used by PLoS to write and publish their “Currents Influenza” online. Since the authoring, editing and publishing is done online, the journal can quickly get items published and available. I learned about some services that allow for remote storage and query of information, and how these services can be less expensive (and easier to run) than hosting your own servers.
Jacqueline Floyd and Chris Rowan presented a session on “Earth Science, Web 2.0+, and Geospatial Applications”. Since my background is in geology, I was particularly intrigued by some of the resources discussed here. The discussion at the end of the talk centered around some of the difficulties of finding spatial information (some of which I have discussed before). For example, the USGS provides a wide range of spatial data – geophysical data, hydrological data, geologic data. Some of this is easier to find (and use) than others. For example, recent earthquake data is available is an easy to use Google Earth format, but data older than one month requires more complicated searching (including detailed latitude and longitude coordinates) and the search output requires manipulation to create a visualization. It could be easier.
One of the last sessions I attended at the conference was a presentation by PLoS managing editor Peter Binfield about article level metrics. Peter discussed some of the things that the PLoS journals are doing to attempt to measure the impact of individual articles, not the entire journal. The new metrics were announced in a blog post last summer, and you can see the metrics at work on any article in any of the PLoS journals. They are using open data and API’s from lots of sources: social bookmarking (like CiteULike and Connotea), citation information (from Google Scholar and Scopus), page views and PDF downloads and lots more. I think that this is an exciting new way to shed more light on what is going on with individual articles, but there are some challenges ahead. How will tenure committees analyze this stuff? (Will they bother?) What does it mean if your article was only downloaded 300 times but your colleague (in a larger discipline like genetics) had an article downloaded 3000 times? And all of this data they are collecting can lead to lots of analysis. Librarians have traditionally used citation analysis as a way of understanding the literature of a community, and hopefully these new metrics will give them more tools to use.