Visualisation Proposal: Accessing Data & Available Visualisation Frameworks

To create the visualisation I proposed I need to access a lot of external services like Delicious, Twitter, … In this post I’ll take a first look at what are the possibilities of accessing data from a few of these services. I’ll end with a conclusion explaining based on the data availability which visualisation I’d like to start prototyping.

Accessing Data

The first question I asked myself was what information do I need to start searching on these services. The starting point of course is the metadata from the paper stored by the open repository. What information could be used to search for on other services, there is currently no such thing as a unique identifier for a paper on all services. But there is of course still enough information to start exploring for information on other services. When publishing a paper on Lirias a user is asked to at least fill in one of the following identifiers:

  • URI – Universal Resource Identifier – e.g.. http://www.Lirias.org/help/submit.html
  • ISSN – International Standard Serial Number – e.g. 1234-5678
  • ISBN – International Standard Book Number – e.g. 0-1234-5678-9
  • DOI – Digital Object Identifier – e.g.. 10.1000/182
  • Other – A unique identifier assigned to the item using a system different from the specified ones.

Next to these identifiers the DSpace software also allows to use Handles. Handles are a way of assigning globally unique identifiers to objects within the DSpace system. These identifiers are persistent and make it possible to for users to bookmark files from Lirias. A Handle can be formatted as a URL and is thus the ideal way for a user to create a bookmark without ever having a broken link. When searching on other external services for a Lirias document often these Handles will be found as a reference.

These different identifiers of course offer more ways to search for the presence in a certain. But they also make the process of searching a bit more complicated e.g. different identifiers are present on the same or more services. In the next paragraphs I’ll be looking at what types of identifiers service allow to search on with their API.

Delicious

The Delicious API is limited to perform actions on the profile of a user and doesn’t allow any global search so it’s only for personal bookmarks. On the site of Delicious there is the possibility to search for a URL.

Bibsonomy

The API of Bibsonomy offers no explicit way to search for ISBN, DOI, … but it does offer a full text search. The search covers all available metadata for a post (e.g. title, authors, ISBN, DOI, …) as well as associated tags. This technique might result in  undesirable results:

Special characters in search terms: Please note that all special (i.e. non-alphanumeric) characters occuring in search terms apart from “_” and “‘” are treated as search term separators – if you search e.g. for an ISBN like /posts?search=978-0-387-71000-6 , then you also an entry with ISBN 387-0-978-71000-6 will be matched, because the number blocks are treated as distinct search terms, which is not what you want.

Connotea

The Connotea API allows to search for tags and URI/URL. This example http://www.connotea.org/data/tags/uri/http://www.google.com/ returns a list of tags for the Google-site.

Other bookmarking sites

An other bookmarking tool I’ve been looking at is CiteULike, this site does not seem to have a fully developed API. I only found an API for developing plugins, which at first sight doesn’t seem to be updated since 2006. The bookmarking service Zotero also offers an API but it’s limited to development within Firefox and is meant for development of extensions accessing Zotero.

Twitter

When searching for activity on Twitter, the biggest problem is what are we looking for. The first search I thought of was the URL of a paper. Twitter has a well documented API, allowing to search for a specific URL. But the limitations of the 140 characters on Twitter forces people to use URL shortening services like TinyURL. So when searching for a particular URL it will not return all related post because of the use of the URL shortening services. A solution to this problem is that Backtweets offers a Twitter search including these post and offers an API.

Another way to find interesting tweets is by searching for a hash tag used during a conference where a paper was presented. But how can I get to know the used hash tags of conferences? This is a question that still needs to be answered. Also there might be some other tweets related to a paper scientists might be interested in that I don’t know of, I hope to discover these during the evaluation of the visualisation.

Blog posts

The hardest information to receive is blog posts related to a paper, there are a lot of blog services like WordPress, Blogger, … The APIs offered by these services are either limited to simple retrieval not allowing to search for a mention of a specific URL or they only offer an API for plugin development. Another issue is that scientist also blog on social networks like Nature Network so blogs are widespread just looking at one or more blog services might not give a realistic view of the blogging activity.

A possible solution to this problem might be Google Alerts which allows you to receive an alert whenever a search term is mentioned on a blog. An issue that arises is that these alerts are accessible via RSS feed or e-mail and there is no API. So whenever searching for a URL a Google Alerts needs to be created and verified, this is time-consuming. This issue needs some more research on how I will achieve to receive blog posts that mention an article.

Available Visualisation Frameworks

As I want to start evaluating and developing a prototype within the next weeks I need to make a choice in which visualisation to start with. In this blog post I’ve taken a look at what information I need to get from these external services. It seems that at this moment I’m only able to find information related to a URL with the API offered by Connotea. It also is not clear what information I’m going to show on the timeline: what tweets are useful? how do I get blog posts mentioning a URL? These are some of the question still unanswered.

Another aspect that I might need to take in account is what kind of visualisation frameworks are already available for the development of these visualisations. The timeline visualisation could be developed using SIMILE Timeline Widget which only needs some data as in put. For the tag and library visualisation I might be able to use the relation browser from Moritz Stefaner. Another possibility for the tag browsing would be the cluster map used in the visualisation of social bookmarks. In case of using the relation browser a lot of bookmarks wouldn’t give a good overview but a cluster map would.

By the lack of information I think starting with the tag visualisation is the best way to go because the timeline activities are still uncertain and for the library visualisation I need user data from Mendeley. I seems that a there are enough frameworks to start from although I haven’t looked at how difficult it would be to start developing with them.

Do you have any suggestion on how to easily get tag and paper information or got a answer to the any of my questions (where to get related blogposts, what twitter messages are useful, …) please let me know. Comment are always welcome!

Advertisements

One Response to “Visualisation Proposal: Accessing Data & Available Visualisation Frameworks”

  1. Bram Says:

    Yahoo Search and site explorer help you in retrieving which websites are linking to a selected website or resource.

    here’s some information about using this:
    http://www.facebook.com/#/notes/mire/do-you-know-whos-linking-to-your-repository-/193898073767

    Having a resource reference to a certain URL, is indeed a degree of finding a unique identifier for a paper//resource.

    However, when looking realistically, it will be often the case that someone is blogging, tweeting or even citing a work or a publication, without using a unique identifier. In this case, you might want to experiment with which of the queries, give the best results when trying to find something that refererences the same resource.

    For example: if your googling for two of the author last names, two keywords from the title, and the year of publishing, would that be enough to retrieve quality results that are indeed referencing the same thing ?

    What about google scholar ? Because you’re looking in a smaller collection of resoruces, it should in theory, be easier to find quality scientific results, and less noise. However … it will not give you links to blogs where people talk about the resource.

    I do have the feeling that you’re covering a lot in your research at the moment. Maybe consult with Joris whether the retrieval of the data should be your main point of interest, or the visualisation (when you would build in the assumption that you have a framework for getting the data, but might just work with example or bogus data for the meanwhile).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: