The thoughts of a web 2.0 research fellow on all things in the technological sphere that capture his interest.

Friday, 22 January 2010

Semantic Webometrics - A few thoughts

The other day an academic colleague asked what I was working on at the moment, in my answer I included - semantic webometrics - unsurprisingly he wanted some more detail. However 'working on' would be a bit of an exaggeration, 'have a few ideas but nothing on paper yet' would have been more appropriate. As such I thought I'd write down some of my rough thoughts on semantic webometrics.

Webometrics
For those who may have stumbled upon this blog from a non-webometric background, Webometrics as defined by Björneborn (2004), and as used by most of the webometrics community, means the:
...study of the quantitative aspects of the construction and use of information resources, structures and technologies on the Web drawing on bibliometric and informetric approaches.
Many of these quantitative studies have focused on hyperlinks. For example, investigating whether there is a correlation between a university's inlinks (a.k.a. backlinks) and a university's research ranking, or whether the interconnectedness of organisations in a region (as seen through interlinking web sites) can give an indication of a region's level of innovation [outrageous self-citation].

One of the problems with many of these link-analyses is that they include a lot of noise. For example, when counting a university's inlinks you will be counting both those from an academic highlighting a university's quality research, and those from the disgruntled student highlighting his most hated tutor. Traditionally we have tried to understand the extent of this noise through large scale content analysis - the extremely tedious manual classification of web links and web pages.

The semantic web
A semantic web is one where information on the web is structured so that it is meaningful to computers. Well known examples of the semantic web include FOAF ontology allowing people to express the relationships with one another (e.g., the FOAF of Tim Berners-Lee) and the use of microformats for certain types of structured content including contact details (as included at www.davidstuart.co.uk) and reviews (which are now indexed by Google as Rich Snippets). This extra information information can be used to reduce the amount noise and enable meaningful webometric studies.

Semantic webometrics
So when I say semantic webometrics I mean - webometric studies that make use of the additional information included in an increasingly semantic web.

For example, a semantic webometic study of the connection between an institution's inlinks and research ranking would take into consideration who had placed the links and the attributes that they had associated with them. A semantic webometric study of the relationships between organisations would look at the explicit relationships contained in FOAF files as well as the implicit information on web pages.

Conclusions
Unfortunately there is relatively little semantic information embedded in the majority of web pages/sites, and where it is widespread, e.g., with the nofollow link attribute, webometricians have yet to develop the tools to make use of them.

As such we need to take an information-centred approach to semantic webometric research rather than a problem-centred approach. Whilst still small, there is an increasing amounts of semantic data being embedded in the web all the time, webometricians need to investigate what is available and how they can use it.

Labels: ,

posted by David at | 1 Comments

Saturday, 30 May 2009

From Webometrician to Web Analyst?

On 22nd July 2009 my job as web 2.0 research fellow at the University of Wolverhampton finishes. As the only other webometrics research post currently available is in South Korea, and I'm not really a 9-5 office type person, I will [probably] be going into business for myself: Commercialising webometrics. Unfortunately, as there are only a handful of people who know what webometrics is and what a webometrician would do, the hunt is on for a new job title.

The most obvious job title is 'web analyst', although the slightly wordier 'web analytics consultant' would probably give a better indication of the services I can offer. Neither, however, sound particularly cutting edge, exciting, or (like webometrician) rhyme with magician! Even after I have decided on a job title I will have to select names for the services I offer. Is 'web impact analysis' catchy enough? Naming children seems like a piece of cake in comparison.

One thing I am sure about: I will not be a search engine optimizer offering search engine optimization! Any other suggestions welcomed.

Labels: ,

posted by David at | 0 Comments

Monday, 27 April 2009

A Wolverhampton Network Diagram: It's a local affair

A couple of posts ago I was complaining about how annoying my job was as I tried to draw conclusions from the jumbled mess of environmental technology websites. Today's post points out that it isn't always such a jumbled mess.

I have just done a far smaller (and less scientific) data collection for a presentation I am doing in Wolverhampton tomorrow [click on picture to enlarge]:

It is a link diagram of a few web sites in Wolverhampton and the surrounding area to illustrate the sort of work my research group does.

What is noticeable from a webometric perspective is how many of the web sites included in the study are actually connected: you can link anywhere in the world, but the web is primarily a local affair.

Labels: , ,

posted by David at | 0 Comments

Wednesday, 22 April 2009

I Hate My Job: The Web is Just a Jumbled Mess!

At the moment I am investigating the linking between 1337 environmental technology web sites. Of the 1337 sites, 751 nodes create one large network:

You spend days sorting a list of URLs, collecting data, finding errors, starting again...and at the end you just have a big ball of string.

A webometrician's job is to draw conclusions from such a jumbled mess: I hate my job.

Labels: ,

posted by David at | 7 Comments

Friday, 20 February 2009

A Philosophy of Linking: Does The Pirate Bay need a webometrician?

As members of The Pirate Bay stand trial Bill Thompson points out the need for a philosophy of linking:
The Pirate Bay case hinges on what counts as infringement, and whether simply linking to a site is enough to make someone liable, treating a hypertext link to a third-party URL as an endorsement, as something that makes a connection between two web pages or information sources that has real legal significance and weight.

Yet it is nothing of the sort. Ever since Tim Berners-Lee defined the Hypertext Markup Language and its Uniform Resource Locators one fundamental thing has applied - a link is just a link....

Perhaps we need a 'philosophy of linkage' to explore what the use of a link can signify, before the lawyers decide it for us and limit the creative potential of the web through their lack of imagination and understanding.

The theory of linking often comes up as a topic of conversation in webometrics, in much the same way as a theory of citation is discussed in bibliometrics. Unfortunately it often takes a back seat to those webometric areas with more obvious real-world applications, e.g., the creation of web indicators.

Only a couple of months ago a colleague and I started working on a 'Theory of Linking', but other work got in the way and the paper remains unfinished. Who knows, maybe if we had written the paper we could have been the first webometricians to be expert witnesses!

Labels: , , ,

posted by David at | 1 Comments

Sunday, 15 February 2009

Twitter, Politics, and Looking for Meaningful Metrics

As Twitter seems to be the latest shiny web site that has everyone interested, and with a general election on its way (well, June 2010 at the latest), I decided to see how the political parties have taken to Twitter.

The most simple comparison is between the raw numbers of the parties:
Obviously these numbers don't look good for the Labour Party, not listening and not many followers. They don't even have a single account, but rather two different streams with the same information.

Whilst such comparisons will be made with increasing regularity as the election approaches, for example:
..., we quickly realise we need to take into consideration a far wider variety of Twitter accounts and take into consideration other metrics.

@DowningStreet, the official Twitter channel for the office of the Prime Minister, provides a total different perspective on the Labour Party's fortunes.
If @DowningStreet's Twitter friends were an indication of support, Gordon could expect a landslide victory at the next general election. Unfortunately things are not that simple. As one comment to @DowningStreet shows, people follow for many different reasons:
any chance next week i can have a pic taken outside No.10? im visiting for a few days? i know its cheeky but i had to ask!
Obviously @DowningStree is not the only other UK political Twitterer, many individuals, groups and departments have accounts. All contributing to the complex picture of the UK political landscape.

Twitter potentially offers a lot of useful information about both the attitude of the parties to the electorate, and the electorate to the parties. Unfortunately, as with all webometric studies, for meaningful answers to be arrived at there needs to be distinct methodical steps rather than just a grabbing of raw data:
1) Select appropriate Twitter accounts to answer the research question.
2) Investigate Twitter interactions:
Not only 'do they follow and have followers', but are they ReTweeting comments and Responding to questions directed at them.
3) Investigate the nature of the interactions:
Unfortunately the simplest way of finding out the nature of many of the connection is to analyse the comments, a very long and tedious process.

As with so many things on the web, it would be interesting to investigate, if only one had the time.

Labels: , , , ,

posted by David at | 0 Comments

Thursday, 5 February 2009

An Unimpressive EThOS from the British Library

One of the hundreds of posts in my feed-reader this morning was about the British Library electronic theses service (via SCIT blog). As my own thesis should be included I decided to indulge in a bit of vanity searching. Result: EThOS has a long way to go.

I would expect my thesis to turn up for the term 'webometrics', in fact it is about the only term for which someone might actually want to read it. Unfortunately the only webometric thesis belongs to Xuemei Li:

My thesis does however turn up for the wholly inappropriate 'bibliometrics':

Seemingly the reason for my appearance under 'bibliometrics' and not 'webometrics' is that 'bibliometrics' appears in my abstract whereas 'webometrics' does not. Whilst this may seem reasonable at first, theorectically the University of Wolverhampton are taking part in the project and their record includes a number of keywords carefully selected me, including 'webometrics'. The British Library also fails to provide a link to my thesis, despite it being scattered over the web like confetti: "Not yet available for download".

Young academics brought up on Google Scholar, with full text searching and links to the numerous copies on the web, are unlikely to see the value in EThOS and its traditional OPAC style. Whilst I'd like to see an electronic thesis online service that seperates the wheat from the chaff, with full text searching and links to the documents, and believe that librarians could aid in retrieval with classification of such documents, this is not what EThOS is currently offering. It's still in Beta, and likely to improve, but it has a frighteningly long way to go and you do wonder whether they should have buddied up with one of the big search engines to produce a more user friendly version.

Labels: , , ,

posted by David at | 2 Comments

Saturday, 3 January 2009

Webometric Word Clouds: an unscientific comparison

Whilst contemplating creating word clouds from search engine results(what else do people think about on a Saturday afternoons?) I started to wonder what my thesis would look like as a word cloud. More specifically, would it end up looking like the autobiography for Mike Thelwall? A quick copy and paste of 163 pages of text into Wordle later:

Maybe articles and theses should have a word cloud before the abstract to help users decide at a glance whether it is even worth reading the abstract.

How does my word cloud compare with other recent webometric theses?

Labels: , ,

posted by David at | 0 Comments

Friday, 21 November 2008

Google SearchWiki: Cleaning up the Webometric results

For some reason Google always saves its big releases for those days when I am busy. Could it be that they are fearful of my criticism? Or merely coincidence? Whatever the reason I couldn't help but push other things to one side and comment on Google's new SearchWiki. Basically, when you are logged into your Google account at google.com (not currently google.co.uk) you can change the results you find on your home page: promoting results, hiding results, commenting on results. Whilst it only affects your results page, you can see how other people have ranked/commented on items, and it seems highly likely that Google will eventually incorporate the findings in its general search results.

SearchWiki is by no means a new idea, sites such Aftervote (now Scour) have done it all before, the difference this time is the amount of people Google can put to work on the idea. At the time of writing this blog a search for 'Google' had already had 908 people make notes; it would probably have taken Aftervote weeks if not months to get that many comments on a single search term. So what is the collective wisdom regarding the best search result for the term 'google' entered into Google.com...that'll be Google.com. Personally I would have thought that people are more likely to be searching for one of Google's other services or information about Google rather than the page they are already on, but noone ever accused the public of being overly bright.

As someone who likes to do his bit for collective wisdom, I have made steps to clean up one of my most regular search 'webometrics':

Just the three adjustments: promotion of the most important site, questioning the validity of a colleague's page, and the removal of a character who has no right to call himself a webometrician. But I am sure everyone would agree that such amendments improve the page astronomically.

Whilst I am sure that shere weight of numbers will prevent the spamming of the top searches, it will be interesting to monitor the spam on the fringes. Will people be looking at the notes other have made? I will. SearchWiki seems as though it will give great insight into what people think of different sites, I just hope Google adds it to their API.

UPDATE: Whilst I initially said it was only available on Google.com, it's seemingly not as simple as that. When I log into Google.com with my webometrics account I get SearchWiki, when I log in with my gmail account I don't get SearchWiki! It seems as though they are taking steps to restrict access geographically.

Labels: , ,

posted by David at | 2 Comments

Tuesday, 28 October 2008

Does Bibliometrics need a Blogger?

Whilst searching on Google Blog Search for 'webometrics' I noticed that the usual webometric blogs are listed as 'Related Blogs':

As I had just been blogging on the subject of bibliometrics, I decided to see which the related blogs on that topic. Surprisingly there aren't any:

[Although two blogs are 'related' to Scientometrics].

If blogs are a useful way for sharing the latest news and information in a particular discipline, as well as the promotion of a discipline, then surely bibliometrics would benefit from the odd bibliometrician blogging occasionally [...for the sake of inter-disciplinary relations I will eschew the joke about bibliometricians being odd]. Admittedly the webometric blogs are not the best example of academic blogging, but it is a burgeoning online community of sorts.

Labels: , ,

posted by David at | 0 Comments

Friday, 3 October 2008

It's Porn Friday!!!

It's not that today has been designated the official porn day of the year, merely that Friday is the day when adult web sites get most of their traffic. That's just one of the facts scattered throughout Bill Tancer's Click: What Millions of People are Doing Online and Why It Matters, albeit the most memorable:

Whilst very much a popular book, rather than an academic book, it's a worthwhile read from a webometric perspective. If nothing else you can curse the limited amount of data we have access to in comparison to our commercial counterparts: Whereas we have to count links, they get to follow click-streams; following the mood and reactions of people around the world.

Whilst there is obviously big money to made with the Hitwise data, as well as with the data of their competitors, maybe they would find the data even easier to sell if it had been shown to stand up to the rigour of the academic community and the peer review process. My door is always open :-)

Labels: , ,

posted by David at | 0 Comments

Thursday, 2 October 2008

Google 2001 v. Google 2008

In honour of their 10th birthday Google brought back their oldest available index a couple of days ago: Google 2001. This provides a great opportunity for looking at how the web has changed, especially the growth of certain terms in comparison to others.

As a webometrician, the obvious choice is to see how 'webometrics' has grown. However with changes in the index size the results are only meaningful in comparison to another result. In this case I have decided on 'Mike Thelwall', the hyper-productive author of over 100 papers in the field, who, luckily, also has an unusual name.


Whilst there were a similar number of documents at the start, and both have grown at an extremely fast rate, webometrics has grown at the faster rate. Scientific proof that there is more to webometrics than Mike Thelwall!

It would be nice if Google opened up some other indexes so that more points to the graph could be added.

Labels: ,

posted by David at | 0 Comments

Wednesday, 24 September 2008

Google Insights for Search: Term order is all important!

Unfortunately most poor academics don't have access to the same data as Bill Tancer, instead we generally have to make do with the crumbs from Google and the other search engines. This morning however, I was reminded about how careful we need to be when using the tools the search engines offer us.

Today I was using Google Insights for Search to compare the term cybermetrics and webometrics. Whilst I am part of the Statistical Cybermetrics Research Group, as a group we tend to discuss 'webometrics'. Google Insights for Search clearly shows that whilst there was once a time when cybermetrics ruled supreme, webometrics is now far more popular.

More importantly, however, I also noticed that Iran wasn't highlighted on the map for the term 'webometrics', despite Iran have a (relatively) strong webometrics community.

Basically, because Iran does not appear in the results for 'cybermetrics' (which was my first search term), it is not calculated for 'webometrics'. If I had added the term 'webometrics' first, then the term 'cybermetrics' the map would have looked very different:

The solution would seem to be to include a universal search term first, but those that immediately spring to mind are not necessarily the sort that you would want appearing on a corporate slide-show.

Labels: , , , ,

posted by David at | 0 Comments

Friday, 5 September 2008

Webometrician v. Webometrician: Who will conquer the world first?

One of the joys of Google Analytics is watching the map slowly filling up as you get traffic from different parts of the world. However, whilst North America and Western Europe quickly fill up, other parts of the world have been more reluctant to visit my Webometric Thoughts. Almost a year after I started using Google Analytics there has still been no traffic from many countries in Africa.

Oh, what a tangled web we weave... is wondering how to start filling his map, hoping to attract visitors from Ukraine, Belarus, Georgia, Armenia, and Moldova. Whilst I am also waiting for some traffic from Belarus and Georgia, at least I can sleep comfortably in the knowledge of 28 visits from the Ukraine, 2 from Armenia, and 1 from Moldova.

Whilst the gauntlet has been thrown down by Kim at Oh, what a tangled web we weave..., I would expect the Belarusian, Georgian, and Armenian traffic to arrive by the end of the week (especially as I have sensibly included the demonyms as well as country names). And whilst Kim has decided to include the terms Google and Facebook in his post to increase the liklihood of traffic, I'm going with the Google Insights for Search suggestions of Minsk, Tbilisi, and Yerevan.

Update: Ooops...just realised I was chasing Armenian traffic after already having had Armenian traffic. So it should really say "I would expect the Belarusian, Georgian, and EXTRA Armenian traffic to arrive by the end of the week"

Labels: , , , , ,

posted by David at | 0 Comments

Thursday, 21 August 2008

Iterasi: Create your own archive!

The UK's web archive is pretty rubbish, therefore Iterasi (highlighted by TechCrunch) is a great addition to the web.

Rather than merely bookmarking a URL, you can archive the actual page, and can continue archiving the page on a regular basis if you so wish. The only downsides to the site are that it only allows you to archive on a daily basis (for the front pages of news sites you may want to archive more regularly), and it only archives when your computer, with its list of scheduled saves, is turned on.

The potential for webometric studies is obvious, it would seem as though even the most technologically incompetent of us can now simply collect longitudinal data. For example, Google searches may be collected on a daily basis to see how the results or the number of hits changes...and once you have archived a page, it's very simple to then embed the page:

It also has potential for bloggers; when they discuss a page or story bloggers can now be sure that their readers will have access to the page that they saw rather than an updated version. How content providers will react to the archiving of their content is yet to be seen.

Labels: , , ,

posted by David at | 0 Comments

Thursday, 14 August 2008

Happy Blog-iversary!!


Today is the one year anniversary of my Webometric Thoughts blog! Unfortunately, despite having a Google anniversary logo commissioned especially for the event (way back in January), Google have decided to give preference to another Olympic logo today instead.

Over the last year I have managed to blog fairly regularly (this is my 286th post), and this has been reflected in a steady increase in traffic. Since I started using Google Analytics in October I have had 15,484 absolute unique visitors:

Most importantly, the number of unique visitors can be seen to be increasing month on month. This increase can also be seen in my Alexa ranking:

When checking my Alexa ranking back in September my ranking was 8,926,204, whilst in January it was 3,816,072. Whilst Alexa changed its ranking algorithm in April, today's results show an improvement on the 1,607,649 I got then. Even Technorati shows an improvement, as I am now in the top half a million blogs!

So, what are the aims of Webometric Thoughts over the next year:
-Break into the top 100,000 web sites (according to Alexa)
-Break into the top 100,000 blogs (according to Technorati)
-Make the blog self-financing (since starting to use Google Ads in March I have earned $14.05...I need to earn approximately $50 a year).
-And, obviously, write higher quality posts.

Labels: , , , , ,

posted by David at | 0 Comments

Wednesday, 6 August 2008

Google Insights for Search: What next?

In addition to Google Trends, Google are now offering Google Insights for Search (http://google.com/insights/search/#)(via TechCrunch). Not only can you filter the terms by category, for example helping to distinguish between Apple (Computers & Electronics) and apple (Food & Drink), but it will also give a nice visual representation of the geographic data.

We can now quickly see that the Iran is the country most interested in webometrics:

The maps also offer a whole new type of vanity searching. The "David Stuart" brand has yet to make major inroads in Africa, Asia or South America. I was grateful, however, to find that my own vanity searches had not overly effected the results (at a city level London is the hub rather than Wolverhampton).

Some bloke called Barack Obama, on the other hand, seems to have made inroads all over, with the exception of the Middle East.

The obvious question, based on the directory structure of the Insights for Search URL (http://google.com/insights/search/#), is what other insight services are Google going to offer? Insights for Maps? Insights for Shopping? Insights for News?

Labels: , , ,

posted by David at | 0 Comments

Webometricians are NOT Web Celebrities!!

When it comes to being a web celebrity, it is not surprising to find that webometricians are near the bottom of the pile; a fact I blame on our spending too much time counting other people's links rather than creating content worth linking to. Anyway, Wired have created a nifty little application (highlighted by Media Futurist) that can help you determine your 'web celebrity' score by using data from Google's Social Graph.

At the moment it only bases your score on MySpace, Twitter, and your blog/web site, so your score depends a lot on how much you use these sites; my thousands of Facebook friends and hundreds of delcious bookmark followers mean nothing. Nonetheless, true to Webometric Thoughts fashion, a comparison of the three main webometrics blogs/bloggers(only using their twitter and blog addresses):

Holmberg's Oh what a tangled web we weave... :
2 (twitter) + 4 (blog) = 6
Thelwall's Webometrics Blog :
10 (twitter) + 15 (blog) = 25
My Webometric Thoughts:
6 (twitter) + 7 (blog) = 13

To give these numbers a bit of perspective, Barack Obama's current ranking is 9,069 (4,509 without MySpace). Thelwall may have won this battle, but we are all losing the war. It would be interesting to see, however, how the Celebrity Meter compares with a qualitative evaluation of web celebrity, such asForbes' list of the top 25 web celebrities.

Whilst 'web celebrity' is just a bit of fun, it does show the potential of the Google Social Graph data, and as far as I am aware no webometrician has used it to any practical purpose yet.

Labels: , ,

posted by David at | 2 Comments

Friday, 25 July 2008

A Webometric Thesis

The finishing of a PhD is more of a whimper than a bang. It has been seven months since I handed in my thesis, and despite having had only the most minor of revisions (total time approximately 4hrs), I have only just received the certificate for my masterpiece:

Whilst there are often complaints about the inability of government to work as effectively as 'the marketplace', we should all be grateful that academia is not in charge of the country; nothing would happen for years on end.

As many weeks have also passed since I sent my thesis to the University's electronic repository, and it still hasn't appeared online, I have decided to put it online myself.

Title:
Web Manifestations of Knowledge-based Innovation Systems
Abstract:
Innovation is widely recognised as essential to the modern economy. The term knowledge-based innovation system has been used to refer to innovation systems which recognise the importance of an economy’s knowledge base and the efficient interactions between important actors from the different sectors of society. Such interactions are thought to enable greater innovation by the system as a whole. Whilst it may not be possible to fully understand all the complex relationships involved within knowledge-based innovation systems, within the field of informetrics bibliometric methodologies have emerged that allows us to analyse some of the relationships that contribute to the innovation process. However, due to the limitations in traditional bibliometric sources it is important to investigate new potential sources of information. The web is one such source. This thesis documents an investigation into the potential of the web to provide information about knowledge-based innovation systems in the United Kingdom.

Within this thesis the link analysis methodologies that have previously been successfully applied to investigations of the academic community (Thelwall, 2004a) are applied to organisations from different sections of society to determine whether link analysis of the web can provide a new source of information about knowledge-based innovation systems in the UK. This study makes the case that data may be collected ethically to provide information about the interconnections between web sites of various different sizes and from within different sectors of society, that there are significant differences in the linking practices of web sites within different sectors, and that reciprocal links provide a better indication of collaboration than uni-directional web links. Most importantly the study shows that the web provides new information about the relationships between organisations, rather than just a repetition of the same information from an alternative source. Whilst the study has shown that there is a lot of potential for the web as a source of information on knowledge-based innovation systems, the same richness that makes it such a potentially useful source makes applications of large scale studies very labour intensive.

Obviously the above abstract will have all but the greatest dullard champing at the bit, and I have therefore made it available in both PDF and Word Document formats.

Labels: ,

posted by David at | 2 Comments

Tuesday, 10 June 2008

Is the web linguistically on the left or right?

I am currently in the middle of reading David Crystal's (2006) 'Language and the Internet', an interesting book that, when it started mentioning style guides, got me wondering about whether style guides could be used to determine whether the UK web space was politically on the left, or on the right. The leading broadsheets from both sides of the political debate have publicly available style guides (i.e., The Telegraph and The Guardian), and the differences could be used for the basis of such a linguistic-webometric investigation.

My personal favourite style guide section is The Telegraph's Banned Words. Whilst the banning of terms such as 'Europhobe' have obvious political motivations, you have to wonder whether it was really necessary to explicitly ban referring to 'perverted Scout leaders' (Whilst Google Trends does not show the phrase to be endemic, that may be because of the Telegraph's quick action). It is interesting to note, however, that despite the Telegraph's authoritarian values, they seem seem to be very lax with their own language, the supposedly banned 'mass exodus' was used only a few days ago. Surely there will be letters to the editor!

Unfortunately these days search engines try to be helpful, and ignore many of the differences. For example, 'Yahoo' and 'Yahoo!' are both treated as the same, when any fool would know that the exclamation mark reflects the searching for more conservative opinions on the search engine. It would be nice to be able to turn a search engine's 'helpful' features off occasionally.

Labels: , , , , ,

posted by David at | 0 Comments