Linked Data: the future of knowledge organization on the Web
By Fran Alexander
ISKO UK events consistently manage to cram in about twice as much content as seems possible given the time. With enough material for at least a two-day conference, no fewer than nine speakers and two poster presenters made for a packed day that provided a pleasing mix of fine technical detail, practical advice, and some context-setting explanations of the evolution of Linked Data.
Keynote address - Government Linked Data: A Tipping Point for the Semantic Web
Professor Nigel Shadbolt gave the keynote address, pointing out that local government data is as useful and interesting as national data. He offered a rundown of the history of the Semantic Web, starting with the classic “layer cake” picture that had been prevalent some years ago, explaining that a lot of the research into Artificial Intelligence (AI) – natural language processing, entity extraction, intelligent reasoning over distributed databases – was very interesting but not particularly pragmatic. Much discussion was devoted to detailed technological issues for a highly specialised community. In the meantime, Linked Data emerged as a simpler, easier approach based on a few founding principles – resources should have a unique derefenceable identifier, be expressed in open standards formats, and be interlinked.
Linked Data is now becoming established and is fractioning out into separate areas, with certain core nodes in various sectors being heavily linked. Although many questions remain about the differences between the “web of documents” and the “web of things”, the release of UK government data in Linked Data format should be seen as a gift for the web community. When trying to get Linked Data principles adopted in organisations, explaining to people the value of the decentralised model of the web is important.
Releasing government Linked Data also shifts responsibility for the use and interpretation of data away from the government to individual users. This can circumvent a lot of bureaucracy. For example, the Department of Transport held a lot of statistics about bicycle accidents, but it was only when this data was released that someone turned it into a map and started providing “safe route” information and various related apps aimed at cyclists. The Treasury was reluctant to release its COINS database, because it felt it was confusingly structured and hard to interpret, but once released people built navigation interfaces for it that are now being used by the Treasury itself. The release of data depended on the adoption of an open licence. The principle is if you publish, the apps will come!
Public data is objective, factual, non-personal – accident rates, student degree numbers, etc. – and can be used to measure public service delivery. This sort of data is a straightforward proposition to release, but private data raises more difficult questions about privacy and trust. How much sharing of personal data is to the citizen’s benefit? To the government’s benefit? Should individuals be responsible for their own data, such as medical records? What role should the government play?
One of a number of Linked Data principles is that public bodies should maintain and publish inventories of their data holdings. It is important that we consider this data seriously, because we are not just assigning URIs to roads, streets, buildings, etc., we are building the digital infrastructure of the nation.
SKOS and Linked Data
Antoine Isaac talked about SKOS and Linked Data, from the perspective of the “web of culture data”. He explained that for many large cultural repositories, converting their classification schemes into ontologies is not practical due to the huge volumes of data involved. However, much rich semantic information can be extracted without the need for a fully formalised ontology system. SKOS enables classification data to be shared in a simple way that permits sharing of thesauruses and grouping of items by concept. It can also be a useful way of expressing annotations to documentation.
It is extensible, so core complex expressions can be included, but it has some basic constraints. For example, only one term can be the preferred term, broader and narrower are assumed to be inverse relationships (which makes it easier to complete a graph), and although this is a limitation in some ways, in other ways it means that classifications can be expressed with minimum semantic commitment. SKOS is not intended to draw inferences beyond what is present in the core data.
It is a web-oriented straightforward way of sharing content and descriptions and permits mapping across repositories (e.g. The MACS Project).
The most interesting applications are the ones that cross-contextualise, so more work needs to be done to mix automatic and manual mapping methods
The Linked Data Journey
Richard Wallis of Talis gave an overview of his Linked Data journey, which began some 40 years ago when cataloguers and librarians managed rich data sets almost entirely manually. He has seen many innovators in the semantic field disappear, but some have persisted. Semantic technology has a reputation for being really wonderful until you add the second user, so it is important to make sure everything you do is scalable in the real world.
The limitations of the web at present are that documents are linked with unqualified links. It is very hard for machines to make any sense of the links without undertaking the vast amounts of work that Google has done, and even then Google connections are only speculative. There are other issues – for example, there are no negative links on the web, so a pressure group can’t link to the websites of organisations they are objecting to, because linking will only serve to boost traffic and enhance the reputation of the very organisations they are opposing.
Linked Data standards represent a very pragmatic approach to the Semantic Web, so do not have to get caught up in science-fiction-like predictions of Artificial Intelligence leading us all into the “hive mind”.
The main difference between ordinary hypertext and Linked Data links is that Linked Data links are qualified. A surprising number of organisations are now entering the Linked Data world – for example Tesco, Walmart, and Best Buy. Linked Data connections can be hidden from the user, so many people don’t realise they are accessing Linked Data applications online.
There is also a growing web of government data that is being used for all sort s of purposes, one such was illustrating the UK’s “innovation hotspots” to encourage people to invest. The BBC has also undertaken a number of Linked Data projects, the most well-know being Wildlife Finder. The New York Time published a lot of data and was then criticised by the community, but responded by making amendments, demonstrating that it makes sense to engage with the wider web community and respond to feedback in order to improve data quality.
When your data is opened up, categorised, and made sharable, all sort of serendipitous connections can be made and exciting new uses discovered. Everyone is experimenting to a certain extent, so it is worth looking out for “fellow travellers” and finding out what others have done.
The Knowledge Hub
Steve Dale approached Linked Data from a very human-focused, user-centric perspective. He posed the questions that can get lost amidst the technical details, such as what exactly is the problem we are trying to solve? Where do I find the information I need to do my job? Which networks or societies should I join?
The web is fragmenting interaction, so that conversations become more granular but also disaggregated. I took this to mean that the Web encourages us to communicate online with lots of specialists about niche areas, but will only exchange a few words or sentences with them, rather than building up a long dialogue with a few people over time. This means that it is hard to forge real connections over social websites. Linked Data can help to aggregate this knowledge and help it build into a core repository for communities of practice, rather than being widely dispersed.
Human intelligence is needed to interpret much data, but a great start would be just to get councils and other organisations to recognise the value of the data they hold. Some even hold data they don’t really know about.
If you use Linked Data to start to build a knowledge hub, you can start to release hidden data and encourage widespread collaboration and communication as others contribute what they think is useful to the hub. You can then also “push” the best or most relevant content to users, tailored and personalised to their selections. Federated search and real-time indexing can keep such a hub vibrant and responsive. Open authentication and open IDs can help smooth pathways for users to encourage them to use the site and services as part of their ordinary working lives, with the minimum of friction.
Afternoon keynote – Linked Data in E-commerce
Professor Martin Hepp talked about the GoodRelations ontology which he has been developing as an ontology to serve online businesses. In 1920 there were only some 5,000 types of goods being traded – the number was so constrained it was possible to publish – presumably profitably - a dictionary of goods listing them all. Now it seems that every product is available in a huge array of varieties – there is even a type of muesli for horses!
This increased specificity makes search a far more complex problem to solve. The effort you need to make to get exactly what you want and to make sure it will do has increased hugely. You cannot just buy a nail, you have to buy a highly specialised electronic accessory.
The advent of the Internet was a huge boon to reducing this massive search effort. Individuals perform hundreds of Google searches every day. However, much of the business world runs on highly structured data which becomes unstructured when consumed by Google. Preserving the structure would contain as much – possibly more – useful information as preserving the links.
In order to make the most of the structure, it needs to be expressed in a standardised format and attention needs to be paid to getting the schema right. A schema that can’t be reused means data that can’t be reused, and in the rapidly changing world of e-commerce, data needs to be as up to date as possible. The Good Relations Ontology is aimed at providing a standard structure for expressing key e-commerce information.
Although Tim Berners-Lee urged everyone just to get their data out on to the web, putting some effort into rendering it in a reusable structure and form can make a huge difference to reuse rates and save much time in rationalising and standardising later. A balance needs to be struck between the level of detail and the time taken to populate data fields and how they can be processed. For example, separating house numbers from street names can make processing easier, but can be slower for customers to fill in forms.
Following good principles in ontology construction will also help your data be picked up and reused. You may need to mix structured and unstructured data, just putting the unstructured data in the best place you can find in an ontology designed with more structure in mind. It can work well to compliment an ontology with a mechanism that provides vocabualry for structure if you have it, but allows you to attached unstructured data to higher-level node if it is difficult to categorise it finely.
There are a number of known pitfalls in definitions – for example it is important not to confuse a product with an offer (otherwise your product will be on special offer all the time!) and a store is not a business entity – Tesco the retailer is not the same as any one individual Tesco store.
Many people in business have been put off the Semantic Web by the artificial intelligence researchers who make it all sound like something from science fiction, but if you can show direct short-term financial gains for businesses – such as improvements in search engine results, clickthrough rates, and unified marketing – you are more likely to get buy-in.
Linked Data: the Long and Winding Road
Andy Powell described the history of the Dublin Core Metadata Initiative. He proposed that if Linked Data is the future, then RDF must be the future of the web.
Dublin Core was originally 12, then 15 metadata elements – which now would be called properties – that can be used to describe web resources. It took a librarian-centric, document focused approach to resource discovery on line. However, the metatag element was widely ignored by search engines and it rapidly became apparent that the whole web could not be categorised. As a method for transferring records and tracking provenance, it set the stage and had some benefits. It deliberately used broad semantics and flat-world modelling (“fuzzy buckets”), but also avoided thinking too much about issues – such as how to express the relationship between an image and a representation of an image or how an artist’s name could also be an attribute of a person – that became more pressing. Many people found it very difficult to grasp the difference between a thing in the world and a string (of characters) held as a representation of that string and there was comparatively little abstraction of the model for any underlying syntax. However, some of the thinking could be transferred to an RDF world, potentially with the benefit of avoiding the same mistakes.
Current problems include promoting and open world view, promoting the view that everyone and no-one can be an expert, and the “strings and things” issue, which now relates to the difference between a resource and a web page representing that resource. One of the biggest challenges remains the need to get agreement on standardisation and for any model to gain a certain critical mass to give it traction within a community.
Linking to Geographic Data
John Goodwin from the Ordnance Survey, who has been involved in the Semantic Web for 10 years, explained some of the unique challenges of geographical data. The problems of common definitions embedded deep in the data were noted. One example was that subtly different definitions of houses-in-multiple-occupancy had been used by different government organisations, so that their data could not be usefully compared. Place names and boundaries change over time and people often call places by unofficial names. Names that no longer exist as official boundaries persist in common parlance. This can cause particular problems for the emergency services, as they need to make sure they go to the place the caller meant by the name. An example given was of children using the name for one park to mean a different one.
Geographic data is however a very useful route in to many other applications and can provide interesting and informative visualisations. By using geographic hierarchies, you can draw inferences to aggregate data up to broader levels – from county to region level, for example.
Many people think that RDF is difficult and relational databases are difficult, but John felt that for him certainly it was the other way around.
Much work still needs to be done on spatial predicate standardisation in RDF and Oracle and the OGC are working on this, as spatial descriptors are not yet as well standardised as temporal descriptors.
PoolParty: SKOS Thesaurus Management utilizing Linked Data
Andreas Blumauer described the Pool Party SKOS editing tool. He felt that SKOS could be the way to introduce web 2.0 mechanisms directly into the web of data. SKOS enables virtually any user to join in with their own knowledge organisation systems and Pool Party seeks to support knowledge organisation + network effects + collaboration + ontology evolution.
He stressed the significance of using Linked Data principles within the firewall of an organisation, with benefits such as improved collaboration and sharing that can still be useful without having to release data onto the public Web.
The Semantic Web can improve every aspect of information retrieval, and if people move from free tagging with single words – which are fraught with problems such as ambiguity – and move to concept tagging with URIs, resource descriptions will become far more valuable and useful.
Porting terminologies to the Semantic Web
Bernard Vatant of Mondeca talked about porting terminologies to the Semantic Web. Much of his work is done within Intranets for companies that need to make use of large external vocabularies. He explained the model underlying the new management system for EUROVOC which is his latest project. This vocabulary presents itself as a thesaurus, but with extensions of expressivity at the terminological level. Bernard emphasises the importance of semiotic approach to terminology in the Semantic Web framework especially relevant in the multilingual context (evident in e.g. lexvo.org initiative). He proposed a semiotic view of terminology to be – every sign is a thing (signs are terms; resources are business objects) and reminded of semiotic triangle of terms, concepts, and objects (Saussure: sign - signfiant - singifié). He pointed out that shallow ontologies can be very effective when more complexity isn’t needed.
Panel and networking
A full and interesting day was ended with a lively panel discussion , ranging across many topics and producing gems like “data is the new raw material” and “in a data-rich world, the scarce commodity is attention”, ad as always an excellent drinks and networking reception to finish.
(Recordings and presentations files of the entire conference are promised in the following weeks)
Subscribe to:
Post Comments (Atom)
3 comments:
Thanks for a useful, detailed report. Although I didn't come to London, I can get a sense of this relevant theme -- a good choice by ISKOUK indeed.
yes, thanks. The conference looks great... I'd like to make it out sometime.
I always thought that semantic web layer cake diagram was a bit too technical to be that useful to wider audiences. It's good to see the more plain language 'linked data' being emphasized.
With progress on open government/open data, it seems we are really on the edge of some interesting innnovation in information organization and use.
Thanks Fran!
I note that "artificial intelligence" was repeatedly linked with "science fiction" in the presentations.
That is a new (but significant) link to me. One of the problems with Linked Data must be to find and account for new types of links.
As for "strings" vs "things" where do these kinds of entities fit? An ontological approach must use concepts. Still, they cannot be fixed in stone or software.
Where Linked Data seems to work best is in a GIS context where one of the dimensions -- location -- is fixed in measurable space and time.
My worry is that Linked Data perpetrates docu-centric thinking where it is all about finding the right bit of (dereferenced) information, at the cost coming to terms with the ever changing currents and flows of the Web and the inferences which can be drawn from them.
Post a Comment