[rdfweb-dev] Identifying things in FOAF

Thu Jul 10 11:34:31 UTC 2003

Just written this on FOAF identification strategies. Published as-is on 
the weblog pretty might direct from my brain without much tweaking. If
any of it is particularly unclear or badly worded let me know and I'll
fix it up.

http://rdfweb.org/mt/foaflog/archives/000039.html

I'll spam you a copy here as well, as I'd be interested in feedback.

--danbri

         ............................................

RDFWeb and the FOAF (Friend of a Friend) project 

July 10, 2003

Identifying things in FOAF

There is growing interest in FOAF and its relationship to various
approaches to "identity management" on the Internet. The FOAF approach
to all this is distinctly pluralistic, to the extent that you might not
even notice that there is a FOAF way of dealing with identity. There
aren't, for example, 'FOAF identifiers' as such, although there is
certainly a FOAF approach to identifying things. So this is a first cut
at writing up some of the as-yet-unarticulated design assumptions behind
FOAF. A more user-friendly version would have examples, those will have
to come later.

So here's the basic story. FOAF is built on top of W3C's Resource
Description Framework (RDF), which itself uses XML and Unicode as file
format standards. All FOAF documents are RDF documents, and any RDF
application vocabularies (such as Dublin Core, RSS 1.0 core +
extensions, MusicBrainz, Wordnet etc.) can be used within FOAF
documents. FOAF shares with RDF a concern to use standard Web
identifiers (URIs) wherever possible. The URI specification (RFC 2396)
provides a common syntax for naming things on the Web, providing an
umbrella concept which covers both 'URLs' and 'URNs'.

To the extent that everything we want to talk about has a well known
URI, this solves all our problems. Lots and lots of things that we want
to talk about do have URIs. There are URIs for Web pages, for mailboxes,
for Java classes, for telephones, for ISBN-registered publications, and
so on. This is great - when you want to talk about one of these things
in a FOAF file, you just mention its URI. Simple, decentralised,
standard.

However our story doesn't end here, FOAF needs to play in a world where
we don't all have total knowledge of every relevant fact. Sometimes a
thing might 'have' a URI (in some pedantic sense) yet 99% of parties on
the Web might not know what that URI is. Or, closer to my main theme, we
might want to talk in our FOAF files about things that it has proved
peculiarly difficult to get agreement about identifying. People, for
example.

Just try setting up a planet-wide system for identifying people and
you'll see my point. There is significant resistence to the idea of
creating a single set of identifiers used to 'tag' everyone. To put it
mildly. So... where does this leave FOAF? FOAF documents are scattered
around the Web, and each document makes a unique contribution to a
bigger picture which can only be seen when those documents are merged
together. In FOAF, we need to identify people, without there being
agreement on person-identifiers. Tricky!

So here is the good news. RDF was designed for generic, cross-domain
data merging. Imagine taking two arbitrary SQL databases and merging
them, so that your new database could answer questions which required
knowledge of things which were previously described partially in one
dataset, and partially in another. That sort of operation is hard to do,
because SQL wasn't designed in a way that makes this easy. Neither was
XML. But RDF was, and FOAF is built as an RDF application. In RDF, there
are off the shelf software tools which can take RDF documents, 'parse'
them into a set of simple 3-part statements (triples) which make claims
about the world, and store those statements alongside others in a merged
RDF database. To the extent that both datasets use the exact same
identifiers when mentioning things they describe, you get a rather handy
data-merge effect.

So here is the (not very) bad news. If two different RDF files (eg. FOAF
documents) are talking about the same thing but don't use exactly the
same URI when mentioning that thing, how are our poor stupid computers
supposed to be able to understand? In the real world, we want to write
RDF documents (eg. for FOAF) about things that we've not yet agreed on
common identifiers for. This is one of the core problems we've had to
address in FOAF.

Basically, off the shelf RDF tools can still do a lot to help us, but we
have to help them. FOAF, as an application that focusses on the
distributed, decentralised, almost out of control use of RDF 'in the
wild', ran into this problem after we had about half a dozen FOAF files.
There are now hundreds, soon thousands, of FOAF documents. Most of them
talk about people, quite successfully, despite the absence of a global
person-id registry. This sounds like a recipe for chaos, yet somehow
many of our FOAF aggregation tools are quite happy with this situation.
They can often figure out when two files are about the self-same thing,
without much help from the authors of those documents. We do this using
what might be called "reference by description". Instead of saying,
"this page was created by urn:global-person-registry:person-n22314151",
we say "this page was created by the peson whose (some-property...) is
(some-value...)", taking care to use an unambiguous property such as
foaf:homepage or foaf:mbox_sha1sum.

Here's how it works. Recall that FOAF is built on top of RDF, and so
every FOAF document boils down to nothing more than a set of 3-part
statements which relate two things together via terms such as
'workplaceHomepage', 'homepage', 'mbox'.

I am related to those things that are my homepages; FOAF's name for that
relationship is 'foaf:homepage'.

I am related to those things that are my personal mailboxes by a
relationship FOAF calls 'foaf:mbox'.

I am related to the strings that you get from feeding my mailbox
identifiers to the SHA1 mathematical function by a relationship FOAF
calls 'foaf:mbox_sha1sum'.

I am related to a myers briggs personality classification, FOAF calls
that relationship 'foaf:myersBriggs'.

I am related to my workplace homepage (http://www.w3.org/) by a
relationship called -- you guessed it -- 'foaf:workplaceHomepage'.

I am related to my name, 'Dan Brickley' by the 'foaf:name' relationship.

I am related to my AIM chat identifier by a relationship FOAF calls
'foaf:aimChatID'.

And so on. Other RDF vocabularies can define additional relationships
(see the FoafVocab entry in our wiki for pointers). They all relate
things to other things in named ways. A FOAF document, like any RDF
document, is simply a collection of these simple claims about how things
in the world relate.

But look again.There is a hidden pattern here. Some of these
relationships are special.

foaf:homepage foaf:mbox foaf:mbox_sha1sum foaf:aimChatID fall in one
category.

foaf:workplaceHomepage, foaf:myersBriggs, foaf:name fall in another.

Here's the difference. The former kinds of relationship (or 'property'
in RDF-talk) have a special characteristic. They have been defined such
that there is at most one thing in the world that has any particular
value for that property.

There is... at most one thing in the world with any given foaf:homepage.
Or foaf:mbox, or foaf:mbox_sha1sum, or foaf:aimChatID. By contrast,
there may well be multiple things in the world with the same
foaf:workplaceHomepage, or foaf:myersBriggs, or even (it's a big world)
foaf:name. Apparently there's another Dan Brickley out there. And lots
of my colleagues share my workplace homepage. And there are a lot of
people who myers brigg surveys classify as 'INTP' . But there is nobody
else at all who has the same foaf:homepage as me, or the same foaf:mbox.
Or foaf:aimChatID.

This is one of the design principles underlying FOAF (and for that
matter the entire Semantic Web effort): a pragmatic, pluralistic
approach to resource description and identification. Rather than
building big, centralised registries of people (or companies, or
physical things) we look for cheaper, more lightweight shared strategies
for identification. In FOAF, we do this by making sure there are
multiple ways we can identify things.

So one FOAF file might mention 'here is a photo; it depicts the person
whose mailbox is danbri at rdfweb.org'. Another FOAF file might say 'here
is a weblog entry written by the person whose homepage is
http://rdfweb.org/people/danbri/', a 3rd FOAF file might say, 'here is a
chat transcript by the person whose foaf:aimChatID is danbri_2002'. To
the extent that there is publically readable RDF in the Web that makes
all these claims, and that there is, perhaps scattered around, enough
information to deduce that these all describe the same people, RDF /FOAF
tools can 'smush' it all together. They could 'realise' that the photo
and the weblog and the chat log were all associated with the self-same
thing, ie me.

To do that, we need certain pieces of information. We need to know
which, of all the kinds of relationship there are, are the uniquely
identifying ones. In RDF terminology we call these unambiguous (or more
technically, inverse-functional) properties. When RDF software reads the
FOAF spec it can determine this from markup embedded in the document
itself. So machines can find out quite easily which properties are ones
which uniquely identify people. They can do this for the FOAF spec, and
for any other RDF vocabulary that is used alongside FOAF.

The other bit of information needed is that somewhere in the Web, it
would need to be claimed that there is a person who has a mailbox of ...
and a homepage of ... and an aimChatID of ...

If that information is available, then FOAF tools are all set to do the
data merge, even though there is no planet-wide unified identification
system for people. We don't use anything else except off the shelf
standards: URIs plus W3C RDF and OWL technology.

If you find the data merging potential creepy, you are not alone. This
kind of technology is not going away, but there are steps you can take.
A full discussion of the privacy aspect isn't possible here, but the
basic idea is (i) be aware -- scattered information can easily be merged
(ii) keep things as secret as they need to be. Don't tell the world (in
your FOAF file or elsewhere) all the chat IDs and homepages and
mailboxes that you use, then act suprised when people and machines piece
together your scattered contributions to the Web. Reading up on PGP
might be a good idea.

We don't need to wait for a global identity management system before
privacy and data merging becomes an issue. FOAF is intended to explore
these issues, and to provide some advance warning for the way certain
aspects of semantic web technology may affect our lives. Just as the
world has had to adapt to the notion of 'being Googled' and having
things that once seemed obscure now all to easily found, the rise of
semantic web technology needs to be accompanied by an understanding of
the risks and opportunities that 'being identified' presents.

Finally... a couple of points of further reading on the technical rather
than social side of this problem. A couple of years ago I wrote a brief
note on aggregation strategies which describes the 'smushing' problem. A
more recent writeup by Matt Biddulph describing his Java implementation
is worth a read too, as are many of the documents from the TAP project,
which share FOAF's concern for reference-by-description. Guha and Rob's
overview paper sets out the issues very clearly.

Posted by danbri at July 10, 2003 12:05 PM | TrackBack