A place to cache linked articles (think custom and personal wayback machine)
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

index.md 21KB

5 years ago

  1. title: Linked Data - Design Issues
  2. url: https://www.w3.org/DesignIssues/LinkedData.html
  3. hash_url: 8df6b7af9ac944275a13f0d0e97ad7d7
  4. <a href="http://www.cafepress.com/w3c_shop"><img alt="Get a 5* mug" border="none" src="https://www.w3.org/DesignIssues/diagrams/lod/597992118v2_350x350_Back.jpg" align="right"/></a>
  5. <h1>Linked Data</h1>
  6. <p>The Semantic Web isn't just about putting data on the web. It
  7. is about making links, so that a person or machine can explore
  8. the web of data.  With linked data, when you have some of
  9. it, you can find other, related, data.</p>
  10. <p>Like the web of hypertext, the web of data is constructed with
  11. documents on the web. However,  unlike the web of hypertext,
  12.  where links are relationships anchors in
  13. hypertext documents written in <small>HTML</small>, for data they
  14. links  between arbitrary things described by
  15. <small>RDF</small>,.  The <small>URI</small>s identify any
  16. kind of object or  concept.   But for
  17. <small>HTML</small> or <small>RDF</small>, the same expectations
  18. apply to make the web grow:</p>
  19. <ol>
  20. <li>
  21. <p>Use <small>URI</small>s as names for things</p>
  22. </li>
  23. <li>
  24. <p>Use <small>HTTP</small> <small>URI</small>s so that people
  25. can look up those names.</p>
  26. </li>
  27. <li>
  28. <p>When someone looks up a <small>URI</small>, provide useful
  29. information, using the standards (RDF*, SPARQL)</p>
  30. </li>
  31. <li>
  32. <p>Include links to other <small>URIs</small>. so that they
  33. can discover more things.</p>
  34. </li>
  35. </ol>
  36. <p>Simple.  In fact, though, a surprising amount of data
  37. isn't linked in 2006, because of problems with one or more of the
  38. steps.  This article discusses solutions to these problems,
  39. details of implementation, and factors affecting choices about
  40. how you publish your data.</p>
  41. <h2>The four rules</h2>
  42. <p>I'll refer to the steps above as rules, but they are
  43. expectations of behavior.  Breaking them does not destroy
  44. anything, but misses an opportunity to make  data
  45. interconnected.  This in turn limits the ways it can later
  46. be reused in unexpected ways.  It is the unexpected re-use
  47. of information which is the value added by the web.</p>
  48. <p>The first rule, to identify things with
  49. <small>URI</small>s,  is pretty much understood by most
  50. people doing semantic web technology.  If it doesn't use the
  51. universal <small>URI</small> set of symbols, we don't call it
  52. Semantic Web.<br/>
  53. <br/>
  54. The second rule, to use <small>HTTP</small>
  55. <small>URI</small>s,  is also widely understood.  The
  56. only deviation has been, since the web started,  a constant
  57. tendency for people to invent new <small>URI</small> schemes (and
  58. sub-schemes within the <span>urn:</span> scheme)  such as
  59. <small>LSID</small>s and handles and <small>XRI</small>s and
  60. <small>DOI</small>s and so on, for various reasons.
  61.  Typically, these involve not wanting to commit to the
  62. established Domain Name System (<small>DNS</small>) for
  63. delegation of authority but to construct something under separate
  64. control.   Sometimes it has to do with not understanding
  65. that <small>HTTP</small> <small>URI</small>s are names (not
  66. addresses) and that <small>HTTP</small> name lookup is a complex,
  67. powerful and evolving set of standards. This issue discussed at
  68. length elsewhere, and time does not allow us to delve into it
  69. here. [ @@ref TAG finding, etc])</p>
  70. <p>The third rule, that one should serve information on the web
  71. against a <small>URI</small>, is, in 2006, well followed for most
  72. ontologies, but, for some reason, not for some major datasets.
  73.  One can,  in general,  look up the properties and
  74. classes one finds in data, and get information from the
  75. <small>RDF</small>, <small>RDFS</small>, and <small>OWL</small>
  76. ontologies including the relationships between the terms in the
  77. ontology.</p>
  78. <p>The basic format here for RDF/XML, with its popular
  79. alternative serialization N3 (or Turtle). Large datasets provide
  80. a SPARQL query service, but the basic linked data should br
  81. provided as well.</p>
  82. <p>Many research and evaluation projects in the few years of the
  83. Semantic Web technologies produced ontologies, and significant
  84. data stores, but the data, if available at all, is buried in a
  85. zip archive somewhere, rather than being accessible on the web as
  86. linked data.  The Biopax project, the CSAktive data on
  87. computer science research people and projects were two examples.
  88. [The CSAktive data is now (2007) available as linked data]</p>
  89. <p>There is also a large and increasing amount of
  90. <small>URI</small>s of non-ontology data which can be looked up.
  91.  <a href="http://ontoworld.org/wiki/Semantic_wiki">Semantic
  92. wikis</a> are one example. The "Friend of a friend"
  93. (<small>FOAF</small>) and <span>Description of a Project</span>
  94. (<small>DOAP</small>) ontologies are used to build social
  95. networks across the web.    Typical <a href="http://en.wikipedia.org/wiki/List_of_social_networking_websites">
  96. social network portals</a> do not provide links to other sites,
  97. nor expose their data in a standard form.</p>
  98. <p>LiveJournal and Opera Community are two portal web sites which
  99. do in fact publish their data in <small>RDF</small> on the web.
  100.   (Plaxo has a trail scheme, and I'm not sure
  101. whether they support <span>knows</span> links). This means that I can
  102. write in my <small>FOAF</small> file that I know Håkon Lie by
  103. using his <small>URI</small> in the Opera Community data, and a
  104. person or machine browsing that data can then follow that link
  105. and find all his friends. <i>[Update:]</i> Also, the Opera
  106. Community site allows you to register the RDF URI for yourelf on
  107. another site. This means that public data about you from
  108. different sites can be linked together into one web, and a person
  109. or machine starting with your Opera identity can find the others.
  110. </p>
  111. <p>The fourth rule, to make links elsewhere,  is necessary
  112. to connect the data we have into a web, a serious, unbounded web
  113. in which one can find al kinds of things,  just as on the
  114. hypertext web we have managed to build.</p>
  115. <p>In hypertext web sites it is considered generally rather bad
  116. etiquette not to link to related external material.  The
  117. value of your own information is very much a function of what it
  118. links to, as well as the inherent value of the information within
  119. the web page.  So it is also in the Semantic Web.</p>
  120. <p>So let's look at the ways of linking data, starting with the
  121. simplest way of making a link.</p>
  122. <h3>Basic web look-up</h3>
  123. <p>The simplest way to make linked data is to use, in one file, a
  124. <small>URI</small> which points into another.</p>
  125. <p>When you write an <small>RDF</small> file,   say
  126. &lt;http://example.org/smith&gt;, then you can use local
  127. identifiers within the file, say  #albert, #brian and
  128. #carol.  In N3 you might say</p>
  129. <pre>
  130. &lt;#albert&gt; fam:child &lt;#brian&gt;, &lt;#carol&gt;.
  131. </pre>
  132. <p>or in <small>RDF/XML</small></p>
  133. <pre>
  134. &lt;rdf:Description about="#albert"<br/> &lt;fam:child rdf:Resource="#brian"&gt;<br/> &lt;fam:child rdf:Resource="#carol"&gt;<br/>&lt;/rdf:Description&gt;
  135. </pre>
  136. <p>The <small>WWW</small> architecture now gives a global
  137. identifier  "http://example.org/smith#albert" to Albert.
  138.  This is a valuable thing to do, as anyone on the planet can
  139. now use that global identifier to refer to Albert and give more
  140. information. </p>
  141. <p>For example, in the
  142. document &lt;http://example.org/jones&gt; someone might
  143. write:</p>
  144. <pre>
  145. &lt;#denise&gt; fam:child &lt;#edwin&gt;, &lt;smith#carol&gt;.
  146. </pre>
  147. <p>or in <small>RDF/XML</small></p>
  148. <pre>
  149. &lt;rdf:Description about="#denise"<br/> &lt;fam:child rdf:Resource="#edwin"&gt;<br/> &lt;fam:child rdf:Resource="http://example.org/smith#carol"&gt;<br/>&lt;/rdf:Description&gt;
  150. </pre>
  151. <p><br/>
  152. Clearly it is reasonable for anyone who comes across the
  153. identifier 'http://example.org/smith#carol" to:</p>
  154. <ol>
  155. <li>Form the <small>URI</small> of the document by truncating
  156. before the hash</li>
  157. <li>Access the document to obtain information about #carol</li>
  158. </ol>
  159. <p>We call this dereferencing the <small>URI</small>.  This
  160. is basic semantic web. </p>
  161. <p>There are several variations.</p>
  162. <h3>Variation: URIs without Slashes and HTTP 303</h3>
  163. <p>There are some circumstances in which dividing identifiers
  164. into documents doesn't work very well.   There may logically
  165. be one global symbol per document per document, and there is a
  166. reluctance to include a # in the <small>URI</small> such
  167. as </p>
  168. <p>
  169. http://wordnet.example.net/antidisesablishmentarianism#word</p>Historically,
  170. the early Dublin Core and <small>FOAF</small> vocabularies did
  171. not have # in their URIs.   In any event when
  172. <small>HTTP</small> <small>URI</small>s without hashes are used
  173. for abstract concepts, and there is a document that carries
  174. information about them, then:<br/>
  175. <ol>
  176. <li>An <small>HTTP</small> <small>GET</small>  request on
  177. the <small>URI</small> of the concept returns <span>303 See Also</span> and gives in the
  178. Location: header, the <small>URI</small> of the
  179. document.  </li>
  180. <li>The document is retrieved as normal</li>
  181. </ol>
  182. <p>This method has the advantage that <small>URI</small>s can be
  183. made up of all forms.  It has the disadvantage that an
  184. <small>HTTP</small> request mBrowse-ableust be made for every
  185. single one.  In the case of Dublin Core, for example,
  186. dc:title and dc:creator etc are in fact served by the same
  187. ontology document, but  one does not know until they have
  188. each been fetched and returned HTTP redirections.</p>
  189. <h3>Variation: FOAF and rdfs:seeAlso</h3>
  190. <p>The <a href="http://foaf-project.org/">Friend-Of-A-Friend</a> convention
  191. uses a form of data link, but  not using either of the two
  192. forms mentioned above.  To refer to another person in a
  193. <small>FOAF</small> file, the convention was to give two
  194. properties, one pointing to the document they are described in,
  195. and the other for identifying them within that document.</p>
  196. <pre>
  197. &lt;#i&gt; foaf:knows [<br/> foaf:mbox &lt;mailto:joe@example.com&gt;;<br/> rdfs:seeAlso &lt;http://example.com/foaf/joe&gt; ].
  198. </pre>
  199. <p>Read, "I know that which has email  joe@example.com and
  200. about which more information is in
  201. &lt;http://example.com/foafjoe&gt;".</p>
  202. <p>In fact, for privacy, often people don't put their email
  203. addresses on the web directly, but in fact put a one-way hash
  204. (<small>SHA-1</small>) of their email address and give that. This
  205. clever trick allows people who know their email address already
  206. to work out that it is the same person, without giving the email
  207. away to others.</p>
  208. <pre>
  209. &lt;#i&gt; foaf:knows [<br/> foaf:mbox_sha1sum "2738167846123764823647"; # @@ dummy<br/> rdfs:seeAslo &lt;http://example.com/foaf/joe&gt; ].
  210. </pre>
  211. <p>This linking system was very successful, forming a
  212.  growing social network, and dominating, in 2006, the linked
  213. data available on the web.</p>
  214. <p>However, the system has the snag that it does not give
  215. <small>URI</small>s to people, and so basic links to them cannot
  216. be made.</p>
  217. <p>I  recommend (e.g in weblogs on <a href="http://dig.csail.mit.edu/breadcrumbs/node/62">Links on the
  218. Semantic Web</a> , <a href="http://dig.csail.mit.edu/breadcrumbs/node/71">Give yourself a
  219. URI</a>, and and <a href="http://dig.csail.mit.edu/breadcrumbs/node/72">Backward and
  220. Forward links in RDF just as important</a>) that those making a
  221. <small>FOAF</small> file give themselves a <small>URI</small> as
  222. well as using the <small>FOAF</small> convention.   
  223.  Similarly, when you refer to a <small>FOAF</small>
  224.  file which gives  a <small>URI</small> to a person,
  225. use it in your reference to that person, so that clients which
  226. just use <small>URI</small>s and don't know about the
  227. <small>FOAF</small> convention can follow the link.</p>
  228. So now we have looked at ways of making a link,
  229. let's look at the  choices of when to make a link.<br/>
  230. <p>One important pattern is a set of data which you can explore
  231. as you go link by link by fetching data.   Whenever one
  232. looks up the URI for a node in the RDF graph, the server returns
  233. information about the arcs out of that node, and the arcs in.
  234.  In other words, it returns any RDF statements in which the
  235. term appears as either subject or object.</p>
  236. <p>Formally,  call a graph G <span>browsable</span> if, for  the URI of
  237. any node in G, if I look up that URI I will be returned
  238. information which describes the node, where describing a node
  239. means:</p>
  240. <ol>
  241. <li>Returning all statements where the node is a subject or
  242. object; and</li>
  243. <li>Describing all blank nodes attached to the node by one
  244. arc.</li>
  245. </ol><br/>
  246. <p class="detail">(The subgraph returned has been referred to as
  247. "minimum Spanning Graph (MSG [@@ref] ) or  RDF molecule
  248. [@@ref], depending on whether nodes are considered identified if
  249. they can be expressed as a path of function, or reverse inverse
  250. functional properties. A concise bounded description, which only
  251. follows links from subject to object,  does not work.)</p>
  252. <p>In practice, when data is stored in two documents, this means
  253. that any <small>RDF</small> statements which relate things in the
  254. two files must be repeated in each.  So, for example, in my
  255. <small>FOAF</small> page I mention that I am a member of the
  256. <small>DIG</small> group, and that information is repeated on the
  257. <small>DIG</small> group data. Thus, someone starting from the
  258. concept of the group can also find out that I am a member.
  259.  In fact, someone who starts off with my <small>URI</small>
  260. can find all the people who are in the same group.</p>
  261. <h3>Limitations on browseable data</h3>
  262. <p>So statements which relate things in the two documents must be
  263. repeated in each. This clearly is against the first rule of data
  264. storage: don't store the same data in two different places: you
  265. will have problems keeping it consistent.  This is indeed an
  266. issue with browsable data.   A set of  of completely
  267. browsable data with links in both directions has to be completely
  268. consistent, and that takes coordination, especially if different
  269. authors or different programs are involved.</p>
  270. <p>We can have completely browsable data, however, where it is
  271. automatically generated.  The <a href="http://dig.csail.mit.edu/2006/dbview/dbview.py">dbview</a>
  272.  server, for example,  provides a browsable virtual
  273.  documents containing the data from any arbitrary relational
  274. database.</p>
  275. <p>When we have a data from multiple sources, then we have
  276. compromises.  These are often settled by common sense,
  277. asking the question,</p>
  278. <blockquote>
  279. <p>"If someone has the URI of that thing, what relationships to
  280. what other objects is it useful to know about?"</p>
  281. </blockquote>
  282. <p>Sometimes, social questions  determine the answer.
  283.  I have links in my <small>FOAF</small> file that I know
  284. various people.  They don't generally repeat that
  285. information in their <small>FOAF</small> files. Someone may say
  286. that they know me, which is an assertion which, in the
  287. <small>FOAF</small> convention, is theirs to assert, and the
  288. reader's to trust or not.  </p>
  289. <p>Other times, the number of arcs makes it impractical.   A
  290. <small>GPS</small> track gives thousands of times at which my
  291. latitude, longitude are known. Every person loading my
  292. <small>FOAF</small> file can expect to get my business card
  293. information, but not all those trackpoints. It is reasonable to
  294. have a pointer from the track (or even each point) to the person
  295. whose position is represented, but not the other way. </p>
  296. <p>One pattern is to have links of a certain property in a
  297. separate document.   A person's homepage doesn't list all
  298. their publications, but instead puts a link to it a separate
  299. document listing them.  There is an understanding
  300. that <span>foaf:made</span>
  301. gives a work of some sort, but <span>foaf:pubs</span> points to a document
  302. giving a list of works.  Thus, someone searching for
  303. something <span>foaf:made</span>
  304. link would do well to follow a <span>foaf:pubs</span> link.  It might
  305. be useful to formalize the notion with a statement like</p>
  306. <pre>
  307. foaf:made link:listDocumentProperty foaf:pubs.
  308. </pre>
  309. <p>in one of the ontologies.</p>
  310. <h3>Query services</h3>
  311. <p>Sometimes the sheer volume of data makes serving it as lots of
  312. files possible, but cumbersome for efficient remote queries over
  313. the dataset.  In this case, it seems reasonable to provide a
  314. <small>SPARQL</small> query service.  To make the data be
  315. effectively linked, someone who only has the
  316.  <small>URI</small> of something must be able to find their
  317. way the <small>SPARQL</small> endpoint. </p>
  318. <p>Here again the <small>HTTP</small> 303 response can be used,
  319. to refer the enquirer to a document with metadata about which
  320. query service endpoints can provide what information about which
  321.  classes of <small>URI</small>s.</p>Vocabularies for doing
  322. this have not yet been standardized.<br/>
  323. (Added 2010). This year, in order to encourage
  324. people -- especially government data owners -- along the road to
  325. good linked data, I have developped this star rating system.
  326. <p>Linked Data is defined above. Linked <em>Open</em> Data (LOD)
  327. is Linked Data which is released under an open licence, which
  328. does not impede its reuse for free. Creative Commons CC-BY is an
  329. example open licence, as is the UK's <a href="http://www.nationalarchives.gov.uk/doc/open-government-licence/">
  330. Open Government Licence</a>. Linked Data does not of course in
  331. general have to be open -- there is a lot of important use of
  332. lnked data internally, and for personal and group-wide data. You
  333. can have 5-star Linked Data without it being open. However, if it
  334. claims to be Linked Open Data then it does have to be open, to
  335. get any star at all.</p>Under the star scheme, you get one (big!)
  336. star if the information has been made public at all, even if it
  337. is a photo of a scan of a fax of a table -- if it has an open
  338. licence. The you get more stars as you make it progressively more
  339. powerful, easier for people to use.
  340. <table>
  341. <tr>
  342. <td class="stars">★</td>
  343. <td>Available on the web (whatever format) <i>but with an
  344. open licence, to be Open Data</i></td>
  345. </tr>
  346. <tr>
  347. <td class="stars">★★</td>
  348. <td>Available as machine-readable structured data (e.g. excel
  349. instead of image scan of a table)</td>
  350. </tr>
  351. <tr>
  352. <td class="stars">★★★</td>
  353. <td>as (2) plus non-proprietary format (e.g. CSV instead of
  354. excel)</td>
  355. </tr>
  356. <tr>
  357. <td class="stars">★★★★</td>
  358. <td>All the above plus, Use open standards from W3C (RDF and
  359. SPARQL) to identify things, so that people can point at your
  360. stuff</td>
  361. </tr>
  362. <tr>
  363. <td class="stars">★★★★★</td>
  364. <td>All the above, plus: Link your data to other people’s
  365. data to provide context</td>
  366. </tr>
  367. </table>
  368. <p>How well does your data do? You can buy <a href="http://www.cafepress.co.uk/w3c_shop.480759174">5 star data
  369. mugs</a>, T-shirts and bumper stickers from the W3C shop at
  370. cafepress: use them to get your colleages and fellows
  371. conference-goers thinking 5 star linked data. (Profits also help
  372. W3C :-).</p>
  373. <p>Now in 2010, people have been pressing me, for governmet data,
  374. to add a new requirement, and that is there should be metadata
  375. about the data itself, and that that metadata should be availble
  376. from a major catalog. Any open dataset (or even datasets which
  377. are not but should be open) can be regisetreed at ckan.net.
  378. Government datasets from the UK and US hsould be regisetred at
  379. data.gov.uk or data.gov respectively. Other copuntries I expect
  380. to develop their own registries. Yes, there should be metadata
  381. about your dataset. That may be the subject of a new note in this
  382. series.</p>
  383. <br/>
  384. <p>Linked data is essential to actually connect the semantic web.
  385.  It is quite easy to do with a little thought, and becomes
  386. second nature.   Various common sense considerations
  387. determine when to make a link and when not to.</p>
  388. <p>The <a href="http://dig.csail.mit.edu/2005/ajar/ajaw/tab">Tabulator</a>
  389. client (running in a suitable browser)  allows you to browse
  390. linked data using the above conventions, and can be used to check
  391. that your linked data works.</p>
  392. <p>References</p>
  393. <p>[Ding2005] Li Ding, et. al.,  <a href="http://ebiquity.umbc.edu/paper/html/id/240/"><span>Tracking RDF Graph Provenance using RDF
  394. Molecules</span></a>, UMBC Tech Report TR-CS-05-06</p>
  395. <hr/>
  396. <h2>Followup</h2>
  397. <p>2006-02 Rob Crowell adapts Dan Connolly's DBView (2004) which
  398. maps SQL data into linked RDF, adding backlinks.</p>
  399. <p>2006-09-05 Chris Bizer et al adapt <a href="http://sites.wiwiss.fu-berlin.de/suhl/bizer/d2r-server/">D2R
  400. Server</a> to provide a linked data view of a database.</p>
  401. <p>2006-10-10 Chris Bizer et al produce the <a href="http://sites.wiwiss.fu-berlin.de/suhl/bizer/ng4j/semwebclient/">Semantic
  402. Web Client Library</a>, "Technically, the library represents the
  403. Semantic Web as a single Jena RDF graph or Jena Model." The code
  404. feteches web documents as needed to answer queries.</p>
  405. <p>2007-01-15 Yves Raimond has produced a <a href="http://moustaki.org/swic/">Semantic Web client for SWI
  406. prolog</a> wit similar functionality.</p>
  407. <p>I have a talk at the 2009 O'Reilly eGovernment 2.0 conference
  408. in Washington DC, talking about "Just a Bag of Chips" @@ref, and
  409. talking about the 5 star scheme. Following that, From InkDroid
  410. blogged summary (and CSS) of my 5 star sceheme adapted here</p>