Linked Data Blog Aggregator

September 01, 2010

AI3:::Adaptive Information (Mike Bergman)

A New Methodology for Building Lightweight, Domain Ontologies

Bringing Ontology Development and Maintenance to the Mainstream

Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data [1].

There are many ways to categorize ontologies. One dimension is between upper level and mid- and lower- (or domain-) level. Another is between reference or subject (domain) ontologies. Upper-level ontologies [2] tend to be encompassing, abstract and inclusive ways to split or organize all “things”. Reference ontologies tend to be cross-cutting such as ones that describe people and their interests (e.g., FOAF), reference subject concepts (e.g., UMBEL), bibliographies and citations (e.g., BIBO), projects (e.g., DOAP), simple knowledge structures (e.g., SKOS), social networks and activities (e.g., SIOC), and so forth.

The focus here is on domain ontologies, which are descriptions of particular subject or domain areas. Domain ontologies are the “world views” by which organizations, communities or enterprises describe the concepts in their domain, the relationships between those concepts, and the instances or individuals that are the actual things that populate that structure. Thus, domain ontologies are the basic bread-and-butter descriptive structures for real-world applications of ontologies.

According to Corcho et al. [3] “a domain ontology can be extracted from special purpose encyclopedias, dictionaries, nomenclatures, taxonomies, handbooks, scientific special languages (say, chemical formulas), specialized KBs, and from experts.” Another way of stating this is to say that a domain ontology — properly constructed — should also be a faithful representation of the language and relationships for those who interact with that domain. The form of the interaction can range from work to play to intellectual understanding or knowledge.

… ontology engineering research should strive for a unified, lightweight and component-based methodological framework, principally targeted at domain experts ….”

Simperl et al. [4]

Another focus here is on lightweight ontologies. These are typically defined as more hierarchical or classificatory in nature. Like their better-known cousins of taxonomies, but with greater connectedness, lightweight ontologies are often designed to represent subsumption or other relationships between concepts. They have not too many or not too complicated predicates (relationships). As relationships are added and the complexities of the world get further captured, ontologies migrate from the lightweight to the “heavyweight” end of the spectrum.

The development of ontologies goes by the names of ontology engineering or ontology building, and can also be investigated under the rubric of ontology learning. For reasons as stated below, we prefer not to use the term ontology engineering, since it tends to convey a priesthood or specialized expertise in order to define or use them. As indicated, we see ontologies as being (largely) developed and maintained by the users or practitioners within a given domain. The tools and methodologies to be employed need to be geared to these same democratic (small “d”) objectives.

A Review of Prior Methodologies

For the last twenty years there have been many methods put forward for how to develop ontologies. These methodological activities have diminished somewhat in recent years. Yet the research as separately discussed in Ontology Development Methodologies [1] seems to indicate this state of methodology development in the field:

  • Very few uniquely different methods exist, and those that do are relatively older in nature
  • The methods tend to either cluster into incremental, iterative ones or those more oriented to comprehensive approaches
  • There is a general logical sharing of steps across most methodologies from assessment to deployment and testing and refinement
  • Actual specifics and flowcharts are quite limited; with the exception of the UML-based systems, most appear not to meet enterprise standards
  • The supporting toolsets are not discussed much, and most of the examples if at all are based solely on a single or governing tool. Tool integration and interoperability is almost non-existent in terms of the narratives, and
  • Development methodologies do not appear to be an active area of recent research.

While there is by no means unanimity in this community, some general consenses can be seen from these prior reviews, especially those that concentrate on practical or enterprise ontologies. In terms of design objectives, this general consensus suggests that ontologies should be [4]:

  • Collaborative
  • Lightweight
  • Domain-oriented (subject matter and expertise)
  • Integrated, and
  • Incremental.

While laudable, and which represent design objectives to which we adhere, current ontology development methods do not meet these criteria. Furthermore, to be discussed in our next installment, there is also an inadequate slate of tools ready to support these objectives.

A Call for a New Methodology

If you ask most knowledgeable enterprise IT executives what they understand ontologies to mean and how they are to be built, you would likely hear that ontologies are expensive, complicated and difficult to build. Reactions such as these (and not trying to set up strawmen) are a reflection of both the lack of methods to achieve the consensual objectives above and the lack of tools to do so.

The use of ontology design patterns is one helpful approach [5]. Such patterns help indicate best design practice for particular use cases and relationship patterns. However, while such patterns should be part of a general methodology, they do not themselves constitute a methodology.

Also, as Structured Dynamics has argued for some time, the future of the semantic enterprise resides in ontology-driven apps [6]. Yet, for that vision to be realized, clearly both methods and tools to build ontologies must improve. In part this series is a reflection of our commitment to plug these gaps.

What we see at present for ontology development is a highly technical, overly engineered environment. Methodologies are only sparsely or generally documented. They are not lightweight nor collaborative nor really incremental. While many tools exist, they do not interoperate and are pitched mostly at the professional ontologist, not the domain user. In order to achieve the vision of ontology-driven apps the methods to develop the fulcrum of that vision — namely, the ontologies themselves — need much additional attention. An adaptive methodology for ontology development is well past due.

Design Criteria for an Adaptive Methodology

We can thus combine the results of prior surveys and recommendations with our own unique approach to adaptive ontologies in order to derive design criteria. We believe this adaptive approach should be:

  • Lightweight and domain-oriented
  • Contextual
  • Coherent
  • Incremental
  • Re-use structure
  • Separate the ABox and TBox (separate work), and
  • Simpler, with interoperable tools designs.

We discuss each of these design criteria below.

While we agree with the advisability of collaboration as a design condition — and therefore also believe that tools to support this methodology must also accommodate group involvement — collaboration per se is not a design requirement. It is an implementation best practice.

Effective ontology development is as much as anything a matter of mindset. This mindset is grounded in leveraging what already exists, “paying as one benefits” through an incremental approach, and starting simple and adding complexity as understanding and experience are gained. Inherently this approach requires domain users to be the driving force in ongoing development with appropriate tools to support that emphasis. Ontologists and ontology engineering are important backstops, but not in the lead design or development roles. The net result of this mindset is to develop pragmatic ontologies that are understood — and used by — actual domain practitioners.

Lightweight and Domain-oriented

By definition the methodology should be lightweight and oriented to particular domains. Ontologies built for the pragmatic purposes of setting context and aiding interoperability tend to be lightweight with only a few predicates, such as isAbout, narrowerThan or broaderThan. But, if done properly, these lighter weight ontologies can be surprisingly powerful in discovering connections and relationships. Moreover, they are a logical and doable intermediate step on the path to more demanding semantic analysis.

Contextual

Context simply means there is a reference structure for guiding the assignment of what content ‘is about’ [7]. An ontology with proper context has a balanced and complete scope of the domain at hand. It generally uses fairly simple predicates; Structured Dynamics tends to use the UMBEL vocabulary for its predicates and class definitions, and to link to existing UMBEL concepts to help ensure interoperability [8]. A good gauge for whether the context is adequate is whether there are sufficient concept definitions to disambiguate common concepts in the domain.

Coherent

The essence of coherence is that it is a state of consistent connections, a logical framework for integrating diverse elements in an intelligent way. So while context supplies a reference structure, coherence means that the structure makes sense. With relation to a content graph, this means that the right connections (edges or predicates) have been drawn between the object nodes (or content) in the graph [9].

Relating content coherently itself demands a coherent framework. At the upper reference layer this begins with UMBEL, which itself is an extraction from the vetted and coherent Cyc common sense knowledge base. However, as domain specifics get added, these details, too, must be testable against a unified framework. Logic and coherence testing are thus an essential part of the ontology development methodology.

Incremental

Much value can be realized by starting small, being simple, and emphasizing the pragmatic. It is OK to make those connections that are doable and defensible today, while delaying until later the full scope of semantic complexities associated with complete data alignment.

An open world approach [10] provides the logical basis for incremental growth and adoption of ontologies. This is also in keeping with the continuous and incremental deployment model that Structured Dynamics has adopted from MIKE2.0 [11]. When this model is applied to the process of ontology development, the basic implementation increments appear as follows:

Continuous Ontology Implementation
Figure 1. A Phased, Incremental Approach to Ontology Development (click to expand)

The first two phases are devoted to scoping and prototyping. Then, the remaining phases of creating a working ontology, testing it, maintaining it, and then revising and extending it are repeated over multiple increments. In this manner the deployment proceeds incrementally and only as learning occurs. Importantly, too, this approach also means that complexity, sophistication and scope only grows consistent with demonstrable benefits.

Re-use of Structure

Fundamental to the whole concept of coherence is the fact that domain experts and practitioners have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope.

These are prior investments in structure that would be silly to ignore. Yet, today, most methodologies do ignore these resources. This ignorance of prior investments in information relationships is perplexing. Though unquestioned adoption of legacy structure is inappropriate to modern interoperable systems, that fact is no excuse for re-inventing prior effort and discoveries, many of which are the result of laborious consensus building or negotiations.

The most productive methodologies for modern ontology building are therefore those that re-use and reconcile prior investments in structural knowledge, not ignore them. These existing assets take the form of already proven external ontologies and internal and industry structures and vocabularies.

Separation of the ABox and TBox

Nearly a year ago we undertook a major series on description logics [12], a key underpinning to Structured Dynamics’ conceptual and logic foundation to its ontology development. While we can not always adhere to strict and conforming description logics designs, our four-part series helped provide guidance for the separation of concerns and work that can also lead to more effective ontology designs [13].

Conscious separation of the so-called ABox (assertions or instance records) and TBox (conceptual structure) in ontology design provides some compelling benefits:

  • Easier ingest and incorporation of external instance data, including conversion from multiple formats and serializations
  • Faster and more efficient inferencing and analysis and use of the conceptual structure (TBox)
  • Easier federation and incorporation of distributed data stores (instance records), and
  • Better segregation of specialized work to the ABox, TBox and specialty work modules, as this figure shows [14]:
TBox- and ABox-level work
Figure 2. Separation of the TBox and ABox [14]

Maintaining identity relations and disambiguation as separate components also has the advantage of enabling different methodologies or algorithms to be determined or swapped out as better methods become available. A low-fidelity service, for example, could be applied for quick or free uses, with more rigorous methods reserved for paid or batch mode analysis. Similarly, maintaining full-text search as a separate component means that work can be done by optimized search engines with built-in faceting.

Simple, Interoperable Tools Support

An essential design criteria is to have a methodology and work flow that explicitly accounts for simple and interoperable tools. By “simple” we mean targeted, task-specific tools and functionality that is also geared to domain users and practitioners.

Of all design areas, this one is perhaps the weakest in terms of current offerings. The next installment in this series [1] will address this topic directly.

The New Methodology

Armed with these criteria, we are now ready to present the new methodology. In summary terms, we can describe the steps in the methodology as:

  1. Scope, analyze, then leverage existing assets
  2. Prototype structure
  3. Pivot on the working ontology
  4. Test
  5. Use and maintain
  6. Extend working ontology and repeat.

Two Parallel Tracks

After the scoping and analysis phase, the effort is split into two tracks:

  • Instances, and their descriptive characteristics, and
  • Conceptual relationships, or ontologies.

This split conforms to the separation of ABox and TBox noted above [15]. There are conceptual and workflow parallels between entities and data v. ontologies. However, the specific methodologies differ, and we only focus on the conceptual ontology side in the discussion below, shown as the upper part (blue) of Figure 3:

Ontology and Instance Build Methodology
Figure 3. Flowchart of Ontology Development Methodology [16] (click to expand)

Two key aspects of the initial effort are to properly scope the size and purpose of the starting prototype and to inventory the existing assets (structure and data; internal and external) available to the project.

Re-Use Structure

Most current ontology methodologies do not emphasize re-use of existing structure. Yet these resources are rich in content and meaning, and often represent years to decades of effort and expenditure in creation, assembly and consensus. Just a short list of these potential sources demonstrates the treasure trove of structure and vocabularies available for re-use: Web portals; databases; legacy schema; metadata; taxonomies; controlled vocabularies; ontologies; master data catalogs; industry standards; exchange formats, etc.

Metadata and available structure may have value no matter where or how it exists, and a fundamental aspect of the build methodology is to bring such candidate structure into a common tools environment for inspection and testing. Besides assembling and reviewing existing sources, those selected for re-use must be migrated and converted to proper ontological form (OWL in the case of those developed by Structured Dynamics). Some of these techniques have been demonstrated for prior patterns and schema [17]; in other instances various converters, RDFizers or scripts may need to be employed to effect the migration.

Many tools and options exist at this stage, even though as a formal step this conversion is often neglected.

Prototype Structure

The prototype structure is the first operating instance of the ontology. The creation of this initial structure follows quite closely the approach recommended in Ontology Development 101 [18], with some modifications to reflect current terminology:

  1. Determine the domain and scope of the ontology
  2. Consider reusing existing ontologies
  3. Enumerate important terms in the ontology
  4. Define the classes and the class hierarchy
  5. Define the properties of classes
  6. Create instances

The prototype structure is important since it communicates to the project sponsors the scope and basic operation of the starting structure. This stage often represents a decision point for proceeding; it may also trigger the next budgeting phase.

Link Reference Ontologies

An essential aspect of a build methodology is to re-use “standard” ontologies as much as possible. Core ontologies are Dublin Core, DC Terms, Event, FOAF, GeoNames, SKOS, Timeline, and UMBEL. These core ontologies have been chosen because of universality, quality, community support and other factors [19]. Though less universal, there are also a number of secondary ontologies, namely BIBO, DOAP, and SIOC that may fit within the current scope.

These are then supplemented with quality domain-specific ontologies, if such exist. Only then are new name spaces assigned for any newly generated ontology(ies).

Working Ontology

The working ontology is the first production-grade (deployable) version of the ontology. It conforms to all of the ontology building best practices and needs to be complete enough such that it can be loaded and managed in a fully conforming ontology editor or IDE [20].

By also using the OWL API, this working structure can also be the source for specialty tools and user maintenance functions, short of requiring a full-blown OWL editor. Many of these aspects are some of the poorest represented in the current tools inventory; we return to this topic in the next installment.

The working ontology is the complete, canonical form of the domain ontology(ies) [21]. These are the central structures that are the focus for ongoing maintenance and extension efforts over the ensuing phases. As such, the ontologies need to be managed by a version control system with comprehensive ontology and vocabulary management support and tools.

Testing and Mapping

As new ontologies are generated, they should be tested for coherence against various reasoning, inference and other natural language processing tools. Gap testing is also used to discover key holes or missing links within the resulting ontology graph structure. Coherence testing may result in discovering missing or incorrect axioms. Gap testing helps identify internal graph nodes needed to establish the integrity or connectivity of the concept graph.

Though used for different purposes, mapping and alignment tools may also work to identify logical and other inconsistencies in definitions or labels within the graph structure. Mapping and alignment is also important in its own right in order to establish the links that help promote ontology and information interoperability.

External knowledge bases can also play essential roles in testing and mapping. Two prominent knowledge base examples are Cyc and Wikipedia, but many additional exist for any specific domain.

Use and Maintenance

Of course, the whole purpose of the development methodology is to create practical, working ontologies. Such uses include search, discovery, information federation, data interoperability, analysis and reasoning, The general purposes to which ontologies may be put are described in the Executive Intro to Ontologies [22].

However, it is also in day-to-day use of the ontology that many enhancements and improvements may be discovered. Examples include improved definitions of concepts; expansions of synonyms, aliases and jargon for concepts; better, more intuitive preferred labels; better means to disambiguate between competing meanings; missing connections or excessive connections; and splitting or consolidating of the underlying structure.

Today, such maintenance enhancements are most often not pursued because existing tools do not support such actions. Reliance on IDEs and tools geared to ontology engineering are not well suited to users and practitioners being able to note or effect such changes. Yet ongoing ontology use and adaptation clearly suggest that users should be encouraged to do so. They are the ones in the front lines of identifying and potentially recording such improvements.

Extend

Ontology development is a process, not a static destination or event. This observation makes intuitive sense since we understand ontologies to be a means to capture our understanding of our domains, which is itself constantly changing due to new observations and insights. This factor alone suggests that ontology development methodologies must therefore give explicit attention to extension.

But there is another reason for this attention. Incremental, adaptive ontologies are also explicitly designed to expand their scope and coverage, bite by bite as benefits prove themselves and justify that expansion. A start small and expand strategy is of course lower risk and more affordable. But, for it to be effective, it also must be designed explicitly for extension and expansion. Ontology growth thus occurs both from learning and discovery and from expanding scope.

Versioning, version control and documentation (see below) thus assume more central importance than a more static view would suggest. The use of feedbacks and the continuous improvement design based on MIKE2.0 are therefore also central tenets of our ontology development methodology.

Documentation

This perspective of the ontology as a way to capture the structure and relationships of a domain — which is also constantly changing and growing — carries over to the need to document the institutional memory and use of it. Both better tools — such as vocabulary management and versioning — and better work processes need to be instituted to properly capture and record use and applications of ontologies.

Some of these aspects are now handled with utilities such as OWLdoc or the TechWiki that Structured Dynamics has innovated to capture ontology knowledge bases on an ongoing basis. But these are still rudimentary steps that need to be enforced with management commitment and oversight.

One need merely begin to probe the ontology development literature to observe how sparse the pickings are. Very little information on methodologies, best practices, use cases, recipes, how to manuals, conversion and use steps and other documentation really exists at present. It is unfortunately the case that documentation even lags the inadequate state of tools development in the ontology space.

Content Processing

Once formalized, these constructs — the structured ontologies or the named entity dictionaries as shown in Figure 3 — are then used for processing input content. That processing can range from conversion to direct information extraction. Once extracted, the structure may be injected (via RDFa or other means) back into raw Web pages. The concepts and entities that occur within these structures help inform various tagging systems [23]. The information can also be converted and exported in various forms for direct use or for incorporation in third-party systems.

Visualization systems and specialized widgets (see next) can be driven by the structure and results sets obtained from querying the ontology structure and retrieving its related instance data. While these purposes are somewhat beyond the direct needs of the ontology development methodology, the ontology structures themselves must be designed to support these functions.

Semantic Component Ontology

In our methodology we also provide for administrative ontologies whose purpose is to relate structural understandings of the underlying data and data types with applicable end-use and visualization tools (”widgets”). Thus the structural knowledge of the domain gets combined with an understanding of data types and what kinds of visualization or presentation widgets might be invoked. The phrase ontology-driven apps results from this design.

Amongst other utility ontologies, Structured Dynamics names its major tool-driver ontology the SCO (Semantic Component Ontology). The SCO works in intimate tandem with the domain ontologies, but is constructed and designed with quite different purposes. A description of the build methodology for the SCO (or its other complementary utility ontologies) is beyond the scope of this current document.

Tooling and Best Practices

As sprinkled throughout the above commentary, this methodology is also intimately related to tools and best practices. The next chapter in this series is devoted to and will be archived on the TechWiki as the lightweight domain ontology methodology. Best practices will be handled in a similar way for the chapter after that one and in its ontology best practices document on the TechWiki.

Time for a Leap Forward in Methodology

Earlier reviews and the information in this document suggest a real need for ontology building methodologies that are integrated, easier to use, interoperate with a richer tools set and are geared to practitioners versus priests. The good news is that there are architectures and building blocks to achieve this vision. The bad news is that the first steps on this path are only now beginning.

The next two installments in this series add further detail for why it is time — and how — we can make a leap forward in methodology. Those critical remaining pieces are in tools and best practices.


[1] This posting is part of a current series on ontology development and tools. The series began with an update of my prior Ontology Tools listing, which now contains 185 tools. It continued with a survey of ontology development methodologies. The next part in this series will address a new architecture for tooling development. The last installment in the series is planned to cover ontology best practices. This same posting is permanently archived and updated on the OpenStructs TechWiki as Lightweight, Domain Ontologies Development Methodology.
[2] Examples of upper-level ontologies include the Suggested Upper Merged Ontology (SUMO), the Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE), PROTON, Cyc and BFO (Basic Formal Ontology). Most of the content in their upper-levels is akin to broad, abstract relations or concepts (similar to the primary classes, for example, in a Roget’s Thesaurus — that is, real ontos stuff) than to “generic common knowledge.” Most all of them have both a hierarchical and networked structure, though their actual subject structure relating to concrete things is generally pretty weak. For a more detailed treatment of ontology classifications, see M. K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007.
[3] O. Corcho, M. Fernandez and A. Gomez-Perez, 2003. “Methodologies, Tools and Languages for Building Ontologies: Where is the Meeting Point?,” in Data & Knowledge Engineering 46, 2003. See http://www.dia.fi.upm.es/~ocorcho/documents/DKE2003_CorchoEtAl.pdf.
[4] Elena Paslaru Bontas Simperl and Christoph Tempich, 2006. “Ontology Engineering: A Reality Check,” in Proceedings of the 5th International Conference on Ontologies, Databases, and Applications of Semantics ODBASE 2006, 2006. See http://ontocom.ag-nbi.de/docs/odbase2006.pdf.
[5] OntologyDesignPatterns.org is a semantic Web portal dedicated to ontology design patterns (ODPs). The portal was started under the NeOn project, which still partly supports its development.
[6] See M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, November 23, 2009.
[7] See M.K. Bergman, 2008. “The Semantics of Context,” AI3:::Adaptive Information blog, May 6, 2008.
[8] UMBEL (Upper Mapping and Binding Exchange Layer) is an ontology of about 20,000 subject concepts that acts as a reference structure for inter-relating disparate datasets. It is also a general vocabulary of classes and predicates designed for the creation of domain-specific ontologies.
[9] See M.K. Bergman, 2008. “When is Content Coherent?,” AI3:::Adaptive Information blog, July 25, 2008.
[10] See M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Information blog, December 21, 2009.
[11] MIKE2.0 (Method for Integrated Knowledge Environments) is an open source information development methodology championed by Bearing Point and Deloitte. Structured Dynamics has adopted the approach and has helped formulate MIKE2.0’s semantic enterprise offering. For a general intro to the approach, see further M.K. Bergman, 2010. “MIKE2.0: Open Source Information Development in the Enterprise,” AI3:::Adaptive Information blog, February 23, 2010.
[12] This is our working definition for description logics:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[13] See the four-part description logics series from M. K. Bergman, 2009. “Making Linked Data Reasonable using Description Logics, Part 1,” AI3:::Adaptive Information blog, Feb. 11, 2009; “Making Linked Data Reasonable using Description Logics, Part 2,” AI3:::Adaptive Information blog, Feb. 15, 2009; “Making Linked Data Reasonable using Description Logics, Part 3,” AI3:::Adaptive Information blog, Feb. 18, 2009; and “Making Linked Data Reasonable using Description Logics, Part 4,” AI3:::Adaptive Information blog, Feb. 23, 2009.
[14] See Part 2 in [13].
[15] The TBox portion, or classes (concepts), is the basis of the ontologies. The ontologies establish the structure used for governing the conceptual relationships for that domain and in reference to external (Web) ontologies. The ABox portion, or instances (named entities), represents the specific, individual things that are the members of those classes. Named entities are the notable objects, persons, places, events, organizations and things of the world. Each named entity is related to one or more classes (concepts) to which it is a member. Named entities do not set the structure of the domain, but populate that structure. The ABox and TBox play different roles in the use and organization of the information and structure.
[16] The original version, now slightly modified, was first published in M. K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies,” AI3:::Adaptive Information blog, Nov. 23, 2009.
[17] As some examples, see for instance: SKOS: Mark van Assem, Veronique Malais, Alistair Miles and Guus Schreiber, 2006. “A Method to Convert Thesauri to SKOS,” in The Semantic Web: Research and Applications (2006), pp. 95-109. See http://www.cs.vu.nl/~mark/papers/Assem06b.pdf for paper, also http://thesauri.cs.vu.nl/eswc06/ and http://thesauri.cs.vu.nl/; taxonomies: Fausto Giunchiglia, Maurizio Marchese and Ilya Zaihrayeu, 2006. “Encoding Classifications into Lightweight Ontologies,” presented at Proceedings of the 3rd European Semantic Web Conference (ESWC 2006), Budva. See http://www.science.unitn.it/~marchese/pdf/encoding%20classifications%20into%20lightweight%20ontologies_JoDS8.pdf; metadata: Mikael Nilsson, 2007. See http://mikaelnilsson.blogspot.com/2007/11/semanticizing-metadata-specifications.html; relational schema: see the W3C workgroup on RDB2RDF; and, of course, there are many others.
[18] Natalya F. Noy and Deborah L. McGuinness, 2001. “Ontology Development 101: A Guide to Creating Your First Ontology,” Stanford University Knowledge Systems Laboratory Technical Report KSL-01-05, March 2001. See http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html.
[19] The various criteria that are considered in nominating an existing ontology to “core” status is that it should be general; highly used; universal; broad committee or community support; well done and documented; and easily understood.
[20] Example and comprehensive ontology editing toolkits or IDEs (integrated development environments) include NeOn toolkit, Protégé, and TopBraid Composer. A complement to these larger toolkits is the OWL API, which when used can also provide a canonical management framework for specific ontology tools and tasks. This topic is covered more in the next installment regarding the tools landscape.
[21] Good ontology design, especially for larger projects, does require a degree of modularity. An architecture of multiple ontologies often work together to isolate different work tasks so as to aid better ontology management. Ontology architecture and modularization is a separate topic in its own right.
[22] Originally published as M.K. Bergman, 2010. “An Executive Intro to Ontologies,” AI3:::Adaptive Information blog, August 9, 2010. This popular document has now been permanently archived on the the OpenStructs TechWiki as Intro to Ontologies.
[23] Another reason for the clear distinction between ABox and TBox is their use to aid one another in disambiguation. Structured Dynamics’ scones approach (subject concepts or named entities) is designed expressly for this purpose. It is also possible to integrate these approaches with third-party tools (e.g., Calais, Expert System (Cogito), etc.) to improve unstructured content characterization. Via this approach we now can assess concept matches in addition to entity matches. This means we can triangulate between the two assessments to aid disambiguation. Because of logical segmentation, we have increased the informational power of our concept graph.

by Mike Bergman at September 01, 2010 05:10 AM

August 30, 2010

AI3:::Adaptive Information (Mike Bergman)

A Brief Survey of Ontology Development Methodologies

The Recent Pace of Ontology Development Appears to Have Waned

The development of ontologies goes by the names of ontology engineering or ontology building, and can also be investigated under the rubric of ontology learning. This paper summarizes key papers and links to this topic [18].

For the last twenty years there have been many methods put forward for how to develop ontologies. These methodological activities have actually diminished somewhat in recent years.

The main thrust of the papers listed herein is on domain ontologies, which model particular domains or topic areas. (As opposed to reference, upper or theoretical ontologies, which are more general or encompassing.) Also, little commentary is offered on any of the individual methodologies; please see the referenced papers for more details.

General Surveys

One of the first comprehensive surveys was done by Jones et al. in 1998 [1]. This study began to elucidate common stages and noted there are typically separate stages to produce first an informal description of the ontology and then its formal embodiment in an ontology language. The existence of these two descriptions is an important characteristic of many ontologies, with the informal description often carrying through to the formal description.

The next major survey was done by Corcho et al. in 2003 [2]. This built on the earlier Jones survey and added more recent methods. The survey also characterized the methods by tools and tool readiness.

More recently the work of Simperl and her colleagues has focused on empirical results of ontology costing and related topics. This series has been the richest source of methodology insight in recent years [3, 4, 5, 6]. More on this work is described below.

Though not a survey of methods, one of the more attainable descriptions of ontology building is Noy and McGuinness’ well-known Ontology Development 101 [7]. Also really helpful are Alan Rector’s various lecture slides on ontology building [8].

However, one general observation is that the pace of new methodology development seems to have waned in the past five years or so. This does not appear to be the result of an accepted methodology having emerged.

Some Specific Methodologies

Some of the leading methodologies, presented in rough order from the oldest to newest, are as follows:

  • Cyc – this oldest of knowledge bases and ontologies has been mapped to many separate ontologies. See the separate document on the Cyc mapping methodology for an overview of this approach [9]
  • TOVE (Toronto Virtual Enterprise) – a first-order logic approach to representing activities, states, time, resources, and cost in an enterprise integration architecture [10]
  • IDEF5 (Integrated Definition for Ontology Description Capture Method) – is part of a broader set of methodologies developed by Knowledge Based Systems, Inc. [11]
  • ONIONS (ONtologic Integration Of Naive Sources) – a set of methods especially geared to integrating multiple information sources [12], with a particular emphasis on domain ontologies
  • COINS (COntext INterchange System) – a long-running series of efforts from MIT’s Sloan School of Management [13]
  • METHONTOLOGY – one of the better known ontology building methodologies; however, not many known uses [14]
  • OTK (On-To-Knowledge) was a methodology that came from the major EU effort at the beginning of last decade; it is a common sense approach reflected in many ways in other methodologies [15]
  • UPON (United Process for ONtologies) – is a UML-based approach that is based on use cases, and is incremental and iterative [16].

Please note that many individual projects also describe their specific methodologies; these are purposefully not included. In addition, Ensan and Du look at some specific ontology frameworks (e.g., PROMPT, OntoLearn, etc.) from a domain-specific perspective [17].

Some Flowcharts

Here is the general methodology as presented in the various Simperl et al. papers [c.f., Fig. 1 in 3]:

Ontology Engineering from Simperl et al.

The Corcho et al. survey also presented a general view of the tools plus framework necessary for a complete ontology engineering environment [Fig. 4 from 2]:

Ontology Tools and Framework from Corcho et al.There are more examples that show ontology development workflows. Here is one again from the Simperl et al. efforts [Fig. 2 in 5]:

Ontology Learning Flowchart from Simperl et al.However, what is most striking about the review of the literature is the paucity of methodology figures and the generality of those that do exist. From this basis, it is unclear what the degree of use is for real, actionable methods.

Best Practices Observations

The Simperl and Tempich paper [3], besides being a rich source of references, also provides some recommended best practices based on their comparative survey. These are:

General Recommendations

  • Enforce dissemination, e.g.. publish more best practices
  • Define selection criteria for methodologies
  • Define a unified methodology following a method engineering approach
  • Support decision for the appropriate formality level given a specific use case

Process Recommendations

  • Define selection criteria for different knowledge acquisition (KA) techniques
  • Introduce process description for the application of different KA techniques
  • Improve documentation of existing ontologies
  • Improve ontology location facilities
  • Build robust translators between formalisms
  • Build modular ontologies
  • Define metrics for ontology evaluation
  • Offer user oriented process descriptions for ontology evaluation

Organizational Recommendations

  • Provide ontology engineering activity descriptions using domain-specific terminology
  • Improve consensus making process support

Technological Recommendations

  • Provide tools to extract ontologies from structured data sources
  • Build lightweight ontology engineering environments
  • Improve the quality of tools for domain analysis, ontology evaluation, documentation
  • Include methodological support in ontology editors
  • Build tools supporting collaborative ontology engineering.

Summary of Observations

This review has not set out to characterize specific methodologies, nor their strengths and weaknesses. Yet the research seems to indicate this state of methodology development in the field:

  • Very few discrete methods exist, and those that do are relatively older in nature
  • The methods tend to either cluster into incremental, iterative ones or those more oriented to more comprehensive approaches
  • There is a general logical sharing of steps across most methodologies from assessment to deployment and testing and refinement
  • Actual specifics and flowcharts are quite limited; with the exception of the UML-based systems, most appear not to meet enterprise standards
  • The supporting toolsets are not discussed much, and most of the examples are based solely on a governing tool. Tool integration and interoperability is almost non-existent in terms of the narratives
  • This does not appear to be a very active area of current research.

[1] D.M. Jones, T.J.M. Bench-Caponand, P.R.S. Visser, 1998.“Methodologies for Ontology Development,” in Proceedings of the IT and KNOWS Conference of the 15th FIP World Computer Congress, 1998. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.52.2437&rep=rep1&type=pdf.
[2] O. Corcho, M. Fernandez and A. Gomez-Perez, 2003. “Methodologies, Tools and Languages for Building Ontologies: Where is the Meeting Point?,” in Data & Knowledge Engineering 46, 2003. See http://www.dia.fi.upm.es/~ocorcho/documents/DKE2003_CorchoEtAl.pdf.
[3] Elena Paslaru Bontas Simperl and Christoph Tempich, 2006. Ontology Engineering: A Reality Check, in Proceedings of the 5th International Conference on Ontologies, Databases, and Applications of Semantics ODBASE2006, 2006. See http://citeseerx.ist.psu.edu/icons/pdf.gif;jsessionid=DE3414C0282C76F0EA787A06039941D2.
[4] Elena Paslaru Bontas Simperl, Christoph Tempich, and York Sure, 2006. “ONTOCOM: A Cost Estimation Model for Ontology Engineering,” presented at ISWC 2006; see http://ontocom.ag-nbi.de/docs/iswc2006.pdf.
[5] Elena Simperl, Christoph Tempich and Denny Vrandečić, 2008. “A Methodology for Ontology Learning,” in Frontiers in Artificial Intelligence and Applications 167 from the Proceedings of the 2008 Conference on Ontology Learning and Population: Bridging the Gap between Text and Knowledge, pp. 225-249, 2008. See http://wtlab.um.ac.ir/parameters/wtlab/filemanager/resources/Ontology%20Learning/ONTOLOGY%20LEARNING%20AND%20POPULATION%20BRIDGING% 20THE%20GAP%20BETWEEN%20TEXT%20AND%20KNOWLEDGE.pdf#page=241.
[6] Elena Simperl, Malgorzata Mochol and Tobias Burger, 2010. “Achieving Maturity: the State of Practice in Ontology Engineering in 2009,” in International Journal of Computer Science and Applications, 7(1), pp. 45 – 65, 2010. See http://www.tmrfindia.org/ijcsa/v7i13.pdf.
[7] Natalya F. Noy and Deborah L. McGuinness, 2001. “Ontology Development 101: A Guide to Creating Your First Ontology,” Stanford University Knowledge Systems Laboratory Technical Report KSL-01-05, March 2001. See http://protege.stanford.edu/publications/ontology_development/ontology101-noy-mcguinness.html.
[9] Stephen L. Reed and Douglas B. Lenat, 2002. Mapping Ontologies into Cyc, paper presented at AAAI 2002 Conference Workshop on Ontologies For The Semantic Web, Edmonton, Canada, July 2002. See http://www.cyc.com/doc/white_papers/mapping-ontologies-into-cyc_v31.pdf . Also, as presented by Doug Foxvog, Ontology Mapping with Cyc, at WMSO, June 14, 2004; see www.wsmo.org/wsml/papers/presentations/Ontology%20Mapping%20at%20Cycorp.ppt. Also, see Matthew E. Taylor, Cynthia Matuszek, Bryan Klimt, and Michael Witbrock, 2007. “Autonomous Classification of Knowledge into an Ontology,” in The 20th International FLAIRS Conference (FLAIRS), Key West, Florida, May 2007. See http://www.cyc.com/doc/white_papers/FLAIRS07-AutoClassificationIntoAnOntology.pdf.
[10] M. Gruninger and M.S. Fox, 1994. “The Design and Evaluation of Ontologies for Enterprise Engineering”, Workshop on Implemented Ontologies, European Conference on Artificial Intelligence 1994, Amsterdam, NL. See http://stl.mie.utoronto.ca/publications/gruninger-onto-ecai94.pdf.
[11] KBSI, 1994. “The IDEF5 Ontology Description Capture Method Overview”, Knowledge Based Systems, Inc. (KBSI) Report, Texas. The report describes the stages of: 1) organizing and scoping; 2) data collection; 3) data analysis; 4) initial ontology development; and 5) ontology refinement and validation. See http://en.wikipedia.org/wiki/IDEF5.
[12] A. Gangemi, G. Steve and F. Giacomelli, 1996. “ONIONS: An Ontological Methodology for Taxonomic Knowledge Integration”, ECAI-96 Workshop on Ontological Engineering, Budapest, August 13th. See http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.22.3972&rep=rep1&type=pdf.
[13] The COINS approach was developed by Madnick et al. over the past two decades or so at the MIT Sloan School of Management. See further http://web.mit.edu/smadnick/www/wp/CISL-Sloan%20WP%20spreadsheet.htm for a listing of papers from this program; some are use cases, and some are architecture-related. For the most detailed treatment, see Aykut Firat, 2003. Information Integration Using Contextual Knowledge and Ontology Merging, Ph.D. Thesis for the Sloan School of Management, MIT, 151 pp. See http://www.mit.edu/~bgrosof/paps/phd-thesis-aykut-firat.pdf.
[14] M. Fernandez, A. Gomez-Perez and N. Juristo, 1997. “METHONTOLOGY: From Ontological Art Towards Ontological Engineering”, AAAI-97 Spring Symposium on Ontological Engineering, Stanford University, March 24-26th, 1997.
[15] York Sure, Christoph Tempich and Denny Vrandecic , 2006. “Ontology Engineering Methodologies,” in Semantic Web Technologies: Trends and Research in Ontology-based Systems, pp. 171-187, Wiley. The general phases of the approach are: 1) feasibility study; 2) kickoff; 3) refinement; 4) evaluation; and 5) application and evolution.
[16] A. De Nicola, M. Missikoff, R. Navigli, 2009. “A Software Engineering Approach to Ontology Building”. Information Systems, 34(2), Elsevier, 2009, pp. 258-275.
[17] Faezeh Ensan and Weichang Du, 2007. Towards Domain-Centric Ontology Development and Maintenance Frameworks; see http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.93.8915&rep=rep1&type=pdf.
[18] This document is permanently archived on the OpenStructs TechWiki. This document is part of a current series on ontology development and tools to be completed over the coming weeks.

by Mike Bergman at August 30, 2010 05:53 AM

August 23, 2010

AI3:::Adaptive Information (Mike Bergman)

Listing of 185 Ontology Building Tools

AI3's Ontologies category

Earlier Listing is Expanded by More than 30%

At the beginning of this year Structured Dynamics assembled a listing of ontology building tools at the request of a client. That listing was presented as The Sweet Compendium of Ontology Building Tools. Now, again because of some client and internal work, we have researched the space again and updated the listing [1].

All new tools are marked with <New> (new only means newly discovered; some had yet to be discovered in the prior listing). There are now a total of 185 tools in the listing, 31 of which are recently new, and 45 added at various times since the first release. <Newest> reflects updates — most from the developers themselves — since the original publication of this post.

Comprehensive Ontology Tools

  • Altova SemanticWorks is a visual RDF and OWL editor that auto-generates RDF/XML or nTriples based on visual ontology design. No open source version available
  • Amine is a rather comprehensive, open source platform for the development of intelligent and multi-agent systems written in Java. As one of its components, it has an ontology GUI with text- and tree-based editing modes, with some graph visualization
  • The Apelon DTS (Distributed Terminology System) is an integrated set of open source components that provides comprehensive terminology services in distributed application environments. DTS supports national and international data standards, which are a necessary foundation for comparable and interoperable health information, as well as local vocabularies. Typical applications for DTS include clinical data entry, administrative review, problem-list and code-set management, guideline creation, decision support and information retrieval.. Though not strictly an ontology management system, Apelon DTS has plug-ins that provide visualization of concept graphs and related functionality that make it close to a complete solution
  • DOME is a programmable XML editor which is being used in a knowledge extraction role to transform Web pages into RDF, and available as Eclipse plug-ins. DOME stands for DERI Ontology Management Environment
  • FlexViz is a Flex-based, Protégé-like client-side ontology creation, management and viewing tool; very impressive. The code is distributed from Sourceforge; there is a nice online demo available; there is a nice explanatory paper on the system, and the developer, Chris Callendar, has a useful blog with Flex development tips
  • <Newest> ITM supports the management of complex knowledge structures (metadata repositories, terminologies, thesauri, taxonomies, ontologies, and knowledge bases) throughout their lifecycle, from authoring to delivery. ITM can also manage alignments between multiple knowledge structures, such as thesauri or ontologies, via the integration of INRIA’s Alignment API. Commercial; from Mondeca
  • Knoodl facilitates community-oriented development of OWL based ontologies and RDF knowledge bases. It also serves as a semantic technology platform, offering a Java service-based interface or a SPARQL-based interface so that communities can build their own semantic applications using their ontologies and knowledgebases. It is hosted in the Amazon EC2 cloud and is available for free; private versions may also be obtained. See especially the screencast for a quick introduction
  • The NeOn toolkit is a state-of-the-art, open source multi-platform ontology engineering environment, which provides comprehensive support for the ontology engineering life-cycle. The v2.3.0 toolkit is based on the Eclipse platform, a leading development environment, and provides an extensive set of plug-ins covering a variety of ontology engineering activities. You can add these plug-ins or get a current listing from the built-in updating mechanism
  • ontopia is a relative complete suite of tools for building, maintaining, and deploying Topic Maps-based applications; open source, and written in Java. Could not find online demos, but there are screenshots and there is visualization of topic relationships
  • Protégé is a free, open source visual ontology editor and knowledge-base framework. The Protégé platform supports two main ways of modeling ontologies via the Protégé-Frames and Protégé-OWL editors. Protégé ontologies can be exported into a variety of formats including RDF(S), OWL, and XML Schema. There are a large number of third-party plugins that extends the platform’s functionality
    • Protégé Plugin Library – frequently consult this page to review new additions to the Protégé editor; presently there are dozens of specific plugins, most related to the semantic Web and most open source
    • Collaborative Protégé is a plug-in extension of the existing Protégé system that supports collaborative ontology editing as well as annotation of both ontology components and ontology changes. In addition to the common ontology editing operations, it enables annotation of both ontology components and ontology changes. It supports the searching and filtering of user annotations, also known as notes, based on different criteria. There is also an online demo
    • <New>Web Protégé is an online version of Protégé attempting to capture all of the native functionality; still under development
  • <New>Sigma is open source knowledge engineering environment that includes ontology mapping, theorem proving, language generation in multiple languages, browsing, OWL read/write, and analysis. It includes the Suggested Upper Merged Ontology (SUMO), a comprehensive formal ontology. It’s under active development and use
  • TopBraid Composer is an enterprise-class modeling environment for developing Semantic Web ontologies and building semantic applications. Fully compliant with W3C standards, Composer offers comprehensive support for developing, managing and testing configurations of knowledge models and their instance knowledge bases. It is based on the Eclipse IDE. There is a free version (after registration) for small ontologies
  • <New>TwoUse Toolkit is an implementation of current OMG and W3C standards for developing ontology-based software models and model-based OWL2 ontologies, largely based around UML. There are a variety of tools, including graphics editors, with more to come
  • <New>Wandora is a topic maps engine written in Java with support for both in-memory topic maps and persisting topic maps in MySQL and SQL Server. It also contains an editor and a publishing system, and has support for automatic classification. It can read OBO, RDF(S), and many other formats, and can export topic maps to various graph formats. There is also a web-based topic maps browser, and graphical visualization.

Not Apparently in Active Use

  • Adaptiva is a user-centred ontology building environment, based on using multiple strategies to construct an ontology, minimising user input by using adaptive information extraction
  • Exteca is an ontology-based technology written in Java for high-quality knowledge management and document categorisation, including entity extraction. Though code is still available, no updates have been provided since 2006. It can be used in conjunction with search engines
  • IODT is IBM’s toolkit for ontology-driven development. The toolkit includes EMF Ontolgy Definition Metamodel (EODM), EODM workbench, and an OWL Ontology Repository (named Minerva)
  • KAON is an open-source ontology management infrastructure targeted for business applications. It includes a comprehensive tool suite allowing easy ontology creation and management and provides a framework for building ontology-based applications. An important focus of KAON is scalable and efficient reasoning with ontologies
  • Ontolingua provides a distributed collaborative environment to browse, create, edit, modify, and use ontologies. The server supports over 150 active users, some of whom have provided us with descriptions of their projects. Provided as an online service; software availability not known.

Vocabulary Prompting Tools

  • AlchemyAPI from Orchestr8 provides an API based application that uses statistical and natural language processing methods. Applicable to webpages, text files and any input text in several languages
  • BooWa is a set expander for any language (formerly known as SEALS); developed by RC Wang of Carnegie Mellon
  • Google Keywords allows you to enter a few descriptive words or phrases or a site URL to generate keyword ideas
  • Google Sets for automatically creating sets of items from a few examples
  • Open Calais is free limited API web service to automatically attach semantic metadata to content, based on either entities (people, places, organizations, etc.), facts (person ‘x’ works for company ‘y’), or events (person ‘z’ was appointed chairman of company ‘y’ on date ‘x’). The metadata results are stored centrally and returned to you as industry-standard RDF constructs accompanied by a Globally Unique Identifier (GUID)
  • Query-by-document from BlogScope has a nice phrase extraction service, with a choice of ranking methods. Can also be used in a Firefox plug-in (not texted with 3.5+)
  • SemanticHacker (from Textwise) is an API that does a number of different things, including categorization, search, etc. By using ‘concept tags’, the API can be leveraged to generate metadata or tags for content
  • TagFinder is a Web service that automatically extracts tags from a piece of text. The tags are chosen based on both statistical and linguistic analysis of the original text
  • Tagthe.net has a demo and an API for automatic tagging of web documents and texts. Tags can be single words only. The tool also recognizes named entities such as people names and locations
  • TermExtractor extracts terminology consensually referred in a specific application domain. The software takes as input a corpus of domain documents, parses the documents, and extracts a list of “syntactically plausible” terms (e.g. compounds, adjective-nouns, etc.)
  • TermFinder uses Poisson statistics, the Maximum Likelihood Estimation and Inverse Document Frequency between the frequency of words in a given document and a generic corpus of 100 million words per language; available for English, French and Italian
  • TerMine is an online and batch term extractor that emphasizes part of speech (POS) and n-gram (phrase extraction). TerMine is the terminological management system with the C-Value term extraction and AcroMine acronym recognition integrated
  • Topia term extractor is a part-of-speech and frequency based term extraction tool implemented in python. Here is a term extraction demo based on this tool
  • Topicalizer is a service which automatically analyses a document specified by a URL or a plain text regarding its word, phrase and text structure. It provides a variety of useful information on a given text including the following: Word, sentence and paragraph count, collocations, syllable structure, lexical density, keywords, readability and a short abstract on what the given text is about
  • TrMExtractor does glossary extraction on pure text files for either English or Hungarian
  • Wikify! is a system to automatically “wikify” a text by adding Wikipedia-like tags throughout the document. The system extracts keywords and then disambiguates and matches them to their corresponding Wikipedia definition
  • Yahoo! Placemaker is a freely available geoparsing Web service. It helps developers make their applications location-aware by identifying places in unstructured and atomic content – feeds, web pages, news, status updates – and returning geographic metadata for geographic indexing and markup
  • Yahoo! Term Extraction Service is an API to Yahoo’s term extraction service, as well as many other APIs and services in a variety of languages and for a variety of tasks; good general resource. The service has been reported to be shut down numerous times, but apparently is kept alive due to popular demand.

Initial Ontology Development

  • COE COE (CmapTools Ontology Editor) is a specialized version of the CmapTools from IMHC. COE — and its CmapTools parent — is based on the idea of concept maps. A concept map is a graph diagram that shows the relationships among concepts. Concepts are connected with labeled arrows, with the relations manifesting in a downward-branching hierarchical structure. COE is an integrated suite of software tools for constructing, sharing and viewing OWL encoded ontologies based on these constructs
  • Conzilla2 is a second generation concept browser and knowledge management tool with many purposes. It can be used as a visual designer and manager of RDF classes and ontologies, since its native storage is in RDF. It also has an online collaboration server [apparently last updated in 2008]
  • http://diagramic.com/ has an online Flex network graph demo, which also has a neat facility for quick entry and visualization of relationships; mostly small scale; pretty cool. Does not appear to be code available anywhere
  • <New>DL-Learner is a tool for learning OWL class expressions from examples and background knowledge. It extends Inductive Logic Programming (ILP) to Description Logics and the Semantic Web. DL-Learner now has a flexible component based design, which allows to extend it easily with new learning algorithms, learning problems, reasoners, and supported background knowledge sources. A new type of supported knowledge sources are SPARQL endpoints, where DL-Learner can extract knowledge fragments, which enables learning classes even on large knowledge sources like DBpedia, and includes an OWL API reasoner interface and Web service interface.
  • DogmaModeler is a free and open source, ontology modeling tool based on ORM. The philosophy of DogmaModeler is to enable non-IT experts to model ontologies with a little or no involvement of an ontology engineer; project is quite old, but the software is still available and it may provide some insight into naive ontology development
  • Erca is a framework that eases the use of Formal and Relational Concept Analysis, a neat clustering technique. Though not strictly an ontology tool, Erca could be implemented in a work flow that allows easy import of formal contexts from CSV files, then algorithms that computes the concept lattice of the formal contexts that can be exported as dot graphs (or in JPG, PNG, EPS and SVG formats). Erca is provided as an Eclipse plug-in
  • GraphMind is a mindmap editor for Drupal. It has the basic mindmap features and some Drupal specific enhancements. There is a quick screencast about how GraphMind looks like and what is does. The Flex source is also available from Github
  • <New>H-Maps is a commercial suite of tools for building topic maps applications, consisting of a topic maps engine and server, a mapping framework for converting from legacy data, and a navigator for visualizing data. It is typically used in bioinformatics (drug discovery and research, toxicological studies, etc), engineering (support and expert systems), and for integration of hetereogeneous data. It supports the XTM 1.0 and TMAPI 1.0 specifications
  • irON using spreadsheets, via its notation and specification. Spreadsheets can be used for initial authoring, esp if the irON guidelines are followed. See further this case study of Sweet Tools in a spreadsheet using irON (commON)
  • <New>JXML2OWL API is a library for mapping XML schemas to OWL Ontologies on the JAVA platform. It creates an XSLT which transforms instances of the XML schema into instances of the OWL ontology. JXML2OWL Mapper is GUI application using the JXML2OWL API
  • MindRaider is Semantic Web outliner. It aims to connect the tradition of outline editors with emerging technologies. MindRaider mission is to organize not only the content of your hard drive but also your cognitive base and social relationships in a way that enables quick navigation, concise representation and inferencing
  • <New>Neologism is a simple web-based RDF Schema vocabulary editor and publishing system. Use it to create RDF classes and properties, which are needed to publish data on the Semantic Web. Its main goal is to dramatically reduce the time required to create, publish and modify vocabularies for the Semantic Web. It is written in PHP and built on the Drupal platform. Neologism is currently in alpha
  • <New>OCS – Ontology Creation System is software to develop ontologies in cooperative way with a graphical interface
  • RDF123 is an application and web service for converting data in simple spreadsheets to an RDF graph. Users control how the spreadsheet’s data is converted to RDF by constructing a graphical RDF123 template that specifies how each row in the spreadsheet is converted as well as metadata for the spreadsheet and its RDF translation
  • <New>ROC (Rapid Ontology Construction) is a tool that allows domain experts to quickly build a basic vocabulary for their domain, re-using existing terminology whenever possible. How this works is that the ROC tool asks the domain expert for a set of keywords that are ‘core’ terms of the domain, and then queries remote sources for concepts matching those terms. These are then presented to the user, who can select terms from the list, find relations to other terms, and expand the set of terms and relations, iteratively. The resulting vocabulary (or ‘proto-ontology’, basically a SKOS-like thesaurus) can be used as is, or can be used as input for a knowledge engineer to base a more comprehensive domain ontology on. Interface “triples-oriented,” not graphical.
  • Topincs is a Topic Map authoring software that allows groups to share their knowledge over the web. It makes use of a variety of modern technologies. The most important are Topic Maps, REST and Ajax. It consists of three components: the Wiki, the Editor, and the Server. The servier requires AMP; the Editor and Wiki are based on browser plug-ins.

Ontology Editing

  • First, see all of the Comprehensive Tools and Ontology Development listings above
  • Anzo for Excel includes an (RDFS and OWL-based) ontology editor that can be used directly within Excel. In addition to that, Anzo for Excel includes the capability to automatically generate an ontology from existing spreadsheet data, which is very useful for quick bootstrapping of an ontology
  • <New>ATop is a topic map browser and editor written in Java and supports the XTM 1.0 specification; project has not been updated since 2008
  • Hozo is an ontology visualization and development tool that brings version control constructs to group ontology development; limited to a prototype, with no online demo
  • Lexaurus Editor is for off-line creation and editing of vocabularies, taxonomies and thesauri. It supports import and export in Zthes and SKOS XML formats, and allows hierarchical / poly-hierarchical structures to be loaded for editing, or even multiple vocabularies to be loaded simultaneously, so that terms from one taxonomy can be re-used in another, using drag and drop. Not available in open source
  • Model Futures OWL Editor combines simple OWL tools, featuring UML (XMI), ErWin, thesaurus and imports. The editor is tree-based and has a “navigator” tool for traversing property and class-instance relationships. It can import XMI (the interchange format for UML) and Thesaurus Descriptor (BT-NT XML), and EXPRESS XML files. It can export to MS Word.
  • <New>OBO-Edit is an open source ontology editor written in Java. OBO-Edit is optimized for the OBO biological ontology file format. It features an easy to use editing interface, a simple but fast reasoner, and powerful search capabilities
  • <New>Onotoa is an Eclipse-based ontology editor for topic maps. It has a graphical UML-like interface, an export function for the current TMCL-draft and a XTM export
  • OntoTrack is a browsing and editing ontology authoring tool for OWL Lite. It combines a sophisticated graphical layout with mouse enabled editing features optimized for efficient navigation and manipulation of large ontologies
  • OWLViz is an attractive visual editor for OWL and is available as a Protégé plug-in
  • PoolParty is a triple store-based thesaurus management environment which uses SKOS and text extraction for tag recommendations. See further this manual, which describes more fully the system’s functionality. Also, there is a PoolParty Web service that enables a Zthes thesaurus in XML format to be uploaded and converted to SKOS (via skos:Concepts)
  • SKOSEd is a plugin for Protege 4 that allows you to create and edit thesauri (or similar artefacts) represented in the Simple Knowledge Organisation System (SKOS).
  • TemaTres is a Web application to manage controlled vocabularies, taxonomies and thesaurus. The vocabularies may be exported in Zthes, Skos, TopicMap, etc.
  • ThManager is a tool for creating and visualizing SKOS RDF vocabularies. ThManager facilitates the management of thesauri and other types of controlled vocabularies, such as taxonomies or classification schemes
  • Vitro is a general-purpose web-based ontology and instance editor with customizable public browsing. Vitro is a Java web application that runs in a Tomcat servlet container. With Vitro, you can: 1) create or load ontologies in OWL format; 2) edit instances and relationships; 3) build a public web site to display your data; and 4) search your data with Lucene. Still in somewhat early phases, with no online demos and with minimal interfaces.
  • <New>Vocab Editor is an RDF/OWL/SKOS vocabulary-diagram editor. It has both client- (Javascript) and server-side (Python) implmentations. It is open source with a demo. There is a blog (Spanish) and online sample vocabulary app editor.

Not Apparently in Active Use

  • Omnigator The Omnigator is a form-based manipulaton tool centered on Topic Maps, though it enables the loading and navigation of any conforming topic map in XTM, HyTM, LTM or RDF formats. There is a free evaluation version.
  • OntoGen is a semi-automatic and data-driven ontology editor focusing on editing of topic ontologies (a set of topics connected with different types of relations). The system combines text-mining techniques with an efficient user interface. It requires .Net.
  • OntoLight is a set of software modules for: transforming raw ontology data for several ontologies from their specific formats into a unifying light-weight ontology format, grounding the ontology and storing it into grounded ontology format, populating grounded ontologies with new instance data, and creating mappings between grounded ontologies; includes Cyc. Download no longer available. See http://analytics.ijs.si/~blazf/papers/Context_SiKDD07.pdf and http://www.neon-project.org/web-content/index.php?option=com_weblinks&task=view&catid=17&id=52 or http://www.neon-project.org/web-content/index.php?option=com_weblinks&catid=21&Itemid=73
  • OWL-S-editor is an editor for the development of services in OWL-S, with graphical, WSDL and import/export support
  • ReTAX+ is an aide to help a taxonomist create a consistent taxonomy and in particular provides suggestions as to where a new entity could be placed in the taxonomy whilst retaining the integrity of the revised taxonomy (c.f., problems in ontology modelling)
  • SWOOP is a lightweight ontology editor. (Swoop is no longer under active development at mindswap. Continuing development can be found on SWOOP’s Google Code homepage at http://code.google.com/p/swoop/)
  • WebOnto supports the browsing, creation and editing of ontologies through coarse grained and fine grained visualizations and direct manipulation.

Ontology Mapping

  • <New>The Alignment API is an API and implementation for expressing and sharing ontology alignments. The correspondences between entities (e.g., classes, objects, properties) in ontologies is called an alignment. The API provides a format for expressing alignments in a uniform way. The goal of this format is to be able to share on the web the available alignments. The format is expressed in RDF, so it is freely extensible. The Alignment API itself is a Java description of tools for accessing the common format. It defines four main interfaces (Alignment, Cell, Relation and Evaluator).
  • COMA++ is a schema and ontology matching tool with a comprehensive infrastructure. Its graphical interface supports a variety of interaction
  • ConcepTool is a system to model, analyse, verify, validate, share, combine, and reuse domain knowledge bases and ontologies, reasoning about their implication
  • <New>MapOnto is a research project aiming at discovering semantic mappings between different data models, e.g, database schemas, conceptual schemas, and ontologies. So far, it has developed tools for discovering semantic mappings between database schemas and ontologies as well as between different database schemas. The Protege plug-in is still available, but appears to be for older versions
  • MatchIT automates and facilitates schema matching and semantic mapping between different Web vocabularies. MatchIT runs as a stand-alone or plug-in Eclipse application and can be integrated with popular third party applications. MatchIT’s uses Adaptive Lexicon™ as an ontology-driven dictionary and thesaurus of English language terminology to quantify and ank the semantic similarity of concepts. It apparently is not available in open source
  • myOntology is used to produce the theoretical foundations, and deployable technology for the Wiki-based, collaborative and community-driven development and maintenance of ontologies instance data and mappings
  • OLA/OLA2 (OWL-Lite Alignment) matches ontologies written in OWL. It relies on a similarity combining all the knowledge used in entity descriptions. It also deal with one-to-many relationships and circularity in entity descriptions through a fixpoint algorithm
  • Potluck is a Web-based user interface that lets casual users—those without programming skills and data modeling expertise—mash up data themselves. Potluck is novel in its use of drag and drop for merging fields, its integration and extension of the faceted browsing paradigm for focusing on subsets of data to align, and its application of simultaneous editing for cleaning up data syntactically. Potluck also lets the user construct rich visualizations of data in-place as the user aligns and cleans up the data.
  • PRIOR+ is a generic and automatic ontology mapping tool, based on propagation theory, information retrieval technique and artificial intelligence model. The approach utilizes both linguistic and structural information of ontologies, and measures the profile similarity and structure similarity of different elements of ontologies in a vector space model (VSM).
  • <New>S-Match takes any two tree like structures (such as database schemas, classifications, lightweight ontologies) and returns a set of correspondences between those tree nodes which semantically correspond to one another.
  • Vine is a tool that allows users to perform fast mappings of terms across ontologies. It performs smart searches, can search using regular expressions, requires a minimum number of clicks to perform mappings, can be plugged into arbitrary mapping framework, is non-intrusive with mappings stored in an external file, has export to text files, and adds metadata to any mapping. See also http://sourceforge.net/projects/vine/.

Not Apparently in Active Use

  • ASMOV (Automated Semantic Mapping of Ontologies with Validation) is an automatic ontology matching tool which has been designed in order to facilitate the integration of heterogeneous systems, using their data source ontologies
  • Chimaera is a software system that supports users in creating and maintaining distributed ontologies on the web. Two major functions it supports are merging multiple ontologies together and diagnosing individual or multiple ontologies
  • CMS (CROSI Mapping System) is a structure matching system that capitalizes on the rich semantics of the OWL constructs found in source ontologies and on its modular architecture that allows the system to consult external linguistic resources
  • ConRef is a service discovery system which uses ontology mapping techniques to support different user vocabularies
  • DRAGO reasons across multiple distributed ontologies interrelated by pairwise semantic mappings, with a vision of peer-to-peer mapping of many distributed ontologies on the Web. It is implemented as an extension to an open source Pellet OWL Reasoner
  • Falcon-AO (Finding, aligning and learning ontologies) is an automatic ontology matching tool that includes the three elementary matchers of String, V-Doc and GMO. In addition, it integrates a partitioner PBM to cope with large-scale ontologies
  • FOAM is the Framework for ontology alignment and mapping. It is based on heuristics (similarity) of the individual entities (concepts, relations, and instances)
  • hMAFRA (Harmonize Mapping Framework) is a set of tools supporting semantic mapping definition and data reconciliation between ontologies. The targeted formats are XSD, RDFS and KAON
  • IF-Map is an Information Flow based ontology mapping method. It is based on the theoretical grounds of logic of distributed systems and provides an automated streamlined process for generating mappings between ontologies of the same domain
  • LILY is a system matching heterogeneous ontologies. LILY extracts a semantic subgraph for each entity, then it uses both linguistic and structural information in semantic subgraphs to generate initial alignments. The system is presently in a demo version only
  • MAFRA Toolkit – the Ontology MApping FRAmework Toolkit allows users to create semantic relations between two (source and target) ontologies, and apply such relations in translating source ontology instances into target ontology instances
  • OntoEngine is a step toward allowing agents to communicate even though they use different formal languages (i.e., different ontologies). It translates data from a “source” ontology to a “target”
  • OWLS-MX is a hybrid semantic Web service matchmaker. OWLS-MX 1.0 utilizes both description logic reasoning, and token based IR similarity measures. It applies different filters to retrieve OWL-S services that are most relevant to a given query
  • RiMOM (Risk Minimization based Ontology Mapping) integrates different alignment strategies: edit-distance based strategy, vector-similarity based strategy, path-similarity based strategy, background-knowledge based strategy, and three similarity-propagation based strategies
  • semMF is a flexible framework for calculating semantic similarity between objects that are represented as arbitrary RDF graphs. The framework allows taxonomic and non-taxonomic concept matching techniques to be applied to selected object properties
  • Snoggle is a graphical, SWRL-based ontology mapper. Snoggle attempts to solve the ontology mapping problem by providing a graphical user interface (similar to which of the Microsoft Visio) to guide the process of ontology vocabulary alignment. In Snoggle, user-defined mappings can be serialized into rules, which is expressed using SWRL
  • Terminator is a tool for creating term to ontology resource mappings (documentation in Finnish).

Ontology Visualization/Analysis

Though all are not relevant, see my post from a couple of years back on large-scale RDF graph software.

  • Social network graphing tools (many covered elsewhere)
  • Cytoscape is a bioinformatics software platform for visualizing molecular interaction networks and integrating these interactions with gene expression profiles and other state data; I have also written specifically about Cytoscape’s use in UMBEL
    • RDFScape is a project that brings Semantic Web “features” to the popular Systems Biology software Cytoscape
    • NetworkAnalyzer performs analysis of biological networks and calculates network topology parameters including the diameter of a network, the average number of neighbors, and the number of connected pairs of nodes. It also computes the distributions of more complex network parameters such as node degrees, average clustering coefficients, topological coefficients, and shortest path lengths. It displays the results in diagrams, which can be saved as images or text files; used by SD
  • Graphl is a tool for collaborative editing and visualisation of graphs, representing relationships between resources or concepts of the real world. Graphl may be thought of as a visual wiki, a place where everybody can contribute to a shared repository of knowledge
  • <New>Graphviz is open source graph visualization software. It has several main graph layout programs. It also has web and interactive graphical interfaces, and auxiliary tools, libraries, and language bindings.
  • <New>GrOWL is an ontology visualizer and editor. The layout of the GrOWL graph can be defined automatically or loaded from a separate style sheet. GrOWL implements configurable filters that can transform the display by simplifying it, hiding concepts and relationships that have no descriptions associated, or perform more complex translations. Concepts can be stored in ontologies with extensive annotations to provide documentation. GrOWL shows these annotation as tooltips, and supports complex HTML and links within them. The GrOWL browser can be used inside a web browser or as a stand-alone application. When used inside a browser, it supports Javascript interaction so that it can be used as a concept chooser with implementation-defined operations.
  • igraph is a free software package for creating and manipulating undirected and directed graphs
  • Network Workbench is a very complex, comprehensive; Swiss Army Knife
  • NetworkX – Python; very clean
  • <New>OntoGraf, a Protege 4 plug-in, gives support for interactively navigating the relationships of your OWL ontologies. Various layouts are supported for automatically organizing the structure of your ontology. Different relationships are supported: subclass, individual, domain/range object properties, and equivalence. Relationships and node types can be filtered.
  • <New>OWL2Prefuse is a Java package which creats Prefuse graphs and trees from OWL files (and Jena OntModels). It takes care of converting the OWL data structure to the Prefuse datastructure. This makes it is easy for developers, to use the Prefuse graphs and trees into their Semantic Web applications.
  • <New>RDF Gravity is a tool for visualising RDF/OWL Graphs/ ontologies. RDF Gravity is implemented by using the JUNG Graph API and Jena semantic web toolkit. Its main features are:
    • Graph Visualization
    • Global and Local Filters (enabling specific views on a graph)
    • Full text Search
    • Generating views from RDQL Queries
    • Visualising multiple RDF files
  • <Newest> SKOS Reader is a SKOS browser and an HTML renderer of SKOS thesauri and terminologies that can display a SKOS file hierarchically, alphabetically, or permuted. Commercial; from Mondeca
  • Stanford Network Analysis Package (SNAP) is a general purpose network analysis and graph mining library. It is written in C++ and easily scales to massive networks with hundreds of millions of nodes
  • Social Networks Visualizer (SocNetV) is a flexible and user-friendly tool for the analysis and visualization of Social Networks. It lets you construct networks (mathematical graphs) with a few clicks on a virtual canvas or load networks of various formats (GraphViz, GraphML, Adjacency, Pajek, UCINET, etc) and modify them to suit your needs. SocNetV also offers a built-in web crawler, allowing you to automatically create networks from all links found in a given initial URL
  • Tulip may be incredibly strong
  • Springgraph component for Flex
  • VizierFX is a Flex library for drawing network graphs. The graphs are laid out using GraphViz on the server side, then passed to VizierFX to perform the rendering. The library also provides the ability to run ActionScript code in response to events on the graph, such as mousing over a node or clicking on it.
  • <New>VUE (Visual Understanding Environment) is an open source project focused on creating flexible tools for managing and integrating digital resources in support of teaching, learning and research. VUE provides a flexible visual environment for structuring, presenting, and sharing digital information.
  • <New>yEd is a diagram editor that can be used to quickly and effectively generate high-quality drawings of diagrams. It can support OWL imports.
  • <New>ZGRViewer is a graph visualizer implemented in Java and based upon the Zoomable Visual Transformation Machine. It is specifically aimed at displaying graphs expressed using the DOT language from AT&T GraphViz and processed by programs dot, neato or others such as twopi. ZGRViewer is designed to handle large graphs, and offers a zoomable user interface (ZUI), which enables smooth zooming and easy navigation in the visualized structure.

Miscellaneous Ontology Tools

  • Apolda (Automated Processing of Ontologies with Lexical Denotations for Annotation) is a plugin (processing resource) for GATE (http://gate.ac.uk/). The Apolda processing resource (PR) annotates a document like a gazetteer, but takes the terms from an (OWL) ontology rather than from a list
  • <Newest>CA Manager supports customized workflows for semantic annotation of content. Commercial; from Mondeca
  • <New>Gloze is a XML to RDF, RDF to XML, and XSD to OWL mapping tool based on Jena; see also http://jena.hpl.hp.com/juc2006/proceedings/battle/paper.pdf . See also http://jena.sourceforge.net/contrib/contributions.html
  • <New>Hoolet is an implementation of an OWL-DL reasoner that uses a first order prover. The ontology is translated to collection of axioms (in an obvious way based on the OWL semantics) and this collection of axioms is then given to a first order prover for consistency checking.
  • LexiLink is a tool for building, curating and managing multiple lexicons and ontologies in one enterprise-wide Web-based application. The core of the technology is based on RDF and OWL
  • mopy is the Music Ontology Python library, designed to provide easy to use python bindings for ontology terms for the creation and manipulation of music ontology data. mopy can handle information from several ontologies, including the Music Ontology, full FOAF vocab, and the timeline and chord ontologies
  • OBDA (Ontology Based Data Access) is a plugin for Protégé aimed to be a full-fledged OBDA ontology and component editor. It provides data source and mapping editors, as well as querying facilities that, in sum, allow you to design and test every aspect of an OBDA system. It supports relational data sources (RDBMS) and GLAV-like mappings. In its current beta form, it requires Protege 3.3.1, a reasoner implementing the OBDA extensions to DIG 1.1 (e.g., the DIG server for QuOnto) and Jena 2.5.5
  • <New>oBrowse is a web based ontology browser developed in java. oBrowse parses OWL files of an ontology and displays ontology in a tree view. Protege-API, JSF are used in development
  • OntoComP is a Protégé 4 plugin for completing OWL ontologies. It enables the user to check whether an OWL ontology contains “all relevant information” about the application domain, and extend the ontology appropriately if this is not the case
  • Ontology Browser is a browser created as part of the CO-ODE (http://www.co-ode.org/) project; rather simple interface and use
  • Ontology Metrics is a web-based tool that displays statistics about a given ontology, including the expressivity of the language it is written in
  • <New>OntoLT aims at a more direct connection between ontology engineering and linguistic analysis. OntoLT is a Protégé plug-in, with which concepts (Protégé classes) and relations (Protégé slots) can be extracted automatically from linguistically annotated text collections. It provides mapping rules, defined by use of a precondition language that allow for a mapping between linguistic entities in text and class/slot candidates in Protégé. Only available for older Protégé versions
  • OntoSpec is a SWI-Prolog module, aiming at automatically generating XHTML specification from RDF-Schema or OWL ontologies
  • OWL API is a Java interface and implementation for the W3C Web Ontology Language (OWL), used to represent Semantic Web ontologies. The API is focused towards OWL Lite and OWL DL and offers an interface to inference engines and validation functionality
  • OWL Module Extractor is a Web service that extracts a module for a given set of terms from an ontology. It is based on an implementation of locality-based modules that is part of the OWL API.
  • OWL Syntax Converter is an online tool for converting ontologies between different formats, including several OWL syntaxes, RDF/XML, KRSS
  • OWL Verbalizer is an on-line tool that verbalizes OWL ontologies in (controlled) English
  • OwlSight is an OWL ontology browser that runs in any modern web browser; it’s developed with Google Web Toolkit and uses Gwt-Ext, as well as OWL-API. OwlSight is the client component and uses Pellet as its OWL reasoner
  • Pellint is an open source lint tool for Pellet which flags and (optionally) repairs modeling constructs that are known to cause performance problems. Pellint recognizes several patterns at both the axiom and ontology level.
  • PROMPT is a tab plug-in for Protégé is for managing multiple ontologies by comparing versions of the same ontology, moving frames between included and including project, merging two ontologies into one, or extracting a part of an ontology
  • <New>ReDeFer is a compendium of RDF-aware utilities organised in a set of packages: RDF2HTML+RDFa: render a piece of RDF/XML as HTML+RDFa; XSD2OWL: transform an XML Schema into an OWL Ontology; CS2OWL: transform a MPEG-7 Classification Scheme into an OWL Ontology; XML2RDF: transform a piece of XML into RDF; and RDF2SVG: render a piece of RDF/XML as a SVG showing the corresponding graph
  • SegmentationApp is a Java application that segments a given ontology according to the approach described in “Web Ontology Segmentation: Analysis, Classification and Use” (http://www.co-ode.org/resources/papers/seidenberg-www2006.pdf)
  • SETH is a software effort to deeply integrate Python with Web Ontology Language (OWL-DL dialect). The idea is to import ontologies directly into the programming context so that its classes are usable alongside standard Python classes
  • SKOS2GenTax is an online tool that converts hierarchical classifications available in the W3C SKOS (Simple Knowledge Organization Systems) format into RDF-S or OWL ontologies
  • SpecGen (v5) is an ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license
  • Text2Onto is a framework for ontology learning from textual resources that extends and re-engineers an earlier framework developed by the same group (TextToOnto). Text2Onto offers three main features: it represents the learned knowledge at a metalevel by instantiating the modelling primitives of a Probabilistic Ontology Model (POM), thus remaining independent from a specific target language while allowing the translation of the instantiated primitives
  • Thea is a Prolog library for generating and manipulating OWL (Web Ontology Language) content. Thea OWL parser uses SWI-Prolog’s Semantic Web library for parsing RDF/XML serialisations of OWL documents into RDF triples and then it builds a representation of the OWL ontology
  • TONES Ontology Repository is primarily designed to be a central location for ontologies that might be of use to tools developers for testing purposes; it is part of the TONES project
  • Visual Ontology Manager (VOM) is a family of tools enables UML-based visual construction of component-based ontologies for use in collaborative applications and interoperability solutions.
  • Web Ontology Manager is a lightweight, Web-based tool using J2EE for managing ontologies expressed in Web Ontology Language (OWL). It enables developers to browse or search the ontologies registered with the system by class or property names. In addition, they can submit a new ontology file
  • RDF evoc (external vocabulary importer) is an RDF external vocabulary importer module (evoc) for Drupal caches any external RDF vocabulary and provides properties to be mapped to CCK fields, node title and body. This module requires the RDF and the SPARQL modules.

Not Apparently in Active Use

  • ActiveOntology is a library, written in Ruby, for easy manipulation of RDF and RDF-Schema models, thru a dynamic DSL based on Ruby idiom
  • Almo is an ontology-based workflow engine in Java supporting the ARTEMIS project; part of the OntoWare initiative
  • ClassAKT is a text classification web service for classifying documents according to the ACM Computing Classification System
  • Elmo provides a simple API to access ontology oriented data inside a Sesame RDF repository. The domain model is simplified into independent concerns that are composed together for multi-dimensional, inter-operating, or integrated applications
  • ExtrAKT is a tool for extracting ontologies from Prolog knowledge bases.
  • F-Life is a tool for analysing and maintaining life-cycle patterns in ontology development.
  • Foxtrot is a recommender system which represents user profiles in ontological terms, allowing inference, bootstrapping and profile visualization.
  • HyperDAML creates an HTML representation of OWL content to enable hyperlinking to specific objects, properties, etc.
  • LinKFactory is an ontology management tool, it provides an effective and user-friendly way to create, maintain and extend extensive multilingual terminology systems and ontologies (English, Spanish, French, etc.). It is designed to build, manage and maintain large, complex, language independent ontologies.
  • LSW – the Lisp semantic Web toolkit enables OWL ontologies to be visualized. It was written by Alan Ruttenberg
  • OntoClassify is a system for scalable classification of text into large topic ontologies currently including DMoz and Inspec. The system is available as Web service. The software runs under Windows platform.
  • Ontodella is a Prolog HTTP server for category projection and semantic linking
  • OntoWeaver is an ontology-based approach to Web sites, which provides high level support for web site design and development
  • OWLLib is a PHP library for accessing OWL files. OWL is w3.org standard for storing semantic information
  • pOWL is a Semantic Web development platform for ontologies in PHP. pOWL consists of a number of components, including RAP
  • ROWL is the Rule Extension of OWL; it is from the Mobile Commerce Lab in the School of Computer Science at Carnegie Mellon University
  • Semantic Net Generator is a utlity for generating Topic Maps automatically from different data sources by using rules definitions specified with Jelly XML syntax. This Java library provides Jelly tags to access and modify data sources (also RDF) to create a semantic network
  • SMORE is OWL markup for HTML pages. SMORE integrates the SWOOP ontology browser, providing a clear and consistent way to find and view Classes and Properties, complete with search functionality
  • SOBOLEO is a system for Web-based collaboration to create SKOS taxonomies and ontologies and to annotate various Web resources using them
  • SOFA is a Java API for modeling ontologies and Knowledge Bases in ontology and Semantic Web applications. It provides a simple, abstract and language neutral ontology object model, inferencing mechanism and representation of the model with OWL, DAML+OIL and RDFS languages; from java.dev
  • WebScripter is a tool that enables ordinary users to easily and quickly assemble reports extracting and fusing information from multiple, heterogeneous DAMLized Web sources.

by Mike Bergman at August 23, 2010 05:28 AM

August 16, 2010

AI3:::Adaptive Information (Mike Bergman)

I Have Yet to Metadata I Didn’t Like

Ecumenical

Contrasted with Some Observations on Linked Data

At the SemTech conference earlier this summer there was a kind of vuvuzela-like buzzing in the background. And, like the World Cup games on television, in play at the same time as the conference, I found the droning to be just as irritating.

That droning was a combination of the sense of righteousness in the superiority of linked data matched with a reprise of the “chicken-and-egg” argument that plagued the early years of semantic Web advocacy [1]. I think both of these premises are misplaced. So, while I have been a fan and explicator of linked data for some time, I do not worship at its altar [2]. And, for those that do, this post argues for a greater sense of ecumenism.

My main points are not against linked data. I think it a very useful technique and good (if not best) practice in many circumstances. But my main points get at whether linked data is an objective in itself. By making it such, I argue our eye misses the ball. And, in so doing, we miss making the connection with meaningful, interoperable information, which should be our true objective. We need to look elsewhere than linked data for root causes.

Observation #1: What Problem Are We Solving?

When I began this blog more than five years ago — and when I left my career in population genetics nearly three decades before that — I did so because of my belief in the value of information to confer adaptive advantage. My perspective then, and my perspective now, was that adaptive information through genetics and evolution was being uniquely supplanted within the human species. This change has occurred because humanity is able to record and carry forward all information gained in its experiences.

Adaptive innovations from writing to bulk printing to now electronic form uniquely position the human species to both record its past and anticipate its future. We no longer are limited to evolution and genetic information encoded in surviving offspring to determine what information is retained and moves forward. Now, all information can be retained. Further, we can combine and connect that information in ways that break to smithereens the biological limits of other species.

Yet, despite the electronic volumes and the potentials, chaos and isolated content silos have characterized humanity’s first half century of experience with digital information. I have spoken before about how we have been steadily climbing the data federation pyramid, with Internet technologies and the Web being prime factors for doing so. Now, with a compelling data model in RDF and standards for how we can relate any type of information meaningfully, we also have the means for making sense of it. And connecting it. And learning and adapting from it.

And, so, there is the answer to the rhetorical question: The problem we are solving is to meaningfully connect information. For, without those meaningful connections and recombinations, none of that information confers adaptive advantage.

Observation #2: The Problem is Not A Lack of Consumable Data

One of the “chicken-and-egg” premises in the linked data community is there needs to be more linked data exposed before some threshold to trigger the network effect occurs. This attitude, I suspect, is one of the reasons why hosannas are always forthcoming each time some outfit announces they have posted another chunk of triples to the Web.

Fred Giasson and I earlier tackled that issue with When Linked Data Rules Fail regarding some information published for data.gov and the New York Times. Our observations on the lack of standards for linked data quality proved to be quite controversial. Rehashing that piece is not my objective here.

What is my objective is to hammer home that we do not need linked data in order to have data available to consume. Far from it. Though linked data volumes have been growing, I actually suspect that its growth has been slower than data availability in toto. On the Web alone we have searchable deep Web databases, JSON, XML, microformats, RSS feeds, Google snippets, yada, yada, all in a veritable deluge of formats, contents and contexts. We are having a hard time inventing the next 1000-fold description beyond zettabyte and yottabyte to even describe this deluge [3].

There is absolutely no voice or observer anywhere that is saying, “We need linked data in order to have data to consume.” Quite the opposite. The reality is we are drowning in the stuff.

Furthermore, when one dissects what most of all of this data is about, it is about ways to describe things. Or, put another way, most all data is not schema nor descriptions of conceptual relationships, but making records available, with attributes and their values used to describe those records. Where is a business located? What political party does a politician belong to? How tall are you? What is the population of Hungary?

These are simple constructs with simple key-value pair ways to describe and convey them. This very simplicity is one reason why naïve data structs or simple data models like JSON or XML have proven so popular [4]. It is one of the reasons why the so-called NoSQL databases have also been growing in popularity. What we have are lots of atomic facts, located everywhere, and representable with very simple key-value structures.

While having such information available in linked data form makes it easier for agents to consume it, that extra publishing burden is by no means necessary. There are plenty of ways to consume that data — without loss of information — in non-linked data form. In fact, that is how the overwhelming percentage of such data is expressed today. This non-linked data is also often easy to understand.

What is important is that the data be available electronically with a description of what the records contain. But that hurdle is met in many, many different ways and from many, many sources without any reference whatsoever to linked data. I submit that any form of desirable data available on the Web can be readily consumed without recourse to linked data principles.

Observation #3: An Interoperable Data Model Does Not Require a Single Transmittal Format

The real advantage of RDF is the simplicity of its data model, which can be extended and augmented to express vocabularies and relationships of any nature. As I have stated before, that makes RDF like a universal solvent for any extant data structure, form or schema.

What I find perplexing, however, is how this strength somehow gets translated into a parallel belief that such a flexible data model is also the best means for transmitting data. As noted, most transmitted data can be represented through simple key-value pairs. Sure, at some point one needs to model the structural assumptions of the data model from the supplying publisher, but that complexity need not burden the actual transmitted form. So long as schema can be captured and modeled at the receiving end, data record transmittal can be made quite a bit simpler.

Under this mindset RDF provides the internal (canonical) data model. Prior to that, format and other converters can be used to consume the source data in its native form. A generalized representation for how this can work is shown in this diagram using Structured DynamicsstructWSF Web services framework middleware as the mediating layer:

Of course, if the source data is already in linked data form with understood concepts, relationships and semantics, much of this conversion overhead can be bypassed. If available, that is a good thing.

But it is not a required or necessary thing. Insistence on publishing data in certain forms suffers from the same narrowness as cultural or religious zealotry. Why certain publishers or authors prefer different data formats has a diversity of answers. Reasons can range from what is tried and familiar to available toolsets or even what is trendy, as one might argue linked data is in some circles today.There are literally scores of off-the-shelf “RDFizers” for converting native and simple data structs into RDF form. New converters are readily written.

Adaptive systems, by definition, do not require wholesale changes to existing practices and do not require effort where none is warranted. By posing the challenge as a “chicken-and-egg” one where publishers themselves must undertake a change in their existing practices to conform, or else they fail the “linked data threshold”, advocates are ensuring failure. There is plenty of useful structured data to consume already.

Accessible structured data, properly characterized (see below), should be our root interest; not whether that data has been published as linked data per se.

Observation #4: A Technique Can Not Carry the Burden of Usefulness or Interoperability

Linked data is nothing more than some techniques for publishing Web-accessible data using the RDF data model. Some have tried to use the concept of linked data as a replacement for the idea of the semantic Web, and some have recently tried to re-define linked data as not requiring RDF [5]. Yet the real issue with all of these attempts — correct or not, and a fact of linked data since first formulated by Tim Berners-Lee — is that a technique alone can not carry the burden of usefulness or interoperability.

Despite billions of triples now available, we in fact see little actual use or consumption of linked data, except in the life science domain. Indeed, a new workshop by the research community called COLD (Consuming Linked Data) has been set up for the upcoming ISWC conference to look into the very reasons why this lack of usage may be occurring [6].

It will be interesting to monitor what comes out of that workshop, but I have my own views as to what might be going on here. A number of factors, applicable frankly to any data, must be layered on top of linked data techniques in order for it to be useful:

  • Context and coherence (see below)
  • Curation and quality control (where provenance is used as the proxy), and
  • Up-to-date and timely.

These requirements apply to any data ranging from Census CSV files to Google search results. But because relationships can also be more readily asserted with linked data, these requirements are even greater for it.

It is not surprising that the life sciences have seen more uptake of linked data. That community has keen experience with curation, and the quality and linkages asserted there are much superior to other areas of linked data [7].

In other linked data areas, it is really in limited pockets such as FactForge from Ontotext or curated forms of Wikipedia by the likes of Freebase that we see the most use and uptake. There is no substitute for consistency and quality control.

It is really in this area of “publish it and they will come” that we see one of the threads of parochialism in the linked data community. You can publish it and they still will not come. And, like any data, they will not come because the quality is poor or the linkages are wrong.

As a technique for making data available, linked data is thus nothing more than a foot soldier in the campaign to make information meaningful. Elevating it above its pay grade sets the wrong target and causes us to lose focus for what is really important.

Observation #5: 50% of Linked Data is Missing (that is, the Linking part)

There is another strange phenomenon in the linked data movement: the almost total disregard for the linking part. Sure data is getting published as triples with dereferencable URIs, but where are the links?

At most, what we are seeing is owl:sameAs assertions and a few others [8]. Not only does this miss the whole point of linked data, but one can question whether equivalence assertions are correct in many instances [9].

For a couple of years now I have been arguing that the central gap in linked data has been the absence of context and coherence. By context I mean the use of reference structures to help place and frame what content is about. By coherence I mean that those contextual references make internal and logical sense, that they represent a consistent world view. Both require a richer use of links to concepts and subjects describing the semantics of the content.

It is precisely through these kinds of links that data from disparate sources and with different frames of reference can be meaningfully related to other data. This is the essence of the semantic Web and the purported purpose of linked data. And it is exactly these areas in which linked data is presently found most lacking.

Of course, these questions are not the sole challenge of linked data. They are the essential challenge in any attempt to connect or interoperate structured data within information systems. So, while linked data is ostensibly designed from the get-go to fulfill these aims, any data that can find meaning outside of its native silo must also be placed into context in a coherent manner. The unique disappointment for much linked data is its failure to provide these contexts despite its design.

Observation #6: Pluralism is a Reality; Embrace It

Yet, having said all of this, Structured Dynamics is still committed to linked data. We present our information as such, and provide great tools for producing and consuming it. We have made it one of the seven foundations to our technology stack and methodology.

But we live in a pluralistic data world. There are reasons and roles for the multitude of popular structured data formats that presently exist. This inherent diversity is a fact in any real-world data context. Thus, we have not met a form of structured data that we didn’t like, especially if it is accompanied with metadata that puts the data into coherent context. It is a major reason why we developed the irON (instance record and object notation) non-RDF vocabulary to provide a bridge from such forms to RDF. irON clearly shows that entities can be usefully described and consumed in either RDF or non-RDF serialized forms.

Attitudes that dismiss non-linked data forms or arrogantly insist that publishers adhere to linked data practices are anything but pluralistic. They are parochial and short-sighted and are contributing, in part, to keeping the semantic Web from going mainstream.

Adoption requires simplicity. The simplest way to encourage the greater interoperability of data is to leverage existing assets in their native form, with encouragement for minor enhancements to add descriptive metadata for what the content is about. Embracing such an ecumenical attitude makes all publishers potentially valuable contributors to a better information future. It will also nearly instantaneously widen the tools base available for the common objective of interoperability.

Parochialism and Root Cause Analysis

Linked data is a good thing, but not an ultimate thing. By making linked data an objective in itself we unduly raise publishing thresholds; we set our sights below the real problem to be solved; and we risk diluting the understanding of RDF from its natural role as a flexible and adaptive data model. Paradoxically, too much parochial insistence on linked data may undercut its adoption and the realization of the overall semantic objective.

Root cause analysis for what it takes to achieve meaningful, interoperable information suggests that describing source content in terms of what it is about is the pivotal factor. Moreover, those contexts should be shared to aid interoperability. Whichever organizations do an excellent job of providing context and coherent linkages will be the go-to ones for data consumers. As we have seen to date, merely publishing linked data triples does not meet this test.

I have heard some state that first you celebrate linked data and its growing quantity, and then hope that the quality improves. This sentiment holds if indeed the community moves on to the questions of quality and relevance. The time for that transition is now. And, oh, by the way, as long as we are broadening our horizons, let’s also celebrate properly characterized structured data no matter what its form. Pluralism is part of the tao to the meaning of information.


[1] See, for example, J.A. Hendler, 2008. “Web 3.0: Chicken Farms on the Semantic Web,” Computer, January 2008, pp. 106-108. See http://www.comp.leeds.ac.uk/webscience/talks/hendler_web_3.pdf. While I can buy Hendler’s arguments about commercial tool vendors holding off major investments until the market is sizable, I think we can also see via listings like Sweet Tools that a lack of tools is not in itself limiting.
[2] An earlier treatment of this subject from a different perspective is M.K. Bergman, 2010. “The Bipolar Disorder of Linked Data,” AI3:::Adaptive Information blog, April 28, 2010.
[3] So far only prefixes for units up to 10^24 (”yotta”) have names; for 10^27, a student campaign on Facebook is proposing “hellabyte” (North California slang for “a whole lot of”) to get adopted by science bodies. See http://scitech.blogs.cnn.com/2010/03/04/hella-proposal-facebook/.
[4] One of more popular posts on this blog has been, M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Information blog, January 22, 2009.
[5] See, for example, the recent history on the linked data entry on Wikipedia or the assertions by Kingsley Idehen regarding entity attribute values (EAV) (see, for example, this blog post.)
[6] See further the 1st International Workshop on Consuming Linked Data (COLD 2010), at the 9th International Semantic Web Conference (ISWC 2010), November 8, 2010, Shanghai, China.
[7] For example, in the early years of GenBank, some claimed that annotations of gene sequences due to things like BLAST analyses may have had as high as 30% to 70% error rates due to propagation of initially mislabeled sequences. In part, the whole field of bioinformatics was formed to deal with issues of data quality and curation (in addition to analytics).
[8] See, for example: Harry Halpin, 2009. “A Query-Driven Characterization of Linked Data,” paper presented at the Linked Data on the Web (LDOW) 2009 Workshop, April 20, 2009, Madrid, Spain, see http://events.linkeddata.org/ldow2009/papers/ldow2009_paper16.pdf; Prateek Jain, Pascal Hitzler, Peter Z. Yehy, Kunal Vermay and Amit P. Shet, 2010. “Linked Data is Merely More Data,” in Dan Brickley, Vinay K. Chaudhri, Harry Halpin, and Deborah McGuinness, Linked Data Meets Artificial Intelligence, Technical Report SS-10-07, AAAI Press, Menlo Park, California, 2010, pp. 82-86., see http://knoesis.wright.edu/library/publications/linkedai2010_submission_13.pdf; among others.
[9] Harry Halpin and Patrick J. Hayes, 2010. “When owl:sameAs isn’t the Same: An Analysis of Identity Links on the Semantic Web,” presented at LDOW 2010, April 27th, 2010, Raleigh, North Carolina. See http://events.linkeddata.org/ldow2010/papers/ldow2010_paper09.pdf.

by Mike Bergman at August 16, 2010 05:58 AM

August 10, 2010

DBpedia Blog

DBpedia 3.5.1 available on Amazon EC2

As the Amazon Web Services are getting used a lot for cloud computing, we have started to provide current snapshots of the DBpedia dataset for this environment.

We provide the DBpedia dataset for Amazon Web Services in two ways:

1. Source files for being mounted:  http://developer.amazonwebservices.com/connect/entry.jspa?externalID=2319

2. Virtuoso SPARQL store for being instanciated: http://www.openlinksw.com/dataspace/dav/wiki/Main/VirtAWSDBpedia351C 

by ChrisBizer at August 10, 2010 06:49 AM

August 09, 2010

AI3:::Adaptive Information (Mike Bergman)

An Executive Intro to Ontologies

Ontologies are the structural frameworks for organizing information on the semantic Web and within semantic enterprises. They provide unique benefits in discovery, flexible access, and information integration due to their inherent connectedness; that is, their ability to represent conceptual relationships. Ontologies can be layered on top of existing information assets, which means they are an enhancement and not a displacement for prior investments. And ontologies may be developed and matured incrementally, which means their adoption may be cost-effective as benefits become evident [1].

What Is an Ontology?

Ontology may be one of the more daunting terms for those exposed for the first time to semantic technologies. Not only is the word long and without common antecedents, but it is also a term that has widely divergent use and understanding within the community. It can be argued that this not-so-little word is one of the barriers to mainstream understanding of the semantic Web.

The root of the term is the Greek ontos, or being or the nature of things. Literally — and in classical philosophy — ontology was used in relation to the study of the nature of being or the world, the nature of existence. Tom Gruber, among others, made the term popular in relation to computer science and artificial intelligence about 15 years ago when he defined ontology as a “formal specification of a conceptualization.”

Much like taxonomies or relational database schema, ontologies work to organize information. No matter what the domain or scope, an ontology is a description of a world view. That view might be limited and miniscule, or it might be global and expansive. However, unlike those alternative hierarchical views of concepts such as taxonomies, ontologies often have a linked or networked “graph” structure. Multiple things can be related to other things, all in a potentially multi-way series of relationships.

Example Taxonomy Structure Example Ontology Structure
A distinguishing characteristic of ontologies compared to conventional hierarchical structures is their degree
of connectedness, their ability to model coherent, linked relationships

Ontologies supply the structure for relating information to other information in the semantic Web or the linked data realm. Ontologies thus provide a similar role for the organization of data that is provided by relational data schema. Because of this structural role, ontologies are pivotal to the coherence and interoperability of interconnected data.

When one uses the idea of “world view” as synonomous with an ontology, it is not meant to be cosmic, but simply a way to convey how a given domain or problem area can be described. One group might choose to describe and organize, say, automobiles, by color; another might choose body styles such as pick-ups or sedans; or still another might use brands such as Honda and Ford. None of these views is inherently “right” (indeed multiples might be combined in a given ontology), but each represents a particular way — a “world view” — of looking at the domain.

Though there is much latitude in how a given domain might be described, there are both good ontology practices and bad ones. We offer some views as to what constitutes good ontology design and practice in the concluding section.

What Are Its Benefits?

A good ontology offers a composite suite of benefits not available to taxonomies, relational database schema, or other standard ways to structure information. Among these benefits are:

  • Coherent navigation by enabling the movement from concept to concept in the ontology structure
  • Flexible entry points because any specific perspective in the ontology can be traced and related to all of its associated concepts; there is no set structure or manner for interacting with the ontology
  • Connections that highlight related information and aid and prompt discovery without requiring prior knowledge of the domain or its terminology
  • Ability to represent any form of information, including unstructured (say, documents or text), semi-structured (say, XML or Web pages) and structured (say, conventional databases) data
  • Inferencing, whereby by specifying one concept (say, mammals) one knows that we are also referring to a related concept (say, that mammals are a kind of animal)
  • Concept matching, which means that even though we may describe things somewhat differently, we can still match to the same idea (such as glad or happy both referring to the concept of a pleasant state of mind)
  • Thus, this means that we can also integrate external content by proper matching and mapping of these concepts
  • A framework for disambiguation by nature of the matching and analysis of concepts and instances in the ontology graph, and
  • Reasoning, which is the ability to use the coherence and structure itself to inform questions of relatedness or to answer questions.

How Are Ontologies Used?

The relationship structure underlying an ontology provides an excellent vehicle for discovery and linkages. “Swimming through” this relationship graph is the basis of the Concept Explorer (also known as the Relation Browser) and similar widgets.

The most prevalent use of ontologies at present is in semantic search. Semantic search has benefits over conventional search in terms of being able to make inferences and matches not available to standard keyword retrieval.

The relationship structure also is a powerful and more general and more nuanced way to organize information. Concepts can relate to other concepts through a richness of vocabulary. Such predicates might capture subsumption, precedence, parts of relationships (mereology), preferences, or importances along virtually any metric. This richness of expression and relationships can also be built incrementally over time, allowing ontologies to grow and develop in sophistication and use as desired.

The pinnacle application for ontologies, therefore, is as coherent reference structures whose purpose is to help map and integrate other structures and information. Given the huge heterogeneity of information both within and without organizations, the use of ontologies as integration frameworks will likely emerge as their most valuable use.

What Makes for a Good Ontology?

Good ontology practice has aspects both in terms of scope and in terms of construction.

Scope Considerations

Here are some scoping and design questions that we believe should be answered in the positive in order for an ontology to meet good practice standards:

  • Does the ontology provide balanced coverage of the subject domain? This question gets at the issue of properly scoping and bounding the subject coverage of the ontology. It also means that the breadth and depth of the coverage is roughly equivalent across its scope
  • Does the ontology embed its domain coverage into a proper context? A major strength of ontologies is their potential ability to interoperate with other ontologies. Re-using existing and well-accepted vocabularies and including concepts in the subject ontology that aid such connections is good practice. The ontology should also have sufficient reference structure for guiding the assignment of what content “is about”
  • Are the relationships in the ontology coherent? The essence of coherence is that it is a state of logical, consistent connections, a logical framework for integrating diverse elements in an intelligent way. So while context supplies a reference structure, coherence means that the structure makes sense. Is the hip bone connected to the thigh bone, or is the skeleton incorrect?
  • Has the ontology been well constructed according to good practice? See next.

If these questions can be answered affirmatively, then we would deem the ontology ready for production-grade use.

Fundamental to the whole concept of coherence is the fact that experts and practitioners within domains have been looking at the questions of relationships, structure, language and meaning for decades. Though perhaps today we now finally have a broad useful data and logic model in RDF, the fact remains that massive time and effort has already been expended to codify some of these understandings in various ways and at various levels of completeness and scope. Good practice also means, therefore, that maximum leverage is made to springboard ontologies from existing structural and vocabulary assets.

And, because good ontologies also embrace the open world approach, working toward these desired end states can also be incremental. Thus, in the face of common budget or deadline constraints, it is possible initially to scope domains as smaller or to provide less coverage in depth or to use a small set of predicates, all the while still achieving productive use of the ontology. Then, over time, the scope can be expanded incrementally.

Construction Considerations

To achieve their purposes, ontologies must be both human-readable and machine-processable. Also, because they represent conceptual structures, they must be built with a certain composition.

Good ontologies therefore are constructed such that they have:

  • Concept definitions – the matching and alignment of things is done on the basis of concepts (not simply labels) which means each concept must be defined
  • A preferred label that is used for human readable purposes and in user interfaces
  • A “semset” – which means a series of alternate labels and terms to describe the concept. These alternatives include true synonyms, but may also be more expansive and include jargon, slang, acronyms or alternative terms that usage suggests refers to the same concept
  • Clearly defined relationships (also known as properties, attributes, or predicates) for relating two things to one another
  • All of which is written in a machine-processable language such as OWL or RDF Schema (among others).

In the case of ontology-driven applications using adaptive ontologies, there are also additional instructions contained in the system (often via administrative ontologies) that tell the system which types of widgets need to be invoked for different data types and attributes. This is different than the standard conceptual schema, but is nonetheless essential to how such applications are designed.


[1] This posting was at the request of a couple of Structured Dynamics‘ customers that desired a way to describe ontologies to non-technical management. For a more in depth treatment, see M.K. Bergman, 2007. “An Intrepid Guide to Ontologies,” AI3:::Adaptive Information blog, May 16, 2007.

by Mike Bergman at August 09, 2010 05:53 AM

August 03, 2010

Frederick Giasson

Citizen DAN demo: The first live OSF instance

Structured Dynamics just released the Citizen DAN demo. This is the sum of nearly two years of efforts in developing different pieces of technologies such as structWSF, conStruct, irON and Semantic Components. Citizen DAN is the first OSF (Open Semantic Framework) instance.

This demo shows how we managed to get a subset of the US Census data related to the Iowa Metropolitain area, how we created a small ontology to describe its instance records, and how they got managed, displayed, browsable and searchable by using the complete tools stack we created for other purposes. All pieces have been integrated together around this Citizen DAN demo that Mike gave at SemTech 2010. We are now releasing a publicly accessible instance of this demo.

I am really proud of what we accomplished so far with the very little resources we are working with since two years. Even if we got nothing from our Knight News Challenge application, we were convinced that Citizen DAN was an important project to build and release for local communities. This is an important open source project geared to help local governments and communities to create value out of the data they own and to publish it in meaningful ways on the Web. It is why we used our small resources to create Citizen DAN. We managed to bootstrap ourselves even more, and we managed to get some early clients interested in investing resources in this project.

It is not just about Citizen DAN

Citizen DAN is one kind of OSF instance. However, OSF can have multiple incarnations. The framework is geared so that any kind of data can be indexed, managed and published by this same framework. We can think of usecases in the financial, consumer and business sectors just to name a few.

Next steps

In the near future, we will release new and updated tools and services; we will add value to the framework. We will create new online services, in other sectors, that also leverage OSF.

What about documentation?

More and more documentation will be written on the TechWiki. We are committed to one thing going forward: documentation as we go; to make sure that our clients doesn’t require us to maintain their instances.

Is there a supporting community?

We will also work hard to develop the community around all pieces of OSF. We already have some active members in the community. Some of them will start committing new code and tools; and writing new documentation on the TechWiki. We are expecting to see a significant growth in the community for the next year.

Each thing that get committed by any members of the community will benefits all other members. So far, all our clients committed the result of their work to the project, because they know that this small investment would worth much more as the community grows by getting freebees from our other clients, and other members committing resources into the development of any OSF piece.

The places to start with the community is on the OpenStructs Community web site, and the OSF Mailing List.

Conclusion

This is just the beginning.

I would encourage your to read Mike’s blog post about this new release to have more background information on OSF.

by Fred at August 03, 2010 06:36 PM

July 13, 2010

DBTune Blog

First BBC microsite powered by a triple-store

Jem Rayfield wrote a very interesting post on the technologies used by the World Cup BBC web site, which also got covered by Read Write Web.

All this is very exciting, the World Cup Website proved that triple store technologies can be used to drive a production website with significant traffic. I am expecting lots more parts of the BBC web infrastructure to evolve in the same way :-)

There are two issues we are still currently trying to solve though:

  • We need to be able to cluster our triples in several dimension. For example, we may want to have a graph for a particular programme, and a much larger graph for a particular dataset (e.g. programme data, wildlife finder data, world cup data). The smaller graph is used to make our updates relatively cheap (we replace the whole graph whenever we receive an update). The bigger graph is used to give some degree of isolations between the different sources of data. For that, we need graphs within graphs. It can be done with N3-type graph literals, but is impossible to achieve in a standard quad-store setup, where one single triple can't be part of several graphs.
  • With regards to programme data, the main bottleneck we're facing is the number of updates per second we need to be able to process, which most of available triple stores struggle to keep up. The 4store instance on DBTune does keep up, but it has a negative impact on the querying performances, as the write operations are blocking the reads. We were quite surprised to see that the available triple store benchmarks do not take the write throughput into account!

by Yves at July 13, 2010 02:46 PM

Project squin

A Database Perspective on Consuming Linked Data on the Web

I recently wrote an article with the title “A Database Perspective on Consuming Linked Data on the Web” for the German database journal Datenbank-Spektrum. The article discusses how to consume Linked Data from the Web by applying different approaches to execute SPARQL queries over data from multiple providers. These approaches include the link traversal based approach as implemented in SQUIN. The following table lists all approaches discussed in the article, including their main distinguishing properties:

LDQueryApproachesOverview

Here is a draft of the complete article.

Thanks to my co-author Andreas Langegger who mainly contributed to the section on traditional query federation and to the classification of the approaches.

Olaf

by Olaf Hartig at July 13, 2010 06:42 AM

July 12, 2010

Blog Data Space (Kingsley Idehen)

Solving Real Problems by Leveraging Linked Data: Unambiguous & Verifiable Identity for HTTP Networks

Problem: Unambiguous Verifiable Network Identity.

How Does Linked Data Address This Problem? It provides critical infrastructure for the WebID Protocol that enables an innovative tweak of SSL/TLS.

What about OpenID? The WebID Protocol embraces and extends OpenID (in an open and positive way) via the WebID + OpenID Hybrid variant of the protocol -- basic effect is that OpenID calls are re-routed to the WebID aspect which simply removes Username and Password Authentication from the authentication challenge interaction pattern.

WebID Components

  1. X.509 Certificate and Private Key Generator
  2. Structured Profile Document (e.g. a FOAF based Profile) published to an HTTP Network (e.g. World Wide Web) and accessible at an Address (URL)
  3. An Agent Identifier aka. WebID (an HTTP Name Reference re. URI variant) that's the Subject of a Structured Profile Document (actually a Descriptor Resource)
  4. Mechanism for persisting Public Key data from X.509 Certificate to Structured Profile Document and associating it with Subject WebID (e.g. SPARUL or other HTTP based methods)
  5. Mechanism for de-referencing Public Key data associated with a WebID (from its Structured Profile Document) for comparison against Public Key data following successful standard SSL/TLS protocol handshake (e.g. via SPARQL Query).

Demo

Related

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at July 12, 2010 03:25 AM

July 05, 2010

Frederick Giasson

Semantic Components

For few months now at Structured Dynamics we have been developing what we call “Semantic Components“. A semantic component is an ontology-driven component, or widget, based on Flex. Such a component takes record descriptions, ontologies and target attributes/types as inputs and outputs some (possibly interactive) visualizations of the records. Depending on the logic described in the input schema and the input records descriptions, the semantic component may behave differently to optimize its own layout/behavior to users.

The purpose of these semantic components is to have a framework of adaptive user interfaces that can be plugged directly to structWSF Web service endpoint instances. The goal is to plug some data, schema and target attributes into these components, and then to let them change their behaviors and appearances depending on the input data and schema.

The picture is simple. We tell the components: here is a set of records serialized in structXML, here is a set of schema serialized in irXML, and here are the target attributes and types I want the components to display. Then, different components get selected and behave differently depending on how the schema have been defined, and how the records have been described.

Ultimately, development time is saved because developers don’t have to hard-code the appearance and the behavior of the user interfaces depending on the data and schema that the user interface was receiving at a certain point in time: the logic is built-in to the components.

Overall Workflow

These various semantic components get embedded in a layout canvas. By interacting with the various components, new queries are generated (most often as SPARQL queries) to the various structWSF Web services endpoints. The result of these requests is to generate a structured results set, which includes various types and attributes.

An internal ontology that embodies the desired behavior and display options (SCO, the Semantic Component Ontology) is matched with these types and attributes to generate the formal instructions to the semantic components. These instructions are presented via the sControl component, that determines which widgets (individual components) needs to be invoked and displayed on the layout canvas.

Semantic Components Framework

New interactions with the resulting displays and components cause the iteration path to be generated anew, again starting a new cycle of queries and results sets.

As these pathways and associated display components get created, they can be named and made persistent for later re-use or within dashboard invocations.

A Shift in Design Perspective

There is a bit of a user interface design shift here. User interfaces have always been developed to present information (data) to users, and to let them interact with it. When someone develops such an interface, he has to make thousands of decisions to enable the user interface to cope with different data description situations. Our semantic component framework tries to remove some of this burden on the shoulders of the designer so that it takes these decisions itself. Such decisions are in the range of:

  • The text control X displays the value of an attribute Y. If the attribute Y doesn’t exist in the description of a record A, then we have to remove it from the user interface.
    • Note: if the text control X gets removed from the interface, there is a good chance that we may have to change other controls as well so that the user interface remains usable to the users.
  • If the text control X gets removed, then there is no reason why its associated icon image should remain in the user interface, so let’s provide accommodations to remove it as well.
  • Some attributes describing the records have values that are comparable with related attributes, so let’s compare these values in a linear chart
  • Some records may be useful baselines for comparison with other records, so let’s allow that to be externally specified, too.
  • All these decisions are true for record A, but not for record B since we have a value to display for the text control X, so let’s behave differently by displaying the text control X and its associated icon image.
  • Etc.

All of these kinds of decisions are now made by the semantic components within our new framework depending on how the input records are described and what ontologies (schema) drive the system.

Thus, the designer can now put more time and effort on the questions of general layout and behavior, themes and styles for her applications, without caring much about how to display information for specific records descriptions.

Perhaps most significantly is that the behavior and presentation of information can now be described within these records and schema, an activity that users and knowledge workers can do directly, thus bypassing the need for IT and development. A new balance gets established: developers focus on creating generic tools (widgets or components); consumers of data (users and knowledge workers) determine how they want to display and compare their information.

Unbelievably Fast Implementation

While this shift or change may appear on its face to require some big new framework, the fact is we have been able to accomplish this with simple approaches leading to simple outcomes. Structured Dynamics has been able to put in place a complete Web portal of integrated data that publish all its data in several serialization languages, with many utilities by which users can interact with the data, slice and dice it, visualize it, and filter and manipulated it … and all of this in within two weeks of effort for one developer!

One good example of this is the Citizen Dan demo, composed of Census data and stories related to the Iowa City Metropolitan Area that Mike presented at SemTech 2010 (and some screenshots).

Oh, and did I mention? This system handles text, images, tags, maps, dashboards, numeric data and any kind of structure you can throw at it. And all with the same set of generic components (to which we and others are adding).

More Information

Here is some more information about the semantic component framework and its related pieces:

This is an alpha version of the library. We would also welcome any contributor to the project! We hope you like what you see and that you will be able to leverage it the way we did so that you, and your team, can save as much time as we did!

by Fred at July 05, 2010 09:32 PM

June 25, 2010

Wikier.org Blog (Sergio Fernandez)

SDoW2010

Following the success of SDoW2008 and SDoW2009, this year we (Alex, John, Uldis and me) repeat with SDoW2010:

SDoW2010

The 3rd international workshop on Social Data on the Web (SDoW2010), co-located with the 9th International Semantic Web Conference (ISWC2010) aims to bring together researchers, developers and practitioners involved in semantically-enhancing social media websites, as well as academics researching more formal aspect of these interactions between the Semantic Web and Social Web.

Submissions are welcomed till August 27th. See you in Shanghai!

by Sergio Fernández at June 25, 2010 09:42 AM

June 16, 2010

Displacement Activities (Tom Heath)

Why Carry the Cost of Linked Data?

In his ongoing series of niggles about Linked Data, Rob McKinnon claims that “mandating RDF [for publication of government data] may be premature and costly“. The claim is made in reference to Francis Maude’s parliamentary answer to a question from Tom Watson. Personally I see nothing in the statement from Francis Maude that implies the mandating of RDF or Linked Data, only that “Where possible we will use recognised open standards including Linked Data standards”. Note the “where possible”. However, that’s not the point of this post.

There’s nothing premature about publishing government data as Linked Data – it’s happening on a large scale in the UK, US and elsewhere. Where I do agree with Rob (perhaps for the first time ;) ) is that it comes at a cost. However, this isn’t the interesting question, as the same applies to any investment in a nation’s infrastructure. The interesting questions are who bears that cost, and who benefits?

Let’s make a direct comparison between publishing a data set in raw CSV format (probably exported from a database or spreadsheet) and making the extra effort to publish it in RDF according to the Linked Data principles.

Assuming that your spreadsheet doesn’t contain formulas or merged cells that would make the data irregularly shaped, or that you can create a nice database view that denormalises your relational database tables into one, then the cost of publishing data in CSV basically amounts to running the appropriate export of the data and hosting the static file somewhere on the Web. Dead cheap, right?

Oh wait, you’ll need to write some documentation explaining what each of the columns in the CSV file mean, and what types of data people should expect to find in each of these. You’ll also need to create and maintain some kind of directory so people can discover your data in the crazy haystack that is the Web. Not quite so cheap after all.

So what are the comparable processes and costs in the RDF and Linked Data scenario? One option is to use a tool like D2R Server to expose data from your relational database to the Web as RDF, but let’s stick with the CSV example to demonstrate the lo-fi approach.

This is not the place to reproduce an entire guide to publishing Linked Data, but in a nutshell, you’ll need to decide on the format of the URIs you’ll assign to the things described in your data set, select one or more RDF schemata with which to describe your data (analogous to defining what the columns in your CSV file mean and how their contents relate to each other), and then write some code to convert the data in your CSV file to RDF, according to your URI format and the chosen schemata. Last of all, for it to be proper Linked Data, you’ll need to find a related Linked Data set on the Web and create some RDF that links (some of) the things in your data set to things in the other. Just as with conventional Web sites, if people find your data useful or interesting they’ll create some RDF that links the things in their data to the things in yours, gradually creating an unbounded Web of data.

Clearly these extra steps come at a cost compared to publishing raw CSV files. So why bear these costs?

There are two main reasons: discoverability and reusability.

Anyone (deliberately) publishing data on the Web presumably does so because they want other people to be able to find and reuse that data. The beauty of Linked Data is that discoverability is baked in to the combination of RDF and the Linked Data principles. Incoming links to an RDF data set put that data set “into the Web” and outgoing links increase the interconnectivity further.

Yes, you can create an HTML link to a CSV file, but you can’t link to specific things described in the data or say how they relate to each other. Linked Data enables this. Yes, you can publish some documentation alongside a CSV file explaining what each of the columns mean, but that description can’t be interlinked with the data itself, making it self-describing. Linked Data does this. Yes, you can include URIs in the data itself, but CSV provides no mechanism that for indicating that the content of a particular cell is a link to be followed. Linked Data does this. Yes, you can create directories or catalogues that describe the data sets available from a particular publisher, but this doesn’t scale to the Web. Remember what the arrival of Google did to the Yahoo! directory? What we need is a mechanism that supports arbitrary discovery of data sets by bots roaming the Web and building searchable indices of the data they find. Linked Data enables this.

Assuming that a particular data set has been discovered, what is the cost of any one party using that data in a new application? Perhaps this application only needs one data set, in which case all the developer must do is read the documentation to understand the structure of the data and get on with writing code. A much more likely scenario is that the application requires integration of two or more data sets. If each of these data sets is just a CSV file then every application developer must incur the cost of integrating them, i.e. linking together the elements common to both data sets, and must do this for every new data set they want to use in their application. In this scenario the integration cost of using these data sets is proportional to their use. There are no economies of scale. It always costs the same amount, to every consumer.

Not so with Linked Data, which enables the data publisher to identify links between their data and third party data sets, and make these links available to every consumer of that data set by publishing them as RDF along with the data itself. Yes, there is a one-off cost to the publisher in creating the links that are most likely to be useful to data consumers, but that’s a one-off. It doesn’t increase every time a developer uses the data set, and each developer doesn’t have to pay that cost for each data set they use.

If data publishers are seriously interested in promoting the use of their data then this is a cost worth bearing. Why constantly reinvent the wheel by creating new sets of links for every application that uses a certain combination of data sets? Certainly as a UK taxpayer, I would rather the UK Government made this one-off investment in publishing and linking RDF data, thereby lowering the cost for everyone that wanted to use them. This is the way to build a vibrant economy around open data.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at June 16, 2010 12:33 PM

May 19, 2010

DBTune Blog

DBpedia and BBC Programmes

We just put live a new exciting feature on BBC Programmes: programme aggregations powered by DBpedia. For example, you can look at:

Of course, the RDF representations are linked up to DBpedia. Try loading adolescence in the Tabulator, for example - you will get an immediate mashup of BBC data, DBpedia data, and Freebase data. Or if you're not afraid of getting overloaded with data, try the California one.

One of the most interesting things about using web identifiers as tags for our programmes (apart from being able to automatically generate those aggregation pages, of course), is that we can use ancillary information about those tags to create new sorts of aggregations, and new visualisations of our data. We could for example plot all our Radio 3 programmes on a map, depending on the geolocation of the people associated to these programmes. Or we could create an aggregation of BBC programmes featuring artists living in the cities with the highest rainfall (why not?). And, of course, this will be a fantastic new source of data for the MusicBore! The possibilities are basically endless, and we are very excited about it!

by Yves at May 19, 2010 11:18 AM

April 28, 2010

DBpedia Blog

DBpedia 3.5.1 released

Hi all,

we are happy to announce the release of DBpedia 3.5.1.

This is primarily a bugfix release, which is based on Wikipedia dumps dating from March 2010. Thanks to the great community feedback about the previous DBpedia release, we were able to resolve the reported issues as well as to improve template to ontology mappings.

The new release provides the following improvements and changes compared to the DBpedia 3.5 release:

  1. Some abstracts contained unwanted WikiText markup. The detection of infoboxes and tables has been improved, so that even most pages with syntax errors have clean abstracts now.
  2. In 3.5 there has been an issue detecting interlanguage links, which led to some non-english statements having the wrong subject. This has been fixed.
  3. Image references to dummy images (e.g. http://en.wikipedia.org/wiki/Image:Replace_this_image.svg) have been removed.
  4. DBpedia 3.5.1 uses a stricter IRI validation now. Care has been taken to only discard URIs from Wikipedia, which are clearly invalid.
  5. Recognition of disambiguation pages has been improved, increasing the size from 247,000 to 769,000 triples.
  6. More geographic coordinates are extracted now, increasing its number from 1,200,000 to 1,500,000 in the english version.
  7. For this release, all Freebase links have been regenerated from the most recent freebase dump.

You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads351. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  • Neofonie GmbH (http://www.neofonie.de), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com).
  • OpenLink Software (http://www.openlinksw.com). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers,

Robert Isele and Anja Jentzsch

by AnjaJentzsch at April 28, 2010 05:18 PM

April 17, 2010

Wikier.org Blog (Sergio Fernandez)

djubby

In order to simplify the deployment architecture of STEAMY (hope Nacho can publish something more soon, because it’s a quite interesting FLOSS Linked Data project), I started the development of djubby in some spare moments of this week. Djubby is nothing more than a Python implementation of Pubby, i.e. a Linked Data frontend for SPARQL endpoints for the Django Web framework.

djubby's architecture

For further information, you can read the getting started guide to learn how to use it, or try the demo application with DBpedia. Although release 0.1.4 is quite nice, I know that there is still a long way for djubby to become a stable software artifact, so all feedback is welcome!

by Sergio Fernández at April 17, 2010 12:18 PM

April 16, 2010

Blog Data Space (Kingsley Idehen)

Data 3.0 (a Manifesto for Platform Agnostic Structured Data) Update 5

After a long period of trying to demystify and unravel the wonders of standards compliant structured data access, combined with protocols (e.g., HTTP) that separate:

  1. Identity,
  2. Access,
  3. Storage,
  4. Representation, and
  5. Presentation.

I ended up with what I can best describe as the Data 3.0 Manifesto. A manifesto for standards complaint access to structured data object (or entity) descriptors.

Some Related Work

Alex James (Program Manager Entity Frameworks at Microsoft), put together something quite similar to this via his Base4 blog (around the Web 2.0 bootstrap time), sadly -- quoting Alex -- that post has gone where discontinued blogs and their host platforms go (deep deep irony here).

It's also important to note that this manifesto is also a variant of the TimBL's Linked Data Design Issues meme re. Linked Data, but totally decoupled from RDF (data representation formats aspect) and SPARQL which -- in my world view -- remain implementation details.

Data 3.0 manifesto

  • An "Entity" is the "Referent" of an "Identifier."
  • An "Identifier" SHOULD provide a global, unambiguous, and unchanging (though it MAY be opaque!) "Name" for its "Referent".
  • A "Referent" MAY have many "Identifiers" (Names), but each "Identifier" MUST have only one "Referent".
  • Structured Entity Descriptions SHOULD be based on the Entity-Attribute-Value (EAV) Data Model, and SHOULD therefore take the form of one or more 3-tuples (triples), each comprised of:
    • an "Identifier" that names an "Entity" (i.e., Entity Name),
    • an "Identifier" that names an "Attribute" (i.e., Attribute Name), and
    • an "Attribute Value", which may be an "Identifier" or a "Literal".
  • Structured Descriptions SHOULD be CARRIED by "Descriptor Documents" (i.e., purpose specific documents where Entity Identifiers, Attribute Identifiers, and Attribute Values are clearly discernible by the document's intended consumers, e.g., humans or machines).
  • Structured Descriptor Documents can contain (carry) several Structured Entity Descriptions
  • Stuctured Descriptor Documents SHOULD be network accessible via network addresses (e.g., HTTP URLs when dealing with HTTP-based Networks).
  • An Identifier SHOULD resolve (de-reference) to a Structured Representation of the Referent's Structured Description.

Related

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at April 16, 2010 09:09 PM

April 14, 2010

Orri Erling

Transactional High Availability in Virtuoso Cluster Edition

Introduction

This post discusses the technical specifics of how we accomplish smooth transactional operation in a database server cluster under different failure conditions. (A higher-level short version was posted last week.) The reader is expected to be familiar with the basics of distributed transactions.

Someone on a cloud computing discussion list called two-phase commit (2PC) the "anti-availability protocol." There is indeed a certain anti-SQL and anti-2PC sentiment out there, with key-value stores and "eventual consistency" being talked about a lot. Indeed, if we are talking about wide-area replication over high-latency connections, then 2PC with synchronously-sharp transaction boundaries over all copies is not really workable.

For multi-site operations, a level of eventual consistency is indeed quite unavoidable. Exactly what the requirements are depends on the application, so I will focus here on operations inside one site.

The key-value store culture seems to focus on workloads where a record is relatively self-contained. The record can be quite long, with repeating fields, different selections of fields in consecutive records, and so forth. Such a record would typically be split over many tables of a relational schema. In the RDF world, such a record would be split even wider, with the information needed to reconstitute the full record almost invariably split over many servers. This comes from the mapping between the text of URIs and their internal IDs being partitioned in one way, and the many indices on the RDF quads each in yet another way.

So it comes to pass that in the data models we are most interested in, the application-level entity (e.g., a user account in a social network) is not a contiguous unit with a single global identifier. The social network user account, that the key-value store would consider a unit of replication mastering and eventual consistency, will be in RDF or SQL a set of maybe hundreds of tuples, each with more than one index, nearly invariably spanning multiple nodes of the database cluster.

So, before we can talk about wide-area replication and eventual consistency with application-level semantics, we need a database that can run on a fair-sized cluster and have cast-iron consistency within its bounds. If such a cluster is to be large and is to operate continuously, it must have some form of redundancy to cover for hardware failures, software upgrades, reboots, etc., without interruption of service.

This is the point of the design space we are tackling here.

Non Fault-Tolerant Operation

There are two basic modes of operation we cover: bulk load, and online transactions.

In the case of bulk load, we start with a consistent image of the database; load data; and finish by making another consistent image. If there is a failure during load, we lose the whole load, and restart from the initial consistent image. This is quite simple and is not properly transactional. It is quicker for filling a warehouse but is not to be used for anything else. In the remainder, we will only talk about online transactions.

When all cluster nodes are online, operation is relatively simple. Each entry of each index belongs to a partition that is determined by the values of one or more partitioning columns of said index. There are no tables separate from indices; the relational row is on the index leaf of its primary key. Secondary indices reference the row by including the primary key. Blobs are in the same partition as the row which contains the blob. Each partition is then stored on a "cluster node." In non fault-tolerant operations, each such cluster node is a single process with exclusive access to its own permanent storage, consisting of database files and logs; i.e., each node is a single server instance. It does not matter if the storage is local or on a SAN, the cluster node is still the only one accessing it.

When things are not fault tolerant, transactions work as follows:

When there are updates, two-phase commit is used to guarantee a consistent result. Each transaction is coordinated by one cluster node, which issues the updates in parallel to all cluster nodes concerned. Sending two update messages instead of one does not significantly impact latency. The coordinator of each transaction is the primary authority for the transaction's outcome. If the coordinator of the transaction dies between the phases of the commit, the transaction branches stay in the prepared state until the coordinator is recovered and can be asked again about the outcome of the transaction. Likewise, if a non-coordinating cluster node with a transaction branch dies between the phases, it will do a roll-forward and ask the coordinator for the outcome of the transaction.

If cluster nodes occasionally crash and then recover relatively quickly, without ever losing transaction logs or database files, this is resilient enough. Everything is symmetrical; there are no cluster nodes with special functions, except for one master node that has the added task of resolving distributed deadlocks.

I suppose our anti-SQL person called 2PC "anti-availability" because in the above situation we have the following problems: if any one cluster node is offline, it is quite likely that no transaction can be committed. This is so unless the data is partitioned on a key with application semantics, and all data touched by a transaction usually stays within a single partition. Then operations could proceed on most of the data while one cluster node was recovering. But, especially with RDF, this is never the case, since keys are partitioned in ways that have nothing to do with application semantics. Further, if one uses XA or Microsoft DTC with the monitor on a single box, this box can become a bottleneck and/or a single point of failure. (Among other considerations, this is why Virtuoso does not rely on any such monitor.) Further, if a cluster node dies never to be heard of again, leaving prepared but uncommitted transaction branches, the rest of the system has no way of telling what to do with them, again unless relying on a monitor that is itself liable to fail.

If transactions have a real world counterpart, it is possible, at least in theory, to check the outcome against the real world state: One can ask a customer if an order was actually placed or a shipment delivered. But when a transaction has to do with internal identifiers of things, for example whether mailto://plaidskirt@hotdate.com has internal ID 0xacebabe, such a check against external reality is not possible.

Fault-Tolerant Operation

In a fault tolerant setting, we introduce the following extra elements: Cluster nodes are comprised of "quorums" of mutually-mirroring server instances. Each such quorum holds a partition of the data. Such a quorum typically consists of two server instances, but may have three for extra safety. If all server instances in the quorum are offline, then the cluster node is offline, and the cluster is not fully operational. If at least one server instance in a quorum is online, then the cluster node is online, and the cluster is operational and can process new transactions.

We designate one cluster node (i.e., one quorum of 2 or 3 server instances) to act as a master node, and we set an order of precedence among its member instances. In addition to arbitrating distributed deadlocks, the master instance on duty will handle reports of server instance failures, and answer questions about any transactions left hanging in prepared state by a dead transaction coordinator. If the master on duty fails, the next master in line will either notice this itself in the line of normal business or get a complaint from another server instance about not being able to contact the previous master.

There is no global heartbeat messaging per se, but since connections between server instances are reused long-term, a dropped connection will be noticed and the master on duty will be notified. If all masters are unavailable, that entire quorum (i.e., the master node) is offline and thus (as with any entire node going offline) most operations will fail anyway, unless by chance they do not hit any data managed by that failed quorum.

When it receives a notice of unavailability, the master instance on duty tries to contact the unavailable server instance and if it fails, it will notify all remaining instances that that server instance is removed from the cluster. The effect is that the remaining server instances will stop attempting to access the failed instance. Updates to the partitions managed by the failed server instance are no longer sent to it, which results in updates to this data succeeding, as they are made against the other server instances in that quorum. Updates to the data of the failed server instance will fail in the window of time between the actual failure and the removal, which is typically well under a second. The removal of a failed server instance is delegated to a central authority in order not to have everybody get in each other's way when trying to effect the removal.

If the failed server instance left prepared uncommitted transactions behind, the server instances having such branches will in due order contact the transaction coordinator to ask what should be done. This is a normal procedure for dealing with possibly dropped commit or rollback messages. When they discover that the coordinator has been removed, the master on duty will be contacted instead. Each prepare message of a transaction lists all the server instances participating in the transaction; thus the master can check whether each has received the prepare. If all have the prepare and none has an abort, the transaction is committed. The dead coordinator may not know this or may indeed not have the transaction logged, since it sends the prepares before logging its own prepare. The recovery will handle this though. We note that of the remaining branches, there is at least one copy of the branch with the failed server instance, or else we would have a whole quorum failed. In cases where there are branches participating in an unresolved transaction where all the quorum members have failed, the system cannot decide the outcome, and will periodically retry until at least one member of the failed quorum becomes available.

The most complex part of the protocol is the recovery of a failed server instance. The recovery starts with a normal roll forward from the local transaction log. After this, the server instance will contact the master on duty to ask for its state. Typically, the master will reply that the recovering server instance had been removed and is out of date. When this is established, the recovering server instance will contact a live member of its quorum and ask for sync. The failed server instance has an approximate timestamp of its last received transaction. It knows this from the roll forward, where time markers are interspersed now and then between transaction records. The live partner then sends its transaction log(s) covering the time from a few seconds before the last transaction of the failed partner up to the present. A few transactions may get rolled forward twice but this does no harm, since these records have absolute values and no deltas and the second insert of a key is simply ignored. When the sender of the log reaches its last committed log entry, it asks the recovering server instance to confirm successful replay of the log so far. Having the confirmation, the sender will abort all unprepared transactions affecting it and will not accept any new ones until the sync is completed. If new transactions were committed between sending the last of the log and killing the uncommitted new transactions, these too are shipped to the recovering server instance in their committed or prepared state. When these are also confirmed replayed, the recovering server instance is in exact sync up to the transaction. The sender then notifies the rest of the cluster that the sync is complete and that the recovered server instance will be included in any updates of its slice of the data. The time between freeze and re-enable of transactions is the time to replay what came in between the first sync and finishing the freeze. Typically nothing came in, so the time is in milliseconds. If an application got its transaction killed in this maneuver, it will be seen as a deadlock.

If the recovering server instance received transactions in prepared state, it will ask about their outcome as a part of the periodic sweep through pending transactions. One of these transactions could have been one originally prepared by itself, where the prepares had gone out before it had time to log the transaction. Thus, this eventuality too is covered and has a consistent outcome. Failures can interrupt the recovery process. The recovering server instance will have logged as far as it got, and will pick up from this point onward. Real time clocks on the host nodes of the cluster will have to be in approximate sync, within a margin of a minute or so. This is not a problem in a closely connected network.

For simultaneous failure of a entire quorum of server instances (i.e., a set of mutually-mirroring partners; a cluster node), the rule is that the last one to fail must be the first to come back up. In order to have uninterrupted service across arbitrary double failures, one must store things in triplicate; statistically, however, most double failures will not hit cluster nodes of the same group.

The protocol for recovery of failed server instances of the master quorum (i.e., the master cluster node) is identical, except that a recovering master will have to ask the other master(s) which one is more up to date. If the recovering master has a log entry of having excluded all other masters in its quorum from the cluster, it can come back online without asking anybody. If there is no such entry, it must ask the other master(s). If all had failed at the exact same instant, none has an entry of the other(s) being excluded and all will know that they are in the same state since any update to one would also have been sent to the other(s).

Failure of Storage Media

When a server instance fails, its permanent storage may or may not survive. Especially with mirrored disks, storage most often survives a failure. However, the survival of the database does not depend on any single server instance retaining any permanent storage over failure. If storage is left in place, as in the case of an OS reboot or replacing a faulty memory chip, rejoining the cluster is done based on the existing copy of the database on the server instance. if there is no existing copy, a copy can be taken from any surviving member of the same quorum. This consists of the following steps: First, a log checkpoint is forced on the surviving instance. Normally log checkpoints are done at regular intervals, independently on each server instance. The log checkpoint writes a consistent state of the database to permanent storage. The disk pages forming this consistent image will not be written to until the next log checkpoint. Therefore copying the database file is safe and consistent as long as a log checkpoint does not take place between the start and end of copy. Thus checkpoints are disabled right after the initial checkpoint. The copy can take a relatively long time; consider 20s per gigabyte on a 1GbE network a good day. At the end of copy, checkpoints are re-enabled on the surviving cluster node. The recovering database starts without a log, sees the timestamp of the checkpoint in the database, and asks for transactions from just before this time up to present. The recovery then proceeds as outlined above.

Network Failures

The CAP theorem states that Consistency, Availability, and Partition-tolerance do not mix. "Partition" here means the split of a network.

It is trivially true that if the network splits so that on both sides there is a copy of each partition of the data, both sides will think themselves the live copy left online after the other died, and each will thus continue to accumulate updates. Such an event is not very probable within one site where all machines are redundantly connected to two independent switches. Most servers have dual 1GbE on the motherboard, and both ports should be used for cluster interconnect for best performance, with each attached to an independent switch. Both switches would have to fail in such a way as to split their respective network for a single-site network split to happen. Of course, the likelihood of a network split in multi-site situations is higher.

One way of guarding against network splits is to require that at least one partition of the data have all copies online. Additionally, the master on duty can request each cluster node or server instance it expects to be online to connect to every other node or instance, and to report which they could reach. If the reports differ, there is a network problem. This procedure can be performed using both interfaces or only the first or second interface of each server to determine if one of the switches selectively blocks some paths. These simple sanity checks protect against arbitrary network errors. Using TCP for inter-cluster-node communication in principle protects against random message loss, but the Virtuoso cluster protocols do not rely on this. Instead, there are protocols for retry of any transaction messages and for using keep-alive messages on any long-running functions sent across the cluster. Failure to get a keep-alive message within a certain period will abort a query even if the network connections look OK.

Backups, and Recovery from Loss of Entire Site

For a constantly-operating distributed system, it is hard to define what exactly constitutes a consistent snapshot. The checkpointed state on each cluster node is consistent as far as this cluster node is concerned (i.e., it contains no uncommitted data), but the checkpointed states on all the cluster nodes are not from exactly the same moment in time. The complete state of a cluster is the checkpoint state of each cluster node plus the current transaction log of each. If the logs were shipped in real time to off-site storage, a consistent image could be reconstructed from them. Since such shipping cannot be synchronous due to latency considerations, some transactions could be received only in part in the event of a failure of the off-site link. Such partial transactions can however be detected at reconstruction time because each record contains the list of all participants of the transaction. If some piece is found missing, the whole can be discarded. In this way integrity is guaranteed but it is possible that a few milliseconds worth of transactions get lost. In these cases, the online client will almost certainly fail to get the final success message and will recheck the status after recovery.

For business continuity purposes, a live feed of transactions can be constantly streamed off-site, for example to a cloud infrastructure provider. One low-cost virtual machine on the cloud will typically be enough for receiving the feed. In the event of long-term loss of the whole site, replacement servers can be procured on the cloud; thus, capital is not tied up in an aging inventory of spare servers. The cloud-based substitute can be maintained for the time it takes to rebuild an owned infrastructure, which is still at present more economical than a cloud-only solution.

Switching a cluster from an owned site to the cloud could be accomplished in a few hours. The prerequisite of this is that there are reasonably recent snapshots of the database files, so that replay of logs does not take too long. The bulk of the time taken by such a switch would be in transferring the database snapshots from S3 or similar to the newly provisioned machines, formatting the newly provisioned virtual disks, etc.

Rehearsing such a maneuver beforehand is quite necessary for predictable execution. We do not presently have a productized set of tools for such a switch, but can advise any interested parties on implementing and testing such a disaster recovery scheme.

Conclusions

In conclusion, we have shown how we can have strong transactional guarantees in a database cluster without single points of failure or performance penalties when compared with a non fault-tolerant cluster. Operator intervention is not required for anything short of hardware failure. Recovery procedures are simple, at most consisting of installing software and copying database files from a surviving cluster node. Unless permanent storage is lost in the failure, not even this is required. Real-time off-site log shipment can easily be added to these procedures to protect against site-wide failures.

Future work may be directed toward concurrent operation of geographically-distributed data centers with eventual consistency. Such a setting would allow for migration between sites in the event of whole-site failures, and for reconciliation between inconsistent histories of different halves of a temporarily split network. Such schemes are likely to require application-level logic for reconciliation and cannot consist of an out-of-the-box DBMS alone. All techniques discussed here are application-agnostic and will work equally well for Graph Model (e.g., RDF) and Relational Model (e.g., SQL) workloads.

Glossary

  • Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
  • Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
  • Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
  • Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
  • Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).

Also see

by Orri Erling (oerling@openlinksw.com) at April 14, 2010 10:21 PM

April 12, 2010

DBpedia Blog

DBpedia 3.5 released

Hi all,

we are happy to announce the release of DBpedia 3.5. The new release is based on Wikipedia dumps dating from March 2010. Compared to the 3.4 release, we were able to increase the quality of the DBpedia knowledge base by employing a new data extraction framework which applies various data cleansing heuristics as well as by extending the infobox-to-ontology mappings that guide the data extraction process.

The new DBpedia knowledge base describes more than 3.4 million things, out of which 1.47 million are classified in a consistent ontology, including 312,000 persons, 413,000 places, 94,000 music albums, 49,000 films, 15,000 video games, 140,000 organizations, 146,000 species and 4,600 diseases. The DBpedia data set features labels and abstracts for these 3.2 million things in up to 92 different languages; 1,460,000 links to images and 5,543,000 links to external web pages; 4,887,000 external links into other RDF datasets, 565,000 Wikipedia categories, and 75,000 YAGO categories. The DBpedia knowledge base altogether consists of over 1 billion pieces of information (RDF triples) out of which 257 million were extracted from the English edition of Wikipedia and 766 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.4 release:

  1. The DBpedia extraction framework has been completely rewritten in Scala. The new framework dramatically reduces the extraction time of a single Wikipedia article from over 200 to about 13 milliseconds. All features of the previous PHP framework have been ported. In addition, the new framework can extract data from Wikipedia tables based on table-to-ontology mappings and is able to extract multiple infoboxes out of a single Wikipedia article. The data from each infobox is represented as a separate RDF resource. All resources that are extracted from a single page can be connected using custom RDF properties which are also defined in the mappings. A lot of work also went into the value parsers and the DBpedia 3.5 dataset should therefore be much cleaner than its predecessors. In addition, units of measurement are normalized to their respective SI unit, which makes querying DBpedia easier.
  2. The mapping language that is used to map Wikipedia infoboxes to the DBpedia Ontology has been redesigned. The documentation of the new mapping language is found at http://dbpedia.svn.sourceforge.net/viewvc/dbpedia/trunk/extraction/core/doc/mapping%20language/
  3. In order to enable the DBpedia user community to extend and refine the infobox to ontology mappings, the mappings can be edited on the newly created wiki hosted on http://mappings.dbpedia.org.  At the moment, 303 template mappings are defined, which cover (including redirects) 1055 templates. On the wiki, the DBpedia Ontology can be edited by the community as well. At the moment, the ontology consists of 259 classes and about 1,200 properties. 
  4. The ontology properties extracted from infoboxes are now split into two data sets (For details see: http://wiki.dbpedia.org/Datasets):  1. The Ontology Infobox Properties dataset contains the properties as they are defined in the ontology (e.g. length). The range of a property is either an xsd schema type or a dimension of measurement, in which case the value is normalized to the respective SI unit. 2. The Ontology Infobox Properties (Specific) dataset contains properties which have been specialized for a specific class using a specific unit. e.g. the property height is specialized on the class Person using the unit centimeters instead of meters.
  5. The framework now resolves template redirects, making it possible to cover all redirects to an infobox on Wikipedia with a single mapping. 
  6. Three new extractors have been implemented:  1. PageIdExtractor extracting Wikipedia page IDs are extracted for each page. 2. RevisionExtractor extracting the latest revision of a page. 3. PNDExtractor extracting PND (Personnamendatei) identifiers.
  7. The data set now provides labels, abstracts, page links and infobox data in 92 different languages, which have been extracted from recent Wikipedia dumps as of March 2010.
  8. In addition the N-Triples datasets, N-Quads datasets are provided which include a provenance URI to each statement. The provenance URI denotes the origin of the extracted triple in Wikipedia (For details see: http://wiki.dbpedia.org/Datasets).You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads35. As usual, the data set is also available as Linked Data and via the DBpedia SPARQL endpoint.

Lots of thanks to:

  • Robert Isele, Anja Jentzsch, Christopher Sahnwaldt, and Paul Kreis (all Freie Universität Berlin) for reimplementing the DBpedia extraction framework in Scala, for extending the infobox-to-ontology mappings and for extracting the new DBpedia 3.5 knowledge base. 
  • Jens Lehmann and Sören Auer (both Universität Leipzig) for providing the knowledge base via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the knowledge base into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint.

The whole DBpedia team is very thankful to three companies which enabled us to do all this by supporting and sponsoring the DBpedia project:

  1. Neofonie GmbH (http://www.neofonie.de/index.jsp), a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications.
  2. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
  3. OpenLink Software (http://www.openlinksw.com/). OpenLink Software develops the Virtuoso Universal Server, an innovative enterprise grade server that cost-effectively delivers an unrivaled platform for Data Access, Integration and Management.

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new DBpedia knowledge base!

Cheers

Chris Bizer

by ChrisBizer at April 12, 2010 09:28 AM

April 09, 2010

Frederick Giasson

Global structWSF Statistics Report

triple_120Today we released a simple structWSF nodes statistics report. It aggregates different statistics from all know (and accessible) structWSF nodes on the Web. It is still in its early stage, but aggregated statistics so far are quite interesting.

This global statistics reports has two aims:

  1. Monitoring the evolution of the usage of structWSF, and
  2. Monitoring the overall performance of structWSF web services in different setups for different usages

The report is accessible here in all time. The report is updated hourly.

Overall Statistics

The main statistics of the report are:

  • The number of structWSF nodes participating to the report
  • The total number of HTTP queries processed by the structWSF nodes
  • The total number of datasets created on the nodes
  • The total number of records indexed, and
  • The total number of triples indexed

These statistics gives a general overview of the size of the “global structWSF network of nodes”.

Web Service Statistics

Each Web service endpoint has its own statistics, which are:

  • The number of queries processed by the web service
  • The average time it took to process the query (without the network latency between the requested and the web service endpoint server)
  • All the requested mime-types, and the number of times a mime-type have been requested, and
  • All the HTTP response code returned by the endpoint

These Web service specific statistics are helpful to have a general understanding of each web service endpoint.

The average time per query is helpful to know what kind of performance a developer should expect when using this web service endpoint.

The list of requested MIME types gives an overall usage of the web service endpoint: are users mostly requesting XML data, JSON data, RDF+XML data, etc. Such usage statistics is helpful to prioritize future development tasks.

The list of all HTTP response code is helpful to notice possible issues with a web service endpoint. If error codes are returned often, this could pinpoint a possible bug in the web service endpoint, an issue with its usage that could lead to a fix in the documentation, etc.

Participating to the Global structWSF Statistics Report

If you are operating a structWSF instance and want to participate to the Global structWSF Statistics Report, you first have to download the new statisticsBroker.php script and install it on your structWSF node.

The statistics broker script is what calculates the statistics of a structWSF node, and what is used to aggregate statistics from all nodes, to generate the consolidated report.

The first thing to do is to edit the file, and to change the value of the $enableStatisticsBroadcast variable from FALSE to TRUE at the line 46. This will enable the script.

Normally you should install the script in the root folder of your structWSF node, but you can install it anywhere on your server, where it will be accessible on the Web.

The final step is to register your node to the reporting system. It is just a matter of registering the URL address where the statisticsBroker.php script is accessible. It should be added to the global report within 24 hours, once I validated it.

Other Usage of the Statistics Broker

This is nice to participate to such global statistics report, but much more can be done with such a statistics broker.

A structWSF developer or a structWSF node maintainer could use it to have statistics of the local node. As described above, such statistics can be used to pinpoint possible performance issues, bottlenecks and possible bugs in web service endpoints. It could also be use to plan future extension of the network to scale some highly used web service endpoint in the network.

Additionally, the statistics broker could be used in a broader server maintenance architecture. It could be used in conjunction with another script to be part of a Ganglia monitoring system for example. Performances could be monitored by Ganglia, rate of requests per hours, raise in the number different HTTP response returned by some web services. Additionally, each of these statistics could be bound to different alerts notification messages that would alert the structWSF system maintainers and developers of possible issues with the network.

Next Step

The next step with the statistics broker will be to create a structWSF web service out of it. That way, structWSF node maintainers will be easily able to define access and usage permissions for such statistics.

by Fred at April 09, 2010 03:53 PM

April 07, 2010

Orri Erling

Fault Tolerance in Virtuoso Cluster Edition (Short Version)

We have for some time had the option of storing data in a cluster in multiple copies, in the Commercial Edition of Virtuoso. (This feature is not in and is not planned to be added to the Open Source Edition.)

Based on some feedback from the field, we decided to make this feature more user friendly. The gist of the matter is that failure and recovery processes have been automated so that neither application developer nor operating personnel needs any knowledge of how things actually work.

So I will here make a few high level statements about what we offer for fault tolerance. I will follow up with technical specifics in another post.

Three types of individuals need to know about fault tolerance:

  • Executives: What does it cost? Will it really eliminate downtime?
  • System Administrators: Is it hard to configure? What do I do when I get an alert?
  • Application Developers/Programmers: Will I need to write extra code? Can old applications get fault tolerance with no changes?

I will explain the matter to each of these three groups:

Executives

The value gained is elimination of downtime. The cost is in purchasing twice (or thrice) the hardware and software licenses. In reality, the cost is less since you get the whole money's worth of read throughput and half the money's worth of write throughput. Since most applications are about reading, this is a good deal. You do not end up paying for unused capacity.

Server instances are grouped in "quorums" of two or, for extra safety, three; as long as one member of each quorum is available, the system keeps running and nobody sees a difference, except maybe for slower response. This does not protect against widespread power outage or the building burning down; the scope is limited to hardware and software failures at one site.

The most basic site-wide disaster recovery plan consists of constantly streaming updates off-site. Using an off-site backup plus update stream, one can reconstitute the failed data center on a cloud provider in a few hours. Details will vary; please contact us for specifics.

Running multiple sites in parallel is also possible but specifics will depend on the application. Again, please contact us if you have a specific case in mind.

System Administrators

To configure, divide your server instances into quorums of 2 or 3, according to which will be mirrors of each other, with each quorum member on a different host from the others in its quorum. These things are declared in a configuration file. Table definitions do not have to be altered for fault tolerance. It is enough for tables and indices to specify partitioning. Use two switches, and two NICs per machine, and connect one of each server's network cables to each switch, to cover switch failures.

When things break, as long as there is at least one server instance up from each quorum, things will continue to work. Reboots and the like are handled without operator intervention; if there is a broken host, then remove it and put a spare in its place. If the disks are OK, put the old disks in the replacement host and start. If the disks are gone, then copy the database files from the live copy. Finally start the replacement database, and the system will do the rest. The system is online in read-write mode during all this time, including during copying.

Having mirrored disks in individual hosts is optional since data will anyhow be in two copies. Mirrored disks will shorten the vulnerability window of running a partition on a single server instance since this will for the most part eliminate the need to copy many (hundreds) of GB of database files when recovering a failed instance.

Application Developers/Programmers

An application can connect to any server instance in the cluster and have access to the same data, with full ACID properties.

There are two types of errors that can occur in any database application: The database server instance may be offline or otherwise unreachable; and a transaction may be aborted due to a deadlock.

For the missing server instance, the application should try to reconnect. An ODBC/JDBC connect string can specify a list of alternate server instances; thus as long as the application is written to try to reconnect as best practices dictate, there is no new code needed.

For the deadlock, the application is supposed to retry the transaction. Sometimes when a server instance drops out or rejoins a running cluster, some transactions will have to be retried. To the application, these conditions look like a deadlock. If the application handles deadlocks (SQL State 40001) as best practices dictate, there is no change needed.

Conclusion

In summary...

  • Limited extra cost for fault tolerance; no equipment sitting idle.
  • Easy operation: Replace servers when they fail; the cluster does the rest.
  • No changes needed to most applications.
  • No proprietary SQL APIs or special fault tolerance logic needed in applications.
  • Fully transactional programming model.

All the above applies to both the Graph Model (RDF) and Relational (SQL) sides of Virtuoso. These features will be in the commercial release of Virtuoso to be publicly available in the next 2-3 weeks. Please contact OpenLink Software Sales for details of availability or for getting advance evaluation copies.

Glossary

  • Virtuoso Cluster (VC) -- a collection of Virtuoso Cluster Nodes on one or more machines, working in parallel as part of a Virtuoso Cluster.
  • Virtuoso Cluster Node (VCN) -- a Virtuoso Server Instance (Non Fault-Tolerant Operations), or a Quorum of Server Instances (Fault Tolerant Operations), which is a member of a collection of Virtuoso Cluster Nodes working in parallel as part of a Virtuoso Cluster.
  • Virtuoso Host Cluster (VHC) -- a collection of machines, each hosting one or more Virtuoso Server Instances, making up a Virtuoso Cluster.
  • Virtuoso Host Cluster Node (VHCN) -- a machine hosting one or more Virtuoso Server Instances that are members of a Virtuoso Cluster.
  • Virtuoso Server Instance (VSI) -- a single Virtuoso process with exclusive access to its own permanent storage, consisting of database files and logs. May comprise an entire Virtuoso Cluster Node (Non Fault-Tolerant Operations), or be one member of a quorum which comprises a Virtuoso Cluster Node (Fault Tolerant Operations).

Also see

by Orri Erling (oerling@openlinksw.com) at April 07, 2010 04:40 PM

April 05, 2010

Orri Erling

"The Acquired, The Innate, and the Semantic" or "Teaching Sem Tech"

I was recently asked to write a section for a policy document touching the intersection of database and semantics, as a follow up to the meeting in Sofia I blogged about earlier. I will write about technology, but this same document also touches the matter of education and computer science curricula. Since the matter came up, I will share a few thoughts on the latter topic.

I have over the years trained a few truly excellent engineers and managed a heterogeneous lot of people. These days, since what we are doing is in fact quite difficult and the world is not totally without competition, I find that I must stick to core competence, which is hardcore tech and leave management to those who have time for it.

When younger, I thought that I could, through sheer personal charisma, transfer either technical skills, sound judgment, or drive and ambition to people I was working with. Well, to the extent I believed this, my own judgment was not sound. Transferring anything at all is difficult and chancy. I must here think of a fantasy novel where a wizard said that, "working such magic that makes things do what they already want to do is easy." There is a grain of truth in that.

In order to build or manage organizations, we must work, as the wizard put it, with nature, not against it. There are also counter-examples, for example my wife's grandmother had decided to transform a regular willow into a weeping one by tying down the branches. Such "magic," needless to say, takes constant maintenance; else the spell breaks.

To operate efficiently, either in business or education, we need to steer away from such endeavors. This is a valuable lesson, but now consider teaching this to somebody. Those who would most benefit from this wisdom are the least receptive to it. So again, we are reminded to stay away from the fantasy of being able to transfer some understanding we think to have and to have this take root. It will if it will and if it does not, it will take constant follow up, like the would-be weeping willow.

Now, in more specific terms, what can we realistically expect to teach about computer science?

Complexity of algorithms would be the first thing. Understanding the relative throughputs and latencies of the memory hierarchy (i.e., cache, memory, local network, disk, wide area network) is the second. Understanding the difference of synchronous and asynchronous and the cost of synchronization (i.e., anything from waiting for a mutex to waiting for a network message) is the third.

Understanding how a database works would be immensely helpful for almost any application development task but this is probably asking too much.

Then there is the question of engineering. Where do we put interfaces and what should these interfaces expose? Well, they certainly should expose multiple instances of whatever it is they expose, since passing through an interface takes time.

I tried once to tell the SPARQL committee that parameterized queries and array parameters are a self-evident truism on the database side. This is an example of an interface that exposes multiple instances of what it exposes. But the committee decided not to standardize these. There is something in the "semanticist" mind that is irrationally antagonistic to what is self-evident for databasers. This is further an example of ignoring precept 2 above, the point about the throughputs and latencies in the memory hierarchy. Nature is a better and more patient teacher than I; the point will become clear of itself in due time, no worry.

Interfaces seem to be overvalued in education. This is tricky because we should not teach that interfaces are bad either. Nature has islands of tightly intertwined processes, separated by fairly narrow interfaces. People are taught to think in block diagrams, so they probably project this also where it does not apply, thereby missing some connections and porosity of interfaces.

LarKC (EU FP7 Large Knowledge Collider project) is an exercise in interfaces. The lessons so far are that coupling needs to be tight, and that the roles of the components are not always as neatly separable as the block diagram suggests.

Recognizing the points where interfaces are naturally narrow is very difficult. Teaching this in a curriculum is likely impossible. This is not to say that the matter should not be mentioned and examples of over-"paradigmatism" given. The geek mind likes to latch on to a paradigm (e.g., object orientation), and then they try to put it everywhere. It is safe to say that taking block diagrams too naively or too seriously makes for poor performance and needless code. In some cases, block diagrams can serve as tactical disinformation; i.e., you give lip service to the values of structure, information hiding, and reuse, which one is not allowed to challenge, ever, and at the same time you do not disclose the competitive edge, which is pretty much always a breach of these same principles.

I was once at a data integration workshop in the US where some very qualified people talked about the process of science. They had this delightfully American metaphor for it:

The edge is created in the "Wild West" — there are no standards or hard-and-fast rules, and paradigmatism for paradigmatism's sake is a laughing matter with the cowboys in the fringe where new ground is broken. Then there is the OK Corral, where the cowboys shoot it out to see who prevails. Then there is Dodge City, where the lawman already reigns, and compliance, standards, and paradigms are not to be trifled with, lest one get the tar-and-feather treatment and be "driven out o'Dodge."

So, if reality is like this, what attitude should the curriculum have towards it? Do we make innovators or followers? Well, as said before, they are not made. Or if they are made, they are not at least made in the university but much before that. I never made any of either, in spite of trying, but did meet many of both kinds. The education system needs to recognize individual differences, even though this is against the trend of turning out a standardized product. Enforced mediocrity makes mediocrity. The world has an amazing tolerance for mediocrity, it is true. But the edge is not created with this, if edge is what we are after.

But let us move to specifics of semantic technology. What are the core precepts, the equivalent of the complexity/memory/synchronization triangle of general purpose CS basics? Let us not forget that, especially in semantic technology, when we have complex operations, lots of data, and almost always multiple distributed data sources, forgetting the laws of physics carries an especially high penalty.

  • Know when to ontologize, when to folksonomize. The history of standards has examples of "stacks of Babel," sky-high and all-encompassing, which just result in non-communication and non-adoption. Lighter weight, community driven, tag folksonomy, VoCamp-style approaches can be better. But this is a judgment call, entirely contextual, having to do with the maturity of the domain of discourse, etc.

  • Answer only questions that are actually asked. This precept is two-pronged. The literal interpretation is not to do inferential closure for its own sake, materializing all implied facts of the knowledge base.

    The broader interpretation is to take real-world problems. Expanding RDFS semantics with map-reduce and proving how many iterations this will take is a thing one can do but real-world problems will be more complex and less neat.

  • Deal with ambiguity. Data on which semantic technologies will be applied will be dirty, with errors from machine processing of natural language to erroneous human annotations. The knowledge bases will not be contradiction free. Michael Witbrock of CYC said many good things about this in Sofia; he would have something to say about a curriculum, no doubt.

Here we see that semantic technology is a younger discipline than computer science. We can outline some desirable skills and directions to follow but the idea of core precepts is not as well formed.

So we can approach the question from the angle of needed skills more than of precepts of science. What should the certified semantician be able to do?

  • Data integration. Given heterogenous relational schemas talking about the same entities, the semantician should find existing ontologies for the domain, possibly extend these, and then map the relational data to them. After the mapping is conceptually done, the semantician must know what combination of ETL and on-the-fly mapping fits the situation. This does mean that the semantician indeed must understand databases, which I above classified as an almost unreachable ideal. But there is no getting around this. Data is increasingly what makes the world go round. From this it follows that everybody must increasingly publish, consume, and refine, i.e., integrate. The anti-database attitude of the semantic web community simply has to go.

  • Design and implement workflows for content extraction, e.g., NLP or information extraction from images. This also means familiarity with NLP, desirably to the point of being able to tune the extraction rule sets of various NLP frameworks.

  • Design SOA workflows. The semantician should be able to extract and represent the semantics of business transactions and the data involved therein.

  • Lightweight knowledge engineering. The experience of building expert systems from the early days of AI is not the best possible, but with semantics attached to data, some sort of rules seem about inevitable. The rule systems will merge into the DBMS in time. Some ability to work with these, short of making expert systems, will be desirable.

  • Understand information quality in the sense of trust, provenance, errors in the information, etc. If the world is run based on data analytics, then one must know what the data in the warehouse means, what accidental and deliberate errors it contains, etc.

Of course, most of these tasks take place at some sort of organizational crossroads or interface. This means that the semantician must have some project management skills; must be capable of effectively communicating with different publics and simply getting the job done, always in the face of organizational inertia and often in the face of active resistance from people who view the semantician as some kind of intruder on their turf.

Now, this is a tall order. The semantician will have to be reasonably versatile technically, reasonably clever, and a self-starter on top. The self-starter aspect is the hardest.

The semanticists I have met are more of the scholar than the IT consultant profile. I say semanticist for the semantic web research people and semantician for the practitioner we are trying to define.

We could start by taking people who already do data integration projects and educating them in some semantic technology. We are here talking about a different breed than the one that by nature gravitates to description logics and AI. Projecting semanticist interests or attributes on this public is a source of bias and error.

If we talk about a university curriculum, the part that cannot be taught is the leadership and self-starter aspect, or whatever makes a good IT consultant. Thus the semantic technology studies must be profiled so as to attract people with this profile. As quoted before, the dream job for each era is a scarce skill that makes value from something that is plentiful in the environment. At this moment and for a few moments to come, this is the data geek, or maybe even semantician profile, if we take data geek past statistics and traditional business intelligence skills.

The semantic tech community, especially the academic branch of it, needs to reinvent itself in order to rise to this occasion. The flavor of the dream job curriculum will be away from the theoretical computer science towards the hands-on of database, large systems performance, and the practicalities of getting data intensive projects delivered.

Related

by Orri Erling (oerling@openlinksw.com) at April 05, 2010 03:21 PM

April 02, 2010

Orri Erling

Upcoming RDF Loader in Unclustered Virtuoso loads Uniprot at 279 Ktriples/s!

We recently heard that Oracle 11G loaded RDF faster than we did. Now, we never thought the speed of loading a database was as important as the speed of query results, but since this is the sole area where they have reportedly been tested as faster, we decided it was time loading was addressed. Indeed, without Oracle to challenge us on query performance, we would not be half as good as we are. So, spurred on by the Oracular influence, we did something about our RDF loading.

Performance, I have said before, is a matter of locality and parallelism. So we applied both to the otherwise quite boring exercise of loading RDF. The recipe is this: Take a large set of triples; resolve the IRIs and literals into their IDs; then insert each index of the triple table on its own thread. All the lookups and inserts are first sorted in key order to get the locality. Running the indices in parallel gets the parallelism. Then run the parser on its own thread, fetching chunks of consecutive triples and queueing them for a pool of loader threads. Then run several parsers concurrently on different files so as to make sure there is work enough at all times. Do not make many more process threads than available CPU threads, since they would just get in each other's way.

The whole process is non-transactional, starting from a checkpoint and ending with a checkpoint.

The test system was a dual-Xeon 5520 with 72G RAM. The Virtuoso was a single server; no cluster capability was used.

We loaded English Dbpedia, 179M triples, in 15 minutes, for a rate of 198 Kt/s. Uniprot with 1.33 G triples loaded in 79 minutes, for 279 Kt/s.

The source files were the Dbpedia 3.4 English files and the Bio2RDF copy of Uniprot, both in Turtle syntax. The uniref, uniparc and uniprot files from the Bio2RDF set were sliced into smaller chunks so as to have more files to load in parallel; the taxonomy file was as such; and no other Bio2RDF files were loaded. Both experiments ran with 8 load streams, 1 per core. The CPU utilization was mostly between 1400% and 1500%, 14-15 of 16 CPU threads busy. Top load speed for a measurement window of 2 minutes was 383 Kt/s.

The index scheme for RDF quads was the default Virtuoso 6 configuration of 5 indices — GS, SP, OP, PSOG, and POGS. (We call this "3+2" indexing, because there are 3 partial and 2 full indices, delivering massive performance benefits over most other index schemes.) IRIs and literals reside in their own tables, each indexed from string to ID and vice versa. A full-text index on literals was not used.

Compared to previous performance, we have more than tripled our best single server multi-stream load speed, and multiplied our single stream load speed by a factor of 8. Some further gains may be reached by adjusting thread counts and matching vector sizes to CPU cache.

This will be available in a forthcoming release; this is not for download yet. Now that you know this, you may guess what we are doing with queries. More on this another time.

by Orri Erling (oerling@openlinksw.com) at April 02, 2010 02:15 PM

March 29, 2010

DBpedia Blog

OPEN POSITION: Move to Berlin, work on DBpedia (1 year full-time contract)

Hi all, 

the DBpedia Team at Freie Universität Berlin is looking for a developer/researcher who wants to contribute to the further development of the DBpedia information extraction framework, investigate approaches to annotate free-text with DBpedia URIs and participate in the various Linked Data efforts currently advanced by our team. 

Candidates should have

  • good programming skills in Java, in addition Scala and PHP are helpful.
  • a university degree preferably in computer science or information systems.  Previous knowledge of Semantic Web Technologies (RDF, SPARQL, Linked Data) and experience with information extraction and/or named entity recognition techniques are a plus. 

Contract start date: 15 May 2010
Duration: 1 year
Salary: around 40.000 Euro/year (German BAT IIa) 

You will be part of an innovative and cordial team and enjoy flexible work hours. After the year, chances are high that you will be able to choose between longer-term positions at Freie Universität Berlin and at Neofonie.  Please contact chris@bizer.de via email until 15 April 2010 for additional details and include information about your skills and experience into your mail. 

The whole DBpedia team is very thankful to neofonie GmbH for contributing to the development of the DBpedia project by financing this position. neofonie is a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications. 

Cheers, 

Chris  

 
Prof. Dr. Christian Bizer
Web-based Systems Group
Freie Universität Berlin
+49 30 838 55509
http://www.bizer.de
chris@bizer.de

by ChrisBizer at March 29, 2010 11:54 AM

March 26, 2010

Displacement Activities (Tom Heath)

The demise of community.linkeddata.org

The issue of what happened to the community.linkeddata.org site came up in this thread on the public-lod mailing list. In the name of the public record I’m posting some of the messages I have related to this issue. I’ll try and get any gaps filled in in due course (let me know if there are specific gaps of interest to you and I’ll try to fill them in); in the meantime I’m keen to get the key bits online.

Some background is here:
http://lists.w3.org/Archives/Public/public-lod/2008Apr/0096.html



from    Michael Hausenblas <michael.hausenblas@d...>
to    Ted Thibodeau Jr <tthibodeau@o...>
cc    Kingsley Idehen <kidehen@o...>,Tom Heath <tom.heath@t...>
date    9 February 2009 18:27
subject    Re: "powered by" logos on linkeddata.org MediaWiki

MacTed,

I'll likely not invest time anymore in the Wiki [the MediaWiki instance at community.linkeddata.org - TH]. The plan is to transfer everything to Drupal. We had a lot of hassle with the Wiki configuration and community contribution was rather low. After the spam attack we decided to close it. It only contains few valuable things (glossary and iM maybe) ..

Do you have an account at linkeddata.org Drupal, yet? Otherwise, Tom, would you please be so kind?

Again, sorry for the delay ... it's LDOW-paper-write-up time :)

Cheers,
Michael


--
Dr. Michael Hausenblas
DERI - Digital Enterprise Research Institute
National University of Ireland, Lower Dangan,
Galway, Ireland, Europe
Tel. +353 91 495730



> From: Ted Thibodeau Jr <tthibodeau@o...>
> Date: Fri, 6 Feb 2009 16:22:31 -0500
> To: Michael Hausenblas <michael.hausenblas@d...>
> Cc: Kingsley Idehen <kidehen@o...>
> Subject: "powered by" logos on linkeddata.org MediaWiki
>
> Hi, Michael --
>
> re: <http://community.linkeddata.org/MediaWiki/index.php?Main_Page>
>
> It appears that the "Powered by Virtuoso" logo that was once alongside
> the
> "Powered by Mediawiki" logo (lower right of every page) has disappeared
> from the main page boilerplate.  Can that get re-added, please?
>
> Please use this logo --
>
>
> <http://boards.openlinksw.com/support/styles/prosilver/theme/images/virt_power
> _no_border.png
>>
>
> -- and make it href link to --
>
>     <http://virtuoso.openlinksw.com/>
>
> Please let me know if there's any difficulty with this.
>
> Thanks,
>
> Ted
>
>
> --
> A: Yes.                      http://www.guckes.net/faq/attribution.html
> | Q: Are you sure?
> | | A: Because it reverses the logical flow of conversation.
> | | | Q: Why is top posting frowned upon?
>
> Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
> Evangelism & Support         //        mailto:tthibodeau@o...
> OpenLink Software, Inc.      //              http://www.openlinksw.com/
>                                   http://www.openlinksw.com/weblogs/uda/
> OpenLink Blogs              http://www.openlinksw.com/weblogs/virtuoso/
>                                 http://www.openlinksw.com/blog/~kidehen/
>      Universal Data Access and Virtual Database Technology Providers




from Tom Heath
to Michael Hausenblas
date 9 March 2009 13:49
subject Re: http://linkeddata.org/domains?
mailed-by talisplatform.com

Hey Michael,

Re 2. great! I've created this node
and put it near the top of the primary navigation. You should be able
to write to that at will :)

Re 1. yes, good idea. I agree we should do this, just need to think
through the IA a little. Can you give me a day or so to chew this
over?

Cheers :)

Tom.




2009/3/7 Michael Hausenblas :
> Tom,
>
> As you may have gathered we're about to close down the 'old' community Wiki
> [1] and move over to [2]. There is not much active (and valuable) content at
> [1] and we had a lot of troubles with spammer (oh how I hate these ...).
>
> So, basically two things would be great:
>
> 1. I'd like to propose to add a sort of 'domain' or 'community' sub-space,
> such as http://linkeddata.org/domains where I can put our interlinking
> multimedia stuff [3] (and then change the redirect ;)
>
> 2. The second thing would be to find a place at [2] for the glossary [4] -
> seems quite helpful for people.
>
> Any thoughts?
>
>
> Cheers,
> Michael
>
> [1] http://community.linkeddata.org/MediaWiki/index.php?Main_Page
> [2] http://linkeddata.org/
> [3] http://www.interlinkingmultimedia.info/
> [4] http://community.linkeddata.org/MediaWiki/?Glossary
>
> --
> Dr. Michael Hausenblas
> DERI - Digital Enterprise Research Institute
> National University of Ireland, Lower Dangan,
> Galway, Ireland, Europe
> Tel. +353 91 495730
> http://sw-app.org/about.html
> http://webofdata.wordpress.com/



It’s quite hard to follow the indenting in the mail exchange below, so I’ve marked my contributions in bold.


from Tom Heath
to Kingsley Idehen
cc Michael Hausenblas
date 18 June 2009 17:07
subject Re: community.linkeddata.org
mailed-by talisplatform.com

Kingsley,

Also, what news of the previous instance?

Cheers,

Tom.

2009/6/18 Tom Heath :
> Hi Kingsley,
>
> Would the service you envisage at the subdomains you propose provide
> only a URI minting plus FOAF+SSL/OpenID service, or would other stuff
> also be available at that domain? If so, what?
>
> Tom.
>
>

> 2009/6/18 Kingsley Idehen :
>> Tom Heath wrote:
>>>
>>> Hi Kingsley,
>>>
>>> 2009/6/16 Kingsley Idehen :
>>>
>>>>
>>>> Tom Heath wrote:
>>>>
>>>>>
>>>>> Hi Kingsley,
>>>>>
>>>>> 2009/6/16 Kingsley Idehen :
>>>>>
>>>>>
>>>>>>
>>>>>> Tom Heath wrote:
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Hi Kingsley,
>>>>>>>
>>>>>>> According to our earlier discussions, this subdomain is deprecated in
>>>>>>> favour of the main site at linkeddata.org. If you'd a like a different
>>>>>>> subdomain for specific service just let me know.
>>>>>>>
>>>>>>> Cheers,
>>>>>>>
>>>>>>>
>>>>>>>

>>>>>>
>>>>>> What are the options?
>>>>>>
>>>>>>
>>>>>
>>>>> Guess that depends on the service you have in mind :) My goal is to
>>>>> avoid fragmentation of the presence at linkeddata.org and subdomains,
>>>>> so favour only creating new subdomains that do something highly
>>>>> specific and do not duplicate functionality or content available
>>>>> elsewhere.
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Tom.
>>>>>
>>>>>
>>>>>

>>>>
>>>> I am not quite understanding you.
>>>>
>>>> What would you see as the scheme for an instance of ODS that gives LOD
>>>> members URIs (of the FOAF+SSL variety)?
>>>>
>>>> Personally, I have no particular interest in pushing this with you per
>>>> se.
>>>> If you somehow deem this unimportant, no problem, I move on etc..
>>>>
>>>
>>> So the proposal is for another equivalent ODS instance, but one that
>>> adds FOAF+SSL support?
>>>
>>> If so, then this does sound important, as FOAF+SSL seems to have lots
>>> to offer. The problem I'm trying to address is as follows: the
>>> feedback I got from people about the previous offering at
>>> community.linkeddata.org was that it was confusing. People didn't
>>> understand what was going on or being offered, and the end result
>>> seemed to be further fragmentation of Linked Data coverage -
>>> particularly problematic for newbies. Therefore a very trimmed down
>>> service offering just personal URIs with FOAF+SSL support would seem
>>> to be of benefit, but I'm not sure of the value of replicating the
>>> previous offering with enhancements.
>>>
>>> Thoughts?
>>>
>>> Incidentally, the previous instance seems to have died. Can it be
>>> reinstated while we finish porting the content across?
>>>
>>> Cheers,
>>>
>>> Tom.
>>>
>>>

>>
>> Tom,
>>
>> Goal is to have a place for people to easily obtain personal URIs. In a way,
>> official LOD community Web IDs.
>> FOAF+SSL is the most important feature here and LOD should be a launch pad.
>>
>> Possible options:
>> yourid.linkeddata.org
>> webid.linkeddata.org
>> me.linkeddata.org
>>
>>
>> This is how it works:
>>
>> 1. New Users open accounts
>> 2. Edit profile
>> 3. Click a button that makes an X.509 certificate, exports to browser, and
>> writes to FOAF space
>> 4. Member visits any FOAF+SSL or OpenID space on the Web and never has to
>> present uid/pwd
>>
>> For existing members, they simply perform steps 3-4.
>>
>>
>> --
>>
>>
>> Regards,
>>
>> Kingsley Idehen Weblog: http://www.openlinksw.com/blog/~kidehen
>> President & CEO OpenLink Software Web: http://www.openlinksw.com
>>
>
> --
> Dr Tom Heath
> Researcher
> Platform Division
> Talis Information Ltd
> T: 0870 400 5000
> W: http://www.talis.com/
>
--
Dr Tom Heath
Researcher
Platform Division
Talis Information Ltd
T: 0870 400 5000
W: http://www.talis.com/



No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at March 26, 2010 07:12 PM

March 15, 2010

Orri Erling

SemData@Sofia Roundtable write-up

There was last week an invitation-based roundtable about semantic data management in Sofia, Bulgaria.

Lots of smart people together. The meeting was hosted by Ontotext and chaired by Dieter Fensel. On the database side we had Ontotext, SYSTAP (Bigdata), CWI (MonetDB), Karlsruhe Institute of Technology (YARS2/SWSE). LarKC was well represented, being our hosts, with STI, Ontotext, CYC, and VU Amsterdam. Notable absences were Oracle, Garlik, Franz, and Talis.

Now of semantic data management... What is the difference between a relational database and a semantic repository, a triple/quad store, a whatever-you-call-them?

I had last fall a meeting at CWI with Martin Kersten, Peter Boncz and Lefteris Sidirourgos from CWI, and Frank van Harmelen and Spiros Kotoulas of VU Amsterdam, to start a dialogue between semanticists and databasers. Here we were with many more people trying to discover what the case might be. What are the differences?

Michael Stonebraker and Martin Kersten have basically said that what is sauce for the goose is sauce for the gander, and that there is no real difference between relational DB and RDF storage, except maybe for a little tuning in some data structures or parameters. Semantic repository implementors on the other hand say that when they tried putting triples inside an RDB it worked so poorly that they did everything from scratch. (It is a geekly penchant to do things from scratch, but then this is not always unjustified.)

OpenLink Software and Virtuoso are in agreement with both sides, contradictory as this might sound. We took our RDBMS and added data types and structures and cost model alterations to an existing platform. Oracle did the same. MonetDB considers doing this and time will tell the extent of their RDF-oriented alterations. Right now the estimate is that this will be small and not in the kernel.

I would say with confidence that without source code access to the RDB, RDF will not be particularly convenient or efficient to accommodate. With source access, we found that what serves RDB also serves RDF. For example, execution engine and data compression considerations are the same, with minimal tweaks for RDF's run time typing needs.

So now we are founding a platform for continuing this discussion. There will be workshops and calls for papers and the beginnings of a research community.

After the initial meeting at CWI, I tried to figure what the difference was between the databaser and semanticist minds. Really, the things are close but there is still a disconnect. Database is about big sets and semantics is about individuals, maybe. The databaser discovers that the operation on each member of the set is not always the same, and the semanticist discovers that the operation on each member of the set is often the same.

So the semanticist says that big joins take time. The databaser tells the semanticist not to repeat what's been obvious for 40 years and for which there is anything from partitioned hashes to merges to various vectored execution models. Not to mention columns.

Spiros of VU Amsterdam/LarKC says that map-reduce materializes inferential closure really fast. Lefteris of CWI says that while he is not a semantic person, he does not understand what the point of all this materializing is, nobody is asking the question, right? So why answer? I say that computing inferential closure is a semanticist tradition; this is just what they do. Atanas Kiryakov of Ontotext says that this is not just a tradition whose start and justification is in the forgotten mists of history, but actually a clear and present need; just look at all the joining you would need.

Michael Witbrock of CYC says that it is not about forward or backward inference on toy rule sets, but that both will be needed and on massively bigger rule sets at that. Further, there can be machine learning to direct the inference, doing the meta-reasoning merged with the reasoning itself.

I say that there is nothing wrong with materialization if it is guided by need, in the vein of memo-ization or cracking or recycling as is done in MonetDB. Do the work when it is needed, and do not do it again.

Brian Thompson of Systap/Bigdata asks whether it is not a contradiction in terms to both want pluggability and merging inference into the data, like LarKC would be doing. I say that this is difficult but not impossible and that when you run joins in a cluster database, as you decide based on the data where the next join step will be, so it will be with inference. Right there, between join steps, integrated with whatever data partitioning logic you have, for partitioning you will have, data being bigger and bigger. And if you have reuse of intermediates and demand driven indexing à la MonetDB, this too integrates and applies to inference results.

So then, LarKC and CYC, can you picture a pluggable inference interface at this level of granularity? So far, I have received some more detail as to the needs of inference and database integration, essentially validating our previous intuitions and plans.

Aside talking of inference, we have the more immediate issue of creating an industry out of the semantic data management offerings of today.

What do we need for this? We need close-to-parity with relational — doing your warehouse in RDF with the attendant agility thereof can't cost 10x more to deploy than the equivalent relational solution.

We also want to tell the key-value, anti-SQL people, who throw away transactions and queries, that there is a better way. And for this, we need to improve our gig just a little bit. Then you have the union of some level of ACID, at least consistent read, availability, complex query, large scale.

And to do this, we need a benchmark. It needs a differentiation of online queries and browsing and analytics, graph algorithms and such. We are getting there. We will soon propose a social web benchmark for RDF which has both online and analytical aspects, a data generator, a test driver, and so on, with a TPC-style set of rules. If there is agreement on this, we will all get a few times faster. At this point, RDF will be a lot more competitive with mainstream and we will cross another qualitative threshold.

by Orri Erling (oerling@openlinksw.com) at March 15, 2010 02:46 PM

March 12, 2010

DBpedia Blog

Invitation to contribute to DBpedia by improving the infobox mappings + New Scala-based Extraction Framework

Hi all,

in order to extract high quality data from Wikipedia, the DBpedia extraction framework relies on infobox to ontology mappings which define how Wikipedia infobox templates are mapped to classes of the DBpedia ontology.

Up to now, these mappings were defined only by the DBpedia team and as Wikipedia is huge and contains lots of different infobox templates, we were only able to define mappings for a small subset of all Wikipedia infoboxes and also only managed to map a subset of the properties of these infoboxes.

In order to enable the DBpedia user community to contribute to improving the coverage and the quality of the mappings, we have set up a public wiki at http://mappings.dbpedia.org/index.php/Main_Page which contains:

1.  all mappings that are currently used by the DBpedia extraction framework
2. the definition of the DBpedia ontology and
3. documentation for the DBpedia mapping language as well as step-by-step guides on how to extend and refine mappings and the ontology.

So if you are using DBpedia data and you you were always annoyed that DBpedia did not properly cover the infobox template that is most important to you, you are highly invited to extend the mappings and the ontology in the wiki. Your edits will be used for the next DBpedia release expected to be published in the first week of April.

The process of contributing to the ontology and the mappings is as follows:

1.  You familiarize yourself with the DBpedia mapping language by reading the documentation in the wiki.
2.  In order to prevent random SPAM, the wiki is read-only and new editors need to be confirmed by a member of the DBpedia team (currently Anja Jentzsch does the clearing). Therefore, please create an account in the wiki for yourself. After this, Anja will give you editing rights and you can edit the mappings as well as the ontology.
3. For contributing to the next DBpedia relase, you can edit until Sunday, March 21. After this, we will check the mappings and the ontology definition in the Wiki for consistency and then use both for the next DBpedia release.

So, we are starting kind of a social experiment on if the DBpedia user community is willing to contribute to the improvement of DBpedia and on how the DBpedia ontology develops through community contributions J

Please excuse, that it is currently still rather cumbersome to edit the mappings and the ontology. We are currently working on a visual editor for the mappings as well as a validation service, which will check edits to the mappings and test the new mappings against example pages from Wikipedia. We hope that we will be able to deploy these tools in the next two months, but still wanted to release the wiki as early as possible in order to already allow community contributions to the DBpedia 3.5 release.

If you have questions about the wiki and the mapping language, please ask them on the DBpedia mailing list where Anja and Robert will answer them.

What else is happening around DBpedia?

In order to speed up the data extraction process and to lay a solid foundation for the DBpedia Live extraction, we have ported the DBpedia extraction framework from PHP to Scala/Java. The new framework extracts exactly the same types of data from Wikipedia as the old framework, but processes a single page now in 13 milliseconds instead of the 200 milliseconds. In addition, the new framework can extract data from tables within articles and can handle multiple infobox templates per article. The new framework is available under GPL license in the DBpedia SVN and is documented at http://wiki.dbpedia.org/Documentation.

The whole DBpedia team is very thankful to two companies which enabled us to do all this by sponsoring the DBpedia project:

1. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
2.  Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/index.jsp).

Thank you a lot for your support!

I personally would also like to thank:

1.  Anja Jentzsch, Robert Isele, and Christopher Sahnwaldt for all their great work on implementing the new extraction framework and for setting up the mapping wiki.
2.  Andreas Lange and Sidney Bofah for correcting and extending the mappings in the Wiki.

Cheers,

Chris Bizer

by ChrisBizer at March 12, 2010 12:07 PM

March 10, 2010

Blog Data Space (Kingsley Idehen)

URIBurner: Painless Generation & Exploitation of Linked Data (Update 1 - Demo Links Added)

What is URIBurner?

A service from OpenLink Software, available at: http://uriburner.com, that enables anyone to generate structured descriptions -on the fly- for resources that are already published to HTTP based networks. These descriptions exist as hypermedia resource representations where links are used to identify:

  • the entity (data object or datum) being described,
  • each of its attributes, and
  • each of its attributes values (optionally).

The hypermedia resource representation outlined above is what is commonly known as an Entity-Attribute-Value (EAV) Graph. The use of generic HTTP scheme based Identifiers is what distinguishes this type of hypermedia resource from others.

Why is it Important?

The virtues (dual pronged serendipitous discovery) of publishing HTTP based Linked Data across public (World Wide Web) or private (Intranets and/or Extranets) is rapidly becoming clearer to everyone. That said, the nuance laced nature of Linked Data publishing presents significant challenges to most. Thus, for Linked Data to really blossom the process of publishing needs to be simplified i.e., "just click and go" (for human interaction) or REST-ful orchestration of HTTP CRUD (Create, Read, Update, Delete) operations between Client Applications and Linked Data Servers.

How Do I Use It?

In similar vane to the role played by FeedBurner with regards to Atom and RSS feed generation, during the early stages of the Blogosphere, it enables anyone to publish Linked Data bearing hypermedia resources on an HTTP network. Thus, its usage covers two profiles: Content Publisher and Content Consumer.

Content Publisher

The steps that follow cover all you need to do:

  • place a tag within your HTTP based hypermedia resource (e.g. within section for HTML )
  • use a URL via the @href attribute value to identify the location of the structured description of your resource, in this case it takes the form: http://linkeddata.uriburner.com/about/id/{scheme-or-protocol}/{your-hostname-or-authority}/{your-local-resource}
  • for human visibility you may consider adding associating a button (as you do with Atom and RSS) with the URL above.

That's it! The discoverability (SDQ) of your content has just multiplied significantly, its structured description is now part of the Linked Data Cloud with a reference back to your site (which is now a bona fide HTTP based Linked Data Space).

Examples

HTML+RDFa based representation of a structured resource description:

<link rel="describedby" title="Resource Description (HTML)"type="text/html" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

JSON based representation of a structured resource description:

<link rel="describedby" title="Resource Description (JSON)" type="application/json" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

N3 based representation of a structured resource description:

<link rel="describedby" title="Resource Description (N3)" type="text/n3" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

RDF/XML based representations of a structured resource description:

<link rel="describedby" title="Resource Description (RDF/XML)" type="application/rdf+xml" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

Content Consumer

As an end-user, obtaining a structured description of any resource published to an HTTP network boils down to the following steps:

  1. go to: http://uriburner.com
  2. drag the Page Metadata Bookmarklet link to your Browser's toolbar
  3. whenever you encounter a resource of interest (e.g. an HTML page) simply click on the Bookmarklet
  4. you will be presented with an HTML representation of a structured resource description (i.e., identifier of the entity being described, its attributes, and its attribute values will be clearly presented).

Examples

If you are a developer, you can simply perform an HTTP operation request (from your development environment of choice) using any of the URL patterns presented below:

HTML:
  • curl -I -H "Accept: text/html" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}

JSON:

  • curl -I -H "Accept: application/json" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/json/{scheme}/{authority}/{local-path}

Notation 3 (N3):

  • curl -I -H "Accept: text/n3" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/n3/{scheme}/{authority}/{local-path}
  • curl -I -H "Accept: text/turtle" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/ttl/{scheme}/{authority}/{local-path}

RDF/XML:

  • curl -I -H "Accept: application/rdf+xml" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/xml/{scheme}/{authority}/{local-path}

Conclusion

URIBurner is a "deceptively simple" solution for cost-effective exploitation of HTTP based Linked Data meshes. It doesn't require any programming or customization en route to immediately realizing its virtues.

If you like what URIBurner offers, but prefer to leverage its capabilities within your domain -- such that resource description URLs reside in your domain, all you have to do is perform the following steps:

  1. download a copy of Virtuoso (for local desktop, workgroup, or data center installation) or
  2. instantiate Virtuoso via the Amazon EC2 Cloud
  3. enable the Sponger Middleware component via the RDF Mapper VAD package (which includes cartridges for over 30 different resources types)

When you install your own URIBurner instances, you also have the ability to perform customizations that increase resource description fidelity in line with your specific needs. All you need to do is develop a custom extractor cartridge and/or meta cartridge.

Related:

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 10, 2010 05:52 PM

March 06, 2010

Blog Data Space (Kingsley Idehen)

Meshups Demonstrating How SPARQL-GEO Enhances Linked Data Exploitation (Update 2)

Deceptively simple demonstrations of how Virtuoso's SPARQL-GEO extensions to SPARQL lay critical foundation for Geo Spatial solutions that seek to leverage the burgeoning Web of Linked Data.

Setup Information

SPARQL Endpoint: Linked Open Data Cache (8.5 Billion+ Quad Store which includes data from Geonames and the Linked GeoData Project Data Sets) .

Live Linked Data Meshup Links:

Related

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 06, 2010 10:43 PM

March 04, 2010

Blog Data Space (Kingsley Idehen)

Revisiting HTTP based Linked Data (Update 1 - Demo Video Links Added)

Motivation for this post arose from a series of Twitter exchanges between Tony Hirst and I, in relation to his blog post titled: So What Is It About Linked Data that Makes it Linked Data™ ?

At the end of the marathon session, it was clear to me that a blog post was required for future reference, at the very least :-)

What is Linked Data?

"Data Access by Reference" mechanism for Data Objects (or Entities) on HTTP networks. It enables you to Identify a Data Object and Access its structured Data Representation via a single Generic HTTP scheme based Identifier (HTTP URI). Data Object representation formats may vary; but in all cases, they are hypermedia oriented, fully structured, and negotiable within the context of a client-server message exchange.

Why is it Important?

Information makes the world tick!

Information doesn't exist without data to contextualize.

Information is inaccessible without a projection (presentation) medium.

All information (without exception, when produced by humans) is subjective. Thus, to truly maximize the innate heterogeneity of collective human intelligence, loose coupling of our information and associated data sources is imperative.

How is Linked Data Delivered?

Linked Data is exposed to HTTP networks (e.g. World Wide Web) via hypermedia resources bearing structured representations of data object descriptions. Remember, you have a single Identifier abstraction (generic HTTP URI) that embodies: Data Object Name and Data Representation Location (aka URL).

How are Linked Data Object Representations Structured?

A structured representation of data exists when an Entity (Datum), its Attributes, and its Attribute Values are clearly discernible. In the case of a Linked Data Object, structured descriptions take the form of a hypermedia based Entity-Attribute-Value (EAV) graph pictorial -- where each Entity, its Attributes, and its Attribute Values (optionally) are identified using Generic HTTP URIs.

Examples of structured data representation formats (content types) associated with Linked Data Objects include:

  • text/html
  • text/turtle
  • text/n3
  • application/json
  • application/rdf+xml
  • Others

How Do I Create Linked Data oriented Hypermedia Resources?

You markup resources by expressing distinct entity-attribute-value statements (basically these a 3-tuple records) using a variety of notations:

  • (X)HTML+RDFa,
  • JSON,
  • Turtle,
  • N3,
  • TriX,
  • TriG,
  • RDF/XML, and
  • Others (for instance you can use Atom data format extensions to model EAV graph as per OData initiative from Microsoft).

You can achieve this task using any of the following approaches:

  • Notepad
  • WYSIWYG Editor
  • Transformation of Database Records via Middleware
  • Transformation of XML based Web Services output via Middleware
  • Transformation of other Hypermedia Resources via Middleware
  • Transformation of non Hypermedia Resources via Middleware
  • Use a platform that delivers all of the above.

Practical Examples of Linked Data Objects Enable

  • Describe Who You Are, What You Offer, and What You Need via your structured profile, then leave your HTTP network to perform the REST (serendipitous discovery of relevant things)
  • Identify (via map overlay) all items of interest based on a 2km+ radious of my current location (this could include vendor offerings or services sought by existing or future customers)
  • Share the latest and greatest family photos with family members *only* without forcing them to signup for Yet Another Web 2.0 service or Social Network
  • No repetitive signup and username and password based login sequences per Web 2.0 or Mobile Application combo
  • Going beyond imprecise Keyword Search to the new frontier of Precision Find - Example, Find Data Objects associated with the keywords: Tiger, while enabling the seeker disambiguate across the "Who", "What", "Where", "When" dimensions (with negation capability)
  • Determine how two Data Objects are Connected - person to person, person to subject matter etc. (LinkedIn outside the walled garden)
  • Use any resource address (e.g blog or bookmark URL) as the conduit into a Data Object mesh that exposes all associated Entities and their social network relationships
  • Apply patterns (social dimensions) above to traditional enterprise data sources in combination (optionally) with external data without compromising security etc.

How Do OpenLink Software Products Enable Linked Data Exploitation?

Our data access middleware heritage (which spans 16+ years) has enabled us to assemble a rich portfolio of coherently integrated products that enable cost-effective evaluation and utilization of Linked Data, without writing a single line of code, or exposing you to the hidden, but extensive admin and configuration costs. Post installation, the benefits of Linked Data simply materialize (along the lines described above).

Our main Linked Data oriented products include:

  • OpenLink Data Explorer -- visualizes Linked Data or Linked Data transformed "on the fly" from hypermedia and non hypermedia data sources
  • URIBurner -- a "deceptively simple" solution that enables the generation of Linked Data "on the fly" from a broad collection of data sources and resource types
  • OpenLink Data Spaces -- a platform for enterprises and individuals that enhances distributed collaboration via Linked Data driven virtualization of data across its native and/or 3rd party content manager for: Blogs, Wikis, Shared Bookmarks, Discussion Forums, Social Networks etc
  • OpenLink Virtuoso -- a secure and high-performance native hybrid data server (Relational, RDF-Graph, Document models) that includes in-built Linked Data transformation middleware (aka. Sponger).

Related

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 04, 2010 03:16 PM

February 18, 2010

Frederick Giasson

structWSF Web Services Tutorial

One thing that was hard to do with structWSF was explaining what structWSF is, and how users can interact with it. For most people, structWSF was abstracted behind conStruct and they didn’t know that each single functionalities of conStruct was bound to one, or multiple queries to one, or multiple, structWSF instance.

It is the reason why we took the time to write a complete structWSF interaction tutorial. This tutorial explains what the general structWSF architecture is, and it describes a series of general interaction usecases. We hope that this tutorial will helps developers and system implementators understanding the capabilities of structWSF and how they can use it.

You can read the complete structWSF Web Services Tutorial here.

Additionally, we released a new version of structWSF, conStruct and the irJSON Parser which are products of this toturial.

by Fred at February 18, 2010 09:45 PM

February 01, 2010

Displacement Activities (Tom Heath)

Wash down the Apple tablet with a gulp of Kool Aid

I’m not in the least bit excited about the iPad, and it seems I’m not alone. The mood seems to have changed since before the launch, with countless tech journalists previously falling over themselves to declare tablets the next big thing. (Thankfully Rory Cellan-Jones from the BBC was more measured, focusing on personal projectors as a more exciting development). The mood since is considerably more downbeat, and I think more realistic.

I may be missing some crucial usage context that reveals the killer characteristics of the iPad, but I’ve tried really hard and still nothing. There are many obvious practical issues with the device:

  • it’s too big for a pocket, but not sufficiently more useful than an iPhone or an HTC Hero.
  • it’s about the same size as a compact laptop, but with less scope for comfortable rapid input.
  • it’s probably too big to cradle comfortably in my hand for prolonged periods, and sitting with one ankle on the other knee is not always practical.

The only scenarios I can conjure up where I could imagine using the device are:

  • showing people my holiday photos.
  • reviewing design proofs without needing to print them out.

Neither of these, or even both, are very compelling at all. TVs are getting good for viewing photos, by including e.g. an SD card slot, and rumours of the death of paper are greatly exagerated.

Perhaps the most annoying thing about the scenarios used to promote the device is the one about the San Francisco to Tokyo flight, watching video all the way without running out of battery. Any airline with planes worth boarding has personal video screens. I don’t want to bring my own. I’d rather use the space to carry a decent pair of noise-canceling headphones, which I’m sure increase my enjoyment of onboard media far more than a little bit of extra screen real estate. The development I want to see is not a new device that I have to prop on the flimsy airline table, hold tight when we hit some turbulence, and stow away when my food arrives, but the capability to connect my own device to the in-built screen via USB or Bluetooth. Even a bare USB port with power but no connectivity would be a start, allowing me to run low-powered devices (that I already own) during long flights.

OK, so the flight reference is just a touchstone for how long the device can run without mains power, but I think it demonstrates a lack of grounding of the device in realistic scenarios.

Any new device has to have two key characteristics these days for me to get excited: interoperability and convergence. The iPad seems to have very little of either. You could argue that it offers some convergence between smartphones and e-readers, but that’s about as exciting as convergence between a smartphone and a wall clock.

I’m left wondering what the iPad is competing against? I’m guessing it’s paper, whether that’s in the form of a book, brochure, newspaper, restaurant menu or whatever. Unfortunately for Apple, paper is pretty well suited to each of these, especially when you introduce bath water, the risk of theft, or just ketchup, into the equation. Perhaps this is the end of electronic picture frames as dedicated device? Probably about time. Maybe the iPad will make an excellent Spotify console for the living room. Who knows? Whatever happens I can’t see this becoming a mass-market product worthy of even a fraction of the hype.

Where I wish that Apple had expended their creative talent was in addressing the power issue. Not in making sure I could watch 10 hours of back to back video, but in enabling me to spend that energy in whatever way I choose, powering whichever device I choose. It drives me crazy that I carry several batteries around, and short of running my phone off my laptop via USB there is no interoperability between these power sources. If Apple could produce a universal power supply that was sleek, sexy, efficient and interoperable, then I would be interested. Sadly this doesn’t seem to be the way.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at February 01, 2010 03:30 PM

January 29, 2010

DBTune Blog

BBC Semantic Web use-case

After a very long time writing it, we finally have a BBC Semantic Web use-case on the W3C website! It describes work we did around BBC Programmes, BBC Music, BBC Wildlife Finder and Search+. I hope it all makes a bit of sense :-) For a more detailed writeup about these issues, Patrick's Linked Data on the BBC are very good.

by Yves at January 29, 2010 03:37 PM

January 27, 2010

Frederick Giasson

Behind Oz’s Curtain

Benjamin Nowack, creator of ARC and Trice, wrote an interesting blog post about the place of Microformats and RDFa in the HTML 5 specification. I am not deep into the specification itself, and so may lack some history context. However, the most interesting point in this article is not related to Microformats, RDFx or the new HTML 5 specification.

The point is that apparently, some people believe that it is RDF or nothing. This is not new, but is that true?

People (and particularly enterprises) want the benefits of structured data, not necessarily RDF. In fact, many people don’t know about RDF, or don’t understand RDF, or just don’t care about RDF. But, is it because you don’t know, understand or care about RDF that you cannot benefit from it? No, certainly not. And I think that is what Benjamin is talking about when he mentions things such as: “[...] to get RDF to the broader developer community“, “[...] here could have been a solution that would have served everybody sufficiently well, both HTMLers and RDFers“. “[...] they would most probably have been able to define RDFa 1.1 as a proper superset of Microdata”. RDF can be incarnated in multiple bodies, but it is still RDF. I think it is what Benjamin was suggesting, and it the path we took at Structured Dynamics.

We choose to use RDF behind Oz’s curtain. This means that at the core of any of our methodologies, systems and specifications, we use RDF. Why? Because it is the more flexible description framework available that helps us handle any other source of data. However, does that mean that we should push RDF in everybody’s face? Certainly not.

Our work with different enterprises from all kind of domains told us that we have to look beyond RDF while still using it (as paradoxically as that may appear). For example, we developed structWSF and conStruct such that people can upload (and manage) their data in different formats while being able to export it in all other different formats. At the core, these systems use RDF to manipulate all these different kind of formats, but from the outside, users simply use the format they care about, they use, or that they have available in their workflow. These users benefits from RDF without knowing it, understanding it or without caring about it. We don’t think RDF is for everyone, but everyone can benefit from RDF.

Another example of RDF behind Oz’s curtain is the irON description framework and its three serialization profiles: irJSON, irXML and commON that we developed. As stated in the Purpose section of this document, the goal was quite clear:

irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON). The notation specification includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. Profiles and examples are also provided for each of the irXML, irJSON and commON serializations.

irON is premised on these considerations and observations:

  • RDF (Resource Description Framework) is a powerful canonical data model for data interoperability
  • However, most existing data is not written in RDF and many authors and publishers prefer other formats for various reasons
  • Many formats that are easier to author and read than RDF are variants of the attribute-value pair construct [2], which can readily be expressed as RDF, and
  • A common abstract notation for converting to RDF would also enable non-RDF formats to become somewhat interchangeable, thus allowing the strengths of each to be combined.

The irON notation and vocabulary is designed to allow the conceptual structure (”schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naive information exchange notation expressive enough to describe most any data entity.

I think this is what Benjamin was talking about in his article, and the kind of mindset he was suggesting the RDF community to adopt. At least this is the minding we adopted at Structured Dynamics, and apparently it is the minding Benjamin adopted for his own business. I am sure there are many other people and organizations out there that are adopting the same point of view according to RDF and its role in the current data ecosystem.

by Fred at January 27, 2010 08:50 PM

January 14, 2010

DBTune Blog

Live SPARQL end-point for BBC Programmes

Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February. Update 2: All should be back to normal and in sync. Please comment on this post if you spot any issue, or general slowliness.

Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.

At the moment, our data comes from two sources: PIPs (the central programme database at the BBC) and PIT (our content mangement system for programme information). In order to populate the /programmes database, we monitor changes on these two sources and replicate them on our database. We have a small piece of Ruby/ActiveRecord software (that we call the Tapp) which handles this process.

I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.

It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).

The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.

This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)

Here are some example SPARQL queries:

  • All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:category 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#> 
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#> 
PREFIX po: <http://purl.org/ontology/po/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
WHERE
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
  ?episode po:version ?version .
  ?version a ?version_type .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?time .
  ?time tl:start ?broadcast_start .
  FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 event:time ?t1 .
  ?event2 event:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

by Yves at January 14, 2010 12:30 PM

January 11, 2010

Do What I Mean (Richard Cyganiak)

prefix.cc, MkII

prefix.cc is a website I’ve made last February to ease a very common task in the life of RDF developers and SPARQL users: looking up namespace URIs. A short summary of what the site can do for you is available here.

The site was developed during a few weekends, and I haven’t touched the code since I first deployed it. Today I’m publishing the first serious update to the site. This post describes what’s new.

Reverse lookup. One of the most requested features is reverse lookup. You can now enter a URI of an RDF term into the query box on the start page, and the site will respond with the best prefix for contracting that URI into a QName. This functionality is also available as an API.

Negative votes. The site has received a moderate amount of spam, mostly from pranksters who think it would be funny to propose their own homepage as a better expansion for the foaf prefix. I’ve mostly cleaned this up manually, but I think it would be better to equip the user community with tools to handle this.

The site has always had a voting mechanism, which I intended as a tiebreaker in cases where people have submitted different URIs for the same prefix, for example in the case of the dc prefix. Starting today, you can submit both positive and negative votes. If a URI receives a certain amount of negative votes, it will be no longer shown.

New export formats. One of my favourite features is the ability to directly get output in various machine-readible syntaxes by composing an appropriate URI, such as http://prefix.cc/foaf.file.n3, which produces a declaration of the FOAF prefix in N3 format. I find this handy for copy-pasting into a text editor, but also for automating things.

A few formats have been added: vann produces an RDF/XML version of the namespace mapping in the VANN vocabulary (example). xmlns produces raw XML prefix declarations (example). go redirects to the namespace URI, so you can type http://prefix.cc/foaf.go into your browser bar as a shortcut for opening the FOAF specification. I’ve also added a table of all supported formats.

A side effect of the introduction of VANN support is that there is now a single VANN representation of all mappings known to the site.

Tweaks and fixes. Regular users will note a number of further small changes and bugfixes throughout the site. One notable fix is to the way namespace lookups are calculated for the list of popular prefixes. Ironically, most of the lookups actually are from web crawlers that followed the links in the list itself, making the list self-perpetuating. Also, the list featured the non-existing robots prefix, because many crawlers are looking for http://prefix.cc/robots.txt. These issues should now be fixed.

Internal changes. The site is developed in PHP, and started out as a quick weekend hack, so the initial code was a horrible mess that was hardly maintainable. I spent quite some time cleaning this up and refactoring the code into a much nicer structure that should be able to grow along with some of the additional features I’ve planned for the future. The codebase now totals some 1600 lines of PHP, CSS and Javascript.

Hidden goodies: RDFa markup and feed of latest additions. Finally, I want to highlight some features that have existed all along, but are easily missed: First, many pages contain RDFa markup, so if you want to re-use any prefix.cc data in your own site or application, you most likely can. Second, there is an RSS feed of the latest additions to the prefix database, and it is a neat way of learning about new vocabularies and ontologies that show up around the Web.

Bugs, comments, suggestions? Any feedback is appreciated. I did a lot of refactoring without a test harness, so it’s quite likely that a few new bugs have crept in. If you notice anything, please let me know. Also, if there is anything that you would like to see in prefix.cc Mk III, please share!

by Richard Cyganiak at January 11, 2010 05:25 PM

November 21, 2009

Do What I Mean (Richard Cyganiak)

What’s in a name? And the Linked Data Police

So I wrote a rather angry private email to Erik Wilde a few days ago, complaining about his use of the term “linked data” for a site that doesn’t follow the linked data practices. Erik decided to publish my email on his blog, along with a long defense of his use of the term, in a post called “The Linked Data™ Police”. Since it’s in public now, we can just as well see if we can get a useful discussion out of this.

First, I realize that Erik probably responded more to the tone of my email than to the content. It was an angry rant, and my tone misses the mark, so his response is fair enough, and I have apologised to him. I also have to say that I’m speaking only for myself and nobody else—before others become scared of the “Linked Data™ Police”, I can assure them that to the best of my knowledge, it has a staff of one, and my peers in the community are in general a friendly and civil lot.

The site in question. So, what is this about? Erik and team have built recovery.berkeley.edu, a site that publishes structured data about Recovery Act spending. The site was built with a grant from the Sunlight Foundation. The technologies of choice are Atom and various other XML formats. As far as I can see, it’s excellent in its adherence to REST principles, including good URI design and “hypermedia as the engine of application state”. These two together are labelled as “linked data” in the site’s technical documentation.

This is a discussion about names, and not about substance. At the very core is the following question: Should we understand “linked data” to mean “the idea of somehow connecting pieces of data with links”, or should we take it to mean “RDF published according to the rules outlined by Tim Berners-Lee in the design note that coined the term”?

Obviously, I’m of the latter opinion. In this post, I want to do two things: First, I want to respond to some specific points from Erik’s post. I will do this by paraphrasing each point, and then responding to it. Second, I want to explain why I care about the matter and why I think that “linked data” should continue to be associated with Tim’s rules, and why advocates of different sets of rules should use different terms.

Erik: “The attitude is scary: Instead of figuring out the most effective way of adding more semantics to the web, it starts with a set of technologies and claims that whatever you want to do, you have to use those.” Linked data didn’t start with a set of technologies; a lot of deliberation by a lot of people went into the choice. Also, I have no quarrel with Erik’s choice of technologies, and I didn’t even suggest that he should or shouldn’t use any technology. There can be good reasons against using RDF, and it’s a good thing that innovation continues in other areas of web data technology.

But if Erik and colleagues don’t buy into the set of technology choices commonly called “linked data”, then why would they insist on using that name? What’s wrong with the established technical terms, REST and Resource-oriented Architecture? Is it just because those are already way past the peak of the hype cycle?

Erik: “Using generic problem names to refer to specific technologies only confuses people.” Erik mentions Linked Data, XML Schema, Semantic Web and Web Services as specific technologies that he considers to be badly labelled. But the peculiar thing about those four is not their names; they are controversial for other reasons. The IT world is full of specific technologies that use generic names: World Wide Web. Structured Query Language. Extensible Markup Language. Hypertext Transfer Protocol. Portable Document Format. Resource-Oriented Architecture. Scalable Vector Graphics. Open Document Format. Erik may not like it, but it’s a common practice.

Erik: “Choosing such names is usually an attempt to make competition harder.” Usually? I doubt that. Technologies are named in their very infancy, when their future and success is far from certain, and when competition is usually not an issue. The naming is usually an attempt to communicate as clearly as possible what the proposed technology is supposed to achieve, which is not a bad thing at all. Some fail at achieving the goal, but everyone designs (and names) assuming that it can be eventually achieved.

Erik: “RDF is just a stylesheet away.” Erik points out that it would be trivial to create a GRDDL transform that translates from the service’s output to RDF. Personally I wouldn’t call it trivial, and being just one transformation away from being compatible is not the same as being compatible. If there were GRDDL transforms in place, I would have no reason at all to complain, although just a few linked data clients support GRDDL at this time.

Back to the roots. So where did the term “linked data” come from? To the best of my knowledge, Tim Berners-Lee coined it in his 2006 Design Note that is titled “Linked Data”. The document introduced the four rules that are now known as the “Linked Data Principles.” Erik’s service is following all of them except the one that demands RDF or SPARQL.

It’s worth pointing out that the four rules did not mention RDF when Tim originally published them, but it is clear from the rest of the document that the use of RDF was implied. The document was aimed at the semantic web community. His later change was a clarification, not a change of intention.

I don’t know why Tim wrote this piece back in 2006, but my interpretation was that he wanted more people to publish data that can be browsed with his Tabulator RDF browser, and most RDF out there at that time couldn’t be browsed because of problems with one of the four rules. So I read it as a call for better interoperability among RDF publishers.

Broadening the term? There have been a number of calls for broadening the meaning of the term, most eloquently from Paul Miller, so Erik is certainly not alone in his view. Their intention is to get linked data quicker into the mainstream, which is a goal that I share. The problem is that broadening a term makes it less meaningful. There is a danger that the term gets extended to the point where it’s equally meaningless to other buzzwords such as Web 3.0 or the venerable Semantic Web. If you can use other formats instead of RDF, then why not also use SOAP instead of HTTP? Why not do away with the URIs? Why not YQL instead of SPARQL? Where does “linked data” stop? Everything is somehow “data” and somehow “linked.”

Interoperability requires choices to be made. In my eyes, the great thing about the term “linked data” is that it has a reasonably precise technical definition, rooted in Tim’s Design Note and the early work of the Linking Open Data project. That work has turned the Semantic Web’s compelling but vague promises of a side-by-side “web for humans” and “web for machines” into concrete guidelines that people can actually implement, and the result is an ecosystem of interoperable tools, clients and datasets that continues to grow around these guidelines.

These guidelines will continue to evolve with the emergence of new technologies (e.g., RDFa) and increasing experience and maturity (e.g., importance of licensing and provenance handling).

But at the core, it has to be about a set of concrete technology choices and deployment practices that foster an interoperable ecosystem of data sources and clients. “Linked data” is the best name we have for that particular set of technology choices and practices. There is nothing magic about the name “linked data”, to the best of my knowledge it didn’t exist at all in the web community before 2006. The term has gained popularity because it has associated rules that tell you how to do it, not because of the “words”. Without the rules, the term would be meaningless fluff. Everything is somehow “linked” and somehow “data.”

If you think that a different set of rules would work better (which is entirely possible), then it would be prudent to write them down, coin a new term for them, and start the legwork of advertising them, just as Tim did since 2006.

by Richard Cyganiak at November 21, 2009 07:34 PM

November 19, 2009

Displacement Activities (Tom Heath)

Putting a Conference into the Semantic Web

Chris Gutteridge asked this question about semantically enabling conference Web sites, which is a subject close to my heart. It’s hard to give a meaningful response in 140 characters, so I decided to get some headline thoughts down for posterity. If you want a fuller account of some first-hand experiences, then the following papers are a good place to start:

Top Five Tips for Semantic Web-enabling a Conference

1. Exploit Existing Workflows

Conferences are incredibly data-rich, but much of this richness is bound up in systems for e.g. paper submission, delegate registration, and scheduling, that aren’t native to the Semantic Web. Recognise this in advance and plan for how you intend to get the data from these systems out into the Web. The good news is that scripts now exists to handle dumps from submission systems such as EasyChair, but you may need to ensure that the conference instance of these systems is configured correctly for your needs. For example, getting dumps from these systems often comes at a price, and if you’re using one instance per track rather than the multi-track options, you may be in for a shock when you ask for the dumps. Speak to the Programme Chairs about this as soon as possible.

In my experience, delegate registration opens months in advance of a conference and often uses a proprietary, one-off system. As early as possible make contact with the person who will be developing and/or running this system, and agree how the registration system can be extended to collect data about the delegates and their affiliations, for example. Obviously there needs to be an opt-in process before this data is published on the public Web.

Collecting these types of data from existing workflows is so monumentally easier than asking people to submit it later through some dedicated means. With this in mind, have modest expectations (in terms of degree of participation) for any system you hope to deploy for people to use before, during and after the conference, whether this is a personalised schedule planner, paper annotation system or rating system for local restaurants. People have massive demands on their time always, and especially at a conference, so any system that isn’t already part of a workflow they are engaged with is likely to get limited uptake.

2. Publish Data Early then Incrementally Improve

Perhaps your goal in publishing RDF data about your conference is simply to do the right thing by eating your own dog food and providing an archival record of the event in machine-readable form. This is fine, but ideally you want people to use the published data before and during the event, not just afterwards. In an ideal world, people will use the data you publish as a foundation for demos of their applications and services and the conference, as means to enhance the event and also to promote their own work. To maximise the chances of this happening you need to make it clear in advance that you will be publishing this data, and give an indication of what the scope of this will be. The RDF available from previous events in the ESWC and ISWC series can give an impression of the shape of the data you will publish (assuming you follow the same modelling patterns), but get samples out early and basic structures in place so people have the chance to prepare. Better to incrementally enhance something than save it all up for a big bang just one week before the conference.

3. Attend to the details

Many of the recent ESWC and ISWC events have done a great job of publishing conference data, and have certainly streamlined the process considerably. However, along the way we’ve lost (or failed to attend to) some of the small but significant facts that relate to a conference, such as the location, venue, sponsors and keynote speakers. This stuff matters, and is the kind of data that probably doesn’t get recorded elsewhere. Obviously publishing data about the conference papers is important, but from an archival point of view this information is at least recorded by the publishers of the proceedings. The more tacit, historical knowledge about a conference series may be of great interest in the future, but is at risk of slipping away.

4. Piggy-back on Existing Infrastructure

As I discovered while coordinating the Semantic Web Technologies for ESWC2006, deploying event-specific services is simply making a rod for your own back. Who is going to ensure these stay alive after the event is over and everyone moves onto the next thing? The answer is probably no-one. The domain-registration will lapse, the server will get hacked or develop a fault, the person who once knew why that site mattered will take a job elsewhere, and the data will disappear in the process. Therefore it’s critical that every event uses infrastructure that is already embedded in everyday usage and also/therefore has a future. The best example of this is data.semanticweb.org, the de facto home for Linked Data from Web-related events. This service has support from SWSA, and enough buy-in from the community, to minimise the risk that it will ever go away. By all means host the data on the conference Web site if you must, but don’t dream of not mirroring it at data.semanticweb.org, with owl:sameAs links to equivalent URIs in that namespace for all entities in your data set.

5. Put Your Data in the Web

Remember that while putting your data on the Web for others to use is a great start, it’s going to be of greatest use to people if it’s also *in* the Web. This is a frequently overlooked distinction, but it really matters. No one in their right mind would dream of having a Web site with no incoming or outgoing links, and the same applies to data. Wherever possible the entities in your data set need to be linked to related entities in other data sets. This could be as simple as linking the conference venue to the town in which it is located, where the URI for the town comes from Geonames. Linking in this way ensures that consumers of the data can discover related information, and avoids you having to publish redundant information that already exists somewhere else on the Web. The really great news is that data.semanticweb.org already provides URIs for many people who have published in the Semantic Web field, and (aside from some complexities with special characters in names) linking to these really can be achieved in one line of code. When it’s this easy there really are no excuses.

Conclusions

Reading the above points back before I hit publish, I realise they focus on Semantic Web-enabling the conference as a whole, rather than specifically the conference Web site, which was the focus of Chris’s original question. I think we know a decent amount about publishing Linked Data on the Web, so hopefully these tips usefully address the more process-oriented than technical aspects.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at November 19, 2009 12:42 PM

November 12, 2009

Project squin

Experiences developing a simple Linked Data based application

One of the characteristics that make the Linked Data community so special is the pragmatic “let’s do it” attitude. The Linked Data-a-thon co-located at this year’s International Semantic Web Conference (ISWC) revealed this attitude once more. The main idea of the Linked Data-a-thon was to encourage conference attendees to develop a “quick and dirty” Linked Data based application during the first days of the conference; the proposition of this challenge was to showcase that it is possible to develop simple but innovative Linked Data applications with very little effort. The outcome of this event were eight amazing applications which exceeded all expectations. Nonetheless, the participation in this challenge also revealed some unexpected difficulties and reminded us of some of the current issues that still exist. In the following we describe our experiences.

The Idea
After being urged by Juan, the main organizer of the Linked Data-a-thon, we seriously thought about participating. Coming up with an idea for a cool application that could also be developed in only a few hours turned out to be a minor challenge. However, we remembered one of the demo queries of our SQUIN service. This query asks for traditional Chinese medicine as an alternative to the western drug Varenicline. Answering this queries requires data from at least three different linked datasets provided by the Linking Open Drug Data (LODD) project. Hence, this query demonstrates the added value of interlinking data from multiple sources on the Web. Furthermore, the query gives a glimpse of how ordinary people may benefit from openly available data. For this reason we agreed to build a simple application around this query and let users vary the drug for which the alternatives are to be found. Since answering only one type of questions is quite boring we decided to add some functionality that enable users to drill a bit deeper into the data that is available. Based on our knowledge of the LODD datasets we came up with the idea of adding value to the search results by allowing users to inspect possible side effects of the alternative medicines. This valuable functionality would require the evaluation of data from at least two additional datasets and, by using SQUIN, it would be realizable with only one additional type of SPARQL queries.

The Queries
As outlined, our application is based on two types of SPARQL queries which we execute with SQUIN. Each of these types corresponds to a template in which a placeholder has to be substituted by a URI in order to create an actual SPARQL query that, then, can be executed over the Web of Linked Data using a SQUIN service. The query template for the alternative medicine queries was easy to create; we only had to generalize the aforementioned SQUIN demo query by replacing the Varenicline URI with the placeholder. Creating the side effects query was a bit more difficult. It required that: 1/ one has the domain-specific knowledge about the application in order to conceive how a medicine can be related to its side effects; 2/ one has to have good knowledge about the collection datasets, knowing which dataset contains what information and how one dataset is connected to another one through which properties. Such knowledge is essential to define a valid SPARQL query; a Linked Data browser such as the one powered by Pubby has been very helpful here. 3/ one must have a mechanism to debug the query and sufficient logs about the query execution, to understand what has gone wrong, e.g., the predicates being in the wrong order or the data URIs being partially updated.

The General Operation
To instantiate our query templates the placeholder in each of them has to be substituted by a URI. For the alternative medicine query this URI would identify the western drug specified by the user; for the side effects query the URI would identify the Chinese medicine. Getting the medicine URIs wouldn’t be a problem because they are part of the results determined for the alternative medicine queries. Obtaining the western drug URIs was a bit more difficult because we wanted our users to provide with a convenient interface in which they only have to specify the drugs by their names. For this reason we added another type of query to look-up drug URIs based on drug names provided by the users. These queries use the following template which contains a filter clause with a regular expression:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX drugbank: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>
SELECT ?uri ?label WHERE {
?uri a drugbank:drugs .
?uri rdfs:label ?label .
FILTER( regex(STR(?label),"[DRUGNAME]","i") )
}
ORDER BY ?uri

The placeholder [DRUGNAME] in this template is replaced by the drug name provided by the user. Notice, queries of this type will not yield any results when executed with SQUIN; they do not contain a seed URI linking to an RDF graph with triples that match any of the patterns in query. Therefore, we send these look-up queries directly to the SPARQL endpoint of the drugbank dataset that describes western drugs and their names.

ScreenshotAltMedSelectionFrom the result of the look-up query we initiate the alternative medicine query. If the look-up query yields multiple results we allow the user to choose one of them. The alternative medicine query is then executed over the Web of Linked Data using a SQUIN service. The results of this query execution are visualized, including an option to show side effects of each medicine. When the user selects one of these options our app initializes the template of the side effects query with the corresponding medicine URI, sends this query to the SQUIN service, and visualizes the results after they are returned by SQUIN.

Performance Discussion
Executing our queries over the Web using SQUIN takes very long for an interactive user interface (1.30 min in the average case). However, we used SQUIN to demonstrate that it is able to answer these queries. Another advantage of this solution: SQUIN provides the most up-to-date results and it would immediately include data from new sources after they are linked from the LODD datasets. To improve the user experience we enabled the query result cache in SQUIN so that results for queries that had been issued before are available immediately. Unfortunately, this result cache suppresses the chance for up-to-date results. To avoid this problem we aim to extend SQUIN with a suitable caching strategy that includes a reasonable invalidation policy.

The query execution strategy implemented in SQUIN is not alone to blame for the long execution times. The majority of the LODD datasets accessed during the query executions (4 out of 5) is published on the same machine. Hence, each query execution stresses this machine extremely so that it becomes a bottleneck for the performance of our application.

Furthermore, it is not only the queries executed with SQUIN that make the user interface feel slow. For our look-up queries we also often had the problem that the response time of the drugbank SPARQL endpoint was quite long (a few seconds) for a query service over a single dataset. We assume this problem is caused by the technology with which drugbank is published: the drugbank SPARQL endpoint operates on a virtual RDF graph that is created on-the-fly from a relational database; the mappings are not able to express the filter clause in our look-up query and, thus, the regular expression cannot be pushed-down to the database query engine; instead, the mapper has to materialize the RDF data for all drugs and the SPARQL engine has to filter afterwards.

Development Effort
Apart from the performance problems that slowed down testing, developing our application was very easy once the query templates were created. Due to integrating SQUIN we did not had to develop any program logic to query the Web. All the time we spend for coding included only the implementation of the workflow and of the AJAX user interface, i.e., everything that is also necessary for an ordinary Web application. In the end it took us about 7 hours to finish the application and we are convinced we could have been much faster had we been more experienced in JavaScript development.

Conclusion
You may try the final outcome of our work here. By integrating SQUIN in our application it required no implementation work to issue queries over up-to-date data on the Web. While the development effort is comparable to a usual Web application we can easily benefit from the availability of open data from multiple sources on the Web.

ScreenshotAltMed

Unfortunately, the user interface does not feel very responsive due to different issues raised in this text. The performance problems that are caused by SQUIN are on our todo list. In particular, in the coming months we want to improve overall query execution times by the implementation of smart caching strategies. During the development we also discovered some performance problems caused by the infrastructure that runs today’s Web of Linked Data. These issues of responsiveness and reliability of Linked Data servers have to be addressed by the community.

Jun and Olaf

by Olaf Hartig at November 12, 2009 11:01 AM

October 31, 2009

Project squin

SQUIN at ISWC 2009

For the last two weeks (Oct. 17 to Oct. 31) I was in the US. During my trip I presented SWClLib, SQUIN, and the general idea of link traversal based query execution as implemented in SWClLib on several occasions.

tn_p1000306The first week I visited my friend Juan Sequeda in Austin, Texas. Austin is a great city in which I realized a healthy dominance of students. Juan took me to dozens of places and I got to know many great people. In particular, I loved the Monday night at which we went from pub to pub on Sixth Street; all of them had a band playing live music. Our evening at the Ghostbusters quote-along in the Drafthouse movie theater is also unforgettable. If you have the chance to spend a few days in Austin I advise you to take this opportunity.
During the week Juan organized a meeting of the Austin Semantic Web meet-up; here I gave the first talk during my trip. I introduced the idea of Linked Data, presented our link traversal based query execution approach, and outlined how users benefit from Linked Data by showcasing our Researchers Map application. The talk was well received by the audience who mainly consisted of several experienced AI guys and entrepreneurs.
36505894 On the following day I had the chance to present my research at the University of Texas (UT). I decided to present link traversal based query execution and to discuss the iterator-based implementation as I would do in my ISWC talk the next week. The computer science departments at UT have a strong focus on theory. For this reason I expected very difficult questions and, thus, was quite nervous. In the end the majority of the people who attended my talk were grad and undergrad students, mainly from Juan’s database and bio-informatics focused research group, and the talk turned more into a lecture. I’d the feeling the students were very eager to learn about RDF, SPARQL, and my ideas to query the Web of Linked Data; I hope my talk inspired some of them.

logoFrom Austin I flew to Washington DC to attend the 8th International Semantic Web Conference during the second week of my trip. The conference venue was far outside of the city, “in the middle of nowhere” as several people put it. I still don’t understand the reason for organizing a conference in such a remote place. However, the conference itself was great; familiar faces, interesting conversations, and many new contacts.

At the first day we (Juan, Patrick Sinclair, Jamie Taylor, and me) gave a tutorial on “how to consume Linked Data”. Even if the audience was fairly diverse -from real beginners to more experienced practitioners- we satisfied their expectations quite well (at least that’s what Ivan Herman attested us ;-). With a discussion on “Querying Linked Data with SPARQL” (slides) I presented one of the more technical parts. Of course, I also introduced the idea of link traversal based query execution and advertised SWClLib and SQUIN here.

A novel idea for this conference was the Linked Data-a-thon, which aimed to show that it is possible to develop simple but innovative Linked Data applications with very little effort. This event resulted in the development of eight amazing applications, two of them made use of SQUIN: a search engine for traditional Chinese medicine and a lda-afterrecent posts section in FOAF letter. The latter was developed by Matthias Quasthoff who also shared his experiences of developing his Linked Data-a-thon submission. One of Matthias’ conclusions was that it turned “out again that SQUIN and LOD work nicely together if links from FOAF profiles to distributed SIOC user accounts exist.” The other SQUIN based application was the submission of Jun Zhao and me. Our app allows you to find traditional Chinese medicine as an alternative to western drugs and to inspect the possible side effects of the determined alternative medicines. For the implementation we used SQUIN to execute queries over multiple interlinked datasets provided by the Linking Open Drug Data (LODD) project. I will write a separate post in which we describe our experiences similar as Matthias did. The other six Linked Data-a-thon submission are also great examples of what can be done thanks to Linked Data. Hence, this event was a huge success! It was even sponsored with prizes. Thanks to Juan for the wonderful organization of this extraordinary event and to Jamie who did a great job presenting the submissions to the ISWC audience.

On the fourth day of the conference the presentation of my paper was scheduled and the room was packed! People were standing in the back and in the door. As probably everyone in the audience realized I was very nervous. Nonetheless, apart from a few stumbles I managed to present the idea of link traversal based query execution to the audience. After the talk and during the remaining days many people came to talk about the approach; all of them loved the general idea; and many expressed their interest in trying SQUIN for their applications.

After all, these two weeks were very exciting and they helped to spread the work about SQUIN.

Olaf
(written at Washington Dulles Airport)

by Olaf Hartig at October 31, 2009 11:00 PM

October 30, 2009

Do What I Mean (Richard Cyganiak)

Linked data at the New York Times: Exciting, but buggy

Update: Evan Sandhaus reports that all the issues mentioned below will be fixed. Great!

Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.

A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.

What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.

How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:

The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to

http://data.nytimes.com/N13941567618952269073.html

which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to

http://data.nytimes.com/N13941567618952269073.rdf

which has all the data goodness in RDF/XML.

There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.

Bugs and problems

This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.

Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:

<http://data.nytimes.com/N13941567618952269073>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.

Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

<http://data.nytimes.com/N13941567618952269073.rdf>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.

Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:

    <nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage>
    <cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
    <cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>

The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:

    <nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" />
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" />
    <cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />

Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.

Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.

Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):

$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.html

Next I will ask for RDF:

$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf

So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:

$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf.html

The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.

Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.

Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:License <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:Attribution >http://data.nytimes.com/N13941567618952269073<;
    cc:attributionName "The New York Times Company";
    .

This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:license <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:attributionURL <http://data.nytimes.com/N13941567618952269073>;
    cc:attributionName "The New York Times Company";
    .

Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.

Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.

by Richard Cyganiak at October 30, 2009 09:19 PM

October 27, 2009

DBTune Blog

Music recommendation and Linked Data

We just presented yesterday at ISMIR a tutorial about Linked Data for music-related information. More information on the tutorial is available on the tutorial website, and the slides are also available.

In particular, we had two sets of slides dealing with the relationship between music recommendation and linked data. As this is something we're investigating within the NoTube project, I thought I would write up a bit more about it.

Let's focus on artist to artist recommendation for now. If we look at last.fm for recommendations for New Order, here is what we get.

Artists similar to New Order, from last.fm

Similarly, using the Echonest API for similar artists, we get back an ordered list of artists similar to New Order, including Orchestral Manoeuvres in the Dark, Depeche Mode, etc.

Now, let's play word associations for a few bands and musical genres. My colleague Michael Smethurst took the Sex Pistols, Acid House and Public Enemy, and draw the following associations:

Sex Pistols associated words

Acid House word associations

Pubic Enemy word associations

We can see that among the different terms in these diagrams, some refer to people, to TV programmes, to fashion styles, to drugs, to music hardware, to places, to laws, to political groups, to record labels, etc. Just a couple of these terms are actually other bands or tracks. If you were to describe these artists just in musical terms, you'd probably be missing the point. And all these things are also linked to each other: you could play word associations for any of them and see what are the connections between Public Enemy and the Sex Pistols. So how does that relate to recommendations? When recommending an artist from another artist, the context is key. You need to provide an explanation of why they actually relate to each other, whether it's through common members, drugs, belonging to the same independent record label, acoustically similar (if so, how exactly), etc. The main hypothesis here being that users are much more likely to be accepting a recommendation that is explicitly backed by some contextual information.

On the BBC website, we cover quite a few domains, and we try to create as much links as possible between these domains, by following the Linked Data principles. From our BBC Music site, we can explore much more information, from other BBC content (programmes, news etc.) to other Linked Data sources, e.g. DBpedia, Freebase and Musicbrainz. This provides us with a wealth of structured information that we would ultimately want to use for driving and backing up our recommendations.

The MusicBore I've described earlier on this blog kind of uses the same approach. Playlists are generated by following paths in Linked Data. Introduction of each artists is done by generating a sentence from the path leading from the seed artist to the target artist. The prototype described in that paper from the SDOW workshop last year also illustrates that approach.

So we developed a small prototype of these kind of ideas, rqommend (and when I say small, it is very small :) ). Basically, we define "relatedness rules" in the form of SPARQL queries, like "Two artists born in Detroit in the 60s are related". We could go for very general rules, e.g. "Any paths between two artists make them related", but it would be very hard to generate an accurate textual explanation for it, and might give some, hem, not very interesting connections. Then, we just go through these rules on an aggregation of Linked Data, and generate recommendations from them. Here is a greasemonkey script injecting such recommendations with BBC Music (see for example the Fugazi page). It injects Linked Data based recommendations, along with the associated explanation, within BBC artist pages. For example, for New Order:

BBC Music recs for New Order

To conclude, I think there is a really strong influence of traditional information retrieval systems on the music information retrieval community. But what makes Google, for example, particularly successful is to exploit links, not the documents themselves. We definitely need to go towards the same sort of model. Exploiting links surrounding music, and all the cross-domain information that makes it so rich, to create better music recommendation systems which combine the what is recommended with the why it is recommended.

by Yves at October 27, 2009 02:43 AM

August 18, 2009

Do What I Mean (Richard Cyganiak)

Multiple Java versions on OS X, and their paths

Java version management on OS X is a wee bit complicated. Here’s what I understand.

All different versions of Java are installed into the directory:
/System/Library/Frameworks/JavaVM.framework/Versions/

For example, the JDK home directory for 1.4.2 would be Versions/1.4.2/Home/.

There are three ways to access specific versions.

  1. The hardcoded default. There is the hardcoded default version that comes with the current version of the OS. For OS X Leopard, this is Java 1.5. This version can always be accessed through the path /Library/Java/Home.
  2. The Java Preferences application. recent versions of Java on OS X install an application “Java Preferences” into /Applications/Utilities. It allows you to select the preferred version by dragging it to the top of the list. There is one list for applets, and one list for applications and the command line. The terminal commands (e.g., java) use that version from the second list. The path of this version, in case you need it, can be obtained by running the command /usr/libexec/java_home.
  3. Setting JAVA_HOME. By doing so one can override the choice that is made in the Java Preferences application. If the JAVA_HOME is set, the command line applications will use that version. However, /usr/libexec/java_home will still return the path of the version that was selected in Java Preferences.

“Magic” versions. In addition to the different Java versions, the Versions directory also contains several “magic” directories:

Versions/CurrentJDK/ is the hardcoded default version of the OS, so on Leopard it will always be an alias pointing to /Versions/1.5. Note that this is not affected by whatever version is selected in Java Preferences or via JAVA_HOME. Messing manually with the symlink to make it point to a different version is probably not a good idea. /Library/Java/Home points here.

Versions/Current/ is an alias that points to Versions/A/. This, in turn, is not a proper Java version like the other directories in /Versions/. It contains internal parts of the OS X Java machinery, e.g., in Versions/Current/Commands there are “fake” binaries such as java, javac and javadoc that internally use /usr/libexec/java_home (and JAVA_HOME, if set) to find the “real” binary. When you call Java commands from the command line, you actually invoke these “fake” binaries (via symlinks in /usr/bin). These are system internals, and poking around in there too much is probably not a good idea.

Finding the current version. So, what is the correct way to determine the location of the current Java version on OS X, say from a shell script?

  1. If JAVA_HOME is set, use that.
  2. Otherwise, invoke /usr/libexec/java_home to find the path.
  3. If that fails, fall back to /Library/Java/Home.

Should you set JAVA_HOME? If you don’t need it, don’t set it at all. If you need it, setting it via

export JAVA_HOME=`/usr/libexec/java_home`

is probably not a bad idea, because this will reflect future changes to the selected version from the Java Preferences application.

by Richard Cyganiak at August 18, 2009 04:34 PM

July 29, 2009

Displacement Activities (Tom Heath)

Search Engine Optimisation for People with a Conscience

I’ve spent a fair amount of time recently cleaning up spammy reviews on Revyu, the Linked Data/Semantic Web reviewing and rating site. The main perpetrators of these spammy reviews seem to be self-appointed Search Engine Optimisation (SEO) “experts” (who even advertise themselves as such on LinkedIn). Their main strategy appears to be polluting the Web with links to fairly worthless sites, in the hope of gaining some share of search engine traffic.

Getting a piece of the action I have no objection to per se. This was exactly my aim with chiip.co.uk my (currently somewhat on ice) shop window to Amazon – visitors could find products via search engines and, if desired, buy them through a trusted supplier, earning me enough commission on the side to pay my hosting bill for a month or two. The difference here is that I just tweaked the site layout to show off the content to search engines in its best light. I never polluted anyone else’s space to gain exposure. People that do this are getting me down.

Revyu has become somewhat popular as a target, presumably due to its decent ranking in the search engines. The site didn’t gain this position through spamming other sites with backlinks, but by having some simple principles baked into the site design from the start. They’re the same basic principles I’ve used on any site I’ve created, and have generally served me well. A few years ago I wrote down the principles that guide me, and I share this first draft here as a service to people who want to optimise the exposure of their site and still be able to sleep at night.

Before you read the tips though bear this in mind: there is something of an art to this, but it isn’t rocket science, and it certainly isn’t black magic. If you can create a Web site then you can optimise pretty well for search engines without paying a single self-appointed “expert” a single penny. This is bread and butter stuff. These approaches should be part of the core skill set of any Web developer rather than an afterthought addressed through some external process. The tips below are not guaranteed to work and may become defunct at any time (some may be defunct already – does anyone ever use frames these days?). However, follow these and you’ll be 80% of the way there.

Search Engine Optimisation Tips

  1. there’s only so much you can do, and this may change at any time
  2. don’t try and trick the search engines, just be honest
  3. use web standards and clean code
  4. use css for styling and layout
  5. put important text first in the page; let this influence your design, it’s probably what users want too, especially if they’re on non-standard browsers
  6. choose page titles carefully
  7. use meta tags, but only if they’re accurate
  8. use robot meta tags, and robots.txt
  9. use structural markup, especially headings
  10. give anchors sensible text (“click here” does not qualify as sensible)
  11. use link titles and alt text
  12. give files and folders meaningful names
  13. provide default pages in directories so people can hack your URLs
  14. forge meaningful (human) links with other sites, and make technical links accordingly
  15. encourage inward links to your site
    • make urls readable and linkable to
    • don’t break links (at least give redirects)
  16. don’t use javascript for links/popup windows that you want to be indexed
  17. avoid links embedded in flash movies
  18. never use frames
  19. never use cookies to power navigation
  20. give example searches or browse trees to open databases to search engines
  21. maximise the content richness of pages
  22. avoid leaf node pages (always create links back to the rest of the site)
  23. limit the use of PDFs
  24. take common typos into account, or spelling variations (optimisation vs optimization is a good example)
  25. update the site regularly
  26. don’t use hidden text or comments to try and convey spam words
  27. don’t embed text in images
  28. avoid writing out text using javascript
  29. don’t use browser detection to alter content or restrict access
  30. provide meaningful error pages
  31. be realistic about what you can achieve optimsation-wise
  32. establish a traffic baseline
  33. use monitoring tools to track your progress

At some point I hope to provide evidence backing up each of these claims. In the meantime you’ll just have to trust me, but it won’t cost you anything.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at July 29, 2009 11:27 AM

June 18, 2009

Project squin

SQUIN presented at SemTech 2009

I attended the 2009 Semantic Technology Conference in San Jose, California where Linked Data was a huge topic. One of the current issues is “What to do with Linked Data”, and whenever I heard that comment, SQUIN would pop into my mind! Therefore, I took this time at SemTech to share with everybody the idea of querying the Web of Linked Data as if it were a database.

A Linked Data gathering took place on Wednesday afternoon and I gave a 5 min pitch of SQUIN. I showed the audience the Researchers Map demo (that won the Scripting Challenge at ESWC2009) and explained how SQUIN worked. Furthermore, I talked about the plans that I have with Turn2Live.com in which they will start to consume music Linked Data through SQUIN (and also start publishing Linked Data too!). After my 5 min pitch, somebody (his first name was Drew) had already downloaded SQUIN, started running it on his laptop and was querying the Web of Linked Data… all in less than 5 min!

semtech09 On Thursday morning, Tony Shaw, the organizer of the Semantic Technology Conference told me that he heard a lot of people talking about SQUIN the night before and that knowing about SQUIN was the highlight of this conference. I was impressed! He immediately invited me to give a quick session on SQUIN. I gave a quick talk at 11am to a small crowd of people who had not seen SQUIN before and had just quickly heard about it the night before.

In summary, I believe that we need to offer tools that will allow to query, use, consume Linked Data in a very simple, easy and quick manner. That is the goal of SQUIN. I acknowledge that a lot of work needs to be done, but this is a start!

Juan

by Olaf Hartig at June 18, 2009 07:36 PM

Christian Becker

Public notice: I do not make balances

1910_beckerbalance_small2
What looked like cleverly personalized spam was indeed a honest request for a manual of an antique scale that went by my name:

June 18,09

To whom it may concern:
I have recently obtained a Christian Becker balance. Design #: 165996. Ser. #: A 8628. Style: AB5. cap. 200G
I need to know what all the knobs are and how to adjust, etc.
I need the instruction manual for the balance. Is such available or can you make a copy of one for me. I’ll be glad to pay for copying.
If no manual available, do you know of someone in the St. Louis, Mo. area who repairs them that might help me?
Thank you for your help.

Sure thing, I sent him the manual (“prepared as a service to laboratories worldwide”). Gotta love antique scientific instruments ;)

by chris at June 18, 2009 06:30 PM

June 15, 2009

Wikier.org Blog (Sergio Fernandez)

SDoW2009

After a shot holidays in Sardinia, I’m back to announce the 2nd International Workshop on Social Data on the Web (SDoW2009), that will be held at Washington in October, co-located with the 8th International Semantic Web Conference (ISWC2009). We’ve just send out the CfP, contributions are welcomed until July 24th August 10th.

SDoW2009

I’m very proud to co-chair for the second consecutive year this workshop with John, Uldis and Alex. Last year in Karlsruhe we enjoyed an amazing day, let see what we can get this year.

by Sergio Fernández at June 15, 2009 09:41 AM

June 01, 2009

Christian Becker

Marbles released on SourceForge

marbles-logoI’m pleased to announce the release of Marbles on SourceForge.

Marbles is a server-side application that formats Semantic Web content for XHTML clients using Fresnel lenses and formats. Colored dots are used to correlate the origin of displayed data with a list of data sources, hence the name.
By performing all formatting, data retrieval and storage activities on the server side rather than on a potentially thinly equipped client, the view generation can touch on large amounts of data and requests can be answered relatively quickly. Marbles provides display and database capabilities for DBpedia Mobile.

Data is retrieved from multiple sources and integrated into a single graph that is persisted across user sessions. When provided with the URI of a resource to display, Marbles tries to dereference it. In parallel, it queries Sindice and Falcons for datasources that contain information about the given resource, and Revyu for reviews. In a similar manner as the Semantic Web Client Library, Marbles follows specific predicates found in retrieved data such as owl:sameAs and rdfs:seeAlso in order to gain more information about a resource and to obtain human-friendly resource labels.

Thanks to Eli Lilly and Company for supporting the open-sourcing of Marbles in part by a research grant.

by chris at June 01, 2009 05:47 PM

May 29, 2009

Christian Becker

¡DBpedia @ Wikimania 2009!

This just got in: I will be at Wikmedia’s Wikimania conference in Buenos Aires to talk about DBpedia and an ongoing mapping collaboration. Now on to learning a few bits of Spanish…

by chris at May 29, 2009 08:59 AM

May 24, 2009

Project squin

New BGP query handler for the Semantic Web Client Library reduces query times to a third

I have added a BGP query handler to my new in-memory storage solution for the Semantic Web Client Library (SWClLib). With this addition the execution of queries over a completely filled cache can be reduced to about 31.7% of the time required by the old Jena/NG4J based solution.

Details

A BGP query basically is a set of triple patterns; i.e., a set of RDF triples that may have query variables in the subject, predicate, and object position. A solution to a BGP query is a solution for all triple patterns in the query. Hence, each solution corresponds to a set of matching triples -one for each triple pattern- in the queried RDF dataset. In the context of NG4J and the SWClLib the matching triples may be part of different named graphs from the local graph set.
NG4J does not implement the evaluation of BGP queries, but, simply relies on the BGP query handler in Jena. The Jena BGP query handler executes the query by evaluating the triple patterns in an iterative manner. The iterators issue triple pattern queries -also called find(SPO) queries- to the underlying graph store. In case of NG4J the underlying graph store is a set of named graphs (i.e. an implementation of the NamedGraphSet interface) which is either the old NG4J implementation or my new storage solution. As described earlier, my solution is based on identifiers for RDF terms (e.g. URIs and literals) instead of the RDF terms itself. For this reason, the RDF term based triple pattern queries issued by the Jena iterators must be translated to identifier based triple patterns. This translation happens by looking-up the RDF terms in a dictionary. Furthermore, a retranslation is necessary for the solutions determined for the triple patterns. Since these translations take time I was wondering whether I can improve query execution by developing a custom BGP query handler for my identifier-based storage solution. This new query handler uses iterators which work with identifier-based triple patterns. This approach obsoletes the need for translations between RDF nodes and their identifiers and, thus, should reduce query execution times.

Evaluation

As usual, I used the Berlin SPARQL Benchmark (BSBM), together with my Linked Data like data generator, to evaluate the new approach. Notice, the data generator creates a dataset with a fairly large number of comparatively small RDF graphs which is typical for the local cache of the SWClLib. The following table lists the average (geometric mean) times to execute the BSBM query mix (10 runs and 3 additional warm-up runs) over datasets created with the scaling factors (pc) of 10 to 80. The table is accompanied with a chart that visualizes the measures.

Scaling factor: # of named graphs: Overall # of triples: Avg. query mix exec. times for Jena/NG4J-based store: Avg. query mix exec. times for new storage solution: Comparison:
10 613 4,971 2.96s 1.01s 34.2%
20 928 8,485 4.96s 1.56s 31.9%
30 1,245 11,999 7.43s 2.37s 31.9%
40 1,845 16,918 17.76s 5.36s 30.2%
50 2,599 22,616 47.93s 12.92s 27.0%
60 2,914 26,108 61.44s 23.93s 39.0%
70 3,230 29,601 80.11s 23.98s 29.9%
80 3,544 33,110 88.33s 26.08s 29.5%

measures
As can be seen from the measures, the new store with its new BGP query handler reduces the execution time of queries over a local graph set to about 31.7%. Honestly, I wasn’t really expecting such a huge improvement. Great news for the SWClLib and for SQUIN. Cheers!

Olaf

by Olaf Hartig at May 24, 2009 08:14 PM

May 14, 2009

Christian Becker

DBpedia Mobile and Marbles featured in Semantic Web for Dummies

Semantic Web for Dummies

The Semantic Web is so mainstream now: Oracle’s Jeff Pollock has actually published a Semantic Web for Dummies book – and it has a whopping 2 pages on DBpedia and DBpedia Mobile! Kudos for what looks like a much-needed overview.

by chris at May 14, 2009 10:18 PM

April 06, 2009

Joshua Tauberer

SPARQL OLE DB Provider

Andy Gueritz announced on the mail list for my SemWeb RDF library for .NET that he has created an OLE provider for a SPARQL endpoint that is usable in Microsoft Excel. He wrote,

In a moment of insanity (but a great learning experience), I gave myself the challenge of writing an OLE DB provider for SPARQL. It is built on top of the SemWeb libary which has saved a substantial amount of effort and also brings some powerful functionality to the table very quickly (Thanks, Joshua!)

The provider as constructed implements a readonly OLE provider that supports all four SPARQL query types and interfaces to SemWeb through COM-Callable Wrapper. It is not extensively tested yet but seems to work with most of the queries I have now put through it, and of course being built on SemWeb it is able to read both local and remote SPARQL sources.

Moral of the story: populate Excel tables with SPARQL queries.

More here.

by Joshua Tauberer at April 06, 2009 11:32 PM

March 25, 2009

Wikier.org Blog (Sergio Fernandez)

Mailing Lists and Social Semantic Web

Social Web EvolutionToday I’ve received my printed copy of the book Social Web Evolution: Integrating Semantic Applications and Web 2.0 Technologies, which includes a chapter titled Mailing Lists and Social Semantic Web gathering all the work made these last years around SWAML, SIOC, mailing lists and the Social Semantic Web.

This is the first book that I’ve written, so I’m very proud of it. Thank you to the co-authors of the chapter (Diego, Lian, Labra and Patricia), the editors of the book, and all the people that help us during these years.

If someone wants to take a look at book, you can find it on Amazon or just use the preview provided by Google Books.

by Sergio Fernández at March 25, 2009 02:31 PM

March 14, 2009

Wikier.org Blog (Sergio Fernandez)

Put your data on the Web

Now, I want you to put your data on the Web.

Tim Berners-Lee on his talk about the 20th birthday of World Wide Web and talking about Linked Data.

by Sergio Fernández at March 14, 2009 10:23 AM

March 02, 2009

Joshua Tauberer

Civic Hacking, the Semantic Web, and Visualization

Yesterday I held a session called Semantic Web II: Civic Hacking, the Semantic Web, and Visualization at Transparency Camp. In addition to posting my slides, here’s basically what I said during the talk (or, now on reflection, what I should have said):

Who I Am: I run the site GovTrack.us which collects information on the status of bills in the U.S. Congress. I don’t make use of the semantic web to run the site, but as an experiment I generate a large semantic web database out of the data I collect, and some additional related data that I find interesting.

Data Isolation: What the semantic web addresses is data isolation. For instance, the website MAPLight.org, which looks for correlations between campaign contributions to Members of Congress and how they voted on legislation, is essentially something that is too expensive to do for its own sake. Campaign data from the Federal Election Commission isn’t tied to roll call vote data from the House and Senate. It’s only because separate projects have, for independent reasons, massaged the existing data and made it more easily mashable that MAPLight is possible (that’s my site GovTrack and the site opensecrets.org). The semantic web wants to make this process cheaper by addressing mashability at the core. This is important for civic (i.e. political/government) data: machines help us sort, search, and transform information so we can learn something, which is good for civic education, journalism (government oversight), and research (health and economy). And it’s important for the data to be mashable by the public because uses of the data go beyond the resources, mission, and mandate of government agencies.

Beyond Metadata: We can think of the semantic web as going beyond metadata if we think of metadata as tabular, isolated data sets. The semantic web helps us encode non-tabular, non-hierarchical data. It lets us make a web of knowledge about the real world, connecting entities like bills in congress with members of congress, what districts they represent, etc. We establish relations like sponsorship, represents, voted.

Why I care: Machine processing of knowledge combined with machine processing of language is going to radically and fundamentally transform the way we learn, communicate, and live. But this is far off still. (This explains why I study linguistics…)

Then there are some slides on URIs and RDF.

My Cloud: When the data gets too big, it’s hard to remember the exact relations between the entities represented in the data set, so I start to think of my semantic web data as several clouds. One cloud is the data I generate from GovTrack, which is 13 million triples about legislation and politicians. Another cloud is data I generate about campaign contributions: 18 million triples. A third data set is census data: 1 billion triples. I’ve related the clouds together so we can take interesting slices through it and ask questions: how did politicians vote on bills, what are the census statistics of the districts represented by congressmen, are votes correlated with campaign contributions aggregted by zipcode, are campaign contributions by zipcode correlated with census statistics for the zipcode (ZCTA), etc. Once the semantic web framework is in place, the marginal cost of asking a new question is much lower. We don’t need to go through the work that MAPLight did each time we want a new correlation.

Linked Open Data (LOD): I showed my part of the greater LOD cloud/community.

Implementation: A website ties itself to the LOD or semantic web world by including <link/> elements to RDF URIs for the primary topic of a page. This URI can be plugged into a web browser to retrieve RDF about that resource: it’s self-describing. I showed excerpts from a URI for a bill in congress that I created. It has basic metadata, but goes beyond metadata. The pages are auto-generated from a SPARQL DESCRIBE query as I explained in my Census case study on my site rdfabout.com.

SPARQL: The query language, the SQL, for the semantic web. It is similar to SQL in metaphors and keywords like SELECT, FROM, and WHERE. It differs in every other way. Interestingly, there is a cultural difference: SPARQL servers (“endpoints”) are often made publicly acessible directly, whereas SQL servers are usually private. This might be because SPARQL is read-only.

Example 1: Did a state’s median income predict the votes of Senators on H.R. 1424, the October 2008 stimulus bill? I show the partial RDF graph related to this question and how the graph relates to the SPARQL query. First it is an example SPARQL query. Then the real one. The real one is complicated not because RDF or SPARQL are complicated, but because the data model *I* chose to represent the information is complicated. That is, my data set is very detailed and precise, and it takes a precise query to access it properly. I showed how this data might be plugged into Many Eyes to visualize it.

My visualization dream: Visualization tools like Swivel (ehm: I had real problems getting it to work), Many Eyes, Ggobi, and mapping tools should go from SPARQL query to visualization in one step.

Example 2: Show me the campaign contributions to Rep. Steve Israel (NY-2) by zipcode on a map. I showed the actual SPARQL query I issue on my SPARQL server and a map that I want to generate. In fact, I made a prototype of a form where I can submit any arbitrary SPARQL query and it creates an interactive map showing the information.

Other notes: My SPARQL server uses my own .NET/C# RDF library. That creates a “triple store”, the equivalent of a RDBMS for the semantic web. Underlyingly, though, it stores the triples in a MySQL database with a table whose columns are “subject, predicate, object”, i.e. a table of triples. See also: D2R server for getting existing data online.

by Joshua Tauberer at March 02, 2009 04:14 PM

February 23, 2009

Joshua Tauberer