Linked Data Blog Aggregator

April 30, 2012

Displacement Activities (Tom Heath)

Bebo White Reviews the Linked Data Book for Journal of Web Engineering

I recently had an email giving advance notice that a review of the Linked Data Book (aka “Linked Data: Evolving the Web into a Global Data Space“) would appear in Volume 11(2) of the Journal of Web Engineering, published by Rinton Press (ISSN: 1540-9589). As some people won’t have easy access to the journal, the review is republished here, with permission. It’s by Bebo White of Stanford University and beyond — thank you Bebo for the thoughtful review, and to Rinton Press for allowing it to be republished here.

Web Engineering has been described as encompassing those “technologies, methodologies, tools, and techniques used to develop and maintain Web-based applications leading to better systems, [thus to] enabling and improving the dissemination and use of content and services though the Web.” (Source: International Conference on Web Engineering)

An especially interesting aspect of this description is “dissemination and use of content.” Semantic Web technologies and particularly the Linked Data paradigm have evolved as powerful enablers for the transition of the current document-oriented Web into a Web of interlinked data/content and, ultimately, into the Semantic Web.

To facilitate this transition many aspects of distributed data and information management need to be adapted, advanced and integrated. Of particular importance are approaches for (1) extracting semantics from unstructured, semi-structured and existing structured sources, (2) management of large volumes of RDF data, (3) techniques for efficient automatic and semi-automatic data linking, (4) algorithms, tools, and inference techniques for repairing and enriching Linked Data with conceptual knowledge, (5) the collaborative authoring and creation of data on the Web, (6) the establishment of trust by preserving provenance and tracing lineage, (7) user-friendly means for browsing, exploration and search of large, federated Linked Data spaces. Particularly promising might be the synergistic combination of approaches and techniques touching upon several of these aspects at once.

For Web Engineering practitioners interested in being a part of this Web transition, Linked Data – Evolving the Web into a Global Data Space by Heath and Bizer will provide a valuable resource. The authors have done an excellent job of addressing the subject in a logical sequence of well-written chapters reflecting technical fundamentals, coverage of existing applications and tools, and the challenges for future development and research. The seven important approaches mentioned earlier are described in a consistent way and illustrated by means of a hypothetical scenario that evolves over the course of the book. The size of this book (122 pages) is deceiving in that it does not reflect the quality and density of its content. The authors have succeeded in presenting a complex topic both succinctly and clearly. It is not a “quick read,” but rather a volume to be used for references, definitions, and meaningful and instructive code examples.

This book is available in digital format (PDF). It is the first in a planned series of books/lectures. The quality of this book should make the reader/practitioner look forward to the upcoming series volumes that promise to further explain the exciting future of this topic.

by Tom Heath at April 30, 2012 12:10 PM

AI3:::Adaptive Information (Mike Bergman)

Pragmatic Approaches to the Semantic Web

Canberra, Australia Linked Data is Sometimes a Useful Technique, but is an Inadequate Focus

While in Australia on other business, I had the great fortune to be invited by Adam Bell of the Australian War Memorial to be the featured speaker at the Canberra Semantic Web Meetup on April 23. The talk was held within the impressive BAE Systems Theatre of the Memorial and was very well attended. My talk was preceded by an excellent introduction to the semantic Web by David Ratcliffe and Armin Haller of CSIRO.  They have kindly provided their useful slides online.

Many of the attendees came from the perspective of libraries, archives or museums. They naturally had an interest in the linked data activities in this area, a growing initiative that is now known under the acronym of LOD-LAM. Though I have been an advocate of linked data going back to 2006, one of my main theses was that linked data was an inadequate focus to achieve interoperability. The key emphases of my talk were that the pragmatic contributions of semantic technologies reside more in mindsets, information models and architectures than in ‘linked data’ as currently practiced.

Disappointments and Successes

The semantic Web and its most recent branding of linked data has antecedents going back to 1945 via Vannevar Bush’s memex and Ted Nelson’s hypertext of the early 1960s. The most powerful portrayal of the potential of the semantic Web comes in Douglas Adams’ 1990 Hyperland special for the BBC, a full decade before Tim Berners-Lee and colleagues first coined the term ‘semantic web’ [1]. The Hyperland vision of obsequious intelligent agents doing our very bidding has, of course, not been fully realized. The lack of visible uptake of this full vision has caused some proponents to back away from the idea of the semantic Web. Linked data, in fact, was a term coined by Berners-Lee himself, arguably in part to re-brand the idea and to focus on a more immediate, achievable vision. In its first formulation linked data emphasized the RDF (Resource Description Framework) data model, though others, notably Kingsley Idehen, have attempted to put forward a revisionist definition of linked data that includes any form of structured data involving entity attribute values (EAV).

No matter how expressed, the idea behind all of these various terms has in essence been to make meaningful connections, to provide the frameworks for interoperability. Interoperability means getting disparate sources of data to relate to each other, as a means of moving from data to information. Interoperability requires that source and receiver share a vocabulary about what things mean, as well as shared understandings about the associations or degree of relationship between the items being linked.

The current concept of linked data attempts to place these burdens mostly on the way data is published. While apparently “simpler” than earlier versions of the semantic Web (since linked data de-emphasizes shared vocabularies and nuanced associations), linked data places onerous burdens on how publishers express their data. Though many in the advocacy community point to the “billions” of RDF triples expressed as a success, actual consumers of linked data are rare. I know of no meaningful application or example where the consumption of linked data is an essential component.

However, there are a few areas of success in linked data. DBpedia, Freebase (now owned by Google), and GeoNames have been notable in providing identifiers (URIs) for common concepts, things, entities and places. There has also been success in the biomedical community with linked data.

Meanwhile, other aspects of the semantic Web have also shown success, but been quite hidden. Apple’s spoken Siri service is driven by an ontological back-end; schema.org is beginning to provide shared ways for tagging key entities and concepts, as promoted by the leading search engines of Google, Bing, Yahoo! and Yandex; Bing itself has been improved as a search service by the incorporation of the semantic search technologies of its earlier Powerset acquisition; and Google is further showing how NLP (natural language processing) techniques can be used to extract meaningful structure for characterizing entities in search results and in search completion and machine language translation. These services are here today and widely used. All operate in the background.

What Lessons Can We Derive?

These failures and successes help provide some pragmatic lessons going forward.

While I disagree with Kingsley’s revisionist approach to re-defining linked data, I very much agree with his underlying premise:  effective data exchange does not require RDF. Most instance records are already expressed as simple entity-value pairs, and any data transfer serialization — from key-value pairs to JSON to CSV spreadsheets — can be readily transformed to RDF.

Semantic technologies are fundamentally about knowledge representation, not data transfer.

This understanding is important because the fundamental contribution of RDF is not as a data exchange format, but as a foundational data model. The simple triple model of RDF can easily express the information assertions in any form of content, from completely unstructured text (after information extraction or metadata characterization) to the most structured data sources. Triples can themselves be built up into complete languages (such as OWL) that also capture the expressiveness necessary to represent any extant data or information schema [2].

The ability of RDF to capture any form of data or any existing schema makes it a “universal solvent” for information. This means that the real role of RDF is as a canonical data model at the core of the entire information architecture. Linked data, with its emphasis on data publishing and exchange, gets this focus exactly wrong. Linked data emphasizes RDF at the wrong end of the telescope.

The idea of common schema and representations is at the core of the semantic Web successes that do exist. In fact, when we look at Siri, emerging search, or some of the other successes noted above, we see that their semantic technology components are quite hidden. Successful semantics tend to work in the background, not in the foreground in terms of how data is either published or consumed. Semantic technologies are fundamentally about knowledge representation, not data transfer.

Where linked data is being consumed, it is within communities such as the life sciences where much work has gone into deriving shared vocabularies and semantics for linking and mapping data. These bases for community sharing express themselves as ontologies, which are really just formalized understandings of these shared languages in the applicable domain (life sciences, in this case). In these cases, curation and community processes for deriving shared languages are much more important to emphasize than how data gets exposed and published.

Linked data as presently advocated has the wrong focus. The techniques of publishing data and de-referencing URIs are given prominence over data quality, meaningful linkages (witness the appalling misuse of owl:sameAs [3]), and shared vocabularies. These are the reasons we see little meaningful consumption of linked data. It is also the reason that the much touted FYN (“follow your nose”) plays no meaningful information role today other than a somewhat amusing diversion.

Shifting the Focus

In our own applications Structured Dynamics promotes seven pillars to pragmatic semantic technologies [4]. Linked data is one of those pillars, because where the other foundations are in place, including shared understandings, linked data is the most efficient data transfer format. But, as noted, linked data alone is insufficient.

Linked data is thus the wrong starting focus for new communities and users wishing to gain the advantages of interoperability. The benefits of interoperability must first obtain from a core (or canonical) data model — RDF — that is able to capture any extant data or schema. As these external representations get boiled down to a canonical form, there must be shared understandings and vocabularies to capture the meaning in this information. This puts community involvement and processes at the forefront of the semantic enterprise. Only after the community has derived these shared understandings should linked data be considered as the most efficient way to interchange data amongst the community members.

Identifying and solving the “wrong” problems is a recipe for disappointment. The challenges of the semantic Web are not in branding or messaging. The challenges of the semantic enterprise and Web reside more in mindsets, approaches and architecture. Linked data is merely a technique that contributes little — perhaps worse by providing the wrong focus — to solving the fundamental issue of information interoperability.

Once this focus shifts, a number of new insights emerge. Structure is good in any form; arguments over serializations or data formats are silly and divert focus. The role of semantic technologies is likely to be a more hidden one, to reside in the background as current successes are now showing us. Building communities with trusted provenance and shared vocabularies (ontologies) are the essential starting points. Embracing and learning about NLP will be important to include the 80% of content currently in unstructured text and disambiguating reference conflicts. Ultimate users, subject matter experts and librarians are much more important contributors to this process than developers or computer scientists. We largely now have the necessary specifications and technologies in place; it is time for content and semantic reconciliation to guide the process.

It is great that the abiding interest in interoperability is leading to the creation of more and more communities, such as LOD-LAM, forming around the idea of linked data. What is important moving forward is to use these interests as springboards, and not boxes, for exploring the breadth of available semantic technologies.

For More on the Talk

Below is a link to my slides used in Canberra:

View more presentations from Mike Bergman.

Also, as mentioned, the intro slides are online, a video recording of the presentations is also available, and some other blog postings occasioned by the talks are also online.


[1] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web”. Scientific American Magazine; see http://www.scientificamerican.com/article.cfm?id=the-semantic-web.
[2] See further, M.K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[3] See, among many, M.K. Bergman, 2010. “Practical P-P-P-Problems with Linked Data,” AI3:::Adaptive Innovation blog, October 4, 2010. See http://www.mkbergman.com/917/practical-p-p-p-problems-with-linked-data/.
[4] M.K. Bergman, 2010. “Seven Pillars of the Open Semantic Enterprise,” AI3:::Adaptive Innovation blog, January 12, 2010. See http://www.mkbergman.com/859/seven-pillars-of-the-open-semantic-enterprise/.

by Mike Bergman at April 30, 2012 08:46 AM

April 21, 2012

Wikier.org Blog (Sergio Fernandez)

Salzburg

Definitively yesterday was my last day working at CTIC. In fact, we’ve worked until the last minute. It has been an honor to have the chance to share this time with people so well technically skilled, but even better people; some of them friends for life. But after more than six years there, I think now it’s my time for a change.

Thus next week I’ll move to Salzburg, where in May I’ll join Salzburg Research to work with Dr. Sebastian Schaffert at the Knowledge and Media Technologies department. I’ll join a very innovate team where I could apply my research on Linked Data to a more platform-oriented approach, initially on projects such as Linked Media Framework or Apache Stanbol.

Finally, an important change in my life. I leave in Asturias almost everything that I know, and by sure I’ll miss a lot my family and my friends here. But I feel I need this change, which I’m sure would be an enriching experience in many aspects.

See you then In Salzrburg!

by Sergio Fernández at April 21, 2012 08:07 AM

April 04, 2012

AI3:::Adaptive Information (Mike Bergman)

The Trouble with Memes

Tractricious Sculpture at Fermilab; picture by Mike KappelAdaptive Information is a Hammer, but Genes are Not a Nail

Since Richard Dawkins first put forward the idea of the “meme” in his book The Selfish Gene some 35 years ago [1], the premise has struck in my craw. I, like Dawkins, was trained as an evolutionary biologist. I understand the idea of the gene and its essential role as a vehicle for organic evolution. And, all of us clearly understand that “ideas” themselves have a certain competitive and adaptive nature. Some go viral; some run like wildfire and take prominence; and some go nowhere or fall on deaf ears. Culture and human communications and ideas play complementary — perhaps even dominant — roles in comparison to the biological information contained within DNA (genes).

I think there are two bases for why the “meme” idea sticks in my craw. The first harkens back to Dawkins. In formulating the concept of the “meme”, Dawkins falls into the trap of many professionals, what the French call déformation professionnelle. This is the idea of professionals framing problems from the confines of their own points of view. This is also known as the Law of the Instrument, or (Abraham) Maslow‘s hammer, or what all of us know colloquially as “if all you have is a hammer, everything looks like a nail [2]. Human or cultural information is not genetics.

The second — and more fundamental — basis for why this idea sticks in my craw is its mis-characterization of what is adaptive information, the title and theme of this blog. Sure, adaptive information can be found in the types of information structures at the basis of organic life and organic evolution. But, adaptive information is much, much more. Adaptive information is any structure that provides arrangements of energy and matter that maximizes entropy production. In inanimate terms, such structures include chemical chirality and proteins. It includes the bases for organic life, inheritance and organic evolution. For some life forms, it might include communications such as pheromones or bird or whale songs or the primitive use of tools or communicated behaviors such as nest building. For humans with their unique abilities to manipulate and communicate symbols, adaptive information embraces such structures as languages, books and technology artifacts. These structures don’t look or act like genes and are not replicators in any fashion of the term. To hammer them as “memes” significantly distorts their fundamental nature as information structures and glosses over what factors might — or might not — make them adaptive.

I have been thinking of these concepts much over the past few decades. Recently, though, there has been a spate of the “meme” term, particularly on the semantic Web mailing lists to which I subscribe. This spewing has caused me to outline some basic ideas about what I find so problematic in the use of the “meme” concept.

A Brief Disquisition on Memes

As defined by Dawkins and expanded upon by others, a “meme” is an idea, behavior or style that spreads from person to person within a culture. It is proposed as being able to be transmitted through writing, speech, gestures or rituals. Dawkins specifically called melodies, catch-phrases, fashion and the technology of building arches as examples of memes. A meme is postulated as a cultural analogue to genes in that they are assumed to be able to self-replicate, mutate or respond to selective pressures. Thus, as proposed, memes may evolve by natural selection in a manner analogous to that of biological evolution.

However, unlike a gene, a structure corresponding to a “meme” has never been discovered or observed. There is no evidence for it as a unit of replication, or indeed as any kind of coherent unit at all. In its sloppy use, it is hard to see how “meme” differs in its scope from concepts, ideas or any form of cultural information or transmission, yet it is imbued with properties analogous to animate evolution for which there is not a shred of empirical evidence.

One might say, so what, the idea of a “meme” is merely a metaphor, what is the harm? Well, the harm comes about when it is taken seriously as a means of explaining human behavior and cultural changes, a field of study called memetics. It becomes a pseudo-scientific term that sets a boundary condition for understanding the nature of information and what makes it adaptive or not [3]. Mechanisms and structures appropriate to animate life are not universal information structures, they are simply the structures that have evolved in the organic realm. In the human realm of signs and symbols and digital information and media, information is the universal, not the genetic structure of organic evolution.

The noted evolutionary geneticist, R.C. Lewontin, one of my key influences as a student, has also been harshly critical of the idea of memetics [4]:

 ”The selectionist paradigm requires the reduction of society and culture to inheritance systems that consist of randomly varying, individual units, some of which are selected, and some not; and with society and culture thus reduced to inheritance systems, history can be reduced to ‘evolution.’ . . . we conclude that while historical phenomena can always be modeled selectionistically, selectionist explanations do not work, nor do they contribute anything new except a misleading vocabulary that anesthetizes history.”

Consistent with my recent writings about Charles S. Peirce [5], many logicians and semiotic theorists are also critical of the idea of “memes”, but on different grounds. The criticism here is that “memes” distort Peirce’s ideas about signs and the reification of signs and symbols via a triadic nature. Notable in this camp is Terrence Deacon [6].

Information is a First Principle

It is not surprising that the concept of “memes” arose in the first place. It is understandable to seek universal principles consistent with natural laws and observations. The mechanism of natural evolution works on the information embodied in DNA, so why not look to genes as some form of universal model?

The problem here, I think, was to confuse mechanisms with first principles. Genes are a mechanism — a “structure” if you will — that along with other forms of natural selection such as the entire organism and even kin selection [7], have evolved as means of adaptation in the animate world. But the fundamental thing to be looked for here is the idea of information, not the mechanism of genes and how they replicate. The idea of information holds the key for drilling down to universal principles that may find commonality between information for humans in a cultural sense and information conveyed through natural evolution for life forms. It is the search for this commonality that has driven my professional interests for decades, spanning from population genetics and evolution to computers, information theory and semantics [8].

But before we can tackle these connections head on, it is important to address a couple of important misconceptions (as I see them).

Seque #1: Information is (Not!) Entropy

In looking to information as a first principle, Claude Shannon‘s seminal work in 1948 on information theory must be taken as the essential point of departure [9]. The motivation of Shannon’s paper and work by others preceding him was to understand information losses in communication systems or networks. Much of the impetus for this came about because of issues in wartime communications and early ciphers and cryptography. (As a result, the Shannon paper is also intimately related to data patterns and data compression, not further discussed here.)

In a strict sense, Shannon’s paper was really talking about the amount of information that could be theoretically and predictably communicated between a sender and a receiver. No context or semantics were implied in this communication, only the amount of information (for which Shannon introduced the term “bits” [10]) and what might be subject to losses (or uncertainty in the accurate communication of the message). In this regard, what Shannon called “information” is what we would best term “data” in today’s parlance.

The form that the uncertainty (unpredictability) calculation that Shannon derived:

 \displaystyle H(X) = - \sum_{i=1}^np(x_i)\log_b p(x_i)

very much resembled the mathematical form for Boltzmann‘s original definition of entropy (as elaborated upon by Gibbs, denoted as S, for Gibb’s entropy):

S = - k_B \sum p_i \ln p_i \,

and thus Shannon also labelled his measure of unpredictability, H, as entropy [10].

After Shannon, and nearly a century after Boltzmann, work by individuals such as Jaynes in the field of statistical mechanics came to show that thermodynamic entropy can indeed be seen as an application of Shannon’s information theory, so there are close parallels [11]. This parallel of mathematical form and terminology has led many to assert that information is entropy.

I believe this assertion is a misconception on two grounds.

First, as noted, what is actually being measured here is data (or bits), not information embodying any semantic meaning or context. Thus, the formula and terminology is not accurate for discussing “information” in a conventional sense.

Second, the Shannon methods are based on the communication (transmittal) between a sender and a receiver. Thus the Shannon entropy measure is actually a measure of the uncertainty for either one of these states. The actual information that gets transmitted and predictably received was formulated by Shannon as R (which he called rate), and he expressed basically as:

R = Hbefore – Hafter

R, then, becomes a proxy for the amount of information accurately communicated. R can never be zero (because all communication systems have losses). Hbefore and Hafter are both state functions for the message, so this also makes R a function of state. So while there is Shannon entropy (unpredictability) for any given sending or receiving state, the actual amount of information (that is, data) that is transmitted is a change in state as measured by a change in uncertainty between sender (Hbefore) and receiver (Hafter). In the words of Thomas Schneider, who provides a very clear discussion of this distinction [12]:

Information is always a measure of the decrease of uncertainty at a receiver.

These points do not directly bear on the basis of information as discussed below, but help remove misunderstandings that might undercut those points. Further, these clarifications make consistent theoretical foundations of information (data) with natural evolution while being logically consistent with the 2nd law of thermodynamics (see next).

Seque #2: Entropy is (Not!) Disorder

The 2nd law of thermodynamics expresses the tendency that, over time, differences in temperature, pressure, or chemical potential equilibrate in an isolated physical system. Entropy is a measure of this equilibration: for a given physical system, the highest entropy state is one at equilibrium. Fluxes or gradients arise when there are differences in state potentials in these systems. (In physical systems, these are known as sources and sinks; in information theory, they are sender and receiver.) Fluxes go from low to high entropy, and are non-reversible — the “arrow of time” — without the addition of external energy. Heat, for example, is a by product of fluxes in thermal energy. Because these fluxes are directional in isolation, a perpetual motion machine is shown as impossible.

In a closed system (namely, the entire cosmos), one can see this gradient as spanning from order to disorder, with the equilibrium state being the random distribution of all things. This perspective, and much schooling regarding these concepts, tends to present the idea of entropy as a “disordered” state. Life is seen as the “ordered” state in this mindset. Hewing to this perspective, some prominent philosophers, scientists and others have sometimes tried to present the “force” representing life and “order” as an opposite one to entropy. One common term for this opposite “force” is “negentropy[13].

But, in the real conditions common to our lives, our environment is distinctly open, not closed. We experience massive influxes of energy via sunlight, and have learned as well how to harness stored energy from eons past in further sources of fossil and nuclear energy. Our open world is indeed a high energy one, and one that increases that high-energy state as our knowledge leads us to exploit still further resources of higher and higher quality. As Buckminster Fuller once famously noted, electricity consumption (one of the highest quality energy resources found to date) has become a telling metric about the well-being and wealth of human societies [14].

The high-energy environments fostering life on earth and more recently human evolution establish a local (in a cosmic sense) gradient that promotes fluxes to more ordered states, not lesser unordered ones. These fluxes remain faithful to basic physical laws and are non-deterministic [15]. Indeed, such local gradients can themselves be seen as consistent with the conditions initially leading to life, favoring the random event in the early primordial soup that led to chemical structures such as chirality, auto-catalytic reactions, enzymes, and then proteins, which became the eventual building blocks for animate life [16].

These events did not have preordained outcomes (that is, they were non-deterministic), but were the result of time and variation in the face of external energy inputs to favor the marginal combinatorial improvement. The favoring of the new marginal improvement also arises consistent with entropy principles, by giving a competitive edge to those structures that produce faster movements across the existing energy gradient. According to Annila and Annila [16]:

“According to the thermodynamics of open systems, every entity, simple or sophisticated, is considered as a catalyst to increase entropy, i.e., to diminish free energy. Catalysis calls for structures. Therefore, the spontaneous rise of structural diversity is inevitably biased toward functional complexity to attain and maintain high-entropy states.”

Via this analysis we see that life is not at odds with entropy, but is consistent with it. Further, we see that incremental improvements in structure that are consistent with the maximum entropy production principle will be favored [17]. Of course, absent the external inputs of energy, these gradients would reverse. Under those conditions, the 2nd law would promote a breakdown to a less ordered system, what most of us have been taught in schools.

With these understandings we can now see the dichotomy as life representing order with entropy disorder as being false. Further, we can see a guiding set of principles that is consistent across the broad span of evolution from primordial chemicals and enzymes to basic life and on to human knowledge and artifacts. This insight provides the fundamental “unit” we need to be looking toward, and not the gene nor the “meme”.

Information is Structure

Of course, the fundamental “unit” we are talking about here is information, and not limited as is Shannon’s concept to data. The quality that changes data to information is structure, and structure of a particular sort. Like all structure, there is order or patterns, often of a hierarchical or fractal or graph nature. But the real aspect of the structure that is important is the marginal ability of that structure to lead to improvements in entropy production. That is, processes are most adaptive (and therefore selected) that maximize entropy production. Any structure that emerges that is able to reduce the energy gradient faster will be favored.

However, remember, these are probabilistic, statistical processes. Uncertainties in state may favor one structure at one time versus another at a different time. The types of chemical compounds favored in the primordial soup were likely greatly influenced by thermal and light cycles and drying and wet conditions. In biological ecosystems, there are huge differences in seed or offspring production or in overall species diversity and ecological complexity based on the stability (say, tropics) or instability (say, disturbance) of local environments. As noted, these processes are inherently non-deterministic.

As we climb up the chain from the primordial ooze to life and then to humans and our many information mechanisms and technology artifacts (which are themselves embodiments of information), we see increasing complexity and structure. But we do not see uniformity of mechanisms or vehicles.

The general mechanisms of information transfer in living organisms occur (generally) via DNA in genes, mediated by sex in higher organisms, subject to random mutations, and then kept or lost entirely as their host organisms survive to procreate or not. Those are harsh conditions: the information survives or not (on a population basis) with high concentrations of information in DNA and with a priority placed on remixing for new combinations via sex. Information exchange (generally) only occurs at each generational event.

Human cultural information, however, is of an entirely different nature. Information can be made persistent, can be recorded and shared across individuals or generations, extended with new innovations like written language or digital computers, or combined in ways that defy the limits of sex. Occasionally, of course, loss of living languages due to certain cultures or populations dying out or horrendous catastrophes like the Spanish burning (nearly all of) the Mayan’s existing books can also occur [18]. The environment will also be uncertain.

So, while we can define DNA in genes or the ideas of a “meme” all as information, in fact we now see how very unlike the dynamics and structures of these two forms really are. We can be awestruck with the elegance and sublimity of organic evolution. We can also be inspired by song or poem or moved to action through ideals such as truth and justice. But organic evolution does not transpire like reading a book or hearing a sermon, just like human ideas and innovations don’t act like genes. The “meme” is a totally false analogy. The only constant is information.

Some Tentative Implications

The closer we come to finding true universals, the better we will be able to create maximum entropy producing structures. This, in turn, has some pretty profound implications. The insight that keys these implications begins with an understanding of the fundamental nature — and importance — of information. According to Karnani et al [19]:

“. . . the common contemporary consent, the second law of thermodynamics, is perceived to drive disorder. Therefore, it may appear, at first sight, inconceivable that this universal law could possibly account for the existence and orderly characteristics of information, as well as for its meaningful content. However, the second law, or equivalently the principle of increasing entropy, merely states that difference among energy densities tends to vanish. When the surrounding energy density is high, the system will evolve toward a stationary state by increasing its energy content, e.g, by devising orderly machinery for energy transduction to acquire energy. . . . Syntax of information, when described by thermodynamics, is associated with the entropy of the physical representation, and significance of information is associated with the entropy increase in the receiver system when it executes the encoded information.”

All would agree that the evolution of life over the past few billion years is truly wondrous. But, what is equally wondrous is that the human species has come to learn and master symbols. That mastery, in turn, has broken the bounds of organic evolution and has put into our hands the very means and structure of information itself. Via this entirely new — and incredibly accelerated — path to information structures, we are only now beginning to see some of its implications:

  • Unlike all other organisms, we dominate our environment and have experienced increasing wealth and freedom. Wealth increases and their universal applicability continue to increase at an exponential rate [20]
  • We no longer depend on the random variant to maximize our entropy producing structures. We can now do so purposefully and with symbologies and bit streams of our own devising
  • Potentially all information variants can be recorded and shared across all human individuals and generations, a complete decoupling from organic boundaries
  • Key ideas and abstractions, such as truth, justice and equality, can operate on a species-wide basis and become adopted without massive die-offs of individuals
  • We are actively moving ourselves into higher-level energy states, further increasing the potential for wealth and new structures
  • We are actively impacting our local environment, potentially creating the conditions for our species’ demise
  • We are increasingly engaging all individuals of the human species in these endeavors through literacy, education and access to global information sources. This provides a still further multiplier effect on humanity’s ability to devise and manipulate information structures into more adaptive and highly-ordered states.

The idea of a “meme” actually cheapens our understanding of these potentials.

Ideas matter and terminology matters. These are the symbols by which we define and communicate potentials. If we choose the wrong analogies or symbols — as “meme” is in this case — we are picking the option with the lower entropy potential. Whether I assert it to be so or not, the “meme” concept is an information structure doomed for extinction.


[1] Richard Dawkins, 1976. The Selfish Gene, Oxford University Press, New York City, ISBN 0-19-286092-5.
[2] This phrase was perhaps first made famous by Mark Twain or Bernard Baruch, but in any case is clearly understood now by all.
[3] According to Wikipedia, Benitez-Bribiesca calls memetics “a dangerous idea that poses a threat to the serious study of consciousness and cultural evolution”. He points to the lack of a coding structure analogous to the DNA of genes, and to instability of any mutation mechanisms for “memes” sufficient for standard evolution processes. See Luis Benitez Bribiesca, 2001. “Memetics: A Dangerous Idea”, Interciencia: Revista de Ciencia y Technologia de América (Venezuela: Asociación Interciencia) 26 (1): 29–31, January 2001. See http://redalyc.uaemex.mx/redalyc/pdf/339/33905206.pdf.
[4] Joseph Fracchia and R.C. Lewontin, 2005. “The Price of Metaphor”, History and Theory (Wesleyan University) 44 (44): 14–29, February 2005.
[5] See further M. K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” posting on AI3:::Adaptive Information blog, January 24, 2012. See http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[6] Terrence Deacon, 1999. “The Trouble with Memes (and what to do about it)”. The Semiotic Review of Books 10(3). See http://projects.chass.utoronto.ca/semiotics/srb/10-3edit.html.
[7] Kin selection refers to changes in gene frequency across generations that are driven at least in part by interactions between related individuals. Some mathematical models show how evolution may favor the reproductive success of an organism’s relatives, even at a cost to an individual organism. Under this mode, selection can occur at the level of populations and not the individual or the gene. Kin selection is often posed as the mechanism for the evolution of altruism or social insects. Among others, kin selection and inclusive fitness was popularized by W. D. Hamilton and Robert Trivers.
[8] You may want to see my statement of purpose under the Blogasbörd topic, first written seven years ago when I started this blog.
[9] Claude E. Shannon, 1948. “A Mathematical Theory of Communication”, Bell System Technical Journal, 27: 379–423, 623-656, July, October, 1948. See http://cm.bell-labs.com/cm/ms/what/shannonday/shannon1948.pdf.
[10] As Shannon acknowledges in his paper, the “bit” term was actually suggested by J. W. Tukey. Shannon can be more accurately said to have popularized the term via his paper.
[12] See Thomas D. Schneider, 2012. “Information Is Not Entropy, Information Is Not Uncertainty!,” Web page retrieved April 4, 2012; see http://www.lecb.ncifcrf.gov/~toms/information.is.not.uncertainty.html.
[13] The “negative entropy” (also called negentropy or syntropy) of a living system is the entropy that it exports to keep its own entropy low, and according to proponents lies at the intersection of entropy and life. The concept and phrase “negative entropy” were introduced by Erwin Schrödinger in his 1944 popular-science book What is Life?. See Erwin Schrödinger, 1944. What is Life – the Physical Aspect of the Living Cell, Cambridge University Press, 1944. A copy may be downloaded at http://old.biovip.com/UpLoadFiles/Aaron/Files/2005051204.pdf.
[14] R. Buckminster Fuller, 1981. Critical Path, St. Martin’s Press, New York City, 471 pp. See especially p. 103 ff.
[15] The seminal paper first presenting this argument is Vivek Sharma and Arto Annila, 2007. “Natural Process – Natural Selection”, Biophysical Chemistry 127: 123-128. See http://www.helsinki.fi/~aannila/arto/natprocess.pdf. This basic theme has been much expanded upon by Annila and his various co-authors. See, for example, [16] and [19], among many others.
[16] Arto Annila and Erkki Annila, 2008. “Why Did Life Emerge?,” International Journal of Astrobiology 7(3 and 4): 293-300. See http://www.helsinki.fi/~aannila/arto/whylife.pdf.
[17] According to Wikipedia, the principle (or “law”) of maximum entropy production is an aspect of non-equilibrium thermodynamics, a branch of thermodynamics that deals with systems that are not in thermodynamic equilibrium. Most systems found in nature are not in thermodynamic equilibrium and are subject to fluxes of matter and energy to and from other systems and to chemical reactions. One fundamental difference between equilibrium thermodynamics and non-equilibrium thermodynamics lies in the behavior of inhomogeneous systems, which require for their study knowledge of rates of reaction which are not considered in equilibrium thermodynamics of homogeneous systems. Another fundamental difference is the difficulty in defining entropy in macroscopic terms for systems not in thermodynamic equilibrium.
The principle of maximum entropy production states that the in comparing two or more alternate paths for crossing an energy gradient that the one that creates the maximum entropy change will be favored. The maximum entropy (sometimes abbreviated MaxEnt or MaxEp) concept is related to this notion. It is also known as the maximum entropy production principle, or MEPP.
[18] The actual number of Mayan books burned by the Spanish conquistadors is unknown, but is somewhere between tens and thousands; see here. Only three or four codexes are known to survive today. Also, Wikipedia contains a listing of notable book burnings throughout history.
[19] Mahesh Karnani, Kimmo Pääkkönen and Arto Annila, 2009. “The Physical Character of Information,” Proceedings of the Royal Society A, April 27, 2009. See http://www.helsinki.fi/~aannila/arto/natinfo.pdf.
[20] I discuss and chart the exponential growth of human wealth based on Angus Maddison data in M. K. Bergman, 2006. “The Biggest Disruption in History: Massively Accelerated Growth Since the Industrial Revolution,” post in AI3:::Adaptive Information blog, July 27, 2006. See http://www.mkbergman.com/250/the-biggest-disruption-in-history-massively-accelerated-growth-since-the-industrial-revolution/.

by Mike Bergman at April 04, 2012 05:50 PM

April 01, 2012

Project squin

Report from EDBT/ICDT 2012 (with a focus on RDF and SPARQL)

This week I attended the EDBT/ICDT 2012 Joint Conference which took place at the main building of our university in the city center of Berlin. I was very pleased to see that people are slowing beginning to recognize (distributed) RDF data and SPARQL as interesting database research topics, beyond the attitude of assuming that 40 years of work on relational data management can trivially be adapted for all Semantic Web data management problems. Four papers were particularly interesting in the context of answering SPARQL queries over multiple (and potentially distributed) RDF datasets. In what follows, I briefly summarize these works.

The most interesting paper was Fabian Prasser et al.’s “Efficient Distributed Query Processing for Autonomous RDF Databases”. The authors adapt the idea of data summaries for Linked Data (as introduced by Andreas Harth et al.) to a federated setting. Hence, while Andreas and his colleagues focused on queries over RDF data accessible via a Linked Data interface (i.e. URI lookups), Fabian’s work may be used for distributed query processing over a federation of SPARQL endpoints (similar to FedX, SPLENDID, and ANAPSID). In a nutshell, the idea is that a query mediator system holds a global index that summarizes the data from all federation members. For any incoming query the mediator uses this index to i) identify members that might contribute results to any particular part of the query, ii) to decide on which of these partial results are guaranteed to be irrelevant for the overall query result (and, thus, can be ignored for query planning), and iii) to select a suitable join order for the distributed (sub)query execution. Fabian’s paper introduces some clever techniques for pruning irrelevant subqueries during query planning and for reducing the number of irrelevant join candidates during query execution. As a result, less data must be transferred for query execution and, thus, overall query execution times decrease. A main downside of Fabian’s approach is the need to access all data of all federation members for constructing the global data summary; this requirement is hardly satisfiable for truly autonomous data sources on the Web. Furthermore, the approach is unsuitable if the data changes. At least the latter issue is mentioned as future work in the paper. Furthermore, it would be interesting to see how the approach compares to the approaches implemented in other SPARQL federation systems (such as those that I mentioned before).

The other relevant EDBT paper was Petros Tsialiamanis et al.’s “Heuristic based Query Optimisation for SPARQL”. This paper focuses on generating efficient query execution plans for SPARQL queries over a single RDF dataset. However, instead of proposing another cost-based strategy the authors introduce a heuristics-based approach that does not need any information (such as statistics) about the queried data. As a result, the approach is well-suited for the use case of querying a frequently updated collection of Linked Data copied from the Web (for which generating statistics and keeping them up-to-date may easily become infeasible).

At the EDBT demo session I discovered the SPARQL-RW Framework (see “SPARQL-RW: Transparent Query Access over Mapped RDF Data Sources” by Konstantinos Makris et al.). This system was able to transparently answer SPARQL queries over two RDF datasets, each of which used a different vocabulary. This functionality was achieved by rewriting the queries using a set of (schema level) mapping rules. The system only supported a global as view rewriting strategy.

Finally, the last day of the conference was reserved for workshops, which included the 2nd Int. Workshop on Linked Web Data Management (LWDM 2012) and the Workshop on Data Analytics in the Cloud (DanaC). In the latter I attended Zoi Kaoudi’s talk on “RDF Data Management in the Amazon Cloud”. Zoi and her colleagues studied several approaches for indexing multiple RDF datasets in Amazon’s SimpleDB in order to use such an index for deciding which of the datasets may be relevant for a given SPARQL query.

In addition to these four RDF and SPARQL related papers there was a lot of other interesting work presented at the conference. I recommend checking out the conference proceedings; there’s also videos of the keynotes which you may want to watch.

Olaf

by Olaf Hartig at April 01, 2012 07:18 AM

March 27, 2012

AI3:::Adaptive Information (Mike Bergman)

Tortured Terminology and Problematic Prescriptions

W3C Logo from http://www.w3.org/Icons/w3c_homeCasting My Vote on Revising httpRange-14

The httpRange-14 issue and its predecessor “identity crisis” debate have been active for more than a decade on the Web [1]. It has been around so long that most acknowledge “fatigue” and it has acquired that rarified status as a permathread. Many want to throw up their hands when they hear of it again and some feel — because of its duration and lack of resolution — that there never will be closure on the question. Yet everyone continues to argue and then everyone wonders why actual consumption of linked data remains so problematic.

Jonathan Rees is to be thanked for refusing to let this sleeping dog lie. This issue is not going to go away so long as its basis and existing prescriptions are, in essence, incoherent. As a member of the W3C’s TAG (Technical Architecture Group), Rees has worked diligently to re-surface and re-frame the discussion. While I don’t agree with some of the specifics and especially with the constrained approach proposed for resolving this question [2], the sleeping dog has indeed been poked and is awake. For that we can thank Jonathan. Maybe now we can get it right and move on.

I don’t agree with how this issue has been re-framed and I don’t agree that responses to it must be constrained to the prescriptive approach specified in the TAG’s call for comments. Yet, that being said, as someone who has been vocal for years about the poor semantics of the semantic Web community, I feel I have an obligation to comment on this official call.

Thus, I am casting my vote behind David Booth’s alternative proposal [3], with one major caveat. I first explain the caveat and then my reasons for supporting Booth’s proposal. I have chosen not to submit a separate alternative in order to not add further to the noise, as Bernard Vatant (and, I’m sure, many, many others) has chosen [4].

Bury the Notion of ‘Information Resource’ Once and for All

I first commented on the absurdity of the ‘information resource’ terminology about five years ago [5]. Going back to Claude Shannon [6] we have come to understand information as entropy (or, more precisely, as differences in energy state). One need not get that theoretical to see that this terminology is confusing. “Information resource” is a term that defies understanding (meaning) or precision. It is also a distinction that leads to a natural counter-distinction, the “non-information resource”, which is also an imprecise absurdity.

What the confusing term is meant to encompass is web-accessible content (“documents”), as opposed to descriptions of (or statements about) things. This distinction then triggers a different understanding of a URI (locator v identifier alone) and different treatments of how to process and interpret that URI. But the term is so vague and easily misinterpreted that all of the guidance behind the machinery to be followed gets muddied, too. Even in the current chapter of the debate, key interlocutors confuse and disagree as to whether a book is an “information resource” or not. If we can’t basically separate the black balls from the white balls, how are we to know what to do with them?

If there must be a distinction, it should be based on the idea of the actual content of a thing — or perhaps more precisely web-accessible content or web-retrievable content — as opposed to the description of a thing. If there is a need to name this class of content things (a position that David Booth prefers, pers. comm.), then let’s use one of these more relevant terms and drop “information resource” (and its associated IR and NIR acronyms) entirely.

The motivation behind the “information resource” terminology also appears to be a desire that somehow a URI alone can convey the name of what a thing is or what it means. I recently tried to blow this notion to smithereens by using Peirce’s discussion of signs [1]. We should understand that naming and meaning may only be provided by the owner of a URI through additional explication, and then through what is understood by the recipient; the string of the URI itself conveys very little (or no) meaning in any semantic sense.

We should ban the notion of “information resource” forever. If the first exposure a potential new publisher or consumer of linked data encounters is “information resource”, we have immediately lost the game. Unresolvable abstractions lead to incomprehension and confusion.

The approach taken by the TAG in requesting new comments on httpRange-14 only compounds this problem. First, the guidance is to not allow any questioning of the “information resource” terminology within the prescribed comment framework [7]. Then, in the suggested framework for response, still further terminology such as “probe URIs”, “URI documentation carrier” or “nominal URI documentation carrier for a URI” is introduced. Aaaaarggghh! This only furthers the labored and artificial terminology common to this particular standards effort.

While Booth’s proposal does not call for an outright rejection of the “information resource” terminology (my one major qualification in supporting it), I like it because it purposefully sidesteps the question of the need to define “information resource” (see his Section 2.7). Booth’s proposal is also explicit in its rejection of implied meaning in URIs and through embrace of the idea of a protocol. Remember, all that is being put forward in any of these proposals is a mechanism for distinguishing between retrievable content obtainable at a given URL and a description of something found at a URI. By racheting down the implied intent, Booth’s proposal is more consistent with the purpose of the guidance and is not guilty of overreach.

Keep It Simple

One of the real strengths of Booth’s proposal is its rejection of the prescriptive method proposed by the TAG for suggesting an alternative to httpRange-14 [7]. The parsimonious objective should be to be simple, be clear, and be somewhat relaxed in terms of mechanisms and prescriptions. I believe use patterns — negotiated via adoption between publishers and consumers — will tell us over time what the “right” solutions may be.

Amongst the proposals put forward so far, David Booth’s is the most “neutral” with respect to imposed meanings or mechanisms, and is the simplest. Though I quibble in some respects, I offer qualified support for his alternative because it:

  • Sidesteps the “information resource” definition (though weaker than I would want; see above)
  • Addresses only the specific HTTP and HTTPS cases
  • Avoids the constrained response format suggested by the TAG
  • Explicitly rejects assigning innate meanings to URIs
  • Poses the solution as a protocol (an understanding between publisher and consumer) rather than defining or establishing a meaning via naming
  • Provides multiple “cow paths” by which resource definitions can be conveyed, which gives publishers and consumers choice and offers the best chance for more well-trodden paths to emerge
  • Does not call for an outright repeal of the httpRange-14 rule, but retains it as one of multiple options for URI owners to describe resources
  • Permits the use of an HTTP 200 response with RDF content as a means of conveying a URI definition
  • Retains the use of the hash URI as an option
  • Provides alternatives to those who can not easily (or at all) use the 303 see also redirect mechanism, and
  • Simplifies the language and the presentation.

I would wholeheartedly support this approach were two things to be added: 1) the complete abandonment of all “information resource” terminology; and 2) an official demotion of the httpRange-14 rule (replacing it with a slash 303 option on equal footing to other options), including a disavowal of the “information resource” terminology. I suspect if the TAG adopts this option, that subsequent scrutiny and input might address these issues and improve its clarity even further.

There are other alternatives submitted, prominently the one by Jeni Tennison with many co-signatories [8]. This one, too, embraces multiple options and cow paths. However, it has the disadvantage of embedding itself into the same flawed terminology and structure as offered by httpRange-14.


[1] For my recent discussion about the history of these issues, see M.K. Bergman, 2012. “Give Me a Sign: What Do Things Mean on the Semantic Web?,” in AI3:::Adaptive Information blog, January 24, 2012; see http://www.mkbergman.com/994/give-me-a-sign-what-do-things-mean-on-the-semantic-web/.
[2] In all fairness, this call was the result of ISSUE-57, which had its own constraints. Not knowing all of the background that led to the httpRange-14 Pandora’s Box being opened again, the benefit of the doubt would be that the form and approach prescribed by the TAG dictated the current approach. In any event, now that the Box is open, all pertinent issues should be addressed and the form of the final resolution should also not be constrained from what makes best sense and is most pragmatic.
[3] David Booth‘s alternative proposal is for the “URI Definition and Discovery Protocol” (uddp). The actual submission according to form is found here.
[4] See Bernard Vatant, 2012. “Beyond httpRange-14 Addiction,” the wheel and the hub blog, March 27, 2012. See http://blog.hubjects.com/2012/03/beyond-httprange-14-addiction.html.
[5] M.K. Bergman, 2007. “More Structure, More Terminology and (hopefully) More Clarity,” in AI3:::Adaptive Information blog, July 27, 2007; see http://www.mkbergman.com/391/more-structure-more-terminology-and-hopefully-more-clarity/. Subsequent to that piece, I have written further on semantic Web semantics in “The Semantic Web and Industry Standards” (January 26, 2008), ” “The Shaky Semantics of the Semantic Web” (March 12, 2008), “Semantic Web Semantics: Arcane, but Important,” (April 8, 2008), “Context” href=”../440/the-semantics-of-context/”>The Semantics of Context,” (May 6, 2008), “When Linked Data Rules Fail” (November 16, 2009), “The Semantic ‘Gap’” (October 24, 2010) and [1].
[6] Claude E. Shannon, 1948. “A Mathematical Theory of Communication,” Bell System Technical Journal, Vol. 27, pp. 379–423, 623–656, 1948.
[7] In the “Call for proposals to amend the “httpRange-14 resolution” (February 29, 2012), Jonathan Rees (presumably on behalf of the TAG), stated this as one of the rules of engagement: “9. Kindly avoid arguing in the change proposals over the terminology that is used in the baseline document. Please use the terminology that it uses. If necessary discuss terminology questions on the list as document issues independent of the 303 question.” The specific template formfor alternative proposals was also prescribed. In response to interactions on this question on the mailing list, Jonathan stated:

If it were up to me I’d purge “information resource” from the document, since I don’t want to argue about what it means, and strengthen the (a) clause to be about content or instantiation or something. But the document had to reflect the status quo, not things as I would have liked them to be.
I have not submitted this as a change proposal because it doesn’t address ISSUE-57, but it is impossible to address ISSUE-57 with a 200-related change unless this issue is addressed, as you say, head on. This is what I’ve written in my TAG F2F preparation materials.
[8] Jeni Tennison, 2012. “httpRange-14 Change Proposal,” submitted March 25, 2012. See the mailing list notice and actual proposal.

by Mike Bergman at March 27, 2012 11:45 PM

March 22, 2012

DBpedia Blog

DBpedia Spotlight has been selected for Google Summer of Code. Please apply now!

The Google Summer of Code (GSoC) is a global program that offers student developers (BSc,MSc,PhD) stipends to write code for open source software projects. It has had thousands of participants since the first edition in 2005, connecting prospective students with mentors from open source communities such as Debian, KDE, Gnome, Apache Software Foundation, Mozilla, etc.

For the students, it is a great chance to get real-world software development experience. For the open source communities, it is a chance to expand their development community. For everybody else, more source code is created and released for the benefit of all!

We are thrilled to announce that our open source project DBpedia Spotlight has been selected for the Google Summer of Code 2012.

We are now seeking students interested in working with us to enhance operational aspects of DBpedia Spotlight, as well as to engage in research activities in collaboration with our team. If you are an energetic developer, passionate for open source and interested in areas related to DBpedia Spotlight, please get in touch with us!

We have shared a number of project ideas to get you started.

To apply, visit: http://www.google-melange.com/gsoc/org/google/gsoc2012/dbpediaspotlight

If you would like to see DBpedia Spotlight in action, helping you to explore available projects within GSoC 2012, please visit our demonstration page at: http://spotlight.dbpedia.org/gsoc/


by ChrisBizer at March 22, 2012 12:18 PM

March 15, 2012

AI3:::Adaptive Information (Mike Bergman)

TechWiki Gets 400th Document

Open Semantic FrameworkPhenomenal Growth in Less than Two Years

Today, for the first time, we passed 400 articles published on the open semantic framework (OSF) TechWiki. The TechWiki content is a baseline “starter kit” of documentation related to these OSF  projects and their contexts:

  • conStruct – connecting modules to enable structWSF and sComponents to be hosted/embedded in Drupal
  • structWSF – platform-independent suite of more than 20 RESTful Web services, organized for managing structured data datasets
  • Semantic Components – JavaScript or Flex semantic components (widgets) for visualizing and manipulating structured data
  • irON – instance record Object Notation for conveying XML, JSON or spreadsheets (CSV) in RDF-ready form, and
  • Various parsers and standard data exchange formats and schema to facilitate information flow amongst these options.

The TechWiki covers all aspects of this open source OSF software stack. Besides the specific components developed and maintained by Structured Dynamics as listed above, the OSF stack combines many leading third-party software packages — such as Drupal for content management, Virtuoso for (RDF) triple storage, Solr for full-text indexing, GATE for natural language processing, the OWL API for ontology management, and others.

The TechWiki is the one-stop resource for how to install, configure, use and maintain these components. The best entry point to the OSF content on the TechWiki is represented by this entry page covering overall workflows in use of the system:

OSF Work FlowsSince our first release of the TechWiki in July 2010, we have been publishing and releasing content steadily. We post a new article about every 1.5 calendar days, or about one per working day. This content is well-organized into (at present) 72 categories and is supported by nearly 500 figures and diagrams. Users are free to download and use this content at will, solely by providing attribution. The content has proven to be a goldmine for local use and modification by our clients, and for training and curriculum development.

The TechWiki represents a part of our commitment that we are successful when our customers no longer need us. As one of our most popular Web sites with fantastic and growing user stats, we invite you to visit and see what it means to provide open source semantic technologies as a total open solution.

by Mike Bergman at March 15, 2012 12:05 AM

February 27, 2012

Frederick Giasson's Weblog

New Mapping Semantic Component In JavaScript

 

I am please to announce the release of the new sWebMap Semantic Component in JavaScript. This new mapping component is a standalone JavaScript application that can be integrated on any new or existing web sites and that interact with an Open Semantic Framework (OSF) instance to search, browse, filter and display with geographically-located information on an interactive map.

Features

The sWebMap is a rich mapping tool that can easily be integrated on any webpage, and that can be extensively customized. The sWebMap does support these features:

  • Full text search for searching and displaying results on a map
  • Extensive filtering capabilities
    • Filtering by dataset source
    • Filtering by type
    • Filtering by attribute/value
    • Filtering of records that belongs to a specific geographic region
  • Display of record on the map using:
    • Different markers depending on the type of record to display (determined by the ontologies)
    • Polygon shapes for records that refers to a geographic region
    • Polyline shapes for records that refers to a geographically-located path
  • Templating of records in a resultset depending on their type
  • Templating of records’ preview, displayed in an overlay window, depending on their type
  • Persist records on the map accros searches and filtering operations
  • Supports map sessions
    • Save map sessions
    • Load saved map sessions
    • Delete saved map sessions
    • Share saved map sessions
  • Supports a multiple-maps mode
    • Three focus maps are available under the main map
    • Each map focus on a particular region of the main map
    • User can switch between focus map to see different records in different region

 

Normal Mode

Here is what the default sWebMap, in normal mode, using a few datasets related to the city of Iowa looks like. You can also interact with this sWebMap instance directly on the Citizen DAN demo website here.


Multiple Windows Mode

Here is what the default sWebMap, in multiple windows mode, using a few datasets related to the city of Iowa looks like. You can also interact with this sWebMap instance directly on the Citizen DAN demo website here.

 


 

Under the Hood: The Open Semantic Framework

Each sWebMap component communicates with an OSF (Open Semantic Framework) instance. More specifically, a sWebMap component will send Search/Filtering queries to a geo-enabled structWSF Search web service endpoint.

Depending on the options you had specified when you created the sWebMap control, each time you move (option), zoom (option) or change the filtering criterias, this will send a query to the Search endpoint. The sWebMap control then requests JSON formatted resultset and display the results to the user.

This means that to implement the sWebMap component on your website, you will need to have:

Download

You can immediately download the entire code source from this GitHub reposiroty:

Installation

Installing the sWebMap component is really easy. In fact, you only have to load a few JavaScript and CSS files, to defined a <div></div> container for the map, and to create a sWebMap component object, which is a single line of code.

Additionally, you can initialize the sWebMap component with one of the multiple options available.

Refer you to the Usage section of the sWebMap component to know exactly how to install and setup a sWebMap component instance.

Resources

Here are some additional resources related to the sWebMap component:

 

by Frederick Giasson at February 27, 2012 06:16 PM

AI3:::Adaptive Information (Mike Bergman)

OSF Gains Powerful, New Mapping Component

Open Semantic FrameworkOntology-driven Application Meshes Structured Data with Public APIs

Locational information — points of interest/POIs, paths/routes/polylines, or polygons/regions — is common to many physical things in our real world. Because of its pervasiveness, it is important to have flexible and powerful display widgets that can respond to geo-locational data. We have been working for some time to extend our family of semantic components [1] within the open semantic framework (OSF) [2] to encompass just such capabilities. Structured Dynamics is thus pleased to announce that we have now added the sWebMap component, which marries the entire suite of Google Map API capabilities to the structured data management arising from the structWSF Web services framework [3] at the core of OSF.

The sWebMap component is fully in keeping with our design premise of ontology-driven applications, or ODapps [4]. The sWebMap component can itself be embedded in flexible layouts — using Drupal in our examples below — and can be very flexibly themed and configured. sWebMap we believe will rapidly move to the head of the class as the newest member of Structured Dynamics’ open source semantic components.

The absolutely cool thing about sWebMap is it just works. All one needs to do is relate it to a geo-enabled Search structWSF endpoint, and then all of the structured data with geo-locational attributes and its facets and structure becomes automagically available to the mapping widget. From there you can flexible map, display, configure, filter, select and keep those selections persistent and share with others. As new structured data is added to your system, that data too becomes automatically available.

Key Further Links

Though screen shots in the operation of this component are provided below, here are some further links to learn more:

sWebMap Overview

There is considerable functionality in the sWebMap widget, not all immediately obvious when you first view it.

NOTE: a wide variety of configuration options — icons and colors — matched with the specific data and base tiling maps appropriate to a given installation may produce maps of significantly different aspect from the screenshots presented below. Click on any screenshot to get a full-size view.

Here is an example for sWebMap when it first comes up, using an example for the “Beaumont neighborhood”:

It is possible to set pre-selected items for any map display. That was done in this case, which shows the pre-selected items and region highlighted on the map and in the records listing (lower left below map).

The basic layout of the map has its main search options at the top, followed by the map itself and then two panels underneath:

The left-hand panel underneath the map presents the results listing. The right-hand panel presents the various filter options by which these results are generated. The filter options consist of:

  • Sources – the datasets available to the instance
  • Kinds – the kinds or types of data (owl:Classes or rdf:types) contained within those datasets, and
  • Attributes – the specific attributes and their values for those kinds or sources.

As selections are made in sources or kinds, the subsequent choices narrow.

The layout below shows the key controls available on the sWebMap:

You can go directly to an affiliated page by clicking the upper right icon. This area often shows a help button or other guide. The search box below that enables you to search for any available data in the system. If there is information that can be mapped AND which occurs within the viewport of the current map size, those results will appear as one of three geographic feature types on the map:

  • Markers, which can be configured with differing icons for specific types or kinds of data
  • Polylines, such as highways or bus routes, or
  • Polygons, which enclose specific regions on the map through a series of drawn points in a closed area.

At the map’s right is the standard map control that allows you to scroll the map area or zoom. Like regular Google maps, you can zoom (+ or – keys, or middle wheel on mouse) or navigate (arrow direction keys, or left mouse down and move) the map.

Current records are shown below the map. Specific records may be selected with its checkbox; this keeps them persistent on the map and in the record listing no matter what the active filter conditions may be. (You may also see a little drawing icon [Update record], which presents an attribute report — similar to a Wikipedia ‘infobox‘ — for the current record). You can see in this case that the selected record also corresponds to a region (polygon) shape on the map.

sWebMap Views, Layers and Layouts

In the map area itself, it is possible to also get different map views by selecting one of the upper right choices. In this case, we can see a satellite view (or “layer”):

Or, we can choose to see a terrain layer:

Or there may optionally be other layers or views available in this same section.

Another option that appears on the map is the ability to get a street view of the map. That is done by grabbing the person icon at the map left and dragging it to where you are interested within the map viewport. That also causes the street portion to be highlighted, with street view photos displayed (if they exist for that location):

By clicking the person icon again, you then shift into walking view:

Via the mouse, you can now navigate up and down these streets and change perspective to get a visual feel for the area.

Multi-map View

Another option you may invoke is the multi-map view of the sWebMap. In this case, the map viewing area expands to include three sub-maps under the main map area. Each sub-map is color-coded and shown as a rectangle on the main map. (This particular example is displaying assessment parcels for the sample instance.) These rectangles can be moved on the main map, in which case their sub-map displays also move:

You must re-size using the sub-map (which then causes the rectangle size to change on the main map). You may also pan the sub-maps (which then causes the rectangle to move on the main map). The results list at the lower left is determined by which of the three sub-maps is selected (as indicated by the heavier bottom border). 

Searching and Filter Selections

There are two ways to get filter selection details for your current map: Show All Records or Search.

NOTE: for all data and attributes as described below, only what is visible on the current map view is shown under counts or records. Counts and records change as you move the map around.

In the first case, we pick the Show All Records option at the bottom of the map view, which then brings up the detailed filter selections in the lower-right panel:

Here are some tips for using the left-hand records listing:

  • If there are more than 10 records, pagination appears at the bottom of the listing
  • Each record is denoted by an icon for the kind of thing it is (bus stops v schools v golf courses, for example)
  • If we mouse over a given record in the listing, its marker icon on the map bounces to show where it resides
  • To the right of each record listing, the checkbox indicates whether you want the record to be maintained persistently. If you check it, the icon on the map changes color, the record is promoted to the top of the list where it becomes sticky and is given an alphabetic sequence. Unchecking this box undoes all of these changes
  • To the right of each record listing is also the view record [View raw attributes for the record] icon; clicking it shows the raw attribute data for that record.

The records that actually appear on this listing are based on the records scope or Search (see below) conditions, as altered by the filter settings on the right-hand listing under the sWebMap. For example, if we now remove the neighborhood record as being persistent and Show included records we now get items across the entire map viewport:

Search works in a similar fashion, in that it invokes the filter display with the same left- and right-hand listings appear under the sWebMap, only now only for those records that met the search conditions. (The allowable search syntax is that for Lucene.) Here is the result of a search, in this case for “school”:

As shown above, the right-hand panel is split into three sections: Sources (or datasets), Kinds (that is, similar types of things, such as bus stops v schools v golf courses), and Attributes (that is, characteristics for these various types of things). All selection possibilities are supported by auto-select.

Sources and Kinds are selected via checkbox. (The default state when none are checked is to show all.) As more of these items are selected, the records listing in the left-hand panel gets smaller. Also, the counts of available items [as shown by the (XX) number at the end of each item] are also changed as filters are added or subtracted by adding or removing checkboxes.

Applying filters to Attributes works a little differently. Attributes filters are selected by selecting the magnifier plus [Filter by attribute] icon, which then brings up a filter selection at the top of the listing underneath the Attributes header.

The specific values and their counts (for the current selection population) is then shown; you may pick one or more items. Once done, you may pick another attribute to add to the filter list, and continue the filtering process.

Saving and Sharing Your Filters

sWebMaps have a useful way to save and share their active filter selections. At any point as you work with a sWebMap, you can save all of its current settings and configurations — viewport area, filter selections, and persistent records — via some simple steps.

You initiate this functionality by choosing the save button at the upper right of the map panel:

When that option is invoked, it brings up a dialog where you are able to name the current session, and provide whatever explanatory notes you think might be helpful.

NOTE: the naming and access to these saved sessions is local to your own use only, unless you choose to share the session with others; see below.

Once you have a saved session, you will then see a new control at the upper right of your map panel. This control is how you load any of your previously saved sessions:

Further, once you load a session, still further options are presented to you that enables you to either delete or share that session:

If you choose to share a session, a shortened URI is generated automatically for you:

If you then provide that URI link to another user, that user can then click on that link and see the map in the exact same state — viewport area, filter selections, and persistent records — as you initially saved. If the recipient then saves this session, it will now also be available persistently for his or her local use and changes.

NOTE: two users may interactively work together by sharing, saving and then modifying maps that they share again with their collaborator.

[1] A semantic components is a JavaScript or Flex component or widget that takes record descriptions and irXML schema as input, and then outputs interactive visualizations of those records. Depending on the logic described in the input schema and the input record descriptions, the semantic component may behave differently or provide presentation options to users. Each semantic component delivers a very focused set of functionality or visualization. Multiple components may be combined on the same canvas for more complicated displays and controls. At present, there are 12 individual semantic widgets in the available open source suite; see further the sComponent category on the TechWiki. By convention, all of the individual widgets in the semantic component suite are named with an ‘s’ prefix; hence, sWebMap.
[2] The open semantic framework, or OSF, is a combination of a layered architecture and an open-source, modular software stack. The stack combines many leading third-party software packages — such as Drupal for content management, Virtuoso for (RDF) triple storage, Solr for full-text indexing, GATE for tagging and natural language processing, the OWL2 API for ontology management and support, and others. These third-party tools are extended with open source developments from Structured Dynamics including structWSF (a RESTful Web services layer of about a dozen modules for interacting with the underlying data and data engines), conStruct (a series of Drupal modules that tie Drupal to the structWSF Web services layer), semantic components (data display and manipulation widgets, mostly based either in Flash or JavaScript, for working with the semantic data), various parsers and standard data exchange formats and schema to facilitate information flow amongst these options, and a ontologies layer, that consists of both domain ontologies that capture the coherent concepts and relationships of the current problem space and of administrative ontologies that govern how the other software layers interact with this structure.
[3] structWSF is a platform-independent Web services framework for accessing and exposing structured RDF (Resource Description Framework) data. Its central organizing perspective is that of the dataset. These datasets contain instance records, with the structural relationships amongst the data and their attributes and concepts defined via ontologies (schema with accompanying vocabularies). The structWSF middleware framework is generally RESTful in design and is based on HTTP and Web protocols and open standards. The current structWSF framework has a baseline set of more than 20 Web services in CRUD, browse, search, tagging, ontology management, and export and import.
[4] For the most comprehensive discussion of ODapps, see M. K. Bergman, 2011. ” Ontology-Driven Apps Using Generic Applications,” posted on the AI3:::Adaptive Information blog, March 7, 2011. You may also search on that blog for ‘ODapps‘ to see related content.

by Mike Bergman at February 27, 2012 04:26 PM

February 14, 2012

AI3:::Adaptive Information (Mike Bergman)

The Conditional Costs of Free

Bandersnatch image from Final Fantasy VII, Japanese version Shun the Frumious Bandersnatch?

The Web and open source have opened up a whole new world of opportunities and services. We can search the global information storehouse, connect with our friends and make new ones, form new communities, map where stuff is, and organize and display aspects of our lives and interests as never before. These advantages compound into still newer benefits via emergent properties such as social discovery or bookmarking, adding richness to our lives that heretofore had not existed.

And all of these benefits have come for free.

Of course, as our use and sophistication of the Web and open source have grown we have come to understand that the free provision of these services is rarely (ever?) unconditional. For search, our compact is to accept ads in return for results. For social networks, our compact is give up some privacy and control of our own identities. For open source, our compact is the acceptance of (generally) little or no support and often poor documentation.

We have come to understand this quid pro quo nature of free. Where the providers of these services tend to run into problems is when they change the terms of the compact. Google, for example, might change how its search results are determined or presented or how it displays its ads. Facebook might change its privacy or data capture policies. Or, OpenOffice or MySQL might be acquired by a new provider, Oracle, that changes existing distribution, support or community involvement procedures.

Sometimes changes may fit within the acceptable parameters of the compact. But, if such changes fundamentally alter the understood compact with the user community, users may howl or vote with their feet. Depending, the service provider may relent, the users may come to accept the new changes, or the user may indeed drop the service.

The Hidden Costs of Dependence

But there is another aspect of the use of free services, the implications of which have been largely unremarked. What happens if a service we have come to depend upon is no longer available?

Abandonment or changes in service may arise from bankruptcy or a firm being acquired by another. My favorite search service of a decade ago, AltaVista, and Delicious are two prominent examples here. Existing services may be dropped by a provider or APIs removed or deprecated. For Google alone, examples include Wave and Gears, Google Labs, and many, many APIs. (The howls around Google Translate actually caused it to be restored.) And existing services may be altered, such as moving from free to fee or having capabilities significantly modified. Ning and Babbel are two examples here. There are literally thousands of examples of Web-based free services that have gone through such changes. Most have not seen widespread use, but have affected their users nonetheless.

There is nothing unique about free services in these regards. Ford was able to cease production of its Edsel and change the form factor of the Thunderbird despite some loyal fans. Sugar Pops morphed into a variety of breakfast cereal brands. Sony Betamax was beat out by VHS, which then lost out to CDs and now DVDs. My beloved Saabs are heading for the dustbin, or Chinese ownership.

In all of these cases, as consumers we have no guarantees about the permanence of the service or the infrastructure surrounding it. The provider is solely able to make these determinations. It is no different when the service or offering is free. It is the reality of the marketplace that causes such changes.

But, somehow, with free Web services, it is easy to overlook these realities. I offer a couple of personal case studies.

Case Study #1: Site Search

I have earlier described the five different versions of site search that I have gone through for this blog. The thing is, my current option, Relevanssi, is also a free plug-in. What is notable about this example, though, is the multiple attempts and (unanticipated) significant effort to discover, evaluate and then implement alternatives. Unfortunately, I rather suspect my current option may itself — because of the nature of free on the Web — need to be replaced at some time down the road.

Case Study #2: FeedBurner

Part of what caused me to abandon Google Custom Search as one of the above search options was the requirement I serve ads on my blog to use it. So, when I decided to eliminate ads entirely in 2010 I not only gave up this search option, but I also lost some of the better tracking and analytics options also provided for free by Google. Fortunately, I had also adopted FeedBurner early in the life of this blog. It was also becoming increasingly clear that feed subscribers — in addition to direct site visitors — were becoming an essential metric for gauging traffic.

I thus had a replacement means for measuring traffic trends. Google (strange how it keeps showing up!) had purchased FeedBurner in 2007, and had made some nice site and feature improvements, including turning some paid services into free. The service was performing quite well, despite FeedBurner’s infamous knack to lose certain feed counts periodically. However, this performance broke last Summer when my site statistics indicated a massive drop in subscribers.

The figure below, courtesy of Feed Compare, shows the daily subscriber statistics for my AI3 blog for the past two years. The spikiness of the curve affirms the infamous statistics gaps of the service. The first part of the curve also shows nice, steady growth of readers, growing to more than 4000 by last Summer. Then, on August 16, there was a massive drop of 85% in my subscriber counts. I monitored this for a couple of days, thinking it was another temporary infamous event, then realized something more serious was afoot:

Drop in Reported Feedburner Subscribers

It was at this point I became active on the Google group for FeedBurner. Many others had noted the same service drop. (The major surmise is that FeedBurner now is having difficulty including Feedfetcher feeds, which is interesting because it is the feed of Google’s own Reader service, and the largest feed aggregation source on the Web.)

Over the ensuing months until last week I posted periodic notices to the official group seeking clarification as to the source of these errors and a fix to the service. In that period, no Google representative ever answered me, nor any of the numerous requests by others. I don’t believe there has been a single entry on any matter by Google staff for nearly the past year.

I made requests and inquiries no fewer than eight times over these months. True, Google had announced it was deprecating the FeedBurner API in May 2011, but, in that announcement, there was no indication that bug fixes or support to their own official group would cease. While it is completely within Google’s purview to do as it pleases, this behavior hardly lends itself to warm feelings by those using the service.

Finally, last week I dropped the FeedBurner stats and installed a replacement WordPress plugin service [1]. It was clear no fixes were forthcoming and I needed to regain an understanding of my actual subscriber base. The counts you now see on this site use this new service; they show the continuation of this site’s historical growth trend.

Is Google Becoming More Frumious?

It is not surprising that in the prior discussions Google figures prominently. It is the largest provider of APIs and free services on the Web. But, even with its continuing services, I am seeing trends that disturb me in terms of what I thought the “compact” was with the company.

I’m not liking recent changes to Google’s bread and butter, search. While they are doing much to incorporate more structure in their results, which I applaud, they are also making ranking, formatting and presentation changes I do not. I am now spending at least us much of my search time on DuckDuckGo, and have been mightily impressed with its cleanliness, quality and lack of ads in results.

I also do not like how all of my current service uses of Google are now being funneled into Google Plus. I am seeing an arrogance that Google knows what is best and wants to direct me to workflows and uses, reminiscent of the arrogance Microsoft came to assume at the height of its market share. How does that variant of Lord Acton’s dictum go? “Market share tends to corrupt, and absolute market share corrupts absolutely.”

We are seeing Google’s shift to monetize extremely popular APIs such as Maps and Translate. My company, Structured Dynamics, has utilized these services heavily for client work in the past. We now must find alternatives or cost the payment for these services into the ongoing economics of our customer installations. Of course, charging for these services is Google’s right, but it does change the equation and causes us to evaluate alternatives.

I fear that Google may be turning into a frumious Bandersnatch. I’m not sure we will shun it, but we certainly are changing our views of the basis by which we engage or not with the company and its services. Once we shift from a basis of free, our expectations as to permanence and support change as well.

Big Boys Don’t Cry

This is not a diatribe against Google nor a woe is us. Us big kids have come to know that there is no such thing as a free lunch. But that message is getting reaffirmed now more strongly in the Web context.

There can be benefits from seeking, installing or adapting to new alternatives with different service profiles when dependent services are abandoned or deprecated. Learning always takes place. Accepting one’s own responsibility for desired services also leads to control and tailoring for specific needs. Early use of free services also educates about what is desired or not, which can lead to better implementation choices if and when direct responsibility is assumed.

But, in some areas, we are seeing services or uses of the Web that we should adopt only with care or even shun. Business opportunities that depend on third-party services or APIs are very risky. Strong reliance on single-provider service ecosystems adds fragility to dependence. Own systems should be designed to not depend too strongly on specific API providers and their unique features or parameters.

Free is not forever, and it is conditional. Substitutability is a good design practice to embrace.


[1] I may detail at a later time how this replacement service was set up.

by Mike Bergman at February 14, 2012 01:02 AM

January 24, 2012

AI3:::Adaptive Information (Mike Bergman)

Give Me a Sign: What Do Things Mean on the Semantic Web?

The Triadic of SignsCoca-Cola, Toucans and Charles Sanders Peirce

The crowning achievement of the semantc Web is the simple use of URIs to identify data. Further, if the URI identifier can resolve to a representation of that data, it now becomes an integral part of the HTTP access protocol of the Web while providing a unique identifier for the data. These innovations provide the basis for distributed data at global scale, all accessible via Web devices such as browsers and smartphones that are now a ubiquitous part of our daily lives.

Yet, despite these profound and simple innovations, the semantic Web’s designers and early practitioners and advocates have been mired in a muddled, metaphysical argument of at least a decade over what these URIs mean, what they reference, and what their actual true identity is. These muddles about naming and identity, it might be argued, are due to computer scientists and programmers trying to grapple with issues more properly the domain of philosophers and linguists. But that would be unfair. For philosophers and linguists themselves have for centuries also grappled with these same conundrums [1].

As I argue in this piece, part of the muddle results from attempting to do too much with URIs while another part results from not doing enough. I am also not trying to directly enter the fray of current standards deliberations. (Despite a decade of controversy, I optimistically believe that the messy process of argument and consensus building will work itself out [2].) What I am trying to do in this piece, however, is to look to one of America’s pre-eminent philosophers and logicians, Charles Sanders Peirce (pronounced “purse”), to inform how these controversies of naming, identity and meaning may be dissected and resolved.

‘Identity Crisis’, httpRange-14, and Issue 57

The Web began as a way to hyperlink between documents, generally Web pages expressed in the HTML markup language. These initial links were called URLs (uniform resource locators), and each pointed to various kinds of electronic resources (documents) that could be accessed and retrieved on the Web. These resources could be documents written in HTML or other encodings (PDFs, other electronic formats), images, streaming media like audio or videos, and the like [3].

All was well and good until the idea of the semantic Web, which postulated that information about the real world — concepts, people and things — could also be referenced and made available for reasoning and discussion on the Web. With this idea, the scope of the Web was massively expanded from electronic resources that could be downloaded and accessed via the Web to now include virtually any topic of human discourse. The rub, of course, was that ideas such as abstract concepts or people or things could not be “dereferenced” nor downloaded from the Web.

One of the first things that needed to change was to define a broader concept of a URI “identifier” above the more limited concept of a URL “locator”, since many of these new things that could be referenced on the Web went beyond electronic resources that could be accessed and viewed [3]. But, since what the referent of the URI now actually might be became uncertain — was it a concept or a Web page that could be viewed or something else? — a number of commentators began to note this uncertainty as the “identity crisis” of the Web [4]. The topic took on much fervor and metaphysical argument, such that by 2003, Sandro Hawke, a staffer of the standards-setting W3C (World Wide Web Consortium), was able to say, “This is an old issue, and people are tired of it” [5].

Yet, for many of the reasons described more fully below, the issue refused to go away. The Technical Architecture Group (TAG) of the W3C took up the issue, under a rubric that came to be known as httpRange-14 [6]. The issue was first raised in March 2002 by Tim Berners-Lee, accepted for TAG deliberations in February 2003, with then a resolution offered in June 2005 [7]. (Refer to the original resolution and other information [6] to understand the nuances of this resolution, since particular commentary on that approach is not the focus of this article.) Suffice it to say here, however, that this resolution posited an entirely new distinction of Web content into “information resources” and “non-information resources”, and also recommended the use of the HTTP 303 redirect code for when agents requesting a URI should be directed to concepts versus viewable documents.

This “resolution” has been anything but. Not only can no one clearly distinguish these de novo classes of “information resources” [19], but the whole approach felt arbitrary and kludgy.

Meanwhile, the confusions caused by the “identity crisis” and httpRange-14 continued to perpetuate themselves. In 2006, a major workshop on “Identity, Reference and the Web” (IRW 2006) was held in conjunction with the Web’s major WWW2006 conference in Edinburgh, Scotland, on May 23, 2006 [8]. The various presentations and its summary (by Harry Halpin) are very useful to understand these issues. What was starting to jell at this time was the understanding that the basis of identity and meaning on the Web posed new questions, and ones that philosophers, logicians and linguists needed to be consulted to help inform.

The fiat of the TAG’s 2005 resolution has failed to take hold. Over the ensuing years, various eruptions have occurred on mailing lists and within the TAG itself (now expressed as Issue 57) to revisit these questions and bring the steps moving forward into some coherent new understanding. Though linked data has been premised on best-practice implementation of these resolutions [9], and has been a qualified success, many (myself included) would claim that the extra steps and inefficiencies required from the TAG’s httpRange-14 guidance have been hindrances, not facilitators, of the uptake of linked data (or the semantic Web).

Today, despite the efforts of some to claim the issue closed, it is not. Issue 57 and the periodic bursts from notable semantic Web advocates such as Ian Davis [10], Pat Hayes and Harry Halpin [11], Ed Summers [12], Xiaoshu Wang [13], David Booth [14] and TAG members themselves, such as Larry Masinter [15] and Jonathan Rees [16], point to continued irresolution and discontent within the advocate community. Issue 57 currently remains open. Meanwhile, I think, all of us interested in such matters can express concern that linked data, the semantic Web and interoperable structured data have seen less uptake than any of us had hoped or wanted over the past decade. As I have stated elsewhere, unclear semantics and muddled guidelines help to undercut potential use.

As each of the eruptions over these identity issues has occurred, the competing camps have often been characterized as “talking past one another”; that is, not communicating in such a way as to help resolve to consensus. While it is hardly my position to do so, I try to encapsulate below the various positions and prejudices as I see them in this decades-long debate. I also try to share my own learning that may help inform some common ground. Forgive me if I overly simplify these vexing issues by returning to what I see as some first principles . . . .

What’s in a Name?

Original Coca-Cola bottle

One legacy of the initial document Web is the perception that Web addresses have meaning. We have all heard of the multi-million dollar purchasing of domains [17] and the adjudication that may occur when domains are hijacked from their known brands or trademark owners. This legacy has tended to imbue URIs with a perceived value. It is not by accident, I believe, that many within the semantic Web and linked data communities still refer to “minting” URIs. Some believe that ownership and control over URIs may be equivalent to grabbing up valuable real estate. It is also the case that many believe the “name” given to a URI acts to name the referent to which it refers.

This perception is partially true, partially false, but moreover incomplete in all cases. We can illustrate these points with the global icon, “Coca-Cola”.

As for the naming aspects, let’s dissect what we mean when we use the label “Coca-Cola” (in a URI or otherwise). Perhaps the first thing that comes to mind is “Coca-Cola,” the beverage (which has a description on Wikipedia, among other references). Because of its ubiquity, we may also recognize the image of the Coca-Cola bottle to the left as a symbol for this same beverage. (Though, in the hilarious movie, The Gods, They Must be Crazy, Kalahari Bushmen, who had no prior experience of Coca-Cola, took the bottle to be magical with evil powers [18].) Yet even as reference to the beverage, the naming aspects are a bit cloudy since we could also use the fully qualified synonyms of “Coke”, “Coca-cola” (small C), “Classic Coke” and the hundreds of language variants worldwide.

On the other hand, the label “Coca-Cola” could just as easily conjure The Coca-Cola Company itself. Indeed, the company web site is the location pointed to by the URI of http://www.thecoca-colacompany.com/. But, even that URI, which points to the home Web page of the company, does not do justice to conveying an understanding or description of the company. For that, additional URIs may need to be invoked, such as the description at Wikipedia, the company’s own company description page, plus perhaps the company’s similar heritage page.

Of course, even these links and references only begin to scratch the surface of what the company Coca-Cola actually is: headquarters, manufacturing facilities, 140,000 employees, shareholders, management, legal entities, patents and Coke recipe, and the like. Whether in human languages or URIs, in any attempt to signify something via symbols or words (themselves another form of symbol), we risk ambiguity and incompleteness.

URI shorteners also undercut the idea that a URI necessarily “names” something. Using the service bitly, we can shorten the link to the Wikipedia description of the Coke beverage to http://bit.ly/xnbA6 and we can shorten the link to The Coca-Cola Company Web site to http://bit.ly/9ojUpL. I think we can fairly say that neither of these shortened links “name” their referents. The most we can say about a URI is that it points to something. With the vagaries of meaning in human languages, we might also say that URIs refer to something, denote something or identify (but not in the sense of completely define) something.

From this discussion, we can assert with respect to the use of URIs as “names” that:

  1. In all cases, URIs are pointers to a particular referent
  2. In some cases, URIs do act to “name” some things
  3. Yet, even when used as “names,” there can be ambiguity as to what exactly the referent is that is denoted by the name
  4. Resolving what such “names” mean is a matter of context and reference to further information or links, and
  5. Because URIs may act as “names”, it is appropriate to consider social conventions and contracts (e.g., trademarks, brands, legal status) in adjudicating who can own the URI.

In summary, I think we can say that URIs may act as names, but not in all or most cases, and when used as such are often ambiguous. Absolutely associating URIs as names is way too heavy a burden, and incorrect in most cases.

What is a Resource?

The “name” discussion above masks that in some cases we are talking about a readable Web document or image (such as the Wikipedia description of the Coke beverage or its image) versus the “actual” thing in the real world (the Coke beverage itself or even the company). This distinction is what led to the so-called “identity crisis”, for which Ian Davis has used a toucan as his illustrative thing [10].Keel-billed Toucan

As I note in the conclusion, I like Davis’ approach to the identity conundrum insofar as Web architecture and linked data guidance are concerned. But here my purpose is more subtle: I want to tease apart still further the apparent distinction between an electronic description of something on the Web and the “actual” something. Like Davis, let’s use the toucan.

In our strawman case, we too use a description of the toucan (on Wikipedia) to represent our “information resource” (the accessible, downloadable electronic document). We contrast to that a URI that we mean to convey the actual physical bird (a “non-information resource” in the jumbled jargon of httpRange-14), which we will designate via the URI of http://example.com/toucan.

Despite the tortured (and newly conjured) distinction between “information resource” and “non-information resource”, the first blush reaction is that, sure, there is a difference between an electronic representation that can be accessed and viewed on the Web and its true, “actual” thing. Of course people can not actually be rendered and downloaded on the Web, but their bios and descriptions and portrait images may. While in the abstract such distinctions appear true and obvious, in the specifics that get presented to experts, there is surprising disagreement as to what is actually an “information resource” v. a “non-information resource” [19]. Moreover, as we inspect the real toucan further, even that distinction is quite ambiguous.

When we inspect what might be a definitive description of “toucan” on Wikipedia, we see that the term more broadly represents the family of Ramphastidae, which contains five genera and forty different species. The picture we are showing to the right is but of one of those forty species, that of the keel-billed toucan (Ramphastos sulfuratus). Viewing the images of the full list of toucan species shows just how divergent these various “physical birds” are from one another. Across all species, average sizes vary by more than a factor of three with great variation in bill sizes, coloration and range. Further, if I assert that the picture to the right is actually that of my pet keel-billed toucan, Pretty Bird, then we can also understand that this representation is for a specific individual bird, and not the physical keel-billed toucan species as a whole.

The point of this diversion is not a lecture on toucans, but an affirmation that distinctions between “resources” occur at multiple levels and dimensions. Just as there is no self-evident criteria as to what constitutes an “information resource”, there is also not a self-evident and fully defining set of criteria as to what is the physical “toucan” bird. The meaning of what we call a “toucan” bird is not embodied in its label or even its name, but in the context and accompanying referential information that place the given referent into a context that can be communicated and understood. A URI points to (“refers to”) something that causes us to conjure up an understanding of that thing, be it a general description of a toucan, a picture of a toucan, an understanding of a species of toucan, or a specific toucan bird. Our understanding or interpretation results from the context and surrounding information accompanying the reference.

In other words, a “resource” may be anything, which is just the way the W3C has defined it. There is not a single dimension which, magically, like “information” and “non-information,” can cleanly and definitely place a referent into some state of absolute understanding. To assert that such magic distinctions exist is a flaw of Cartesian logic, which can only be reconciled by looking to more defensible bases in logic [20].

Peirce and the Logic of Signs

The logic behind these distinctions and nuances leads us to Charles Sanders PeirceCharles Sanders Peirce (1839 – 1914). Peirce (pronounced “purse”) was an American logician, philosopher and polymath of the first rank. Along with Frege, he is acknowledged as the father of predicate calculus and the notation system that formed the basis of first-order logic. His symbology and approach arguably provide the logical basis for description logics and other aspects underlying the semantic Web building blocks of the RDF data model and, eventually, the OWL language. Peirce is the acknowledged founder of pragmatism, the philosophy of linking practice and theory in a process akin to the scientific method. He was also the first formulator of existential graphs, an essential basis to the whole field now known as model theory. Though often overlooked in the 20th century, Peirce has lately been enjoying a renaissance with his voluminous writings still being deciphered and published.

The core of Peirce’s world view is based in semiotics, the study and logic of signs. In his seminal writing on this, “What is in a Sign?” [21], he wrote that “every intellectual operation involves a triad of symbols” and “all reasoning is an interpretation of signs of some kind”. Peirce had a predilection for expressing his ideas in “threes” throughout his writings.

Semiotics is often split into three branches: 1) syntactics – relations among signs in formal structures; 2) semantics – relations between signs and the things to which they refer; and 3) pragmatics – relations between signs and the effects they have on the people or agents who use them.

Peirce’s logic of signs in fact is a taxonomy of sign relations, in which signs get reified and expanded via still further signs, ultimately leading to communication, understanding and an approximation of “canonical” truth. Peirce saw the scientific method as itself an example of this process.

A given sign is a representation amongst the triad of the sign itself (which Peirce called a representamen, the actual signifying item that stands in a well-defined kind of relation to the two other things), its object and its interpretant. The object is the actual thing itself. The interpretant is how the agent or the perceiver of the sign understands and interprets the sign. Depending on the context and use, a sign (or representamen) may be either an icon (a likeness), an indicator or index (a pointer or physical linkage to the object) or a symbol (understood convention that represents the object, such as a word or other meaningful signifier).

An interpretant in its barest form is a sign’s meaning, implication, or ramification. For a sign to be effective, it must represent an object in such a way that it is understood and used again. This makes the assignment and use of signs a community process of understanding and acceptance [20], as well as a truth-verifying exercise of testing and confirming accepted associations.

John Sowa has done much to help make some of Peirce’s obscure language and terminology more accessible to lay readers [22]. He has expressed Peirce’s basic triad of sign relations as follows, based around the Yojo animist cat figure used by the character Queequeg in Herman Melville’s Moby-Dick:

The Triangle of Meaning

In this figure, object and symbol are the same as the Peirce triad; concept is the interpretant in this case. The use of the word ‘Yojo’ conjures the concept of cat.

This basic triad representation has been used in many contexts, with various replacements or terms at the nodes. Its basic form is known as the Meaning Triangle, as was popularized by Ogden and Richards in 1923 [23].

The key aspect of signs for Peirce, though, is the ongoing process of interpretation and reference to further signs, a process he called semiosis. A sign of an object leads to interpretants, which, as signs, then lead to further interpretants. In the Sowa example below, we show how meaning triangles can be linked to one another, in this case by abstracting that the triangles themselves are concepts of representation; we can abstract the ideas of both concept and symbol:

Representing an Object by a Concept

We can apply this same cascade of interpretation to the idea of the sign (or representamen), which in this case shows that a name can be related to a word symbol, which in itself is a combination of characters in a string called ‘Yojo’:

Representing Signs of Signs of Signs

According to Sowa [22]:

“What is revolutionary about Peirce’s logic is the explicit recognition of multiple universes of discourse, contexts for enclosing statements about them, and metalanguage for talking about the contexts, how they relate to one another, and how they relate to the world and all its events, states, and inhabitants.
“The advantage of Peircean semiotics is that it firmly situates language and logic within the broader study of signs of all types. The highly disciplined patterns of mathematics and logic, important as they may be for science, lie on a continuum with the looser patterns of everyday speech and with the perceptual and motor patterns, which are organized on geometrical principles that are very different from the syntactic patterns of language or logic.”

Catherine Legg [20] notes that the semiotic process is really one of community involvement and consensus. Each understanding of a sign and each subsequent interpretation helps come to a consensus of what a sign means. It is a way of building a shared understanding that aids communication and effective interpretation. In Peirce’s own writings, the process of interpretation can lead to validation and an eventual “canonical” or normative interpretation. The scientific method itself is an extreme form of the semiotic process, leading ultimately to what might be called accepted “truths”.

Peircean Semiotics of URIs

So, how do Peircean semiotics help inform us about the role and use of URIs? Does this logic help provide guidance on the “identity crisis”?

The Peircean taxonomy of signs has three levels with three possible sign roles at each level, leading to a possible 27 combinations of sign representations. However, because not all sign roles are applicable at all levels, Peirce actually postulated only ten distinct sign representations.

Common to all roles, the URI “sign” is best seen as an index: the URI is a pointer to a representation of some form, be it electronic or otherwise. This representation bears a relation to the actual thing that this referent represents, as is true for all triadic sign relationships. However, in some contexts, again in keeping with additional signs interpreting signs in other roles, the URI “sign” may also play the role of a symbolic “name” or even as a signal that the resource can be downloaded or accessed in electronic form. In other words, by virtue of the conventions that we choose to assign to our signs, we can supply additional information that augments our understanding of what the URI is, what it means, and how it is accessed.

Of course, in these regards, a URI is no different than any other sign in the Peircean world view: it must reside in a triadic relationship to its actual object and an interpretation of that object, with further understanding only coming about by the addition of further signs and interpretations.

In shortened form, this means that a URI, acting alone, can at most play the role of a pointer between an object and its referent. A URI alone, without further signs (information), can not inform us well about names or even what type of resource may be at hand. For these interpretations to be reliable, more information must be layered on, either by accepted convention of the current signs or the addition of still further signs and their interpretations. Since the attempts to deal with the nature of a URI resource by fiat as stipulated by httpRange-14 neither meet the standards of consensus nor empirical validity, the attempt can not by definition become “canonical”. This does not mean that httpRange-14 and its recommended practices can not help in providing more information and aiding interpretation for what the nature of a resource may be. But it does mean that httpRange-14 acting alone is insufficient to resolve ambiguity.

Moreover, what we see in the general nature of Peirce’s logic of signs is the usefulness of adding more “triads” of representation as the process to increase understanding through further interpretation. Kind of sounds like adding on more RDF triples, does it not?

Global is Neither Indiscriminate Nor Unambiguous

Names, references, identity and meaning are not absolutes. They are not philosophically, and they are not in human language. To expect machine communications to hold to different standards and laws than human communications is naive. To effect machine communications our challenge is not to devise new rules, but to observe and apply the best rules and practices that human communications instruct.

There has been an unstated hope at the heart of the semantic Web enterprise that simply expressing statements in the right way (syntax) and in the right form (RDF) is sufficient to facilitate machine communications. But this hope, too, is naive and silly. Just as we do not accept all human utterances as truth, neither will we accept all machine transmissions as reliable. Some of the information will be posted in error; some will be wrong or ill-fitting to our world view; some will be malicious or intended to deceive. Spam and occasionally lousy search results on the Web tell us that Web documents are subject to these sources of unsuitability, why is not the same true of data?

Thus, global data access via the semantic Web is not — and can never be — indiscriminate nor unambiguous. We need to understand and come to trust sources and provenance; we need interpretation and context to decide appropriateness and validity; and we need testing and validation to ensure messages as received are indeed correct. Humans need to do these things in their normal courses of interaction and communication; our machine systems will need to do the same.

These confirmations and decisions as to whether the information we receive is actionable or not will come about via still more information. Some of this information may come about via shared conventions. But most will come about because we choose to provide more context and interpretation for the core messages we hope to communicate.

A Go-Forward Approach

Nearly five years ago Hayes and Halpin put forth a proposal to add ex:refersTo and ex:describedBy to the standard RDF vocabulary as a way for authors to provide context and explanation for what constituted a specific RDF resource [11]. In various ways, many of the other individuals cited in this article have come to similar conclusions. The simple redirect suggestions of both Ian Davis [10] and Ed Summers [12] appear particularly helpful.

Over time, we will likely need further representations about resources regarding such things as source, provenance, context and other interpretations that would help remove ambiguities as to how the information provided by that resource should be consumed or used. These additional interpretations can mechanically be provided via referenced ontologies or embedded RDFa (or similar). These additional interpretations can also be aided by judicious, limited additions of new predicates to basic language specifications for RDF (such as the Hayes and Halpin suggestions).

In the end, of course, any frameworks that achieve consensus and become widely adopted will be simple to use, easy to understand, and straightforward to deploy. The beauty of best practices in predicates and annotations is that failures to provide are easy to test. Parties that wish to have their data consumed have incentive to provide sufficient information so as to enable interpretation.

There is absolutely no reason that these additions can not co-exist with the current httpRange-14 approach. By adding a few other options and making clear the optional use of httpRange-14, we would be very Peirce-like in our go-forward approach: We are being both pragmatic while we add more means to improve our interpretations for what a Web resource is and is meant to be.


[1] Throughout intellectual history, a number of prominent philosophers and logicians have attempted to describe naming, identity and reference of objects and entities. Here are a few that you may likely encounter in various discussions of these topics in reference to the semantic Web; many are noted philosophers of language:

  • Aristotle (384 BC – 322 BC) – founder of formal logic; formulator and proponent of categorization; believed in the innate “universals” of various things in the natural world
  • Rudolf Carnap (1891 – 1970) -  proposed a logical syntax that provided a system of concepts, a language, to enable logical analysis via exactly formula; a basis for natural language processing;rejected the idea and use of metaphysics
  • René Descartes (1596 – 1650) – posited a boundary between mind and the world; the meaning of a sign is the intension of its producer, and is private and incorrigible
  • Friedrich Ludwig Gottlob Frege (1848 – 1925) – one of the formulators of first-order logic, though syntax not adopted; advocated shared senses, which can be objective and sharable
  • Kurt Gödel (1906 – 1978) – his two incompleteness theorems are some of the most important logic contributions of all time; they establish inherent limitations of all but the most trivial axiomatic systems capable of doing arithmetic, as well as for computer programs
  • David Hume (1711 – 1776) – embraced natural empiricism, but kept the Descartes concept of an “idea”
  • Immanuel Kant (1724 – 1804) – one of the major philosophers in history, argued that experience is purely subjective without first being processed by pure reason; a major influence on Peirce
  • Saul Kripke (1940 – ) – proposed the causal theory of reference and what proper names mean via a “baptism” by the namer
  • Gottfried Wilhelm Leibniz (1646 – 1716) – the classic definition of identity is Leibniz’s Law, which states that if two objects have all of their properties in common, they are identical and so only one object
  • Richard Montague (1930 – 1971) – wrote much on logic and set theory; student of Tarski; pioneered a logical approach to natural language semantics; associated with model theory, model-theoretic semantics
  • Charles Sanders Peirce (1839 – 1914) – see main text
  • Willard Van Orman Quine (1908 – 2000) – noted analytical philosopher, advocated the “radical indeterminancy of translation” (can never really know)
  • Bertrand Russell (1872 – 1970) – proposed the direct theory of reference and what it means to “ground in references”; adopted many Peirce arguments without attribution
  • Ferdinand de Saussure (1857 – 1913) – also proposed an alternative view to Peirce of semiotics, one grounded in sociology and linguistics
  • John Rogers Searle (1932 – ) – argues that consciousness is a real physical process in the brain and is subjective; has argued against strong AI (artificial intelligence)
  • Alfred Tarski (1901 – 1983) – analytic philosopher focused on definitions of models and truth; great admirer of Peirce; associated with model theory, model-theoretic semantics
  • Ludwig Josef Johann Wittgenstein (1889 – 1951) – he disavowed his earlier work, arguing that philosophy needed to be grounded in ordinary language, recognzing that the meaning of words is dependent on context, usage, and grammar.
Also, Umberto Eco has been a noted proponent and popularizer of semiotics.
[2] As any practitioner ultimately notes, standards development is a messy, lengthy and trying process. Not all individuals can handle the messiness and polemics involved. Personally, I prefer to try to write cogent articles on specific issues of interest, and then leave it to others to slug it out in the back rooms of standards making. Where the process works well, standards get created that are accepted and adopted. Where the process does not work well, the standards are not embraced as exhibited by real-world use.
[3] Tim Berners-Lee, 2007. What Do HTTP URIs Identify?
This article does not discuss the other sub-category of URIs, URNs (for names). URNs may refer to any standard naming scheme (such as ISBNs for books) and has no direct bearing on any network access protocol, as do URLs and URIs when they are referenceable. Further, URNs are little used in practice.
[4] Kendall Clark was one of the first to question “resource” and other identity ambiguities, noting the tautology between URI and resource as “anything that has identity.” See Kendall Clark, 2002. “Identity Crisis,” in XML.com, Sept 11 2002; see http://www.xml.com/pub/a/2002/09/11/deviant.html. From the topic map community, one notable contribution was from Steve Pepper and Sylvia Schwab, 2003. “Curing the Web’s Identity Crisis,” found at : http://www.ontopia.net/topicmaps/materials/identitycrisis.html.
[5] Sandro Hawke, 2003. Disambiguating RDF Identifiers. W3C, January 2003. See http://www.w3.org/2002/12/rdf-identifiers/.
[6] The issue was framed as what is the proper “range” for HTTP referrals and was also the 14th major TAG issue recorded, hence the name. See further the httpRange-14 Webography .
[7] See W3C, “httpRange-14: What is the range of the HTTP dereference function?”; see http://www.w3.org/2001/tag/issues.html#httpRange-14.
[9] Leo Sauermann and Richard Cyganiak, eds., 2008. Cool URIs for the Semantic Web, W3C Interest Group Note, December 3, 2008. See http://www.w3.org/TR/cooluris/.
[10] Ian Davis, 2010. Is 303 Really Necessary? Blog post, November 2010, accessed 20 January 2012. (See http://blog.iandavis.com/2010/11/04/is-303-really-necessary/.) A considerable thread resulted from this post; see http://markmail.org/thread/mkoc5kxll6bbjbxk.
[11] See first Harry Halpin, 2006. “Identity, Reference and Meaning on the Web,” presented at WWW 2006, May 23, 2006. See http://www.ibiblio.org/hhalpin/irw2006/hhalpin.pdf. This was then followed up with greater elaboration by Patrick J. Hayes and Harry Halpin, 2007. “In Defense of Amibiguity,” http://www.ibiblio.org/hhalpin/homepage/publications/indefenseofambiguity.html.
[12] Ed Summers, 2010. Linking Things and Common Sense, blog post of July 7, 2010. See http://inkdroid.org/journal/2010/07/07/linking-things-and-common-sense/.
[13] Xiaoshu Wang, 2007. URI Identity and Web Architecture Revisited, Word document posted on posterous.com, November 2007. (Former Web documents have been removed.)
[14] David Booth, 2006. “URIs and the Myth of Resource Identity,” see http://dbooth.org/2006/identity/.
[15] See Larry Masinter, 2012. “The ‘tdb’ and ‘duri’ URI Schemes, Based on Dated URIs,” 10th version, IETF Network Working Group Internet-Draft,January 12, 2012. See http://tools.ietf.org/html/draft-masinter-dated-uri-10.
[16] Jonathan Rees has been the scribe and author for many of the background documents related to Issue 57. A recent mailing list entry provides pointers to four relevant documents in this entire discussion. See Jonathan A Rees, 2012. Guide to ISSUE-57 (httpRange-14) document suiteJanuary, 21, 2012.
[17] At least twenty domain names, led by insure.com, have sold for more the $2 million each; see this Wikipedia listing.
[18] In the wonderful movie, The Gods, They Must be Crazy, Bushmen in the Kalahari Desert one day find an unbroken glass Coke bottle that had been thrown out of an airplane. Initially, this strange artifact seems to be another boon from the gods, and the Bushmen find many uses for it. But unlike anything that they have had before, there is only one bottle to go around. This creates jealousy, envy, anger, hatred, even violence. The protagonist, Xi, decides that the bottle is an evil thing and must be thrown off of the edge of the world. The hilarity of the movie comes from that premise and Xi’s encounters with the modern world as he pursues his quest with the magic bottle.
[19] Wang [13]rhetorically asked which of the following things would be categorized as an “information resource”:
  1. A book
  2. A clock
  3. The clock on the wall of my bedroom
  4. A gene
  5. The sequence of a gene
  6. A software
  7. A service
  8. A namespace
  9. An ontology
  10. A language
  11. A number
  12. A concept, such as Dublin Core’s creator.

See the 2007 thread on this issue, mostly by Sean Palmer and Noah Mendelsohn, the latter aknowledging that various experts may only agree on 85% of the items.

[20] See further Catherine Legg, 2010. “Pragmaticsm on the Semantic Web,” in Bergman, M., Paavola, S., Pietarinen, A.-V., & Rydenfelt, H. eds., Ideas in Action: Proceedings of the Applying Peirce Conference, pp. 173–188. Nordic Studies in Pragmatism 1. Helsinki: Nordic Pragmatism Network. See http://www.nordprag.org/nsp/1/Legg.pdf.
[21] Charles Sanders Peirce, 1894. “What is in a Sign?”, see http://www.iupui.edu/~peirce/ep/ep2/ep2book/ch02/ep2ch2.htm.
[22] The figures in particular are from John F. Sowa, 2000. “Ontology, Metadata, and Semiotics,” presented at ICCS 2000 in Darmstadt, Germany, on August 14, 2000; published in B. Ganter & G. W. Mineau, eds., Conceptual Structures: Logical, Linguistic, and Computational Issues, Lecture Notes in AI #1867, Springer-Verlag, Berlin, 2000, pp. 55-81. May be found at http://www.jfsowa.com/ontology/ontometa.htm. Also see John F. Sowa, 2006. “Peirce’s Contributions to the 21st Century,” presented at International Conference on Conceptual Structures, Aalborg, Denmark, July 17, 2006. See http://www.jfsowa.com/pubs/csp21st.pdf.
[23] C.K. Ogden and I. A. Richards, 1923. The Meaning of Meaning, Harcourt, Brace, and World, New York, 8th edition 1946.

by Mike Bergman at January 24, 2012 03:52 PM

January 05, 2012

Frederick Giasson's Weblog

December 28, 2011

Frederick Giasson's Weblog

Open Semantic Framework Running on Micro Instances

After releasing the new Open Semantic Framework Installer, we started to test it on machines with all kind of different specifications: different CPU limits, different amount of memory, etc. One of the setup that caught our attention was Amazon’s EC2 Micro Instance.

The Micro Instance is a virtual server type that has been introduced by Amazon a little bit more than a year ago. As described by Amazon, Micro Instances are:

Instances of this family provide a small amount of consistent CPU resources and allow you to burst CPU capacity when additional cycles are available. They are well suited for lower throughput applications and web sites that consume significant compute cycles periodically.

We were intrigued by this particular type of instance because we wanted to know how the complete Open Semantic Framework stack could operate on such a small server instance.

Micro Instance Specifications

The Micro Instance’s specifications are as follow:

  • 613 MB memory
  • Up to 2 EC2 Compute Units (for short periodic bursts)
  • 32-bit or 64-bit platform
  • I/O Performance: Low

Note that a EC2 Compute Unit provides the equivalent CPU capacity of a 1.0-1.2 GHz 2007 Opteron or 2007 Xeon processor.

Installing The Stack

Installing the stack on the Amazon Micro Instance, using the OSF Installer, is not the fastest experience in the World. In fact, installing the complete stack takes up to 10 hours (5 minutes of your time, but compiling and installing everything takes about 10 hours of CPU time).

The problem is that installing OSF is a CPU intensive task, while the Micro instance is not. The micro instance can sustain small CPU bursts, but it can’t sustain the creation and compilation of the entire stack. That means that the CPU cycles won’t be available to the instance, and that the CPU consumption of that instance will be throttled by Amazon, which will significantly slow down the installation process.

However, as you will see below, once OSF is installed on the Micro instance, the complete stack responds perfectly to all queries sent to it.

Creating an AMI

The only time you have to spend 10 hours to install the OSF stack on an Amazon Micro Instance is the first time. After that, you would only have to create an Amazon AMI from that vanilla OSF instance for future use. If you proceed that way, you will lower the installation time from 10 hours to a few minutes.

Reading and Searching Data

The testing we did for reading and searching data from structWSF shows that performances are as good as the ones you would get from a small instance with a normal workload. The Crud: Read and the Search structWSF endpoints are fully responsive and operational.

Creating, Updating and Deleting Data

The testing we did for creating, updating and deleting entire datasets takes more time than with a small instance even if the instance is dedicated to that only task, without any other queries processed by the instance at the same time. The reason for this decrease in performances is due to the CPU throttling done by Amazon for this kind of more CPU intensive task. However, since individual records creation, updating and deletion creates “CPU Peaks”, such isolated create/update/delete queries doesn’t greatly affect the overall performances of the instance.

What This Type Of Instance Is Good For?

We found that such small instances were perfect for data collection activities performed by a single person, or a small group of collaborators. We also found that it could be used by low-traffic websites such as personal web portal, personal blogs, etc. The complete OSF stack is fully responsive and our analysis shows that the resources (CPU and Memory) are stable and responsive with a normal workload.

Conclusion

Such a small server instance can easily be used to create a personal data collection endpoint, or a personal, or small, data presentation portal such as Mike’s semantic web Sweet Tools. It is well suited for data portals that require reading and searching of data with occasional data changes (addition, removal and modification of instance records).

by Frederick Giasson at December 28, 2011 08:45 PM

December 21, 2011

Frederick Giasson's Weblog

Volkswagen UK’s Search Engine Powered by structWSF

It is now official, Volkswagen UK‘s search engine is now powered by structWSF. Their new contextual search engine has been released last Friday. I covered the underlying architecture in one of my recent blog post: Volkswagen's RDF Data Management Workflow.

 

 

John Streit, head of technology at Tribal DDB, described the two key advantages of using the structWSF (part of the Open Semantic Framework (OSF)) for their website in an interview with Wired UK:

The first is that it gives you a single place to access data. Streit explains: “Applications often need to retrieve data from multiple sources which adds complexity and development time. By using this technology we can get everything we need from a single place which drastically lowers development time and running costs.” Furthermore the exposure of data improves search and means that it can be repurposed in new and imaginative ways.

by Frederick Giasson at December 21, 2011 06:23 PM

December 14, 2011

Frederick Giasson's Weblog

The Open Semantic Framework Installer

We are excited to introduce the first Open Semantic Framework installation script. This new installer application will install and configure the entire Open Semantic Framework stack for you. It will take about 10 minutes of your time, and will process in the background for a few hours while everything necessary to build the OSF stack is downloaded and compiled. Open Semantic Framework Installer

The only thing you have to do to run the OSF Installer is to issue the few commands outlined below, and then to answer a few questions in the process (which, since most of them use the standard default values, is pretty easy).

The OSF Installer is a major addition to the Open Semantic Framework since it now enables a greater number of people (mere mortals) to install and use the stack, and it enables much faster deployment of the system.

The full installation manual, where each of the steps performed by the installer is explained in detail, is available as a reference here.

Requirements

The current version of the Open Semantic Framework Installer is fully operational on:

  1. Ubuntu 10.04 (Lucid)
  2. 32 Bits Operating System
  3. Access to internet from the server
  4. 5GIG of disk space on the partition where you are installing OSF

Eventually this installer will be upgraded for 64-bits operating systems, and for other Linux distributions. Also, the current installer should work on newer versions of Ubuntu, but it has only been tested to date on the latest LTS version.

Installing the Open Semantic Framework

The only manual steps need to do to install the Open Semantic Framework are to:

  1. Create a folder where to install OSF on your server
  2. Download the osf-install.zip installation package
  3. Make the osf-install.sh installation script executable
  4. Run the osf-install.sh installation script
  5. Answer the questions asked by the installer

Here are the commands you have to run:

1
2
3
4
5
6
cd /mnt/
sudo wget https://github.com/downloads/structureddynamics/Open-Semantic-Framework-Installer/osf-installer-v1.0a4.zip
sudo unzip osf-installer-v1.0a4.zip
cd `ls -d structureddynamics*/`
sudo chmod 755 osf-install.sh
./osf-install.sh

conStruct and structWSF Upgrades

In the process, both conStruct and structWSF have been enhanced to enable automatic upgrading in the future. Starting with structWSF version 1.0a92 and conStruct version 6.x-1.0-beta9, future upgrades should be done automatically using automatic upgrading procedures.

However, to enable this, existing users will have to upgrade their current versions manually to establish the new automatic upgrades baseline.

Next Steps

Once you have installed the OSF stack, you next query the structWSF Web service endpoints, and import datasets using conStruct. Here are a few things you can do to start exploring the Open Semantic Framework:

  1. Start exploring structWSF
  2. Start exploring conStruct
  3. Start exploring Ontologies usage in OSF
  4. Start importing and manipulating datasets
  5. Start exploring the Open Semantic Framework architecture
  6. Start playing with the structWSF web service endpoints

Since everything is installed on your server, so you only have to play with the stack now. If you break something, just ping us on the mailing list or re-install it without worrying about each installation steps!

Help

It may be possible that you experience some issues with this new OSF Installer. If that is the case, I would suggest your to make an outreach to the Open Semantic Web Mailing List so that we fix it on the Git repository.

Just write an email that includes the specifications of the server where you are trying to install OSF on. Then tell us where the issue happens in the installation process. Also add any logs that could be helpful in debugging the issue.

Conclusion

This is the first version of the OSF installer, but this is a real balm for installing OSF. As noted, this installer will eventually be upgraded to support 64-bit servers and other Linux distributions. Also, any help improving this installer from Bash wizards would naturally be greatly welcomed.

by Frederick Giasson at December 14, 2011 12:10 AM

December 12, 2011

AI3:::Adaptive Information (Mike Bergman)

The State of Tooling for Semantic Technologies

State of SemWeb Tools - 2011Number of Semantic Web Tools Passes 1000 for First Time; Many Other Changes

We have been maintaining Sweet Tools, AI3‘s listing of semantic Web and -related tools, for a bit over five years now. Though we had switched to a structWSF-based framework that allows us to update it on a more regular, incremental schedule [1], like all databases, the listing needs to be reviewed and cleaned up on a periodic basis. We have just completed the most recent cleaning and update. We are also now committing to do so on an annual basis.

Thus, this is the inaugural ‘State of Tooling for Semantic Technologies‘ report, and, boy, is it a humdinger. There have been more changes — and more important changes — in this past year than in all four previous years combined. I think it fair to say that semantic technology tooling is now reaching a mature state, the trends of which likely point to future changes as well.

In this past year more tools have been added, more tools have been dropped (or abandoned), and more tools have taken on a professional, sophisticated nature. Further, for the first time, the number of semantic technology and -related tools has passed 1000. This is remarkable, given that more tools have been abandoned or retired than ever before.

Click here to browse the Sweet Tools listing. There is also a simple listing of URL links and categories only.

We first present our key findings and then overall statistics. We conclude with a discussion of observed trends and implications for the near term.

Key Findings

Some of the key findings from the 2011 State of Tooling for Semantic Technologies are:

  • As of the date of this article, there are 1010 tools in the Sweet Tools listing, the first it has passed 1000 total tools
  • A total of 158 new tools have been added to the listing in the last six months, an increase of 17%
  • 75 tools have been abandoned or retired, the most removed at any period over the past five years
  • A further 6%, or 55 tools, have been updated since the last listing
  • Though open source has always been an important component of the listing, it now constitutes more than 80% of all listings; with dual licenses, open source availability is about 83%. Online systems contribute another 9%
  • Key application areas for growth have been in SPARQL, ontology-related areas and linked data
  • Java continues to dominate as the most important language.

Many of these points are elaborated below.

The Statistical Picture

The updated Sweet Tools listing now includes nearly 50 different tools categories. The most prevalent categories, each with over 6% of the total, are information extraction, general RDF tools, ontology tools, browser tools (RDF, OWL), and parsers or converters. The relative share by category is shown in this diagram (click to expand):

Since the last listing, the fastest growing categories have been SPARQL, linked data, knowledge bases and all things related to ontologies. The relative changes by tools category are shown in this figure:

Though it is true that some of this growth is the result of discovery, based on our own tool needs and investigations, we have also been monitoring this space for some time and serendipity is not a compelling explanation alone. Rather, I think that we are seeing both an increase in practical tools (such as for querying), plus the trends of linked data growth matched with greater sophistication in areas such as ontologies and the OWL language.

The languages these tools are written in have also been pretty constant over the past couple of years, with Java remaining dominant. Java has represented half of all tools in this space, which continues with the most recent tools as well (see below). More than a dozen programming or scripting languages have at least some share of the semantic tooling space (click to expand):

Sweet Tools Languages

With only 160 new tools it is hard to draw firm trends, but it does appear that some languages (Haskell, XSLT) have fallen out of favor, while popularity has grown for Flash/Flex (from a small base), Python and Prolog (with the growth of logic tools):

PHP will likely continue to see some emphasis because of relations to many content management systems (WordPress, Drupal, etc.), though both Python and Ruby seem to be taking some market share in that area.

New Tools

The newest tools added to the listing show somewhat similar trends. Again, Java is the dominant language, but with much increased use of JavaScript and Python and Prolog:

Sweet Tools Languages

The higher incidence of Prolog is likely due to the parallel increase in reasoners and inference engines associated with ontology (OWL) tools.

The increase in comprehensive tool suites and use of Eclipse as a development environment would appear to secure Java’s dominance for some time to come.

Trends and Observations

These dry statistics tend to mask the feel one gets when looking at most of the individual tools across the board. Older academic and government-funded project tools are finally getting cleaned out and abandoned. Those tools that remain have tended to get some version upgrades and improved Web sites to accompany them.

The general feel one gets with regard to semantic technology tooling at the close of 2011 has these noticeable trends:

  • A three-tiered environment – the tools seem to segregate into: 1) a bottom tier of tools (largely) developed by individuals or small groups, now most often found on Google Code or Github; 2) a middle-tier of (largely) government-funded projects, sometimes with multiple developers, often older, but with no apparent driving force for ongoing improvements or commercialization; and 3) a top-tier of more professional and (often) commercially-oriented tools. The latter category is the most noticeable with respect to growth and impact
  • Professionalism – the tools in the apparent top tiers feel to have more professionalism and better (and more attractive) packaging. This professionalism is especially true for the frameworks and composite applications. But, it also applies to many of the EU-funded projects from Europe, which has always been a huge source of new tool developments
  • More complete toolsets – similarly, the upper levels of tools are oriented to pragmatic problems and problem-solving, which often means they embody multiple functions and more complete tooling environments. This category actually appears to be the most visible one exhibiting growth
  • Changing nature of academic releases – yet, even the academic releases seem to be increasing in professionalism and completeness. Though in the lowest tier it is still possible to see cursory or experimental tool releases, newer academic releases (often) seem to be more strategically oriented and parts of broader programmatic emphases. Programs like AKSW from the University of Leipzig or the Freie Universität Berlin or Finland’s Semantic Computing Research Group (SeCo), among many others, tend to be exemplars of this trend
  • Rise of commercial interests and enterprise adoption – the growing maturity of semantic technologies is also drawing commercial interest, and the incubation of new start-ups by academic and research institutions acts to reinforce the above trends. Promising projects and tools are now much more likely to be spun off as potential ventures, with accompanying better packaging, documentation and business models
  • Multiple languages and applications – with this growing complexity and sophistication has also come more complicated apps, combining multiple languages and functions. In fact, for some time the Sweet Tools listing has been justifiably criticized by some as overly “simplifying” the space by classifying tools under (largely) single applications or single languages. By the 2012 survey, it will likely be necessary to better classify the tools using multiple assignments
  • Google code over SourceForge for open source (and an increase in Github, as well) – virtually all projects on SourceForge now feel abandoned or less active. The largest source of open source projects in the semantic technology space is now clearly Google Code. Though of a smaller footprint today, we are also seeing many of the newer open source projects also gravitate to Github. Open source hosting environments are clearly in flux.

I have said this before, and been wrong about it before, but it is hard to see the tooling growth curve continue at its current slope into the future. I think we will see many individual tools spring up on the open source hosting sites like Google and Github, perhaps at relatively the same steady release rate. But, old projects I think will increasingly be abandoned and older projects will not tend to remain available for as long a time. While a relatively few established open source standards, like Solr and Jena, will be the exception, I think we will see shorter shelf lives for most open source tools moving forward. This will lead to a younger tools base than was the case five or more years ago.

I also think we will continue to see the dominance of open source. Proprietary software has increasingly been challenged in the enterprise space. And, especially in semantic technologies, we tend to see many open source tools that are as capable as proprietary ones, and generally more dynamic as well. The emphasis on open data in this environment also tends to favor open source.

Yet, despite the professionalism, sophistication and complexity trends, I do not yet see massive consolidation in the semantic technology space. While we are seeing a rapid maturation of tooling, I don’t think we have yet seen a similar maturation in revenue and business models. While notable semantic technology start-ups like Powerset and Siri have been acquired and are clear successes, these wins still remain much in the minority.


[1] Please use the comments section of this post for suggesting new or overlooked tools. We will incrementally add them to the Sweet Tools listing. Also, please see the About tab of the Sweet Tools results listing for prior releases and statistics.

by Michael K. Bergman at December 12, 2011 02:29 PM

December 05, 2011

Frederick Giasson's Weblog

Role and Use of Ontologies in the Open Semantic Framework

Ontologies are to the Open Semantic Framework what humans were to the Mechanical Turk. The hidden human in the Mechanical Turk was orchestrating all and every chess move. However, to the observers, the automated chess machine was looking just like it: a new kind of intelligent machine. We were in 1770.

Ontologies plays exactly the same role for the Open Semantic Framework (OSF): they orchestrate all and every moves for all the pieces within OSF. They are what instructs structWSF, the Semantic Components, conStruct, and all other derivate pieces of user interfaces how to behave.

In this (lengthy) blog post, I will present the main ontologies that have an impact on different parts of OSF. We will see how different ontology classes and properties, and how the description of the records indexed in the system, can impact the behaviors of OSF.

In addition to this post, Mike has also published a blog post today that overviews the overall OSF ontology modularization and architecture.

Constituent Ontologies

Let’s take a look at the core ontologies used by the Open Semantic Framework. All these ontologies have been developed in relation to OSF. These, and other external ontologies, have the same role in OSF as the human does in the Mechanical Turk: they instruct the system how to behave.

Here is the list of the core ontologies:

  1. The SCO Ontology (Semantic Component Ontology)
  2. The WSF Ontology (Web Service Framework Ontology)
  3. The AGGR Ontology (Aggregation Ontology)
  4. The irON Ontology (Instance Record an Object Notation Ontology)
  5. One or more domain ontologies, to capture the concepts and relationships for the purposes of a given OSF installation, and
  6. Possibly UMBEL or other upper-level concept ontologies, used for linkages to external systems.

(Note: the internal wiki links to each of these ontologies also provides links to the actual ontology specifications on Github.)

A useful discussion of these ontologies and their interactions in an OSF instance is provided by the ontology modularization document. This current document focuses primarily on the specific properties and roles associated with them in an OSF installation.

Depending on the specific OSF installation, of course, multiple external ontologies may also be employed. Some of the common external ones used in an OSF installation are described by the external ontologies document. These external ontologies are important — indeed essential in order to ensure linkage to the external world — but have little to do with internal OSF control structures. That is why the rest of this discussion focused on internal ontologies only.

Summary Ontology Roles

Ontologies play pivotal roles across all parts of the framework. In a broad sense, the internal OSF ontologies are used for annotations, guiding interactions or relating concepts and information to other information. In specific terms, OSF ontologies may play one or more of these dozen or so roles:

  1. Define record descriptions
  2. Inform interface displays
  3. Integrate different data sources
  4. Define component selections
  5. Define component behaviors
  6. Guide template selection
  7. Provide reasoning and inference
  8. Guide content filtering (with and without inference)
  9. Tag concepts in text documents
  10. Help organize and navigate Web portals
  11. Manage datasets and ontologies, and
  12. Set access permissions and registrations.

In the remainder of this post, for each of these roles, we will see how ontologies affect numerous different parts of the OSF framework. These sections are presented in the order above.

Define Records Descriptions

A central role of ontologies in the Open Semantic Framework is their use to describe any kind of record that gets indexed and managed by the system. Since the framework indexes everything into the RDF data model, ontologies are needed as a schema to describe these RDF resources.

The irON ontology is specifically designed for record descriptions and notations. It interacts with all of the domain and (if used, UMBEL) upper level ontologies.

Inform Interface Displays

Ontologies have an impact in most of the user interfaces that display record information. The property that has the most impact is iron:prefLabel, which is used to display the label within the user interface that refers to a record or record attributes (properties). This label can be used within text, in a list control, in a tree control, or in any other kind of control that displays references to records.

Note: there are also other properties that are considered as fallbacks to iron:prefLabel if a record has no triples using the iron:prefLabel property. These include rdfs:label, dcterms:title, foaf:name, etc.

General User Interface Labels and Descriptions

There are a few properties that have an impact on most of the components of the OSF stack, most of which come from the irON ontology. Here is the list of these irON properties that impact other parts of the system, mainly related to different user interfaces:

irON Property Impact on the different user interfaces
iron:prefLabel Preferred label to refer to an instance record or specific attribute (property)This impacts most of the user interfaces. As soon as a record is described using this property, the user interface uses it to refer to that record (as a link, in a list, as a word, etc.)
iron:altLabel Alternative label to refer to an instance record or specific attribute (property)This impacts most of the user interfaces. As soon as a record is described using this property, and that the user interface needs more than one label to refer to that record, it is displayed in the user interface (as a link, in a list, as a word, etc.)
iron:hiddenLabel Hidden label are labels that shouldn’t be displayed in any user interface, but that may be used by different systems for indexing purposesThis impacts on different indexation system such as Scones. As soon as a record is described using this property, and that a system needs more words (synonyms) to use to describe that record, but that label shouldn’t be displayed in any user interface, these hidden labels will be used.
iron:description Description of an instance record.This impacts most of the user interfaces. As soon as a record is described using this property, and that the user interface needs a description to refers to that record, it will be displayed in the user interface (as a link, in a list, as a word, etc.)
iron:prefURL Preferred URL for an instance recordThis impacts most of the user interfaces. As soon as a record is described using this property, and that the user interface needs a web page URL to refer to that record, it will be displayed in the user interface (as a link)

User Interface ‘Short’ Labels

There are a few properties that impact most of the components of the OSF stack. Here is the list of SCO properties that impact other parts of the system, mainly related to different user interfaces:

SCO Property Impact on the different user interfaces
sco:shortLabel The short label is used to display a short version of the label of an attribute/type where it has to be displayed in a restrained region of a component.This impacts multiple different kinds of user interfaces (including the semantic components) in the way that if the user interface knows that the place available to display the label is limited, it will utilize the sco:shortLabel value before any other label values that may be defined for that record.

Hierarchical Displays

The way ontologies define a class or a property structure also has an impact on different kinds of hierarchical displays. An example of this is the “Filter by Kinds” section of the structSearch and structBrowse modules. The possible filters that may be applied to a search query will be displayed to the user according to the hierarchy as defined in the ontologies.

Integrate Heterogeneous Data Sources

The principle reason why the Open Semantic Framework uses RDF and ontologies to describe all the data it indexes and manages is to facilitate data integration from multiple and heterogeneous data sources. The premise of using RDF and ontologies is:

The RDF framework, along with using ontologies as schema, is the most flexible means currently available to describe any kind of data. The RDF-ontology combination can be used to represent any data coming from any other source, data management system, format, or unstructured to structured basis for describing information. (See further the Advantages and Myths of RDF.)

This foundation leads to the extreme flexibility of the Open Semantic Framework. The rationale behind this flexibility, and its benefits, has been described in many locations within this wiki. You may also want to see this article on One of the Semantic Web's Core Added Value.

Ontologies have a dramatic — and positive — impact on the data integration and presentation tasks within an OSF instance.

Define Component Selections

A key aspect of the SCO Ontology is its use as the means to define what semantic components (or widgets) display what types of information within data records.

These assignments are done via the sControl component. The properties for this component define what components may display what type (class) of data records. Here is the list of SCO properties that impact the sControl’s behaviors:

SCO Property Impact on the sControl component
sco:displayControl Annotate a class or a property to reference it to a display control. This indicates what are the semantic components that can normally be used to display some information about a record of a certain type, or a record that is described using some property.This property impacts the behavior of the sControl component in the sense that for a given record’s description, and a given ontology, the sControl component will select different semantic controls for displays. The actual information displayed and with what widget(s) depends on the type of the record and the properties that are used to describe it.
sco:comparableWith Is is possible to specify a “comparableWith” relation between two predicates. These comparable attributes have the same allowedValue(s), and the semantics of the predicates that are deemed comparable are the same. Since the kinds of values, and their semantics, are the same, they are then considered comparable.This property is normally applied when it is desirable, for example, to plot values of different attributes describing similar records on some visualization component (for example, a linear chart).This property impacts the behavior of the sControl component in the sense that for a given record’s description, and a given ontology, the sControl component will display information about multiple input records depending if the value of some of the properties used to describe it are comparable.

Define Component Behaviors

In the Open Semantic Framework, one of the most important roles of ontologies is to define the interaction between different pieces of the system. Because of the extent of these interactions, this section is the longest and most detailed amongst all of the dozen or so ontology roles.

The SCO ontology can have multiple effects on multiple parts of an OSF instance. This section describes those interactions.

sMap Component

The sMap component had different behaviors depending on how its input record is described. Here is the list of SCO properties that will have an impact on the sMap’s behaviors:

SCO Property Impact on the sMap component
sco:gisMap Reference a map binary file created created from a ShapeFile map file and ClearMapsBuilder. The referenced map file is a serialized ActionScript object.The sco:gisMap defines the first layout that is related to a given resource. Normally, this resource is part of the map related by the gisMap predicate. Read more about maps in the sMapdocumentation page.There is only one gisMap relationship per resource, other relationships should be made with the sco:relatedGisMappredicate.This property impacts the behavior of the sMap component in the sense that it is the record’s description, and is the property that tells the component what map to render to the user.
sco:relatedGisMap Reference a map binary file created created from a ShapeFile map file and ClearMapsBuilder. The referenced map file is a serialized ActionScript object.The sco:relatedGisMap defines a related map layout that is related to a given resource. The resource is related to that map layer in some ways, but it is not necessarily part of the layer. Read more about maps in the sMapdocumentation page.This property impacts the behavior of the sMap component in the sense that it is the record’s description, and is the property that tells the component what map to render to the user.

sWebMap Component

The sWebMap component has different behaviors depending on how its input record is described. Here is the list of SCO and WGS84 properties that impact the behavior of an sWebMap:

SCO/WGS84 Property Impact on the sWebMap component
sco:polygonCoordinates Defines the coordinates of a polygon shape that represents a geographic area determined by a record.Coordinates are defined as coordinates in KMLThis property impacts the behavior of the sWebMap component in the sense that for a given resultset of records, polygon shapes are displayed on a World map for each of the records described with the property.
sco:polylineCoordinates Defines the coordinates of a polyline shape that represents a record on a map.Coordinates are defined as coordinates in KMLThis property impacts the behavior of the sWebMap component in the sense that for a given resultset of records, polylines are displayed on a World map for each of the records described with the property.
sco:mapMarkerImageUrl URL of an icon image to use as a marker on a web map. Normally, this property is used to annotate a Class description. All of the records belonging to that class are marked on a map using this icon image.This property impacts the behavior of the sWebMap component in the sense for a given resultset of records, and a given ontology description, all records of a given type of class that are displayed with the marker icon found at the URL specified for sco:mapMarkerImageUrl.
wgs84:lat Latitude coordinate of a record on a World map.This property impacts the behavior of the sWebMap component in the sense for a given resultset of records, each record with a wgs84:lat property is displayed on the sWebMap at that latitude coordinate.
wgs84:long Longitude coordinate of a record on a World map.This property impacts the behavior of the sWebMap component in the sense for a given resultset of records, each record with a wgs84:long property is displayed on the sWebMap at that longitude coordinate.
wgs84:alt Altitude of a record on a World map.This property impacts the behavior of the sWebMap component in the sense for a given resultset of records, each record with a wgs84:alt property is displayed on the sWebMap at that altitude indicator.

sStory Component

The sStory component has different behaviors depending on how its input record is described. Here is the list of SCO properties that impact an sStory component:

SCO Property Impact on the sStory component
sco:storyUrl URL reference to a webpage representation of the story that got indexed into Scones.This property impacts the behavior of the sStory component in the sense that for a given record’s description, the sStory component refers users to the original webpage URL that got processed by Scones and is displayed in the sStory component.
sco:storyTextUri URI reference to the text of a storyThis property impacts the behavior of the sStory component in the sense that for a given record’s description, the sStory component uses the text document referenced by this property to display in the text display of the sStory component.
sco:storyAnnotatedTextUri URI reference to the annotated text of a story. Annotations are serialized in XML given the GATEformat.This property impacts the behavior of the sStory component in the sense that for a given record’s description, the sStory component usesw the Gate annotated text document referenced by this property to display the tagged concepts in the tags section of the sStory viewer, and also uses it to highlight the tagged terms within the text viewer.

sBarChart and sLinearChart Components

The sBarChart and the sLinearChart components exhibit different behaviors depending on how the input records that are enabled for these component types are described. Here is the list of the SCO properties that impact this behavior:

SCO Property Impact on the sBarChart and the sLinearChart components
sco:comparableWith Is is possible to specify a “comparableWith” relation between two predicates. These comparable attributes have the same allowedValue(s), and the semantics of the predicates that are deemed comparable are the same. Since the kinds of values, and their semantics, are the same, they are then considered comparable.This property is normally applied when it is desirable, for example, to plot values of different attributes describing similar records on some visualization component (for example, a linear chart).This property impacts the behavior of the sLinearChart component in the sense that for a given record’s description, and a given ontology, the sLinearChart component will display the values of the comparable attributes on the a linear chart.
sco:unitType URI reference to a unit type ontology. The sco:unitType property is used to determine the type of unit referenced by a property. For example, if a data property has xsd:float as range, then sco:unitTypedetermines what kind of things referred to by this number.The semantic components make all of the properties that share the same sco:unitTypecomparable (so, possibly displayable on the same semantic component, such as the sBarChart and the sLinearChart).This property impacts the behavior of the sLinarChart and the sBarChart component in the sense that for a given record’s description, and a given ontology, the sLinarChart or the sBarChart can be selected and used to display the values with the same unit type on one of these charts.
sco:orderingValue The value of the sco:orderingValue predicate is used to order the predicate of a set of comparable predicates. This set of comparable predicates is normally created from the set composed of all compatibleWithpredicates. This is normally used to plot, and order, values of different attributes describing similar records on some visualization component (for example, a linear chart).This property impacts the behavior of the sLinarChart and the sBarChart component in the sense that for a given record’s description, and a given ontology, the sLinarChart or the sBarChart will order the values of comparable properties on the charts, according to the ordering value defined for each property.

sRelationBrowser

The sRelationBrowser component exhibits different behaviors depending on how its input record is described. Here is the list of SCO properties that impact the sRelationBrowser component:

SCO Property Impact on the sRelationBrowser component
sco:relationBrowserNodeType Reference to a relation browser node type used to skin a node according to its type. This should be a reference to a type URI defined in a relation browse nodes skins configuration file. If a record is defined with this property, the relation browser tries to find a node of that type to apply to it as a skin.This property impacts the behavior of the sRelationBrowser component in the sense that for a given record’s description, and a given ontology, the sRelationBrowser component uses the skin specified by the sco:relationBrowserNodeType attribute to display the record in the sRelationBrowser component.

sDashboard

The sDashboard component exhibits different behaviors depending on how its input record is described. Here is the list of SCO properties that impact the sDashboard component:

SCO Property Impact on the sDashboard component
sco:dashboardSessionFileUri URI reference to the Dashboard session accessible on the Web.This property impacts the behavior of the sRelationBrowser component in the sense that for a given record’s description, and a given ontology, the sDashboard component loads the Dashboard session referenced by this property.

Guide Visualization Template Selection

One of the core features of the conStruct set of Drupal modules is the ability to use different display templates depending on the types of records available. The selection of these templates is based on the types of those records and the type hierarchies described by the OSF ontologies. This section describes how these ontologies guide template selections.

As a refresher on templates and their use, see the Building conStruct Templates document. It describes how the templating engine works and how to create various templates.

Template Selection

Template selection is the action of binding an instance record to a display template based on its type. Three things are required to make this happen:

  1. Instance records have to be typed
  2. An ontological structure of type relationships (via subClasses) has to exist in one or more OSF ontology(ies), and
  3. A template has to exist for the type of the instance record.

(Note: a specific template by type is not strictly required, since lacking a specific template for the target type, the system will invoke the nearest template up the parental chain in the governing ontology structure, eventually getting to the most generic template available, that for “thing”.)

Impact of Ontologies on Template Selection

conStruct’s templating engine selects record display templates based on the class hierarchy loaded on a OSF instance. It also uses inference on types to select the proper template for a given record.

Let’s say that we try to display information about a foaf:Person instance record. What the system attempts to do is to find a template that displays information about this kind of instance record. First, the foaf:Person type (class) has to be defined in the ontological structure of the OSF instance; if it is not, then no specific template will be selected and the system will default to using the owl_thing.html template (see below). If the type (class) is found, the system will next check to see if a template exists for that specific type. If one exists one, it will use the matching template. If one does not, it will next select the parent class of the type and try the match again. If it again fails, it will continue its test up the parental chain. If all tests fail, it will use the default owl_thing.html template. Whichever template is selected then becomes the basis for formatting and presenting the visual record display.

We can use a simple class hierarchy, matched to a simple set of available template files, to illustrate how ontologies impact the conStruct templating system.

Loaded Class Hierarchy Available template files
owl:Thing
   |
   |
    --> foaf:Agent
            |
            |
            |--> foaf:Person
            |
            |
             --> foaf:Organization
  owl_thing.tpl
  foaf_agent.tpl
  foaf_organization.tpl

Now, let’s say that our OSF portal is about to display information about a foaf:Person record. As we can notice, there is no foaf_person.tpl template available for a foaf:Person. However, because of the ontology structure, the system next attempts to select a template from a parent class of that foaf:Person.

What the system would do is to check if there is a template available for a record of type foaf:Person. Since there is none, it would try to find one for a parent type, so in this case the foaf:Agent class. In our example, there is now a match. The templating engine thus uses the foaf_agent.tpl template to display information about the foaf:Person record.

Were the foaf_agent.tpl not to exist, then the templating engine would fall back to the owl_thing.tpl template, which is considered to be the “generic record display template”, or the template of last resort.

This design means that if:

  1. the ontological structure changes over time, or
  2. new templates get added to the system

then there may be an impact on how the record gets displayed.

The major advantage of this design is that more and more specific formatting templates may be added to an OSF installation over time, both improving the tailored look of results displays and accommodating more structure and relationships as they evolve.

Provide Reasoning and Inference

A standard use of ontologies is for reasoning and inference, and those used by OSF are no exception.

By extension, however, we can also use these same capabilities to check on data consistency and coherence. This is an important feature of the system since the system can detect if there are logical inconsistencies or logical incoherencies that have been developed by the system administrator during ontology growth and development. Having coherent and consistent ontologies means that we have the proper foundations to create consistent and coherent datasets of instance records.

See further the discussion on reasoning using Protégé.

Guide Content Filtering

Filtering data is the action of getting a subset of records from a complete dataset based on some selection criteria. In OSF, the predominant share of filtering is done using the structWSF Search Web service endpoint. The a minority of filtering is done using the SPARQL endpoint. It is also possible to filter via the AGGR aggregation ontology.

Possible filtering criteria for the Search endpoint are:

  1. Filtering by type(s)
  2. Filtering by attribute(s)
  3. Filtering by attribute(s)/value(s)
  4. Filtering by geo-localization (within a given geographical area)

These filtering activities are performed by different tools of the stack, such as:

  • structSearch
  • structBrowse
  • sWebMap

These tools are impacted by the definition of the loaded ontologies. The filtering of the values by types, attributes and attributes/values requires an ontology class or an ontology property as filtering criteria.

Filtering with Inference

Also, the any Search query can be performed with inference enabled. Just like with the template selection section noted above, inference can have a big impact on the number and nature of returned results. Let’s consider this example class structure:

Loaded Class Hierarchy Indexer Records
owl:Thing
   |
   |
    --> bibo:Document
            |
            |
            |
             --> bibo:Image
                     |
                     |
                     |--> muni:HeritageImage
                     |        |
                     |        |
                     |         --> muni:ParkHeritageImage
                     |
                     |
                     |--> muni:NeighborhoodImage
                     |
                     |
                     |
                      --> muni:ParkImage
  <1> a bibo:Image .
  <2> a muni:HeritageImage .
  <3> a muni:HeritageImage .
  <4> a muni:ParkHeritageImage .
  <5> a muni:ParlImage .

This class structure shows a hierarchy of images where the leaf classes are topical image classes (so classes where their individuals are considered images representing one of the topic: Heritage, Neighborhood and Park). Now let’s see how this class structure impacts Search queries, and returned results, by different tools (structSearch, structBrowse, sWebMap and others).

Here is a series of Search queries sent to a structWSF instance that has this class hierarchy loaded, using the sample specification noted above. This tables shows the results potentially returned by the Search endpoint with and without inferencing turned on:

Use Case Type Filter Inference Returned Results
#1 muni:HeritageImage Off
<2> a muni:HeritageImage .
<3> a muni:HeritageImage .
#2 muni:HeritageImage On
<2> a muni:HeritageImage .
<3> a muni:HeritageImage .
<4> a muni:ParkHeritageImage .
#3 bibo:Image Off
#4 bibo:Image On
<1> a bibo:Image .
<2> a muni:HeritageImage .
<3> a muni:HeritageImage .
<4> a muni:ParkHeritageImage .
<5> a muni:ParlImage .

In the Use Case #1, the user requests all of the muni:HeritageImages without inferencing. This means that the Search endpoint will return all of the records that have been typed as muni:HeritageImage. In this case, the records <2> and <3> got returned.

Use Case #2 is a variant of Use Case #1, only now with inferencing enabled. In this use case, the Search endpoint will return all the muni:HeritageImage and all the records that are typed with one of its subtypes (in this case, muni:ParkHeritageImage). For this query, records <2>, <3> and <4> got returned. This case shows where ontologies can have a dramatic impact on the system. If we modify that class hierarchy and put the muni:ParkHeritageImage as being a sub-class-of bibo:Image, then the same results would be returned for Use Case #2 than we got with Use Case #1.

With Use Case #3, the endpoint does not return any results because inferencing is disabled, and because there is no record typed as bibo:Image.

Use Case #4 is a variant of Use Case #3 where inferencing is enabled. The endpoint returns all the image records because all of them are bibo:Image by inference on type.

Filtering via the AGGR Ontology

The AGGR Ontology also has an impact on anything that displays facets of filtered searches. Amongst others, it impacts the structSearch and structBrowse conStruct modules. It also impacts different user interfaces that use the Search Web service endpoint to perform auto-completion tasks.

Tag Concepts in Text Documents

In the Open Semantic Framework, the Scones Web service endpoint is what is used to analyze unstructured text documents, then turning them into semi-structured text documents by automatically tagging concepts. The concept tagging takes place using ontology-based information extraction, or OBIE. Named entity dictionaries are the basis for entity tagging.

These concepts used for the tagging come from selected ontologies loaded on the system. The way these ontologies have been created, and the way the classes and named individuals have been defined, has a dramatic impact on the quality of the documents tagged by Scones.

Scones uses two things from ontologies:

  • its classes
  • its named individuals

Depending on settings, one or both of these sources may be used for scones tagging.

There are a few properties intimately related to the Scones Web service endpoint:

Properties Impact on Scones
iron:prefLabel Preferred label to refer to an instance record.This property impacts the behavior of the Scones tagger in the sense that the value of the iron:prefLabel property is used to detect and tag as a reference the corresponding class or named individual.
iron:altLabel Alternative label to refer to an instance record.This property impacts the behavior of the Scones tagger in the sense that the value(s) of the iron:altLabel property is used to detect and tag as a reference the corresponding class(es) or named individual(s).
iron:hiddenLabel Hidden label are labels that are not displayed in any user interface, but may be used by different systems for indexing purposes (such as for recognizing misspellings).This property impacts the behavior of the Scones tagger in the sense that the value(s) of the iron:hiddenLabel property is used to detect and tag as a reference the corresponding class(es) or named individual(s). As we saw above, hidden labels are not displayed in user interfaces. However, they are used to specify variations in the way some of the other labels may be written. These hidden labels are explicitly used by the Scones tagger.
sco:namedEntity Specifies if a resource can be considered a named entity. Literal value: “true” or “false”.This property impacts the behavior of the Scones tagger in the sense that all of the records with the sco:namedEntity property set to trueare automatically added by the Scones endpoint to its Named Entities Dictionaries.This means that all the records that are specified to be named entities will be used by Scones to tag any input text documents.

Help Navigate and Organize Web Portals

In OSF, ontologies also have an impact on the general organization of a Web portal and how it is navigated.

Portal Navigation

In an OSF portal, its domain ontologies use the sRelationBrowser for general navigation. The relation browser is a tool that lets users dynamically navigate a graph (that is, nodes with arcs that links these nodes). The most widespread usage of the relation browser is to let users navigate the links between ontology concepts. These concepts are the anchor points of what other content is available on the Web portal. By navigating the concepts (classes) structure, users are able to explore an OSF portal’s entire content.

Each node in the sRelationBrowser semantic component is linked to whatever other kinds of related records exist in the system. Depending on the types of those records, other semantic components can then be invoked to display this tightly related content for each node.

Ontologies thus impact navigation and discovery on an OSF portal in two ways:

  1. They impact the navigation of the structure by defining which concepts are linked to other concepts and with what property
  2. They define what related records may get displayed to the user based on their classes and properties.

Layouts Organization

OSF Web portals are mainly organized by Layouts. A layout is a specific page presentation format with specific design, components and ordering and sizing of those components. This page presentation is highly influenced by the kind of things indexed in the system. Generally, layouts present records of a certain type (or family of types), along with specialized functions that users are able to use to perform different actions on that set of records.

Here are a few examples of such layouts:

These layouts aggregate all of the records of a certain type (like images of all kinds), display them using different kind of tools (like an Images Gallery), and filter them depending on different filtering criteria (like mapping the position where each image got captured, on a map, within a specific neighborhood area).

The ontologies impact the general organization of the Web portal because of the kind of things that are indexed in the system interacting with the available layouts.

Manage Datasets and Ontologies

Basic settings for managing datasets and ontologies is provided by the WSF Ontology. It presently does so via two mechanisms.

Datasets Syncing Framework

The Datasets Syncing Framework behaves differently depending on the value of the wsf:crudAction property for each input record.

WSF Property Impact on the DSF
wsf:crudAction States the CRUD action that should be used to index a given record into structWSF. This property is used by the Datasets Syncing Framework to determine if the record feed to it should be created, deleted or updated.The value of this property can be one of:(1) create (2) update (3) deleteThis property impact the behavior of the DSF in the sense that it is the record’s description, using this property that will tells the framework how to behave (create, delete or update) toward the input record. If nothing is specified, the record will simply be ignored.

structOntology

The structOntology conStruct module exhibits different behavior depending on the value of the wsf:ontologyModified property for each input ontology description.

WSF Property Impact on the DSF
wsf:ontologyModified States if an ontology has been modified since the last time it got saved on the file system of the OSF server instance.This property impacts the behavior of the structOntology module in the sense that if, for an input ontology, the description of that ontology states that this property is “true”, then it will notify the user via its loaded ontologies list that the ontology has been modified, and that it has not yet been saved.

Set Access Permissions and Registrations

The WSF Ontology also has a principal purpose to describe the internal state of a structWSF instance such as the internal access control records, the datasets descriptions, the registered web service endpoints, etc. As a result, this ontology can have multiple effects on other parts of an OSF instance.

The WSF Ontology is used to describe three main areas of a structWSF installation:

  1. datasets registry
  2. access definition registry
  3. registered web services endpoints registry

These registries are hosted in some specialized datasets in the triple store (Virtuoso for most OSF installations). The information indexed in these different registries is defined using the WSF ontology.

All structWSF Web services are affected by these registries.

by Frederick Giasson at December 05, 2011 06:02 PM

AI3:::Adaptive Information (Mike Bergman)

An Ontologies Architecture for Ontology-driven Apps

Open Semantic Framework Ontology Modularization and Roles within an OSF Instance

For some time now, Structured Dynamics (SD) has been touting the unique advantages of ODapps, or ontology-driven applications [1]. ODapps are modular, generic software applications designed to operate in accordance with the specifications contained in one or more ontologies. The relationships and structure of the information driving these applications are based on the standard functions and roles of ontologies (namely as domain ontologies), as supplemented by UI and instruction sets and validations and rules. When these supplements are added to standard ontology functions, we collectively term them adaptive ontologies [2].

To further the discussion around ODapps, today we are publishing two new documents, using the semantic technology foundation of the open semantic framework. OSF is a comprehensive, open source stack of SD and external tools that provides a turnkey environment for enterprises to adopt semantic technologies and approaches. OSF has been designed from the ground up to be an ontology-driven application framework.

The first new document, posted on Fred Giasson’s blog, provides a detailed discussion of the dozen or so roles ontologies can play within an OSF installation. Fred’s document is geared more to specific properties and configurations useful to deploy this framework; that is, the “drivers” in an ODapp setting. The second new document — this one — is more of a broad overview of the modularization and architecture of the constituent ontologies that make up an OSF installation. Both documents have also been posted to SD’s open content TechWiki [3], which now has about 360 technical articles on understanding and implementing an OSF installation, importantly including its ontologies.

OSF Constituent Ontologies

As presently configured, an OSF installation may typically utilize most or all of the following internal ontologies:

  • The SCO Ontology (Semantic Component Ontology)
  • The WSF Ontology (Web Service Framework Ontology)
  • The AGGR Ontology (Aggregation Ontology)
  • The irON Ontology (Instance Record and Object Notation Ontology)
  • One or more domain ontologies, to capture the concepts and relationships for the purposes of a given OSF installation, and
  • Possibly UMBEL (optional) or other upper-level concept ontologies, used for linkages to external systems.

(Note: the internal wiki links to each of these ontologies also provides links to the actual ontology specifications on Github.)

Depending on the specific OSF installation, of course, multiple external ontologies may also be employed. Some of the common external ones used in an OSF installation are described by the external ontologies document on the TechWiki. These external ontologies are important — indeed essential in order to ensure linkage to the external world — but have little to do with internal OSF control structures. That is why the rest of this discussion is focused on internal ontologies only.

The OSF Ontologies Architecture

The actual relationships between these ontologies are shown in the following diagram. Note that the ontologies tend to cluster into two main areas:

  1. Content (or domain) ontologies, which tend to embody more of the traditional ontology functions such as information interoperability. inferencing, reasoning and conceptual and knowledge capture of the applicable domain; and
  2. Administrative ontologies, which govern internal application use and user interface interactions.

This ontology architecture supports the broader open semantic framework:

(click for full size)

The WSF ontology plays a special role in that it sets the overall permission and access rights to the other components and ontologies. The UMBEL ontology (or other upper-level ontologies that might be chosen) is also optional. Such vocabularies are included when interoperability with external applications or knowledge bases is desired.

Summary of OSF Roles

We can further disaggregate these ontology splits with respect to the specific dozen or so ontology roles discussed in Fred’s complementary piece on ontology roles in OSF. These dozen roles are shown by the rows with interactions marked for the various ontologies:

  S
C
O
A
G
G
R
W
S
F
i
r
O
N
D
o
m
a
i
n
U
M
B
E
L
Define record descriptions          
Inform interface displays      
Integrate different data sources      
Define component selections    
Define component behaviors        
Guide template selection      
Provide reasoning and inference      
Guide content filtering (with and without inference)        
Tag concepts in text documents      
Help organize and navigate Web portals        
Manage datasets and ontologies          
Set access permissions and registrations          

One of the unique aspects of adaptive ontologies is their added role in informing user interfaces and supporting specific semantic tools. Note, for example, the role of the content ontologies in informing interface displays, as well as their use in tagging concepts (via information extraction). These additional roles are the reason that these ontologies are shown as straddling both content and administrative functions in the first figure.

See Fred’s piece to learn more about these dozen roles.

Interactions Are More Complex than Arrows

Naturally, a simple drawn arrow between ontologies (first figure) or a checkmark on a matrix (table above) can hide important details of how these interactions between ontologies and components actually work. In an earlier article, we discussed how the whole workflow takes place between users and user interface selections affecting the types of data returned by those selections, and then the semantic components (widgets) used to display them. This example interaction is shown by the following animation:

(click for full size)

The blue nodes show the ontology interactions. These, in turn, instruct how the various components (yellow) and code (green) need to operate. These interactions are the essence of an ontology-driven app. The software is expressively designed to respond to specifications in the ontology(ies) used, and the ontologies themselves embrace some additional properties specific to driving those apps.

Possible Future Directions

ODapps are a relatively new paradigm, from which we continue to learn more about uses and potentials. We have wanted to write the first versions of these two new documents for some time, but have held off as we learned and exploited further the latent potentials in this design. As it stands, we see further potentials in this approach, and will therefore be likely adding new ontologies and capabilities to the general system for some time.

Some of the areas that look promising to us include:

  • A generalized statistical ontology, especially as it can inform data displays in the semantic components
  • Even more capable widgets in business intelligence (BI) uses, with a concomitant expansion of the vocabulary (predicates and classes) in some of the underlying ontologies
  • More aggregation and summation functions supported by the AGGR ontology, and
  • Still further improved permissions and access layers in the WSF ontology.

These potentials arise from the native power of the design basis for ontology-driven apps. Conceptually, the design is simplicity itself. Operationally, the system is extremely flexibile and robust. Strategically, it means that development and specification efforts can now move from coding and programmers to ontologies and the subject matter users who define and depend on them. With these advantages, who can argue with that?


[1] For the most comprehensive discussion of ODapps, see M. K. Bergman, 2011. ” Ontology-Driven Apps Using Generic Applications,” posted on the AI3:::Adaptive Information blog, March 7, 2011. You may also search on that blog for ‘ODapps‘ to see related content.
[2] See M.K. Bergman, 2009. “Ontologies as the ‘Engine’ for Data-Driven Applications“, AI3:::Adaptive Information blog, June 10, 2009, for the first presentation of these topics, but the specific term adaptive ontology was not yet used. That term was first introduced in “Confronting Misconceptions with Adaptive Ontologies” (August 17, 2009). The dedicated treatment of these topics and their interplay was provided in M.K. Bergman, 2009. “Ontology-driven Applications Using Adaptive Ontologies”, AI3:::Adaptive Information blog, November 23, 2009. The relation of these topics to enterprise software was first presented in M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise”, AI3:::Adaptive Information blog, September 28, 2009.
[3] Slight revisions of these documents have been posted to the TechWiki as Role and Use of Ontologies in OSF and OSF Ontologies Modularization and Architecture, respectively.

by Mike Bergman at December 05, 2011 06:01 PM

November 21, 2011

Frederick Giasson's Weblog

Moving Projects from Google Code to GitHub

Last week we slowly migrated Structured Dynamics‘ Google Code Projects to GitHub.We have been thinking about moving to GitHub for some time now, but we only wanted to move projects to it if no prior history and commits were dropped in the process. One motivation for the possible change has been the seeming lack of support by Google for certain long-standing services: we are seeing disturbing trends across a number of existing services. We also needed a migration process that would work with all of our various projects, without losing a trunk, branch, tag or commits (and their related comments).

It was not until recently that I found a workable process. Other people have successfully migrated Google Code SVN projects to GitHub, but I had yet to find a consolidated guide to do it. It is for this last reason that I write this blog post: to help people, if they desire, to move projects from Google Code to GitHub.

Moving from Google Code to GitHub

The protocol outlined below may appear complex, but it looks more intimidating than it really is. Moving a project takes about two to five minutes once your GitHub account and your migration computer is properly configured.

You need four things to move a Google Code SVN project to GitHub:

  1. A Google Code project to move
  2. A GitHub user account
  3. SSH keys, and
  4. A migration computer that is configured to migrate the project from Google Code to GitHub. (in this tutorial, we will use a Ubuntu server; but any other Linux/Windows/Mac computer, properly configured, should do the job)

Create GitHub Account

If you don’t already own a GitHub account, the first step is to create one here.

Create & Configure SSH Keys

Once your account has been created, you have to create and setup the SSH keys that you will use to commit the code into the Git Repository on GitHub:

  1. Go to the SSH Keys Registration page of your account
  2. If you already have a key, then add it to this page, otherwise read this manual to learn how to generate one

Configure Migration Server

The next step is to configure the computer that will be used to migrate the project. For this tutorial, I use a Ubuntu server to do the migration, but any Windows, Linux or Mac computer should do the job if properly configured.

The first step is to install Git and Ruby on that computer:

1
 sudo apt-get install git-core git-svn ruby rubygems

To perform the migration of a Google Code SVN project to GitHub, we are using a Ruby application called svn2git that is now developed by Kevin Menard. The next step is to install svn2git on that computer:

1
 sudo gem install svn2git --source http://gemcutter.org

Migrate Project

Before migrating your project, you have to link the Google Code committers to GitHub accounts. This is done by populating a simple text file that will be given as input to svn2git.

Open the authors.txt file into a temporary folder:

1
 sudo vim /tmp/authors.txt

Then, for each author, you have to add the mapping between their Google Code and GitHub accounts. If a Google Code committer does not exist on GitHub, then you should map it to your own GitHub account.

1
2
 (no author) = Frederick Giasson <fred@f...com>
 fred@f...com = Frederick Giasson <fred@f...com>

The format of this authors.txt file is:

1
 Google-Account-Username = Name-Of-Author-On-GitHub <Email-Of-Author-On-Github

Take note of the first Google Code committer (no author) mapping. This link is required for every authors.txt file. This placeholder is used to map the initial commit performed by the Google Code system. (When Google Code initializes a new project, it uses that username for creating the first commit of any project.)

When you are done, save the file.

Now that set up is complete, you are ready to migrate your project. First, let’s create the folder that will be used to checkout the SVN project on the server, and then to push it on GitHub.

1
2
3
cd /tmp/
mkdir myproject
cd myproject

In this tutorial, we have a normal migration scenario. However, your migration scenario may differ from this one. It is why I would suggest you check out the different scenarios that are supported by svn2git document. Change the following command accordingly. Let’s migrate the Google Code SVN Project into the local Git repository:

1
 /var/lib/gems/1.8/bin/svn2git http://myproject.googlecode.com/svn --authors /tmp/authors.txt --verbose

Make sure that no errors have been reported during the process. If it is the case, then refer to the Possible Errors and Fixes section below to troubleshoot your issue.

The next step is to create a new GitHub repository where to migrate the SVN project. Go to this GitHub page to create your new repository. Then you have to configure Git to add a remote link, from the local Git repository you created on your migration computer, to this remote GitHub repository:

1
 git remote add origin git@github.com:you-github-username/myproject.git

Finally, let’s push the local Git repository master, branches and tags to GitHub. The first thing to push onto GitHub is the SVN’s trunk. It is done by running that command:

1
 git push -u origin master

Then, if your project has multiple branches and tags, you can push them, one by one, using the same command. However, you will have to replace master by the name of that branch or tag. If you don’t know what is the exact name of these branches or tags, you can easily list all of them using this Git command:

1
 git show-ref

Once you have progressed through all branched and tags, you are done. If you take a look at your GitHub project’s page, you should see that the trunk, branches, tags and commits are now properly imported into that project.

Possible Errors And Fixes

Fatal Error: Not a valid object name

There are a few things that can go wrong while trying to migrate your project(s).

One of the errors I experienced is a "fatal" error message "Not a valid object name". To fix this, we have to fix a line of code in svn2git. Open the migration.rb file. Check around the line 227 for the method fix_branches(). Remove the first line of that method, and replace the second one by:

1
 svn_branches = @remote.find_all { |b| !@tags.include?(b) && b.strip =~ %r{^svn\/} }

Error: author not existing

While running the svn2git application, the process may finish prematurely. If you check the output, you may see that it can’t find the match for an author. What you will have to do is to add that author to your authors file and re-run svn2git. Otherwise you won’t be able to fully migrate the project.

I’m not quite sure why these minor glitches occurred during my initial migrate, but with the simple fixes above you should be good to go.

by Frederick Giasson at November 21, 2011 09:29 PM

November 15, 2011

AI3:::Adaptive Information (Mike Bergman)

UMBEL Services, Part 4: structOntology

UMBEL Vocabulary and Reference Concept OntologyImproved Ontology Navigation and Management in Read-only and Editable Forms

This continues our series on the new UMBEL portal. UMBEL, the Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts and a vocabulary designed for domain ontologies and ontology mapping [1]. This part four discusses structOntology, the online ontology viewing and management tool that is an integral part of the open semantic framework (OSF), the framework that hosts the UMBEL portal.

Ontologies are the central governing structure or “brains” of a semantic installation. As provided by the OSF framework, ontologies are also the basis for instructing user interface labels and how the interface behaves. The Web is about global access, immediacy, flexibility and adaptability. Why can’t our use of ontologies be the same?

Unlike similar tools of the past, structOntology exists on the same installation as the ontology that drives it. It is a backoffice ontology editing and management tool that is part of the conStruct tool suite, accessible via the OSF admin panel. There is no need to go off to a separate application, make changes, re-import, and then test. structOntology allows all of that to occur locally with the instance in which it resides. Also, there are some important functionality differences — especially finding and selecting stuff and search — that sets structOntology apart from existing, conventional tools.

Yet, that being said, structOntology is also not the complete Swiss Army knife for ontology management. It is designed for local and immediate use. Its spectrum of functionality is not as complete as other ontology frameworks (for example, supporting reasoners, consistency testers or plug-ins). So, for immmediate and locally relevant use, structOntology appears to be the appropriate tool. For more detailed ontology work or testing, other frameworks are perhaps more useful. And, in recognition of these roles, structOntology also has robust import and export capabilities that enable these dual local-detailed use scenarios. For these distinctions, see further the structOntology v Protégé? document.

structOntology comes in two versions. First, there is the read-only version, which can be made publicly available, that is a great aid to ontology navigation and discovery. This is the version viewable on the UMBEL portal. Second, there is an editable version, which is only available to administrators via a back office function within an OSF instance. Some screen shots of this version, plus pointers to more documentation about it, are provided below.

OWL API as a First-class Citizen

What enables OSF to treat ontologies as a first-class citizen — viewable and editable from within the applications in which they operate — results from the incorporation of the OWL API as one of the major engines underlying the structWSF Web services framework, the key foundational basis to an OSF installation. As noted in Part 2 of this series, the OWL API is one of the four major engines supporting structWSF:

The OWL API is the same engine used by Protégé 4, which is why both structOntology and Protégé are fully interoperable.

Besides interoperabilty, the use of the OWL API also means that other OWL API-based tools, such as reasoners or mappers, may be linked into the system. This design is in keeping with our normative view of an ontology tooling landscape, which Structured Dynamics keeps pursuing in a steady, incremental manner [2]. Further, because of its sibling engines, the OWL API and OSF are also able to leverage the other engines supporting structWSF, such as Solr for advanced search or efficient indexing in the RDF triplestore. (The advantages go both ways, too, such as for example enabling the OWL API to feed appropriate ontology specifications to the GATE text processing area for uses such as ontology-based information extraction [OBIE]). All of this makes for a most powerful and capable foundation to an OSF instance.

The Read-Only Version (UMBEL)

Since UMBEL is a reference ontology and the UMBEL portal is an access point to those references and specifications, we really don’t want casual users making modifications to the ontology [3]. For this reason, only a read-only version of structOntology is provided on the portal.

Access to the structOntology function occurs via the Ontology link on the UMBEL portal. Upon access, you are presented with the main structOntology interface:

The organization of the structOntology application presents all currently available and active ontologies listed in the left panel; UMBEL, of course, is the one selected here. Since this is a read-only version, only the View button shows up in the right-hand panel. (For the options available in the editable version, see below.)

View Option

Upon invoking the View option, the hierarchical tree for the selected ontology appears on the left; structural and definitions on the right. 

You may expand the tree and explore the structure deeper by either clicking on the tree nodes in the left-hand panel or the item links in the right-hand panel. If there are further levels in the tree, you will get the JavaScript ‘working’ icon and then see the tree expanded with the new node information shown to the right.

Also note that your interaction with the structOntology application is recounted via the “breadcrumbs” listing at the upper left of the application. The green arrow icon allows you to expand or collapse various sections in the display.

Tooltips

The tree labels are themselves based on the preferred labels assigned to things. However, if you want to see the actual ontology URI reference, you can do so via the tooltip when mousing over the item:

Ontology view tooltips

The tooltip shows the full URI path (unique identifier) of the selected item.

Classes Tab

This example has been based on the Classes tab, which are the reference concepts in the UMBEL context. In read-only mode, the basic information presented is the tree structure, the item description and prefLabel, and super and sub class information in the right-hand panel. (More options are available in the editable version; see below.)

Properties Tab

Properties — that is the relations or predicates between items or nodes — are presnted in a similar manner to that for Classes. The Properties tab has the same basic layout and operations as the Classes tab, including similar right-hand panels:

The Editable Version

The editable version of structOntology shares all of the functionality of the read-only version. Besides adding editing capabilities, the editable version also has other functionality related to general ontology creation and management. There is separate documentation for the editable version; the examples below are from a different instance than UMBEL.

The editable version is accessed via the backoffice admin function within an OSF instance. When invoked, it also has more management options presented in the right-hand panel:

We’ll highlight some of the differences from the read-only version below.

Create New Option

The first notable addition is the ability to create ontologies (as well as to delete, or Remove, them):

The URL (such as http://purl.org/ontology/myont#) becomes the base URI for the new ontology. The new ontology is created with a basic structure, from which you only need fill in your new concepts or classes and relationships:

Basic stubbing is provided for the new ontology to help bootstrap its development (not shown). Once created, this new ontology also now appears on the available local ontologies when first invoking the structOntology application.

View Option

Most screens are quite similar to the read-only version with the obvious change of replacing labels with edit boxes. It is via these edit fields that the ontology becomes editable. This change is quite evident for the View screen:

StructOntology view.png

Searching

Searching can take place on the currently active ontology or all loaded (available) ontologies. Note that selection was made above via the radiobutton under the search box.

Also, depending on settings, searching can also take place on only the preferred label, or on alternative labels or descriptions (in fact, all annotations). (This is part of the settings.)

When entering search terms, the system automatically attempts to complete the matching search phrase. A minimum of three entered characters guides this auto-completion functionality:

When search is initiated, the potential results list also auto-completes for what you have already typed into the search box. Upon selection of one of these items (or completion of the full search phrase), the structOntology system issues a search query to the remote server, which then acts to auto-populate the ontology tree on the left-hand panel. In this case, we have selected ‘communitiy facilities’:

The desired search results then automatically expand the ontology tree. This is really helpful for longer ontologies (the example one shown has about 3000 concepts and about 6000 axioms) and means quicker initial tree loading. Once completed, the (multiple) occurrences of the search item are shown in highlight throughout the tree.

Note this search is not necessarily restricted to the actual node label. Alternative labels and descriptions may also be used to find the search results. This greatly expands the findability of the search function. Here is a great example of matching the OWL API engine to Solr underneath a structWSF instance.

Tab Structure

The editable version of structOntology offers more detail in the right-hand panel when Viewing an item. These sections include:

  • Annotations
  • Structural relationships
  • Instances
  • Linkage to characteristics, and
  • Advanced settings.

Each section is editable. All have auto-complete. Each section may also be expanded or collapsed.

General Operations

Each panel has an expand and collapse arrow shown at the upper right of its panel. These causes the panel’s individual entries to either be exposed or hidden. At the right of each entry, new entries can be invoked with the green plus symbol; existing entries can be deleted with the red minus symbol. (See Structural Relationships below.)

In working with each panel, note that each entry also has the search and auto-complete features earlier noted. Drag-and-drop is also contextual into these panels or not, depending on the nature of the item selected in the left-hand panel (tree).

Annotations

Annotations provide the descriptions about the thing at hand and its associated metadata. (These are separately defined under the Properties tab, or as part of the imported ontology specification.) The available annotations are displayed in this panel when expanded:

Entries are simply provided by entering values into the text fields and then Saving.

Structural Relationships

The structural relationships are the means to set parent and child relations between concepts, as well as to instruct disjoint or equivalent class relations. The Structural Relationships panel is the key one for setting the interconnections within the graph structure at the heart of the governing ontology.

Most of the key structural relationships in OWL are provided by this panel. (However, note there are some additional and rarely used structural specifications in OWL. These must be set via a third-party external application. Such potential interactions are made possible via the flexible import and export options with structOntology).

Instances (Individuals)

Another right-hand panel provides the facility to assign individuals to the classes (or concepts) established under the prior two panels. In this case, we are looking at some specific ‘community facilities’ to assign to that concept:

As with the prior panels, a new instance may be added or discarded ones deleted. Individual instances and their characteristics may also be updated or changes.

Linkage to Characteristics

Another aspect to OSF ontologies is the ability to relate concepts to various metadata characteristics or attributes that might describe that concept’s instances. This relationship is done via the dedicated hasCharacteristic property, which is assigned via this right-hand panel:

This option has the specific behavior of allowing one or more properties (characteristics) to be asserted for a given a class (concept).

Advanced Options

Display and widget and other options are set under the Advanced Options panel. One item to note are the widgets that may be assigned for displaying a given information item:

The relationship of widgets (or semantic components) to information items is a deserving topic in its own right. For more information about this topic, see the semantic components category.

Contextual Drag-and-Drop

In edit mode, it is possible to drag items from the left-hand tree panel into the specifications at the right. This is contextual. In this first example, we see an attempt to drop a “class” result (or concept) into the annotation panel, which violates the structure of the system and is therefore not allowed (as shown by the visual red X cues):

However, if we drag and drop from the tree in an allowable structural definition, we get the visual green check as a cue the move is legal:

This functionality and feedback means that only allowable assignments can be dropped into a new structural definition.

Export Option

Another piece of functionality in the editable version is the export option. When invoked, Export brings up the save dialog with the ability to assign an ontology file name:

Upon saving, it stores the currently active ontology in RDF/XML format:

Export is not active in UMBEL do to the large size of the ontology. If you want to obtain it directly, you may do so from the UMBEL ontology CVS.

Import Option

An Import option is available in the editable version. structOntology import supports all OWL API serializations, specifically RDF/XML, N3, Manchester Syntax and Turtle. When import is invoked, a file open dialog is presented that enables you to find the ontology on your local hard drive:

The Import feature has no file extension limitations; make care to pick and assign the proper types for importation.

Via the Import and Export buttons, it is possible to work locally with structOntology while exporting to more capable third-party tools. Then, once use of those tools is complete, Import provides the ability to re-import the updated ontology back into the local collection.

File Options

Finally, as a server-based system accessed via Web services, there are some slightly different concepts necessary to keep in mind when using the editable version of structOntology. These distinctions need to be kept in mind because you might be working with the local version or the one on the main server. These file options are:

  • Save — saves all modifications on the file, on the server. Then, all modifications will be used if you do a Reload
  • Unload — removes the currently active ontology from the local instance, but does NOT remove it from the server. It merely acts to remove that ontology for local use in the current session
  • Remove — a full delete of the ontology, both locally and on the server
  • Update — recreates the serializations files created from these ontologies, like the .SRZ files used by structWSF and conStruct; the ironXML schema used by the semantic components, etc. The Update option is the most common one when updating an ontology locally, for which you want the persistent version on the remote server to be kept in sync
  • Reload — reloads the server version. If prior local work had not been updated, then a reload acts as a way to restore the remote instance to the local one without change..

These are all available via buttons under the main right-hand panel in structOntology and are more fully described in the edit version documentation.

Additional Information

Additional information on structOntology may be found in an online video:

UMBEL small logo

This is the fourth of a multi-part series on the newly updated UMBEL services. Other articles in this series are:


[1] See further the general Wikipedia description of UMBEL or its specification on the official UMBEL Web site.
[2] See especially the second figure and the accompanying discussion in this document.
[3] The appropriate pathway for suggested changes to the UMBEL ontology itself is via its official mailing list.

by Mike Bergman at November 15, 2011 07:33 PM

November 10, 2011

AI3:::Adaptive Information (Mike Bergman)

UMBEL Services, Part 3: Concept Browser

UMBEL Vocabulary and Reference Concept OntologyThe OSF Browser is Now More Configurable

This continues our series on the new UMBEL portal. UMBEL, the Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts and a vocabulary designed for domain ontologies and ontology mapping [1]. This part three deals with the portal’s navigational tool, the concept or relation browser [2]. It is a favorite component of the open semantic framework (OSF).

Discovery and navigation in a graph structure — as is the basis of ontologies and the UMBEL structure — can be difficult. It is made even more difficult when the number of concepts in the object space is large. With 28,000 concepts, UMBEL is one of the largest public ontologies extant. The relation browser is designed specifically to address these difficulties.

The concept browser in UMBEL is invoked via the main menu option or by clicking on the browser icon [] shown in conjuction with a given concept. Here is an example for the concept ‘tractor’:

Note in this case that the More details … link brings you to a detailed concept view, as was covered in the previous part in this series.

With its extreme configurability and flexibility — see further below — the relation browser can be an essential foundation to an open semantic framework installation. But, the best part about the relation browser is that it is fun to use. Clicking bubbles and dynamically moving through the graph structure is like snorkeling thorugh a massive school of silvery fish.

Origins of the Relation Browser

We have been featuring the relation browser since April 2008 when the first UMBEL sandbox was released:

The relation browser is the innovation of Moritz Stefaner, one of the best data and information visualization gurus around. He continues to innovate in large-scale information spaces, and is a frequent speaker at information visualization conferences. Moritz’s Web site and separate blog are each worth perusing for neat graphics and ideas.

Configurability

Since our first efforts with the browser, we have worked to extend its applicability and configurability. The relation browser can be downloaded separately from our semantic components code distribution site.

The relation browser is configured via an XML specification file. Separate specifications are available for the nodes (classes or concept) and connecting edges (predicates or properties). Here are the current configuration options:

NODE PARAMETERS
label
label is the label assigned to a given node; by default, the end of the URI of the type will be used as the label
displayNodeLabel a Boolean value whether to display or hide a label for a specific node
tooltips the tooltip to be displayed when mousing over a specific node
textFont defines the font of the text label on the node; for example: “Verdana”
textColor defines the color of the text label on the node; value in RGB hex format
textSize defines the size of the text to display in the node
image
a URL to an image to use to display at the position of the node
shape a shape of the node to display; available values are “circle”, “block”, “polygon”, “square”, “cross”, “X”, “triangleUp”, “triangleDown”,
“triangleLeft”, “triangleRight”, “diamond”
lineWeight defines the size of the line of the border for the node’s shape
lineColor defines the color of the line of the border for the node’s shape; value in RGB hex format
fillColor defines the color to use within the shape for the node; value in RGB hex format
radius
defines the radius of the node. The radius is an invisible boundary where the edges get attached
backgroundScaleFactor scale factor for the node’s shape background; a scale factor of 1.25 means that it is 125% normal size
textScaleFactor scale factor of the node’s text label
textOffsetX X Offset where to start displaying the text within the node’s shape
textOffsetY Y Offset where to start displaying the text within the node’s shape
textMultilines multi-lines means that each word of a label is displayed on a single line
textMaxWidth maximum width of the text; if longer, then it is truncated with an ellipsis (“…”) appended
textMaxHeight maximum height of the text; if higher, then it is truncated with an ellipsis (“…”) appended
selectedNodeColorOverlay defines a color to overlay on the center (selected) node of the graph; it is defined by a series of 4 different offsets [alpha, red, green, blue] ranging from -255 to 255 in relation to the base node’s values; can, for example, to make the central node of the graph brighter
overNodeColorOverlay defines a color to overlay on a moused over node of the graph; it is defined by a series of 4 different offsets [alpha, red, green, blue] ranging from -255 to 255 in relation to the base node’s values; can, for example, to make a moused over node of the graph brighter
   
EDGE PARAMETERS
displayLabel the label to display over the center of the edge
tooltipLabel the tooltip to be displayed when mousing over a specific edge
directedArrowHead defines the type of the arrow for the edge; available values are “none”, “triangle”, “lines”
textFont defines the font of the text label on the edge
textColor defined the color of the text label on the edge; value in RGB hex format
textSize defines the size of the text to display on the edge
image a URL to an image to use to display over the edge at middle of the two connected nodes
lineWeight defines the size of the line for the edge connector
lineColor defines the color of the line for the edge connector; value in RGB hex format

 

It is also possible to specify a breadcrumb in association with the browser.

Besides these configurations, the API for the relation browser also provides for methods to:

  • Link Nodes to Objects
  • Link Nodes to Displays

Via these mechanisms, the relation browser can become a central focal point for any OSF installation. See further the specifications for additional ideas and tips.

Some Other Examples

Here are some other examples of relation browsers you can see across the Web:

UMBEL small logo

This is the third of a multi-part series on the newly updated UMBEL services. Other articles in this series are:


[1] See further the general Wikipedia description of UMBEL or its specification on the official UMBEL Web site.
[2] Various clients and users have named this widget a number of things, including spider, concept explorer, relation browser and concept browser.

by Mike Bergman at November 10, 2011 12:10 AM

November 07, 2011

AI3:::Adaptive Information (Mike Bergman)

UMBEL Services, Part 2: Full-text, Faceted Search

UMBEL Vocabulary and Reference Concept OntologyOSF Integration with Solr Provides Superior Search

This continues our series on the new UMBEL portal. UMBEL, the Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts and a vocabulary designed for domain ontologies and ontology mapping [1]. This part focuses on the search function within the UMBEL portal based on the native engines used by the open semantic framework (OSF).

Search uses the integration of RDF and inferencing with full-text, faceted search using Solr. This has been Structured Dynamics’ standard search function for some time, as Fred Giasson initially described in April 2009. It is a very effective way for finding new and related concepts within the UMBEL structure.

Solr, as the Web service-enabled option for its parent Lucene, has most recently become a not uncommon adjunct to semantic technologies, for the very same reasons as outlined herein. However, in 2008, when we first embraced the option, it was not common at all. To my knowledge, within the semantic technology community, only the SWSE (semantic Web search engine) project was using Lucene when we began to adopt it.

The reasons for embracing Solr (or Lucene) are these:

  • Full-text search with a flexible search syntax
  • Ability to add facets (which is very powerful when combined with the structure of RDF)
  • High performance
  • Extensions for locational and time searches and many additional options, and
  • Open source.

Prior to the adoption of add-ons like Solr, RDF-only search suffered from a number of limitations, especially in the lack of a searchable correspondence of labels in relation to the object URIs used in the RDF model (see some of the limitations of standard RDF search).

Because of its advantages, Solr became the first additional main engine underneath our structWSF Web services framework, complementing the central RDF triple store (Virtuosoin most cases). We have subsequently added other main engines, as well, with a current total of four, which other parts in this UMBEL series will later discuss:

Being a main engine underneath structWSF means that datasets are fully indexed and cross-correlated with the capabilities of the other engines at time of ingest. Ingest most commonly occurs when datasets are ingested by the standard import tool; but, it might also be part of the system’s large dataset import scripts or synchronization routines.

The Search Function and Syntax

The standard UMBEL search box is found at the upper left of most site pages. When searching, you may choose these operators or syntax to add to your keywords, for example:

  • park OR city — provides the most results
  • park AND city — both terms must be present; fewer results
  • park city (no quotes) — both terms must be present, and within 5 words of one another; still fewer results, or
  • “park city” — exact phrase in quotes, with the fewest results.

(At present, Booleean operators apply to full-content search, and not filtered search.)

Upon searching, using the default of searching title, alternative labels (synonyms) and description (“TAD”), the standard search results page is displayed:

This page provides the further filtering options of searching by only title, or all content (including the linkings for each concept to its super classes and sub classes, which may produce a quite inclusive results set). These filter options are helpful in being able to sift through the 28,000 concepts within UMBEL.

The results listing provides the UMBEL concept names, their description, their alternative labels and a link [] to view them in the relation browser (to be discussed in more detail in the next part of this series). A simple pagination utility enables the results to be paged.

structWSF Basis and Options

This UMBEL search uses the structWSF Search Web service. It is what ties into the Solr engine to perform the full text searches on the structured data indexed on a structWSF instance. A search query can be as simple as querying the data store for a single keyword, or to query it using a series of complex filters.

Not all of these query syntax or filtering options are active on the UMBEL instance given the simple concept structure of the UMBEL ontology. Turning these options on or off is a relatively straightforward matter of altering some configuration files and ensuring the right parameters are included in the queries issued by the application to the structWSF search endpoint.

Developers communicate with the Search Web service using the HTTP POST method. You may request one of the following mime types: (1) text/xml, (2) application/rdf+xml, (3) application/rdf+n3 or (4) application/json. The content returned by the Web service is serialized using the mime type requested and the data returned depends on the parameters selected.

A. Optional Available Operators

Optionally, the structWSF Search function may be configured to support these operators and conventions. All operators, by the way, must be entered as ALL CAPS:

  • AND, which is the default operator if more than one key word is entered
  • OR, which needs to be specifically entered
  • NOT
  • Phrases, which are denoted by double quotes as this “search phrase”; single quotes are not accepted
  • Wildcard searches on single characters (?) and multiple characters (*), which can be placed anywhere except the beginning of the query term
  • Field searches, whereby the field name is used in the query followed by a colon
  • Nesting, which allows complicated Boolean expressions to be formed (so long as parentheses are balanced), and many more exotic options.

See further the Lucene search engine syntax specification.

B. Optional Available Filters

Each search query can be applied to all, or a subset of, datasets accessible by the requester. Each Search query can be filtered by these different filtering criteria:

  1. Type of the record(s) being requested
  2. Source dataset(s) for the the record
  3. Presence of an attribute describing the record(s)
  4. A specific value, for a specific attribute describing the record(s)
  5. A distance from a lat/long coordinate (for geo-enabled structWSF instance)
  6. A range of lat/long coordinates (for geo-enabled structWSF instance)

These filtering options allow subset searches to occur, as the example above for title and TAD in UMBEL shows. However, these filters can also be combined into more complete and structured selection options as well. For example, this same search utility applied to Structured Dynamics’ Citizen Dan local government sandbox shows how these additional filters may be applied:

  • Clicking on a given “kind” name causes the results display to be restricted to results only for that kind of class.
  • If so selected, the Filter by Dataset tab is also restricted to the datasets that contain results with that class.
  • Once selected, the filter remains in place. To remove it, click on the Remove filter icon [] to restore the “kinds” back to the original listing for this search.

See the example. Such filtering capabilities present all of the “kinds” (actually, classes that have similar members) that are contained within the structure of the individual results that comprise the search results. The number of records (results) returned for each class may also be shown in parentheses.

Single Result (Concept) View

Clicking on an individual instance result in the UMBEL search results view (see above) provides the single result View for that specific UMBEL concept:

This view now provides a detailed description of the UMBEL concept and its structure and relationships. I briefly describe each item denoted by a checkmark.

The concept title and link to the relation browser [] are provided, followed by the actual concept URI identifier. Then the listing shows the alternative labels (synonyms, jargon and acronyms) provided for that concept followed by its (often detailed) description.

The structured information for that concept appears below that material. First shown is the UMBEL SuperType [2] to which the concept belongs, and then its external (non-UMBEL ontology) and internal (UMBEL) super classes and subclasses. There is also the facility to retrieve named individuals (instances) for that concept (see next).

Named Individual Listing

Choosing the ‘Get Entities from Sources’ button may provide example instances for that concept, as is shown below for the ‘Artist’ concept:

Retrieving Named Individuals

This linkage is relatively new for UMBEL (see the version 1.00 release write-up) and is still being expanded. At present, these linkages are limited to only a subset of UMBEL concepts and only linkages to Wikipedia. This aspect of the system is under active development, with more sources and more linked concepts to be released in the future.

UMBEL small logo

This is the second of a multi-part series on the newly updated UMBEL services. Other articles in this series are:


[1] See further the general Wikipedia description of UMBEL or its specification on the official UMBEL Web site.
[2] SuperTypes are a collection of (mostly) similar Reference Concepts. Most of the SuperType classes have been designed to be (mostly) disjoint from the other SuperType classes. SuperTypes thus provide a higher-level of clustering and organization of Reference Concepts for use in user interfaces and for reasoning purposes. Every Reference Concept in UMBEL is assigned to a SuperType; a few are assigned to more than one where disjointedness conditions are not absolute. Each of the 32 UMBEL SuperTypes has a matching predicate for relating to external topics. See further the discusison of SuperTypes in the UMBEL specification.

by Mike Bergman at November 07, 2011 02:28 PM

November 03, 2011

AI3:::Adaptive Information (Mike Bergman)

And, Now, We Pause for a Brief Commercial Message . . .

Winter Park, CO RentalMixing Business and Pleasure

I never talk politics here, and rarely speak of sports or family or personal matters. But I’m making an exception today.

Since we lived in Montana a couple of decades back, skiing and the mountains have been a central theme in my family. Both of my kids learned to ski at Lost Trail before they even turned three. Today, both are impressive skiers. (I’m a different story, but that is immaterial. ;) )

We have skied many places across the Western US, all enjoyable and all remarkable. But, our favorite amongst them has been Winter Park, CO (more specifically, Mary Jane — no Jane, no pain). We have been going there for nearly a decade. The slopes and the beauty are, of course, arguments in themselves. But also what makes Winter Park special is that it offers the best deal on earth for lift tickets (with an annual pass) and has a local clientele that is laid back and into substance and not flash.

As our kids have grown and taken on lives of their own, we have come to treasure those chances when all of us can be together. Sking — but summer activities, too — are great ways to make that happen.

So, it is with immeasurable pleasure that we closed the sale today on a new second home in Winter Park. It is absolutely perfect for all things outdoors. And, since we still have regular lives and work, we will be offering our new place for rental for those many weeks we can not enjoy it ourselves. If mountains and beauty and nature are in your calling, let us know. We have a fantastic place to rent to you in one of the most spectacular places on earth.

by Mike Bergman at November 03, 2011 09:08 AM

October 24, 2011

AI3:::Adaptive Information (Mike Bergman)

UMBEL Services, Part 1: Overview

UMBEL Vocabulary and Reference Concept OntologyNew Portal Update Leverages the Open Semantic Framework

UMBEL, the Upper Mapping and Binding Exchange Layer, is an upper ontology of about 28,000 reference concepts and a vocabulary designed for domain ontologies and ontology mapping [1]. When we first released UMBEL in mid-2008 it was accompanied by a number of Web services and a SPARQL endpoint, and general APIs. In fact, these were the first Web services developed for release by Structured Dynamics. They were the prototypes for what later became the structWSF Web services framework, which incorporated many lessons learned and better practices.

By the time that the structWSF framework had evolved with many additions to comprise the Open Semantic Framework (OSF), those original UMBEL Web services had become quite dated. Thus, upon the last major update to UMBEL to version 1.0 back in February of this year, we removed these dated services.

Like what I earlier mentioned about the cobbler’s children being the last to get new shoes, it has taken us a bit to upgrade the UMBEL services. However, I am pleased to announce we have now completed the transition of UMBEL’s earlier services to use the OSF framework, and specifically the structWSF platform-independent services. As a result, there are both upgraded existing services and some exciting new ones. We will now be using UMBEL as one of our showcases for these expanding OSF features. We will be elaborating upon these features throughout this series, some parts of which will appear on Fred Giasson’s blog.

In this first part, we provide a broad overview of the new UMBEL OSF implementation. We also begin to foretell some of the parts to come that will describe some of these features in more detail.

The Overall Portal

The new UMBEL portal is a fairly classic example of an OSF installation. The content management system hosting the system is Drupal, supplemented with a standard set of third-party modules and our own conStruct semantic technology modules. The theme is a stripped-down modification of the popular Pixture Reloaded theme:

Like other vocabulary sites, the UMBEL portal contains specifications and links to community resources and downloads. It also has some specialty links not shown on typical standards sites.

Much Better Vocabulary Access and Management

The site now most prominently features our structOntology editing and maintenance tool. Built on the OWL API, the same as Protégé 4, structOntology provides the advantage of enabling edits and management of ontologies directly within the applications in which they are used. This is far superior to needing to fire up an external ontology manager and then to re-import the changed ontology. structOntology also has an arguably simpler interface and operation than other ontology management alternatives:

For the UMBEL site, the standard view of using structOntology is read-only. In a subsequent part we will also discuss structOntology’s full editing and maintenance mode.

Improved Discovery and Navigation

Like all standard OSF installations, there are two superior means for discovery and navigation of the information space:  search and the relation browser.

Search uses the integration of RDF and inferencing with full-text, faceted search using Solr. This has been Structured Dynamics’ standard search function for some time, as Fred initially described in April 2009. It is a very effective way for finding new and related concepts within the UMBEL structure.

The relation browser is what is used for casual navigation and discovery. Any concept found via search or other means within the system can have the browser invoked by clicking on its browser icon []. When done, the standard relation browser appears:

The relation browser is highly configurable, as shown by some of our exemplar installations. Note in this case that the More details … link brings you to a detailed concept view, such as this example:

These various tools provide great means for discovery and navigation within the 28,000 concepts in the UMBEL reference space.

Newly Released Web Services and SPARQL Endpoints

We are also now providing updated endpoints for Ontology: Read, Search, Crud: Read, SPARQL and Scones. These will be described with access and query examples in a later part.

Some Cool New Sandboxes

We will also be discussing our OBIE (ontology-based information extraction) and entity tagger, scones, and export and ontology edit and management functions in subsequent posts.

Looking Ahead to Remaining Parts

We anticipate eight or nine more parts in this series explaining most of these options in greater detail. We hope to post a couple per week or so over the coming month. We will conclude with a discussion of next pending UMBEL releases.

UMBEL small logo This is the first of a multi-part series on the newly updated UMBEL services.

[1] See further the general Wikipedia description of UMBEL or its specification on the official UMBEL Web site.

by Mike Bergman at October 24, 2011 04:35 PM

October 22, 2011

HyperDanja (Danny Ayers)

links for 2011-10-22

by danja at October 22, 2011 09:11 PM

October 18, 2011

AI3:::Adaptive Information (Mike Bergman)

Fred’s Hair is on Fire

Structured DynamicsToday’s Post is a Testimony to the Value of Vacations

My partner, Fred Giasson, today posted the second part of his series on open source. Since returning from a well-earned vacation a few weeks back — after more than three years without a break — Fred has been writing and developing up a storm. As someone said to me last week, “Fred’s on fire!” I could not agree more.

I think Fred’s post speaks for itself as to why and how Structured Dynamics has made a conscious choice to embrace open source. The major reason he puts forth — to bootstrap the company without the need for external investment — is unusual in itself. But one thing he is silent about is why this is a compelling reason. I’ll comment on that.

Fred and I have both worked for others dependent on their capital for our ventures (a few more times in my case). Capital is great for expansion and operations, but it can be deadly when visions requiring patience are in play. Structured Dynamics is only now a bit more than halfway through its five-year plan. While semantics technologies are exciting with a world of upside potential, they have also been incubated in academic labs with (as yet) a general lack of practical deployment. The promise is there, but often the delivery and maturation have been lacking. We are committed to play a visible role in correcting that.

The approach Fred outlines was not perhaps easily available to new startups a decade ago. But now, with open source and the Internet, costs of entry and ongoing development have dropped markedly. Yet, surprisingly, the idea of financing a startup via revenues is still not talked about sufficiently — let alone often used as an actual basis for building a company.

I’ve been fortunate to be able to partner with a young, world-class technologist whose maturity exceeds that of individuals many years his senior. He understands that in order to achieve important visions that the stewardship of those ideas can not be left to venture capitalists committed solely or mostly to gaming terms or near-term returns. We’re placing our bets on the paying customer and our own judgment.

So, it is great to see Fred continue his phenomenal development productivity since he returned from Hawaii. The benefit of his vacation is that we are also now getting his insights on his blog again.

by Mike Bergman at October 18, 2011 04:19 AM

Frederick Giasson's Weblog

Open Sources Projects As A Pool Of Resources

In a previous blog post, I wrote about how Open Source may be unnatural, and even counter intuitive, to many people. However, that really begs some questions evident with my current company's strategy.

Why have Mike Bergman and I chosen to develop no less than three major open source projects (structWSF, conStruct and the Semantic Components), encompassing more than 100 000 lines of new code and leveraging between 30 to 50 other open source software and libraries? Why have we open sourced all our software? Why has open source formed the core business strategy of Structured Dynamics in the last three years? How have we been able to profitably sustain the company, even in the midst of the global economic crisis that began in 2008?

I will try to answer these questions in this blog post, perhaps even providing some guidance for newer startups that may follow behind us.

Why Open Sourcing?

Why did Structured Dynamics chose to open source all of its software? There are multiple reasons why people and businesses choose to go open source. For some, it is because they think that it is where the market place is moving. For others it is because they think that a community will emerge around their effort, and then get free resources that improve the piece of software. Some think that their software will promptly be reviewed by professional programmer. Others may think that their system will become more secure. Etc.

For Structured Dynamics the reason why we choose to go open source is somewhat different:

We perceived that by open sourcing our complete software stack we could bootstrap the company without any external investment.

Making a Living out of Open Source Projects

There are multiple ways to do a living from an open source project:

  • Doing consultancy work related to the project
  • Implementing the software(s) into clients’ computer environment(s)
  • Selling training classes
  • Selling support contracts
  • Selling maintenance contracts
  • Selling hosted instances of the software (the SaaS model for one)
  • Selling development time to improve some part(s) of the software
  • Creating conferences around their open source projects
  • Selling proprietary extensions
  • I am probably missing a few, so please add them in a comment section below, and I will make sure to add them to this list.

Depending on the software you are developing, and depending on the business plan of your company, you may be doing one — or multiple — of these things to generate some money from your open source projects.

At Structured Dynamics we are doing some of them: we do get consultancy contracts related to the Open Semantic Framework and we do implement OSF in our clients’ computer environments.

But, more importantly, we are also doing development contracts related to the framework. In fact, each project we are working on is quite different. Our major projects involve companies that reside in totally different domains, have different needs and need to accommodate different kinds of users. However, most of the projects share the same core needs, and all of them advance the core technology in ways meaningful to our vision. We choose our customers — and , of course, vice versa — based on a true sense of partnership wherein both parties have their objectives furthered.

Let’s see how we use these relationships to drive the development of the Open Semantic Framework.

Open Source Project as a Pool of Resources

In the last three years, Structured Dynamics has attracted multiple companies and organizations that share our vision, and which are willing to invest in the Open Semantic Framework open source project. (See Mike's recent post on business development for a bit more on that aspect of things.) Each of these clients did want to use the OSF framework for their own needs. However, each of them did want to do something special that was not currently implemented in the framework.

What we created in these three years is a pool of resources that we used to develop the framework such that it accommodates the needs of each of our clients. Each of our clients then becomes a participant to the shared pool of innovation. Our clients have been willing to invest in the open source framework because they need their own features and because they know that they will benefit from what other participants of the pool will invest themselves down the road.

In that scenario, we are the managers of a pool of resources. We have the vision of where we want the framework to go, we know the roadmap of the project and we know the needs of each participant (our clients). What we do is to try to optimize the resources we get from each of our clients by developing the framework such that it can accommodate as broad of a spectrum of participants as possible. Then, we seek to find new participants that have some needs that will help us continue to develop the next steps of the roadmap. In this manner, we Jacob's Ladder our existing work to increase the capabilities for later clients, but earlier clients still benefit because they can upgrade to the later improvements. This is a self-sustaining model to continue to move the development of the framework forward.

By finding new clients, what we do is to give a return on investment to the other pool participants. Most of the new features that we develop for these new clients will benefit the other participants to the pool and will create new possibilities for them without any additional investment. All of our first clients have implemented what other participants later invest into the pool, thus crystallizing and augmenting their return on investment by using these new features.

Open Source is Not Just About Software

Open Source is not just about pieces of code, and this is quite important to understand. What we have open sourced with the Open Semantic Framework is much more than a series of code sources. We open sourced the entire framework:

  1. The source codes
  2. The documentation
  3. The processes
  4. The methodologies

We term this comprehensive approach our total open solution.

This distinction with other open source projects is an essential differentiator with our approach. We choose to open source all of the pieces related to the framework. What drove this decision is a simple sentence that shows our philosophy behind it:

"We're Successful When We're Not Needed"

If the APIs, processes and methodologies are not properly documented, it means that we would certainly be needed by our clients, which would mean that we failed to open source our solution. But since we are working to open source our code, our processes and our methodologies, we are on the way to successfully open source the Open Semantic Framework since we won’t be needed by our clients.

This business approach is not as crazy as it sounds. We are free to work on new and important innovations, and are not basing our company culture on dependency and a constant drain by our customers. I know, it does not sound like Larry Ellison, but sounds good to us and our clients. It is certainly not a maximum revenue objective built on the backs of individual clients.

Our life is more fun and our clients trust us with new stuff. Further, each step of the way, we are able to leverage our own framework for unbelievable productivity in what we deliver for the money. But that is a topic for another day.

We think Structured Dynamics' business approach is a contemporary winning strategy. Our customers get good and advanced capabilities at low cost and risk, while we get to work on innovative extensions that are raising the semantic baseline for the marketplace. Who knows if we will always continue this path, but for now it is leading to sustained development and market growth for open semantic frameworks, including our own OSF.

 

 

by Frederick Giasson at October 18, 2011 02:17 AM

October 11, 2011

Frederick Giasson's Weblog

Volkswagen’s RDF Data Management Workflow

TribalDDB UK’s team just published a new case study to the W3C: Case Study: Contextual Search for Volkswagen and the Automotive Industry. They discuss the benefits of some of the semantic web technologies, techniques and concepts that they use to help them managing their data. They describe their approach and outline their design. It covers the technical aspects of their new Semantic Web Platform that I wrote about a few weeks ago.

In this blog post, I want to further explain their data management workflow, and how their data get exposed to different kind of users.

Two Classes of Users

Let’s take a look at their data ingest/management/publishing workflow:

As you can see, all their data get collected, transformed and imported into structWSF. As I explained in my previous blog post, they are using structWSF to manage all their RDF data and access all the functionalities from the different web service endpoints.

However, how the data get exposed to the users is not that clear. In fact, it depends on the classes of users. A user can be multiple different things: it may be a person, it may be a computer software, it may be an organization, etc. However, there are two general classes of users:

  1. Public users, and
  2. Private users

Public users are users that have no direct relation with Volkswagen and that have no access to their internal network. Private users are generally internal departments or some internal software applications that have direct access to the structWSF instance.

Private Users

Private users generally have access to all structWSF web service endpoints. This means that all structWSF functionalities are accessible to them by querying the endpoints.

Two different kind of private users are specified in the use case’s schema:

  1. Volkswagen Site Search
  2. Other / External Applications

The Volkswagen site search is a software application that uses the structWSF Search endpoint to search, filter and expose their data to their users (the people who perform searches on the Volkswagen UK website).

The other/external applications are software applications that have access to the structWSF instance. These are generally internal applications that run in the same network. One of these applications is an internal software that exports all the RDF data from the structWSF SPARQL endpoint, and import it into Kasabi.

These are two examples of software applications that Volkswagen created around the structWSF web services to re-purpose, re-contextualize and re-publish their RDF data.

Public Users

There is currently two kinds of public users of this new Volkswagen Semantic Platform:

  • People, and
  • Software applications

Two interfaces have been made publicly available for each of these kinds of users:

  • A website search engine page for people, and
  • A SPARQL endpoint for software applications

When a person user reaches the website’s search page, the search query get sent to the structWSF Search web service endpoint. The result is then returned to the engine, get templated and displayed to the person user.

A SPARQL endpoint is accessible to the software applications. This endpoint is hosted by the Kasabi information marketplace. Volkswagen chooses to export everything from their structWSF into Kasabi to outsource the maintenance of their public SPARQL endpoint.

Unlock the Power

As we saw in this blog post and in the W3C use case, all Volkswagen UK data is internally managed by structWSF; however they are not locked into that system. They can easily communicate with external services to add new functionalities to their stack or to take business decision such as outsourcing the management of some publicly accessible data access endpoints.

This is an important characteristic of their design:

By choosing semantic web technologies (such as structWSF), techniques and concepts (such as their Vehicles OWL Ontology and RDF), they are not locking themselves into a specific framework. They can easily communicate with external systems and applications. This means that they can quickly adapt their system to their constantly changing needs.

Conclusion

I wrote this blog post to further explain Volkswagen’s data management workflow. I wanted to make sure that people were understanding the role that structWSF has in this use case, and the ecosystem it operates in.

by Frederick Giasson at October 11, 2011 10:53 PM

AI3:::Adaptive Information (Mike Bergman)

The Cobbler’s Shoes

Structured Dynamics The Need to Enforce Periodic Checkups on Web Properties

Face it, we all get busy and begin to overlook our own needs while we work for others on our jobs. The parable of the cobbler’s children going without shoes says it all.  It means that the shoemaker spends so much time looking after his customers’ needs that he neglects the needs of his own children.

We see the same phenomena in relation to our own personal assets, home repairs and cleaning, and a myriad of chores and background requirements. One way we can overcome these neglects is by scheduling annual or periodic checkups or activities. Spring cleaning is one such effort, as is annual asset portfolio re-balancing or doctor’s appointments or 10,000 mile vehicle servicing.

One of the cobbler’s chores for Structured Dynamics is the periodic care and feeding of our various Web sites. This has actually proven to be a non-trivial exercise, as our properties have grown to exceed 1400 static Web pages across some 30 diverse Web addresses and properties. As our client and code base expands, this exercise is increasingly demanding.

Taking advantage of a small break in the action, we have just completed another one of these reviews and revisions. Interestingly, as I was going through the various sites, I saw that date stamps for prior revisions tended to all occur in the September and October time frame. Last September, for example, SD went through a major redesign and new logo. Apparently, without consciously realizing it, we have been doing our own Web attic cleaning in the Fall.

Thus, as a way to formalize this process for us internally, I thought I’d briefly outline the Web site changes that we have cobbled together for this year. I suspect we’ll be doing another spiffing come Fall 2012.

Rationalizing the Properties

It is kind of frightening to realize that we have allowed our Web properties to grow to about 30 individual sites. This accretion happens gradually: a new initiative or capability arises that seems to warrant its own Web site. Yet each site carries with it a need to develop and maintain, as well as to explain its role and use in the Structured Dynamics information space.

Exclusive of internal development sites or ones dedicated to specific customers, here is the roster of existing SD properties that we have needed to rationalize:

Note that all properties with strike outs have now either been retired or consolidated with other properties. We have reduced the property count by 10, or by a third. Additional consolidations will be forthcoming.

Providing a Consistent Entry to the Various Properties

With the growth of our various Web properties and the diversity of the initiatives behind them, Fred and I have grown increasingly frustrated that our site visitors lacked a consistent way to access and understand these projects. Across all properties, Structured Dynamics has about 6,000 daily visitors or RSS tracking feeds.

Providing a consistent context of what these properties mean and their relation to one another is further compounded by the sheer size of our properties. Excluding dynamically generated pages (such as from search, demonstration of our semantic components, or use of the relation browser), we have on the order of 1400 static Web pages across all properties and blogs. Users may enter our information space via any of these entry points.

The answer to how to provide a consistent context on any Web page throughout our properties resides in the nifty JavaScript popup Fred recently described for his own blog. What we realized is that we could adapt this widget to provide a single overview of SD’s resources, and then add that widget to all of our properties such that it appears as a small tab at the bottom (sometimes side) of all property pages.

Then, when the tab SD Resource tab is clicked, the following popup appears:

So, whenever you are on one of our properties, look for the tab (generally) at the lower right corner of every Web page. That will take you to the common entry point across Structured Dynamics’ Web properties.

Updating the Properties

In this process we also went through some of our existing sites and made content, narrative and navigation changes consistent with this rationalization and consistent entry point. These updates were not nearly as extensive as the full re-designs from one year ago.

New Shoe Designs

With a constant stream of new initiatives and new understandings, it will remain a challenge for us to describe our various products and services. An even greater challenge will be to provide coherent descriptions of how all of these initiatives fit together consistent with our overall vision. One attempt at that is our new Overview page. Meanwhile, of course, we will occasionally be offering new Web goodies and sites as developments warrant. These will need to get integrated into this picture as well.

We think we have taken an itty-bitty step to improving this process with the SD Resources tab widget. Nonetheless, I’m sure that we will continue to craft new shoes to try to find ones that are still yet more comfortable and attractive. Thing is, we may have to wait another year before we get around to it again.

by Mike Bergman at October 11, 2011 09:26 AM

October 06, 2011

Frederick Giasson's Weblog

A Men Dedicated To Its Vision (and that Changed the World)

This men literally changed the World we live in. He had a vision, he failed, but he came back to change everybody’s daily habits. He pushed others to the limit and changed entire industries. Even if I don’t always agree with its company’s decisions, I will always respect its vision, its work and its dedication. Rest in peace Mr. Job.

 

Here are my collection of Steve’s best quotes that I aggregated over time… I hope it helps you understanding who the men was.

 

"Your time is limited, so don't waste it living someone else's life. Don't be trapped by dogma – which is living with the results of other people's thinking. Don't let the noise of others' opinions drown out your own inner voice. And most important, have the courage to follow your heart and intuition. They somehow already know what you truly want to become. Everything else is secondary. " - Steve Jobs

"When I was 17, I read a quote that went something like: "If you live each day as if it was your last, someday you'll most certainly be right." It made an impression on me, and since then, for the past 33 years, I have looked in the mirror every morning and asked myself: "If today were the last day of my life, would I want to do what I am about to do today?" And whenever the answer has been "No" for too many days in a row, I know I need to change something." - Steve Jobs

"Your work is going to fill a large part of your life, and the only way to be truly satisfied is to do what you believe is great work. And the only way to do great work is to love what you do. If you haven't found it yet, keep looking. Don't settle. As with all matters of the heart, you'll know when you find it. And, like any great relationship, it just gets better and better as the years roll on. So keep looking until you find it. Don't settle." - Steve Jobs

"Again, you can't connect the dots looking forward; you can only connect them looking backwards. So you have to trust that the dots will somehow connect in your future. You have to trust in something – your gut, destiny, life, karma, whatever. This approach has never let me down, and it has made all the difference in my life."  - Steve Jobs

"To design something really well you have to get it. You have to really grok what it's all about. It takes a passionate commitment to thoroughly understand something – chew it up, not just quickly swallow it. Most people don't take the time to do that. Creativity is just connecting things. When you ask a creative person how they did something, they may feel a little guilty because they didn't really do it, they just saw something. It seemed obvious to them after a while. That's because they were able to connect experiences they've had and synthesize new things. And the reason they were able to do that was that they've had more experiences or have thought more about their experiences than other people have. Unfortunately, that's too rare a commodity. A lot of people in our industry haven't had very diverse experiences. They don't have enough dots to connect, and they en up with very linear solutions, without a broad perspective on the problem. The broader one's understanding of the human experience, the better designs we will have." - Steve Jobs

 

 

by Frederick Giasson at October 06, 2011 12:37 AM

October 05, 2011

Frederick Giasson's Weblog

Unnatural Open Source

I have never been an open source software advocate. In fact, like most people, I always wondered how companies could find a business advantage in developing open source softwares and how they could make money out of it to grow. It is nice to have open source softwares, but it is hard to imagine how you could justify putting thousands of hours in open source software projects if it is not only by passion.

In this post I will explain what I think is the main factor that put people, businesses and organizations on guard when come the time to think about open source softwares. In fact, I think it has much more to do with our nature: how we naturally are as human being, and much less to do with any real business related factors.

In a follow-up blog post, I will explain how Structured Dynamics embraced open source software, how we developed the company around the concept, and how we are managing the development of our project such that it benefits all our clients along with the company. But first, let’s try to figure out why much people are suspicious regarding open source softwares.

The Fear

"I must not fear. Fear is the mind-killer. Fear is the little-death that brings total obliteration. I will face my fear. I will permit it to pass over me and through me. And when it has gone past I will turn the inner eye to see its path. Where the fear has gone there will be nothing. Only I will remain."

- Dune, Frank Herbert

Have you ever heard someone telling you:

I found an incredible business idea! I am pretty sure that I am the first one to think about that. I will get some good money down the road!

Then, you naturally asked for more information about this great idea! And then the answer you got was something like:

Hooo! But I can’t tell you, this is really secret right now, at least until everything is ready to go.

Does this sound familiar? I does to me. I hear it often. But, why does people react that way? It is simply by fear: fearing that someone “steal” their ideas, start a company based on them, build projects or services that implemented them, and get rich while you are flipping burgers at McDonald’s.

To me, this is the main reason why people, organizations and businesses are suspicious regarding open source software: because of fear; fear of loosing something they don’t even have.

But the question is: is that rational? From my experience, and my understanding of how things works, I can certainly say that it is not. This way of thinking is not rational because it doesn’t take into account a few things:

  • The ability of others to do something with your ideas
  • The ability of others to have the vision you have for your ideas
  • The willingness of others to spend all their time and energy to make these ideas working
  • People tend to do what they want to do, and not what others wants­

The same behavior seems to happen with open source projects. When I am explaining to people what we are doing, one of the first reaction is: why your work is open and free? Don’t you fear that someone steal your project and ideas? How can you make money if it is free, people will just run with it for themselves no?

The simple answer to all these question is: no. No we don’t feature that anybody steal our projects and ideas just by cloning them from the source control. We don’t because of the four reasons listed above. We don’t because we trust our vision and our abilities to implement it in our various open source projects. And yes we can sustain the company pretty well with these projects and it is what I will cover in my following blog post.

Conclusion

Non-Open Source softwares are just like when someone has a business idea “for the next big thing” and that doesn’t want to share it with anybody else because he think that someone will take that idea and run with it by himself. In fact, it is quite the opposite. I learned with experience that there is only one person (or organization) that can make such a great idea a relative success: the person (or organization) that lives for that idea. An idea is just an idea, and has nothing great in it, until it gets implemented, until the idea lives by itself, propelled by it most dedicated advocate: its creators and their boundless enthusiasm. Any idea would fail without this… and would worth nothing; it would just be an idea.

by Frederick Giasson at October 05, 2011 01:14 PM

October 02, 2011

Frederick Giasson's Weblog

WordPress’s Follow Button for Non-WordPress.com Users

About two weeks ago, the WordPress.com team released a wonderful new tool called the Follow Button to all theirs WordPress.com users. This button is floating in the bottom-right corner of a blogs and let readers subscribing, by email, to the blog’s publications. Each time a new blog post is published, they receive an update in their inbox.

The idea is far from new, and may even looks like old-school. However, the implementation they did is simple, really well done and really clever. Also, the wording they used in the tool is perfect (for example, using the word Follow instead of Subscribe).

The only problem is that this wonderful new tool is only available for WordPress.com users! As you may know, this blog is using WordPress, but it is a self-hosted instance. After doing some research, I couldn’t find any plugins or methods to install it on my blog. Also, the email service under this user interface is built into WordPress.com. As last resort, I checked their Jetpack plugin, to see if it got added the Follow Button to it, but apparently they didn’t (it is probably too recent).

So, I was in a dilemma: I wanted this feature for my blog, I didn’t want to migrate everything to WordPress.com, and I didn’t had the time to write a plugin that does exactly this. So what I did is to take a few hours to hack my own Follow Button using what is already existing out there. In fact, I have been quite surprised to see how easy it turned out to be.

It as been as easy as installing the really good Subscribe2 plugin and to create the UI, from the original Follow Button using some HTML, CSS and JQuery code. After some re-wiring, I ended-up with my own self-hosted Follow Button.

This is what I want to share with you here, in this Hors Série blog post. I am pretty sure that many self-hosted WordPress blogger will want it, so I took an additional hour to write and publish this blog post.

I did two additional “improvements” to the concept:

  1. I changed the icon to put some color in there. Not only to make it less dull, but also to bring a little bit mo attention to it.
  2. I also added a link to my RSS feed. To me, “Follow” is not just about emails, but it is also about other syndication mediums too. However, I kept the email as the first option to keep the spirit of the tool.

Finally, I didn’t want to hack any piece of code in WordPress nor in any other WordPress plugin. The only thing that we will modify is the theme, by adding some code to it. The current implementation could be improved by upgrading Subscriber2 for example, but I didn’t want people to have to do this to enable the Follow Button on their blog.

Step #1: Install Subscribe2

First thing first. The first thing you will have to do is to install the WordPress plugin that will enable your users to subscribe, and to manage their subscriptions, to your blog via emails. We are using the really good Subscribe2 WordPress Plugin that gives these features to your WordPress instance.

To install this plugin using WordPress’ automatic plugin installation system, follow these instructions. Read the plugin’s installation instruction if you want to do this the manual way:

  1. Log in to your WordPress blog and visit Plugins->Add New.
  2. Search for Subscribe2, click “Install Now” and then Activate the Plugin
  3. Click the “Settings” admin menu link, and select “Subscribe2″
  4. Configure the options to taste, including the email template and any categories which should be excluded from notification
  5. Click the “Tools” admin menu link, and select “Subscribers”
  6. Manually subscribe people as you see fit
  7. Create a WordPress Page to display the subscription form. When creating the page, you may click the “S2″ button on the QuickBar to automatically insert the subscribe2 token. Or, if you prefer, you may manually insert the subscribe2 shortcode or token: [subscribe2] or the HTML invisible Ensure the token is on a line by itself and that it has a blank line above and below. This token will automatically be replaced by dynamic subscription information and will display all forms and messages as necessary
  8. In the WordPress “Settings” area for Subscribe2 select the page name in the “Appearance” section that of the WordPress page created in step 7

On this blog, I called the page created at step #7: Follow. Once you are done installing the plugin, you can test it by visiting your Follow page and by entering your own email (one that is not attached to any user of your account is preferable) and by checking in your inbox if you receive a subscription notification. If you haven’t, you may want to take a look at this FAQ to debug any possible issue with your outgoing email service.

Step #2: Customize your Follow Page

This next step is optional. Since that the form generated by the Subscribe2 plugin is really minimalist, you may want to customize it a little bit, to change its design and to add some explanation in the page, to help your readers to understand what is going on. Take a look at my own Follow page to see what I did to customize that page.

Step #3: Add the Follow Button code in you theme

The third step is really what will morph the Subscribe2 plugin into the Follow Button. What we are doing here, is just to add the code, in your theme, to display the Follow Button.

The first thing you have to do, is to locate where the footer of the pages is generated in the theme. Open the theme folder of your blog: /../wordpress/wp-content/themes/mytheme/. Then you will have to open a few files to check where the </body> ending HTML tag is generated. The file where that code is generated really depends on how the theme got designed. You can do a search, within all the PHP files in that folder for the string “</body>“. This should give you the answer right away. Once you located that place, you are good to continue with the following instructions.

Important note: It is possible that your Theme doesn’t use jQuery by default. If it is the case, then you have to edit the header.php (or whatever the name of the file where the header of your blog is generated) of your theme, and add the following line in the <head>...</head> section of the page:

1
<script src=”https://ajax.googleapis.com/ajax/libs/jquery/1.6.4/jquery.min.js” type=text/javascript”></script>

If you don’t have jQuery loaded, a JavaScript error will be returned, and the panel will “freeze” in the webpage. Once you make sure that jQuery was loaded, do proceed with this code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
<style type="text/css" media="screen">
  #bit, #bit * {}
  #bit {
      bottom: -300px;
      font: 13px "Helvetica Neue",sans-serif;
      position: fixed;
      right: 10px;
      z-index: 999999;
      width: 230px;
  }
 
  .loggedout-follow-typekit {
      margin-right: 4.5em;
  }
 
  #bit a.bsub {
      background-color: #464646;
      background-image: -moz-linear-gradient(center bottom , #3F3F3F, #464646 5px);
      border: 0 none;
      box-shadow: 0 -1px 5px rgba(0, 0, 0, 0.2);
      color: #CCCCCC;
      display: block;
      float: right;
      font: 13px/28px "Helvetica Neue",sans-serif;
      letter-spacing: normal;
      outline-style: none;
      outline-width: 0;
      overflow: hidden;
      padding: 0 10px 0 8px;
      text-decoration: none !important;
      text-shadow: 0 -1px 0 #444444;
  }
 
  #bit a.bsub {
      border-radius: 2px 2px 0 0;
  }
 
  #bit a.bsub span {
      background-attachment: scroll;
      background-clip: border-box;
      background-color: transparent;
      background-image: url("[[PATH-TO-THE-FAMFAM-ICON]]asterisk_orange.png");
      background-origin: padding-box;
      background-position: 2px 3px;
      background-repeat: no-repeat;
      background-size: 20% auto;
      padding-left: 18px;
  }
 
  #bit a:hover span, #bit a.bsub.open span {
      /*background-position: 0 -117px;*/
      color: #FFFFFF !important;
  }
 
  #bit a.bsub.open {
      background: none repeat scroll 0 0 #333333;
  }
 
  #bitsubscribe {
      background: none repeat scroll 0 0 #464646;
      border-radius: 2px 0 0 0;
      color: #FFFFFF;
      margin-top: 27px;
      padding: 15px;
      width: 200px;
      float: right;
      margin-top: 0;
  }
 
  div#bitsubscribe.open {
      box-shadow: 0 0 8px rgba(0, 0, 0, 0.5);
  }
 
  #bitsubscribe div {
      overflow: hidden;
  }
 
  #bit h3, #bit #bitsubscribe h3 {
      color: #FFFFFF;
      font-family: "Helvetica Neue",Helvetica,Arial,sans-serif;
      font-size: 20px;
      font-weight: 300;
      margin: 0 0 0.5em !important;
      text-align: left;
      text-shadow: 0 1px 0 #333333;
  }
 
  #bit #bitsubscribe p {
      color: #FFFFFF;
      font: 300 15px/1.3em "Helvetica Neue",Helvetica,Arial,sans-serif;
      margin: 0 0 1em;
      text-shadow: 0 1px 0 #333333;
  }
 
  #bitsubscribe p a {
      margin: 20px 0 0;
  }
 
  #bit #bitsubscribe p.bit-follow-count {
      font-size: 13px;
  }
 
  #bitsubscribe input[type="submit"] {
      -moz-transition: all 0.25s ease-in-out 0s;
      background: -moz-linear-gradient(center top , #333333 0%, #111111 100%) repeat scroll 0 0 transparent;
      border: 1px solid #282828;
      border-radius: 11px 11px 11px 11px;
      box-shadow: 0 1px 0 #444444 inset;
      color: #CCCCCC;
      padding: 2px 20px;
      text-decoration: none;
      text-shadow: 0 1px 0 #000000;
  }
 
  #bitsubscribe input[type="submit"]:hover {
      background: -moz-linear-gradient(center top , #333333 0%, #222222 100%) repeat scroll 0 0 transparent;
      box-shadow: 0 1px 0 #4F4F4F inset;
      color: #FFFFFF;
      text-decoration: none;
  }
 
  #bitsubscribe input[type="submit"]:active {
      background: -moz-linear-gradient(center top , #111111 0%, #222222 100%) repeat scroll 0 0 transparent;
      box-shadow: 0 -1px 0 #333333 inset;
      color: #AAAAAA;
      text-decoration: none;
  }
 
  #bitsubscribe input[type="text"] {
      border-radius: 3px 3px 3px 3px;
      font: 300 15px "Helvetica Neue",Helvetica,Arial,sans-serif;
  }
 
  #bitsubscribe input[type="text"]:focus {
      border: 1px solid #000000;
  }
 
  #bitsubscribe.open {
      display: block;
  }
 
  #bsub-subscribe-button {
      margin: 0 auto;
      text-align: center;
  }
 
  #bitsubscribe #bsub-credit {
      border-top: 1px solid #3C3C3C;
      font: 11px "Helvetica Neue",sans-serif;
      margin: 0 0 -15px;
      padding: 7px 0;
      text-align: center;
  }
 
  #bitsubscribe #bsub-credit a {
      background: none repeat scroll 0 0 transparent;
      color: #AAAAAA;
      text-decoration: none;
      text-shadow: 0 1px 0 #262626;
  }
 
  #bitsubscribe #bsub-credit a:hover {
      background: none repeat scroll 0 0 transparent;
      color: #FFFFFF;
  }
</style>    

<script type="text/javascript" charset="utf-8">
  jQuery.extend(jQuery.easing, {
      easeOutCubic: function (x, t, b, c, d) {
          return c * ((t = t / d - 1) * t * t + 1) + b;
      }
  });
  jQuery(document).ready(function () {
      var isopen = false,
          bitHeight = jQuery('#bitsubscribe').height();
      setTimeout(function () {
          jQuery('#bit').animate({
              bottom: '-' + bitHeight - 30 + 'px'
          }, 200);
      }, 300);
      jQuery('#bit a.bsub').click(function () {
          if (!isopen) {
              isopen = true;
              jQuery('#bit a.bsub').addClass('open');
              jQuery('#bit #bitsubscribe').addClass('open')
              jQuery('#bit').stop();
              jQuery('#bit').animate({
                  bottom: '0px'
              }, {
                  duration: 400,
                  easing: "easeOutCubic"
              });
          } else {
              isopen = false;
              jQuery('#bit').stop();
              jQuery('#bit').animate({
                  bottom: '-' + bitHeight - 30 + 'px'
              }, 200, function () {
                  jQuery('#bit a.bsub').removeClass('open');
                  jQuery('#bit #bitsubscribe').removeClass('open');
              });
          }
      });
  });
</script>

<div id="bit" class="">
  <a class="bsub" href="javascript:void(0)"><span id='bsub-text'>Follow</span></a>
 
  <div id="bitsubscribe">
    <h3><label for="loggedout-follow-field">Follow this Blog</label></h3>
 
    <form action="[[PATH-TO-YOUR-FOLLOW-WORDPRESS-PAGE]]" method="post" accept-charset="utf-8" id="loggedout-follow">
      <p>Get every new post on this blog delivered to your Inbox.</p>
      <p class="bit-follow-count">Join <?php echo $wpdb->get_var("SELECT COUNT(id) FROM wp_subscribe2 WHERE active='1'"); ?> other followers:</p>
      <p>
        <input type="text" name="email" id="s2email" style="width: 95%; padding: 1px 2px" value="Enter email address" onfocus='this.value=(this.value=="Enter email address") ? "" : this.value;' onblur='this.value=(this.value=="") ? "Enter email address" : this.value;'  id="loggedout-follow-field"/>
      </p>
       
      <input type="hidden" name="ip" value="<?php echo $_SERVER['REMOTE_ADDR']; ?>">
     
      <p id='bsub-subscribe-button'>
        <input type="submit" name="subscribe"  value="Sign me up!" />
      </p>
    </form>
   
    <p style="padding-top: 10px;">Or subscribe to the RSS feed by clicking on the counter:</p>  
   
    <p>
      [[ADD-YOUR-RSS-FEED-LINK-HERE]]
    </p>
  </div>
</div>

The only thing you have to do is to copy/paste that code above the </body> tag. Then, do the following three modifications to properly wire it in your blog:

  • At line #41, replace [[PATH-TO-THE-FAMFAM-ICON]]with the path of the asterisk_orange.png icon, on your blog
  • At line #211, replace [[PATH-TO-YOUR-FOLLOW-WORDPRESS-PAGE]] by the URL of your Follow page (the one you created when you installed Subscribe2)
  • At line #228, replace [[ADD-YOUR-RSS-FEED-LINK-HERE]] by the link to your RSS feed

You can get the free asterisk_orange.png icon image from the FamFamFam website. The only thing you have to do, is to download that image, and to put it in the folder you defined for [[PATH-TO-THE-FAMFAM-ICON]]. However, you can use whatever image that you prefer, that may better fit the design of your blog.

Step #4: Disable it For Mobile Devices

Some mobile devices may have issues displaying this floating window. Sometimes, the window may be floating in the middle of the device’s screen without folding-back in the bottom of the page. For this reason, you may want to disable (remove) this option if the user is using a mobile device to read your blog. You can easily disable it if the web server detects that a mobile devise is requesting the webpage by adding these two blocks of code.

First, copy and paste this first block of code above the code of the Follow button (before line #1):

1
2
3
4
5
<?php
 $useragent = $_SERVER['HTTP_USER_AGENT'];
 if(!preg_match('/android.+mobile|avantgo|bada\/|blackberry|blazer|compal|elaine|fennec|hiptop|iemobile|ip(hone|od)|iris|kindle|lge |maemo|midp|mmp|opera m(ob|in)i|palm( os)?|phone|p(ixi|re)\/|plucker|pocket|psp|symbian|treo|up\.(browser|link)|vodafone|wap|windows (ce|phone)|xda|xiino/i',$useragent)||preg_match('/1207|6310|6590|3gso|4thp|50[1-6]i|770s|802s|a wa|abac|ac(er|oo|s\-)|ai(ko|rn)|al(av|ca|co)|amoi|an(ex|ny|yw)|aptu|ar(ch|go)|as(te|us)|attw|au(di|\-m|r |s )|avan|be(ck|ll|nq)|bi(lb|rd)|bl(ac|az)|br(e|v)w|bumb|bw\-(n|u)|c55\/|capi|ccwa|cdm\-|cell|chtm|cldc|cmd\-|co(mp|nd)|craw|da(it|ll|ng)|dbte|dc\-s|devi|dica|dmob|do(c|p)o|ds(12|\-d)|el(49|ai)|em(l2|ul)|er(ic|k0)|esl8|ez([4-7]0|os|wa|ze)|fetc|fly(\-|_)|g1 u|g560|gene|gf\-5|g\-mo|go(\.w|od)|gr(ad|un)|haie|hcit|hd\-(m|p|t)|hei\-|hi(pt|ta)|hp( i|ip)|hs\-c|ht(c(\-| |_|a|g|p|s|t)|tp)|hu(aw|tc)|i\-(20|go|ma)|i230|iac( |\-|\/)|ibro|idea|ig01|ikom|im1k|inno|ipaq|iris|ja(t|v)a|jbro|jemu|jigs|kddi|keji|kgt( |\/)|klon|kpt |kwc\-|kyo(c|k)|le(no|xi)|lg( g|\/(k|l|u)|50|54|e\-|e\/|\-[a-w])|libw|lynx|m1\-w|m3ga|m50\/|ma(te|ui|xo)|mc(01|21|ca)|m\-cr|me(di|rc|ri)|mi(o8|oa|ts)|mmef|mo(01|02|bi|de|do|t(\-| |o|v)|zz)|mt(50|p1|v )|mwbp|mywa|n10[0-2]|n20[2-3]|n30(0|2)|n50(0|2|5)|n7(0(0|1)|10)|ne((c|m)\-|on|tf|wf|wg|wt)|nok(6|i)|nzph|o2im|op(ti|wv)|oran|owg1|p800|pan(a|d|t)|pdxg|pg(13|\-([1-8]|c))|phil|pire|pl(ay|uc)|pn\-2|po(ck|rt|se)|prox|psio|pt\-g|qa\-a|qc(07|12|21|32|60|\-[2-7]|i\-)|qtek|r380|r600|raks|rim9|ro(ve|zo)|s55\/|sa(ge|ma|mm|ms|ny|va)|sc(01|h\-|oo|p\-)|sdk\/|se(c(\-|0|1)|47|mc|nd|ri)|sgh\-|shar|sie(\-|m)|sk\-0|sl(45|id)|sm(al|ar|b3|it|t5)|so(ft|ny)|sp(01|h\-|v\-|v )|sy(01|mb)|t2(18|50)|t6(00|10|18)|ta(gt|lk)|tcl\-|tdg\-|tel(i|m)|tim\-|t\-mo|to(pl|sh)|ts(70|m\-|m3|m5)|tx\-9|up(\.b|g1|si)|utst|v400|v750|veri|vi(rg|te)|vk(40|5[0-3]|\-v)|vm40|voda|vulc|vx(52|53|60|61|70|80|81|83|85|98)|w3c(\-| )|webc|whit|wi(g |nc|nw)|wmlb|wonu|x700|xda(\-|2|g)|yas\-|your|zeto|zte\-/i',substr($useragent,0,4)))
 {
?>

Then copy and paste this second block of code below the code of the follow button (after line #231):

1
2
3
<?php
  }
?>

This code come from the Detect Mobile Browser project and is the best mobile device detection code I saw so far. What this code does, is not to include the Follow Button if the device that is requesting the webpage is a mobile device. Otherwise, the Follow Button is added to the HTML page.

Step #5: Test it!

If you are reading this step #5, it means that you finished to create your own, self-hosted, Follow Button!

Congratulation!

But the last thing that remains to be done, is to test it. Once you saved your file with the code above, just refresh any page of your blog. You should see appearing the Follow button on the bottom-right corner of your blog. If you click on it, you should see the form that let your readers subscribing to the system. If you add one of your emails, and click the subscribe button, you should get redirected to the Follow page. Finally you should receive a confirmation email that ask you to confirm your subscription by clicking on a link.

If all these steps properly works, it means that you are done and ready to provide that new functionality to the readers of your blog!

Conclusion

Even if this blog post is few pages long, I hope you found it easy to install and setup. If you have any question regarding this hack, don’t hesitate to ask them down there, in the comments section of this post. I will be happy to answer all of them.

Happy Hacking!

 

Translations

This blog post as been translated in Federico Bozo in Spanish. Other translations will be added to this section.

by Frederick Giasson at October 02, 2011 10:19 PM

September 30, 2011

HyperDanja (Danny Ayers)

links for 2011-09-30

by danja at September 30, 2011 11:15 PM

September 27, 2011

Frederick Giasson's Weblog

One of Semantic Web’s Core Added Value

If I ask the question: "What added value(s) does the Semantic Web brings on the table?". So, what are the benefits that companies and organizations would get from using the Semantic Web? I am pretty sure that after asking this question, I would get answers such as:
  • You will instantly be able to traverse graphs of relationships
  • You will be able to infer facts (so create/persist new knowledge) from other existing facts
  • You will be able to check to make sure that your knowledge base is consistent and satisfiable
  • You will be able to modify your ontologies/vocabularies/schemas without impacting the description of your instance records or the usability of any software that use it (unlike relation databases)
  • And so on…

All these answers would be accurate. However, what if these answers would only be a part of the real added value that the Semantic Web brings on the table?

Note: when I refer to the “Semantic Web” on this blog post (and across all my writings), I refer to a set of technologies, techniques and concepts referred as the Semantic Web. So it is not a single thing, but a complete set of things that creates new ways of working with, and manipulating, information.

Strong of about 7 years of research and development of Semantic Web technologies that includes about 3 years of developing the Open Semantic Framework, that the biggest added value that I found from utilizing Semantic Web technologies is only partially related to these answers. In fact the biggest added value for me, as a developer can be resumed in one word:

PRODUCTIVITY

As simple as this. The biggest added value I gained from using and applying Semantic Web related technologies, techniques and concepts is an important increase in development, and data integration productivity.

Such productivity gain as to do with one of Semantic Web’s core attribute:

FLEXIBILITY

This is what I was suggesting in my latest blog post about Volkswagen’s use of the Open Semantic Framework: how Volkswagen uses the Open Semantic Framework to get flexibility that will lead to a gain in productivity to integrate, publish, and re-contextualize their data assets. The few gains that I listed above are part of the reason why the Semantic Web gives you flexibility that leads to an increase in productivity.

This same point as been re-affirmed today by Lee Feigenbaum in its latest blog post Saving Months, Not Milliseconds: Do More Faster with the Semantic Web:

Why is this? Ultimately, it’s because of the inherent flexibility of the Semantic Web data model (RDF). This flexibility has been described in many different ways. RDF relies on an adaptive, resilient schema (from Mike Bergman); it enables cooperation without coordination (from David Wood via Kendall Clark); it can be incrementally evolved; changes to one part of a system don’t require re-designs to the rest of the system. These are all dimensions of the same core flexibility of Semantic Web technologies, and it is this flexibility that lets you do things fast with the Semantic Web.

Warning: Productivity is not synonymous with simplicity

However, I would warn people that think that productivity gains are possible because semantic web technologies are simpler to use, manage and implement than other existing technologies.

It is certainly not the case, and I don't think it will ever be. Semantic Web technologies, techniques and concepts are not easy to understand, and have a big learning curve. This is partly true because these techniques, technologies and concepts are relatively new in the field of the computer sciences, and because they are not fully understood, defined, implemented and used.

by Frederick Giasson at September 27, 2011 05:34 PM

September 26, 2011

AI3:::Adaptive Information (Mike Bergman)

Thirty OWL API Tools

OWL - Web Ontology LanguageDocumenting the Emerging Ecosystem Around OWL 2

We have been touting the importance of OWL 2 as the language of choice for federating and reasoning over RDF and ontologies. An absolutely essential enabler of the OWL 2 language is version 3 of the OWL API (actually, version 3.2.4 at the time of this writing), a Java-based framework for accessing and managing the language. Protégé 4, the most popular open source ontology editor and integrated development environment (IDE), for example, is built around the OWL API.

As we laid out a bit more than a year ago, now codified on our TechWiki as the Normative Landscape of Ontology Tools (especially the second figure), we see the OWL API as the essential pivot point for all forms of ontology tools moving forward.

We have attempted to assemble a definitive and comprehensive list of all known tools presently based around version 3 of the OWL API. (We have surely missed some and welcome comments to this post that identify missing ones; we promise to add them and keep tracking them.) Herein is a listing of the 30 or so known OWL API-based tools:

  • Protégé 4 is a free, open source ontology editor and knowledge-base framework based on OWL 2 and centered on the OWL API
  • CEL, FaCT++, HermiT, Pellet, and Racer Pro reasoners provide OWL API wrappers and are also available as reasoner plugins to Protégé 4
  • There is also a FaCT++ port to Java that is also implementing the OWLReasoner and is available as a plugin for Protégé 4.1; it is at version 0.9 with user feedback welcomed
  • structOntology is an open source ontology editor and manager supporting Structured DynamicsconStruct implementation of the Open Semantic Framework (OSF) in Drupal; more information is provided here
  • TrOWL is a Tractable reasoning infrastructure for OWL 2. TrOWL supports both standard TBox and ABox reasoning, as well as conjunctive query answering
  • SKOSEd is a SKOS editor for Protege; just recently made compatible with Protégé 4.1
Please let us know of any missing OWL API tools that should be added to this list by submitting a comment to this post. We will keep this listing current.
  • Populus is a semantic spreadsheet framework using RightField and OPPL for creating OWL ontologies
  • Bubastis is a tool for detecting asserted logical differences between two ontologies, such as between versions. A stand alone version of the tool is also available for download from the EFO tools page. Bubastis is powered by the OWL API
  • Tab2OWL and its download is a Java tool for importing classes into an already existing OWL file. The script uses the OWL API to read in a tab delimited file of class details and create OWL classes from these rows, adding them to an existing ontology
  • S-Match is a semantic matching framework, which provides several semantic matching algorithms and facilities for developing new ones. Currently S-Match contains implementations of the original S-Match semantic matching algorithm, as well as minimal semantic matching algorithm and structure preserving semantic matching algorithm
  • The Alignment API is an API and implementation for expressing and sharing ontology alignments. It uses an RDF format for expressing alignments in a uniform way. Its four main interfaces (Alignment, Cell, Relation and Evaluator) provides these services: storing, finding, and sharing alignments; piping alignment algorithms (improving an existing alignment); manipulating (thresholding and hardening); generating processing output; and comparing alignments
  • The OWLlink API is a Java interface and implementation of the OWLlink protocol on top of the Java-based OWL API. The OWLlink API enables OWL API-based applications to access remote reasoners (so-called OWLlink servers), and it turns any OWL API aware reasoner into an OWLlink server
  • OPPL2 (ontology pre-processing language) is an abstract formalism that allows for manipulating ontologies written in OWL. It is 100% based on the Manchester OWL Syntax; a query language based on OWL (logical) axioms and variables; a scripting language that allows the addition/removal of OWL (logical) axioms. It is available as an Protégé 4.1 plug-in
  • OPPL Patterns It is available as an Protégé 4.1 plug-in
  • Posh (Prolog OWL Shell) is a command line utility that wraps the Thea OWL library to allow for advanced querying and processing of ontologies, combining the power of Prolog and OWL reasoning
  • POPL (Prolog Ontology Processing Language) allows you to write expressive ontology rewrite rules in a high-level declarative fashion using a syntax similar to Manchester syntax
  • OWLTools (aka OWL2LS – OWL2 Life Sciences) is a convenience Java API on top of the OWL API. Code is available here
  • LexOWL is a plug-in for Protégé 4. In order to add more powerful functionality (e.g., inferencing, editing) to the existing infrastructure and align LexGrid more closely with various Semantic Web technologies, the LexOWL plugin for Protégé 4 provides a way for representing the ontologies modeled within the LexGrid environment in OWL. A source for downloading this tool has not been found
  • Apero, a Protégé plug-in that is an ontology debugging tool based on the use of anti-patterns; see http://www.emcl-study.eu/fileadmin/master_theses/thesis_tahwil.pdf
  • DReW is a prototype DL reasoner over LDL+ ontologies and a prototype reasoner for dl-programs over LDL+ ontologies under well-founded semantics. It is not well developed or documented; it can be downloaded here
  • The LingInfo, LexOnto, LexInfo and LMF ontologies are available from the project website, as well as a corresponding Java API with an implementation for the commonly used OWL API
  • Thea2 is a Prolog library that provides complete support for querying and processing OWL 2 ontologies directly from within Prolog programs. Thea2 also offers additional capabilities including a bridge to the Java OWL API and translation of ontologies to Description Logic programs
  • GLOW is a visualization for OWL ontologies, based on Hierarchical Edge Bundles. Hierarchical Edge Bundles is a new visually attractive technique for displaying adjacency relations in hierarchical data, such as concept structures formed by `subclass-of’ and `type-of’ relations. The displayed adjacency relations can be selected from an ontology using a set of common configurations, allowing for intuitive discovery of information. It is a visualization library based on OWL API, as well as a plug-in for Protégé
  • ROWLKit is a simple GUI to reason and query over ontologies written in the OWL 2 QL profile of OWL
  • OBDA Plugin (Ontology-based data access) is an add-on for the Protégé ontology editor aimed at transforming Protégé into a fully fledged OBDA model editor. It provides data source and mapping editors, as well as querying facilities that, in conjunction with an OBDA-enabled reasoner, allows you to design and test every aspect of an OBDA system
  • OntoCAT provides high level abstraction for interacting with ontology resources including local ontology files in standard OWL and OBO formats (via OWL API)
  • SemaRule Navigator is an Eclipse-based toolkit of multiple semWeb tools, built around the OWL API, organized into a pipeline-like system (appears quite complicated)
  • OWLDB (alias Mnemosyne) is a storage system based on object-relational mappings utilising the OWL-API for the W3C Web Ontology Language OWL
  • Finally, for a periodically updated list of “official” extensions, see https://owlapi.svn.sourceforge.net/svnroot/owlapi/v3/branches/owlextensions/.

Addendum

Ignazio Palmisano also graciously suggested these additional sources:

which also further leads to this additional listing:

It is not clear if all of these offer OWL 2 support, let along work with the current OWL API.

by Mike Bergman at September 26, 2011 08:52 AM

September 21, 2011

Frederick Giasson's Weblog

Volkswagen’s Use of structWSF in their Semantic Web Platform

TribalDDB London, Volkswagen UK‘s partner, mentioned earlier this week that Volkswagen are using some parts of the Open Semantic Framework to develop the next generation of their online platform.

This story has been published by Jennifer Zaino’s in her article: Volkswagen: Das Auto Company is Das Semantic Web Company!

I can now talk about this project that uses some pieces of the framework that we have been developing for more than 3 years now.

The Objective

Volkswagen’s main objective behind the development of the next version of their Web platform started by improving their online search engine, but as William Greenly mentioned, it quickly became a strategic decision:

"So the objectives were about site search and improving it, but in the long-run it was always the idea to contextualize content, to facet content, to promote it in different contexts."

The objective is to create a platform that gives them the flexibility to leverage all the data assets they own. This flexibility will help them to leverage the data assests they have to improve not only their search engine, but also to contextualize it in different parts of their websites, partner’s websites or to promote, and publish that same information on different communication channels or devices.

The Flexibility


What is a flexible platform in that context? A flexible platform is one that can integrate any kind of information sources. Such information sources in the context of Volkswagen can be a series of relational dataset schemas spread around the World, Excel spreadsheets, CSV files, old plain text technical documents about past model of cars, semi-structured documents such as webpages, etc.

A flexible platform is also one that minimally impact (if at all) the data consumers if the data structure changes in the system. This is really important since the World we live in constantly changes. This means that things constantly change and we have to reflect these changes in the data we own and maintain. This is why this point is so important, because we want to minimize the impact of the data structure changes that will happen all the time.

Having the flexibility to constantly adapt your data, while minimally impacting the data consumers of the system, enables you to make quick decision to adapt your strategy in a highly competitive World. This flexibility gives you a clear business advantage.

A flexible platform is also one that let you publish your data the way you want, in the format that is needed. Such a flexible platform has to give you access to an interface that give you access to all the functionalities of the platform without having to care about what happens under the hood.

A flexible system is one that can communicate your information on any kind of communication channels, and to any devices that have access to the Web.

Under the Hood

That next generation platform that Volkswagen is currently developing is partly based on a few of the main pieces of the Open Semantic Framework. These pieces help them to reach their goal by helping them giving the flexibility their platform needs.

The first step they gone thru was to create their Volkswagen Vehicles Ontology that is used to describe all the entities they want to index into their platform. The Web Ontology Language (OWL), along with the Resource Description Framework (RDF) is what gives them the complete flexibility on how they can integrate all the pieces of information they want, in a canonical format.

Then they choose to use structWSF (the structured data web services framework). This piece gives them the flexibility to get a series of web interfaces (web service endpoints) to create, update, manage and query their data. This web service layer enables them to do anything they want with their data, from anywhere on the Web. This is possible because all the functionalities of the framework are exposed as web service endpoints. StructWSF also gives them the possibility to communicate their data in multiple different formats. This makes it the perfect flexible system to feed their information in different contexts, in different communication channels or on different devices.

At Volkswagen, structWSF is used to populate, and keep in sync, their Solr and Triple Store instances. It gives them the time to care about the more important aspects of their platform, and to care about how the data should be synced between the various specialized data management systems.

By using structWSF to manage their data, they are able to reach some objectives to make their platform as flexible as possible:

  • To be able to minimize the impact of data changes to the data consumers
    • Because structWSF uses OWL & RDF to describe all the data it index
  • To be able to manipulate their data from anywhere
    • Because all the functionalities of structWSF are exposed as web service endpoints
  • To be able to communicate the information in different contexts, communication channels and devices
    • Because structWSF has, in its core, is designed to transform all the data it indexes in any other kind of format

The Next Step

One of their longer term goal and objective is to analyze their unstructured and semi-structured textual documents to extract some structure out of them, and to index them into their semantic platform. To do this, they are looking at using Scones, which is the structWSF semantic tagger web service endpoint. Scones will use some subject reference structures such as UMBEL to semantically tag the textual document. Once the document as been processed by Scones, and indexed in structWSF, it can now be re-published in different contexts based on the reference concepts that have been tagged to it. This gives them the flexibility to leverage non-structured sources of data and to re-purpose it in different ways by publishing it in different context and in different systems.

This second system will enable them to leverage the investment they made in the past, by writing all these textual documents, and to re-purpose, and re-contextualizing, them in all kind of different contexts.

Conclusion

I think that TribalDDB and Volkswagen make the good decision for their future. Taking the business decision to develop and maintain a completely new kind of information system is not an easy decision to take. I am not saying that they made the good choice to use our pieces of the stack. The decision goes far beyond this. Such a Semantic Platform challenges everything in an organization: the people that takes the decisions, the people that create and manage the data, the people that develop the system, the people that maintain that system, the consumers of the system, the customers, the partners, etc. This is a big decision; whatever the technology stack you plan to use. I congratulate them for the decision they took.

I strongly believe that this was the right decision to take considering the future opportunities they are creating to themselves.

 

 

by Frederick Giasson at September 21, 2011 08:59 PM

September 19, 2011

Frederick Giasson's Weblog

Benchmark of PHP’s main String Search Functions

I am currently upgrading the structWSF ontologies related web service endpoints along with the structOntology conStruct module to make them more performing so that we can load ontologies that have thousands of classes and properties (at least up to 30 000 of them).

While testing these new upgrades with them UMBEL ontology, I noticed that much of the time was spent by a few number of stripos() calls located in the loadXML() function of the ProcessorXML.php internal structXML parser. They were used to extract the prefixes in the header of the structXML files, and then to resolve them into the XML file. I was using stripos() instead of strpos() to make the parsing of these structXML files case-insensitive even if XML is case-sensitive itself. However, due to their processing cost, I did change this behaviors by using the strpos() function instead. Here are the main reasons to this change:

  • XML is itself case-sensitive, so don’t try to be too clever
  • These structXML files that are exchanged are mostly internal to structXML
  • Their parsing performances is critical

The Tests

This is a non-scientific post about some experimentation I made related to the various PHP 5.3 string search functions. These tests have been performed on a small Amazon EC2 instance using DBG and PHPeD.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<?php
 
$text = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce malesuada aliquet pharetra. Nunc tincidunt tempus eleifend. Cras aliquet risus eget tortor elementum at molestie erat auctor. Sed sapien nulla, auctor a aliquam in, ornare eget enim. Ut ac luctus nunc. Etiam et tortor felis, sed fringilla orci. Fusce laoreet ligula turpis, quis sodales enim. Pellentesque at sapien ut dolor malesuada placerat eu ac quam. Pellentesque purus elit, sodales in fringilla eu, egestas vitae ipsum. Nam condimentum, nisi ac tincidunt luctus, odio erat porta turpis, eget varius felis leo sit amet lorem. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Maecenas quis pulvinar dui. Integer quis eros nibh. Donec in lectus vitae ligula euismod vulputate ut euismod enim. Ut vehicula, sapien at faucibus ornare, nulla lorem luctus purus, sed imperdiet augue purus quis enim.";
$explodedText = explode(" ", $text);
 
for($i = 0; $i < 10000; $i++)
{
  $word = $word = array_rand($explodedText );
   
  strpos($text, $word);
  stripos($text, $word);
  strstr($text, $word);
  stristr($text, $word);
}

?>

The first test uses a text of 138 words. That text get exploded into an array where each value is a word of that text. Then, before each iteration, we randomly select a word that we will search, within the text, using each of the 4 search functions.

Note that in the result images below, each of the line in the left-most column are the ones of the PHP code above.

That first test starts with 10 000 iterations. Here are the results of the first run:


The second test uses the same 138 words, but the test is performed 100 000 times:

As we can see, strpos() and strstr() are clearly faster than their case-insensitive counterparts.

Now, let’s see what is the impact of the size of the text to search. We will now perform the two tests with 10 000 and 100 000 iterations but with a text that has 497 words.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
<?php
 
$longText = "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Fusce malesuada aliquet pharetra. Nunc tincidunt tempus eleifend. Cras aliquet risus eget tortor elementum at molestie erat auctor. Sed sapien nulla, auctor a aliquam in, ornare eget enim. Ut ac luctus nunc. Etiam et tortor felis, sed fringilla orci. Fusce laoreet ligula turpis, quis sodales enim. Pellentesque at sapien ut dolor malesuada placerat eu ac quam. Pellentesque purus elit, sodales in fringilla eu, egestas vitae ipsum. Nam condimentum, nisi ac tincidunt luctus, odio erat porta turpis, eget varius felis leo sit amet lorem. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Maecenas quis pulvinar dui. Integer quis eros nibh. Donec in lectus vitae ligula euismod vulputate ut euismod enim. Ut vehicula, sapien at faucibus ornare, nulla lorem luctus purus, sed imperdiet augue purus quis enim. Nunc eu consectetur quam. Duis nulla sem, tincidunt vel placerat at, ultricies eu est. Vestibulum sed nulla nunc, et tristique orci. Aliquam nulla sapien, lobortis in sagittis vitae, tincidunt ut felis. Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut condimentum, orci venenatis mollis faucibus, purus enim euismod massa, a imperdiet sapien arcu in sapien. Nulla convallis sodales pretium. Nulla facilisi. Maecenas molestie est tortor. Fusce congue, leo eu tristique sodales, odio leo facilisis lectus, in euismod odio tellus ut sapien. Fusce odio orci, facilisis eu convallis et, consectetur nec mauris. Nullam nulla lacus, volutpat sit amet pulvinar quis, pulvinar eget dolor. Curabitur sit amet odio sem, at dapibus tellus. Donec nec dictum eros. Morbi convallis libero ultrices magna varius suscipit. Duis bibendum volutpat felis non fermentum. Phasellus nunc mi, ornare et vulputate sed, pellentesque sed enim. Mauris suscipit, nisl quis tempor mollis, tortor nunc varius odio, eu dictum odio mi quis sapien. Morbi placerat, erat quis mattis iaculis, urna nisi faucibus nisi, eu mattis elit mauris eu quam. Mauris euismod tincidunt ante quis interdum. Phasellus elementum libero in arcu tempus tincidunt. Praesent in nunc eget nibh porta imperdiet eget eget mauris. Morbi pellentesque dapibus lacus, rutrum sollicitudin nisi fermentum vel. Cras tempor mattis urna, sit amet semper eros varius ut. Fusce erat elit, tempus non commodo et, egestas sit amet odio. Suspendisse libero neque, porttitor vel volutpat eget, placerat in mi. Proin pharetra leo in ligula porttitor vestibulum. Curabitur vel mauris nec lorem sollicitudin porttitor. Sed suscipit, mauris ac sollicitudin tempus, orci velit aliquet leo, vitae ornare mi nulla a tellus. Morbi turpis justo, vestibulum ac auctor sed, vulputate nec nisl. Quisque ut ultricies orci. Sed vel dolor at felis egestas venenatis in ut elit. Nam quis neque sem. Morbi turpis magna, porttitor vulputate dignissim commodo, auctor eu nibh. Ut at nisl tortor. Quisque cursus interdum mi ut molestie. Vivamus nec ipsum ipsum. Class aptent taciti sociosqu ad litora torquent per conubia nostra, per inceptos himenaeos. Sed quis ipsum erat, quis dignissim nunc. Sed eu diam dapibus tortor fermentum dignissim. Phasellus ac turpis nisl, dictum consequat elit. Suspendisse at turpis quis eros pharetra imperdiet. Mauris ut nisl augue. ";
$explodedLongText = explode(" ", $longText);
 
for($i = 0; $i < 500000; $i++)
{
  $word = array_rand($explodedLongText);
   
  strpos($text, $word);
  stripos($text, $word);
  strstr($text, $word);
  stristr($text, $word);
}

?>

That third test starts with 10 000 iterations. Here are the results of the third run:

The fourth test uses the same 497 words, but the test is performed 100 000 times:

As we can see, even if we add more words, the same kind of performances are experienced.

Conclusion

After many runs (I only demonstrated a few here). I think I can affirm that strpos() and strstr() are way faster than their case-insensitive counterparts. However, strpos() seems a little bit faster than strstr(), but it seems to depends of the context, and which random words are being searched for. In any cases, according to PHP’s documentation, we should always use strpos() instead of strstr() because it supposedly use less memory.

There may also be some unknown memory considerations that may affect the code I used to test these functions. In any case, I can affirm that in a real context, where queries are sent to the Ontology: Read web service endpoint that hosts the UMBEL ontology, that strpos() is a way faster than stripos().

by Frederick Giasson at September 19, 2011 05:57 PM

September 18, 2011

Frederick Giasson's Weblog

What is an Ontology?

An ontology is the definition of a vocabulary, and the rules for combining its terms, used to describe things that needs to be communicated.

This is yet another tentative definition of what is an ontology applied for the semantic web. Before explaining that definition, I would like to continue by stating what I think is the main purpose of an ontology:

An ontology as for main purpose to communicate coherent and consistent information.

Different Kinds of Ontologies

Over the years, I tended to use the word “vocabulary,” along with the word “ontology,” in different blog posts and technical documents. However, the usage of each word may not always have been clear. Is an vocabulary an ontology? Is an ontology a vocabulary? Are these concepts synonymous? There is an important distinction to make: an ontology can be a vocabulary, but an ontology is much more than a simple vocabulary.

Ontologies can describe all kind of well-known knowledge representation structures, some simple, and others much more complex. Here is a small list of some of them:

  • lexicons
  • taxonomies, or
  • higher order knowledge description frameworks

In its most basic usage, an ontology will define a vocabulary. It will simply define the terms (words) that belongs to that vocabulary without saying anything regarding the usage of these words.

Then, an ontology could evolve into a taxonomy by defined hierarchical relationships between the terms that compose the vocabulary.

Finally, it can evolve further to become a higher order knowledge description framework that defines more complex usage rules such as: usage restrictions, all kind of relationships between described entities, etc. New knowledge could also be inferred. It is why I say that an ontology is not strictly a simple vocabulary, but that it powerful knowledge description framework.

Knowledge Base

As we saw above, the main purpose of an ontology is to be able to create a coherent and consistent knowledge base of information that can get communicated. So an ontology is a kind of language that let you create knowledge bases that are consistent, coherent and where new knowledge can be inferred. That is done by following the usage rules defined in the ontology.

However, there is another important aspect to take into account: an ontology will describe knowledge that is coherent and consistent, but according to the own World view of that ontology. This means that two ontologies, describing the same domain of knowledge, could consistently and coherently describe information according to their view of the World.

Let's take an example. Let's say that two book stores developed their own ontologies to describe the books they sell. Both companies sell books. There are good chances that they will use the same vocabulary to describe their books. However, the usage rules between these terms may differ between the two book stores. One of the book stores could say that a proceeding is a specialized kind of book. But the other book store could say that no, a proceeding is not a specialized kind of book, but that it is a document just like a book. So, both would describe a proceeding as a document, but one would have different interpretation rules about what a book really is. As you see, both book stores use the same vocabulary to define their library of books, but they interpret their meaning differently. If the two stores would have to exchange information about books in the future, they won't have many difficulties because they are probably sharing the same vocabulary, but the interpretation of that information may differ. The result of these potential differences in their interpretations may be where a book will be classified into the store; or how their customers could search for a specific book, using different filtering criterias; etc.

This is not different than what happens in our daily lives: is there a day in your life when you don't hear people arguing about different point of views? It is exactly the same thing that happens here. We potentially all live and see and the exact same events, images, sound, etc.; but we may all have a different interpretation of these things.

Ontologies in the Open Semantic Framework?

Ontologies are so flexible that we choose to make ontologies the "brain" of the Open Semantic Framework.

We wanted to use the most flexible knowledge description framework that would enable us to integrate any possible information sources that have been describe using any existing kind of simple, or really complex, knowledge representation structures such as simple: lexicons, taxonomies, relational schemas, etc. By using ontologies as its central piece, OSF is a flexibly data integration framework that can consolidate information from various, heterogeneous, sources of information.

If we remember the definition we started with, ontologies are not just about describing terms and their relationships in a coherent and consistent way. The ultimate purpose is to communicate that information. It is what the structWSF part of the Open Semantic Framework does: it let any kind of system that have access to the Internet to send, receive and manipulate information in multiple formats from a series of web service endpoints.

More Reading

Finally, I would suggest you to read Mike's Intrepid Guide to Ontologies to have a better understanding of where ontologies come from, how they works, what other formats exists, what are the different approaches to ontologies and what tools currently exists to work with ontologies.

by Frederick Giasson at September 18, 2011 03:55 PM

September 12, 2011

AI3:::Adaptive Information (Mike Bergman)

Making the Argument for Semantic Technologies

Judgment for Semantic TechnologiesFive Unique Advantages for the Enterprise

There have been some notable attempts of late to make elevator pitches [1] for semantic technologies, as well as Lee Feigenbaum’s recent series on Are We Asking the Wrong Question? about semantic technologies [2]. Some have attempted to downplay semantic Web connotations entirely and to replace the pitch with Linked Data (capitalized). These are part of a history of various ways to try to make a business case around semantic approaches [3].

What all of these attempts have in common is a view — an angst, if you will — that somehow semantic approaches have not fulfilled their promise. Marketing has failed semantic approaches. Killer apps have not appeared. The public has not embraced the semantic Web consonant with its destiny. Academics and researchers can not make the semantic argument like entrepreneurs can.

Such hand wringing, I believe, is misplaced on two grounds. First, if one looks to end user apps that solely distinguish themselves by the sizzle they offer, semantic technologies are clearly not essential. There are very effective mash-up and data-intensive sites such as many of the investment sites (Fidelity, TDAmeritrade, Morningstar, among many), real estate sites (Trulia, Zillow, among many), community data sites (American FactFinder, CensusScope, City-Data.com, among many), shopping sites (Amazon, Kayak, among many), data visualization sites (Tableau, Factual, among many), etc. , etc., that work well, are intuitive and integrate much disparate information. For the most part, these sites rely on conventional relational database backends and have little semantic grounding. Effective data-intensive sites do not require semantics per se [4].

Second, despite common perceptions, semantics are in fact becoming pervasive components of many common and conventional Web sites. We see natural language processing (NLP) and extraction technologies becoming common for most search services. Google and Bing sprinkle semantic results and characterizations across their standard search results. Recommendation engines and targeted ad technologies now routinely use semantic approaches. Ontologies are creeping into the commercial spaces once occupied by taxonomies and controlled vocabularies. Semantics-based suggestion systems are now the common technology used. A surprising number of smartphone apps have semantics at their core.

So, I agree with Lee Feigenbaum that we are asking the wrong question. But I would also add that we are not even looking in the right places when we try to understand the role and place of semantic technologies.

The unwise attempt to supplant the idea of semantic technologies with linked data is only furthering this confusion. Linked data is merely a means for publishing and exposing structured data. While linked data can lead to easier automatic consumption of data, it is not necessary to effective semantic approaches and is actually a burden on data publishers [5]. While that burden may be willingly taken by publishers because of its consumption advantages, linked data is by no means an essential precursor to semantic approaches. None of the unique advantages for semantic technologies noted below rely on or need to be preceded by linked data. In semantic speak, linked data is not the same as semantic technologies.

The essential thing to know about semantic technologies is that they are a conceptual and logical foundation to how information is modeled and interrelated. In these senses, semantic technologies are infrastructural and groundings, not applications per se. There is a mindset and worldview associated with the use of semantic technologies that is far more essential to understand than linked data techniques and is certainly more fundamental than elevator pitches or “killer apps.”

Five Unique Advantages

Thus, the argument for semantic technologies needs to be grounded in their foundations. It is within the five unique advantages of semantic technologies described below that the benefits to enterprises ultimately reside.

#1: Modern, Back-end Data Federation

The RDF data model — and its ability to represent the simplest of data up through complicated domain schema and vocabularies via the OWL ontology language — means that any existing schema or structure can be represented. Because of this expressiveness and flexibility, any extant data source or schema can be represented via RDF and its extensions. This breadth means that a common representation for any existing schema may be expressed. That expressiveness, in turn, means that any and all data representations can be described in a canonical way.

A shared, canonical representation of all existing schema and data types means that all of that information can now be federated and interrelated. The canonical means of federating information via the RDF data model is the foundational benefit of semantic technologies. Further, the practice of giving URIs as unique identifiers to all of the constituent items in this approach makes it perfectly suitable to today’s reality of distributed data accessible via the Web [6].

#2: Universal Solvent for Structure

I have stated many times that I have not met a form of structured data I did not like [7]. Any extant data structure or format can be represented as RDF. RDF can readily express information contained within structured (conventional databases), semi-structured (Web page or XML data streams), or unstructured (documents and images) information sources. Indeed, the use of ontologies and entity instance records in RDF is a powerful basis for driving the extraction systems now common for tagging unstructured sources.

(One of the disservices perpetuated by an insistence on linked data is to undercut this representational flexibility of RDF. Since most linked data is merely communicating value-attribute pairs for instance data, virtually any common data format can be used as the transmittal form.)

The ease of representing any existing data format or structure and the ability to extract meaningful structure from unstructured sources makes RDF a “universal solvent” for any and all information. Thus, with only minor conversion or extraction penalties, all information in its extant form can be staged and related together via RDF.

#3: Adaptive, Resilient Schema

A singular difference between semantic technologies (as we practice them) and conventional relational data systems is the use of an open world approach [8]. The relational model is a paradigm where the information must be complete and it must be described by a schema defined in advance. The relational model assumes that the only objects and relationships that exist in the domain are those that are explicitly represented in the database. This makes the closed world of relational systems a very poor choice when attempting to combine information from multiple sources, to deal with uncertainty or incompleteness in the world, or to try to integrate internal, proprietary information with external data.

Semantic technologies, on the other hand, allow domains to be captured and modeled in an incremental manner. As new knowledge is gained or new integrations occur, the underlying schema can be added to and modified without affecting the information that already exists in the system. This adaptability is generally the biggest source of economic benefits to the enterprise from semantic technologies. It is also a benefit that enables experimentation and lowers risk.

#4: Unmatched Productivity

Having all information in a canonical form means that generic tools and applications can be designed to work against that form. That, in turn, leads to user productivity and developer productivity. New datasets, structure and relationships can be added at any time to the system, but how the tools that manipulate that information behave remains unchanged.

User productivity arises from only needing to learn and master a limited number of toolsets. The relationships in the constituent datasets are modeled at the schema (that is, ontology) level. Since manipulation of the information at the user interface level consists of generic paradigms regarding the selection, view or modification of the simple constructs of datasets, types and instances, adding or changing out new data does not change the interface behavior whatsoever. The same bases for manipulating information can be applied no matter the datasets, the types of things within them, or the relationships between things. The behavior of semantic technology applications is very much akin to having generic mashups.

Developer productivity results from leveraging generic interfaces and APIs and not bespoke ones that change every time a new dataset is added to the system. In this regard, ontology-driven applications [9] arising from a properly designed semantic technology framework also work on the simple constructs of datasets, types and instances. The resulting generalization enables the developer to focus on creating logical “packages” of functionality (mapping, viewing, editing, filtering, etc.) designed to operate at the construct level, and not the level of the atomic data.

#5: Natural, Connected Knowledge Systems

All of these factors combine to enable more and disparate information to be assembled and related to one another. That, in turn, supports the idea of capturing entire knowledge domains, which can then be expanded and shifted in direction and emphasis at will. These combinations begin to finally achieve knowledge capture and representation in its desired form.

Any kind of information, any relationship between information, and any perspective on that information can be captured and modeled. When done, the information remains amenable to inspection and manipulation through a set of generic tools. Rather simple and direct converters can move that canonical information to other external forms for use by existing external tools. Similarly, external information in its various forms can be readily converted to the internal canonical form.

These capabilities are the direct opposite to today’s information silos. From its very foundations, semantic technologies are perfectly suited to capture the natural connections and nature of relevant knowledge systems.

A Summary of Advantages Greater than the Parts

There are no other IT approaches available to the enterprise that can come close to matching these unique advantages. The ideal of total information integration, both public and private, with the potential for incremental changes to how that information is captured, manipulated and combined, is exciting. And, it is achievable today.

With semantic technologies, more can be done with less and done faster. It can be done with less risk. And, it can be implemented on a pay-as-you-benefit basis [10] responsive to the current economic climate.

But awareness of this reality is not yet widespread. This lack of awareness is the result of a couple of factors. One factor is that semantic technologies are relatively new and embody a different mindset. Enterprises are only beginning to get acquainted with these potentials. Semantic technologies require both new concepts to be learned, and old prejudices and practices to be questioned.

A second factor is the semantic community itself. The early idea of autonomic agents and the heavy AI emphasis of the initial semantic Web advocacy now feels dated and premature at best. Then, the community hardly improved matters with its shift in emphasis to linked data, which is merely a technique and which completely overlooks the advantages noted above.

However, none of this likely matters. The five unique advantages for enterprises from semantic technologies are real and demonstrable today. While my crystal ball is cloudy as to how fast these realities will become understood and widely embraced, I have no question they will be. The foundational benefits of semantic technologies are compelling.

I think I’ll take this to the bank while others ride the elevator.


[1] This series was called for by Eric Franzon of SemanticWeb.com. Contributions to date have been provided by Sandro Hawke, David Wood, and Mark Montgomery.
[2] See Lee Feigenbaum, 2011. “Why Semantic Web Technologies: Are We Asking the Wrong Question?,” TechnicaLee Speaking blog, August 22, 2011; see http://www.thefigtrees.net/lee/blog/2011/08/why_semantic_web_technologies.html, and its follow up on “The Magic Crank,” August 29, 2011; see http://www.thefigtrees.net/lee/blog/2011/08/the_magic_crank.html. For a further perspective on this issue from Lee’s firm, Cambridge Semantics, see Sean Martin, 2010. “Taking the Tech Out of SemTech,” presentation at the 2010 Semantic Technology Conference, June 23, 2010. See http://www.slideshare.net/LeeFeigenbaum/taking-the-tech-out-of-semtech.
[3] See, for example, Jeff Pollock, 2008. “A Semantic Web Business Case,” Oracle Corporation; see http://www.w3.org/2001/sw/sweo/public/BusinessCase/BusinessCase.pdf.
[4] Indeed, many semantics-based sites are disappointingly ugly with data and triples and URIs shoved in the user’s face rather than sizzle.
[5] Linked data and its linking predicates are also all too often misused or misapplied, leading to poor quality of integrations. See, for example, M.K. Bergman and F. Giasson, 2009. “When Linked Data Rules Fail,” AI3:::Adaptive Innovation blog, November 16, 2009. See http://www.mkbergman.com/846/when-linked-data-rules-fail/.
[6] Greater elaboration on all of these advantages is provided in M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Innovation blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/.
[7] See M.K. Bergman, 2009. “‘Structs’: Naïve Data Formats and the ABox,” AI3:::Adaptive Innovation blog, January 22, 2009. See http://www.mkbergman.com/471/structs-naive-data-formats-and-the-abox/.
[8] A considerable expansion on this theme is provided in M.K. Bergman, 2009. “‘The Open World Assumption: Elephant in the Room,” AI3:::Adaptive Innovation blog, December 21, 2009. See http://www.mkbergman.com/852/the-open-world-assumption-elephant-in-the-room/.
[9] For a full expansion on this topic, see M.K. Bergman, 2011. “Ontology-driven Apps Using Generic Applications,” AI3:::Adaptive Innovation blog, March 7, 2011. See http://www.mkbergman.com/948/ontology-driven-apps-using-generic-applications/.
[10] See M.K. Bergman, 2010. “‘Pay as You Benefit’: A New Enterprise IT Strategy,” AI3:::Adaptive Innovation blog, July 12, 2010. See http://www.mkbergman.com/896/pay-as-you-benefit-a-new-enterprise-it-strategy/.

by Mike Bergman at September 12, 2011 09:11 AM

September 11, 2011

DBpedia Blog

DBpedia 3.7 released, including 15 localized Editions

Hi all,

we are happy to announce the release of DBpedia 3.7. The new release is based on Wikipedia dumps dating from late July 2011.

The new DBpedia data set describes more than 3.64 million things, of which 1.83 million are classified in a consistent ontology, including 416,000 persons, 526,000 places, 106,000 music albums, 60,000 films, 17,500 video games, 169,000 organizations, 183,000 species and 5,400 diseases.

The DBpedia data set features labels and abstracts for 3.64 million things in up to 97 different languages; 2,724,000 links to images and 6,300,000 links to external web pages; 6,200,000 external links into other RDF datasets, and 740,000 Wikipedia categories. The dataset consists of 1 billion pieces of information (RDF triples) out of which 385 million were extracted from the English edition of Wikipedia and roughly 665 million were extracted from other language editions and links to external datasets.

Localized Editions

Up till now, we extracted data from non-English Wikipedia pages only if there exists an equivalent English page, as we wanted to have a single URI to identify a resource across all 97 languages. However, since there are many pages in the non-English Wikipedia editions that do not have an equivalent English page (especially small towns in different countries, e.g. the Austrian village Endach, or legal and administrative terms that are just relevant for a single country) relying on English URIs only had the negative effect that DBpedia did not contain data for these entities and many DBpedia users have complained about this shortcoming.

As part of the DBpedia 3.7 release, we now provide 15 localized DBpedia editions for download that contain data from all Wikipedia pages in a specific language. These localized editions cover the following languages: ca, de, el, es, fr, ga, hr, hu, it, nl, pl, pt, ru, sl, tr. The URIs identifying entities in these i18n data sets are constructed directly from the non-English title and a language-specific URI namespaces (e.g. http://ru.dbpedia.org/resource/Berlin), so there are now 16 different URIs in DBpedia that refer to Berlin. We also extract the inter-language links from the different Wikipedia editions. Thus, whenever a inter-language links between a non-English Wikipedia page and its English equivalent exists, the resulting owl:sameAs link can be used to relate the localized DBpedia URI to the equivalent in the main (English) DBpedia edition. The localized DBpedia editions are provided for download on the DBpedia download page (http://wiki.dbpedia.org/Downloads37). Note that we have not provide public SPARQL endpoints for the localized editions, nor do the localized URIs dereference. This might change in the future, as more local DBpedia chapters are set up in different countries as part of the DBpedia internationalization effort (http://dbpedia.org/Internationalization).

Other Changes

Beside the new localized editions, the DBpedia 3.7 release provides the following improvements and changes compared to the last release:

1. Framework

  • Redirects are resolved in a post-processing step for increased inter-connectivity of 13% (applied for English data sets)
  • Extractor configuration using the dependency injection principle
  • Simple threaded loading of mappings in server
  • Improved international language parsing support thanks to the members of the Internationalization Committee: http://dbpedia.org/Internationalization

2. Bugfixes

  • Encode homepage URLs to conform with N-Triples spec
  • Correct reference parsing
  • Recognize MediaWiki parser functions
  • Raw infobox extraction produces more object properties again
  • skos:related for category links starting with “:” and having and anchor text
  • Restrict objects to Main namespace in MappingExtractor
  • Double rounding (e.g. a person’s height should not be 1800.00000001 cm)
  • Start position in abstract extractor
  • Server can handle template names containing a slash
  • Encoding issues in YAGO dumps

3. Ontology

  • 320 ontology classes
  • 750 object properties
  • 893 datatype properties
  • owl:equivalentClass and owl:equivalentProperty mappings to http://schema.org

Note that the ontology now is a directed-acyclic graph. Classes can have multiple superclasses, which was important for the mappings to schema.org. A taxonomy can still be constructed by ignoring all superclass but the one that is specified first in the list and is considered the most important.

4. Mappings

  • Dynamic statistics for infobox mappings showing the overall and individual coverage of the mappings in each language: http://mappings.dbpedia.org/index.php/Mapping_Statistics
  • Improved DBpedia Ontology as well as improved Infobox mappings using http://mappings.dbpedia.org/. These improvements are largely due to collective work by the community before and during the DBpedia Mapping Creation Sprint. For English, there are 17.5 million RDF statements based on mappings (13.8 million in version 3.6) (see also http://dbpedia.org/Downloads37#ontologyinfoboxproperties).
  • ConstantProperty mappings to capture information from the template title (e.g. Infobox_Australian_Road {{TemplateMapping | mapToClass = Road | mappings = {{ConstantMapping | ontologyProperty = country | value = Australia }}}})
  • Language specification for string properties in PropertyMappings (e.g. Infobox_japan_station: {{PropertyMapping | templateProperty = name | ontologyProperty = foaf:name | language = ja}} )
  • Multiplication factor in PropertyMappings (e.g. Infobox_GB_station: {{PropertyMapping | templateProperty = usage0910 | ontologyProperty = passengersPerYear | factor = 1000000}}, because it’s always specified in millions)

5. RDF Links to External Data Sources

  • New RDF links pointing at resources in the following Linked Data sources: Umbel, EUnis, LinkedMDB, Geospecis
  • Updated RDF links pointing at resources in the following Linked Data sources: Freebase, WordNet, Opencyc, New York Times, Drugbank, Diseasome, Flickrwrapper, Sider, Factbook, DBLP, Eurostat, Dailymed, Revyu

Accessing the new DBpedia Release

You can download the new DBpedia dataset from http://dbpedia.org/Downloads37.

As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint (http://dbpedia.org/sparql).

Credits

Lots of thanks to

  • All editors that contributed to the DBpedia ontology mappings via the Mappings Wiki.
  • Max Jakob (Freie Universität Berlin, Germany) for improving the DBpedia extraction framework and for extracting the new datasets.
  • Dimitris Kontokostas (Aristotle University of Thessaloniki, Greece) for providing language generalizations to the extraction framework.
  • Paul Kreis (Freie Universität Berlin, Germany) for administering the ontology and for delivering the mapping statistics and schema.org mappings.
  • Uli Zellbeck (Freie Universität Berlin, Germany) for providing the links to external datasets using the Silk framework.
  • The whole Internationalization Committee for expanding some DBpedia extractors to a number of languages:
    http://dbpedia.org/Internationalization.
  • Kingsley Idehen and Mitko Iliev (both OpenLink Software) for loading the dataset into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.

The work on the new release was financially supported by:

  • The European Commission through the project LOD2 - Creating Knowledge out of Linked Data (http://lod2.eu/, improvements to the extraction framework).
  • The European Commission through the project LATC - LOD Around the Clock (http://latc-project.eu/, creation of external RDF links).
  • Vulcan Inc. as part of its Project Halo (http://www.projecthalo.com/).

More information about DBpedia is found at http://dbpedia.org/About

Have fun with the new data set!

Cheers,

Chris Bizer

by ChrisBizer at September 11, 2011 09:14 AM

August 30, 2011

HyperDanja (Danny Ayers)

August 19, 2011

DBTune Blog

4Store stuff

Update: The repository below is not maintained anymore, as official packages have been pushed into Debian. They are not yet available for Ubuntu 11.04 though. In order install 4store on Natty you'd have to install the following packages from the Oneiric repository, in order:

  • libyajl1
  • libgmp10
  • libraptor2
  • librasqal3
  • lib4store0
  • 4store

And you should have a running 4store (1.1.3).

Old post, for reference: I've been playing a lot with Garlik's 4store recently, and I have been building a few things around it. I just finished building packages for Ubuntu Jaunty, which you can get by adding the following lines in your /etc/apt/sources.list:

deb http://moustaki.org/apt jaunty main
deb-src http://moustaki.org/apt jaunty main

And then, an apt-get update && apt-get install 4store should do the trick. The packages are available for i386 and amd64. It is also one of my first packages, so feedback is welcomed (I may have gotten it completely wrong). After being installed, you can create a database and start a SPARQL server.

I've also been writing two client libraries for 4store, all available on Github:

  • 4store-php, a PHP library to interact with 4store over HTTP (so not exactly similar to Alexandre's PHP library, which interacts with 4store through the command-line tools);
  • 4store-ruby, a Ruby library to interact with 4store over HTTP or HTTPS.

by Yves at August 19, 2011 09:29 AM

August 15, 2011

Displacement Activities (Tom Heath)

Back Online after the Spam-fest

Just a quick post now this blog is back online after being badly compromised by spammers. I took everything down and let the links 404 for a while in the hope that it would encourage search engines to clear out their indexes, and the search engine referrals seems to be getting cleaner now, which is a relief. May this be the last of it.

by Tom Heath at August 15, 2011 10:01 AM

AI3:::Adaptive Information (Mike Bergman)

Of Flagpoles and Fishes

World's Tallest Flagpole; see ref [9]The New Paradigm of ‘Substantive Marketing’ for Innovative IT

This decade has clearly marked a sea change in the move of enterprise software from proprietary to open source, as I have recently discussed [1]. It is instructive that only a mere six years ago I was in heated fights with my then Board about open source; today, that seems so quaint and dated.

Also during this period many have noted how open source has changed the capital required to begin a new software startup [2]. Open source both provides the tooling and the components for cobbling together specialty apps and extensions. Six and seven and even eight figure startup costs common just a decade ago have now dropped to four or five figures. When we see the explosion of hundreds of thousands of smartphone apps we are seeing the glowing residue of these additional sea changes. Dropping startup costs by one to three orders of magnitude is truly democratizing innovation.

But something else has been going on that is changing the face of enterprise software (besides consolidation, another factor I also recently commented on). And that factor is “marketing”. Much less commentary is made about this change, but it, too, is greatly lowering costs and fundamentally changing market penetration strategies. That topic — and my personal experience with it — is the focus of this article.

The Obsolete Recent Past

Besides the few remaining big providers of enterprise software — like IBM, Oracle, HP, SAP — most vendors have totally remade their sales practices of just a few years ago. Large sales forces with big commissions and a year to two year sales cycles can no longer be justified when software license fees and the percentage maintenance annuities that flow from them are dropping rapidly. Today’s mantras are doing more with less and doing it faster, hardly consistent with the traditional enterprise software model. Sure, big enterprises, especially big government and big business, have large sunk costs in legacy systems that will continue to be milked by existing vendors. But the flow is constricting with longer-term trends clear to see. The old enterprise software model is obsolete.

Even if it were not dying, it is hard to square huge investments in sales and marketing when product development has become inexpensive and agile. The proliferation of three-letter marketing acronyms for branding “new” product areas and standard formulas for product hype of just a few years ago also feels old and dated. Cozy relationships with conventional trade press pundits and market analysts seem to be diminishing in importance, possibly because the authoritativeness of their influence is also diminishing. It is harder to justify market firm subscription costs when priority budget items are being cut and new information outlets have emerged.

In response to this, many developers have forsaken the enterprise market for the consumer one. Indeed enterprises themselves are looking more and more to the consumer sector and commodity apps for innovation and answers. But, still, problems unique to enterprises remain and how to effectively reach them in this brave new world is today’s marketing problem for enterprise software vendors.

Most entities today, when opining about these challenges, tend to emphasize the need for “laser focus” and “rifle-shot” targeting of prospects. The advice takes the form of: 1) emphasize well-defined verticals; 2) know your market well; and 3) target and go after your likely prospects. Prospect data mining and targeted ad analysis are the proferred elixirs.

But, there is little evidence such refined methods for prospect identification and targeting are really working. Like politicians doing focus groups and opinion polling to capture the desired “message” of their potential electorates, these are all still “push” models of marketing. Yet we are swamped with pushed messages and marketing everywhere we turn. The model is failing.

Besides message overload, there are two issues with laser targeting. First, despite all that we try to know about ready buyers (for enterprise software), we really don’t know if any particular individual is truly needful, in a position to buy, has the authority to buy, or is the right advocate to make the internal sell. Second, though the idea of “laser” carries with it the image of focus and not flailing, it is in fact expensive to identify the targets and send a focused message their way. Because of these issues, decay rates for laser prospects throughout conventional sales pipelines continue to rise.

A New Marketing ParadigmNew Paradigm Roadsign

There has always been the phenomenon of the “fish jumping into the boat“; that is, the unanticipated inbound inquiry from a previously unknown prospect leading to a surprisingly swift sale. But we have seen this phenomenon increase markedly in recent years. Structured Dynamics‘ current customer base — including recurring customers — comes almost exclusively from this source. As we have noted this trend in comparison with more targeted outreach, we have spent much time trying to understand why it is occurring and how we can leverage what Peter Drucker called the “unexpected success” [3].

What we are seeing, I believe, is a shift from sales to marketing, and within marketing from direct or outbound marketing to a new paradigm of marketing. Others have likened this to inbound marketing [4] or content marketing [5] or permission marketing [6]. What we are seeing at Structured Dynamics bears many resemblances to parts of what is claimed for these other approaches, but not all. And, it is also true that what we are seeing may pertain mostly to innovative IT for emerging enterprise markets, and not a generalized paradigm suitable to other products or markets.

For lack of a better term, what we are seeing we can term “substantive marketing”. By this we mean offering valuable content and solutions-oriented systems for free and without restriction. This shares aspects with content marketing. Then, in keeping with the trend for buyers doing their own research and analysis to fulfill their own needs, similar to the premises of inbound or permission marketing, potential consumers can make their own judgments as to relevance and value of our offerings.

Sometimes, of course, some prospects find our approaches and solutions lacking. Sometimes, they may grab what we have offered for free and use them on their own without compensation to us. But where the match is right — and we need to be honest with both ourselves and the customer when it is not — we can better spend the customer’s limited time and resources to tailor our generic solutions to their specific needs. In doing so, we offer higher value (tailored services) while learning better about another spectrum of consumer need that can virtuously enhance our substantive offerings for the next prospect.

So, let’s decompose these components further to see what they can tell us about this new practice of substantive marketing and how to use it as an engine for moving forward.

Substantive Marketing

The Virtuous Cycle Begins with Substantive Solutions

The premise of substantive marketing is to offer square-deal value to the marketplace in the form of solutions-based content. Like content marketing that offers “the creation or sharing of content for the purpose of engaging current and potential consumer bases” [5], substantive marketing goes even further. The whole basis and premise of the approach is to provide substantive content, in one of more of these areas, preferably all:

  • Knowledge — this substantive area includes papers, commentary, survey results or listings of tools and references useful to the target market
  • Analysis — this content area includes unique analysis of market trends, data, technologies or reviews that pertain to the target market
  • Code — this area relates to the provision of open source code and tools, preferably under licenses that allow users to use the software without restriction (two examples are the Apache 2 license and the MIT license)
  • Documentation — a critical substantive area is the documentation in how to install, use, modify or customize these tools, including a prejudice to APIs and tutorial information
  • Methodologies, workflows and best practices — it is important to also discuss how to properly operate and utilize these tools and information. Taking care to document lessons learned and best practices also helps the user community avoid common mistakes and to speed adoption and utility, and
  • Demos — this area involves setting up (and sharing code and procedures for same) demos that show how the code and its methods actually work. Demos also become first use cases to aid the new user in learning and setting up the code bases.

Further, this substantive content is offered without strings, restrictions or customer fill-in forms. The content is not a come on or a teaser. We are not trying to gather leads or prospect names, because we have no intent to dun them with emails or follow-ups.

This substantive content is as complete as can be to enable new users to adopt the information and tools in their current state without further assistance. (In some cases, the information also educates the marketplace in order to prepare future customers for adoption.) Most importantly, this substantive content is offered for free, either open source (for code) or creative commons for documentation and other content. In return, it is fair to request — and we do — attribution when this material is used.

We have previously termed this complete panoply of substantive content a total open solution [7]. Some might find the provision of such robust information crazy: How can we give away the store of our proprietary knowledge and systems? But we find this kind of thinking old school. In an open source world where so much information is now available online, with a bit of effort customers can find this information anyway. Rather, our mindset is that customers do not want to pay again for what has already been done, but are willing to pay for what can be done with that knowledge for their own specific problems. Offering the complete storehouse of our knowledge in fact signals our interest in only charging the customer for new answers, new value or new formulations. The customers we like to work with feel they are getting an honest, square deal.

Flagpole Venues Help Increase Awareness

Consider your substantive content to be your flag, a unique banner for conveying and packaging your specific brand. It is thus important to find appropriate flagpoles — in the virtual territories that your customers visit — for raising this content high for them to see. Since the role of these flagpoles is to create awareness in potential prospects — who you do not likely know individually or even by group in advance — it makes sense to raise your offerings up on many flagpoles and on the highest flagpoles. Visibility is the object of the approach.

This approach is distinctly not leafletting or cramming links or emails into as many spaces as possible. The idea of substantive marketing is to fly valuable content high enough that desirous potential customers can discover and then inspect the information on their own, and only if they so choose. In this regard, substantive marketing resembles permission marketing [6].

Being visible helps ensure that the needful, questing prospect that you would never have been able to target on your own is able to see and be aware of your offerings. And, since they are seeking information and answers, your collateral needs to be of a similar nature. Solutions and substance are what they are seeking; what you have run up the flagpole should respond to that.

The mindset here is to respect your prospective customers and to allow them to chose to receive and inspect your offerings, but only if they so choose. If flown in the right venues with the right visibility, customers will see your flags and inspect them if they meet their requirements.

Some of the venues at which you can raise your flags include:

  • Blogs — this venue is especially helpful, since you have complete control over content, message, voice and packaging
  • Social networks — the value of social networks is now accepted, and should be a core component of any visibility strategy. However, it is also important to make sure that your contributions are driven by substance and value and do not become part of the cacophonous background noise
  • Vertical media — there are always existing outlets well-read and -respected by your customer propects. Establishing relationships and value with these third-party outlets can extend your reach
  • Web sites — this venue includes your standard Web sites, of course. But, you should also consider setting up specific project-related sites or sites dedicated to documentation (c.f., our TechWiki site of 300+ technical articles) or to methodologies (the excellent MIKE2.0 site is one great example) or to other ways by which particular content (such as tools with the Sweet Tools site) can raise another flag
  • User forums — user discussion groups and forums also become their own attractants for like-interested prospects, and
  • Conferences and tradeshows — while potentially valuable, presence at conferences and tradeshows must be carefully evaluated. Since participation and opportunity costs are high, the venues should be clearly relevant to your market space with likely decision makers in attendance.

The observant reader will have already concluded that each of these venues develops slowly, and therefore raising visibility is generally a slow-and-steady game that requires patience. Start-up vendors backed by venture firms or those looking for quick visibility and cashout will not find this approach suitable. On the other hand, customer prospects looking for answers and self-sustaining solutions are not much interested in flash in the pan vendors, either.

A Model Responsive to the Changing Nature of Customer Prospects

The real drivers for this changing paradigm come from customer prospects. Sophisticated buyers of enterprise IT and instrumental change agents within organizations share most if not all of these characteristics:

  • They are inundated with marketing messages and jaded about hype and “pushed” messages
  • They are generally knowledgeable about their needs and problem spaces and about approximate technologies. They are eager and desirous of learning independently and know that their recommendations affect their personal reputations and standing within their enterprises
  • With the many volatile external and internal changes, including staff reductions and fluid assignments, leadership for new technology adoption can come from many different and unknown corners of the organization; it is extremely difficult to identify and target prospects
  • The economic and competitive environment places a premium on affordability and low-risk evaluations of new technologies
  • Lock-ins of any kind — be it to specific vendors or technologies — are understood as inherently risky. This understanding is raising the importance of open and standards-based approaches
  • Being the subject of a pushy sales effort is distasteful and a negative to an eventual sale. Education and learning, however, is respected
  • Because of all that is at stake, honesty with no bullshit is highly appreciated. If you as a vendor do not offer an appropriate solution or have fulfillment weaknesses, tell the prospect so. Further, tell them who can supply the solution. One never knows when and where the next problem may arise, and providing trustworthy advice can lead to later engagements.

More often than not we find our customers to have already installed and used our existing substantive materials for some time before they approach us about further work. They appreciate the tutorial information and have taught themselves much in advance. By the time we engage, both parties are able to cost-effectively focus on what is truly missing and needed and to deliver those answers in a quick way. Re-engagements tend to occur when a next set of gaps or challenges arise.

Though it may sound trite or even unbelievable to those who have not yet experienced such a relationship, the square deal value offered by substantive marketing can really lead to true partnerships and trust between vendor and customer. We experience it daily with our customers, and vice versa. We also think this is the adaptive approach that our new environment demands.

The Free Path to Open Source and Solutions

Once prospects learn of our substantive offerings, many may decide independently that what we have is not suitable. Others may simply download and use the information on their own, for which we often never know let alone receive revenue. We are completely fine with this, as shown for three different cases.

First, some of these prospects need no more than what we already have. This increases our user base, increases our visibility and often results in contributions to our forums and documentation.

Then, some of these prospects come to learn they need or want more than what our current offerings provide, leading to two possible forks. In one fork, the second case, they may have sufficient skills internally or with other suppliers to extend the system on their own. Some of this flows back to an improved code base or improved installation or documentation bases.

In the other fork, the third case, they may decide to engage us in tailoring a solution for them. That case is the only one of the three that leads to a direct revenue path.

In all three cases we win, and the customer wins. Maybe enterprise software vendors of decades past rue this reality of lower margins and shared benefits; we agree that the absolute profit potential of substantive marketing is much less. But we gladly accept the more enjoyable work and steady revenue relationships resulting from these changes. We are not engaged in some pollyann-ish altruism here, but in a steely-eyed honest brokering that best serves our own self-interest (and fairly that of the customer, as well).

A Square Deal Baseline for Tailored Services

Great IT product does not come from idle musings or dreamed up functionality. It comes solely and directly from solving customer problems. Only via customers can software be refined and made more broadly usable.

A slipstream of those who have previously become aware and tested our offerings will choose to engage our services. This generally takes the form of an inbound call, where the prospect not only qualifies itself, but also establishes the terms and conditions for the sale. They have chosen to select us; they are fish that have jumped into the boat.

To again quote Peter Drucker, “. . . the aim of marketing is to make selling superfluous. The aim of marketing is to know and understand the customer so well that the product or service fits him and sells itself. Ideally, marketing should result in a customer who is ready to buy. All that should be needed then is to make the product or service available . . .” [8]. This is precisely what I meant earlier about the shift in emphasis from sales to marketing.

Even at this point there may be mismatches in needs and our skills and availabilities. If such is the case, we do not hesitate to say so, and attempt to point the prospect in another direction (from which we also gain invaluable market knowledge). If there is indeed a match, we then proceed to try to find common ground on schedule and budget.

Paradoxically, this square deal and honesty about the readiness and weaknesses of our offerings often leads to forgiveness from our customers. For example, for some time we have lacked automated installation scripts that would make it easier for prospects to install our open semantic framework. But, because of compensating value in other areas, such gaps can be overlooked and tackled later on (indeed, as a current customer is now funding). By not pretending to be everything to everyone, we can offer what we do have without embarrassment and get on with the job of solving problems.

For larger potential engagements, we typically suggest a fixed price initial effort to develop an implementation plan. The interviews and research to support this typical 4- to 6-weeks effort (generally in the $5 K to $10 K range, depending) then result in a detailed fulfillment proposal, with firm tasks, budget and schedule, specific to that customer’s requirements. Just as we respect our prospects’ time and budget, we expect the same and do not conduct these detailed plans without compensation. With respect to fulfillment contracts, we cap contract amount and limit milestone payments to pre-set percentages or time expended, whichever is lower.

This approach ensures we understand the customer’s needs and have budgeted and tasked accordingly. Capped contracts also put the onus on us the contractor to understand our own effort and tasking structures and realities, which leads to better future estimating. For the customer, this approach caps risk and potential exposure, and ensures milestones are being met no matter the time expenditures by us, the contractor. This approach extends our square-deal basis to also embrace risks and payments.

New (and Open Source) Developments Fuel the Substance Pipeline

Thus, when customers engage us, they spend almost solely on new functionality specifically tailored to their needs. In doing so, we suggest they agree to release the new developments they fund as open source. We argue — and customers predominantly agree — that they are already benefitting from lower overall costs because other customers have funded sharable, open source before them. We point out that the new customers that follow them will also be independently creating new functionality, to which they will also later benefit.

(This argument does not apply to specific customer data or ontologies, which are naturally proprietary to the customer. Also, if the customer wants to retain intellectual ownership of extensions, we charge higher development fees.)

Once these new developments are completed, they are fed back into a new baseline of valuable content and code. From this new baseline the cycle of substantive marketing can be augmented anew and perpetuated.

Three Guidelines to Leverage Substantive Marketing

All of these points can really be boiled down to three guidelines for how to make substantive marketing effective:

  • First, whatever your domain or market, provide useful and substantive content. The content you offer is indeed your marketing collateral. Prospective customers can gauge from it directly whether it meets their needs, appears sound and workable, and has value. If you have little of substance to offer, this paradigm is not for you
  • Second, plant many flagpoles and raise your flags high in territories your market prospects are likely to visit. This is a process that requires thoughtfulness and patience. Thoughtfulness, because that is how you determine where to plant your flags. If you yourself are a consumer of what you offer, it is easier to find those venues. And patience, because it takes time to stack valuable content upon valuable content in order to raise visibility
  • And, third, be honest and respectful. Help your prospect work within available budget to achieve the most possible at lowest risk. And help them find others, if need be, who might be better able than you to truly solve their problems.

What we are finding — as we continue to refine our understanding of this new paradigm — is that through substantive marketing the fish are finding us and they sometimes jump into the boat. We like our enterprise customers to pre-qualify themselves and already be “sold” once they knock on the door. One never knows when that phone might ring or the email might come in. But when it does, it often results in a collaborative customer as a partner who is a joy to work with to solve exciting new problems.


[1] M.K. Bergman, 2011. “Declining IT Innovation in the Enterprise,” in AI3:::Adaptive Innovation blog, January 17, 2011. See http://www.mkbergman.com/940/declining-it-innovation-in-the-enterprise/.
[2] Paul Graham has been the most prominent observer of this scene; see P. Graham, 2008. “Why There Aren’t Any More Googles,” April 2008 (see http://www.paulgraham.com/googles.html) and subsequent articles.
[3] See esp. Peter F. Drucker, 1985. Innovation and Entrepreneurialship: Practice and Principals, Harper & Row, New York, NY, 277 pp.
[4] Inbound marketing is a marketing strategy that focuses on getting found by customers. According to David Meerman Scott, inbound marketers “earn their way in” (via publishing helpful information on a blog etc.) in contrast to outbound marketing where they used to have to “buy, beg, or bug their way in” (via paid advertisements, issuing press releases in the hope they get picked up by the trade press, or paying commissioned sales people, respectively). Brian Halligan, cofounder and CEO of HubSpot, claims he first coined the term of inbound marketing.
[5] Content marketing is an umbrella term encompassing all marketing formats that involve the creation or sharing of content for the purpose of engaging current and potential consumer bases. In contrast to traditional marketing methods that aim to increase sales or awareness through interruption techniques, content marketing subscribes to the notion that delivering high-quality, relevant and valuable information to prospects and customers drives profitable consumer action. See also Holger Shulze, 2011. B2B Content Marketing Trends slideshow, see http://www.slideshare.net/hschulze/b2b-content-marketing-report.
[6] Seth Godin coined the term permission marketing wherein marketers obtain permission before advancing to the next step in the purchasing process. It is mostly used by online marketers, notably email marketers and search marketers, as well as certain direct marketers who send a catalog in response to a request. Godin contrasts this approach to traditional “interruption marketing” where messages are sent without prior permission.
[7] See the three-part series, M.K. Bergman, 2010. “Listening to the Enterprise: Total Open Solutions,” “Part 1,” “Part 2” and “Part 3,” AI3:::Adaptive Information blog, May 12 – 31, 2010.
[8] Peter F. Drucker, 1974. Management: Tasks, Responsibilities, Practices. New York, NY: Harper & Row. pp. 864. ISBN 0-06-011092-9.
[9] The intro photo is of the world’s tallest flagpole (at 165 m), in Dushanbe, Tajikistan. The photo is courtesy of CentralAsiaOnline.com.

by Mike Bergman at August 15, 2011 08:25 AM

August 08, 2011

AI3:::Adaptive Information (Mike Bergman)

A New Best Friend: Gephi for Large-scale Networks

Geshi NetworkVisualization + Analysis Pushes Aside Cytoscape

Though I never intended it, some posts of mine from a few years back dealing with 26 tools for large-scale graph visualization have been some of the most popular on this site. Indeed, my recommendation for Cytoscape for viewing large-scale graphs ranks within the top 5 posts all time on this site.

When that analysis was done in January 2008 my company was in the midst of needing to process the large UMBEL vocabulary, which now consists of 28,000 concepts. Like anything else, need drives research and demand, and after reviewing many graphing programs, we chose Cytoscape, then provided some ongoing guidelines in its use for semantic Web purposes. We have continued to use it productively in the intervening years.

Like for any tool, one reviews and picks the best at the time of need. Most recently, however, with growing customer usage of large ontologies and the development of our own structOntology editing and managing framework, we have begun to butt up against the limitations of large-scale graph and network analysis. With this post, we announce our new favorite tool for semantic Web network and graph analysis — Gephi — and explain its use and showcase a current example.

The Cytoscape Baseline and Limitations

Three and one-half years ago when I first wrote about Cytoscape, it was at version 2.5. Today, it is at version 2.8, and many aspects have seen improvement (including its Web site). However, in other respects, development has slowed. For example, version 3.x was first discussed more than three years ago; it is still not available today.

Though the system is open source, Cytoscape has also largely been developed with external grant funds. Like other similarly funded projects, once and when grant funds slow, development slows as well. While there has clearly been an active community behind Cytoscape, it is beginning to feel tired and a bit long in the tooth. From a semantic Web standpoint, some of the limitations of the current Cytoscape include:

  • Difficult conversion of existing ontologies — Cytoscape requires creating a CSV input; there was an earlier RDFscape plug-in that held great promise to bridge the software into the RDF and semantic Web sphere, but it has not remained active
  • Network analysis — one of the early and valuable generalized network analysis plug-ins was NetworkAnalyzer; however, that component has not seen active development in three years, and dynamic new generalized modules suitable for social network analysis (SNA) and small-world networks have not been apparent
  • Slow performance and too-frequent crashes — Cytoscape has always had a quirky interface and frequent crashes; later versions are a bit more stable, but usability remains a challenge
  • Largely supported by the biomedical community — from the beginning, Cytoscape was a project of the biomedical community. Most plug-ins still pertain to that space. Because of support for OBO (Open Biomedical and Biological Ontologies) formats and a lack of uptake by the broader semantic Web community, RDF- and OWL-based development has been keenly lacking
  • Aside from PDFs, poor ability to output large graphs in a viewable manner
  • Limited layout support — and poor performance for many of those included with the standard package.

Undoubtedly, were we doing semantic technologies in the biomedical space, we might well develop our own plug-ins and contribute to the Cytoscape project to help overcome some of these limitations. But, because I am a tools geek (see my Sweet Tools listing with nearly 1000 semantic Web and -related tools), I decided to check out the current state of large-scale visualization tools and see if any had made progress on some of our outstanding objectives.

Choosing Geshi and Using It

There are three classes of graph tools in the semantic technology space:

  1. Ontology navigation and discovery, to which the Relation Browser and RelFinder are notable examples
  2. Ontology structure visualization (and sometimes editing), such as the GraphViz (OWLViz) or OntoGraf tools used in Protégé (or the nice FlexViz, again used by the OBO community), and
  3. Large-scale graph visualization in order to gain a complete picture and macro relationships in the ontology.

One could argue that the first two categories have received the most current development attention. But, I would also argue that the third class is one of the most critical:  to understand where one is in a large knowledge space, much better larger-scale visualization and navigation tools are needed. Unfortunately, this third category is also the one that appears to be receiving the least development attention. (To be sure, large-scale graphs pose computational and performance challenges.)

In the nearly four years since my last major survey of 26 tools in this category, the new entrants appear quite limited. I’ve surely overlooked some, but the most notable are Gruff, NAViGaTOR, NetworkX and Gephi [1]. Gruff actually appears to belong most in Category #2; I could find no examples of graphs on the scale of thousands of nodes. NAViGaTOR is biomedical only. NetworkX has no direct semantic graph importing and — while apparently some RDF libraries can be used for manipulating imports — alternative workflows were too complex for me to tackle for initial evaluation. This leaves Gephi as the only potential new candidate.

From a clean Web site to well-designed intro tutorials, first impressions of Gephi are strongly positive. The real proof, of course, was getting it to perform against my real use case tests. For that, I used a “big” ontology for a current client that captures about 3000 different concepts and their relationships and more than 100 properties. What I recount here — from first installing the program and plug-ins and then setting up, analyzing, defining display parameters, and then publishing the results — took me less than a day from a totally cold start. The Gephi program and environment is surprisingly easy to learn, aided by some great tutorials and online info (see concluding section).

The critical enabler for being able to use Gephi for this source and for my purposes is the SemanticWebImport plug-in, recently developed by Fabien Gandon and his team at Inria as part of the Edelweiss project [2]. Once the plug-in is installed, you need only open up the SemanticWebImport tab, give it the URL of your source ontology, and pick the Start button (middle panel):

SemWeb Plug-in for GephiNote the SemanticWebImport tool also has the ability (middle panel) to issue queries to a SPARQL endpoint, the results of which return a results graph (partial) from the source ontology. (This feature is not further discussed herein.) This ontology load and display capability worked without error for the five or six OWL 2 ontologies I initially tested against the system. 

Once loaded, an ontology (graph) can be manipulated with a conventional IDE-like interface of tabs and views. In the right-hand panels above we are selecting various network analysis routines to run, in this case Average Degrees. Once one or more of these analysis options is run, we can use the results to then cluster or visualize the graph; the upper left panel shows highlighting the Modularity Class, which is how I did the community (clustering) analysis of our big test ontology. (When run you can also assign different colors to the cluster families.) I also did some filtering of extraneous nodes and properties at this stage and also instructed the system via the ranking analysis to show nodes with more link connections as larger than those nodes with fewer links.

At this juncture, you can also set the scale for varying such display options as linear or some power function. You can also select different graph layout options (lower left panel). There are many layout plug-in options for Gephi. The layout plugin called OpenOrd, for instance, is reported to be able to scale to millions of nodes.

At this point I played extensively with the combination of filters, analysis, clusters, partitions and rankings (as may be separately applied to nodes and edges) to: 1) begin to understand the gross structure and characteristics of the big graph; and 2) refine the ultimate look I wanted my published graph to have.

In our example, I ultimately chose the standard Yifan Hu layout in order to get the communities (clusters) to aggregate close to one another on the graph. I then applied the Parallel Force Atlas layout to organize the nodes and make the spacings more uniform. The parallel aspect of this force-based layout allows these intense calculations to run faster. The result of these two layouts in sequence is then what was used for the results displays.

Upon completion of this analysis, I was ready to publish the graph. One of the best aspects of Gephi is its flexibility and control over outputs. Via the main Preview tab, I was able to do my final configurations for the published graph:

Publication Options for GephiThe graph results from the earlier-worked out filters and clusters and colors are shown in the right-hand Preview pane. On the left-hand side, many aspects of the final display are set, such as labels on or off, font sizes, colors, etc. It is worth looking at the figure above in full size to see some of the options available. 

Standard output options include either SVG (vector image) or PDFs, as shown at the lower left, with output size scaling via slider bar. Also, it is possible to do standard saves under a variety of file formats or to do targeted exports.

One really excellent publication option is to create a dynamically zoomable display using the Seadragon technology via a separate Seadragon Web Export plug-in. (However, because of cross-site scripting limitations due to security concerns, I only use that option for specific sites. See next section for the Zoom It option — based on Seadragon — to workaround that limitation.)

Outputs Speak for Themselves

I am very pleased with the advances in display and analysis provided by Gephi. Using the Zoom It alternative [3] to embedded Seadragon, we can see our big ontology example with:

  • All 3000 nodes labeled, with connections shown (though you must must zoom to see) and
  • When zooming (use scroll wheel or + icon) or panning (via mouse down moves), wait a couple of seconds to get the clearest image refresh:

Note: at standard resolution, if this graph were to be rendered in actual size, it would be larger than 7 feet by 7 feet square at full zoom !!!

To compare output options, you may also;

Still, Some Improvements Would be Welcomed

It is notable that Gephi still only versions itself as an “alpha”. There is already a robust user community with promise for much more technology to come.

As an alpha, Gephi is remarkably stable and well-developed. Though clearly useful as is, I measure the state of Gephi against my complete list of desired functionality, with these items still missing:

  • Real-time and interactive navigation — the ability to move through the graph interactively and to issue queries and discover relationships
  • Huge node numbers — perhaps the OpenOrd plug-in somewhat addresses this need. We will be testing Gephi against UMBEL, which is an order of magnitude larger than our test big ontology
  • More node and edge control — Cytoscape still retains the advantage in the degree to which nodes and edges can be graphically styled
  • Full round-tripping — being able to use Gephi in an edit mode would be fantastic; the edit functionality is fairly straightforward, but the ability to round-trip in appropriate formats (OWL, RDF or otherwise) may be the greater sticking point.

Ultimately, of course, as I explained in an earlier presentation on a Normative Landscape for Ontology Tools, we would like to see a full-blown graphical program tie in directly with the OWL API. Some initial attempts toward that have been made with the non-Gephi GLOW visualization approach, but it is still in very early phases with ongoing commitments unknown. Optimally, it would be great to see a Gephi plug-in that ties directly to the OWL API.

In any event, while perhaps Cytoscape development has stalled a bit for semantic technology purposes, Gephi and its SemanticWebImport plug-in have come roaring into the lead. This is a fine toolset that promises usefulness for many years to come.

Some Further Gephi Links

To learn more about Gephi, also see the:

Also, for future developments across the graph visualization spectrum, check out the Wikipedia general visualization tools listing on a periodic basis.


[1] The R open source math and statistics package is very rich with apparently some graph visualization capabilities, such as the dedicated network analysis and visualization project statnet. rrdf may also provide an interesting path for RDF imports. R and its family of tools may indeed be quite promising, but the commitment necessary to R appears quite daunting. Longer-term, R may represent a more powerful upgrade path for our general toolsets. Neo4j is also a rising star in graph databases, with its own visualization components. However, since we did not want to convert our underlying data stores, we also did not test this option.
[2] Erwan Demairy is the lead developer and committer for SemanticWebImport. The first version was released in mid-April 2011.
[3] For presentations like this blog post, the Seadragon JavaScript enforces some security restrictions against cross-site scripting. To overcome that, the option I followed was to:
  • Use Gephi’s SVG export option
  • Open the SVG in Inkscape
  • Expand the size of the diagram as needed (with locked dimensions to prevent distortion)
  • Save As a PNG
  • Go to Zoom It and submit the image file
  • Choose the embed function, and
  • Embed the link provided, which is what is shown above.
(Though Zoom.it also accepts SVG files directly, I found performance to be spotty, with many graphical elements dropped in the final rendering.)

by Mike Bergman at August 08, 2011 09:27 AM

August 02, 2011

Do What I Mean (Richard Cyganiak)

Multiple itemtypes in Microdata

There’s a lot of discussion recently around HTML5′s microdata proposal, and how it relates to W3C’s earlier RDFa standard that is currently being updated for HTML5. Microdata solves many of the use cases of RDFa in a much simpler way. … Continue reading

August 02, 2011 02:50 PM

July 31, 2011

HyperDanja (Danny Ayers)

July 25, 2011

AI3:::Adaptive Information (Mike Bergman)

Five Iterations of Site Search

WordPressOvercoming the Limitations of WordPress Search

Since the inception of this AI3 blog a bit over six years ago, I have gone through five different approaches to local site search, all geared to overcome the limitations of WordPress‘ native search function. The current and last iteration uses the Relevanssi plug-in, the best I have used so far. (Check it out yourself in the search box to the upper right.) I describe these five iterations in this post.

Iteration #1: Native WordPress Search

When first released, AI3 used the native search that comes with the WordPress installation (when first installed that was WP version 1.5; the current version is at 3.2.1). That was OK when few knew of my site and the number of visitors was low.

But the WP search is known to suck, mostly because of search results based on date posted not relevance and its slow performance. Once I began to get more traffic, it was time for a change.

Iteration #2: Google Custom Search

The option I have kept longest on this site is Google’s Custom Search. When first announced at the end of 2006 it was a real godsend and very innovative. I installed my first version in January 2007 and continued to make modifications and use it up through April 2010. I used it on various sites with many different types of Custom Search implementations.

Unfortunately, to use the free version it is necessary to include ads that Google provides. For a while, this served my purposes, since I was actively trying to learn whether ad revenues were viable for a standard blog and what kinds of traffic are necessary to produce meaningful revenues. However, by early 2010 I had come to the conclusion that — even with a quite popular blog for its niche — that ad revenues would never be that meaningful and it was not worth cluttering up my site. So I ended my experiment with Google ads and, being cheap, chose not to use the paid version of the search service and thus dropped the system.

What I liked:

  • Easy set up
  • Familiar search syntax and interface.

What I did not like:

  • Inclusion of Google ad panels
  • Lack of flexibility is styling search results presentation
  • Need for a Google key
  • Inability to tweak ordering of search results
  • Intrusive Google logos in multiple places.

Iteration #3: Bing Site Owner

Microsoft’s Bing was starting to come on strongly at that time so I decided next to try the Bing Site Owner’s service. I began this new approach immediately upon retiring Google.

What I liked:

  • Very easy set up
  • Acceptable flexibility in styling results
  • Nice popup implementation
  • Not overly intrusive with the Bing (MS) brand.

However, without direct notice, Microsoft ended this service as of April of this year.

What I did not like:

  • Service went dark
  • Cancelled service without any notification (except on the Bing webmaster’s site, a location I never visited)
  • No alternatives to the Bing API 2.0 with its difficult set up.

I was pretty pleased with the Bing service and would likely have continued using it because it wasn’t broke. But, the sudden plug-pulling was offputting.

Thus, I decided, heck, if I was going to have to go through the effort of learning the new Bing API, I might as well learn to do it all myself.

Iteration #4: WPSearch 2

So, it was back to researching options and WP plug-ins on the Web. After assembling the options, I first chose to go with WPSearch 2. The thing that most initially attracted me to this option was its reliance on the Lucene open source search engine, the same option that my company Structured Dynamics uses in its Solr text indexing for the Open Semantic Framework (OSF).

Since my AI3 blog theme is of my own design with many changes over the years, I had lost its original capabilities in having a native search form and search results page. So, my first task after installing the WPSearch plugin and indexing my content was to add these pages to my theme. The WP Codex has an OK set of instructions on creating a search page and related discussion.

There are some valuable tutorials out there that explain how this is done; I refer to them rather than repeat such information here.

I completed this work and kept WPSearch 2 up and active on my site for roughly the past week. But, I also kept trying to achieve some of the aspects I wanted in formatting and organizing search results and became increasingly frustrated. I also experienced numerous freezes and white screens and fatal PHP errors while editing new pages or deleting comment spam that told me I simply had to abandon this option.

In summary, what I liked:

  • Use of Lucene search engine
  • Very fast performance
  • Known search syntax.

What I did not like:

  • Duplicate results
  • Freezes and timeouts when managing comments or new edits
  • Inability to capture total search count (at least with my own PHP skills)
  • Inability to highlight search terms.

I’m sorry that I needed to abandon this option, since I do view highly the underlying Lucene text engine. But, the integration with existing WP functionality and other modules was not fully baked. I think with more work, including exposing more of the Lucene search API functionality, that this option could redeem itself. But, as of today, it is not reliable enough for my site.

Iteration #5: Relevanssi

In trying to find hacks and workarounds to some of the desires and issues noted above, I had come across reference to the Relevanssi plug-in, which appeared to embrace much of what I was looking to achieve. The download is quite small (100 K) and must therefore use the native WP MySQL for the index, but it is feature rich and has a strong relevance-ranking and with ranking flexibility. There is great flexibility and configurability in how search results get presented, also an attraction.

Installation of this system and then indexing was very clean and straightforward. It has a syntax that readily supports the Boolean AND operator (the default behavior I have set for the site) (if the AND search finds no matches, it will automatically do an OR search) and phrase searching, with the prior links showing examples from this blog (also see the search form at upper right).

As implemented, then, here is the listing of major features in Relevanssi:

  • Total number of search results (implemented)
  • Search term highlighting (implemented)
  • Contextual excerpt snippets (implemented)
  • Sort by date (not implemented)
  • Category search (not implemented)
  • Filter by date (not implemented)
  • Filter by category or tag (not implemented).

Here is a screen capture of the complete configuration menu in WordPress for Relevanssi:

Relevanssi Configuration Options

For further information, you may also want to see some more advanced search functions and the Relevanssi knowledge base.

by Mike Bergman at July 25, 2011 09:40 AM

July 19, 2011

AI3:::Adaptive Information (Mike Bergman)

In the Midst of an Evolutionary Explosion

Photo courtesy of levelofhealth.comA Decade of Remarkable Advances in Ten Grand IT Challenges

I’ve been in the information theory and technology game for quite some time, but believe nothing has matched the pace of advances of the past ten years. As one example, it was a mere eight years ago that I was sitting in a room with language translation vendors contemplating automated translation techniques for US intelligence agencies. The prospects finally looked doable, but the success of large-scale translation was not assured.

At about that same time, and the years until just recently, a whole slew of Grand Challenges [1] in computing hung out there: tantalizing yet not proven. These areas ranged from information extraction and natural language understanding to speech recognition and automated reasoning.

But things have been changing fast, and with a subtle steadiness that has caused it to go largely unremarked. Sure, all of us have been aware of the huge changes on the Web and search engine ubiquity and social networking. But some of the fundamentally hard problems in computing have also gone through some remarkable (but largely unremarked) advances.

We now have smart phones that speak instructions to us while we instruct them by voice in turn. Virtually all information conceivable is now indexed and made available through the Web; structure is now rapidly characterizing that information, making it even more useful to discover and organize. We can translate documents online with acceptable accuracy into more than 60 languages [2]. We can get directions to or see satellite views of virtually any place on earth. We have in fact become accustomed to new technology magic on a nearly daily basis, so much so that the pace of these advances seems to be a constant, blunting our perspective of just how rapid these advances have been progressing.

These advances are perhaps not the realization of artificial intelligence as articulated in the 1950s to 1980s, but are contributing to a machine-based ability to do tasks useful to humans heretofore impossible and at scales unimaginable. As Google and IBM’s Watson are showing, statistics (among other techniques) applied to massive knowledge bases or text corpora are breaking down all of the Grand Challenges of symbolic computing. The image that is emerging is less one of intelligent machines working autonomously than it is of computers working interactively or semi-automatically with humans to address previously unsolvable problems.

By using a perspective of the decade past, we also demark the seminal paper on the semantic Web by Berners-Lee, Hendler and Lassila from May 2001 [3]. Yet, while this semantic Web vision has been a contributor to the success of the Grand Challenge advances of the past ten years, I think we can also say that it has not been the key or even a primary driver. That day may still yet come. Rather, I think we have to look to natural language and statistics surrounding large-scale corpora as the more telling drivers.

Ten Grand Challenge Advances

Over the past ten years there have been significant advances on at least ten Grand Challenges in symbolic computation. As the concluding section notes, these advances can be traced in most part to broader advances in natural language processing, the logical and semiotic bases for interoperability, and standards (nominally in the semantic Web) for embracing them. Here are these ten areas of advance, all achieved over the past ten years:

#1 Information Extraction

Information extraction (IE) uses various forms of natural language processing (NLP) to identify structured information within unstructured or semi-structured documents. These documents are presented in machine-readable form (including straight text, various document formats or HTML) with the various types of information “tagged” or prompted for inclusion. Information types that can be extracted with one of the various techniques include entities, relations, topics, categories, and so forth. Once tagged or extracted, the information in the documents can now be included and linked to standard structured information (as might come from conventional databases) or to structure in other documents.

Most recently, a large number of online services and open source systems have also become available with strengths in one or more of these extraction types [4]. Some current examples include Yahoo! Term Extraction, OpenCalais, BeliefNetworks, OpenAmplify, Alchemy API, Evri, Extractiv, Illinois Tagger, and about 80 others [4].

#2 Machine Translation

Machine translation is the automatic translation of machine-readable text from one human language to another. Accurate and acceptable machine translation requires applying different types of knowledge including grammar, semantics, facts about the real world, etc. Various approaches have been developed and refined over time.

Especially helpful has been the availability of huge corpora in multiple languages to which large-scale statistical analysis may be applied (as is the case of Google’s machine translation) or human editing and refinement (as is the case with the more than 280 language versions of Wikipedia).

While it is true none of these systems have 100% accuracy (even human translators show much variation), the more advanced ones are truly impressive with remaining ambiguities flagged for resolution by semi-automatic means.

#3 Sentiment Analysis

Though sentiment analysis is strictly speaking a subset of information extraction, it has the more demanding and useful task of extracting subjective information, often across a group of documents or texts. Sentiment analysis can be applied to online reviews to determine the “polarity” about specific objects, and it is especially useful for identifying public opinion trends or evaluating social media for ranking, polling or marketing purposes.

Because of its greater difficulty and potential high value, many of the leading sentiment analysis capabilities remain proprietary. Some capable open source versions are available nonetheleless. There is also an interesting online application using Twitter feeds.

#4 Disambiguation

Many words have more than one meaning. Word sense disambiguation uses either machine learning, dictionaries (gazetteers) of known entities and concepts, ontologies or linguistic databases such as WordNet, or combinations thereof to evaluate ambiguous terms or phrases and resolve them based on context. Some systems need to be “trained” or some work automatically or others are based on evaulation and prompting (semi-automatic) to complete the disambiguation process.

State-of-the-art systems have greater than 90% precision [5]. Most of the leading open source NLP toolkits have quite capable disambiguation modules, and even better proprietary systems exist.

#5 Speech Synthesis and Recognition

Speech synthesis is the conversion of text to spoken speech and has been around for quite some time. Speech recognition is a far more difficult task in that a given sound clip or real-time spoken speech of a person must be converted to a textual representation, which itself can then be acted upon such as navigating or making selections. Speech recognition is made difficult because of individual voice differences, the variations of human languages and speech patterns, and the need to segment speech into a sequence of words. (In most spoken languages, the sounds representing successive letters blend into each other, so the conversion of the modulated wave form to discrete characters or tokens can be a very difficult process.)

Crude systems of a decade ago required much training with a specific speaker’s voice to show much effectiveness. Today, the range and ability to use these systems without training has markedly improved.

Until recently, improvements largely were driven by military and intelligence requirements. Today, however, with the ubiquity of smart phones and speech interfaces, the consumer market is greatly accelerating progress.

#6 Image Recognition

Image recognition is the ability to determine whether or not an electronic image contains some specific object, feature, or activity, and then to extract the image data associated with it. Today, under specific circumstances and for specific tasks, this can be done by computer. However, for the general case of arbitrary objects in arbitrary situations this challenge has not yet been fully met. The systems of today work best for simple geometric objects (e.g., polyhedra), human faces, printed or hand-written characters, or vehicles, and in specific situations, typically described in terms of well-defined illumination, background, and orientation of the object relative to the camera.

Auto license recognition at intersections, face recognition by security cameras, and greatly expanded and improved character recognition systems (machine vision) represent some of the current state-of-the-art. Again, smart phone apps are helping to drive advances.

#7 Interoperability Standards and Methods


Rapid Progress in Climbing the Data Federation Pyramid

Most of the previous advances are related to extracting structured information or mapping or deriving additional structured information. Once obtained, of course, the next challenge is in how to relate that information together; that is, how to make it interoperate.

We have been steadily climbing a data federation pyramid [6] — and at an impressively accelerating rate since the adoption of the Internet and Web. These network innovations gave us a common basis and protocols for connecting distributed devices. That, in turn, has freed us to concentrate on the standards for data representation and interoperability.

XML first provided a means for a common data serialization that encouraged various communities and industries to devise exchange vocabularies. RDF provided a means for a common data model, one that was both simple and extensible at the same time [7]. OWL built upon that basis to enable us to build common domain models (see next).

There are alternatives to the semantic Web standards of RDF and OWL such as common logic and there are many competing data exchange formats to XML. None of these standards is essential on its own and all have their communities and advocates. However, because they are standards and they share common network bases, it has also been relatively easy to convert amongst the various available protocols. We are nearly at a global level where everything is connected, machine-readable, and in structured form.

#7 Common Domain Models

Semantics in machine-readable form means that we can more confidently link and combine available information. We are seeing a veritable explosion of domain models to represent various domains and viewpoints in consensual, interoperable form. What this means is that we are now gaining the computing vocabularies and grammars — along with shared community models (world views) — to get this stuff to work together.

Five years ago we called this phenomena mashups, but no one uses that term any longer because these information brewpots are everywhere, including in our very hands when we interact with the apps on our smart phones. This glue of domain models is generally as invisible to us as is the glue in laminates or the resin in plastics. But they are the strength and foundations nonetheless that enable much of the computing magic unfolding around us.

#9 Virtual Apps (Cloud Computing)

Once the tyranny of physical separation was shattered between data and machine by the network, the rationale for keeping the data with the app or even the user with the app disappeared. Cloud computing may seem mysterious or sound to have some high-octave hum, but it really is nothing more than saying that the Web enables us to treat all of our computing resources as virtual. Data can be anywhere; machines and hard drives can be anywhere; and applications can be anywhere.

And, virtualness brings benefits in and of itself. Whole computing environments can be installed or removed nearly instantaneously. Peak computing demands can be met with virtual headrooms. Backup and rollover and redundancy practices and strategies can change. Web services mean tailored capabilities can be invoked from anywhere and integrated for local needs. Massive computing resources and server farms can be as accessible to the individual as they are to prior computing behemoths. Combined with continued advances in underlying computing hardware and chips, the computing power available to any user is rising exponentially. There is now even more power in the power curve.

#10 Big Data

One hears stories of Google or the National Security Agency having access and managing servers measured in the hundreds of thousands. Entirely new operating systems and computing environments — many with roots in open source — such as virtual operating systems and MapReduce approaches like Hadoop have been innovated to deal with the current era of “big data”.

MapReduce is a framework for processing huge datasets using a large number of servers. The “map” step partitions the problem into tractable sub-problems, organized in a tree structure. The “reduce” step then takes the answers to all the sub-problems and combines them to produce the final output.

Such techniques enable analysis of datasets of a size impossible before. This has enabled the development of statistics and analytical techniques that have been able to make correlations and find patterns for some of the Grand Challenge tasks noted before that simply could not be addressed within previous limits. The “big data” approach is providing a brute force alternative to previously intractable problems.

Why Such Progress?

Declining hardware costs and increasing performance (such as from Moore’s Law), combined with the adoption of the Internet + Web network, set the fertile conditions for these unprecedented advances in computing’s Grand Challenges. But the adaptive radiation in innovations now occurring has its own dynamics. In computing terms, we are seeing the equivalent of the Cambrian explosion in evolutionary history.

The dynamics driving this computing explosion are based largely, I believe, on the statistics of information retrieval and extraction needed to cope with the scale of documents on the Web. That, in turn, has impelled innovations in big data and distributed architectures and designs that have pried open previously closed computing lockboxes. As data from everywhere and from every provenance pours into the system, means for handling and interoperating with it have become imperatives. These forces, in turn, have been channeled and are being met through the open and standards-based approaches that helped lead to the development of the Internet and its infrastructure in the first place.

These powerful evolutionary forces in computing are clearly evident in the ten Grand Challenge advances above. But the challenges above are also silent on another factor, underpinning the interoperability initiatives, that is only now just becoming evident and exerting its own powerful force. That is the workable, intellectual foundations for interoperability itself.

Clearly, as the advances in the Grand Challenges show, we are seeing immense exposures of new structured information and impressive means for accessing and managing it on a global, distributed scale.  Yet all of this data and this structure begs the question of how to get the information to work together. Further, the sources and viewpoints and methods by which all of this data has been created also puts a huge premium on means to deal with the diversity. Though not evident, and perhaps not even known to many of the innovators and practitioners, there has been a growing intellectual force shaping our foundational views about the nature of things and their representations. This force has been, I believe, one of those root cause drivers helping to show the way to interoperability.

John Sowa, despite his unending criticism of the semantic Web in favor of common logic, has nonetheless been a very positive evangelist for the 19th century American logician and philosopher, Charles Sanders Peirce. Sowa points out that the entire 20th century largely neglected Peirce’s significant contributions in many areas and some philosophers appropriated Peircean insights without proper attribution [8]. Indeed, Peirce has only come to wider attention within the past decade or so. Much of his voluminous lifetime writings have still not yet been committed to publication.

Among many notable contributions, Peirce was passionate about signs and their triadic representations, in a field known as semiotics. The philosophical and logical basis of his triangle of signs deserves your attention, which can not be adequately treated here [9]. However, as summarized by Sowa [8], “A semiotic view of language and logic gets to the heart of the philosophical controversies and their practical implications for linguistics, artificial intelligence, and related subjects.”

In essence, Peirce’s triadic logic of semiotics helps clarify philosophical questions about things, how they are perceived and how they are named that has vexed philosophers at least since the time of Aristotle. What Peirce was able to put forward was a testable logic for how things and the names of things can be understood and related to one another, via logical statements or structures. These, in turn, can be symbolized and formalized into logical constructs that can capture the structure of natural language as well as more structured data.

The clarity of Peirce’s logic of signs is an underlying factor, I believe, for why we are finally seeing our way clear to how to capture, represent and relate information from a diversity of sources and viewpoints that is defensible and interoperable [10]. As we plumb Peircean logics further, I believe we will continue to gain additional insights and methods for combining and relating information. The next phase of our advances on these Grand Challenges is likely to be fueled more by connections and interoperability than in basic extraction or representation.

The Widening Explosion

We are not seeing the vision of artificial intelligence unfold as posed three decades ago. Nor are we seeing the AI-complete type of problems being solved in their entirety [11]. Rather, we are seeing impressive but incomplete approaches. Full automation and autonomy are not yet at hand, and may be so far in the future as to never be. But we are nevertheless seeing advances across the board in all Grand Challenge areas.

What is emerging is a practical achievement of the Grand Challenges, the scale and scope of which is unprecedented in symbolic computing. As we see Peircean logic continue to take hold and interoperability grow in usefulness and stature, I think it fair to say we can look back in ten years to describe where we stand today as having been in the midst of an evolutionary explosion.


[1] Grand Challenges were United States policy objectives for high-performance computing and communications research set in the late 1980s. According to “A Research and Development Strategy for High Performance Computing”, Executive Office of the President, Office of Science and Technology Policy, 29 pp., November 20, 1987, “A grand challenge is a fundamental problem in science or engineering, with broad applications, whose solution would be enabled by the application of high performance computing resources that could become available in the near future.”
[2] For example, as of July 17, 2011, Google offered 63 different source or target languages for translation.
[3] Tim Berners-Lee, James Hendler and Ora Lassila, 2001. “The Semantic Web”. Scientific American Magazine; see http://www.scientificamerican.com/article.cfm?id=the-semantic-web.
[4] Go to Sweet Tools, and enter the search ‘information extraction’ to see a list of about 85 tools.
[5] See, for example, Roberto Navigli, 2009. “Word Sense Disambiguation: A Survey,” ACM Computing Surveys, 41(2), 2009, pp. 1–69. See http://www.dsi.uniroma1.it/~navigli/pubs/ACM_Survey_2009_Navigli.pdf.
[6] M.K. Bergman, 2006. “Climbing the Data Federation Pyramid,” AI3:::Adaptive Information blog, May 25, 2006; see http://www.mkbergman.com/229/climbing-the-data-federation-pyramid/.
[7] M. K. Bergman, 2009. “Advantages and Myths of RDF,” AI3:::Adaptive Information blog, April 8, 2009. See http://www.mkbergman.com/483/advantages-and-myths-of-rdf/
[8] John Sowa, 2006. “Peirce’s Contributions to the 21st Century”, in H. Schärfe, P. Hitzler, & P. Øhrstrøm, eds., Conceptual Structures: Inspiration and Application, LNAI 4068, Springer, Berlin, 2006, pp. 54-69. See http://www.jfsowa.com/pubs/csp21st.pdf.
[9] See, as a start, the Wikipedia article on Charles Sanders Peirce (pronounced “purse”), as well as the Arisbe collection of his assembled papers (to date). Also see John Sowa, 2010. “The Role of Logic and Ontology in Language and Reasoning,” from Chapter 11 of Theory and Applications of Ontology: Philosophical Perspectives, edited by R. Poli & J. Seibt, Berlin: Springer, 2010, pp. 231-263. See http://www.jfsowa.com/pubs/rolelog.pdf. Sowa also says, “Although formal logic can be studied independently of natural language semantics, no formal ontology that has any practical application can ever be developed and used without acknowledging its intimate connection with NL semantics.”
[10] While Peirce’s logic and clarity of conceptual relationships is compelling, I find reading his writings quite demanding.
[11] In the field of artificial intelligence, the most difficult problems are informally known as AI-complete or AI-hard, meaning that the difficulty of these computational problems is equivalent to solving the central artificial intelligence problem of making computers as intelligent as people. Computer vision, autonomous robots and understanding natural language are amongst challenges recognized by consensus as being AI-complete. However, practical advances on the Grand Challenges were never defined as needing to meet the AI-complete criterion. Indeed, it is even questionable whether such a hurdle is even worthwhile or meaningful on its own.

by Mike Bergman at July 19, 2011 04:00 AM

July 09, 2011

DBpedia Blog

Official DBpedia Live Release

We are pleased to announce the official release of DBpedia Live. The main objective of DBpedia is to extract structured information from Wikipedia, convert it into RDF, and make it freely available on the Web. In a nutshell, DBpedia is the Semantic Web mirror of Wikipedia.

Wikipedia users constantly revise Wikipedia articles with updates happening almost each second. Hence, data stored in the official DBpedia endpoint can quickly become outdated, and Wikipedia articles need to be re-extracted. DBpedia Live enables such a continuous synchronization between DBpedia and Wikipedia.

The DBpedia Live framework has the following new features:

  1. Migration from the previous PHP framework to the new Java/Scala DBpedia framework.
  2. Support of clean abstract extraction.
  3. Automatic reprocessing of all pages affected by a schema mapping change at http://mappings.dbpedia.org.
  4. Automatic reprocessing of pages that are not changed for more than one month. The main objective of that feature is to that any change in the DBpedia framework, e.g. addition/change of an extractor, will eventually affect all extracted resources. It also serves as fallback for technical problems in Wikipedia or the update stream.
  5. Publication of all changesets.
  6. Provision of a tool to enable other DBpedia mirrors to be in synchronization with our DBpedia Live endpoint. The tool continuously downloads changesets and performs changes in a specified triple store accordingly.

Important Links:

Thanks a lot to Mohamed Morsey, who implemented this version of DBpedia Live as well as to Sebastian Hellmann and Claus Stadler who worked on its predecessor. We also thank our partners at the FU Berlin and OpenLink as well as the LOD2 project for their support.

by Sören at July 09, 2011 10:50 AM

June 30, 2011

Wikier.org Blog (Sergio Fernandez)

Easily document your vocabularies/ontologies with Parrot

After several months of development within ONTORULE project, today Tejo has finally announced that Parrot is online. This tool is the natural evolution of others, such as SpecGen or Neologism ; not better, just different. It adds the possibility to generate documentation from several artifacts, not only vocabularies/ontologies, but rules too.

Since the tool is also provided as online service, it introduces an interesting option: easily document your vocabularies/ontologies backed on this service. Based on the recipes it’d easy to write the necessary rules on your .htaccess to use Parrot in this way:


RewriteEngine On
RewriteBase /exampledir
AddDefaultCharset utf-8
AddType application/rdf+xml .rdf
AddType application/rdf+xml .owl

# Rewrite rule to serve HTML content from the vocabulary URI if requested
RewriteCond %{HTTP_ACCEPT} !application/rdf\+xml.*(text/html|application/xhtml\+xml)
RewriteCond %{HTTP_ACCEPT} text/html [OR]
RewriteCond %{HTTP_ACCEPT} application/xhtml\+xml [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/.*
RewriteRule ^example$ http://ontorule-project.eu/parrot/parrot?documentUri=http://example.org/exampledir/example.owl [R=303,L]

# Rewrite rule to serve RDF/XML content from the vocabulary URI if requested
RewriteCond %{HTTP_ACCEPT} application/rdf\+xml
RewriteRule ^example$ example.owl [R=303]

# Rewrite rule to serve the RDF/XML content from the vocabulary URI by default
RewriteRule ^example$ example.owl [R=303]

This configuration supposes that your vocabulary/ontology has http://example.org/exampledir/example# as namespace, and the location of the source file is http://example.org/exampledir/example.owl. But you can customize it as your convenience.

For instance, the documentation for SIOC would be this . And based on the mappings provided by prefix.cc, Parrot also generates documentation from the most popular prefix of a vocabulary, such as http://ontorule-project.eu/parrot/parrot?documentUri=foaf for FOAF.

Please, check the full features list . And as usual, any comment is very welcome !

by Sergio Fernández at June 30, 2011 02:15 PM

June 28, 2011

DBTune Blog

Using RDFa for testing templates

I promised a bunch of people from the BBC that I would write about this, so here it is! This is my first Software Engineering-related post, bear with me :)

We recently released a new iteration of /programmes, built on top of a completely new technology stack. As part of that move, we decided we wanted to review our template testing strategy. In our old application, the process for a new feature would basically be:

  1. Software Engineer writes models, controllers and data views;
  2. Web Developer writes XHTML templates;
  3. Software Engineer writes controller tests, which would actually test the routes, the controllers and the templates (such controller 'unit tests' are actually fairly standard across MVC frameworks, for some reason - e.g. in Zend)

Those controller tests were based on CSS selectors or XPaths. Therefore, any time a small front-end tweak needed to be done, the controller tests would break, which is very annoying for everyone.

We had two problems:

  • Our controller 'unit tests' were not really unit tests - front-end developers shouldn't have to understand the whole routing, controllers, models for making a front end change.
  • Using CSS selectors or XPaths for template tests is brittle. We don't want our tests to break every time we change the name of a CSS class.

In order to solve the first problem, we divided our controller tests in route tests (here is a route, assert that the application forwards it to the right controller/action with the right parameters), real unit controller tests (mock the models, call an action with some request parameters, check that the right data is sent to the view), and template tests.

In order to solve our second problem, we made those template tests rely on RDFa markup embedded within the page. In order to test a template, we create some mock data, generate the template using this data, and check the right RDF triples can be extracted from the resulting page. It ensures that tests are actually based on data - front-end changes won't have an impact on them. We just want to make sure we present the right data to the user. As the tests are not relying on other application code, it also means that someone writing the templates can maintain his own test suite.

A simple example of one of these tests is the following one:

    public function testLetterSearch()
    {
        $this->setDefaultComponent('/components/atoz/letters');
        $data = (object) array(
            "by"      => "by",
            "search"  => "b",
            "slice" => "player",
            "letters" => array('@', 'a', 'b', 'c'),
        );

        $this->assertTriples($data, array(
            array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/%40/player'),
            array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/a/player'),
            array('/programmes/a-z/by/b', 'rdfs:seeAlso', '/programmes/a-z/by/c/player'),
        ));
    }

which can be read as if you are displaying a list of letters on an A-Z page and you have selected one, you shouldn't link to that one

Another very nice side-effect is that developers have a motivation to put lots of RDFa inside our pages! Compare the 168 triples extracted from a new /programmes page, including full tracklist information, programme metadata and broadcasts, to the 18 triples extracted from an old one. And as we add new components to this page, more RDFa will become available.

Also, the speed at which our developers picked up RDFa (1.0, not even 1.1, which is apparently simpler) defeats the eternal argument about RDFa being too complicated, but that's just my opinion :) The RDFa cheat sheet has proved immensely helpful.

by Yves at June 28, 2011 10:52 AM

June 16, 2011

HyperDanja (Danny Ayers)

June 03, 2011

AI3:::Adaptive Information (Mike Bergman)

Structured Web Gets Massive Boost

Schema.orgContrary to Some Views, Google and Co.’s Microdata Effort will Also Boost RDF

In my opinion, perhaps the most important event for the structured Web since RDF was released a dozen years ago was today’s joint announcement by the search engine triumvirate of Google, Bing and Yahoo! releasing Schema.org. Schema.org is a vendor specification for nearly 300 mini-schema (or structured record definitions) that can be used to tag information in Web pages. These schema are organized into a clean little hierarchy and cover many of the leading things — from organizations to people to products and creative works — that can be written about and characterized on the Web.

These schema specifications are based on the microdata standard presently under review as part of the pending HTML5 specification. Microdata are set record descriptions of key-value pair attributes that can be embedded into the HTML Web page language. These microdata schema are similar to microformats, but broader in coverage and extensible. Microdata is also simpler than RDFa, another W3C specification that the Schema.org organizers call “. . . extensible and very expressive, but the substantial complexity of the language has contributed to slower adoption.”

Is the Initiative a Slap in RDF’s Face?

Various forums have been alive with howls and questions from many RDF and RDFa advocates that this initiative negates years of effort behind those formats. Yet I and my company, Structured Dynamics, which base our entire technology approach on semantics and RDF, do not see this announcement as a threat or rejection. What gives; what is the difference in perspective?

In our view, RDF and its triple representations in its data model, is the simplest and most expressive means to represent any data or any data relationship. As such, RDF, and its language extensions such as OWL and ontologies, provide a robust and flexible canonical data model for capturing any extant data or schema. No matter what the native form of the source information, we can boil it down to RDF and inter-relate it to any other information. It is for these reasons (and others) we have frequently termed RDF as the universal data solvent.

But, simple records and simple data need not be encumbered with the complexity of RDF. We have long argued for the importance of naive data structs. Many of these are simple key-value pairs where the subject is implied. The so-called little structured data records in Wikipedia, called infoboxes, are of this form. JSON and many simple data formats also have cleaner data formats.

The basic fact that RDF provides a universal data model for any kind of native data does not necessarily translate into its use as the actual data exchange format. Rather, winning data exchange formats are those that can be easily understood, easily expressed and therefore widely used. I think there is a real prospect that microdata, ready for ingest and expression by the Web’s leading search engines, may represent a real sea change in the availability and expression of structured data on the Web.

More structure — not less — is the real fuel that will promote greater adoption of RDF when it comes time to interoperate that data. The RDF community should rejoice that more structure will be coming to the Web from Google et al.’s announcement. We should also soon see an explosion of tools and utilities and services that make it easy to automatically add such structure to Web pages via single clicks. Then, once this structure is available, watch out!

So, while the backers of Schema.org also announced their continued support for microformats and RDFa as they presently exist, I rather suspect today’s announcement represents a denouement for these alternative formats. Though these formats may be creatively destroyed, I think the effect on RDF itself will be a profound and significant boost. I foresee clarity coming to the marketplace regarding RDF’s role:  as a canonical means for expressing data of any form, and not necessarily as a data exchange format.

The Initiative is No Surprise

This initiative, led by Google, should be no surprise. Google is the registered agent for the Schema.org Web site and has been the key proponent of microdata via its support of Ian Hickson in the WhatWG and HTML5 work groups. As I stated a couple of years back, Google has also not hidden its interests in structured data. Practically daily we see more structured data appear in Google search results and it has maintained a very active program in structured data extraction from text and tables for some years.

Google and its search engine partners recognize that search needs are evolving from keyword retrievals to structure, relationships, and filtered, targeted results. Those advances come from structure — as well as the semantic relationships between things that something like the Schema.org begins to represent.

Many within the W3C and elsewhere questioned why Google was pushing microdata when there were competing options such as microformats or RDFa (or even earlier variants). Of course, like Microsoft of a decade earlier, some ascribed Google’s microdata advocacy as arising from commercial interests or clout in advertising alone. Of course Google has an economic interest in the growth and usefulness of the Web. But I do not believe its advocacy to be premised on clout or “my way or the highway.”

Google and the search engine triumvirate understand well — much better than many of the researchers and academics that dominate mailing list discussions — that use and adoption trump elegance and sophistication. When one deconstructs the design of microdata and the nearly 300 schema now released behind it, I think the pragmatic observer can only come to one conclusion: Job well done!

Why This is Exciting

I have been a fervent RDF advocate for nearly a decade and have also been a vocal proponent of the structured Web as a necessary stepping stone to the semantic Web. In fact, here is a repeat of a diagram I have used many times over the past 5 years:

Transition in Web Structure
Document Web Structured Web
Semantic Web
Linked Data
  • Document-centric
  • Document resources
  • Unstructured data and semi-structured data
  • HTML
  • URL-centric
  • circa 1993
  • Data-centric
  • Structured data
  • Semi-structured data and structured data
  • XML, JSON, RDF, etc
  • URI-centric
  • circa 2003
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S
  • URI-centric
  • circa 2007
  • Data-centric
  • Linked data
  • Semi-structured data and structured data
  • RDF, RDF-S, OWL
  • URI-centric
  • circa ???

When one looks at the schema of schema that accompany today’s announcement, it is really clear just how encompassing and important these instant standards will become:

DataType
 

Thing

Intangible 

CreativeWork

Event

Organization
 

LocalBusiness

AnimalShelter
AutomotiveBusiness 

ChildCare
DryCleaningOrLaundry
EmergencyService

EmploymentAgency
EntertainmentBusiness

FinancialService

FoodEstablishment

GovernmentOffice

HealthAndBeautyBusiness

HomeAndConstructionBusiness

InternetCafe
Library
LodgingBusiness

MedicalOrganization

ProfessionalService

RadioStation
RealEstateAgent
RecyclingCenter
SelfStorage
ShoppingCenter
SportsActivityLocation

Store

TelevisionStation
TouristInformationCenter
TravelAgency

NGO

SportsTeam

Organization (con’t)
 

Person
Place

Product

Today’s announcement is the best news I have heard in years regarding the structured Web, RDF, and the semantic Web. This announcement is — I believe — the signal event of the structured Web. With regard to my longstanding diagram above, I can go to bed tonight knowing we have now crossed the threshold into the semantic Web.

by Mike Bergman at June 03, 2011 02:57 AM

May 31, 2011

Wikier.org Blog (Sergio Fernandez)

SDoW2011

Following the successful SDoW workshops at ISWC 2008, 2009 and 2010, this year we (Alex, John, Uldis and me) repeat with the 4th international workshop Social Data on the Web (SDoW2011) at ISWC2011. We aim to bring together Semantic Web experts and Web 2.0 practitioners and users to discuss the application of semantic technologies to data from the Social Web. It is motivated by recent active developments in collaborative and social software and their Semantic Web counterparts, notably in the industry, such as FaceBook Open Graph Protocol.

SDoW2011

Submissions are welcomed until August 15th. See you in Bonn!

by Sergio Fernández at May 31, 2011 07:57 AM

May 26, 2011

Project squin

Analysis of HTTP-based cache control support in Linked Data servers

Today, I gave a talk at the monthly Talis Research meeting. The topic of this talk was “The Impact of Data Caching of on Query Execution for Linked Data” (find the slides of the talk). Primarily, I discussed the findings of the corresponding paper I presented at this year’s Linked Data workshop. However, I also outlined some ideas for cache coherency mechanisms in SQUIN, assuming that nearly no Linked Data server supports the different options for cache control that are provided by the HTTP protocol (e.g. conditional GET, Last-Modified headers, ETag headers, etc.). During the preparation of this talk, I was curious whether this assumption is still valid today. The only analysis that I was aware of is from 2009, performed by Michael Hausenblas, described in this blog post. Michael concludes that the “results of the LOD caching evaluation are somewhat deflating: more than half of the samples do not support cache control and less than 20% support Last-Modified or ETag headers.” So, has that changed?

To answer this question, I developed a small Java program checkCacheSupport.java (available as Free Software under the terms of the GNU General Public License, v3). This program, first, issues a query at the SPARQL endpoint provided for the CKAN catalog of linked dataset; this query asks for example resources of the registered linked datasets. Each of the reported example resources is then requested by the program in two ways: The first request is a conditional GET with a If-Modified-Since header that specifies the current time. Assuming that the requested resources are not modified at the time of the experiment, a server that supports conditional GET should respond with a 304 Not Modified. Hence, the program takes a 304 response as evidence that the corresponding Linked Data server supports conditional GET. After this conditional GET the program requests each example resource a second time using an ordinary GET request. If the server responds with a 200 OK, the program records the following header fields from the response: Last-Modified, ETag, Cache-Control, and Expires.

When I executed the program it checked 154 example resources from an overall number of 110 different datasets. Here is the result of this experiment:

  • For 41 of the 154 example resources the server supports
    conditional GET (26.6%).
  • For 54 of the 154 example resources the server provides
    a Last-Modified header (35.1%).
  • For 49 of the 154 example resources the server provides
    an ETag header (31.8%).
  • For 49 of the 154 example resources the server provides
    a Cache-Control header (50.6%).
  • For 58 of the 154 example resources the server provides
    a Cache-Control header with a max-age entry (37.7%).
  • For 57 of the 154 example resources the server provides
    an Expires header (37.0%).

You may want to take a look at the results in detail.

So, it seems that the situation has improved a bit since 2009, but still not to a degree where it is reasonable to solely rely on HTTP-based cache control mechanisms in a Linked Data consuming application.

Olaf

by Olaf Hartig at May 26, 2011 03:38 PM

May 19, 2011

DBTune Blog

Music and the Semantic Web workshop

Quite a big week for Semantic Web and music last week in London. On Thursday, there was the MusicNet workshop, with (among others) a talk from the BBC, given by Nick Humfrey. Sadly, I could not attend the workshop due to other BBC duties.

On the Friday, David De Roure and I organised a Music and the Semantic Web workshop at the AES, which David already blogged about. We had four panelists:

We started with four presentations from each of the panelists. David started by presenting MusicNet, aiming at creating canonical Web identifiers for classical music composers. He demonstrated a new tool for merging identifiers for music composers - finding common properties between groups of composers, and providing an interface to review those groups.

Alexandre then presented Seevl. He explained how it worked, aggregating and consolidating structured data about music artists from a range of different places, and generating recommendations using this data, as well as explanations of these recommendations (those two artists played together, they had the same producer for their first album, etc.). I wrote quite a lot about this kind of things on this blog - it's really nice to finally see it taking shape!

Gregg presented his experience within the Connected Media Experience (CME) project. His presentation was supported by a position paper, which is extremely interesting. CME (a large consortium of key music industry players) worked for a couple of years on a RDF/Music Ontology format for online releases. Sadly, they recently abandoned this format for simple HTML5+CSS - structured data about those releases is not a priority anymore. Gregg gave us insights on what went wrong, and what were the lessons to be learned by both the Semantic Web community and the Music Industry.

Evan then presented Decibel, giving us very interesting insights about music metadata, and a demonstration of their service. It was interesting to see semantic technologies used in a completely different model. The richness of the data they hold is truly amazing (Evan demoed their internationalisation feature as well - all their data is available in a variety of languages), but sadly not available under an open license.

After that, we had a number of questions from David and I, as well as from the audience, about ease of editing and owning of music metadata (who should own it? third parties? artists? record labels? who should host the canonical URI for an artist?), about relationships of Semantic Web standards with industry standards like ISRCs, ISNIs, MPEG etc.

Overall, a very interesting workshop - I hope we can do it again next year!

by Yves at May 19, 2011 04:30 PM

May 18, 2011

Do What I Mean (Richard Cyganiak)

The RDF 1.1 Literal Quiz

Let’s pretend we live in January 2013, and RDF 1.1 has just been published. This including the RDF Working Group’s attempt to clean up string literals. The issue with string literals is that RDF currently offers three different ways for … Continue reading

May 18, 2011 09:50 PM

May 17, 2011

AI3:::Adaptive Information (Mike Bergman)

Intro to structOntology

A Video Introduction to a New Online Ontology Editor and Manager

Structured Dynamics is pleased to unveil structOntology — its ontology manager application within the conStruct open source semantic technology suite. We are doing so via a video, which provides a bit more action about this exciting new app.

structOntology has been on our radar for more than two years. But, it was only in embracing the OWLAPI some eight months back that we finally saw our way clear to how to implement the system.

The app, superbly developed by Fred Giasson, has many notable advantages — some of which are covered by the video — but two deserve specific attention:  1) the superior search function (if you have been using Protégé or similar, you will love the fact this search indexes everything, courtesy of Solr); and 2) the availability of its functionality directly within the applications that are driven by the ontologies. Of course, there’s other cool stuff too!:

 

 

(If you have trouble seeing this, here is the direct YouTube link or an alternate local Flash version if you can not access YouTube.)

More information on structOntology will be forthcoming over the coming weeks. We will be posting it as open source as part of the Open Semantic Framework by early summer.

by Mike Bergman at May 17, 2011 03:06 AM