Linked Data Blog Aggregator

March 12, 2010

AI3:::Adaptive Information (Mike Bergman)

Brown Bag Lunch: Untapped Assets: The $3 Trillion Value of US Documents

Friday Brown Bag Lunch

Today, in the advanced knowledge economy of the United States, the information contained within documents represents about a third of total gross domestic product, or an amount of about $3.3 trillion annually.

Yet our understanding of the value of documents and the means to manage them is abysmal. These failures impact enterprises of all sizes from the standpoints of revenues, profitability and reputation. Continued national productivity growth — and thus the wealth of all citizens — depends critically on understanding and managing these document values.

As this white paper describes, the lack of a compelling and demonstrable common understanding of the importance of documents is in itself a major factor limiting available productivity benefits. There is an old Chinese saying that roughly translated is “what cannot be measured, cannot be improved.” Many corporate officers may believe this to be the case for document creation and productivity, but, as this paper shows, in fact many of these document issues can be measured.

Friday Brown Bag Lunch This Friday brown bag leftover was first placed into the AI3 refrigerator on July 20, 2005. No changes have been made to the original posting.

I’d like to thank David Siegel for recently highlighting this post from 5 years ago with nice kudos on his PowerOfPull blog. That reference is what caused me to dust off the cobwebs from this older piece.

To wit, some 25% of all of the annual trillions of dollar spent on document creation costs lend themselves to actionable improvements:

U.S. FIRMS

$ Million

%
Cost to Create Documents

$3,261,091

Benefits
Benefits to Finding Missed or Overlooked Documents

$489,164

63%

Benefits to Improved Document Access

$81,360

10%

Benefits of Re-finding Web Documents

$32,967

4%

Benefits of Proposal Preparation and Wins

$6,798

1%

Benefits of Paperwork Requirements and Compliance

$119,868

15%

Benefits of Reducing Unauthorized Disclosures

$51,187

7%

Total Annual Benefits

$781,314

100%

PER LARGE FIRM

$ Million

Cost to Create Documents

$955.6

Benefits to Finding Missed or Overlooked Documents

$143.3

Benefits to Improving Document Access

$23.8

Benefits of Re-finding Web Documents

$9.7

Benefits of Proposal Preparation and Wins

$2.0

Benefits of Paperwork Requirements and Compliance

$35.1

Benefits of Reducing Unauthorized Disclosures

$15.0

Total Annual Benefits

$229.0

Table 1. Mid-range Estimates for the Annual Value of Documents, U.S. Firms, 2002[1]

The total benefit from improved document access and use to the U.S economy is on the order of $800 billion annually, or about 8% of GDP. For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm. About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation. About one-quarter of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited grants and contracts.

Indeed, even these figures likely severely underestimate the benefits to enterprises from an improved leverage of document assets. It has always been the case that the best and most successful companies have been able to make better advantage of their intellectual assets than their competitors. The competitiveness advantage from better document access and use alone may exceed the huge benefits in the table above.

Documents — that is, unstructured and semi-structured data — are now at the point where structured data was at 15 years ago. At that time, companies realized that consolidating information from multiple numeric databases would be a key source of competitive advantage. That realization led to the development and growth of the data warehousing or business intelligence markets, now representing about $3.9 billion in annual software sales.

Search and enterprise content management software today only represents a fraction of that amount — perhaps on the order of $500 million annually. But given that intellectual content in documents represents three to four times the amount in numeric structured data, it is clear that document software capabilities are not being well utilized, reaching only a small fraction of their market potential.

The estimates provided in this white paper are drawn from numerous sources and are extremely fragmented, perhaps even inconsistent. One hope in preparing this document was to stimulate more research attention and data gathering around the critical issues of document value to the enterprise and the economy at large.

EXECUTIVE SUMMARY

I. INTRODUCTION

Documents: The Drivers of a Knowledge Economy

Documents: The Linchpin of Corporate Intellectual Assets

Documents: Unknown Value, Huge Implications

Documents: The Next Generation of Data Warehousing?

Connecting the Dots: A Pointillistic Approach

II. INTERNAL DOCUMENTS

Number of ‘Valuable’ Documents Produced per Firm

Total Annual U.S. ‘Costs’ to Create Documents

‘Cost’ of Creating a ‘Typical’ Document

‘Cost’ of a Missed or Overlooked Document

Other Document Total ‘Cost’ Factors and Summary

Archival Lifetime of ‘Valuable’ Documents

III. WEB DOCUMENTS AND SEARCH

Estimate of Time and Effort Devoted to Document Search

Effect of Non-persistent Search Efforts

‘Cost’ of Creating and Maintaining a Document Category Portal

‘Cost’ of Inaccessible or Hidden Intranet Sites

IV. OPPORTUNITIES AND THREATS

‘Costs’ and Opportunity Costs of Winning Proposals

‘Costs’ of Regulation and Regulatory Non-compliance

‘Cost’ of an Unauthorized Posted Document

V. CONCLUSIONS

I. INTRODUCTION

How many documents does your organization create each year? What effort does this represent in terms of total staffing costs? What does it cost to create a ‘typical’ document? Of documents created, how much of the value in them is readily sharable throughout your organization? How long do you need to keep valuable documents and how can you access them? How much existing document content is re-created simply because prior work cannot be found? When prior information is missed, what do these prior investments in documents represent in terms of loss of market share, revenue or reputation? Indeed, what does the term, “document” represent in your organization’s context?

If you have difficulty answering these questions, you are not alone. Depending on the survey, from 90% to 97% of enterprises cannot answer these questions — in whole or in part. The purpose of this white paper is to provide the first comprehensive assessment ever of these document values.

Enterprises and the analyst community have historically overlooked the impact of document creation as opposed to document handling. Document creation is about 2-3 times more important — from an embedded cost standpoint — than document handling. Second, all aspects of document creation, and later access and use, assume a much greater role in the overall economics of enterprises than have been realized previously.

Documents: The Drivers of a Knowledge Economy

Put your index finger one inch from your nose. That is how close — and unfocused — document importance is to an organization. Documents are the salient reality of a knowledge economy, but like your finger, documents are often too close, ubiquitous and commonplace to appreciate.

How do your employees earn their livings? Writing proposals? Marketing or selling? Evaluating competitors or opportunities? Persuading? Analyzing? Communicating? Teaching? Of course, in some sectors, many make their living from growing things or making things. These are essential jobs — indeed, until the last few decades were the predominant drivers of economies — but are now being supplanted in advanced economies by knowledge work. Perhaps up to 35% of all company employees in the U.S. can be classified as knowledge workers.

And knowledge work means documents. The fact is that knowledge is produced and communicated through the written word. When we search, when we write, when we persuade, we may often do so verbally but make it persistent through the written word.

Documents: The Linchpin of Corporate Intellectual Assets

IBM estimates that corporate data doubles every six to eight months, 85% of which are documents.[2] At least 10% of an enterprise’s information changes on a monthly basis.[3] Year-on-year office document growth rates are on the order of 22%.[4] As later analysis indicates, there are perhaps on the order of 10 billion documents created annually in the U.S with a mid-range “asset” value of $3.3 trillion per year. Documents are a huge contributor to the United States’ gross domestic product of $10.5 trillion (2002).

  • According to a Coopers & Lybrand study in 1993:[5]
  • Ninety percent of corporate memory exists on paper
  • Ninety percent of the papers handled each day are merely shuffled
  • Professionals spend 5-15 percent of their time reading information, but up to 50 percent looking for it
  • On average, 19 copies are made of each paper document.

A Xerox Corporation study commissioned in 2003 and conducted by IDC surveyed 1000 of the largest European companies and had similar findings:[6],[7]

  • On average 45% of an executive’s time was spent dealing with documents
  • 82% believe that documents were crucial to the successful operation of their organizations
  • A further 70% claimed that poor document processes could impact the operational agility of their organizations
  • While 83%, 78% and 76% consider faxes, email and electronic files as documents, respectively, only 48% and 46% categorize web pages and multimedia content as such.

Documents: Unknown Value, Huge Implications

But, if defining what constitutes a document is hard, identifying the costs associated with all the document activities is almost impossible for many organizations. Ninety to 97 percent of the corporate respondents to the Coopers & Lybrand and Xerox studies, respectively, could not estimate how much they spent on producing documents each year. Almost three quarters of them admit that the information is unavailable or unknown to them.

An A.T. Kearney study sponsored by Adobe, EDS, Hewlett-Packard, Mayfield and Nokia, published in 2001, estimated that workforce inefficiencies related to content publishing cost organizations globally about $750 billion. The study further estimated that knowledge workers waste between 15% to 25% of their time in non-productive document activities.[8]

Enterprise document use (SPIN)

Figure 1. The Situation of Poor Enterprise Document Use Leads to Real Implications

But the situation is much broader and results in part from the inability to quantify the importance of both internal and external document assets to all aspects of the enterprise’s bottom line. For examples drawn from the main body of this white paper, early adopters of enterprise content software typically capture less than 1% of valuable internal documents available; large enterprises are witnessing the proliferation of internal and external Web sites, sometimes exceeding thousands; use of external content is presently limited to Internet search engines, producing non-persistent results and no capture of the investment in discovery or results; and “deep” content in searchable databases, which is common to large organizations and represents 90% of external Internet content, is completely untapped.

A USC study reported that typically only 32% of employees in knowledge organizations have access to good information about technical developments relevant to their work, and 79% claim they have inadequate information about what their competitors are doing.[9]

The enterprise content integration software market is fragmented and confused, with only a few established companies providing partial solutions. Content integration is still a small market with annual revenues of less than $50 million worldwide.[10] Vendor offerings fail to satisfy customer needs because of a lack of functionality and a lack of scalability to enterprise volumes. Sales in the market remain distinctly lower than those projected by industry analysts, even as the magnitude of “information overload” continues to grow at a dramatic rate.

Documents: The Next Generation of Data Warehousing?

Documents — that is, unstructured and semi-structured data — are now at the point where structured data was at 15 years ago. At that time, companies realized that consolidating information from multiple numeric databases would be a key source of competitive advantage. That realization led to the development and growth of the data warehousing or business intelligence markets, now representing about $3.9 billion in annual software sales.[11]

Certain categories of businesses have been leaders in content integration, especially those that have recently had mergers and acquisitions activity, those that need to integrate business applications with content, and those for which the reuse of marketing assets across the organization is critical.10

Stonebraker and Hellerstein have provided an insightful roadmap for how enterprise data integration or “federation” has trended over time: Data warehousing → Enterprise application integration → Enterprise content integration → Enterprise information integration.[12] There are two threads to this trend. First, there has been a growing recognition of the importance of document (unstructured) content to contribute to actionable information. Second, increasingly unified and integrated means are being applied to all data sources to allow single-access retrievals.

Connecting the Dots: A Pointillistic Approach

The state of information regarding the value and cost of documents is extremely poor. Lack of defensible and vetted estimates for this information undercuts the ability to properly estimate the intellectual assets tied up in documents or the impacts of overlooked or misused documents.

Only three large document studies — the Coopers & Lybrand, Xerox and A.T. Kearney studies noted above — have been conducted in the past ten years regarding the use and importance of documents within enterprises, and then solely from the standpoint of executive perceptions.

The quantified picture presented in this white paper regarding the costs and benefits of document creation, access and use is a paint-by-the-numbers assemblage of disparate data. The paper draws upon about 80 different data sources, many fragmented. The analysis approach by necessity has needed to conjoin assumptions and data from many diverse sources.

This approach leads to both uncertainty regarding “true” values and likely inaccuracies or mis-estimates in some areas. To make the assessment as consistent as possible, a base year of 2002 was used, the common year reference for most of the available data sources. To bracket uncertainties, most estimates are provided in low, medium and high estimates.

Thus, this study should be viewed as preliminary, but strongly indicative of the value of documents. Further research and data collection will surely refine these estimates. Clearly, though, by any measure, the value of documents to the enterprise is significant and huge, and should not continue to be overlooked.

II. INTERNAL DOCUMENTS

Though valuable content resides everywhere, the first challenge to enterprises is getting a handle on their own internal document content.

Number of ‘Valuable’ Documents Produced per Firm

A recent UC Berkeley study on “How Much Information?” estimated that more than 4 billion pages of internal office documents with archival value are generated annually in the U.S. (Note: this is not the amount created, only those documents deemed worthy of retaining for more than one year).

Firm Size (employees)

1-9

10-19

20-99

100-499

500-999

1000-2500

2500-9999

>10,000

Firms

3,716,944

616,064

518,258

85,304

8,572

5,161

2,704

930

Employees

12,328,094

8,274,541

20,370,447

16,410,367

5,906,266

7,894,226

12,519,664

31,357,579

Knowledge Workers

2,217,093

1,488,099

3,663,435

2,951,251

1,062,187

1,419,703

2,251,545

5,639,368

Number of Pages  – Low

465,842,666

312,670,737

769,739,697

620,099,840

223,180,542

298,299,744

473,081,537

1,184,911,325

Number of Pages  – High

1,164,606,665

781,676,843

1,924,349,242

1,550,249,599

557,951,355

745,749,360

1,182,703,842

2,962,278,313

Number of Docs  – Low

46,584,267

31,267,074

76,973,970

62,009,984

22,318,054

29,829,974

47,308,154

118,491,133

Number of Docs- High

116,460,666

78,167,684

192,434,924

155,024,960

55,795,135

74,574,936

118,270,384

296,227,831

Docs/Firm  – Low

13

51

149

727

2,604

5,780

17,496

127,410

Docs/Firm  – High

31

127

371

1,817

6,509

14,450

43,739

318,525

Docs/Firm – 3 yr Low

38

152

446

2,181

7,811

17,340

52,487

382,229

Docs/Firm – 5 yr High

157

634

1,857

9,087

32,545

72,249

218,695

1,592,623

Content Management Workers

105,709

70,951

174,670

140,713

50,644

67,690

107,352

268,881

CMWs/Firm

0

0

0

2

6

13

40

289

Table 2. Document Projections for U.S. Firms by Size, 2002 Basis

Sources: UC Berkeley[13], U.S. Commerce Department[14], U.S. Bureau of Labor Statistics[15], U.S. Census Bureau[16]

Table 2 and Table 3 attempt to summarize the scale of this challenge for U.S. firms (for internal enterprise documents only). (See[17] for a description of methodology regarding document scales, note[18] for estimating the numbers of enterprise knowledge workers, and note[19] for estimating content workers. A rough multiplier of 3x to 4x can be applied to extrapolate globally.[20]) Breakouts are provided by size of firm; these include estimates for the number of knowledge and content workers within U.S. firms.

Category

Value

Firms

4,953,937

Employees

127,273,960

Knowledge Workers

20,692,680

Annual Number of Docs – Low

9,291,013,320

Annual Number of Docs- High

21,739,130,435

Annual Docs/Firm – Low

1,875

Annual Docs/Firm – High

4,388

Total Docs/Firm – 3 yr Low

1,990

Total Docs/Firm – 5 yr High

5,601

Content Management Workers

986,610

CMWs/Firm

0.2

Table 3. Total Annual Document Projections for U.S. Firms, 2002 Basis

Table 4 takes this information and breaks out distribution of document production for a ‘typical’ knowledge worker according to major document types. The data from this table is based on analysis of dozens of BrightPlanet customers averaged across about 10 million documents in various repositories.

% Based On

All

Unique

MBs

KB/Page

Pg/Doc

Pages

Docs

MBs

Pages

Archival Documents (3 yrs)
DOC

281

59

20

10.5

2,938

52%

36%

50%

PDF

46

28

14

43.6

2,017

9%

17%

34%

PPT

32

26

55

14.6

474

6%

16%

8%

XLS

178

51

100

2.7

484

33%

31%

8%

Weighted

537

164

28

11.0

5,912

100%

100%

100%

Current Documents (I yr)
DOC

221

71

20

5.1

1,127

49%

35%

32%

PDF

66

36

14

24.7

1,634

15%

18%

46%

PPT

53

76

55

12.9

687

12%

38%

20%

XLS

108

17

100

0.6

70

24%

8%

2%

Weighted

449

199

57

7.8

3,517

100%

100%

100%

Total per Employee
DOC

502

129

20

8.1

4,065

51%

36%

43%

PDF

112

64

14

32.5

3,650

11%

18%

39%

PPT

86

102

55

13.5

1,161

9%

28%

12%

XLS

285

68

100

1.9

554

29%

19%

6%

Weighted

986

363

39

9.6

9,430

100%

100%

100%

Table 4. Document Production for a ‘Typical’ Knowledge Worker

Note that word processed documents account for about 50% of typical production and storage demands. However, also note that documents of the highest archival value, as converted to PDFs for sharing and deployment, also represent about a third to two-fifths of stored documents.

Total Annual U.S. ‘Costs’ to Create Documents

Based on the information from Table 2 to Table 4 above, all updated to a common year 2002 basis, we can now estimate the total annual costs in the U.S. for creating all internal enterprise documents. The analysis is based on the UC Berkeley information and the Coopers & Lybrand studies. The “bottom up” case is based on the number of annual U.S. documents estimated based on Table 2. These results are shown in the table below:

Annual U.S. Office Documents

Number (M)

$/Document

Total $ (B)

“Bottom Up” – Low

1,387

$738.58

$1,024

“Bottom Up” – High

7,242

$141.43

$1,024

Coopers & Lybrand

11,975

$272.33

$3,261

C&L – UCB

27,737

$272.33

$7,554

C&L – “Bottom Up”

4,315

$272.33

$1,175

Average

10,531

$384.11

$3,253

Table 5. Annual U.S. Office Document Cost Estimates[21]

The average numbers above represent the average of the unique values in each column. The Table 5 analysis suggests there may be on the order of 10 billion documents created annually in the U.S with a total “asset” value on the order of $3.3 trillion per year.

‘Cost’ of Creating a ‘Typical’ Document

Based on the averages in the table above, a ‘typical’ document may cost on the order of $380 each to create.[22] Of course, a “document” can vary widely in size, complexity and time to create, and therefore its individual cost and value will vary widely. An invoice generated from an automated accounting system could be a single page and produced automatically in the thousands; proposals for very large contracts can take tens of thousands to millions of dollars to create. For examples, here are some other ‘typical’ costs for a variety of documents:

Ave. Cost

‘Typical’ Document

$384.11

Invoice

$4.43

[23]
Mortgage Application

$210.00

[24]
‘Typical’ Proposal

$17,500.00

[25]

Table 6. ‘Typical’ per Document Creation Costs

Depending on document mix and activities, individual enterprises may want to vary the average document creation costs used in their cost-benefit estimates.

‘Cost’ of a Missed or Overlooked Document

The Coopers & Lybrand study suggests that 7.5 percent of all documents are lost forever, and that it costs $120 in labor ($150 updated to 2002) to find a misfiled document;[26] other studies suggest that 5% to 6% of documents are routinely misplaced or misfiled.

In fact, the extent of this problem is unknown and is affirmed by the Xerox results:[27]

  • Almost three quarters of corporate respondents admit that the information is unavailable or unknown to them
  • 95% of the companies are not able to estimate the cost of wasted or unused documents
  • On average 19% of printed documents were wasted.

Other Document Total ‘Cost’ Factors and Summary

Five independent studies suggest that, on average, organizations spend from 5% to 15% of total company revenue on handling documents.27,[28],[29],[30],[31] These seemingly innocuous percentages can translate into huge bottom-line impacts for U.S. enterprises. For example, the total GDP of the United States was on the order of $10.5 trillion at the end of 2002.[32] Translating this value into the results of Table 5 and the information in previous sections indicates the importance of document creation and handling for U.S enterprises:

Low

Medium

High

Total U.S. Gross Domestic Product ($B)

$10,487

$10,487

$10,487

Total Document Handling ($B)

$524

$1,049

$1,573

% of total GDP:

5.0%

10.0%

15.0%

Total Document Creation ($B)

$1,100

$3,261

$7,554

% of total GDP:

10.5%

31.1%

72.0%

Total Document Misfiled ($B)

$32

$81

$160

% of total GDP:

0.3%

0.8%

1.5%

ALL U.S. Document Burdens ($B)

$1,656

$4,390

$9,287

% of total GDP:

15.8%

41.9%

88.6%

Table 7. Range Estimates for Total U.S. Document Burdens in Enterprises, 2002[33]

A few observations relate to this table. First, enterprises and the analyst community have greatly overlooked the impact of document creation as opposed to document handling. Document creation is about 2-3 times more important  – from an embedded cost standpoint  – than document handling. Second, all aspects of document creation assume a much greater role in the overall economics of enterprises than has been realized previously.

The fact that documents have received so little management attention, awareness, measurement and direct attention to improve performance is shocking.

Archival Lifetime of ‘Valuable’ Documents

The ‘low’ and ‘high’ estimates for documents in Table 2 and Table 3 assume that 2% and 5%, respectively, of internal documents have archival value. Were these percentages to be higher, the volume of documents requiring integration and access would likewise increase. The 2% value is derived from the UC Berkeley study,[34] which also refers to an unpublished European study that places archival amounts at 10%. Unfortunately, there is little empirical information to support the degree to which documents deserve to be kept for archival purposes.

Assuming that documents may retain value for three to five years, the largest firms perhaps have as many as 4 million internal documents on average with enterprise-wide value. Firms with fewer employees generally have lower document counts. Archival percentages, however, are a tricky matter, since apparently 85% of all archived documents are accessed.[35]

III. WEB DOCUMENTS AND SEARCH

Various estimates by Cowles/Simba,[36] Veronis, Suhler & Associates,[37] and Outsell[38] place the current market for on line business information in the $30 billion to $140 billion range, with significant projected growth. Outsell also indicates that marketing, sales, and product development professionals rely most heavily on information from the Internet for their daily decision making, based on a comparative study of Fortune 500 business professionals’ use of the open Web and fee-based desktop information content services.[39] Clearly, relevant and targeted content, much of which resides on line, has extreme value to enterprises.

UC Berkeley estimates that about 500 petabytes of new information was published on the Web in 2002,34 based on original analysis conducted by BrightPlanet.[40] The compound growth rate in Web documents has been on the order of more than 200% annually.[41] Estimates for deep Web content range from about 6-8 times larger [42] to 500 times larger40 than standard “surface web” content. The size of Internet content is overwhelming, of highly variable quality, growing at a rapid pace, and with much of its content ephemeral.

Estimate of Time and Effort Devoted to Document Search

According to a recent study by iProspect, about 56 percent of users use search engines every day, based on a population of which more than 70 percent use the Internet more than 10 hours per week. Professionals abandon a current search 38% of the time after inspecting only one results page (the listing of document result URLs), and overall 82% of users attempt another search if relevant results are not found within the first three results pages. Just 13 percent of users said that they use different search engines for different types of searches.[43] Only 7.5 percent of Internet users said they refined their search with additional keywords in cases where they were unable to achieve satisfactory results.[44]

The average knowledge worker spends 2.3 hrs per day  – or about 25% of work time  – searching for critical job information.[45] IDC estimates that enterprises employing 1,000 knowledge workers waste well over $6 million per year each in searching for information that does not exist, failing to find information that does, or recreating information that could have been found but was not.[46] As that report stated, “It is simply impossible to create knowledge from information that cannot be found or retrieved.”

Vendors and customers often use time savings by knowledge workers as a key rationale for justifying a document or content initiative. This comes about because many studies over the years have noted that white collar employees spend a consistent 20% to 25% of their time seeking information; the premise is that more effective search will save time and drop these percentages. As a sample calculation, each 1% reduction in time devoted to search produces:

$50,000 (base salary) * 1.8 (burden rate) * 1.0% = $900/ employee

The stable percentage effort devoted to search over time suggests it is the “satisficing” allocation. (In other words, knowledge workers are willing to devote a quarter of their time to finding relevant information.) Thus, while better tools to aid better discovery may lead to finding better information and making better decisions more productively  – a far more important justification in itself  – there may not result a strict time or labor savings from more efficient search.[47]

Effect of Non-persistent Search Efforts

The percentage of Web page visits that are re-visits is estimated at between 58%[48] and 80%.[49] While many of these re-visitations occur shortly after the first visit (e.g., during the same session using the back button), a significant number occur after a considerable amount of time has elapsed. Thus, it is not surprising that a survey of problems using the Web found “Not being able to find a page I know is out there,” and “Not being able to return to a page I once visited,” accounted for 17% of the problems reported, and that the most common problem using bookmarks was, “Changed content.”[50] Depending on the content type, users use either “direct” or “indirect” approaches to re-find previously discovered information:

Direct

Indirect

Specific Information

42%

58%

General Information

58%

43%

Specific Documents

29%

71%

Web Documents

77%

23%

Emails

9%

91%

Table 8. General Approaches to Re-finding Previously Discovered Information [51]

Direct approaches require remembering or specifically noting the specific location of the information. Direct approaches include: direct entry; emailing to self; emailing to others; printing out; saving as file; pasting the URL into a document; and posting to a personal Web site.

Indirect approaches include: searching; looking through bookmarks; and recalling from a history file. All of these indirect approaches are supported by modern browsers. Note that re-finding Web pages or documents relies heavily on having a record of a previously visited URL.

As a University of Washington study supported by Microsoft discovered, all of the specific direct and indirect techniques applied to these re-discovery approaches have significant drawbacks in terms of desired functions for the recall process: [52]

Portability No of Access Points Persistence Preservation Currency Context Reminding Ease of Integration Communication Ease of Maintenance

DIRECT APPROACHES

Direct Entry

Low

High

Low

Med

High

Low

Low

?

Low

High

Email to Self

Low

High

Low

Med

High

High

High

Med

Low

Med

Email to Others

Low

High

Low

Med

High

High

Low

Low?

High

High

Print-out

High

High

High

Low

Low

Low

High

Med

High

Med

Save as File

Med?

Low?

High

High

Low

Low

Low

Med?

Low

Med

Paste URL in Doc

Low

Low?

Low

Med

High

High

High?

High?

Low

High

Personal Web Site

Low

High

Low

Med

High

High

High?

High

Med

High?

INDIRECT APPROACHES

Search

Low

High

Low

Med

High

Low

Low

?

Low

High

Bookmark

Low

Low

Low

Med

High

Low

Low

Low

Low

Low

History

Low

Low

Low

Med

High

Low

Low

Low?

Low

?

Table 9. Strengths and Weakness of Existing Techniques to Re-use Web Information

The general observation is that no present technique is able alone to keep search persistent, current or maintain context. These combined inadequacies mean that previously found information is not easily found again, or re-discovered, as the following table shows:

Percent

Information No Longer Available

37%

Re-tracing Path Fails

14%

Time Length Since Last Find

9%

Other Failure Reasons

9%

Total Information Lost

68%

Success Finding Lost Information

32%

Table 10. Success in Finding Important Earlier Found Web Information [53]

This table has a number of important observations. First, some 37% of previously found information disappears from the Web, consistent with other findings that estimate about 40% of all Web content disappears annually, some of which has historical or archival value.[54]

Second, and most importantly, nearly 70% of previously found valuable information cannot be rediscovered again. More than half of this problem is because the information is no longer available on the Web, but other reasons relate to the inadequacies of recall techniques for finding previously discovered information.

These observations can translate into some relatively huge costs on a per employee and per enterprise basis, as the table below shows:

Per Knowledge Worker

Per ‘Large’

All

Per Doc

All Docs

Enterprise ($000)

Enterprises ($M)

Re-finding Documents

$148.54

$585

$3,547

$12,103

Re-creating Documents

$384.11

$1,008

$6,114

$20,864

TOTAL

$1,593

$9,661

$32,967

Table 11. ‘Cost’ of Not Readily Re-finding Valuable Web Information

This analysis assumes that some previously found information of value is again re-found (60%), but some is also not re-found and must be re-created (40%).[55] The ‘large’ enterprise is identical to the definition in Table 2 (which is also nearly equivalent to a Fortune 1000 company).[56]

The analysis indicates that poor methods to recall previously found and valuable Web documents may cost $1,600 per knowledge worker per year. This translates into nearly a $10 million productivity loss for the largest enterprises, or nearly $33 billion across all U.S. industries.

In relation to the total document costs noted in Table 7 above, these may seem to be comparatively small numbers. However, when viewed in the context of unproductive standard Web search, they indicate important failings in the ability to recall previously found valuable results from searches and their attendant productivity losses.

‘Cost’ of Creating and Maintaining a Document Category Portal

Users, administrators and industry analysts alike recognize the importance of placing content into logical, intuitive and hierarchically organized categories. About 60% of knowledge workers note that search is a difficult process, made all the more difficult without a logical organization to content.[57] While technical distinctions exist, these logical structures organized into a hierarchical presentation are most often referred to as “taxonomies,” though other terms such as ontology, subject directory, subject tree, directory structure or classification schema may be used.

Delphi Group’s research with corporate Web sites points to the lack of organized information as the number one problem in the opinion of business professionals. More than three-quarters of the surveyed corporations indicated that a taxonomy or classification system for documents is imperative or somewhat important to their business strategy; more than one-third of firms that classify documents still use manual techniques.57 Hierarchical arrangements of categorized subjects trigger associations and relationships that are not obvious when simply searching keywords. Other advantages cited for the taxonomic presentation of documents are the greater likelihood of discovery, ease-of-use, overcoming the difficulty of formulating effective search queries, being able to search only within related documents, discovery of relationships among similar terminology and concepts, and user satisfaction.[58],[59]

From the user standpoint, knowledge workers want to impose taxonomic order on document chaos, but only if the taxonomy models their domain accurately. They also want software to assist with categorizing, as long as it respects the taxonomy they created. Finally, the results of these category placements should be presented via a portal. Thus, as the common concern across all requirements, the taxonomy takes on tremendous importance for an application’s success.[60]

Large firm documents

Figure 2. Typical Large Firm Documents, Thousands

Enterprises that have adopted directory structures for content management are not yet achieving enterprise-wide relevance, presenting on average 1% of all relevant documents in an organized portal view. These limitations appear to be driven by weaknesses in the technology and high costs associated with conventional approaches:

  • Comprehensiveness and Scale – according to a market report published by Plumtree in 2003, the average document portal contains about 37,000 documents.[61] This was an increase from a 2002 Plumtree survey that indicated average document counts of 18,000.[62] However, about 60% of respondents to a Delphi Group survey said they had more than 50,000 internal documents in their portal environment (generally the department level), 3 and as Table 2 indicates above, most of the largest firms likely have millions or more internal documents deserving of common access and archiving.
  • The left-hand bar in Figure 2 indicates current averages for documents in existing content portals. The right-hand (yellow and orange) bar indicates potential based on high and low estimates. The ‘Archive’ case (middle bar) show the same values as provided in Table 2, and represent a conservative view of “archival-likely” documents. The right bar is a more representative view of actual current internal content that enterprises may want to make available to their employees.[63] Two observations have merit: 1) under current practice, enterprises are at most making 10% of their useful documents available, and more likely slightly over 1%; 2) the documents that are being made available are solely internal, and neglect potentially important external sources that would increase document counts considerably.
  • Implementation Times – though average time to stand-up a new content installation is about 6 months, there is also a 22% risk that deployment times exceeds that and an 8% risk it takes longer than one year. Furthermore, internal staff necessary for initial stand-up average nearly 14 people (6 of whom are strictly devoted to content development), with the potential for much larger head counts[64]
  • Ongoing Maintenance and Staffing Costs – ongoing maintenance and staffing costs typically exceed the initial deployment effort. This trend is perhaps not surprising in that once a valuable content portal has been created there will be demands to expand its scope and coverage. Based on these various factors, Table 12 summarizes set-up, ongoing maintenance and key metrics for today’s conventional approaches versus what BrightPlanet can do (the BrightPlanet document count is based on a ‘typical’ installation; there are no practical scale limits)

DOCUMENT

INITIAL SET-UP

MAINTENANCE

BASIS

Staff

Mos

$/Doc

Staff

$/Doc

Current Practice

37,000

6.2

5.4

$4.861

6.4

$11.278

BrightPlanet

250,000

1.0

0.8

$0.017

0.3

$0.078

BP Advantage

6.8 x + up

6.2 x

6.7 x

280.4 x

21.4 x

144.6 x

Table 12. Staff, Time and per Document Costs for Categorized Document Portals

  • The content staff level estimates in the table are consistent with anecdotal information and with a survey of 40 installations that found there were on average 14 content development staff managing each enterprise’s content portal.[65]

Though conventional approaches to content integration seem to lead to high per document set-up and maintenance costs, these should be contrasted with standard practice that suggests it may cost on average $25 to $40 per document simply for filing.29 Indeed, labor costs can account for up to 30% of total document handling costs.28 Nonetheless, at $5 to $11 per document for content management alone, this could result in no actual cost savings if electronic access does not displace current filing practices. When multiplied across all enterprise documents, these uncertainties can translate into huge swings in costs or benefits for a content portal initiative.

  • Software License v. Full Project Costs – according to Charles Phillips of Morgan Stanley, only 30% of the money spent on major software projects goes to the actual purchase of commercially packaged software. Another third goes to internal software development by companies. The remaining 37% goes to third-party consultants.[66] In evaluating a commitment, internal staff and consulting time should be carefully scrutinized. Efficiencies in initial deployment and ongoing support are the biggest cost drivers
  • Internal PLUS External Sources – weaknesses in scalability and high implementation costs often lead to a dismissal of the importance of integrating internal plus external content. Few installations address relevant content external to the enterprise essential to achieving its missions. Granted, the increase in scales associated with external content are large, but for some businesses integration with external content may be essential.

While other vendors claim fast categorization times, what they fail to mention is the lengthy pre-processing times necessary for generating their categorization metatags. According to Forrester Research, some of these metatagging systems can only process five to 15 documents per hour![67]

‘Cost’ of Inaccessible or Hidden Intranet Sites

In 2003, the portal vendor Plumtree noticed a new trend that it called “Web sprawl,” by which it meant the costly proliferation of Web applications, intranets and extranets.[68] BEA has taken up this trend as a major thrust to its Web service offerings through an approach it calls “enterprise portal rationalization” (EPR).[69] According to BEA, its architectural offerings are meant to control the “metastasizing” of corporate Web sites.

How common and to what scale is the proliferation of enterprise Web sites? I have not been able to find any comprehensive studies on this topic, but has been able to find many anecdotal examples. The proliferation, in fact, began as soon as the Internet became popular:

  • As reported in 2000, Intel had more than 1 million URLs on its intranet with more than 100 new Web sites being introduced each month[70]
  • In 2002, IBM consolidated over 8,000 intranet sites, 680 ‘major’ sites, 11 million Web pages and 5,600 domain names into what it calls the IBM Dynamic Workplaces, or W3 to employees[71]
  • Silicon Graphics’ ‘Silicon Junction’ company-wide portal serves 7,200 employees with 144,000 Web pages consolidated from more than 800 internal Web sites[72]
  • Hewlett-Packard Co., for example, has sliced the number of internal Web sites it runs from 4,700 (1,000 for employee training, 3,000 for HR) to 2,600, and it makes them all accessible from one home, @HP [73],[74]
  • Avaya Corporation is now consolidating more than 800 internal Web sites globally[75]
  • The Wall Street Journal recently reported that AT&T has 10 information architects on staff to maintain its 3,600 intranet sets that contain 1.5 million public Web pages[76]
  • The new Department of Homeland Security is faced with the challenge of consolidating more than 3,000 databases inherited from its various constituent agencies.[77]

BrightPlanet’s customers confirm these trends, with indicators of hundreds if not thousands of internal Web sites common in the largest companies. Indeed, it is surprising how many instances there are where corporate IT does not even know the full extent of Web site proliferation. The problem is likely much greater than realized:

Low

Med

High

Number of Large Firms

930

1,500

3,000

Ave Number of Web Sites per Firm

100

500

900

Ave. Number of Documents per Web Site

100

350

1,500

Total Large Firm Web Sites

93,000

750,000

2,700,000

Percentage of Known Web Sites

85%

60%

40%

Percentage of Doc Federation for Known Sites

50%

10%

2%

Site Development & Maintenance
Development Cost per Web Site

$300

$1,701

$9,000

Annual Maintenance Cost per Site

$800

$3,947

$21,000

Total Yr 1 Cost per Site

$1,100

$5,649

$30,000

Total Yr 1 per Large Firm Costs ($000)

$110

$2,824

$27,000

Total Yr 1 Large Firm Costs ($M)

$102

$4,237

$81,000

‘Cost’ of Unfound Documents
No. of Unknown Documents per Firm

5,750

80,500

820,800

Total Number of Large Firm Unknown Docs

5,347,500

120,750,000

2,462,400,000

Total Cost per Web Site

$6,900

$23,915

$350,310

Cost of Unknown Docs per Firm ($000)

$690

$11,958

$315,279

Total Cost of Large Firm Unknown Docs ($M)

$642

$17,937

$945,837

Summary
Total Cost per Firm ($000)

$800

$14,782

$342,279

Total Cost all Large Firms ($M)

$744

$22,173

$1,026,837

Development as % of Total Costs

14%

19%

8%

Unfound Documents as % of Total Costs

86%

81%

92%

Table 13. Development and Unfound Document ‘Costs’ for Large Firms due to Web Sprawl

Table 13 consolidates previous information to estimate what the ‘costs’ of Web sprawl might be to larger firms (analogous to the Fortune 1000). The table presents Low, Medium and High estimates for number of Web sites per firm, known and unknown documents in each, and associated costs for initial site development and first-year maintenance plus the value of unfound information. The Medium category uses the average values from previous tables. The Low and High values bracket these amounts based on distribution of known values and expert judgment.

The table indicates as a mid-range estimate that an individual Web site for a large enterprise may cost about $6,000 to set-up and maintain in the first year and represents $24,000 in opportunity costs due to unknown or unfound documents. For the average large enterprise across all Web sites, these costs may be $4.2 million and $12.0 million, respectively. Across all large firms, total costs due to Web sprawl may be on the order of $22 billion.

While site development and maintenance costs are not trivial, exceeding $4 billion for all large firms (which can also be significantly reduced  – see previous section), the major cost impact comes from the inability to find or federate the information that is available. Unfound documents represent well in excess of 80% of the costs associated with Web sprawl.

The Web sprawl situation is analogous to other major technology shifts. For example, in the early 1980s, IT grappled mightily with the proliferation of personal computers. Centralized control was impossible in that circumstance because individuals and departments recognized the productivity benefits to be gained by PCs. Only when enterprise-capable vendors of networking technology, such as Novell, were able to offer integration solutions was the corporation able to control and fully exploit the PC’s technology potential.

The proliferation of internal enterprise Web sites is responding to similar drivers: innovation, customer service, or superior methods of product or solutions delivery. Ambitious mid-level managers will continue to exploit these advantages by “cowboy” additions of more corporate Web sites, and that is likely to the good for most enterprises. Gaining control and fully realizing the value of this Web site proliferation  – while not stymieing innovation  – will likely require enabling technology analogous to the networking of PCs.

IV. OPPORTUNITIES AND THREATS

The previous analysis has focused on more-or-less direct costs and drivers. These impacts are huge and deserve proper consideration. But there are other implications from the inability to access and manage relevant document information. These implications fall into the categories of lost opportunities, liabilities, or non-compliance. These implications often far outweigh the direct costs in their bottom-line impacts. This section presents only a few of these many opportunities.

‘Costs’ and Opportunity Costs of Winning Proposals

Competitive proposals are an important revenue factor to hundreds of thousands of businesses. Indeed, contracts and grants from federal, state and local governments accounted for 12.1% of GDP in 2002; the amount competitively awarded equaled about 5.6% of GDP.[78] Reducing the fully-burdened costs of producing responses to competitive procurements and improving the rate of successfully obtaining them can be a huge competitive advantage to business.

Significant proportions of commercial projects and programs are likewise awarded through competitive proposals and bids. However, literature references to these are limited, and the remainder of this section relies on federal sector statistics as a proxy for the overall category.

Though the federal government is making strides in providing central clearinghouses to opportunities  – and is also doing much in moving to uniform application standards and electronic application submissions  – these efforts are still in their nascent stages and similar efforts at the state and local level are severely lagging. As a result, the magnitude of the proposal opportunity is perhaps largely unknown to many businesses. This lack of appreciation and attention to the cost- and success-drivers behind winning proposals is a real gap in the competitiveness of many individual businesses.

Table 14 on the following page consolidates information from many government sources to quantify the magnitude of this competitively-awarded grant and contract opportunity with governments.

Number of Awards

Amount ($000)

Federal Government
Total Grants

1,335,813

$441,037,633

[79] [80]
Total Contract Procurements

1,155,096

$327,413,076

Competitively-awarded Grants

336,091

$99,234,657

[81]
Competitively-awarded Procurements

909,087

$231,878,136

[82]
Total Competitive Opportunities

1,245,179

$331,112,793

Ave Competitive Opportunity

$266

[83]
State & Local Government [84] [85]
Total Grants

757,199

$190,000,000

Total Contract Procurements

1,439,031

$310,000,000

Competitively-awarded Grants

190,512

$42,750,512

[86]
Competitively-awarded Procurements

1,132,551

$219,545,972

Total Competitive Opportunities

1,323,063

$262,296,485

Ave Competitive Opportunity

$198

Total (no B-to-B)
Competitively-awarded Grants

526,603

$141,985,169

Competitively-awarded Procurements

2,041,638

$451,424,108

Total Competitive Opportunities

2,568,241

$593,409,277

Ave Competitive Opportunity

$231

Table 14. Federal, State & Local Contract and Grant Opportunities, 2002

This analysis suggests there are nearly $600 billion available each year for competitively awarded grants and procurements from all levels of government within the U.S.; about 60% from the federal sector. The average competitive award is about $270 K for grants; about $220 K for contract procurements.

Aside from construction firms (which are excluded in this and prior analyses), there are on the order of 92,500 federal contract-seeking firms today.[87] In 2003, the top 200 federal contracting firms accounted for nearly $190 billion in contract outlays.[88] While it is unclear what proportion of these commitments were competitive (81% of total federal commitments) or based on all contract procurements (57% of total federal commitments), it is clear that more than 90,000 firms are competing via a classic power curve for a minor portion of available federal revenues. This power curve is shown in Figure 3 below for the 200 largest federal contractors, which obtain a proportionately high percentage of all contract dollars.

Power curve distribution of Fedeeral contractors

Figure 3. Power Curve Distribution of Top 200 Federal Contractors by Revenue, 2002

The combination of these factors enables an estimate of the bottom-line proposal impacts by firm. This information is shown in the table below:

Number

Amount ($000)

Total Competitive Awards
Federal

1,245,179

$331,112,793

[89]
State & Local

1,323,063

$262,296,485

Number of Competing Firms

120,250

[90]
Number of Winning Firms

90,805

Number of Winning Proposals

2,326,485

Number of Submitted Proposals

11,211,974

Direct Proposal Preparation Costs
Winning Proposal Preparation

$5,021,357

[91]
Losing Proposals Preparation

$16,939,516

TOTAL Proposal Preparation

$21,960,873

Low

Med

High

Improvement in RFP Development

7.5%

15.0%

35.0%

[92]
Proposal Preparation
Benefits – Individual Submitters ($000)

$14

$27

$64

Benefits – All Submitters ($000)

$1,647,065

$3,294,131

$7,686,305

Proposal Success Benefits
Increase in Number of Winning Submissions

6,810

13,621

31,782

[93]
Increase in Number of Winning Firms

1,406

2,812

6,562

[94]
Benefits – Individual Submitters ($000)

$1,235

$1,235

$1,235

Benefits – All Submitters ($000)

$1,737,101

$3,474,203

$8,106,473

Benefits – All Submitters/All Aspects

$3,384,167

$6,768,334

$15,792,778

Table 15. Combined Preparation Costs and Opportunity Costs for Proposals

Across all entities, the annual cost of preparing proposals to competitive solicitations from government agencies at all levels is on the order of $22 billion, $5 billion for winning firms and $17 billion for losing firms. Better access to missing information and better information  – assuming no change in the underlying ideas or proposal-writing skills  – suggests that proposal response costs could be reduced by more than $3 billion annually. Another $3 billion annually is available for better winning of competitive proposals. Individual benefits to firms that respond to competitive solicitations is on average $1.25 million per competing firm.[95]

The more significant benefit to individual firms from improved access to “missing” information and better information is increasing the likelihood of winning a competitive award. Firms that embrace these practices are estimated to obtain a $1.2 million annual benefit. Given that many firms that have previously been losing awards have relatively low annual revenues, the percent impact on the bottom line can be quite striking due to improved proposal preparation information.

‘Costs’ of Regulation and Regulatory Non-compliance

A December 2001 small business poll by the National Federation of Independent Business (NFIB) gauged the impacts of the regulatory workload on firms. When asked “is government regulation a very serious, somewhat serious, not too serious, or not at all serious problem for your business,” nearly half, or 43.6 percent, answered “very serious” or “somewhat serious.” The respondents indicated the most serious regulatory problems were at the federal level (49 %), state level (35 %) or local level (13%) of government. The biggest single regulatory problem cited was extra paperwork, followed by difficulty understanding how to comply with regulations and dollars spent doing so.[96] A later December 2003 NFIB survey indicates that the average cost per hour of complying with paperwork requirements was $48.72.[97]

Type of Regulation

All Firms

<20 Employees

20-499 Employees

500+ Employees

All Federal Regulations

$5,107

$7,544

$4,671

$4,827

Environmental

$1,312

$3,600

$1,269

$776

Economic

$2,234

$1,748

$1,782

$2,688

Workplace

$843

$897

$944

$755

Tax Compliance

$719

$1,300

$676

$608

Table 16. Per Employee Costs of Federal Regulation by Firm Size, 2002

According to a 2001 report, “The Impact of Regulatory Costs on Small Firms” by W. Mark Crain and Thomas D. Hopkins, the total costs of Federal regulations were estimated to be $843 billion in 2000, or 8 percent of the U. S. Gross Domestic Product. Of these costs, $497 billion fell on business and $346 billion fell on consumers or other governments. Here are how those impacts are estimated on a per employee basis across a range of firm sizes:[98]

As of September 30, 2002, federal agencies estimated there were about 8.2 billion “burden hours” of paperwork government-wide. Almost 95 percent of those 8.2 billion hours were being collected primarily for the purpose of regulatory compliance. [99]

Burden Hrs (million)

Labor Costs ($M)

Total Government

8,223.17

$318,237

Total Gov (excl. Treasury)

1,472.74

$56,995

Treasury

6,750.43

$261,242

Transportation

244.73

$9,471

HHS

224.83

$8,701

Labor

189.22

$7,323

EPA

140.47

$5,436

Defense

92.36

$3,574

Agriculture

88.59

$3,428

Justice

46.60

$1,803

Education

38.44

$1,488

State

29.23

$1,131

HUD

21.93

$849

Commerce

11.65

$451

Interior

7.66

$296

Energy

3.76

$146

SEC

136.58

$5,286

FTC

69.66

$2,696

FCC

26.80

$1,037

SSA

24.89

$963

FAR (contracts)

24.49

$948

FCIC

9.87

$382

NRC

8.34

$323

FEMA

7.77

$301

Veterans Administration

7.31

$283

NASA

5.95

$230

NSF

4.46

$173

FERC

4.38

$170

SBA

2.77

$107

Table 17. Federal Government Paperwork Burdens, 2002[100]

A December 2003 NFIB survey indicates that the average cost per hour of complying with paperwork requirements was $48.72.[101] If these costs are substituted, the total cost burden in the table above would be about $400 billion, $71 billion of which excludes Treasury and the IRS.

Despite legislation requiring federal paperwork reduction and embracing of e-government initiatives, paperwork burdens continue to increase. Total burden hours in 2002, for example, increased 600 million hours, or about 4 percent, from the previous year. The Code of Federal Regulations (CFR) continues to expand despite efforts to curtail further growth. The CFR grew from 71,000 pages in 1975 to 135,000 pages in 1998. Annually, there are more than 4,000 regulatory changes introduced by the federal government. The federal government now has over 8,000 separate information collection requests authorized by OMB.[102]

Federal Source

Fines ($ 000)

Internal Revenue Service

$4,119,622

[103]

Corporate Income

$1,120,531

Employment Taxes

$2,691,021

Excise Taxes

$200,585

Other Taxes

$107,486

[104]
Agriculture

$2,000

Economic Stabilization

$9,000

Labor & Immigration

$72,000

Commerce & Customs (excl SEC)

$22,000

SEC

$101,000

[105]

Narcotics & Alcohol

$2,000

Mine Safety

$18,000

Environmental Protection

$212,000

[106]

Miscellaneous

$1,000

Other

$448,000

TOTAL

$5,006,622

Table 18. Federal Fines and Penalties to Corporations, 2002

Another source of costs to enterprises are civil penalties and fines for non-compliance with existing regulations, as shown in the table above for 2002 by agency. A total of $5 billion annually is expended by U.S. businesses for civil penalties due to non-compliance with federal regulation, $1 billion of which is due to non-tax purposes.

However, these estimates may undercount actual fines and penalties levied by the federal government due to the accounting basis of the OMB source. For example, the Department of Labor (DOL) collected fines and penalties totaling $175 million from employers in fiscal year 2002 for Fair Labor Standards Act (FLSA) violations.[107] According to a 2002 report, since 1990, 43 of the government’s top contractors paid approximately $3.4 billion in fines/penalties, restitution, and settlements.[108] And, according to another report, the corporations liable to the top 100 False Claims Act paid more than $12 billion since 1986.[109] Since there is no central clearinghouse for this information, with both individual agency general counsels and the Department of Justice responsible for actual collections, the figures in Table 18 should be interpreted as estimates.

Table 19 on the next page consolidates the information in Table 16 to Table 18 to estimate the overall regulatory and paperwork burdens on U.S. businesses, plus estimates of the benefits to be gained from better document access and use.

‘Cost’ of an Unauthorized Posted Document

Unauthorized information disclosures derive mainly from within an organization. The ease of electronic record duplication and dissemination  – particularly through postings on enterprise Web sites  – increases a firm’s vulnerability to this problem. Records mutate and propagate in poorly controlled environments. On average, unauthorized disclosure of confidential information costs Fortune 1000 companies about $15 million per company per year.[110]

A few privacy laws demonstrate the potential liabilities associated with disclosure of confidential information due to inadvertent mistakes or disgruntled employees. As one example, the Health Insurance Portability and Accountability Act (HIPAA) of 1996 sets security standards protecting the confidentiality and integrity of “individually identifiable health information,” past, present or future. Failure to comply with any of the electronic data, security, or privacy standards can result in civil monetary penalties up to $25,000 per standard per year. Violation of the privacy regulations for commercial or malicious purposes can result in criminal penalties of $50,000 to $250,000 in fines and one to ten years of imprisonment.[111]

Amount ($000)

Total Federal Paperwork Burden (non-tax)

$56,995,038

[112]
Total Federal Other Regulatory Burden

$331,791,551

[113]
Total Federal Fines and Penalties

$5,006,622

[114]
Total State and Local Paperwork Burden (non-tax)

$32,059,709

[115]
Total State and Local Other Regulatory Burden

$186,632,748

Total State and Local Fines and Penalties

$2,816,225

Low

Med

High

Improvements Due to Better Information

7.5%

15.0%

35.0%

Paperwork Burdens (non-tax)
Benefits per Large Firm

$1,957

$3,915

$9,134

[116]
Benefits – All Firms

$6,679,106

$13,358,212

$31,169,161

Other Regulatory Burdens
Benefits per Large Firm

$11,394

$22,788

$53,172

Benefits – All Firms

$38,881,822

$77,763,645

$181,448,505

Reductions in Fines and Penalties
Benefits per Large Firm

$4,212

$8,424

$19,655

Benefits – All Firms

$14,372,953

$28,745,905

$67,073,779

TOTAL – All Regulatory Burdens
Benefits per Large Firm

$17,563

$35,126

$81,962

Benefits – All Firms

$59,933,881

$119,867,762

$279,691,445

Table 19. Regulatory Burden and Benefits to Firms from Improved Information

As another example, the Gramm-Leach-Bliley Act (GLBA) of 1999 mandates the financial industry to create guidelines for the safeguarding of customer information. GLBA includes severe civil and criminal penalties for non-compliance, with civil penalties up to $100,000 for each violation and key officers may be fined up to $10,000 per violation. Violation of the GLBA can also carry hefty sanctions, including termination of FDIC insurance and fines of up to $1,000,000 for an individual or one percent of the total assets of the financial institution.[117]

Other major areas of unauthorized disclosure liability occur in national security, identity theft, and commerce, tax and Social Security information. Indeed, virtually every state and federal agency related to a company’s business has policies and fines regarding unauthorized disclosures. Monitoring these requirements is thus an imperative for enterprise management to prevent exposure to fines and loss of reputation.

On a less-quantifiable basis there are also risks about the clarity of the enterprise message to customers, suppliers and partners. Unmanaged Web sprawl is a critical hole for enterprises to ensure compliance with privacy and confidentiality regulations, and to promote clarity of message and accuracy to stakeholders.

V. CONCLUSIONS

Prior to the analysis in this white paper, the state of understanding about the value of document assets had been abysmal. While still preliminary and subject to much improvement, this study has nonetheless found:

  • The value of documents  – in their creation, access and use  – can indeed be measured
  • The information contained within U.S. enterprise documents represents about a third of gross domestic product, or an amount of about $3.3 trillion annually
  • Some 25% of all of these expenditures lend themselves to actionable improvements
  • There are perhaps on the order of 10 billion documents created annually in the U.S.
  • Corporate data doubles every six to eight months; 85% of this data is contained in documents
  • Ninety to 97 percent of enterprises cannot estimate how much they spend on producing documents each year
  • Document creation is about 2-3 times more important  – from an embedded cost standpoint  – than document handling
  • It costs, on average, $350 to create a ‘typical’ document
  • The total potential benefit from practical improvements in document access and use to the U.S economy is on the order of $800 billion annually, or about 8% of GDP
  • For the 1,000 largest U.S. firms, benefits from these improvements can approach nearly $250 million annually per firm
  • About three-quarters of these benefits arise from not re-creating the intellectual capital already invested in prior document creation
  • Another 25% of the benefits are due to reduced regulatory non-compliance or paperwork, or better competitiveness in obtaining solicited contracts and grants
  • $33 billion is wasted each year in re-finding previously found Web documents
  • Paperwork and regulatory improvements due to documents can save U.S. enterprises $120 billion each year
  • Lack of document access due to Web sprawl costs U.S. enterprises $22 billion each year
  • $8 billion in annual benefits is available due to document improvements for competitive governmental grant and contract solicitations
  • These figures likely severely underestimate the benefits to enterprises from improved competitiveness, a factor not analyzed in this study
  • Documents are now at the point where structured data was at 15 years ago at the nascent emergence of the data warehousing market.

As noted throughout, there is a considerable need for additional research and data on document creation, use, costs and benefits. Additional technical endnotes are provided in the PDF version of the full paper.


[1] All sources and assumptions are fully documented in footnotes in the main body of this white paper; general assumptions used in multiple tables are provided in the Technical Endnotes.

[2] As quoted by Armando Garcia, vice president of content management at IBM; see http://www.contentworld.com/conference/conthur.html

[3] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[4] Based on the 1999 to 2001 estimate changes in reference 34, Table 2-6.

[5] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp

[6] J. Snowdon, Documents The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf

[7] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf

[8] A.T. Kearney, Network Publishing: Creating Value Through Digital Content, A.T. Kearney White Paper, April 2001, 32 pp. See http://www.adobe.com/aboutadobe/pressroom/pressmaterials/networkpublishing/pdfs/netpubwh.pdf.

[9] S.A. Mohrman and D.L. Finegold, Strategies for the Knowledge Economy: From Rhetoric to Reality, 2000,http://www.marshall.usc.edu/ceo/Books/pdf/knowledge_economy.pdf. University of Southern California study as supported by Korn/Ferry International, January 2000, 43 pp. See

[10] C. Moore, TheContent Integration Imperative, Forrester Research Trends Report, March 26, 2004, 14 pp.

[11] D. Vesset, Worldwide Business Intelligence Forecast and Anal ysis, 2003-2007, International Data Corporation, June 2003, 18 pp. See http://www.dwway.com/file/20030708085453_IDC_WW-BIFORECASTANDANALYSIS2003-07_JUN03.pdf.

[12] M. Stonebraker and J. Hellerstein, “Content Integration for E-Business,” in ACM SIGMOD Proceedings, Santa Barbara, CA, pp. 552-560, May 2001.

[13] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.

[14] U.S. Department of Commerce, Digital Economy 2003, Economic Statistics Administration, U.S. Dept. of Commerce, Washington, D.C., April 2004, 155 pp. See http://www.esa.doc.gov/DigitalEconomy2003.cfm.

[15] U.S. Department of Labor, “Occupation Employment and Wages, 2002,” Bureau of Labor Statistics. See http://www.bls.gov/news.release/archives/ocwage_11192003.pdf.

[16] U.S. Census Bureau, “Statistics of U.S. Businesses 2001.” See http://www.census.gov/epcd/susb/2001/us/US–.htm.

[17] Total office documents counts were obtained on a page basis from reference 13, which used a value of 2% for what documents deserve to be archived. This formed the ‘lo’ case, with the high case using a 5% estimate (lower still than the ENST 10% estimated cited in reference 13). Total pages were converted to numbers of documents on an average 8 pp per document basis; see Technical Endnotes for further discussion.

[18] See Technical Endnotes for the derivation of knowledge worker estimates.

[19] See Technical Endnotes for the derivation of content worker estimates.

[20] Citation sources and assumptions for this analysis are presented in the BrightPlanet white paper, “A Cure to IT Indigestion: Deep Content Federation,” BrightPlanet Corporation White Paper, June 2004, 31 pp.

[21] The “bottom up” cases are built from the number of assumed knowledge workers in Table 3. The “low” and “high” variants are based on a 5% archival value or 350 annual documents created per worker, respectively, applied to worker staff costs associated with document creation. The “Coopers & Lybrand” case is a strict updating of that study to 2002. The other two “C&L” cases use the updated per document costs from the C&L study; the first variant uses the annual documents created from the UC Berkeley study without archiving; the second variant uses the average of the “low” and “high” document numbers. See further Technical Endnotes for other key assumptions.

[22] The individual values in Table 5 range from about $140 to $740 per document, with the update of the Coopers & Lybrand study being about $270. Separate Delphi analysis by BrightPlanet has shown median values of about $550 per document.

[23] See http:// www.eds.com/services_offerings/ibill_openbill_b2b.shtml

[24] See http://www.hsh.com/cfee-sample.html.

[25] See http://www.atp.nist.gov/eao/applicants/section9.htm.

[26] As initially published in Inc Magazine in 1993. Reference to this document may be found at: http://www.contingencyplanning.com/PastIssues/marapr2001/6.asp

[27] Xerox Global Services, Documents – An Opportunity for Cost Control and Business Transformation, 28 pp., 2003. The findings may be found at: http://www.sap.com/solutions/srm/pdf/CCS_Xerox.pdf and J. Snowdon, Documents  – The Lifeblood of Your Business?, October 2003, 12 pp. The white paper may be found at: http://www.mdy.com/News&Events/Newsletter/IDCDocMgmt.pdf

[28] Optika Corporation. See http://www.optika.com/ROI/calculator/ROI_roiresults.cfm.

[29] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.

[30] ALL Associates Group, Inc., EDAM Sector Summary, April 2003, 2 pp.

[31] ALL Associates Group, 2002 EDAM Metrics for Major U.S. Companies.

[32] By the second Q 2004, this amount was $11.6 trillion. U.S. Federal Reserve Board, Flow of Funds Accounts for the United States, Sept. 16, 2004. See http://www.federalreserve.gov/releases/Z1/current/accessible/f6.htm.

[33] The bases for this table have the following assumptions: 1) the three cases for document handling are based on 5%, 10% and 15% of total enterprise revenues, per the earlier section; 2) the three cases for document creation are based on the ‘C&L Bottom-Up’, ‘Bottom-up  – High,’ and ‘Coopers & Lybrand’ items for the Low, Medium, and High columns, respectively, in Table 5; and 3) the document misfiling case draws on the same basis but using the total document estimates and misfiled percentages of 5%, 7.5% and 9% consistent with the previous discussion section. See further the Technical Endnotes.

[34] P. Lyman and H. Varian, “How Much Information, 2003,” retrieved from http://www.sims.berkeley.edu/how-much-info-2003 on December 1, 2003.

[35] Cap Ventures information, as cited in ZyLAB Technologies B.V., “Know the Cost of Filing Your Paper Documents,” Zylab White Paper, 2001. See http://www.zylab.com/downloads/whitepapers/PDF/21%20-%20Know%20the%20cost%20of%20filing%20your%20paper%20documents.pdf.

[36] As reported in http://www.hoovers.com/company/archive/detail/0,2049,7_2322,00.html.

[37] See http://www.veronissuhler.com/businfo/segment.html, August 2, 2000.

[38] See http://www.outsellinc.com/docs/pr_release/pr20000602_01.htm, June 2, 2000.

[39] See http://www.outsellinc.com/docs/pr_release/pr20000629_01.htm.

[40] M.K. Bergman, “The Deep Web: Surfacing Hidden Value,” BrightPlanet Corporation White Paper, June 2000. The most recent version of the study was published by the University of Michigan’s Journal of Electronic Publishing in July 2001. See http://www.press.umich.edu/jep/07-01/bergman.html.

[41] This analysis assumes there were 1 million documents on the Web as of mid-1994.

[42] See, for example, C. Sherman and G. Price, The Invisible Web, Information Today, Inc., Medford, NJ, 2001, 439 pp., and P. Pedley, The Invisible Web: Searching the Hidden Parts of the Internet, Aslib-IMI, London, 2001, 138pp.

[43] iProspect Corporation, iProspect Search Engine User Attitudes, April/May 2004, 28 pp. See http://www.iprospect.com/premiumPDFs/iProspectSurveyComplete.pdf.

[44] As reported at http://www.nua.ie/surveys/index.cgi?f=VS&art_id=905358569&rel=true.

[45] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[46] C. Sherman and S. Feldman, “The High Cost of Not Finding Information,” International Data Corporation Report #29127, 11 pp., April 2003.

[47] M.E.D. Koenig, “Time Saved  – a Misleading Justification for KM,” KMWorld Magazine, Vol 11, Issue 5, May 2002. See http://www.kmworld.com/publications/magazine/index.cfm.

[48] G. Xu, A. Cockburn and B. McKenzie, Lost on the Web: An Introduction to Web Navigation Research, http://www.cosc.canterbury.ac.nzq/ACMchapterq/NZCSPGq/papers.

[49] A. Cockburn and B. McKenzie, What Do Web Users Do? An Empirical Analysis of Web Use, 2000. See http://citeseer.ist.psu.edu/cockburn00what.html.

[50] Tenth edition of GVU’s (graphics, visualization and usability} WWW User Survey, May 14, 1999. See http://www.gvu.gatech.edu/user_surveys/survey-1998-10/tenthreport.html.

[51] C. Alvarado, J. Teevan, M. S. Ackerman and D.Karger, “Surviving the Information Explosion: How People Find Their Electronic Information,” AI Memo 2003-06, April 2003, 11 pp.., Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory. See ftp://publications.ai.mit.edu/ai-publications/2003/AIM-2003-006.pdf.

[52] W. Jones, H. Bruce and S. Dumais, “Keeping Found Things Found on the Web,” See http://washington.edu/KFTF_Web.pdf.

[53] J. Teevan, “How People Re-find Information When the Web Changes,” AI Memo 2004-014, June 2004, 10 pp., Massachusetts Institute of Technology, Computer Science and Artificial Intelligence Laboratory. See ftp://publications.ai.mit.edu/ai-publications/2004/AIM-2004-012.pdf.

[54] Library of Congress, “Preserving Our Digital Heritage: Plan for the National Digital Information Infrastructure and Preservation Program”, a Report to Congress by the U.S. Library of Congress, 2002, 66 pp. See http://www.digitalpreservation.gov/ndiipp/.

[55] Consistent with Table 8; this analysis also assumes the 25% search time commitment by employee and previous values from earlier tables.

[56] All subsequent references to ‘Large’ firms is based on the last column in Table 2, namely the 930 U.S. firms with more than 10,000 employees.

[57] Delphi Group, “Taxonomy & Content Classification Market Milestone Report,” Delphi Group White Paper, 2002. See http://delphigroup.com.

[58] S. Stearns, “Realize the Value Locked in Your Content Silos Without Breaking the Bank: Automated Classification Tools to Improve Information Discovery,” Inmagic White Paper, version 1.0, 2004. 10 pp. See http://www.inmagic.com.

[59] P. Sonderegger, “Weave Search into the Browsing Experience,” ForresterQuick Take, Forrester Research, Inc., Feb. 18, 2004. 2 pp.

[60] P. Russom, “An Eye for the Needle,” Intelligent Enterprise, January 14, 2002. See http://www.iemagazine.com/020114/502feat2_1.

[61] This average was estimated by interpolating figures shown on Figure 8 in reference 68.

[62] This average was estimated by interpolating figures shown on the p.14 figure in Plumtree Corporation, “The Corporate Portal Market in 2002,” Plumtree Corp. White Paper, 27 pp. See http://www.plumtree.com/pdf/Corporate_Portal_Survey_White_Paper_February2002.pdf.

[63] The ‘low’ case represents the archival value in the middle bars with the addition that 30% of internal documents generated in the current year have a value to be shared for one year; the ‘high’ case represents the related archival value in the middle bars but with 40% of documents generated in that year having a value to be shared for one year.

[64] Analysis based on reference 68, with interpolations from Figure 16.

[65] M. Corcoran, “When Worlds Collide: Who Really Owns the Content,” AIIM Conference, New York, NY, March 10, 2004. See http://show.aiimexpo.com/convdata/aiim2003/brochures/64CorcoranMary.pdf.

[66] C. Phillips, “Stemming the Software Spending Spree,” Optimize Magazine, April 2002, Issue 6. See http://www.optimizemag.com/article/showArticle.jhtml?articleId=17700698&pgno=1.

[67] C. Moore, “The Content Integration Imperative,” Forrester Research, Inc., March 26, 2004, 14 pp.

[68] Plumtree Corporation, “The Corporate Portal Market in 2003,” Plumtree Corp. White Paper, 30 pp. See http://www.plumtree.com/portalmarket2003/default.asp.

[69] BEA Corporation, “Enterprise Portal Rationalization,” BEA Technical White Paper, 23 pp., 2004. See http://www.bea.com/content/news_events/white_papers/BEA_epr_wp.pdf.

[70] A. Aneja, C.Rowan and B. Brooksby, “Corporate Portal Framework for Transforming Content Chaos on Intranets,” Intel Technology Journal Q1, 2000. See http://developer.intel.com/technology/itj/q12000/pdf/portal.pdf.

[71] J. Smeaton, “IBM’s Own Intranet: Saving Big Blue Millions,” Intranet Journal, Sept. 25, 2002. See http://www.intranetjournal.com/articles/200209/ij_09_25_02a.html.

[72] See http://www.wookieweb.com/Intranet/.

[73] D. Voth, “Why Enterprise Portals are the Next Big Thing,” LTI Magazine, October 1, 2002. See http://www.ltimagazine.com/ltimagazine/article/articleDetail.jsp?id=36877.

[74] A. Nyberg, “Is Everybody Happy?” CFO Magazine, November 01, 2002. See http://www.cfo.com/article/1%2C5309%2C8062%2C00.html.

[75] See http://www.proudfoot-plc.com/pdf_20004-USPR1002Avayaweb.asp.

[76] Wall Street Journal, May 4, 2004, p. B1.

[77] pers. comm.., Jonathon Houk, Director of DHS IIAP Program, November 2003.

[78] These figures are based on Table 12 and the GDP figures from reference 32. Note, the analysis in this section also ignores business-to-business opportunities, which are also likely significant.

[79] Total grant and procurement amounts are derived from the U.S. Census Bureau, Consolidated Federal Funds Report (CFFR). See http://harvester.census.gov/cffr/asp/Reports.asp.

[80] The number of awards and an analysis of which line items are competitively awarded was derived from the U.S. Census Bureau, Federal Assistance Award Data System (FAADS). See http://www.census.gov/govs/faads/021sumus.htm.

[81] Specific categories of grants were analyzed based on the U.S. General Services Administration’s Catalog of Federal Domestic Assistance (CFDA) definitions to determine degree of competitiveness; see http://12.46.245.173/cfda/cfda.html. Figures from the U.S. Department of Health and Human Services, Grant.gov Clearinghouse (see http://www.grants.gov/) suggest that $350 billion in federal grants is available, but many of the specific grant opportunities are geared to state governments or individuals. That is why the figures shown indicate only $100 billion in competitive opportunities available directly to enterprises.

[82] U.S. General Services Administration, Federal Procurement Data System  – NG (FY 2003 data); see http://www.fpdc.gov/fpdc/FPR2003a.pdf and http://www.fpdc.gov/fpdc/FPR2003c.pdf. These sources are also the reference for the number of actions or successful awards. Due to discrepancies, these amounts were adjusted to conform with the totals in reference 79.

[83] Average competitive opportunities are derived by dividing the total award amount by category by the number of awards for that category.

[84] See http://www.gcswin.com/opportunities/opp2.htm. This is the only summary reference for state and local information found. Splits between grants and contract procurements were adjusted based on the assumption that contract amounts differed at the non-federal level. Thus, while the split for grant-contract procurements in the federal sector is about 58%-42% in the federal sector, it is assumed to be 38%-62% at the state and local level.

[85] There may also be some double counting of state amounts due to transfers from the federal government. For example, in 2002, $360,534 million in direct transfers was made to states and localities from the federal government. U.S. Census Bureau, State and Local Government Finances by Level of Government and by State: 2001  – 02. See http://www.census.gov/govs/estimate/0200ussl_1.html.

[86] This analysis assumes that individual grant and contract awards are 80% of the amount shown at the federal level.

[87] To be listed requires a minimum of $10,000 in federal contracts; see http://clinton2.nara.gov/WH/EOP/OP/html/aa/aa06.html.

[88] See http://www.govexec.com/features/0804-15/0804-15s1s1.htm.

[89] This header information is drawn from Table 12.

[90] Number of competing firms is increased from the federal contractor baseline by a factor of 1.30 to account for new state and local government contractors.

[91] Winning and losing proposal preparation costs are based on the empirical percentages from NIST (see reference 93), namely 0.85% and 0.59%, respectively, as a percent of total award amounts.

[92] The ‘Low’ basis for improvements is based on the finding of missing information discussed in a previous section; the ‘High” basis reflects the difference between lowest quartile and highest quartile efforts spent on successful proposal preparation (see reference 93). The ‘Med’ basis is an intermediate value between these two.

[93] The increase in winning submissions is calculated based on numbers of winning proposals times the RFP improvement factor. In fact, because all things being equal the pool of contract dollars does not change, this amount merely represents a shift of winning awards from existing winners to new winners. In other words, total contracts amounts are a zero-sum game with proposal improvements by previous losers taken from the pool of previous winners.

[94] The analysis in Figure 2 indicates there is a power curve distribution of awards. The number of new winning proposals was applied to this curve to estimate the actual number of new firms winning awards; see Figure 2 for the power-curve fitting equation.

[95] Of course, better probabilities of winning competitive solicitations are a zero-sum game. New winners displace old winners. The real advantage in this arena is to individual firms that better succeed at securing the existing pool of competitive funds. The benefits to individual companies can be the difference between profitability, indeed survival.

[96] NFIB, Coping with Regulation, NFIB National Small Business Poll, Vol. 1, Issue 5. See http://www.nfib.com/object/3105105.html.

[97] NFIB, Paperwork and Record-keeping, NFIB National Small Business Poll, Vol. 3, Issue 5. See http://www.nfib.com/object/4131277.html.

[98] W. M. Crain & T. D. Hopkins, “The Impact of Regulatory Costs on Small Firms”, Report to the Small Business Administration, RFP No. SBAHQ-00-R-0027 (2001). The report’s 2000 year basis was updated to 2002 based on a 4% annual inflation factor.

[99] U.S. General Accounting Office, Paperwork Reduction Act: Record Increase in Agencies’ Burden Estimates, testimony of V. S. Rezendes, before the Subcommittee on Energy, Policy, Natural Resources and Regulatory Affairs, Committee on Government Reform, House of Representatives, April 11, 2003. See http://www.reform.house.gov/UploadedFiles/Testimony_GAO_Revised.pdf.

[100] Office of Management and Budget, Managing Information Collection and Dissemination, Fiscal Year 2003, 198 pp. (Table A1). See http://www.whitehouse.gov/omb/inforeg/2003_info_coll_dism.pdf.

[101] NFIB, Paperwork and Record-keeping, NFIB National Small Business Poll, Vol. 3, Issue 5. See http://www.nfib.com/object/4131277.html.

[102]U.S. Small Business Administration, Final Report of the Small Business Paperwork Relief Task Force, June 27, 2003, 64 pp. See http://www.sbaonline.sba.gov/advo/laws/final_paperwork03.pdf.

[103] IRS, Civil Penalties Assessed and Abated, by Type of Penalty and Type of Tax (Table 26), September 20, 2002. See http://www.irs.gov/pub/irs-soi/02db26cp.xls.

[104] Except as footnoted, the figures below are drawn from the OMB Public Budget Tables. Civil penalties for crime victims have been excluded from these figures. See http://www.whitehouse.gov/omb/budget/fy2005/db.html.

[105] Obtained orders in SEC judicial and administrative proceedings requiring securities law violators to disgorge illegal profits of approximately $1.293 billion. Civil penalties ordered in SEC proceedings totaled approximately $101 million. See SEC http://www.sec.gov/pdf/annrep02/ar02enforce.pdf.

[106] T. L. Sansonetti, U.S. Department of Justice, testimony before the House Committee on the Judiciary, Subcommittee on Commercial and Administrative Law, March 9, 2004. See http://www.house.gov/judiciary/sansonetti030904.htm.

[107]Argy, Wiltse & Robinson, Business Insights, Summer 2003, 4 pp. See http://www.awr.com/news_let/Argy%20Summer%202003.pdf

[108] Project on Government Oversight, Federal Contractor Misconduct: Failures of the Suspension and Debarment System, revised May 10, 2002. See http://www.pogo.org/p/contracts/co-020505-contractors.html.

[109]Corporate Crime Reporter, Top 100 False Claims Act Settlements, December 30, 2003, 64 pp. See http://www.corporatecrimereporter.com/fraudrep.pdf.

[110] According to Alchemia Corporation testimony citing a Price Waterhouse Coopers study, FDA Hearing, Jan. 17, 2002. See http://www.fda.gov/ohrms/dockets/dockets/ 00d1538/00d-1538_mm00023_01_vol7.doc.

[111] For example, see http://www.medschool.ucsf.edu/curriculum/clinical/guide/section2/confidentiality.asp.

[112] From Table 17.

[113] From Table 16 after adjusting by total number of employees for all firms as shown on Table 2, and removal of total burdens as shown in Table 17.

[114] From Table 18.

[115] All ‘State and Local’ items are based on the ratio of state and local budgets in relation to the federal budget, excluding direct federal transfers, and applied to those factors for the federal sector. This ratio is 0.563. See http://www.gpoaccess.gov/usbudget/fy01/guide01.html.

[116] All ‘Large Firm’ estimates are based on the ratio of large firm documents to total firm documents; see Table 2.

[117] For example, see http://www.nfr.com/why/mandates.php#gramm

by Mike at March 12, 2010 06:43 PM

DBpedia Blog

Invitation to contribute to DBpedia by improving the infobox mappings + New Scala-based Extraction Framework

Hi all,

in order to extract high quality data from Wikipedia, the DBpedia extraction framework relies on infobox to ontology mappings which define how Wikipedia infobox templates are mapped to classes of the DBpedia ontology.

Up to now, these mappings were defined only by the DBpedia team and as Wikipedia is huge and contains lots of different infobox templates, we were only able to define mappings for a small subset of all Wikipedia infoboxes and also only managed to map a subset of the properties of these infoboxes.

In order to enable the DBpedia user community to contribute to improving the coverage and the quality of the mappings, we have set up a public wiki at http://mappings.dbpedia.org/index.php/Main_Page which contains:

1.  all mappings that are currently used by the DBpedia extraction framework
2. the definition of the DBpedia ontology and
3. documentation for the DBpedia mapping language as well as step-by-step guides on how to extend and refine mappings and the ontology.

So if you are using DBpedia data and you you were always annoyed that DBpedia did not properly cover the infobox template that is most important to you, you are highly invited to extend the mappings and the ontology in the wiki. Your edits will be used for the next DBpedia release expected to be published in the first week of April.

The process of contributing to the ontology and the mappings is as follows:

1.  You familiarize yourself with the DBpedia mapping language by reading the documentation in the wiki.
2.  In order to prevent random SPAM, the wiki is read-only and new editors need to be confirmed by a member of the DBpedia team (currently Anja Jentzsch does the clearing). Therefore, please create an account in the wiki for yourself. After this, Anja will give you editing rights and you can edit the mappings as well as the ontology.
3. For contributing to the next DBpedia relase, you can edit until Sunday, March 21. After this, we will check the mappings and the ontology definition in the Wiki for consistency and then use both for the next DBpedia release.

So, we are starting kind of a social experiment on if the DBpedia user community is willing to contribute to the improvement of DBpedia and on how the DBpedia ontology develops through community contributions J

Please excuse, that it is currently still rather cumbersome to edit the mappings and the ontology. We are currently working on a visual editor for the mappings as well as a validation service, which will check edits to the mappings and test the new mappings against example pages from Wikipedia. We hope that we will be able to deploy these tools in the next two months, but still wanted to release the wiki as early as possible in order to already allow community contributions to the DBpedia 3.5 release.

If you have questions about the wiki and the mapping language, please ask them on the DBpedia mailing list where Anja and Robert will answer them.

What else is happening around DBpedia?

In order to speed up the data extraction process and to lay a solid foundation for the DBpedia Live extraction, we have ported the DBpedia extraction framework from PHP to Scala/Java. The new framework extracts exactly the same types of data from Wikipedia as the old framework, but processes a single page now in 13 milliseconds instead of the 200 milliseconds. In addition, the new framework can extract data from tables within articles and can handle multiple infobox templates per article. The new framework is available under GPL license in the DBpedia SVN and is documented at http://wiki.dbpedia.org/Documentation.

The whole DBpedia team is very thankful to two companies which enabled us to do all this by sponsoring the DBpedia project:

1. Vulcan Inc. as part of its Project Halo (www.projecthalo.com). Vulcan Inc. creates and advances a variety of world-class endeavors and high impact initiatives that change and improve the way we live, learn, do business (http://www.vulcan.com/).
2.  Neofonie GmbH, a Berlin-based company offering leading technologies in the area of Web search, social media and mobile applications (http://www.neofonie.de/index.jsp).

Thank you a lot for your support!

I personally would also like to thank:

1.  Anja Jentzsch, Robert Isele, and Christopher Sahnwaldt for all their great work on implementing the new extraction framework and for setting up the mapping wiki.
2.  Andreas Lange and Sidney Bofah for correcting and extending the mappings in the Wiki.

Cheers,

Chris Bizer

by ChrisBizer at March 12, 2010 12:07 PM

March 10, 2010

Blog Data Space (Kingsley Idehen)

URIBurner: Painless Generation & Exploitation of Linked Data (Update 1 - Demo Links Added)

What is URIBurner?

A service from OpenLink Software, available at: http://uriburner.com, that enables anyone to generate structured descriptions -on the fly- for resources that are already published to HTTP based networks. These descriptions exist as hypermedia resource representations where links are used to identify:

  • the entity (data object or datum) being described,
  • each of its attributes, and
  • each of its attributes values (optionally).

The hypermedia resource representation outlined above is what is commonly known as an Entity-Attribute-Value (EAV) Graph. The use of generic HTTP scheme based Identifiers is what distinguishes this type of hypermedia resource from others.

Why is it Important?

The virtues (dual pronged serendipitous discovery) of publishing HTTP based Linked Data across public (World Wide Web) or private (Intranets and/or Extranets) is rapidly becoming clearer to everyone. That said, the nuance laced nature of Linked Data publishing presents significant challenges to most. Thus, for Linked Data to really blossom the process of publishing needs to be simplified i.e., "just click and go" (for human interaction) or REST-ful orchestration of HTTP CRUD (Create, Read, Update, Delete) operations between Client Applications and Linked Data Servers.

How Do I Use It?

In similar vane to the role played by FeedBurner with regards to Atom and RSS feed generation, during the early stages of the Blogosphere, it enables anyone to publish Linked Data bearing hypermedia resources on an HTTP network. Thus, its usage covers two profiles: Content Publisher and Content Consumer.

Content Publisher

The steps that follow cover all you need to do:

  • place a tag within your HTTP based hypermedia resource (e.g. within section for HTML )
  • use a URL via the @href attribute value to identify the location of the structured description of your resource, in this case it takes the form: http://linkeddata.uriburner.com/about/id/{scheme-or-protocol}/{your-hostname-or-authority}/{your-local-resource}
  • for human visibility you may consider adding associating a button (as you do with Atom and RSS) with the URL above.

That's it! The discoverability (SDQ) of your content has just multiplied significantly, its structured description is now part of the Linked Data Cloud with a reference back to your site (which is now a bona fide HTTP based Linked Data Space).

Examples

HTML+RDFa based representation of a structured resource description:

<link rel="describedby" title="Resource Description (HTML)"type="text/html" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

JSON based representation of a structured resource description:

<link rel="describedby" title="Resource Description (JSON)" type="application/json" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

N3 based representation of a structured resource description:

<link rel="describedby" title="Resource Description (N3)" type="text/n3" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

RDF/XML based representations of a structured resource description:

<link rel="describedby" title="Resource Description (RDF/XML)" type="application/rdf+xml" href="http://linkeddata.uriburner.com/about/id/http/example.org/xyz.html"/>

Content Consumer

As an end-user, obtaining a structured description of any resource published to an HTTP network boils down to the following steps:

  1. go to: http://uriburner.com
  2. drag the Page Metadata Bookmarklet link to your Browser's toolbar
  3. whenever you encounter a resource of interest (e.g. an HTML page) simply click on the Bookmarklet
  4. you will be presented with an HTML representation of a structured resource description (i.e., identifier of the entity being described, its attributes, and its attribute values will be clearly presented).

Examples

If you are a developer, you can simply perform an HTTP operation request (from your development environment of choice) using any of the URL patterns presented below:

HTML:
  • curl -I -H "Accept: text/html" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}

JSON:

  • curl -I -H "Accept: application/json" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/json/{scheme}/{authority}/{local-path}

Notation 3 (N3):

  • curl -I -H "Accept: text/n3" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/n3/{scheme}/{authority}/{local-path}
  • curl -I -H "Accept: text/turtle" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/ttl/{scheme}/{authority}/{local-path}

RDF/XML:

  • curl -I -H "Accept: application/rdf+xml" http://linkeddata.uriburner.com/about/id/{scheme}/{authority}/{local-path}
  • curl http://linkeddata.uriburner.com/about/data/xml/{scheme}/{authority}/{local-path}

Conclusion

URIBurner is a "deceptively simple" solution for cost-effective exploitation of HTTP based Linked Data meshes. It doesn't require any programming or customization en route to immediately realizing its virtues.

If you like what URIBurner offers, but prefer to leverage its capabilities within your domain -- such that resource description URLs reside in your domain, all you have to do is perform the following steps:

  1. download a copy of Virtuoso (for local desktop, workgroup, or data center installation) or
  2. instantiate Virtuoso via the Amazon EC2 Cloud
  3. enable the Sponger Middleware component via the RDF Mapper VAD package (which includes cartridges for over 30 different resources types)

When you install your own URIBurner instances, you also have the ability to perform customizations that increase resource description fidelity in line with your specific needs. All you need to do is develop a custom extractor cartridge and/or meta cartridge.

Related:

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 10, 2010 05:52 PM

AI3:::Adaptive Information (Mike Bergman)

Citizen DAN, Prise Deux

Citizen DAN LogoHuzzah! for Local Government Open Data, Transparency, Community Indicators and Citizen Journalism

While the Knight News Challenge is still working its way through the screening details, Structured Dynamics Citizen DAN proposal remains in the hunt. Listen to this:

To date, we have been the most viewed proposal by far (2x more than the second most viewed!!! Hooray!) and are in the top five of highest rated (have also been at #1 or #2, depending. Hooray!). Thanks to all of you for your interest and support.

There is much to recommend this KNC approach, not the least of which being able to attract some 2,500 proposals seeking a piece of the 2010 $5 million potential grant awards. Our proposal extends SD’s basic structWSF and conStruct Drupal frameworks to provide a data appliance and network (DAN) to support citizen journalists with data and analysis at the local, community level.

None of our rankings, of course, guarantees anything. But, we also feel good about how the market is looking at these frameworks. We have recently been awarded some pretty exciting and related contracts. Any and all of these initiatives will continue to contribute to the open source Citizen DAN vision.

And, what might that vision be? Well, after some weeks away from it, I read again our online submission to the Knight News Challenge. I have to say: It ain’t too bad! (Plus many supporting goodies and details.)

So, I repeat in its entirety below, the KNC questions and our formal responses. This information from our original submittal is unchanged, except to add some live links where they could not be submitted as such before. (BTW, the bold headers are the KNC questions.) Eventual winners are slated to be announced around mid-June. We’re keeping our fingers crossed, but we are pursuing this initiative in any case.


Describe your project:

Citizen DAN is an open source framework to leverage relevant local data for citizen journalists. It is a:

  • Appliance for filtering and analyzing data specific to local community indicators
  • Means to visualize local data over time or by neighborhood
  • Meeting place for the public to upload and share local data and information
  • Web data portal that can be individually tailored by any local community
  • Node in a global network of communities across which to compare indicators of community well-being.

Good decisions and good journalism require good information. Starting with pre-loaded government data, Citizen DAN provides any citizen the framework to learn and compare local statistics and data with other similar communities. This helps to promote the grist for citizen journalism; it is also a vehicle for discovery and learning across the community.

Citizen DAN comes pre-packaged with all necessary deployment components and documentation, including local data from government sources. It includes facilities for direct upload of additional local data in formats from spreadsheets to standard databases. Many standard converters are included with the basic package.

Citizen DAN may be implemented by local governments or by community advocacy groups. When deployed, using its clear documentation, sponsors may choose whether or what portions of local data are exposed to the broader Citizen DAN network. Data exposed on the network is automatically available to any other network community for comparison and analysis purposes.

This data appliance and network (DAN) is multi-lingual. It will be tested in three cities in Canada and the US, showing its multi-lingual capabilities in English, Spanish and French.

How will your project improve the way news and information are delivered to geographic communities?

With Citizen DAN, anyone with Web access can now get, slice, and dice information about how their community is doing and how it compares to other communities. We have learned from Web 2.0 and user-generated content that once exposed, useful information can be taken and analyzed in valuable and unanticipated ways.

The trick is to get information that already exists. Citizen journalists of the past may not have either known:

  1. Where to find relevant information, or
  2. How to ‘slice-and-dice’ that information to extract meaningful nuggets.

By removing these hurdles, Citizen DAN improves the ways information is delivered to communities and provides the framework for sifting through it to extract meaning.

How is your idea innovative? (new or different from what already exists)

Government public data in electronic tabular form or as published listings or tables in local newspapers has been available for some time. While meeting strict ‘disclosure’ requirements, this information has neither been readily analyzable nor actionable.

The meaning of information lies in its interpretation and analysis.

Citizen DAN is innovative because it:

  1. Is a platform for accessing and exposing available community data
  2. Provides powerful Web-based tools for drilling down and mining data
  3. Changes the game via public-provided data, and
  4. Packages Citizen DAN in a Web framework that is available to any local citizen and requires no expertise other than clicking links.

What experience do you or your organization have to successfully develop this project?

Structured Dynamics has already developed and released as open-source code structWSF and conStruct , the basic foundations to this proposal. structWSF provides the network and dataset “backbone” to this proposal; conStruct provides the Drupal portal and Web site framework.

To this foundation we add proven experience and knowledge of datasets and how to access them, as well as tools and converters for how to stage them for standard public use. A key expertise of Structured Dynamics is the conversion of virtually any legacy data format into interoperable canonical forms.

These are important challenges, which require experience in the semantics of data and mapping from varied forms into useful and common frameworks. Structured Dynamics has codified its expertise in these areas into the software underlying Citizen DAN.

Structured Dynamics’ principals are also multi-lingual, with language-neutral architectures and code. The company’s principals are also some of the most prominent bloggers and writers in the semantic Web. We are acknowledged as attentive to documentation and communication.

Finally, Structured Dynamics’ principals have more than a decade of track record in successful data access and mining, and software and venture development.

To this strong basis, we have preliminary city commitments for deploying this project in the United States (English and Spanish) and Canada (French and English).

What unmet need does your proposal answer?

ThisWeKnow offers local Census data, but no community or publishing aspects. Data sharing is in DataSF and DataMine (NYC), but they lack collaboration, community networks and comparisons, or powerful data visualization or mapping.

Citizen DAN is a turnkey platform for any size community to create, publish, search, browse, slice-and-dice, visualize or compare indicators of community well-being. Its use makes the Web more locally focused. With it, researchers, watchdog groups, reporters, local officials and interested citizens can now discover hard data for ‘new news’ or fact-check mainstream media.

What tasks/benchmarks need to be accomplished to develop your project and by when will you complete them?

There are two releases with feedback. Each task summary, listing of task hours (hr) and duration in months (mo), in rough sequence order with overlaps, is:

  1. Dataset Prep/Staging: identify, load and stage baseline datasets; provide means for aggregating data at different levels; 420 hr; 2.5 mo
  2. Refine Data Input Facility: feature to upload other external data, incl direct from local sources; XML, spreadsheet, JSON forms; dataset metadata; 280 hr; 3 mo
  3. Add Data Visualization Component: Flex mapping/data visualization (charts, graphs) using any slice-and-dice; 390 hr; 3 mo
  4. Make Multi-linguality Changes: English, French, Spanish versions; 220 hr; 2 mo
  5. Refine User Interface: update existing interface in faceted browse; filter; search; record create, manage and update; imports; exports; and user access rights; 380 hr; 3 mo
  6. Standard Citizen DAN Ontologies: the coherent schema for the data; 140 hr; 3 mo
  7. Create Central Portal: distribution and promotion site for project; 120 hr; 2 mo
  8. Deploy/Test First Release: release by end of Mo 5 @ 3 test sites; 300 hr; 4 mo
  9. Revise Based on Feedback: bug fixing and 4 mo testing/feedback, then revision #2; 420 hr
  10. Package/Document: component packaging for easier installs; increased documentation; 310 hr; 2 mo
  11. Marketing/Awareness: see next question; 240 hr; 12 mo
  12. Project Management: standard PM/interact with test communities, partners; 220 hr; 12 mo.

See attached task details.

What will you have changed by the end of your project?

"Information is the currency of democracy." Thomas Jefferson (n.b.)

We intuitively understand that an informed citizenry is a healthy polity. At the global level and in 250 languages, we see how Wikipedia, matched with the Internet and inexpensive laptops, is bringing unforeseen information and enrichment to all. Across the board, we are seeing the democratization of information.

But very little of this revolution has percolated to the local level.

Only in the past decade or so have we seen free, electronic access to national Census data. We still see local data only published in print or not available at all, limiting both awareness but more importantly understanding and analysis. Data locked up in municipal computers or available but not expressed via crowdsourcing is as good as non-existent.

Though many citizens at the local level are not numeric, intuition has to tell us that the absense of empirical, local data hurts our ability to understand, reason and debate our local circumstances. Are we doing better or worse than yesterday? Than in comparison with our peers? Under what measures does this have meaning about community well being?

The purpose of the Citizen DAN project is to create an appliance — in the same sense of refrigerators keeping our food from spoiling — by which any citizen can crack open and expose relevant data at the local level. Citizen DAN is about enrichening our local information and keeping our communities healthy.

How will you measure progress and ultimately success?

We will measure the progress of the project by the number of communities and local organizations that use the Citizen DAN platform to create and publish community data. Subsidiary measures include the number of:

  • Individual users across all installations
  • Users contributing uploaded datasets
  • Contributed datasets
  • Contributed applications based on the platform
  • Interconnected sites in the network
  • Different Citizen DAN networks
  • Substantive articles and blog posts on Citizen DAN
  • Mentions of ‘Citizen DAN’ (and local naming or variants, which will be tracked) in news articles
  • Contributed blog posts on the central Citizen DAN portal
  • Software package downloads, and
  • Google citations and hits on ‘Citizen DAN’ (and prominent variants).

These measures, plus active sites with profiles of each, will be monitored and tracked on the central Citizen DAN portal.

‘Ultimate success’ is related to the general growth in transparent government at the local level. Growth in Citizen DAN-related measures on a year-over-year basis or in relation to Gov2.0 would indicate success.

Do you see any risk in the development of your project?

There is no technical risk to this proposal, but there are risks in scope, awareness and acceptance. Our system has been operational for one year for relevant use cases; all components have been integrated, debugged, and put into production.

Scope risks relate to how much data the Citizen DAN platform is loaded with, and how much functionality is included. We balance the data question by using common public datasets for baseline data, then add features for localities to “crowdsource” their own supplementary data. We balance the functionality question by limiting new development to data visualization/mapping and to upload functions (per above), and then to refine what already exists.

Awareness risks arise from a crowded attention space. We can overcome this in two ways. The first is to satisfy users at our test sites. That will result in good recommendations to help seed a snowball effect. The second way is to use social media and our existing Web outlets aggressively. We have been building awareness for our own properties in steady, inch-by-inch measures. While a notable few Web efforts may go viral, the process is not predictable. Steady, constant focus is our preferred recipe.

Acceptance risk is intimately linked with awareness and use. If we can satisfy each Citizen DAN community, then new datasets, new functionality and new awareness will naturally arise. More users and more contributions through the network effect are the best way to broad acceptance.

What is your marketing plan? How will people learn about what you are doing?

Marketing and awareness efforts will include our use of social media, dedicated Web sites, support from test communities, and outreach to relevant community Web sites.

Our own blogs are popular in the semantic Web and structured data space (~3K uniques daily); we have published two posts on Citizen DAN and will continue to do so with more frequency once the effort gets underway.

We will create a central portal (http://citizen-dan.org) based on the project software (akin to our other project sites). The model for this apps and deployments clearinghouse is CrimeReports.com. Using social aspects and crowdsourcing, the site will encourage sharing and best practices amongst the growing number of Citizen DAN communities.

We will blog and post announcements for key releases and milestones on relevant external Web sites including various Gov 2.0 sites, Community Indicators Consortium, GovLoop, Knight News Challenge, the Sunlight Foundation, and so forth. In addition, we will collate and track individual community efforts (maintained on the central Citizen DAN site) and make specific outreach to community data sites (such as DataSF or DataMine at NYC.gov). We will use Twitter (#CitizenDAN, etc) and the social networks of LinkedIn, Facebook, and Meetup to promote Citizen DAN activity.

We will interact with advocates of citizen journalism, and engage civic organizations, media, and government officials (esp in our three test communities) to refine our marketing plan.

Is this a one-time experiment or do you think it will continue after the grant?

Citizen DAN is not an experiment. It is a working framework that gives any locality and its citizenry the means to assemble, share and compare measures of its community well-being with other communities. These indicators, in turn, provide substance and grist for greater advocacy and writing and blogging (”journalism”) at the local level.

Granted, there are unknowns: How many localities will adopt the Citizen DAN appliance? How essential will its data be to local advocacy and news? How active will each Citizen DAN installation be in attracting contributions and local data?

We submit the better way to frame the question is the degree of adoption, as opposed to will it work.

Web-based changes in our society and social interaction are leading to the democratization of information, access to it, and channels for expression. Whether ultimately successful in the specific form proposed herein, Citizen DAN and its open source software and frameworks will surely be adopted in one form or another — to one degree or another — in the unassailable trend toward local government transparency and citizen involvement.

In short, Yes: We believe Citizen DAN will continue long after the grant.

If it is to be self-sustainable, what is the plan for making that happen?

Our plan begins with the nature of Citizen DAN as software and framework. Sustainability is a question of whether the appliance itself is useful, and how users choose to leverage it.

Mediawiki, the software behind Wikipedia, is an analog. Mediawiki is an enabling infrastructure. Some sites using it are not successful; others wildly so. Success has required the combination of a good appliance with topicality and good management. The same is true for Citizen DAN.

Our plan thus begins with Citizen DAN as a useful appliance, as free open source with great documentation and prominent initial use cases. Our plan continues with our commitment to the local citizen marketplace.

We are developing Citizen DAN because of current trends. We foresee many hundreds of communities adopting the system. Most will be able to do so on their own. Some others may require modifications or assistance. Our self-interest is to ensure a high level of adoption.

An era of citizen engagement is unfolding at the local level, fueled by Web technologies and growing comfort with crowdsourcing and social networks. Meanwhile, local government constraints and pressures for transparency are unleashing locked-up data. These forces will create new opportunities for data literacy by the public, that will itself bring new understanding and improvements in governance and budgeting. We plan on Citizen DAN and its offspring to be one of the catalysts for those changes.

by Mike at March 10, 2010 04:36 AM

March 06, 2010

Blog Data Space (Kingsley Idehen)

Meshups Demonstrating How SPARQL-GEO Enhances Linked Data Exploitation (Update 1)

Deceptively simple demonstrations of how Virtuoso's SPARQL-GEO extensions to SPARQL lay critical foundation for Geo Spatial solutions that seek to leverage the burgeoning Web of Linked Data.

Setup Information

SPARQL Endpoint: Linked Open Data Cache (8.5 Billion+ Quad Store which includes data from Geonames and the Linked GeoData Project Data Sets) .

Live Linked Data Meshup Links:

Related

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 06, 2010 10:43 PM

March 04, 2010

Blog Data Space (Kingsley Idehen)

Revisiting HTTP based Linked Data (Update 1 - Demo Video Links Added)

Motivation for this post arose from a series of Twitter exchanges between Tony Hirst and I, in relation to his blog post titled: So What Is It About Linked Data that Makes it Linked Data™ ?

At the end of the marathon session, it was clear to me that a blog post was required for future reference, at the very least :-)

What is Linked Data?

"Data Access by Reference" mechanism for Data Objects (or Entities) on HTTP networks. It enables you to Identify a Data Object and Access its structured Data Representation via a single Generic HTTP scheme based Identifier (HTTP URI). Data Object representation formats may vary; but in all cases, they are hypermedia oriented, fully structured, and negotiable within the context of a client-server message exchange.

Why is it Important?

Information makes the world tick!

Information doesn't exist without data to contextualize.

Information is inaccessible without a projection (presentation) medium.

All information (without exception, when produced by humans) is subjective. Thus, to truly maximize the innate heterogeneity of collective human intelligence, loose coupling of our information and associated data sources is imperative.

How is Linked Data Delivered?

Linked Data is exposed to HTTP networks (e.g. World Wide Web) via hypermedia resources bearing structured representations of data object descriptions. Remember, you have a single Identifier abstraction (generic HTTP URI) that embodies: Data Object Name and Data Representation Location (aka URL).

How are Linked Data Object Representations Structured?

A structured representation of data exists when an Entity (Datum), its Attributes, and its Attribute Values are clearly discernible. In the case of a Linked Data Object, structured descriptions take the form of a hypermedia based Entity-Attribute-Value (EAV) graph pictorial -- where each Entity, its Attributes, and its Attribute Values (optionally) are identified using Generic HTTP URIs.

Examples of structured data representation formats (content types) associated with Linked Data Objects include:

  • text/html
  • text/turtle
  • text/n3
  • application/json
  • application/rdf+xml
  • Others

How Do I Create Linked Data oriented Hypermedia Resources?

You markup resources by expressing distinct entity-attribute-value statements (basically these a 3-tuple records) using a variety of notations:

  • (X)HTML+RDFa,
  • JSON,
  • Turtle,
  • N3,
  • TriX,
  • TriG,
  • RDF/XML, and
  • Others (for instance you can use Atom data format extensions to model EAV graph as per OData initiative from Microsoft).

You can achieve this task using any of the following approaches:

  • Notepad
  • WYSIWYG Editor
  • Transformation of Database Records via Middleware
  • Transformation of XML based Web Services output via Middleware
  • Transformation of other Hypermedia Resources via Middleware
  • Transformation of non Hypermedia Resources via Middleware
  • Use a platform that delivers all of the above.

Practical Examples of Linked Data Objects Enable

  • Describe Who You Are, What You Offer, and What You Need via your structured profile, then leave your HTTP network to perform the REST (serendipitous discovery of relevant things)
  • Identify (via map overlay) all items of interest based on a 2km+ radious of my current location (this could include vendor offerings or services sought by existing or future customers)
  • Share the latest and greatest family photos with family members *only* without forcing them to signup for Yet Another Web 2.0 service or Social Network
  • No repetitive signup and username and password based login sequences per Web 2.0 or Mobile Application combo
  • Going beyond imprecise Keyword Search to the new frontier of Precision Find - Example, Find Data Objects associated with the keywords: Tiger, while enabling the seeker disambiguate across the "Who", "What", "Where", "When" dimensions (with negation capability)
  • Determine how two Data Objects are Connected - person to person, person to subject matter etc. (LinkedIn outside the walled garden)
  • Use any resource address (e.g blog or bookmark URL) as the conduit into a Data Object mesh that exposes all associated Entities and their social network relationships
  • Apply patterns (social dimensions) above to traditional enterprise data sources in combination (optionally) with external data without compromising security etc.

How Do OpenLink Software Products Enable Linked Data Exploitation?

Our data access middleware heritage (which spans 16+ years) has enabled us to assemble a rich portfolio of coherently integrated products that enable cost-effective evaluation and utilization of Linked Data, without writing a single line of code, or exposing you to the hidden, but extensive admin and configuration costs. Post installation, the benefits of Linked Data simply materialize (along the lines described above).

Our main Linked Data oriented products include:

  • OpenLink Data Explorer -- visualizes Linked Data or Linked Data transformed "on the fly" from hypermedia and non hypermedia data sources
  • URIBurner -- a "deceptively simple" solution that enables the generation of Linked Data "on the fly" from a broad collection of data sources and resource types
  • OpenLink Data Spaces -- a platform for enterprises and individuals that enhances distributed collaboration via Linked Data driven virtualization of data across its native and/or 3rd party content manager for: Blogs, Wikis, Shared Bookmarks, Discussion Forums, Social Networks etc
  • OpenLink Virtuoso -- a secure and high-performance native hybrid data server (Relational, RDF-Graph, Document models) that includes in-built Linked Data transformation middleware (aka. Sponger).

Related

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 04, 2010 03:16 PM

March 02, 2010

Blog Data Space (Kingsley Idehen)

Linked Data & Socially Enhanced Collaboration (Enterprise or Individual) -- Update 1

Socially enhanced enterprise and invididual collaboration is becoming a focal point for a variety of solutions that offer erswhile distinct content managment features across the realms of Blogging, Wikis, Shared Bookmarks, Discussion Forums etc.. as part of an integrated platform suite. Recently, Socialtext has caught my attention courtesy of its nice features and benefits page . In addition, I've also found the Mike 2.0 portal immensely interesting and valuable, for those with an enterprise collaboration bent.

Anyway, Socialtext and Mike 2.0 (they aren't identical and juxtaposition isn't seeking to imply this) provide nice demonstrations of socially enhanced collaboration for individuals and/or enterprises is all about:

  1. Identifying Yourself
  2. Identifying Others (key contributors, peers, collaborators)
  3. Serendipitous Discovery of key contributors, peers, and collaborators
  4. Serendipitous Discovery by key contributors, peers, and collaborators
  5. Develop and sustain relationships via socially enhanced professional network hybrid
  6. Utilize your new "trusted network" (which you've personally indexed) when seeking help or propagating a meme.

As is typically the case in this emerging realm, the critical issue of discrete "identifiers" (record keys in sense) for data items, data containers, and data creators (individuals and groups) is overlooked albeit unintentionally.

How HTTP based Linked Data Addresses the Identifier Issue

Rather than using platform constrained identifiers such as:

  • email address (a "mailto" scheme identifier),
  • a dbms user account,
  • application specific account, or
  • OpenID.

It enables you to leverage the platform independence of HTTP scheme Identifiers (Generic URIs) such that Identifiers for:

  1. You,
  2. Your Peers,
  3. Your Groups, and
  4. Your Activity Generated Data,

simply become conduits into a mesh of HTTP -- referencable and accessible -- Linked Data Objects endowed with High SDQ (Serendipitious Discovery Quotient). For example my Personal WebID is all anyone needs to know if they want to explore:

  1. My Profile (which includes references to data objects associated with my interests, social-network, calendar, bookmarks etc.)
  2. Data generated by my activities across various data spaces (via data objects associated with my online accounts e.g. Del.icio.us, Twitter, Last.FM)
  3. Linked Data Meshups via URIBurner (or any other Virtuoso instance) that provide an extend view of my profile

How FOAF+SSL adds Socially aware Security

Even when you reach a point of equilibrium where: your daily activities trigger orchestratestration of CRUD (Create, Read, Update, Delete) operations against Linked Data Objects within your socially enhanced collaboration network, you still have to deal with the thorny issues of security, that includes the following:

  1. Single Sign On,
  2. Authentication, and
  3. Data Access Policies.

FOAF+SSL, an application of HTTP based Linked Data, enables you to enhance your Personal HTTP scheme based Identifer (or WebID) via the following steps (peformed by a FOAF+SSL compliant platform):

  1. Imprint WebID within a self-signed x.509 based public key (certificate) associated with your private key (generated by FOAF+SSL platform or manually via OpenSSL)
  2. Store public key components (modulous and exponent) into your FOAF based profile document which references your Personal HTTP Identifier as its primary topic
  3. Leverage HTTP URL component of WebID for making public key components (modulous and exponent) available for x.509 certificate based authentication challenges posed by systems secured by FOAF+SSL (directly) or OpenID (indirectly via FOAF+SSL to OpenID proxy services).

Contrary to conventional experiences with all things PKI (Public Key Infrastructure) related, FOAF+SSL compliant platforms typically handle the PKI issues as part of the protocol implementation; thereby protecting you from any administrative tedium without compromising security.

Conclusions

Understanding how new technology innovations address long standing problems, or understanding how new solutions inadvertently fail to address old problems, provides time tested mechanisms for product selection and value proposition comprehension that ultimately save scarce resources such as time and money.

If you want to understand real world problem solution #1 with regards to HTTP based Linked Data look no further than the issues of secure, socially aware, and platform independent identifiers for data objects, that build bridges across erstwhile data silos.

If you want to cost-effectively experience what I've outlined in this post, take a look at OpenLink Data Spaces (ODS) which is a distributed collaboration engine (enterprise of individual) built around the Virtuoso database engines. It simply enhances existing collaboration tools via the following capabilities:

Addition of Social Dimensions via HTTP based Data Object Identifiers for all Data Items (if missing)

  1. Ability to integrate across a myriad of Data Source Types rather than a select few across RDBM Engines, LDAP, Web Services, and various HTTP accessible Resources (Hypermedia or Non Hypermedia content types)
  2. Addition of FOAF+SSL based authentication
  3. Addition of FOAF+SSL based Access Control Lists (ACLs) for policy based data access.

Related:

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at March 02, 2010 08:47 PM

March 01, 2010

AI3:::Adaptive Information (Mike Bergman)

Open SEAS: A Framework to Transition to a Semantic Enterprise

New Release Builds on the MIKE2.0 Methodology and Deliverables

Open SEAS

Today, Structured Dynamics is pleased to release Open SEAS, its methodology for Semantic Enterprise Adoption and Solutions. At the same time, we are donating the framework to the open source MIKE2.0 Method for an Integrated Knowledge Environment project.

Open SEAS provides a framework for the enterprise to establish a coherent, consistent and interoperable layer across its information assets. It is compliant with the MIKE2.0 Semantic Enterprise Solution Offering.

Open SEAS has been developed for enterprises desiring to initiate or extend their involvement with semantic technologies. It is inherently incremental, low-cost and low-risk.

Donation and Relation to MIKE2.0

Concurrent with this release, Structured Dynamics is also donating the methodology and all of its related intellectual assets to the MIKE2.0 project. Under Creative Commons license and MIKE2.0’s content governance policies, the community’s current 2000+ members are now free to expand and use the Open SEAS methodology in any manner they see fit.
MIKE2.0 Logo

Last week, I began to introduce MIKE2.0 and its methodology to the readers of this blog. MIKE2.0 provides a complete delivery environment and methodology for information management projects in the enterprise. Solutions — from the specific to the composite — are described and packaged with respect to plans, management communications, products (open source and proprietary), activities, benchmarks, and deliverables. Delivery is accomplished over multiple increments, split into five phases from definition and planning to deployment. The assets associated with this framework first are based on templates and guidelines that can be applied to any information management area. The framework allows for multiple projects to be combined and inter-related, all under a common methodology. More information and a good entry point is provided on the What is MIKE2.0? page on the project’s main Web site.

MIKE2.0 presently has some 800 resources across about 40 solution areas. With Structured Dynamics’ donation, there are now about 40 resources related to the semantic enterprise, many of them major, accompanied by many images and figures. This contribution makes the Semantic Enterprise Solution Offering instantly one of the more complete within MIKE2.0. As noted below, this contribution is also just a beginning of our commitment.

Basic Overview of Open SEAS

The Open SEAS framework is Structured Dynamics’ specific implementation framework for MIKE2.0’s Semantic Enterprise Solution Offering. This section overviews some of Open SEAS‘ key facets.

A Grounding in the Open World Approach

Many enterprise information systems, particularly relational ones, embody a closed world assumption that holds that any statement that is not known to be true is false. This premise works well where there is complete coverage of specific items, such as the enumeration of all customers or all products.

Yet, in most areas of the real (”open”) world there is no guarantee or likelihood of complete coverage. Under an open world assumption the lack of a given assertion or fact does not imply whether that possible assertion is true or false: it simply is not known. An open world assumption is one of the key factors that defines the open Semantic Enterprise Offering and enables it to be deployed incrementally. It is also the basis for enabling linkage to external (often incomplete) datasets.
Pillars of the Open Semantic Enterprise

Fortunately, there is no requirement for enterprises to make some philosophical commitment to either closed- or open-world systems or reasoning. It is perfectly acceptable to combine traditional closed-world relational systems with open-world reasoning. It is also not necessary to make any choices or trade-offs about using public v. private data or combinations thereof. All combinations are acceptable when the basis for integration is an open-world one.

Open SEAS is grounded in this “open” style. It can be employed in virtually any enterprise circumstance and at any scope, and expanded in a similar way as budget and needs allow.

Other Basic Pillars to the Framework

Open SEAS is based on seven pillars, which themselves inform the basis for the MIKE2.0 Guiding Principles for the Open Semantic Enterprise. These principles cover data model, architecture, deployment practices and approach for how an enterprise can begin and then extend its use of semantics for information interoperability.

Important aspects are linked data or Web-oriented architecture, but it is really the unique combination of open-world approach and the RDF data model and its semantic power that provide the distinctive differences for Open SEAS. An exciting prospect — but still in its early stages of discovery and implementation — is the role of adaptive ontologies to power ontology-driven applications. These prospects, if fully realized, could totally remake how knowledge workers interact and specify the applications that manage their information environment.

Embracing the Layered Semantic Enterprise Architecture

Open SEAS also fully embraces the Layered Semantic Enterprise Architecture of MIKE2.0’s Semantic Enterprise Offering. This architecture acts as a subsequent set of functions or middleware with respect to the MIKE2.0’s standard SAFE Architecture. Most of the existing SAFE architecture resides in the Existing Assets layer. The specific aspects of Open SEAS resides in the layers above, namely Access/Conversion, Ontologies and the Applications Layers.

Using (Mostly) Open Source to Fill Gaps in the Technology Stack

Stitching together this interoperability layer above existing information and infrastructure assets requires many diverse tools and products, and there still are gaps. The layer figure below shows the semantic enterprise architecture overlaid with some representative open source projects and tools that plug some of those gaps.

Open SEAS also maintains a comprehensive roster of open source and proprietary tools in all aspects of semantic technology, ranging from data storage and converters, to Web services and middleware, and then to ultimate user applications. A database of nearly 1,000 tools in all areas is maintained for potential applicability to the methodology.

Quick, Adaptive, Agile Increments

The inherently incremental nature of the Open SEAS framework encourages experimentation, affordable deployments, and experience gathering. Because the systems and deployments put into place with this framework are based on the open world approach and use the extensible RDF data model, expansions in scope, sophistication or domain can be incorporated at any time without adverse effects on existing assets or systems or prior Open SEAS deployments.

Quick and (virtually) risk-free increments means that adopting semantic approaches in the enterprise can be accelerated (or not) based on empirical benefits and available budgets.

An Emphasis on Learning

The Open SEAS framework is built on a solid foundation, but it also one that is incomplete. Deployments of semantic technologies and approaches are still quite early in the enterprise, whether measured in numbers, scope or depth. In order for the framework — and the practice of semantic adoption in general — to continue to expand and be relevant in the enterprise, active learning and documentation is essential. One of the reasons for the affiliation of Open SEAS with MIKE2.0 is to leverage these strong roots in methodological learning.

Where Do We Go From Here?

The nature of Open SEAS and its parent Semantic Enterprise Solution Offering touches most offerings within the MIKE2.0 framework. There is much to be done to integrate the semantic enterprise perspective into these other possibilities, plus much that needs to be learned and documented for the offering itself. The concept of the semantic enterprise, after all, is relatively new with few prominent case studies.

As the offering points out, there are some dozens of addition necessary resources that are available and ready to be packaged and moved into the MIKE2.0 framework. These efforts are a priority, and will continue over the coming weeks.

But, more importantly, beyond that, the experience and practitioner base needs to grow. Much is unknown regarding key aspects of the offering:

  • What are the priority application areas which promise the greatest return on investment?
  • What are best practices for adoption and technologies across the entire semantic enterprise stack?
  • Many tools and techniques are still legacies and outgrowths of the research and academic communities. How can these be adopted and modified to meet enterprise standards and expectations?
  • What are the “best” ontology and vocabulary building blocks upon which to model and help frame the enterprise’s interoperability needs?
  • What are the most cost-effective strategies for leveraging existing information and infrastructure assets, while transitioning away from them where appropriate?

Despite these questions, emergence is the way complex systems arise out of a multiple of relatively simple interactions, exhibiting new and unforeseen properties in the process. RDF is an emergent model. It begins as simple “fact” statements of triples, that may then be combined and expanded into ever-more complex structures and stories. As an internal, canonical data model, RDF has advantages for information federation and development over any other approach. It can represent, describe, combine, extend and adapt data and their organizational schema flexibly and at will. Applications built upon RDF can explore and analyze in ways not easily available with other models.

Combined with an open-world approach, new information can be brought in and incorporated to the framework step-by-step. Perhaps the greatest promise in an ongoing transition to become a semantic enterprise is how an inherently incremental and building-block approach might alter prior practices and risks across the entire information management spectrum.

We invite you to join us and to contribute to this effort. I encourage you to join MIKE2.0 if you have not already done so, and check out announcements on this blog for ongoing developments.

by Mike at March 01, 2010 05:19 PM

February 26, 2010

Blog Data Space (Kingsley Idehen)

OpenLink Virtuoso - Product Value Proposition Overiew

Situation Analysis

Since the beginning of the modern IT era, each period of innovation has inadvertently introduced its fair share of Data Silos. The driving force behind this anomaly remains an overemphasis on the role of applications when selecting problem solutions. Unfortunately, most solution selecting decision makers remain oblivious to the fact that most applications are architecturally monolithic; i.e., they fail to separate the following five layers that are critical to all solutions:

  1. Data Unit (Datum or Data Object) Identity,
  2. Data Storage/Persistence,
  3. Data Access,
  4. Data Representation, and
  5. Data Presentation/Visualization.

The rise of the Internet, and its exponentially-growing user-friendly enclave known as the World Wide Web, is bringing the intrinsic costs of the monolithic application architecture anomaly to bear -- in manners unanticipated by many. For example, the emergence of network-oriented solutions across the realms of Enterprise 2.0-based Collaboration and Web 2.0-based Software-as-a-Service (SaaS), combined with the overarching influence of Social Media, are producing more heterogeneously-structured and disparately-located data sources than people can effectively process.

As is often the case, a variety of problem and product monikers have emerged for the data access and integration challenges outlined above. Contemporary examples include Enterprise Information Integration, Master Data Management, and Data Virtualization. Labeling aside, the fundamental issues of the unresolved Data Integration challenge boil down to the following:

  • Data Model Heterogeneity
  • Data Quality (Cleanliness)
  • Semantic Variance across Contexts (e.g., weights and measures).

Effectively solving today's data integration challenges requires a move away from monolithic application architecture to loosely-coupled, network-centric application architectures. Basically, we need a ubiquitous network-centric application protocol that lends itself to loosely-coupled across-the-wire orchestration of data interactions. In short, this will be what revitalizes the art of application development and deployment.

The World Wide Web is built around a network application protocol called HTTP. This protocol intrinsically separates the five layers listed earlier, thereby enabling:

  • Use of Generic HTTP URIs as Data Object (Entity) Identifiers;
  • Identifier Co-reference, such that multiple Data Object Identifiers may reference the same Data Object;
  • Use of the Entity-Attribute-Value Model to describe Data Objects using real world modeling friendly conceptual graphs;
  • Use of HTTP URLs to Identify Locations of Resources that bear (host) Data Object Descriptions (Representations);
  • Data Access mechanism for retrieving Data Object Representations from persistent or transient storage locations.

What is Virtuoso?

A uniquely designed to address today's escalating Data Access and Integration challenges without compromising performance, security, or platform independence. At its core lies an unrivaled commitment to industry standards combined with unique technology innovation that transcends erstwhile distinct realms such as:

When Virtuoso is installed and running, HTTP-based Data Objects are automatically created as a by-product of its powerful data virtualization, transcending data sources and data representation formats. The benefits of such power extend across profiles such as:

Product Benefits Summary

  • Enterprise Agility — Virtuoso lets you mix-&-match best-of-class combinations of Operating Systems, Programming Environments, Database Engines and Data-Access Middleware when building or tweaking your IS infrastructure, without the typical impedance of vendor-lock-in.
  • Data Model Dexterity — By supporting multiple protocols and data models in a single product, Virtuoso protects you against costly vulnerabilities such as: perennial acquisition and accumulation of expensive data model specific DBMS products that still operate on the fundamental principle of: proprietary technology lock-in, at a time when heterogeneity continues to intrinsically define the information technology landscape.
  • Cost-effectiveness — By providing a single point of access (and single-sign-on, SSO) to a plethora of Web 2.0-style social networks, Web Services, and Content Management Systems, and by using Data Object Identifiers as units of Data Virtualization that become the focal points of all data access, Virtuoso lowers the cost to exploit emerging frontiers such as socially-enhanced enterprise collaboration.
  • Speed of Exploitation — Virtuoso provides the ability to rapidly assemble 360-degree conceptual views of data, across internal line-of-business application (CRM, ERP, ECM, HR, etc.) data and/or external data sources, whether these are unstructured, semi-structured, or fully structured.

Bottom line, Virtuoso delivers unrivaled flexibility and scalability, without compromising performance or security.

Related

 

by Kingsley Uyi Idehen (kidehen@openlinksw.com) at February 26, 2010 07:12 PM

February 23, 2010

AI3:::Adaptive Information (Mike Bergman)

MIKE2.0: Open Source Information Development in the Enterprise

MIKE2.0 Logo
A Maturing Standard, Worthy of Adoption

Enterprises are hungry for guidance and assistance in learning how to embrace semantics and semantic technologies in their organization. Because of our services and products and my blog writings, we field many inquiries at Structured Dynamics about best practices and methods for transitioning to a semantic enterprise.

Until the middle of last year, we had been mostly focused on software development projects and our middleware efforts via things like conStruct, structWSF, irON and UMBEL. While we also were helping in early engagement and assessment efforts, it was becoming clear that more formalized (and documented!) methods and techniques were warranted. We needed concrete next steps to offer the organization once they became intrigued and then excited about what it might mean to become a semantic enterprise.

For decades, of course, various management and IT consultancies have focused on assisting enterprises adopt new work methods and information management approaches and technologies. These practices have resulted in a wealth of knowledge and methods, all attuned to enterprise needs and culture. Unfortunately, these methods have also been highly proprietary and hidden behind case studies and engagements often purposely kept from public view.

So, in parallel with formulating and documenting our own approaches — some of which are quite new and unique to the semantic space (with its open world flavor as we practice it) — we also have been active students for what others have done and written about information management assessment and change in the enterprise. Despite the hundreds of management books published each year and the deluge of articles and pundits, there are surprisingly few “meaty” sources of actual methods and templates around which to build concrete assessment and adoption methods.

The challenge here is not to present simply a few ideas or to spin some writings (or a full book!) around them. Rather, we need the templates, checklists, guidances, tools listings, frameworks, methods, test harnesses, codified approaches, scheduling and budgeting constructs, and so forth that takes initial excitement and ideas to prototyping and then deployment. These methodological assets take tens to hundreds of person-years to develop. They must also embody the philosophies and approaches consistent with our views and innovations.

Customers like to see the methods and deliverables that assessment and planning efforts can bring to them. But traditional consultancies have been naturally reluctant to share these intellectual assets with the marketplace — unless for a fee. Like many growing small companies before us, Structured Dynamics was thus embarking on systematically building its own assets up, as engagements and time allowed.

Welcome to MIKE2.0 and A Bit of History

I first heard of MIKE2.0 from Alan Morrison of PriceWaterhouseCoopers’ Center for Technology and Innovation and from Steve Ardire, a senior advisor to SD. My first reaction was pretty negative, both because I couldn’t believe why anyone would name a methodology after me (hehe) and I also have been pretty cool to the proliferation of version numbers for things other than software or standards.

However, through Alan and Steve’s good offices we were then introduced to two of the leaders of MIKE2.0, Sean McClowry of PWC and then Rob Hillard of Deloitte. Along with BearingPoint, the original initiator and contributor to MIKE2.0, these three organizations and their key principals provide much of the organizational horsepower and resource support to MIKE2.0.

Based on the fantastic support of the community and the resources of MIKE2.0 itself (see concluding section on Why We Like the Framework), we began digging deeper into the MIKE2.0 Web site and its methodology and resources. For the reasons summarized in this article, we were amazed with the scope and completeness of the framework, and very comfortable with its approach to developing working deployments consistent with our own philosophy of incremental expansion and learning.

Method for an Integrated Knowledge Environment (MIKE2.0) is an open source delivery framework for enterprise information management. It provides a comprehensive methodology (747 significant articles so far) that can be applied across a number of different projects within the information management space. While initially focused around structured data, the goal of MIKE2.0 is to provide a comprehensive methodology for any type of information development.

Information development is an approach organizations can apply to treat information as a strategic asset through their complete supply chain: from how it is created, accessed, presented and used in decision-making to how it is shared, kept secure, stored and destroyed. Information development is a key concept of the MIKE2.0 methodology and a central tenet of its philosophy:

The concept of Information Development is based on the premise that due to its complexity, we currently lack the methods, technologies and skills to solve our information management challenges. Many of the techniques in use today are relatively immature and fragmented and the problems keep getting more difficult to solve. This is one of the reasons we see so many problems today and why organizations that manage information well are so successful.

MIKE2.0 is not a framework for general transactional or operational purposes regarding data or records in the enterprise. (Though it does support functions related to analyzing that information.) Rather, MIKE2.0 is geared to the knowledge management or information management environment, with a clear emphasis on enterprise-wide issues, information integration and collaboration.

The MIKE2.0 methodology was initially created by a team from BearingPoint, a leading management and technology consultancy. The project started as “MIKE2″, an internal approach to aid enterprises to improve their information management. The MIKE2 initiative was started in early 2005 and the methodology was brought through a number of release cycles until it reached a mature state in late 2005. “MIKE2.0″ involved taking this approach and making it open source and more collaborative. Much of the content of the MIKE2.0 methodology was made available to the open source community in late December 2006. The actual MIKE2.0 Web site and release occurred in 2007.

Anyone can join MIKE2.0, which adheres to an open source and Creative Commons model. Governance of MIKE2.0 is based on a meritocracy model, similar to the principles followed by the Apache Software Foundation.

There is much additional background on MIKE2.0. Also, for an explanation of the rationale for the framework, see the MIKE2.0 article, A New Model for the Enterprise.

A Surprisingly Robust and Complete Framework

MIKE2.0 provides a complete delivery framework for information management projects in the enterprise. The assets associated with this framework first are based on templates and guidelines that can be applied to any information management area. This is a key source of our interest in the framework.

But, there is also real content behind these templates. There is a slate of “solution offerings” geared to most areas of enterprise information management. There are “solution capabilities” that describe the tools and templates by which these solutions need to be specified, planned and tracked. There are frameworks for relating specific vendor and open source tools to each offering. And, there are general strategic and other guidances for how to communicate the current state of the discipline as well as its possible future states.

The next diagram captures some of these major elements:

Perhaps the most important aspect of this framework, however, are the ways by which it provides solid guidance for how entirely new solution areas — the semantic enterprise, for example, in Structured Dynamics’ own case — can be expressed and “codified” in ways meaningful to enterprise customers. These frameworks provide a common competency across all areas of enterprise interest in information development and management. For a relatively new and small vendor such as us, this framework provides a credible meeting ground with the market.

A Phased and Incremental Approach to Information Development

The fundamental approach to a MIKE2.0 offering is staged and incremental. This is very much in keeping with Structured Dynamics’ own philosophy, which, more importantly, also is consonant with the phased adoption and expansion of open semantic techologies within the enterprise.

Under the MIKE2.0 framework, the first two phases relate to strategy and assessment. The next three phases (of the five standard ones) produce the first meaningful implementation of the offering. Depending, that may range from a prototype to broader deployment, based on the maturity of the offering. Thereafter, scale-out and expansion occurs via a series of potential increments:

The incremental aspects of the later three phases are not dissimilar from “spiral” deployments common to some government procurements. The truth remains, however, that actual experience is quite limited in later increments, and whether these methodologies can hold over long periods of time is unknown. Despite this caution, most failures occur in the earliest phases of a project. MIKE2.0 has strong framework support in these early phases.

A Broad Spectrum of Capabilities, Assets and Solutions

MIKE2.0 “solutions” are presented as offerings from single ones to a variety of clusters or groupings. These types reflect the real circumstances of applications and deployments at either the departmental or enterprise level. They may range from systematic to those that address specific business and technology problems. Tools and solutions may be work process, human, or technological, proprietary or open.

An overarching purpose of the MIKE2.0 methodology is to couch these variations into a consistent and holistic framework that allows individual or multiple pieces to be combined and inter-related. This consistency is a key to the core objective of information management interoperability across whatever solution profile the enterprise may choose to adopt.

This objective is best expressed via the Overall Implementation Guide. Thus, while detailed aspects of MIKE2.0’s solution offerings may encompass very specific techniques, design patterns and process steps, in combination these pieces can be combined into meaningful wholes.

This spectrum of solution possibilities is organized according to:

  • Vendor Solution Offerings – integrated approaches to solving problems from a vendor perspective and are often product-specific
  • The SAFE Architecture – an adaptive framework that is flexible and extensible, and can be deployed incrementally

These groupings are shown in the diagram below, with the “core” and composite groupings shown in the middle:

Nearly Two Score Core and Composite Offerings

These central core and composite groupings, of course, are comprised of more focused and specific solutions. While it is really not the purpose of this piece to describe any of these MIKE2.0 specifics in detail, the next diagram helps illustrate the scope and breadth of the current framework.

Here are the some 30+ individual “core” solution offerings:

These are also accompanied by 8 or so cross-cutting “composite” solutions that reach across many of the core aspects.

Whether core or component, there is a patterned set of resources, guidances and templates that accompany each solution. The MIKE2.0 Web site and resources are generally organized around these various core or composite solutions.

Why We Like the Framework

MIKE2.0 is a project that walks its talk. Here are some of the reasons why we like the framework and how it is managed, and why we plan to be active participants as it moves forward:

  • Open source with a true collaboration and welcoming commitment
  • A sympatico philosophy and grounding
  • Knowledgeable and friendly leaders and governance structure
  • An incremental, adaptive framework, in keeping with our own beliefs and the reality of the marketplace for emerging practices
  • A proven core methodology, with many existing templates and tools for leveraging methodology development and extensions
  • Much available content (about 800 articles as of this date)
  • A smart, active community contributing on a constant basis
  • Backup and support from leading management and IT consultants
  • Active evangelists, with attention to community communications and care-and-feeding
  • The MIKE2.0 environment itself (based on the OmCollab collaboration platform), which is well thought out with many nice community and content aspects
  • A budget and roadmap for how to extend the methodology and achieve the vision.

We invite you to learn more about MIKE2.0 and join with us in helping it to continue to grow and mature.

And, oh, as to that aversion to the MIKE2.0 name? Well, with our recent addition of Citizen DAN, it is apparent we are adopting as many boys as we can. Welcome to the family, MIKE2.0!

by Mike at February 23, 2010 06:51 AM

February 18, 2010

Frederick Giasson

structWSF Web Services Tutorial

One thing that was hard to do with structWSF was explaining what structWSF is, and how users can interact with it. For most people, structWSF was abstracted behind conStruct and they didn’t know that each single functionalities of conStruct was bound to one, or multiple queries to one, or multiple, structWSF instance.

It is the reason why we took the time to write a complete structWSF interaction tutorial. This tutorial explains what the general structWSF architecture is, and it describes a series of general interaction usecases. We hope that this tutorial will helps developers and system implementators understanding the capabilities of structWSF and how they can use it.

You can read the complete structWSF Web Services Tutorial here.

Additionally, we released a new version of structWSF, conStruct and the irJSON Parser which are products of this toturial.

by Fred at February 18, 2010 09:45 PM

February 15, 2010

AI3:::Adaptive Information (Mike Bergman)

Two Contrasting Styles for the Semantic Enterprise

Two Faces in Circle, from http://energeticrelations.com/Our Own Approach is Adaptive and Incremental

It is gratifying to see the emergence of the term semantic enterprise, with much increased attention and commentary. But, similar to different styles and patterns in software programming, there is not a single (nor best, depending on circumstance) way to approach becoming a semantic enterprise.

In this piece I contrast two styles. The more traditional and familiar one is comprehensive, complete and “engineered” in its approach. The second, and emerging style, is more adaptive and incremental. While Structured Dynamics is a proponent and thought leader for the adaptive style, the use and applicability of either approach is really a function of objectives and circumstances. The choice of approach depends on use case, and should not be a dogmatic one.

Any time a contrast is posed, one should be on guard about setting up a rhetorical strawman. There may perhaps be a bit of this flavor in this article; if so, it is unintended. It is probably best to realize that there is a gradient — or spectrum — of possible approaches between these contrasting styles. The real message is to understand these differences such that you can comfortably place your own organization at the right points along this spectrum.

A Spectrum of Advantages and Differences

The general idea of semantics in the enterprise preceeds the use of the term, having been somewhat captured before by the ideas of enterprise application integration, enterprise information integration and other concepts even related to data federation and data warehousing stretching back to the 1980s. However, as a specific label, we can look back to the first mentions in the late 1990s and more concerted attention beginning from about 2002 or so onward [1]. As another indicator, since 2005 the Semantic Technology Conference has given specific prominence to the enterprise [2].

Throughout this period, the sense from academic papers, many vendors, and most pundits [3] has been on things like automated reasoning, machine-aided decision making, aspects of artificial intelligence, and so forth. The general tone is often framed as “revolution” or “massive changes” or something “entirely new.” If you are a consultant or software/implementation vendor — especially where VC money is backing the venture with hopes for big returns and home runs — it may make cynical sense to sell such large and costly change.

I believe there are circumstances where the Semantic Enterprise writ this large may make sense and be financially justified. But, this kind of “big change” view has also seen relatively few visible (or successful) deployments. It has colored what it means to be a semantic enterprise. And, I believe, it has weakened market credibility by perhaps overpromising and underdelivering. The conventional view of what it is be a semantic enterprise deserves to be balanced.

So, as we balance this understanding of the semantic enterprise to one that is more nuanced, we can contrast the characteristics of the two apposite styles as follows:

Characteristics of the
Comprehensive, ‘Engineered’ Style
Characteristics of the
Adaptive, Incremental Style
  • A focus on a more complete, comprehensive coverage of the semantics in the domain
  • More enterprise-wide, less partial or departmental
  • Greater emphasis on “closed world” approaches [4]; more akin to relational database architecting and schema
  • Expansion is possible, but effort may be somewhat complex
  • A general implication is to replace or supplant existing information structures with semantic ones
  • Not necessarily based on semantic Web standards and languages [5] (e.g., may include Common Logic, frame logics, etc.)
  • Richer set of predicates (relations)
  • Though a distinction is maintained between schema and instances, their separation may not be consistently (physically) enforced
  • Often more complicated inferencing and logic tests
  • More complete enumeration and characterization of items
  • Much process around semantics agreement across groups
  • Fairly well-developed implementation tools, including for ontology engineering
  • Implementation times in months to years
  • Implementation costs akin to traditional large-scale IT projects
  • An emphasis on a simpler, incremental, “learn as you go” approach
  • Start with single departments or limited vertical apps
  • Embedded in the “open world” approach [4], with incorporation of external information
  • Design and approach inherently allows incremental expansion and adaptation
  • A key premise is to build from and leverage existing information structures, vocabularies and assets
  • Fully based on semantic Web standards and languages [5], often including linked data [6]
  • Tends to start simply with hierarchical or related concepts (e.g., SKOS)
  • Conscious distinction in the structure for handling schema separate from instances [7]
  • Inferencing logic based more on concept matching, or parent-child or part-of relationships
  • Degree of item characterization based on current scope
  • Initial semantic matching can be driven from existing assets
  • Fairly well-developed implementation tools, except for how to engage publics in the development process
  • Implementation times in weeks to months
  • Implementation costs driven by available budgets (and thus scope)

Note we have labeled the conventional approach as the “comprehensive, engineering” style; its contrast, and the one we position more closely to, is the “adaptive, incremental” style.

[Others have posited contrasting styles, most often as "top down" v. "bottom up." However, in one interpretation of that distinction, "top down" means a layer on top of the existing Web [8]. On the other hand, “top down” is more often understood in the sense of a “comprehensive, engineered” view, consistent with my own understanding [9]. Yet no matter which characterization, neither captures what I feel to be the more important considerations of mindset, logic and premise.]

Though the table above contrasts many points, I think there are two main distinctions to the adaptive approach. First, it firmly embraces the open world assumption. OWA is key to an incremental, “learn as you go” deployment that is also well suited to incorporation of external information. The second main distinction is to leverage and build from existing assets.

A Spectrum of Applications

Yet as noted in the opening, which of these approaches makes better sense depends on circumstance. One aspect of circumstance is available budget and deployment times for pilots or proofs-of-concept. Another aspect, of course, is the planned use or application for the deployment.

These are by no means hard distinctions, but in general we can see these contrasting approaches applying to the following uses:

Applications and Uses for the
Comprehensive, ‘Engineered’ Style
(i.e., more CWA driven)
Applications and Uses for the
Adaptive, Incremental Style
(i.e., more OWA driven)
  • Bounded, “inward” applications (high degree of control and completeness)
  • Engineering enterprises
  • Technical domains and organizations
  • Aeronautics
  • Pharmaceuticals
  • Chemicals
  • Petroleum
  • Energy
  • A/E firms (construction)
  • External facing applications, organizations (customers, incorporation of external data)
  • Faceted Search
  • Taxonomy updates
  • Multi-domain master data management (MDM)
  • Simple (initially) inferencing
  • Consumer products
  • Finance
  • Health care
  • Knowledge enterprises

A critical distinction is the nature of the enterprise itself. “External-facing” enterprises or functions that want or need to incorporate much external information (say, marketing or competitive intelligence) are advised to look closely at the adaptive approach. Organizations that have more complete control over their circumstances should perhaps focus on the conventional approach.

Adoption Thresholds and Risks

In previous writings I have pointed to the manifest benefits that can accrue to the semantic enterprise [see, esp. 10]. But we also have witnessed nearly a decade of promotion for semantics in the enterprise, with perhaps a lack of progress in some areas or unmet promises in others. These raise questions and skepticism of the real eventual costs and benefits.

I believe some of this skepticism is inherent with anything new — the general IT fatigue from what the current “next great thing” might be. But I also believe that some of this skepticism results from an approach to semantics in the enterprise that is both lengthy to deploy and high cost.

The key advantage of the adaptive, incremental approach is that the whole IT game in the enterprise can change. An open world approach enables adoption as it proves itself and as budgets allow. Commitments made under this approach have, in essence, permanent value. Past fears and concerns about making “wrong” bets no longer apply. With learning, targets can be re-adjusted, structure re-defined and applications re-focused, all as new discoveries and broadening scope dictate.

This does not make the adaptive approach better than the conventional one. But, it does make it less risky and, well, more adaptive.


[1] For example, the earliest Google mentions on “semantic enterprise” date to about 1998 or 1999. In 2002, the University of Georgia and Amit Sheth offered the first known academic course on the Semantic Enterprise; see http://lsdis.cs.uga.edu/SemanticEnterprise/.
[2] See the conference guide for the Semantic Technology Conference 2005. The sixth one, the 2010 Semantic Technology Conference, is upcoming on June 21-25 in San Francisco.
[3] See, for example, Mitchell Ummell, ed., 2009. “The Rise of the Semantic Enterprise,” special dedicated edition of the Cutter IT Journal, Vol. 22(9), 40 pp., September 2009. See http://www.cutter.com/offers/semanticenterprise.html (after filling out contact form). Partially in response to this conventional view, I wrote [10]. In that article I offered as a working definition that “a semantic enterprise is one that adopts the languages and standards of the semantic Web . . . and applies them to the issues of information interoperability, preferably using the best practices of linked data.” That happens to be Structured Dynamics’ preferred definition, though as this posting indicates, there is a spectrum of definitions of the term.
[4] See, M.K. Bergman, 2009. “The Open World Assumption: Elephant in the Room“, AI3:::Adaptive Information blog, December 21, 2009.
[5] See for example RDF, RDFS, OWL , SKOS and SPARQL and others.
[6] Linked data is a set of best practices for publishing and deploying instance and class data using the RDF data model. Two of the best practices are to name the data objects using uniform resource identifiers (URIs), and to expose the data for access via the HTTP protocol. Both of these practices enable the Web to become a distributed database, which also means that Web architectures can also be readily employed.
[7] We use a basis in description logics for defining the roles and splits in schema and instances. As we define it:

“Description logics and their semantics traditionally split concepts and their relationships from the different treatment of instances and their attributes and roles, expressed as fact assertions. The concept split is known as the TBox (for terminological knowledge, the basis for T in TBox) and represents the schema or taxonomy of the domain at hand. The TBox is the structural and intensional component of conceptual relationships. The second split of instances is known as the ABox (for assertions, the basis for A in ABox) and describes the attributes of instances (and individuals), the roles between instances, and other assertions about instances regarding their class membership with the TBox concepts.”
[8] One article that got quite a bit of play a few years back was A. Iskold, 2007. “Top Down: A New Approach to the Semantic Web,” in ReadWrite Web, Sept. 20, 2007. The problem with this terminology is that it offers a completely different sense of “top down” to traditional uses. In Iskold’s argument, his “top down” is a layering on top of the existing Web.
[9] The more traditional view of “top down” with respect to the semantic Web is in relation to how the system is constructed. This is reflected well in a presentation from the NSF Workshop on DB & IS Research for Semantic Web and Enterprises, April 3, 2002, entitled “The ‘Emergent, Semantic Web: Top Down Design or Bottom Up Consensus?“. Under this view, top down is design and committee-driven; bottom up is more decentralized and based on social processes, which is more akin to Iskold’s “top down.”
[10] M.K. Bergman, 2009. “Fresh Perspectives on the Semantic Enterprise,” AI3:::Adaptive Information blog, Sept. 28, 2009.

by Mike at February 15, 2010 03:36 PM

February 01, 2010

Displacement Activities (Tom Heath)

Wash down the Apple tablet with a gulp of Kool Aid

I’m not in the least bit excited about the iPad, and it seems I’m not alone. The mood seems to have changed since before the launch, with countless tech journalists previously falling over themselves to declare tablets the next big thing. (Thankfully Rory Cellan-Jones from the BBC was more measured, focusing on personal projectors as a more exciting development). The mood since is considerably more downbeat, and I think more realistic.

I may be missing some crucial usage context that reveals the killer characteristics of the iPad, but I’ve tried really hard and still nothing. There are many obvious practical issues with the device:

  • it’s too big for a pocket, but not sufficiently more useful than an iPhone or an HTC Hero.
  • it’s about the same size as a compact laptop, but with less scope for comfortable rapid input.
  • it’s probably too big to cradle comfortably in my hand for prolonged periods, and sitting with one ankle on the other knee is not always practical.

The only scenarios I can conjure up where I could imagine using the device are:

  • showing people my holiday photos.
  • reviewing design proofs without needing to print them out.

Neither of these, or even both, are very compelling at all. TVs are getting good for viewing photos, by including e.g. an SD card slot, and rumours of the death of paper are greatly exagerated.

Perhaps the most annoying thing about the scenarios used to promote the device is the one about the San Francisco to Tokyo flight, watching video all the way without running out of battery. Any airline with planes worth boarding has personal video screens. I don’t want to bring my own. I’d rather use the space to carry a decent pair of noise-canceling headphones, which I’m sure increase my enjoyment of onboard media far more than a little bit of extra screen real estate. The development I want to see is not a new device that I have to prop on the flimsy airline table, hold tight when we hit some turbulence, and stow away when my food arrives, but the capability to connect my own device to the in-built screen via USB or Bluetooth. Even a bare USB port with power but no connectivity would be a start, allowing me to run low-powered devices (that I already own) during long flights.

OK, so the flight reference is just a touchstone for how long the device can run without mains power, but I think it demonstrates a lack of grounding of the device in realistic scenarios.

Any new device has to have two key characteristics these days for me to get excited: interoperability and convergence. The iPad seems to have very little of either. You could argue that it offers some convergence between smartphones and e-readers, but that’s about as exciting as convergence between a smartphone and a wall clock.

I’m left wondering what the iPad is competing against? I’m guessing it’s paper, whether that’s in the form of a book, brochure, newspaper, restaurant menu or whatever. Unfortunately for Apple, paper is pretty well suited to each of these, especially when you introduce bath water, the risk of theft, or just ketchup, into the equation. Perhaps this is the end of electronic picture frames as dedicated device? Probably about time. Maybe the iPad will make an excellent Spotify console for the living room. Who knows? Whatever happens I can’t see this becoming a mass-market product worthy of even a fraction of the hype.

Where I wish that Apple had expended their creative talent was in addressing the power issue. Not in making sure I could watch 10 hours of back to back video, but in enabling me to spend that energy in whatever way I choose, powering whichever device I choose. It drives me crazy that I carry several batteries around, and short of running my phone off my laptop via USB there is no interoperability between these power sources. If Apple could produce a universal power supply that was sleek, sexy, efficient and interoperable, then I would be interested. Sadly this doesn’t seem to be the way.

No related posts.

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at February 01, 2010 03:30 PM

January 29, 2010

DBTune Blog

BBC Semantic Web use-case

After a very long time writing it, we finally have a BBC Semantic Web use-case on the W3C website! It describes work we did around BBC Programmes, BBC Music, BBC Wildlife Finder and Search+. I hope it all makes a bit of sense :-) For a more detailed writeup about these issues, Patrick's Linked Data on the BBC are very good.

by Yves at January 29, 2010 03:37 PM

January 27, 2010

Frederick Giasson

Behind Oz’s Curtain

Benjamin Nowack, creator of ARC and Trice, wrote an interesting blog post about the place of Microformats and RDFa in the HTML 5 specification. I am not deep into the specification itself, and so may lack some history context. However, the most interesting point in this article is not related to Microformats, RDFx or the new HTML 5 specification.

The point is that apparently, some people believe that it is RDF or nothing. This is not new, but is that true?

People (and particularly enterprises) want the benefits of structured data, not necessarily RDF. In fact, many people don’t know about RDF, or don’t understand RDF, or just don’t care about RDF. But, is it because you don’t know, understand or care about RDF that you cannot benefit from it? No, certainly not. And I think that is what Benjamin is talking about when he mentions things such as: “[...] to get RDF to the broader developer community“, “[...] here could have been a solution that would have served everybody sufficiently well, both HTMLers and RDFers“. “[...] they would most probably have been able to define RDFa 1.1 as a proper superset of Microdata”. RDF can be incarnated in multiple bodies, but it is still RDF. I think it is what Benjamin was suggesting, and it the path we took at Structured Dynamics.

We choose to use RDF behind Oz’s curtain. This means that at the core of any of our methodologies, systems and specifications, we use RDF. Why? Because it is the more flexible description framework available that helps us handle any other source of data. However, does that mean that we should push RDF in everybody’s face? Certainly not.

Our work with different enterprises from all kind of domains told us that we have to look beyond RDF while still using it (as paradoxically as that may appear). For example, we developed structWSF and conStruct such that people can upload (and manage) their data in different formats while being able to export it in all other different formats. At the core, these systems use RDF to manipulate all these different kind of formats, but from the outside, users simply use the format they care about, they use, or that they have available in their workflow. These users benefits from RDF without knowing it, understanding it or without caring about it. We don’t think RDF is for everyone, but everyone can benefit from RDF.

Another example of RDF behind Oz’s curtain is the irON description framework and its three serialization profiles: irJSON, irXML and commON that we developed. As stated in the Purpose section of this document, the goal was quite clear:

irON (instance record and Object Notation) is a abstract notation and associated vocabulary for specifying RDF triples and schema in non-RDF forms. Its purpose is to allow users and tools in non-RDF formats to stage interoperable datasets using RDF. The notation supports writing RDF and schema in JSON (irJSON), XML (irXML) and comma-delimited (CSV) formats (commON). The notation specification includes guidance for creating instance records (including in bulk), linkages to existing ontologies and schema, and schema definitions. Profiles and examples are also provided for each of the irXML, irJSON and commON serializations.

irON is premised on these considerations and observations:

  • RDF (Resource Description Framework) is a powerful canonical data model for data interoperability
  • However, most existing data is not written in RDF and many authors and publishers prefer other formats for various reasons
  • Many formats that are easier to author and read than RDF are variants of the attribute-value pair construct [2], which can readily be expressed as RDF, and
  • A common abstract notation for converting to RDF would also enable non-RDF formats to become somewhat interchangeable, thus allowing the strengths of each to be combined.

The irON notation and vocabulary is designed to allow the conceptual structure (”schema”) of datasets to be described, to facilitate easy description of the instance records that populate those datasets, and to link different structures for different schema to one another. In these manners, more-or-less complete RDF data structures and instances can be described in alternate formats and be made interoperable. irON provides a simple and naive information exchange notation expressive enough to describe most any data entity.

I think this is what Benjamin was talking about in his article, and the kind of mindset he was suggesting the RDF community to adopt. At least this is the minding we adopted at Structured Dynamics, and apparently it is the minding Benjamin adopted for his own business. I am sure there are many other people and organizations out there that are adopting the same point of view according to RDF and its role in the current data ecosystem.

by Fred at January 27, 2010 08:50 PM

January 20, 2010

Frederick Giasson

New versions of structWSF and conStruct

triple_120construct_logo_120

We just released a new (major) version of both structWSF and conStruct. Though some months had passed since we last released this software, we finally got the time and opportunity to make these important upgrades. Many things have changed in both packages. I don’t want to iterate all the changes in this blog post, so I would suggest you to read the changes log files here:

These new versions have greatly been impacted by the needs of our clients. We also started to introduce some new concepts we wrote about the last few months.

A really good addition to this release is the a brand new Installation Manual. Hopefully people will be able to “easily” and properly install and setup a Web server to host these two packages.

All documentation files have been updated:

You can download both software packages from here:

An Amazon EC2/EBS Architecture

Some of the changes to these new versions have been made to help create, setup and maintain Web servers that host structWSF and conStruct instances.

At Structured Dynamics, we have developed and use a server architecture that leverages Amazon computer-in-the-clouds services such as: EC2, EBS, Elastic IP in the Cloud. Such an architecture is giving us the flexibility to easily maintain and upgrade server instances, to instantly create new structWSF instances in one click (without performing all these steps everytime), etc.

You can contact us for more information about these EC2 AMIs and EBS Volumes that we developed for this purpose. Here is an overview of the architecture that is now in place:

structwsf_amazon

There is a clear separation of concerns between three major things:

  • Software & libraries
  • Configuration files
  • Data files.

We chose to put all software and libraries needed to create a stand-alone structWSF instance in an EC2 AMI. This means that all needed software to run a structWSF instance is present on the Virtuoso server running Ubuntu server.

Then we chose to put all configuration and data files on an EBS volume that we attach, and mount, on the EC2 instance. You can think about a EBS volume as a physical hard drive: it can be mounted on a server instance, but it can’t be shared between multiple instances.

By splitting the software & libraries, configuration and data files, we make sure that we can easily upgrade a structWSF server in production with the latest version of structWSF (its code base and all related software such as Virtuoso, Solr, etc). Since the configuration and data files are not on the EC2 instance, we can easily create a new EC2 instance by using the latest structWSF AMI we produced, and then to mount the configuration and data files EBS volume on the new (and upgraded) structWSF instance. That way, in a few clicks, we can fully upgrade a server in production without fear of disturbing the configuration or data files.

Additionally, we can easily create backups of configuration and data files at different intervals by using Amazon’s Snapshot technology.

Finally, we chose to put all related software and configuration files needed to run a conStruct instance in another, separate, EBS volume. That way, we have a clean structWSF AMI instance that can be upgraded at any time, and we can plug (mount) a conStruct instance (EBS instance) into a structWSF server at any time. This means that we can easily have structWSF instances with or without a conStruct instance. The same strategy can easily be used to create plugin packages that can be mounted and unmounted to any structWSF instance at any time, depending on the needs.

All this makes structWSF server instances maintenance easier, simpler and faster.

by Fred at January 20, 2010 10:42 PM

January 14, 2010

DBTune Blog

Live SPARQL end-point for BBC Programmes

Update: We seem to have an issue with the 4store hosting the dataset, so the data is stale since the end of February.

Last year, we got OpenLink and Talis to crawl BBC Programmes and provide two SPARQL end-points on top of the aggregated data. However, getting the data by crawling it means that the end-points did not have all the data, and that the data can get quite outdated -- especially as our programme data changes a lot.

At the moment, our data comes from two sources: PIPs (the central programme database at the BBC) and PIT (our content mangement system for programme information). In order to populate the /programmes database, we monitor changes on these two sources and replicate them on our database. We have a small piece of Ruby/ActiveRecord software (that we call the Tapp) which handles this process.

I made a small experiment, converting our ActiveRecord objects to RDF and hooking an HTTP POST or an HTTP DELETE request to a 4store instance for each change we receive. This means that this 4store instance is kept in sync with upstream data sources.

It took a while to backfill, but it is now up-to-date. Check out the SPARQL end-point, a test SPARQL query form and the size of the endpoint (currently about 44 million triples).

The end-point holds all information about services, programmes, categories, versions, broadcasts, ondemands, time intervals and segments, as defined within the Programme Ontology. All of these resources are held within their own named graph, which means we have a very large number of graphs (about 5 million). It makes it far easier to update the endpoint, as we can just replace the whole graph whenever something changes for a resource.

This is still highly experimental though, and and I already found a few bugs: some episodes seem to be missing (for example, some Strictly Come Dancing episodes are missing, for some reason). I've also encountered some really weird crashes of the machine hosting the end-point when concurrently pushing a large number of RDF documents at it - I still didn't succeed to identify the cause of it. To summarise: it might die without notice :-)

Here are some example SPARQL queries:

  • All programmes related to James Bond:
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?uri ?label
WHERE {
  ?uri po:category 
    <http://www.bbc.co.uk/programmes/people/bmFtZS9ib25kLCBqYW1lcyAobm8gcXVhbGlmaWVyKQ#person> ; rdfs:label ?label
}
  • FInd all Eastenders broadcast dates after 2009-01-01, along with the type of the version that was broadcast:
PREFIX event: <http://purl.org/NET/c4dm/event.owl#> 
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#> 
PREFIX po: <http://purl.org/ontology/po/> 
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?version_type ?broadcast_start
WHERE
{ <http://www.bbc.co.uk/programmes/b006m86d#programme> po:episode ?episode .
  ?episode po:version ?version .
  ?version a ?version_type .
  ?broadcast po:broadcast_of ?version .
  ?broadcast event:time ?time .
  ?time tl:start ?broadcast_start .
  FILTER ((?version_type != <http://purl.org/ontology/po/Version>) && (?broadcast_start > "2009-01-01T00:00:00Z"^^xsd:dateTime))}
PREFIX po: <http://purl.org/ontology/po/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX mo: <http://purl.org/ontology/mo/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX event: <http://purl.org/NET/c4dm/event.owl#>
PREFIX tl: <http://purl.org/NET/c4dm/timeline.owl#>
PREFIX owl: <http://www.w3.org/2002/07/owl#> 
SELECT DISTINCT ?programme ?label
WHERE {
  ?event1 po:track ?track1 .
  ?track1 foaf:maker ?maker1 . ?maker1 owl:sameAs <http://www.bbc.co.uk/music/artists/67f66c07-6e61-4026-ade5-7e782fad3a5d#artist> .
  ?event2 po:track ?track2 .
  ?track2 foaf:maker ?maker2 . ?maker2 owl:sameAs <http://www.bbc.co.uk/music/artists/fb7272ba-f130-4f0a-934d-6eeea4c18c9a#artist> .
  ?event1 event:time ?t1 .
  ?event2 event:time ?t2 .
  ?t1 tl:timeline ?tl .
  ?t2 tl:timeline ?tl .
  ?version po:time ?t .
  ?t tl:timeline ?tl .
  ?programme po:version ?version .
  ?programme rdfs:label ?label .
}

by Yves at January 14, 2010 12:30 PM

January 11, 2010

Do What I Mean (Richard Cyganiak)

prefix.cc, MkII

prefix.cc is a website I’ve made last February to ease a very common task in the life of RDF developers and SPARQL users: looking up namespace URIs. A short summary of what the site can do for you is available here.

The site was developed during a few weekends, and I haven’t touched the code since I first deployed it. Today I’m publishing the first serious update to the site. This post describes what’s new.

Reverse lookup. One of the most requested features is reverse lookup. You can now enter a URI of an RDF term into the query box on the start page, and the site will respond with the best prefix for contracting that URI into a QName. This functionality is also available as an API.

Negative votes. The site has received a moderate amount of spam, mostly from pranksters who think it would be funny to propose their own homepage as a better expansion for the foaf prefix. I’ve mostly cleaned this up manually, but I think it would be better to equip the user community with tools to handle this.

The site has always had a voting mechanism, which I intended as a tiebreaker in cases where people have submitted different URIs for the same prefix, for example in the case of the dc prefix. Starting today, you can submit both positive and negative votes. If a URI receives a certain amount of negative votes, it will be no longer shown.

New export formats. One of my favourite features is the ability to directly get output in various machine-readible syntaxes by composing an appropriate URI, such as http://prefix.cc/foaf.file.n3, which produces a declaration of the FOAF prefix in N3 format. I find this handy for copy-pasting into a text editor, but also for automating things.

A few formats have been added: vann produces an RDF/XML version of the namespace mapping in the VANN vocabulary (example). xmlns produces raw XML prefix declarations (example). go redirects to the namespace URI, so you can type http://prefix.cc/foaf.go into your browser bar as a shortcut for opening the FOAF specification. I’ve also added a table of all supported formats.

A side effect of the introduction of VANN support is that there is now a single VANN representation of all mappings known to the site.

Tweaks and fixes. Regular users will note a number of further small changes and bugfixes throughout the site. One notable fix is to the way namespace lookups are calculated for the list of popular prefixes. Ironically, most of the lookups actually are from web crawlers that followed the links in the list itself, making the list self-perpetuating. Also, the list featured the non-existing robots prefix, because many crawlers are looking for http://prefix.cc/robots.txt. These issues should now be fixed.

Internal changes. The site is developed in PHP, and started out as a quick weekend hack, so the initial code was a horrible mess that was hardly maintainable. I spent quite some time cleaning this up and refactoring the code into a much nicer structure that should be able to grow along with some of the additional features I’ve planned for the future. The codebase now totals some 1600 lines of PHP, CSS and Javascript.

Hidden goodies: RDFa markup and feed of latest additions. Finally, I want to highlight some features that have existed all along, but are easily missed: First, many pages contain RDFa markup, so if you want to re-use any prefix.cc data in your own site or application, you most likely can. Second, there is an RSS feed of the latest additions to the prefix database, and it is a neat way of learning about new vocabularies and ontologies that show up around the Web.

Bugs, comments, suggestions? Any feedback is appreciated. I did a lot of refactoring without a test harness, so it’s quite likely that a few new bugs have crept in. If you notice anything, please let me know. Also, if there is anything that you would like to see in prefix.cc Mk III, please share!

by Richard Cyganiak at January 11, 2010 05:25 PM

December 29, 2009

Orri Erling

Linked Data and Virtuoso in 2010

It is again time for the end-of-year blog post.

In 2009, RDF scalability questions were solved in their broad outline and the corresponding Virtuoso release was built and used in production internally. Its general availability is now imminent while it has been available on a case-by-case basis thus far.

In 2010, we take on a new challenge: To bring RDF closer to parity with equivalent relational solutions. This will also entail some significant improvements to our relational technology.

Storage density is a key ingredient of performance. Some of the advances will be in this area; other advances will be in increased parallelism of execution. Right now we run things in vectored batches in cluster situations where message latency forces operations to be shipped in large chunks. Next we will do this across the board, also in single servers. The advantages of this for cache behavior and other factors are known in the literature.

Looking at environmental factors, we have a new SPARQL at a Working Draft stage. We have basic parity with SQL expressivity, which is a prerequisite for RDF to become a data model that can be an alternative to relational outside of very specialized contexts.

As the standards process makes SPARQL closer to being an alternative to SQL for data integration, we will make the database engine technology such that RDF's inherent penalty in terms of storage overhead and processing time substantially decreases. This will make RDF a workable integration medium also in places where it was not such before. Of course, an application-specific schema will retain some advantage over a generic one, but then one can have a purely relational application on Virtuoso as well. Just think of the possibility of an application-specific schema emerging by itself in a workload-driven fashion.

As background data for an increasing number of fields becomes available as linked data, using this together with proprietary data for analytics and discovery becomes increasingly interesting. This is the initial line of RDF data warehousing. The biomedical field has many examples. The technologies we will release during 2010 will be geared towards enabling a second line of RDF applications, where ad hoc agile integration with RDF as a lingua franca becomes a real alternative to relational solutions with ETL point solutions for harvesting information from diverse systems. One may see how RDF's flexibility and expressivity may add to agility in any number of situations where data from heterogenous sources needs to be integrated. Which of today's business scenarios does not face this issue?

References:

by Orri Erling (oerling@openlinksw.com) at December 29, 2009 03:24 PM

Linked Data and Virtuoso in 2010

It is again time for the end-of-year blog post.

In 2009, RDF scalability questions were solved in their broad outline and the corresponding Virtuoso release was built and used in production internally. Its general availability is now imminent while it has been available on a case-by-case basis thus far.

In 2010, we take on a new challenge: To bring RDF closer to parity with equivalent relational solutions. This will also entail some significant improvements to our relational technology.

Storage density is a key ingredient of performance. Some of the advances will be in this area; other advances will be in increased parallelism of execution. Right now we run things in vectored batches in cluster situations where message latency forces operations to be shipped in large chunks. Next we will do this across the board, also in single servers. The advantages of this for cache behavior and other factors are known in the literature.

Looking at environmental factors, we have a new SPARQL at a Working Draft stage. We have basic parity with SQL expressivity, which is a prerequisite for RDF to become a data model that can be an alternative to relational outside of very specialized contexts.

As the standards process makes SPARQL closer to being an alternative to SQL for data integration, we will make the database engine technology such that RDF's inherent penalty in terms of storage overhead and processing time substantially decreases. This will make RDF a workable integration medium also in places where it was not such before. Of course, an application-specific schema will retain some advantage over a generic one, but then one can have a purely relational application on Virtuoso as well. Just think of the possibility of an application-specific schema emerging by itself in a workload-driven fashion.

As background data for an increasing number of fields becomes available as linked data, using this together with proprietary data for analytics and discovery becomes increasingly interesting. This is the initial line of RDF data warehousing. The biomedical field has many examples. The technologies we will release during 2010 will be geared towards enabling a second line of RDF applications, where ad hoc agile integration with RDF as a lingua franca becomes a real alternative to relational solutions with ETL point solutions for harvesting information from diverse systems. One may see how RDF's flexibility and expressivity may add to agility in any number of situations where data from heterogenous sources needs to be integrated. Which of today's business scenarios does not face this issue?

References:

by Orri Erling (oerling@openlinksw.com) at December 29, 2009 03:24 PM

December 10, 2009

Orri Erling

BSBM Again

Benchmarks are very useful. They strive to resolve a complex whole into a single number and to do so without losing too much real life relevance.

For this reason, benchmarks are also intrinsically controversial: People will not agree on what is relevant and they will not agree on what constitutes proper measurement.

We have previously written about the Berlin Benchmark.

On the relevance side, we note that it is a primarily relational benchmark and by this fact tends to show RDF in a bad light. This is OK, since it is among other things meant to compare RDB to RDF mappings, meaning that the benchmark schema must be relational to start with.

On the execution side, we have to again protest against bad configuration and use of outdated software versions, as concerns the latest reported results. We will submit the specifics of this separately.

This is somewhat tedious and unnecessary, as there is a better way: Ontotext and ourselves are namely good friends, to the point of having extensively discussed all the issues of RDF store benchmarking.

Thus, this most recent publication of BSBM gives us an opportunity to practice what we preach.

Therefore, I would propose we jump the gun on implementing proper benchmark processes and audits and do the following:

  • Have the vendors run the benchmark themselves, disclosing configuration by TPC rules.
  • Have an independent, mutually agreeable auditor verify the execution of the benchmark. Again, TPC rules offer a precedent.

In this way, anytime a vendor comes out with a new release, they may publish BSBM audited figures if they please. The figures would be for the two variants of the query mixes and at a specified scale. The details of the hardware should be published as per TPC rules, also with a price/performance metric.

So, Ontotext, we could do this. We both gain by running on proper configurations with proper tuning; the audience gains by learning what constitutes proper tuning and configuration. Cheating is dealt with by having a third party audit the process. (To avoid the problem of finding a third party, we can this time around audit each other over remote login.)

This would also have the added advantage of running on decent hardware as opposed to what the BSBM publication was run on (4 core/8G). Further down the line there is an opportunity for third parties to become auditors.

As for other vendors — e.g., Franz (AllegroGraph), Garlik, and SYSTAP (Bigdata) — they are also invited.

by Orri Erling (oerling@openlinksw.com) at December 10, 2009 08:02 PM

December 09, 2009

DBpedia Blog

Open Knowledge Conference 2010

OKCon, now in its fifth year, is the interdisciplinary conference that brings together individuals from across the open knowledge spectrum (such as also DBpedia in particular and Linked Open Data in general) for a day of presentations and workshops.Open knowledge promises significant social and economic benefits in a wide range of areas from governance to science, culture to technology. Opening up access to content and data can radically increase access and reuse, improving transparency, fostering innovation and increasing societal welfare.

In addition to high profile initiatives such as Wikipedia, OpenStreetMap and the Human Genome Project, there is enormous growth among open knowledge projects and communities at all levels. Moreover, in the last year, many governments across the world have begun opening up their data.

And it doesn’t stop there. In academia, open access to both publications and data has been gathering momentum, and similar calls to open up learning materials have been heard in education. Furthermore, this gathering flood of open data and content is the creator and driver of massive technological change. How can we make this data available, how can we connect it together, how can we use it collaborate and share our work?

  • where: London, UK
  • when: Saturday 24th April, 2010
  • www: http://www.okfn.org/okcon/
  • cfp: http://www.okfn.org/okcon/cfp/ (deadline: Jan 31st 2010)
  • hashtag: #okcon2010

by Sören at December 09, 2009 11:35 AM

December 03, 2009

DBpedia Blog

DBpedia in ReadWriteWeb’s Top 10 Semantic Web Products of 2009

The new year is slowly approaching and people start compiling their top x lists of 2009, with x usually ranging between 10 and 365. ;-)

The popular Web technology blog ReadWriteWeb has chosen x with value 10 and picked DBpedia as one of their top Semantic Web products of 2009. Its actually the only non-commercial community project in the list and in good company with products such as Google’s Search Options and Rich Snippets, Apperture and Data.gov. Other picks, which btw. heavily use or link to DBpedia, include OpenCalais, Freebase, BBC Music and Zemanta.

Read the full article at http://www.readwriteweb.com/archives/top_10_semantic_web_products_of_2009.php

by Sören at December 03, 2009 05:53 PM

November 21, 2009

Do What I Mean (Richard Cyganiak)

What’s in a name? And the Linked Data Police

So I wrote a rather angry private email to Erik Wilde a few days ago, complaining about his use of the term “linked data” for a site that doesn’t follow the linked data practices. Erik decided to publish my email on his blog, along with a long defense of his use of the term, in a post called “The Linked Data™ Police”. Since it’s in public now, we can just as well see if we can get a useful discussion out of this.

First, I realize that Erik probably responded more to the tone of my email than to the content. It was an angry rant, and my tone misses the mark, so his response is fair enough, and I have apologised to him. I also have to say that I’m speaking only for myself and nobody else—before others become scared of the “Linked Data™ Police”, I can assure them that to the best of my knowledge, it has a staff of one, and my peers in the community are in general a friendly and civil lot.

The site in question. So, what is this about? Erik and team have built recovery.berkeley.edu, a site that publishes structured data about Recovery Act spending. The site was built with a grant from the Sunlight Foundation. The technologies of choice are Atom and various other XML formats. As far as I can see, it’s excellent in its adherence to REST principles, including good URI design and “hypermedia as the engine of application state”. These two together are labelled as “linked data” in the site’s technical documentation.

This is a discussion about names, and not about substance. At the very core is the following question: Should we understand “linked data” to mean “the idea of somehow connecting pieces of data with links”, or should we take it to mean “RDF published according to the rules outlined by Tim Berners-Lee in the design note that coined the term”?

Obviously, I’m of the latter opinion. In this post, I want to do two things: First, I want to respond to some specific points from Erik’s post. I will do this by paraphrasing each point, and then responding to it. Second, I want to explain why I care about the matter and why I think that “linked data” should continue to be associated with Tim’s rules, and why advocates of different sets of rules should use different terms.

Erik: “The attitude is scary: Instead of figuring out the most effective way of adding more semantics to the web, it starts with a set of technologies and claims that whatever you want to do, you have to use those.” Linked data didn’t start with a set of technologies; a lot of deliberation by a lot of people went into the choice. Also, I have no quarrel with Erik’s choice of technologies, and I didn’t even suggest that he should or shouldn’t use any technology. There can be good reasons against using RDF, and it’s a good thing that innovation continues in other areas of web data technology.

But if Erik and colleagues don’t buy into the set of technology choices commonly called “linked data”, then why would they insist on using that name? What’s wrong with the established technical terms, REST and Resource-oriented Architecture? Is it just because those are already way past the peak of the hype cycle?

Erik: “Using generic problem names to refer to specific technologies only confuses people.” Erik mentions Linked Data, XML Schema, Semantic Web and Web Services as specific technologies that he considers to be badly labelled. But the peculiar thing about those four is not their names; they are controversial for other reasons. The IT world is full of specific technologies that use generic names: World Wide Web. Structured Query Language. Extensible Markup Language. Hypertext Transfer Protocol. Portable Document Format. Resource-Oriented Architecture. Scalable Vector Graphics. Open Document Format. Erik may not like it, but it’s a common practice.

Erik: “Choosing such names is usually an attempt to make competition harder.” Usually? I doubt that. Technologies are named in their very infancy, when their future and success is far from certain, and when competition is usually not an issue. The naming is usually an attempt to communicate as clearly as possible what the proposed technology is supposed to achieve, which is not a bad thing at all. Some fail at achieving the goal, but everyone designs (and names) assuming that it can be eventually achieved.

Erik: “RDF is just a stylesheet away.” Erik points out that it would be trivial to create a GRDDL transform that translates from the service’s output to RDF. Personally I wouldn’t call it trivial, and being just one transformation away from being compatible is not the same as being compatible. If there were GRDDL transforms in place, I would have no reason at all to complain, although just a few linked data clients support GRDDL at this time.

Back to the roots. So where did the term “linked data” come from? To the best of my knowledge, Tim Berners-Lee coined it in his 2006 Design Note that is titled “Linked Data”. The document introduced the four rules that are now known as the “Linked Data Principles.” Erik’s service is following all of them except the one that demands RDF or SPARQL.

It’s worth pointing out that the four rules did not mention RDF when Tim originally published them, but it is clear from the rest of the document that the use of RDF was implied. The document was aimed at the semantic web community. His later change was a clarification, not a change of intention.

I don’t know why Tim wrote this piece back in 2006, but my interpretation was that he wanted more people to publish data that can be browsed with his Tabulator RDF browser, and most RDF out there at that time couldn’t be browsed because of problems with one of the four rules. So I read it as a call for better interoperability among RDF publishers.

Broadening the term? There have been a number of calls for broadening the meaning of the term, most eloquently from Paul Miller, so Erik is certainly not alone in his view. Their intention is to get linked data quicker into the mainstream, which is a goal that I share. The problem is that broadening a term makes it less meaningful. There is a danger that the term gets extended to the point where it’s equally meaningless to other buzzwords such as Web 3.0 or the venerable Semantic Web. If you can use other formats instead of RDF, then why not also use SOAP instead of HTTP? Why not do away with the URIs? Why not YQL instead of SPARQL? Where does “linked data” stop? Everything is somehow “data” and somehow “linked.”

Interoperability requires choices to be made. In my eyes, the great thing about the term “linked data” is that it has a reasonably precise technical definition, rooted in Tim’s Design Note and the early work of the Linking Open Data project. That work has turned the Semantic Web’s compelling but vague promises of a side-by-side “web for humans” and “web for machines” into concrete guidelines that people can actually implement, and the result is an ecosystem of interoperable tools, clients and datasets that continues to grow around these guidelines.

These guidelines will continue to evolve with the emergence of new technologies (e.g., RDFa) and increasing experience and maturity (e.g., importance of licensing and provenance handling).

But at the core, it has to be about a set of concrete technology choices and deployment practices that foster an interoperable ecosystem of data sources and clients. “Linked data” is the best name we have for that particular set of technology choices and practices. There is nothing magic about the name “linked data”, to the best of my knowledge it didn’t exist at all in the web community before 2006. The term has gained popularity because it has associated rules that tell you how to do it, not because of the “words”. Without the rules, the term would be meaningless fluff. Everything is somehow “linked” and somehow “data.”

If you think that a different set of rules would work better (which is entirely possible), then it would be prudent to write them down, coin a new term for them, and start the legwork of advertising them, just as Tim did since 2006.

by Richard Cyganiak at November 21, 2009 07:34 PM

November 20, 2009

DBpedia Blog

German government proclaims Faceted Wikipedia/DBpedia Search one of the 365 most innovative ideas in Germany

The German federal government has proclaimed Faceted Wikipedia Search as one of the 365 most innovative ideas in Germany in the context of the Deutschland – Land der Ideen competition. The competition showcases innovative ideas in areas such as science and technology, business, education, art and ecology. The patron of the competition is the German President Horst Köhler.

Faceted Wikipedia/DBpedia Search allows users to ask complex queries, like “Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?” against Wikipedia. The answers to these queries are not generated using key word matching as the answers of search engines like Google or Yahoo, but are generated based on structured information that has been extracted from many different Wikipedia articles. Faceted Wikipedia/DBpedia Search allows users to query Wikipedia like a structured database and thus enables them to truly exploit Wikipedia’s collective intelligence.

Faceted Wikipedia/Dbpedia Search can be tested online at http://dbpedia.neofonie.de/browse/

Please click on the example queries below to see Faceted Wikipedia Search in action:

Faceted Wikipedia/DBpedia Search has been jointly developed by neofonie GmbH, Berlin and the Web-based Systems Group at Freie Universität Berlin. Technically, Faceted Wikipedia/DBpedia Search is based on the DBpedia data extraction framework and neofonie search technology.

 

The DBpedia data extraction framework extracts structured data from Wikipedia, such as the content of infoboxes which summarize relevant facts as a table on the top right-hand side of Wikipedia articles. The extracted data is represented using the Resource Description Framework, a data model for web-based systems. Currently, the framework extracts around 190 million facts from the English editon of Wikipedia and 289 million facts from Wikipedia editions in 90 further languages. The DBpedia data extraction framework is developed by the Web-based Systems group at Freie Universität Berlin and the Agile Knowledge Engineering and Semantic Web group at Universität Leizpig.

 

The neofonie search engine, neofonie search, is employed to execute complex queries over the extracted data. neofonie search aggregates RDF data from DBpedia with full-text data from Wikipedia. The aggregated data is then divided into hierarchical facets, composed of 200 types with 2.9 million values. In addition to providing the search technology and processing power, neofonie is also responsible for the hosting of the Faceted Wikipedia/DBpedia Search on the Amazon Elastic Compute Cloud (Amazon EC2).

As DBpedia covers a wide range of domains and has a high degree of conceptual overlap with various other open-license datasets, an increasing number of data publishers have started to set data-level links from their data sources to DBpedia, making DBpedia one of the cristalization points of the emerging Web of Linked Data. In the future, the links between databases will allow applications like Faceted Wikipedia Search to answer queries based not only on Wikipedia knowledge but based on knowledge from a world wide web of databases.

Faceted Wikipedia Search will be presented as part of the Land der Ideen series on April 12th, 2010 at neofonie, Berlin.

Additional information about the Land der Ideen competition, DBpedia, neofonie and the Web of Data is found at:

by ChrisBizer at November 20, 2009 04:17 PM

November 19, 2009

Displacement Activities (Tom Heath)

Putting a Conference into the Semantic Web

Chris Gutteridge asked this question about semantically enabling conference Web sites, which is a subject close to my heart. It’s hard to give a meaningful response in 140 characters, so I decided to get some headline thoughts down for posterity. If you want a fuller account of some first-hand experiences, then the following papers are a good place to start:

Top Five Tips for Semantic Web-enabling a Conference

1. Exploit Existing Workflows

Conferences are incredibly data-rich, but much of this richness is bound up in systems for e.g. paper submission, delegate registration, and scheduling, that aren’t native to the Semantic Web. Recognise this in advance and plan for how you intend to get the data from these systems out into the Web. The good news is that scripts now exists to handle dumps from submission systems such as EasyChair, but you may need to ensure that the conference instance of these systems is configured correctly for your needs. For example, getting dumps from these systems often comes at a price, and if you’re using one instance per track rather than the multi-track options, you may be in for a shock when you ask for the dumps. Speak to the Programme Chairs about this as soon as possible.

In my experience, delegate registration opens months in advance of a conference and often uses a proprietary, one-off system. As early as possible make contact with the person who will be developing and/or running this system, and agree how the registration system can be extended to collect data about the delegates and their affiliations, for example. Obviously there needs to be an opt-in process before this data is published on the public Web.

Collecting these types of data from existing workflows is so monumentally easier than asking people to submit it later through some dedicated means. With this in mind, have modest expectations (in terms of degree of participation) for any system you hope to deploy for people to use before, during and after the conference, whether this is a personalised schedule planner, paper annotation system or rating system for local restaurants. People have massive demands on their time always, and especially at a conference, so any system that isn’t already part of a workflow they are engaged with is likely to get limited uptake.

2. Publish Data Early then Incrementally Improve

Perhaps your goal in publishing RDF data about your conference is simply to do the right thing by eating your own dog food and providing an archival record of the event in machine-readable form. This is fine, but ideally you want people to use the published data before and during the event, not just afterwards. In an ideal world, people will use the data you publish as a foundation for demos of their applications and services and the conference, as means to enhance the event and also to promote their own work. To maximise the chances of this happening you need to make it clear in advance that you will be publishing this data, and give an indication of what the scope of this will be. The RDF available from previous events in the ESWC and ISWC series can give an impression of the shape of the data you will publish (assuming you follow the same modelling patterns), but get samples out early and basic structures in place so people have the chance to prepare. Better to incrementally enhance something than save it all up for a big bang just one week before the conference.

3. Attend to the details

Many of the recent ESWC and ISWC events have done a great job of publishing conference data, and have certainly streamlined the process considerably. However, along the way we’ve lost (or failed to attend to) some of the small but significant facts that relate to a conference, such as the location, venue, sponsors and keynote speakers. This stuff matters, and is the kind of data that probably doesn’t get recorded elsewhere. Obviously publishing data about the conference papers is important, but from an archival point of view this information is at least recorded by the publishers of the proceedings. The more tacit, historical knowledge about a conference series may be of great interest in the future, but is at risk of slipping away.

4. Piggy-back on Existing Infrastructure

As I discovered while coordinating the Semantic Web Technologies for ESWC2006, deploying event-specific services is simply making a rod for your own back. Who is going to ensure these stay alive after the event is over and everyone moves onto the next thing? The answer is probably no-one. The domain-registration will lapse, the server will get hacked or develop a fault, the person who once knew why that site mattered will take a job elsewhere, and the data will disappear in the process. Therefore it’s critical that every event uses infrastructure that is already embedded in everyday usage and also/therefore has a future. The best example of this is data.semanticweb.org, the de facto home for Linked Data from Web-related events. This service has support from SWSA, and enough buy-in from the community, to minimise the risk that it will ever go away. By all means host the data on the conference Web site if you must, but don’t dream of not mirroring it at data.semanticweb.org, with owl:sameAs links to equivalent URIs in that namespace for all entities in your data set.

5. Put Your Data in the Web

Remember that while putting your data on the Web for others to use is a great start, it’s going to be of greatest use to people if it’s also *in* the Web. This is a frequently overlooked distinction, but it really matters. No one in their right mind would dream of having a Web site with no incoming or outgoing links, and the same applies to data. Wherever possible the entities in your data set need to be linked to related entities in other data sets. This could be as simple as linking the conference venue to the town in which it is located, where the URI for the town comes from Geonames. Linking in this way ensures that consumers of the data can discover related information, and avoids you having to publish redundant information that already exists somewhere else on the Web. The really great news is that data.semanticweb.org already provides URIs for many people who have published in the Semantic Web field, and (aside from some complexities with special characters in names) linking to these really can be achieved in one line of code. When it’s this easy there really are no excuses.

Conclusions

Reading the above points back before I hit publish, I realise they focus on Semantic Web-enabling the conference as a whole, rather than specifically the conference Web site, which was the focus of Chris’s original question. I think we know a decent amount about publishing Linked Data on the Web, so hopefully these tips usefully address the more process-oriented than technical aspects.

Related posts:

  1. Linked Data Tutorials at Semantic Web Austin I spent a few days last week in Austin, Texas,...

Related posts brought to you by Yet Another Related Posts Plugin.

by Tom Heath at November 19, 2009 12:42 PM

November 16, 2009

Frederick Giasson

When Linked Data Rules Fail

Image Source: www.adhd-mindbydesign.com

High Visibility Problems with NYT, data.gov Show Need for Better
Practices

When I say, “shot”, what do you think of? A flu shot? A shot of whisky? A moon shot? A gun shot? What if I add the term “bank”? Do you now think of someone being shot in an armed robbery of a local bank or similar?

And, now, what if I add a reference to say, The Hustler, or Minnesota Fats, or “Fast Eddie” Felson? Do you now see the connection to a pressure-packed banked pool shot in some smoky bar room?

As humans we need context to make connections and remove ambiguity. For machines, with their limited reasoning and inference engines, context and accurate connections are even more important.

Over the past few weeks we have seen announcements of two large and high-visibility linked data

projects:  One, a first release of references for articles concerning about 5,000 people from the New York Times at data.nytimes.com; and Two, a massive exposure of 5 billion triples from data.gov datasets provided by the Tetherless World Constellation (TWC) at Rennselaer Polytechnic Institute (RPI).

On various grounds from licensing to data characterization and to creating linked data for its own sake, some prominent commentators have weighed in on what is good and what is not so good with these datasets. One of us, Mike, commented about a week ago that “we have now moved beyond ‘proof of concept’ to
the need for actual useful data of trustworthy provenance and proper mapping and characterization. Recent efforts are a disappointment that no enterprise would or could rely upon.”

Reactions to that posting and continued discussion on various mailing lists warrant a more precise dissection of what is wrong and still needs to be done with these datasets [1].

Berners-Lee’s Four Linked Data “Rules”

It is useful, then, to return to first principles, namely the original four “rules” posed by Tim Berners-Lee in his design note on linked data [2]:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names
  3. When someone looks up a URI, provide useful information, using thestandards (RDF, SPARQL)
  4. Include links to other URIs so that they can discover more things.

The first two rules are definitional to the idea of linked data. They cement the basis of linked data in the Web, and are not at issue with either of the two linked data projects that are the subject of this posting.

However, it is the lack of specifics and guidance in the last two rules where the breakdowns occur. Both the NYT and the RPI datasets suffer from a lack of “providing useful information” (Rule #3). And, the nature of the links in Rule #4 is a real problem for the NYT dataset.

What Constitutes “Useful Information”?

The Wikipedia entry on linked data expands on “useful information” by augmenting the original rule with the parenthetical clause, ” (i.e., a structured description — metadata).” But even that expansion is insufficient.

Fundamentally, what are we talking about with linked data? Well, we are talking about instances that are characterized by one or more attributes. Those instances exist within contexts of various natures. And, those contexts may relate to other existing contexts.

We can break this problem description down into three parts:

  • A vocabulary that defines the nature of the instances and their descriptive attributes
  • A schema of some nature that describes the structural relationships amongst instances and their characteristics, and, optimally,
  • A mapping to existing external schema or constructs that help place the data into context.

At minimum, ANY dataset exposed as linked data needs to be described by a vocabulary. Both the NYT and RPI datasets fail on this score, as we elaborate below. Better practice is to also provide a schema of relationships in which to embed each instance record. And, best practice is to also map those structures to external schema.

Lacking this “useful information”, especially a defining vocabulary, we cannot begin to understand whether our instances deal with drinks, bank robberies or pool shots. This lack, in essence, makes the information worthless, even though available via URL.

The data.gov (RPI) Case

With the support of NSF and various grant funding, RPI has set up the
Data-Gov Wiki [3], which is in the process of converting the datasets on data.gov to RDF,placing them into a semantic wiki to enable comment and annotation, and providing that data as RSS feeds. Other demos are also being placed on the site.

As of the date of this posting, the site had a catalog of 116 datasets from the 800 or so available on data.gov, leading to these statistics:

  • 459,412,419 table entries
  • 5,074,932,510 triples, and
  • 7,564 properties (or attributes).

We’ll take one of these datasets, #319, and look a bit closer at it:

Wiki Title Agency Name data.gov Link No Properties No Triples RDF File
Dataset 319 Consumer Expenditure Survey Department of Labor LABOR-STAT http://www.data.gov/details/319 22 1,583,236 http://data-gov.tw.rpi.edu/raw/319/index.rdf

This report was picked solely because it had a small number of attributes (properties), and is thus easier to screen capture. The summary report on the wiki is shown by this page:


Data-gov-Wiki Dataset #319

(click to expand)

So, we see that this specific dataset contains about 22 of the nearly 8,000 attributes across all datasets.

When we click on one of these attribute names, we are then taken to a specific wiki page that only reiterates its label. There is no definition or explanation.

When we inspect this page further we see that, other than the broad characterization of the dataset itself (the bulk of the page), we see at the bottom 22 undefined attributes with labels such as item code, periodicity code, seasonal, and the like. These attributes are the real structural basis for the data in this dataset.

But, what does all of this mean???

To gain a clue, now let’s go to the source data.gov site for this dataset (#319). Here is how that report looks:


Data.gov Dataset #319

(click to expand)

Contained within this report we see a listing for additional metadata. This link tells us about the various data fields contained in this dataset; we see many of these attributes are “codes” to various data categories.

Probing further into the dataset’s technical documentation, we see that there is indeed a rich structure underneath this report, again provided
via various code lookups. There are codes for geography, seasonality (adjusted or not), consumer demographic profiles and a variety of consumption categories. (See, for example, the link to this glossary page.) These are the keys to understanding the actual values within this dataset.

For example, one major dimension of the data is captured by the attribute item_code. The survey breaks down consumption expenditures within the broad categories of  Food, Housing, Apparel and Services, Transportation, Health Care, Entertainment, and Other. Within a category, there is also a rich structural breakdown. For xample, expenditures for Bakery Products within Food is given a code of FHC2.

But, nowhere are these codes defined or unlocked in the RDF datasets. This absence is true for virtually all of the datasets exposed on this wiki.

So, for literally billions of triples, and 8,000 attributes, we have ABSOLUTELY NO INFORMATION ABOUT WHAT THE DATA CONTAINS OTHER THAN A PROPERTY LABEL. There is much,much rich value here in data.gov, but all of it remains locked up and hidden.

The sad truth about this data release is that it provides absolutely no value in its current form. We lack the keys to unlock the value.

To be sure, early essential spade work has been done here to begin putting in place the conversion infrastructure for moving text files, spreadsheets and the like to an RDF form. This is yeoman work important to ultimate access. But, until a vocabulary is published that defines the attributes and their codes so we can unlock this value, it will remain hidden. And only when its further value (by connecting attributes and relations across datasets) through a schema of some nature is also published, the real value from connecting the dots will also remain hidden.The Hustler

These datasets may meet the partial conditions of providing clickable URLs, but the crucial “useful information” as to what any of this data means is absent.

Every single dataset on data.gov has supporting references to text files, PDFs, Web pages or the like that describe the nature of the data within each dataset. Until that information is exposed and made usable, we have no linked data.

Until ontologies get created from these technical documents, the value of these data instances remain locked up, and no value can be created from having these datasets expressed in RDF.

The devil lies in the details. The essential hard work has not yet begun.

The NYT Case

Though at a much smaller scale with many fewer attributes, the NYT dataset suffers from the same failing: it too lacks a vocabulary.

So, let’s take the case of one of the lead actors in The Hustler, Paul Newman, who played the role of “Fast Eddie” Felson. Here is the NYT record for the “person” Paul
Newman
(which they also refer to as http://data.nytimes.com/newman_paul_per). Note the header title of Newman, Paul:


NYT 'Paul Newman Articles' Record

(click to expand)

Click on any of the internal labels used by the NYT for its own attributes (such as nyt:first_use), and you will be given this message:

“An RDFS description and English language documentation for the NYT namespace will be provided soon. Thanks for your patience.”

We again have no idea what is meant by all of this data except for the labels used for its attributes. In this case for nyt:first_use we have a value of “2001-03-18″.

Hello? What? What is a “first use” for a “Paul Newman” of “2001-03-18″???

The NYT put the cart before the horse: even if minimal, they should have released their ontology first — or at least at the same time — as they released their data instances. (See further this discussion about how an ontology creation workflow can be incremental by starting simple and then upgrading as needed.)

Links to Other Things

Since there really are no links to other things on the Data-Gov Wiki, our focus in this section continues with the NYT dataset using our same example.

We now are in the territory of the fourth “rule” of linked data: 4. Include links to other URIs so that they can discover more things.

This will seem a bit basic at first, but before we can talk about linking to other things, we first need to understand and define the starting “thing” to which we are linking.

What is a “Newman, Paul” Thing?

Of course, without its own vocabulary, we are left to deduce what this thing “Newman, Paul“ is that is shown in the previous screen shot. Our first clue comes from the statement that it is of rdf:type SKOS concept. By looking to the SKOS vocabulary, we see that concept is a class and is defined as:

A SKOS concept can be viewed as an idea or notion; a unit of thought. However, what constitutes a unit of thought is subjective, and this
definition is meant to be suggestive, rather than restrictive. The notion of a SKOS concept is useful when describing the conceptual or intellectual structure of a knowledge organization system, and when referring to specific ideas or meanings established within a KOS.

We also see that this instance is given a foaf:primaryTopic of Paul Newman.

So, we can deduce so far that this instance is about the concept or idea of Paul Newman. Now, looking to the attributes of this instance — that is the defining properties provided by the NYT — we see the properties of nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage. Completing our deductions, and in the absence of its own vocabulary, we can now define this concept instance somewhat as follows:

New York Times articles in the period 2001 to 2009 having as their primary topic the actor Paul Newman

(BTW, across all records in this dataset, we could see what the earliest first use was to better deduce the time period over which these articles have been assembled, but that has not been done.)

We also would re-title this instance more akin to “2001-2009 NYT Articles with a Primary Topic of Paul Newman” or some such and use URIs more akin to this usage.

sameAs Woes

Thus, in order to make links or connections with other data, it is essential to understand what the nature is of the subject “thing” at hand. There is much confusion about actual “things” and the references to “things” and what is the nature of a “thing” within the literature and on mailing lists.

Our belief and usage in matters of the semantic Web is that all “things” we deal with are a reference to whatever the “true”, actual thing is. The question then becomes:  What is the nature (or scope) of this referent?

There are actually quite easy ways to determine this nature. First, look to one or more instance examples of the “thing” being referred to. In our case above, we have the “Newman, Paul” instance record. Then, look to the properties (or attributes) the publisher of that record has used to describe that thing. Again, in the case above, we have nyt:associated_article_count, nyt:first_use, nyt:last_use and nyt:topicPage.

Clearly, this instance record — that is, its nature — deals with articles or groups of articles. The relation to Paul Newman occurs as a basis of
the primary topic of these articles, and not a person basis for which to describe the instance. If the nature of the instance was indeed the person Paul Newman, then the attributes of the record would more properly be related to “person” properties such as age, sex, birthdate, death date, marital status, etc.

This confusion by NYT as to the nature of the “things” they are describing then leads to some very serious errors. By confusing the topic (Paul Newman) of a record with the nature of that record (articles about topics), NYT next misuses one of the most powerful semantic Web predicates available, owl:sameAs.

By asserting in the “Newman, Paul” record that the instance has a sameAs relationship with external records in Freebase and DBpedia, the NYT both entails that properties from any of the associated records are shared and infers a chain of other types to describe the record. More precisely, the NYT is asserting that the “thing” referred to by these instances are identical resources.

Thus, by the sameAs statements in the “Newman, Paul” record, the NYT is also asserting that that record is an instance of all these classes:

Furthermore, because of its strong, reciprocal entailments, the owl:sameAs assertion would also now entail that the person Paul Newman has the nyt:first_use and nyt:last_use attributes, clearly illogical for a “person” thing.

This connection is clearly wrong in both directions. Articles are not persons and don’t have marital status; and persons do not have first_uses. By misapplying this sameAs linkage relationship, we have screwed things up in every which way. And the error began with misunderstanding what kinds of “things” our data is about.

Some Options

However, there are solutions. First, the sameAs assertions, at least involving these external resources, should be dropped.

Second, if linkages are still desired, a vocabulary such as UMBEL [4] could be used to make an assertion between such a concept, and these other related resources. So, even though these resources are not the same, they are closely related. The UMBEL ontology helps us to define this kind of relation between related, but non-identical, resources.

Instead of using the owl:sameAs

property, we would suggest the usage of the umbel:linksEntity, which links a skos:Concept to related named entities resources. Additionally, Freebase, which also currently asserts a sameAs relationship to the NYT resource, could use the umbel:isAbout relationship to assert that their resource “is about” a certain concept, which is the one defined by the NYT.

Alternatively, still other external vocabularies that more precisely capture the intent of the NYT publishers could be found, or the NYT editors could define their own properties specifically addressing their unique linkage interests.

Other Minor Issues

As a couple of additional, minor suggestions for the NYT dataset, we would suggest:

  • Create a foaf:Organization description of the NYT organization, then use it with dc:creator and dcterms:rightsHolder rather than using a literal, and
  • The dual URIs such as “http://data.nytimes.com/N31738445835662083893” and “http://data.nytimes.com/newman_paul_per” are not wrong in themselves, but the purpose is hard to understand. Why does a single organization need to create multiple resources for the identical resource, when it comes from the same system and has the same purpose?

Re-visiting the Linkage “Rule”

There are very valuable benefits from entailment, inference and logic to be gained from linking resources. However, if the nature of the “things” being linked — or the properties that define these linkages — are incorrect, then very wrong logical implications result. Great care and understanding should be applied to linkage assertions.

In the End, the Challenge is Not Linked Data, but Connected Data

Our critical comments are not meant to be disrespectful and are not being picky. The NYT and TWC are prominent institutions for which we should expect leadership on these issues. Our criticisms (and we believe those of others) are also not an expression of a “trough of disillusionment” as some have been pointing out.

This posting is about poor practices, pure and simple. The time to correct them is now. If asked, we would be pleased to help either institution establish exemplar practices. This is not automatic, and it is not always easy. The data.gov datasets, in particular, will require much time and effort to get right. There is much documentation that needs to be transitioned and expressed in semantic Web formats.

In a broader sense, we also seem to lack a definition of best practices related to vocabularies, schema and mappings. The Berners-Lee rules are imprecise and insufficient as is. Prior best guidance documents tend to
be more how to publish and make URIs linkable, than to properly characterize, describe and connect the data.

Perhaps, in part, this is a bit of a semantics issue. The challenge is not the mechanics of linking data, but the meaning and basis for connecting that data. Connections require logic and rationality sufficient to reliably inform inference and rule-based engines. It also needs to pass the sniff test as we “follow our nose” by clicking the links exposed by the data.

It is exciting to see high-quality content such as from national governments and major publishers like the New York Times begin to be exposed as linked data. When this content finally gets embedded into usable contexts, we should see manifest uses and benefits emerge. We hope both institutions take our criticisms in that spirit.

This posting has been jointly authored by Mike Bergman and Fred Giasson and simultaneously published on both of their blogs, hoping to draw more attention to the need for better practices in publishing linked data.

[1] The NYT has been updated with improvements and they fixed multiple issues from the first release. The
problems listed herein, however, still pertain after these improvements.
[2] Tim Berners-Lee, 2006. Linked Data (Design Issues), first posted on 2006-07-27; last updated on
2009-06-18. See http://www.w3.org/DesignIssues/LinkedData.html. Berners-Lee refers to the steps above as “rules,” but he elaborates they are expectations of behavior. Most later citations refer to these as “principles.”
[3] Li Ding, Dominic DiFranzo, Sarah Magidson, Deborah L. McGuinness and Jim Hendler, 2009. Data-GovWiki: Towards Linked Government Data. See
http://www.cs.vu.nl/~pmika/swc/documents/Data-gov%20Wiki-data-gov-wiki-v1.pdf.
[4] UMBEL (Upper Mapping and Binding Exchange Layer) is a lightweight ontology structure in development for relating Web content and data to a standard set of subject concepts. It purpose has resulted in its creation of an associated vocabulary geared to both class-instance and reciprocal relationships, as well as partial or likelihood relationships. See http://umbel.org/technical_documentation.html#vocabulary.

by Fred at November 16, 2009 05:03 PM

November 12, 2009

Project squin

Experiences developing a simple Linked Data based application

One of the characteristics that make the Linked Data community so special is the pragmatic “let’s do it” attitude. The Linked Data-a-thon co-located at this year’s International Semantic Web Conference (ISWC) revealed this attitude once more. The main idea of the Linked Data-a-thon was to encourage conference attendees to develop a “quick and dirty” Linked Data based application during the first days of the conference; the proposition of this challenge was to showcase that it is possible to develop simple but innovative Linked Data applications with very little effort. The outcome of this event were eight amazing applications which exceeded all expectations. Nonetheless, the participation in this challenge also revealed some unexpected difficulties and reminded us of some of the current issues that still exist. In the following we describe our experiences.

The Idea
After being urged by Juan, the main organizer of the Linked Data-a-thon, we seriously thought about participating. Coming up with an idea for a cool application that could also be developed in only a few hours turned out to be a minor challenge. However, we remembered one of the demo queries of our SQUIN service. This query asks for traditional Chinese medicine as an alternative to the western drug Varenicline. Answering this queries requires data from at least three different linked datasets provided by the Linking Open Drug Data (LODD) project. Hence, this query demonstrates the added value of interlinking data from multiple sources on the Web. Furthermore, the query gives a glimpse of how ordinary people may benefit from openly available data. For this reason we agreed to build a simple application around this query and let users vary the drug for which the alternatives are to be found. Since answering only one type of questions is quite boring we decided to add some functionality that enable users to drill a bit deeper into the data that is available. Based on our knowledge of the LODD datasets we came up with the idea of adding value to the search results by allowing users to inspect possible side effects of the alternative medicines. This valuable functionality would require the evaluation of data from at least two additional datasets and, by using SQUIN, it would be realizable with only one additional type of SPARQL queries.

The Queries
As outlined, our application is based on two types of SPARQL queries which we execute with SQUIN. Each of these types corresponds to a template in which a placeholder has to be substituted by a URI in order to create an actual SPARQL query that, then, can be executed over the Web of Linked Data using a SQUIN service. The query template for the alternative medicine queries was easy to create; we only had to generalize the aforementioned SQUIN demo query by replacing the Varenicline URI with the placeholder. Creating the side effects query was a bit more difficult. It required that: 1/ one has the domain-specific knowledge about the application in order to conceive how a medicine can be related to its side effects; 2/ one has to have good knowledge about the collection datasets, knowing which dataset contains what information and how one dataset is connected to another one through which properties. Such knowledge is essential to define a valid SPARQL query; a Linked Data browser such as the one powered by Pubby has been very helpful here. 3/ one must have a mechanism to debug the query and sufficient logs about the query execution, to understand what has gone wrong, e.g., the predicates being in the wrong order or the data URIs being partially updated.

The General Operation
To instantiate our query templates the placeholder in each of them has to be substituted by a URI. For the alternative medicine query this URI would identify the western drug specified by the user; for the side effects query the URI would identify the Chinese medicine. Getting the medicine URIs wouldn’t be a problem because they are part of the results determined for the alternative medicine queries. Obtaining the western drug URIs was a bit more difficult because we wanted our users to provide with a convenient interface in which they only have to specify the drugs by their names. For this reason we added another type of query to look-up drug URIs based on drug names provided by the users. These queries use the following template which contains a filter clause with a regular expression:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX drugbank: <http://www4.wiwiss.fu-berlin.de/drugbank/resource/drugbank/>
SELECT ?uri ?label WHERE {
?uri a drugbank:drugs .
?uri rdfs:label ?label .
FILTER( regex(STR(?label),"[DRUGNAME]","i") )
}
ORDER BY ?uri

The placeholder [DRUGNAME] in this template is replaced by the drug name provided by the user. Notice, queries of this type will not yield any results when executed with SQUIN; they do not contain a seed URI linking to an RDF graph with triples that match any of the patterns in query. Therefore, we send these look-up queries directly to the SPARQL endpoint of the drugbank dataset that describes western drugs and their names.

ScreenshotAltMedSelectionFrom the result of the look-up query we initiate the alternative medicine query. If the look-up query yields multiple results we allow the user to choose one of them. The alternative medicine query is then executed over the Web of Linked Data using a SQUIN service. The results of this query execution are visualized, including an option to show side effects of each medicine. When the user selects one of these options our app initializes the template of the side effects query with the corresponding medicine URI, sends this query to the SQUIN service, and visualizes the results after they are returned by SQUIN.

Performance Discussion
Executing our queries over the Web using SQUIN takes very long for an interactive user interface (1.30 min in the average case). However, we used SQUIN to demonstrate that it is able to answer these queries. Another advantage of this solution: SQUIN provides the most up-to-date results and it would immediately include data from new sources after they are linked from the LODD datasets. To improve the user experience we enabled the query result cache in SQUIN so that results for queries that had been issued before are available immediately. Unfortunately, this result cache suppresses the chance for up-to-date results. To avoid this problem we aim to extend SQUIN with a suitable caching strategy that includes a reasonable invalidation policy.

The query execution strategy implemented in SQUIN is not alone to blame for the long execution times. The majority of the LODD datasets accessed during the query executions (4 out of 5) is published on the same machine. Hence, each query execution stresses this machine extremely so that it becomes a bottleneck for the performance of our application.

Furthermore, it is not only the queries executed with SQUIN that make the user interface feel slow. For our look-up queries we also often had the problem that the response time of the drugbank SPARQL endpoint was quite long (a few seconds) for a query service over a single dataset. We assume this problem is caused by the technology with which drugbank is published: the drugbank SPARQL endpoint operates on a virtual RDF graph that is created on-the-fly from a relational database; the mappings are not able to express the filter clause in our look-up query and, thus, the regular expression cannot be pushed-down to the database query engine; instead, the mapper has to materialize the RDF data for all drugs and the SPARQL engine has to filter afterwards.

Development Effort
Apart from the performance problems that slowed down testing, developing our application was very easy once the query templates were created. Due to integrating SQUIN we did not had to develop any program logic to query the Web. All the time we spend for coding included only the implementation of the workflow and of the AJAX user interface, i.e., everything that is also necessary for an ordinary Web application. In the end it took us about 7 hours to finish the application and we are convinced we could have been much faster had we been more experienced in JavaScript development.

Conclusion
You may try the final outcome of our work here. By integrating SQUIN in our application it required no implementation work to issue queries over up-to-date data on the Web. While the development effort is comparable to a usual Web application we can easily benefit from the availability of open data from multiple sources on the Web.

ScreenshotAltMed

Unfortunately, the user interface does not feel very responsive due to different issues raised in this text. The performance problems that are caused by SQUIN are on our todo list. In particular, in the coming months we want to improve overall query execution times by the implementation of smart caching strategies. During the development we also discovered some performance problems caused by the infrastructure that runs today’s Web of Linked Data. These issues of responsiveness and reliability of Linked Data servers have to be addressed by the community.

Jun and Olaf

by Olaf Hartig at November 12, 2009 11:01 AM

November 11, 2009

Orri Erling

RDF Geography With Virtuoso

We have just added a geometry data type and corresponding R-tree index to Virtuoso. This follows the general scheme of SQL/MM, as is implemented by PostGIS and many others. We have all the engine-side stuff, including optimizer support for geometry cardinality sampling and good execution plans for combinations of spatial and other joins. We have however not yet implemented all the different geometry types and library function support for them, like shortest distance between two arbitrary shapes.

The geometry support is for both SQL and SPARQL. On the SQL side, it works with the ISO/IEC 13249 SQL/MM API; with RDF, a geometry can occur as the object of a quad. If the object is a typed-literal of the virtrdf:Geometry type, it gets indexed in a geometry index over all geometries in quads; no special declarations are needed. After this, SQL MM predicates and functions can be used with SPARQL, like this:

  PREFIX  geo:  <http://www.w3.org/2003/01/geo/wgs84_pos#>  
  SELECT  ?class
          COUNT (*) 
   WHERE  { ?m  geo:geometry  ?geo    . 
            ?m  a             ?class  . 
                FILTER ( <bif:st_intersects> 
                          ( ?geo, 
                            <bif:st_point> (0, 52), 
                            100
                          )
                       )
          } 
GROUP BY  ?class 
ORDER BY  DESC 2 
 

This returns the counts of objects of each class occurring within 100 km of (0, 52), a point near London.

For any data set with WGS 84 geo:long and geo:lat values, a simple SQL function makes a point geometry for each such coordinate pair and adds it as the geo:geometry property of the subject with the long/lat. This then enables fast spatial access to arbitrary location data in RDF.

Right now, we hardly see any geometries other than points in RDF data, even though there are some efforts for vocabularies for more complex entities. As these get adopted we will support them.

For scalability, we tried the implementation with OpenStreetMap's 350 million or so points. The geometry implementation partitions well over a cluster, similarly to a full text index, i.e., every server has its slice of the geometries, partitioned by the geometry object's key, thus not by range of coordinates or such. Like this, the items are evenly spread even though the coordinate distribution is highly uneven.

We can do spatial joins like —

   SELECT  ?s 
           ( <sql:num_or_null> (?p) )  
           COUNT (*) 
    WHERE  { ?s   <http://dbpedia.org/ontology/populationTotal>  ?p    . 
             FILTER 
               ( <sql:num_or_null> (?p) > 1000000 )                      . 
             ?s   geo:geometry                                   ?geo  .
             FILTER 
               ( <bif:st_intersects> ( ?pt, ?geo, 5 ) )                  . 
             ?xx  geo:geometry                                   ?pt 
           } 
 GROUP BY  ?s 
           ( <sql:num_or_null> (?p) )
 ORDER BY  DESC 3 
    LIMIT  20  

This takes the DBpedia subjects that have a population over 1 million and a geometry. We then count all the geometries within 5 km of the point location of the first geometry. With DBpedia (about 5 million points), GeoNames (7 million points), and OpenStreetMap (350 million points), we get the result:

http://dbpedia.org/resource/Munich                        1356594    117280
http://dbpedia.org/resource/London                        7355400     81486
http://dbpedia.org/resource/Davao_City                    1363337     58640
http://dbpedia.org/resource/Belo_Horizonte                2412937     58640
http://dbpedia.org/resource/Chengde                       3610000     58640
http://dbpedia.org/resource/Hamburg                       1769117     51664
http://dbpedia.org/resource/San_Diego%2C_California       1266731     47685
http://dbpedia.org/resource/Bursa                         1562828     47685
http://dbpedia.org/resource/Port-au-Prince                1082800     47685
http://dbpedia.org/resource/Oakland_County%2C_Michigan    1194156     45636
http://dbpedia.org/resource/Sana%27a                      1747627     40923
http://dbpedia.org/resource/Milan                         1303437     40923
http://dbpedia.org/resource/Campinas                      1059420     40923
http://dbpedia.org/resource/Hohhot                        2580000     40923
http://dbpedia.org/resource/Brussels                      1031215     40923
http://dbpedia.org/resource/Bogra_District                2988567     40923
http://dbpedia.org/resource/Cort%C3%A9s_Department        1202510     40923
http://dbpedia.org/resource/Berlin                        3416300     35668
http://dbpedia.org/resource/New_York_City                 8274527     30810
http://dbpedia.org/resource/Los_Angeles%2C_California     3849378     25614
20 Rows. -- 1733 msec.
Cluster 8 nodes, 1 s. 358 m/s 1596 KB/s 664% cpu 2% read 16% clw threads 1r 0w 0i buffers 1124351 0 d 0 w 0 pfs

This takes 1.7 seconds on a Virtuoso Cluster configured with 8 processes on a single dual-Xeon 5520 box, running at about 664% CPU with warm cache. Fair enough for a first crack, this can obviously be optimized further. Still, the geo part of the processing is already as good as instantaneous.

We will shortly have the geography features installed on DBpedia and the other data sets we host. As these come online we will show more demo queries.

For more about SQL/MM, you can look to a couple of PDFs:

by Orri Erling (oerling@openlinksw.com) at November 11, 2009 05:17 PM

DBpedia Blog

DBpedia 3.4 released

We are happy to announce the release of DBpedia 3.4. The new release is based on Wikipedia dumps dating from September 2009.

The new DBpedia data set describes more than 2.9 million things, including 282,000 persons, 339,000 places, 88,000 music albums, 44,000 films, 15,000 video games, 119,000 organizations, 130,000 species and 4400 diseases. The DBpedia data set now features labels and abstracts for these things in 91 different languages; 807,000 links to images and 3,840,000 links to external web pages; 4,878,100 external links into other RDF datasets, 415,000 Wikipedia categories, and 75,000 YAGO categories. The data set consists of 479 million pieces of information (RDF triples) out of which 190 million were extracted from the English edition of Wikipedia and 289 million were extracted from other language editions.

The new release provides the following improvements and changes compared to the DBpedia 3.3 release:

  1. the data set has been extracted from more recent Wikipedia dumps.
  2. the data set now provides labels, abstracts and infobox data in 91 different languages.
  3. we provide two different version of the DBpedia Infobox Ontology (loose and strict) in order to meet different application requirements. Please refer to http://wiki.dbpedia.org/Datasets#h18-11 for details.
  4. as Wikipedia has moved to dual-licensing, we also dual-license DBpedia. The DBpedia 3.4 data set is licensed under the terms of the Creative Commons Attribution-ShareAlike 3.0 license and the GNU Free Documentation License.
  5. the mapping-based infobox data extractor has been improved and now normalizes units of measurement.
  6. various bug fixes and improvements throughout the code base. Please refer to the change log for the complete list http://wiki.dbpedia.org/Changelog

You can download the new DBpedia dataset from http://wiki.dbpedia.org/Downloads34. As usual, the dataset is also available as Linked Data and via the DBpedia SPARQL endpoint at http://dbpedia.org/sparql.

Lots of thanks to

  • Anja Jentzsch, Christopher Sahnwaldt, Robert Isele, and Paul Kreis (all Freie Universität Berlin) for improving the DBpedia extraction framework and for extracting the new data set.
  • Jens Lehmann and Sören Auer (Universität Leipzig) for providing new data set via the DBpedia download server at Universität Leipzig.
  • Kingsley Idehen and Mitko Iliev for loading the new data set into the Virtuoso instance that serves the Linked Data view and SPARQL endpoint. OpenLink Software (http://www.openlinksw.com/) altogether for providing the server infrastructure for DBpedia.
  • Neofonie GmbH, Berlin, (http://www.neofonie.de/index.jsp) for supporting the DBpedia project by paying Christopher Sahnwaldt.

The next steps for the DBpedia project will be to

  1. synchronize Wikipedia and DBpedia by deploying the DBpedia live extraction which updates the DBpedia knowledge base immediately when a Wikipedia article changes.
  2. enable the DBpedia user community to edit and maintain the DBpedia ontology and the infobox mappings that are used by the extraction framework in a public Wiki.
  3. increase the quality of the extracted data by improving and fine-tuning the extraction code.

All this will hopefully happen soon.

Have fun with the new data set!

Cheers

Chris Bizer

by ChrisBizer at November 11, 2009 12:52 PM

October 31, 2009

Project squin

SQUIN at ISWC 2009

For the last two weeks (Oct. 17 to Oct. 31) I was in the US. During my trip I presented SWClLib, SQUIN, and the general idea of link traversal based query execution as implemented in SWClLib on several occasions.

tn_p1000306The first week I visited my friend Juan Sequeda in Austin, Texas. Austin is a great city in which I realized a healthy dominance of students. Juan took me to dozens of places and I got to know many great people. In particular, I loved the Monday night at which we went from pub to pub on Sixth Street; all of them had a band playing live music. Our evening at the Ghostbusters quote-along in the Drafthouse movie theater is also unforgettable. If you have the chance to spend a few days in Austin I advise you to take this opportunity.
During the week Juan organized a meeting of the Austin Semantic Web meet-up; here I gave the first talk during my trip. I introduced the idea of Linked Data, presented our link traversal based query execution approach, and outlined how users benefit from Linked Data by showcasing our Researchers Map application. The talk was well received by the audience who mainly consisted of several experienced AI guys and entrepreneurs.
36505894 On the following day I had the chance to present my research at the University of Texas (UT). I decided to present link traversal based query execution and to discuss the iterator-based implementation as I would do in my ISWC talk the next week. The computer science departments at UT have a strong focus on theory. For this reason I expected very difficult questions and, thus, was quite nervous. In the end the majority of the people who attended my talk were grad and undergrad students, mainly from Juan’s database and bio-informatics focused research group, and the talk turned more into a lecture. I’d the feeling the students were very eager to learn about RDF, SPARQL, and my ideas to query the Web of Linked Data; I hope my talk inspired some of them.

logoFrom Austin I flew to Washington DC to attend the 8th International Semantic Web Conference during the second week of my trip. The conference venue was far outside of the city, “in the middle of nowhere” as several people put it. I still don’t understand the reason for organizing a conference in such a remote place. However, the conference itself was great; familiar faces, interesting conversations, and many new contacts.

At the first day we (Juan, Patrick Sinclair, Jamie Taylor, and me) gave a tutorial on “how to consume Linked Data”. Even if the audience was fairly diverse -from real beginners to more experienced practitioners- we satisfied their expectations quite well (at least that’s what Ivan Herman attested us ;-). With a discussion on “Querying Linked Data with SPARQL” (slides) I presented one of the more technical parts. Of course, I also introduced the idea of link traversal based query execution and advertised SWClLib and SQUIN here.

A novel idea for this conference was the Linked Data-a-thon, which aimed to show that it is possible to develop simple but innovative Linked Data applications with very little effort. This event resulted in the development of eight amazing applications, two of them made use of SQUIN: a search engine for traditional Chinese medicine and a lda-afterrecent posts section in FOAF letter. The latter was developed by Matthias Quasthoff who also shared his experiences of developing his Linked Data-a-thon submission. One of Matthias’ conclusions was that it turned “out again that SQUIN and LOD work nicely together if links from FOAF profiles to distributed SIOC user accounts exist.” The other SQUIN based application was the submission of Jun Zhao and me. Our app allows you to find traditional Chinese medicine as an alternative to western drugs and to inspect the possible side effects of the determined alternative medicines. For the implementation we used SQUIN to execute queries over multiple interlinked datasets provided by the Linking Open Drug Data (LODD) project. I will write a separate post in which we describe our experiences similar as Matthias did. The other six Linked Data-a-thon submission are also great examples of what can be done thanks to Linked Data. Hence, this event was a huge success! It was even sponsored with prizes. Thanks to Juan for the wonderful organization of this extraordinary event and to Jamie who did a great job presenting the submissions to the ISWC audience.

On the fourth day of the conference the presentation of my paper was scheduled and the room was packed! People were standing in the back and in the door. As probably everyone in the audience realized I was very nervous. Nonetheless, apart from a few stumbles I managed to present the idea of link traversal based query execution to the audience. After the talk and during the remaining days many people came to talk about the approach; all of them loved the general idea; and many expressed their interest in trying SQUIN for their applications.

After all, these two weeks were very exciting and they helped to spread the work about SQUIN.

Olaf
(written at Washington Dulles Airport)

by Olaf Hartig at October 31, 2009 11:00 PM

October 30, 2009

Do What I Mean (Richard Cyganiak)

Linked data at the New York Times: Exciting, but buggy

Update: Evan Sandhaus reports that all the issues mentioned below will be fixed. Great!

Yesterday at the International Semantic Web Conference, Evan Sandhaus of the New York Times unveiled data.nytimes.com, a site that publishes linked data for some parts of the Times’ index. To me, this was one of the most exciting announcements at the conference, and it caused quite a tweetstorm during and after Evan’s talk.

A bit of background: Every article published in the newspaper or on the website is tagged, classified and categorized in many ways by skilled editors. This metadata allows the creation of topic pages that automatically collect relevant articles for notable people, organisations, and events. Examples include Michelle Obama, Swine Flu (H1N1 Virus) and Wrestling.

What’s in the data? The dataset published yesterday contains information on each of the concepts that have a topic page. For now, it is limited to topic pages about people. The concepts are modelled in SKOS. The information attached to each concept consists mostly of links: to DBpedia, to Freebase, into the Times API (which is not available as RDF at this point), and of course to the corresponding topic page. This means that if you have a DBpedia URI for an especially notable entity, a high-quality New York Times topic page with the latest news about the topic is only two RDF links away. A notable feature of the links is that every single one has been manually reviewed, making this perhaps the highest-quality linkset in the LOD cloud.

How to get the data? This being linked data, every concept has a dereferenceable URI. Examples:

The site’s URI scheme follows one of the Cool URIs recipes: The identifiers above are resolvable, and by using content negotiation, web browsers are redirected to

http://data.nytimes.com/N13941567618952269073.html

which has a nicely formatted summary of the data available about Michelle Obama. Data browsers and other RDF-enabled clients, on the other hand, are redirected to

http://data.nytimes.com/N13941567618952269073.rdf

which has all the data goodness in RDF/XML.

There is also a dump: people.rdf. You can browse the data starting from the data.nytimes.com page. Everything is available under a CC-BY license.

Bugs and problems

This being a new dataset and the Times’ first foray into linked data, it turns out that the Beta label on the site is quite warranted. I will highlight four issues.

Data and metadata are mixed together. Let’s look at the data about Michelle Obama, available at the N13941567618952269073.rdf URI above. I’m reformatting the data into Turtle for legibility.

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

This makes perfect sense, it’s data about a person, modelled as a SKOS concept. But then it goes on:

<http://data.nytimes.com/N13941567618952269073>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

This is not data about Michelle Obama the person, it’s metadata about the data published by the NYT. It’s certainly not true that Michelle Obama was created by the New York Times, or that she “started” in 2007 (whatever that’s supposed to mean), and don’t even get me started on asserting a rights or a license over a person.

Note that the NYT team actually went through the effort of setting up separate URIs for Michelle the person (http://data.nytimes.com/N13941567618952269073), and for the HTML and RDF documents describing the concepts (http://data.nytimes.com/N13941567618952269073.html and http://data.nytimes.com/N13941567618952269073.rdf). The reason why linked data experts advocate this practice of having separate URIs is exactly because it enables separation of data and metadata: It lets you state some facts about the concepts, and other things about the documents that describe the concepts. This is what should be done in the data above: The metadata should not be asserted about the URI identifying Michelle, but about the URI identifying the document published by the NYT: N13941567618952269073.rdf. So we would get:

<http://data.nytimes.com/N13941567618952269073>
    a skos:Concept;
    skos:prefLabel "Obama, Michelle";
    skos:definition "Michelle Obama is the first …";
    skos:inScheme nyt:nytd_per;
    nyt:topicPage <http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html>;
    owl:sameAs <http://rdf.freebase.com/rdf/en.michelle_obama>;
    owl:sameAs <http://data.nytimes.com/obama_michelle_per>;
    owl:sameAs <http://dbpedia.org/resource/Michelle_Obama>;

<http://data.nytimes.com/N13941567618952269073.rdf>
    dc:creator "The New York Times Company";
    time:start "2007-05-18"^^xsd:date;
    time:end "2009-10-08"^^xsd:date;
    dcterms:rightsHolder "The New York Times Company"^^xsd:string;
    cc:license "http://creativecommons.org/licenses/by/3.0/us/";
    .

Eric Hellman has a post about this issue, calling it “a potential legal disaster” because a license is attached to a resource that’s said to be the same as a resource on a different site (DBpedia and Freebase). He’s a bit alarmist, but this example highlights why the separation of data and metadata, of concept URIs and document URIs, is critically important in a general-purpose data model.

Distinguishing URIs and literals. Here’s some selected snippets from the RDF/XML output:

    <nyt:topicPage>http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html</nyt:topicPage>
    <cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
    <cc:Attribution>http://data.nytimes.com/N13941567618952269073</cc:Attribution>

The value of all three properties are URIs. In the RDF data model, URIs are of such central importance that they are treated differently from any other kind of value (strings, integers, dates). But not so in the code example above. There, the three URIs are encoded as simple strings. This should be:

    <nyt:topicPage rdf:resource="http://topics.nytimes.com/top/reference/timestopics/people/o/michelle_obama/index.html" />
    <cc:License rdf:resource="http://creativecommons.org/licenses/by/3.0/us/" />
    <cc:Attribution rdf:resource="http://data.nytimes.com/N13941567618952269073" />

Why does this matter? It’s basically like making links “clickable” in HTML by putting them into a <a href=”…”> tag: RDF clients will not recognize URIs if they are encoded as literals, and will not know that they can treat them as links that can be followed.

Content negotiation for hybrid clients. As usual for linked data emitting sites, there is content negotiation on the concept URIs: They redirect either to RDF or HTML, based on the Accept header sent by the client when resolving the URI via the HTTP protocol. Also as usual for first-time linked data producers, the content negotiation is a bit broken.

Here is what happens when I ask for HTML (using cURL, which is a handy tool for debugging the HTTP behaviour of linked data sites):

$ curl -I -H "Accept: text/html" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.html

Next I will ask for RDF:

$ curl -I -H "Accept: application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf

So far, so good. But many clients are “hybrid”, they can consume both RDF and HTML. This includes many tools that can consume RDFa (RDF embedded in HTML pages). So it’s not uncommon to find tools that combine multiple media types in the accept header. The Times server should also redirect those tools to the RDF, because any RDF-consuming client can probably handle the raw RDF data better than the (not overly useful) HTML pages. But let’s see what happens:

$ curl -I -H "Accept: text/html,application/rdf+xml" http://data.nytimes.com/N13941567618952269073

Response:

HTTP/1.1 303 See Other
Server: Apache/2.2.3 (Red Hat)
Location: http://data.nytimes.com/N13941567618952269073.rdf.html

The server redirects to a file that doesn’t exist, ending in .rdf.html. This is pretty funny to me as a programmer, because the bug gives me a glimpse into the Times codebase, where obviously a programmer didn’t consider that the two alternatives—sending HTML or sending RDF—are exclusive.

Update: Someone at the Times seems to be working on the server as I’m writing this; the latest behaviour is even worse; it redirects to .rdf.html even if I request only RDF, and uses 301 redirects instead of 303.

Using the Creative Commons schema. The NYT data uses the Creative Commons schema to license the data under CC-BY. Here’s the relevant RDF, in Turtle (I fixed the subject URI and turned literals into URIs where appropriate):

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:License <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:Attribution >http://data.nytimes.com/N13941567618952269073<;
    cc:attributionName "The New York Times Company";
    .

This uses three properties: cc:License, cc:Attribution and cc:attributionName. But according to the schema, cc:License and cc:Attribution are classes, not properties. This should be:

<http://data.nytimes.com/N13941567618952269073.rdf>
    cc:license <http://creativecommons.org/licenses/by/3.0/us/>;
    cc:attributionURL <http://data.nytimes.com/N13941567618952269073>;
    cc:attributionName "The New York Times Company";
    .

Summary. The Times’ foray into linked data is an exciting new development, but it also shows how hard it is to get linked data right. This is a weakness of the linked data approach.

Can we do anything about this? Better tutorials and education can probably help. Another activity that is trying to address the issue is the Pedantic Web Group, a loose collection of people like me who obsess about the technical details of publishing data on the web and work with data publishers to get issues like the above fixed. We might even give you a hand with reviewing your stuff before you go live with it.

by Richard Cyganiak at October 30, 2009 09:19 PM

October 27, 2009

Orri Erling

European Commission and the Data Overflow

The European Commission recently circulated a questionnaire to selected experts on what could be done for the future of big data.

Since the questionnaire is public, I am publishing my answers below.

  1. Data and data types

    1. What volumes of data are we dealing with today? What is the growth rate? Where can we expect to be in 2015?

      Private data warehouses of corporations have more than doubled yearly for the past years; hundreds of TB is not exceptional. This will continue. The real shift is in structured data being published in increasing quantities with a minimum level of integrate-ability through use of RDF and linked data principles. There are rewards for use of standard vocabularies and identifiers through search engines recognizing such data. There is convergence around DBpedia identifiers for real-world entities, e.g., most things that would be in the news.

      This also means that internal data processes and silos may be enriched with this content. There is consequent pressure for accommodating more diversity of data, with more flexible schema.

      Ultimately, all content presently stored in RDBs and presented in public accessible dynamic web pages will end up on the web of linked data. Examples are product catalogs, price lists, event schedules and the like.

      The volume of the well known linked data sets is around 10 billion statements. With the above mentioned trends, growth by two or three orders of magnitude by 2015 seems reasonable, This is so especially if explicit semantics are extracted from the document web and if there is some further progress in the precision/recall of such extraction.

      Relevant sections of this mass of data are a potential addition to any present or future analytics application.

      Since arbitrary analytics over the database which is the web cannot be economically provided by a centralized search engine, a cloud model may be used for on-demand selection of relevant data and mixing it with private data. This will drive database innovation for the next years even more than the continued classical warehouse growth.

      Science data is another driver of the data overflow. For example, faster gene sequencing, more accurate measurements in high energy physics, better imaging, and remote sensing will produce l