Tech

Study: PDF, HTML files dominate Data.gov

Data.gov heavily relies on HTML and PDF for its file formats, leaving two George Mason University researchers to ponder if the federal government’s data repository is achieving what it set out to accomplish.

By Greg Otto

June 13, 2016

You'll find a lot of files like this on Data.gov. (Pixabay)

In a paper published in April, Anne Washington and David Morar of George Mason’s School of Policy, Government and International Affairs combed through the entire Data.gov catalog to figure out what file formats were available and if those files were the most convenient for the intended audience.

The researchers examined files hosted on Data.gov against the five-star open data scheme advocated for by internet pioneer Tim Berners-Lee. The system ranks data stored in PDF and HTML formats on the lower end of the scale, and four- and five-star data formats — those that can be linked together in a fashion similar to how URLs are hyperlinked across the internet — at the upper end.

Washington and Morar modified the five-star structure to account for a number of files on Data.gov being posted in obscure file formats. These files, which were typically formats used in word processing or mapping programs, were given a 0 stars. Unstructured formats, such as HTML or PDF received one star. Proprietary files, such as Microsoft Word or Excel files, were given two stars. Structured machine-readable formats, such as XML or CSV files, were given three stars. Files that contained uniform resource identifiers were ranked the highest with four stars.

Create column charts

Researchers found that of the 244,000 files on Data.gov, more than 30 percent (77,217) are posted in HTML. The second-most popular file format is XML, at 17 percent (42,846). PDFs came in third at 14 percent (34,381), while two lesser-known file formats — ODF and Octet Stream — rounded out the top five.

More than 60 percent of Data.gov’s files were given a one-star rating. Formats that earned three stars — meaning the files are open and machine-readable — finished second, with 23 percent of all Data.gov files falling into this category.

Only 18,347 files — 7 percent — were found to meet the four-star criteria.

The study’s authors found that agencies have embraced publishing information to Data.gov in a format that can be adopted by a wide array of the public. However, the study points out that the government may be too focused on informing the “English-literate public than the data literate who want machine-readable information.”

“If the goal of open government data is machine readable structured file, there may be a legitimate concern about the large number of PDF and HTML files,” the report reads. “The innovators and the data entrepreneurs expect structure machine-readable data.”

Congress is pushing for machine-readable data to be the government’s default format. In April, groups in the House and Senate introduced a bill that calls on agencies to create an inventory of all enterprise data, determine what can be released publicly, and post it with open licenses and in machine-readable formats.

The authors also conclude that the government is going to have to decide how to reach both average users and techies alike.

“Governments attempt to satisfy both the average user, with simple accessible formats, and the sophisticated data consumer, with structured machine-readable formats,” the report reads. “Open government data has established an important pattern of considering both the least and the most sophisticated users. This study suggests that we need a broader conversation about who the data audience will be in the context of open government.”

You can download the full study here.

Contact the reporter on this story via email at greg.otto@fedscoop.com, or follow him on Twitter at @gregotto. His OTR and PGP info can be found here. Subscribe to the Daily Scoop for stories like this in your inbox every morning by signing up here: fdscp.com/sign-me-on.

Study: PDF, HTML files dominate Data.gov

More Like This

New TMF investments support AI Safety Institute, upgrades to nuclear emergency response

White House announces nearly $100 million in pledges to boost emerging tech workforce

White House touts federal action to protect children online

Top Stories

More than 1,300 devices have been reported missing to USAID, document shows

The software you can’t use at NASA

Harris likely to combine Biden AI policies with Silicon Valley-informed approach

GOP lawmakers, financial leaders ‘leery’ of rushing AI rules on the sector

CrowdStrike outage briefly impacted national organ transplant matching system

Amid scrutiny into the US Secret Service, a look at how the agency uses technology

NIST seeks organization to stand up institute focused on AI to boost manufacturing

More Scoops

How federal agencies can improve data insights and lower storage costs

With AI, agencies have secondary responsibility of providing data for industry

Data Foundation report reveals overwhelming optimism about the state of open data

Data.gov is offline because of the shutdown

GAO on USASpending.gov: It’s good, but it could be better

If open data sets aren’t being deleted, is government data still ‘endangered’?

OPEN Government Data Act to get another chance

Latest Podcasts

The VA extends its EHR contract with Oracle Center for another 11 months.

Leveraging AI to modernize government IT systems

The Coast Guard’s AI chief takes a new role focused on the 2024 presidential transition

TMF funds enhancements in nuclear and AI safety; Federal initiatives strengthen child online protection

Tech

Defense

Cyber

FedScoop TV