![]() |
The 2000 U.S. Census: 1 Billion RDF TriplesAugust 14, 2007 OverviewI've been interested lately in getting large amounts of
existing data into RDF so that
databases once isolated by being in vastly different data formats can
start to be meshed more easily. (See data on the U.S. Congress in
RDF). This page describes the 2000 U.S. Census
converted into RDF (what is RDF?) and exposed via
SPARQL. The U.S. Census data is
provided by the Census Bureau in a structured format (with an enormous
amount of documentation, no less) and yields on the order of 1 billion
RDF triples. The task of extracting those triples is a hefty one,
though fairly straight-forward, and this document explains what I did from start to finish --- first
transforming the Census data into Notation 3 (with a Perl script), and
then loading it into a MySQL database and serving it via SPARQL (using
my own C# library for
RDF). Also see my original announcement of this data set on the semantic-web mail list. Try it out...
More information from the Census...
Later on this page...
About the Census DataThe Census data comprises population statistics at various geographic
levels, from the U.S. as a whole, down through states, counties,
sub-counties (roughly, cities and
incorporated towns), so-called "census data
places" ("CDP"s, what I would call a named "village", but might
correspond better with the colloquial use of the word town), ZIP Code Tabulation Areas
(ZCTAs, which approximate ZIP codes), and even deeper levels of granularity.
Side notes: The data set contains around 3,200 counties, 36,000
"towns", 16,000 "villages", and 33,000 ZCTAs. There are fewer CDPs than towns here
because I exclude CDPs that represent 100% of the town they are
contained in. A big chunk of the 25k stats per region is iterations of
the same statistics but for race-based subsets of the total population
of each region, and since I don't think that's so interesting, and I
wanted to keep the data size managable (1B is large enough, thank
you), I omit those stats, leaving around 11 thousand per region.
The statistics
themselves contain total population counts, counts by age, sex, and
race, information on commuting time to work, mean income, latitude and
longitude of the region, etc. In fact, for each of the around
55,000 geographic regions from country down to CDP, 25
thousand statistics are reported! That's a lot.The thousands of statistics are, fortunately for the human user,
structured. The stats break down into tables, and tables within tables,
etc. For instance, population by sex and age (i.e. how many individuals
are male and 24 years old) is a two-level table. First the total population
is broken down by sex (total male, female), and then each of those parts
are further broken down by age. It is not uncommon to see tables four
or five levels deep. And since the statistics break things down into
smaller and smaller categories, around 25% of all of the numbers reported
over all of the regions are just zero. Further, the stats break down into what the Census calls "universes."
For a region, one universe is "total population", which means that the
statistics represent counts of people out of all of the people in the
region. Another universe is "households", which means the numbers
are not counting people but households, out of all of the households
in the region. Other universes break down those two, such as "total
population 18 years and over." And, lastly there are some statistics which are not counts but
are instead aggregates or medians, such as the median income out
of a subset of the population within some region. The fact that the statistics are structured into
hierarchical tables makes the relational database model already
problematic for representing the data --- unless you want a single
table with thousands of columns, or otherwise several hundred tables, but in
either case the hierarchy is not explicitly modeled. An XML-ish
database might work well, but the flexibility of a native RDF store for
encoding a graph seems like a fair enough fit. Modeling the Census Data in RDF and Conversion to Notation 3Modeling ChoicesThe Census data could have been modeled in RDF any number of ways,
and I chose one way that seemed to work out all right. I wanted
the notion of a hierarchy of tables to carry over into RDF. That is,
there should be nodes in the RDF graph representing a table, i.e. a
subset of the population, out of which subtables may slice the population
into smaller groups. Each slice of the world is represented by a RDF predicate.
Tables as such are represented by blank (anonymous) nodes. Okay, to take an example: Each region starts off with a node
representing the region, i.e. <http://www.rdfabout.com/rdf/usgov/geo/us>
for the United States as a whole. To keep the thousands of statistics
slightly separated from the basic geographic data, that node follows a
census:details predicate to a new node representing the 2000 Census
statistics for that region. By the current convention, the URIs for the details
node just appends "/censustables" to the region URI.The first way the Census splices
the world as it pertains to that region is by dividing the world into
universes, which as I mentioned above are "total population", "households",
etc. In the RDF model, predicates representing each universe are
applied to the ...region/censustables entity and land on a bnode representing a table
that will further divide that universe. If the "total population" universe
is then subdivided into "male" versus "female", predicates leave the first
bnode and land on two new bnodes representing the males out of the total
population and the females of the total population. Further subdivisions,
such as by age, leave these bnodes and may land on new bnodes that may be
subdivided further.There is more to explain, below, but here is a graphical representation of
what is going on. The black-colored nodes and edges are the nodes and
triples I've discussed so far. ![]() As the figure shows, each Census predicate (in black) takes you from
a table bnode to either a) another table or b) a literal numeric value
(in blue). When it takes you to a numeric value, you've reached the end
of the line and it tells you how many people (housholds, etc.) fall into
You may notice in the figure that there are two
all of the categories that brought you to that value. So if you follow
the path population
edges leaving the Region node terminating on bnodes that repeat the
same total population count (120,000) in their rdf:values.
This redundancy is there in order to model each table in the Census data
as independent bnodes. That is, if there were only one population
predicate leaving Region, then the male/female division and the inHousehold/inGroupQuarters
division would be collapsed. The benefit of keeping them apart is that
the model is explicit about which categories are mutually exclusive
and which fall into natural groupings, which applications may find useful.Region > population > female > age10-19, you end on a
numeric value that tells you how many women aged 10-19 are in that
region. (If you followed Region > households >
nonFamilyHouseholds you would get the number of households, not
people, that are nonFamilyHouseHolds. To know what a "non-family
household" is, you would have to consult the PDFs published by the
Census.) Now, what makes this a little bit weird is that we may also
want to know how many women there are total. If we followed the path
only part way (Region > population > female), you might
want a literal numeric value here, but we can't do that since
we've already put a bnode here so we can branch further. Instead, we
branch off a rdf:value predicate to a literal value that
has the total number for that category (the edges and literals in red).
So if you want the total number of women, you follow Region >
population > female > rdf:value. (This is, yes, a little bit
awkward since you can't know whether you need that extra
rdf:value or not without looking at the structure of the
graph.)Besides the graph above, there is also a hierarchy established
between regions themselves using dcterms:isPartOf, and
several other features including actual names and latitute/longitude
are represented as predicates directly off of the region node.Converting the Census Raw Data Files to Notation 3The Census publishes their data in text-based data files (compressed,
since even compressed it is several gigabytes in all). I chose
to first convert them to Notation 3 files on disk, and then to load
those files into a triple store. Geographic DataBasic geographic data, including the hierarchical relationship between
the regions, latitude/longitude, and names, comes from the file
usgeo_uf1.txt.
It's a fixed-column-width text file, which is described here.
Besides some standard schemas, I use two of my own: Census and USGovt. Below is the segment of the N3 version for New York. @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix dc: <http://purl.org/dc/elements/1.1/> . @prefix dcterms: <http://purl.org/dc/terms/> . @prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#> . @prefix census: <tag:govshare.info,2005:rdf/census/> . @prefix usgovt: <tag:govshare.info,2005:rdf/usgovt/> . <http://www.rdfabout.com/rdf/usgov/geo/us/ny> rdf:type usgovt:State ; usgovt:censusStateCode "21" ; usgovt:fipsStateCode "36" ; usgovt:uspsStateCode "NY" ; dc:title "New York" ; dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us> ; geo:lat 42.155127 ; geo:long -75.164667 ; census:population 18976457 ; census:households 7679307 ; census:landArea "122283145776 m^2" ; census:waterArea "19016249880 m^2" ; census:details <http://www.rdfabout.com/rdf/usgov/geo/us/ny/censustables> . You may notice the landArea and waterArea values are given in a strange half-number/half-unit literal value. I didn't know what to do since there isn't much consensus on how to indicate the units of physical quantities in RDF. (See the thread "ontology for units of measurement and/or physical quantities" at the semantic-web mail list archives and my own suggestion that came later on.) A link to download the geographic data set (about 1 million triples)
is at
the start of this document. Population TablesThe tables-within-tables population data came next. There are two
sets of this data. The first set is the "100 Percent" data which is data for
questions asked of every individual in the United States. The second data set is
the Sample data, which is data for questions asked of about one-sixth of the
population. The two data sets overlap in questions, and so there are two values
for some statistics, one as determined by the 100 Percent data and one from
the Sample data, and this forced the use of two separate namespaces for the
statistics, "100pct" and "samp". This data was by far much more difficult to get into N3 than the
geographic data, which is just a flat table with a few fields per region. The "summary files"
(as they're called) come from here (documentation is there too). For instance, the file SF1_all_0Final_National.zip (one gigabyte) contains about one-third of the census data. In that zip file are zip files for each state, and in that are comma-delimited text files with several hundred fields per file.
The comma-delimited files are described in a set of metadata files for the SAS statistics program. Those
files list the order of the fields within the data files and give each field a short description. In addition, whitespace indentations in the descriptions, which presumably are ignored by SAS, were crucial for establishing the hierarchical nature of the field. I used a Perl script to convert the data into N3. The script runs
for about an hour and a half (on modern hardware) and yields 1 billion triples. To
get the triples yourself, you will have to download my perl script, a patch file,
and Census data files, all linked at the start of this document. The
resulting N3 files are just too big (2.4GB) for me to provide as a download. The beginning of the output for the United States looks like this:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix dc: <http://purl.org/dc/elements/1.1/> .
@prefix : <tag:govshare.info,2005:rdf/census/details/100pct> .
<http://www.rdfabout.com/rdf/usgov/geo/us/censustables>
:totalPopulation 281421906 ; # P001001
:totalPopulation [
dc:title "URBAN AND RURAL (P002001)";
rdf:value 281421906 ; # P002001
:urban [
rdf:value 222360539 ; # P002002
:insideUrbanizedAreas 192323824 ; # P002003
:insideUrbanClusters 30036715 ; # P002004
] ;
:rural 59061367 ; # P002005
] ;
:totalPopulation [
dc:title "RACE (P003001)";
rdf:value 281421906 ; # P003001
:populationOfOneRace [
rdf:value 274595678 ; # P003002
:whiteAlone 211460626 ; # P003003
:blackOrAfricanAmericanAlone 34658190 ; # P003004
:americanIndianAndAlaskaNativeAlone 2475956 ; # P003005
...Overall StatisticsThe conversion process to Notation 3 yielded a total of
1,002,848,918 triples. 1,016,219 triples of that were for the geographic data.
The remaining were for the detailed population statistics. Regions covered
include the U.S., states, counties, towns, Census data places ("villages"),
ZCTAs (roughly ZIP Codes), and current congressional districts (2007-08).
Also included are owl:sameAs links between geographic entities
defined here and some defined in the Geonames data set.Loading the Data into a Triple Store and Exposing it to the WorldLoading the DataI used my own SemWeb Library for C# to
load the triples into a MySQL database.
Loading it took 39 hours on a 2.13 GHz Core 2 Duo (~7,000 statements per second). The database is 85 GB
large (~90 bytes per statement). Within my library there is a
command-line tool for loading data into a database. I used this to load
the geographic data:
export SEMWEB_MYSQL_IMPORT_MODE=LOCK
mono rdfstorage.exe --clear -in n3 \
-out "mysql:censusgeo:Database=rdf;User name=rdf" \
geo-* link-*Because of the enormous size of the remaining data, the detailed data
triples were in GZip'ed N3 files, and so one has to be
a little bit more creative to load that into a database without
uncompressing them first:
export SEMWEB_MYSQL_IMPORT_MODE=DISABLEKEYS
cat sumfile* cong* | gunzip | cat -- schema.n3 - | \
nice mono rdfstorage.exe
-out "mysql:censustables:Database=rdf;User name=rdf" \
-in n3 -The library creates three MySQL tables to store the data.
The first table has columns Subject, Predicate, and Object, each an integer
key representing the resource. Two other tables map integers to URIs, for
named entities, and integers to values and datatypes for literals.
MySQL indexes are created over the columns in several ways to support
different types of queries. Exposing the Data with SPARQLOnce the data is in a triple store, exposing the data via SPARQL is
the best way to access it. SPARQL allows one to run all sorts of queries
against the data set. We could use it, for instance, to get a list of all
of the counties in New York sorted by the median income in the county:
SELECT ?name ?medianincome WHERE {
?county dcterms:isPartOf <http://www.rdfabout.com/rdf/usgov/geo/us/ny> ;
rdf:type usgov:County ;
dc:title ?name .
?county census:details [
census2:population15YearsAndOverWithIncomeIn1999 [
census2:medianIncomeIn1999 ?medianincome
]
] .
} ORDER BY ?medianincomeAnother example would be to find all states in the
United States with more students living in dorms than prison inmates:
SELECT ?name ?prisoners ?students WHERE {
?state rdf:type usgov:State ;
dc:title ?name .
?state census:details [
census1:populationInGroupQuarters [
census1:institutionalizedPopulation [
census1:correctionalInstitutions ?prisoners
] ;
census1:noninstitutionalizedPopulation [
census1:collegeDormitories__includesCollegeQuartersOffCampus ?students
]
]
] .
FILTER(?students > ?prisoners) .
}You can try out queries
here. That page has the full query examples with PREFIXes and some additional suggestions for querying the data. Again, I used my own library to set up a SPARQL end-point,
since I have included in it an ASP.NET HTTP protocol handler that
can plug into ASP.NET to provide a SPARQL end-point. I run it under
Mono's mod_mono under Apache
but in principle it should work with IIS on Windows too.
Getting that going is very straight-forward, at least once you have mod_mono
working. First, instruct Apache
that a certain location will be handled by the ASP.NET runtime by
editing the server or vhost configuration and adding:
<Location /sparql>
SetHandler mono
</Location>(I believe you could also do something similar with a
.htaccess file in a "sparql" directory.) Then, at the
root-level of the website, in the web.config file, add:
<configuration>
<configSections>
<section name="sparqlSources" type="System.Configuration.NameValueSectionHandler,System"/>
</configSections>
<system.web>
<httpHandlers>
<add verb="*" path="sparql"
type="SemWeb.Query.SparqlProtocolServerHandler, SemWeb.Sparql" />
<httpHandlers>
</system.web>
<sparqlSources>
<add key="/sparql" value="noreuse,mysql:censusgeo:Database=rdf;Server=localhost;User Id=rdf
mysql:censustables:Database=rdf;Server=localhost;User Id=rdf"/>
</sparqlSources>
</configuration>and place the DLLs from the library in the bin
directory of the website.This creates a SPARQL end-point at
http://www.rdfabout.com/sparql. Dereferencing URIs and Linked DataCurrent Semantic Web best practices are to mint URIs
for entities that are dereferencable, that is, http: URIs
that you could plug into a web browser and get something back. For
reasons beyond this scope of this web page, the practice is to have
a web server reply with a 303 "see other" status when dereferencing RDF
URIs, sending the user to a different page which actually provides some
RDF information (usually RDF/XML) about the resource in question.
(For more, see How
to Publish Linked Data on the Web.)Providing the redirect from the URIs to some other
page describing them is easy enough with Apache. But what page should
describe them, and how should those pages be created? With a SPARQL
end-point already set up, an easy solution is to use URLs for SPARQL
DESCRIBE queries as the targets of the redirects. I've thus included the following in my .htaccess
file at the root of my website. It redirects URLs in a certain subdirectory
to the SPARQL end-point with a DESCRIBE query on the URL originally accessed.
Note that the angled brackets in the query are URI-escaped, and that I assume
no special URI characters like %, #, and & are present in URIs, or a different
redirecting method would have to be used so that URIs could be properly escaped.RedirectMatch 303 (/rdf/usgov/geo/.*) http://rdfabout.com/sparql?query=DESCRIBE+%3Chttp://www.rdfabout.com$1%3E You could try it out by visiting
http://www.rdfabout.com/rdf/usgov/geo/us,
the URI I minted for the United States, right in your browser, or pasting that URI
into any of the client-side tools
on the LinkedData ESW wiki page. Update: A robust solution for redirects, for URIs that may include special escaped characters, is here: .htaccess. It employs a redirect for all URLs in the /rdf path space for the virtual host. Note that this requires that a line be added to httpd.conf. Providing a SitemapFollowing the new Semantic Web
Crawling:
a Sitemap Extension guide, and the robots.txt
extension for finding sitemaps, I have created a sitemap.xml file for the data
stored on this website which you can take a look at here,
and it is referenced in the robots.txt file for the website. |