Your very own MusicBrainz

Importing MusicBrainz RDF dump into Virtuoso OpenSource

Introduction

With the availability of large databases, the semantic web is slowly becoming a reality. We now have access to vast amounts of data through services like FreeBase, MusicBrainz or DBPedia. With these services, through the use of query languages like MQL or SPARQL, you can easily know the directors of movies starring Dolph Lungren, the daughters of all astronauts that walked on the moon or the album names of bands that had Robert Trujillo as one of its members.

Most of these database provide web services to query the database directly without going through their sites. For instance, you can query the BBC Programmes and Music Database using SPARQL at http://lod.openlinksw.com/sparql/ using the sparqlwrapper Python module:

>>> from SPARQLWrapper import SPARQLWrapper, JSON
>>> sparql = SPARQLWrapper("http://lod.openlinksw.com/sparql/")
>>> sparql.setQuery("""
 SELECT DISTINCT ?title WHERE {
 <http://www.bbc.co.uk/programmes/b006q2x0#programme> <http://purl.org/ontology/po/episode> ?episode .
 ?episode <http://purl.org/dc/terms/title> ?title .
 } LIMIT 10
""")
>>> sparql.setReturnFormat(JSON)
>>> results = sparql.query().convert()
>>> [ result['title']['value'] for result in results['results']['bindings']]
[u'Victory of the Daleks', u'The Fourth Dimension', u'The Beast Below Jigsaw', u'Series 5, Flesh and Stone', u'The Eleventh Hour', u'Vincent and the Doctor Characters', u'The Beast Below Characters', u'The Beast Below', u'Smiler Mask', u'The Stolen Earth']

The snippet shows the titles for 10 episodes of the BBC Series “Doctor Who”.

Although these sites provide an invaluable service, they cannot be used as back-ends for production environments as they limit the number of requests we can make either by fixed limits or by throttling the requests. Luckily for us, dumps are available for these database and we can have our own versions up and running in no time.

Database Set-Up

There are several options when it comes to triple stores, but we noticed that a lot of the public sites offering SPARQL endpoints use Virtuoso, so we’re going to use it for our example.

Installing Virtuoso on an Ubuntu installation is trivial:

$ sudo apt-get install virtuoso-server virtuoso-vad-isparql

This creates an initial database with the default configuration. Although it can be used as is for smaller graphs, we’re going to be processing several million triples, so we need to tweak both the server and database configurations before we can move on.

First, we need to follow the instructions on the “General” and “Swapiness” sections of the Virtuoso performance tuning guide.

Open Virtuoso’s .ini file (in our case /etc/virtuoso-opensource-6.1/virtuoso.ini). The guide states that for large datasets it is recommented to allocate between 2/3 and 3/5 of the system RAM by modifying the values of the NumberOfBuffers and MaxDirtyBuffers configuration variables on the [Parameters] section according to the following table:

System RAMNumberOfBuffersMaxDirtyBuffers
2 GB 170000 130000
4 GB 340000 250000
8 GB 680000 500000
16 GB 1360000 1000000
32 GB 2720000 2000000
48 GB 4000000 3000000
64 GB 5450000 4000000

The guide also suggests to use striping for the database. Change the “[Striping]” section of the virtuoso ini file so it looks like the following:

[Striping]
Segment1 = 10G, db-seg1-1.db
Segment2 = 10G, db-seg1-2.db
Segment3 = 10G, db-seg1-3.db
Segment4 = 10G, db-seg1-4.db

The database is approximately 25Gb, so 4 stripes of 10Gb each should be sufficient.

You should now restart the service. It’ll take a while to do so as it needs to allocate the space of the stripes.

We discovered that without these changes, the ingestion process usually hangs (due to deadlocks most likely) requiring us to restart the process.

Downloading the RDF Dump

It is possible to download and MusicBrainz RDF dump directly from their FTP site, but the latest one is quite old. Luckily, LinkedBrainz provide updated dumps as well as SPARQL endpoint.

Download all the files from the dump folder and put them in /var/lib/virtuoso-opensource-6.1/linkedbrainz or any other directory listed in the DirsAllowed configuration variable. The files are organized by date, so you’ll have to put them all on a single directory (overwriting the older files if necessary).

Loading the Triples into the Graph

Go to the directory where the files live and run the following command:

$ echo "http://musicbrainz.org" > global.graph

This will create a global.graph file with the IRI of MusicBrainz in it. This file will tell Virtuoso’s bulk loader script to use this IRI as the graph name for all the files. The files are gzipped, but Virtuoso can recognize and handle compressed files without problems.

Open an isql shell to start the ingestion process:

$ isql-vt 1111 dba password
SQL> ld_dir ('/var/lib/virtuoso-opensource-6.1/linkedbrainz', '*.nt.gz', 'http://musicbrainz.org');
SQL> set isolation='uncommitted';
SQL> rdf_loader_run(log_enable=>2);

This will run the loading script. Depending on your specific configuration it can take several hours (or even days!) to complete. You can check the progress by opening up another isql shell and querying the status of the load_list table:

$ isql-vt 1111 dba password
SQL> select * from load_list;

Assuming everything went well (all the files listed on the load_list table have an ll_state value of ‘2’). We can now query Virtuoso’s SPARQL endpoint.

Querying the SPARQL Endpoint

We’re going to use sqlwrapper one more time to discover the names of the female members of the swedish pop group “Abba”:

>>> from SPARQLWrapper import SPARQLWrapper, JSON
>>> sparql = SPARQLWrapper("http://192.168.0.1:8890/sparql/")
>>> sparql.setQuery("""
select distinct ?name where {
 ?abba <http://xmlns.com/foaf/0.1/name> 'ABBA' .
 ?member <http://purl.org/ontology/mo/member_of> ?abba .
 ?member <http://xmlns.com/foaf/0.1/gender> 'female' .
 ?member <http://xmlns.com/foaf/0.1/name> ?name
} LIMIT 5
""")
>>> sparql.setReturnFormat(JSON)
>>> results = sparql.query().convert()
>>> [result['name']['value'] for result in results['results']['bindings']]
[u'Anni-Frid Lyngstad', u'Agnetha F\xe4ltskog']

You now have your own version of MusicBrainz!

Conclusions

Our initial objective was to show you how to set-up a local version of MusicBrainz, but the same process can be applied to larger databases like DBPedia or FreeBase, but bear in mind that it might take days for a single host to process the entire dump.


Previous / Next posts


Comments