MongoDB as RDF triple store based on Node.js

This is a short summary of my bachelor thesis that was written in German. Maybe someone else is interested in the results.

About

The topic of the thesis was “Implementing a RDF storage solution for MongoDB”. Besides this implementation part the performance of the written API was compared to Virtuoso Open-Source.

So, to summarize this thesis, the main question is “Could MongoDB be an alternative to Virtuoso as triple store?”.

Implementation

There was prior work on this topic such as this or that but no scientific evaluation of the systems. You can find the written API for Node.js here. It mainly implements an insert method and some query methods. For details take a look at the README. There are also explanations about the URI transformation methods that avoids dots in the key identifier of the JSON objects.

Evaluation

A short explanation of the systems and the data sets.

systems

nested a.k.a flat: MongoDB implementation with the nested uri transformation

dict: MongoDB implementation with the dict uri transformation

virtNative: Virtuoso via ODBC interface

virtHttp: Virtuoso via HTTP

data sets

abbr. subjects triples
2k 2056 10698
5k 5464 29446
10k 9830 59431
20k 19583 110431
40k 41901 238965

Insert

Insert results

nested dict virtNative virtHttp
2k 1.609s 10.031s 0.512s 1.415s
5k 7.811s 30.522s 1.300s 3.163s
10k 23.148s 83.707s 2.697s 6.368s
20k 176.297s 251.442s 5.674s 13.605s
40k 733.773s 998.763s 15.153s 42.731s

As you can see, both MongoDB implementations are extremely slow inserting the triples. There are differences of 15 minutes and more! But it’s hard to say what the reasons for this results are. Because of the mentioned article which presents similar results, we can only guess that MongoDB itself is the reason and not the implementations. Maybe it has something to do with updating the indices.

Query

Query results

nested dict virtNative virtHttp
2k 0.0115s 0.1095s 0.0025s 0.0325s
5k 0.0120s 0.2485s 0.0030s 0.0325s
10k 0.0120s 0.4680s 0.0030s 0.0320s
20k 0.0110s 0.8550s 0.0030s 0.0320s
40k 0.0100s 2.0470s 0.0010s 0.0220s

The results of the probably most used query method findBySubject() are quite better than the insert ones. While Virtuoso via ODBC is the fastest implementation, the MongoDB implementation nested is faster than Virtuoso via HTTP. The second MongoDB implementation is continously the slowest system.

For the results of the other query methods, please take a look in the thesis.

Conclusion

  1. The implementation dict is continously the slowest one.
  2. The implementation nested is in four of five query methods faster than Virtuoso via HTTP
  3. Virtuoso via ODBC is continously the fastest system.

Overall: With the results in mind, we can draw the conclusion that MongoDB as a triple store is only useful with small to mid sized data sets. Furthermore it is advisable to use it only in query-oriented environments because of the lacking performance inserting triples.


blog comments powered by Disqus