MongoDB as RDF triple store based on Node.js
This is a short summary of my bachelor thesis that was written in German. Maybe someone else is interested in the results.
About
The topic of the thesis was “Implementing a RDF storage solution for MongoDB”. Besides this implementation part the performance of the written API was compared to Virtuoso Open-Source.
So, to summarize this thesis, the main question is “Could MongoDB be an alternative to Virtuoso as triple store?”.
Implementation
There was prior work on this topic such as this or that but no scientific evaluation of the systems. You can find the written API for Node.js here. It mainly implements an insert method and some query methods. For details take a look at the README. There are also explanations about the URI transformation methods that avoids dots in the key identifier of the JSON objects.
Evaluation
A short explanation of the systems and the data sets.
systems
nested a.k.a flat: MongoDB implementation with the nested uri transformation
dict: MongoDB implementation with the dict uri transformation
virtNative: Virtuoso via ODBC interface
virtHttp: Virtuoso via HTTP
data sets
abbr. | subjects | triples |
---|---|---|
2k | 2056 | 10698 |
5k | 5464 | 29446 |
10k | 9830 | 59431 |
20k | 19583 | 110431 |
40k | 41901 | 238965 |
Insert
nested | dict | virtNative | virtHttp | |
---|---|---|---|---|
2k | 1.609s | 10.031s | 0.512s | 1.415s |
5k | 7.811s | 30.522s | 1.300s | 3.163s |
10k | 23.148s | 83.707s | 2.697s | 6.368s |
20k | 176.297s | 251.442s | 5.674s | 13.605s |
40k | 733.773s | 998.763s | 15.153s | 42.731s |
As you can see, both MongoDB implementations are extremely slow inserting the triples. There are differences of 15 minutes and more! But it’s hard to say what the reasons for this results are. Because of the mentioned article which presents similar results, we can only guess that MongoDB itself is the reason and not the implementations. Maybe it has something to do with updating the indices.
Query
nested | dict | virtNative | virtHttp | |
---|---|---|---|---|
2k | 0.0115s | 0.1095s | 0.0025s | 0.0325s |
5k | 0.0120s | 0.2485s | 0.0030s | 0.0325s |
10k | 0.0120s | 0.4680s | 0.0030s | 0.0320s |
20k | 0.0110s | 0.8550s | 0.0030s | 0.0320s |
40k | 0.0100s | 2.0470s | 0.0010s | 0.0220s |
The results of the probably most used query method findBySubject() are quite better than the insert ones. While Virtuoso via ODBC is the fastest implementation, the MongoDB implementation nested is faster than Virtuoso via HTTP. The second MongoDB implementation is continously the slowest system.
For the results of the other query methods, please take a look in the thesis.
Conclusion
- The implementation dict is continously the slowest one.
- The implementation nested is in four of five query methods faster than Virtuoso via HTTP
- Virtuoso via ODBC is continously the fastest system.
Overall: With the results in mind, we can draw the conclusion that MongoDB as a triple store is only useful with small to mid sized data sets. Furthermore it is advisable to use it only in query-oriented environments because of the lacking performance inserting triples.