Semantic Desktop and KDE 4 – State and Plans of Nepomuk-KDE [Update]

kde-logo-official
Nepomuk-KDE is the basis for the semantic technologies we will see in KDE 4. Sebastian Trüg, the main developer behind Nepomuk-KDE, provided me with some up2date information about the current state and future plans.

The Semantic Desktop describes the idea where users will not only be able to search existing information, but also to search for the meaning and relation of these information. The Nepomuk project creates open standards and APIs around this idea.
And Nepomuk-KDE is the implementation of these standards for KDE.

Nepomuk-KDE: Basics

Technically Nepomuk-KDE uses mainly RDF/S for storing the aggregated data. RDF/S is the standard for storing meta data for the Semantic Web and is therefore also used as the standard in the Semantic Desktop.
The current implementation of Nepomuk-KDE contains an implementation of an RDF repository which stores all the data. According to Sebastian the data can be accessed by DBUS which is the default way to communicate in Nepomuk-KDE (of course). But there are also other ways which might be more convenient to KDE developers:

For a KDE developer, however, it is much simpler to use the knepomuk library which provides convenience wrapper classes to access the repository.
Additionally there is the KMetaData library which is yet another wrapper library which provides easy access to the metadata by a resource-centric view. This is what is supposed to be used for implementing stuff like tagging or rating in applications.

The definitions (aka Ontologies) how the data like tags, comments and so on should be stored in the repository can be found in kmetadata/ontologies in KDE’s svn or in the directory $KDEDIR/share/apps/knepomuk/ontologies if you install kmetadata on your hard disk.

Last but not least, if you integrate Nepomuk-KDE with an application KMetaData can help generating code that hides all “nasty meta data type and property named and type conversion”. See the tutorial KmetaData First Steps at techbase. Also, have a look at the KMetaData apidox.

Besides these entry links and the homepage itself the best place to entry the development is, of course, subscribing to the Nepomuk-KDE e-mail list. Btw., one of the current topics is about a possible new, more catchy name for Nepomuk-KDE 🙂

Nepomuk-KDE: Current State

So much about the basics and development stuff – now to the sparkling bits:
In the current implementation Nepomuk-KDE enables the user to store additional meta data in form of tags, comments or ratings. See Nepomuk-KDE in action within Dolphin:

Nepomuk-KDE - Dolphin integration with rating

The music file is rated, has a comment and also a tag below the comment field. And this cannot only be done with music files but with all kinds of files:

Nepomuk-KDE - Dolphin integration with txt file

These comments and tags can be searched of course:

Nepomuk-KDE - search after environment in comment

Nepomuk-KDE - search after KDE 4 in tags

In the first image everything is searched for the term “environment” – and a file with a comment containing this string is listed. In the second search example the string “KDE4” is searched, and the result shows files which are tagged “KDE4”.

So, finally, tagging has also reached KDE 4 – I must admit that I’m very pleased with this result because I’ve waited for tag support in KDE for much too long.

Nepomuk-KDE: Future

At the moment Sebastian works at a backend for strigi. The final goal is to share the data backend so that strigi uses RDF as well. This would create one single data pool for all meta data on your machine.

The next steps are to integrate Nepomuk-KDE further with the applications of the desktop: There is a Google Summer of Code project to replace digikam’s rating and tagging system with the one from Nepomuk-KDE. Amarok already has exactly the features supported by Nepomuk-KDE for all files (rating, tagging, comments) therefore it only makes sense to merge the data as well. And of course all PIM applications contain a huge amount of data which should be analyzed semantic wise: think of displaying not only an e-mail by a contact but also all related e-mails by other contacts and also all related files which were sent and received.
You can extend this list of applications with any program which needs or wants to store any kind of additional information to used or modified files.

Besides integrating Nepomuk-KDE into other files there is also work ongoing to bring other meta data to Nepomuk-KDE: while atm tags, rating and comments are supported there will be many more types in the future. As already mentioned above with the PIM example data can be grouped around discussions (which e-mail is a forward or reply to which) but also around origin (where does this file comes from and where have it been before).
Also, think of an image viewer which does not onl display the given image but also all images taken at similar times or with similar people on it (digikam supports this already with person tags) or even taken at similar places (geo tagging or identifying names like barcelona07.jpg).
And if you really dare to have a look at a possible but yet far away future: IBM, a Nepomuk partner, has text analyse tools which could be used to analyze the content of for example mails to get a better understanding of what the e-mail is really about.

And there are other things which have to be done as well: Data export, cooperation with other desktop environments, etc. For example, the Nepomuk project itself plans to create a P2P based solution for sharing files together with their meta data, and at some point in the future Nepomuk-KDE should implement this part of the standard as well.

As you see there are lots of things to do, and there is room for almost every kind of participation. Just send an e-mail to the developers list. You can also simply leave a comment at this post if you are interested, the developers will keepn an eye on the comments.

Update
Sebastian has posted a FAQ about Nepomuk-KDE, featuring also the question about x-attributes.

52 thoughts on “Semantic Desktop and KDE 4 – State and Plans of Nepomuk-KDE [Update]”

  1. Hey, I’m submitting a link to this on the dot… 🙂 It’s already on osnews 🙂

  2. Forgive my potentially stupid questions: I’m a bit skeptical about all this “meta” stuff, perhaps because of the way I’ve experienced it with Amarok. What happens if you move / rename a file? Or if you convert from PNG to JPEG? Is all the meta-data lost?
    If the system makes an effort to “track” the movement of files, how does this impact the performance of, say, moving 1000 photos? And what happens if you skip the GUI and move them with mv?
    Also, where is this meta-data being stored? In ~/.kde? Or is it embedded in each file somehow? MP3s and JPGs each have their own species of “tags”, but what about other kind of files?
    And finally, what would be necessary to transfer files from one computer to another without losing the meta-data?

  3. Good questions Constantin, I too want to hear the awnser, I use mv quite a bit.

    I hope they eventually create a Filesystem that supports this data in the file itself.

  4. @Constantin: Moving and copying of files is indeed a complicated area. I asked Sebastian and he told me that they are implementing it in these days – this also means that there are no benchmarks yet.
    But in the end it would only be copying or moving entries in a database which is not a big deal, given that there are many well performing databases around.

    About the data storage itself: if it will be combined with strigi it is very likely that it will find its way into ~/.kde, but there might also be a Wsabi solution in the long run, I think.
    That would not be an “embedded” solution like xattributes (although xattributes would make the moving and copying easier).

    The question about tags and non-mp3/jpeg files: you can attach any kind of tag you wish. Very similar to the tag system in digikam (which is not standardized with other apps I think).

    But about the last question, the transfer of such information to other computers: this is not solved at the moment (in general all search machines have this problem) afaik. The Nepomuk project has the goal to also provide such mechanisms, but mainly by p2p solutions. Don’t know if the strigi/Nepomuk-KDE guys will come up with another solution.

    You also might want to ask the Nepomuk-KDE developers themselves, I am just “a voice” 😉

  5. about meta data ….
    I think the only way to easily use metadata is to make it easy. This means to store the metadata with the file. This also means that using metadata is just a matter of reading and writing meta data. The only problem may be that people will tend to gather identical files with different metadata. That could be solved by putting an MD5sum of the file in the metadata and allowing for the files to be merged (that is of course the metadata to be merged and one of the two identical files to be deleted).

  6. So there will be a standard widget or dialog to rate a file and add comments and tags, I guess. Will this for example be integrated in the ‘save as…’ dialogs too? and in the open dialogs?

    What also would be interesting is that you don’t have to type tags each time, but that you can select from a list, with the most used tags on top. And maybe your top 5 tags immediately clickable, and your last 5 tags also immediately clickable.

  7. In my opinion, the “right” metadata system should come from the filesystem. as the fs stores the position of the files, it should also be able to store metadata in a part of the fs :
    – unified way of accessing metadata
    – embedded in the core system makes it faste
    – allows file operations to keep metadata
    But this implies to have more complex filesystem tools and merge the concept of fs and database.
    If i remember well, this was supposed to be WinFS, the next generation NTFS.
    On top of that, i have a dream that one day i could search my filesystem through command line by typing something like : search * from all folders where author like “%me%”
    (oh! that sounds sqlish! :))

  8. reginald: Saving meta data with the file itself is only possible in two ways: via xattributes which are not supported by every file system, or via introducing a new container file type.
    The second is totally out of question since all files would be unusable on all non-KDE machines. The first one depends maybe too much on the file system.

    Fred: of course, a perfect file system would deal with all that – but there is no perfect file system like that at the moment as you say yourself. And honestly, I haven’t seen any file system project which got larger attention which would be capable of such things.

  9. Well, I think I know how to solve a problem with losing metadata or saving metadata for files which don’t support metadata in their container. It’s easy – just use a file, say “.metadata.db” (like “Thumbnails.db” in Windows), to save a copy of metadata for every file in current directory that have metadata. For example:
    You have a directory named “Folder” 😉 and 2 files inside: music.ogg and movie.avi without any metadata. When you rate your movie.avi, Nepomuk should create “.metadata.db” and add a copy of your ratings to it. When you add a comment for music.ogg, Nepomuk will add a comment to Ogg Tag, to Nepomuk database and to previously created “.metadata.db”.
    And now there are two cases:

    Case 1: When you will copy musicfile.ogg to different directory, say to your pendrive, Nepomuk will create another “.metadata.db” in target directory with music.ogg metadata. When you will connect your pendrive to other KDE box, another Nepomuk will be able to retrieve music.ogg metadata from this “.metadata.db”. Easy, isn’t it?

    Case 2: You convert music.ogg to music.wav. Nepomuk should notice that there is a new file with the same name, (except extension) as music.ogg but without any metadata. Than, if it’s possible, check some additional atribute (like music.wav duration) to make sure this isn’t false-positive situation and simply copy music.ogg metadata to music.wav metadata. It could work similar for “mv” case mentioned by Constantin

  10. Ext2, ext3, ReiserFS and XFS all do support extended attributes – that was what I meant by xattributes.

    BeCe: about your idea with directory files – how would that be different from a database? The only difference would be that you would be able to copy the information given that the directory is copied. However, a database might be faster than directory files.

  11. When “describing” a file (i.e. tagging, rating, whatever other predicate), one can consider using either filename and/or the checksum of the file. The checksum addresses the case of rename. Doesn’t help if one converts file formats and renames to something totally different, but could handle the case of only the extension being changed by pointing out to the user any similarly-named files.

  12. user xattrs seem to me the architecturally right spot to save this data, plus it solves a whole set of issues and enables /usr/bin/find-like commands to operate on the metadata. and how hard is it to make a metadata storage adapter that detects xattr support and switches to DB-based storage when they’re not available.

    *Of course* metadata will need to be stored in a DB anyway, because the project needs fast search capabilities, *but* the authoritative point of reference and final storage should be xattrs if they are present.

  13. IMHO, it would be foolish to *not* use xattrs for the storage of metadata. I already have Digikam storing metadata within EXIF and IPTC tags rather than only in its own database, because I’ve had problems with the Digikam database between versions, requiring messy restoration of older backups. I have lost my Amarok song ratings due to version upgrades at least twice, and am currently unable to make new ratings because of a strange problem. Frankly, I’m tired of re-rating my files!

    The ratings should be cached in a well-indexed database, and perhaps inotify/dnotify/whatever could be used to keep it up-to-date, but the metadata should *always* be stored in xattrs as the final, canonical reference. I *really* don’t want to lose *all* my metadata for *all* my files because one file or directory got corrupted, or a library upgrade incompatibility caused it to become unusable, etc. It’s a big pain to figure out from how far back I must restore my metadata to recover it, without losing too much of the newer data.

    Really, xattrs have been around for long enough, and are on enough filesystems. I think this is one case where it would be foolish to handicap the system by designing for the past, supporting metadata on non-xattr-supporting filesystems.

    –2c

  14. I agree too – storing the metadata in files directly makes absolutely sense.
    The problem might be fast access and that not EVERY filesystem supports that. So the obvious solution is doing both (like mentioned): A DB for dayly reference and as only solution if xattrs are not availible and storing the same metadatas with the files for portability and safty reasons. Of course the consistency must be made sure.

  15. @Debian-user
    “I already have Digikam storing metadata within EXIF and IPTC tags rather than only in its own database, because I’ve had problems with the Digikam database between versions, requiring messy restoration of older backups. I have lost my Amarok song ratings due to version upgrades at least twice, and am currently unable to make new ratings because of a strange problem. Frankly, I’m tired of re-rating my files!”
    with one metadata system for all apliacations we will have much better, well tested metadata system. So in KDE4.1 there won’t be many problems ]:->

  16. Instead using xattr you have a damn big advantage, means you can use shell, mv or graphical interfaces or whatever to move files in your system while keeping consistency of the metadata for every file for free 😉

  17. What happens with xattr data when copying files to CDs (ISO, Rockridge, UDF) or Windows-Filesystems, VFAT on MP3-Players, etc.?

    On AmigaOs you had a “.info” file for each file that did contain the metadata. This is not very pretty, but very pragmatic:

    my.jpg
    my.jpg.info

    (But ok, what happens with filesystems that only allow one dot in the name?)

  18. Thanks for the info! However, please keep in mind that I’m only the voice, not a developer. Although the developers keep an eye on this discussion you might want to post this directly to the Nepomuk-KDE developers or even to the Nepomuk lists directly since there you will have the ability to address them directly.

  19. Some ideas about saving metadata. In our semap project we decided to store them in the database directly. But we provide an additional solution for traditional application to notify semap about updated resources. This is very important for us, because we plan to use it under Windows too. In database we used RDF model and anyone may export them and backup.

  20. This isn’t helping much unless there are aids and prods to putting in useful info about every file as it is created.
    Not often would anybody go to the effort of typing: “This is a nice song…” So, based on file type, a number of likely “tags” might be proposed with check boxes next to each.
    It occurs to me that John from libraryclips.blogsome.com might have some good ideas and I suggest contacting him.

  21. fuxam, it seldom makes sense to tag a file just created because usually the file and therefore also the meaning evolves over time – and over time tags should be entered.

    However, I think you don’t really the the idea of Nepomuk: manually tagging is just the first step. Music files are already tagged via Amarok in KDE, and that can be done automatically (based on how often you play a song, etc.). Photos are also already tagged via Digikam – these data are used by Nepomuk, you don’t have to change your workflow in that regard.

    Anyway, in the long term Nepomuk will be able to identify relations between files automatically – and maybe even a subset of sets. Also, in the really long term Nepomuk might be able to extract the meaning of a paper all by itself.

    A last word about your link: I wasn’t able to find any relevant information in regard to this topic at the given link therefore I disabled it.

  22. What about different users on the same machine? – will any user have his own database? independently from the others? I’m not sure if that would be a good idea – beside it would be a waste of resurces, every file would have to be retaged per user – and there would be no consistency. My wife and I have acess to the same folder storing our photos – I’ve tagged them with digikam and everithing works great: I can acess them, she can acess tem and we both can access them from a remote machine – and all the time with the same tags and Metadata. I guess that is the behaviour a user would expect normally (not only for photos)…

  23. maybe a bit off topic, but anyway:
    How about treating commentaries in metadata like ‘post-it’-notes?
    I mean, something you can ‘stick’ on a file – maybe one could write a knote and then stick it to a file by dragging it onto it. The availibility of a comment in a files metadata could be visualized by drawing a small ‘post-it’-sticker in the upper-right corner of the files icon…
    I think that would be an appropriate metaphor and I think it would be helpful to see if a file contains comments, without being forced to click on it…

  24. Heiko:
    In the first place shared tags and comments are a big security concern, so as default it should be disabled!
    However, they should be interchangeable and exportable, I agree with you there, and with central databases this becomes difficult when you talk about shared folders, no question there.

    I think we have to wait how Nepomuk-KDE matures to see how this works and which options we will see.

    About the stick-it notes: remember that notes are in the end just a small sub-set of the abilities of Nepomuk-KDE. Still, I pretty much like the idea, you could post that on the Nepomuk-KDE list 🙂

  25. Extended attributes on ext3, jfs, xfs, reiser, alternate data streams on ntfs and data forks on hfs+. The only problem is how to store metadata on CDs and FAT32 filesystems (FAT32 being a commonly used filesystem for usb drives…). Since the database approach doesn’t handle CDFS and FAT32 either, I really don’t see a reason why the meta-data wouldn’t be stored on the place that file-system creators provided for meta-data storing.

    As Debian-user said, we are all tired of retagging amaroK files because the database was corrupt or lost for some reason. I do agree that the database would be very efficient for caching meta-data and that is what the database should do.

  26. What about the storage overhead? All this metadata is nice, but then we’ll need to create metadata databases a la locate.db, which takes up additional hard drive space, and it will need to be loaded into RAM to be of any real use. Y’know, having too many options is almost as bad as having too few options, one can get lost in the search engine. I recently finished organizing my mp3s, it was a pain to tag each of them properly, and I have over 400 songs in my player.

  27. @Alex: I’m not sure why the database should be loaded into ram?! Also, the additional hard drive space is marginal compared to the amount of data you index.
    And yes, self tagging is hard – for that reason the Nepomuk-KDE system aims at extracting meta data automatically in the long run.

  28. liquidat:
    about security-concerns: I agree – there are diffenent usecases.
    the first one I described above may aply when some colleges work on the same set of files and want to be informed what is going on -> metainfo should be shared there. IMHO this situation shows the real trengt of the whole metadate-thing.
    the second case is a worker in a huge company tagging public acessible files for his personal use.
    If the metadata is not necessarily shared between users, the whole ‘store-metadata-in-filesystem-approach’ is out of question. And it gets really nasty to accomplish data-consistency for shared metadata.
    Maybe the user has to choose, if his tag is private or public – in the first case it is stored only in his personal database, in the second one it’s stored in the filesystem too.

  29. Does Nepomuk do Cross-Language-Search?

    Cross-language Search: What’s it all about?

    The term “cross-language search” is used in many different senses:

    1. Some search engine providers claim to support multilingual or cross-language search if they can handle and index documents written in different languages. They search for the exact appearance of the entered search terms, e.g. “war” finds English documents referring to military actions and it finds German documents containing “war” in the sense of “was” (i.e. a meaningless glue word).

    2. Other search engines (see, e.g., http://www.google.com/intl/en/press/annc/translate_20070523.html) provide a tool for the translation of a query into a selectable other language, and then, the query is submitted with the translated query text. This is certainly a progress and can be useful in some specific situations, e.g. if one is looking for a hairdresser in Paris.

    Shortcomings:
    – If one is looking for “member of the board” and “SAir Group” (Swissair) and searches for German documents, the translated query “Mitglied des Brettes” und “SAir Gruppe” won’t provide any results. If “member of the board” is replaced by “Aufsichtsrat” some documents are found but they do not correspond to the commonly used terms “Verwaltungsrat” or “Verwaltungsräte” in conjunction with the Sair case.
    – For information research and intelligence services the above-mentioned method does not help because it is not able to compare and rank documents written in different languages.

    3. A true cross-language search is possible only if the search engine is able to recognize the thematic content, i.e., if the system realizes that the English translation of a French (or a German etc.) document is equivalent to the original document. This advanced technique is implemented in http://www.infocodex.com. It simultaneously finds documents in all supported languages, without the need for a cumbersome (and arbitrary) translation into each other language. Because of the cross-language content recognition and a well-founded similarity measure, the documents can be ordered by their relevance with respect to the query.

  30. Hi Zeno – nice read. You might also post this to the Nepomuk-KDE developers list to probably reach more people.

    I’m not sure if Nepomuk-KDE at the moment supports cross language search, but it should not be a problem at all: Sonnet, the KDE 4 spell checking system, is capable of detecting languages and could therefore be integrated with Nepomuk-KDE/strigi (because the component of indexing is actually done by strigi).

  31. Hi, very nice read! Could I translate it in italian language and post it on my blog?
    I read the CC license, but I want to notify you my purpose first.

  32. Thanks for asking first! And of course you can translate the article, the licence is exactly for such cases.
    I think the easiest is to also publish it under cc, and link back to this page. If that would not work for some reason, ask again for other permissions you need 🙂

  33. Hello everybody, my name is Damion, and I’m glad to join your conmunity,
    and wish to assit as far as possible.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.