kabum | 5 Apr 12:34 2012

Re: [OSM-dev] Google Summer of Code

Am 3. April 2012 20:02 schrieb Paul Norman <penorman <at> mac.com>:

The problem with detecting when changesets are closed is that there is no way to determine exactly when they are closed short of an API query. You can fake it by assuming changesets are closed an hour after the last change to them and 24 hours after the first change to them.

<osm version="0.6" generator="OpenStreetMap server">
<changeset id="11187430" user="regedi" uid="645826" created_at="2012-04-05T10:28:21Z" open="true" min_lat="50.0106489" min_lon="36.3515771" max_lat="50.0112144" max_lon="36.3586195">
<tag k="created_by" v="Potlatch 2"/>
<tag k="build" v="2.3-375-g9f05171"/>
<tag k="version" v="2.3"/>

<osm version="0.6" generator="OpenStreetMap server">
<changeset id="11167430" user="bergfrei" uid="327035" created_at="2012-03-31T15:11:30Z" closed_at="2012-03-31T15:16:55Z" open="false" min_lat="47.9912789" min_lon="9.7206276" max_lat="48.0492344"max_lon="9.8521079">
<tag k="comment" v="Hochdorf Ausgleich Luftbildversatz"/>
<tag k="created_by" v="JOSM/1.5 (5047 de)"/>

Or have I missed something?


It is better to detect problems when they occur, not up to 24 hours after they’ve occurred.

That's correct. A good practise would be, to code it as abstract as possible and so only parse modify/delete/create sets. The origin (minute/hour-diff/changeset) will be ignored. 

I try to take this into account in my proposal.

Thanks for all of your ideas! It's time to finish my proposal :)




From: kabum [mailto:uu.kabum <at> gmail.com]
Sent: Tuesday, April 03, 2012 2:20 AM
To: Derick Rethans
Cc: OpenStreetMap dev list

Subject: Re: [OSM-dev] Google Summer of Code




Am 2. April 2012 22:20 schrieb Paul Norman <penorman <at> mac.com>:

A tool that operates on the changeset level is https://github.com/pnorman/osm-weirdness

It detects changesets that have a high probability  of being an import or mechanical edit. The detection is pretty crude but it does find a fair number of undocumented imports, mechanical edits, and other weirdness. If you point it an old state.txt file it will start in the past and work up to the present.


I've a look later this day on your script.


When working with the minutely diffs there are some limitations:

Limited knowledge of changesets. In practice, if you start your detection an hour in the past you can have a list of all open changesets, but it is not possible to know the tags of the changesets.

No knowledge of the previous state of objects. You know where deleted objects were, but you can’t tell how far an object is moved or what it’s tags were before. To tell this you need to query a service with a full history DB, and handling full history files is difficult.

No knowledge of way geometry if using existing nodes. Iandees’ https://github.com/pnorman/osm-weirdness/tree/way_check solves this by fetching nodes in a way that aren’t also in the changeset from jxapi and it can then detect bad geometry (e.g. ways that trace over themselves)


If you were to code a vandalism detection tool I think it should work on the minutely replication diffs (http://wiki.openstreetmap.org/wiki/Planet.osm/diffs)


I thought about analyse the data after the changeset is closed, but this diffs sounds also good. I will check this way :) Thanks!



Am 3. April 2012 09:38 schrieb Derick Rethans <osm <at> derickrethans.nl>:

On Mon, 2 Apr 2012, kabum wrote:

> Result:
> - each changeset has a total rating -> use a treshold value to divide them
> into suspicious and not suspicious

Instead of just using static thresholds, I think that something like SVM
(http://en.wikipedia.org/wiki/Support_vector_machine) might be highly
benificial here; and it's another cool technology to play with. There is
a cool library for this (http://www.csie.ntu.edu.tw/~cjlin/libsvm/) and
I know there is at least an extension to use it from PHP:


Thanks for this method ... seems to be very suitable for our use case.


I've already some years of experience of PHP, but I wouldn't prefer it for this part of the project. I thought about Python (libsvm has native Python bindings ;)) 




> Some questions came up within this preparation:
> - Is there a prefered language? Has this to be specified within the
> proposal? (language skill has to be rated, so I would decide this during
> the project phase)

Not really any preferred language. What did you have in mind? For the
front end I was thinking PHP, but the engine, I wouldn't know. I think
something high performant (so C or C++) might be benificial.


My thoughts were that it's easy to setup and it's capable to call it easy from a terminal or to include it in other python scripts (i.e. web frontend).


If C++ is necessary, because of it's speed, then I think I could master this. In the passed semester I participated in a software engineering partical training at university (in a team of five fellow students), where we have an extensive use of C++ (https://github.com/brainafk/Empire).


> - I also would like to discuss used libraries and framework within the
> project phase, or should I decide this also in my proposal?
> - Should the frontend integrate in the current website (ruby on rails
> project) or should this just be an optional feature?

I think it can easily live as it's own website.


Ok :)


> - How detailed should be the proposal? Is it enough to formulate this draft?

That's a tricky one, the more information you provide the better I
think, as it shows you have thought about it :-)


I think it grows a lot by this discussion and I try to be as detailed as possible. :)


Thanks for the response :)




dev mailing list
dev <at> openstreetmap.org