Home   Search   Memberlist   Usergroups   Register   Profile   Log in 
detection methods

 
Post new topic   Reply to topic    GreedyTorrent Community Index -> General
Author Message
dryden



Joined: 06 Sep 2007
Posts: 17

PostPosted: Fri Sep 07, 2007 5:06 am    Post subject: detection methods Reply with quote

It is actually quite possible for a tracker to analyse the torrent statistics and infer what users are modifying their upload statistics.

That is, if they have discrepancies between the total accounted upload and download figures, given a certain set of intervals, and if the cheaters use constant ratio modification factors.

Given a set of N intervals or torrrents and a set of P peers, the upload-download discrepancy for every interval is the result in a linear system of equations. For every interval,

a(1) x(1) + a(2) x(2) + a(3) x(3) + ... + a(P) x(P) = d

where
a is the percentage of a reported upload that is fake
x is the reported upload figure for each peer
d is the discrepancy between total download and upload reported for that interval or torrent.

Now if there is no error, and d is fully attributable to the a's, and if every a(i) is constant, then the solution to this system of equations, which gives the vector a, can be obtained by a simple pivoting algorithm that transforms the matrx [ X d ] into [ I b ]. Given these ideal circumstances, this solution is exact. You can tell exactly what part of each reported download is fake.

This solution can also be obtained by solving using a least squares method, but that is a bit overkill in this case.

Now it becomes a lot harder if the ratio's are not constant.

Let me show you an example. Suppose we have 7 peers and 10 torrents. Suppose we have this constant weight vector:

a = [0.33 0.50 0.66 0.20 0.50 0.00 0.00]

This means peer 1 is reporting 33% fake and 66% real upload. Peer 2 is doing it fifty-fifty. Peer 6 and 7 are not cheating at all.

Now suppose we have this matrix of reported upload figures:
Code:

    60     20     10      0      0      0    100    36.40
    50     30      0     80     10      0      0    52.50
    20     10      0     60      0     30    100    23.60
     0      0     40      0      0     30    100    26.40
     0     40      0     40     30      0      0    43.00
    20      0     40      0     30      0    100    48.00
     0      0      0      0     50    200    200    25.00
    10     20     30     40     50     60     70    66.10
    70     60     50     40     30     20     10   109.10
    30     30     30     30     30     30     30    65.70


The last column is the discrepancy that would be created due to the fake reports as given by X*a' = d

Transforming this matrix gives:
Code:
     1      0      0      0      0      0      0   0.33
     0      1      0      0      0      0      0   0.50
     0      0      1      0      0      0      0   0.66
     0      0      0      1      0      0      0   0.20
     0      0      0      0      1      0      0   0.50
     0      0      0      0      0      1      0   0.00
     0      0      0      0      0      0      1  -0.00
     0      0      0      0      0      0      0   0.00
     0      0      0      0      0      0      0   0.00
     0      0      0      0      0      0      0  -0.00


The first P values of the last column now reflects the weight vector that we had first injected.

Suppose we do not use a constant weight vector but a weight matrix such as this:

Code:
 
  0.66   0.50   0.00   0.66   0.20   0.50   0.00
  0.66   0.50   0.00   0.66   0.20   0.50   0.00
  0.66   0.50   0.00   0.66   0.20   0.50   0.00
  0.33   0.00   0.30   0.66   0.20   0.50   0.00
  0.33   0.00   0.00   0.66   0.20   0.50   0.00
  0.33   0.00   0.00   0.66   0.20   0.50   0.00
  0.33   0.00   0.00   0.66   0.20   0.50   0.00
  0.33   0.00   0.00   0.66   0.20   0.50   0.00
  0.33   0.20   0.00   0.66   0.20   0.50   0.00
  0.33   0.20   0.00   0.66   0.20   0.50   0.00


As you can see, peer 1 2 and 3 have used different weights in the various intervals.

The pivoting algorithm now yields:
Code:
  0.81   0.35   0.38   0.69  -0.30   0.72  -0.10

As you can see the results are quite off, although the shape is still somewhat visible.

Matlab's Least Squares solution gives:
Code:
b = [0.64   0.03  -0.07   0.78  -0.07   0.49   0.08]


But the weighted averages, that a tracker would be interested in, are:
Code:
a = [0.50    0.23    0.06    0.66    0.20    0.50   0]


So extracting the true mod-factors seems to be a troublesome endeavour. But suppose there is only one peer falsifying his data, and only half of the time. The pivoting algo gives:
Code:
b =  0.31   0.13  -0.00  -0.02  -0.15   0.05  -0.01

Least squares gives:
Code:
b = [0.23   -0.05   -0.21    0.04   -0.01   -0.06    0.07]

The weighted averages are (not known to the tracker):
Code:
a = [0.17    0    0    0    0    0    0]


It appears that this peer runs a good risk of being banned, or at least being watched.
I don't think trackers have any other tools than statistical analysis at their disposal, but it is quite potent nontheless. You can gather data over a long period of time, and do linear regression on subsets and perhaps some kind of meta analysis on the intermediate results. You can use statistical tests to see if certain results are likely to be caused by natural variance, or are significant enough to give some certainty as to the conclusion that someone is a cheater.

Whereas I have focussed on modification as a percentage of upload amounts, the model could easily be extended to account for downloaded amounts or constant additives as well. You'd just need more data samples to get a solution.
Back to top
ADSL (l)user



Joined: 16 Mar 2007
Posts: 19

PostPosted: Thu Sep 13, 2007 2:01 pm    Post subject: Reply with quote

Excellent thread.

While I was never that good in mathematical statistics (and I figure that you didn't use any of the "advanced" statistical methods) I think that your analysis very good.

Still, I think that in order to use this kind of analysis to catch "cheaters" you must take some other, more "earthly" factors:
* most of the coders on private trackers are good with network protocols and/or HTML+CSS+PHP programming. While being a good programer requires certain knowledge and good grasp of general mathematics, it does not require knowledge of the mathematical statistics (and I see that you used matrixes in order to solve equations - method probably unheard of for many web programmers out there).
* you as* that there was a single cheater in case were we've had 7 users and 10 torrents. It's certain that it would be rarely the case to have such a small amount of users and torrents active. The amount of calculatons it would take to process all these equations would be a great strain to already loaded servers, wich are sometimes leased or not dedicated machines (some of them are, but most of the processing resources are spent on the tracker).
* you made case as if all statistics are reported by the time the analysis is made. That is rarely (if never) the case. Many things can attribute to this:
** clients report statistics in various and different times. When you make the analysis, a number of peer that are downloading may have not been able to report, such creating suspicion on other peers that they have reported too much of upload.
** some client progams don't report statistics when you close them, effectevly losing that part of statistics
** a client's computer may suffer a crash, thus losing download/upload statistics for a certain peroid of time
** they may be a network interruption - a statistics may be lost during this time
** ... and so on.

On the other hand:
* if someone makes a good tool out of this analysis, (it's been common practice at least so far that) many of "tracer coders" would adopt it, if it's effective.
* the larger the sample, more certain you are, and more accurate the analysis is.


In the closing, this has been the best thread on this forum yet.
Back to top
dryden



Joined: 06 Sep 2007
Posts: 17

PostPosted: Fri Sep 14, 2007 2:10 am    Post subject: Reply with quote

Yo. Thanks for the reply.

* you made case as if all statistics are reported by the time the analysis is made. That is rarely (if never) the case.

Of course, but I had that calculated in when I thought of the method. The key to creating comparable intervals is to extrapolate using the data you already have for each peer. Although some recent data may be absent, it is true (i think) that all clients report back to the tracker at regular times (although these times differ per client of course) to get updated lists of peers, and at the same time reporting their current statistics. The tracker need only extrapolate the missing data using the average rates of the last recorded interval (if that period is large enough), but that does mean that it must keep track of the totals of the last (usable) update plus the totals of the update before it. It could also use the data available at time X to create an interpolated total for time X-i where i is the typical update interval (the interval field encoded in tracker responses), by using the last recorded figure in [X-i, X], the last recorded figure in [X-2i, X-i], and then simply interpolating for time X-i. So a tracker must keep track of 6 4-byte numbers per torrent per client. At the end of each sampling interval (say every 2 hours) it must write the totals for the last sampling interval to a file, as well as the discrepancies found for that interval.

If this interpolation does not provide a good enough approximation, you can add a new column to the matrix with the totals of all uploads for each interval, to see if the error is a linear combination of that (and a column with the numbers of peers seen in an interval to see if the error is a linear combo of that).

** clients report statistics in various and different times. When you make the analysis, a number of peer that are downloading may have not been able to report, such creating suspicion on other peers that they have reported too much of upload.

Faulty reporting of other peers can never create suspicion on you, unless somehow this error is a linear combination of your upload rate and time (for multiple intervals). The chances of this are very small. It is also very hard to willingly frame someone, because you would need to obtain the statistics for that peer, and create a discrepancy that is a linear combination of these statistics, but not of your own. You cannot do this on your own, because the only statistics you can gather about a peer, are its up and download rate to you, but these are your own statistics as well, so any manipulation will frame yourself just as hard. The only way to do this is to cooperate with other peers (or to participate in many torrents, and to create a discrepancy that is related to whether a certain peer is active in that torrent or not).

* The amount of calculatons it would take to process all these equations would be a great strain to already loaded servers, wich are sometimes leased or not dedicated machines (some of them are, but most of the processing resources are spent on the tracker).

Yes, probably. Naturally, the statistics gathered would be in the order of gigabytes, if stored in matrix format. A tracker with a max of 7000 users, 1000 active torrents and a sampling interval of 2 hours, given 2 4-byte ints per user per torrent, would give 640 meg of uncompressed data per day and 4.5 gig per week. To perform analysis over the last two weeks, you need to process 9 GB of data. Note however that these matrices are extremely sparse, 99.9% would probably be zeros and the compressed files may be 600 times smaller. If the average amount of torrents per user in a given period is 5, then the average amount of users per torrent is 35. Using a different format (a 'packed' matrix), these 2 weeks only make up 35*1000*12*14*12 = 67 megabytes. An analysis tool can use this format internally, and perform matrix multiplications on this packed matrix. The matrix that results from this is 7000x7000 mostly non-zero float values (could be 4, 6 or 8 byte) and would fit in memory. (The size of this matrix depends only on the number of peers. You could sample for a year with the same set of users and the matrix would still be 7000x7000. But if you solve for both upload-factor, download-factor and constant-addition then each dimension will triple in size and the whole matrix will take up 9x as much space (21KČ*4 = 1.7 GB).)

I have tried multiplying a random sparse 5000x3000 matrix with its transpose, it took a couple of minutes on a duron 800. Solving the equation that resulted from it (least squares) took far less, less than a minute. I couldn't try a larger matrix because Matlab seems to want to allocate it in one contiguous block and I don't have a large enough memory for that. But a 7000x168000 matrix should take about 300 minutes on my system (if I had the memory for it). A custom algoritm that operates on the packed matrix could be faster. The calculation that makes up the biggest part of the pie is the multiplication of the matrix with its transpose; I don't know if Matlab recognizes this, but if it doesn't you could save up to 50% worth of multiplications.

So, you would just download this 70 meg file to a modern computer, set it to crunch for an hour, and you'd have your results. Added load for the server running the tracker is minimal.
Back to top
ADSL (l)user



Joined: 16 Mar 2007
Posts: 19

PostPosted: Sat Sep 15, 2007 9:02 pm    Post subject: Reply with quote

I want to congratulate you.

You've made some excellent points to my counterpoints.

Compressed matrices never occured to me. Even I, with my limited knowledge, could come up with the algorythm that is not over-unoptimized for multipicating packed matrices.

So "all" that is left is for someone with enough time on his hands to make a tool and try it out.

I encourage all private tracker owners to take some of the ides presented here. They're free ... as free is content you're providing on your "private" trackers.

And, again, big props to dryden (is it Taylor or what? Wink)
Back to top
dryden



Joined: 06 Sep 2007
Posts: 17

PostPosted: Sun Sep 16, 2007 12:43 am    Post subject: Reply with quote

Thank you.

It may be possible to frame someone else after all... it seems that uTorrent is able to discover a peer's total download speed (although I can find nothing in the documented protocol that would explain that). You could create a linear combination of that but it would probably still be partly attributed to you because it can be linked to your presence in a torrent interval but for that interval you'd be sharing the attribution with that peer.

Naturally, while saving the intervals you'd save them in packed-row format, but for the multiplication you need it to be in packed-column format so you'd have to convert it first, but that is no big deal.

My best advice for a cheater is to make sure that at least 33% of the time you use zero modification so that trackers have a hard time matching discrepancy with peer presence. So I think the most useful feature addition for GreedyTorrent would be to be able to set the cheat ratio per torrent.

Who is Taylor? The scientific management guy?
Back to top
ADSL (l)user



Joined: 16 Mar 2007
Posts: 19

PostPosted: Tue Sep 18, 2007 10:38 pm    Post subject: Reply with quote

dryden wrote:

Who is Taylor? The scientific management guy?


I misspelled Tyler Durden Smile
Back to top
dryden



Joined: 06 Sep 2007
Posts: 17

PostPosted: Thu Sep 27, 2007 7:41 pm    Post subject: Reply with quote

Ah. Tyler is a cool guy. But Dryden is a cool guy too Wink.
Back to top
Display posts from previous:   
Post new topic   Reply to topic    GreedyTorrent Community Index -> General All times are GMT
Page 1 of 1