If you're literally just looking for code that appears in two places, then there...

scott_s · on March 10, 2009

Yeah, it's polynomial, but it's still really slow and not practical.

In a string of length N, there are N substrings of length 1, N-1 substrings of length 2, N-2 substring of length 3, ..., 1 substring of length N. This is the sum of the first N integers, which is N(N+1)/2. If we're doing big oh analysis, we can just say that's O(N^2).

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

Doing a O(N^4) algorithm on nontrivial sizes of N (and nontrivial sizes of N are where you need it the most) will take a long time.

Disclaimer: it's been a long time since I've done any algorithm analysis, so please check my work.

ramchip · on March 10, 2009

My algorithms teacher does research in something similar. IIRC he has a program exactly for showing copy/pasted code. You may want to have a look at his papers on Clone Detection Tools, "A Novel Approach to Optimize Clone Refactoring Activity", etc.

http://www.polymtl.ca/recherche/rc/en/professeurs/details.ph...

silentbicycle · on March 10, 2009

Awesome, thanks!

lacker · on March 10, 2009

Comparing every something to every other something is an N^2 operation. But, in this case, our something is already O(N^2), so the final algorithm is actually O(N^4).

If you are just checking for equality, comparing each of N items to another group of N items is O(N). Use a hash table.

silentbicycle · on March 11, 2009

Exactly. Interned strings / symbols work out the same, too.