Two Sum Problem Analysis 3: Handling Duplicates Input

Overview

Here in this post, I would like to give even further analysis for the two sum problem by considering the most complicated case where unique solution is not guaranteed and input could contain duplicates too. You are then asked to find all of such(copyright @sigmainfy) candidate index pairs that their sum equal to the given target!

This would make things even more complicated than last post “Two Sum Problem Analysis 2: Not Unique and No Duplicates” and I will show the time complexity of O(NlogN) or O(N) might not be achievable but O(N^2) would be the worst case.

Two Sum Problem

Similar to the description in the last post but pay attention to the bold line where we have different assumptions for this problem

Given an array of integers, find two numbers such that they add up to a specific target number. The function twoSum should return all such indices of the two numbers such that they add up to the target, where index1 must be less than index2. Please note that your returned answers (both index1 and index2) are not zero-based. You my assume nothing about this problem, that is, unique solution is not guaranteed and input may contain duplicates too. Input: numbers={2, 7, 11, 15}, target=9 Output: index1=1, index2=2

Two Sum Problem Solution Analysis

Well, as we need to return all the possible index pairs (i, j) such that A[i] + A[j] = target, in the worst extreme case: say the array consists of 100 elements all of which are the same value 1, and the target is 2, it is obvious all the index pair (1, 2), (1, 3) … (1, 100), (2, 3)… (2, 100) …… (99, 100) are our answers, the number of such pair is O(N^2), this means, we at least need O(N^2) time even to form such pairs. Based on this fact, I conclude that in this different context we are asked to return all possible index pairs, no amazing algorithms or techniques could guarantee the worst time complexity better than O(N^2).

And that’s that for time complexity, now we want to find the details about how to proceed to find all such pairs. The way I propose is using the so called pre-processing approach :

  1. Prepare a hash map h1 and a new vector array v1, h1 is a hash map(unordered_map) to keep track of the (value, list), the element in this map is “value –> list of indexes or positions this value exists in the input array”. This is to keep the original information, v1 is our new array with duplicates removed to be processed in next steps
  2. Linear scan, and for each input element A[i], if it is not in the hash map h1 we put it into our new v1, else we don’t put it in v1. At the same time, no matter A[i] is in h1 or not, we push the (copyright @sigmainfy) index i into the vector keyed at value A[i] in the hash map h1, this step takes O(N) time
  3. We pass the non-duplicate array v1 into our non-duplicate model or algorithm discussed in our last post  ”Two Sum Problem Analysis 2: Not Unique and No Duplicates“, but we return all the value pairs rather than index pairs, that is we got  a list of pairs (a1, b1) (a2, b2) …. such that ai + bi = target, depending on the algorithm we use, this step takes O(N) by hashing and O(NlogN) by sorting
  4. For each pair (ai, bi) we make all the index pairs for them: retrieve h1[ai] and h1[bi] each of which is just a list of indexes in the original input. so any pair composed by indexes in h1[ai] and h1[bi] is our answer. This could take up to O(N^2) time.

Remarks: As we reach here, is it a bit frustrating to have the worst time complexity being O(N^2)? But based on the fact of the extreme special case, it seems, this is the best we can do. We feel frustrated sort of because the only VALUE pair we have is (1, 1) in that special case, rather we have to scan O(N^2) times. The whole pointer here making us O(N^2) is we are required to get all INDEX pairs rather than VLAUE pairs. So the open question would be is there a better algorithm to get all possible VALUE pairs? I will discuss this interesting question in later posts soon. Think about it for yourself for now.

Summary

So here in this post, I relax the assumption (copyright @sigmainfy) of two sum problem even more by considering duplicates in the input. I give a pre-processing approach and show the worst time complexity would be O(N^2) and there might not be any better bound, if anyone has some more insightful comments, that would be highly appreciated! Thanks.

Written on September 21, 2014