187. Repeated DNA Sequences

  • 30.1%

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

1
2
3
4
5
6
For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

方法一:

我的代码实现:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
class Solution {
public:
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
if(s.size()<10)
return res;
unordered_set<int> set1, set2;
unordered_map<char, int> map = {{'A', 0}, {'C', 1}, {'G', 2}, {'T', 3}};
for(int i=0; i<=s.size()-10; i++){
int k=0;
for(int j=i; j<i+10; j++){
k <<= 2;
k |= map[s[j]];
}
if(set1.find(k)!=set1.end() && set2.find(k)==set2.end()){
res.push_back(s.substr(i, 10));
set2.insert(k);
}
set1.insert(k);
}
return res;
}
};

https://discuss.leetcode.com/topic/8894/clean-java-solution-hashmap-bits-manipulation

Clean Java solution (hashmap + bits manipulation)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
public List<String> findRepeatedDnaSequences(String s) {
Set<Integer> words = new HashSet<>();
Set<Integer> doubleWords = new HashSet<>();
List<String> rv = new ArrayList<>();
char[] map = new char[26];
//map['A' - 'A'] = 0;
map['C' - 'A'] = 1;
map['G' - 'A'] = 2;
map['T' - 'A'] = 3;

for(int i = 0; i < s.length() - 9; i++) {
int v = 0;
for(int j = i; j < i + 10; j++) {
v <<= 2;
v |= map[s.charAt(j) - 'A'];
}
if(!words.add(v) && doubleWords.add(v)) {
rv.add(s.substring(i, i + 10));
}
}
return rv;
}

https://discuss.leetcode.com/topic/8487/i-did-it-in-10-lines-of-c

I did it in 10 lines of C++

The main idea is to store the substring as int in map to bypass the memory limits.

There are only four possible character A, C, G, and T, but I want to use 3 bits per letter instead of 2.

Why? It’s easier to code.

A is 0x41, C is 0x43, G is 0x47, T is 0x54. Still don’t see it? Let me write it in octal.

A is 0101, C is 0103, G is 0107, T is 0124. The last digit in octal are different for all four letters. That’s all we need!

We can simply use s[i] & 7 to get the last digit which are just the last 3 bits, it’s much easier than lookup table or switch or a bunch of if and else, right?

We don’t really need to generate the substring from the int. While counting the number of occurrences, we can push the substring into result as soon as the count becomes 2, so there won’t be any duplicates in the result.

1
2
3
4
5
6
7
8
9
10
11
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<int, int> m;
vector<string> r;
int t = 0, i = 0, ss = s.size();
while (i < 9)
t = t << 3 | s[i++] & 7;
while (i < ss)
if (m[t = t << 3 & 0x3FFFFFFF | s[i++] & 7]++ == 1)
r.push_back(s.substr(i - 10, 10));
return r;
}

BTW, the OJ doesn’t seems to have test cases which the given string length is smaller than 9, so I didn’t check it to make the code simpler.

Any suggestions?

Update:

I realised that I can use s[i] >> 1 & 3 to get 2 bits, but then I won’t be able to remove the first loop as 1337c0d3r suggested.

https://discuss.leetcode.com/topic/8487/i-did-it-in-10-lines-of-c/2

Neat idea. The additional 1 bit per letter still encode each substring in 10x3 = 30 bits, just under 4 bytes for a 32-bit integer.

Your code could be further simplified. By observing that s[i] & 7 is never 0, each of the first nine substrings with length < 10 will have unique hash key and will never collide with other 10-letter long sequences. Therefore the first loop could be removed and be compacted into a single loop.

1
2
3
4
5
6
7
8
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<int, int> m;
vector<string> r;
for (int t = 0, i = 0; i < s.size(); i++)
if (m[t = t << 3 & 0x3FFFFFFF | s[i] & 7]++ == 1)
r.push_back(s.substr(i - 9, 10));
return r;
}

Another observation is the mapped value need not be an integer counter, and could simply be a boolean to further save space. This requires some extra logic though:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
vector<string> findRepeatedDnaSequences(string s) {
unordered_map<int, bool> m;
vector<string> r;
for (int t = 0, i = 0; i < s.size(); i++) {
t = t << 3 & 0x3FFFFFFF | s[i] & 7;
if (m.find(t) != m.end()) {
if (m[t]) {
r.push_back(s.substr(i - 9, 10));
m[t] = false;
}
} else {
m[t] = true;
}
}
return r;
}

PS: OJ does have test cases which the given string has length that is less than 9.

https://discuss.leetcode.com/topic/8487/i-did-it-in-10-lines-of-c/3

Concise solution! A little longer but more readable solution base on your idea.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
int str2int(string s) {
int str=0;
for (int i = 0; i < s.size(); ++i)
str = (str<<3) +(s[i]&7);
return str;
}
vector<string> findRepeatedDnaSequences(string s) {
vector<string> res;
unordered_map<int,int> coll;
for (int i = 0; i+10 <= s.size(); ++i)
if (coll[str2int(s.substr(i,10))]++ == 1)
res.push_back(s.substr(i,10));
return res;
}

https://discuss.leetcode.com/topic/27517/7-lines-simple-java-o-n

7 lines simple Java, O(n)

1
2
3
4
5
6
7
8
9
public List<String> findRepeatedDnaSequences(String s) {
Set seen = new HashSet(), repeated = new HashSet();
for (int i = 0; i + 9 < s.length(); i++) {
String ten = s.substring(i, i + 10);
if (!seen.add(ten))
repeated.add(ten);
}
return new ArrayList(repeated);
}

https://discuss.leetcode.com/topic/8539/short-java-rolling-hash-solution

Short Java “rolling-hash” solution

Hi guys!

The idea is to use rolling hash technique or in case of string search also known as Rabin-Karp algorithm. As our alphabet A consists of only 4 letters we can be not afraid of collisions. The hash for a current window slice could be found in a constant time by subtracting the former first character times size of the A in the power of 9 and updating remaining hash by the standard rule: hash = hash*A.size() + curr_char.

Check out the Java code below.

Hope it helps!

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
public class Solution {
private static final Map<Character, Integer> A = new HashMap<>();
static { A.put('A',0); A.put('C',1); A.put('G',2); A.put('T',3); }
private final int A_SIZE_POW_9 = (int) Math.pow(A.size(), 9);

public List<String> findRepeatedDnaSequences(String s) {
Set<String> res = new HashSet<>();
Set<Integer> hashes = new HashSet<>();
for (int i = 0, rhash = 0; i < s.length(); i++) {
if (i > 9) rhash -= A_SIZE_POW_9 * A.get(s.charAt(i-10));
rhash = A.size() * rhash + A.get(s.charAt(i));
if (i > 8 && !hashes.add(rhash)) res.add(s.substring(i-9,i+1));
}
return new ArrayList<>(res);
}
}

https://discuss.leetcode.com/topic/18263/10-lines-c-code-8-ms-passed

10 lines C++ code, 8 ms passed!

1
2
3
4
5
6
7
8
9
10
11
12
vector<string> findRepeatedDnaSequences(string s) {
char hashMap[1048576] = {0};
vector<string> ans;
int len = s.size(),hashNum = 0;
if (len < 11) return ans;
for (int i = 0;i < 9;++i)
hashNum = hashNum << 2 | (s[i] - 'A' + 1) % 5;
for (int i = 9;i < len;++i)
if (hashMap[hashNum = (hashNum << 2 | (s[i] - 'A' + 1) % 5) & 0xfffff]++ == 1)
ans.push_back(s.substr(i-9,10));
return ans;
}

https://discuss.leetcode.com/topic/10880/11ms-solution-with-unified-hash-fxn

~ 11ms Solution with Unified Hash Fxn

Appricate for advice.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
vector<string> findRepeatedDnaSequences(string s)
{
vector<string> ret;
if ( s.length() < 11 )
{
return ret;
}

char table[1048576] = "";
unsigned int hash = 0U;

for ( size_t i = 0; i < 10; ++i )
{
/** 'A' - 'A' + 1 = 1 = 1 (mod 5)
* 'C' - 'A' + 1 = 3 = 3 (mod 5)
* 'G' - 'A' + 1 = 7 = 2 (mod 5)
* 'T' - 'A' + 1 = 20 = 0 (mod 5)
*/
hash = ( hash << 2 ) | ( ( s[i] - 'A' + 1 ) % 5 );
}

table[hash] = 1;

for ( size_t i = 10; i < s.length(); ++i )
{
hash = ( ( hash << 2 )
^ ( ( s[ i - 10 ] - 'A' + 1 ) % 5 ) << 20 )
| ( ( s[i] - 'A' + 1 ) % 5 );

if ( table[hash] == 0 )
{
table[hash] = 1;
}
else if ( table[hash] == 1 )
{
table[hash] = 2;
ret.push_back( string( s, i - 9, 10 ) );
}
}

return ret;
}

https://discuss.leetcode.com/topic/31640/20-ms-solution-c-with-explanation

20 ms solution (C++) with explanation

One obvious way to do this is to use a hash table mapping strings to counts. (e.g. H[“AAAAAAAAAA”] represents the number of times we have seen AAAAAAAAAA. This will work in O(n) time, but its useful to discuss why this is not a good solution:

  • Runtime constant (from hashing): When using a hash table, there is a runtime hit for hashing the string. In this case, converting the string to a table index. That will presumably mean looking at each character of the string. Since each character is part of 10 different substrings (other than the end characters), that means 10n character reads. Still linear, but we can do better on the constant.
  • Memory (values): There isn’t any reason to store a count. Each possible string has only 3 states we need to track: “never been seen”, “been seen once”, and “been seen more than once”. This requires only two bits to track – not the 4-8 bytes needed for an integer.
  • Memory (keys): A hash table needs to store each key (to resolve collisions). At 10 bytes per key, thats 10*n bytes – a potential problem if n is every large, and completely unnecessary.

Here is how we address the three problems:

Hashing: We compute the hash ourselves, but take advantage of the overlapping. We treat each letter as a two-bit number. (Arbitrarily, A=0, C=1, G=2, T=3.) We treat ten consecutive letters as a 20-bit integer. We can calculate the first one with:

1
2
3
int val = 0;
for (int i=0; i < 10; i++)
val = (val << 2) | char2val(s[i]);

Now, to compute the next string:

1
val = ((val << 2) & mask) | char2val(s[10]);

Where:

  1. mask: 20 consecutive bits ((1 << 21) -1).
  2. ((val << 2) & mask: shift everything over two bits, and get rid of the most significant bits.
  3. ((val << 2) & mask) | char2val(s[10]): Replace the right-most two bits with the character code.
    Much faster than looking at every character 10 times.

Hash table: We instead use two bit-sets. There are 2^{21}-1 possible strings. We need a bit in each set for each possible string. The first set (S1) tells us if the string has been seen once or not. The second set (S2) tell us whether the string has been seen more than once.

Code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
vector<string> findRepeatedDnaSequences(string s) {
if (s.size() <= 10)
return vector<string>();

vector<string> R;
bitset<1<<20> S1;
bitset<1<<20> S2;

int val = 0;
for (int i=0; i < 10; i++) // Calc. the has value for the first string.
val = (val << 2) | char2val(s[i]);
S1.set(val);
cout << val << " | ";

int mask = (1 << 20) - 1;
for (int i=10; i < s.size(); i++) {
// Calc the hash value for the string ending at position i.
val = ((val << 2) & mask) | char2val(s[i]);
if (S2[val])
continue;
if (S1[val]) {
R.push_back(s.substr(i-10+1, 10));
S2.set(val);
}
else
S1.set(val);
}
return R;
}

int char2val(char c) {
switch (c) {
case 'A': return 0;
case 'C': return 1;
case 'G': return 2;
case 'T': return 3;
}
}
谢谢你,可爱的朋友。