2016-02-22

Leetcode-Repeated DNA Sequences(Java)

Question:

All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: “ACGAATTCCG”. When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Thinking:

Because it limits the length of repeated substrings, so we can travse the list one time from the left to right. Pick substring from the list to check whether this substring emerge beofe.

Solution:

public List<String> findRepeatedDnaSequences(String s) {
    HashSet<String> set = new HashSet<String>();
    List<String> res = new ArrayList<String>();

    for (int i = 0; i+10 <= s.length(); i ++) {
        String temp = s.substring(i, i+10);
        if (!set.contains(temp)) {
            set.add(temp);
        }
        else {
            if (!res.contains(temp))
                res.add(temp);
        }
    }

    return res;
}

More efficient way:(which use the bits to represent map fucntion, reference: https://leetcode.com/discuss/25399/clean-java-solution-hashmap-bits-manipulation)

public List<String> findRepeatedDnaSequences(String s) {
    HashMap<Character, Integer> map = new HashMap<Character, Integer>();
    HashSet<Integer> set = new HashSet<Integer>();
    HashSet<Integer> second = new HashSet<Integer>();
    List<String> res = new ArrayList<String>();

    if (s.length() < 10)    return res;

    map.put('A', 0);
    map.put('C', 1);
    map.put('G', 2);
    map.put('T', 3);

    int v = 0;

    for (int i = 0; i < s.length(); i++) {
        v <<= 2;
        v |= map.get(s.charAt(i));
        v &= 0xfffff;
        if (i < 9)    continue;
        if (!set.add(v) && second.add(v)) {
            res.add(s.substring(i-9, i+1));
        }
    }

    return res;
}