techInterview Discussion

Answers to technical interview questions. A part of TechInterview.org

A part of techInterview.org: answers to technical interview questions.

Your host: Michael Pryor

Another substring problem

Given a very very long string and a charater set. you have to find the smallest substring in the given string which has all characters in the given set.
K.Ashwin Kumar Send private email
Saturday, June 30, 2007
 
 
Aw... so we firts start with the function telling is if some character is in the given k character set:

bool isin(char c, char[] test)
{
 bool is =false;
 for(int i=0; i < k; i++)
 {
    if(test[i]==c)
    {
      is=true;
      break;
    }
 }
 return bool;
}

Then we have a function that does this on as subset/subrrray of the initial array A:

int minrange=sizeof(A)+1;

bool isinrange( char[] A, int start, int end, char[] test)
{
  bool is=false;

  for(int i= start; i < end; ++i)
  {
      if ( isin(A[i], test) )
      continue;
      else
        break;
  }   

  if(i==end) // all test cases are in start-end range
  {
    //record this in global variable
    if(end-start+1<minrange)
      minrange=end-start+1;

    is=true;
  } 

  return is;   
}

void finaltest( char[] A, char[] test)
{
  int level = 0;
  int size = sizeof A/sizeof A[0];
  if(isinrange( A, level, size, test)
  {
      isinrange( A, level+1, size, test);
      isinrange( A, level, size-1, test);   
  }
}
Bytheway
Saturday, June 30, 2007
 
 
I think I screwed up recursion
Bytheway
Saturday, June 30, 2007
 
 
Should be:

void isinrange( char[] A, int start, int end, char[] test)
{
  if (start >= end )
    return;

  bool is=false;

  for(int i= start; i < end; ++i)
  {
      if ( isin(A[i], test) )
      continue;
      else
        break;
  }   

  if(i==end) // all test cases are in start-end range
  {
    //record this in global variable
    if(end-start+1<minrange)
      minrange=end-start+1;

    is=true;
  } 

  if(is)
  {
    isinrange( A, start+1, end, test);
    isinrange( A, start, end-1, test);
  }
}
Bytheway
Saturday, June 30, 2007
 
 
Admittedly, this above is quite naive. I assumed the worst case where all the characters in the test array may be the same character like "aaaaaaa".

Thus any single character 'a' (surrounded by two characters different from it ) is the smallest substring and has a length of 1.
Bytheway
Saturday, June 30, 2007
 
 
So what would be the performance of this algo?

My first take using CLR notation is:

T(0) = O(1)
T(n) = 2*T(n-1) + O(k)
Bytheway
Saturday, June 30, 2007
 
 
Ok, next improvement step is that linear search like:

bool isin(char c, char[] test)
{
 bool is =false;
 for(int i=0; i < k; i++)
 {
    if(test[i]==c)
    {
      is=true;
      break;
    }
 }
 return bool;
}

can be changed to binary tree search to take O(log k) time
Bytheway
Saturday, June 30, 2007
 
 
"Admittedly, this above is quite naive. I assumed the worst case where all the characters in the test array may be the same character like "aaaaaaa".

Thus any single character 'a' (surrounded by two characters different from it ) is the smallest substring and has a length of 1."

The answer, in that case, is "", because *all* of the characters in the character set have to be included.

Sincerely,

Gene Wirchenko
Gene Wirchenko Send private email
Saturday, June 30, 2007
 
 
Of course, the function isinrange is completely wrong, answering a different question. It needs to record the indices of the found characters so that they don't get tested again.


void isinrange( char[] A, int start, int end, char[] test)
{
  int len=sizeof test/sizeof test[0];

  if ((end-start+1) < len) // all characters must be in
    return;

  int indices[len];

  for( int i=0; i<len; i++)
  {
    int j=start;
    for(; j < end; j++)
    {
      bool skip=false;
      // skip indices where match found
      for(int k=0; k<i; k++)
      {
          if(indices[k]==j)
          {
            skip=true;
            break;
          }
      }

      if(skip)
        continue;

      if( A[j]== test[i] )
      {
          indices[i]=j; // record index where found
          break;
      }
    }

    if( j==end ) // test character nowhere in range
        return;
  }   

  // all test cases are in start-end range
  // record this in global variable
  if(end-start+1<minrange)
    minrange=end-start+1;

  isinrange( A, start+1, end, test);
  isinrange( A, start, end-1, test);
}
Bytheway
Sunday, July 01, 2007
 
 
I think performance would be O(N^2)
Bytheway
Sunday, July 01, 2007
 
 
Can it be done better than O(N^2), O(N) probably!!
K.Ashwin Kumar Send private email
Sunday, July 01, 2007
 
 
We can scan the big string and record all the indices where each character from the test set is located. This will take some space for indices but can be done in O(N) time.

Then we have to do something smart with those indices but that eludes me for the moment.
Bytheway
Sunday, July 01, 2007
 
 
Uhm, I can't make much progress with the indices idea. The main problem with it is that it adds a complexity to the algorithm that depends on how many locations contain each of the test characters.

I think this in itself is an interesting problem so I will pose it in another thread.

So even if we start with O(N) the rest is unrelated to N unless we are content estimating the number of indices per each test character as O(N). Then I can't imagine any processing of indices that would make algo stay O(N).

Second, duplicates complicate this approach as the set of indices for identical characters will be the same.

Sure, if *all* of the test characters were the same (like "aaaaaa") then this approach would quickly find the smallest subset that contains extactly that many a's.
Bytheway
Sunday, July 01, 2007
 
 
Let Count[] be an array or hash table of counts by character set memnber.  Let M be the number of zero counts - initially M = size of character set.

Find the first substring with all characters by scanning the string and updating the value of Count[] and M for each character.  When M reaches zero you have your first substring containing all characters.  Remove characters from the left o this substring subtracting on from the count array until one of the entries goes to zero this gives you first candidate miminum string.

Given a candidate minimum string you get the next one as follows.

1.  Drop 1 character from the left (which brings the count of that character to zero).
2.  Add characters to the right until you add the character back from the string.
3.  Drop characters from the left until the count for a character would go to zero.

Continue to the end of the string looking for candidates and take the shortest candidate.  This should be O(N).
Send private email
Monday, July 02, 2007
 
 
So say your string is "qwerty". After scanning you finally found 't' that was the only missing char so far.


So now you have counts (in square brackets):

[10]  [3]  [6]  [2]  [1]  [8]
  q    w    e    r    t    y

What exactly does this tell you about the size of the substring and what do you exactly do by 'subtracting the count from the left'?
Bytheway
Monday, July 02, 2007
 
 
@Anonymous

Can u explain ur O(N) with an detailed example??
K.Ashwin Kumar Send private email
Tuesday, July 03, 2007
 
 
Example.  Say the character set is abcd and the string is

aabcaadcccbadda

1. The first string that contains all characters 'aabcaad'

2. We get our first candidate droping characters from the left until it is minimal 'bcdaad' (Length = 6)

3. We drop the first character b giving 'cdaad'

4. We add characters until we get the b we dropped 'cdaadcccb'

5. Drop from left to get next candidate: 'adcccb' (Lenth = 7)

6. Drop 1 from left:'dcccb'.
7. Add to right: 'dcccba' 
8. Drop from left 'dcccba' (Length = 6)

9. Drop 1 from left: 'cccba'
10: Add to right: 'cccbad'
11: Drop from left: 'cbad' (Length = 4)

12. Drop 1 from left: 'bad'
13. There are no c's to the right so we are done.

So in this case the minimum was 'cbad' lenth = 4.
Bill Hamaker Send private email
Tuesday, July 03, 2007
 
 
Bill hamaker,

at step 5) Drop until next candidate

in such steps, are u not checking whether it contains all characters from set abcd, for this wont u run another loop inside to confirm string candidature ??
K.Ashwin Kumar Send private email
Tuesday, July 03, 2007
 
 
No because we maintain running counts in the count array/hash table as in the following example (untested) c# code.

private bool NextCandidate(ref int start, ref int length,
    string s, Dictionary<char, int> count)
{
// returns false if no next candidate
// start, length are reference variables
// initially they refer to the previous candidate
// upon completion they refer to the next candidate
// by definition of 'candidate' count[s[start]] = 1    
// and count[s[start+length-1]] = 1
// remove one character from left;
char c = s[start];
count[c] = 0;
start++;
// add characters to right
while (start+length < s.Length && count[c] == 0)
{
  count[start+length]++;
  length++;
}
if (start + length >= s.Length)
{
  // no next candidate
  return false;
}
// remove from left
while (count[start] > 1)
{
  count[start]--;
  start++;
}
return true;
}
Bill Hamaker Send private email
Tuesday, July 03, 2007
 
 
"2. We get our first candidate droping characters from the left until it is minimal 'bcdaad' (Length = 6)"

It should be "bcaad"

How do you test for minimality ? And whats the complexity of the code ?
prog
Saturday, July 28, 2007
 
 
Algorithm:

step 1:
    add next character to string
    Optimize string
    
    if we have a substring
        Record substring if its the best so far
        
step 2:        
        remove leftmost character from string
        If we still have a substring
            Optimize substring
            Record substring if its the best so far
            goto step 2
    
    goto step 1        

---------------------------------------------------------

Here's the simplified implementation, should run O(N):

typedef struct tagSS_SESSION
{
    int unique;
    char Array[256];
    int char_set_size;
} SS_SESSION;


/* AddChar: Returns TRUE when session contains full substring */
BOOL AddChar(char character, SS_SESSION *pSession)
{
    if( pSession->Array[character] == 0 )
        pSession->unique++;
    
    pSession->Array[character]++;

    if( pSession->unique >= pSession->char_set_size )
        return TRUE    ;

    return FALSE;
}

/* RemoveChar: Returns TRUE when session contains full substring */
BOOL RemoveChar(char character, SS_SESSION *pSession)
{
    if( pSession->Array[character] == 1 )
        pSession->unique--;
    pSession->Array[character]--;
    
    if( pSession->unique >= pSession->char_set_size )
        return TRUE;
    return FALSE;
}

int Optimize(char* string, int len)
{
    int count = 0;
    
    if( len < 2)
        return 0;

    for( int i = 1; i < len; i++ ) {
        if( string[count] == string[i] ) {
            count++;
            i = count;
        }
    }
    
    return count;
}



char* FindSubString(char* orig_substring, char* char_set)
{
    SS_SESSION    session = {0};
                
    int            best_len        = 100000, /* fake high length */
                best_start_offset, /* best start offset so far */
                orig_len    = strlen(orig_substring),
                current_len = 0, /* current substring length */
                start_offset = 0; /* current start offset */
    

    session.char_set_size = strlen(char_set);

    for( int i = 0; i < orig_len; i++ ) {
        int opt_count = 0; /* how many bytes we optimized */
        
        /* here we add a character to our current 'string' and find out if we have a valid substring */
        BOOL bMatchFound = AddChar(orig_substring[i], &session);
        current_len++;
        
        /* optimize the left side if we can */
        opt_count = Optimize(&orig_substring[start_offset], current_len);
        
        while( opt_count-- ) {
            RemoveChar(orig_substring[start_offset], &session);
            start_offset++;
            current_len--;
        }
        
        /* do we have a substring? */
        if( bMatchFound ) {
            if( current_len < best_len ) {
                /* the current substring is shorter than our best */
                best_len            = current_len;
                best_start_offset    = start_offset;
            }
            
            /* OK, so we had a substring match after adding the last character.
              Now, lets start removing characters from the left and see how we can
              reduce this string while maintaining the substring
            */
            while( RemoveChar(orig_substring[start_offset], &session )) {
                    
                opt_count        = Optimize(&orig_substring[++start_offset], --current_len);

                start_offset    += opt_count;
                current_len        -= opt_count;
        
                if( current_len < best_len ) {
                    /* after truncating, the current substring is shorter than our best */
                    best_len            = current_len;
                    best_start_offset    = start_offset;
                }
            }
    
            start_offset++;
            current_len--;
        }

    }
    
    /* we either have the smallest substring ( best_len != 100000 ) or not (best_len == 100000 ) */

    if( best_len == 100000 )
        return NULL;

    printf("result:");
    
    for( int bl = 0; bl < best_len; bl++ )
        printf("%c", orig_substring[bl + best_start_offset]);
    return &orig_substring[best_start_offset];
}
Mark Urbanus Send private email
Saturday, August 04, 2007
 
 

This topic is archived. No further replies will be accepted.

Other recent topics Other recent topics
 
Powered by FogBugz