Riddles and more ...

Fascinating as it is, riddles are a good thing to hone ones thinking and break habits of short_attention_span. So I thought I should try a bit. On the other side of it is to see if there are brute force approach to solve some of those programmatically.

So I started my iGoogle a while back, but never had the curage of really trying some of those. I was more interested about weather, sports and other stuff. Yesterday I saw one popped up on my screen. It says that "You have a 10 digit number, the alphabet for this number is, we all know is from the set {0,1,...9}. We need to come up with a number that indicates the number of identical digits in its position. So if the leftmost position is 0, then for that number to be found, we place how many zeros in it. Similarly the 1st position will have the number of ones, and so on".

 

For a 10 digit number, the range is [ 0000000000, 9999999999].  A brute force approach to take one at a time a validate would be pretty time consuming even for a program!!!

 

So I started with the easy one, meaning how many zeros I can put in the number and still get away with the stipulated conditions. 9000000000. But I have to put 1 at the 9th position. Note the indexing is zero based. But that number failed, so I thought, start with 8 zeros. And it came out 800000010. Bang!

Not sure how many of such numbers exists within that range, but perhaps we can try a brute force algorithm, but can we use some backtracking here? Did not think thru it yet :)

And another one I saw was about permutation cycles. I did not try to solve it. It would be interesting to solve, because it has a strong mathematical structure, so thinking from a non mathematical side would be the challenge I want by brain to take. Reasons?. Mathematics is good, rigorous, but intuition is an weapon that leads to sleek solutions to problems.

Well, if one gets half-hour time a day, I would strongly encourage to try some of them.

 

Posted on Saturday, January 22, 2011 at 09:53AM by Registered CommenterProkash Sinha | CommentsPost a Comment | References1 Reference

Algorithmic Complexity

I must say that I've seen way too many smart software engineers with lack of understanding in Algorithmic complexity. Not sure if they did not have a basic course or just slept through those classes since they hardly contribute to hack some code to working conditions. So a good basic knowledge of it is perhaps helpful.

 So what is this Algorithmic complexity anyway?

 

Well, consider a simple program first. Now in a structured programming language, one would code with some of the following constructs -

 

  • loops - for loop, while, do while  etc.
  • conditional statements - if then else and its variant
  • sequential flat statements - Initialization, Allocation of memory etc.
  • api calls to systems -  malloc, strcpy  etc.
  • Finally composition of functions to form the program - main(), foo(0, bar() etc.

 

 In a loop, there could be function calls, and other statements. If it is a function call, find the complexity of a function and replace it there. If it is a statement then find the complexity of the statement. Since finding the complexity of a function really boils down to finding the complexity of the statements it used, knowing complexity of different statements is all that is needed here.

 Now any initialization of a scalar variable takes constant time, say k unit of time. Unit of time is unnecessary to define at this stage, since the idea is to analyze the complexity and compare with other algorithms that solves the same problem. Any vector variable, for example an array initialization takes  n * k unit of time, where n is the size of the array.  In complexity analysis term, constant time is defined as  O(1), and n * k is defined as O(n), where n is the size of the array.

Conditional statement's complexity is defined to be the maximum complexity of all the branches. So in a simple if then else statement, if the if block has complexity of O(n), and else has complexity of O(1), then the whole conditional statement has complexity of O(n).

Loop complexity is defined to be the multiplication of the loops bound with the maximum complexity of all the statements inside the loop. If the loop runs thru a list or array of single dimension of size n, the loops bound is O(n). Now if the maximum complexity of all the statements inside the loop is O(n), then the whole loop has complexity = O(n) * O(n) = O(n^2). 

But remember that loop using while or do while is little tricky, since the condition that drives the loop could turn out to be evaluated as false on different situations, so worst case complexity is what we need to take. By the way, all these analysis are for worst case complexity, and it is the upper bound analysis. So we using Big-O here. Usually finding minimum complexity is quite harder. For lot of algorithms, minimum complexity is unknown. 

 After analyzing all the components, you will come up with a polynomial of the form -

O(n^k) + O(n^k-1) + ... O(n^2) + O(n) + O(1).

More over some algorithm can have complexity like O(log n), O (log log n) etc. These terms could also belongto the above stated  general form of polynomial complexity.

This is where comes the polynomial complexity, the P space, NP space etc. NP space is defined in a little different way. When we try to solve some problem and see that there is no fixed k for which we can have the complexity of the above form, we know it does not seem like it has an algorithm of polynomial complexity. For example, some algorithm has exponential complexity. When this happens, we say that the problem is NP hard. Graph algorithms and other optimization problems often sees this type of hard problems. See some algorithm book for more detail.

When someone finds a problem that is NP hard, the next step is to reduce a known NP complete problem to this new problem using a polynomial time algorithm. If the reduction is mathematically correct, the new problem belong to the NP-complete class. What it tells is that if any of the problem in the NP complete class has a polynomial time solution, then the whole class have polynomial time solutions, hence the whole class of NP complete problems have Polynomial complexity.  This is a real hard problem to show any of these problems in the NP complete set has polynomial complexity. Also it is quite hard to come up with a polynomial time reduction of a know NP complete problem to NP hard problems. For lot of problems, finding such a reduction is no small feat. And that is why we see lot of problems are NP hard, and some of the brightest people could not find a polynomial time reduction yet.

 

So in summary, most of the complexity analysis is really for worst case scenarios. Hence upper bound analysis. Finding lower bound of an algorithm is bit difficult. For example, sorting by comparison has Omega(n log(n) ). But for lot of algorithms there is known lower bound. And when a problem is classified to be NP-hard or NP-complete, alternative options like: Problem domain restrictions; Heuristics and Approximation; Probabilistic algorithms are alternative approach to find good enough solutions. For example, lot of graph algorithmic problems are either NP-hard or NP-complete. But restricting the domain of those graphs, specially for recursively definable graphs, linear solutions can be found. Note here that when we say recursively definable, it restrict the domain of all possible graphs.

 

That's it about Computational complexity analysis.

 

Posted on Saturday, January 15, 2011 at 11:13AM by Registered CommenterProkash Sinha | CommentsPost a Comment | References2 References

Concurrent Programming!

 

Hmm, we can't really get by without it! If so, then what we gain from it?

Because of multi-core and multi-processors, we can't just have single processor sequential programming to achieve better performance without concurrency. The idea behind it is to have each processor do a part of the work to finish up the total work rapidly. Two heads is better than one! Four is better than two!. And so on...

 Not necessarily! Some work could be inherently sequential, some could be purely parallel, and some other could be in between the two extreem. Identifying them, and estimating how much parallel processing could be done is essential first step in achieving good concurrency. For example, merge-sort could be an example that falls in between, link list processing could be purely sequential, and matrix multiplication could be purely parallel.

 Though above classification is somewhat unclear, here the discussion is about kernel mode drivers. How much parallelism we can get? And how exactly can we estimate the speed up? In other words, I've a driver and my concern is would it be able to scale as number of processor scales? Also what should be an approach for estimating the speed up?

 

Well, you will see books and articles about parallel and concurrent programming that takes mostly scientific programs and try to analyze how best they can be parallelized. And sometime they also calculate the estimated speed up. If I'm posed with this question, I will take the following steps first -

 

  • I will take Amdahl's law for maximum speed up.
  • We will divide the driver code with frequency of execution to define the workload in measurable units.
  • We look at the places where we need some locks: coarse or fine grained
  • Then apply the formula to get an estimate

Why go through such trouble? It should give us areas to improve upon based on our estimation. Let's look at Amdahl's formula.

Assume that the work takes 1 unit of time when ran sequentially. Now the work has some sequential part. If the sequential part is zero, then it is a pure parallel work, and it scales with the number of processor. Yeah, somewhat idealized condition. Assume it has  took  0< p < 1 unit of time to the parallel part, and 1 - p unit of time for the sequential part.  Then if we have n processor to work on this, the time it would take is -

TPn = Time spent by n parallel processors =  1 - p  +  p/n. So the Speed up is  1 /  ( 1 - p + p/n ).

Example: Suppose the algorithm has 1/2 of the work that is purely sequential, and another 1/2 is parallelize-able. Then the speed up with 2 processor is  =  1 / ( 1 - 1/2  +  1/(2*2) ) = 1 / ( 1/2 + 1/4) = 4/3.  So basically the result shows that we just have added 1/3 more processing capacity, so to speak mathematically!

 For drivers, you can not just assume it run all parts to completion. Some parts runs on demand, and some parts run intermittently and some other parts run very frequently. So before we even look to see what part is inherently sequential, and what part(s) are parallelize-able, we have to inject the frequency to get an weighted method of estimating the speed up. 

 Once we have an estimation, our push would be take most weighted sequential part, and try to parallelize. As we all know that any place we guard using locks are essentially sequential. So critical sections are essentially sequential execution path, and our objective to push those code out of locks as much as possible.

In summary, next time you are asked how much speedup you achieved by modifying your critical section, you don't have to run some specific tests and answer the question. Moreover, your tests may not be representative of the full spectrum of running conditions of your driver. You should be able to plug the numbers in your formula, and answer the estimated speed up! Wonderful, at least I think so!!!

Posted on Sunday, September 19, 2010 at 09:18AM by Registered CommenterProkash Sinha | CommentsPost a Comment | References1 Reference

Looking thru microscope - performance!

Producer consumer approach is very popular in network driver design. Transmit and Receive descriptors are usually laid out in physical pages for shared memory between host and device. And there is a producer index and a consumer index. To give it a ring structure, these indices are incremented modulo the size of the total memory each of those descriptor table takes. So indices wrap around. Depending on the device memory available, and the size of the descriptor structure we can probably use the native types like unsigned int_16 or something so that every time we increase the indices we don't need to check for wrapping around. But 16bit is 64K, so if the structure size is say 64 bytes, we need about 1K of descriptor before the indices wrap around. But if the device provides, say, 16KB of memory for this, then one option is to check after each increment of an index for possible wrap around. Well, this seems like an extra burden in high performance interrupt path. So the alternative would be to define a bit field that would wrap automatically after 16KB of addresses. So we can avoid the check for wrap around, and might have a performance boost. But if we look thru the microscope, we might have different conclusion. In the test and set case, there is hardly any branch hazard if you worry about it. Think about it! By looking at the code generated by gcc on an example case, I was surprised to see that the bit field approach is indeed slower. Take a look!

 

GAS LISTING /tmp/ccF3PvQX.s page 1


   1              .file "sundry.c"
   4              .text
   5              Ltext0:
 211              .section .rdata,"dr"
 212              LC0:
 213 0000 6D796269 .ascii "mybitfield.mytestfield is:%d\0"
 213      74666965
 213      6C642E6D
 213      79746573
 213      74666965
 214 001d 000000   .text
 216              .globl _bitfield_0
 218              _bitfield_0:
   1:sundry.c      **** /*
   2:sundry.c      ****  * sundry.c
   3:sundry.c      ****  *
   4:sundry.c      ****  *  Created on: Sep 15, 2010
   5:sundry.c      ****  *      Author: Prokash Sinha
   6:sundry.c      ****  */
   7:sundry.c      **** #include <stdio.h>
   8:sundry.c      ****
   9:sundry.c      **** void bitfield_0()
  10:sundry.c      **** {
 220              LM1:
 221 0000 55       pushl %ebp
 222 0001 89E5     movl %esp, %ebp
 223 0003 83EC18   subl $24, %esp
  11:sundry.c      ****
  12:sundry.c      ****         unsigned int uintvar =0;
 225              LM2:
 226 0006 C745FC00 movl $0, -4(%ebp)
 226      000000
  13:sundry.c      ****
  14:sundry.c      **** struct {
  15:sundry.c      **** unsigned int : 0;
  16:sundry.c      **** unsigned int mytestfield: 4;
  17:sundry.c      **** unsigned int pad: 28;
  18:sundry.c      **** }mybitfield;
  19:sundry.c      ****
  20:sundry.c      **** mybitfield.mytestfield = 7;
 228              LM3:
 229 000d 8B45F8   movl -8(%ebp), %eax
 230 0010 83E0F0   andl $-16, %eax
 231 0013 83C807   orl $7, %eax
 232 0016 8945F8   movl %eax, -8(%ebp)
  21:sundry.c      **** printf("mybitfield.mytestfield is:%d", mybitfield.mytestfield);
 234              LM4:
 235 0019 0FB645F8 movzbl -8(%ebp), %eax
 236 001d 83E00F   andl $15, %eax
 237 0020 89442404 movl %eax, 4(%esp)
 238 0024 C7042400 movl $LC0, (%esp)
 238      000000
 239 002b E8000000 call _printf
 239      00
  22:sundry.c      ****
  23:sundry.c      ****         /*show that inc beyond the width makes it zero */
  24:sundry.c      **** mybitfield.mytestfield = 16;

GAS LISTING /tmp/ccF3PvQX.s page 2


 241              LM5:
 242 0030 8B45F8   movl -8(%ebp), %eax
 243 0033 83E0F0   andl $-16, %eax
 244 0036 8945F8   movl %eax, -8(%ebp)
  25:sundry.c      **** printf("mybitfield.mytestfield is:%d", mybitfield.mytestfield);
 246              LM6:
 247 0039 0FB645F8 movzbl -8(%ebp), %eax
 248 003d 83E00F   andl $15, %eax
 249 0040 89442404 movl %eax, 4(%esp)
 250 0044 C7042400 movl $LC0, (%esp)
 250      000000
 251 004b E8000000 call _printf
 251      00
  26:sundry.c      ****
  27:sundry.c      ****         /*show just the instructions to inc w/o worry for wrapping */
  28:sundry.c      **** mybitfield.mytestfield++ ;
 253              LM7:
 254 0050 0FB645F8 movzbl -8(%ebp), %eax
 255 0054 83E00F   andl $15, %eax
 256 0057 8D5001   leal 1(%eax), %edx
 257 005a 8B45F8   movl -8(%ebp), %eax
 258 005d 83E20F   andl $15, %edx
 259 0060 83E0F0   andl $-16, %eax
 260 0063 09D0     orl %edx, %eax
 261 0065 8945F8   movl %eax, -8(%ebp)
  29:sundry.c      ****
  30:sundry.c      ****         /* show how many instruction it generates */
  31:sundry.c      ****         uintvar++;
 263              LM8:
 264 0068 8D45FC   leal -4(%ebp), %eax
 265 006b FF00     incl (%eax)
  32:sundry.c      ****         if ( uintvar > 4000000)
 267              LM9:
 268 006d 817DFC00 cmpl $4000000, -4(%ebp)
 268      093D00
 269 0074 7607     jbe L1
  33:sundry.c      ****              uintvar = 0;
 271              LM10:
 272 0076 C745FC00 movl $0, -4(%ebp)
 272      000000
 273              L1:
  34:sundry.c      ****
  35:sundry.c      **** }
 275              LM11:
 276 007d C9       leave
 277 007e C3       ret
 282              Lscope0:
 286              .globl _main
 288              _main:
  36:sundry.c      ****
  37:sundry.c      **** void main (void)
  38:sundry.c      **** {
 290              LM12:
 291 007f 55       pushl %ebp
 292 0080 89E5     movl %esp, %ebp
 293 0082 83EC08   subl $8, %esp
 294 0085 83E4F0   andl $-16, %esp

GAS LISTING /tmp/ccF3PvQX.s page 3


 295 0088 B8000000 movl $0, %eax
 295      00
 296 008d 83C00F   addl $15, %eax
 297 0090 83C00F   addl $15, %eax
 298 0093 C1E804   shrl $4, %eax
 299 0096 C1E004   sall $4, %eax
 300 0099 8945FC   movl %eax, -4(%ebp)
 301 009c 8B45FC   movl -4(%ebp), %eax
 302 009f E8000000 call __alloca
 302      00
 304              LM13:
 305 00a4 E8000000 call ___main
 305      00
  39:sundry.c      **** bitfield_0();
 307              LM14:
 308 00a9 E852FFFF call _bitfield_0
 308      FF
  40:sundry.c      **** }
 310              LM15:
 311 00ae C9       leave
 312 00af C3       ret
 313              Lscope1:
 315              .text
 317              Letext:

GAS LISTING /tmp/ccF3PvQX.s page 4


DEFINED SYMBOLS
                            *ABS*:00000000 sundry.c
     /tmp/ccF3PvQX.s:218    .text:00000000 _bitfield_0
     /tmp/ccF3PvQX.s:288    .text:0000007f _main

UNDEFINED SYMBOLS
___main
__alloca
_printf

Posted on Wednesday, September 15, 2010 at 07:31PM by Registered CommenterProkash Sinha | CommentsPost a Comment | References1 Reference

Band on the run ! PCIE 3.0 DIO / SR-iov / MR-iov

If you know the concept behind the song "Band on the run...", you might have already guessed how I'm going start this blog, otherwise just forget about the name :). If you are not aware of them, sr-iov and mr-iov are specifically to enhance the virtualization to a higher level of performance. Virtualization at the commodity hardware started back about a decade ago. x86 family being widely accepted, the first wave of virtualization on this family of hardware took the emulation path. A basic tenet of it is that any instruction that have a visible effect to processor state needs to be trappable. There are about 17 instructions in x86 that were not in this class, hence emulation was the first approach to it. With that in mind, the next step was the para-virtualized I/O implementations to share a hardware I/O device among multiple virtual machines. Para-virtualization came at a cost on I/O performance. Virtual machines saw around 50% I/O thru-put of real device capacity. So there is currently ( as of 2010 ) very extensive research and development going on in this area. The basic problem is how to boost the I/O performance in virtualized environment. And the first approach is to tear-out the I/O processing. Basically the idea is to think that I/O consist of two main parts: Data payload path; Control and Management path. If data payload path can be optimized using suitable hardware and software architecture modification or enhancement then we could have a chance to realize thru-put comparable to real device capacity. SR-IOV, MR-IOV, IOAT, and others are some efforts toward that direction to achieve this goal.
Posted on Friday, August 27, 2010 at 10:07PM by Registered CommenterProkash Sinha | CommentsPost a Comment | References2 References