Riddles and more ...
Fascinating as it is, riddles are a good thing to hone ones thinking and break habits of short_attention_span. So I thought I should try a bit. On the other side of it is to see if there are brute force approach to solve some of those programmatically.
So I started my iGoogle a while back, but never had the curage of really trying some of those. I was more interested about weather, sports and other stuff. Yesterday I saw one popped up on my screen. It says that "You have a 10 digit number, the alphabet for this number is, we all know is from the set {0,1,...9}. We need to come up with a number that indicates the number of identical digits in its position. So if the leftmost position is 0, then for that number to be found, we place how many zeros in it. Similarly the 1st position will have the number of ones, and so on".
For a 10 digit number, the range is [ 0000000000, 9999999999]. A brute force approach to take one at a time a validate would be pretty time consuming even for a program!!!
So I started with the easy one, meaning how many zeros I can put in the number and still get away with the stipulated conditions. 9000000000. But I have to put 1 at the 9th position. Note the indexing is zero based. But that number failed, so I thought, start with 8 zeros. And it came out 800000010. Bang!
Not sure how many of such numbers exists within that range, but perhaps we can try a brute force algorithm, but can we use some backtracking here? Did not think thru it yet :)
And another one I saw was about permutation cycles. I did not try to solve it. It would be interesting to solve, because it has a strong mathematical structure, so thinking from a non mathematical side would be the challenge I want by brain to take. Reasons?. Mathematics is good, rigorous, but intuition is an weapon that leads to sleek solutions to problems.
Well, if one gets half-hour time a day, I would strongly encourage to try some of them.
Algorithmic Complexity
I must say that I've seen way too many smart software engineers with lack of understanding in Algorithmic complexity. Not sure if they did not have a basic course or just slept through those classes since they hardly contribute to hack some code to working conditions. So a good basic knowledge of it is perhaps helpful.
So what is this Algorithmic complexity anyway?
Well, consider a simple program first. Now in a structured programming language, one would code with some of the following constructs -
- loops - for loop, while, do while etc.
- conditional statements - if then else and its variant
- sequential flat statements - Initialization, Allocation of memory etc.
- api calls to systems - malloc, strcpy etc.
- Finally composition of functions to form the program - main(), foo(0, bar() etc.
In a loop, there could be function calls, and other statements. If it is a function call, find the complexity of a function and replace it there. If it is a statement then find the complexity of the statement. Since finding the complexity of a function really boils down to finding the complexity of the statements it used, knowing complexity of different statements is all that is needed here.
Now any initialization of a scalar variable takes constant time, say k unit of time. Unit of time is unnecessary to define at this stage, since the idea is to analyze the complexity and compare with other algorithms that solves the same problem. Any vector variable, for example an array initialization takes n * k unit of time, where n is the size of the array. In complexity analysis term, constant time is defined as O(1), and n * k is defined as O(n), where n is the size of the array.
Conditional statement's complexity is defined to be the maximum complexity of all the branches. So in a simple if then else statement, if the if block has complexity of O(n), and else has complexity of O(1), then the whole conditional statement has complexity of O(n).
Loop complexity is defined to be the multiplication of the loops bound with the maximum complexity of all the statements inside the loop. If the loop runs thru a list or array of single dimension of size n, the loops bound is O(n). Now if the maximum complexity of all the statements inside the loop is O(n), then the whole loop has complexity = O(n) * O(n) = O(n^2).
But remember that loop using while or do while is little tricky, since the condition that drives the loop could turn out to be evaluated as false on different situations, so worst case complexity is what we need to take. By the way, all these analysis are for worst case complexity, and it is the upper bound analysis. So we using Big-O here. Usually finding minimum complexity is quite harder. For lot of algorithms, minimum complexity is unknown.
After analyzing all the components, you will come up with a polynomial of the form -
O(n^k) + O(n^k-1) + ... O(n^2) + O(n) + O(1).
More over some algorithm can have complexity like O(log n), O (log log n) etc. These terms could also belongto the above stated general form of polynomial complexity.
This is where comes the polynomial complexity, the P space, NP space etc. NP space is defined in a little different way. When we try to solve some problem and see that there is no fixed k for which we can have the complexity of the above form, we know it does not seem like it has an algorithm of polynomial complexity. For example, some algorithm has exponential complexity. When this happens, we say that the problem is NP hard. Graph algorithms and other optimization problems often sees this type of hard problems. See some algorithm book for more detail.
When someone finds a problem that is NP hard, the next step is to reduce a known NP complete problem to this new problem using a polynomial time algorithm. If the reduction is mathematically correct, the new problem belong to the NP-complete class. What it tells is that if any of the problem in the NP complete class has a polynomial time solution, then the whole class have polynomial time solutions, hence the whole class of NP complete problems have Polynomial complexity. This is a real hard problem to show any of these problems in the NP complete set has polynomial complexity. Also it is quite hard to come up with a polynomial time reduction of a know NP complete problem to NP hard problems. For lot of problems, finding such a reduction is no small feat. And that is why we see lot of problems are NP hard, and some of the brightest people could not find a polynomial time reduction yet.
So in summary, most of the complexity analysis is really for worst case scenarios. Hence upper bound analysis. Finding lower bound of an algorithm is bit difficult. For example, sorting by comparison has Omega(n log(n) ). But for lot of algorithms there is known lower bound. And when a problem is classified to be NP-hard or NP-complete, alternative options like: Problem domain restrictions; Heuristics and Approximation; Probabilistic algorithms are alternative approach to find good enough solutions. For example, lot of graph algorithmic problems are either NP-hard or NP-complete. But restricting the domain of those graphs, specially for recursively definable graphs, linear solutions can be found. Note here that when we say recursively definable, it restrict the domain of all possible graphs.
That's it about Computational complexity analysis.
Concurrent Programming!
Hmm, we can't really get by without it! If so, then what we gain from it?
Because of multi-core and multi-processors, we can't just have single processor sequential programming to achieve better performance without concurrency. The idea behind it is to have each processor do a part of the work to finish up the total work rapidly. Two heads is better than one! Four is better than two!. And so on...
Not necessarily! Some work could be inherently sequential, some could be purely parallel, and some other could be in between the two extreem. Identifying them, and estimating how much parallel processing could be done is essential first step in achieving good concurrency. For example, merge-sort could be an example that falls in between, link list processing could be purely sequential, and matrix multiplication could be purely parallel.
Though above classification is somewhat unclear, here the discussion is about kernel mode drivers. How much parallelism we can get? And how exactly can we estimate the speed up? In other words, I've a driver and my concern is would it be able to scale as number of processor scales? Also what should be an approach for estimating the speed up?
Well, you will see books and articles about parallel and concurrent programming that takes mostly scientific programs and try to analyze how best they can be parallelized. And sometime they also calculate the estimated speed up. If I'm posed with this question, I will take the following steps first -
- I will take Amdahl's law for maximum speed up.
- We will divide the driver code with frequency of execution to define the workload in measurable units.
- We look at the places where we need some locks: coarse or fine grained
- Then apply the formula to get an estimate
Why go through such trouble? It should give us areas to improve upon based on our estimation. Let's look at Amdahl's formula.
Assume that the work takes 1 unit of time when ran sequentially. Now the work has some sequential part. If the sequential part is zero, then it is a pure parallel work, and it scales with the number of processor. Yeah, somewhat idealized condition. Assume it has took 0< p < 1 unit of time to the parallel part, and 1 - p unit of time for the sequential part. Then if we have n processor to work on this, the time it would take is -
TPn = Time spent by n parallel processors = 1 - p + p/n. So the Speed up is 1 / ( 1 - p + p/n ).
Example: Suppose the algorithm has 1/2 of the work that is purely sequential, and another 1/2 is parallelize-able. Then the speed up with 2 processor is = 1 / ( 1 - 1/2 + 1/(2*2) ) = 1 / ( 1/2 + 1/4) = 4/3. So basically the result shows that we just have added 1/3 more processing capacity, so to speak mathematically!
For drivers, you can not just assume it run all parts to completion. Some parts runs on demand, and some parts run intermittently and some other parts run very frequently. So before we even look to see what part is inherently sequential, and what part(s) are parallelize-able, we have to inject the frequency to get an weighted method of estimating the speed up.
Once we have an estimation, our push would be take most weighted sequential part, and try to parallelize. As we all know that any place we guard using locks are essentially sequential. So critical sections are essentially sequential execution path, and our objective to push those code out of locks as much as possible.
In summary, next time you are asked how much speedup you achieved by modifying your critical section, you don't have to run some specific tests and answer the question. Moreover, your tests may not be representative of the full spectrum of running conditions of your driver. You should be able to plug the numbers in your formula, and answer the estimated speed up! Wonderful, at least I think so!!!
Looking thru microscope - performance!
Producer consumer approach is very popular in network driver design. Transmit and Receive descriptors are usually laid out in physical pages for shared memory between host and device. And there is a producer index and a consumer index. To give it a ring structure, these indices are incremented modulo the size of the total memory each of those descriptor table takes. So indices wrap around. Depending on the device memory available, and the size of the descriptor structure we can probably use the native types like unsigned int_16 or something so that every time we increase the indices we don't need to check for wrapping around. But 16bit is 64K, so if the structure size is say 64 bytes, we need about 1K of descriptor before the indices wrap around. But if the device provides, say, 16KB of memory for this, then one option is to check after each increment of an index for possible wrap around. Well, this seems like an extra burden in high performance interrupt path. So the alternative would be to define a bit field that would wrap automatically after 16KB of addresses. So we can avoid the check for wrap around, and might have a performance boost. But if we look thru the microscope, we might have different conclusion. In the test and set case, there is hardly any branch hazard if you worry about it. Think about it! By looking at the code generated by gcc on an example case, I was surprised to see that the bit field approach is indeed slower. Take a look!
GAS LISTING /tmp/ccF3PvQX.s page 1
1 .file "sundry.c"
4 .text
5 Ltext0:
211 .section .rdata,"dr"
212 LC0:
213 0000 6D796269 .ascii "mybitfield.mytestfield is:%d\0"
213 74666965
213 6C642E6D
213 79746573
213 74666965
214 001d 000000 .text
216 .globl _bitfield_0
218 _bitfield_0:
1:sundry.c **** /*
2:sundry.c **** * sundry.c
3:sundry.c **** *
4:sundry.c **** * Created on: Sep 15, 2010
5:sundry.c **** * Author: Prokash Sinha
6:sundry.c **** */
7:sundry.c **** #include <stdio.h>
8:sundry.c ****
9:sundry.c **** void bitfield_0()
10:sundry.c **** {
220 LM1:
221 0000 55 pushl %ebp
222 0001 89E5 movl %esp, %ebp
223 0003 83EC18 subl $24, %esp
11:sundry.c ****
12:sundry.c **** unsigned int uintvar =0;
225 LM2:
226 0006 C745FC00 movl $0, -4(%ebp)
226 000000
13:sundry.c ****
14:sundry.c **** struct {
15:sundry.c **** unsigned int : 0;
16:sundry.c **** unsigned int mytestfield: 4;
17:sundry.c **** unsigned int pad: 28;
18:sundry.c **** }mybitfield;
19:sundry.c ****
20:sundry.c **** mybitfield.mytestfield = 7;
228 LM3:
229 000d 8B45F8 movl -8(%ebp), %eax
230 0010 83E0F0 andl $-16, %eax
231 0013 83C807 orl $7, %eax
232 0016 8945F8 movl %eax, -8(%ebp)
21:sundry.c **** printf("mybitfield.mytestfield is:%d", mybitfield.mytestfield);
234 LM4:
235 0019 0FB645F8 movzbl -8(%ebp), %eax
236 001d 83E00F andl $15, %eax
237 0020 89442404 movl %eax, 4(%esp)
238 0024 C7042400 movl $LC0, (%esp)
238 000000
239 002b E8000000 call _printf
239 00
22:sundry.c ****
23:sundry.c **** /*show that inc beyond the width makes it zero */
24:sundry.c **** mybitfield.mytestfield = 16;
GAS LISTING /tmp/ccF3PvQX.s page 2
241 LM5:
242 0030 8B45F8 movl -8(%ebp), %eax
243 0033 83E0F0 andl $-16, %eax
244 0036 8945F8 movl %eax, -8(%ebp)
25:sundry.c **** printf("mybitfield.mytestfield is:%d", mybitfield.mytestfield);
246 LM6:
247 0039 0FB645F8 movzbl -8(%ebp), %eax
248 003d 83E00F andl $15, %eax
249 0040 89442404 movl %eax, 4(%esp)
250 0044 C7042400 movl $LC0, (%esp)
250 000000
251 004b E8000000 call _printf
251 00
26:sundry.c ****
27:sundry.c **** /*show just the instructions to inc w/o worry for wrapping */
28:sundry.c **** mybitfield.mytestfield++ ;
253 LM7:
254 0050 0FB645F8 movzbl -8(%ebp), %eax
255 0054 83E00F andl $15, %eax
256 0057 8D5001 leal 1(%eax), %edx
257 005a 8B45F8 movl -8(%ebp), %eax
258 005d 83E20F andl $15, %edx
259 0060 83E0F0 andl $-16, %eax
260 0063 09D0 orl %edx, %eax
261 0065 8945F8 movl %eax, -8(%ebp)
29:sundry.c ****
30:sundry.c **** /* show how many instruction it generates */
31:sundry.c **** uintvar++;
263 LM8:
264 0068 8D45FC leal -4(%ebp), %eax
265 006b FF00 incl (%eax)
32:sundry.c **** if ( uintvar > 4000000)
267 LM9:
268 006d 817DFC00 cmpl $4000000, -4(%ebp)
268 093D00
269 0074 7607 jbe L1
33:sundry.c **** uintvar = 0;
271 LM10:
272 0076 C745FC00 movl $0, -4(%ebp)
272 000000
273 L1:
34:sundry.c ****
35:sundry.c **** }
275 LM11:
276 007d C9 leave
277 007e C3 ret
282 Lscope0:
286 .globl _main
288 _main:
36:sundry.c ****
37:sundry.c **** void main (void)
38:sundry.c **** {
290 LM12:
291 007f 55 pushl %ebp
292 0080 89E5 movl %esp, %ebp
293 0082 83EC08 subl $8, %esp
294 0085 83E4F0 andl $-16, %esp
GAS LISTING /tmp/ccF3PvQX.s page 3
295 0088 B8000000 movl $0, %eax
295 00
296 008d 83C00F addl $15, %eax
297 0090 83C00F addl $15, %eax
298 0093 C1E804 shrl $4, %eax
299 0096 C1E004 sall $4, %eax
300 0099 8945FC movl %eax, -4(%ebp)
301 009c 8B45FC movl -4(%ebp), %eax
302 009f E8000000 call __alloca
302 00
304 LM13:
305 00a4 E8000000 call ___main
305 00
39:sundry.c **** bitfield_0();
307 LM14:
308 00a9 E852FFFF call _bitfield_0
308 FF
40:sundry.c **** }
310 LM15:
311 00ae C9 leave
312 00af C3 ret
313 Lscope1:
315 .text
317 Letext:
GAS LISTING /tmp/ccF3PvQX.s page 4
DEFINED SYMBOLS
*ABS*:00000000 sundry.c
/tmp/ccF3PvQX.s:218 .text:00000000 _bitfield_0
/tmp/ccF3PvQX.s:288 .text:0000007f _main
UNDEFINED SYMBOLS
___main
__alloca
_printf