Programming - Journal

Next stop - PCIE 3.0 MR-iov SR-iov

This is going to be my next blog. I will try to cover what are the ideas and what to look for! Before I start, I need to do a revisit to PCIE error handling. It confused me a lot - specially when I'm responsible to shed light on the design alternatives in architectural meetings. First thing to note is that PCIE is packet based serial communication method. Since it is packet based, it has different layers to handle the complexity of communication between devices on point to point links. Now different layers can find error situations and can triggers events to report errors. So error classification, reporting, and handling should be a fundamental paradigm for such protocols. It is not new though, TCP/IP and other layer 3 and 4 protocol does this type of error managements. Few fundamentals, though aside, are the difference between asynchronous and synchronous communications. Any packet based communication is asynchronous!. Synchronizing clock speed proved to be both difficult and time consuming, hence asynchronous packetized protocol is the current trend, and clocking informations are in the handshaking or link training packets. Note that I did not go into detail of synchronous and asynchronous communication. But to note, PCI-E is really asynchronous serial communication between devices in a link. It employs 8b/10b self synchronizing line coding. Also the signaling scheme is quite elegant. There is no separate address and data signals used in PCI. It also does not have side band clock signals along with data. It is also scalable, since devices can have x4, x8, x16, and x32 lanes. x4 configuration means 4 lanes between two devices. Each lane is a pair of unidirectional bit flows: one in each direction. x4 has the theoretical limit of 1GB. So if x16 is used, close to 4GB of bit rate is possible!This is pretty much the base specification PCIE 1.0. Version 2.2 and 3.0 achieve higher speed. Now coming back to PCIE error mechanism. Since PCIE has to have backward compatibility, it supports the old PERR#, and SERR# signals. PERR is data parity error, SERR is system error. While PERR are potentially recoverable, SERR are usually considered unrecoverable. PCI-X basically follow the same rule as PCI, but defines device specific error handling. It is really a prelude to more comprehensive error handling in PCIE. Since the error detection, reporting, and handling are from the PCIE device point of view, instead of platform point of view, the discussion here will only stress from the device point of view. Moreover I will try to stress some apparently salient points with associated complexities that comes with virtualization of device function. Briefly, when we introduce virtual functions off of a physical function, and try to assign to virtual machines(VM), each VM is now running a PCIE device ( sort of ). First the detection mechanism from the driver. In its simplest form the device should have a error status register like in the PCI config space. But that is just the simplest case. Now since error can occur at several protocol layers, not all errors needs to be percolated at the top layer for detection by driver as well as reporting from the device ( i.e., from endpoint ) could vary. For example most of the TLP ( Transaction layer ) error are actually shows up in error status register. Since PCIE has the backward compatibility, status register for PCI compatibility as well as PCIE error status register get set. But software compatible to one or the other bus specification does not necessarily clear the bits of the other bus type when done with error handling. So the driver will know from the error status register if there was an error got reported. At the driver level, this is the basic error detection. To understand the reporting the transactions are classified into (1) Non-posted requests: Reads; I/O writes; Configuration request, (2) Posted requests: Messages and Memory Writes. For non-posted request, completer reports error using completion status. The reporting target is the requester and optionally the root-complex. Usually it is the requester who decides how to handle the error. For the posted requests, the requestor TLP does not expect any completion TLP be returned from the completer (i.e., fire and forget type), completer creates an Error message and sends to Root Complex. Root complex needs to handle the error. From the software driver point of view, it is vital to look at what level of device support exist in the chipset for PCIE compatibility. One area that become a decision point when we introduce virtual functions off of a physical function. Depending on the OS, there could be different instances of the driver in kernel space or it could have different context and scope. When an error gets reported, perhaps every instance of the driver or the context might detect it. Now who should handle this and how? This where the host vm comes to play. Any VM but host should report this back to host VM for handling!!! I will end this installment here. I did not cover a lot of things here. For example: error sources, error message type, advance error reporting, DLLP and TLP error types etc, etc.

Posted on Friday, August 20, 2010 at 03:44PM by

Prokash Sinha |

Is it fun to hack?

In the past I only used kernel apis to probe and use PCI related resources. The fun or the dumb part of it is that most of the stuff is done for you, and in most cases it is pretty routine to get those resources and use it... But currently, I was using pci scan and configuration probing to get some data out of it. Depending on the chipset, the information I get or the programmable interface I have to the chipset could be pretty raw and complicated. Usually these chipsets are quite powerful, having multiple processors and couple mega bytes of storage. This is mainly due to few convergence, in my case it was fiber channel and network interface together. In my case, I was probing for manufactures data, including 8 to 10 boot codes that are burned into EEPROM or NVRAM or Option ROM. PCI spec gives 256 bytes of CPI config space, and often vendors don't use all of it, so informations are missing, on top of it, the x (0x78) is not placed at the end of the space, instead it is placed somewhere in the middle. Any processing could just go read the whole space or stop at the end marker. But the main part is to read the PCIR(PCI ROM) structure to find what are the boot codes and their versions. Fun part is that PCI recommends that from base address, at 0x18 would have the PCIR pointer, and form their respective boot code image informations can be retrieved. But when option rom or nvram space is tight, it could no longer be true. Finding these offset is quite challenging, and UEFI diagnostic tools are quite handy. Once that was found the 4 bytes entity could be in big endian format or otherwise. So that requires yet another hack to find out if we need a byte swap or not. Well, it is no fun when I started, but once all the informations are found without using the flawed informations provided by the docs, it turns out to be fun!

Posted on Monday, July 5, 2010 at 09:58AM by

Prokash Sinha |

May I interrupt you!

I was recently trying to dig deep into PCI/PCI-X/PCI-e. Basic concept of interruption is known to most of us. It is asynchronous events to get attention of the CPU(s). In older days, there is pin on the CPU, a device asserts the PIN, and CPU checks on instruction boundary, if a device need attention, and proceed accordingly based on the priority. When it comes to electronics, real estate is not cheap, but we the hungry people wants more and more devices to be attached to our base gadgets. So naturally there are interrupt sharing, interrupt delivery, and interrupt priorities came into design process. As old news is no news, excessive delays to service interrupt is no service. So that was the start of PIC, programmable interrupt controller. Then came multi processors, and interrupt delivery to specific subsets of the processors. And IOAPIC was born. In its primitive form, an interrupt could be broadcasted to all the processor, and depending on the mask a processor will either ack it or ignore it depending on the policies. Some time this is in the area of APIC bus technology. Finally ( as of now ) came message based interruption as opposed to the older line based interrupt. The idea is to have lots of different types of interrupts to be served by a set of processors. In the older line based interrupt there were two main steps:Interrupt Ack, and receiving the vector index so that OS can jump off to the service routine and serve the interrupt. From the HW point of view at a higher level it is two step process. For MSI-X, the story is quite different, and somewhat not so well documented ... The PCI specs, and the Mindshare books ( PCI Arch, PCI-X, and PCI-E ) does not spell out the sequence of events and the corresponding handlers at the HW level to service interrupts. We see how the configuration space has been primed. We see how the device uses MSI address register, and data register to signal ( or to be precise do a WR PCI post transaction ). But how does it get to CPU? When does a CPU know that an interrupt is pended, stop whatever is being done and go for servicing the interrupt. Well, read on those books and let me know if there is a clear way it has been described... But if WEB is your friend, well it is not my friend since walking outside often gives me trouble while bumping on invisible spider webs, you can search and get some hints. Basically when the transaction is posted into some specific location (yeah, hand waving here ), some component of Root complex or some other chipset knows that it is not a regular PCI WR transaction, it is an MSI-X write (WR) transaction, so it tries to get the CPUs attention. There are ICR (interrupt control registers), and that comes into play... So basically few things are important - 1) When a device writes a PCI-X posted transaction, who lets the CPU know an interrupt has arrived, and how does it do it? 2) When does the CPU go check if an interrupt attention has happend? It is still at instruction boundary! 3) When does it know an interrupt attention is pending, where does it get the index from to vector into the MSI-x interrupt vector table?

Posted on Saturday, May 22, 2010 at 07:45AM by

Prokash Sinha |

Happy Hacking in 2010 !

Long back when I first started hacking in IBM TSO, and UNIX, I was told by one of my coworker that the only thing constant in this industry is change. And not much to surprise, I've been trying to live with it. Hence there are two digress: Not finishing the NDIS 6.0 hacks; Not finishing the UI side of the combinatorial game Hacken bush. I will have to get them latter ... For now, I'm trying to wrap my head around Mac X. If you happen to be around for long in the hacking area, you know strange thoughts come around quite often. Few months ago, I was trying to refresh the basic paging mechanism in NT. Long back in the 16bit days, we knew that executables are file mapped in NT, meaning loading of a large exe is fast. So the obvious question is for what is the backing store for an APP or for a driver. I knew that long back, NT used to have the driver binary file mapped, if I could recall. But due to online update, and other security measure the backing store for drivers is the paging file. Now the question is what if I try to load a whole lot drivers, can it make the system page file overly crowded?. In Mac (xnu), there is not a single paging file if we configured correctly. So paging files are created and merged back when not heavily loaded. When it comes to user vs. kernel protections, we know that the kernel essentials are shared mapped to user address spaces, both in Windows and Linux. Remember the 2GB, 3GB user address space, rest is kernel's property. Mac os X does not share this sharing idea that much, so kernel is in its own address space. Quite nice, since lot of xnu ideas are from Mach. But system calls, and context switching is bit heavy hence less optimized. But then processor speeds became many fold better, so it is a nice trade off two. When it comes to bitnes or arch, a natural question is when I have a CD that has the OS for both power pc(ppc) and x86, what happens at the installation time, what are forms of those binaries etc. While I've not figured out all the questions I have, Mach-O, the binary format does accomodate multi-arch representation in a single file. In other words, if I build an executable for ppc, x86, and x64, I can combine them in a single binary - called fat binary, that can be used in any of these architectures !!!

Posted on Saturday, January 2, 2010 at 09:42AM by

Prokash Sinha |

Android - Placing random game configurations - part 3

Now that I realized that there has to be more than one random tree just to start with, I had to be sure that there is a way to dynamically find out where the trees are placed. Visually anyone can draw multiple trees on a bounded area, but programmatic-ally it requires some thoughts. This is an area I need to improve in the implementation, but for now it looks fairly nice. So essentially I will be leaving this at that, and will comeback later.

Next thing to tackle is to remove tree edges, when an user cut an edge. Ah, this is familiar problem in graph algorithm, we just need to find the connected component, and drop the component that does not have the root. As it turned out, BFS (breadth first search) table becomes very handy. Also that cutting any internal edge of a tree makes it to two trees.

Now I can have multiple trees on the canvas. I can start the play, so when I cut an edge of my choice, it takes the connected component without the root down. I never said that one or both connected component could be empty set of edges:).

Next in line is when an user tries to cut an edge, we need to know what edge it was. I plugged in some simple geometric primitives to do just that, otherwise I would never be able to cut that edge and the related connected component it generated. But the problem is one has to hit several time to get the right (x, y) coordinate of the pixel, yeah a very good way to hurt your eyes!.

Since the line widths are very small when I go for single pixel thickness, I have to have it parameterized to dynamically adjust from the device capability. For now, I can test for 5 to 10 pixels worth of thickness. And right there, I come across another geometric problem!. How do I compute on the rectangle that is just formed, so that user will click once to cut an edge and its associated non-rooted component? That is my next design. Interestingly these little mathematical and algorithmic problems makes me think!. For example, rectangle is a convex shape, so given a point I need to decide if it is inside the convex region or not!!

In summary -

I've been able to connect the host to the openmoko.
I've the debug environment set.
I've the user's side of playing
I will get to the game evaluation and game engine part soon!

Posted on Saturday, October 17, 2009 at 07:29AM by

Prokash Sinha |

A road named NoNameUno

Next stop - PCIE 3.0 MR-iov SR-iov

Is it fun to hack?

May I interrupt you!

Happy Hacking in 2010 !

Android - Placing random game configurations - part 3