Debug performance matters: The value of good metrics | Network Systems Designline

Get the latest news, products and how-to information on network systems. Sign up for the Network Systems DesignLine newsletter, a weekly e-mail guide dedicated to the needs of engineers developing networking equipment and components. Here is our RSS feed.








 Network Systems DesignLine » How-To » Ethernet Design

 
 HOW-TO : Ethernet Design

Debug performance matters: The value of good metrics


Print This Story Send As Email Discuss This Story Reprints



Courtesy of Embedded.com

Rate this article
WORSE | BETTER
1 2 3 4 5
Debugging is the most difficult and costly phase of software development for systems large and small. Deeply embedded systems don't have the standard PC user interfaces of keyboards, mice, graphic displays or even network consoles, so you need specialized debugging tools to get the critical system information necessary to find and fix bugs.

For many systems, that access is provided by a hardware debug device which communicates to your system's microprocessors through an on-chip debug (OCD) port. These debug devices can have dramatically different performance characteristics.

Your development time is valuable, so make sure that when you build your system and select a hardware debug device that you carefully consider debugging performance, or you may find yourself waiting when you should be debugging.

Debugging Performance Metrics
The difficulty in measuring "debugging performance" lies in the varied nature of debugging. One day's task is nothing like the next, and debugging as a whole can be thought of as a sequence of experiments undertaken one after the other.

Figure 1. Typical embedded debug setup

Piece by piece, you try to pin down the problem, working through the same section of code as different scenarios are examined, tested, and then set aside. During this process, you spend a lot of time reloading or reprogramming your application, re-running the application to specific breakpoints, stepping through code, uploading logging or trace information, and examining the state of the system.

Fortunately for measuring debugging performance, the time taken for these tasks is dominated by a single factor " memory access speed on your system. Reloading or reprogramming the application requires direct writes to RAM and/or non-volatile memory.

Need to read out that log of what the system was doing when it died? You want to viewing a peripheral's memory-mapped registers? Trying to debug your deadlocked application, looking for which of your system's tasks is holding a semaphore when it shouldn't?

In all of these cases, it's memory access to the rescue. If memory access is slow, you're looking at a lot of dead time while debugging, waiting on your debugging system to catch up.

As such, memory access speed is the fundamental measure of productivity and performance for debugging an embedded processor. It is also easy to measure; simply dump a large number of pseudo-random bytes from the debugging host into the memory of the system under debug and time how long it takes to complete. This will give the memory write speed of the system, and read speed can be measured by simply reading the data back.

Why does memory access performance vary from system to system? Performance bottlenecks lurk everywhere, but the most important ones are the hardware debug device and the design of the microprocessor's debug port.

In addition, certain other design factors of the system under debug (not just the selection of microprocessor) can affect memory access speed. To understand this better, we must examine the details of debug-mode memory access through a debug port.

Memory Access Through a Debug Port
Debug ports come in many shapes and sizes. The focus here will be on debug ports based on the widely used IEEE 1149.1 boundary scan standard (commonly known as "JTAG"). This four-signal standard was originally designed for in-circuit PCB and device testing, but has been extended to include software debugging.

The standard has a number of characteristics which make it well-suited to the task of in-system debugging, including access to multiple devices simultaneously, and the possibility of combining software debug, manufacturing test, and device programming into a single low-pin count connector.

JTAG is a simple interface at the pin level, with a single clock called TCK driven by the debug device. Along with TCK the debug device sends one bit of data per TCK cycle on the TDI signal and one bit of control information on the TMS signal.

On each cycle the system under debug replies with a single bit of data out on the TDO signal. Since only one bit can be sent and received per TCK period, the frequency of TCK is a significant factor in the performance of the JTAG interface; at a TCK frequency of 10MHz the interface can carry no more than 10 million bits per second.

On top of this simple signaling scheme, microprocessors add their own protocols for allowing memory access. Some device families require thousands of JTAG TCK periods per byte read or written from memory, while the most efficient device families require only slightly more than 8 TCK cycles for each byte of memory accessed.

On the whole, most devices add somewhere between 20% to 100% of overhead in TCK periods for their most efficient memory access method, so each byte of memory read or written requires 10 to 16 JTAG TCK periods.

The topology of the system under debug can also affect memory access efficiency. One of the strengths of the JTAG standard lies in its ability to serially chain multiple devices from different manufacturers into a single scan chain that is all accessible through a single debug device.

This makes system-level testing, visibility and debug very convenient, but it comes with a cost. Systems with multiple devices in a scan chain incur extra overhead for each operation, which reduces throughput. A system with tens of devices chained together can easily cut the theoretical best-case memory access throughput in half.

Careful system design and signal routing are also required for a JTAG-based system to perform at its full potential. Remember that JTAG-based systems can send and receive only a single bit of data per TCK cycle, so it is very important that the system handle high TCK frequencies while maintaining the timing relationship of TCK to the other three JTAG signals.

If the four high-speed JTAG signals are not treated carefully in circuit design and layout, the maximum frequency of the JTAG interface may be limited, and this will limit the maximum memory access performance of the system.

The ARM1176JZF-S: putting performance metrics to work
For more in-depth analysis of memory access, let's examine the debug system provided on the ARM1176JZF-S high-performance embedded processor core (for a full discussion, see ARM's excellent user manual for this core).

The ARM11 debug port allows arbitrary opcodes to be fed to and executed on the processor core while in debug mode, and offers a register (the Debug Data Transfer Register, or DTR) that is visible to both the processor core and the debug port. A naïve but logical way to read memory from the debug port is shown in Figure 2 below.

Figure 2. Unoptimized scan sequence

This works, but for large-scale memory access is inefficient, requiring 648 JTAG clock cycles to read a single 4-byte value from memory. To put that level of efficiency into context, we can easily compute the memory access speed of a debug device when given the number of TCK cycles required per memory access:

So this scan sequence running at a typical 10MHz JTAG clock can read memory at no more than 60.3 kilobytes per second:

This same sequence with minor changes can be used to write memory at the same efficiency. Unfortunately, 60 kilobytes per second isn't very fast. As an example, a developer with a 2.5 megabyte application would have to wait 42 seconds each time the program is downloaded. An extra 42 seconds for every new test case or scenario quickly adds up to a significant loss of expensive developer time.

Fortunately, it is easy to do much better. If we only execute steps 1 through 4 once and use a load instruction with auto-increment in step 5, then we increase efficiency so only 216 cycles are used for each 4-byte load or store. Thanks to the ingenuity and forethought of the ARM11 engineering team, steps 5 and 6 can also be combined and optimized so each 4-byte load or store consumes just 41 JTAG clock cycles as shown in Figure 3, below.

Figure 3. Optimized scan sequence

Now the best-case memory transfer speed at a 10MHz JTAG clock is much faster, and debugging cycles for our hypothetical developer are practically instantaneous:

This analysis is simple for the ARM1176JZF-S, but for other devices the process of efficient memory access is not always obvious or well documented. It is critical that debug devices use efficient memory access routines " and they must execute those routines within tight time constraints in order to achieve high performance.

Debug Device Implementation
One common way to drive these JTAG lines is through the general purpose I/O signals (GPIOs) of a small microcontroller. This has the advantage of being inexpensive and simple. The major drawback is speed - the microcontroller must compute JTAG command sequences, extract TDI bit values, compute any TMS values and do bit operations on the GPIO registers.

If it takes even as few as 20 cycles of the microcontroller per TCK clock edge and the microcontroller runs at 60MHz, 1,640 microcontroller cycles will be required per 4-byte shift command and the maximum effective clock speed of the JTAG interface for this device will be only about 1.5MHz:

After accounting for the microcontroller handling the transfer of data to and from the debugging host, such a system is slowed further  - if the microcontroller spends half its time moving data from the host, the effective TCK speed drops to 750kHz. Substituting this figure into the memory transfer speed equation for the ARM1176 (41 cycles per 4-bit load/store) yields a transfer speed of 71.5KB/sec.

One way to speed things up is to use programmable logic to handle transforming high-level "shift commands" into the bit-level signal patters. Using the same microcontroller example, if 100 cycles are required on average per command and each command can shift 16 JTAG TCK cycles, then each 4-byte memory access (which uses 41 JTAG TCK cycles) can be accomplished in 300 microcontroller CPU cycles.

With only 300 cycles required per 4-byte memory access and assuming that 50% of the CPU cycles are still dedicated to transferring data from the debugging host, memory access throughput increases above 390 KB/sec:

Does actual TCK frequency matter?
An interesting fact emerges here: the actual TCK frequency no longer matters, as you can see from its absence in the above equation (one caveat: TCK frequency must be above about 4MHz, or the system will be limited by the TCK frequency " but for figuring the upper boundary of performance, TCK frequency is no longer necessarily a limiting factor).

This microcontroller system could have a TCK clock generator capable of 50MHz or more, and throughput would still be exactly 390.63KB/sec. The throughput of the debug device has become limited by the computing power of the microcontroller and its programmable shifting logic.

The only way to increase memory access performance for this debug device is to increase the computational throughput of the microcontroller, either by increasing its clock rate or by decreasing the number of cycles needed per shift command.

Figure 4. Hardware debug device speed

This is an important piece of information to remember as you consider any JTAG-oriented hardware debug device -  maximum TCK frequency is important, but the ability to fill those TCK cycles with useful work is even more critical.

Today's typical high-end hardware debug devices are often built like the example microcontroller+PLD device benchmarked in Figure 4 above. With a high-performance microprocessor and dedicated JTAG management logic, memory access speeds of 2MB/sec or more on a system like this ARM1176 example are common, but they are only possible with a well-designed and highly optimized hardware debug device.

In fact, the debug ports of some of today's devices are capable of correct operation at TCK speeds above 100MHz, offering a challenge to designers of hardware debug devices.

For the ARM1176 example, 100MHz means a throughput of 9527 KB/sec, making debugging and programming tasks virtually instantaneous. To live up to the performance potential of such systems, careful system-level design of the hardware debug device is required to ensure that bottlenecks within the device do not limit performance.

If the device is connected to the debug host by USB, the device must support USB 2.0 high speed or be limited by the 1.5 MB/sec throughput of USB 1.1. If the debug device is connected by ethernet, a high-performance networking subsystem capable of nearly saturating 100 megabit ethernet must be used.

On top of that, the debug device must be able to issue whatever commands are necessary to execute a 4-byte load or store every 410nS to maintain the 9527 KB/sec transfer rate, and the system must have sufficient buffering and power to sustain that throughput while simultaneously transferring nearly 10 megabytes of data per second from the debugging host.

Conclusion
At the end of the day, you just want to get your work done in as quick and painless a way as possible. With this in mind, it is important to examine hardware debug performance when designing a new system, and especially when selecting a hardware debug device.

High-performance debugging equipment means you can spend less time waiting around for system restart and critical debugging information, and more time solving the real-world problems of a deeply embedded system.

Anderson MacKay is Engineering Manager in Green Hills Software's Target Connections group, responsible for product planning, engineering, and project management for the  Probe and SuperTrace Probe products.


Print This Story Send As Email Discuss This Story Reprints

 
eSearch  

 Top 5 Most Read
 How-To Stories
1. 2. 3. 4. 5.

 Top 5 Most Read
 News Stories
1. 2.

  • Introduction to Optical Transmission Systems

  • Optimizing Embedded Systems for Broadband 10 Gigabit Ethernet Connectivity

  • Interfacing a DS3231 with an 8051-Type Microcontroller

  • The entire library >>  

     
     Top 5 Most Read
     Product Stories
    1. 2. 3.

     Sponsor

    EE Times TechCareers
    Search Jobs

    Enter Keyword(s):


    Function:


    State:
      

    Post Your Resume
    -----------------
    Employers Area
    Most Recent Posts
    GE Corporation seeking Lead Systems Analyst in Van Buren Township, MI

    Osram Sylvania seeking Sr Applications Engineer in Danvers, MA

    Accolo, Inc. seeking User Experience Engineer in Reston, VA

    Johnson Controls, Inc seeking Project Development Engineer in Pittsburg, PA

    WhiteHat Security seeking User Interface Engineer in Santa Clara, CA

    More career-related news, resources and job postings for technology professionals


     Tech Library
    ¤ Looking for the appropriate Industry Association? This comprehensive, up-to-date list will take you to the right Web site for the help you need.

    ¤ Got a question about a standard? Here are direct links to resources detailing the industry's most important communications standards.

    ¤ Freshen up on technology, new and old, with these links to interesting and informative tutorials.

    More from TechLibrary

    Welcome to our DesignLine network of web communities. On these sites, we provide practical how-to technical information for engineers and engineering managers involved in Automotive,audio, DSP, DTV, EDA, Industrial Control, Mobile Handset, Power Management, Programmable Logic,RF,Video, and Wireless networking design. Check out the sites and let us know your thoughts.
     



    Career Center | CommsDesign.com | Embedded.com | EE Times | TechOnline
    Planet Analog | DeepChip | eeProductCenter | Electronic Supply & Manufacturing | Webinars