Developing Embedded Software Using Davinci and Omap Technology
Developing Embedded Software Using Davinci and Omap Technology
N
i=1
C(i)
N
%
9.4. LEVERAGINGTHEAPPLICATION 83
This API lets the application developer determine a quick estimate of the average DSP CPU
load in the recent past. It is also useful for monitoring the uctuations in CPU load.
On the ip side, C(i) is not accurate on a frame by frame basis. For the very rst frame, C(1)
= 100 since the DSP has been idle till that point. In other words, C(i) takes time to stabilize. Apart
from this, the return value is a DSP Engine level CPU load. We cannot isolate the performance
from this number. In addition, when two codecs are in operation simultaneously, this method is not
reliable.
9.4.1.2 ARM timestamps
This method is the most intuitive to implement in the application. It involves capturing the time-
stamps immediately before and after the VISA API VIDENC_process() call on the ARM. As seen
earlier, the API is a blocking call on ARM and the application unblocks only after the encoding
completes on the DSP. The difference between timestamps before and after the VIDENC_process()
API gives the overall time taken to encode a frame including the actual codec encode cycles, the
systemframework (CE) latencies, ARM-DSP message passing latencies and any cache maintenance
overheads.
Figure 9.5 shows how timestamps can be used to calculate the performance.
After recording the timestamps for each frame, the DSP CPU load is calculated as follows:
1. Lets assume that video capture is NTSC. The encode
operation for each frame will occur once in every 33 ms,
i.e., 30 frames are processed in 1 second => each frame
is processed in 33 ms.
2. Lets also assume that the DSP is running at 594 MHz
For frame i:
B(i) = Time stamp before VIDENC\_process() call
A(i) = Time stamp after VIDENC\_process() call
C(i) = A(i) -- B(i) microseconds
P(i) =
C(i)
33000
594 MHz
Average Encode Duration =
N
i=1
C(i)
N
microseconds
Average DSP MHz =
N
i=1
P(i)
N
MHz
84 CHAPTER9. SAMPLEAPPLICATIONUSINGEPSI ANDXDM
Controller DSP
Engine_open()
VIDENC_create()
VIDENC_process()
VIDENC_process()
Frames
2 N-1
B(1)
A(1)
B(N)
A(N)
Figure 9.5: ARM Timestamp.
The following code snippet shows how to capture the timestamps on ARM/Linux. The
variable encodeTime contains the time taken to encode a frame in microseconds. This variable is
the same as C(i) above.
Using the formula given above, we can compute the average DSP MHz consumed by the
video encode application.
#define NUM_MICROSECS_IN_SEC (1000000)
9.4. LEVERAGINGTHEAPPLICATION 85
typedef struct timeval TimeStamp;
...
TimeStamp t1, t2;
int encodeTime;
/* Get timestamp before and after encode */
gettimeofday(&t1, 0);
VIDENC_process(videncHdl, &inbufDesc, &outbufDesc,
&inArgs, &outArgs);
gettimeofday(&t2, 0);
/* Calculate the time taken to encode */
encodeTime = (t2.tv_sec * NUM_MICROSECS_IN_SEC) +
t2.tv_usec) -
(t1.tv_sec * NUM_MICROSECS_IN_SEC) + t1.tv_usec)
...
This method is more granular than using Engine_getCpuLoad() described in Sec-
tion 3.3.1.1. We can get a detailed performance report on a frame-by-frame basis. We can plot
and analyze CPU load patterns based on the nature of input content.
Onthe ip side, there could be anatomicity problemwhenrecording timestamps. For example,
if there is a context switch on the ARM side just after the VIDENC_process() call but before the
timestamp t2 gets recorded, it will skew the results.
9.4.1.3 Codec Engine Traces
This method uses the trace support provided by Codec Engine to capture the performance data.
Refer to the chapter on Codec Engine on the various trace options supported and enabling the
traces.
The CEtracing can also be turned on at the time of executing the ARMapplication by setting
the CE trace level from command line. The typical trace level used to capture the performance
numbers is trace level 2.
86 CHAPTER9. SAMPLEAPPLICATIONUSINGEPSI ANDXDM
root# CE_DEBUG=2 ./controller > trace.txt
After traces are turned on, the CE prints detailed Codec Engine traces with timestamps on
the console. The trace outputs on the console are captured into a log le. A post processor can be
executed on the log le to lter the required CE traces for determining performance.
Figure 9.6 shows a section of sample CE trace when a frame of video is getting encoded.
Figure 9.6 highlights in blue the trace lines relevant to performance calculations. The output
contains traces from both ARM and DSP.
The ARM author: see p9-13 - Missing Text?
The different performance numbers can now be derived as follows:
The output contains traces fromboth ARMand DSP. Each trace line is preceded by the correspond-
ing timestamp. ARM timestamps are recorded in microseconds (us). DSP timestamps are recorded
in form of DSP ticks (tk).
The DSP ticks are directly obtained from the TSC registers and right shifted by 8. If <T>is
the ticks value printed in the trace line and DSP is running at 594 MHz:
DSP cycles consumed = <T> * 256
Duration in microsecs = ( <T> * 256 ) / 594 us
The important trace lines are marked by a letter in Figure 9.6.
A = Start of CE processing on ARM
B = Start of CE processing on DSP
C = Start of input buffer cache invalidation on DSP
D = End of input buffer cache invalidation on DSP
E = End of algorithm activation on DSP
F = End of codec processing on DSP
G = Start of output buffer cache writeback-invalidate on DSP
H = End of output buffer cache writeback-invalidate on DSP
I = End of CE processing on DSP
J = End of CE processing on ARM
By calculating the difference between appropriate timestamps, different performance numbers
can be derived as follows:
J - A = Total time taken to encode a frame from ARM
I - B = Total time taken to encode a frame on DSP
F - E = Actual encoder processing time
D - C = Time taken for input buffer cache invalidation
H - G = Time taken for output buffer cache writeback-invalidate
9.4. LEVERAGINGTHEAPPLICATION 87
. . .
@2,386,158us: [+0 T:0x4118eb60] ti.sdo.ce.video.VIDENC - VIDENC_process> Enter
(handle=0x91af0, inBufs=0x80b2c, outBufs=0x80bfc, inArgs=0x80a0c, outArgs=0x80a18)
@2,386,331us: [+5 T:0x4118eb60] CV - VISA_allocMsg> Allocating message for
messageId=0x00020fa6
@2,386,498us: [+0 T:0x4118eb60] CV - VISA_call(visa=0x91af0, msg=0x4199e880):
messageId=0x00020fa6, command=0x0
[DSP] @4,054,353tk: [+5 T:0x8c4cefcc] CN - NODE> 0x8fa9a3e8(h264enc#0)
call(algHandle=0x8fa9a4a8, msg=0x8fe04880); messageId=0x00020fa6
[DSP] @4,054,463tk: [+0 T:0x8c4cefcc] OM - Memory_cacheInv> Enter(addr=0x8a02f000,
sizeInBytes=921600)
[DSP] @4,055,361tk: [+0 T:0x8c4cefcc] OM - Memory_cacheInv> return
[DSP] @4,055,413tk: [+0 T:0x8c4cefcc] OM - Memory_cacheInv> Enter(addr=0x89ca8000,
sizeInBytes=460800)
[DSP] @4,055,897tk: [+0 T:0x8c4cefcc] OM - Memory_cacheInv> return
[DSP] @4,055,949tk: [+0 T:0x8c4cefcc] OM - Memory_cacheInv> Enter(addr=0x89e46000,
sizeInBytes=921600)
[DSP] @4,056,859tk: [+0 T:0x8c4cefcc] OM - Memory_cacheInv> return
[DSP] @4,056,912tk: [+0 T:0x8c4cefcc] ti.sdo.ce.video.VIDENC - VIDENC_process> Enter
(handle=0x8fa9a4a8, inBufs=0x8c4d20e4, outBufs=0x8c4d21b4, inArgs=0x8fe04a04,
outArgs=0x8fe04a10)
[DSP] @4,057,036tk: [+5 T:0x8c4cefcc] CV - VISA_enter(visa=0x8fa9a4a8): algHandle =
0x8fa9a4d0
[DSP] @4,057,101tk: [+0 T:0x8c4cefcc] ti.sdo.ce.alg.Algorithm - Algorithm_activate>
Enter(handle=0x8fa9a4d0)
[DSP] @4,057,173tk: [+0 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_activateAlg> Enter
(scratchId=0, alg=0x8ba045e8)
[DSP] @4,057,249tk: [+2 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_activateAlg> Last
active algorithm 0x8ba045e8, current algorithm to be activated 0x8ba045e8
[DSP] @4,057,341tk: [+2 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_activateAlg>
Activation of algorithm 0x8ba045e8 not required, already active
[DSP] @4,057,422tk: [+0 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_activateAlg> Exit
[DSP] @4,057,628tk: [+0 T:0x8c4cefcc] ti.sdo.ce.alg.Algorithm - Algorithm_activate>
return
[DSP] @4,080,334tk: [+5 T:0x8c4cefcc] CV - VISA_exit(visa=0x8fa9a4a8): algHandle =
0x8fa9a4d0
[DSP] @4,080,433tk: [+0 T:0x8c4cefcc] ti.sdo.ce.alg.Algorithm -
Algorithm_deactivate> Enter(handle=0x8fa9a4d0)
[DSP] @4,080,661tk: [+0 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_deactivateAlg> Enter
(scratchId=0, algHandle=0x8ba045e8)
[DSP] @4,080,736tk: [+2 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_deactivateAlg> Lazy
deactivate of algorithm 0x8ba045e8
[DSP] @4,080,811tk: [+0 T:0x8c4cefcc] ti.sdo.fc.dskt2 - _DSKT2_deactivateAlg> Exit
[DSP] @4,080,863tk: [+0 T:0x8c4cefcc] ti.sdo.ce.alg.Algorithm -
Algorithm_deactivate> return
[DSP] @4,080,923tk: [+0 T:0x8c4cefcc] ti.sdo.ce.video.VIDENC - VIDENC_process> Exit
(handle=0x8fa9a4a8, retVal=0x0)
[DSP] @4,081,005tk: [+0 T:0x8c4cefcc] OM - Memory_cacheWb> Enter(addr=0x89e46000,
sizeInBytes=921600)
[DSP] @4,081,905tk: [+0 T:0x8c4cefcc] OM - Memory_cacheWb> return
[DSP] @4,081,957tk: [+0 T:0x8c4cefcc] OM - Memory_cacheWb> Enter(addr=0x8bc183ca,
sizeInBytes=1207808)
[DSP] @4,083,134tk: [+0 T:0x8c4cefcc] OM - Memory_cacheWb> return
[DSP] @4,083,187tk: [+0 T:0x8c4cefcc] OM - Memory_cacheWb> Enter(addr=0x8bd3c680,
sizeInBytes=603904)
[DSP] @4,083,796tk: [+0 T:0x8c4cefcc] OM - Memory_cacheWb> return
[DSP] @4,083,848tk: [+5 T:0x8c4cefcc] CN - NODE> returned from
call(algHandle=0x8fa9a4a8, msg=0x8fe04880); messageId=0x00020fa6
@2,404,415us: [+0 T:0x4118eb60] CV - VISA_call Completed: messageId=0x00020fa6,
command=0x0, return(status=0)
@2,404,562us: [+5 T:0x4118eb60] CV - VISA_freeMsg(0x91af0, 0x4199e880): Freeing
message with messageId=0x00020fa6
@2,404,681us: [+0 T:0x4118eb60] ti.sdo.ce.video.VIDENC - VIDENC_process> Exit
(handle=0x91af0, retVal=0x0)
. . .
A
B
C
D
E
F
G
H
I
J
Figure 9.6: Sample CE Trace.
88 CHAPTER9. SAMPLEAPPLICATIONUSINGEPSI ANDXDM
This method is the most comprehensive and accurate method to measure the performance
of system developed using DaVinci software framework. Using this method, we can calculate the
performance of pure codec encode, cache maintenance and overall performance as seen at system
level.
On the ip side, this method produces quite a bit of trace messages per frame of encode.
Hence, it may not be suitable for real time systems. It will be most useful when the application
developer nds a bottle neck in the system and wants to ne tune the different parts of system to
optimize performance.
9.4.2 MEASURINGTHECODECENGINELATENCY
When the VISA API is called on the ARM, the Codec Engine on them ARM marshals the param-
eters into a message and sends the message to DSP using DSP Link. The Codec Engine component
on DSP receives the message, un-marshals the parameters, activates the appropriate algorithm in-
stance and calls the corresponding codec _process() function. Once the _process() function
is completed in the codec, the CE performs the reverse process now. The results are marshaled
into a message and sent over the DSP Link. The CE on ARM receives the message and passes
it back to the application. Only now, the controller application on the ARM unblocks from the
VIDENC_process() call.
In the above sequence, the CE performs a lot of operations in the background message
passing, algorithm activation, cache maintenance. These operations are necessary but introduce
latency. This latency will vary depending on the codec and input parameters. For example, if an
input of D1 size is passed to the encoder, cache maintenance will take a longer time than if an input
of CIF size is passed. Additionally, different codecs have different buffer requirements.
Once we have the performance numbers as shown in Section 3.3.1.3, further performance
metrics can be derived as follows:
Overheads on DSP = (I-B) - (F-E)
ARM <-> DSP Buffer passing latency = (J-A) - (I-B)
Some of the overhead like cache maintenance and algorithmactivationare necessary. However,
the knowledge of these overheads will enable the application developer to determine the headroom
available on DSP. In addition, the application developer can also ne tune the codec congurations
depending on how the overheads get affected due to congurations.
The performance report obtained through the CEtrace is given in le below. The performance
numbers were capturedona per-frame basis. Fromthis data, the codec performance onDSP, overhead
on DSP, overall performance on DSP, overall performance as seen from ARM were derived.
The performance report in Figure 9.7 captures data for 1800 frames. Figure 9.8 shows the line
graph for the same report for the rst 250 frames. This graph is useful for the ready interpretations
it provides:
9.4. LEVERAGINGTHEAPPLICATION 89
Note: This is an url to an Excel sheet.
http://www.morganclaypool.com/page/pawate
Figure 9.7: Encoder Performance Report.
0
5000
10000
15000
20000
25000
1
1
8
3
5
5
2
6
9
8
6
1
0
3
1
2
0
1
3
7
1
5
4
1
7
1
1
8
8
2
0
5
2
2
2
2
3
9
Frame No
E
n
c
o
d
e
D
u
r
a
t
i
o
n
(
u
s
)
DSP Codec
DSP Overhead
DSP Total
ARM Total
Figure 9.8: Figure Encoder Performance Graph.
2 DSP overhead is constant across frames
2 There are no spikes in the performance graph except at the beginning. This implies that the
peak performance will not deviate much from the average performance
2 Total DSP encode time and Total ARM encode time follow each other. This implies that CE
latency remains consistent
2 There is a regular trough for every 15 frames. This is understandable as the I-frame interval
congured for this test is 15. As the cycles consumed for an I-frame is less than a P-frame,
the troughs are seen at regular intervals.
9.4.3 MULTI-CHANNEL APPLICATION
Developing a multi-channel application is the same as writing a single channel application. The only
restriction from the Codec Engine is that the Engine handles need to be serialized. This will not
90 CHAPTER9. SAMPLEAPPLICATIONUSINGEPSI ANDXDM
be a problem if all the codec instances access the engine handle from the same thread. If the codec
instances are running in different instances, then each instance needs to have a separate Engine
handle created using the Engine_open() API.
We saw the single channel application in Section 3.2. The multi-channel application is pro-
vided below. This application opens two input les test1.yuv and test2.yuv, creates two video
encoder instances, and congures both the instances. After the creation phase, this application reads
the rst input le test1.yuv for the video frame to be encoded, and it passes this frame to the rst
encoder instance. It reads the second input le test2.yuv, and then it passes the video frame to the
second instance of encoder. The encoded frames from instances 1 and 2 are stored into two encoded
les test1.enc and test2.enc.
Multi Channel Application Code
void
APP_videoEncode(int numFramesToCapture)
{
. . .
/* Initialize Video Encoder instance #1 */
videncHdl1 = VIDENC_create(engineHdl, h264enc,
&videncParams1);
/* Initialize Video Encoder instance #1 */
videncHdl2 = VIDENC_create(engineHdl, h264enc, &videncParams2);
/* Configure Video Encoders */
VIDENC_control(videncHdl1, XDM_SETPARAMS,
&videncDynParams1, &videncStatus1);
VIDENC_control(videncHdl2, XDM_SETPARAMS,
&videncDynParams2, &videncStatus2);
/* Initialize file */
9.4. LEVERAGINGTHEAPPLICATION 91
fileIn1 = FILE_open(test1.yuv, r);
fileIn2 = FILE_open(test2.yuv, r);
fileOut1 = FILE_open(test1.enc, w);
fileOut2 = FILE_open(test2.enc, w);
while (nframes++ < numFramesToCapture)
{
FILE_read(fileIn1, inbuf1, FRAME_SIZE);
FILE_read(fileIn2, inbuf2, FRAME_SIZE);
VIDENC_process(videncHdl1, &inbufDesc1,
&outbufDesc1, &inArgs1, &outArgs1);
VIDENC_process(videncHdl2, &inbufDesc2, &outbufDesc2,
&inArgs2, &outArgs2);
FILE_write(fileOut1, outbuf1, outArgs1.bytesGenerated);
FILE_write(fileOut2, outbuf2, outArgs2.bytesGenerated);
}
VIDENC_delete(videncHdl1);
VIDENC_delete(videncHdl2);
Engine_close(engineHdl);
. . .
}
93
C H A P T E R 10
IP Network Camera on DM355
Using TI Software Platform
10.1 INTRODUCTION
This document provides detailed information on the source code organization, execution and sug-
gestions to modify ARM and iMX programs on a DM355 IPNetCam Reference Design. DM355
is a multimedia processor from Texas Instruments (TI) with ARM and hardware video accelerator
for MPEG4 and JPEG and a set of peripherals for multimedia products. DM355 can support a
range of resolutions from SIF to 720p. It can support single as well as multiple channels of MPEG4.
The IPNetCam takes input from CMOS Sensors, processes/compresses the video and streams the
processed/compressed video over Ethernet. Its web based management console facilitates users for
various settings and streaming of video and audio data. Recently, the next generation version of this
reference design, based on the DM365 is now available on the TI web at www.ti.com/ipcamera
10.2 SYSTEMOVERVIEW
The gure below shows the top-level software architecture of IPNetCam. The IPNetCam software
is built on top of the TI DVSDK. The identied partner will work mostly work on the Application
Layer (APL) and the Input Output Layer (IOL). The IPNetCam software will use the existing
DM365 Codec Engine and provide necessary codec combinations to the users.
This comprises of MontaVista Linux Pro, which is designed for IPNC board using the stan-
dard DVSDK Linux kernel. It has various device drivers to support the various interfaces. The
application uses this layer with EPSI (Embedded Peripheral Software Interface).
10.3 OPERATINGSYSTEM
We will use the MontaVista Linux Pro kernel, which is shipped along with DM365 EVM. This
innovative Embedded Linux solution features dynamic power management, rapid kernel boot time,
enhanced le systems, new development tools for system performance tuning, and rich processor
and peripheral support.
MontaVista Linux comes with TIs xDM VISA APIs. It will be the most efcient and handy
to build this solution.
94 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
Application Layer (APL)
Input-Output Layer (IOL)
Signal Processing Layer
V
I
S
A
A
P
I
EPSI APIs
Encryption
Input/Output Layer (IOL)
Connectivity
Conductor
Thread
Video Analytic
App
Rules
Management
Audio Video
Streaming
Manufacturing
Diagnostic
User
Diagnostic
System
Management
Users Value Added Applications
Instance Instance
Instance Instance
xDM xDM
xDM xDM
API
MPEG4
Preprocessor JPEG
Codec Engine Resource Sever
D
M
A
N
,
A
C
P
Y
D
S
K
T
M
E
M
,
T
S
K
A
P
I
DSP Link DSP/BIOS
Instance
xDM
H.264
Video
Analytics
Figure 10.1: IP network camera built on top of DVSDK.
10.4 DEVICEDRIVERS
The IPNetCam has various interfaces, such as Video capture, Audio capture, SD, USB etc. To
support all these interfaces, corresponding device drivers for the MontaVista platform needs to be
developed/congured. For most of the interfaces, the MontaVista Linux provides the basic driver,
which is customized for the actual hardware interface used.
10.5 SUPPORTEDSERVICES ANDFEATURES
The IPNetCam supports the features that appear in the Tables 10.3, 10.4, and 10.5.
10.6 ACRONYMS
Acronyms are used throughout this document appear in Table 10.6.
10.7 ASSUMPTIONS ANDDEPENDENCIES
2 This document is based upon the IPNC reference design set with DM355 EVM.
10.7. ASSUMPTIONS ANDDEPENDENCIES 95
Table 10.3: Application Layer
Connectivity HTTP Web Server (HTTP)
TSL/SSL
FTP Server
SMTP client
NTP client
DHCP client
UPnP client
Network discovery
PoE
Audio Video Streaming Web based video streaming using Quick-
Time/RealPlayer/VLC to ensure compliance. Ad-
ditionally, low-latency video player is required on Host
PC in order to meet end-to-end latency of 150ms.
Audio/video capture date and time are marked up on top
of video and inserted in audio/video stream.
Audio volume control
Play voice alert
RTP, RTSP over TCP or UDP
System Management Multiple user access levels with password protection
Firmware Updates on the IPNetCamfor further software
updates
Firmware backup and restore
SD and Network Storage settings
End-end low latency requirement: 150 ms
Analog output for local preview and monitoring the cap-
tured (compressed) video/image
USB for network detection and conguration
Storage Local Encrypted Local Storage (MPEG4 SIF stream,
JPEG image and alert details can be stored locally.)
Network Storage (MPEG4 HD video stream can be
stored on a host PC; JPEG and alert details as an email
attachment or by FTP protocol.)
Motion detection Basic motion detection for area of Interest is expected
Manufacturing Diagnostic A detailed hardware diagnostic software for customers
who wants to take this reference design to manufactur-
ing
User Diagnostic Simple hardware diagnostic software tool to test the basic
IO functionality of all peripherals.
96 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
Table 10.3: Application Layer (continued)
Image Control H2A software for auto white balance and auto exposure
Zooming of an image using the digital zoom based on
ePTZ. This is based on video capture driver; a capture
region can be changed frame by frame. By changing the
location of capture region, the active scope of the video
is Paned/titled/zoomed electrically. IPNC is expected to
enable ePTZ at D1 and VGA resolution at 30fps.
User-dened video image capture size
Switch day/night mode
Switch indoor/outdoor mode
Command Control For Admin Video and Audio channel start & stop
ePTZ control (PTZ at a certain step.)
Video input setting (720P rawRGBdata generating from
sensor)
Brightness, contrast, saturation, hue, gain setting
Setting JPEG parameters (QP)
Setting MPEG-4 parameters (CBR/VBR, bitrate,
GOP..etc)
Setting G.711 parameters
Setting Dual Codec Combos (CBR/VBR, bitrate,
GOP..etc)
Setting area of Interest(ROI) for motion detection
Event notication
Network Settings
User Access control
JPEG image storage options
Secondary MEPG4 SIF storage options
Active Connection List
Control the alarm output of the I/O port on the camera
Play an audio le(voice alert)
Switch day/night mode
Switch indoor/outdoor mode
Synchronize the date and time of the camera with those
of the computer
10.7. ASSUMPTIONS ANDDEPENDENCIES 97
Table 10.4: Input-Output Layer (IOL)
Video Input Video Input Driver
Video capture directly from CMOS image sensor
Video input driver can change capture region and location
frame by frame.
Enable ePTZ functionality in video driver level (enable
ePTZ feature at D1 and VGA resolution at 30fps.)
Auto focus, iris, white balancing, dark frame-subtraction,
exposure, Lens shading correction using DM355
ISP/VPSS capabilities.
Audio Mono Input Driver
Stereo Output Driver
Storing SD Memory driver
LAN EMAC driver
POE
GPIO & PWM GPIO driver
RTC RTC driver
NAND Flash NAND Flash driver
Table 10.5: Signal Processing Layer (SPL)
CODEC Combos MPEG-4 (SP, 720P)+ JPEG Compression + motion de-
tection+G.711 speech codec
Dual Stream CODEC Combo MPEG4 (SP, 720P) +MPEG4 (SP, SIF) or JPEG (SIF)
+ motion detection + G.711 speech coding
Triple Stream CODEC Combo MPEG4 (SP, 720P) +JPEG(VGA)+MPEG4 (SP, SIF)
+ motion detection + G.711 speech coding
Table 10.6: Acronyms
Acronym Description
IPNC Internet Protocol Net Camera
EVM
CMOS
2A Auto White Balance and Auto Exposure
2 The operating system comes from MontaVista Linux version 2.6.10.
2 Texas Instruments Incorporated provides digital Video Software Development Kit (DVSDK).
98 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
2 The Code Composer Studio (CCStudio) version CCS 3.3.38.2 or higher is used for ashing
to NAND memory.
2 For the single stream and dual stream mode, framerate achieved will be 30 fps.
2 For the triple stream, the MPEG4 (SP, 720P) will be at 30fps, whereas JPEG(VGA) and
MPE4 (SP, SIF) will be at 15fps.
2 Motion detection will reduce the frame rate apprx by three fps.
2 For latency under 150 ms, the PC must meet the following requirement.
Hardware
2 Intel(R) Pentium(R) D(DUAL Core) CPU 3.0GHz or equivalent
2 512 MB system memory or above
2 Sound Card : DirectX 9.0c compatible sound card
2 Video Card : 3D hardware Accelerator Card Required 100% DirectX 9.0c compatible
2 Ethernet network Port/Card
2 Network Cable
2 10/100 Ethernet switch/hub
Software
2 VLC media player 0.8.6b or above
2 Windows XP Service Pack 2 or above
2 Resolution of screen setting : 1280x960 or higher for the display of 720P
10.8 SOURCECODEORGANIZATION
We now discuss the development tools you need in order to compile the code followed by a brief
description of how we have organized the code.
10.8.1 DEVELOPMENTTOOLS ENVIRONMENT(S)
Before starting to build the source code, please ensure that the required software package and building
tools is installed correctly. Below is the list for required software:
1. TI DVSDK software package version 1.30.00.40.
2. MontaVista Linux Pro v4.0.1.
3. Root le system for development (optional).
10.8. SOURCECODEORGANIZATION 99
10.8.2 INSTALLATIONANDGETTINGSTARTED
1. Copy <release>/source/ipnc_app_XXXX.tgz into <installDir>/ directory in your Linux
desktop.
2. Uncompress the install le using command below:
tar zxvf ipnc_app_XXXX.tgz
Then in <installDir>, it will create a directory ipnc_app/ and a le Rules.make and
following sub-directories are created. Details of directory structure are as follows:
1. ipnc_app/multimedia/encode_stream/ : Single/Dual Codec/Dual size streaming.
2. ipnc_app/sys_adm/alarm_control/: Demo code for communication with alarm server.
3. ipnc_app/sys_adm/alarm_server/: Alarm server for processing events when event trigger.
4. ipnc_app/sys_adm/le_mng/: Manager for the system parameter.
5. ipnc_app/sys_adm/param_transfer/: Communication interface with web server.
6. ipnc _app/sys_adm/system_control/: Demo code for the communication with system server.
7. ipnc _app/sys_adm/system_control/: Application for processing command from web server.
8. ipnc _app/util/: Common utilities for the process communication.
9. ipnc_app/include/: Common header les.
10. ipnc_app/lib/: Common libraries.
11. ipnc_app/network/boa:
ActiveX control of the web server.
Display 720P or CIF image on the web browser.
Network conguration on the web browser.
12. ipnc_app/network/live: Adding getting CIF image by RTP.
13. ipnc_app/network/msmtp-1.4.13/: Message e-mail sender.
14. ipnc_app/network/quftp-1.0.7/: FTP client for sending jpg image periodically.
15. ipnc _app/network/WebData/: Homepage and some data for web server to use.
100 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
Figure 10.2: Directory structure of IP Netcam software.
3. Once the installation is complete, one needs to modify Rules.make based on the systems
deployment. Shown below is a simple description about Rules.make for reference. Please set
the correct environment paths on your system:
# The installation directory of the DVSDK dvsdk_1_30_00_23.
DVSDK_INSTALL_DIR=/home/user/workdir/dvsdk_1_30_00_23
# For backwards compatibility.
DVEVM_INSTALL_DIR=$(DVSDK_INSTALL_DIR)
# Where the Codec Engine package is installed.
CE_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/codec_engine_2_00
# Where the XDAIS package is installed.
XDAIS_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/xdais_6_00
10.8. SOURCECODEORGANIZATION 101
# Where the DSP Link package is installed.
#LINK_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/NOT_USED
# Where the CMEM (contiguous memory allocator) package is installed.
CMEM_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/cmem_2_00
# Where the codec servers are installed.
CODEC_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/dm355_codecs_1_06_01
# Where the RTSC tools package is installed.
XDC_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/xdc_3_00_02_11
# Where Framework Components product is installed.
FC_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/framework_components_2_00
# Where DSP/BIOS is installed.
BIOS_INSTALL_DIR=$(DVSDK_INSTALL_DIR)/
# The directory that points to your kernel source directory.
LINUXKERNEL_INSTALL_DIR=/home/user/workdir/ti-davinci
# The prex to be added before the GNU compiler tools (optionally including
# path), i.e., "arm_v5t_le-" or "/opt/bin/arm_v5t_le-".
MVTOOL_DIR=/opt/mv_pro_4.0.1/montavista/pro/devkit/arm/v5t_le
MVTOOL_PREFIX=$(MVTOOL_DIR)/bin/arm_v5t_le-
# Where to copy the resulting executables and data to (when executing make
# install) in a proper le structure. This EXEC_DIR should either be visible
# from the target, or one will have to copy this (whole) directory onto
the
# target lesystem.
EXEC_DIR=/home/user/workdir/lesys/opt/net
# The directory that points to the IPNC software package
IPNC_DIR=/home/user/workdir/ipnc_app
# The directory to application include
PUBLIC_INCLUDE_DIR=$(IPNC_DIR)/include
# The directory to application library
LIB_DIR=$(IPNC_DIR)/lib
# The directory to root directory of your root le system
ROOT_FILE_SYS = /home/user/workdir/lesys
102 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
4. If the login is not as root, one needs to use below commands to prevent error when installation
chown -R <useracct> <IPNC_DIR>
chown -R <useracct> <ROOT_FILE_SYS>
Substitute your user name for <useracct> and <IPNC_DIR> and
<ROOT_FILE_SYS> is the directories you set at Rules.make at Step 3.
10.8.3 LISTOF INSTALLABLECOMPONENTS
Note: Any links appearing on this manifest were veried at the time it was created. TI makes no
guarantee that they will remain active in the future.
10.8.4 BUILDPROCEDURE
1. Change directory to the <InstallDir>/ipnc_app/ using below command
cd <InstallDir>/ipnc_app/
2. Build the software package using command
make clean
make
3. Install the application to your root le system
make install
Note:
This installation will overwrite les at /etc /var in your root le system.
Please backup your data rst before you start to run.
10.8.5 EXECUTIONPROCEDURE
In order to launch Encode Demo, the following command needs to be executed from the target
command prompt:
1. # test command for sensor 640x480
<target prompt># ./encode_ipnc -t 10 -d -r 640x480 -b 200000 -v record_480P.mpeg4
2. # test command for sensor 1280x720
<target prompt># ./encode_ipnc -t 10 -d -r 1280x720 -b 200000 -v record_720P.mpeg4
After the build is successful, following modules will be generated at the directory
$(EXEC_DIR) set at $(installDir)/Rules.make
1. wis-streamer wis-streamer2 le_mng
10.8. SOURCECODEORGANIZATION 103
F
i
l
e
n
a
m
e
L
i
c
e
n
s
e
S
o
u
r
c
e
G
P
L
S
o
u
r
c
e
C
o
d
e
D
i
s
t
r
i
b
u
t
i
o
n
p
e
r
t
h
e
G
P
L
L
G
P
L
S
o
u
r
c
e
C
o
d
e
D
i
s
t
r
i
b
u
t
i
o
n
p
e
r
L
G
P
L
F
i
l
e
o
r
d
i
r
e
c
t
o
r
y
n
a
m
e
T
I
R
D
S
L
A
S
o
u
r
c
e
c
o
d
e
d
i
s
t
r
i
b
u
t
i
o
n
n
o
t
p
e
r
m
i
t
t
e
d
v
2
v
3
v
2
v
3
O
t
h
e
r
R
e
a
d
a
n
d
f
o
l
l
o
w
a
p
p
l
i
c
a
b
l
e
l
i
c
e
n
s
e
t
e
r
m
s
O
r
i
g
i
n
a
l
s
o
u
r
c
e
o
b
t
a
i
n
e
d
f
r
o
m
O
r
i
g
i
n
a
l
S
o
u
r
c
e
m
o
d
i
f
i
e
d
b
y
T
I
?
(
Y
/
N
)
B
o
a
W
e
b
s
e
r
v
e
h
t
t
p
:
/
/
w
w
w
.
b
o
a
.
o
r
g
V
e
r
s
i
o
n
:
0
.
9
4
.
1
3
d
o
w
n
l
o
a
d
e
d
:
0
3
A
u
g
2
0
0
7
Y
D
h
c
p
c
d
h
t
t
p
:
/
/
w
w
w
.
p
h
y
s
t
e
c
h
.
c
o
m
/
d
o
w
n
l
o
a
d
/
v
e
r
s
i
o
n
:
v
.
1
.
3
.
2
2
-
p
l
4
d
o
w
n
l
o
a
d
e
d
:
0
3
A
u
g
2
0
0
7
N
n
t
p
c
l
i
e
n
t
h
t
t
p
:
/
/
d
o
o
l
i
t
t
l
e
.
i
c
a
r
u
s
.
c
o
m
/
n
t
p
c
l
i
e
n
t
/
V
e
r
s
i
o
n
:
2
0
0
7
_
3
6
5
d
o
w
n
l
o
a
d
e
d
:
3
1
D
e
c
2
0
0
7
N
l
i
b
e
s
m
t
p
h
t
t
p
:
/
/
w
w
w
.
s
t
a
f
f
o
r
d
.
u
k
l
i
n
u
x
.
n
e
t
/
l
i
b
e
s
m
t
p
/
d
o
w
n
l
o
a
d
.
h
t
m
l
V
e
r
s
i
o
n
1
.
0
.
4
D
o
w
n
l
o
a
d
e
d
1
M
a
r
2
0
0
8
N
E
s
m
t
p
h
t
t
p
:
/
/
e
s
m
t
p
.
s
o
u
r
c
e
f
o
r
g
e
.
n
e
t
/
d
o
w
n
l
o
a
d
.
h
t
m
l
V
e
r
s
i
o
n
0
.
6
.
0
D
o
w
n
l
o
a
d
e
d
:
1
M
a
r
2
0
0
8
N
Q
u
f
t
p
h
t
t
p
:
/
/
s
o
u
r
c
e
f
o
r
g
e
.
n
e
t
/
p
r
o
j
e
c
t
s
/
q
u
f
t
p
V
e
r
s
i
o
n
1
.
0
.
7
D
o
w
n
l
o
a
d
e
d
:
3
1
D
e
c
2
0
0
7
N
L
i
b
u
p
n
p
B
S
D
(
B
e
r
k
e
l
e
y
S
t
a
n
d
a
r
d
D
i
s
t
r
i
b
u
t
i
o
n
)
l
i
c
e
n
s
e
S
e
e
L
I
C
E
N
S
E
f
i
l
e
o
r
h
t
t
p
:
/
/
p
u
p
n
p
.
s
o
u
r
c
e
f
o
r
g
e
.
n
e
t
/
h
t
t
p
:
/
/
p
u
p
n
p
.
s
o
u
r
c
e
f
o
r
g
e
.
n
e
t
/
V
e
r
s
i
o
n
:
1
.
6
.
0
d
o
w
n
l
o
a
d
e
d
:
0
3
A
u
g
2
0
0
7
N
F
F
M
p
e
g
h
t
t
p
:
/
/
f
f
m
p
e
g
.
m
p
l
a
y
e
r
h
q
.
h
u
/
l
e
g
a
l
.
h
t
m
l
V
e
r
s
i
o
n
:
S
V
N
-
r
1
2
3
4
7
d
o
w
n
l
o
a
d
e
d
:
3
1
D
e
c
2
0
0
7
Y
L
I
V
E
5
5
5
S
t
r
e
a
m
i
n
g
M
e
d
i
a
h
t
t
p
:
/
/
w
w
w
.
l
i
v
e
5
5
5
.
c
o
m
/
l
i
v
e
M
e
d
i
a
/
p
u
b
l
i
c
/
V
e
r
s
i
o
n
:
2
0
0
7
.
0
8
.
0
3
d
o
w
n
l
o
a
d
e
d
:
0
3
A
u
g
2
0
0
7
Y
F
i
g
u
r
e
1
0
.
3
:
I
n
s
t
a
l
l
e
d
C
o
m
p
o
n
e
n
t
s
.
104 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
2. encode_stream
3. test.m4e
4. boa
5. loadmodules_ipnc.sh
Before you starting the streaming, ensure below les is at the directory $(EXEC_DIR) set
at $(installDir)/Rules.make
1. dm350mmap.ko
2. cmemk.ko
3. mapdmaq
Example for execution:
1. VGA Demo:
Start:
$cd $(EXEC_DIR)
$./encode_stream -u 0 -q 50 -d -r 640x480 -b 4000000 v test.mpeg4 &
$./wis-streamer &
Leave:
$killall -9 wis-streamer
$killall -9 encode_stream
2. 720P Demo:
Start:
$cd $(EXEC_DIR)
$./encode_stream -u 0 -q 50 -d -r 1280x720 -b 4000000 -v test.mpeg4 &
$./wis-streamer &
Leave:
$killall -9 wis-streamer
$killall -9 encode_stream
3. Dual (720P+CIF) Demo:
Start:
$cd $(EXEC_DIR)
$./encode_stream -u 3 -q 50 -d -r 1280x720 -e 352x192 -b 4000000 -v test.mpeg4 &
$./wis-streamer &
$./wis-streamer2 &
Leave:
$killall -9 wis-streamer
$killall -9 wis-streamer2
$killall -9 encode_stream
10.9. ARM9EJ PROGRAMMING 105
10.9 ARM9EJ PROGRAMMING
This section explains the top-level process and threads in the system, task partitioning across MJCP
and ARM9EJ for codecs, ARM9EJ load, thread/process scheduling, and component addition or
deletion.
10.9.1 ARM9EJ TASKPARTITIONING
10.9.1.1 Process/Threads and Scheduling
There are multi processes within IPNC SW to enable various functions including video capture,
compression, streaming and congurations. The most important processes are described below.
Encode_stream: Enable video capture, resizing, compression 2A algorithms and motion de-
tection.
Wis-streamer: Take MPEG4 elementary stream from encoder_stream process, pack it into
RTP packets and enable streaming over IP.
Webserver: HTTP server based on BOA. ActiveX control is included to display MPEG4
streaming. Network conguration is supported in this webserver process.
A inter-process interface function (getAVdata()) is implemented to ease process-level syn-
chronization and communication. A part of getAvdata is referred below.
int GetAVData( unsigned int field, int serial, AV_DATA * ptr )
{
int ret=RET_SUCCESS;
if(virptr == NULL)
return RET_ERROR_OP;
switch(field) {
case AV_OP_GET_MJPEG_SERIAL:
if(serial != -1) {
ret = RET_INVALID_PRM;
} else {
106 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
FrameInfo_t curframe = GetCurrentFrame(FMT_MJPEG);
if(curframe.serial_no < 0) {
ret = RET_NO_VALID_DATA;
} else {
ptr->serial = curframe.serial_no;
ptr->size = curframe.size;
ptr->width = curframe.width;
ptr->height = curframe.height;
}
}
break;
Both Wis-streamer and Websercer reuse well-known open resource. Developers should be
able to nd enough details online. Encode_stream process is the most important process. We are
going to discuss it in detail. Every thread within this process will be addressed.
The encode_stream process consists of nine separate POSIX threads (pthreads): the main
thread (main.c), which eventually becomes the control thread (ctrl.c), the video thread (video.c),
the display thread (display.c), the capture thread (capture.c), the stream writer thread (writer.c), the
2A thread (appro_aew.c), the motion detection thread (motion_detect.c), the audio/video message
thread (stream.c), and the speech thread (speech.c). The video, display, capture, writer, 2A, motion,
stream interface, and speech threads are spawned from the main thread before the main thread
becomes the control thread. All threads except the original main/control thread are congured to
be preemptive and priority-based scheduled (SCHED_FIFO). The video and 2A threads share the
highest priority, followed by the stream writer, display and capture threads. The speech, and motion
thread has lower priority than the writer and capture threads, and the control thread has the lowest
priority of all.
The initialization and cleanup of the threads are synchronized using the provided Rendezvous
utility module. This module uses POSIX conditions to synchronize thread execution. Each thread
10.9. ARM9EJ PROGRAMMING 107
Figure 10.4: Application processes and threads.
performs its initialization and signals the Rendezvous object when completed. When all threads have
nished initializing, all threads are unlocked simultaneously and start executing their main loops.
The same method is used for thread cleanup. This way buffers that are shared between threads are
not freed in one thread while still being used in another.
10.9.1.2 Main Thread
The job of the main thread is to perform necessary initialization tasks, to parse the command-line
parameters provided by the user when invoking the application, and to spawn the other threads with
parameters depending on the values of the command-line parameters.
10.9.1.3 Display Thread
In order to show a preview of the frames being encoded while they are being encoded, the captured
raw frames from the VPSS front end need to be copied to the frame buffer of the VPSS back end.
To allow the copying to be performed in parallel with the DSP processing, it is performed by a
separate display thread. The thread execution begins by initializing the FBDev display device driver
108 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
Figure 10.5: Frame based processing of IP Netcam.
in initDisplayDevice(). In this function, the display resolution (D1) and bits per pixel (16) are set
using the FBIOPUT_VSCREENINFO ioctl, before the three (triple buffered display) buffers are
made available to the user space process from the Linux device driver using the mmap() call. The
buffers are initialized to black, since the video resolution might not be full D1 resolution, and the
background of a smaller frame should be black. Next, a Rszcopy job is created. The Rszcopy module
uses the VPSS resizer module on the DM355 to copy an image from source to destination without
consuming CPU cycles. When the display thread has nished initializing, it synchronizes with the
other threads using the Rendezvous utility module. Because of this, only after the other threads have
nished initializing is the main loop of the display thread executed.
10.9.1.4 Capture Thread
The video capture device is initialized by initCaptureDevice(). The video capture device driver is
a Video 4 Linux 2 (v4l2) device driver. In this function, the capabilities of the capture device are
veried using the VIDIOC_QUERYCAP ioctl. Next the video standard (NTSC or PAL) is auto-
detected fromthe capture device and veried against the display video standard selected on the Linux
kernel command line. Next three video capture buffers are allocated inside the capture device driver
using the VIDIOC_REQBUFS ioctl, and these buffers are mapped to the user space application
process using mmap(). Finally, the capturing of frames in the capture device driver is started using
the VIDIOC_STREAMON ioctl.
10.9.1.5 Stream Writer Thread
To allow the writing of encoded video frames to the circular memory buffer to be done in parallel
with the DSP processing, the stream writing is performed by a separate writer thread. First the
10.9. ARM9EJ PROGRAMMING 109
Figure 10.6: Flowchart of thread and user commands.
110 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
Figure 10.7: Processing sequence.
10.9. ARM9EJ PROGRAMMING 111
destination buffer on memory manage is allocated by stream_init (). Then the Rendezvous object is
notied that the stream writer threads initialization is complete. Note that the speech thread, unlike
the video thread, does its writing circular buffer in the speech thread itself. This is because speech
has lower performance requirements than video.
10.9.1.6 Video Thread Interaction
The gure below shows one iteration of each of the threads involved in processing a video frame
once they start executing their main loops, and how these threads interact.
Figure 10.8: More on processing sequence, control, and threads.
First the capture thread dequeues a rawcaptured buffer fromthe VPSS front end device driver
using the VIDIOC_DQBUF ioctl. To show a preview of the video frame being encoded, a pointer
to this captured buffer is sent to the display thread using FifoUtil_put(). The capture thread then
fetches an empty raw buffer pointer from the video thread. Then this buffer pointer is then sent to
the video thread for encoding.
The video thread receives this captured buffer pointer and then fetches an I/O buffer using
FifoUtil_get() from the stream writer thread. The encoded video data will be put in this I/O buffer.
112 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
While the display thread copies the captured raw buffer to the FBDev display frame buffer using
the Rszcopy_execute() call, the video thread is encoding the same captured buffer into the fetched
I/O buffer on the DSP using the VIDENC_process() call. Note that the encoder algorithm on
the DSP core and the Rszcopy module might access the captured buffer simultaneously, but only
for reading. When the display thread has nished copying the buffer, it makes the new frame
buffer to which we just copied our captured frame thenew display buffer on the next vertical sync
using the FBIOPAN_DISPLAY ioctl before the thread waits on the next vertical sync using the
FBIO_WAITFORVSYNC ioctl. When the video encoder running on the DSP core is nished
encoding the captured buffer into the I/O buffer, the I/O buffer is sent to the writer thread using
FifoUtil_put(), where it is written to the circular memory buffer using the call stream_write(). The
capture raw buffer pointer is sent back to the capture thread to be relled. The captured buffer
pointer is collected in the capture thread from the display thread as a handshake that indicates
that the display copying of this buffer is nished using FifoUtil_get(), before the captured buffer
is reenqueued at the VPSS front end device driver using the VIDIOC _QBUF ioctl. The writer
thread writes the encoded frame to the circular memory buffer while the capture thread is waiting
for the next dequeued buffer from the VPSS front end device driver to be ready. If the writing of
the encoded buffer is not complete when the next dequeued buffer is ready and the capture thread
is unblocked, there is no wait provided IO_BUFFERS is larger than 1 since another buffer will be
available on the FIFO at this time. The encode stream application has IO_BUFFERS set to 2.
10.9.2 ARMCPUUTILIZATION
ARM CPU (running at 216 MHz) utilization is proled statistically. The CPU loading information
is collected 300 times in a period of 5 minutes. The details are listed below.
Note:
ARM is running at 216MHz.
10.10 IMXPROGRAMMING
This section provides ways of ofoading computational load to iMX available in DM355, which
runs concurrently with ARM9EJ. Special treatment is required as the image and video codecs like
JPEG and MPEG4 are tightly coupled to iMX and other coprocessors/accelerators. iMX is free for
70 to 84% of the MPEG4 encoder execution time depending on encoder settings.
10.10.1 IMXPROGRAMEXECUTION
The iMX program runs concurrently with ARM9EJ. Typical iMX program are math intensive
requiring MAC operations. iMX in DM355 can perform 4 MACs per cycle. iMX and ARM9 run
at same clock (216 MHz on DM355H and 271 MHz on DM355UH).
10.10. IMXPROGRAMMING 113
Note: ARM is running at 216MHz
Figure 10.9: Measuring ARM CPU utilization and determining available headroom.
10.10.1.1 iMX Utilization by MPEG4 Encoder
MPEG4 encoder uses iMX for performing color conversion. If 8x8 intra/inter decision is enabled,
iMX is used for 8x8 average computation needed for intra/inter decision logic. This is about 400
cycles on iMX.
10.10.1.2 Sequential to MPEG4 Encoder
iMX algorithms can be run sequentially with MPEG4 JPEG Coprocessor (MJCP) as shown in
Figure 10.10. In this case, entire SEQ, and iMX program and data memory is available for the
algorithm. iMX execution cycles corresponding to unused MJCP cycles is available for algorithms.
The feasibility of such scenario depends on MJCP free time. In other words, iMX can run when
MJCP is idle. The activate() and deactivate() xDM calls implemented by codecs protect against
context switches in iMX and MJCP usage. Similarly, the algorithms will have save iMX context
114 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
if needed by implementing activate and deactivate calls. Activate is used to restore context and
deactivate for saving context.
In addition, the iMX program memory can be extended to 4096 bytes (instead of 1024 byes)
by swapping command memory with MJCP (since MJCP is not executing).
Figure 10.10: Sequential execution of iMX programs with MPEG4/JPEG codecs.
10.10.1.3 Concurrent with MPEG4 Encoder
In case of concurrent execution, iMX programs are executed in parallel with MJCP execution as
shown in Figure 10.11.
The availability of iMX for processing other than encode/decode operations depends on:
Availability of IOmemory (image buffer and coefcient buffer)
iMX program for algorithms should use the space not used by codecs. 3848 bytes out of
4096 bytes of image buffer is used (248 bytes free in image buffer). 4352 bytes out of 8192 bytes in
coefcient buffer is used (3840 bytes free in coefcient buffer).
Availability of programmemory (command buffer)
iMX program of algorithm will have to be inserted before or after iMX program for codec.
468 bytes out of 1024 is used (556 bytes free in command memory). iMX program start and end
addresses are 0x11F06000 and 0x11F061D4 respectively.
Availability of SEQmemory (programand data)
Sequencer is required for scheduling DMA transfers to fetch data into or out of iMX image
and coefcient memory. If the iMX program is included as extension to the codec (either pre- or
post-processing operating on same data as the codec) then SEQ code of codec may not require
change to handle extra program in iMX. 2560 bytes out of 4096 bytes in program memory is used
(1536 bytes of program memory is free. No data memory). SEQ program start and end addresses
are 0x11F0F000 and 0x11F0FA00 respectively.
10.11. CONCLUSION 115
Figure 10.11: Concurrent execution of iMX programs with MPEG4/JPEG codecs.
DMAfor IO
If additional DMA transfers are required due to the algorithm to fetch input or output, the
DMA transfer will have to be chained/linked to the existing transfers in codec. This is needed to
avoid control owchange within codec processing. The codec control owis managed by COPCand
SEQ. This would require codec source les (at least few of them). Codec will have to be revalidated.
Availability of iMXcycles
iMX free period may be used for iMX algorithms. (400 cycles of iMX is used per macroblock
encoding. In other words, iMX is free for 70 to 84% of the codec execution time depending on
encoder settings).
iMXprogramexecution time
Currently, iMX is not the hardware block that determines codec execution time. codec exe-
cution time is determined by the worst-case block (in term of execution time), which are DMA and
b-iMX. Thus, if the iMX execution time exceeds the execution time of the block that is the bottle-
neck for performance, codec performance will degrade and cause timing problems (since codecs are
not tested for this case). Codec will need revalidation. Performance will have to be accounted by the
application.
10.11 CONCLUSION
ARM9EJ is available for 40%50% of total execution time to perform additional services. The
MPEG4 and JPEG codecs use minimal ARM cycles as seen from datasheets. The ARM load per
codec is less than 20 MHz. The rest of the codec processing is performed by MJCP. Scheduling
of the additional services concurrently with MJCP (performing encode/decode) yields optimal uti-
116 CHAPTER10. IP NETWORKCAMERAONDM355 USINGTI SOFTWARE
lization of ARM9E and MJCP. Additionally, ARM9EJ can be utilized when MJCP is nished
encode/decode operation.
iMX can be used for pre/post processing algorithms operating on macroblock data of the
frame that is being encoded or decoded respectively. In this case, care needs to be taken to ensure
iMX program does not cause performance bottleneck for codecs (since codecs are not tested for this
timing scenario). In addition, this concurrent operation requires the algorithms to be tted within
the available iMX program/data memory along with the iMX program of codecs.
If spare time is available after codec execution, iMX programs can be run sequentially with
codecs. In this mode of operation, there is no limitation of program/data memory for algorithms as
the entire hardware is available for the iMX program of algorithms. Also, other hardware modules
like SEQUENCER, and EDMA can be used by the algorithm without many restrictions.
117
C H A P T E R 11
Adding your secret sauce to the
Signal Processing Layer (SPL)
11.1 INTRODUCTION
So far we have discussed the software platform that Texas Instruments provides and how you may
develop your product based on it. However, some of you may have your own intellectual property
that brings a unique differentiation to your product. While you may add this differentiation in the
Application Layer (APL), depending on your algorithm, it may not run fast enough if the APL runs
on the ARM processor alone. In such cases, you would want to leverage the power of the DSP on
the SoC and migrate your algorithm to the DSP. This chapter shows how you may componentize your
secret sauce, port it to the DSP and integrate it with the rest of the software platform.
The process of doing this is quite involved and we may not here be able to provide full justice
to it. In the remainder of this chapter, I will try to touch on the key aspects. You may wish to get a
more hands on experience by going through the teaching example posted online:
http://www.ti.com/davinciwiki-portingtodsp
11.2 FROMANY CMODELTOGOLDENCMODEL ONPC
In most cases the starting point for your algorithm might be an ITU standards code or you may have
developed your own code. This will most likely be in oating point, developed and tested on the
PC. As a developer you will probably assume that you have access to unlimited amount of memory
and processing power. Your goal would have been to rst develop an algorithm that meets the needs
of your application. The challenge now is to migrate it from the PC world to the embedded world
where space (memory) and time (MHz) are limited due to the target cost goals of the end product.
Your C code base must be modied to adhere to certain rules so that it can run in real time and
easily be integrated with the rest of the software platform. Golden C is the resulting code base that
follows these rules.
Step1: Create a test harness.
After conrming that the algorithm meets the needs of the target application, the rst step
now is to create a test harness with well dened input les, and corresponding output les that will
be used later for verifying correctness of porting. This is an important step since you will be using
the test les several times in the process of porting your code to the DSP.
Next, ideally, it would be better if the algorithm were converted from using oating point data
types toxedpoint data types. Depending onthe type of SoCyouchoose fromTexas Instruments, you
118 CHAPTER 11. ADDING YOUR SECRET SAUCE TO THE SIGNAL PROCESSING LAYER
(SPL)
may or may not have to do this. For example, there are devices with oating point support. However
the majority of devices support only xed point notation and if cost and power are your concerns then
these xed point devices become more attractive. While you may think that your algorithm must
have oating point support, in general, it is possible to work around that requirement and develop
an algorithm using xed point arithmetic. There is a wealth of information and documentation
available on how to convert your oating point to xed point based processing and describing the
steps involved in this process is outside the scope of this book.
Step 2: Convert your Cto Golden Cmodel.
In order to make your code componentized and meet real time and embedded processing
requirements, it is better that it meet some basic requirements or rules enumerated below. While
there are far many rules to follow, here are some key ones:
Rule1: Organize your code base. Organize your code based on functionality and place them in
appropriate folders. While this might be obvious, source code organization, test les, and documen-
tation allows multiple teams to work together and share their developments at a later stage in the
development cycle. Test les should be isolated from the algorithm implementation.
Rule2: Your algorithm should not perform any le input and output operations. All data in and out
should be via buffers using pointers for efciency. Your top level application code may do le I/O
operations but your core algorithm should never do them.
Rule3: Remove any mallocs and callocs. This is probably the hardest part of the conversion process
since algorithm developers tend to use mallocs. However, for embedded processing, we want the
framework such as the codec engine to be in charge of managing resources.
Rule4: Classify the type of memories used into persistent, scratch and constants. Persistent memory
is memory that needs to be maintained from one frame to the next while scratch memory is scratch
and need not be saved or maintained from one frame to the next. Constants are tables or coefcients
that are needed by the algorithm. In large volume applications, constants can be moved to ROM
thereby reducing the cost of the system.
Rule5: Avoid use of static and global variables.
Rule6. Data types should all be isolated into one header le with clear explanations. This will
become useful later when you wish to leverage optimized code or kernels provided by TI and or
third parties. Well dened data types that match the word length of the device and the library routine
are important.
Rule7: Your code should not contain any endian specic instructions.
Rule8: Stack should be used only for local variables and parameter passing. Large local arrays and
structures should not be allocated in the stack.
Rule9: Your code should be xDAIS-compliant. Tools are available for testing for this compliance.
Rule10: Your code should be xDM-compliant.
Step 3: Build, Run, andTest your Golden Ccode on PC
Now that you have made your code embedded processing friendly, build and test it using the
test harness dened in step one to ensure that you did not introduce any new bugs! Benchmark your
11.2. FROMANY CMODELTOGOLDENCMODEL ONPC 119
code on the PC to evaluate the performance. For example, you may wish to know the frames per
second (fps) that you observe by running the code on the PC.
Step4: Build, Run andTest your Golden Ccode on the DSP using CCS.
Now you are ready to use Code Composer Studio (CCStudio), a software development tool
and environment provided by Texas Instruments. Compile your code base for the DSP and use
CCStudio to load it and run it on the DSP. You will reuse your test harness again to ensure that your
code runs correctly on the DSP.
Step5: Basic DSP optimization using compiler options.
By turning on certain compiler options, you may be able to quickly see signicant boosts in
the performance of your code. Please see the wiki page shown in section 11.1 for details.
Step6: Make the code xDAIS and xDMcompliant.
This is a necessary step in order to make your code integrate with the rest of the software
from Texas Instruments and third parties. You may have already done this in Step2. While ideally
Step2 is the correct stage at which you should be making your code xDAIS and xDM compliant,
however, you may delay it to Step6. There are several tools that enable you to test your embedded
code for compliance, which are not available to you on the PC.
Step 7. Create a server for Codec Engine.
Step 8. Test the server using DVTBas a reference example.
You can now use the Digital Video Test Bench (DVTB) code as a reference application for
calling your xDM compliant algorithm from the ARM and measure the performance. You should
observe a signicant boost in the performance when your code runs on the DSP.
These 8 steps have provided you with a process for adding your unique differentiating features
easily to the standard TI software platform.
Using the TI software platform, allows you to focus your creative energy more on your dif-
ferentiation and less on the mundane tasks of writing basic software. You need not know all the
complexity and details of the underlying hardware architecture in order to build your compelling
product. The beauty of this is that you can take advantage of years of software engineering effort
provided with TIs silicon and build on top of it.
121
C H A P T E R 12
Further Reading
We hope that this book has given you an insight into the software platform and how it is organized
and how you may develop applications based on it. In addition to this reading, there are several other
resources and books that will accelerate your software development. Some of them are:
1. OMAP and DaVinci Software for Dummies, by Steve Blonstein and Alan Campbell. TI part
#: SPRW184
2. DVTB documentation available online at www.ti.com/davinciwiki_dvtb
3. Porting GPP code to DSP and Codec Engine at
www.ti.com/davinciwiki-portingtodsp
4. 3 different URLs wiki.davincidsp.com, wiki.omap.com and tiexpressdsp.com. The same con-
tent appears regardless of which URL you use. This was done in order to serve the needs of
the Davinci
T M
, OMAP
T M
, and eXpressDSP
T M
platforms without fracturing or duplicating
content.
5. Related Wiki and Project Sites
The Real-Time Software Components (RTSC) project wiki at eclipse.org
TI Open-source projects
Target Content Downloads (DSP/BIOS, Codec Engine, XDAIS, RTSC etc)
Code Generation Tools Downloads
Applications PowerToys Downloads
TI DSP Village Knowledgebase
6. Leveraging DaVinci Technology for creating IPNetwork Cameras,TI Developers Conference,
Dallas, Texas, March 14, 2007.
7. Accelerating Innovation with Off-the-Shelf Software and Published APIs, ARM Developers
Conference 06, Santa Clara, CA, Oct 3, 2006. Invited presentation.
8. Accelerating Innovation with the DaVinci software code and Programming Model, TI De-
velopers Conference, Dallas, Texas, Feb 28, 2006.
123
About the Author
BASAVARAJ I. PAWATE
Basavaraj I. Pawate (Raj) , Distinguished Member Technical Staff, has held several leadership
positions for TI worldwide in North America, Japan, and India. These cover a wide spectrum of
responsibilities ranging from individual research to initiating R&D programs, from establishing
product development groups to outsourcing and creating reference designs, from winning designs
and helping customers ramp to production to being CTO of emerging markets.
After completing his M.A.Sc. in signal processing at the University of Ottawa, Ottawa,
Canada, Raj joined TI Corporate R&D in 1985 and worked on speech processing, in particular
speech recognition for almost 10 years. He then moved to Japan where he established the Multimedia
Signal Processing group from the ground up. When TI identied VoIP as an EEE, Raj went to
Bangalore, India to establish a large effort in product R&D. Here he worked withTelogy, a company
that Texas Instruments acquired, to deliver Janus, a multicore DSP device with VoIP software.
Raj is credited with several early innovations: a few examples include the worlds rst Internet
Audio Player, a precursor to MP3 players, world wide standard for DSPs in standardized modules
(Basava Technology), reuse methodologies for codecs and presilicon validation (CDR & Links &
Chains), and one software platform for diverse hardware platforms.
Raj has fteen issued patents in DSP algorithms, memory, and systems. Several of these
patents have been deployed in products. Raj has published more than thirty technical papers.
Raj andhis wife Parvathi have three daughters andlive inHouston. Raj enjoys talking, walking,
and, recently, reading philosophy.