2.2. Data Sheet¶
Read This First
All performance numbers provided in this document are gathered using following Evaluation Modules unless otherwise specified.
NOTE: All performance measurements provided in this document are PRELIMINARY and have not necessarily been fully optimized yet. They are provided here as-is for this release.
Name | Description |
---|---|
J721E EVM | J721E Evaluation Module, SOM rev E6 with ARM running at 2GHz, DDR data rate 3733 MT/S, L3 Cache size 3MB, J721EXG01EVM |
Table: Evaluation Modules
About This Manual
This document provides a roadmap for benchmarks to be provided in this release and future releases, as well as specific performance data for benchmarks, kernel drivers and use-case examples provided as part of this release of the Processor SDK Linux Automotive package. This document should be used in conjunction with release notes and user guides provided with the Processor SDK Linux Automotive package for information on specific issues present with drivers and other software included in a particular release.
If You Need Assistance
For further information or to report any problems, contact http://community.ti.com/ or http://support.ti.com/
2.2.1. Benchmarking Roadmap¶
The current plan for availability of benchmark measurements on the J721E EVM is shown in the table below.
Domain | Feature | Benchmark Application | Metric | Target SDK |
---|---|---|---|---|
MPU/A72 | MIPS | CoreMark | Score | SDK6.0 |
Dhrystone | Score | SDK6.0 | ||
Nbench | Score | SDK6.0 | ||
Whetstone | Score | SDK6.0 | ||
Linpack | Score | SDK6.0 | ||
Multi-tasking | CoreMarkPro | Single core Score | SDK6.0 | |
Multicore scaling | SDK6.0 | |||
SpecInt2K6 | Single core Score | SDK6.0 | ||
Multicore scaling | SDK6.0 | |||
Multibench | Single core Score | SDK6.0 | ||
Multicore scaling | SDK6.0 | |||
MSMC | L3 Cache | CoreMark Pro | Single core Score | SDK6.0 |
Multicore scaling | SDK6.0 | |||
LMBench | Memory access latency | SDK6.0 | ||
SpecInt2K6 | Single core Score | SDK6.0 | ||
Multicore scaling | SDK6.0 | |||
I/O Coherence | Ethernet throughput | CPU Load (%) | SDK6.1 | |
I/O throughput (GBps) | SDK6.1 | |||
DSS7 & PAT | On-the-fly composition | Multi-layer composition | Resolution | SDK6.0 |
FPS | SDK6.0 | |||
CPU Load (%) | SDK6.0 | |||
Memory bandwidth (GBps) | SDK6.0 | |||
Multimedia | HEVC 4K@60fps decoding | HEVC 4K stream decoding | Latency | SDK6.1 |
CPU Load (%) | SDK6.1 | |||
Memory Bandwidth (GBps) | SDK6.1 | |||
4x 1080p@60fps decoding | 4-channel decoding | Latency | SDK6.1 | |
CPU Load (%) | SDK6.1 | |||
Memory Bandwidth (GBps) | SDK6.1 | |||
H264 1080p@60fps encoding | 1080p encoding | Latency | SDK6.1 | |
CPU Load | TBD | |||
Memory Bandwidth | TBD | |||
4x 720p@30fps Encoding | 4-channel Encoding | Latency | TBD | |
CPU Load (%) | TBD | |||
Memory Bandwidth (GBps) | TBD | |||
3D Graphics | GPU off screen rendering performance | GFXBench 3.0 Manhattan 1080p offscreen (fps) | FPS | SDK6.0 |
CPU Load (%) | SDK6.1 | |||
Memory Bandwidth (GBps) | SDK6.1 | |||
GFXBench 3.1 Manhattan 1080p offscreen (fps) | FPS | SDK6.0 | ||
CPU Load (%) | SDK6.1 | |||
Memory Bandwidth (GBps) | SDK6.1 | |||
GFXBench 4.0 Car Chase 1080p offscreen (fps) | FPS | SDK6.0 | ||
CPU Load (%) | SDK6.1 | |||
Memory Bandwidth (GBps) | SDK6.1 | |||
GFXBench Trex OffScreen | FPS | SDK6.0 | ||
CPU Load (%) | SDK6.1 | |||
Memory Bandwidth (GBps) | SDK6.1 | |||
Texture Read/Write | FPS | SDK6.2 | ||
Memory Bandwidth (GBps) | SDK6.2 | |||
DDR/Interconnect | Throughput | STREAM | GBps | SDK6.0 |
LMBench | GBps | SDK6.0 | ||
udma mem<->mem xfers | GBps | SDK6.2 | ||
DRU xfers | GBps | SDK6.2 | ||
ECC overhead | Synthetic application with ECC memory region | GBps | SDK6.2 | |
% loss in performance | SDK6.2 | |||
Boot-time Measurement | Cold boot to Linux prompt | Time measurement | seconds | SDK6.1 |
Peripheral Performance Results | Throughput | Unit Test | Mbits/sec | SDK6.0 |
Table: Benchmarking Release Plan
2.2.2. Benchmarks¶
2.2.2.1. LMBench¶
LMBench is a collection of microbenchmarks of which the memory bandwidth and latency-related ones are typically used to estimate processor memory system performance.
Latency: lat_mem_rd-stride128-szN, where N is equal to or smaller than the cache size at a given level, measures the cache miss penalty. N that is at least double the size of last level cache is the latency to external memory.
Bandwidth: bw_mem_bcopy-N, where N is is equal to or smaller than the cache size at a given level, measures the achievable memory bandwidth from software doing a memcpy() type operation. Typical use is for external memory bandwidth calculation. The bandwidth is calculated as byte read and written counts as 1 which should be roughly half of STREAM copy result.
Benchmarks | j721e-evm: perf |
---|---|
af_unix_sock_stream_latency (microsec) | 11.29 |
af_unix_socket_stream_bandwidth (MBs) | 2849.56 |
bw_file_rd-io-1mb (MB/s) | 4959.59 |
bw_file_rd-o2c-1mb (MB/s) | 2914.39 |
bw_mem-bcopy-16mb (MB/s) | 3196.16 |
bw_mem-bcopy-1mb (MB/s) | 6046.59 |
bw_mem-bcopy-2mb (MB/s) | 4056.80 |
bw_mem-bcopy-4mb (MB/s) | 3386.39 |
bw_mem-bcopy-8mb (MB/s) | 3190.64 |
bw_mem-bzero-16mb (MB/s) | 8956.06 |
bw_mem-bzero-1mb (MB/s) | 8407.49 (min 6046.59, max 10768.39) |
bw_mem-bzero-2mb (MB/s) | 7240.76 (min 4056.80, max 10424.71) |
bw_mem-bzero-4mb (MB/s) | 6578.58 (min 3386.39, max 9770.76) |
bw_mem-bzero-8mb (MB/s) | 6210.45 (min 3190.64, max 9230.26) |
bw_mem-cp-16mb (MB/s) | 1134.75 |
bw_mem-cp-1mb (MB/s) | 6321.01 (min 1878.13, max 10763.89) |
bw_mem-cp-2mb (MB/s) | 6392.84 (min 2338.79, max 10446.89) |
bw_mem-cp-4mb (MB/s) | 5533.59 (min 1270.65, max 9796.53) |
bw_mem-cp-8mb (MB/s) | 5193.39 (min 1144.33, max 9242.45) |
bw_mem-fcp-16mb (MB/s) | 3255.34 |
bw_mem-fcp-1mb (MB/s) | 8891.47 (min 7014.54, max 10768.39) |
bw_mem-fcp-2mb (MB/s) | 7180.86 (min 3937.01, max 10424.71) |
bw_mem-fcp-4mb (MB/s) | 6565.77 (min 3360.78, max 9770.76) |
bw_mem-fcp-8mb (MB/s) | 6245.12 (min 3259.98, max 9230.26) |
bw_mem-frd-16mb (MB/s) | 6707.66 |
bw_mem-frd-1mb (MB/s) | 6642.07 (min 6269.59, max 7014.54) |
bw_mem-frd-2mb (MB/s) | 4858.16 (min 3937.01, max 5779.30) |
bw_mem-frd-4mb (MB/s) | 5166.74 (min 3360.78, max 6972.69) |
bw_mem-frd-8mb (MB/s) | 4998.70 (min 3259.98, max 6737.41) |
bw_mem-fwr-16mb (MB/s) | 8956.06 |
bw_mem-fwr-1mb (MB/s) | 8516.74 (min 6269.59, max 10763.89) |
bw_mem-fwr-2mb (MB/s) | 8113.10 (min 5779.30, max 10446.89) |
bw_mem-fwr-4mb (MB/s) | 8384.61 (min 6972.69, max 9796.53) |
bw_mem-fwr-8mb (MB/s) | 7989.93 (min 6737.41, max 9242.45) |
bw_mem-rd-16mb (MB/s) | 7151.37 |
bw_mem-rd-1mb (MB/s) | 12469.51 (min 11214.63, max 13724.38) |
bw_mem-rd-2mb (MB/s) | 5398.97 (min 3764.35, max 7033.59) |
bw_mem-rd-4mb (MB/s) | 6074.99 (min 3723.01, max 8426.97) |
bw_mem-rd-8mb (MB/s) | 4472.57 (min 1612.42, max 7332.72) |
bw_mem-rdwr-16mb (MB/s) | 1529.05 |
bw_mem-rdwr-1mb (MB/s) | 4502.86 (min 1878.13, max 7127.58) |
bw_mem-rdwr-2mb (MB/s) | 2482.81 (min 2338.79, max 2626.83) |
bw_mem-rdwr-4mb (MB/s) | 2454.50 (min 1270.65, max 3638.35) |
bw_mem-rdwr-8mb (MB/s) | 1420.79 (min 1144.33, max 1697.25) |
bw_mem-wr-16mb (MB/s) | 1434.21 |
bw_mem-wr-1mb (MB/s) | 10425.98 (min 7127.58, max 13724.38) |
bw_mem-wr-2mb (MB/s) | 3195.59 (min 2626.83, max 3764.35) |
bw_mem-wr-4mb (MB/s) | 3680.68 (min 3638.35, max 3723.01) |
bw_mem-wr-8mb (MB/s) | 1654.84 (min 1612.42, max 1697.25) |
bw_mmap_rd-mo-1mb (MB/s) | 12495.41 |
bw_mmap_rd-o2c-1mb (MB/s) | 3616.29 |
bw_pipe (MB/s) | 4463.63 |
bw_unix (MB/s) | 2849.56 |
lat_connect (us) | 22.63 |
lat_ctx-2-128k (us) | 2.99 |
lat_ctx-2-256k (us) | 3.19 |
lat_ctx-4-128k (us) | 4.02 |
lat_ctx-4-256k (us) | 4.34 |
lat_fs-0k (num_files) | 813.00 |
lat_fs-10k (num_files) | 209.00 |
lat_fs-1k (num_files) | 236.00 |
lat_fs-4k (num_files) | 218.00 |
lat_mem_rd-stride128-sz1000k (ns) | 8.43 |
lat_mem_rd-stride128-sz125k (ns) | 5.15 |
lat_mem_rd-stride128-sz250k (ns) | 5.15 |
lat_mem_rd-stride128-sz31k (ns) | 2.00 |
lat_mem_rd-stride128-sz50 (ns) | 2.00 |
lat_mem_rd-stride128-sz500k (ns) | 5.15 |
lat_mem_rd-stride128-sz62k (ns) | 5.15 |
lat_mmap-1m (us) | 6.72 |
lat_ops-double-add (ns) | 0.32 |
lat_ops-double-mul (ns) | 2.00 |
lat_ops-float-add (ns) | 0.32 |
lat_ops-float-mul (ns) | 2.00 |
lat_ops-int-add (ns) | 0.50 |
lat_ops-int-bit (ns) | 0.33 |
lat_ops-int-div (ns) | 4.00 |
lat_ops-int-mod (ns) | 4.67 |
lat_ops-int-mul (ns) | 1.52 |
lat_ops-int64-add (ns) | 0.50 |
lat_ops-int64-bit (ns) | 0.33 |
lat_ops-int64-div (ns) | 3.00 |
lat_ops-int64-mod (ns) | 5.67 |
lat_pagefault (us) | 1.10 |
lat_pipe (us) | 8.55 |
lat_proc-exec (us) | 512.91 |
lat_proc-fork (us) | 494.18 |
lat_proc-proccall (us) | 0.00 |
lat_select (us) | 9.61 |
lat_sem (us) | 1.13 |
lat_sig-catch (us) | 2.15 |
lat_sig-install (us) | 0.41 |
lat_sig-prot (us) | 0.16 |
lat_syscall-fstat (us) | 1.02 |
lat_syscall-null (us) | 0.25 |
lat_syscall-open (us) | 121.36 |
lat_syscall-read (us) | 0.38 |
lat_syscall-stat (us) | 1.85 |
lat_syscall-write (us) | 0.32 |
lat_tcp (us) | 0.53 |
lat_unix (us) | 11.29 |
latency_for_0.50_mb_block_size (nanosec) | 5.15 |
latency_for_1.00_mb_block_size (nanosec) | 4.22 (min 0.00, max 8.43) |
pipe_bandwidth (MBs) | 4463.63 |
pipe_latency (microsec) | 8.55 |
procedure_call (microsec) | 0.00 |
select_on_200_tcp_fds (microsec) | 9.61 |
semaphore_latency (microsec) | 1.13 |
signal_handler_latency (microsec) | 0.41 |
signal_handler_overhead (microsec) | 2.15 |
tcp_ip_connection_cost_to_localhost (microsec) | 22.63 |
tcp_latency_using_localhost (microsec) | 0.53 |
Table: LM Bench Metrics
2.2.2.2. Dhrystone¶
Dhrystone is a core only benchmark that runs from warm L1 caches in all modern processors. It scales linearly with clock speed. For standard ARM cores the DMIPS/MHz score will be identical with the same compiler and flags.
Benchmarks | j721e-evm: perf |
---|---|
cpu_clock (MHz) | 2000.00 |
dhrystone MIPS (dual-core) | 23K DMIPS |
dhrystone_per_second (DhrystoneP) per core | 20000000.00 |
Table: Dhrystone Benchmark
2.2.2.3. Whetstone¶
Whetstone is a benchmark that measures the speed and efficiency at which a core performs floating-point operations.
Benchmarks | j721e-evm: perf |
---|---|
whetstone (MIPS) | 10000.00 |
Table: Whetstone Benchmark
2.2.2.4. Linpack¶
Linpack measures peak double precision (64 bit) floating point performance in solving a dense linear system.
Benchmarks | j721e-evm: perf |
---|---|
linpack (Kflops) | 2634946.00 |
Table: Linpack Benchmark
2.2.2.5. NBench¶
NBench is a CPU benchmark designed to expose the capabilities of a system’s CPU, FPU, and memory system. NBench is a single-threaded benchmark and is not designed to measure the performance gain on multi-core machines.
Benchmarks | j721e-evm: perf |
---|---|
assignment (Iterations) | 29.88 |
fourier (Iterations) | 32335.00 |
fp_emulation (Iterations) | 257.81 |
huffman (Iterations) | 2466.30 |
idea (Iterations) | 7718.70 |
lu_decomposition (Iterations) | 1431.80 |
neural_net (Iterations) | 24.09 |
numeric_sort (Iterations) | 871.86 |
string_sort (Iterations) | 428.25 |
Table: NBench Benchmarks
2.2.2.6. Stream¶
STREAM is a microbenchmark for measuring data memory system performance without any data reuse. It is designed to miss on caches and exercise data prefetcher and speculative accesseses. It uses double precision floating point (64bit) but in most modern processors, the memory access will be the bottleneck. The four individual scores are copy, scale (as in multiply by constant), add two numbers, and triad for multiply accumulate. For bandwidth, a byte read counts as one, and a byte written counts as one, resulting in a score that is double the bandwidth LMBench will show.
Benchmarks | j721e-evm: perf |
---|---|
add (MB/s) | 6789.60 |
copy (MB/s) | 6637.90 |
scale (MB/s) | 6565.70 |
triad (MB/s) | 6783.10 |
Table: Stream
2.2.2.7. CoreMark¶
CoreMark® is a benchmark that measures the performance of CPUs in embedded systems. Coremark contains implementations of the following algorithms: list processing (find and sort), matrix manipulation (common matrix operations), state machine (determine if an input stream contains valid numbers), and CRC (cyclic redundancy check). The result of the benchmark is a single score for the CPU divided by the clock rate (CoreMark/MHz).
Benchmarks | j721e-evm: Score/MHz |
---|---|
CoreMark/MHz (dual-core) | 11.4 |
Table: CoreMark
2.2.2.8. CoreMarkPro¶
CoreMark®-Pro is a comprehensive, advanced processor benchmark that works with and enhances the market-proven industry-standard EEMBC CoreMark® benchmark. While CoreMark stresses the CPU pipeline, CoreMark-Pro tests the entire processor, adding comprehensive support for multicore technology, a combination of integer and floating-point workloads, and data sets for utilizing larger memory subsystems.
Benchmarks | j721e-evm: perf |
---|---|
cjpeg-rose7-preset (workloads/) | 83.33 |
core (workloads/) | 0.78 |
coremark-pro () | 2561.66 |
linear_alg-mid-100x100-sp (workloads/) | 82.51 |
loops-all-mid-10k-sp (workloads/) | 2.50 |
nnet_test (workloads/) | 3.55 |
parser-125k (workloads/) | 12.20 |
radix2-big-64k (workloads/) | 281.69 |
sha-test (workloads/) | 156.25 |
zip-test (workloads/) | 52.63 |
Table: CoreMarkPro
2.2.2.9. MultiBench¶
MultiBench™ is a suite of benchmarks that allows processor and system designers to analyze, test, and improve multicore processors. It uses three forms of concurrency:
- Data decomposition: multiple threads cooperating on achieving a unified goal and demonstrating a processor’s support for fine grain parallelism.
- Processing multiple data streams: uses common code running over multiple threads and demonstrating how well a processor scales over scalable data inputs.
- Multiple workload processing: shows the scalability of general-purpose processing, demonstrating concurrency over both code and data.
MultiBench combines a wide variety of application-specific workloads with the EEMBC Multi-Instance-Test Harness (MITH), compatible and portable with most any multicore processors and operating systems. MITH uses a thread-based API (POSIX-compliant) to establish a common programming model that communicates with the benchmark through an abstraction layer and provides a flexible interface to allow a wide variety of thread-enabled workloads to be tested.
Benchmarks | j721e-evm: perf |
---|---|
4m-check (workloads/) | 1296.68 |
4m-check-reassembly (workloads/) | 239.81 |
4m-check-reassembly-tcp (workloads/) | 135.14 |
4m-check-reassembly-tcp-cmykw2-rotatew2 (workloads/) | 46.37 |
4m-check-reassembly-tcp-x264w2 (workloads/) | 2.80 |
4m-cmykw2 (workloads/) | 324.15 |
4m-cmykw2-rotatew2 (workloads/) | 63.36 |
4m-reassembly (workloads/) | 247.53 |
4m-rotatew2 (workloads/) | 73.15 |
4m-tcp-mixed (workloads/) | 210.53 |
4m-x264w2 (workloads/) | 2.84 |
idct-4m (workloads/) | 35.10 |
idct-4mw1 (workloads/) | 35.10 |
ippktcheck-4m (workloads/) | 1200.19 |
ippktcheck-4mw1 (workloads/) | 1203.08 |
ipres-4m (workloads/) | 229.36 |
ipres-4mw1 (workloads/) | 227.96 |
md5-4m (workloads/) | 54.44 |
md5-4mw1 (workloads/) | 53.08 |
rgbcmyk-4m (workloads/) | 164.20 |
rgbcmyk-4mw1 (workloads/) | 164.07 |
rotate-4ms1 (workloads/) | 58.82 |
rotate-4ms1w1 (workloads/) | 58.89 |
rotate-4ms64 (workloads/) | 59.31 |
rotate-4ms64w1 (workloads/) | 59.31 |
x264-4mq (workloads/) | 1.46 |
x264-4mqw1 (workloads/) | 1.46 |
Table: Multibench
2.2.2.10. Spec2K6¶
CPU2006 is a set of benchmarks designed to test the CPU performance of a modern server computer system. It is split into two components, the first being CINT2006, the other being CFP2006 (SPECfp), for floating point testing.
SPEC defines a base runtime for each of the 12 benchmark programs. For SPECint2006, that number ranges from 1000 to 3000 seconds. The timed test is run on the system, and the time of the test system is compared to the reference time, and a ratio is computed. That ratio becomes the SPECint score for that test. (This differs from the rating in SPECINT2000, which multiplies the ratio by 100.)
As an example for SPECint2006, consider a processor which can run 400.perlbench in 2000 seconds. The time it takes the reference machine to run the benchmark is 9770 seconds. Thus, the ratio is 4.885. Each ratio is computed, and then the geometric mean of those ratios is computed to produce an overall value.
Benchmarks | j721e-evm: perf |
---|---|
Spec2K6_speed (PRELIM) | 5.53 score/GHz |
Spec2K6_rate (PRELIM) | 4.8 score/GHz/Core |
Table: Spec2K6
2.2.2.11. Boot-time Measurement¶
2.2.2.11.1. Boot media: MMCSD¶
Boot Configuration | j721e-evm: boot time (sec) |
---|---|
Kernel boot time test when bootloader, kernel and sdk-rootfs are in mmc-sd | 22.7 |
Kernel boot time test when init is /bin/sh and bootloader, kernel and sdk-rootfs are in mmc-sd | 14.5 |
Table: Boot time MMC/SD
2.2.2.12. DSS Composition¶
A DSS unit test is used to perform on-the-fly composition with 4 distinct display layers all going to the same display output. Frame-rate performance is measured, along with CPU load and DDR memory bandwidth consumption.
DSS Display Composition | Resolution | j721e-evm: FPS | CPU Load | Bandwidth: GB/sec |
---|---|---|---|---|
4-layer composition | 1080p | 60 | ~0 % | 2 GBps |
Table: DSS Composition
2.2.2.13. 3D Graphics Benchmarks¶
Run GFXBench and capture performance reported: output render rate (Fps) and Score. All results here are “offscreen” only.
Benchmarks | j721e-evm: FPS | Score | CPU Load | Bandwidth: GB/sec |
---|---|---|---|---|
GFXBench 3.0 Manhattan 1080p offscreen | 16.6 | 1029 | 39.4% | 2.98 |
GFXBench 3.1 Manhattan 1080p offscreen | 10.2 | 632 | 21.2% | 2.29 |
GFXBench 4.0 Car Chase 1080p offscreen | 5.56 | 328 | 20.3% | 2.43 |
GFXBench Trex offScreen | 32.1 | 1798 | 33.5% | 4.5 |
Table: GFXBench Results
2.2.2.14. Multimedia (Decode)¶
Run gstreamer pipeline “gst-launch-1.0 playbin uri=file://<Path to stream> video-sink=”kmssink sync=false connector=<connector id>” audio-sink=fakesink” and calculate performance based on the execution time reported. CPU load of the V4L2 driver is also reported.
HEVC Decode Performance (V4L2 driver level), Stream(s) | Resolution | j721e-evm: Decode latency (usec) | CPU Load |
---|---|---|---|
Netflix_FoodMarket_4096x2160_60fps_IPP_51level_main_40mbps.265 | 4K @60fps (4096 x 2160) | 10234 | 2 % |
Netflix_FoodMarket_4096x2160_30fps_IPP_51level_main_40mbps.265 | 4K @30fps (4096 x 2160) | 10239 | 1 % |
crowd_run_2560x1440_350frame_75mbps_60fps.265 | 2.5K @60fps (2560 x 1440) | 9936 | 1.4% |
CrowdRun_p1920x1080_nv12_60fps_500fr.yuv.265 | 1080p @60fps (1920 x 1080) | 9898 | 2 % |
4x CrowdRun_p1920x1080_nv12_60fps_500fr.yuv.265 | 4 x 1080p @ 60fps (1920 x 1080) | 9966 | 7.5% |
CrowdRun_p1920x1080_nv12_30fps_500fr.265 | 1080p @ 30fps (1920 x 1080) | 9903 | 1 % |
4x CrowdRun_p1920x1080_nv12_30fps_500fr.265 | 4 x 1080p @ 30fps (1920 x 1080) | 10043 | 3.5% |
sintotrees_p1280x720_60fps_nv12_480fr.yuv.265 | 720p @ 60fps (1280 x 720) | 9789 | 1.5% |
4x sintotrees_p1280x720_60fps_nv12_480fr.yuv.265 | 4 x 720p @ 60fps (1280 x 720) | 9850 | 6.4% |
Table: HEVC V4L2 Decode Performance
H.264 Decode Performance (V4L2 driver level), Stream(s) | Resolution | j721e-evm: Decode latency (usec) | CPU Load |
---|---|---|---|
Netflix_FoodMarket_4096x2160_60fps_8bit_500frames.264 | 4K @60fps (4096 x 2160) | 13176 | 4.2% |
Netflix_FoodMarket_4096x2160_30fps_8bit_500frames.264 | 4K @30fps (4096 x 2160) | 13559 | 1.5% |
crowd_run_2560x1440_350frame_70mbps_60fps.264 | 2.5K @60fps (2560 x 1440) | 9830 | 2.7% |
CrowdRun_p1920x1080_nv12_60fps_500fr.yuv.264 | 1080p @60fps (1920 x 1080) | 9798 | 3.5% |
4x CrowdRun_p1920x1080_nv12_60fps_500fr.yuv.264 | 4 x 1080p @ 60fps (1920 x 1080) | 9855 | 20% |
CrowdRun_p1920x1080_nv12_30fps_500fr.264 | 1080p @ 30fps (1920 x 1080) | 9790 | 2 % |
4x CrowdRun_p1920x1080_nv12_30fps_500fr.264 | 4 x 1080p @ 30fps (1920 x 1080) | 9859 | 11% |
sintotrees_p1280x720_60fps_nv12_480fr.yuv.264 | 720p @ 60fps (1280 x 720) | 9741 | 3 % |
4x sintotrees_p1280x720_60fps_nv12_480fr.yuv.264 | 4 x 720p @ 60fps (1280 x 720) | 9806 | 22% |
GA20_13_panning_720x480_185_420SP.264 | 720 x 480 | 9696 | 0.7 % |
container_640x480_420sp_300fr.h264 | 640 x 480 | 9690 | 0.6 % |
AUD_MW_E.264 | 176 x 144 | 9659 | 0.5 % |
Table: H.264 V4L2 Decode Performance
2.2.2.15. Multimedia (Encode)¶
- Run encode test command and calculate performance based on the execution time reported, as in the following example:
- tienc_encode -i pedestrain_1080p_nv12.yuv -w 1920 -h 1080 -o out19.264 -b 15000000 -g 1 -p 1 –j
H.264 Encode Performance (V4L2 driver level), Stream (Encoder) | Resolution | j721e-evm: Encode latency (usec) |
---|---|---|
pedestrain_1080p_nv12.yuv | 1920 x 1080 | 12770 |
oldtowncross_1280x720_nv12.yuv | 1280 x 720 | 5746 |
concert_640x480_nv12.yuv | 640 x 480 | 2636 |
Table: H.264 V4L2 Encode Performance
2.2.2.16. Ethernet Driver (CPSW_2G)¶
2.2.2.16.1. TCP Throughput¶
TCP Window Size (KBytes) | j721e-evm: Throughput (Mbits/sec) | j721e-evm: CPU Load |
---|---|---|
8 | 734.40 | |
16 | 958.40 | |
32 | 1333.60 | |
64 | 1712.00 | |
128 | 1696.00 | |
256 | 1408.80 |
Table: TCP Throughput
2.2.2.16.2. UDP Throughput¶
UDP Throughput Egress
UDP Packet Size(bytes) | j721e-evm: Throughput (Mbits/sec) | j721e-evm: CPU Load | j721e-evm: Packets Per Second (kpps) |
---|---|---|---|
1024 | 924.00 | 53.70 | 112.00 |
1470 | 546.00 | 29.90 | 46.00 |
1500 | 922.00 | 57.80 | 76.00 |
4000 | 956.00 | 47.20 | 29.00 |
8000 | 958.00 | 46.00 | 14.00 |
Table: UDP Throughput Egress
UDP Throughput Ingress
UDP Packet Size(bytes) | j721e-evm: Throughput (Mbits/sec) | j721e-evm: CPU Load | j721e-evm: Packets Per Second (kpps) |
---|---|---|---|
64 | 63.90 | 32.90 | 123.00 |
128 | 29.80 | 24.40 | 28.00 |
256 | 120.00 | 31.30 | 58.00 |
512 | 241.00 | 37.20 | 58.00 |
1024 | 815.00 | 54.00 | 99.00 |
1470 | 956.00 | 55.50 | 81.00 |
1500 | 861.00 | 59.80 | 71.00 |
4000 | 946.00 | 55.20 | 29.00 |
8000 | 958.00 | 46.60 | 14.00 |
Table: UDP Throughput Ingress
2.2.2.17. PCIe Driver¶
2.2.2.17.1. PCIe-ETH¶
TCP Window Size(Kbytes) | j721e-evm: Bandwidth (Mbits/sec) |
---|---|
128 | 1317.60 |
256 | 1427.20 |
Table: PCI Ethernet
2.2.2.18. UFS Driver¶
2.2.2.18.1. UFS Throughput¶
Important
The performance numbers can be severely affected if the media is mounted in sync mode. For performance sensitive applications, umount the auto-mounted filesystem and re-mount in async mode.
Buffer size (bytes) | j721e-evm: Write VFAT Throughput (Mbytes/sec) | j721e-evm: Write VFAT CPU Load (%) | j721e-evm: Read VFAT Throughput (Mbytes/sec) | j721e-evm: Read VFAT CPU Load (%) |
---|---|---|---|---|
102400 | 97.39 (min 96.05, max 98.28) | 3.15 (min 2.33, max 3.72) | 663.08 | 19.35 |
262144 | 96.95 (min 95.39, max 97.66) | 3.22 (min 2.79, max 3.64) | 778.28 | 25.00 |
524288 | 96.52 (min 95.48, max 97.27) | 2.77 (min 2.33, max 3.65) | 798.43 | 25.93 |
1048576 | 97.34 (min 95.83, max 98.00) | 3.15 (min 2.79, max 3.65) | 798.91 | 23.08 |
5242880 | 97.45 (min 96.10, max 98.22) | 2.79 (min 2.35, max 3.23) | 797.60 | 24.00 |
Table: UFS Throughput VFAT
Buffer size (bytes) | j721e-evm: Write EXT4 Throughput (Mbytes/sec) | j721e-evm: Write EXT4 CPU Load (%) | j721e-evm: Read EXT4 Throughput (Mbytes/sec) | j721e-evm: Read EXT4 CPU Load (%) |
---|---|---|---|---|
102400 | 88.79 (min 86.78, max 90.73) | 3.21 (min 2.60, max 4.56) | 470.46 | 15.56 |
262144 | 96.57 (min 95.18, max 97.60) | 3.40 (min 2.82, max 4.07) | 795.68 | 23.08 |
524288 | 96.80 (min 96.00, max 97.65) | 3.05 (min 2.33, max 4.13) | 799.37 | 26.92 |
1048576 | 96.48 (min 94.61, max 97.60) | 3.39 (min 2.79, max 4.48) | 757.11 | 24.14 |
5242880 | 97.27 (min 95.57, max 98.13) | 3.34 (min 2.35, max 4.11) | 780.68 | 25.93 |
Table: UFS Throughput EXT4
2.2.2.19. EMMC Driver¶
2.2.2.19.1. EMMC Throughput¶
Important
The performance numbers can be severely affected if the media is mounted in sync mode. For performance sensitive applications, umount the auto-mounted filesystem and re-mount in async mode.
Buffer size (bytes) | j721e-evm: Write VFAT Throughput (Mbytes/sec) | j721e-evm: Write VFAT CPU Load (%) | j721e-evm: Read VFAT Throughput (Mbytes/sec) | j721e-evm: Read VFAT CPU Load (%) |
---|---|---|---|---|
102400 | 59.52 (min 55.27, max 60.90) | 3.51 (min 2.60, max 6.82) | 287.13 | 12.99 |
262144 | 59.91 (min 56.19, max 60.96) | 3.12 (min 2.33, max 5.12) | 293.21 | 7.14 |
524288 | 60.00 (min 56.52, max 61.07) | 3.22 (min 2.33, max 5.66) | 298.61 | 7.25 |
1048576 | 59.85 (min 56.45, max 60.79) | 3.44 (min 2.61, max 5.66) | 294.10 | 7.04 |
5242880 | 60.15 (min 56.62, max 61.17) | 3.41 (min 2.63, max 5.41) | 293.04 | 8.45 |
Table: EMMC Throughput VFAT
Buffer size (bytes) | j721e-evm: Write EXT2 Throughput (Mbytes/sec) | j721e-evm: Write EXT2 CPU Load (%) | j721e-evm: Read EXT2 Throughput (Mbytes/sec) | j721e-evm: Read EXT2 CPU Load (%) |
---|---|---|---|---|
102400 | 60.71 (min 59.92, max 61.01) | 1.91 (min 1.46, max 2.85) | 292.55 | 6.94 |
262144 | 60.57 (min 59.43, max 60.99) | 1.62 (min 1.17, max 2.56) | 302.14 | 10.00 |
524288 | 60.76 (min 59.53, max 61.28) | 1.62 (min 1.17, max 2.56) | 319.51 | 4.69 |
1048576 | 60.79 (min 59.33, max 61.26) | 1.67 (min 1.17, max 2.82) | 320.64 | 6.15 |
5242880 | 60.58 (min 59.37, max 61.09) | 1.73 (min 1.45, max 2.83) | 320.41 | 7.81 |
Table: EMMC Throughput EXT2
Buffer size (bytes) | j721e-evm: Write EXT4 Throughput (Mbytes/sec) | j721e-evm: Write EXT4 CPU Load (%) | j721e-evm: Read EXT4 Throughput (Mbytes/sec) | j721e-evm: Read EXT4 CPU Load (%) |
---|---|---|---|---|
102400 | 60.53 (min 60.32, max 60.93) | 1.90 (min 1.45, max 2.59) | 306.33 | 10.00 |
262144 | 61.74 (min 61.05, max 62.08) | 1.71 (min 1.47, max 2.33) | 317.46 | 9.09 |
524288 | 61.69 (min 60.96, max 62.05) | 1.82 (min 1.47, max 2.62) | 334.83 | 6.56 |
1048576 | 61.74 (min 60.81, max 62.26) | 1.70 (min 1.48, max 2.03) | 337.18 | 7.94 |
5242880 | 61.80 (min 61.11, max 62.23) | 1.88 (min 1.47, max 2.33) | 337.15 | 6.67 |
Table: EMMC Throughput EXT4
2.2.2.20. MMC/SD Driver¶
2.2.2.20.1. MMC/SD Throughput¶
Important
The performance numbers can be severely affected if the media is mounted in sync mode. Hot plug scripts in the filesystem mount removable media in sync mode to ensure data integrity. For performance sensitive applications, umount the auto-mounted filesystem and re-mount in async mode.
Buffer size (bytes) | j721e-evm: Write VFAT Throughput (Mbytes/sec) | j721e-evm: Write VFAT CPU Load (%) | j721e-evm: Read VFAT Throughput (Mbytes/sec) | j721e-evm: Read VFAT CPU Load (%) |
---|---|---|---|---|
102400 | 10.32 (min 10.28, max 10.37) | 0.98 (min 0.69, max 1.47) | 40.64 | 2.14 |
262144 | 10.18 (min 9.70, max 10.44) | 0.93 (min 0.65, max 1.48) | 41.47 | 2.57 |
524288 | 10.15 (min 9.65, max 10.40) | 0.92 (min 0.64, max 1.56) | 42.09 | 2.61 |
1048576 | 9.97 (min 9.40, max 10.39) | 0.91 (min 0.58, max 1.52) | 43.28 | 4.98 |
5242880 | 10.17 (min 9.36, max 10.41) | 0.95 (min 0.60, max 1.38) | 42.55 | 6.25 |
Table: MMC/SD Throughput VFAT
Buffer size (bytes) | j721e-evm: Write EXT2 Throughput (Mbytes/sec) | j721e-evm: Write EXT2 CPU Load (%) | j721e-evm: Read EXT2 Throughput (Mbytes/sec) | j721e-evm: Read EXT2 CPU Load (%) |
---|---|---|---|---|
102400 | 9.19 (min 3.84, max 11.08) | 0.59 (min 0.42, max 0.69) | 41.69 | 1.59 |
262144 | 10.74 (min 10.25, max 11.05) | 0.66 (min 0.45, max 0.99) | 42.81 | 1.23 |
524288 | 10.52 (min 10.15, max 11.10) | 0.66 (min 0.45, max 0.87) | 44.63 | 1.49 |
1048576 | 10.30 (min 9.85, max 10.50) | 0.64 (min 0.50, max 0.84) | 44.77 | 1.07 |
5242880 | 10.67 (min 10.27, max 11.15) | 0.64 (min 0.48, max 0.89) | 45.18 | 0.87 |
Table: MMC/SD Throughput EXT2
Buffer size (bytes) | j721e-evm: Write EXT4 Throughput (Mbytes/sec) | j721e-evm: Write EXT4 CPU Load (%) | j721e-evm: Read EXT4 Throughput (Mbytes/sec) | j721e-evm: Read EXT4 CPU Load (%) |
---|---|---|---|---|
102400 | 10.87 (min 10.11, max 12.15) | 0.62 (min 0.47, max 0.75) | 42.14 | 1.41 |
262144 | 10.91 (min 10.34, max 11.63) | 0.61 (min 0.47, max 0.72) | 42.53 | 2.43 |
524288 | 10.97 (min 10.68, max 11.64) | 0.62 (min 0.52, max 0.78) | 45.04 | 1.29 |
1048576 | 11.23 (min 10.92, max 11.68) | 0.64 (min 0.50, max 0.83) | 45.65 | 1.09 |
5242880 | 11.25 (min 10.90, max 11.74) | 0.64 (min 0.47, max 0.88) | 45.60 | 1.73 |
Table: MMC/SD Throughput EXT4
The performance numbers were captured using the following:
- SanDisk 8GB MicroSDHC Class 10 Memory Card
- Partition was mounted with async option
2.2.2.21. USB Driver¶
2.2.2.21.1. MUSB/XHCI Host controller¶
Important
For Mass-storage applications, the performance numbers can be severely affected if the media is mounted in sync mode. Hot plug scripts in the filesystem mount removable media in sync mode to ensure data integrity. For performance sensitive applications, umount the auto-mounted filesystem and re-mount in async mode.
Setup : Inateck ASM1153E USB hard disk is connected to usb0 port. File read/write performance data on usb0 port is captured.
Buffer size (bytes) | j721e-evm: Write VFAT Throughput (Mbytes/sec) | j721e-evm: Write VFAT CPU Load (%) | j721e-evm: Read VFAT Throughput (Mbytes/sec) | j721e-evm: Read VFAT CPU Load (%) |
---|---|---|---|---|
102400 | 312.86 (min 236.02, max 332.26) | 17.04 (min 14.52, max 24.42) | 333.52 | 12.90 |
262144 | 317.29 (min 247.76, max 335.04) | 17.01 (min 14.52, max 21.69) | 354.90 | 12.07 |
Table: USB Host VFAT
Buffer size (bytes) | j721e-evm: Write EXT2 Throughput (Mbytes/sec) | j721e-evm: Write EXT2 CPU Load (%) | j721e-evm: Read EXT2 Throughput (Mbytes/sec) | j721e-evm: Read EXT2 CPU Load (%) |
---|---|---|---|---|
102400 | 349.94 (min 311.81, max 359.84) | 11.25 (min 8.77, max 13.43) | 320.95 | 12.50 |
1048576 | 351.06 (min 312.71, max 360.91) | 11.05 (min 7.14, max 13.64) | 394.87 | 11.54 |
5242880 | 353.43 (min 315.19, max 363.50) | 10.77 (min 7.27, max 12.50) | 405.38 | 10.00 |
Table: USB Host EXT2
Window Size (kbytes) | j721e-evm: TX Throughput (Mbits/sec) | j721e-evm: RX Throughput (Mbits/sec) |
---|---|---|
8 | 263.50 | 44.50 |
16 | 266.80 | 78.70 |
32 | 337.30 | 230.90 |
64 | 347.90 | 345.00 |
128 | 351.60 | 351.00 |
Table: USBDEVICE NCM IPERF TCP THROUGHPUT
2.2.2.22. CRYPTO Driver¶
2.2.2.22.1. OpenSSL Performance¶
Algorithm | Buffer Size | j721e-evm: throughput |
---|---|---|
aes-128-cbc | 1024 | 59030.53 |
aes-128-cbc | 16 | 1125.79 |
aes-128-cbc | 256 | 18987.78 |
aes-128-cbc | 64 | 4571.37 |
aes-128-cbc | 8192 | 174153.73 |
aes-192-cbc | 1024 | 53752.49 |
aes-192-cbc | 16 | 1176.94 |
aes-192-cbc | 256 | 18654.46 |
aes-192-cbc | 64 | 4529.56 |
aes-192-cbc | 8192 | 174410.41 |
aes-256-cbc | 1024 | 56535.38 |
aes-256-cbc | 16 | 1146.76 |
aes-256-cbc | 256 | 18744.58 |
aes-256-cbc | 64 | 4550.21 |
aes-256-cbc | 8192 | 157827.07 |
des-cbc | 1024 | 40360.62 |
des-cbc | 16 | 11463.94 |
des-cbc | 256 | 35921.66 |
des-cbc | 64 | 25251.09 |
des-cbc | 8192 | 41822.89 |
des3 | 1024 | 46702.25 |
des3 | 16 | 1176.65 |
des3 | 256 | 15924.74 |
des3 | 64 | 4926.95 |
des3 | 8192 | 94093.31 |
md5 | 1024 | 96581.63 |
md5 | 16 | 2185.08 |
md5 | 256 | 31671.89 |
md5 | 64 | 8521.94 |
md5 | 8192 | 242316.63 |
sha1 | 1024 | 64368.30 |
sha1 | 16 | 1077.21 |
sha1 | 256 | 16950.44 |
sha1 | 64 | 4295.45 |
sha1 | 8192 | 353219.93 |
sha224 | 1024 | 119689.56 |
sha224 | 16 | 2122.24 |
sha224 | 256 | 33028.01 |
sha224 | 64 | 8506.86 |
sha224 | 8192 | 511008.77 |
sha256 | 1024 | 61989.21 |
sha256 | 16 | 1038.27 |
sha256 | 256 | 16261.12 |
sha256 | 64 | 4130.03 |
sha256 | 8192 | 342261.76 |
sha384 | 1024 | 76010.84 |
sha384 | 16 | 2083.06 |
sha384 | 256 | 28211.63 |
sha384 | 64 | 8395.31 |
sha384 | 8192 | 151688.53 |
sha512 | 1024 | 46941.87 |
sha512 | 16 | 990.04 |
sha512 | 256 | 14470.91 |
sha512 | 64 | 3952.11 |
sha512 | 8192 | 131323.22 |
Table: OpenSSL Algorithm Throughput
Algorithm | j721e-evm: CPU Load |
---|---|
aes-128-cbc | 35.00 |
aes-192-cbc | 36.00 |
aes-256-cbc | 34.00 |
des-cbc | 99.00 |
des3 | 36.00 |
md5 | 99.00 |
sha1 | 99.00 |
sha224 | 99.00 |
sha256 | 99.00 |
sha384 | 99.00 |
sha512 | 99.00 |
Table: OpenSSL CPU Load During Throughput Testing
2.2.3. Resources¶
2.2.3.1. Linux Memory Map¶
- You can view the RAM allocations for the Kernel at run-time as follows:
root@j7-evm:/proc# cat /proc/iomem
[snip...]
80000000-9e7fffff : System RAM
80080000-80bdffff : Kernel code
80be0000-80c6ffff : reserved
80c70000-80e4ffff : Kernel data
a9000000-a9ffffff : System RAM
abc00000-ffffffff : System RAM
[snip]
[snip...]
880000000-8ffffffff : System RAM
8ff480000-8fff2ffff : reserved
8fff30000-8fff4ffff : reserved
8fff50000-8fffeffff : reserved
8ffff0000-8ffffffff : reserved
[snip]
Memory Carveouts
- Both reserved & shared memory carveouts are defined in the following file for the J721E EVM:
linux/arch/arm64/boot/dts/ti/k3-j721e-som-p0.dtsi
- The following table shows the memory carveouts designated in this release:
Memory Carveout Purpose | Start Address | Carveout Size (MiB) | Carveout Type |
---|---|---|---|
Secure DDR (OPTEE) | 0x9e800000 | 24 | reserved |
MCU_R5F0 Core0 IPC data | 0xa0000000 | 1 | shared DMA pool |
MCU_R5F0 Core0 Memory | 0xa0100000 | 15 | shared DMA pool |
MCU_R5F0 Core1 IPC data | 0xa1000000 | 1 | shared DMA pool |
MCU_R5F0 Core1 Memory | 0xa1100000 | 15 | shared DMA pool |
MAIN_R5F0 Core0 IPC data | 0xa2000000 | 1 | shared DMA pool |
MAIN_R5F0 Core0 Memory | 0xa2100000 | 15 | shared DMA pool |
MAIN_R5F0 Core1 IPC data | 0xa3000000 | 1 | shared DMA pool |
MAIN_R5F0 Core1 Memory | 0xa3100000 | 15 | shared DMA pool |
MAIN_R5F1 Core0 IPC data | 0xa4000000 | 1 | shared DMA pool |
MAIN_R5F1 Core0 Memory | 0xa4100000 | 15 | shared DMA pool |
MAIN_R5F1 Core1 IPC data | 0xa5000000 | 1 | shared DMA pool |
MAIN_R5F1 Core1 Memory | 0xa5100000 | 15 | shared DMA pool |
C66x_1 IPC data | 0xa6000000 | 1 | shared DMA pool |
C66x_0 Memory | 0xa6100000 | 15 | shared DMA pool |
C66x_0 IPC data | 0xa7000000 | 1 | shared DMA pool |
C66x_1 Memory | 0xa7100000 | 15 | shared DMA pool |
C71x_0 IPC data | 0xa8000000 | 1 | shared DMA pool |
C71x_0 Memory | 0xa8100000 | 15 | shared DMA pool |
IPC Memory (Remote Cores IPC VRING space) | 0xaa000000 | 28 | reserved |
Display Memory (dynamic allocation) | 0xc0000000 | 512 | CMA carveout |
Table: Linux Memory Carveouts