Embedded World 2018 - Speeches
A New Scalable Architecture to Accelerate Deep Convolutional Neural Networks for Low Power IoT Applications
by Giuseppe Desoli
Deep Learning is an extremely promising set of techniques that allows achieving state of the art results in many applications involving recognition, identification and/or classification tasks; however, those come at the price of significant requirements in terms of processing power hindering their adoption due to the lack of availability of low-cost and energy-efficient solutions. Recently a push towards an ever-increasing deployment of DL inference on embedded devices supporting the edge-computing paradigm has been observed, overcoming limitations of cloud-based computing for latency, bandwidth requirements, security, privacy, and availability. At the edge, severe performance requirements must coexist with tight constraints in terms of power and energy consumption. Deep Convolutional Neural Networks (CDNN) DL algorithms necessitate billions of multiply-accumulate operations per second for real-time workloads, as well as local storage of millions of bytes of pre-trained weights. To cope with these constraints, low-power IoT end-nodes must resort to specialized hardware blocks for specific compute-intensive data processing, while retaining full software programmability to cope with lower computational-intensity tasks. The STMicroelectronics Orlando is a configurable, scalable and design time parametric CNN Processing Engine powered by an energy efficient set of HW convolutional accelerators supporting kernel compression, an on-chip reconfigurable data transfer fabric to improve data reuse and reduce on-chip and off-chip memory traffic. The Orlando SoC prototype, integrates custom designed DSPs, and a reconfigurable dataflow custom designed HW accelerator fabric connecting camera interfaces, sensor pipelines, croppers, color converters, feature detectors, video encoders, streaming DMAs and 8 convolutional accelerators. The chip includes four SRAM banks each with 1MB, dedicated bus port, and fine-grained power gating, to sustain the maximum throughput for different CDNN topologies reducing the need for external memory to save power. The prototype chip in FD-SOI 28 technology adopts mono-supply SRAM based single well bitcell with low power features and adaptive circuitry to support a wide voltage range from 1.1V to 0.575V, and leverages a GALS clocking architecture to reduce the clock network dynamic power and skew sensitivity due to on-chip variation at lower voltages. A power consumption of 41mW on a typical DCNN algorithm (AlexNet) is achieved with a peak efficiency of 2.9 TOPS/W
|Dr. Giuseppe Desoli graduated in Electronics Engineering from the University of Genoa, Italy in 1991 (summa cum laude) where he also received a PhD in Telecommunications in 1995, working in the field of parallel architectures and algorithms for signal and image processing. |
He joined as a computer scientist HP Labs at the Cambridge Lab in Massachusetts, USA from 1995 to 2002, doing research in VLIW architectures with a special focus on embedded custom defined microprocessors, tools and binary translation technologies, working together with some of the most influential pioneers of the field.
He later came to STMicroelectronics, attracted by practical application of VLIW processors platforms to embedded devices and SoCs, and led a microprocessors research group within ST-Italy AST (Advanced System Technology) R&D group.
Since 2010 he is in charge of advanced HW and SW R&D projects, from 2015 he's focusing on low power HW architectures for deep learning acceleration for embedded applications. He's co-authored more than 50 scientific publications on computer science and signal processing, and holds a number of patents with both the European and US patent offices. Dr. Desoli is a fellow member of the STMicrolectronics technical staff.
Embedded Algorithms for Motion Detection and Processing
by Marco Castellano
The speech introduces two embedded digital modules, Finite State Machine and Machine Learning Processing on smart sensor LSM6DSOX: it is shown how this solution paves the way to advance innovation and reduce current consumption at system level.
Finite State Machine can be configured by means of atomic blocks connected in sequence. These blocks have been defined to cover the most of the gestures required by customers, but leaving to final user the customization of the application thanks to the high reconfigurability of each atomic block.
Machine Learning Processing is based on patterns acquisition and machine learning elaboration, based on patterns statistics. MLP can be widely configured by the user depending on the application aim. Our smart sensor hardware module is able to execute runtime parameters computation and decision-tree branch elaboration.
Both FSM and MLP are supported by effective custom STM software tools for fast prototyping and implementation.
|Marco Castellano received the Laurea degree from the Univ. of Pavia, Italy (2005) and in 2009 a Ph.D. in electrical engineering from Univ. of Pavia, Italy, in a joint research center supported by the Univ. of Pavia and STMicroelectronics. His research topic was high speed, low power arithmetic circuits for Digital Signal Processing. |
In 2008 he joined STMicroelectronics in Cornaredo (Italy) working in MEMS division as digital designer. Main fields of interest in digital design were complex gesture recognition algorithm implementation, FIFO, sensors DSP and compensations design.
In 2016 he became leader of a team of digital experts in the field of co-design of controller and related software for custom low-power application design. He has authored many papers, conference contributions and patents on topics related to algorithms integration.
Efficient Solution for Secure Firmware Install and Upgrade of Embedded Applications
by Christophe Mani
Security became major constrain for developers of many embedded applications. This has been driven by the growing complexity of the applications and especially their connectivity to internet and/or other proprietary networks. In parallel Secure Firmware Install, used during production, or in-application firmware upgrade, used in the field, happened to be mandatory elements providing faster time to market, eliminating piracy and lowering cost of applications maintenance including flexible adaptation of various business models around the final applications and their ecosystem.
Real practical use case example will be presented addressing complete volume production chain and demonstrating benefits of full and easy secure control over the process of secure firmware install. As target platform the secured version of STM32H7 MCU will be used. It comes with a ready to use secure firmware install code located in a protected area including certificate to validate its authenticity. All being provisioned with an individual asymmetric key pair to receive the encrypted customer binary code with its encryption key. The decryption process is done internally by the STM32H7 afterwards.
Practically the STM32H7 includes a user memory space named Securable Memory Area that is sizable and activated by configuration. When activated, that user area is visible after any reset and is protected against debug. After its usage, the code jump to the main FW located in regular user area and accessing it becomes impossible. This new memory space enhanced the code and key protection of any secure firmware upgrade solution. All the code and data stored in that memory domain totally disappear.
Christophe Mani has a long application background and experience in the field of secure MCUs, RFID and smartcard products.