Friday, 30 June 2017

FPGAs and AI processors: DNN and CNN for all

Here is a nice hidden node from a traditional 1990's style gender identification neural net I did a few weeks ago.

A 90's style hidden node image in a simple gender identifier net
Source: my laptop
My daughter was doing a course as part of her B Comp Eng, the degree after her acting degree. Not being in the same city I thought maybe I could look at her assignment and help in parallel. Unsurprisingly, she didn't need nor want my help. No man-splaining necessary from the old timer father. Nevertheless, it was fun to play with the data Bec pulled down on faces. Bec's own gender id network worked fine for 13 out of 14 photos of herself fed into the trained net. Nice.

I was late to the party and first spent time with neural nets in the early nineties. As a prop trader at Bankers Trust in Sydney, I used a variety of software including a slightly expensive graphical tool from NeuroDimension that also generated C++ code for embedding. It had one of those parallel port copy protection dongles that were a pain. I was doing my post-grad at a group at uni that kept changing its name from something around connectionism, to adaptive methods, and then data fusion. I preferred open source and the use of NeuroDimension waned. I ported the Stuttgart Neural Network Simulator, SNNS, to the new MS operating system, Windows NT (with OS/3 early alpha branding ;-) ), and briefly became the support guy for that port. SNNS was hokey code with messy static globals but it worked pretty fly for a white guy.

My Master of Science research project was a kind of cascade correlation-like neural net, Multi-rate Optimising Order Statistic Equaliser (MOOSE), for intraday Bund trading. The MOOSE was a bit of work designed for acquiring fast LEO satellite signals (McCaw's Teledesic), repurposed for playing with Bunds as they migrated from LIFFE to DTB. As a prop trader at an investment bank, I could buy neat toys. I had the world's fastest computer at the time: an IBM MicroChannel dual Pentium Pro 200MHz processors plus SCSI with some megabytes of RAM. Pulling 800,000 points into my little C++ stream/dag processor seemed like black magic in 1994. Finite differencing methods let me do oodles of O(1) incremental linear regressions and the like to get 1000 fold speed-ups. It seemed good at the time. Today, your phone would laugh in my general direction.

There was plenty of action in neural nets back in those days. Not much of it was overly productive but it was useful. I was slightly bemused to read Eric Schmidt's take on machine learning and trading in Lindsay Fortado and Robin Wigglesworth's FT article "Machine learning set to shake up equity hedge funds",
Eric Schmidt, executive chairman of Alphabet, Google’s parent company, told a crowd of hedge fund managers last week that he believes that in 50 years, no trading will be done without computers dissecting data and market signals.
“I’m looking forward to the start-ups that are formed to do machine learning against trading, to see if the kind of pattern recognition I’m describing can do better than the traditional linear regression algorithms of the quants,” he added. “Many people in my industry think that’s amenable to a new form of trading.”
Eric, old mate, you know I was late to the party in the early nineties, what does that make you?

Well, things are different now. I like to think of it and have written about the new neural renaissance as The Age of Perception. It is not intelligence, it is just good at patterns. It is still a bit hopeless at language ambiguities. It will also be a while before it understands the underlying values and concepts for deep financial understanding. 

Deep learning is simultaneously both overhyped and underestimated. It is not intelligence, but it will help us get there. It is overhyped by some as an AI breakthrough that will give us cybernetic human-like replicants. We still struggle with common knowledge and ambiguity in simple text for reasoning. We have a long way to go. The impact of relatively simple planning algorithms and heuristics along with the dramatic deep learning based perception abilities from vision, sound, text, radar, et cetera, will be as profound as every person and their dog now understands. That's why I call it, The Age of Perception. It is as if the supercomputers in our pockets have suddenly awoken with their eyes quickly adjusting to the bright blinking blight that is the real world. 

The impact will be dramatic and lifestyle changing for the entire planet. Underestimate the impact at your peril. No, we don't have a date with a deep Turing conversationalist that will provoke and challenge our deepest thoughts - yet. That will inevitably come, but it is not on the visible horizon. Smart proxies aiding by speech, text and Watson-like Jeopardy databases will give a very advanced Eliza, but no more. Autonomous transport, food production, construction, yard and home help will drive dramatic lifestyle and real-estate value changes.

Apart from this rambling meander, my intention here was to collect some thoughts on the chips driving the current neural revolution. Not the most exciting thought for many, but it is a useful exercise for me.

Neural network hardware


Neural processing is not a lot different today compared to twenty years ago. Deep is more of a brand than a difference. The activation functions have been simplified which suits hardware better. Mainly there is more data and a better understanding of how to initialise the weights, handle many layers, parallelise, and improve robustness via techniques such as dropout. The Neocognitron architecture from 1980 is not much different to today's deep learner or CNN, but it helped that Yann LeCun allowed it to learn. 

Back in the nineties there was also plenty of neural hardware platforms such as CNAPS (1990) with its 64 processing units and 256kB of memory for doing 1.6 GCPS (connections per second CPS) for 8/16-bit or 12.8 GCPS for 1-bit. You can read about Synapse-1, CNAPS, SNAP, CNS Connectionist Supercomputer, Hitachi WSI, My-Neupower, LNeuro 1.0, UTAK1, GNU Implementation (no, not GNU GNU, General Neural Unit), UCL, Mantra 1, Biologically-Inspired Emulator, INPG Architecture, BACHUS, and ZISC036 in "Overview of neural hardware", [Heemskerk, 1995, draft].

Phew, it seems a lot but that excluded the software and accelerator board/CPU combos, such as ANZA plus, SAIC SIGMA-1, NT6000, Balboa 860 coprocessor, Ni1000 Recognition Accelerator Hardware (Intel), IBM NEP, NBC, Neuro Turbo I, Neuro Turbo II, WISARD, Mark II & IV, Sandy/8, GCN (Sony), Topsi, BSP400 (400 microprocessors), DREAM Machine, RAP, COKOS, REMAP, General Purpose Parallel Neurocomputer, TI NETSIM, and GeNet. Then there were quite a few analogue and hybrid analogue implementations, including Intel's Electrically Trainable Analog Neural Network (801770NX). You get the idea, there was indeed a lot back in the day.

All a go go in 1994:


Optimistically Moore's Law was telling us a TeraCPS was just around the corner,
"In the next decade micro-electronics will most likely continue to dominate the field of neural network implementation. If progress advances as rapidly as it has in the past, this implies that neurocomputer performances will increase by about two orders of magnitude. Consequently, neurocomputers will be approaching TeraCPS (10^12 CPS) performance. Networks consisting of 1 million nodes, each with about 1,000 inputs, can be computed at brain speed (100-1000 Hz). This would offer good opportunities to experiment with reasonably large networks."
The first neural winter was the cruel subversion of research dollars by Minsky and Papert's dissing of Rosenblatt's perceptron dream with incorrect hand-wavy generalisations about hidden layers that ultimately led to Rosenblatt's untimely death. In 1995 another neural winter was kind of underway although I didn't really know it at the time. As a frog in the saucepan, I didn't notice the boil. This second winter was fired up by a lack of exciting progress and general boredom. 

The second neural winter ended with the dramatic improvements in ImageNet processing with the University of Toronto's SuperVision from AlexNet in 2012 thanks to Geoffrey Hinton's winter survival skills. This result was then blown apart by Google's LeNet 2014 Inception model. So, the Age of Perception started in 2012 by my reckoning. Mark your diaries. We're now five years in.

Google did impressive parallel CPU work with lossy updates across a few thousand regular machines. Professor Andrew Ng and friends made the scale approachable by enabling dozens of GPUs to do the work of thousands of CPUs. Thus, we were saved from the prospect of neural processing being only for the well funded. Well, kind of, now the state of the art sometimes needs thousands of GPUs or specific chips. 

More data and more processing have been quite key. Let's get to the point and list some of the platforms that are key to the Age of Perception's big data battle:

GPUs from Nvidia

These are hard to beat. The subsidisation that comes from the large video processing market drives tremendous economies of scale. The new Nvidia V100 can do 15 TFlops of SP or 120 TFlops with its new Tensor core architecture which is a FP16 multiply and FP32 accumulate or add to suit ML. Nvidia is packing up 8 boards into their DGX-1 for 960 Tensor TFlops. 

GPUs from AMD

AMD has been playing catch-up with Nvidia in the ML space. The soon to be released AMD Radeon Instinct MI25 is promising 12.3 TFlops of SP or 24.6 TFlops of FP16. If your calculations are amenable to Nvidia's Tensors, then AMD can't compete. Nvidia also does twice the bandwidth with 900GB/s versus AMD's 484 GB/s. 

Google's TPUs

Google's original TPU had a big lead over GPUs and helped power DeepMind's AlphaGo victory over Lee Sedol in a Go tournament. The original 700MHz TPU is described as having 95 TFlops for 8-bit calculations or 23 TFlops for 16-bit whilst drawing only 40W. This was much faster than GPUs on release but is now slower than Nvidia's V100, but not on a per W basis. The new TPU2 is referred to as a TPU device with four chips and can do around 180 TFlops. Each chip's performance has been doubled to 45 TFlops for 16-bits. You can see the gap to Nvidia's V100 is closing. You can't buy a TPU or TPU2. Google is making them available for use in their cloud with TPU pods containing 64 devices for up to 11.5 PetaFlops of performance. The giant heatsinks on the TPU2 are some cause for speculation, but the market is changing from devices to units with groups of devices and also such groups within the cloud.

Wave Computing

Wave's Aussie CTO, Dr Chris Nicol, has produced a wonderful piece of work with Wave's asynchronous data flow processor in their Compute Appliance. I was introduced to Chris briefly a few years ago in California by Metamako Founder Charles Thomas. They both used to work on clockless async stuff at NICTA. Impressive people those two. 

I'm not sure Wave's appliance was initially targeting ML but their ability to run TensorFlow at 2.9 PetaOPS/sec on their 3RU appliance is pretty special. Wave refers to their processors at DPUs and an appliance has 16 DPUs. Wave uses processing elements it calls Coarse Grained Reconfigurable Arrays (CGRAs). It is unclear what bit width the 2.9 PetaOPS/s is referring to. From their white paper, the ALUs can do 1b, 8b, 16b and 32b,  
"The arithmetic units are partitioned. They can perform 8-b operations in parallel (ideal for DNN inferencing) as well as 16-b and 32-b operations (or any combination of the above). Some 64-b operations are also available and these can be extended to arbitrary precision using software.
Here is a bit more on one of the 16 DPUs included in the appliance,
"The Wave Computing DPU is an SoC that contains a 16,384 PEs, configured as a CGRA of 32x32 clusters. It includes four Hybrid Memory Cube (HMC) Gen 2 interfaces, two DDR4 interfaces, a PCIe Gen3 16-lane interface and an embedded 32-b RISC microcontroller for SoC resource management. The Wave DPU is designed to execute autonomously without a host CPU."
On TensorFlow ops, 
"The Wave DNN Library team creates pre-compiled, relocatable kernels for common DNN functions used by workflows like TensorFlow. These can be assembled into Agents and instantiated into the machine to form a large data flow graph of tensors and DNN kernels."
"...a session manager that interfaces with machine learning workflows like TensorFlow, CNTK, Caffe and MXNet as a worker process for both training and inferencing. These workflows provide data flow graphs of tensors to worker processes. At runtime, the Wave session manager analyzes data flow graphs and places the software agents into DPU chips and connects them together to form the data flow graphs. The software agents are assigned regions of global memory for input buffers and local storage. The static nature of the CGRA kernels and distributed memory architecture enables a performance model to accurately estimate agent latency. The session manager uses the performance model to insert FIFO buffers between the agents to facilitate the overlap of communication and computation in the DPUs. The variable agents support software pipelining of data flowing through the graph to further increase the concurrency and performance. The session manager monitors the performance of the data flow graph at runtime (by monitoring stalls, buffer underflow and/or overflow) and dynamically tunes the sizes of the FIFO buffers to maximize throughput. A distributed runtime management system in DPU-attached processors mounts and unmounts sections of the data flow graph at run time to balance computation and memory usage. This type of runtime reconfiguration of a data flow graph in a data flow computer is the first of its kind."
Yeah, me too. Very cool.

The exciting thing about this platform is that it is coarser than FPGA in architectural terms and thus less flexible, but likely to perform better. Very interesting.

KnuEdge's KnuPath

I tweeted about KnuPath back in June 2016. Their product page has since gone missing in action. I'm not sure what they are up to with the $100M they put into their MIMD architecture. It was described at the time as having 256 tiny DSP, or tDSP, cores on each ASIC along with an ARM controller suitable for sparse matrix processing in a 35W envelope. 

(source: HPC Wire - click to enlarge)
The performance is unknown, but they compared their chip to a current NVIDIA, at that time, and said they had 2.5 times the performance. We know Nvidia is now more than ten times faster with their Tensor cores so KnuEdge will have a tough job keeping up. A MIMD or DSP approach will have to execute awfully well to take some share in this space. Time will tell. 

Intel's Nervana

Intel purchased Nervana Systems who was developing both a GPU/software approach in addition to their Nervana Engine ASIC. Comparable performance is unclear. Intel is also planning in integrating into the Phi platform via a Knights Crest project. NextPlatform suggested the 2017 target on 28nm may be 55 TOPS/s for some width of OP. There is a NervanaCon Intel has scheduled for December, so perhaps we'll see the first fruits then.

Horizon Robotics

This Chinese start-up has a Brain Processing Unit (BPU) in the works. Dr Kai Yu has the right kind of pedigree as he was previously the head of Baidu's Institute of Deep Learning. Earlier this year a BPU emulation on an Arria 10 FPGA was shown in this Youtube clip. There is little information on this platform in public.

Eyeriss

Eyeriss is an MIT project that developed a 64nm ASIC with unimpressive raw performance. The chip is about half the speed of a Nvidia TK1 on AlexNet. The neat aspect was that such middling performance was achieved by a 278mW reconfigurable accelerator thanks to its row stationary approach. Nice.

Graphcore

Graphcore raised $30M of Series-A late last year to support the development of their Intelligence Processing Unit, or IPU. Their web is a bit sparse on details with hand-wavy facts such at >14,000 independent processor threads and >100x memory bandwidth. Some snippets have snuck out with NextPlatform reporting over a thousand true cores on the chip with a custom interconnect. It's PCIe board has a 16-processor element. It sounds kind of dataflowy. Unconvincing PR aside, the team has a strong rep and the investors are not naive, so we'll wait and see.

Tenstorrent

Tenstorrent is a small Canadian start-up in Toronto claiming an order of magnitude improvement in efficiency for deep learning, like most. No real public details but they're are on the Cognitive 300 list.

Cerebras

Cerebras is notable due to its backing from Benchmark and that its founder was the CEO of SeaMicro. It appears to have raised $25M and remains in stealth mode.


Thinci

Thinci is developing vision processors from Sacremento with employees in India too. They claim to be at the point of first silicon, Thinci-tc500, along with benchmarking and winning of customers already happening. Apart from "doing everything in parallel" we have little to go on.


Koniku

Koniku's web site is counting down and has 72 days showing until my new reality. I can hardly wait. They have raised very little money and after watching their Youtube clip embedded in this Forbes page, you too will not likely not be convinced, but you never know. Harnessing biological cells is certainly different. It sounds like a science project, but, then this,
"We are a business. We are not a science project," Agabi, who is scheduled to speak at the Pioneers Festival in Vienna, next week, says, "There are demands that silicon cannot offer today, that we can offer with our systems."
The core of the Koniku offer is the so-called neuron-shell, inside which the startup says it can control how neurons communicate with each other, combined with a patent-pending electrode which allows to read and write information inside the neurons. All this packed in a device as large as an iPad, which they hope to reduce to the size of a nickel by 2018.

Adapteva

Adapteva is a favourite little tech company of mine to watch as you'll see in this previous meander, "Adapteva tapes out Epiphany-V: A 1024-core 64-bit RISC processor." Andreas Olofsson taped out his 1024 core chip late last year and we await news of its performance. Epiphany-V has new instructions for deep learning and we'll have to see if this memory-controller-less design with 64MB of on-chip memory will have appropriate scalability. The impressive efficiency of Andrea's design and build may make this a chip we can all actually afford, so let's hope it performs well.

Knowm

Knowm talks about Anti-Hebbian and Hebbian (AHaH) plasticity and memristors. Here is a paper covering the subject, "AHaH Computing–From Metastable Switches to Attractors to Machine Learning." It's a bit too advanced for me. With a quick glance I can't tell the difference between this tech and hocus-pocus but it looks sciency. I'm gonna have to see this one in the flesh to grok it. The idea of neuromemristive processors is intriguing. I do like a good buzzword in the morning.

Mythic

A battery powered neural chip from Mythic with 50x lower power. Not so many real details out there. The chip is the size of a button, but aren't most chips?
"Mythic's platform delivers the power of desktop GPU in a button-sized chip"
Perhaps another one that is suitable for drones and phones that is likely to be eaten or sidelined by a phone.

Qualcomm

Phones are an obvious place for ML hardware to crop up. We want to identify the dog type, flower, leaf, cancerous mole, translate a sign, understand the spoken word, etc. Our pocket supercomputers would like all the help they can get for the Age of Perception.

Qualcomm has been fussing around ML for a while with the Zeroth SDK and Snapdragon Neural Processing Engine. The NPE certainly works reasonably well on the Hexagon DSP that Qualcomm use. The Hexagon DSP is far from a very wide parallel platform and it has been confirmed by Yann LeCun that Qualcomm and Facebook are working together on a better way in Wired's "The Race To Build An AI Chip For Everything Just Got Real",
"And more recently, Qualcomm has started building chips specifically for executing neural networks, according to LeCun, who is familiar with Qualcomm's plans because Facebook is helping the chip maker develop technologies related to machine learning. Qualcomm vice president of technology Jeff Gehlhaar confirms the project. "We're very far along in our prototyping and development," he says."
Perhaps we'll see something soon beyond the Kryo CPU, Adreno GPU, Hexagon DSP, and Hexagon Vector Extensions. It is going to be hard to be a start-up in this space if you're competing against Qualcomm's machine learning.

Pezy-SC and Pezy-SC2

These are the 1024 core and 2048 core processors that Pezy develop. The Pezy-SC 1024 core chip powered the top 3 systems on the Green500 list of supercomputers back in 2015. The Pezy-SC2 is the follow up chip that is meant to be delivered by now, and I do see a talk in June about it, but details are scarce yet intriguing,
"PEZY-SC2 HPC Brick: 32 of PEZY-SC2 module card with 64GB DDR4 DIMM (2.1 PetaFLOPS (DP) in single tank with 6.4Tb/s"
It will be interesting to see what  2,048 MIMD MIPS Warrior 64-bit cores can do. In the June 2017 Green500 list, a Nvidia P100 system took the number one spot and there is a Pezy-SC2 system at number 7. So the chip seems alive but details are thin on the ground. Motoaki Saito is certainly worth watching.

Kalray

Despite many promises, Kalray has not progressed their chip offering beyond the 256 core beast I covered back in 2015, "Kalray - new product meander." Kalray is advertising their product as suitable for embedded self-driving car applications though I can't see the product architecture being an ideal CNN platform in its current form. Kalray has a Kalray Neural Network (KaNN) software package and claims better efficiency than GPUs with up to 1 TFlop/s on chip.

Kalrays NN fortunes may improve with an imminent product refresh and just this month Kalray completed a new funding that raised $26M. The new Coolidge processor is due in mid-2018 with 80 or 160 cores along with 80 or 160 co-processors optimised for vision and deep learning.

This is quite a change in architecture from their >1000 core approach and I think it is most sensible.

IBM TrueNorth

TrueNorth is IBM's Neuromorphic CMOS ASIC developed in conjunction with the DARPA SyNAPSE program.
It is a manycore processor network on a chip design, with 4096 cores, each one simulating 256 programmable silicon "neurons" for a total of just over a million neurons. In turn, each neuron has 256 programmable "synapses" that convey the signals between them. Hence, the total number of programmable synapses is just over 268 million (228). In terms of basic building blocks, its transistor count is 5.4 billion. Since memory, computation, and communication are handled in each of the 4096 neurosynaptic cores, TrueNorth circumvents the von-Neumann-architecture bottlenecks and is very energy-efficient, consuming 70 milliwatts, about 1/10,000th the power density of conventional microprocessors. [Wikipedia]
Previously criticised for running spiking neural networks rather than being fit for deep learning, IBM developed a new algorithm for running CNNs on TrueNorth,
Instead of firing every cycle, the neurons in spiking neural networks must gradually build up their potential before they fire...Deep-learning experts have generally viewed spiking neural networks as inefficient—at least, compared with convolutional neural networks—for the purposes of deep learning. Yann LeCun, director of AI research at Facebook and a pioneer in deep learning, previously critiqued IBM’s TrueNorth chip because it primarily supports spiking neural networks... 
...the neuromorphic chips don't inspire as much excitement because the spiking neural networks they focus on are not so popular in deep learning.
To make the TrueNorth chip a good fit for deep learning, IBM had to develop a new algorithm that could enable convolutional neural networks to run well on its neuromorphic computing hardware. This combined approach achieved what IBM describes as “near state-of-the-art” classification accuracy on eight data sets involving vision and speech challenges. They saw between 65 percent and 97 percent accuracy in the best circumstances.
When just one TrueNorth chip was being used, it surpassed state-of-the-art accuracy on just one out of eight data sets. But IBM researchers were able to boost the hardware’s accuracy on the deep-learning challenges by using up to eight chips. That enabled TrueNorth to match or surpass state-of-the-art accuracy on three of the data sets.
The TrueNorth testing also managed to process between 1,200 and 2,600 video frames per second. That means a single TrueNorth chip could detect patterns in real time from between as many as 100 cameras at once..." [IEEE Spectrum]
Power efficiency is quite brilliant on TrueNorth and makes it very worthy of consideration.

Brainchip's Spiking Neuron Adaptive Processor (SNAP)

SNAP will not do deep learning and is a curiosity without being a practical drop in CNN engineering solution, yet. IBM's stochastic phase-change neurons seem more interesting if that is a path you wish to tread.

Apple's Neural Engine

Will it or won't it?  Bloomberg is reporting it will as a secondary processor but there is little detail. Not only is it an important area for Apple, but it helps avoid and compete with Qualcomm.

Others


Cambricon - Chinese Academy of Sciences invests $1.4M  for chip. It is an instruction set architecture for NNs with data-level parallelism, customised vector/matrix instructions, on-chip scratchpad memory. Claims 91 times CPU-x86 and 3 times faster than a K40M with 1% or 1.695W of peak power use. "Cambricon-X: An Accelerator for Sparse Neural Networks" and "Cambricon: An Instruction Set Architecture for Neural Networks."

Ex-googlers and groq inc. Perhaps another TPU?

Aimotive.

Deep Vision is bulding low-power chips for deep learning. Perhaps one of these papers by the founders have clues, "Convolution Engine: Balancing Efficiency & Flexibility in Specialized Computing" [2013] and "Convolution Engine: Balancing Efficiency and Flexibility in Specialized Computing" [2015].

Deep Scale.

Reduced Energy Microsystems are developing lower power asynchronous chips to suit CNN inference. REM was Y Combinator's first ASIC venture according to TechCrunch.

Leapmind is busy too.

FPGAs


Microsoft has thrown its hat into the FPGA ring, "Microsoft Goes All in for FPGAs to Build Out AI Cloud." Wired did a nice story on the MSFT use of FPGAs too, "Microsoft Bets Its Future on a Reprogrammable Computer Chip"
"On Bing, which an estimated 20 percent of the worldwide search market on desktop machines and about 6 percent on mobile phones, the chips are facilitating the move to the new breed of AI: deep neural nets."
I have some affinity for this approach. Xilinx and Intel's (nee Altera) FPGAs are powerful engines. Xilinx naturally claim their FPGA's are best for INT8 with one of their white papers containing the following slide,


Both vendors have good support for machine learning with their FPGAs:

Whilst performance per Watt is impressive for FPGAs, the vendors' larger chips have long had earth shatteringly high chip prices for the larger chips. Xilinx's VU9P lists at over $US 50k at Avnet.

Finding a balance between price and capability is the main challenge with the FPGAs.

One thing that is to love about the FPGA approach is the ability to make some quite wonderful architectural decisions. Say you want to improve you memory streaming of floating point via compressing off board DRAM for HBM and uncompress it in real time, there is a solution if you try hard enough, "Bandwidth Compression of Floating-Point Numerical Data Streams for FPGA-Based High-Performance Computing"


This kind of dynamic architectural agility would be a hard thing to pull off with almost any other technology.

Too many architectural choices may be considered a problem, but I kind of like that problem myself. Here is a nice paper on closing the performance gap between custom hardware and FPGA processors with an FPGA-based horizontally microcoded compute engine that reminds of the old DISC or discrete instruction set computer from many moons ago, "Reducing the Performance Gap between Soft Scalar CPUs and Custom Hardware with TILT"




Winners


Trying to forecast a winner in this kind of race is a fool's errand. Qualcomm will be well placed simply due to their phone dominance. Apple will no doubt succeed with whatever they do. Nvidia's V100 is quite a winner with its Tensor units. I'm not sure I can see Google's TPU surviving in a relentless long-term silicon march despite its current impressive performance. I'm fond of the FPGA approach but I can't help but think they should release DNN editions at much cheaper price points so that they don't get passed by the crowd. Intel and AMD will have their co-processors. As all the major players are mucking in, much of it will come down to supporting standard toolkits, such as TensorFlow, and then we will not have to care too much about the specifics, just the benchmarks.

From the smaller players, as much as I like and am cheering for the Adapteva approach I think their memory architecture may not be well suited to DNN. I hope I'm wrong.

Wave Computing is probably my favourite approach after FPGAs. Their whole asynchronous data flow approach is quite awesome. It appears REM is doing something similar, but I think they may be too late. Will Wave Computing be able to hold their head up in face of all the opposition? Perhaps as their asynchronous CGRA has an inherent advantage. Though I'm not sure they need just DNNs to succeed as their tech has much broader applicability.

Neuromorphic spiking processor thing-a-ma-bobs are probably worth ignoring for now but keep your eye on them due to their power advantage. Quantum crunching may make it all moot anyway. The exception to this rule is probably IBM's TrueNorth thanks to its ability to not just do spiking networks but to also run DNNs efficiently.

For me, Random Forests are friendly. They are much harder to screw up ;-)

Happy trading,

--Matt.

1 comment:

  1. Matt, always love your posts and insight. Keep up the good work.

    ReplyDelete