A Survey on Fault-Tolerant Methodologies for Deep Neural Networks

eng Article in English DOI: 10.14313/PAR_248/89

Rizwan Tariq Syed , Markus Ulbricht , send Krzysztof Piotrowski , Milos Krstic IHP – Leibniz-Institut für innovative Mikroelektronik, Frankfurt (Oder), Germany

Download Article

Abstract

Asignificant rise in Artificial Intelligence (AI) has impacted many applications around us, so much so that AI has now been increasingly used in safety-critical applications. AI at the edge is the reality, which means performing the data computation closer to the source of the data, as opposed to performing it on the cloud. Safety-critical applications have strict reliability requirements; therefore, it is essential that AI models running on the edge (i.e., hardware) must fulfill the required safety standards. In the vast field of AI, Deep Neural Networks (DNNs) are the focal point of this survey as it has continued to produce extraordinary outcomes in various applications i.e. medical, automotive, aerospace, defense, etc. Traditional reliability techniques for DNNs implementation are not always practical, as they fail to exploit the unique characteristics of the DNNs. Furthermore, it is also essential to understand the targeted edge hardware because the impact of the faults can be different in ASICs and FPGAs. Therefore, in this survey, first, we have examined the impact of the fault in ASICs and FPGAs, and then we seek to provide a glimpse of the recent progress made towards the fault-tolerant DNNs. We have discussed several factors that can impact the reliability of the DNNs. Further, we have extended this discussion to shed light on many state-of-the-art fault mitigation techniques for DNNs.

Keywords

ASICs, fault tolerance, FPGAs, neural networks, reliability

Przegląd metod zapewniających odporność na błędy dla głębokich sieci neuronowych

Streszczenie

Znaczący rozwój sztucznej inteligencji (SI) wpływa na wiele otaczających nas aplikacji, do tego stopnia, że SI jest obecnie coraz częściej wykorzystywana w aplikacjach o krytycznym znaczeniu dla bezpieczeństwa. Sztuczna inteligencja na brzegu sieci (Edge) jest rzeczywistością, co oznacza wykonywanie obliczeń na danych bliżej źródła danych, w przeciwieństwie do wykonywania ich w chmurze. Aplikacje o krytycznym znaczeniu dla bezpieczeństwa mają wysokie wymagania dotyczące niezawodności; dlatego ważne jest, aby modele SI działające na brzegu sieci (tj. sprzęt) spełniały wymagane standardy bezpieczeństwa. Z rozległej dziedziny sztucznej inteligencji, głębokie sieci neuronowe (DNN) są centralnym punktem tego badania, ponieważ nadal przynoszą znakomite wyniki w różnych zastosowaniach, tj. medycznych, motoryzacyjnych, lotniczych, obronnych itp. Tradycyjne techniki niezawodności implementacji w przypadku DNN nie zawsze są praktyczne, ponieważ nie wykorzystują unikalnych cech DNN. Co więcej, istotne jest również zrozumienie docelowego sprzętu brzegowego, ponieważ wpływ usterek może być różny w układach ASIC i FPGA. Dlatego też w niniejszym przeglądzie najpierw zbadaliśmy wpływ usterek w układach ASIC i FPGA, a następnie staramy się zapewnić wgląd w ostatnie postępy poczynione w kierunku DNN odpornych na błędy. Omówiliśmy kilka czynników, które mogą wpływać na niezawodność sieci DNN. Ponadto rozszerzyliśmy tę dyskusję, aby rzucić światło na wiele najnowocześniejszych technik ograniczania błędów w sieciach DNN.

Słowa kluczowe

niezawodność, odporność na błędy, sieci neuronowe, układy ASIC, układy FPGA

Bibliography

  1. Adam K., Izeldin I.M., Ibrahim Y., A selective mitigation technique of soft errors for DNN models used in healthcare applications: DenseNet201 case study. „IEEE Access”, Vol. 9, 2021, 65803–65823. DOI: 10.1109/ACCESS.2021.3076716.
  2. Chen Z., Li G., Pattabiraman K., A low-cost fault corrector for deep neural networks through range restriction. [In:] 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), 2021, 1–13, DOI: 10.1109/DSN48987.2021.00018.
  3. Dollar P., Wojek C., Schiele B., Perona P., Pedestrian detection: An evaluation of the state of the art. „IEEE Transactions on Pattern Analysis and Machine Intelligence”, Vol. 34, No. 4, 2012, 743–761, DOI: 10.1109/TPAMI.2011.155.
  4. Draghetti L.K., Santos F.F.D., Carro L., Rech P., Detecting Errors in Convolutional Neural Networks Using Inter Frame Spatio-Temporal Correlation. IEEE 25th International Symposium on On-Line Testing and Robust System Design, IOLTS 2019, 310–315, DOI: 10.1109/IOLTS.2019.8854431.
  5. Gambardella G., Kappauf J., Blott M., Doehring C., Kumm M., Zipf P., Vissers K., Efficient error-tolerant quantized neural network accelerators. IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2019. DOI: 10.1109/DFT.2019.8875314.
  6. Gao Z., Wei X., Zhang H., Li W., Ge G., Wang Y., Reviriego P., Reliability evaluation of pruned neural networks against errors on parameters. IEEE International Symposium on Defect and Fault Tolerance in VLSI and Nanotechnology Systems, DFT 2020, DOI: 10.1109/DFT50435.2020.9250812.
  7. Ghavami B., Sadati M., Fang Z., Shannon L., FitAct: Error resilient deep neural networks via fine-grained post-trainable activation functions. Design, Automation Test in Europe Conference Exhibition (DATE ‘22), 1239–1244, DOI: 10.23919/DATE54114.2022.9774635.
  8. Gill B.S. Design and Analysis Methodologies to Reduce Soft Errors in Nanometer VLSI Circuits. Ph.D. thesis, 2006, Department of Electrical Engineering and Computer Science CASE WESTERN RESERVE UNIVERSITY.
  9. Goldstein B.F., Reliability evaluation of compressed deep learning models. 2020 IEEE 11th Latin American Symposium on Circuits Systems (LASCAS), 1–5, DOI:10.1109/LASCAS45839.2020.9069026.
  10. Goldstein B.F., Ferreira V.C., Srinivasan S., Das D., Nery A.S., Kundu S., Franca F.M.G., A lightweight error-resiliency mechanism for deep neural networks. 22nd International Symposium on Quality Electronic Design (ISQED), 2021, 311–316. DOI:10.1109/ISQED51717.2021.9424287.
  11. Goodfellow I., Pouget-Abadie J., Mirza M., Xu B., Warde- -Farley D., Ozair S., Courville A., Bengio Y., Generative adversarial nets. (Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger eds.), „Advances in Neural Information Processing Systems”, Vol. 27, 2014, 2672–2680, DOI: 10.5555/2969033.2969125.
  12. Haenlein M., Kaplan A., A brief history of artificial intelligence: On the past, present, and future of artificial intelligence. „California Management Review”, Vol. 61, No. 4, 2019, 5–14. DOI: 10.1177/0008125619864925.
  13. Hass K.J., Probabilistic estimates of upset caused by single event transients, 1999.
  14. Hayes J., Fault modeling for digital mos integrated circuits. „IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems”, Vol. 3, No. 3, 1984, 200–208, DOI: 10.1109/TCAD.1984.1270076.
  15. He K., Zhang X., Ren S., Sun J., Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778, DOI: 10.1109/CVPR.2016.90.
  16. Hinton G.E., Krizhevsky A., Wang S.D., Transforming auto-encoders. [In:] T. Honkela, W. Duch, M. Girolami, S. Kaski, (eds.), Artificial Neural Networks and Machine Learning – ICANN 2011, 44–51. DOI: 10.1007/978-3-642-21735-7_6.
  17. Hinton G.E., Vinyals O., Dean J. Distilling the knowledge in a neural network. ArXiv, 2015, DOI: 10.48550/arXiv.1503.02531.
  18. Hoang L.H., Hanif M.A., Shafique M., FT-ClipAct: Resilience analysis of deep neural networks and improving their fault tolerance using clipped activation. 2020 Design, Automation Test in Europe Conference Exhibition (DATE), 1241–1246. DOI:10.23919/DATE48585.2020.9116571.
  19. Huang K.H., Abraham, J.A. (1984). Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers, C-33(6), 518–528. DOI:10.1109/TC.1984.1676475.
  20. Ibrahim Y., Wang H., Bai M., Liu Z., Wang J., Yang Z., Chen Z., Soft Error Resilience of Deep Residual Networks for Object Recognition. „IEEE Access”, Vol. 8, 2020, 19490– 19503, DOI: 10.1109/ACCESS.2020.2968129.
  21. Kastensmidt F.L., Carro L., Reis R., Fault-tolerance techniques for SRAM-based FPGAs, 2006, Springer, DOI: 10.1007/978-0-387-31069-5.
  22. Krizhevsky A., Sutskever I., Hinton G.E., Imagenet classification with deep convolutional neural networks. [In:] F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, Vol. 25, 2012, Curran Associates, Inc.
  23. LeCun Y., Bengio Y., Hinton G., Deep learning. „Nature”, Vol. 521(7553), 2015, 436–444. DOI: 10.1038/nature14539.
  24. Li G., Hari S.K.S., Sullivan M., Tsai T., Pattabiraman K., Emer J., Keckler S.W., Understanding error propagation in deep learning neural network (DNN) accelerators and applications. [In:] Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, 1–12, DOI: 10.1145/3126908.3126964.
  25. Li Y., Li M., Luo B., Tian Y., Xu Q., DeepDyve: Dynamic verification for deep neural networks. [In:] Proceedings of the 2020 ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, 101–112. Association for Computing Machinery, New York, NY, USA, DOI: 10.1145/3372297.3423338.
  26. Libano F., Wilson B., Anderson J., Wirthlin M.J., Cazzaniga C., Frost C., Rech P., Selective hardening for neural networks in FPGAs. „IEEE Transactions on Nuclear Science”, Vol. 66, No. 1, 2019, 216–222. DOI: 10.1109/TNS.2018.2884460.
  27. Libano F., Wilson B., Wirthlin M., Rech P., Brunhaver J., Understanding the impact of quantization, accuracy, and radiation on the reliability of convolutional neural networks on FPGAs. „IEEE Transactions on Nuclear Science”, Vol. 67, No. 7, 2020, 1478–1484, DOI: 10.1109/TNS.2020.2983662.
  28. Libano F., Analyzing and Improving the Reliability of Matrix Multiplication and Neural Networks on FPGAs. Ph.D. thesis, 2021, Arizona State University.
  29. Lyashenko V., Basic guide to spiking neural networks for deep learning, 2020, https://cnvrg.io/spiking-neural-networks/.
  30. Mittal S., Vetter J.S., A survey of techniques for modeling and improving reliability of computing systems. „IEEE Transactions on Parallel and Distributed Systems”, Vol. 27, No. 4, 2016, 1226–1238. DOI: 10.1109/TPDS.2015.2426179.
  31. Ozen E., Orailoglu A., Sanity-check: Boosting the reliability of safety-critical deep neural network applications. In 2019 IEEE 28th Asian Test Symposium (ATS), 7–75. DOI: 10.1109/ATS47505.2019.000-8.
  32. Parhami, Avizienis, Detection of storage errors in mass memories using low-cost arithmetic error codes. „IEEE Transactions on Computers”, Vol. C-27, No. 4, 1978, 302–308. DOI: 10.1109/TC.1978.1675102.
  33. Ponzina F., Peón-Quirós M., Burg A., Atienza D., E2 CNNs: Ensembles of convolutional neural networks to improve robustness against memory, errors in edge-computing devices. „IEEE Transactions on Computers”, Vol. 70, No. 8, 2021, 1199–1212. DOI: 10.1109/TC.2021.3061086.
  34. Ribes S., Malek A., Trancoso P., Sourdis I., Reliability analysis of compressed CNNs. 2021.
  35. Russakovsky O., Deng J., Su H., Krause J., Satheesh S., Ma S., Huang Z., Karpathy A., Khosla A., Bernstein M., Berg A.C., Fei-Fei, L., ImageNet Large Scale Visual Recognition Challenge. „International Journal of Computer Vision”, Vol. 115, No. 3, 2015, 211–252, DOI: 10.1007/s11263-015-0816-y.
  36. Sabbagh M., Evaluating fault resiliency of compressed deep neural networks. 2019 IEEE International Conference on Embedded Software and Systems (ICESS), 2019, 1–7. DOI:10.1109/ICESS.2019.8782505.
  37. Santos F.F.D., Evaluation and Mitigation of Soft-Errors in Neural Network-Based Object Detection in Three GPU Architectures. 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops, DSN-W 2017, 169–176. DOI:10.1109/DSN-W.2017.47.
  38. Simonyan K., Zisserman A., Very deep convolutional networks for large-scale image recognition, 2015, DOI: 10.48550/arXiv.1409.1556.
  39. Syed R.T., Ulbricht M., Piotrowski K., Krstic, M., Fault resilience analysis of quantized deep neural networks. IEEE 32nd International Conference on Microelectronics (MIEL), 2021, 275–279. DOI: 10.1109/MIEL52794.2021.9569094.
  40. Sze V., Chen Y.H., Yang T.J., Emer J.S., Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE, Vol. 105, No. 12, 2017, 2295–2329, DOI: 10.1109/JPROC.2017.2761740.
  41. Werner S., Navaridas J., Luján M., A survey on design approaches to circumvent permanent faults in networkson-chip. „ACM Computing Surveys”, Vol. 48, No. 4, 2016, DOI:10.1145/2886781.
  42. Yi C.H., Kwon K.H., Jeon J., Method of improved hardware redundancy for automotive system. 14th International Symposium on Communications and Information Technologies (ISCIT), 2014, 204–207, DOI: 10.1109/ISCIT.2014.7011901.
  43. Zahid U., Gambardella G., Fraser N.J., Blott M., Vissers K., FAT: Training neural networks for reliable inference under hardware faults. 2020 IEEE International Test Conference (ITC), 2020, 1–10. DOI: 10.1109/ITC44778.2020.9325249.
  44. Zhang J., Gu T., Basu K., Garg S., Analyzing and mitigating the impact of permanent faults on a systolic array based neural network accelerator. Proceedings of the IEEE VLSI Test Symposium, 2018, 1–6, DOI: 10.1109/VTS.2018.8368656.
  45. Zhao K., Di S., Li S., Liang X., Zhai Y., Chen J., Ouyang K., Cappello F., Chen Z., Algorithm-Based Fault Tolerance for Convolutional Neural Networks. 2020, 1–13, DOI: 10.1109/TPDS.2020.3043449.