By Paul Devlin (University of Dallas; STEM-Trek Blogger)
I started my day with the “Machine Learning in HPC Environments” workshop. The workshop moderator briefly talked about Alan Turing’s proposal of intelligent computers and how the term “machine learning” was coined in reference to teaching a computer to play games. ML developers still work hard to game computers good at games, but the moderator named some important changes that have occurred in the field. He pointed to better algorithms, big data, and more computing power as the primary culprits for the dramatic improvement in machine learning software in recent years. After all these huge advances, how will machine learning be able to progress even further? By the end of the workshop, I was convinced that the answer is training parallelization.
The first keynote talk was by Dr. Azalia Mirhoseini from Google Brain. She discussed a novel architecture for deep neural networks that allows for an incredible amount of model parallelization. The “mixture-of-experts” layer that part of the network that parallelizes well. It’s actually a layer that has multiple neural subnetworks contained in it. Because of this, mixture-of-expert networks have an absurdly large number of parameters. As Dr. Mirhoseini pointed out though, large networks have proven their value by persistently outperforming smaller networks, so this kind of work is likely to pay off. Specifically, Google has found that networks with mixture-of-expert layers do quite well at the task of machine translation. Dr. Mirhoseini and her team trained a model and found that each “expert” in the mixture-of-experts layer naturally acquired a specific, observable specialization. One expert would be given dates while another would be given words related to biology. This specialization is exactly why the model’s training parallelizes so well: each expert only has to be trained on data relevant to its specialization. Dr. Mirhoseini showed the state-of-the-art results achieved by the mixture-of-experts model and pointed out that with this architecture, we could soon see models with over 1 trillion parameters being successfully trained. That’s a lot of computing, but parallelization is just that powerful.
The second part of Dr. Mirhoseini’s talk was about techniques for device placement (determining exactly how to split the training up among a set of GPU’s and CPU’s to optimize training time). The technique Google has been trying is almost comical: using machine learning to optimize machine learning training. It sounds funny, but it turns out that it works really well. Dr. Mirhoseini’s team developed a reinforcement learning framework that divides training work among devices better than human experts can. One of the inputs for the reinforcement model is the neural network that will be trained. If that’s not meta, I don’t know what is. Also, the training of the reinforcement model is parallelized. I wonder how they figure out the device placement for that training.
After Dr. Mirhoseini’s amazing presentation, I listened to several other shorter talks. The one I found most interesting was by Guojing Cong from IBM. He’s been figuring out how to make Adam-trained models more accurate. He showed us sensitivity heatmaps from SGD-trained models and Adam-trained models. The SGD models were looking at the entirety of the objects they were supposed to identify in images, while Adam’s models focused only on a small part of the object. This disparity suggested that Adam isn’t considering enough features. Guojing’s team then looked at the L2 norms and found that SGD’s L2 Norms are orders of magnitude smaller that Adam’s, so they tried lowering the initial learning rate on Adam and found that it makes Adam-trained models much more accurate. Take note, Adam users!
I had to leave the workshop before the second keynote to make it to the mentoring panel, where all the SC18 mentors and proteges received goo advice for maintaining long-lasting relationships. After the panel, we all went to the speed mentoring session, where I met my mentor in person: Vipin Sachdeva from Silicon Therapeutics. I talked to Vipin for a while because I find the work done at Silicon Therapeutics so interesting. They use HPC to search for new drugs by simulating protein-ligand binding. After talking to Vipin, I talked to Elizabeth Bartlett from Lockheed Martin. She gave me a lot of great career advice.
When the mentoring events were over, I wandered around the exhibit hall, which was still being set up.
Cover Photo: Nvidia showed off real-time image generation with deep learning
I would like to thank Jim Wilgenbusch (Minnesota Supercomputing Institute), Elizabeth Leake (STEM-Trek) and David Tur (HPCNow!); I would never have gotten this amazing opportunity without them.
Paul Devlin is a sophomore in the computer science department at the University of Dallas. In his free time, he competes in programming competitions and hackathons, and plays the piano. As part of his coursework, he has multithreaded Java programs and familiarized himself with C. Last Summer, he used Tensorflow and TACC’s Maverick Supercomputer to train machine learning models as part of a research program at the University of Texas at Arlington. He said, “I am super excited about the many uses for HPC; especially bioinformatics.”
Devlin heard about SC in mid October from Jim Wilgenbusch (U-Minnesota; Minnesota Supercomputing Institute). Wilgenbusch wrote to ask Leake if she knew of a way that Devlin could participate in SC. Leake knew it was too late to apply for the student program, but she recalled that HPCNow! offered to share an SC18 exhibitor pass with her when she spoke at their conference in Barcelona last summer. Devlin offered to help HPCNow! as a booth show floor runner in exchange for the pass. STEM-Trek covered the cost of his workshops-only registration, and the student program coordinator suggested there is room for one more in the mentoring session and other events–Ubuntu Style!