This event has ended. View the official site or create your own event → Check it out
This event has ended. Create your own
Welcome to BSidesLV 2016, our 8th annual BSides in beautiful Las Vegas, Nevada!
View analytic
Wednesday, August 3 • 11:35 - 12:30
Labeling the VirusShare Corpus: Lessons Learned

Sign up or log in to save this to your schedule and see who's attending!

A machine learning researcher needs a nice dataset to work with, but all of the publicly available malware datasets have major issues. We'll start by reviewing the basics of machine learning on malware: what works, what doesn't, and what data is out there. We'll introduce the VirusShare dataset, show how we fixed the labels issue (using VirusTotal) so that it may be used for supervised machine learning, and discuss why this corpus should be used as a standard for machine learning research. Finally, we'll look at pyspark, and how it can be used to both summarize the corpus and to help us find which chunks have high concentrations of particular families of malware.

avatar for John Seymour

John Seymour

University of Maryland, Baltimore County
John Seymour is a Data Scientist at ZeroFOX, Inc. by day, and Ph.D. student at University of Maryland, Baltimore County by night. He researches the intersection of machine learning and InfoSec in both roles. He's mostly interested in avoiding and helping others avoid some of the major pitfalls in machine learning, especially in dataset preparation (seriously, do people still use malware datasets from 1998?) He has spoken at both DEFCON and his... Read More →

Wednesday August 3, 2016 11:35 - 12:30
Ground Truth Florentine F