Loading…
BSidesLV 2016 has ended
Welcome to BSidesLV 2016, our 8th annual BSides in beautiful Las Vegas, Nevada!
View analytic
Wednesday, August 3 • 11:35 - 12:30
Labeling the VirusShare Corpus: Lessons Learned

Sign up or log in to save this to your schedule and see who's attending!

A machine learning researcher needs a nice dataset to work with, but all of the publicly available malware datasets have major issues. We'll start by reviewing the basics of machine learning on malware: what works, what doesn't, and what data is out there. We'll introduce the VirusShare dataset, show how we fixed the labels issue (using VirusTotal) so that it may be used for supervised machine learning, and discuss why this corpus should be used as a standard for machine learning research. Finally, we'll look at pyspark, and how it can be used to both summarize the corpus and to help us find which chunks have high concentrations of particular families of malware.

Speakers
avatar for John Seymour

John Seymour

University of Maryland, Baltimore County
John Seymour is a Senior Data Scientist at ZeroFOX, Inc. by day, and Ph.D. student at University of Maryland, Baltimore County by night. He researches the intersection of machine learning and InfoSec in both roles. He’s mostly interested in dataset bias (seriously, do people still... Read More →


Wednesday August 3, 2016 11:35 - 12:30
Ground Truth Florentine F