澳门新葡萄新京學術報告預告
題目:An Approach for Validating the Quality of Datasets for Machine Learning
主講人:丁俊華
時 間:2019年2月27日下午3:00
地 點:6205
報告内容:
There are basically two ways for improving the accuracy of machine learning: building relevant machine learning models, and providing high quality datasets for training the models. Significant efforts have been made in designing powerful machine learning models. Furthermore, many open-source datasets have been created for machine learning research. However, research on assessing the impact of the quality of a dataset on the accuracy of a machine learning system has not received attention. In this paper, we present an experimental study to show how the quality of datasets impact the accuracy of machine learning models. We discovered a common problem in datasets that could greatly impact the accuracy of machine learning. This problem could also exist in many other machine learning systems, especially those that are developed using crowd-sourced datasets. This problem is difficult to detect using traditional validation approaches. We propose a novel technique based on metamorphic testing for validating a machine learning system together with its training and testing data. The key to metamorphic testing is to create tests that will adequately test the system. We propose an approach for creating such tests. The effectiveness of the proposed approach is demonstrated through a case study of automated classification of biological cell images.
主講人簡介:
丁俊華:美國北德克薩斯大學(University of North Texas)信息科學系教授,1994畢業于中國地質大學,1997年獲南京大學計算機科學碩士學位,2000年和2004年分别獲美國佛羅裡達國際大學計算機科學碩士和博士學位。曾在美國貝克曼庫爾特公司、美國強生公司工作,2007年後,在美國東卡大學計算機科學系和北德克薩斯大學從事科研和教學工作。發表論文70多篇,編寫專著2本,主持美國自然科學基金6項,在數據和信息質量分析、軟件安全、計算法律和信息檢索上具有豐富的經驗。現為IEEE Transactions on Software Engineering,IEEE TSMC-A等雜志審稿人,Information and Software Technology雜志編輯,美國自然科學基金評審人。歡迎全校對機器學習感興趣的師生參加。