Machine Learning Techniques for the
Diagnosis of Pediatric Tuberculosis

Coston, Amanda

Please use this identifier to cite or link to this item: http://arks.princeton.edu/ark:/88435/dsp013x816m72k

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Schapire, Robert	-
dc.contributor.author	Coston, Amanda	-
dc.date.accessioned	2013-07-26T15:41:24Z	-
dc.date.available	2013-07-26T15:41:24Z	-
dc.date.created	2013-05-06	-
dc.date.issued	2013-07-26	-
dc.identifier.uri	http://arks.princeton.edu/ark:/88435/dsp013x816m72k	-
dc.description.abstract	The goal of this project was two-fold: first, to improve the performance of machine learning algorithms for the diagnosis of pediatric tuberculosis, and second, to use machine learning algorithms to better understand the problem of diagnosis. We constructed and examined Bayes nets using a MATLAB toolbox by Kevin Murphy and we experimented with 26 other machine learning algorithms in the Weka software package. We found that while the Bayes nets have better accuracy when we initialize parameters based on medical knowledge, creating our own structure based on medical knowledge did not increase performance; a naive Bayes net does better than the our handcrafted Bayes net. Neither the Bayes nets nor any of the Weka algorithms performed at the level necessary for use in real medical settings. Calibration curves show that the predicted probabilities of the Bayes nets and Weka algorithms do not correspond to the probability of positive diagnosis. Among the Weka algorithms, we found that decision algorithms generally have better performance, with the alternating decision tree and the ensemble methods (bagging and Adaboost) on decision stumps performing the best. Overall, false negative rates are much higher than false positive rates, which does not bode well for practical applications since false negatives yield significantly dire consequences in real life. We found that we could lower the false negative rates and generally improve the performance of the Bayes nets by guessing the label of unknown instances, a method we call predictive labeling. Using a variety of algorithms, we also tested for which features were most important to diagnosis. The structure of alternating decision trees as well as traditional decision trees contributed to our understanding. We also randomized the data for each feature to see which had the greatest effect on performance, reasoning that the feature whose randomization had the greatest effect would be the most important. In addition, we implemented an explanation algorithm by selecting which feature in each patient would change the probability of diagnosis most if not present. Using these algorithms we found that the most important features for diagnosis were malaise and weight loss. Moving forward, we recommend obtaining larger and more comprehensive data sets that may yield better performance from the Bayes nets and other machine learning algorithms.	en_US
dc.format.extent	68 pages	en_US
dc.language.iso	en_US	en_US
dc.title	Machine Learning Techniques for the Diagnosis of Pediatric Tuberculosis	en_US
dc.type	Princeton University Senior Theses	-
pu.date.classyear	2013	en_US
pu.department	Computer Science	en_US
pu.pdf.coverpage	SeniorThesisCoverPage	-
dc.rights.accessRights	Walk-in Access. This thesis can only be viewed on computer terminals at the <a href=http://mudd.princeton.edu>Mudd Manuscript Library</a>.	-
pu.mudd.walkin	yes	-
Appears in Collections:	Computer Science, 1988-2020 Princeton School of Public and International Affairs, 1929-2020

Files in This Item:

File	Size	Format
Amanda_COSTON_Jocelyn_TANG_.pdf	2.53 MB	Adobe PDF	Request a copy

Show simple item record

Search

Browse