Machine Learning¶
QIIME2's sample-classifier plugin provides supervised machine learning tools for classification and regression using microbiome feature tables. Here we ask two questions:
- Can we predict the facility where a sample was collected? (classification)
- Can we predict the accumulated degree days (ADD) at time of collection? (regression)
Classification¶
Predict Facility
Train a Random Forest classifier to predict facility from microbial community composition:
qiime sample-classifier classify-samples \
--i-table table_nomitochloro_1500_abund_L7.qza \
--m-metadata-file metadata_q2_workshop.txt \
--m-metadata-column facility \
--p-random-state 123 \
--p-n-jobs 1 \
--output-dir sample_classifier_results_facility
Random State
Setting --p-random-state 123 makes the results reproducible. Any integer works, use the same value whenever you need to reproduce the exact same train/test split and model.
Visualize Top Features¶
Facility Classifier
Generate a heatmap showing normalized abundance of the 100 most important features:
qiime sample-classifier heatmap \
--i-table table_nomitochloro_1500_abund_L7.qza \
--i-importance sample_classifier_results_facility/feature_importance.qza \
--m-sample-metadata-file metadata_q2_workshop.txt \
--m-sample-metadata-column facility \
--p-group-samples \
--p-feature-count 100 \
--o-heatmap sample_classifier_results_facility/heatmap_100_features.qzv \
--o-filtered-table sample_classifier_results_facility/filtered_table_100_features.qza
Regression¶
Predict Accumulated Degree Days
Train a Random Forest regressor to predict the continuous variable add_0c (accumulated degree days):
qiime sample-classifier regress-samples \
--i-table table_nomitochloro_1500_abund_L7.qza \
--m-metadata-file metadata_q2_workshop.txt \
--m-metadata-column add_0c \
--p-estimator RandomForestRegressor \
--p-n-estimators 20 \
--p-random-state 123 \
--output-dir sample_regressor_results_ADD
Inspect Feature Importances¶
ADD Regressor
qiime metadata tabulate \
--m-input-file sample_regressor_results_ADD/feature_importance.qza \
--o-visualization sample_regressor_results_ADD/feature_importance.qzv
Outputs¶
| File | Type | Description |
|---|---|---|
sample_classifier_results_facility/ |
Directory | Classifier results (accuracy, confusion matrix, feature importance) |
sample_classifier_results_facility/heatmap_100_features.qzv |
Visualization | Heatmap of top 100 features |
sample_classifier_results_facility/filtered_table_100_features.qza |
Artifact | Feature table subset to top 100 features |
sample_regressor_results_ADD/ |
Directory | Regressor results (R², scatter plot, feature importance) |
sample_regressor_results_ADD/feature_importance.qzv |
Visualization | Tabulated feature importances for ADD |
Next: Network Primer