Image of a young boy gathering water from a swamp (source: LifeWater)

Predictive Modeling of Tanzanian Water Well Functionality

Nancy Ho

--

Many villages across the African continent suffer from a lack of access to sanitary resources to live a healthy lifestyle, the most prominent resource being water. Especially in Tanzania, according to the non-profit organization LifeWater International, almost half of the Tanzanian population lacks access to water that meets the standard of “safe” water. LifeWater International is one of many non-profit organizations that seeks to aid in providing clean water to Tanzanian citizens by using donations towards the drilling of new water wells for villagers to gain access to water that meets these safety standards.

For this project, I was tasked to create a model that would be able to predict the functionality of a water well in Tanzania based on descriptive information of each well. Although drilling new wells is important for development of Tanzania’s water systems, I strongly believe it is equally as important to be able to monitor the conditions of wells already built. For a non-profit organization specializing in providing clean water to countries like Tanzania, I would recommend that we are making sure that the populace doesn’t lose access to a consistent water source by making sure pre-existing wells are functional while we are planning the development of new ones.

The data I used in this project was taken from DrivenData, which consists of datasets taken from information from the Tanzania Ministry of Water containing information on Tanzanian water wells and the conditions of each well.

To create the predictive models that I would need to solve the business problem, I used machine learning algorithms from the scikit-learn library. After going through some of the documentation on those models, I chose to do fit the data on three models: logistic regression, random forest, and Naive Bayes.

After preprocessing the data by one-hot encoding the categorical features using pandas.get_dummies(), I used a train_test_split to split the data into training and test datasets. This prevents data leakage by assuring we have an unaltered set of holdout test data to validate our model with later.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=123)

The first model I ran was a LogisticRegression model, which fits models by measuring the difference in probabilities among each class in a multiclass model like the one I attempt to create here.

logreg_pipeline = Pipeline([('ss', StandardScaler()), 
('logreg', LogisticRegression(C=1.0, multi_class='multinomial'))])

To evaluate the performance of this model and the rest of my models, I used two methods from scikit-learn’s library: cross validation, which splits the training set into smaller “folds” and tests the model among the rest of the data, and the F1 score, which returns the weighted mean of the precision and recall of the model.

cross_val_score(logreg_pipeline, X, y, cv=5)logreg_y_pred = logreg_pipeline.predict(X_test)
f1_score(y_test, logreg_y_pred, average='micro')

Both of our metrics returned an average score of 0.70, which is pretty decent for the first model; however, I want to try and find a more accurate model for correctly classifying water well functionality.

Next, I ran a RandomForest model, which fits and averages subsets of smaller decision tree classifiers to improve the accuracy of our model and prevent overfitting of our data.

rf_pipeline = Pipeline([('ss', StandardScaler()), 
('rf', RandomForestClassifier(max_depth=None))])

This model averaged a score of 0.72. Not much better than the first model I ran, but this is still an improvement!

Lastly, I tried out the Naive Bayes model. This model extends on Bayes’ theorem and follows a “naive” assumption that all the features in the model are independent of each other, which estimates an overall probability by multiplying the conditional probabilities of each feature. Scikit-learn has multiple Naive Bayes classifiers available; for this model I chose to use the Gaussian Naive Bayes classifier.

gnb_pipeline = Pipeline([('ss', StandardScaler()), 
('gnb', GaussianNB())])

Unfortunately, this model performed much worse, only averaging a score of 0.22.

Of the three models the random forest model ended up performing the best, so I proceeded to perform hyperparameter tuning on this model, which essentially involves finding the best parameters to optimize the performance of our model.

param_grid = {
"rf__criterion": ["gini", "entropy"],
"rf__max_depth": [None, 2, 5, 10],
"rf__min_samples_split": [2, 5, 10],
"rf__min_samples_leaf": [1, 3, 6]
}
rf_gridsearch = GridSearchCV(estimator=rf_pipeline,
param_grid=param_grid,
scoring='f1_micro')

Our best model with optimal parameters ended up with an average score of 0.75. While it may not be the best score, it shows that the model still performs well enough. If anything, it provides a good model that has much room for refining and improving later on. Some ways we could possibly improve it might be to include additional features or testing out different parameters from the default values in the scikit-learn algorithm and our grid search parameters.

Something I would like to try experimenting with later on fitting the data on other machine learning algorithms available in scikit-learn. Two models I was considering training during the process of this project were a K-Nearest Neighbors model and a Support Vector Machine (SVM) model. However, there were numerous issues in running those models, so for the sake of time, I needed to abandon those models:

  • For the K-Nearest Neigbors model, it simply took too much time to run compared to the random forest and Naive Bayes models.
  • Running the SVM model also took a long time and was taxing on my computer’s performance — when I was trying to run a grid search on the SVM, it would not finish running for hours.

Given more time and perhaps a more powerful system, I would consider trying to run these models again — both models, based on the default parameters for each one, happened to outperform the best model in the final analysis (random forest). If I have more time to run these models in the future, I would like to see if I can get a metric that shows significant improvement over my previous models.

For further detail into the scope of this project or if you are interested in seeing the code, you can review my GitHub repository: https://github.com/nancyho83/Tanzania_Water_Well_Analysis

--

--

Nancy Ho
Nancy Ho

Written by Nancy Ho

0 Followers

Aspiring data scientist. Graduate of Flatiron School.