Creating a data science application is an excellent hands-on way to deepen your understanding of the field. This guide will walk you through every step, from gathering and preprocessing data to training a model and deploying predictions via an API.
In this tutorial, we'll use Python, scikit-learn, and FastAPI to develop a straightforward data science app. The app will leverage the built-in wine dataset from scikit-learn to keep the process simple and accessible. By the end, you'll have a functional API serving machine learning predictions.
Ready to dive in? Let’s begin the journey to building your first data science application!
Step 1: Setting Up the Environment
Make sure you have a recent version of Python installed on your system. Next, install the required libraries to build the machine learning model and set up the API for serving predictions. Here's how:
$ pip3 install fastapi uvicorn scikit-learn pandas
This command will install the necessary libraries:
- FastAPI: To create and manage the API.
- Uvicorn: A server to run the API.
- scikit-learn: For building and training the machine learning model.
- Pandas: To handle and preprocess data efficiently.
Step 2: Loading the Dataset
We'll use the wine dataset from scikit-learn. To make data manipulation straightforward, we'll load it and convert it into a pandas DataFrame. Here's how:
# model_training.py
import pandas as pd
from sklearn.datasets import load_wine
def get_wine_data():
# Load the wine dataset
wine_data = load_wine()
# Create a DataFrame with feature names
data_frame = pd.DataFrame(wine_data.data, columns=wine_data.feature_names)
# Add the target column to the DataFrame
data_frame['target'] = wine_data.target
return data_frame
Step 3: Exploring the Dataset
Before moving forward, it's a good idea to explore the dataset briefly. This will help you understand the structure of the data and identify any preprocessing needs. Here's how you can do it:
# Exploring the dataset if __name__ == "__main__": df = get_wine_data() # Display the first few rows print("Dataset Preview:") print(df.head()) # Check for missing values print("\nMissing Values:") print(df.isnull().sum()) # Display basic statistical details print("\nDataset Statistics:") print(df.describe()) # Check the distribution of target classes print("\nTarget Class Distribution:") print(df['target'].value_counts())
This exploration provides insights into the dataset's features, the presence of any missing values, statistical summaries, and the distribution of target classes.
Step 4: Data Preprocessing
Next, we will preprocess the dataset by splitting it into training and test sets and scaling the features.
The preprocess_data function is designed specifically for this purpose:...
# model_training.py
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
def preprocess_data(df):
"""
Preprocesses the dataset by splitting it into training and test sets
and scaling the features for better model performance.
Parameters:
- df: pandas DataFrame, the input dataset containing features and target.
Returns:
- X_train_scaled: numpy array, scaled training features.
- X_test_scaled: numpy array, scaled test features.
- y_train: pandas Series, training target values.
- y_test: pandas Series, test target values.
"""
# Separate features and target variable
X = df.drop(columns=['target']) # Features
y = df['target'] # Target (wine quality)
# Split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=27
)
# Scale features using StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
return X_train_scaled, X_test_scaled, y_train, y_test
Using StandardScaler for feature scaling ensures that all features are standardized to have the same scale, allowing them to contribute equally during model training.
Step 5: Training the Logistic Regression Model
Next, we will train a LogisticRegression model using the preprocessed data and save the trained model to a pickle file. The train_model function handles this process:
# model_training.py
from sklearn.linear_model import LogisticRegression
import pickle
def train_model(X_train, y_train):
"""
Trains a LogisticRegression model on the given training data
and saves the trained model to a pickle file.
Parameters:
- X_train: numpy array, the scaled training features.
- y_train: pandas Series or numpy array, the training target values.
Returns:
- model: trained LogisticRegression model.
"""
# Initialize and train the Logistic Regression model
model = LogisticRegression(random_state=42)
model.fit(X_train, y_train)
# Save the trained model to a pickle file
with open('classifier.pkl', 'wb') as file:
pickle.dump(model, file)
return model
Step 6: Evaluating the Model
After training the model, we evaluate its performance by calculating its accuracy on the test set. The evaluate_model function is defined as follows:
# model_training.py
from sklearn.metrics import accuracy_score
def evaluate_model(model, X_test, y_test):
"""
Evaluates the performance of a trained model by calculating its accuracy on the test set.
Parameters:
- model: trained model to be evaluated.
- X_test: numpy array, scaled test features.
- y_test: pandas Series or numpy array, true labels for the test set.
Returns:
- accuracy: float, the calculated accuracy of the model.
"""
# Generate predictions on the test set
y_pred = model.predict(X_test)
# Calculate accuracy
accuracy = accuracy_score(y_test, y_pred)
print(f"Model Accuracy: {accuracy:.2f}")
return accuracy
if __name__ == "__main__":
# Load and preprocess the data
df = load_wine_data()
X_train_scaled, X_test_scaled, y_train, y_test = preprocess_data(df)
# Train the model
model = train_model(X_train_scaled, y_train)
# Evaluate the trained model
evaluate_model(model, X_test_scaled, y_test)
When you execute the Python script, the following steps are performed sequentially: the data is loaded, preprocessed, the model is trained, and its performance is evaluated. Upon running the script, the output displays:
Accuracy: 0.98
Step 7: Setting Up FastAPI
Next, we’ll create a basic FastAPI application to serve predictions using the trained model. This application will allow users to send data to the API and receive predictions from the LogisticRegression model.
# app.py
from fastapi import FastAPI
app = FastAPI()
@app.get("/")
def home():
"""
Root endpoint for the API.
Returns a welcome message.
"""
return {"message": "Welcome to the Prediction API - Genius data science"}
In this step, we set up a basic FastAPI application and defined a root endpoint, which creates a simple web server capable of responding to HTTP requests.
You can run the FastAPI app by executing the following command:
uvicorn app:app --reload
Go to http://127.0.0.1:8000 to see the message.
Step 8: Loading the Model in FastAPI
We will now load the pre-trained Logistic Regression model within the FastAPI application to make predictions.
Let’s define a function that loads the pre-trained model into the FastAPI app:
# app.py
import pickle
def load_model():
"""
Loads the pre-trained Logistic Regression model from a pickle file.
Returns:
- model: The loaded Logistic Regression model.
"""
with open('classifier.pkl', 'rb') as file:
model = pickle.load(file)
return model
This means the model is now ready to generate predictions whenever it receives a request.
Step 9: Creating the Prediction Endpoint
We will define an endpoint that accepts wine features as input and returns the predicted wine quality class.
Define Input Data Model
We will create a prediction endpoint that accepts wine feature data in JSON format. The input data model will be defined using Pydantic to validate the incoming data.
# app.py
from pydantic import BaseModel
class WineFeatures(BaseModel):
"""
Pydantic model to validate the incoming wine feature data.
Each attribute represents a specific wine feature.
"""
alcohol: float
malic_acid: float
ash: float
alcalinity_of_ash: float
magnesium: float
total_phenols: float
flavanoids: float
nonflavanoid_phenols: float
proanthocyanins: float
color_intensity: float
hue: float
od280_od315_of_diluted_wines: float
proline: float
Prediction Endpoint
When a request is received, the API uses the loaded model to predict the wine quality class based on the provided features.
# app.py
@app.post("/predict")
def predict_wine(features: WineFeatures):
"""
Accepts wine feature data as input and returns the predicted wine quality class.
Parameters:
- features: WineFeatures, the input wine feature data validated by Pydantic.
Returns:
- prediction: The predicted wine quality class (integer).
"""
model = load_model() # Load the pre-trained model
input_data = [[
features.alcohol, features.malic_acid, features.ash, features.alcalinity_of_ash,
features.magnesium, features.total_phenols, features.flavanoids,
features.nonflavanoid_phenols, features.proanthocyanins, features.color_intensity,
features.hue, features.od280_od315_of_diluted_wines, features.proline
]]
prediction = model.predict(input_data) # Predict wine quality class
return {"prediction": int(prediction[0])} # Return the prediction as an integer
Step 10: Testing the Application Locally
You can rerun the app by running:
uvicorn app:app --reload
To test the application, you can send a POST request to the /predict endpoint with the wine feature data in JSON format.
Here is an example of how to do this using curl or tools like Postman:
curl -X 'POST' \
'http://127.0.0.1:8000/predict' \
-H 'Content-Type: application/json' \
-d '{
"alcohol": 13.2,
"malic_acid": 2.87,
"ash": 2.35,
"alcalinity_of_ash": 19.0,
"magnesium": 100.0,
"total_phenols": 2.8,
"flavanoids": 3.06,
"nonflavanoid_phenols": 0.3,
"proanthocyanins": 1.2,
"color_intensity": 5.2,
"hue": 1.03,
"od280_od315_of_diluted_wines": 3.0,
"proline": 1500
}'
If the request is successful, you will receive a JSON response with the predicted wine quality class, like:
{"prediction":5}
This output indicates the predicted wine quality class based on the input features.
Conclusion
We’ve successfully built a simple yet functional data science app!
Here’s a recap of the process:
- We trained a machine learning model using scikit-learn.
- We used FastAPI to create an API that allows users to input wine feature data and receive predictions from the trained model.
The app is now capable of handling user input, making predictions, and returning results via an HTTP interface. This foundation opens up many possibilities for enhancement:
- Building more complex models: You can try using more advanced algorithms or deep learning models.
- Adding additional features: You can integrate more input fields, preprocess the data in different ways, or add more model evaluation metrics.
- Scaling: Once you're confident the app works, consider deploying it to cloud platforms for scalability and accessibility.
With this approach, you can easily extend the functionality to fit more advanced use cases. Enjoy exploring and building on this framework!
More in this topic
- 10 Essential Python Libraries for Every Data Analyst
- Step-by-Step: PDF Chatbots with Langchain and Ollama
- 7 Steps to Master Large Language Models
- 4 Essential Steps to Create Multi-Agent Nested Chats with AutoGen
- Step-by-Step: Your Own YouTube and Web Summarizer with LangChain
- 8 Popular Tools for RAG Applications You Need to Know
- Creating a Personal Assistant with LangChain: A Step-by-Step Guide