Greetings, aspiring data enthusiasts! As we continue to navigate the expansive universe of Numpy
and Pandas
, it's time for us to examine their applications across various fields. Numpy
and Pandas
have made their presence known in the world of data science, and they have broadened their relevance into fields such as Bioinformatics, Astronomy, and Social Networks. This lesson will explore these cross-disciplinary applications and discover how Numpy
and Pandas
strengthen these fields with robust data manipulation capabilities.
Bioinformatics combines biology and computer science, providing a platform for analyzing and interpreting complex biological data, particularly genetic data. Bioinformatics is often confronted with vast and intricate datasets that require advanced data manipulation techniques to extract valuable insights.
Consider a realistic illustration of bioinformatics data - DNA sequences. These sequences are strings of characters representing nucleotides labeled as A
, T
, C
, G
. Here's a pandas DataFrame that encapsulates the DNA sequences of several genes.
Python1import pandas as pd 2 3# DNA sequences for several genes 4data = { 5 "Gene": ["Gene A", "Gene B", "Gene C", "Gene D"], 6 "Sequence": ["ATCGTACGA", "CGATCGATG", "TAGCTAG", "CGTAGCTA"] 7} 8 9df_genes = pd.DataFrame(data) 10print(df_genes) 11""" 12 Gene Sequence 130 Gene A ATCGTACGA 141 Gene B CGATCGATG 152 Gene C TAGCTAG 163 Gene D CGTAGCTA 17"""
In the DataFrame df_genes
, each row corresponds to a distinct gene. For instance, the first row provides information about "Gene A," with the sequence "ATCGTACGA"
. Now, suppose we wish to determine the length of these sequences. Pandas allow us to fulfill this requirement with relative ease. Let's use the Pandas apply
function, a versatile function that applies a function along an axis of the DataFrame. In this case, the function is used to compute the length of each DNA sequence in our DataFrame. We then add this data as a new column, Length
, to our DataFrame:
Python1df_genes['Length'] = df_genes['Sequence'].apply(len) 2print(df_genes) 3""" 4 Gene Sequence Length 50 Gene A ATCGTACGA 9 61 Gene B CGATCGATG 9 72 Gene C TAGCTAG 7 83 Gene D CGTAGCTA 8 9"""
As we can see, we used the apply
function to apply the len
function to the 'Sequence' column, calculated the length of each sequence, and added the result to a new column, 'Length'. This form of operation is a staple of data manipulation in bioinformatics.
Astronomical research often grapples with large datasets. For instance, astronomical surveys create extensive catalogs of millions to billions of stars. Numpy
and Pandas
give us the power to manage and manipulate these immense datasets efficiently.
Consider a dataset of star observations that includes their astronomical coordinates (Right Ascension and Declination), magnitudes, and the dates of the observations.
Python1import numpy as np 2 3# Star dataset (Simulated data for demonstration) 4data = { 5 "Star_ID": np.arange(1, 5), 6 "Right_Ascension": [204.85, 63.70, 305.29, 45.2], 7 "Declination": [-29.72, 38.03, -14.78, 7.8], 8 "Magnitude": [2.04, 1.25, 3.17, 1.9], 9 "Observation_Date": pd.date_range('01/01/2020', periods=4) 10} 11 12df_stars = pd.DataFrame(data) 13print(df_stars) 14""" 15 Star_ID Right_Ascension Declination Magnitude Observation_Date 160 1 204.85 -29.72 2.04 2020-01-01 171 2 63.70 38.03 1.25 2020-01-02 182 3 305.29 -14.78 3.17 2020-01-03 193 4 45.20 7.80 1.90 2020-01-04 20"""
Filtering out stars based on their magnitudes or observation dates is commonplace in Astronomy. Pandas facilitate these requirements, making operations more intuitive and direct. Let's demonstrate this with a simple filter to exclude stars observed before a specific date:
Python1filter_date = pd.to_datetime('2020-01-02') 2filtered_stars = df_stars[df_stars['Observation_Date'] > filter_date] 3print(filtered_stars) 4""" 5 Star_ID Right_Ascension Declination Magnitude Observation_Date 62 3 305.29 -14.78 3.17 2020-01-03 73 4 45.20 7.80 1.90 2020-01-04 8"""
Here, we use pandas' to_datetime
function to convert a string into a datetime object. Then, we use the DataFrame filtering feature from Pandas to identify stars observed after the specified date.
Social Network Analysis (SNA) effectively reveals the underlying structure within social networks using network and graph theories. It characterizes network structures using nodes (individual elements in the network) and edges (which represent relationships between these nodes).
Consider the following dataset, which represents social interaction among individuals:
Python1# Social interaction data (Simulated for demonstration) 2data = { 3 "Person": ["Alice", "Bob", "Charlie", "Dave"], 4 "Friends": [10, 5, 8, 2], 5 "Posts": [100, 50, 80, 200] 6} 7 8df_social = pd.DataFrame(data) 9print(df_social) 10""" 11 Person Friends Posts 120 Alice 10 100 131 Bob 5 50 142 Charlie 8 80 153 Dave 2 200 16"""
Now, suppose our goal is to compute the average number of posts per friend for each individual. We can accomplish this expeditiously using Pandas. Here is a demonstration of how we introduce a new column into our DataFrame, Posts_per_Friend
, which represents the average number of posts per friend:
Python1df_social['Posts_per_Friend'] = df_social['Posts'] / df_social['Friends'] 2print(df_social) 3""" 4 Person Friends Posts Posts_per_Friend 50 Alice 10 100 10.0 61 Bob 5 50 10.0 72 Charlie 8 80 10.0 83 Dave 2 200 100.0 9"""
We divide the 'Posts' column by the 'Friends' column for each person using vectorized operations from Pandas, which helps us to calculate the average number of posts per friend. We then assign this new calculation to a new column, 'Posts_per_Friend'. Data manipulation of this type is a typical aspect of Social Network Analysis.
Consider a problem involving multiple disciplines. We are examining a dataset that contains gene names (recalling our exploration in Bioinformatics), their discovery dates (linking back to our Astronomy example), and their popularity on a specific social network. These cross-disciplinary problems present a distinctive niche wherein we can seamlessly navigate disparate data sources using Numpy
and Pandas
.
Today's lesson broadened our perspective on the utility of Numpy
and Pandas
across various fields. Through real-world examples, we learned how these tools are integral to managing genetic data in Bioinformatics, handling large-scale datasets in Astronomy, and analyzing and interpreting network data in Social Networks. Applying these tools in data manipulation underscores their significance well beyond data science.
Equipped with the insights gained in today's lesson, it's time to participate in some hands-on exercises! These challenges are centered around the discussed problems and will strengthen your understanding of the concepts. By participating in these exercises, we will be better prepared to tackle real-life challenges using Numpy
and Pandas
. Fasten your seatbelts, and let's embark on this fascinating journey of data manipulation across various fields!