Expanding Horizons: Applications of Numpy and Pandas in Bioinformatics, Astronomy, and Social Networks

Lesson 5

Introduction

Greetings, aspiring data enthusiasts! As we continue to navigate the expansive universe of Numpy and Pandas, it's time for us to examine their applications across various fields. Numpy and Pandas have made their presence known in the world of data science, and they have broadened their relevance into fields such as Bioinformatics, Astronomy, and Social Networks. This lesson will explore these cross-disciplinary applications and discover how Numpy and Pandas strengthen these fields with robust data manipulation capabilities.

Bioinformatics and Data Manipulation

Bioinformatics combines biology and computer science, providing a platform for analyzing and interpreting complex biological data, particularly genetic data. Bioinformatics is often confronted with vast and intricate datasets that require advanced data manipulation techniques to extract valuable insights.

Consider a realistic illustration of bioinformatics data - DNA sequences. These sequences are strings of characters representing nucleotides labeled as A, T, C, G. Here's a pandas DataFrame that encapsulates the DNA sequences of several genes.

Python
1import pandas as pd
2
3# DNA sequences for several genes
4data = {
5    "Gene": ["Gene A", "Gene B", "Gene C", "Gene D"],
6    "Sequence": ["ATCGTACGA", "CGATCGATG", "TAGCTAG", "CGTAGCTA"]
7}
8
9df_genes = pd.DataFrame(data)
10print(df_genes)
11"""
12     Gene   Sequence
130  Gene A  ATCGTACGA
141  Gene B  CGATCGATG
152  Gene C    TAGCTAG
163  Gene D   CGTAGCTA
17"""

In the DataFrame df_genes, each row corresponds to a distinct gene. For instance, the first row provides information about "Gene A," with the sequence "ATCGTACGA". Now, suppose we wish to determine the length of these sequences. Pandas allow us to fulfill this requirement with relative ease. Let's use the Pandas apply function, a versatile function that applies a function along an axis of the DataFrame. In this case, the function is used to compute the length of each DNA sequence in our DataFrame. We then add this data as a new column, Length, to our DataFrame:

Python
1df_genes['Length'] = df_genes['Sequence'].apply(len)
2print(df_genes)
3"""
4     Gene   Sequence  Length
50  Gene A  ATCGTACGA       9
61  Gene B  CGATCGATG       9
72  Gene C    TAGCTAG       7
83  Gene D   CGTAGCTA       8
9"""

As we can see, we used the apply function to apply the len function to the 'Sequence' column, calculated the length of each sequence, and added the result to a new column, 'Length'. This form of operation is a staple of data manipulation in bioinformatics.

Astronomy: Handling Large Datasets

Astronomical research often grapples with large datasets. For instance, astronomical surveys create extensive catalogs of millions to billions of stars. Numpy and Pandas give us the power to manage and manipulate these immense datasets efficiently.

Consider a dataset of star observations that includes their astronomical coordinates (Right Ascension and Declination), magnitudes, and the dates of the observations.

Python
1import numpy as np
2
3# Star dataset (Simulated data for demonstration)
4data = {
5    "Star_ID": np.arange(1, 5),
6    "Right_Ascension": [204.85, 63.70, 305.29, 45.2],
7    "Declination": [-29.72, 38.03, -14.78, 7.8],
8    "Magnitude": [2.04, 1.25, 3.17, 1.9],
9    "Observation_Date": pd.date_range('01/01/2020', periods=4)
10}
11
12df_stars = pd.DataFrame(data)
13print(df_stars)
14"""
15   Star_ID  Right_Ascension  Declination  Magnitude Observation_Date
160        1           204.85       -29.72       2.04       2020-01-01
171        2            63.70        38.03       1.25       2020-01-02
182        3           305.29       -14.78       3.17       2020-01-03
193        4            45.20         7.80       1.90       2020-01-04
20"""

Filtering out stars based on their magnitudes or observation dates is commonplace in Astronomy. Pandas facilitate these requirements, making operations more intuitive and direct. Let's demonstrate this with a simple filter to exclude stars observed before a specific date:

Python
1filter_date = pd.to_datetime('2020-01-02')
2filtered_stars = df_stars[df_stars['Observation_Date'] > filter_date]
3print(filtered_stars)
4"""
5   Star_ID  Right_Ascension  Declination  Magnitude Observation_Date
62        3           305.29       -14.78       3.17       2020-01-03
73        4            45.20         7.80       1.90       2020-01-04
8"""

Here, we use pandas' to_datetime function to convert a string into a datetime object. Then, we use the DataFrame filtering feature from Pandas to identify stars observed after the specified date.

Data Analysis in Social Networks

Social Network Analysis (SNA) effectively reveals the underlying structure within social networks using network and graph theories. It characterizes network structures using nodes (individual elements in the network) and edges (which represent relationships between these nodes).

Consider the following dataset, which represents social interaction among individuals:

Python
1# Social interaction data (Simulated for demonstration)
2data = {
3    "Person": ["Alice", "Bob", "Charlie", "Dave"],
4    "Friends": [10, 5, 8, 2],
5    "Posts": [100, 50, 80, 200]
6}
7
8df_social = pd.DataFrame(data)
9print(df_social)
10"""
11    Person  Friends  Posts
120    Alice       10    100
131      Bob        5     50
142  Charlie        8     80
153     Dave        2    200
16"""

Now, suppose our goal is to compute the average number of posts per friend for each individual. We can accomplish this expeditiously using Pandas. Here is a demonstration of how we introduce a new column into our DataFrame, Posts_per_Friend, which represents the average number of posts per friend:

Python
1df_social['Posts_per_Friend'] = df_social['Posts'] / df_social['Friends']
2print(df_social)
3"""
4    Person  Friends  Posts  Posts_per_Friend
50    Alice       10    100              10.0
61      Bob        5     50              10.0
72  Charlie        8     80              10.0
83     Dave        2    200             100.0
9"""

We divide the 'Posts' column by the 'Friends' column for each person using vectorized operations from Pandas, which helps us to calculate the average number of posts per friend. We then assign this new calculation to a new column, 'Posts_per_Friend'. Data manipulation of this type is a typical aspect of Social Network Analysis.

Advanced Problem-Solving

Consider a problem involving multiple disciplines. We are examining a dataset that contains gene names (recalling our exploration in Bioinformatics), their discovery dates (linking back to our Astronomy example), and their popularity on a specific social network. These cross-disciplinary problems present a distinctive niche wherein we can seamlessly navigate disparate data sources using Numpy and Pandas.

Wrapping Up

Today's lesson broadened our perspective on the utility of Numpy and Pandas across various fields. Through real-world examples, we learned how these tools are integral to managing genetic data in Bioinformatics, handling large-scale datasets in Astronomy, and analyzing and interpreting network data in Social Networks. Applying these tools in data manipulation underscores their significance well beyond data science.

Ready To Practice?

Equipped with the insights gained in today's lesson, it's time to participate in some hands-on exercises! These challenges are centered around the discussed problems and will strengthen your understanding of the concepts. By participating in these exercises, we will be better prepared to tackle real-life challenges using Numpy and Pandas. Fasten your seatbelts, and let's embark on this fascinating journey of data manipulation across various fields!

Enjoy this lesson? Now it's time to practice with Cosmo!

Practice is how you turn knowledge into actual skills.