By Jeremy Schendel
Data Scientist

Jeremy-Schendel-e1455057461949What is Python?python-logo-master-v3-TM

Python is a general purpose programming language which emphasizes code readability and programmer productivity, and is at the heart of NextHealth Technologies’ analytics engine. It has been used for a wide variety of purposes, including building desktop applications, website backends, creating video games, and automating routine tasks, among others. As of July 2016, Python is the fourth most popular programming language, and is used by some of the most innovative organizations in the world.

Why Use Python for Data Science?

Ease of Use

Python is often described as an easy language to use; it has a clear syntax, large community, and small learning curve. By having a very clear and readable syntax, we can quickly and efficiently write working code, spending less time fixing bugs and troubleshooting. Python, more so than any other language, allows us to maximize the time spent focusing on our craft: processing large amounts of healthcare data, finding trends and insights, and driving outcomes. Whenever we do encounter a bug, a fix is usually easy to find. In most cases, someone else in the community has already encountered a similar problem and documented a fix.

python-logo-master-v3-TM

Data Science Libraries

A major strength of Python is its abundance of free third-party libraries. As of July 2016, Python’s official repository contains over 85,000 third-party libraries. For nearly every programming use case, there’s a high quality library available to be leveraged, eliminating the need to reinvent the wheel. Among these there are some fantastic libraries for data science, covering every stage of the data science pipeline.

At the beginning of our data science pipeline is Pandas, a library which provides high performance data structures used to manipulate, clean, and prepare data for the machine learning phase of the pipeline. This is one of the most important and overlooked aspects of machine learning; supplying good data in the proper format is pivotal to making good machine learning predictions. Pandas allows us to efficiently get our data in exactly the format we need, allowing us to avoid the “garbage in, garbage out” pitfall that is common to data science.

After the data has been properly wrangled by Pandas, it goes on to the machine learning phase of the pipeline, which is handled by the scikit-learn library. One of the things that makes scikit-learn great is that it provides a vast number of algorithms from many branches of machine learning under a common API. What this means is that numerous machine techniques can be used with essentially the same syntax, so we can easily pivot between different techniques depending on the type of data being analyzed and the problem that is being solved. Scikit-learn also provides tools for validating results and ensure that models are chosen optimally. Doing all of this in other programming languages would often require multiple libraries, which could all potentially have different syntax and underlying data assumptions.

Perhaps the most important library in Python’s scientific computing ecosystem is NumPy, a library for high performance numerical computation. NumPy provides the basis for which many of Python’s scientific computing libraries are built upon, including Pandas and scikit-learn. In simple terms, NumPy is the high horsepower motor that gives these libraries great performance. Without NumPy, many of the computations across the data science pipeline would be infeasible to do in Python.

Ease of Integration

Aside from data science, Python has a wide variety of libraries that allow for interoperability with other programming languages and technologies, allowing it to seamlessly integrate into almost any platform. In fact, many companies’ platforms are written in Python from end to end. Within the data science team, we use Python to perform necessary tasks that don’t fall strictly within the realm of statistical computing. For example, we use Python to connect to external web API’s to pull in additional data, web scraping for additional data, and even system administration tasks such as moving, converting, and preparing files to enter the data science pipeline. These tasks would be more difficult in many of the other programming languages used for statistical computing, perhaps requiring the additional programming languages. Python’s versatility allows us to do everything we need with a single language, reducing overall complexity, resources, and programmer time.

Concluding Remarks

Python allows us to efficiently process healthcare data and uncover key insights that drive outcomes, all with minimal programming time. Its versatile nature allows it to seamless integrate into NextHealth’s platform.

If you’d like to learn more about how NextHealth can help your organization reduce medical costs using our prescriptive analytics and consumer engagement platform, please contact us.