PyData Carolinas 2016 brought together hundreds of professionals, researchers and students interested in data analysis to discuss how best to apply Python tools to meet challenges in data management, processing, analytics and visualization. Among the attendees was Clarence White, one of two students from North Carolina A & T who was sponsored by the South Big Data Hub to attend. The Hub was also a silver sponsor of PyData Carolinas. Below are Clarence's thoughts on the conference.
My name is Clarence White, a Ph.D. student in computational science and engineering at North Carolina A&T State University. In my research, I’m working on applying machine learning methods to bioinformatics problems. Some areas of interest to me have been beta lactamase and phosphorylation site prediction. Beta lactamase is one of the main reasons behind the development of antibiotic resistance among pathogenic bacteria, and protein phosphorylation plays an important role in a wide range of cellular processes. In attending the PyData Carolinas 2016 conference, I gained much needed insights and ideas to use in my research. Some of the interesting sessions I attended included the following:
- Integrating Scala/Java with your Python Code
- Scalable Data Science with Spark and R
- Python, C, C++, and Fortran Relationship Status: It’s Not That Complicated
- Transforming Data to Unlock Its Latent Potential
- Introduction to Zeppelin Notebooks and PySpark 2.0
- Just Bring Glue – Leveraging Multiple Libraries To Quickly Build Powerful New Tools
- Balancing scale and interpretability in analytical applications with sklearn and ensembling methods
I had not been exposed to the Jupyter Notebook environment before, but since the conference I have used this environment almost exclusively and have found it very useful. Jupyter Notebook is a web application that allows you to create and share documents that contain live code, equations, visualizations and explanatory text. I also picked up on some applications of Lasso and Ridge regression methods for machine learning and, more specifically, ensemble methods. I am currently applying an ensemble method similar to Random Forest (known as Canonical Correlation Forest) in my research. I am now evaluating using Lasso as a feature selection method to improve the accuracy of my classifier. At PyData, I also heard about various data science techniques that I can apply to data sets to gain valuable insights. During the weeks after attending PyData and after trying some of the methods I learned there, I have made significant progress in my research.
PyData Carolinas was held at IBM in Research Triangle Park. For more on the conference see http://pydata.org/carolinas2016/