GSoC 2018 Wrap-up: Haskell Dataframes, Postgres Type Providers and more
Aug 9, 2018
5 minutes read

Overview

Frames-beam is the library I worked on during Google Summer of Code 2018 as part of the Haskell.org open source organization. It is primarily intended as a extension to the Frames library, and adds Postgres as an additional data source to it. To summarize its features that were implemented over the summer:

  • Generative Type Provider: Generates Haskell types corresponding to your Postgres database schema in a separate, import-able module, at compile time.

  • Query Helpers: Provides a very limited, relevant subset of SQL for selecting rows from Postgres tables, in form of “canned” queries.

  • GenericVinyl: TemplateHaskell and generics-sop based conversion of query results (i.e. plain Haskell records) to vinyl records (forms basis of the in-memory Haskell dataframes)

  • Two Modes of Operation:

    A. Interactive Exploration Mode: Enables exploration of a subset of Postgres data (that fits in-memory), read in one-shot from the DB and explored in GHCi using Frames.

    B. Scaled-up Streaming Mode: Once satisfied by the line of analysis on smaller subsets of data in interactive mode, it is possible to scale things up by using conduit based streaming, constant memory “pipelines” that work on streams of vinyl records.

Motivation

Assume you have a bunch of CSV files, that you’d like to explore in the added context of data in a Postgres DB instance. Or vice-versa. You don’t have control over the shape of the data (CSV format/DB schema) that you’re going to be looking at: that is you’re primarily consuming CSVs and have a read-only Postgres instance.

The world outside can change, leading to modifications in the shape of the data you’re going to be consuming, rendering a fair amount of your data wrangling code written in R/Python broken and needing a re-write. This is where Haskell’s expressive types stand to help, by making explicit the assumptions in your code that are determined by your data shape. Broken code is easier to refactor confidently into working code again, additionally making gains in reproducibility as well.

Frames-beam(intended for exploration of Postgres data), an extension of Frames (used for CSVs), is motivated by the above oft-encountered situation in various data analysis workflows.

Design Choices

The following is small sampling of some of the design choices made in the design of this library:

  • DB access library: While I didn’t try out any other library, beam worked out pretty well for the purposes of this project. Particularly useful was the beam-migrate feature that powers the “type provider” feature under the hood, making it easier for end users to not have to write any of the DB access related boiler plate themselves. Moreover the streaming bits are powered by beam-postgres’s conduit based query execution interface.

  • generics-sop: This is used to convert from plain Haskell records to vinyl. I have blogged in depth about the implementation details of this before. Usage of this library’s deriveGeneric feature in a transparent manner meant using TemplateHaskell, instead of its GHC.Generics based interface. Which also brings us to the next point…

  • TemplateHaskell (TH): I understand this is a slightly polarizing topic, but for some of the functionality that was desired in this library, there simply was no other way to accomplish things. Notably:

A. The GenericVinyl feature makes use of TH to generate a typeclass instance for an arbitrary plain record, to enable its conversion into vinyl. I couldn’t think of a way to achieve this without TH. I’ll be very happy to be proven wrong here, though.

B. The “type provider” feature makes use of TH to establish a network connection at compile time to access the DB schema, generate Haskell types corresponding to it, and places the module in a file in src directory, so that it can be imported wherever necessary. Compile time code generation makes sure that the most up to date version of the database schema finds usage in the code. The bit about placing the module in a file at compile time is not entirely settled, and there is a GitHub issue about it, that describes in detail the reasons for doing so, at least for now. The tl;dr of the issue is that in order to have generated code come into scope in the calling module instead of going into a separate file, it came down to getting haskell-src-meta to work on the generated code, and along with some concerns around user-experience.

Challenges Encountered

A small sampling of the rough edges in the ecosystem/general challenges that I had to navigate during the project:

  • While working with generics-sop, at one point I was stuck for a long-ish period because I could not figure out a way to get GHC to (safely) coerce the kind of a type from * to Symbol. I had to dig around a lot, including a bit into parts of the singletons library, using which, at one point I was trying to promote the coerce function (from Data.Coerce) from a term-level function to a type-level function (i.e. a type family). It didn’t work out. I’m still not sure if this coercion is even possible; but I had to change my overall approach in order to get past this.

  • It took me a bit to figure out the conduit-based streaming interface of beam-postgres. Wasn’t very tricky in the end; but I do wish this was better documented. Additionally, usage of beam requries monomorphization of the (highly) polymorphic types in certain points of usage; this was something I discovered through the issue tracker, after much trying on my own. Just two things I wish were better documented, in a library that was very fun to use and has excellent docs/user guides otherwise.

Reflections on GSoC

I had a really good time working on Frames-beam, and exploring different parts of the Haskell ecosystem in the process. I would strongly recommend the GSoC program as part of the Haskell.org organization, to anyone looking to get beyond the initial stages of learning Haskell, and accelerating the learning process, in a very hands-on way.

Acknowledgements

I would like to thank Marco Zocca (@ocramz), my GSoC mentor, for his patient guidance, invaluable insights into the Haskell library ecosystem and enthusiastic encouragement all along the way.


Back to posts


comments powered by Disqus