Question

What's the best tools for Data Preparation?

As a Data Scientist πŸ‘¨β€πŸ”¬, I know we (still) spend most of our times discovering, accessing, collecting & cleaning data.

I haven't found any good solutions that helps to deal with theses issues.

What tools help you mastering data?

Mentioned
#Database
Share
mike_seekwell's avatar
2 years ago

Founder of SeekWell here (http://seekwell.io/). A couple things we do to help in this space:

  1. Shared database connections - One problem we saw early on is people didn't know the credentials for their database (especially if they had > 1). We let you easily set up a shared connections across your team so you don't need to set up a connection for every new user
  2. Snippets / code repository - A good bit of the "cleaning" time is spent finding / remembering functions to transform data. We speed that up considerably with Snippets* and a shared code repository that lets you search thru your entire teams code history.

Would love to hear everyones thoughts on what we could be doing better!

*https://www.notion.so/Snippets-88bb32aaf06743149b892ae93d8a0e24

5 points
phionahq's avatar
2 years ago

[Disclosure: self-promotion]

My colleagues and I built Phiona (https://phiona.com) because of issues that we had had in our past roles in product management and data engineering. Access to data is difficult (who wants to give folks access to databases?), cleaning and transforming the data is manual and undocumented in Excel or Google Sheets, and it's a pain to repeat steps week after week for repeated analyses.

The problem that we see is that there are very few options for folks who aren't as technical (thus leaving SQL and pandas/other free cleaning libraries out of the question) but aren't able to afford incredibly expensive tools meant more for enterprise ETL use cases.

We try to help in a few ways:

  • Maintain secure, governed read-only connections to data sources like MySQL, PostgreSQL and S3 so that non-technical folks can pull data they need without a SQL query.
  • Simplify the process for identifying potential data issues (duplicates, non-standard dates, missing values) by using our automated "co-pilot" to highlight potential problems for the user to take action on.
  • Data joins and transformations without using SQL or Python code.
  • Automating your data transformation steps so that you don't have to do the work all over again, and making it easy to send analyses for others to work on.

Let us know how we might be able to improve the data process even further- we're still pretty early and are always looking for great feedback from data scientists and others!

4 points
bludrop's avatar
2 years ago

I've been loving using Trifacta (https://www.trifacta.com/) - both their regular free version and also through Google Cloud Dataprep (https://cloud.google.com/dataprep).

Another option that could work (although less free unless you're dealing with public data) is Exploratory (https://exploratory.io/), which is based on R.

4 points
What are your favorite small, single-purpose apps?

There are plenty of advanced, powerful software filled with enough features to do almost anything. And then there are the tiny utilities that don't do much, but that are great at what they do do. ...

Do you use Readwise app? How much are you paying for the service?

I came across Readwise this week and I thought it was a very nice app. It sends emails with highlights from Kindle books, Instapaper, Pocket, iBooks and other services. It also offers the spaced re...

How do you manage your chat inbox?

Hey guys, first post here. As part of my work, I have to deal with and respond to a lot of incoming messages from different chats: Linkedin/WhatsApp/Signal/IG. I try to use Unreads/Archive features...

The community for power users.