Tôi là Duyệt

2023-09-03(1 month ago)in Data

In this post, I want to explore the features and capabilities of DuckDB, an open-source, in-process SQL OLAP database management system written in C++11 that has been gaining popularity recently. According to what people have said, DuckDB is designed to be easy to use and flexible, allowing you to run complex queries on relational datasets using either local, file-based DuckDB instances or the cloud service MotherDuck.

2023-07-16(3 months ago)in Data

Airflow control the parallelism and concurrency (draw)

How to control parallelism and concurrency

2023-05-07(5 months ago)in Data

Running Spark in GitHub Actions

This post provides a quick and easy guide on how to run Apache Spark in GitHub Actions for testing purposes

2023-04-15(6 months ago)in Data

Why does Helm Charts interpret 0777 as 511?

Why does Helm Charts interpret 0777 to 511? It took me quite some time to debug it.

2023-04-01(7 months ago)in Data

GPT vs Traditional NLP Models

The field of Natural Language Processing (NLP) has seen remarkable advancements in recent years, and the emergence of the Generative Pre-trained Transformer (GPT) has revolutionized the way NLP models operate. GPT is a cutting-edge language model that employs deep learning to generate human-like text. Unlike conventional NLP models, which required extensive training on specific tasks, GPT is pre-trained on vast amounts of data and can be fine-tuned for various NLP tasks

2023-02-26(8 months ago)in Data

Ask ChatGPT about 20 important concepts of Apache Spark

I asked ChatGPT to explain 20 important concepts of Apache Spark. Let's see what it has to say!

2023-01-22(9 months ago)in Data

Data Engineering Tools written in Rust

This blog post will provide an overview of the data engineering tools available in Rust, their advantages and benefits, as well as a discussion on why Rust is a great choice for data engineering.

2023-01-10(9 months ago)in Data

Why ClickHouse Should Be the Go-To Choice for Your Next Data Platform?

Recently, I was working on building a new Logs dashboard at Fossil to serve our internal team for log retrieval, and I found ClickHouse to be a very interesting and fast engine for this purpose. In this post, I'll share my experience with using ClickHouse as the foundation of a light-weight data platform and how it compares to another popular choice, Athena. We'll also explore how ClickHouse can be integrated with other tools such as Kafka to create a robust and efficient data pipeline.

2022-09-27(1 year ago)in Data

Airflow Dataset (Data-aware scheduling)

Airflow since 2.4, in addition to scheduling DAGs based upon time, they can also be scheduled based upon a task updating a dataset. This will change the way you schedule DAGs.

2022-03-09(2 years ago)in Data

Spark on Kubernetes tại Fossil 🤔

Apache Spark được chọn làm công nghệ cho Batch layer bởi khả năng xử lý một lượng lớn data cùng một lúc. Ở thiết kế ban đầu, team data chọn sử dụng Apache Spark trên AWS EMR do có sẵn và triển khai nhanh chóng. Dần dần, AWS EMR bộc lộ một số điểm hạn chế trên môi trường Production. Trong bài viết này, mình sẽ nói về tại sao và làm thế nào team Data chuyển từ Spark trên AWS EMR sang Kubernetes.

2022-02-24(2 years ago)in Data

grant-rs: Manage Redshift/Postgres Privileges GitOps Style

The grant project aims to manage Postgres and Redshift database roles and privileges in GitOps style. Grant is the culmination of my learning of Rust for data engineering tools.

2021-11-27(2 years ago)in Data

Rust và Data Engineering? 🤔

Đối với một Data Engineer như mình, ưu tiên chọn một ngôn ngữ dựa trên việc nó có giải quyết được hết hầu hết các nhu cầu và bài toán của mình hay không: Data Engineering, Distributed System và Web Development. Và cuối cùng mình dự định sẽ bắt đầu với Rust, bởi vì ...

2021-11-22(2 years ago)in Data

Spark on Kubernetes - better handling for node shutdown

Spark 3.1 on the Kubernetes project is now officially declared as production-ready and Generally Available. Spot instances in Kubernetes can cut your bill by up to 70-80% if you are willing to trade in reliability. The new feature - SPIP: Add better handling for node shutdown (SPARK-20624) was implemented to deal with the problem of losing an executor when working with spot nodes - the need to recompute the shuffle or cached data.

2021-08-29(2 years ago)in Data

Good reasons to use ClickHouse

More than 200+ companies are using ClickHouse today. With many features support, it's equally powerful for both Analytics and Big Data service backend.

2021-07-04(2 years ago)in Data

Postgres Full Text Search

Postgres has built-in functions to handle Full Text Search queries. This is like a "search engine" within Postgres.

2017-05-31(6 years ago)in Data

Cài Apache Spark standalone bản pre-built

Mình nhận được nhiều phản hồi từ bài viết BigData - Cài đặt Apache Spark trên Ubuntu 14.04 rằng sao cài khó và phức tạp thế. Thực ra bài viết đó mình hướng dẫn cách build và install từ source.

2016-09-20(7 years ago)in Data

Chạy Apache Spark với Jupyter Notebook

IPython Notebook là một công cụ tiện lợi cho Python. Ta có thể Debug chương trình PySpark Line-by-line trên IPython Notebook một cách dễ dàng, tiết kiệm được nhiều thời gian.

2016-09-08(7 years ago)in Data

PySpark - Thiếu thư viện Python trên Worker

Apache Spark chạy trên Cluster, với Java thì đơn giản. Với Python thì package python phải được cài trên từng Node của Worker. Nếu không bạn sẽ gặp phải lỗi thiếu thư viện.

2016-06-29(7 years ago)in Data

Tìm hiểu về dữ liệu trong thể thao hiện đại

Tìm hiểu về dữ liệu trong thể thao hiện đại. Một trong những câu trả lời cho câu hỏi: Dân hệ thống thông tin thì làm gì?

2016-02-03(8 years ago)in Data

Bigdata - Columnar Database và Graph Database

Như đã nói về big data, chúng ta có các loại dữ liệu khác nhau và chúng ta cần lưu trữ trong database. Bigdata có thể xử lý và lưu trữ trên nhiều loại CSDL khác nhau. Sau đây tôi sẽ nói 1 ít về columnar Database và Graph Database.

2016-02-03(8 years ago)in Data

Graph Database

Bài trước tôi có nói về Columnar Database và Graph Database. Mục đích là so sánh và đi sâu vào Graph Database. Tiếp đến là xử lý Graph Database với Big Data.

2015-12-12(8 years ago)in Data

Apache Spark on Docker

Docker and Spark are two technologies which are very hyped these days

2015-04-18(8 years ago)in Data

Bigdata - Getting Started with Spark (in Python)

Hadoop is the standard tool for distributed computing across really large data sets and is the reason why you see "Big Data" on advertisements as you walk through the airport. It has become an operating system for Big Data, providing a rich ecosystem of tools and techniques that allow you to use a large cluster of relatively cheap commodity hardware to do computing at supercomputer scale. Two ideas from Google in 2003 and 2004 made Hadoop possible: a framework for distributed storage (The Google File System), which is implemented as HDFS in Hadoop, and a framework for distributed computing (MapReduce).

2015-04-12(9 years ago)in Data

Big Data - Explained in Less Than 2 Minutes - To Absolutely Anyone

There are some things that are so big that they have implications for everyone, whether we want them to or not. Big Data is one of those concepts, and is completely transforming the way we do business and is impacting most other parts of our lives.

2015-04-09(9 years ago)in Data

MongoDB - Cách thiết lập để App Server kết nối đến MongoDb Server

Thông thường, chúng ta thường thiết lập để Code và phần Database chung 1 server. Với những ứng dụng lớn để quản lý, chúng ta phải tách riêng biệt chúng trên nhiều server khác nhau. Bởi vì mặc định MongoDb không cho phép remote connections mà chỉ cho phép kết nối nội bộ. Mình sẽ hướng dẫn cách thiết lập sao cho từ App Server (server chứa code) kết nối được tới MongoDb Server (hoặc cụm MongoDb Server)

2015-04-06(9 years ago)in Data

Database - Tìm hiểu về CSDL Redis

Redis là 1 trong số các hệ quản trị cơ sở dữ liệu phát triển mang phong cách NoSQL. Redis là hệ thống lưu trữ key-value với rất nhiều tính năng và được sử dụng rộng rãi. Redis nổi bật bởi việc hỗ trợ nhiều cấu trúc dữ liệu cơ bản (hash, list, set, sorted set, string), đồng thời cho phép scripting bằng ngôn ngữ lua.

2015-03-04(9 years ago)in Data

DBMS - Tầm Quan Trọng Của Kiểu Dữ Liệu

Một bài viết từ Blog kĩ thuật máy tính

Resources

me@duyet.net